* [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function
@ 2025-05-30 6:24 Howard Chu
2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu
2025-05-30 6:24 ` [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() Howard Chu
0 siblings, 2 replies; 5+ messages in thread
From: Howard Chu @ 2025-05-30 6:24 UTC (permalink / raw)
To: acme
Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel,
Howard Chu
Task filtering is flawed on perf trace. Spawned threads can't get
augmented, and the BPF program collects more data than it needs. This
patch series adds a pids_targeted BPF map to better filter the tasks,
and makes the bpf-output event '__augmented_syscalls__' system-wide. It
also removes evlist__has_bpf_output() that's no longer useful.
Howard Chu (2):
perf trace: Enhance task filtering
perf evlist: Remove evlist__has_bpf_output()
tools/perf/builtin-trace.c | 52 ++++++++++++++++---
.../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++---
tools/perf/util/evlist.c | 14 +----
tools/perf/util/evlist.h | 1 -
4 files changed, 73 insertions(+), 29 deletions(-)
--
2.45.2
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v1 1/2] perf trace: Enhance task filtering
2025-05-30 6:24 [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function Howard Chu
@ 2025-05-30 6:24 ` Howard Chu
2025-05-30 21:30 ` Namhyung Kim
2025-05-30 6:24 ` [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() Howard Chu
1 sibling, 1 reply; 5+ messages in thread
From: Howard Chu @ 2025-05-30 6:24 UTC (permalink / raw)
To: acme
Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel,
Howard Chu
This patch does two things:
1. Add a pids_targeted map, put pids that interest perf trace in.
2. Make bpf-output event system-wide.
Effect 1:
perf trace doesn't augment threads properly. With the script below:
Program test_trace_loop.c
~~~
#include <pthread.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#define THREAD_NR 2
struct thread_arg {
int index;
};
void *func(void *arg) {
struct thread_arg *t_arg = arg;
while (1) {
printf("thread %d running\n", t_arg->index);
sleep(1);
}
return NULL;
}
int main()
{
pthread_t thread_ids[THREAD_NR];
struct thread_arg thread_args[THREAD_NR];
for (int i = 0; i < THREAD_NR; i++) {
thread_args[i].index = i;
if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) {
perror("failed to create thread, exiting\n");
exit(1);
}
}
while (1) {
printf("parent sleeping\n");
sleep(1);
}
for (int i = 0; i < THREAD_NR; i++)
pthread_join(thread_ids[i], NULL);
return 0;
}
~~~
Commands
~~~
$ gcc test_trace_loop.c -o test_trace_loop
$ ./test_trace_loop &
[1] 1404183
$ pstree 1404183 -p
test_trace_loop(1404183)─┬─{test_trace_loop}(1404185)
└─{test_trace_loop}(1404186)
$ sudo perf trace -p 1404183 -e *sleep
~~~
Output
before:
$ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
0.119 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60) ...
? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
0.047 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
0.047 (1000.127 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
explanation: only the parent thread 1404183 got augmented
after:
$ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
0.147 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
0.076 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
0.076 (1000.160 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
0.147 (1000.090 ms): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
2.557 ( ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ...
1000.323 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
2.557 (1000.129 ms): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
1000.384 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
explanation: all threads augmented
Effect 2: perf trace doesn't collect syscall argument data for *ALL*
pids, and throw it away anymore. Those uninteresting pids get filtered
right away. There should be a performance advantage.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/builtin-trace.c | 52 ++++++++++++++++---
.../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++---
tools/perf/util/evlist.c | 2 +-
3 files changed, 73 insertions(+), 16 deletions(-)
diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
index 67b557ec3b0d..11620cb40198 100644
--- a/tools/perf/builtin-trace.c
+++ b/tools/perf/builtin-trace.c
@@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
unsigned long before;
const bool forks = argc > 0;
bool draining = false;
+ bool enable_evlist = false;
trace->live = true;
@@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
evlist__set_default_cgroup(trace->evlist, trace->cgroup);
create_maps:
+ if (trace->syscalls.events.bpf_output)
+ trace->syscalls.events.bpf_output->core.system_wide = true;
+
err = evlist__create_maps(evlist, &trace->opts.target);
if (err < 0) {
fprintf(trace->output, "Problems parsing the target to trace, check your options!\n");
@@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
goto out_error_open;
#ifdef HAVE_BPF_SKEL
if (trace->syscalls.events.bpf_output) {
+ struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core;
struct perf_cpu cpu;
+ bool t = true;
+
+ enable_evlist = true;
+ if (trace->opts.target.system_wide)
+ trace->skel->bss->system_wide = true;
+ else
+ trace->skel->bss->system_wide = false;
/*
* Set up the __augmented_syscalls__ BPF map to hold for each
* CPU the bpf-output event's file descriptor.
*/
- perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) {
+ perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) {
int mycpu = cpu.cpu;
- bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
- &mycpu, sizeof(mycpu),
- xyarray__entry(trace->syscalls.events.bpf_output->core.fd,
- mycpu, 0),
- sizeof(__u32), BPF_ANY);
+ err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
+ &mycpu, sizeof(mycpu),
+ xyarray__entry(perf_evsel->fd, mycpu, 0),
+ sizeof(__u32), BPF_ANY);
+ if (err) {
+ pr_err("Couldn't set system-wide bpf output perf event fd"
+ ", err: %d\n", err);
+ goto out_disable;
+ }
+ }
+
+ if (target__has_task(&trace->opts.target)) {
+ struct perf_thread_map *threads = trace->evlist->core.threads;
+
+ for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) {
+ pid_t pid = perf_thread_map__pid(threads, thread);
+
+ err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid,
+ sizeof(pid), &t, sizeof(t), BPF_ANY);
+ if (err) {
+ pr_err("Couldn't set pids_targeted map, err: %d\n", err);
+ goto out_disable;
+ }
+ }
+ } else if (workload_pid != -1) {
+ err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid,
+ sizeof(workload_pid), &t, sizeof(t), BPF_ANY);
+ if (err) {
+ pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err);
+ goto out_disable;
+ }
}
}
@@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
goto out_error_mmap;
}
- if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)
+ if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay))
evlist__enable(evlist);
if (forks)
diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
index e4352881e3fa..e517eec7290b 100644
--- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
+++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
@@ -26,6 +26,7 @@
#define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0))
#define MAX_CPUS 4096
+#define MAX_PIDS 4096
/* bpf-output associated map */
struct __augmented_syscalls__ {
@@ -113,6 +114,15 @@ struct pids_filtered {
__uint(max_entries, 64);
} pids_filtered SEC(".maps");
+volatile bool system_wide;
+
+struct pids_targeted {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, pid_t);
+ __type(value, bool);
+ __uint(max_entries, MAX_PIDS);
+} pids_targeted SEC(".maps");
+
struct augmented_args_payload {
struct syscall_enter_args args;
struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc)
@@ -145,6 +155,11 @@ struct beauty_payload_enter_map {
__uint(max_entries, 1);
} beauty_payload_enter_map SEC(".maps");
+static pid_t getpid(void)
+{
+ return bpf_get_current_pid_tgid();
+}
+
static inline struct augmented_args_payload *augmented_args_payload(void)
{
int key = 0;
@@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args)
return 1; /* Failure: don't filter */
}
-static pid_t getpid(void)
+static bool filter_pid(void)
{
- return bpf_get_current_pid_tgid();
-}
+ if (system_wide)
+ return false;
-static bool pid_filter__has(struct pids_filtered *pids, pid_t pid)
-{
- return bpf_map_lookup_elem(pids, &pid) != NULL;
+ pid_t pid = getpid();
+
+ if (bpf_map_lookup_elem(&pids_targeted, &pid) &&
+ !bpf_map_lookup_elem(&pids_filtered, &pid))
+ return false;
+
+ return true;
}
static int augment_sys_enter(void *ctx, struct syscall_enter_args *args)
@@ -534,7 +553,7 @@ int sys_enter(struct syscall_enter_args *args)
* initial, non-augmented raw_syscalls:sys_enter payload.
*/
- if (pid_filter__has(&pids_filtered, getpid()))
+ if (filter_pid())
return 0;
augmented_args = augmented_args_payload();
@@ -560,7 +579,7 @@ int sys_exit(struct syscall_exit_args *args)
{
struct syscall_exit_args exit_args;
- if (pid_filter__has(&pids_filtered, getpid()))
+ if (filter_pid())
return 0;
bpf_probe_read_kernel(&exit_args, sizeof(exit_args), args);
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index dcd1130502df..7b5837c6e6bb 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -1012,7 +1012,7 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
if (!threads)
return -1;
- if (target__uses_dummy_map(target) && !evlist__has_bpf_output(evlist))
+ if (target__uses_dummy_map(target))
cpus = perf_cpu_map__new_any_cpu();
else
cpus = perf_cpu_map__new(target->cpu_list);
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output()
2025-05-30 6:24 [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function Howard Chu
2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu
@ 2025-05-30 6:24 ` Howard Chu
1 sibling, 0 replies; 5+ messages in thread
From: Howard Chu @ 2025-05-30 6:24 UTC (permalink / raw)
To: acme
Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel,
Howard Chu
This function is no longer needed.
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
tools/perf/util/evlist.c | 12 ------------
tools/perf/util/evlist.h | 1 -
2 files changed, 13 deletions(-)
diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
index 7b5837c6e6bb..876952685803 100644
--- a/tools/perf/util/evlist.c
+++ b/tools/perf/util/evlist.c
@@ -2618,18 +2618,6 @@ void evlist__uniquify_evsel_names(struct evlist *evlist, const struct perf_stat_
}
}
-bool evlist__has_bpf_output(struct evlist *evlist)
-{
- struct evsel *evsel;
-
- evlist__for_each_entry(evlist, evsel) {
- if (evsel__is_bpf_output(evsel))
- return true;
- }
-
- return false;
-}
-
bool evlist__needs_bpf_sb_event(struct evlist *evlist)
{
struct evsel *evsel;
diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h
index 85859708393e..dd5e32583c74 100644
--- a/tools/perf/util/evlist.h
+++ b/tools/perf/util/evlist.h
@@ -436,7 +436,6 @@ void evlist__format_evsels(struct evlist *evlist, struct strbuf *sb, size_t max_
void evlist__check_mem_load_aux(struct evlist *evlist);
void evlist__warn_user_requested_cpus(struct evlist *evlist, const char *cpu_list);
void evlist__uniquify_evsel_names(struct evlist *evlist, const struct perf_stat_config *config);
-bool evlist__has_bpf_output(struct evlist *evlist);
bool evlist__needs_bpf_sb_event(struct evlist *evlist);
#endif /* __PERF_EVLIST_H */
--
2.45.2
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v1 1/2] perf trace: Enhance task filtering
2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu
@ 2025-05-30 21:30 ` Namhyung Kim
2025-05-30 23:20 ` Howard Chu
0 siblings, 1 reply; 5+ messages in thread
From: Namhyung Kim @ 2025-05-30 21:30 UTC (permalink / raw)
To: Howard Chu
Cc: acme, mingo, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel
Hi Howard,
On Thu, May 29, 2025 at 11:24:07PM -0700, Howard Chu wrote:
> This patch does two things:
> 1. Add a pids_targeted map, put pids that interest perf trace in.
> 2. Make bpf-output event system-wide.
>
> Effect 1:
> perf trace doesn't augment threads properly. With the script below:
>
> Program test_trace_loop.c
> ~~~
> #include <pthread.h>
> #include <stdio.h>
> #include <unistd.h>
> #include <stdlib.h>
>
> #define THREAD_NR 2
>
> struct thread_arg {
> int index;
> };
>
> void *func(void *arg) {
> struct thread_arg *t_arg = arg;
> while (1) {
> printf("thread %d running\n", t_arg->index);
> sleep(1);
> }
> return NULL;
> }
>
> int main()
> {
> pthread_t thread_ids[THREAD_NR];
> struct thread_arg thread_args[THREAD_NR];
>
> for (int i = 0; i < THREAD_NR; i++) {
> thread_args[i].index = i;
> if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) {
> perror("failed to create thread, exiting\n");
> exit(1);
> }
> }
>
> while (1) {
> printf("parent sleeping\n");
> sleep(1);
> }
>
> for (int i = 0; i < THREAD_NR; i++)
> pthread_join(thread_ids[i], NULL);
>
> return 0;
> }
> ~~~
>
> Commands
> ~~~
> $ gcc test_trace_loop.c -o test_trace_loop
>
> $ ./test_trace_loop &
> [1] 1404183
>
> $ pstree 1404183 -p
> test_trace_loop(1404183)─┬─{test_trace_loop}(1404185)
> └─{test_trace_loop}(1404186)
>
> $ sudo perf trace -p 1404183 -e *sleep
> ~~~
>
> Output
> before:
> $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
> ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
> ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> 0.119 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60) ...
> ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
> 0.047 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> 0.047 (1000.127 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
>
> explanation: only the parent thread 1404183 got augmented
>
> after:
> $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
> ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
> 0.147 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
> ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
> 0.076 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> 0.076 (1000.160 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> 0.147 (1000.090 ms): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
> 2.557 ( ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ...
> 1000.323 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
> 2.557 (1000.129 ms): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
> 1000.384 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
>
> explanation: all threads augmented
>
> Effect 2: perf trace doesn't collect syscall argument data for *ALL*
> pids, and throw it away anymore. Those uninteresting pids get filtered
> right away. There should be a performance advantage.
Thanks for doing this!
>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
> tools/perf/builtin-trace.c | 52 ++++++++++++++++---
> .../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++---
> tools/perf/util/evlist.c | 2 +-
> 3 files changed, 73 insertions(+), 16 deletions(-)
>
> diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> index 67b557ec3b0d..11620cb40198 100644
> --- a/tools/perf/builtin-trace.c
> +++ b/tools/perf/builtin-trace.c
> @@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> unsigned long before;
> const bool forks = argc > 0;
> bool draining = false;
> + bool enable_evlist = false;
>
> trace->live = true;
>
> @@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> evlist__set_default_cgroup(trace->evlist, trace->cgroup);
>
> create_maps:
> + if (trace->syscalls.events.bpf_output)
> + trace->syscalls.events.bpf_output->core.system_wide = true;
> +
> err = evlist__create_maps(evlist, &trace->opts.target);
> if (err < 0) {
> fprintf(trace->output, "Problems parsing the target to trace, check your options!\n");
> @@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> goto out_error_open;
> #ifdef HAVE_BPF_SKEL
> if (trace->syscalls.events.bpf_output) {
> + struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core;
> struct perf_cpu cpu;
> + bool t = true;
> +
> + enable_evlist = true;
> + if (trace->opts.target.system_wide)
> + trace->skel->bss->system_wide = true;
> + else
> + trace->skel->bss->system_wide = false;
>
> /*
> * Set up the __augmented_syscalls__ BPF map to hold for each
> * CPU the bpf-output event's file descriptor.
> */
> - perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) {
> + perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) {
> int mycpu = cpu.cpu;
>
> - bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
> - &mycpu, sizeof(mycpu),
> - xyarray__entry(trace->syscalls.events.bpf_output->core.fd,
> - mycpu, 0),
> - sizeof(__u32), BPF_ANY);
> + err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
> + &mycpu, sizeof(mycpu),
> + xyarray__entry(perf_evsel->fd, mycpu, 0),
> + sizeof(__u32), BPF_ANY);
> + if (err) {
> + pr_err("Couldn't set system-wide bpf output perf event fd"
> + ", err: %d\n", err);
> + goto out_disable;
> + }
> + }
> +
> + if (target__has_task(&trace->opts.target)) {
> + struct perf_thread_map *threads = trace->evlist->core.threads;
> +
> + for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) {
> + pid_t pid = perf_thread_map__pid(threads, thread);
> +
> + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid,
> + sizeof(pid), &t, sizeof(t), BPF_ANY);
> + if (err) {
> + pr_err("Couldn't set pids_targeted map, err: %d\n", err);
> + goto out_disable;
> + }
> + }
> + } else if (workload_pid != -1) {
> + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid,
> + sizeof(workload_pid), &t, sizeof(t), BPF_ANY);
> + if (err) {
> + pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err);
> + goto out_disable;
> + }
> }
> }
>
> @@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> goto out_error_mmap;
> }
>
> - if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)
> + if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay))
I guess target__none() should not call evlist__enable() here.
> evlist__enable(evlist);
>
> if (forks)
> diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> index e4352881e3fa..e517eec7290b 100644
> --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> @@ -26,6 +26,7 @@
> #define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0))
>
> #define MAX_CPUS 4096
> +#define MAX_PIDS 4096
>
> /* bpf-output associated map */
> struct __augmented_syscalls__ {
> @@ -113,6 +114,15 @@ struct pids_filtered {
> __uint(max_entries, 64);
> } pids_filtered SEC(".maps");
>
> +volatile bool system_wide;
> +
> +struct pids_targeted {
> + __uint(type, BPF_MAP_TYPE_HASH);
> + __type(key, pid_t);
> + __type(value, bool);
> + __uint(max_entries, MAX_PIDS);
> +} pids_targeted SEC(".maps");
> +
> struct augmented_args_payload {
> struct syscall_enter_args args;
> struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc)
> @@ -145,6 +155,11 @@ struct beauty_payload_enter_map {
> __uint(max_entries, 1);
> } beauty_payload_enter_map SEC(".maps");
>
> +static pid_t getpid(void)
> +{
> + return bpf_get_current_pid_tgid();
> +}
> +
> static inline struct augmented_args_payload *augmented_args_payload(void)
> {
> int key = 0;
> @@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args)
> return 1; /* Failure: don't filter */
> }
>
> -static pid_t getpid(void)
> +static bool filter_pid(void)
> {
> - return bpf_get_current_pid_tgid();
> -}
> + if (system_wide)
> + return false;
Doesn't it need to check CPU list when -C option is used?
>
> -static bool pid_filter__has(struct pids_filtered *pids, pid_t pid)
> -{
> - return bpf_map_lookup_elem(pids, &pid) != NULL;
> + pid_t pid = getpid();
> +
> + if (bpf_map_lookup_elem(&pids_targeted, &pid) &&
> + !bpf_map_lookup_elem(&pids_filtered, &pid))
Can we just use a single map for this purpose?
Thanks,
Namhyung
> + return false;
> +
> + return true;
> }
>
> static int augment_sys_enter(void *ctx, struct syscall_enter_args *args)
> @@ -534,7 +553,7 @@ int sys_enter(struct syscall_enter_args *args)
> * initial, non-augmented raw_syscalls:sys_enter payload.
> */
>
> - if (pid_filter__has(&pids_filtered, getpid()))
> + if (filter_pid())
> return 0;
>
> augmented_args = augmented_args_payload();
> @@ -560,7 +579,7 @@ int sys_exit(struct syscall_exit_args *args)
> {
> struct syscall_exit_args exit_args;
>
> - if (pid_filter__has(&pids_filtered, getpid()))
> + if (filter_pid())
> return 0;
>
> bpf_probe_read_kernel(&exit_args, sizeof(exit_args), args);
> diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c
> index dcd1130502df..7b5837c6e6bb 100644
> --- a/tools/perf/util/evlist.c
> +++ b/tools/perf/util/evlist.c
> @@ -1012,7 +1012,7 @@ int evlist__create_maps(struct evlist *evlist, struct target *target)
> if (!threads)
> return -1;
>
> - if (target__uses_dummy_map(target) && !evlist__has_bpf_output(evlist))
> + if (target__uses_dummy_map(target))
> cpus = perf_cpu_map__new_any_cpu();
> else
> cpus = perf_cpu_map__new(target->cpu_list);
> --
> 2.45.2
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 1/2] perf trace: Enhance task filtering
2025-05-30 21:30 ` Namhyung Kim
@ 2025-05-30 23:20 ` Howard Chu
0 siblings, 0 replies; 5+ messages in thread
From: Howard Chu @ 2025-05-30 23:20 UTC (permalink / raw)
To: Namhyung Kim
Cc: acme, mingo, mark.rutland, alexander.shishkin, jolsa, irogers,
adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel
Hello Namhyung,
On Fri, May 30, 2025 at 2:31 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hi Howard,
>
> On Thu, May 29, 2025 at 11:24:07PM -0700, Howard Chu wrote:
> > This patch does two things:
> > 1. Add a pids_targeted map, put pids that interest perf trace in.
> > 2. Make bpf-output event system-wide.
> >
> > Effect 1:
> > perf trace doesn't augment threads properly. With the script below:
> >
> > Program test_trace_loop.c
> > ~~~
> > #include <pthread.h>
> > #include <stdio.h>
> > #include <unistd.h>
> > #include <stdlib.h>
> >
> > #define THREAD_NR 2
> >
> > struct thread_arg {
> > int index;
> > };
> >
> > void *func(void *arg) {
> > struct thread_arg *t_arg = arg;
> > while (1) {
> > printf("thread %d running\n", t_arg->index);
> > sleep(1);
> > }
> > return NULL;
> > }
> >
> > int main()
> > {
> > pthread_t thread_ids[THREAD_NR];
> > struct thread_arg thread_args[THREAD_NR];
> >
> > for (int i = 0; i < THREAD_NR; i++) {
> > thread_args[i].index = i;
> > if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) {
> > perror("failed to create thread, exiting\n");
> > exit(1);
> > }
> > }
> >
> > while (1) {
> > printf("parent sleeping\n");
> > sleep(1);
> > }
> >
> > for (int i = 0; i < THREAD_NR; i++)
> > pthread_join(thread_ids[i], NULL);
> >
> > return 0;
> > }
> > ~~~
> >
> > Commands
> > ~~~
> > $ gcc test_trace_loop.c -o test_trace_loop
> >
> > $ ./test_trace_loop &
> > [1] 1404183
> >
> > $ pstree 1404183 -p
> > test_trace_loop(1404183)─┬─{test_trace_loop}(1404185)
> > └─{test_trace_loop}(1404186)
> >
> > $ sudo perf trace -p 1404183 -e *sleep
> > ~~~
> >
> > Output
> > before:
> > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
> > ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
> > ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> > 0.119 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60) ...
> > ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
> > 0.047 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> > 0.047 (1000.127 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> >
> > explanation: only the parent thread 1404183 got augmented
> >
> > after:
> > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep
> > ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> > ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
> > 0.147 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
> > ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
> > 0.076 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> > 0.076 (1000.160 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0
> > 0.147 (1000.090 ms): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0
> > 2.557 ( ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ...
> > 1000.323 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ...
> > 2.557 (1000.129 ms): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0
> > 1000.384 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ...
> >
> > explanation: all threads augmented
> >
> > Effect 2: perf trace doesn't collect syscall argument data for *ALL*
> > pids, and throw it away anymore. Those uninteresting pids get filtered
> > right away. There should be a performance advantage.
>
> Thanks for doing this!
Thank you for reviewing this patch.
>
> >
> > Signed-off-by: Howard Chu <howardchu95@gmail.com>
> > ---
> > tools/perf/builtin-trace.c | 52 ++++++++++++++++---
> > .../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++---
> > tools/perf/util/evlist.c | 2 +-
> > 3 files changed, 73 insertions(+), 16 deletions(-)
> >
> > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c
> > index 67b557ec3b0d..11620cb40198 100644
> > --- a/tools/perf/builtin-trace.c
> > +++ b/tools/perf/builtin-trace.c
> > @@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> > unsigned long before;
> > const bool forks = argc > 0;
> > bool draining = false;
> > + bool enable_evlist = false;
> >
> > trace->live = true;
> >
> > @@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> > evlist__set_default_cgroup(trace->evlist, trace->cgroup);
> >
> > create_maps:
> > + if (trace->syscalls.events.bpf_output)
> > + trace->syscalls.events.bpf_output->core.system_wide = true;
> > +
> > err = evlist__create_maps(evlist, &trace->opts.target);
> > if (err < 0) {
> > fprintf(trace->output, "Problems parsing the target to trace, check your options!\n");
> > @@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> > goto out_error_open;
> > #ifdef HAVE_BPF_SKEL
> > if (trace->syscalls.events.bpf_output) {
> > + struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core;
> > struct perf_cpu cpu;
> > + bool t = true;
> > +
> > + enable_evlist = true;
> > + if (trace->opts.target.system_wide)
> > + trace->skel->bss->system_wide = true;
> > + else
> > + trace->skel->bss->system_wide = false;
> >
> > /*
> > * Set up the __augmented_syscalls__ BPF map to hold for each
> > * CPU the bpf-output event's file descriptor.
> > */
> > - perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) {
> > + perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) {
> > int mycpu = cpu.cpu;
> >
> > - bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
> > - &mycpu, sizeof(mycpu),
> > - xyarray__entry(trace->syscalls.events.bpf_output->core.fd,
> > - mycpu, 0),
> > - sizeof(__u32), BPF_ANY);
> > + err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__,
> > + &mycpu, sizeof(mycpu),
> > + xyarray__entry(perf_evsel->fd, mycpu, 0),
> > + sizeof(__u32), BPF_ANY);
> > + if (err) {
> > + pr_err("Couldn't set system-wide bpf output perf event fd"
> > + ", err: %d\n", err);
> > + goto out_disable;
> > + }
> > + }
> > +
> > + if (target__has_task(&trace->opts.target)) {
> > + struct perf_thread_map *threads = trace->evlist->core.threads;
> > +
> > + for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) {
> > + pid_t pid = perf_thread_map__pid(threads, thread);
> > +
> > + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid,
> > + sizeof(pid), &t, sizeof(t), BPF_ANY);
> > + if (err) {
> > + pr_err("Couldn't set pids_targeted map, err: %d\n", err);
> > + goto out_disable;
> > + }
> > + }
> > + } else if (workload_pid != -1) {
> > + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid,
> > + sizeof(workload_pid), &t, sizeof(t), BPF_ANY);
> > + if (err) {
> > + pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err);
> > + goto out_disable;
> > + }
> > }
> > }
> >
> > @@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv)
> > goto out_error_mmap;
> > }
> >
> > - if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)
> > + if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay))
>
> I guess target__none() should not call evlist__enable() here.
Tracing a workload makes target__none() return true... But I need the
evlist enabled for the workload. This statement can be written better:
if bpf-output event is used, enable evlist.
>
>
> > evlist__enable(evlist);
> >
> > if (forks)
> > diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > index e4352881e3fa..e517eec7290b 100644
> > --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> > @@ -26,6 +26,7 @@
> > #define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0))
> >
> > #define MAX_CPUS 4096
> > +#define MAX_PIDS 4096
> >
> > /* bpf-output associated map */
> > struct __augmented_syscalls__ {
> > @@ -113,6 +114,15 @@ struct pids_filtered {
> > __uint(max_entries, 64);
> > } pids_filtered SEC(".maps");
> >
> > +volatile bool system_wide;
> > +
> > +struct pids_targeted {
> > + __uint(type, BPF_MAP_TYPE_HASH);
> > + __type(key, pid_t);
> > + __type(value, bool);
> > + __uint(max_entries, MAX_PIDS);
> > +} pids_targeted SEC(".maps");
> > +
> > struct augmented_args_payload {
> > struct syscall_enter_args args;
> > struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc)
> > @@ -145,6 +155,11 @@ struct beauty_payload_enter_map {
> > __uint(max_entries, 1);
> > } beauty_payload_enter_map SEC(".maps");
> >
> > +static pid_t getpid(void)
> > +{
> > + return bpf_get_current_pid_tgid();
> > +}
> > +
> > static inline struct augmented_args_payload *augmented_args_payload(void)
> > {
> > int key = 0;
> > @@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args)
> > return 1; /* Failure: don't filter */
> > }
> >
> > -static pid_t getpid(void)
> > +static bool filter_pid(void)
> > {
> > - return bpf_get_current_pid_tgid();
> > -}
> > + if (system_wide)
> > + return false;
>
> Doesn't it need to check CPU list when -C option is used?
I agree, thanks.
>
> >
> > -static bool pid_filter__has(struct pids_filtered *pids, pid_t pid)
> > -{
> > - return bpf_map_lookup_elem(pids, &pid) != NULL;
> > + pid_t pid = getpid();
> > +
> > + if (bpf_map_lookup_elem(&pids_targeted, &pid) &&
> > + !bpf_map_lookup_elem(&pids_filtered, &pid))
>
> Can we just use a single map for this purpose?
pids_targeted allows certain pids, and pids_filtered exclude some
pids. I actually made a mistake here because when doing system-wide
tracing, pids_filtered should be checked too (in my code I just return
if system_wide is true). In per-task tracing however, I think we can
squash them together, if any pid in pids_filtered is also in
pids_targeted, we still keep that pid since the user specified it, if
not, it gets filtered anyway since it's not in pids_targeted. So you
are right about using a single map for per-task tracing, but only
because pids_filtered is used in system-wide mode, I don't think we
can delete the map.
Thanks,
Howard
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-05-30 23:20 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-30 6:24 [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function Howard Chu
2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu
2025-05-30 21:30 ` Namhyung Kim
2025-05-30 23:20 ` Howard Chu
2025-05-30 6:24 ` [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() Howard Chu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).