* [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function @ 2025-05-30 6:24 Howard Chu 2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu 2025-05-30 6:24 ` [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() Howard Chu 0 siblings, 2 replies; 5+ messages in thread From: Howard Chu @ 2025-05-30 6:24 UTC (permalink / raw) To: acme Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel, Howard Chu Task filtering is flawed on perf trace. Spawned threads can't get augmented, and the BPF program collects more data than it needs. This patch series adds a pids_targeted BPF map to better filter the tasks, and makes the bpf-output event '__augmented_syscalls__' system-wide. It also removes evlist__has_bpf_output() that's no longer useful. Howard Chu (2): perf trace: Enhance task filtering perf evlist: Remove evlist__has_bpf_output() tools/perf/builtin-trace.c | 52 ++++++++++++++++--- .../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++--- tools/perf/util/evlist.c | 14 +---- tools/perf/util/evlist.h | 1 - 4 files changed, 73 insertions(+), 29 deletions(-) -- 2.45.2 ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v1 1/2] perf trace: Enhance task filtering 2025-05-30 6:24 [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function Howard Chu @ 2025-05-30 6:24 ` Howard Chu 2025-05-30 21:30 ` Namhyung Kim 2025-05-30 6:24 ` [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() Howard Chu 1 sibling, 1 reply; 5+ messages in thread From: Howard Chu @ 2025-05-30 6:24 UTC (permalink / raw) To: acme Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel, Howard Chu This patch does two things: 1. Add a pids_targeted map, put pids that interest perf trace in. 2. Make bpf-output event system-wide. Effect 1: perf trace doesn't augment threads properly. With the script below: Program test_trace_loop.c ~~~ #include <pthread.h> #include <stdio.h> #include <unistd.h> #include <stdlib.h> #define THREAD_NR 2 struct thread_arg { int index; }; void *func(void *arg) { struct thread_arg *t_arg = arg; while (1) { printf("thread %d running\n", t_arg->index); sleep(1); } return NULL; } int main() { pthread_t thread_ids[THREAD_NR]; struct thread_arg thread_args[THREAD_NR]; for (int i = 0; i < THREAD_NR; i++) { thread_args[i].index = i; if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) { perror("failed to create thread, exiting\n"); exit(1); } } while (1) { printf("parent sleeping\n"); sleep(1); } for (int i = 0; i < THREAD_NR; i++) pthread_join(thread_ids[i], NULL); return 0; } ~~~ Commands ~~~ $ gcc test_trace_loop.c -o test_trace_loop $ ./test_trace_loop & [1] 1404183 $ pstree 1404183 -p test_trace_loop(1404183)─┬─{test_trace_loop}(1404185) └─{test_trace_loop}(1404186) $ sudo perf trace -p 1404183 -e *sleep ~~~ Output before: $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 0.119 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60) ... ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 0.047 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... 0.047 (1000.127 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 explanation: only the parent thread 1404183 got augmented after: $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 0.147 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ... ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 0.076 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... 0.076 (1000.160 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 0.147 (1000.090 ms): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 2.557 ( ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ... 1000.323 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ... 2.557 (1000.129 ms): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 1000.384 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... explanation: all threads augmented Effect 2: perf trace doesn't collect syscall argument data for *ALL* pids, and throw it away anymore. Those uninteresting pids get filtered right away. There should be a performance advantage. Signed-off-by: Howard Chu <howardchu95@gmail.com> --- tools/perf/builtin-trace.c | 52 ++++++++++++++++--- .../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++--- tools/perf/util/evlist.c | 2 +- 3 files changed, 73 insertions(+), 16 deletions(-) diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c index 67b557ec3b0d..11620cb40198 100644 --- a/tools/perf/builtin-trace.c +++ b/tools/perf/builtin-trace.c @@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) unsigned long before; const bool forks = argc > 0; bool draining = false; + bool enable_evlist = false; trace->live = true; @@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv) evlist__set_default_cgroup(trace->evlist, trace->cgroup); create_maps: + if (trace->syscalls.events.bpf_output) + trace->syscalls.events.bpf_output->core.system_wide = true; + err = evlist__create_maps(evlist, &trace->opts.target); if (err < 0) { fprintf(trace->output, "Problems parsing the target to trace, check your options!\n"); @@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv) goto out_error_open; #ifdef HAVE_BPF_SKEL if (trace->syscalls.events.bpf_output) { + struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core; struct perf_cpu cpu; + bool t = true; + + enable_evlist = true; + if (trace->opts.target.system_wide) + trace->skel->bss->system_wide = true; + else + trace->skel->bss->system_wide = false; /* * Set up the __augmented_syscalls__ BPF map to hold for each * CPU the bpf-output event's file descriptor. */ - perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) { + perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) { int mycpu = cpu.cpu; - bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__, - &mycpu, sizeof(mycpu), - xyarray__entry(trace->syscalls.events.bpf_output->core.fd, - mycpu, 0), - sizeof(__u32), BPF_ANY); + err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__, + &mycpu, sizeof(mycpu), + xyarray__entry(perf_evsel->fd, mycpu, 0), + sizeof(__u32), BPF_ANY); + if (err) { + pr_err("Couldn't set system-wide bpf output perf event fd" + ", err: %d\n", err); + goto out_disable; + } + } + + if (target__has_task(&trace->opts.target)) { + struct perf_thread_map *threads = trace->evlist->core.threads; + + for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) { + pid_t pid = perf_thread_map__pid(threads, thread); + + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid, + sizeof(pid), &t, sizeof(t), BPF_ANY); + if (err) { + pr_err("Couldn't set pids_targeted map, err: %d\n", err); + goto out_disable; + } + } + } else if (workload_pid != -1) { + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid, + sizeof(workload_pid), &t, sizeof(t), BPF_ANY); + if (err) { + pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err); + goto out_disable; + } } } @@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) goto out_error_mmap; } - if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay) + if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)) evlist__enable(evlist); if (forks) diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c index e4352881e3fa..e517eec7290b 100644 --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c @@ -26,6 +26,7 @@ #define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0)) #define MAX_CPUS 4096 +#define MAX_PIDS 4096 /* bpf-output associated map */ struct __augmented_syscalls__ { @@ -113,6 +114,15 @@ struct pids_filtered { __uint(max_entries, 64); } pids_filtered SEC(".maps"); +volatile bool system_wide; + +struct pids_targeted { + __uint(type, BPF_MAP_TYPE_HASH); + __type(key, pid_t); + __type(value, bool); + __uint(max_entries, MAX_PIDS); +} pids_targeted SEC(".maps"); + struct augmented_args_payload { struct syscall_enter_args args; struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc) @@ -145,6 +155,11 @@ struct beauty_payload_enter_map { __uint(max_entries, 1); } beauty_payload_enter_map SEC(".maps"); +static pid_t getpid(void) +{ + return bpf_get_current_pid_tgid(); +} + static inline struct augmented_args_payload *augmented_args_payload(void) { int key = 0; @@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args) return 1; /* Failure: don't filter */ } -static pid_t getpid(void) +static bool filter_pid(void) { - return bpf_get_current_pid_tgid(); -} + if (system_wide) + return false; -static bool pid_filter__has(struct pids_filtered *pids, pid_t pid) -{ - return bpf_map_lookup_elem(pids, &pid) != NULL; + pid_t pid = getpid(); + + if (bpf_map_lookup_elem(&pids_targeted, &pid) && + !bpf_map_lookup_elem(&pids_filtered, &pid)) + return false; + + return true; } static int augment_sys_enter(void *ctx, struct syscall_enter_args *args) @@ -534,7 +553,7 @@ int sys_enter(struct syscall_enter_args *args) * initial, non-augmented raw_syscalls:sys_enter payload. */ - if (pid_filter__has(&pids_filtered, getpid())) + if (filter_pid()) return 0; augmented_args = augmented_args_payload(); @@ -560,7 +579,7 @@ int sys_exit(struct syscall_exit_args *args) { struct syscall_exit_args exit_args; - if (pid_filter__has(&pids_filtered, getpid())) + if (filter_pid()) return 0; bpf_probe_read_kernel(&exit_args, sizeof(exit_args), args); diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index dcd1130502df..7b5837c6e6bb 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -1012,7 +1012,7 @@ int evlist__create_maps(struct evlist *evlist, struct target *target) if (!threads) return -1; - if (target__uses_dummy_map(target) && !evlist__has_bpf_output(evlist)) + if (target__uses_dummy_map(target)) cpus = perf_cpu_map__new_any_cpu(); else cpus = perf_cpu_map__new(target->cpu_list); -- 2.45.2 ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v1 1/2] perf trace: Enhance task filtering 2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu @ 2025-05-30 21:30 ` Namhyung Kim 2025-05-30 23:20 ` Howard Chu 0 siblings, 1 reply; 5+ messages in thread From: Namhyung Kim @ 2025-05-30 21:30 UTC (permalink / raw) To: Howard Chu Cc: acme, mingo, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel Hi Howard, On Thu, May 29, 2025 at 11:24:07PM -0700, Howard Chu wrote: > This patch does two things: > 1. Add a pids_targeted map, put pids that interest perf trace in. > 2. Make bpf-output event system-wide. > > Effect 1: > perf trace doesn't augment threads properly. With the script below: > > Program test_trace_loop.c > ~~~ > #include <pthread.h> > #include <stdio.h> > #include <unistd.h> > #include <stdlib.h> > > #define THREAD_NR 2 > > struct thread_arg { > int index; > }; > > void *func(void *arg) { > struct thread_arg *t_arg = arg; > while (1) { > printf("thread %d running\n", t_arg->index); > sleep(1); > } > return NULL; > } > > int main() > { > pthread_t thread_ids[THREAD_NR]; > struct thread_arg thread_args[THREAD_NR]; > > for (int i = 0; i < THREAD_NR; i++) { > thread_args[i].index = i; > if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) { > perror("failed to create thread, exiting\n"); > exit(1); > } > } > > while (1) { > printf("parent sleeping\n"); > sleep(1); > } > > for (int i = 0; i < THREAD_NR; i++) > pthread_join(thread_ids[i], NULL); > > return 0; > } > ~~~ > > Commands > ~~~ > $ gcc test_trace_loop.c -o test_trace_loop > > $ ./test_trace_loop & > [1] 1404183 > > $ pstree 1404183 -p > test_trace_loop(1404183)─┬─{test_trace_loop}(1404185) > └─{test_trace_loop}(1404186) > > $ sudo perf trace -p 1404183 -e *sleep > ~~~ > > Output > before: > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep > ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 > ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > 0.119 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60) ... > ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 > 0.047 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... > 0.047 (1000.127 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > > explanation: only the parent thread 1404183 got augmented > > after: > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep > ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 > 0.147 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ... > ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 > 0.076 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... > 0.076 (1000.160 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > 0.147 (1000.090 ms): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 > 2.557 ( ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ... > 1000.323 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ... > 2.557 (1000.129 ms): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 > 1000.384 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... > > explanation: all threads augmented > > Effect 2: perf trace doesn't collect syscall argument data for *ALL* > pids, and throw it away anymore. Those uninteresting pids get filtered > right away. There should be a performance advantage. Thanks for doing this! > > Signed-off-by: Howard Chu <howardchu95@gmail.com> > --- > tools/perf/builtin-trace.c | 52 ++++++++++++++++--- > .../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++--- > tools/perf/util/evlist.c | 2 +- > 3 files changed, 73 insertions(+), 16 deletions(-) > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > index 67b557ec3b0d..11620cb40198 100644 > --- a/tools/perf/builtin-trace.c > +++ b/tools/perf/builtin-trace.c > @@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > unsigned long before; > const bool forks = argc > 0; > bool draining = false; > + bool enable_evlist = false; > > trace->live = true; > > @@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > evlist__set_default_cgroup(trace->evlist, trace->cgroup); > > create_maps: > + if (trace->syscalls.events.bpf_output) > + trace->syscalls.events.bpf_output->core.system_wide = true; > + > err = evlist__create_maps(evlist, &trace->opts.target); > if (err < 0) { > fprintf(trace->output, "Problems parsing the target to trace, check your options!\n"); > @@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > goto out_error_open; > #ifdef HAVE_BPF_SKEL > if (trace->syscalls.events.bpf_output) { > + struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core; > struct perf_cpu cpu; > + bool t = true; > + > + enable_evlist = true; > + if (trace->opts.target.system_wide) > + trace->skel->bss->system_wide = true; > + else > + trace->skel->bss->system_wide = false; > > /* > * Set up the __augmented_syscalls__ BPF map to hold for each > * CPU the bpf-output event's file descriptor. > */ > - perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) { > + perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) { > int mycpu = cpu.cpu; > > - bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__, > - &mycpu, sizeof(mycpu), > - xyarray__entry(trace->syscalls.events.bpf_output->core.fd, > - mycpu, 0), > - sizeof(__u32), BPF_ANY); > + err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__, > + &mycpu, sizeof(mycpu), > + xyarray__entry(perf_evsel->fd, mycpu, 0), > + sizeof(__u32), BPF_ANY); > + if (err) { > + pr_err("Couldn't set system-wide bpf output perf event fd" > + ", err: %d\n", err); > + goto out_disable; > + } > + } > + > + if (target__has_task(&trace->opts.target)) { > + struct perf_thread_map *threads = trace->evlist->core.threads; > + > + for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) { > + pid_t pid = perf_thread_map__pid(threads, thread); > + > + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid, > + sizeof(pid), &t, sizeof(t), BPF_ANY); > + if (err) { > + pr_err("Couldn't set pids_targeted map, err: %d\n", err); > + goto out_disable; > + } > + } > + } else if (workload_pid != -1) { > + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid, > + sizeof(workload_pid), &t, sizeof(t), BPF_ANY); > + if (err) { > + pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err); > + goto out_disable; > + } > } > } > > @@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > goto out_error_mmap; > } > > - if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay) > + if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)) I guess target__none() should not call evlist__enable() here. > evlist__enable(evlist); > > if (forks) > diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c > index e4352881e3fa..e517eec7290b 100644 > --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c > +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c > @@ -26,6 +26,7 @@ > #define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0)) > > #define MAX_CPUS 4096 > +#define MAX_PIDS 4096 > > /* bpf-output associated map */ > struct __augmented_syscalls__ { > @@ -113,6 +114,15 @@ struct pids_filtered { > __uint(max_entries, 64); > } pids_filtered SEC(".maps"); > > +volatile bool system_wide; > + > +struct pids_targeted { > + __uint(type, BPF_MAP_TYPE_HASH); > + __type(key, pid_t); > + __type(value, bool); > + __uint(max_entries, MAX_PIDS); > +} pids_targeted SEC(".maps"); > + > struct augmented_args_payload { > struct syscall_enter_args args; > struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc) > @@ -145,6 +155,11 @@ struct beauty_payload_enter_map { > __uint(max_entries, 1); > } beauty_payload_enter_map SEC(".maps"); > > +static pid_t getpid(void) > +{ > + return bpf_get_current_pid_tgid(); > +} > + > static inline struct augmented_args_payload *augmented_args_payload(void) > { > int key = 0; > @@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args) > return 1; /* Failure: don't filter */ > } > > -static pid_t getpid(void) > +static bool filter_pid(void) > { > - return bpf_get_current_pid_tgid(); > -} > + if (system_wide) > + return false; Doesn't it need to check CPU list when -C option is used? > > -static bool pid_filter__has(struct pids_filtered *pids, pid_t pid) > -{ > - return bpf_map_lookup_elem(pids, &pid) != NULL; > + pid_t pid = getpid(); > + > + if (bpf_map_lookup_elem(&pids_targeted, &pid) && > + !bpf_map_lookup_elem(&pids_filtered, &pid)) Can we just use a single map for this purpose? Thanks, Namhyung > + return false; > + > + return true; > } > > static int augment_sys_enter(void *ctx, struct syscall_enter_args *args) > @@ -534,7 +553,7 @@ int sys_enter(struct syscall_enter_args *args) > * initial, non-augmented raw_syscalls:sys_enter payload. > */ > > - if (pid_filter__has(&pids_filtered, getpid())) > + if (filter_pid()) > return 0; > > augmented_args = augmented_args_payload(); > @@ -560,7 +579,7 @@ int sys_exit(struct syscall_exit_args *args) > { > struct syscall_exit_args exit_args; > > - if (pid_filter__has(&pids_filtered, getpid())) > + if (filter_pid()) > return 0; > > bpf_probe_read_kernel(&exit_args, sizeof(exit_args), args); > diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c > index dcd1130502df..7b5837c6e6bb 100644 > --- a/tools/perf/util/evlist.c > +++ b/tools/perf/util/evlist.c > @@ -1012,7 +1012,7 @@ int evlist__create_maps(struct evlist *evlist, struct target *target) > if (!threads) > return -1; > > - if (target__uses_dummy_map(target) && !evlist__has_bpf_output(evlist)) > + if (target__uses_dummy_map(target)) > cpus = perf_cpu_map__new_any_cpu(); > else > cpus = perf_cpu_map__new(target->cpu_list); > -- > 2.45.2 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v1 1/2] perf trace: Enhance task filtering 2025-05-30 21:30 ` Namhyung Kim @ 2025-05-30 23:20 ` Howard Chu 0 siblings, 0 replies; 5+ messages in thread From: Howard Chu @ 2025-05-30 23:20 UTC (permalink / raw) To: Namhyung Kim Cc: acme, mingo, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel Hello Namhyung, On Fri, May 30, 2025 at 2:31 PM Namhyung Kim <namhyung@kernel.org> wrote: > > Hi Howard, > > On Thu, May 29, 2025 at 11:24:07PM -0700, Howard Chu wrote: > > This patch does two things: > > 1. Add a pids_targeted map, put pids that interest perf trace in. > > 2. Make bpf-output event system-wide. > > > > Effect 1: > > perf trace doesn't augment threads properly. With the script below: > > > > Program test_trace_loop.c > > ~~~ > > #include <pthread.h> > > #include <stdio.h> > > #include <unistd.h> > > #include <stdlib.h> > > > > #define THREAD_NR 2 > > > > struct thread_arg { > > int index; > > }; > > > > void *func(void *arg) { > > struct thread_arg *t_arg = arg; > > while (1) { > > printf("thread %d running\n", t_arg->index); > > sleep(1); > > } > > return NULL; > > } > > > > int main() > > { > > pthread_t thread_ids[THREAD_NR]; > > struct thread_arg thread_args[THREAD_NR]; > > > > for (int i = 0; i < THREAD_NR; i++) { > > thread_args[i].index = i; > > if (pthread_create(&thread_ids[i], NULL, &func, &thread_args[i])) { > > perror("failed to create thread, exiting\n"); > > exit(1); > > } > > } > > > > while (1) { > > printf("parent sleeping\n"); > > sleep(1); > > } > > > > for (int i = 0; i < THREAD_NR; i++) > > pthread_join(thread_ids[i], NULL); > > > > return 0; > > } > > ~~~ > > > > Commands > > ~~~ > > $ gcc test_trace_loop.c -o test_trace_loop > > > > $ ./test_trace_loop & > > [1] 1404183 > > > > $ pstree 1404183 -p > > test_trace_loop(1404183)─┬─{test_trace_loop}(1404185) > > └─{test_trace_loop}(1404186) > > > > $ sudo perf trace -p 1404183 -e *sleep > > ~~~ > > > > Output > > before: > > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep > > ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 > > ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > > 0.119 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: 0x7a86061fde60, rmtp: 0x7a86061fde60) ... > > ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 > > 0.047 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... > > 0.047 (1000.127 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > > > > explanation: only the parent thread 1404183 got augmented > > > > after: > > $ sudo /tmp/perf/perf trace -p 1404183 -e *sleep > > ? ( ): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > > ? ( ): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 > > 0.147 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ... > > ? ( ): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 > > 0.076 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... > > 0.076 (1000.160 ms): test_trace_loo/1404183 ... [continued]: clock_nanosleep()) = 0 > > 0.147 (1000.090 ms): test_trace_loo/1404186 ... [continued]: clock_nanosleep()) = 0 > > 2.557 ( ): test_trace_loo/1404185 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86069fee60) ... > > 1000.323 ( ): test_trace_loo/1404186 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7a86061fde60) ... > > 2.557 (1000.129 ms): test_trace_loo/1404185 ... [continued]: clock_nanosleep()) = 0 > > 1000.384 ( ): test_trace_loo/1404183 clock_nanosleep(rqtp: { .tv_sec: 1, .tv_nsec: 0 }, rmtp: 0x7ffd89091450) ... > > > > explanation: all threads augmented > > > > Effect 2: perf trace doesn't collect syscall argument data for *ALL* > > pids, and throw it away anymore. Those uninteresting pids get filtered > > right away. There should be a performance advantage. > > Thanks for doing this! Thank you for reviewing this patch. > > > > > Signed-off-by: Howard Chu <howardchu95@gmail.com> > > --- > > tools/perf/builtin-trace.c | 52 ++++++++++++++++--- > > .../bpf_skel/augmented_raw_syscalls.bpf.c | 35 ++++++++++--- > > tools/perf/util/evlist.c | 2 +- > > 3 files changed, 73 insertions(+), 16 deletions(-) > > > > diff --git a/tools/perf/builtin-trace.c b/tools/perf/builtin-trace.c > > index 67b557ec3b0d..11620cb40198 100644 > > --- a/tools/perf/builtin-trace.c > > +++ b/tools/perf/builtin-trace.c > > @@ -4377,6 +4377,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > > unsigned long before; > > const bool forks = argc > 0; > > bool draining = false; > > + bool enable_evlist = false; > > > > trace->live = true; > > > > @@ -4447,6 +4448,9 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > > evlist__set_default_cgroup(trace->evlist, trace->cgroup); > > > > create_maps: > > + if (trace->syscalls.events.bpf_output) > > + trace->syscalls.events.bpf_output->core.system_wide = true; > > + > > err = evlist__create_maps(evlist, &trace->opts.target); > > if (err < 0) { > > fprintf(trace->output, "Problems parsing the target to trace, check your options!\n"); > > @@ -4481,20 +4485,54 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > > goto out_error_open; > > #ifdef HAVE_BPF_SKEL > > if (trace->syscalls.events.bpf_output) { > > + struct perf_evsel *perf_evsel = &trace->syscalls.events.bpf_output->core; > > struct perf_cpu cpu; > > + bool t = true; > > + > > + enable_evlist = true; > > + if (trace->opts.target.system_wide) > > + trace->skel->bss->system_wide = true; > > + else > > + trace->skel->bss->system_wide = false; > > > > /* > > * Set up the __augmented_syscalls__ BPF map to hold for each > > * CPU the bpf-output event's file descriptor. > > */ > > - perf_cpu_map__for_each_cpu(cpu, i, trace->syscalls.events.bpf_output->core.cpus) { > > + perf_cpu_map__for_each_cpu(cpu, i, perf_evsel->cpus) { > > int mycpu = cpu.cpu; > > > > - bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__, > > - &mycpu, sizeof(mycpu), > > - xyarray__entry(trace->syscalls.events.bpf_output->core.fd, > > - mycpu, 0), > > - sizeof(__u32), BPF_ANY); > > + err = bpf_map__update_elem(trace->skel->maps.__augmented_syscalls__, > > + &mycpu, sizeof(mycpu), > > + xyarray__entry(perf_evsel->fd, mycpu, 0), > > + sizeof(__u32), BPF_ANY); > > + if (err) { > > + pr_err("Couldn't set system-wide bpf output perf event fd" > > + ", err: %d\n", err); > > + goto out_disable; > > + } > > + } > > + > > + if (target__has_task(&trace->opts.target)) { > > + struct perf_thread_map *threads = trace->evlist->core.threads; > > + > > + for (int thread = 0; thread < perf_thread_map__nr(threads); thread++) { > > + pid_t pid = perf_thread_map__pid(threads, thread); > > + > > + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &pid, > > + sizeof(pid), &t, sizeof(t), BPF_ANY); > > + if (err) { > > + pr_err("Couldn't set pids_targeted map, err: %d\n", err); > > + goto out_disable; > > + } > > + } > > + } else if (workload_pid != -1) { > > + err = bpf_map__update_elem(trace->skel->maps.pids_targeted, &workload_pid, > > + sizeof(workload_pid), &t, sizeof(t), BPF_ANY); > > + if (err) { > > + pr_err("Couldn't set pids_targeted map for workload, err: %d\n", err); > > + goto out_disable; > > + } > > } > > } > > > > @@ -4553,7 +4591,7 @@ static int trace__run(struct trace *trace, int argc, const char **argv) > > goto out_error_mmap; > > } > > > > - if (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay) > > + if (enable_evlist || (!target__none(&trace->opts.target) && !trace->opts.target.initial_delay)) > > I guess target__none() should not call evlist__enable() here. Tracing a workload makes target__none() return true... But I need the evlist enabled for the workload. This statement can be written better: if bpf-output event is used, enable evlist. > > > > evlist__enable(evlist); > > > > if (forks) > > diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c > > index e4352881e3fa..e517eec7290b 100644 > > --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c > > +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c > > @@ -26,6 +26,7 @@ > > #define is_power_of_2(n) (n != 0 && ((n & (n - 1)) == 0)) > > > > #define MAX_CPUS 4096 > > +#define MAX_PIDS 4096 > > > > /* bpf-output associated map */ > > struct __augmented_syscalls__ { > > @@ -113,6 +114,15 @@ struct pids_filtered { > > __uint(max_entries, 64); > > } pids_filtered SEC(".maps"); > > > > +volatile bool system_wide; > > + > > +struct pids_targeted { > > + __uint(type, BPF_MAP_TYPE_HASH); > > + __type(key, pid_t); > > + __type(value, bool); > > + __uint(max_entries, MAX_PIDS); > > +} pids_targeted SEC(".maps"); > > + > > struct augmented_args_payload { > > struct syscall_enter_args args; > > struct augmented_arg arg, arg2; // We have to reserve space for two arguments (rename, etc) > > @@ -145,6 +155,11 @@ struct beauty_payload_enter_map { > > __uint(max_entries, 1); > > } beauty_payload_enter_map SEC(".maps"); > > > > +static pid_t getpid(void) > > +{ > > + return bpf_get_current_pid_tgid(); > > +} > > + > > static inline struct augmented_args_payload *augmented_args_payload(void) > > { > > int key = 0; > > @@ -418,14 +433,18 @@ int sys_enter_nanosleep(struct syscall_enter_args *args) > > return 1; /* Failure: don't filter */ > > } > > > > -static pid_t getpid(void) > > +static bool filter_pid(void) > > { > > - return bpf_get_current_pid_tgid(); > > -} > > + if (system_wide) > > + return false; > > Doesn't it need to check CPU list when -C option is used? I agree, thanks. > > > > > -static bool pid_filter__has(struct pids_filtered *pids, pid_t pid) > > -{ > > - return bpf_map_lookup_elem(pids, &pid) != NULL; > > + pid_t pid = getpid(); > > + > > + if (bpf_map_lookup_elem(&pids_targeted, &pid) && > > + !bpf_map_lookup_elem(&pids_filtered, &pid)) > > Can we just use a single map for this purpose? pids_targeted allows certain pids, and pids_filtered exclude some pids. I actually made a mistake here because when doing system-wide tracing, pids_filtered should be checked too (in my code I just return if system_wide is true). In per-task tracing however, I think we can squash them together, if any pid in pids_filtered is also in pids_targeted, we still keep that pid since the user specified it, if not, it gets filtered anyway since it's not in pids_targeted. So you are right about using a single map for per-task tracing, but only because pids_filtered is used in system-wide mode, I don't think we can delete the map. Thanks, Howard ^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() 2025-05-30 6:24 [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function Howard Chu 2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu @ 2025-05-30 6:24 ` Howard Chu 1 sibling, 0 replies; 5+ messages in thread From: Howard Chu @ 2025-05-30 6:24 UTC (permalink / raw) To: acme Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers, adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel, Howard Chu This function is no longer needed. Signed-off-by: Howard Chu <howardchu95@gmail.com> --- tools/perf/util/evlist.c | 12 ------------ tools/perf/util/evlist.h | 1 - 2 files changed, 13 deletions(-) diff --git a/tools/perf/util/evlist.c b/tools/perf/util/evlist.c index 7b5837c6e6bb..876952685803 100644 --- a/tools/perf/util/evlist.c +++ b/tools/perf/util/evlist.c @@ -2618,18 +2618,6 @@ void evlist__uniquify_evsel_names(struct evlist *evlist, const struct perf_stat_ } } -bool evlist__has_bpf_output(struct evlist *evlist) -{ - struct evsel *evsel; - - evlist__for_each_entry(evlist, evsel) { - if (evsel__is_bpf_output(evsel)) - return true; - } - - return false; -} - bool evlist__needs_bpf_sb_event(struct evlist *evlist) { struct evsel *evsel; diff --git a/tools/perf/util/evlist.h b/tools/perf/util/evlist.h index 85859708393e..dd5e32583c74 100644 --- a/tools/perf/util/evlist.h +++ b/tools/perf/util/evlist.h @@ -436,7 +436,6 @@ void evlist__format_evsels(struct evlist *evlist, struct strbuf *sb, size_t max_ void evlist__check_mem_load_aux(struct evlist *evlist); void evlist__warn_user_requested_cpus(struct evlist *evlist, const char *cpu_list); void evlist__uniquify_evsel_names(struct evlist *evlist, const struct perf_stat_config *config); -bool evlist__has_bpf_output(struct evlist *evlist); bool evlist__needs_bpf_sb_event(struct evlist *evlist); #endif /* __PERF_EVLIST_H */ -- 2.45.2 ^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-05-30 23:20 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-05-30 6:24 [PATCH v1 0/2] perf trace: Enhance task filtering and remove an unused function Howard Chu 2025-05-30 6:24 ` [PATCH v1 1/2] perf trace: Enhance task filtering Howard Chu 2025-05-30 21:30 ` Namhyung Kim 2025-05-30 23:20 ` Howard Chu 2025-05-30 6:24 ` [PATCH v1 2/2] perf evlist: Remove evlist__has_bpf_output() Howard Chu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).