[RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
@ 2025-05-29  6:55 Howard Chu
  2025-05-30  0:23 ` Howard Chu
  2025-05-30 23:37 ` Namhyung Kim
  0 siblings, 2 replies; 11+ messages in thread
From: Howard Chu @ 2025-05-29  6:55 UTC (permalink / raw)
  To: acme
  Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel,
	Howard Chu, Song Liu

perf trace utilizes the tracepoint utility, the only filter in perf
trace is a filter on syscall type. For example, if perf traces only
openat, then it filters all the other syscalls, such as readlinkat,
readv, etc.

This filtering is flawed. Consider this case: two perf trace
instances are running at the same time, trace instance A tracing
readlinkat, trace instance B tracing openat. When an openat syscall
enters, it triggers both BPF programs (sys_enter) in both perf trace
instances, these kernel functions will be executed:

perf_syscall_enter
  perf_call_bpf_enter
    trace_call_bpf
      bpf_prog_run_array

In bpf_prog_run_array:
~~~
while ((prog = READ_ONCE(item->prog))) {
	run_ctx.bpf_cookie = item->bpf_cookie;
	ret &= run_prog(prog, ctx);
	item++;
}
~~~

I'm not a BPF expert, but by tinkering I found that if one of the BPF
programs returns 0, there will be no tracepoint sample. That is,

(Is there a sample?) = ProgRetA | ProgRetB | ProgRetC

Where ProgRetA is the return value of one of the BPF programs in the BPF
program array.

Go back to the case, when two perf trace instances are tracing two
different syscalls, again, A is tracing readlinkat, B is tracing openat,
when an openat syscall enters, it triggers the sys_enter program in
instance A, call it ProgA, and the sys_enter program in instance B,
ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) |
ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
when there really should be one.

I also want to point out that openat and readlinkat have augmented
output, so my example might not be accurate, but it does explain the
current perf-trace-in-parallel dilemma.

Now for augmented output, it is different. When it calls
bpf_perf_event_output, there is a sample. There won't be no ProgRetA |
ProgRetB... thing. So I will send another RFC patch to enable
parallelism using this feature. Also, augmented_output creates a sample
on it's own, so returning 1 will create a duplicated sample, when
augmented, just return 0 instead.

Is this approach perfect? Absolutely not, there will likely be some
performance overhead on the kernel side. It is just a quick dirty fix
that makes perf trace run in parallel without failing. This patch is an
explanation on the reason of failures and possibly, a link used in a
nack comment.

Cc: Song Liu <song@kernel.org>
Signed-off-by: Howard Chu <howardchu95@gmail.com>
---
 .../util/bpf_skel/augmented_raw_syscalls.bpf.c   | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
index e4352881e3fa..315fadf01321 100644
--- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
+++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
@@ -546,13 +546,14 @@ int sys_enter(struct syscall_enter_args *args)
 	/*
 	 * Jump to syscall specific augmenter, even if the default one,
 	 * "!raw_syscalls:unaugmented" that will just return 1 to return the
-	 * unaugmented tracepoint payload.
+	 * unaugmented tracepoint payload. If augmented, return 0 to reduce a
+	 * duplicated tracepoint sample.
 	 */
-	if (augment_sys_enter(args, &augmented_args->args))
-		bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
+	if (!augment_sys_enter(args, &augmented_args->args))
+		return 0;

-	// If not found on the PROG_ARRAY syscalls map, then we're filtering it:
-	return 0;
+	bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
+	return 1;
 }

 SEC("tp/raw_syscalls/sys_exit")
@@ -570,10 +571,7 @@ int sys_exit(struct syscall_exit_args *args)
 	 * unaugmented tracepoint payload.
 	 */
 	bpf_tail_call(args, &syscalls_sys_exit, exit_args.syscall_nr);
-	/*
-	 * If not found on the PROG_ARRAY syscalls map, then we're filtering it:
-	 */
-	return 0;
+	return 1;
 }

 char _license[] SEC("license") = "GPL";
-- 
2.45.2

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-05-29  6:55 [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances Howard Chu
@ 2025-05-30  0:23 ` Howard Chu
  2025-05-30 23:37 ` Namhyung Kim
  1 sibling, 0 replies; 11+ messages in thread
From: Howard Chu @ 2025-05-30  0:23 UTC (permalink / raw)
  To: acme
  Cc: mingo, namhyung, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel,
	Song Liu

On Wed, May 28, 2025 at 11:55 PM Howard Chu <howardchu95@gmail.com> wrote:
>
> perf trace utilizes the tracepoint utility, the only filter in perf
> trace is a filter on syscall type. For example, if perf traces only
> openat, then it filters all the other syscalls, such as readlinkat,
> readv, etc.
>
> This filtering is flawed. Consider this case: two perf trace
> instances are running at the same time, trace instance A tracing
> readlinkat, trace instance B tracing openat. When an openat syscall
> enters, it triggers both BPF programs (sys_enter) in both perf trace
> instances, these kernel functions will be executed:
>
> perf_syscall_enter
>   perf_call_bpf_enter
>     trace_call_bpf
>       bpf_prog_run_array
>
> In bpf_prog_run_array:
> ~~~
> while ((prog = READ_ONCE(item->prog))) {
>         run_ctx.bpf_cookie = item->bpf_cookie;
>         ret &= run_prog(prog, ctx);
>         item++;
> }
> ~~~
>
> I'm not a BPF expert, but by tinkering I found that if one of the BPF
> programs returns 0, there will be no tracepoint sample. That is,
>
> (Is there a sample?) = ProgRetA | ProgRetB | ProgRetC

Sorry, I meant ProgRetA & ProgRetB & ProgRetC.

>
> Where ProgRetA is the return value of one of the BPF programs in the BPF
> program array.
>
> Go back to the case, when two perf trace instances are tracing two
> different syscalls, again, A is tracing readlinkat, B is tracing openat,
> when an openat syscall enters, it triggers the sys_enter program in
> instance A, call it ProgA, and the sys_enter program in instance B,
> ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) |
> ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,

Same, ProgRetA (0) & ProgRetB (1) = 0.

> when there really should be one.
>
> I also want to point out that openat and readlinkat have augmented
> output, so my example might not be accurate, but it does explain the
> current perf-trace-in-parallel dilemma.
>
> Now for augmented output, it is different. When it calls
> bpf_perf_event_output, there is a sample. There won't be no ProgRetA |
> ProgRetB... thing. So I will send another RFC patch to enable

Ditto.

Thanks,
Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-05-29  6:55 [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances Howard Chu
  2025-05-30  0:23 ` Howard Chu
@ 2025-05-30 23:37 ` Namhyung Kim
  2025-05-31  0:00   ` Howard Chu
  1 sibling, 1 reply; 11+ messages in thread
From: Namhyung Kim @ 2025-05-30 23:37 UTC (permalink / raw)
  To: Howard Chu, Steven Rostedt, Mathieu Desnoyers
  Cc: acme, mingo, mark.rutland, alexander.shishkin, jolsa, irogers,
	adrian.hunter, peterz, kan.liang, linux-perf-users, linux-kernel,
	Song Liu

Hello,

(Adding tracing folks)

On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> perf trace utilizes the tracepoint utility, the only filter in perf
> trace is a filter on syscall type. For example, if perf traces only
> openat, then it filters all the other syscalls, such as readlinkat,
> readv, etc.
> 
> This filtering is flawed. Consider this case: two perf trace
> instances are running at the same time, trace instance A tracing
> readlinkat, trace instance B tracing openat. When an openat syscall
> enters, it triggers both BPF programs (sys_enter) in both perf trace
> instances, these kernel functions will be executed:
> 
> perf_syscall_enter
>   perf_call_bpf_enter
>     trace_call_bpf
>       bpf_prog_run_array
> 
> In bpf_prog_run_array:
> ~~~
> while ((prog = READ_ONCE(item->prog))) {
> 	run_ctx.bpf_cookie = item->bpf_cookie;
> 	ret &= run_prog(prog, ctx);
> 	item++;
> }
> ~~~
> 
> I'm not a BPF expert, but by tinkering I found that if one of the BPF
> programs returns 0, there will be no tracepoint sample. That is,
> 
> (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> 
> Where ProgRetA is the return value of one of the BPF programs in the BPF
> program array.
> 
> Go back to the case, when two perf trace instances are tracing two
> different syscalls, again, A is tracing readlinkat, B is tracing openat,
> when an openat syscall enters, it triggers the sys_enter program in
> instance A, call it ProgA, and the sys_enter program in instance B,
> ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> when there really should be one.

Sounds like a bug.  I think it should run bpf programs attached to the
current perf_event only.  Isn't it the case for tracepoint + perf + bpf?

> 
> I also want to point out that openat and readlinkat have augmented
> output, so my example might not be accurate, but it does explain the
> current perf-trace-in-parallel dilemma.
> 
> Now for augmented output, it is different. When it calls
> bpf_perf_event_output, there is a sample. There won't be no ProgRetA &
> ProgRetB... thing. So I will send another RFC patch to enable
> parallelism using this feature. Also, augmented_output creates a sample
> on it's own, so returning 1 will create a duplicated sample, when
> augmented, just return 0 instead.

Yes, it's bpf-output and tracepoint respectively.  Maybe we should
always return 1 not to drop syscalls unintentionally and perf can
discard duplicated samples.

Another approach would be return 0 always and use bpf-output for
unaugmented syscalls too.  But I'm afraid it'd affect other perf tools
using tracepoints.

> 
> Is this approach perfect? Absolutely not, there will likely be some
> performance overhead on the kernel side. It is just a quick dirty fix
> that makes perf trace run in parallel without failing. This patch is an
> explanation on the reason of failures and possibly, a link used in a
> nack comment.

Thanks for your work, but I'm afraid it'd still miss some syscalls as it
returns 0 sometimes.

Thanks,
Namhyung

> 
> Cc: Song Liu <song@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  .../util/bpf_skel/augmented_raw_syscalls.bpf.c   | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> index e4352881e3fa..315fadf01321 100644
> --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> @@ -546,13 +546,14 @@ int sys_enter(struct syscall_enter_args *args)
>  	/*
>  	 * Jump to syscall specific augmenter, even if the default one,
>  	 * "!raw_syscalls:unaugmented" that will just return 1 to return the
> -	 * unaugmented tracepoint payload.
> +	 * unaugmented tracepoint payload. If augmented, return 0 to reduce a
> +	 * duplicated tracepoint sample.
>  	 */
> -	if (augment_sys_enter(args, &augmented_args->args))
> -		bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
> +	if (!augment_sys_enter(args, &augmented_args->args))
> +		return 0;
>  
> -	// If not found on the PROG_ARRAY syscalls map, then we're filtering it:
> -	return 0;
> +	bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
> +	return 1;
>  }
>  
>  SEC("tp/raw_syscalls/sys_exit")
> @@ -570,10 +571,7 @@ int sys_exit(struct syscall_exit_args *args)
>  	 * unaugmented tracepoint payload.
>  	 */
>  	bpf_tail_call(args, &syscalls_sys_exit, exit_args.syscall_nr);
> -	/*
> -	 * If not found on the PROG_ARRAY syscalls map, then we're filtering it:
> -	 */
> -	return 0;
> +	return 1;
>  }
>  
>  char _license[] SEC("license") = "GPL";
> -- 
> 2.45.2
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-05-30 23:37 ` Namhyung Kim
@ 2025-05-31  0:00   ` Howard Chu
  2025-06-02 22:17     ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Howard Chu @ 2025-05-31  0:00 UTC (permalink / raw)
  To: Namhyung Kim
  Cc: Steven Rostedt, Mathieu Desnoyers, acme, mingo, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, peterz,
	kan.liang, linux-perf-users, linux-kernel, Song Liu

Hello Namhyung,

On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
>
> Hello,
>
> (Adding tracing folks)

(That's so convenient wow)

>
> On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> > perf trace utilizes the tracepoint utility, the only filter in perf
> > trace is a filter on syscall type. For example, if perf traces only
> > openat, then it filters all the other syscalls, such as readlinkat,
> > readv, etc.
> >
> > This filtering is flawed. Consider this case: two perf trace
> > instances are running at the same time, trace instance A tracing
> > readlinkat, trace instance B tracing openat. When an openat syscall
> > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > instances, these kernel functions will be executed:
> >
> > perf_syscall_enter
> >   perf_call_bpf_enter
> >     trace_call_bpf
> >       bpf_prog_run_array
> >
> > In bpf_prog_run_array:
> > ~~~
> > while ((prog = READ_ONCE(item->prog))) {
> >       run_ctx.bpf_cookie = item->bpf_cookie;
> >       ret &= run_prog(prog, ctx);
> >       item++;
> > }
> > ~~~
> >
> > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > programs returns 0, there will be no tracepoint sample. That is,
> >
> > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> >
> > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > program array.
> >
> > Go back to the case, when two perf trace instances are tracing two
> > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > when an openat syscall enters, it triggers the sys_enter program in
> > instance A, call it ProgA, and the sys_enter program in instance B,
> > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > when there really should be one.
>
> Sounds like a bug.  I think it should run bpf programs attached to the
> current perf_event only.  Isn't it the case for tracepoint + perf + bpf?

I really can't answer that question.

>
> >
> > I also want to point out that openat and readlinkat have augmented
> > output, so my example might not be accurate, but it does explain the
> > current perf-trace-in-parallel dilemma.
> >
> > Now for augmented output, it is different. When it calls
> > bpf_perf_event_output, there is a sample. There won't be no ProgRetA &
> > ProgRetB... thing. So I will send another RFC patch to enable
> > parallelism using this feature. Also, augmented_output creates a sample
> > on it's own, so returning 1 will create a duplicated sample, when
> > augmented, just return 0 instead.
>
> Yes, it's bpf-output and tracepoint respectively.  Maybe we should
> always return 1 not to drop syscalls unintentionally and perf can
> discard duplicated samples.

I like this.

>
> Another approach would be return 0 always and use bpf-output for
> unaugmented syscalls too.  But I'm afraid it'd affect other perf tools
> using tracepoints.

Yep.

>
> >
> > Is this approach perfect? Absolutely not, there will likely be some
> > performance overhead on the kernel side. It is just a quick dirty fix
> > that makes perf trace run in parallel without failing. This patch is an
> > explanation on the reason of failures and possibly, a link used in a
> > nack comment.
>
> Thanks for your work, but I'm afraid it'd still miss some syscalls as it
> returns 0 sometimes.

My bad... For example this:

if (pid_filter__has(&pids_filtered, getpid()))
   return 0;

This patch is practically meaningless, but it passes the parallel tests.

Thanks,
Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-05-31  0:00   ` Howard Chu
@ 2025-06-02 22:17     ` Steven Rostedt
  2025-06-03  4:56       ` Namhyung Kim
  2025-06-04 10:25       ` Jiri Olsa
  0 siblings, 2 replies; 11+ messages in thread
From: Steven Rostedt @ 2025-06-02 22:17 UTC (permalink / raw)
  To: Howard Chu
  Cc: Namhyung Kim, Mathieu Desnoyers, acme, mingo, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, peterz,
	kan.liang, linux-perf-users, linux-kernel, Song Liu, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

On Fri, 30 May 2025 17:00:38 -0700
Howard Chu <howardchu95@gmail.com> wrote:

> Hello Namhyung,
> 
> On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> >
> > Hello,
> >
> > (Adding tracing folks)  
> 
> (That's so convenient wow)

Shouldn't the BPF folks be more relevant. I don't see any of the tracing
code involved here.

> 
> >
> > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:  
> > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > trace is a filter on syscall type. For example, if perf traces only
> > > openat, then it filters all the other syscalls, such as readlinkat,
> > > readv, etc.
> > >
> > > This filtering is flawed. Consider this case: two perf trace
> > > instances are running at the same time, trace instance A tracing
> > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > instances, these kernel functions will be executed:
> > >
> > > perf_syscall_enter
> > >   perf_call_bpf_enter
> > >     trace_call_bpf

This is in bpf_trace.c (BPF related, not tracing related).

-- Steve


> > >       bpf_prog_run_array
> > >
> > > In bpf_prog_run_array:
> > > ~~~
> > > while ((prog = READ_ONCE(item->prog))) {
> > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > >       ret &= run_prog(prog, ctx);
> > >       item++;
> > > }
> > > ~~~
> > >
> > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > programs returns 0, there will be no tracepoint sample. That is,
> > >
> > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > >
> > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > program array.
> > >
> > > Go back to the case, when two perf trace instances are tracing two
> > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > when an openat syscall enters, it triggers the sys_enter program in
> > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > when there really should be one.  
> >
> > Sounds like a bug.  I think it should run bpf programs attached to the
> > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?  
> 
> I really can't answer that question.
> 
> >  
> > >
> > > I also want to point out that openat and readlinkat have augmented
> > > output, so my example might not be accurate, but it does explain the
> > > current perf-trace-in-parallel dilemma.
> > >
> > > Now for augmented output, it is different. When it calls
> > > bpf_perf_event_output, there is a sample. There won't be no ProgRetA &
> > > ProgRetB... thing. So I will send another RFC patch to enable
> > > parallelism using this feature. Also, augmented_output creates a sample
> > > on it's own, so returning 1 will create a duplicated sample, when
> > > augmented, just return 0 instead.  
> >
> > Yes, it's bpf-output and tracepoint respectively.  Maybe we should
> > always return 1 not to drop syscalls unintentionally and perf can
> > discard duplicated samples.  
> 
> I like this.
> 
> >
> > Another approach would be return 0 always and use bpf-output for
> > unaugmented syscalls too.  But I'm afraid it'd affect other perf tools
> > using tracepoints.  
> 
> Yep.
> 
> >  
> > >
> > > Is this approach perfect? Absolutely not, there will likely be some
> > > performance overhead on the kernel side. It is just a quick dirty fix
> > > that makes perf trace run in parallel without failing. This patch is an
> > > explanation on the reason of failures and possibly, a link used in a
> > > nack comment.  
> >
> > Thanks for your work, but I'm afraid it'd still miss some syscalls as it
> > returns 0 sometimes.  
> 
> My bad... For example this:
> 
> if (pid_filter__has(&pids_filtered, getpid()))
>    return 0;
> 
> This patch is practically meaningless, but it passes the parallel tests.
> 
> Thanks,
> Howard


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-06-02 22:17     ` Steven Rostedt
@ 2025-06-03  4:56       ` Namhyung Kim
  2025-06-04 10:25       ` Jiri Olsa
  1 sibling, 0 replies; 11+ messages in thread
From: Namhyung Kim @ 2025-06-03  4:56 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Howard Chu, Mathieu Desnoyers, acme, mingo, mark.rutland,
	alexander.shishkin, jolsa, irogers, adrian.hunter, peterz,
	kan.liang, linux-perf-users, linux-kernel, Song Liu, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko

Hi Steve,

On Mon, Jun 02, 2025 at 06:17:43PM -0400, Steven Rostedt wrote:
> On Fri, 30 May 2025 17:00:38 -0700
> Howard Chu <howardchu95@gmail.com> wrote:
> 
> > Hello Namhyung,
> > 
> > On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > (Adding tracing folks)  
> > 
> > (That's so convenient wow)
> 
> Shouldn't the BPF folks be more relevant. I don't see any of the tracing
> code involved here.

Yep, they are CC'ed too.  And sorry I thought you are involved in this
part of the code.

> 
> > 
> > >
> > > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:  
> > > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > > trace is a filter on syscall type. For example, if perf traces only
> > > > openat, then it filters all the other syscalls, such as readlinkat,
> > > > readv, etc.
> > > >
> > > > This filtering is flawed. Consider this case: two perf trace
> > > > instances are running at the same time, trace instance A tracing
> > > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > > instances, these kernel functions will be executed:
> > > >
> > > > perf_syscall_enter
> > > >   perf_call_bpf_enter
> > > >     trace_call_bpf
> 
> This is in bpf_trace.c (BPF related, not tracing related).
 
Ok, noted.

Thanks,
Namhyung

> 
> > > >       bpf_prog_run_array
> > > >
> > > > In bpf_prog_run_array:
> > > > ~~~
> > > > while ((prog = READ_ONCE(item->prog))) {
> > > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > > >       ret &= run_prog(prog, ctx);
> > > >       item++;
> > > > }
> > > > ~~~
> > > >
> > > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > > programs returns 0, there will be no tracepoint sample. That is,
> > > >
> > > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > > >
> > > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > > program array.
> > > >
> > > > Go back to the case, when two perf trace instances are tracing two
> > > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > > when an openat syscall enters, it triggers the sys_enter program in
> > > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > > when there really should be one.  
> > >
> > > Sounds like a bug.  I think it should run bpf programs attached to the
> > > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?  
> > 
> > I really can't answer that question.
> > 
> > >  
> > > >
> > > > I also want to point out that openat and readlinkat have augmented
> > > > output, so my example might not be accurate, but it does explain the
> > > > current perf-trace-in-parallel dilemma.
> > > >
> > > > Now for augmented output, it is different. When it calls
> > > > bpf_perf_event_output, there is a sample. There won't be no ProgRetA &
> > > > ProgRetB... thing. So I will send another RFC patch to enable
> > > > parallelism using this feature. Also, augmented_output creates a sample
> > > > on it's own, so returning 1 will create a duplicated sample, when
> > > > augmented, just return 0 instead.  
> > >
> > > Yes, it's bpf-output and tracepoint respectively.  Maybe we should
> > > always return 1 not to drop syscalls unintentionally and perf can
> > > discard duplicated samples.  
> > 
> > I like this.
> > 
> > >
> > > Another approach would be return 0 always and use bpf-output for
> > > unaugmented syscalls too.  But I'm afraid it'd affect other perf tools
> > > using tracepoints.  
> > 
> > Yep.
> > 
> > >  
> > > >
> > > > Is this approach perfect? Absolutely not, there will likely be some
> > > > performance overhead on the kernel side. It is just a quick dirty fix
> > > > that makes perf trace run in parallel without failing. This patch is an
> > > > explanation on the reason of failures and possibly, a link used in a
> > > > nack comment.  
> > >
> > > Thanks for your work, but I'm afraid it'd still miss some syscalls as it
> > > returns 0 sometimes.  
> > 
> > My bad... For example this:
> > 
> > if (pid_filter__has(&pids_filtered, getpid()))
> >    return 0;
> > 
> > This patch is practically meaningless, but it passes the parallel tests.
> > 
> > Thanks,
> > Howard
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-06-02 22:17     ` Steven Rostedt
  2025-06-03  4:56       ` Namhyung Kim
@ 2025-06-04 10:25       ` Jiri Olsa
  2025-06-06 18:27         ` Alexei Starovoitov
  2025-06-09 18:38         ` Howard Chu
  1 sibling, 2 replies; 11+ messages in thread
From: Jiri Olsa @ 2025-06-04 10:25 UTC (permalink / raw)
  To: Howard Chu, Namhyung Kim
  Cc: Mathieu Desnoyers, acme, mingo, mark.rutland, alexander.shishkin,
	irogers, adrian.hunter, peterz, kan.liang, linux-perf-users,
	linux-kernel, Song Liu, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Steven Rostedt

On Mon, Jun 02, 2025 at 06:17:43PM -0400, Steven Rostedt wrote:
> On Fri, 30 May 2025 17:00:38 -0700
> Howard Chu <howardchu95@gmail.com> wrote:
> 
> > Hello Namhyung,
> > 
> > On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > >
> > > Hello,
> > >
> > > (Adding tracing folks)  
> > 
> > (That's so convenient wow)
> 
> Shouldn't the BPF folks be more relevant. I don't see any of the tracing
> code involved here.
> 
> > 
> > >
> > > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:  
> > > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > > trace is a filter on syscall type. For example, if perf traces only
> > > > openat, then it filters all the other syscalls, such as readlinkat,
> > > > readv, etc.
> > > >
> > > > This filtering is flawed. Consider this case: two perf trace
> > > > instances are running at the same time, trace instance A tracing
> > > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > > instances, these kernel functions will be executed:
> > > >
> > > > perf_syscall_enter
> > > >   perf_call_bpf_enter
> > > >     trace_call_bpf
> 
> This is in bpf_trace.c (BPF related, not tracing related).
> 
> -- Steve
> 
> 
> > > >       bpf_prog_run_array
> > > >
> > > > In bpf_prog_run_array:
> > > > ~~~
> > > > while ((prog = READ_ONCE(item->prog))) {
> > > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > > >       ret &= run_prog(prog, ctx);
> > > >       item++;
> > > > }
> > > > ~~~
> > > >
> > > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > > programs returns 0, there will be no tracepoint sample. That is,
> > > >
> > > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > > >
> > > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > > program array.
> > > >
> > > > Go back to the case, when two perf trace instances are tracing two
> > > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > > when an openat syscall enters, it triggers the sys_enter program in
> > > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > > when there really should be one.  
> > >
> > > Sounds like a bug.  I think it should run bpf programs attached to the
> > > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?  
> > 
> > I really can't answer that question.

bpf programs for tracepoint are executed before the perf event specific
check/trigger in perf_trace_run_bpf_submit

bpf programs array is part of struct trace_event_call so it's global per
tracepoint, not per perf event

IIRC perf trace needs the perf event sample and the bpf program is there
to do the filter and some other related stuff?

if that's the case I wonder we could switch bpf_prog_run_array logic
to be permissive like below, and perhaps make that as tracepoint specific
change, because bpf_prog_run_array is used in other place

or make it somehow configurable.. otherwise I fear that might break existing
use cases.. FWIW bpf ci tests [1] survived the change below

jirka


[1] https://github.com/kernel-patches/bpf/pull/9044

---
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5b25d278409b..4ca8afe0217c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2218,12 +2218,12 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
 	const struct bpf_prog *prog;
 	struct bpf_run_ctx *old_run_ctx;
 	struct bpf_trace_run_ctx run_ctx;
-	u32 ret = 1;
+	u32 ret = 0;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
 
 	if (unlikely(!array))
-		return ret;
+		return 1;
 
 	run_ctx.is_uprobe = false;
 
@@ -2232,7 +2232,7 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
 	item = &array->items[0];
 	while ((prog = READ_ONCE(item->prog))) {
 		run_ctx.bpf_cookie = item->bpf_cookie;
-		ret &= run_prog(prog, ctx);
+		ret |= run_prog(prog, ctx);
 		item++;
 	}
 	bpf_reset_run_ctx(old_run_ctx);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-06-04 10:25       ` Jiri Olsa
@ 2025-06-06 18:27         ` Alexei Starovoitov
  2025-06-09 18:30           ` Howard Chu
  2025-06-09 18:38         ` Howard Chu
  1 sibling, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2025-06-06 18:27 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Howard Chu, Namhyung Kim, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, Peter Zijlstra,
	Kan Liang, linux-perf-use., LKML, Song Liu, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Steven Rostedt

On Wed, Jun 4, 2025 at 3:25 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Jun 02, 2025 at 06:17:43PM -0400, Steven Rostedt wrote:
> > On Fri, 30 May 2025 17:00:38 -0700
> > Howard Chu <howardchu95@gmail.com> wrote:
> >
> > > Hello Namhyung,
> > >
> > > On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > (Adding tracing folks)
> > >
> > > (That's so convenient wow)
> >
> > Shouldn't the BPF folks be more relevant. I don't see any of the tracing
> > code involved here.
> >
> > >
> > > >
> > > > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> > > > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > > > trace is a filter on syscall type. For example, if perf traces only
> > > > > openat, then it filters all the other syscalls, such as readlinkat,
> > > > > readv, etc.
> > > > >
> > > > > This filtering is flawed. Consider this case: two perf trace
> > > > > instances are running at the same time, trace instance A tracing
> > > > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > > > instances, these kernel functions will be executed:
> > > > >
> > > > > perf_syscall_enter
> > > > >   perf_call_bpf_enter
> > > > >     trace_call_bpf
> >
> > This is in bpf_trace.c (BPF related, not tracing related).
> >
> > -- Steve
> >
> >
> > > > >       bpf_prog_run_array
> > > > >
> > > > > In bpf_prog_run_array:
> > > > > ~~~
> > > > > while ((prog = READ_ONCE(item->prog))) {
> > > > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > > > >       ret &= run_prog(prog, ctx);
> > > > >       item++;
> > > > > }
> > > > > ~~~
> > > > >
> > > > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > > > programs returns 0, there will be no tracepoint sample. That is,
> > > > >
> > > > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > > > >
> > > > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > > > program array.
> > > > >
> > > > > Go back to the case, when two perf trace instances are tracing two
> > > > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > > > when an openat syscall enters, it triggers the sys_enter program in
> > > > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > > > when there really should be one.
> > > >
> > > > Sounds like a bug.  I think it should run bpf programs attached to the
> > > > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?
> > >
> > > I really can't answer that question.
>
> bpf programs for tracepoint are executed before the perf event specific
> check/trigger in perf_trace_run_bpf_submit
>
> bpf programs array is part of struct trace_event_call so it's global per
> tracepoint, not per perf event

right.
looks like perf is attaching two different progs to the same sys_enter
tracepoint and one of them returns 0.
That's expected behavior.
The rule is all-yes-is-yes, any-no-is-no.
We apply this logic to majority (if not all) bpf prog return values.

> IIRC perf trace needs the perf event sample and the bpf program is there
> to do the filter and some other related stuff?
>
> if that's the case I wonder we could switch bpf_prog_run_array logic
> to be permissive like below, and perhaps make that as tracepoint specific
> change, because bpf_prog_run_array is used in other place

No. That might break somebody and we don't want to deviate from the rule.

> or make it somehow configurable.. otherwise I fear that might break existing
> use cases.. FWIW bpf ci tests [1] survived the change below
>
> jirka
>
>
> [1] https://github.com/kernel-patches/bpf/pull/9044
>
> ---
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 5b25d278409b..4ca8afe0217c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2218,12 +2218,12 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
>         const struct bpf_prog *prog;
>         struct bpf_run_ctx *old_run_ctx;
>         struct bpf_trace_run_ctx run_ctx;
> -       u32 ret = 1;
> +       u32 ret = 0;
>
>         RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
>
>         if (unlikely(!array))
> -               return ret;
> +               return 1;
>
>         run_ctx.is_uprobe = false;
>
> @@ -2232,7 +2232,7 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
>         item = &array->items[0];
>         while ((prog = READ_ONCE(item->prog))) {
>                 run_ctx.bpf_cookie = item->bpf_cookie;
> -               ret &= run_prog(prog, ctx);
> +               ret |= run_prog(prog, ctx);
>                 item++;
>         }
>         bpf_reset_run_ctx(old_run_ctx);
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-06-06 18:27         ` Alexei Starovoitov
@ 2025-06-09 18:30           ` Howard Chu
  0 siblings, 0 replies; 11+ messages in thread
From: Howard Chu @ 2025-06-09 18:30 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Jiri Olsa, Namhyung Kim, Mathieu Desnoyers,
	Arnaldo Carvalho de Melo, Ingo Molnar, Mark Rutland,
	Alexander Shishkin, Ian Rogers, Adrian Hunter, Peter Zijlstra,
	Kan Liang, linux-perf-use., LKML, Song Liu, bpf,
	Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Steven Rostedt

Hi Alexei,

On Fri, Jun 6, 2025 at 11:27 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 4, 2025 at 3:25 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Jun 02, 2025 at 06:17:43PM -0400, Steven Rostedt wrote:
> > > On Fri, 30 May 2025 17:00:38 -0700
> > > Howard Chu <howardchu95@gmail.com> wrote:
> > >
> > > > Hello Namhyung,
> > > >
> > > > On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > (Adding tracing folks)
> > > >
> > > > (That's so convenient wow)
> > >
> > > Shouldn't the BPF folks be more relevant. I don't see any of the tracing
> > > code involved here.
> > >
> > > >
> > > > >
> > > > > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> > > > > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > > > > trace is a filter on syscall type. For example, if perf traces only
> > > > > > openat, then it filters all the other syscalls, such as readlinkat,
> > > > > > readv, etc.
> > > > > >
> > > > > > This filtering is flawed. Consider this case: two perf trace
> > > > > > instances are running at the same time, trace instance A tracing
> > > > > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > > > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > > > > instances, these kernel functions will be executed:
> > > > > >
> > > > > > perf_syscall_enter
> > > > > >   perf_call_bpf_enter
> > > > > >     trace_call_bpf
> > >
> > > This is in bpf_trace.c (BPF related, not tracing related).
> > >
> > > -- Steve
> > >
> > >
> > > > > >       bpf_prog_run_array
> > > > > >
> > > > > > In bpf_prog_run_array:
> > > > > > ~~~
> > > > > > while ((prog = READ_ONCE(item->prog))) {
> > > > > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > > > > >       ret &= run_prog(prog, ctx);
> > > > > >       item++;
> > > > > > }
> > > > > > ~~~
> > > > > >
> > > > > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > > > > programs returns 0, there will be no tracepoint sample. That is,
> > > > > >
> > > > > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > > > > >
> > > > > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > > > > program array.
> > > > > >
> > > > > > Go back to the case, when two perf trace instances are tracing two
> > > > > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > > > > when an openat syscall enters, it triggers the sys_enter program in
> > > > > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > > > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > > > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > > > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > > > > when there really should be one.
> > > > >
> > > > > Sounds like a bug.  I think it should run bpf programs attached to the
> > > > > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?
> > > >
> > > > I really can't answer that question.
> >
> > bpf programs for tracepoint are executed before the perf event specific
> > check/trigger in perf_trace_run_bpf_submit
> >
> > bpf programs array is part of struct trace_event_call so it's global per
> > tracepoint, not per perf event
>
> right.
> looks like perf is attaching two different progs to the same sys_enter
> tracepoint and one of them returns 0.
> That's expected behavior.
> The rule is all-yes-is-yes, any-no-is-no.
> We apply this logic to majority (if not all) bpf prog return values.
>
> > IIRC perf trace needs the perf event sample and the bpf program is there
> > to do the filter and some other related stuff?
> >
> > if that's the case I wonder we could switch bpf_prog_run_array logic
> > to be permissive like below, and perhaps make that as tracepoint specific
> > change, because bpf_prog_run_array is used in other place
>
> No. That might break somebody and we don't want to deviate from the rule.

Makes sense. Thanks.

Thanks,
Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-06-04 10:25       ` Jiri Olsa
  2025-06-06 18:27         ` Alexei Starovoitov
@ 2025-06-09 18:38         ` Howard Chu
  2025-08-11 20:15           ` Namhyung Kim
  1 sibling, 1 reply; 11+ messages in thread
From: Howard Chu @ 2025-06-09 18:38 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Namhyung Kim, Mathieu Desnoyers, acme, mingo, mark.rutland,
	alexander.shishkin, irogers, adrian.hunter, peterz, kan.liang,
	linux-perf-users, linux-kernel, Song Liu, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Steven Rostedt

Hi Jiri,

On Wed, Jun 4, 2025 at 3:25 AM Jiri Olsa <olsajiri@gmail.com> wrote:
>
> On Mon, Jun 02, 2025 at 06:17:43PM -0400, Steven Rostedt wrote:
> > On Fri, 30 May 2025 17:00:38 -0700
> > Howard Chu <howardchu95@gmail.com> wrote:
> >
> > > Hello Namhyung,
> > >
> > > On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > > >
> > > > Hello,
> > > >
> > > > (Adding tracing folks)
> > >
> > > (That's so convenient wow)
> >
> > Shouldn't the BPF folks be more relevant. I don't see any of the tracing
> > code involved here.
> >
> > >
> > > >
> > > > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> > > > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > > > trace is a filter on syscall type. For example, if perf traces only
> > > > > openat, then it filters all the other syscalls, such as readlinkat,
> > > > > readv, etc.
> > > > >
> > > > > This filtering is flawed. Consider this case: two perf trace
> > > > > instances are running at the same time, trace instance A tracing
> > > > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > > > instances, these kernel functions will be executed:
> > > > >
> > > > > perf_syscall_enter
> > > > >   perf_call_bpf_enter
> > > > >     trace_call_bpf
> >
> > This is in bpf_trace.c (BPF related, not tracing related).
> >
> > -- Steve
> >
> >
> > > > >       bpf_prog_run_array
> > > > >
> > > > > In bpf_prog_run_array:
> > > > > ~~~
> > > > > while ((prog = READ_ONCE(item->prog))) {
> > > > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > > > >       ret &= run_prog(prog, ctx);
> > > > >       item++;
> > > > > }
> > > > > ~~~
> > > > >
> > > > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > > > programs returns 0, there will be no tracepoint sample. That is,
> > > > >
> > > > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > > > >
> > > > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > > > program array.
> > > > >
> > > > > Go back to the case, when two perf trace instances are tracing two
> > > > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > > > when an openat syscall enters, it triggers the sys_enter program in
> > > > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > > > when there really should be one.
> > > >
> > > > Sounds like a bug.  I think it should run bpf programs attached to the
> > > > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?
> > >
> > > I really can't answer that question.
>
> bpf programs for tracepoint are executed before the perf event specific
> check/trigger in perf_trace_run_bpf_submit
>
> bpf programs array is part of struct trace_event_call so it's global per
> tracepoint, not per perf event
>
> IIRC perf trace needs the perf event sample and the bpf program is there
> to do the filter and some other related stuff?

Actually perf trace relies on BPF programs to produce samples and do
filtering, but it can also turn BPF off and use tracepoints only.

>
> if that's the case I wonder we could switch bpf_prog_run_array logic
> to be permissive like below, and perhaps make that as tracepoint specific
> change, because bpf_prog_run_array is used in other place

Thanks but as Alexei said this may break those who depend on it.

>
> or make it somehow configurable.. otherwise I fear that might break existing
> use cases.. FWIW bpf ci tests [1] survived the change below
>
> jirka
>
>
> [1] https://github.com/kernel-patches/bpf/pull/9044
>
> ---
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 5b25d278409b..4ca8afe0217c 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -2218,12 +2218,12 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
>         const struct bpf_prog *prog;
>         struct bpf_run_ctx *old_run_ctx;
>         struct bpf_trace_run_ctx run_ctx;
> -       u32 ret = 1;
> +       u32 ret = 0;
>
>         RCU_LOCKDEP_WARN(!rcu_read_lock_held(), "no rcu lock held");
>
>         if (unlikely(!array))
> -               return ret;
> +               return 1;
>
>         run_ctx.is_uprobe = false;
>
> @@ -2232,7 +2232,7 @@ bpf_prog_run_array(const struct bpf_prog_array *array,
>         item = &array->items[0];
>         while ((prog = READ_ONCE(item->prog))) {
>                 run_ctx.bpf_cookie = item->bpf_cookie;
> -               ret &= run_prog(prog, ctx);
> +               ret |= run_prog(prog, ctx);
>                 item++;
>         }
>         bpf_reset_run_ctx(old_run_ctx);

Thank you so much for looking into this.

Thanks,
Howard

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
  2025-06-09 18:38         ` Howard Chu
@ 2025-08-11 20:15           ` Namhyung Kim
  0 siblings, 0 replies; 11+ messages in thread
From: Namhyung Kim @ 2025-08-11 20:15 UTC (permalink / raw)
  To: Howard Chu
  Cc: Jiri Olsa, Mathieu Desnoyers, acme, mingo, mark.rutland,
	alexander.shishkin, irogers, adrian.hunter, peterz, kan.liang,
	linux-perf-users, linux-kernel, Song Liu, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko

Hello,

Sorry for the late reply.

On Mon, Jun 09, 2025 at 11:38:00AM -0700, Howard Chu wrote:
> Hi Jiri,
> 
> On Wed, Jun 4, 2025 at 3:25 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Mon, Jun 02, 2025 at 06:17:43PM -0400, Steven Rostedt wrote:
> > > On Fri, 30 May 2025 17:00:38 -0700
> > > Howard Chu <howardchu95@gmail.com> wrote:
> > >
> > > > Hello Namhyung,
> > > >
> > > > On Fri, May 30, 2025 at 4:37 PM Namhyung Kim <namhyung@kernel.org> wrote:
> > > > > On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> > > > > > perf trace utilizes the tracepoint utility, the only filter in perf
> > > > > > trace is a filter on syscall type. For example, if perf traces only
> > > > > > openat, then it filters all the other syscalls, such as readlinkat,
> > > > > > readv, etc.
> > > > > >
> > > > > > This filtering is flawed. Consider this case: two perf trace
> > > > > > instances are running at the same time, trace instance A tracing
> > > > > > readlinkat, trace instance B tracing openat. When an openat syscall
> > > > > > enters, it triggers both BPF programs (sys_enter) in both perf trace
> > > > > > instances, these kernel functions will be executed:
> > > > > >
> > > > > > perf_syscall_enter
> > > > > >   perf_call_bpf_enter
> > > > > >     trace_call_bpf
> > > > > >       bpf_prog_run_array
> > > > > >
> > > > > > In bpf_prog_run_array:
> > > > > > ~~~
> > > > > > while ((prog = READ_ONCE(item->prog))) {
> > > > > >       run_ctx.bpf_cookie = item->bpf_cookie;
> > > > > >       ret &= run_prog(prog, ctx);
> > > > > >       item++;
> > > > > > }
> > > > > > ~~~
> > > > > >
> > > > > > I'm not a BPF expert, but by tinkering I found that if one of the BPF
> > > > > > programs returns 0, there will be no tracepoint sample. That is,
> > > > > >
> > > > > > (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> > > > > >
> > > > > > Where ProgRetA is the return value of one of the BPF programs in the BPF
> > > > > > program array.
> > > > > >
> > > > > > Go back to the case, when two perf trace instances are tracing two
> > > > > > different syscalls, again, A is tracing readlinkat, B is tracing openat,
> > > > > > when an openat syscall enters, it triggers the sys_enter program in
> > > > > > instance A, call it ProgA, and the sys_enter program in instance B,
> > > > > > ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> > > > > > even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> > > > > > ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> > > > > > when there really should be one.
> > > > >
> > > > > Sounds like a bug.  I think it should run bpf programs attached to the
> > > > > current perf_event only.  Isn't it the case for tracepoint + perf + bpf?
> > > >
> > > > I really can't answer that question.
> >
> > bpf programs for tracepoint are executed before the perf event specific
> > check/trigger in perf_trace_run_bpf_submit
> >
> > bpf programs array is part of struct trace_event_call so it's global per
> > tracepoint, not per perf event

Right, I think we need a way to attach a BPF program to perf_event (as
an overflow handler), not to the trace_event_call, when it comes to a
tracepoint event.  So that it can only affect behaviors of the calling
thread.  It would access the trace data as sample raw data from ctx.

Maybe it needs new link_create flags and requires BPF_PROG_TYPE_PERF_EVENT.

Wdyt?

Thanks,
Namhyung


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-11 20:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29  6:55 [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances Howard Chu
2025-05-30  0:23 ` Howard Chu
2025-05-30 23:37 ` Namhyung Kim
2025-05-31  0:00   ` Howard Chu
2025-06-02 22:17     ` Steven Rostedt
2025-06-03  4:56       ` Namhyung Kim
2025-06-04 10:25       ` Jiri Olsa
2025-06-06 18:27         ` Alexei Starovoitov
2025-06-09 18:30           ` Howard Chu
2025-06-09 18:38         ` Howard Chu
2025-08-11 20:15           ` Namhyung Kim

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).