All of lore.kernel.org
 help / color / mirror / Atom feed
From: Namhyung Kim <namhyung@kernel.org>
To: Howard Chu <howardchu95@gmail.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: acme@kernel.org, mingo@redhat.com, mark.rutland@arm.com,
	alexander.shishkin@linux.intel.com, jolsa@kernel.org,
	irogers@google.com, adrian.hunter@intel.com,
	peterz@infradead.org, kan.liang@linux.intel.com,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	Song Liu <song@kernel.org>
Subject: Re: [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances
Date: Fri, 30 May 2025 16:37:48 -0700	[thread overview]
Message-ID: <aDpBTLoeOJ3NAw_-@google.com> (raw)
In-Reply-To: <20250529065537.529937-1-howardchu95@gmail.com>

Hello,

(Adding tracing folks)

On Wed, May 28, 2025 at 11:55:36PM -0700, Howard Chu wrote:
> perf trace utilizes the tracepoint utility, the only filter in perf
> trace is a filter on syscall type. For example, if perf traces only
> openat, then it filters all the other syscalls, such as readlinkat,
> readv, etc.
> 
> This filtering is flawed. Consider this case: two perf trace
> instances are running at the same time, trace instance A tracing
> readlinkat, trace instance B tracing openat. When an openat syscall
> enters, it triggers both BPF programs (sys_enter) in both perf trace
> instances, these kernel functions will be executed:
> 
> perf_syscall_enter
>   perf_call_bpf_enter
>     trace_call_bpf
>       bpf_prog_run_array
> 
> In bpf_prog_run_array:
> ~~~
> while ((prog = READ_ONCE(item->prog))) {
> 	run_ctx.bpf_cookie = item->bpf_cookie;
> 	ret &= run_prog(prog, ctx);
> 	item++;
> }
> ~~~
> 
> I'm not a BPF expert, but by tinkering I found that if one of the BPF
> programs returns 0, there will be no tracepoint sample. That is,
> 
> (Is there a sample?) = ProgRetA & ProgRetB & ProgRetC
> 
> Where ProgRetA is the return value of one of the BPF programs in the BPF
> program array.
> 
> Go back to the case, when two perf trace instances are tracing two
> different syscalls, again, A is tracing readlinkat, B is tracing openat,
> when an openat syscall enters, it triggers the sys_enter program in
> instance A, call it ProgA, and the sys_enter program in instance B,
> ProgB, now ProgA will return 0 because ProgA cares about readlinkat only,
> even though ProgB returns 1; (Is there a sample?) = ProgRetA (0) &
> ProgRetB (1) = 0. So there won't be a tracepoint sample in B's output,
> when there really should be one.

Sounds like a bug.  I think it should run bpf programs attached to the
current perf_event only.  Isn't it the case for tracepoint + perf + bpf?

> 
> I also want to point out that openat and readlinkat have augmented
> output, so my example might not be accurate, but it does explain the
> current perf-trace-in-parallel dilemma.
> 
> Now for augmented output, it is different. When it calls
> bpf_perf_event_output, there is a sample. There won't be no ProgRetA &
> ProgRetB... thing. So I will send another RFC patch to enable
> parallelism using this feature. Also, augmented_output creates a sample
> on it's own, so returning 1 will create a duplicated sample, when
> augmented, just return 0 instead.

Yes, it's bpf-output and tracepoint respectively.  Maybe we should
always return 1 not to drop syscalls unintentionally and perf can
discard duplicated samples.

Another approach would be return 0 always and use bpf-output for
unaugmented syscalls too.  But I'm afraid it'd affect other perf tools
using tracepoints.

> 
> Is this approach perfect? Absolutely not, there will likely be some
> performance overhead on the kernel side. It is just a quick dirty fix
> that makes perf trace run in parallel without failing. This patch is an
> explanation on the reason of failures and possibly, a link used in a
> nack comment.

Thanks for your work, but I'm afraid it'd still miss some syscalls as it
returns 0 sometimes.

Thanks,
Namhyung

> 
> Cc: Song Liu <song@kernel.org>
> Signed-off-by: Howard Chu <howardchu95@gmail.com>
> ---
>  .../util/bpf_skel/augmented_raw_syscalls.bpf.c   | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> index e4352881e3fa..315fadf01321 100644
> --- a/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> +++ b/tools/perf/util/bpf_skel/augmented_raw_syscalls.bpf.c
> @@ -546,13 +546,14 @@ int sys_enter(struct syscall_enter_args *args)
>  	/*
>  	 * Jump to syscall specific augmenter, even if the default one,
>  	 * "!raw_syscalls:unaugmented" that will just return 1 to return the
> -	 * unaugmented tracepoint payload.
> +	 * unaugmented tracepoint payload. If augmented, return 0 to reduce a
> +	 * duplicated tracepoint sample.
>  	 */
> -	if (augment_sys_enter(args, &augmented_args->args))
> -		bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
> +	if (!augment_sys_enter(args, &augmented_args->args))
> +		return 0;
>  
> -	// If not found on the PROG_ARRAY syscalls map, then we're filtering it:
> -	return 0;
> +	bpf_tail_call(args, &syscalls_sys_enter, augmented_args->args.syscall_nr);
> +	return 1;
>  }
>  
>  SEC("tp/raw_syscalls/sys_exit")
> @@ -570,10 +571,7 @@ int sys_exit(struct syscall_exit_args *args)
>  	 * unaugmented tracepoint payload.
>  	 */
>  	bpf_tail_call(args, &syscalls_sys_exit, exit_args.syscall_nr);
> -	/*
> -	 * If not found on the PROG_ARRAY syscalls map, then we're filtering it:
> -	 */
> -	return 0;
> +	return 1;
>  }
>  
>  char _license[] SEC("license") = "GPL";
> -- 
> 2.45.2
> 

  parent reply	other threads:[~2025-05-30 23:37 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-29  6:55 [RFC PATCH v1] perf trace: Mitigate failures in parallel perf trace instances Howard Chu
2025-05-30  0:23 ` Howard Chu
2025-05-30 23:37 ` Namhyung Kim [this message]
2025-05-31  0:00   ` Howard Chu
2025-06-02 22:17     ` Steven Rostedt
2025-06-03  4:56       ` Namhyung Kim
2025-06-04 10:25       ` Jiri Olsa
2025-06-06 18:27         ` Alexei Starovoitov
2025-06-09 18:30           ` Howard Chu
2025-06-09 18:38         ` Howard Chu
2025-08-11 20:15           ` Namhyung Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aDpBTLoeOJ3NAw_-@google.com \
    --to=namhyung@kernel.org \
    --cc=acme@kernel.org \
    --cc=adrian.hunter@intel.com \
    --cc=alexander.shishkin@linux.intel.com \
    --cc=howardchu95@gmail.com \
    --cc=irogers@google.com \
    --cc=jolsa@kernel.org \
    --cc=kan.liang@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=song@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.