Re: [RFC 00/10] ftrace,bpf: Use single direct ops for bpf trampolines

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

From: Jiri Olsa <olsajiri@gmail.com>
To: Mark Rutland <mark.rutland@arm.com>
Cc: "Steven Rostedt" <rostedt@kernel.org>,
	"Florent Revest" <revest@google.com>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-trace-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"Menglong Dong" <menglong8.dong@gmail.com>,
	"Naveen N Rao" <naveen@kernel.org>,
	"Michael Ellerman" <mpe@ellerman.id.au>,
	"Björn Töpel" <bjorn@rivosinc.com>,
	"Andy Chiu" <andybnac@gmail.com>,
	"Alexandre Ghiti" <alexghiti@rivosinc.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>
Subject: Re: [RFC 00/10] ftrace,bpf: Use single direct ops for bpf trampolines
Date: Wed, 30 Jul 2025 13:19:51 +0200	[thread overview]
Message-ID: <aIn_12KHz7ikF2t1@krava> (raw)
In-Reply-To: <aIkLlB7Z7V--BeGi@J2N7QTR9R3.cambridge.arm.com>

On Tue, Jul 29, 2025 at 06:57:40PM +0100, Mark Rutland wrote:
> Hi Jiri,
> 
> [adding some powerpc and riscv folk, see below]
> 
> On Tue, Jul 29, 2025 at 12:28:03PM +0200, Jiri Olsa wrote:
> > hi,
> > while poking the multi-tracing interface I ended up with just one
> > ftrace_ops object to attach all trampolines.
> > 
> > This change allows to use less direct API calls during the attachment
> > changes in the future code, so in effect speeding up the attachment.
> 
> How important is that, and what sort of speedup does this result in? I
> ask due to potential performance hits noted below, and I'm lacking
> context as to why we want to do this in the first place -- what is this
> intended to enable/improve?

so it's all work on PoC stage, the idea is to be able to attach many
(like 20,30,40k) functions to their trampolines quickly, which at the
moment is slow because all the involved interfaces work with just single
function/tracempoline relation

there's ongoing development by Menglong [1] to organize such attachment
for multiple functions and trampolines, but still at the end we have to use
ftrace direct interface to do the attachment for each involved ftrace_ops 

so at the point of attachment it helps to have as few ftrace_ops objects
as possible, in my test code I ended up with just single ftrace_ops object
and I see attachment time for 20k functions to be around 3 seconds

IIUC Menglong's change needs 12 ftrace_ops objects so we need to do around
12 direct ftrace_ops direct calls .. so probably not that bad, but still
it would be faster with just single ftrace_ops involved

[1] https://lore.kernel.org/bpf/20250703121521.1874196-1-dongml2@chinatelecom.cn/

> 
> > However having just single ftrace_ops object removes direct_call
> > field from direct_call, which is needed by arm, so I'm not sure
> > it's the right path forward.
> 
> It's also needed by powerpc and riscv since commits:
> 
>   a52f6043a2238d65 ("powerpc/ftrace: Add support for DYNAMIC_FTRACE_WITH_DIRECT_CALLS")
>   b21cdb9523e5561b ("riscv: ftrace: support direct call using call_ops")
> 
> > Mark, Florent,
> > any idea how hard would it be to for arm to get rid of direct_call field?
> 
> For architectures which follow the arm64 style of implementation, it's
> pretty hard to get rid of it without introducing a performance hit to
> the call and/or a hit to attachment/detachment/modification. It would
> also end up being a fair amount more complicated.
> 
> There's some historical rationale at:
> 
>   https://lore.kernel.org/lkml/ZfBbxPDd0rz6FN2T@FVFF77S0Q05N/
> 
> ... but the gist is that for several reasons we want the ops pointer in
> the callsite, and for atomic modification of this to switch everything
> dependent on that ops atomically, as this keeps the call logic and
> attachment/detachment/modification logic simple and pretty fast.
> 
> If we remove the direct_call pointer from the ftrace_ops, then IIUC our
> options include:
> 
> * Point the callsite pointer at some intermediate structure which points
>   to the ops (e.g. the dyn_ftrace for the callsite). That introduces an
>   additional dependent load per call that needs the ops, and introduces
>   potential incoherency with other fields in that structure, requiring
>   more synchronization overhead for attachment/detachment/modification.
> 
> * Point the callsite pointer at a trampoline which can generate the ops
>   pointer. This requires that every ops has a trampoline even for
>   non-direct usage, which then requires more memory / I$, has more
>   potential failure points, and is generally more complicated. The
>   performance here will vary by architecture and platform, on some this
>   might be faster, on some it might be slower.
> 
>   Note that we probably still need to bounce through an intermediary
>   trampoline here to actually load from the callsite pointer and
>   indirectly branch to it.
> 
> ... but I'm not really keen on either unless we really have to remove 
> the ftrace_ops::direct_call field, since they come with a substantial
> jump in complexity.

ok, that sounds bad.. thanks for the details

Steven, please correct me if/when I'm wrong ;-)

IIUC in x86_64, IF there's just single ftrace_ops defined for the function,
it will bypass ftrace trampoline and call directly the direct trampoline
for the function, like:

   <foo>:
     call direct_trampoline
     ...

IF there are other ftrace_ops 'users' on the same function, we execute
each of them like:

  <foo>:
    call ftrace_trampoline
      call ftrace_ops_1->func
      call ftrace_ops_2->func
      ...

with our direct ftrace_ops->func currently using ftrace_ops->direct_call
to return direct trampoline for the function:

	-static void call_direct_funcs(unsigned long ip, unsigned long pip,
	-                             struct ftrace_ops *ops, struct ftrace_regs *fregs)
	-{
	-       unsigned long addr = READ_ONCE(ops->direct_call);
	-
	-       if (!addr)
	-               return;
	-
	-       arch_ftrace_set_direct_caller(fregs, addr);
	-}

in the new changes it will do hash lookup (based on ip) for the direct
trampoline we want to execute:

	+static void call_direct_funcs_hash(unsigned long ip, unsigned long pip,
	+                                  struct ftrace_ops *ops, struct ftrace_regs *fregs)
	+{
	+       unsigned long addr;
	+
	+       addr = ftrace_find_rec_direct(ip);
	+       if (!addr)
	+               return;
	+
	+       arch_ftrace_set_direct_caller(fregs, addr);
	+}

still this is the slow path for the case where multiple ftrace_ops objects use
same function.. for the fast path we have the direct attachment as described above

sorry I probably forgot/missed discussion on this, but doing the fast path like in
x86_64 is not an option in arm, right?


thanks,
jirka

next prev parent reply	other threads:[~2025-07-30 12:21 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-29 10:28 [RFC 00/10] ftrace,bpf: Use single direct ops for bpf trampolines Jiri Olsa
2025-07-29 10:28 ` [RFC 01/10] ftrace: Make alloc_and_copy_ftrace_hash direct friendly Jiri Olsa
2025-07-29 10:28 ` [RFC 02/10] ftrace: Add register_ftrace_direct_hash function Jiri Olsa
2025-07-29 10:28 ` [RFC 03/10] ftrace: Add unregister_ftrace_direct_hash function Jiri Olsa
2025-07-29 10:28 ` [RFC 04/10] ftrace: Add modify_ftrace_direct_hash function Jiri Olsa
2025-07-29 10:28 ` [RFC 05/10] ftrace: Export some of hash related functions Jiri Olsa
2025-07-29 10:28 ` [RFC 06/10] ftrace: Use direct hash interface in direct functions Jiri Olsa
2025-07-29 10:28 ` [RFC 07/10] bpf: Add trampoline ip hash table Jiri Olsa
2025-07-29 10:28 ` [RFC 08/10] ftrace: Factor ftrace_ops ops_func interface Jiri Olsa
2025-07-29 10:28 ` [RFC 09/10] bpf: Remove ftrace_ops from bpf_trampoline object Jiri Olsa
2025-07-29 10:28 ` [RFC 10/10] Revert "ftrace: Store direct called addresses in their ops" Jiri Olsa
2025-07-29 17:57 ` [RFC 00/10] ftrace,bpf: Use single direct ops for bpf trampolines Mark Rutland
2025-07-30 11:19   ` Jiri Olsa [this message]
2025-07-30 13:56     ` Steven Rostedt
2025-07-31 20:40       ` Jiri Olsa
2025-08-01  9:49     ` Mark Rutland
2025-08-02 21:26       ` Jiri Olsa
2025-08-06 10:20         ` Mark Rutland
2025-08-13 11:09           ` Jiri Olsa

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aIn_12KHz7ikF2t1@krava \
    --to=olsajiri@gmail.com \
    --cc=alexghiti@rivosinc.com \
    --cc=andrii@kernel.org \
    --cc=andybnac@gmail.com \
    --cc=ast@kernel.org \
    --cc=bjorn@rivosinc.com \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=menglong8.dong@gmail.com \
    --cc=mpe@ellerman.id.au \
    --cc=naveen@kernel.org \
    --cc=palmer@dabbelt.com \
    --cc=revest@google.com \
    --cc=rostedt@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox