Re: [RFC 0/1] BPF tracing for arm64 using fprobe

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Masami Hiramatsu (Google) <mhiramat@kernel.org>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: Florent Revest <revest@chromium.org>, bpf <bpf@vger.kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	KP Singh <kpsingh@kernel.org>,
	Brendan Jackman <jackmanb@google.com>,
	markowsky@google.com, Mark Rutland <mark.rutland@arm.com>,
	Masami Hiramatsu <mhiramat@kernel.org>,
	Steven Rostedt <rostedt@goodmis.org>,
	Xu Kuohai <xukuohai@huawei.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC 0/1] BPF tracing for arm64 using fprobe
Date: Thu, 17 Nov 2022 22:33:53 +0900	[thread overview]
Message-ID: <20221117223353.431e29124ba51a72c3507ced@kernel.org> (raw)
In-Reply-To: <CAADnVQ+BWpzqOV8dGCR=A3dR3u60CkBkqSXEQHe2kVqFzsgnHw@mail.gmail.com>

On Wed, 16 Nov 2022 18:41:26 -0800
Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote:

> On Tue, Nov 8, 2022 at 2:07 PM Florent Revest <revest@chromium.org> wrote:
> >
> > Hi!
> >
> > With this RFC, I'd like to revive the conversation between BPF, ARM and tracing
> > folks on what BPF tracing (fentry/fexit/fmod_ret) could/should look like on
> > arm64.
> >
> > Current status of BPF tracing
> > =============================
> >
> > On currently supported architectures (like x86), BPF tracing programs are
> > called from a JITted BPF trampoline, itself called from the ftrace patch site
> > thanks to the ftrace "direct call" API. (or from the end of the ftrace
> > trampoline if a ftrace ops is also tracing that function, but this is
> > transparent to BPF)
> >
> > Thanks to Xu's work [1], we now have BPF trampolines on arm64 (these can be
> > used for struct ops programs already), but Xu's attempts at getting ftrace
> > direct calls support [2][3] on arm64 have been unsucessful so far so we still
> > do not support BPF tracing programs. This prompted me to try a different
> > approach. I'd like to collect feedback on it here.
> >
> > Why not direct calls ?
> > ======================
> >
> > Mark and Steven have not been too keen on getting direct calls on arm64 because:
> > - working around BL instruction's limited range introduces complexity [4]
> > - it's difficult to get reliable stacktraces right with direct calls [5]
> > - direct calls are complex to maintain on the arch/ftrace side [5]
> >
> > In the absence of ftrace direct calls support, BPF tracing programs would need
> > to be called from an ftrace ops instead. Note that the BPF callback signature
> > would have to be different, so we can't re-use trampolines (direct called
> > callbacks receive arguments in registers whereas ftrace ops callbacks receive
> > arguments in a struct ftrace_regs pointer)
> >
> > Why fprobe ?
> > ============
> >
> > Ftrace ops per-se only expose an API to hook before a function. There are two
> > systems built on top of ftrace ops that also allow hooking the function exit:
> > fprobe (using rethook) and the function graph tracer. There are plans from
> > Masami and Steven to unify these two systems but, as they stand, only fprobe
> > gives enough flexibility to implement BPF tracing.
> >
> > In order not to reinvent the wheel, if direct calls aren't available on the
> > arch, BPF could leverage fprobe to hook before and after the traced function.
> > Note that return hooking is implemented a bit differently than it is in BPF
> > trampolines. Instead of keeping arguments on a stack frame and calling the
> > traced function, rethook saves arguments in a memory pool and returns to the
> > traced function with a hijacked return pointer that will have its ret jump back
> > to the rethook trampoline.
> >
> > What about performances ?
> > =========================
> >
> > In its current state, a fprobe callback on arm64 is very expensive because:
> > 1- the ftrace trampoline saves all registers (including many unnecessary ones)
> > 2- it calls ftrace_ops_list_func which iterates over all ops and is very slow
> > 3- the fprobe ops unconditionally hooks a rethook
> > 4- rethook grabs memory from a freelist which is slow under high contention
> >
> > However, all the above points are currently being addressed:
> > 1- by Mark's series to save argument registers only [6]
> > 2- by Mark's series to call single ops directly [7]
> > 3- by Masami's patch to skip rethooks if not needed [8]
> > 4- Masami said the rethook freelist would be replaced by a per-task stack as
> >    part of its unification with the function graph tracer [9]
> >
> > I measured the costs of BPF on different approaches on my RPi4 here: [10]
> > tl;dr: the BPF "bench" takes a performance hit of:
> > - 28.6% w/ BPF tracing on direct calls (best case scenario for reference) [11]
> > - 66.8% w/ BPF on kprobe (just for reference)
> > - 62.6% w/ BPF tracing on fprobe without any optimizations (current state) [12]
> > - 34.1% w/ BPF tracing on fprobe with all optimizations (near-future state) [13]
> 
> Even with all optimization the performance overhead is not acceptable.
> It feels to me that folks are still thinking about bpf trampoline
> as a tracing facility.
> It's a lot more than that. It needs to run 24/7 with zero overhead.
> It needs to replace the kernel functions and be invoked
> millions times a second until the system is rebooted.
> In this environment every nanosecond counts.
> 
> Even if the fprobe side was completely free the patch 1 has so much
> overhead in copy of bpf_cookie, regs, etc that it's a non-starter
> for these use cases.
> 
> There are several other fundamental issues in this approach
> because of fprobe/ftrace.
> It has ftrace_test_recursion_trylock and disables preemption.
> Both are deal breakers.

I talked with Florent about this offline.
ftrace_test_recursion_trylock() is required for generic ftrace
use because user callback can call a function which can be
traced by ftrace. This means it can cause an infinite loop.
However, if user can ensure to check it by itself, I can add a
flag to avoid that trylock. (Of course, you can shoot your foot.)

I thought the preemption disabling was for accessing per-cpu,
but it is needed for rethook to get an object from an RCU
protected list.
Thus when we move on the per-task shadow stack, it can be
removed too.

> 
> bpf trampoline has to allow recursion in some cases.
> See __bpf_prog_enter*() flavors.
> 
> bpf trampoline also has to use migrate_disable instead of preemption
> and rcu_read_lock() in some cases and rcu_read_lock_trace() in others.

Is rcu_read_lock() better than preempt_disable()? 

> 
> bpf trampoline must never allocate memory or grab locks.

Note that ftrace_test_recursion_trylock() is just a bit operation
per-task, not taking a lock (nor atomic).

Thank you,

> 
> All of these mandatory features exclude fprobe, ftrace, rethook
> from possible options.
> 
> Let's figure out how to address concerns with direct calls:
> 
> > - working around BL instruction's limited range introduces complexity [4]
> > - it's difficult to get reliable stacktraces right with direct calls [5]
> > - direct calls are complex to maintain on the arch/ftrace side [5]


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>

next prev parent reply	other threads:[~2022-11-17 13:34 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-08 22:06 [RFC 0/1] BPF tracing for arm64 using fprobe Florent Revest
2022-11-08 22:06 ` [RFC 1/1] bpf: Invoke tracing progs using fprobe on archs without direct call Florent Revest
2022-11-17  2:41 ` [RFC 0/1] BPF tracing for arm64 using fprobe Alexei Starovoitov
2022-11-17 13:33   ` Masami Hiramatsu [this message]
2022-11-17 16:50     ` Alexei Starovoitov
2022-11-18 16:26       ` Mark Rutland
2022-11-17 17:16   ` Steven Rostedt
2022-11-17 21:55     ` Chris Mason
2022-11-17 22:40       ` Steven Rostedt
2022-11-18 16:34         ` Mark Rutland
2022-11-18 16:45           ` Steven Rostedt
2022-11-18 17:44             ` Chris Mason
2022-11-18 18:06               ` Steven Rostedt
2022-11-18 18:52                 ` Chris Mason
2022-11-21 13:47                   ` KP Singh
2022-11-21 14:16                     ` Peter Zijlstra
2022-11-21 14:23                       ` KP Singh
2022-11-21 15:15                     ` Steven Rostedt
2022-11-21 15:29                       ` KP Singh
2022-11-21 15:39                         ` Steven Rostedt
2022-11-21 16:16                         ` Jiri Kosina
2022-11-21 15:40                       ` Alexei Starovoitov
2022-11-21 15:45                         ` Steven Rostedt
2022-11-21 15:55                           ` Borislav Petkov
2022-11-21 10:09                 ` Peter Zijlstra
2022-11-21 14:40                   ` Masami Hiramatsu
2022-11-18 16:18       ` Mark Rutland
2022-11-17 13:16 ` Masami Hiramatsu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221117223353.431e29124ba51a72c3507ced@kernel.org \
    --to=mhiramat@kernel.org \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=jackmanb@google.com \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=markowsky@google.com \
    --cc=revest@chromium.org \
    --cc=rostedt@goodmis.org \
    --cc=xukuohai@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).