From: Jiri Olsa <olsajiri@gmail.com>
To: Jiri Olsa <olsajiri@gmail.com>
Cc: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
Alexei Starovoitov <alexei.starovoitov@gmail.com>,
yunwei356@gmail.com, bpf <bpf@vger.kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
lsf-pc <lsf-pc@lists.linux-foundation.org>,
Yonghong Song <yonghong.song@linux.dev>,
Oleg Nesterov <oleg@redhat.com>,
Daniel Borkmann <daniel@iogearbox.net>
Subject: Re: [LSF/MM/BPF TOPIC] faster uprobes
Date: Tue, 5 Mar 2024 16:30:10 +0100 [thread overview]
Message-ID: <Zec6guiGtXHgUpbx@krava> (raw)
In-Reply-To: <ZebWqIgABbk8dQXT@krava>
On Tue, Mar 05, 2024 at 09:24:08AM +0100, Jiri Olsa wrote:
> On Mon, Mar 04, 2024 at 04:55:33PM -0800, Andrii Nakryiko wrote:
> > On Sun, Mar 3, 2024 at 2:20 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > >
> > > On Fri, Mar 01, 2024 at 09:26:57AM -0800, Andrii Nakryiko wrote:
> > > > On Fri, Mar 1, 2024 at 9:01 AM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Mar 1, 2024 at 12:18 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > >
> > > > > > On Thu, Feb 29, 2024 at 04:25:17PM -0800, Andrii Nakryiko wrote:
> > > > > > > On Thu, Feb 29, 2024 at 6:39 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> > > > > > > >
> > > > > > > > One of uprobe pain points is having slow execution that involves
> > > > > > > > two traps in worst case scenario or single trap if the original
> > > > > > > > instruction can be emulated. For return uprobes there's one extra
> > > > > > > > trap on top of that.
> > > > > > > >
> > > > > > > > My current idea on how to make this faster is to follow the optimized
> > > > > > > > kprobes and replace the normal uprobe trap instruction with jump to
> > > > > > > > user space trampoline that:
> > > > > > > >
> > > > > > > > - executes syscall to call uprobe consumers callbacks
> > > > > > >
> > > > > > > Did you get a chance to measure relative performance of syscall vs
> > > > > > > int3 interrupt handling? If not, do you think you'll be able to get
> > > > > > > some numbers by the time the conference starts? This should inform the
> > > > > > > decision whether it even makes sense to go through all the trouble.
> > > > > >
> > > > > > right, will do that
> > > > >
> > > > > I believe Yusheng measured syscall vs uprobe performance
> > > > > difference during LPC. iirc it was something like 3x.
> > > >
> > > > Do you have a link to slides? Was it actual uprobe vs just some fast
> > > > syscall (not doing BPF program execution) comparison? Or comparing the
> > > > performance of int3 handling vs equivalent syscall handling.
> > > >
> > > > I suspect it's the former, and so probably not that representative.
> > > > I'm curious about the performance of going
> > > > userspace->kernel->userspace through int3 vs syscall (all other things
> > > > being equal).
> > >
> > > I have a simple test [1] comparing:
> > > - uprobe with 2 traps
> > > - uprobe with 1 trap
> > > - syscall executing uprobe
> > >
> > > the syscall takes uprobe address as argument, finds the uprobe and executes
> > > its consumers, which should be comparable to what the trampoline will do
> > >
> > > test does same amount of loops triggering each uprobe type and measures
> > > the time it took
> > >
> > > # ./test_progs -t uprobe_syscall_bench -v
> > > bpf_testmod.ko is already unloaded.
> > > Loading bpf_testmod.ko...
> > > Successfully loaded bpf_testmod.ko.
> > > test_bench_1:PASS:uprobe_bench__open_and_load 0 nsec
> > > test_bench_1:PASS:uprobe_bench__attach 0 nsec
> > > test_bench_1:PASS:uprobe1_cnt 0 nsec
> > > test_bench_1:PASS:syscalls_uprobe1_cnt 0 nsec
> > > test_bench_1:PASS:uprobe2_cnt 0 nsec
> > > test_bench_1: uprobes (1 trap) in 36.439s
> > > test_bench_1: uprobes (2 trap) in 91.960s
> > > test_bench_1: syscalls in 17.872s
> > > #395/1 uprobe_syscall_bench/bench_1:OK
> > > #395 uprobe_syscall_bench:OK
> > > Summary: 1/1 PASSED, 0 SKIPPED, 0 FAILED
> > >
> > > syscall uprobe execution seems to be ~2x faster than 1 trap uprobe
> > > and ~5x faster than 2 traps uprobe
> > >
> >
> > Thanks for running benchmarks! I quickly looked at the selftest and
> > noticed this:
> >
> > +/*
> > + * Assuming following prolog:
> > + *
> > + * 6984ac: 55 push %rbp
> > + * 6984ad: 48 89 e5 mov %rsp,%rbp
> > + */
> > +noinline void uprobe2_bench_trigger(void)
> > +{
> > + asm volatile ("");
> > +}
> >
> > This actually will be optimized out to just ret in -O2 mode (make
> > RELEASE=1 for selftests):
> >
> > 00000000005a0ce0 <uprobe2_bench_trigger>:
> > 5a0ce0: c3 retq
> > 5a0ce1: 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
> > 5a0cec: 0f 1f 40 00 nopl (%rax)
> >
> > So be careful with that.
>
> right, I did not mean for this to be checked in, just wanted to get the
> numbers quickly
>
> >
> > Also, I just updated our existing set of uprobe benchmarks (see [0]),
> > do you mind adding your syscall-based one as another one there and
> > running all of them and sharing the numbers with us? Very curious to
> > see both absolute and relative numbers from that benchmark. (and
> > please do build with RELEASE=1)
> >
> > You should be able to just run benchs/run_bench_uprobes.sh (also don't
> > forget to add your syscall-based benchmark to the list of benchmarks
> > in that shell script).
>
> yes, saw it and was going to run/compare it.. it's good idea to add
> the syscall one and get all numbers together, will do that
seems to be consistent with my previous test:
base : 15.854 ± 0.007M/s
uprobe-nop : 2.859 ± 0.007M/s
uprobe-push : 2.697 ± 0.002M/s
uprobe-ret : 1.081 ± 0.000M/s
uprobe-syscall : 5.520 ± 0.006M/s
uretprobe-nop : 1.422 ± 0.002M/s
uretprobe-push : 1.396 ± 0.002M/s
uretprobe-ret : 0.787 ± 0.000M/s
uretprobe-syscall: 1.888 ± 0.002M/s
syscall uprobe is ~2x faster than 1 trap uprobe and ~5x faster than 2 traps uprobe
uretprobe is bit more tricky to compare, the speed up is there for the initial
uprobe hit, then there's again the trap from the uretprobe trampoline
I have the bench changes in here [1], I'll send it out together with rfc post
jirka
[1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench_1
>
> >
> > Thank you!
> >
> >
> > BTW, while I think patching multiple instructions for syscall-based
> > uprobe is going to be extremely tricky, I think at least u*ret*probe's
> > int3 can be pretty easily optimized away with syscall, given that the
> > kernel controls code generation there. If anything, it will get the
> > uretprobe case a bit closer to the performance of uprobe. Give it some
> > thought.
>
> hm, right.. the trampoline is there already, but at the moment is global
> and used by all uretprobes.. and int3 code moves userspace (changes rip)
> to the original return address.. maybe we can do that through syscall
> as well
>
> or we could add jump back to uretprobe's original return addrress to the
> trampoline, but then we need special trampoline for each uretprobe,
> I'll check
>
> thanks,
> jirka
>
> >
> >
> > [0] https://patchwork.kernel.org/project/netdevbpf/patch/20240301214551.1686095-1-andrii@kernel.org/
> >
> > > jirka
> > >
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git/log/?h=uprobe_syscall_bench
> > >
> > > >
> > > > > Certainly necessary to have a benchmark.
> > > > > selftests/bpf/bench has one for uprobe.
> > > > > Probably should extend with sys_bpf.
> > > > >
> > > > > Regarding:
> > > > > > replace the normal uprobe trap instruction with jump to
> > > > > user space trampoline
> > > > >
> > > > > it should probably be a call to trampoline instead of a jump.
> > > > > Unless you plan to generate a different trampoline for every location ?
> > > > >
> > > > > Also how would you pick a space for a trampoline in the target process ?
> > > > > Analyze /proc/pid/maps and look for gaps in executable sections?
> > > >
> > > > kernel already does that for uretprobes, it adds a new "[uprobes]"
> > > > memory mapping, so this part is already implemented
> > > >
> > > > >
> > > > > We can start simple with a USDT that uses nop5 instead of nop1
> > > > > and explicit single trampoline for all USDT locations
> > > > > that saves all (callee and caller saved) registers and
> > > > > then does sys_bpf with a new cmd.
> > > > >
> > > > > To replace nop5 with a call to trampoline we can use text_poke_bp
> > > > > approach: replace 1st byte with int3, replace 2-5 with target addr,
> > > > > replace 1st byte to make an actual call insn.
> > > > >
> > > > > Once patched there will be no simulation of insns or kernel traps.
> > > > > Just normal user code that calls into trampoline, that calls sys_bpf,
> > > > > and returns back.
next prev parent reply other threads:[~2024-03-05 15:30 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-29 14:39 [LSF/MM/BPF TOPIC] faster uprobes Jiri Olsa
2024-03-01 0:25 ` Andrii Nakryiko
2024-03-01 8:18 ` Jiri Olsa
2024-03-01 17:01 ` Alexei Starovoitov
2024-03-01 17:26 ` Andrii Nakryiko
2024-03-01 18:08 ` Yunwei 123
2024-03-03 10:20 ` Jiri Olsa
2024-03-05 0:55 ` Andrii Nakryiko
2024-03-05 8:24 ` Jiri Olsa
2024-03-05 15:30 ` Jiri Olsa [this message]
2024-03-05 17:30 ` Andrii Nakryiko
2024-03-11 10:59 ` Jiri Olsa
2024-03-11 15:06 ` Oleg Nesterov
2024-03-11 16:46 ` Jiri Olsa
2024-03-11 17:02 ` Oleg Nesterov
2024-03-11 21:11 ` Jiri Olsa
2024-03-11 17:32 ` Andrii Nakryiko
2024-03-11 21:26 ` Jiri Olsa
2024-03-11 23:05 ` Andrii Nakryiko
2024-03-02 20:46 ` Jiri Olsa
2024-03-02 21:08 ` Alexei Starovoitov
2024-03-02 21:49 ` Oleg Nesterov
2024-03-01 19:39 ` Kui-Feng Lee
2024-03-05 17:18 ` Jiri Olsa
2024-03-05 23:53 ` Song Liu
2024-03-07 9:15 ` Jiri Olsa
2024-03-07 23:02 ` Kui-Feng Lee
2024-03-08 15:43 ` Andrei Matei
2024-03-12 17:16 ` Kui-Feng Lee
2024-03-13 1:32 ` Andrei Matei
2024-03-13 5:42 ` Kui-Feng Lee
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Zec6guiGtXHgUpbx@krava \
--to=olsajiri@gmail.com \
--cc=alexei.starovoitov@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=oleg@redhat.com \
--cc=yonghong.song@linux.dev \
--cc=yunwei356@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox