From: Jiri Olsa <olsajiri@gmail.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Jiri Olsa <olsajiri@gmail.com>,
Andrii Nakryiko <andrii@kernel.org>,
bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
martin.lau@kernel.org, kernel-team@meta.com
Subject: Re: [PATCH v2 bpf-next 2/2] selftests/bpf: add fast mostly in-kernel BPF triggering benchmarks
Date: Sun, 17 Mar 2024 21:38:58 +0100 [thread overview]
Message-ID: <ZfdU4ordMYOGP0BQ@krava> (raw)
In-Reply-To: <CAEf4BzayNECKkmc4=XfLW5fzsPozMnjqOEmGO+r2UmEQXt1XyA@mail.gmail.com>
On Fri, Mar 15, 2024 at 03:22:29PM -0700, Andrii Nakryiko wrote:
> On Fri, Mar 15, 2024 at 2:10 AM Jiri Olsa <olsajiri@gmail.com> wrote:
> >
> > On Thu, Mar 14, 2024 at 10:18:13PM -0700, Andrii Nakryiko wrote:
> > > Existing kprobe/fentry triggering benchmarks have 1-to-1 mapping between
> > > one syscall execution and BPF program run. While we use a fast
> > > get_pgid() syscall, syscall overhead can still be non-trivial.
> > >
> > > This patch adds kprobe/fentry set of benchmarks significantly amortizing
> > > the cost of syscall vs actual BPF triggering overhead. We do this by
> > > employing BPF_PROG_TEST_RUN command to trigger "driver" raw_tp program
> > > which does a tight parameterized loop calling cheap BPF helper
> > > (bpf_get_smp_processor_id()), to which kprobe/fentry programs are
> > > attached for benchmarking.
> > >
> > > This way 1 bpf() syscall causes N executions of BPF program being
> > > benchmarked. N defaults to 100, but can be adjusted with
> > > --trig-batch-iters CLI argument.
> > >
> > > Results speak for themselves:
> > >
> > > $ ./run_bench_trigger.sh
> > > uprobe-base : 138.054 ą 0.556M/s
> > > base : 16.650 ą 0.123M/s
> > > tp : 11.068 ą 0.100M/s
> > > rawtp : 14.087 ą 0.511M/s
> > > kprobe : 9.641 ą 0.027M/s
> > > kprobe-multi : 10.263 ą 0.061M/s
> > > kretprobe : 5.475 ą 0.028M/s
> > > kretprobe-multi : 5.703 ą 0.036M/s
> > > fentry : 14.544 ą 0.112M/s
> > > fexit : 10.637 ą 0.073M/s
> > > fmodret : 11.357 ą 0.061M/s
> > > kprobe-fast : 14.286 ą 0.377M/s
> > > kprobe-multi-fast : 14.999 ą 0.204M/s
> > > kretprobe-fast : 7.646 ą 0.084M/s
> > > kretprobe-multi-fast: 4.354 ą 0.066M/s
> > > fentry-fast : 31.475 ą 0.254M/s
> > > fexit-fast : 17.379 ą 0.195M/s
> > >
> > > Note how xxx-fast variants are measured with significantly higher
> > > throughput, even though it's exactly the same in-kernel overhead:
> > >
> > > fentry : 14.544 ą 0.112M/s
> > > fentry-fast : 31.475 ą 0.254M/s
> > >
> > > kprobe-multi : 10.263 ą 0.061M/s
> > > kprobe-multi-fast : 14.999 ą 0.204M/s
> > >
> > > One huge and not yet explained deviation is a slowdown of
> > > kretprobe-multi, we should look into that separately.
> > >
> > > kretprobe : 5.475 ą 0.028M/s
> > > kretprobe-multi : 5.703 ą 0.036M/s
> > > kretprobe-fast : 7.646 ą 0.084M/s
> > > kretprobe-multi-fast: 4.354 ą 0.066M/s
> > >
> > > Kprobe cases don't seem to have this illogical slowdown:
> > >
> > > kprobe : 9.641 ą 0.027M/s
> > > kprobe-multi : 10.263 ą 0.061M/s
> > > kprobe-fast : 14.286 ą 0.377M/s
> > > kprobe-multi-fast : 14.999 ą 0.204M/s
> >
> > hum, I see that as well:
> >
> > uprobe-base : 230.624 ą 0.527M/s
> > base : 16.320 ą 0.087M/s
> > tp : 10.057 ą 0.122M/s
> > rawtp : 14.851 ą 0.300M/s
> > kprobe : 10.993 ą 0.104M/s
> > kprobe-multi : 11.053 ą 0.038M/s
> > kretprobe : 6.679 ą 0.015M/s
> > kretprobe-multi : 6.466 ą 0.015M/s
> > fentry : 14.949 ą 0.064M/s
> > fexit : 10.530 ą 1.275M/s
> > fmodret : 11.145 ą 0.245M/s
> > kprobe-fast : 20.080 ą 0.468M/s
> > kprobe-multi-fast : 17.603 ą 0.102M/s
> > kretprobe-fast : 9.943 ą 0.056M/s
> > kretprobe-multi-fast: 5.185 ą 0.022M/s
> > fentry-fast : 46.697 ą 0.260M/s
> > fexit-fast : 19.250 ą 0.108M/s
> >
> >
> > I even see decline in kprobe-multi-fast:
> >
> > kprobe-fast : 20.080 ą 0.468M/s
> > kprobe-multi-fast : 17.603 ą 0.102M/s
> >
> > kretprobe-fast : 9.943 ą 0.056M/s
> > kretprobe-multi-fast: 5.185 ą 0.022M/s
> >
> >
> > I've got some IBT related code showing up in the perf profile
> > and when I disabled it I got better results for kprobe-multi-fast
> > but kretprobe-multi-fast is still bad
>
> ok, I think it's a self-inflicted false alarm. Note how this benchmark
> is using bpf_get_smp_processor_id() as an attach point? Well, guess
> what, each triggered program is then calling
> bpf_get_smp_processor_id() again to get the per-CPU counter to
> increment. So this benchmark was effectively heavily testing recursion
> protection, and I guess kretprobe hurts the most.
heh, nice :-) that explains the bpf_prog_inc_misses_counter hits
I see in perf profile
1.46% bench [kernel.vmlinux] [k] bpf_prog_inc_misses_counter
|
---bpf_prog_inc_misses_counter
kprobe_multi_link_prog_run.isra.0
fprobe_exit_handler
rethook_trampoline_handler
arch_rethook_trampoline_callback
arch_rethook_trampoline
bpf_prog_7790468e40a289ea_bench_trigger_kretprobe_multi_fast
kprobe_multi_link_prog_run.isra.0
fprobe_exit_handler
rethook_trampoline_handler
arch_rethook_trampoline_callback
arch_rethook_trampoline
bpf_prog_db14b3b08bc9e952_trigger_driver
__bpf_prog_test_run_raw_tp
bpf_prog_test_run_raw_tp
__sys_bpf
__x64_sys_bpf
do_syscall_64
entry_SYSCALL_64
syscall
start_thread
clone3
jirka
>
> I realized that when I moved to bpf_get_numa_node_id() and seeing much
> higher numbers:
>
> BEFORE
> ======
> $ benchs/run_bench_trigger.sh kprobe-batch fentry-batch
> kprobe-batch : 14.018 ± 0.333M/s
> fentry-batch : 31.470 ± 0.791M/s
>
> AFTER
> =====
> $ benchs/run_bench_trigger.sh kprobe-batch fentry-batch
> kprobe-batch : 19.391 ± 0.868M/s
> fentry-batch : 48.785 ± 0.932M/s
>
> That's when I realized something is off here. Now results make a bit
> more sense (machine on which I test is different, so absolute values
> differ from before):
>
> $ benchs/run_bench_trigger.sh kprobe kprobe-multi kprobe-batch
> kprobe-multi-batch kretprobe kretprobe-multi kretprobe-batch
> kretprobe-multi-batch
> kprobe : 8.114 ± 0.199M/s
> kprobe-multi : 8.739 ± 0.052M/s
> kprobe-batch : 19.379 ± 0.709M/s
> kprobe-multi-batch : 24.969 ± 0.226M/s
> kretprobe : 4.859 ± 0.027M/s
> kretprobe-multi : 4.940 ± 0.083M/s
> kretprobe-batch : 8.460 ± 0.044M/s
> kretprobe-multi-batch: 8.761 ± 0.282M/s
>
>
> >
> >
> > uprobe-base : 234.024 ą 0.225M/s
> > base : 16.383 ą 0.029M/s
> > tp : 9.973 ą 0.017M/s
> > rawtp : 14.889 ą 0.047M/s
> > kprobe : 10.970 ą 0.011M/s
> > kprobe-multi : 11.640 ą 0.009M/s
> > kretprobe : 6.667 ą 0.005M/s
> > kretprobe-multi : 6.704 ą 0.005M/s
> > fentry : 14.968 ą 0.024M/s
> > fexit : 11.860 ą 0.012M/s
> > fmodret : 12.656 ą 0.036M/s
> > kprobe-fast : 20.340 ą 0.043M/s
> > kprobe-multi-fast : 21.203 ą 0.019M/s
> > kretprobe-fast : 9.956 ą 0.021M/s
> > kretprobe-multi-fast: 5.611 ą 0.006M/s
> > fentry-fast : 46.825 ą 0.041M/s
> > fexit-fast : 19.746 ą 0.024M/s
> >
> > slightly better:
> >
> > kprobe-fast : 20.340 ą 0.043M/s
> > kprobe-multi-fast : 21.203 ą 0.019M/s
> >
> > still almost half perf:
> > kretprobe-fast : 9.956 ą 0.021M/s
> > kretprobe-multi-fast: 5.611 ą 0.006M/s
> >
> > jirka
> >
> > >
> > > Cc: Jiri Olsa <jolsa@kernel.org>
> > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > > ---
> > > tools/testing/selftests/bpf/bench.c | 18 +++
> > > .../selftests/bpf/benchs/bench_trigger.c | 123 +++++++++++++++++-
> > > .../selftests/bpf/benchs/run_bench_trigger.sh | 8 +-
> > > .../selftests/bpf/progs/trigger_bench.c | 56 +++++++-
> > > 4 files changed, 201 insertions(+), 4 deletions(-)
> > >
>
> [...]
next prev parent reply other threads:[~2024-03-17 20:39 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-15 5:18 [PATCH v2 bpf-next 1/2] selftests/bpf: scale benchmark counting by using per-CPU counters Andrii Nakryiko
2024-03-15 5:18 ` [PATCH v2 bpf-next 2/2] selftests/bpf: add fast mostly in-kernel BPF triggering benchmarks Andrii Nakryiko
2024-03-15 9:10 ` Jiri Olsa
2024-03-15 22:22 ` Andrii Nakryiko
2024-03-17 20:38 ` Jiri Olsa [this message]
2024-03-15 16:03 ` Alexei Starovoitov
2024-03-15 16:31 ` Andrii Nakryiko
2024-03-15 16:58 ` Andrii Nakryiko
2024-03-15 18:47 ` Alexei Starovoitov
2024-03-15 21:07 ` Andrii Nakryiko
2024-03-15 16:08 ` [PATCH v2 bpf-next 1/2] selftests/bpf: scale benchmark counting by using per-CPU counters Alexei Starovoitov
2024-03-15 16:22 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZfdU4ordMYOGP0BQ@krava \
--to=olsajiri@gmail.com \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@meta.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.