On Mon, 29 Apr 2024 13:25:04 -0700
Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:

> On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu <mhiramat@kernel.org> wrote:
> >
> > Hi Andrii,
> >
> > On Thu, 25 Apr 2024 13:31:53 -0700
> > Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
> >
> > > Hey Masami,
> > >
> > > I can't really review most of that code as I'm completely unfamiliar
> > > with all those inner workings of fprobe/ftrace/function_graph. I left
> > > a few comments where there were somewhat more obvious BPF-related
> > > pieces.
> > >
> > > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > > and then with your series applied on top. Just to see if there are any
> > > regressions. I think it will be a useful data point for you.
> >
> > Thanks for testing!
> >
> > >
> > > You should be already familiar with the bench tool we have in BPF
> > > selftests (I used it on some other patches for your tree).
> >
> > What patches we need?
> >
> 
> You mean for this `bench` tool? They are part of BPF selftests (under
> tools/testing/selftests/bpf), you can build them by running:
> 
> $ make RELEASE=1 -j$(nproc) bench
> 
> After that you'll get a self-container `bench` binary, which has all
> the self-contained benchmarks.
> 
> You might also find a small script (benchs/run_bench_trigger.sh inside
> BPF selftests directory) helpful, it collects final summary of the
> benchmark run and optionally accepts a specific set of benchmarks. So
> you can use it like this:
> 
> $ benchs/run_bench_trigger.sh kprobe kprobe-multi
> kprobe         :   18.731 ± 0.639M/s
> kprobe-multi   :   23.938 ± 0.612M/s
> 
> By default it will run a wider set of benchmarks (no uprobes, but a
> bunch of extra fentry/fexit tests and stuff like this).

origin:
# benchs/run_bench_trigger.sh 
kretprobe :    1.329 ± 0.007M/s 
kretprobe-multi:    1.341 ± 0.004M/s 
# benchs/run_bench_trigger.sh 
kretprobe :    1.288 ± 0.014M/s 
kretprobe-multi:    1.365 ± 0.002M/s 
# benchs/run_bench_trigger.sh 
kretprobe :    1.329 ± 0.002M/s 
kretprobe-multi:    1.331 ± 0.011M/s 
# benchs/run_bench_trigger.sh 
kretprobe :    1.311 ± 0.003M/s 
kretprobe-multi:    1.318 ± 0.002M/s s

patched: 

# benchs/run_bench_trigger.sh
kretprobe :    1.274 ± 0.003M/s 
kretprobe-multi:    1.397 ± 0.002M/s 
# benchs/run_bench_trigger.sh
kretprobe :    1.307 ± 0.002M/s 
kretprobe-multi:    1.406 ± 0.004M/s 
# benchs/run_bench_trigger.sh
kretprobe :    1.279 ± 0.004M/s 
kretprobe-multi:    1.330 ± 0.014M/s 
# benchs/run_bench_trigger.sh
kretprobe :    1.256 ± 0.010M/s 
kretprobe-multi:    1.412 ± 0.003M/s 

Hmm, in my case, it seems smaller differences (~3%?).
I attached perf report results for those, but I don't see large difference.

> > >
> > > BASELINE
> > > ========
> > > kprobe         :   24.634 ± 0.205M/s
> > > kprobe-multi   :   28.898 ± 0.531M/s
> > > kretprobe      :   10.478 ± 0.015M/s
> > > kretprobe-multi:   11.012 ± 0.063M/s
> > >
> > > THIS PATCH SET ON TOP
> > > =====================
> > > kprobe         :   25.144 ± 0.027M/s (+2%)
> > > kprobe-multi   :   28.909 ± 0.074M/s
> > > kretprobe      :    9.482 ± 0.008M/s (-9.5%)
> > > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> >
> > This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> > eventually because it should be a single callback version of
> > kretprobe-multi.

I ran another benchmark (prctl loop, attached), the origin kernel result is here;

# sh ./benchmark.sh 
count = 10000000, took 6.748133 sec

And the patched kernel result;

# sh ./benchmark.sh 
count = 10000000, took 6.644095 sec

I confirmed that the parf result has no big difference.

Thank you,


> >
> > >
> > > These numbers are pretty stable and look to be more or less representative.
> > >
> > > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > > about the same, though.
> > >
> > > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > > slower, almost by 10%. Not sure why, but looks real after re-running
> > > benchmarks a bunch of times and getting stable results.
> >
> > Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> > So nothing should be changed. Maybe cache access pattern has been
> > changed?
> > I'll check it with tracefs (to remove the effect from bpf related changes)
> >
> > >
> > > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > > Again, I don't know if it is expected or not, but it's a nice
> > > improvement.
> >
> > Thanks!
> >
> > >
> > > If you have any idea why kretprobes would get so much slower, it would
> > > be nice to look into that and see if you can mitigate the regression
> > > somehow. Thanks!
> >
> > OK, let me check it.
> >
> > Thank you!
> >
> > >
> > >
> > > >  51 files changed, 2325 insertions(+), 882 deletions(-)
> > > >  create mode 100644 tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> > > >
> > > > --
> > > > Masami Hiramatsu (Google) <mhiramat@kernel.org>
> > > >
> >
> >
> > --
> > Masami Hiramatsu (Google) <mhiramat@kernel.org>


-- 
Masami Hiramatsu (Google) <mhiramat@kernel.org>