BPF List
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Joanne Koong <joannekoong@fb.com>, bpf <bpf@vger.kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Kernel Team <Kernel-team@fb.com>
Subject: Re: [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark
Date: Wed, 24 Nov 2021 22:59:10 +0100	[thread overview]
Message-ID: <87lf1db4gh.fsf@toke.dk> (raw)
In-Reply-To: <CAEf4BzbB6utDjOJLZzwbBEoAgdO774=PX8O9dWeZJRzM2kdxaQ@mail.gmail.com>

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Wed, Nov 24, 2021 at 4:56 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Joanne Koong <joannekoong@fb.com> writes:
>>
>> > On 11/23/21 11:19 AM, Toke Høiland-Jørgensen wrote:
>> >
>> >> Joanne Koong <joannekoong@fb.com> writes:
>> >>
>> >>> Add benchmark to measure the throughput and latency of the bpf_loop
>> >>> call.
>> >>>
>> >>> Testing this on qemu on my dev machine on 1 thread, the data is
>> >>> as follows:
>> >>>
>> >>>          nr_loops: 1
>> >>> bpf_loop - throughput: 43.350 ± 0.864 M ops/s, latency: 23.068 ns/op
>> >>>
>> >>>          nr_loops: 10
>> >>> bpf_loop - throughput: 69.586 ± 1.722 M ops/s, latency: 14.371 ns/op
>> >>>
>> >>>          nr_loops: 100
>> >>> bpf_loop - throughput: 72.046 ± 1.352 M ops/s, latency: 13.880 ns/op
>> >>>
>> >>>          nr_loops: 500
>> >>> bpf_loop - throughput: 71.677 ± 1.316 M ops/s, latency: 13.951 ns/op
>> >>>
>> >>>          nr_loops: 1000
>> >>> bpf_loop - throughput: 69.435 ± 1.219 M ops/s, latency: 14.402 ns/op
>> >>>
>> >>>          nr_loops: 5000
>> >>> bpf_loop - throughput: 72.624 ± 1.162 M ops/s, latency: 13.770 ns/op
>> >>>
>> >>>          nr_loops: 10000
>> >>> bpf_loop - throughput: 75.417 ± 1.446 M ops/s, latency: 13.260 ns/op
>> >>>
>> >>>          nr_loops: 50000
>> >>> bpf_loop - throughput: 77.400 ± 2.214 M ops/s, latency: 12.920 ns/op
>> >>>
>> >>>          nr_loops: 100000
>> >>> bpf_loop - throughput: 78.636 ± 2.107 M ops/s, latency: 12.717 ns/op
>> >>>
>> >>>          nr_loops: 500000
>> >>> bpf_loop - throughput: 76.909 ± 2.035 M ops/s, latency: 13.002 ns/op
>> >>>
>> >>>          nr_loops: 1000000
>> >>> bpf_loop - throughput: 77.636 ± 1.748 M ops/s, latency: 12.881 ns/op
>> >>>
>> >>>  From this data, we can see that the latency per loop decreases as the
>> >>> number of loops increases. On this particular machine, each loop had an
>> >>> overhead of about ~13 ns, and we were able to run ~70 million loops
>> >>> per second.
>> >> The latency figures are great, thanks! I assume these numbers are with
>> >> retpolines enabled? Otherwise 12ns seems a bit much... Or is this
>> >> because of qemu?
>> > I just tested it on a machine (without retpoline enabled) that runs on
>> > actual
>> > hardware and here is what I found:
>> >
>> >              nr_loops: 1
>> >      bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>> >
>> >              nr_loops: 10
>> >      bpf_loop - throughput: 198.519 ± 0.155 M ops/s, latency: 5.037 ns/op
>> >
>> >              nr_loops: 100
>> >      bpf_loop - throughput: 247.448 ± 0.305 M ops/s, latency: 4.041 ns/op
>> >
>> >              nr_loops: 500
>> >      bpf_loop - throughput: 260.839 ± 0.380 M ops/s, latency: 3.834 ns/op
>> >
>> >              nr_loops: 1000
>> >      bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>> >
>> >              nr_loops: 5000
>> >      bpf_loop - throughput: 264.211 ± 1.508 M ops/s, latency: 3.785 ns/op
>> >
>> >              nr_loops: 10000
>> >      bpf_loop - throughput: 265.366 ± 3.054 M ops/s, latency: 3.768 ns/op
>> >
>> >              nr_loops: 50000
>> >      bpf_loop - throughput: 235.986 ± 20.205 M ops/s, latency: 4.238 ns/op
>> >
>> >              nr_loops: 100000
>> >      bpf_loop - throughput: 264.482 ± 0.279 M ops/s, latency: 3.781 ns/op
>> >
>> >              nr_loops: 500000
>> >      bpf_loop - throughput: 309.773 ± 87.713 M ops/s, latency: 3.228 ns/op
>> >
>> >              nr_loops: 1000000
>> >      bpf_loop - throughput: 262.818 ± 4.143 M ops/s, latency: 3.805 ns/op
>> >
>> > The latency is about ~4ns / loop.
>> >
>> > I will update the commit message in v3 with these new numbers as well.
>>
>> Right, awesome, thank you for the additional test. This is closer to
>> what I would expect: on the hardware I'm usually testing on, a function
>> call takes ~1.5ns, but the difference might just be the hardware, or
>> because these are indirect calls.
>>
>> Another comparison just occurred to me (but it's totally OK if you don't
>> want to add any more benchmarks):
>>
>> The difference between a program that does:
>>
>> bpf_loop(nr_loops, empty_callback, NULL, 0);
>>
>> and
>>
>> for (i = 0; i < nr_loops; i++)
>>   empty_callback();
>
> You are basically trying to measure the overhead of bpf_loop() helper
> call itself, because other than that it should be identical.

No, I'm trying to measure the difference between the indirect call in
the helper, and the direct call from the BPF program. Should be minor
without retpolines, and somewhat higher where they are enabled...

> We can estimate that already from the numbers Joanne posted above:
>
>              nr_loops: 1
>       bpf_loop - throughput: 46.780 ± 0.064 M ops/s, latency: 21.377 ns/op
>              nr_loops: 1000
>       bpf_loop - throughput: 262.806 ± 0.629 M ops/s, latency: 3.805 ns/op
>
> nr_loops:1 is bpf_loop() overhead and one static callback call.
> bpf_loop()'s own overhead will be in the ballpark of 21.4 - 3.8 =
> 17.6ns. I don't think we need yet another benchmark just for this.

That seems really high, though? The helper is a pretty simple function,
and the call to it should just be JIT'ed into a single regular function
call, right? So why the order-of-magnitude difference?

-Toke


  reply	other threads:[~2021-11-24 21:59 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-23 18:34 [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong
2021-11-23 18:34 ` [PATCH v2 bpf-next 1/4] bpf: Add bpf_loop helper Joanne Koong
2021-11-23 22:46   ` Andrii Nakryiko
2021-11-23 18:34 ` [PATCH v2 bpf-next 2/4] selftests/bpf: Add bpf_loop test Joanne Koong
2021-11-23 18:34 ` [PATCH v2 bpf-next 3/4] selftests/bpf: measure bpf_loop verifier performance Joanne Koong
2021-11-23 18:34 ` [PATCH v2 bpf-next 4/4] selftest/bpf/benchs: add bpf_loop benchmark Joanne Koong
2021-11-23 19:19   ` Toke Høiland-Jørgensen
2021-11-24  0:20     ` Joanne Koong
2021-11-24 12:56       ` Toke Høiland-Jørgensen
2021-11-24 19:26         ` Andrii Nakryiko
2021-11-24 21:59           ` Toke Høiland-Jørgensen [this message]
2021-11-25  0:04             ` Joanne Koong
2021-11-25 11:35               ` Toke Høiland-Jørgensen
2021-11-29 19:41                 ` Joanne Koong
2021-11-23 18:47 ` [PATCH v2 bpf-next 0/4] Add bpf_loop_helper Joanne Koong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87lf1db4gh.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=Kernel-team@fb.com \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=joannekoong@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox