All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vadim Fedorenko <vadim.fedorenko@linux.dev>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>,
	Vadim Fedorenko <vadfed@meta.com>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Eduard Zingerman <eddyz87@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Mykola Lysenko <mykolal@fb.com>, Jakub Kicinski <kuba@kernel.org>,
	x86@kernel.org, bpf@vger.kernel.org,
	Martin KaFai Lau <martin.lau@linux.dev>
Subject: Re: [PATCH bpf-next v5 1/4] bpf: add bpf_get_cpu_cycles kfunc
Date: Tue, 12 Nov 2024 21:43:01 +0000	[thread overview]
Message-ID: <946c61c2-ab1e-43cb-adfd-cc5b7716b915@linux.dev> (raw)
In-Reply-To: <CAEf4BzaRb+fUK17wrj4sWnYM5oKxTvwZC=U-GjvsdUtF94PqrA@mail.gmail.com>

On 12/11/2024 05:50, Andrii Nakryiko wrote:
> On Fri, Nov 8, 2024 at 4:42 PM Vadim Fedorenko <vadfed@meta.com> wrote:
>>
>> New kfunc to return ARCH-specific timecounter. For x86 BPF JIT converts
>> it into rdtsc ordered call. Other architectures will get JIT
>> implementation too if supported. The fallback is to
>> __arch_get_hw_counter().
>>
>> Signed-off-by: Vadim Fedorenko <vadfed@meta.com>
>> ---
> 
> nit: please add cover letter for the next revision, multi-patch sets
> generally should come with a cover letter, unless it's some set of
> trivial and mostly independent patches. Anyways...

Yeah, sure. This series has grown from the first small version...

> 
> I haven't yet looked through the code (yet), but I was curious to
> benchmark the perf benefit, so that's what I did for fun this evening.
> 
> (!!!) BTW, a complete aside, but I think it's interesting. It turned
> out that using bpf_test_prog_run() scales *VERY POORLY* with large
> number of CPUs, because we start spending tons of time in
> fdget()/fdput(), so I initially got pretty unscalable results,
> profiled a bit, and then switched to just doing
> syscall(syscall(__NR_getpgid); + SEC("raw_tp/sys_enter")). Anyways,
> the point is that microbenchmarking is tricky and we need to improve
> our existing bench setup for some benchmarks. Anyways, getting back to
> the main topic.
> 
> I wrote a quick two benchmarks testing what I see as intended use case
> for these kfuncs (batching amortizes the cost of triggering BPF
> program, batch_iters = 500 in my case):
> 
> SEC("?raw_tp/sys_enter")
> int trigger_driver_ktime(void *ctx)
> {
>          volatile __u64 total = 0;
>          int i;
> 
>          for (i = 0; i < batch_iters; i++) {
>                  __u64 start, end;
> 
>                  start = bpf_ktime_get_ns();
>                  end = bpf_ktime_get_ns();
>                  total += end - start;
>          }
>          inc_counter_batch(batch_iters);
> 
>          return 0;
> }
> 
> extern __u64 bpf_get_cpu_cycles(void) __weak __ksym;
> extern __u64 bpf_cpu_cycles_to_ns(__u64 cycles) __weak __ksym;
> 
> SEC("?raw_tp/sys_enter")
> int trigger_driver_cycles(void *ctx)
> {
>          volatile __u64 total = 0;
>          int i;
> 
>          for (i = 0; i < batch_iters; i++) {
>                  __u64 start, end;
> 
>                  start = bpf_get_cpu_cycles();
>                  end = bpf_get_cpu_cycles();
>                  total += bpf_cpu_cycles_to_ns(end - start);
>          }
>          inc_counter_batch(batch_iters);
> 
>          return 0;
> }
> 
> And here's what I got across multiple numbers of parallel CPUs on our
> production host.
> 
> # ./bench_timing.sh
> 
> ktime                 ( 1 cpus):   32.286 ± 0.309M/s  ( 32.286M/s/cpu)
> ktime                 ( 2 cpus):   63.021 ± 0.538M/s  ( 31.511M/s/cpu)
> ktime                 ( 3 cpus):   94.211 ± 0.686M/s  ( 31.404M/s/cpu)
> ktime                 ( 4 cpus):  124.757 ± 0.691M/s  ( 31.189M/s/cpu)
> ktime                 ( 5 cpus):  154.855 ± 0.693M/s  ( 30.971M/s/cpu)
> ktime                 ( 6 cpus):  185.551 ± 2.304M/s  ( 30.925M/s/cpu)
> ktime                 ( 7 cpus):  211.117 ± 4.755M/s  ( 30.160M/s/cpu)
> ktime                 ( 8 cpus):  236.454 ± 0.226M/s  ( 29.557M/s/cpu)
> ktime                 (10 cpus):  295.526 ± 0.126M/s  ( 29.553M/s/cpu)
> ktime                 (12 cpus):  322.282 ± 0.153M/s  ( 26.857M/s/cpu)
> ktime                 (14 cpus):  375.347 ± 0.087M/s  ( 26.811M/s/cpu)
> ktime                 (16 cpus):  399.813 ± 0.181M/s  ( 24.988M/s/cpu)
> ktime                 (24 cpus):  617.675 ± 7.053M/s  ( 25.736M/s/cpu)
> ktime                 (32 cpus):  819.695 ± 0.231M/s  ( 25.615M/s/cpu)
> ktime                 (40 cpus):  996.264 ± 0.290M/s  ( 24.907M/s/cpu)
> ktime                 (48 cpus): 1180.201 ± 0.160M/s  ( 24.588M/s/cpu)
> ktime                 (56 cpus): 1321.084 ± 0.099M/s  ( 23.591M/s/cpu)
> ktime                 (64 cpus): 1482.061 ± 0.121M/s  ( 23.157M/s/cpu)
> ktime                 (72 cpus): 1666.540 ± 0.460M/s  ( 23.146M/s/cpu)
> ktime                 (80 cpus): 1851.419 ± 0.439M/s  ( 23.143M/s/cpu)
> 
> cycles                ( 1 cpus):   45.815 ± 0.018M/s  ( 45.815M/s/cpu)
> cycles                ( 2 cpus):   86.706 ± 0.068M/s  ( 43.353M/s/cpu)
> cycles                ( 3 cpus):  129.899 ± 0.101M/s  ( 43.300M/s/cpu)
> cycles                ( 4 cpus):  168.435 ± 0.073M/s  ( 42.109M/s/cpu)
> cycles                ( 5 cpus):  210.520 ± 0.164M/s  ( 42.104M/s/cpu)
> cycles                ( 6 cpus):  252.596 ± 0.050M/s  ( 42.099M/s/cpu)
> cycles                ( 7 cpus):  294.356 ± 0.159M/s  ( 42.051M/s/cpu)
> cycles                ( 8 cpus):  317.167 ± 0.163M/s  ( 39.646M/s/cpu)
> cycles                (10 cpus):  396.141 ± 0.208M/s  ( 39.614M/s/cpu)
> cycles                (12 cpus):  431.938 ± 0.511M/s  ( 35.995M/s/cpu)
> cycles                (14 cpus):  503.055 ± 0.070M/s  ( 35.932M/s/cpu)
> cycles                (16 cpus):  534.261 ± 0.107M/s  ( 33.391M/s/cpu)
> cycles                (24 cpus):  836.838 ± 0.141M/s  ( 34.868M/s/cpu)
> cycles                (32 cpus): 1099.689 ± 0.314M/s  ( 34.365M/s/cpu)
> cycles                (40 cpus): 1336.573 ± 0.015M/s  ( 33.414M/s/cpu)
> cycles                (48 cpus): 1571.734 ± 11.151M/s  ( 32.744M/s/cpu)
> cycles                (56 cpus): 1819.242 ± 4.627M/s  ( 32.486M/s/cpu)
> cycles                (64 cpus): 2046.285 ± 5.169M/s  ( 31.973M/s/cpu)
> cycles                (72 cpus): 2287.683 ± 0.787M/s  ( 31.773M/s/cpu)
> cycles                (80 cpus): 2505.414 ± 0.626M/s  ( 31.318M/s/cpu)
> 
> So, from about +42% on a single CPU, to +36% at 80 CPUs. Not that bad.
> Scalability-wise, we still see some drop off in performance, but
> believe me, with bpf_prog_test_run() it was so much worse :) I also
> verified that now we spend cycles almost exclusively inside the BPF
> program (so presumably in those benchmarked kfuncs).

Am I right that the numbers show how many iterations were done during
the very same amount of time? It would be also great to understand if
we get more precise measurements - just in case you have your tests
ready...


  reply	other threads:[~2024-11-12 21:43 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-09  0:41 [PATCH bpf-next v5 1/4] bpf: add bpf_get_cpu_cycles kfunc Vadim Fedorenko
2024-11-09  0:41 ` [PATCH bpf-next v5 2/4] bpf: add bpf_cpu_cycles_to_ns helper Vadim Fedorenko
2024-11-12 23:03   ` Eduard Zingerman
2024-11-09  0:41 ` [PATCH bpf-next v5 3/4] selftests/bpf: add selftest to check rdtsc jit Vadim Fedorenko
2024-11-12 23:17   ` Eduard Zingerman
2024-11-09  0:41 ` [PATCH bpf-next v5 4/4] selftests/bpf: add usage example for cpu cycles kfuncs Vadim Fedorenko
2024-11-12  5:50 ` [PATCH bpf-next v5 1/4] bpf: add bpf_get_cpu_cycles kfunc Andrii Nakryiko
2024-11-12 21:43   ` Vadim Fedorenko [this message]
2024-11-12 23:59     ` Andrii Nakryiko
2024-11-12 21:21 ` Eduard Zingerman
2024-11-12 21:39   ` Vadim Fedorenko
2024-11-12 21:53     ` Eduard Zingerman
2024-11-12 22:19       ` Eduard Zingerman
2024-11-12 22:27         ` Alexei Starovoitov
2024-11-12 23:08           ` Vadim Fedorenko
2024-11-13  0:09             ` Alexei Starovoitov
2024-11-13  0:20               ` Vadim Fedorenko
2024-11-13 17:38 ` Yonghong Song
2024-11-13 17:52   ` Vadim Fedorenko
2024-11-13 18:42     ` Yonghong Song
2024-11-13 22:28       ` Vadim Fedorenko
2024-11-13 23:02         ` Yonghong Song
2024-11-14  1:05           ` Vadim Fedorenko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=946c61c2-ab1e-43cb-adfd-cc5b7716b915@linux.dev \
    --to=vadim.fedorenko@linux.dev \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=kuba@kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mykolal@fb.com \
    --cc=tglx@linutronix.de \
    --cc=vadfed@meta.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.