Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack

BPF List
 help / color / mirror / Atom feed

From: Yonghong Song <yonghong.song@linux.dev>
To: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
	Andrii Nakryiko <andrii@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
Date: Thu, 18 Jul 2024 14:44:41 -0700	[thread overview]
Message-ID: <1297da19-18a7-4727-8dab-e45ef0651e14@linux.dev> (raw)
In-Reply-To: <20240718205203.3652080-1-yonghong.song@linux.dev>


On 7/18/24 1:52 PM, Yonghong Song wrote:
> This patch intends to show some benchmark results comparing a bpf
> program with vs. without private stack. The patch is not intended
> to land since it hacks existing kernel interface in order to
> do proper comparison. The bpf program is similar to
> 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
> where a raw_tp program is triggered with bpf_prog_test_run_opts() and
> the raw_tp program has a loop of helper bpf_get_numa_node_id() which
> will enable a fentry prog to run. The fentry prog calls three
> do-nothing functions to maximumly expose the cost of private stack.
>
> The following is the jited code for bpf prog in progs/private_stack.c
> without private stack. The number of batch iterations is 4096.
>
> subprog:
> 0:  f3 0f 1e fa             endbr64
> 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> 9:  66 90                   xchg   ax,ax
> b:  55                      push   rbp
> c:  48 89 e5                mov    rbp,rsp
> f:  f3 0f 1e fa             endbr64
> 13: 31 c0                   xor    eax,eax
> 15: c9                      leave
> 16: c3                      ret
>
> main prog:
> 0:  f3 0f 1e fa             endbr64
> 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> 9:  66 90                   xchg   ax,ax
> b:  55                      push   rbp
> c:  48 89 e5                mov    rbp,rsp
> f:  f3 0f 1e fa             endbr64
> 13: 48 bf 00 e0 57 00 00    movabs rdi,0xffffc9000057e000
> 1a: c9 ff ff
> 1d: 48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
> 21: 48 83 c6 01             add    rsi,0x1
> 25: 48 89 77 00             mov    QWORD PTR [rdi+0x0],rsi
> 29: e8 6e 00 00 00          call   0x9c
> 2e: e8 69 00 00 00          call   0x9c
> 33: e8 64 00 00 00          call   0x9c
> 38: 31 c0                   xor    eax,eax
> 3a: c9                      leave
> 3b: c3                      ret
>
> The following are the jited progs with private stack:
>
> subprog:
> 0:  f3 0f 1e fa             endbr64
> 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> 9:  66 90                   xchg   ax,ax
> b:  55                      push   rbp
> c:  48 89 e5                mov    rbp,rsp
> f:  f3 0f 1e fa             endbr64
> 13: 49 b9 70 a6 c1 08 7e    movabs r9,0x607e08c1a670
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 31 c0                   xor    eax,eax
> 28: c9                      leave
> 29: c3                      ret
>
> main prog:
> 0:  f3 0f 1e fa             endbr64
> 4:  0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]
> 9:  66 90                   xchg   ax,ax
> b:  55                      push   rbp
> c:  48 89 e5                mov    rbp,rsp
> f:  f3 0f 1e fa             endbr64
> 13: 49 b9 88 a6 c1 08 7e    movabs r9,0x607e08c1a688
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a    add    r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 48 bf 00 d0 5b 00 00    movabs rdi,0xffffc900005bd000
> 2d: c9 ff ff
> 30: 48 8b 77 00             mov    rsi,QWORD PTR [rdi+0x0]
> 34: 48 83 c6 01             add    rsi,0x1
> 38: 48 89 77 00             mov    QWORD PTR [rdi+0x0],rsi
> 3c: 41 51                   push   r9
> 3e: e8 46 23 51 e1          call   0xffffffffe1512389
> 43: 41 59                   pop    r9
> 45: 41 51                   push   r9
> 47: e8 3d 23 51 e1          call   0xffffffffe1512389
> 4c: 41 59                   pop    r9
> 4e: 41 51                   push   r9
> 50: e8 34 23 51 e1          call   0xffffffffe1512389
> 55: 41 59                   pop    r9
> 57: 31 c0                   xor    eax,eax
> 59: c9                      leave
> 5a: c3                      ret
>
>  From the above, it is clear for subprog and main prog,
> we have some r9 related overhead including retriving the stack
> in the jit prelog code:
>    movabs r9,0x607e08c1a688
>    add    r9,QWORD PTR gs:0x21a00
> and 'push r9' and 'pop r9' around subprog calls.
>
> I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
> which has 20 cores and 80 cpus. The number of hits are in the unit
> of loop iterations.
>
> The following are two benchmark results and a few other tries show
> similar results in terms of variation.
>    $ ./benchs/run_bench_private_stack.sh
>    no-private-stack-1:  2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
>    private-stack-1:     2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-8:  89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
>    private-stack-8:     90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-64:  1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
>    private-stack-64:    1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-512:  14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
>    private-stack-512:   14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-2048:  58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
>    private-stack-2048:  58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-4096:  119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
>    private-stack-4096:  114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
>    $ ./benchs/run_bench_private_stack.sh
>    no-private-stack-1:  2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
>    private-stack-1:     2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-8:  91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
>    private-stack-8:     90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-64:  1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
>    private-stack-64:    1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-512:  14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
>    private-stack-512:   14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-2048:  58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
>    private-stack-2048:  54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
>    no-private-stack-4096:  117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
>    private-stack-4096:  114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
>
> It is is clear that private-stack is worse than non-private stack up to close 5 percents.
> This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
>
> Although the benchmark shows up to 5% potential slowdown with private stack.
> In reality, the kernel enables private stack only after stack size 64 which means
> the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
> push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
> if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
> the performance should be reasonable too.
>
> With 4096 loop ierations per program run, I got
>    $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
>    18.47%  bench                                              [k]
>    17.29%  bench    bpf_trampoline_6442522961                 [k] bpf_trampoline_6442522961
>    13.33%  bench    bpf_prog_bcf7977d3b93787c_func1           [k] bpf_prog_bcf7977d3b93787c_func1
>    11.86%  bench    [kernel.vmlinux]                          [k] migrate_enable
>    11.60%  bench    [kernel.vmlinux]                          [k] __bpf_prog_enter_recur
>    11.42%  bench    [kernel.vmlinux]                          [k] __bpf_prog_exit_recur
>     7.87%  bench    [kernel.vmlinux]                          [k] migrate_disable
>     3.71%  bench    [kernel.vmlinux]                          [k] bpf_get_numa_node_id
>     3.67%  bench    bpf_prog_d9703036495d54b0_trigger_driver  [k] bpf_prog_d9703036495d54b0_trigger_driver
>     0.04%  bench    bench                                     [.] btf_validate_type
>
>    $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
>      18.94%  bench                                              [k]
>      16.88%  bench    bpf_prog_bcf7977d3b93787c_func1           [k] bpf_prog_bcf7977d3b93787c_func1
>      15.77%  bench    bpf_trampoline_6442522961                 [k] bpf_trampoline_6442522961
>      11.70%  bench    [kernel.vmlinux]                          [k] __bpf_prog_enter_recur
>      11.48%  bench    [kernel.vmlinux]                          [k] migrate_enable
>      11.30%  bench    [kernel.vmlinux]                          [k] __bpf_prog_exit_recur
>       5.85%  bench    [kernel.vmlinux]                          [k] migrate_disable
>       3.69%  bench    bpf_prog_d9703036495d54b0_trigger_driver  [k] bpf_prog_d9703036495d54b0_trigger_driver
>       3.56%  bench    [kernel.vmlinux]                          [k] bpf_get_numa_node_id
>       0.06%  bench    bench                                     [.] bpf_prog_test_run_opts
>
> NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.

I tried with perf built with latest bpf-next and with no-private-stack, the issue still
exists. Will debug more.

>
> I suspect top 18.47%/18.94% perf run probably due to fentry prog bench_trigger_fentry_batch,
> considering even subprog func1 takes 13.33%/16.88% time.
> Overall bpf prog include trampoline takes more than 50% of the time.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
>   arch/x86/net/bpf_jit_comp.c                   |   5 +-
>   include/linux/bpf.h                           |   3 +-
>   include/uapi/linux/bpf.h                      |   3 +
>   kernel/bpf/core.c                             |   3 +-
>   kernel/bpf/syscall.c                          |   4 +-
>   kernel/bpf/verifier.c                         |   1 +
>   tools/include/uapi/linux/bpf.h                |   3 +
>   tools/testing/selftests/bpf/Makefile          |   2 +
>   tools/testing/selftests/bpf/bench.c           |   6 +
>   .../bpf/benchs/bench_private_stack.c          | 149 ++++++++++++++++++
>   .../bpf/benchs/run_bench_private_stack.sh     |  11 ++
>   .../selftests/bpf/progs/private_stack.c       |  37 +++++
>   12 files changed, 222 insertions(+), 5 deletions(-)
>   create mode 100644 tools/testing/selftests/bpf/benchs/bench_private_stack.c
>   create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh
>   create mode 100644 tools/testing/selftests/bpf/progs/private_stack.c
[...]

next prev parent reply	other threads:[~2024-07-18 21:44 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44   ` Yonghong Song [this message]
2024-07-18 21:59     ` Kumar Kartikeya Dwivedi
2024-07-19  3:01       ` Yonghong Song
2024-07-19  0:36     ` Alexei Starovoitov
2024-07-19  2:21       ` Yonghong Song
2024-07-20  0:14   ` bot+bpf-ci
2024-07-20  1:08   ` Alexei Starovoitov
2024-07-22 16:33     ` Yonghong Song
2024-07-20  3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43   ` Yonghong Song
2024-07-24  5:08     ` Yonghong Song
2024-07-24 16:54       ` Alexei Starovoitov
2024-07-24 17:56         ` Yonghong Song
2024-07-22 20:57   ` Andrii Nakryiko
2024-07-23  1:05     ` Alexei Starovoitov
2024-07-23  3:26       ` Andrii Nakryiko
2024-07-24  3:17         ` Alexei Starovoitov
2024-07-24  4:06           ` Andrii Nakryiko
2024-07-24  4:46             ` Yonghong Song
2024-07-24  4:32           ` Yonghong Song
2024-07-23  5:30       ` Yonghong Song
2024-07-23  7:02         ` Yonghong Song
2024-07-22  3:33 ` Eduard Zingerman
2024-07-22 16:54   ` Yonghong Song
2024-07-22 17:53     ` Eduard Zingerman
2024-07-22 17:51   ` Alexei Starovoitov
2024-07-22 18:22     ` Eduard Zingerman
2024-07-22 20:08       ` Alexei Starovoitov
2024-07-24 21:28   ` Yonghong Song
2024-07-25  4:55     ` Alexei Starovoitov
2024-07-25 17:20       ` Eduard Zingerman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1297da19-18a7-4727-8dab-e45ef0651e14@linux.dev \
    --to=yonghong.song@linux.dev \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=kernel-team@fb.com \
    --cc=martin.lau@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox