From: Yonghong Song <yonghong.song@linux.dev>
To: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>,
Andrii Nakryiko <andrii@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
kernel-team@fb.com, Martin KaFai Lau <martin.lau@kernel.org>
Subject: Re: [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack
Date: Thu, 18 Jul 2024 14:44:41 -0700 [thread overview]
Message-ID: <1297da19-18a7-4727-8dab-e45ef0651e14@linux.dev> (raw)
In-Reply-To: <20240718205203.3652080-1-yonghong.song@linux.dev>
On 7/18/24 1:52 PM, Yonghong Song wrote:
> This patch intends to show some benchmark results comparing a bpf
> program with vs. without private stack. The patch is not intended
> to land since it hacks existing kernel interface in order to
> do proper comparison. The bpf program is similar to
> 7df4e597ea2c ("selftests/bpf: add batched, mostly in-kernel BPF triggering benchmarks")
> where a raw_tp program is triggered with bpf_prog_test_run_opts() and
> the raw_tp program has a loop of helper bpf_get_numa_node_id() which
> will enable a fentry prog to run. The fentry prog calls three
> do-nothing functions to maximumly expose the cost of private stack.
>
> The following is the jited code for bpf prog in progs/private_stack.c
> without private stack. The number of batch iterations is 4096.
>
> subprog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 31 c0 xor eax,eax
> 15: c9 leave
> 16: c3 ret
>
> main prog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 48 bf 00 e0 57 00 00 movabs rdi,0xffffc9000057e000
> 1a: c9 ff ff
> 1d: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> 21: 48 83 c6 01 add rsi,0x1
> 25: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> 29: e8 6e 00 00 00 call 0x9c
> 2e: e8 69 00 00 00 call 0x9c
> 33: e8 64 00 00 00 call 0x9c
> 38: 31 c0 xor eax,eax
> 3a: c9 leave
> 3b: c3 ret
>
> The following are the jited progs with private stack:
>
> subprog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 49 b9 70 a6 c1 08 7e movabs r9,0x607e08c1a670
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 31 c0 xor eax,eax
> 28: c9 leave
> 29: c3 ret
>
> main prog:
> 0: f3 0f 1e fa endbr64
> 4: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
> 9: 66 90 xchg ax,ax
> b: 55 push rbp
> c: 48 89 e5 mov rbp,rsp
> f: f3 0f 1e fa endbr64
> 13: 49 b9 88 a6 c1 08 7e movabs r9,0x607e08c1a688
> 1a: 60 00 00
> 1d: 65 4c 03 0c 25 00 1a add r9,QWORD PTR gs:0x21a00
> 24: 02 00
> 26: 48 bf 00 d0 5b 00 00 movabs rdi,0xffffc900005bd000
> 2d: c9 ff ff
> 30: 48 8b 77 00 mov rsi,QWORD PTR [rdi+0x0]
> 34: 48 83 c6 01 add rsi,0x1
> 38: 48 89 77 00 mov QWORD PTR [rdi+0x0],rsi
> 3c: 41 51 push r9
> 3e: e8 46 23 51 e1 call 0xffffffffe1512389
> 43: 41 59 pop r9
> 45: 41 51 push r9
> 47: e8 3d 23 51 e1 call 0xffffffffe1512389
> 4c: 41 59 pop r9
> 4e: 41 51 push r9
> 50: e8 34 23 51 e1 call 0xffffffffe1512389
> 55: 41 59 pop r9
> 57: 31 c0 xor eax,eax
> 59: c9 leave
> 5a: c3 ret
>
> From the above, it is clear for subprog and main prog,
> we have some r9 related overhead including retriving the stack
> in the jit prelog code:
> movabs r9,0x607e08c1a688
> add r9,QWORD PTR gs:0x21a00
> and 'push r9' and 'pop r9' around subprog calls.
>
> I did some benchmarking on an intel box (Intel(R) Xeon(R) D-2191A CPU @ 1.60GHz)
> which has 20 cores and 80 cpus. The number of hits are in the unit
> of loop iterations.
>
> The following are two benchmark results and a few other tries show
> similar results in terms of variation.
> $ ./benchs/run_bench_private_stack.sh
> no-private-stack-1: 2.152 ± 0.004M/s (drops 0.000 ± 0.000M/s)
> private-stack-1: 2.226 ± 0.003M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-8: 89.086 ± 0.674M/s (drops 0.000 ± 0.000M/s)
> private-stack-8: 90.023 ± 0.117M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-64: 1545.383 ± 3.574M/s (drops 0.000 ± 0.000M/s)
> private-stack-64: 1534.630 ± 2.063M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-512: 14591.591 ± 15.202M/s (drops 0.000 ± 0.000M/s)
> private-stack-512: 14323.796 ± 13.165M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-2048: 58680.977 ± 46.116M/s (drops 0.000 ± 0.000M/s)
> private-stack-2048: 58614.699 ± 22.031M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-4096: 119974.497 ± 90.985M/s (drops 0.000 ± 0.000M/s)
> private-stack-4096: 114841.949 ± 59.514M/s (drops 0.000 ± 0.000M/s)
> $ ./benchs/run_bench_private_stack.sh
> no-private-stack-1: 2.246 ± 0.002M/s (drops 0.000 ± 0.000M/s)
> private-stack-1: 2.232 ± 0.005M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-8: 91.446 ± 0.055M/s (drops 0.000 ± 0.000M/s)
> private-stack-8: 90.120 ± 0.069M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-64: 1578.374 ± 1.508M/s (drops 0.000 ± 0.000M/s)
> private-stack-64: 1514.909 ± 3.898M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-512: 14767.811 ± 22.399M/s (drops 0.000 ± 0.000M/s)
> private-stack-512: 14232.382 ± 227.217M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-2048: 58342.372 ± 81.519M/s (drops 0.000 ± 0.000M/s)
> private-stack-2048: 54503.335 ± 160.199M/s (drops 0.000 ± 0.000M/s)
> no-private-stack-4096: 117262.975 ± 179.802M/s (drops 0.000 ± 0.000M/s)
> private-stack-4096: 114643.523 ± 146.956M/s (drops 0.000 ± 0.000M/s)
>
> It is is clear that private-stack is worse than non-private stack up to close 5 percents.
> This can be roughly estimated based on the above jit code with no-private-stack vs. private-stack.
>
> Although the benchmark shows up to 5% potential slowdown with private stack.
> In reality, the kernel enables private stack only after stack size 64 which means
> the bpf prog will do some useful things. If bpf prog uses any helper/kfunc, the
> push/pop r9 overhead should be minimum compared to the overhead of helper/kfunc.
> if the prog does not use a lot of helper/kfunc, there is no push/pop r9 and
> the performance should be reasonable too.
>
> With 4096 loop ierations per program run, I got
> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 no-private-stack
> 18.47% bench [k]
> 17.29% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
> 13.33% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> 11.86% bench [kernel.vmlinux] [k] migrate_enable
> 11.60% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
> 11.42% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
> 7.87% bench [kernel.vmlinux] [k] migrate_disable
> 3.71% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
> 3.67% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
> 0.04% bench bench [.] btf_validate_type
>
> $ perf record -- ./bench -w3 -d10 -a --nr-batch-iters=4096 private-stack
> 18.94% bench [k]
> 16.88% bench bpf_prog_bcf7977d3b93787c_func1 [k] bpf_prog_bcf7977d3b93787c_func1
> 15.77% bench bpf_trampoline_6442522961 [k] bpf_trampoline_6442522961
> 11.70% bench [kernel.vmlinux] [k] __bpf_prog_enter_recur
> 11.48% bench [kernel.vmlinux] [k] migrate_enable
> 11.30% bench [kernel.vmlinux] [k] __bpf_prog_exit_recur
> 5.85% bench [kernel.vmlinux] [k] migrate_disable
> 3.69% bench bpf_prog_d9703036495d54b0_trigger_driver [k] bpf_prog_d9703036495d54b0_trigger_driver
> 3.56% bench [kernel.vmlinux] [k] bpf_get_numa_node_id
> 0.06% bench bench [.] bpf_prog_test_run_opts
>
> NOTE: I tried 6.4 perf and 6.10 perf, both of which have issues. I will investigate this further.
I tried with perf built with latest bpf-next and with no-private-stack, the issue still
exists. Will debug more.
>
> I suspect top 18.47%/18.94% perf run probably due to fentry prog bench_trigger_fentry_batch,
> considering even subprog func1 takes 13.33%/16.88% time.
> Overall bpf prog include trampoline takes more than 50% of the time.
>
> Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
> ---
> arch/x86/net/bpf_jit_comp.c | 5 +-
> include/linux/bpf.h | 3 +-
> include/uapi/linux/bpf.h | 3 +
> kernel/bpf/core.c | 3 +-
> kernel/bpf/syscall.c | 4 +-
> kernel/bpf/verifier.c | 1 +
> tools/include/uapi/linux/bpf.h | 3 +
> tools/testing/selftests/bpf/Makefile | 2 +
> tools/testing/selftests/bpf/bench.c | 6 +
> .../bpf/benchs/bench_private_stack.c | 149 ++++++++++++++++++
> .../bpf/benchs/run_bench_private_stack.sh | 11 ++
> .../selftests/bpf/progs/private_stack.c | 37 +++++
> 12 files changed, 222 insertions(+), 5 deletions(-)
> create mode 100644 tools/testing/selftests/bpf/benchs/bench_private_stack.c
> create mode 100755 tools/testing/selftests/bpf/benchs/run_bench_private_stack.sh
> create mode 100644 tools/testing/selftests/bpf/progs/private_stack.c
[...]
next prev parent reply other threads:[~2024-07-18 21:44 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-07-18 20:51 [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Yonghong Song
2024-07-18 20:52 ` [PATCH bpf-next v2 2/2] [no_merge] selftests/bpf: Benchmark runtime performance with private stack Yonghong Song
2024-07-18 21:44 ` Yonghong Song [this message]
2024-07-18 21:59 ` Kumar Kartikeya Dwivedi
2024-07-19 3:01 ` Yonghong Song
2024-07-19 0:36 ` Alexei Starovoitov
2024-07-19 2:21 ` Yonghong Song
2024-07-20 0:14 ` bot+bpf-ci
2024-07-20 1:08 ` Alexei Starovoitov
2024-07-22 16:33 ` Yonghong Song
2024-07-20 3:28 ` [PATCH bpf-next v2 1/2] bpf: Support private stack for bpf progs Andrii Nakryiko
2024-07-22 16:43 ` Yonghong Song
2024-07-24 5:08 ` Yonghong Song
2024-07-24 16:54 ` Alexei Starovoitov
2024-07-24 17:56 ` Yonghong Song
2024-07-22 20:57 ` Andrii Nakryiko
2024-07-23 1:05 ` Alexei Starovoitov
2024-07-23 3:26 ` Andrii Nakryiko
2024-07-24 3:17 ` Alexei Starovoitov
2024-07-24 4:06 ` Andrii Nakryiko
2024-07-24 4:46 ` Yonghong Song
2024-07-24 4:32 ` Yonghong Song
2024-07-23 5:30 ` Yonghong Song
2024-07-23 7:02 ` Yonghong Song
2024-07-22 3:33 ` Eduard Zingerman
2024-07-22 16:54 ` Yonghong Song
2024-07-22 17:53 ` Eduard Zingerman
2024-07-22 17:51 ` Alexei Starovoitov
2024-07-22 18:22 ` Eduard Zingerman
2024-07-22 20:08 ` Alexei Starovoitov
2024-07-24 21:28 ` Yonghong Song
2024-07-25 4:55 ` Alexei Starovoitov
2024-07-25 17:20 ` Eduard Zingerman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1297da19-18a7-4727-8dab-e45ef0651e14@linux.dev \
--to=yonghong.song@linux.dev \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=kernel-team@fb.com \
--cc=martin.lau@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox