* On inlining more helpers in the JITs or the verifier
@ 2024-05-02 17:37 Puranjay Mohan
2024-05-02 19:19 ` Puranjay Mohan
2024-05-02 21:22 ` Andrii Nakryiko
0 siblings, 2 replies; 4+ messages in thread
From: Puranjay Mohan @ 2024-05-02 17:37 UTC (permalink / raw)
To: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
John Fastabend, Kumar Kartikeya Dwivedi, bpf
Hi Everyone,
While working on inlining bpf_get_smp_processor_id() in the ARM64 and
RISCV JITs, I realized that these archs allow such optimizations because
they keep some information like the per-cpu offset or the pointer to the
task_struct in special system registers.
So, I went through the list of all BPF helpers and made a list of
helpers that we can inline in these JITs to make their usage much more
optimized:
I. ARM64 and RISC-V specific optimzations if inlined:
A) Because pointer to tast_struct is available in a register:
1. bpf_get_current_pid_tgid()
2. bpf_get_current_task()
3. bpf_set_retval()
4. bpf_get_retval()
5. bpf_task_pt_regs()
6. bpf_get_attach_cookie()
B) Because per_cpu offset is available in a register:
1. bpf_this_cpu_ptr()
2. bpf_get_numa_node_id()
These can be inlined in the verifier too using the newly
introduced per-cpu instruction.
II. These are very basic writes, can be inlined in the verifier or the JIT:
1. bpf_msg_apply_bytes()
2. bpf_msg_cork_bytes()
3. bpf_set_hash_invalid()
I will first try to inline all these in the ARM64 JIT and see the
performance improvement. I am not sure what would be the best way to
benchmark all of this inlining.
Andrii, can you suggest something for the benchmarking?
Looking forward to your thoughts on this.
Thanks,
Puranjay
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: On inlining more helpers in the JITs or the verifier 2024-05-02 17:37 On inlining more helpers in the JITs or the verifier Puranjay Mohan @ 2024-05-02 19:19 ` Puranjay Mohan 2024-05-02 21:22 ` Andrii Nakryiko 1 sibling, 0 replies; 4+ messages in thread From: Puranjay Mohan @ 2024-05-02 19:19 UTC (permalink / raw) To: Björn Töpel, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, Kumar Kartikeya Dwivedi, bpf Puranjay Mohan <puranjay@kernel.org> writes: > Hi Everyone, > > While working on inlining bpf_get_smp_processor_id() in the ARM64 and > RISCV JITs, I realized that these archs allow such optimizations because > they keep some information like the per-cpu offset or the pointer to the > task_struct in special system registers. > > So, I went through the list of all BPF helpers and made a list of > helpers that we can inline in these JITs to make their usage much more > optimized: > > I. ARM64 and RISC-V specific optimzations if inlined: > > A) Because pointer to tast_struct is available in a register: > 1. bpf_get_current_pid_tgid() > 2. bpf_get_current_task() Tried inlining bpf_get_current_task() on ARM64: Before After -------- -------- bpf_prog_6e2672bcc4451a42_trigger_get_current_task: bpf_prog_6e2672bcc4451a42_trigger_get_current_task: ; task = (struct task_struct *)bpf_get_current_task(); ; task = (struct task_struct *)bpf_get_current_task(); 34: mov x10, #0xffffffffffff9838 34: mrs x7, sp_el0 38: movk x10, #0x8027, lsl #16 3c: movk x10, #0x8000, lsl #32 40: blr x10 44: add x7, x0, #0x0 In the non-inlined version there is a branch [blr x10] to: 0xffff800080279838 bpf_get_current_task: <+0>: mrs x0, sp_el0 <+4>: ret So, we only need a single instruction after inlining!! I just don't know the best way to benchmark this. In theory it looks highly optimized. Thanks, Puranjay ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: On inlining more helpers in the JITs or the verifier 2024-05-02 17:37 On inlining more helpers in the JITs or the verifier Puranjay Mohan 2024-05-02 19:19 ` Puranjay Mohan @ 2024-05-02 21:22 ` Andrii Nakryiko 2024-05-03 16:04 ` Alexei Starovoitov 1 sibling, 1 reply; 4+ messages in thread From: Andrii Nakryiko @ 2024-05-02 21:22 UTC (permalink / raw) To: Puranjay Mohan Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, Kumar Kartikeya Dwivedi, bpf On Thu, May 2, 2024 at 10:37 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > > Hi Everyone, > > While working on inlining bpf_get_smp_processor_id() in the ARM64 and > RISCV JITs, I realized that these archs allow such optimizations because > they keep some information like the per-cpu offset or the pointer to the > task_struct in special system registers. > > So, I went through the list of all BPF helpers and made a list of > helpers that we can inline in these JITs to make their usage much more > optimized: > > I. ARM64 and RISC-V specific optimzations if inlined: > > A) Because pointer to tast_struct is available in a register: > 1. bpf_get_current_pid_tgid() > 2. bpf_get_current_task() These two are used really frequently, so it might make sense to optimize them (and also bpf_get_current_task_btf(), of course), if others agree with me. > 3. bpf_set_retval() > 4. bpf_get_retval() > 5. bpf_task_pt_regs() I'm leaning towards saying that probably not, unless we have a really good reason to. Inlining is not free in terms of code maintenance and complexity, so I wouldn't go and inline everything possible. But maybe others have another opinion. > 6. bpf_get_attach_cookie() definitely no, there are multiple implementations depending on specific program type > > B) Because per_cpu offset is available in a register: > 1. bpf_this_cpu_ptr() maybe, but I don't think we inline at BPF instruction level, so inlining in BPF JIT seems premature > 2. bpf_get_numa_node_id() I'm not sure how actively this is used, so I'd say no to this one as well. > > These can be inlined in the verifier too using the newly > introduced per-cpu instruction. yep, I'd start with doing BPF assembly inlining for bpf_this_cpu_ptr/bpf_per_cpu_ptr, tbh > > II. These are very basic writes, can be inlined in the verifier or the JIT: > 1. bpf_msg_apply_bytes() > 2. bpf_msg_cork_bytes() > 3. bpf_set_hash_invalid() I'd say this is also going overboard with inlining. > > I will first try to inline all these in the ARM64 JIT and see the > performance improvement. I am not sure what would be the best way to > benchmark all of this inlining. > > Andrii, can you suggest something for the benchmarking? > > Looking forward to your thoughts on this. > > Thanks, > Puranjay ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: On inlining more helpers in the JITs or the verifier 2024-05-02 21:22 ` Andrii Nakryiko @ 2024-05-03 16:04 ` Alexei Starovoitov 0 siblings, 0 replies; 4+ messages in thread From: Alexei Starovoitov @ 2024-05-03 16:04 UTC (permalink / raw) To: Andrii Nakryiko Cc: Puranjay Mohan, Björn Töpel, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend, Kumar Kartikeya Dwivedi, bpf On Thu, May 2, 2024 at 2:22 PM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Thu, May 2, 2024 at 10:37 AM Puranjay Mohan <puranjay@kernel.org> wrote: > > > > > > Hi Everyone, > > > > While working on inlining bpf_get_smp_processor_id() in the ARM64 and > > RISCV JITs, I realized that these archs allow such optimizations because > > they keep some information like the per-cpu offset or the pointer to the > > task_struct in special system registers. > > > > So, I went through the list of all BPF helpers and made a list of > > helpers that we can inline in these JITs to make their usage much more > > optimized: > > > > I. ARM64 and RISC-V specific optimzations if inlined: > > > > A) Because pointer to tast_struct is available in a register: > > 1. bpf_get_current_pid_tgid() > > 2. bpf_get_current_task() > > These two are used really frequently, so it might make sense to > optimize them (and also bpf_get_current_task_btf(), of course), if > others agree with me. > > > 3. bpf_set_retval() > > 4. bpf_get_retval() > > 5. bpf_task_pt_regs() > > I'm leaning towards saying that probably not, unless we have a really > good reason to. Inlining is not free in terms of code maintenance and > complexity, so I wouldn't go and inline everything possible. But maybe > others have another opinion. > > > > 6. bpf_get_attach_cookie() > > definitely no, there are multiple implementations depending on > specific program type > > > > > B) Because per_cpu offset is available in a register: > > 1. bpf_this_cpu_ptr() > > maybe, but I don't think we inline at BPF instruction level, so > inlining in BPF JIT seems premature > > > > 2. bpf_get_numa_node_id() > > I'm not sure how actively this is used, so I'd say no to this one as well. > > > > > These can be inlined in the verifier too using the newly > > introduced per-cpu instruction. > > yep, I'd start with doing BPF assembly inlining for > bpf_this_cpu_ptr/bpf_per_cpu_ptr, tbh > > > > > II. These are very basic writes, can be inlined in the verifier or the JIT: > > 1. bpf_msg_apply_bytes() > > 2. bpf_msg_cork_bytes() > > 3. bpf_set_hash_invalid() > > I'd say this is also going overboard with inlining. +1 simplicity of logic is not a reason to inline it. I would only inline bpf_get_current_task[_btf]() and do it in the verifier. JITs should inline only if perf delta is really significant. I hope bpf_get_smp_processor_id() will be the only such example. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2024-05-03 16:05 UTC | newest] Thread overview: 4+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-05-02 17:37 On inlining more helpers in the JITs or the verifier Puranjay Mohan 2024-05-02 19:19 ` Puranjay Mohan 2024-05-02 21:22 ` Andrii Nakryiko 2024-05-03 16:04 ` Alexei Starovoitov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.