All of lore.kernel.org
 help / color / mirror / Atom feed
* On inlining more helpers in the JITs or the verifier
@ 2024-05-02 17:37 Puranjay Mohan
  2024-05-02 19:19 ` Puranjay Mohan
  2024-05-02 21:22 ` Andrii Nakryiko
  0 siblings, 2 replies; 4+ messages in thread
From: Puranjay Mohan @ 2024-05-02 17:37 UTC (permalink / raw)
  To: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, Kumar Kartikeya Dwivedi, bpf


Hi Everyone,

While working on inlining bpf_get_smp_processor_id() in the ARM64 and
RISCV JITs, I realized that these archs allow such optimizations because
they keep some information like the per-cpu offset or the pointer to the
task_struct in special system registers.

So, I went through the list of all BPF helpers and made a list of
helpers that we can inline in these JITs to make their usage much more
optimized:

I. ARM64 and RISC-V specific optimzations if inlined:

    A) Because pointer to tast_struct is available in a register:
        1. bpf_get_current_pid_tgid()
        2. bpf_get_current_task()
        3. bpf_set_retval()
        4. bpf_get_retval()
        5. bpf_task_pt_regs()
        6. bpf_get_attach_cookie()
    
    B) Because per_cpu offset is available in a register:
        1. bpf_this_cpu_ptr()
        2. bpf_get_numa_node_id()

        These can be inlined in the verifier too using the newly
        introduced per-cpu instruction.

II. These are very basic writes, can be inlined in the verifier or the JIT:
    1. bpf_msg_apply_bytes()
    2. bpf_msg_cork_bytes()
    3. bpf_set_hash_invalid()

I will first try to inline all these in the ARM64 JIT and see the
performance improvement. I am not sure what would be the best way to
benchmark all of this inlining.

Andrii, can you suggest something for the benchmarking?

Looking forward to your thoughts on this.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On inlining more helpers in the JITs or the verifier
  2024-05-02 17:37 On inlining more helpers in the JITs or the verifier Puranjay Mohan
@ 2024-05-02 19:19 ` Puranjay Mohan
  2024-05-02 21:22 ` Andrii Nakryiko
  1 sibling, 0 replies; 4+ messages in thread
From: Puranjay Mohan @ 2024-05-02 19:19 UTC (permalink / raw)
  To: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, Kumar Kartikeya Dwivedi, bpf

Puranjay Mohan <puranjay@kernel.org> writes:

> Hi Everyone,
>
> While working on inlining bpf_get_smp_processor_id() in the ARM64 and
> RISCV JITs, I realized that these archs allow such optimizations because
> they keep some information like the per-cpu offset or the pointer to the
> task_struct in special system registers.
>
> So, I went through the list of all BPF helpers and made a list of
> helpers that we can inline in these JITs to make their usage much more
> optimized:
>
> I. ARM64 and RISC-V specific optimzations if inlined:
>
>     A) Because pointer to tast_struct is available in a register:
>         1. bpf_get_current_pid_tgid()
>         2. bpf_get_current_task()

Tried inlining bpf_get_current_task() on ARM64:

                  Before                                                                  After
                 --------                                                               --------

bpf_prog_6e2672bcc4451a42_trigger_get_current_task:                      bpf_prog_6e2672bcc4451a42_trigger_get_current_task:
; task = (struct task_struct *)bpf_get_current_task();                   ; task = (struct task_struct *)bpf_get_current_task();
  34:   mov     x10, #0xffffffffffff9838                                   34:   mrs     x7, sp_el0
  38:   movk    x10, #0x8027, lsl #16
  3c:   movk    x10, #0x8000, lsl #32
  40:   blr     x10
  44:   add     x7, x0, #0x0



In the non-inlined version there is a branch [blr x10] to:

0xffff800080279838 bpf_get_current_task:
                     <+0>:     mrs     x0, sp_el0
                     <+4>:     ret


So, we only need a single instruction after inlining!!

I just don't know the best way to benchmark this. In theory it looks
highly optimized.

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On inlining more helpers in the JITs or the verifier
  2024-05-02 17:37 On inlining more helpers in the JITs or the verifier Puranjay Mohan
  2024-05-02 19:19 ` Puranjay Mohan
@ 2024-05-02 21:22 ` Andrii Nakryiko
  2024-05-03 16:04   ` Alexei Starovoitov
  1 sibling, 1 reply; 4+ messages in thread
From: Andrii Nakryiko @ 2024-05-02 21:22 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, Kumar Kartikeya Dwivedi, bpf

On Thu, May 2, 2024 at 10:37 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>
>
> Hi Everyone,
>
> While working on inlining bpf_get_smp_processor_id() in the ARM64 and
> RISCV JITs, I realized that these archs allow such optimizations because
> they keep some information like the per-cpu offset or the pointer to the
> task_struct in special system registers.
>
> So, I went through the list of all BPF helpers and made a list of
> helpers that we can inline in these JITs to make their usage much more
> optimized:
>
> I. ARM64 and RISC-V specific optimzations if inlined:
>
>     A) Because pointer to tast_struct is available in a register:
>         1. bpf_get_current_pid_tgid()
>         2. bpf_get_current_task()

These two are used really frequently, so it might make sense to
optimize them (and also bpf_get_current_task_btf(), of course), if
others agree with me.

>         3. bpf_set_retval()
>         4. bpf_get_retval()
>         5. bpf_task_pt_regs()

I'm leaning towards saying that probably not, unless we have a really
good reason to. Inlining is not free in terms of code maintenance and
complexity, so I wouldn't go and inline everything possible. But maybe
others have another opinion.


>         6. bpf_get_attach_cookie()

definitely no, there are multiple implementations depending on
specific program type

>
>     B) Because per_cpu offset is available in a register:
>         1. bpf_this_cpu_ptr()

maybe, but I don't think we inline at BPF instruction level, so
inlining in BPF JIT seems premature


>         2. bpf_get_numa_node_id()

I'm not sure how actively this is used, so I'd say no to this one as well.

>
>         These can be inlined in the verifier too using the newly
>         introduced per-cpu instruction.

yep, I'd start with doing BPF assembly inlining for
bpf_this_cpu_ptr/bpf_per_cpu_ptr, tbh

>
> II. These are very basic writes, can be inlined in the verifier or the JIT:
>     1. bpf_msg_apply_bytes()
>     2. bpf_msg_cork_bytes()
>     3. bpf_set_hash_invalid()

I'd say this is also going overboard with inlining.

>
> I will first try to inline all these in the ARM64 JIT and see the
> performance improvement. I am not sure what would be the best way to
> benchmark all of this inlining.
>
> Andrii, can you suggest something for the benchmarking?
>
> Looking forward to your thoughts on this.
>
> Thanks,
> Puranjay

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: On inlining more helpers in the JITs or the verifier
  2024-05-02 21:22 ` Andrii Nakryiko
@ 2024-05-03 16:04   ` Alexei Starovoitov
  0 siblings, 0 replies; 4+ messages in thread
From: Alexei Starovoitov @ 2024-05-03 16:04 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Puranjay Mohan, Björn Töpel, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, Kumar Kartikeya Dwivedi, bpf

On Thu, May 2, 2024 at 2:22 PM Andrii Nakryiko
<andrii.nakryiko@gmail.com> wrote:
>
> On Thu, May 2, 2024 at 10:37 AM Puranjay Mohan <puranjay@kernel.org> wrote:
> >
> >
> > Hi Everyone,
> >
> > While working on inlining bpf_get_smp_processor_id() in the ARM64 and
> > RISCV JITs, I realized that these archs allow such optimizations because
> > they keep some information like the per-cpu offset or the pointer to the
> > task_struct in special system registers.
> >
> > So, I went through the list of all BPF helpers and made a list of
> > helpers that we can inline in these JITs to make their usage much more
> > optimized:
> >
> > I. ARM64 and RISC-V specific optimzations if inlined:
> >
> >     A) Because pointer to tast_struct is available in a register:
> >         1. bpf_get_current_pid_tgid()
> >         2. bpf_get_current_task()
>
> These two are used really frequently, so it might make sense to
> optimize them (and also bpf_get_current_task_btf(), of course), if
> others agree with me.
>
> >         3. bpf_set_retval()
> >         4. bpf_get_retval()
> >         5. bpf_task_pt_regs()
>
> I'm leaning towards saying that probably not, unless we have a really
> good reason to. Inlining is not free in terms of code maintenance and
> complexity, so I wouldn't go and inline everything possible. But maybe
> others have another opinion.
>
>
> >         6. bpf_get_attach_cookie()
>
> definitely no, there are multiple implementations depending on
> specific program type
>
> >
> >     B) Because per_cpu offset is available in a register:
> >         1. bpf_this_cpu_ptr()
>
> maybe, but I don't think we inline at BPF instruction level, so
> inlining in BPF JIT seems premature
>
>
> >         2. bpf_get_numa_node_id()
>
> I'm not sure how actively this is used, so I'd say no to this one as well.
>
> >
> >         These can be inlined in the verifier too using the newly
> >         introduced per-cpu instruction.
>
> yep, I'd start with doing BPF assembly inlining for
> bpf_this_cpu_ptr/bpf_per_cpu_ptr, tbh
>
> >
> > II. These are very basic writes, can be inlined in the verifier or the JIT:
> >     1. bpf_msg_apply_bytes()
> >     2. bpf_msg_cork_bytes()
> >     3. bpf_set_hash_invalid()
>
> I'd say this is also going overboard with inlining.

+1

simplicity of logic is not a reason to inline it.
I would only inline bpf_get_current_task[_btf]() and do it
in the verifier. JITs should inline only if perf delta
is really significant.
I hope bpf_get_smp_processor_id() will be the only such example.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-05-03 16:05 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-02 17:37 On inlining more helpers in the JITs or the verifier Puranjay Mohan
2024-05-02 19:19 ` Puranjay Mohan
2024-05-02 21:22 ` Andrii Nakryiko
2024-05-03 16:04   ` Alexei Starovoitov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.