public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Marco Elver <elver@google.com>
To: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Mykola Lysenko <mykolal@fb.com>,
	Shuah Khan <shuah@kernel.org>,
	bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-kselftest@vger.kernel.org
Subject: Re: [PATCH] bpf: Separate bpf_local_storage_lookup() fast and slow paths
Date: Wed, 7 Feb 2024 10:56:23 +0100	[thread overview]
Message-ID: <ZcNTx2X7Y7zA5nrC@elver.google.com> (raw)
In-Reply-To: <9908bdfb-1030-4a9f-8405-3696c5d03981@linux.dev>

On Tue, Feb 06, 2024 at 05:22PM -0800, Martin KaFai Lau wrote:
> On 2/6/24 9:04 AM, Marco Elver wrote:
> > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote:
> > [...]
> > > > Or can you suggest different functions to hook to for the recursion test?
> > > 
> > > I don't prefer to add another tracepoint for the selftest.
> > 
> > Ok - I also checked, even though it should be a no-op, it wasn't
> > (compiler generated worse code).
> 
> I am interested to how the tracepoint generates worse code. Can you share
> some details ?

My guess is that it produces enough code that some inlinable functions
are no longer being inlined. Specifically __bpf_task_storage_get().

> > 
> > > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the
> > > initial bpf_local_storage_lookup() should work and the immediate recurred
> > > bpf_task_storage_delete() will fail.
> > > 
> > > Depends on how the new slow path function will look like in v2. The test can
> > > probably be made to go through the slow path, e.g. by creating a lot of task
> > > storage maps before triggering the lookup.
[...]
> > Could you suggest how we can fix up the tests? I'm a little stuck
> > because there's not much we can hook to left.
> 
> I don't see a solution either if only the cache insertion code path is in a
> traceable function.
> 
> The prog->active counter has already been covered in another test. This test
> is mostly only covering the lookup => delete recur case and the code path is
> contained within the bpf storage logic. The future code review should be
> able to cover. I would make an exception here and remove this test case
> considering anything (e.g. tracepoint) we do here is likely to make it
> worse. (more on the test removal below).
> 
> > 
> > Thanks,
> > -- Marco
> > 
> > ------ >8 ------
> > 
> > From: Marco Elver <elver@google.com>
> > Date: Tue, 30 Jan 2024 17:57:45 +0100
> > Subject: [PATCH v2] bpf: Allow compiler to inline most of
> >   bpf_local_storage_lookup()
> > 
> > In various performance profiles of kernels with BPF programs attached,
> > bpf_local_storage_lookup() appears as a significant portion of CPU
> > cycles spent. To enable the compiler generate more optimal code, turn
> > bpf_local_storage_lookup() into a static inline function, where only the
> > cache insertion code path is outlined (call instruction can be elided
> > entirely if cacheit_lockit is a constant expression).
> 
> Can you share more why only putting the cache insertion code to a function
> improves the larger number of maps case. In the benchmark, cacheit_lockit
> should always be true and __bpf_local_storage_insert_cache() should always
> be called.

Keeping bpf_local_storage_lookup() smaller (even if just outlining the
cache insertion) makes a difference as it allows the compiler generate
more optimal code, specifically we avoid duplicating setting up calls to
_raw_spin_lock/unlock. E.g.  __bpf_task_storage_get is not being inlined
anymore if bpf_local_storage_lookup() becomes too large (i.e. everything
is up for inlining incl. cache insertion).

Also, on x86 preempt builds, spin_lock/unlock aren't inlinable, so we
have to pay the price of 2 calls regardless: previously for calls to
_raw_spin_lock_irqsave and to _raw_spin_unlock_irqsave. However, with
the version of __bpf_local_storage_insert_cache in my patch, the call to
_raw_spin_unlock_irqsave is tail called, which allows the compiler to
perform TCO, i.e. we still only pay the price of 2 calls: one to
__bpf_local_storage_insert_cache and to _raw_spin_lock_irqsave (but no
call to _raw_spin_unlock_irqsave, which can just be jumped to):

<__bpf_local_storage_insert_cache>:
  endbr64
  nopl   0x0(%rax,%rax,1)
  push   %r15
  push   %r14
  push   %r12
  push   %rbx
  mov    %rdx,%rbx
  mov    %rsi,%r12
  mov    %rdi,%r15
  lea    0xa8(%rdi),%r14
  mov    %r14,%rdi
  call   ffffffff82323650 <_raw_spin_lock_irqsave>
  cmpq   $0x0,0x18(%rbx)
  je     ffffffff8127ea80 <__bpf_local_storage_insert_cache+0x40>
  add    $0x40,%rbx
  movzwl 0x10e(%r12),%ecx

  mov    %rbx,(%r15,%rcx,8)
  mov    %r14,%rdi
  mov    %rax,%rsi
  pop    %rbx
  pop    %r12
  pop    %r14
  pop    %r15
  jmp    ffffffff823237d0 <_raw_spin_unlock_irqrestore>        <--- TCO

I also compared a version where _everything_ is inlined vs. the one with
__bpf_local_storage_insert_cache outlined: the one where everything is
inlined nullifies any performance improvements and is significantly
worse than the one with __bpf_local_storage_insert_cache outlined.

[...]
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????") >   int BPF_PROG(on_lookup)
> 
> Remove this BPF_PROG.
> 
> >   {
> >   	struct task_struct *task = bpf_get_current_task_btf();
> > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > index 4542dc683b44..d73b33a4c153 100644
> > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > @@ -27,7 +27,7 @@ struct {
> >   	__type(value, long);
> >   } map_b SEC(".maps");
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????")
> 
> Same here. The checks related to on_lookup in
> prog_tests/task_local_storage.c need to be removed also.
> 
> >   int BPF_PROG(on_lookup)
> >   {
> >   	struct task_struct *task = bpf_get_current_task_btf();
> 

Thanks,
-- Marco

  reply	other threads:[~2024-02-07  9:56 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-31 14:18 [PATCH] bpf: Separate bpf_local_storage_lookup() fast and slow paths Marco Elver
2024-01-31 19:52 ` Martin KaFai Lau
2024-01-31 20:09   ` Marco Elver
2024-02-05 15:00   ` Marco Elver
2024-02-05 23:24     ` Martin KaFai Lau
2024-02-06 17:04       ` Marco Elver
2024-02-07  1:22         ` Martin KaFai Lau
2024-02-07  9:56           ` Marco Elver [this message]
2024-01-31 20:04 ` Yonghong Song
2024-02-05 23:02 ` Song Liu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZcNTx2X7Y7zA5nrC@elver.google.com \
    --to=elver@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=mykolal@fb.com \
    --cc=sdf@google.com \
    --cc=shuah@kernel.org \
    --cc=song@kernel.org \
    --cc=yonghong.song@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox