From: Marco Elver <elver@google.com>
To: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Mykola Lysenko <mykolal@fb.com>,
Shuah Khan <shuah@kernel.org>,
bpf@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-kselftest@vger.kernel.org
Subject: Re: [PATCH] bpf: Separate bpf_local_storage_lookup() fast and slow paths
Date: Wed, 7 Feb 2024 10:56:23 +0100 [thread overview]
Message-ID: <ZcNTx2X7Y7zA5nrC@elver.google.com> (raw)
In-Reply-To: <9908bdfb-1030-4a9f-8405-3696c5d03981@linux.dev>
On Tue, Feb 06, 2024 at 05:22PM -0800, Martin KaFai Lau wrote:
> On 2/6/24 9:04 AM, Marco Elver wrote:
> > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote:
> > [...]
> > > > Or can you suggest different functions to hook to for the recursion test?
> > >
> > > I don't prefer to add another tracepoint for the selftest.
> >
> > Ok - I also checked, even though it should be a no-op, it wasn't
> > (compiler generated worse code).
>
> I am interested to how the tracepoint generates worse code. Can you share
> some details ?
My guess is that it produces enough code that some inlinable functions
are no longer being inlined. Specifically __bpf_task_storage_get().
> >
> > > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the
> > > initial bpf_local_storage_lookup() should work and the immediate recurred
> > > bpf_task_storage_delete() will fail.
> > >
> > > Depends on how the new slow path function will look like in v2. The test can
> > > probably be made to go through the slow path, e.g. by creating a lot of task
> > > storage maps before triggering the lookup.
[...]
> > Could you suggest how we can fix up the tests? I'm a little stuck
> > because there's not much we can hook to left.
>
> I don't see a solution either if only the cache insertion code path is in a
> traceable function.
>
> The prog->active counter has already been covered in another test. This test
> is mostly only covering the lookup => delete recur case and the code path is
> contained within the bpf storage logic. The future code review should be
> able to cover. I would make an exception here and remove this test case
> considering anything (e.g. tracepoint) we do here is likely to make it
> worse. (more on the test removal below).
>
> >
> > Thanks,
> > -- Marco
> >
> > ------ >8 ------
> >
> > From: Marco Elver <elver@google.com>
> > Date: Tue, 30 Jan 2024 17:57:45 +0100
> > Subject: [PATCH v2] bpf: Allow compiler to inline most of
> > bpf_local_storage_lookup()
> >
> > In various performance profiles of kernels with BPF programs attached,
> > bpf_local_storage_lookup() appears as a significant portion of CPU
> > cycles spent. To enable the compiler generate more optimal code, turn
> > bpf_local_storage_lookup() into a static inline function, where only the
> > cache insertion code path is outlined (call instruction can be elided
> > entirely if cacheit_lockit is a constant expression).
>
> Can you share more why only putting the cache insertion code to a function
> improves the larger number of maps case. In the benchmark, cacheit_lockit
> should always be true and __bpf_local_storage_insert_cache() should always
> be called.
Keeping bpf_local_storage_lookup() smaller (even if just outlining the
cache insertion) makes a difference as it allows the compiler generate
more optimal code, specifically we avoid duplicating setting up calls to
_raw_spin_lock/unlock. E.g. __bpf_task_storage_get is not being inlined
anymore if bpf_local_storage_lookup() becomes too large (i.e. everything
is up for inlining incl. cache insertion).
Also, on x86 preempt builds, spin_lock/unlock aren't inlinable, so we
have to pay the price of 2 calls regardless: previously for calls to
_raw_spin_lock_irqsave and to _raw_spin_unlock_irqsave. However, with
the version of __bpf_local_storage_insert_cache in my patch, the call to
_raw_spin_unlock_irqsave is tail called, which allows the compiler to
perform TCO, i.e. we still only pay the price of 2 calls: one to
__bpf_local_storage_insert_cache and to _raw_spin_lock_irqsave (but no
call to _raw_spin_unlock_irqsave, which can just be jumped to):
<__bpf_local_storage_insert_cache>:
endbr64
nopl 0x0(%rax,%rax,1)
push %r15
push %r14
push %r12
push %rbx
mov %rdx,%rbx
mov %rsi,%r12
mov %rdi,%r15
lea 0xa8(%rdi),%r14
mov %r14,%rdi
call ffffffff82323650 <_raw_spin_lock_irqsave>
cmpq $0x0,0x18(%rbx)
je ffffffff8127ea80 <__bpf_local_storage_insert_cache+0x40>
add $0x40,%rbx
movzwl 0x10e(%r12),%ecx
mov %rbx,(%r15,%rcx,8)
mov %r14,%rdi
mov %rax,%rsi
pop %rbx
pop %r12
pop %r14
pop %r15
jmp ffffffff823237d0 <_raw_spin_unlock_irqrestore> <--- TCO
I also compared a version where _everything_ is inlined vs. the one with
__bpf_local_storage_insert_cache outlined: the one where everything is
inlined nullifies any performance improvements and is significantly
worse than the one with __bpf_local_storage_insert_cache outlined.
[...]
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????") > int BPF_PROG(on_lookup)
>
> Remove this BPF_PROG.
>
> > {
> > struct task_struct *task = bpf_get_current_task_btf();
> > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > index 4542dc683b44..d73b33a4c153 100644
> > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > @@ -27,7 +27,7 @@ struct {
> > __type(value, long);
> > } map_b SEC(".maps");
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????")
>
> Same here. The checks related to on_lookup in
> prog_tests/task_local_storage.c need to be removed also.
>
> > int BPF_PROG(on_lookup)
> > {
> > struct task_struct *task = bpf_get_current_task_btf();
>
Thanks,
-- Marco
next prev parent reply other threads:[~2024-02-07 9:56 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-31 14:18 [PATCH] bpf: Separate bpf_local_storage_lookup() fast and slow paths Marco Elver
2024-01-31 19:52 ` Martin KaFai Lau
2024-01-31 20:09 ` Marco Elver
2024-02-05 15:00 ` Marco Elver
2024-02-05 23:24 ` Martin KaFai Lau
2024-02-06 17:04 ` Marco Elver
2024-02-07 1:22 ` Martin KaFai Lau
2024-02-07 9:56 ` Marco Elver [this message]
2024-01-31 20:04 ` Yonghong Song
2024-02-05 23:02 ` Song Liu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZcNTx2X7Y7zA5nrC@elver.google.com \
--to=elver@google.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=mykolal@fb.com \
--cc=sdf@google.com \
--cc=shuah@kernel.org \
--cc=song@kernel.org \
--cc=yonghong.song@linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.