From: Puranjay Mohan <puranjay@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>,
Xu Kuohai <xukuohai@huawei.com>,
Florent Revest <revest@chromium.org>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Date: Fri, 26 Apr 2024 16:55:01 +0000 [thread overview]
Message-ID: <mb61psez8vzbu.fsf@kernel.org> (raw)
In-Reply-To: <CAEf4BzbBBpsuCGgombEj1N8f97iKrMr2WXSoU8jOUfKSqLXnyw@mail.gmail.com>
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> From: Puranjay Mohan <puranjay12@gmail.com>
>>
>> Support an instruction for resolving absolute addresses of per-CPU
>> data from their per-CPU offsets. This instruction is internal-only and
>> users are not allowed to use them directly. They will only be used for
>> internal inlining optimizations for now between BPF verifier and BPF
>> JITs.
>>
>> Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
>> access using tpidr_el1"), the per-cpu offset for the CPU is stored in
>> the tpidr_el1/2 register of that CPU.
>>
>> To support this BPF instruction in the ARM64 JIT, the following ARM64
>> instructions are emitted:
>>
>> mov dst, src // Move src to dst, if src != dst
>> mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp.
>> add dst, dst, tmp // Add the per cpu offset to the dst.
>>
>> To measure the performance improvement provided by this change, the
>> benchmark in [1] was used:
>>
>> Before:
>> glob-arr-inc : 23.597 ± 0.012M/s
>> arr-inc : 23.173 ± 0.019M/s
>> hash-inc : 12.186 ± 0.028M/s
>>
>> After:
>> glob-arr-inc : 23.819 ± 0.034M/s
>> arr-inc : 23.285 ± 0.017M/s
>
> I still expected a better improvement (global-arr-inc's results
> improved more than arr-inc, which is completely different from
> x86-64), but it's still a good thing to support this for arm64, of
> course.
>
> ack for generic parts I can understand:
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
I will have to do more research to find why we don't see very high
improvement.
But this is what is happening here:
This was the complete picture before inlining:
int cpu = bpf_get_smp_processor_id();
mov x10, #0xffffffffffffd4a8
movk x10, #0x802c, lsl #16
movk x10, #0x8000, lsl #32
blr x10 ---------------------------------------> nop
nop
adrp x0, 0xffff800082128000
mrs x1, tpidr_el1
add x0, x0, #0x8
ldrsw x0, [x0, x1]
<----------------------------------------ret
add x7, x0, #0x0
Now we have:
int cpu = bpf_get_smp_processor_id();
mov x7, #0xffff8000ffffffff
movk x7, #0x8212, lsl #16
movk x7, #0x8008
mrs x10, tpidr_el1
add x7, x7, x10
ldr w7, [x7]
So, we have removed multiple instructions including a branch and a
return. I was expecting to see more improvement. This benchmark is taken
from a KVM based virtual machine, maybe if I do it on bare-metal I would
see more improvement ?
Thanks,
Puranjay
WARNING: multiple messages have this Message-ID (diff)
From: Puranjay Mohan <puranjay@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>,
Xu Kuohai <xukuohai@huawei.com>,
Florent Revest <revest@chromium.org>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Date: Fri, 26 Apr 2024 16:55:01 +0000 [thread overview]
Message-ID: <mb61psez8vzbu.fsf@kernel.org> (raw)
In-Reply-To: <CAEf4BzbBBpsuCGgombEj1N8f97iKrMr2WXSoU8jOUfKSqLXnyw@mail.gmail.com>
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> From: Puranjay Mohan <puranjay12@gmail.com>
>>
>> Support an instruction for resolving absolute addresses of per-CPU
>> data from their per-CPU offsets. This instruction is internal-only and
>> users are not allowed to use them directly. They will only be used for
>> internal inlining optimizations for now between BPF verifier and BPF
>> JITs.
>>
>> Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
>> access using tpidr_el1"), the per-cpu offset for the CPU is stored in
>> the tpidr_el1/2 register of that CPU.
>>
>> To support this BPF instruction in the ARM64 JIT, the following ARM64
>> instructions are emitted:
>>
>> mov dst, src // Move src to dst, if src != dst
>> mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp.
>> add dst, dst, tmp // Add the per cpu offset to the dst.
>>
>> To measure the performance improvement provided by this change, the
>> benchmark in [1] was used:
>>
>> Before:
>> glob-arr-inc : 23.597 ± 0.012M/s
>> arr-inc : 23.173 ± 0.019M/s
>> hash-inc : 12.186 ± 0.028M/s
>>
>> After:
>> glob-arr-inc : 23.819 ± 0.034M/s
>> arr-inc : 23.285 ± 0.017M/s
>
> I still expected a better improvement (global-arr-inc's results
> improved more than arr-inc, which is completely different from
> x86-64), but it's still a good thing to support this for arm64, of
> course.
>
> ack for generic parts I can understand:
>
> Acked-by: Andrii Nakryiko <andrii@kernel.org>
>
I will have to do more research to find why we don't see very high
improvement.
But this is what is happening here:
This was the complete picture before inlining:
int cpu = bpf_get_smp_processor_id();
mov x10, #0xffffffffffffd4a8
movk x10, #0x802c, lsl #16
movk x10, #0x8000, lsl #32
blr x10 ---------------------------------------> nop
nop
adrp x0, 0xffff800082128000
mrs x1, tpidr_el1
add x0, x0, #0x8
ldrsw x0, [x0, x1]
<----------------------------------------ret
add x7, x0, #0x0
Now we have:
int cpu = bpf_get_smp_processor_id();
mov x7, #0xffff8000ffffffff
movk x7, #0x8212, lsl #16
movk x7, #0x8008
mrs x10, tpidr_el1
add x7, x7, x10
ldr w7, [x7]
So, we have removed multiple instructions including a branch and a
return. I was expecting to see more improvement. This benchmark is taken
from a KVM based virtual machine, maybe if I do it on bare-metal I would
see more improvement ?
Thanks,
Puranjay
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2024-04-26 16:55 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-26 12:13 [PATCH bpf-next v3 0/2] bpf, arm64: Support per-cpu instruction Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 12:13 ` [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 16:19 ` Andrii Nakryiko
2024-04-26 16:19 ` Andrii Nakryiko
2024-04-26 16:55 ` Puranjay Mohan [this message]
2024-04-26 16:55 ` Puranjay Mohan
2024-04-26 17:35 ` Andrii Nakryiko
2024-04-26 17:35 ` Andrii Nakryiko
2024-04-30 18:30 ` Puranjay Mohan
2024-04-30 18:30 ` Puranjay Mohan
2024-04-26 12:13 ` [PATCH bpf-next v3 2/2] bpf, arm64: inline bpf_get_smp_processor_id() helper Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 16:26 ` Andrii Nakryiko
2024-04-26 16:26 ` Andrii Nakryiko
2024-04-26 17:06 ` Puranjay Mohan
2024-04-26 17:06 ` Puranjay Mohan
2024-04-26 17:31 ` Andrii Nakryiko
2024-04-26 17:31 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mb61psez8vzbu.fsf@kernel.org \
--to=puranjay@kernel.org \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=revest@chromium.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=will@kernel.org \
--cc=xukuohai@huawei.com \
--cc=yonghong.song@linux.dev \
--cc=zlim.lnx@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.