From: Puranjay Mohan <puranjay@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>,
Xu Kuohai <xukuohai@huawei.com>,
Florent Revest <revest@chromium.org>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Date: Tue, 30 Apr 2024 18:30:21 +0000 [thread overview]
Message-ID: <mb61p34r23dqa.fsf@kernel.org> (raw)
In-Reply-To: <CAEf4BzZejgfw=GiX_LTWVupRzrKVaX5Ky6L3wziSoquEFUju2w@mail.gmail.com>
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Fri, Apr 26, 2024 at 9:55 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> From: Puranjay Mohan <puranjay12@gmail.com>
>> >>
>> >> Support an instruction for resolving absolute addresses of per-CPU
>> >> data from their per-CPU offsets. This instruction is internal-only and
>> >> users are not allowed to use them directly. They will only be used for
>> >> internal inlining optimizations for now between BPF verifier and BPF
>> >> JITs.
>> >>
>> >> Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
>> >> access using tpidr_el1"), the per-cpu offset for the CPU is stored in
>> >> the tpidr_el1/2 register of that CPU.
>> >>
>> >> To support this BPF instruction in the ARM64 JIT, the following ARM64
>> >> instructions are emitted:
>> >>
>> >> mov dst, src // Move src to dst, if src != dst
>> >> mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp.
>> >> add dst, dst, tmp // Add the per cpu offset to the dst.
>> >>
>> >> To measure the performance improvement provided by this change, the
>> >> benchmark in [1] was used:
>> >>
>> >> Before:
>> >> glob-arr-inc : 23.597 ± 0.012M/s
>> >> arr-inc : 23.173 ± 0.019M/s
>> >> hash-inc : 12.186 ± 0.028M/s
>> >>
>> >> After:
>> >> glob-arr-inc : 23.819 ± 0.034M/s
>> >> arr-inc : 23.285 ± 0.017M/s
>> >
>> > I still expected a better improvement (global-arr-inc's results
>> > improved more than arr-inc, which is completely different from
>> > x86-64), but it's still a good thing to support this for arm64, of
>> > course.
>> >
>> > ack for generic parts I can understand:
>> >
>> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> >
>>
>> I will have to do more research to find why we don't see very high
>> improvement.
>>
>> But this is what is happening here:
>>
>> This was the complete picture before inlining:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov x10, #0xffffffffffffd4a8
>> movk x10, #0x802c, lsl #16
>> movk x10, #0x8000, lsl #32
>> blr x10 ---------------------------------------> nop
>> nop
>> adrp x0, 0xffff800082128000
>> mrs x1, tpidr_el1
>> add x0, x0, #0x8
>> ldrsw x0, [x0, x1]
>> <----------------------------------------ret
>> add x7, x0, #0x0
>>
>>
>> Now we have:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov x7, #0xffff8000ffffffff
>> movk x7, #0x8212, lsl #16
>> movk x7, #0x8008
>> mrs x10, tpidr_el1
>> add x7, x7, x10
>> ldr w7, [x7]
>>
>>
>> So, we have removed multiple instructions including a branch and a
>> return. I was expecting to see more improvement. This benchmark is taken
>> from a KVM based virtual machine, maybe if I do it on bare-metal I would
>> see more improvement ?
>
> I see, yeah, I think it might change significantly. I remember back
> from times when I was benchmarking BPF ringbuf, I was getting
> very-very different results from inside QEMU vs bare metal. And I
> don't mean just in absolute numbers. QEMU/KVM seems to change a lot of
> things when it comes to contentions, atomic instructions, etc, etc.
> Anyways, for benchmarking, always try to do bare metal.
>
I found the solution to this. I am seeing much better performance when
implementing this inlining in the JIT through another method, similar to
what I did for riscv see[1]
[1] https://lore.kernel.org/all/20240430175834.33152-3-puranjay@kernel.org/
Will do the same for ARM64 in V5 of this series.
Thanks,
Puranjay
WARNING: multiple messages have this Message-ID (diff)
From: Puranjay Mohan <puranjay@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
Will Deacon <will@kernel.org>,
Alexei Starovoitov <ast@kernel.org>,
Daniel Borkmann <daniel@iogearbox.net>,
Andrii Nakryiko <andrii@kernel.org>,
Martin KaFai Lau <martin.lau@linux.dev>,
Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
Yonghong Song <yonghong.song@linux.dev>,
John Fastabend <john.fastabend@gmail.com>,
KP Singh <kpsingh@kernel.org>,
Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
Jiri Olsa <jolsa@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>,
Xu Kuohai <xukuohai@huawei.com>,
Florent Revest <revest@chromium.org>,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Date: Tue, 30 Apr 2024 18:30:21 +0000 [thread overview]
Message-ID: <mb61p34r23dqa.fsf@kernel.org> (raw)
In-Reply-To: <CAEf4BzZejgfw=GiX_LTWVupRzrKVaX5Ky6L3wziSoquEFUju2w@mail.gmail.com>
Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
> On Fri, Apr 26, 2024 at 9:55 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> From: Puranjay Mohan <puranjay12@gmail.com>
>> >>
>> >> Support an instruction for resolving absolute addresses of per-CPU
>> >> data from their per-CPU offsets. This instruction is internal-only and
>> >> users are not allowed to use them directly. They will only be used for
>> >> internal inlining optimizations for now between BPF verifier and BPF
>> >> JITs.
>> >>
>> >> Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
>> >> access using tpidr_el1"), the per-cpu offset for the CPU is stored in
>> >> the tpidr_el1/2 register of that CPU.
>> >>
>> >> To support this BPF instruction in the ARM64 JIT, the following ARM64
>> >> instructions are emitted:
>> >>
>> >> mov dst, src // Move src to dst, if src != dst
>> >> mrs tmp, tpidr_el1/2 // Move per-cpu offset of the current cpu in tmp.
>> >> add dst, dst, tmp // Add the per cpu offset to the dst.
>> >>
>> >> To measure the performance improvement provided by this change, the
>> >> benchmark in [1] was used:
>> >>
>> >> Before:
>> >> glob-arr-inc : 23.597 ± 0.012M/s
>> >> arr-inc : 23.173 ± 0.019M/s
>> >> hash-inc : 12.186 ± 0.028M/s
>> >>
>> >> After:
>> >> glob-arr-inc : 23.819 ± 0.034M/s
>> >> arr-inc : 23.285 ± 0.017M/s
>> >
>> > I still expected a better improvement (global-arr-inc's results
>> > improved more than arr-inc, which is completely different from
>> > x86-64), but it's still a good thing to support this for arm64, of
>> > course.
>> >
>> > ack for generic parts I can understand:
>> >
>> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> >
>>
>> I will have to do more research to find why we don't see very high
>> improvement.
>>
>> But this is what is happening here:
>>
>> This was the complete picture before inlining:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov x10, #0xffffffffffffd4a8
>> movk x10, #0x802c, lsl #16
>> movk x10, #0x8000, lsl #32
>> blr x10 ---------------------------------------> nop
>> nop
>> adrp x0, 0xffff800082128000
>> mrs x1, tpidr_el1
>> add x0, x0, #0x8
>> ldrsw x0, [x0, x1]
>> <----------------------------------------ret
>> add x7, x0, #0x0
>>
>>
>> Now we have:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov x7, #0xffff8000ffffffff
>> movk x7, #0x8212, lsl #16
>> movk x7, #0x8008
>> mrs x10, tpidr_el1
>> add x7, x7, x10
>> ldr w7, [x7]
>>
>>
>> So, we have removed multiple instructions including a branch and a
>> return. I was expecting to see more improvement. This benchmark is taken
>> from a KVM based virtual machine, maybe if I do it on bare-metal I would
>> see more improvement ?
>
> I see, yeah, I think it might change significantly. I remember back
> from times when I was benchmarking BPF ringbuf, I was getting
> very-very different results from inside QEMU vs bare metal. And I
> don't mean just in absolute numbers. QEMU/KVM seems to change a lot of
> things when it comes to contentions, atomic instructions, etc, etc.
> Anyways, for benchmarking, always try to do bare metal.
>
I found the solution to this. I am seeing much better performance when
implementing this inlining in the JIT through another method, similar to
what I did for riscv see[1]
[1] https://lore.kernel.org/all/20240430175834.33152-3-puranjay@kernel.org/
Will do the same for ARM64 in V5 of this series.
Thanks,
Puranjay
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2024-04-30 18:30 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-26 12:13 [PATCH bpf-next v3 0/2] bpf, arm64: Support per-cpu instruction Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 12:13 ` [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 16:19 ` Andrii Nakryiko
2024-04-26 16:19 ` Andrii Nakryiko
2024-04-26 16:55 ` Puranjay Mohan
2024-04-26 16:55 ` Puranjay Mohan
2024-04-26 17:35 ` Andrii Nakryiko
2024-04-26 17:35 ` Andrii Nakryiko
2024-04-30 18:30 ` Puranjay Mohan [this message]
2024-04-30 18:30 ` Puranjay Mohan
2024-04-26 12:13 ` [PATCH bpf-next v3 2/2] bpf, arm64: inline bpf_get_smp_processor_id() helper Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 16:26 ` Andrii Nakryiko
2024-04-26 16:26 ` Andrii Nakryiko
2024-04-26 17:06 ` Puranjay Mohan
2024-04-26 17:06 ` Puranjay Mohan
2024-04-26 17:31 ` Andrii Nakryiko
2024-04-26 17:31 ` Andrii Nakryiko
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=mb61p34r23dqa.fsf@kernel.org \
--to=puranjay@kernel.org \
--cc=andrii.nakryiko@gmail.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=catalin.marinas@arm.com \
--cc=daniel@iogearbox.net \
--cc=eddyz87@gmail.com \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-kernel@vger.kernel.org \
--cc=martin.lau@linux.dev \
--cc=revest@chromium.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=will@kernel.org \
--cc=xukuohai@huawei.com \
--cc=yonghong.song@linux.dev \
--cc=zlim.lnx@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.