All of lore.kernel.org
 help / color / mirror / Atom feed
From: Puranjay Mohan <puranjay@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>,
	Xu Kuohai <xukuohai@huawei.com>,
	Florent Revest <revest@chromium.org>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Date: Tue, 30 Apr 2024 18:30:21 +0000	[thread overview]
Message-ID: <mb61p34r23dqa.fsf@kernel.org> (raw)
In-Reply-To: <CAEf4BzZejgfw=GiX_LTWVupRzrKVaX5Ky6L3wziSoquEFUju2w@mail.gmail.com>

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Apr 26, 2024 at 9:55 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> From: Puranjay Mohan <puranjay12@gmail.com>
>> >>
>> >> Support an instruction for resolving absolute addresses of per-CPU
>> >> data from their per-CPU offsets. This instruction is internal-only and
>> >> users are not allowed to use them directly. They will only be used for
>> >> internal inlining optimizations for now between BPF verifier and BPF
>> >> JITs.
>> >>
>> >> Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
>> >> access using tpidr_el1"), the per-cpu offset for the CPU is stored in
>> >> the tpidr_el1/2 register of that CPU.
>> >>
>> >> To support this BPF instruction in the ARM64 JIT, the following ARM64
>> >> instructions are emitted:
>> >>
>> >> mov dst, src            // Move src to dst, if src != dst
>> >> mrs tmp, tpidr_el1/2    // Move per-cpu offset of the current cpu in tmp.
>> >> add dst, dst, tmp       // Add the per cpu offset to the dst.
>> >>
>> >> To measure the performance improvement provided by this change, the
>> >> benchmark in [1] was used:
>> >>
>> >> Before:
>> >> glob-arr-inc   :   23.597 ± 0.012M/s
>> >> arr-inc        :   23.173 ± 0.019M/s
>> >> hash-inc       :   12.186 ± 0.028M/s
>> >>
>> >> After:
>> >> glob-arr-inc   :   23.819 ± 0.034M/s
>> >> arr-inc        :   23.285 ± 0.017M/s
>> >
>> > I still expected a better improvement (global-arr-inc's results
>> > improved more than arr-inc, which is completely different from
>> > x86-64), but it's still a good thing to support this for arm64, of
>> > course.
>> >
>> > ack for generic parts I can understand:
>> >
>> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> >
>>
>> I will have to do more research to find why we don't see very high
>> improvement.
>>
>> But this is what is happening here:
>>
>> This was the complete picture before inlining:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov     x10, #0xffffffffffffd4a8
>> movk    x10, #0x802c, lsl #16
>> movk    x10, #0x8000, lsl #32
>> blr     x10 ---------------------------------------> nop
>>                                                      nop
>>                                                      adrp    x0, 0xffff800082128000
>>                                                      mrs     x1, tpidr_el1
>>                                                      add     x0, x0, #0x8
>>                                                      ldrsw   x0, [x0, x1]
>>             <----------------------------------------ret
>> add     x7, x0, #0x0
>>
>>
>> Now we have:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov     x7, #0xffff8000ffffffff
>> movk    x7, #0x8212, lsl #16
>> movk    x7, #0x8008
>> mrs     x10, tpidr_el1
>> add     x7, x7, x10
>> ldr     w7, [x7]
>>
>>
>> So, we have removed multiple instructions including a branch and a
>> return. I was expecting to see more improvement. This benchmark is taken
>> from a KVM based virtual machine, maybe if I do it on bare-metal I would
>> see more improvement ?
>
> I see, yeah, I think it might change significantly. I remember back
> from times when I was benchmarking BPF ringbuf, I was getting
> very-very different results from inside QEMU vs bare metal. And I
> don't mean just in absolute numbers. QEMU/KVM seems to change a lot of
> things when it comes to contentions, atomic instructions, etc, etc.
> Anyways, for benchmarking, always try to do bare metal.
>

I found the solution to this. I am seeing much better performance when
implementing this inlining in the JIT through another method, similar to
what I did for riscv see[1]

[1] https://lore.kernel.org/all/20240430175834.33152-3-puranjay@kernel.org/

Will do the same for ARM64 in V5 of this series.

Thanks,
Puranjay

WARNING: multiple messages have this Message-ID (diff)
From: Puranjay Mohan <puranjay@kernel.org>
To: Andrii Nakryiko <andrii.nakryiko@gmail.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Will Deacon <will@kernel.org>,
	Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Eduard Zingerman <eddyz87@gmail.com>, Song Liu <song@kernel.org>,
	Yonghong Song <yonghong.song@linux.dev>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, Zi Shen Lim <zlim.lnx@gmail.com>,
	Xu Kuohai <xukuohai@huawei.com>,
	Florent Revest <revest@chromium.org>,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org, bpf@vger.kernel.org
Subject: Re: [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs
Date: Tue, 30 Apr 2024 18:30:21 +0000	[thread overview]
Message-ID: <mb61p34r23dqa.fsf@kernel.org> (raw)
In-Reply-To: <CAEf4BzZejgfw=GiX_LTWVupRzrKVaX5Ky6L3wziSoquEFUju2w@mail.gmail.com>

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Apr 26, 2024 at 9:55 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>>
>> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>>
>> > On Fri, Apr 26, 2024 at 5:14 AM Puranjay Mohan <puranjay@kernel.org> wrote:
>> >>
>> >> From: Puranjay Mohan <puranjay12@gmail.com>
>> >>
>> >> Support an instruction for resolving absolute addresses of per-CPU
>> >> data from their per-CPU offsets. This instruction is internal-only and
>> >> users are not allowed to use them directly. They will only be used for
>> >> internal inlining optimizations for now between BPF verifier and BPF
>> >> JITs.
>> >>
>> >> Since commit 7158627686f0 ("arm64: percpu: implement optimised pcpu
>> >> access using tpidr_el1"), the per-cpu offset for the CPU is stored in
>> >> the tpidr_el1/2 register of that CPU.
>> >>
>> >> To support this BPF instruction in the ARM64 JIT, the following ARM64
>> >> instructions are emitted:
>> >>
>> >> mov dst, src            // Move src to dst, if src != dst
>> >> mrs tmp, tpidr_el1/2    // Move per-cpu offset of the current cpu in tmp.
>> >> add dst, dst, tmp       // Add the per cpu offset to the dst.
>> >>
>> >> To measure the performance improvement provided by this change, the
>> >> benchmark in [1] was used:
>> >>
>> >> Before:
>> >> glob-arr-inc   :   23.597 ± 0.012M/s
>> >> arr-inc        :   23.173 ± 0.019M/s
>> >> hash-inc       :   12.186 ± 0.028M/s
>> >>
>> >> After:
>> >> glob-arr-inc   :   23.819 ± 0.034M/s
>> >> arr-inc        :   23.285 ± 0.017M/s
>> >
>> > I still expected a better improvement (global-arr-inc's results
>> > improved more than arr-inc, which is completely different from
>> > x86-64), but it's still a good thing to support this for arm64, of
>> > course.
>> >
>> > ack for generic parts I can understand:
>> >
>> > Acked-by: Andrii Nakryiko <andrii@kernel.org>
>> >
>>
>> I will have to do more research to find why we don't see very high
>> improvement.
>>
>> But this is what is happening here:
>>
>> This was the complete picture before inlining:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov     x10, #0xffffffffffffd4a8
>> movk    x10, #0x802c, lsl #16
>> movk    x10, #0x8000, lsl #32
>> blr     x10 ---------------------------------------> nop
>>                                                      nop
>>                                                      adrp    x0, 0xffff800082128000
>>                                                      mrs     x1, tpidr_el1
>>                                                      add     x0, x0, #0x8
>>                                                      ldrsw   x0, [x0, x1]
>>             <----------------------------------------ret
>> add     x7, x0, #0x0
>>
>>
>> Now we have:
>>
>> int cpu = bpf_get_smp_processor_id();
>> mov     x7, #0xffff8000ffffffff
>> movk    x7, #0x8212, lsl #16
>> movk    x7, #0x8008
>> mrs     x10, tpidr_el1
>> add     x7, x7, x10
>> ldr     w7, [x7]
>>
>>
>> So, we have removed multiple instructions including a branch and a
>> return. I was expecting to see more improvement. This benchmark is taken
>> from a KVM based virtual machine, maybe if I do it on bare-metal I would
>> see more improvement ?
>
> I see, yeah, I think it might change significantly. I remember back
> from times when I was benchmarking BPF ringbuf, I was getting
> very-very different results from inside QEMU vs bare metal. And I
> don't mean just in absolute numbers. QEMU/KVM seems to change a lot of
> things when it comes to contentions, atomic instructions, etc, etc.
> Anyways, for benchmarking, always try to do bare metal.
>

I found the solution to this. I am seeing much better performance when
implementing this inlining in the JIT through another method, similar to
what I did for riscv see[1]

[1] https://lore.kernel.org/all/20240430175834.33152-3-puranjay@kernel.org/

Will do the same for ARM64 in V5 of this series.

Thanks,
Puranjay

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

  reply	other threads:[~2024-04-30 18:30 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-26 12:13 [PATCH bpf-next v3 0/2] bpf, arm64: Support per-cpu instruction Puranjay Mohan
2024-04-26 12:13 ` Puranjay Mohan
2024-04-26 12:13 ` [PATCH bpf-next v3 1/2] arm64, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
2024-04-26 12:13   ` Puranjay Mohan
2024-04-26 16:19   ` Andrii Nakryiko
2024-04-26 16:19     ` Andrii Nakryiko
2024-04-26 16:55     ` Puranjay Mohan
2024-04-26 16:55       ` Puranjay Mohan
2024-04-26 17:35       ` Andrii Nakryiko
2024-04-26 17:35         ` Andrii Nakryiko
2024-04-30 18:30         ` Puranjay Mohan [this message]
2024-04-30 18:30           ` Puranjay Mohan
2024-04-26 12:13 ` [PATCH bpf-next v3 2/2] bpf, arm64: inline bpf_get_smp_processor_id() helper Puranjay Mohan
2024-04-26 12:13   ` Puranjay Mohan
2024-04-26 16:26   ` Andrii Nakryiko
2024-04-26 16:26     ` Andrii Nakryiko
2024-04-26 17:06     ` Puranjay Mohan
2024-04-26 17:06       ` Puranjay Mohan
2024-04-26 17:31       ` Andrii Nakryiko
2024-04-26 17:31         ` Andrii Nakryiko

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=mb61p34r23dqa.fsf@kernel.org \
    --to=puranjay@kernel.org \
    --cc=andrii.nakryiko@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=daniel@iogearbox.net \
    --cc=eddyz87@gmail.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=martin.lau@linux.dev \
    --cc=revest@chromium.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=will@kernel.org \
    --cc=xukuohai@huawei.com \
    --cc=yonghong.song@linux.dev \
    --cc=zlim.lnx@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.