bpf.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
@ 2024-04-05 12:43 Puranjay Mohan
  2024-04-05 14:16 ` Björn Töpel
  2024-04-05 18:04 ` Andrii Nakryiko
  0 siblings, 2 replies; 6+ messages in thread
From: Puranjay Mohan @ 2024-04-05 12:43 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui
  Cc: puranjay12

Support an instruction for resolving absolute addresses of per-CPU
data from their per-CPU offsets. This instruction is internal-only and
users are not allowed to use them directly. They will only be used for
internal inlining optimizations for now between BPF verifier and BPF
JITs.

RISC-V uses generic per-cpu implementation where the offsets for CPUs
are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
the address of the task_struct in TP register. The first element in
tast_struct is struct thread_info, and we can get the cpu number by
reading from the TP register + offsetof(struct thread_info, cpu).

Once we have the cpu number in a register we read the offset for that
cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
offset to the destination register.

To measure the improvement from this change, the benchmark in [1] was
used on Qemu:

Before:
glob-arr-inc   :    1.127 ± 0.013M/s
arr-inc        :    1.121 ± 0.004M/s
hash-inc       :    0.681 ± 0.052M/s

After:
glob-arr-inc   :    1.138 ± 0.011M/s
arr-inc        :    1.366 ± 0.006M/s
hash-inc       :    0.676 ± 0.001M/s

[1] https://github.com/anakryiko/linux/commit/8dec900975ef

Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
---
 arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
index 15e482f2c657..e95bd1d459a4 100644
--- a/arch/riscv/net/bpf_jit_comp64.c
+++ b/arch/riscv/net/bpf_jit_comp64.c
@@ -12,6 +12,7 @@
 #include <linux/stop_machine.h>
 #include <asm/patch.h>
 #include <asm/cfi.h>
+#include <asm/percpu.h>
 #include "bpf_jit.h"
 
 #define RV_FENTRY_NINSNS 2
@@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
 			emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
 			emit_mv(rd, RV_REG_T1, ctx);
 			break;
+		} else if (insn_is_mov_percpu_addr(insn)) {
+			if (rd != rs)
+				emit_mv(rd, rs, ctx);
+#ifdef CONFIG_SMP
+				/* Load current CPU number in T1 */
+				emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu), RV_REG_TP,
+					ctx);
+				/* << 3 because offsets are 8 bytes */
+				emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
+				/* Load address of __per_cpu_offset array in T2 */
+				emit_imm(RV_REG_T2, (u64)&__per_cpu_offset, ctx);
+				/* Add offset of current CPU to  __per_cpu_offset */
+				emit_add(RV_REG_T1, RV_REG_T2, RV_REG_T1, ctx);
+				/* Load __per_cpu_offset[cpu] in T1 */
+				emit_ld(RV_REG_T1, 0, RV_REG_T1, ctx);
+				/* Add the offset to Rd */
+				emit_add(rd, rd, RV_REG_T1, ctx);
+#endif
 		}
 		if (imm == 1) {
 			/* Special mov32 for zext */
@@ -2038,3 +2057,8 @@ bool bpf_jit_supports_arena(void)
 {
 	return true;
 }
+
+bool bpf_jit_supports_percpu_insn(void)
+{
+	return true;
+}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-05 12:43 [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
@ 2024-04-05 14:16 ` Björn Töpel
  2024-04-05 14:33   ` Puranjay Mohan
  2024-04-05 18:04 ` Andrii Nakryiko
  1 sibling, 1 reply; 6+ messages in thread
From: Björn Töpel @ 2024-04-05 14:16 UTC (permalink / raw)
  To: Puranjay Mohan, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou, bpf,
	linux-riscv, linux-kernel, Pu Lehui
  Cc: puranjay12

Puranjay Mohan <puranjay12@gmail.com> writes:

> Support an instruction for resolving absolute addresses of per-CPU
> data from their per-CPU offsets. This instruction is internal-only and
> users are not allowed to use them directly. They will only be used for
> internal inlining optimizations for now between BPF verifier and BPF
> JITs.
>
> RISC-V uses generic per-cpu implementation where the offsets for CPUs
> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
> the address of the task_struct in TP register. The first element in
> tast_struct is struct thread_info, and we can get the cpu number by
     ^
     k ;-)
> reading from the TP register + offsetof(struct thread_info, cpu).
>
> Once we have the cpu number in a register we read the offset for that
> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
> offset to the destination register.

Just to clarify for readers; BPF programs are run with migrate disable,
which means that on RT we can be preempted, which means that per-cpu
operations are trickier (disabling interrupts/preemption).

However, this BPF instruction is about calculating the per-cpu address,
so the look up can be inlined.

It's not a per-cpu *operation*.

> To measure the improvement from this change, the benchmark in [1] was
> used on Qemu:
>
> Before:
> glob-arr-inc   :    1.127 ± 0.013M/s
> arr-inc        :    1.121 ± 0.004M/s
> hash-inc       :    0.681 ± 0.052M/s
>
> After:
> glob-arr-inc   :    1.138 ± 0.011M/s
> arr-inc        :    1.366 ± 0.006M/s
> hash-inc       :    0.676 ± 0.001M/s
>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
> ---
>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index 15e482f2c657..e95bd1d459a4 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -12,6 +12,7 @@
>  #include <linux/stop_machine.h>
>  #include <asm/patch.h>
>  #include <asm/cfi.h>
> +#include <asm/percpu.h>
>  #include "bpf_jit.h"
>  
>  #define RV_FENTRY_NINSNS 2
> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>  			emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
>  			emit_mv(rd, RV_REG_T1, ctx);
>  			break;
> +		} else if (insn_is_mov_percpu_addr(insn)) {
> +			if (rd != rs)
> +				emit_mv(rd, rs, ctx);
> +#ifdef CONFIG_SMP
> +				/* Load current CPU number in T1 */
> +				emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu), RV_REG_TP,
> +					ctx);
> +				/* << 3 because offsets are 8 bytes */
> +				emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
> +				/* Load address of __per_cpu_offset array in T2 */
> +				emit_imm(RV_REG_T2, (u64)&__per_cpu_offset, ctx);

Did you try using emit_addr() here? I'd guess that'll be fewer
instructions, no?


Björn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-05 14:16 ` Björn Töpel
@ 2024-04-05 14:33   ` Puranjay Mohan
  0 siblings, 0 replies; 6+ messages in thread
From: Puranjay Mohan @ 2024-04-05 14:33 UTC (permalink / raw)
  To: Björn Töpel, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou, bpf,
	linux-riscv, linux-kernel, Pu Lehui

Björn Töpel <bjorn@kernel.org> writes:

> Puranjay Mohan <puranjay12@gmail.com> writes:
>
>> Support an instruction for resolving absolute addresses of per-CPU
>> data from their per-CPU offsets. This instruction is internal-only and
>> users are not allowed to use them directly. They will only be used for
>> internal inlining optimizations for now between BPF verifier and BPF
>> JITs.
>>
>> RISC-V uses generic per-cpu implementation where the offsets for CPUs
>> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
>> the address of the task_struct in TP register. The first element in
>> tast_struct is struct thread_info, and we can get the cpu number by
>      ^
>      k ;-)

I need to start using some kind of spell-check in vim :D.

>> reading from the TP register + offsetof(struct thread_info, cpu).
>>
>> Once we have the cpu number in a register we read the offset for that
>> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
>> offset to the destination register.
>
> Just to clarify for readers; BPF programs are run with migrate disable,
> which means that on RT we can be preempted, which means that per-cpu
> operations are trickier (disabling interrupts/preemption).
>
> However, this BPF instruction is about calculating the per-cpu address,
> so the look up can be inlined.
>
> It's not a per-cpu *operation*.

Will add this information to the commit message.

>> To measure the improvement from this change, the benchmark in [1] was
>> used on Qemu:
>>
>> Before:
>> glob-arr-inc   :    1.127 ± 0.013M/s
>> arr-inc        :    1.121 ± 0.004M/s
>> hash-inc       :    0.681 ± 0.052M/s
>>
>> After:
>> glob-arr-inc   :    1.138 ± 0.011M/s
>> arr-inc        :    1.366 ± 0.006M/s
>> hash-inc       :    0.676 ± 0.001M/s
>>
>> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>>
>> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
>> ---
>>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
>> index 15e482f2c657..e95bd1d459a4 100644
>> --- a/arch/riscv/net/bpf_jit_comp64.c
>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>> @@ -12,6 +12,7 @@
>>  #include <linux/stop_machine.h>
>>  #include <asm/patch.h>
>>  #include <asm/cfi.h>
>> +#include <asm/percpu.h>
>>  #include "bpf_jit.h"
>>  
>>  #define RV_FENTRY_NINSNS 2
>> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>>  			emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
>>  			emit_mv(rd, RV_REG_T1, ctx);
>>  			break;
>> +		} else if (insn_is_mov_percpu_addr(insn)) {
>> +			if (rd != rs)
>> +				emit_mv(rd, rs, ctx);
>> +#ifdef CONFIG_SMP
>> +				/* Load current CPU number in T1 */
>> +				emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu), RV_REG_TP,
>> +					ctx);
>> +				/* << 3 because offsets are 8 bytes */
>> +				emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
>> +				/* Load address of __per_cpu_offset array in T2 */
>> +				emit_imm(RV_REG_T2, (u64)&__per_cpu_offset, ctx);
>
> Did you try using emit_addr() here? I'd guess that'll be fewer
> instructions, no?

Yes, I should have used that, the address would always be in the range
of auipc+addi right? I will try and see. 

Thanks,
Puranjay

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-05 12:43 [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
  2024-04-05 14:16 ` Björn Töpel
@ 2024-04-05 18:04 ` Andrii Nakryiko
  2024-04-08  7:40   ` Björn Töpel
  1 sibling, 1 reply; 6+ messages in thread
From: Andrii Nakryiko @ 2024-04-05 18:04 UTC (permalink / raw)
  To: Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Björn Töpel, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	bpf, linux-riscv, linux-kernel, Pu Lehui

On Fri, Apr 5, 2024 at 5:44 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
>
> Support an instruction for resolving absolute addresses of per-CPU
> data from their per-CPU offsets. This instruction is internal-only and
> users are not allowed to use them directly. They will only be used for
> internal inlining optimizations for now between BPF verifier and BPF
> JITs.
>
> RISC-V uses generic per-cpu implementation where the offsets for CPUs
> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
> the address of the task_struct in TP register. The first element in
> tast_struct is struct thread_info, and we can get the cpu number by
> reading from the TP register + offsetof(struct thread_info, cpu).
>
> Once we have the cpu number in a register we read the offset for that
> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
> offset to the destination register.
>
> To measure the improvement from this change, the benchmark in [1] was
> used on Qemu:
>
> Before:
> glob-arr-inc   :    1.127 ± 0.013M/s
> arr-inc        :    1.121 ± 0.004M/s
> hash-inc       :    0.681 ± 0.052M/s
>
> After:
> glob-arr-inc   :    1.138 ± 0.011M/s
> arr-inc        :    1.366 ± 0.006M/s
> hash-inc       :    0.676 ± 0.001M/s

TBH, I don't trust benchmarks done inside QEMU. Can you try running
this on some real hardware?

>
> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>
> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
> ---
>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
>
> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> index 15e482f2c657..e95bd1d459a4 100644
> --- a/arch/riscv/net/bpf_jit_comp64.c
> +++ b/arch/riscv/net/bpf_jit_comp64.c
> @@ -12,6 +12,7 @@
>  #include <linux/stop_machine.h>
>  #include <asm/patch.h>
>  #include <asm/cfi.h>
> +#include <asm/percpu.h>
>  #include "bpf_jit.h"
>
>  #define RV_FENTRY_NINSNS 2
> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>                         emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
>                         emit_mv(rd, RV_REG_T1, ctx);
>                         break;
> +               } else if (insn_is_mov_percpu_addr(insn)) {
> +                       if (rd != rs)
> +                               emit_mv(rd, rs, ctx);

Is this an unconditional move instruction? in x86-64, EMIT_mov checks
whether source and destination registers are the same and doesn't emit
anything if they match (which makes sense, right)?

> +#ifdef CONFIG_SMP
> +                               /* Load current CPU number in T1 */
> +                               emit_ld(RV_REG_T1, offsetof(struct thread_info, cpu), RV_REG_TP,
> +                                       ctx);

nit: maybe keep this on the same line?

> +                               /* << 3 because offsets are 8 bytes */
> +                               emit_slli(RV_REG_T1, RV_REG_T1, 3, ctx);
> +                               /* Load address of __per_cpu_offset array in T2 */
> +                               emit_imm(RV_REG_T2, (u64)&__per_cpu_offset, ctx);
> +                               /* Add offset of current CPU to  __per_cpu_offset */
> +                               emit_add(RV_REG_T1, RV_REG_T2, RV_REG_T1, ctx);
> +                               /* Load __per_cpu_offset[cpu] in T1 */
> +                               emit_ld(RV_REG_T1, 0, RV_REG_T1, ctx);
> +                               /* Add the offset to Rd */
> +                               emit_add(rd, rd, RV_REG_T1, ctx);
> +#endif
>                 }
>                 if (imm == 1) {
>                         /* Special mov32 for zext */
> @@ -2038,3 +2057,8 @@ bool bpf_jit_supports_arena(void)
>  {
>         return true;
>  }
> +
> +bool bpf_jit_supports_percpu_insn(void)
> +{
> +       return true;
> +}
> --
> 2.40.1
>

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-05 18:04 ` Andrii Nakryiko
@ 2024-04-08  7:40   ` Björn Töpel
  2024-04-18 22:24     ` Andrii Nakryiko
  0 siblings, 1 reply; 6+ messages in thread
From: Björn Töpel @ 2024-04-08  7:40 UTC (permalink / raw)
  To: Andrii Nakryiko, Puranjay Mohan
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Paul Walmsley, Palmer Dabbelt, Albert Ou, bpf, linux-riscv,
	linux-kernel, Pu Lehui

Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:

> On Fri, Apr 5, 2024 at 5:44 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
>>
>> Support an instruction for resolving absolute addresses of per-CPU
>> data from their per-CPU offsets. This instruction is internal-only and
>> users are not allowed to use them directly. They will only be used for
>> internal inlining optimizations for now between BPF verifier and BPF
>> JITs.
>>
>> RISC-V uses generic per-cpu implementation where the offsets for CPUs
>> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
>> the address of the task_struct in TP register. The first element in
>> tast_struct is struct thread_info, and we can get the cpu number by
>> reading from the TP register + offsetof(struct thread_info, cpu).
>>
>> Once we have the cpu number in a register we read the offset for that
>> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
>> offset to the destination register.
>>
>> To measure the improvement from this change, the benchmark in [1] was
>> used on Qemu:
>>
>> Before:
>> glob-arr-inc   :    1.127 ± 0.013M/s
>> arr-inc        :    1.121 ± 0.004M/s
>> hash-inc       :    0.681 ± 0.052M/s
>>
>> After:
>> glob-arr-inc   :    1.138 ± 0.011M/s
>> arr-inc        :    1.366 ± 0.006M/s
>> hash-inc       :    0.676 ± 0.001M/s
>
> TBH, I don't trust benchmarks done inside QEMU. Can you try running
> this on some real hardware?

I just ran it on a "VisionFive2" SBC:

BEFORE
======
glob-arr-inc   :   11.586 ± 0.021M/s 
arr-inc        :   10.892 ± 0.005M/s 
hash-inc       :    1.517 ± 0.001M/s 

AFTER
=====
glob-arr-inc   :   11.893 ± 0.017M/s  (+2.6%)
arr-inc        :   11.630 ± 0.020M/s  (+6.8%)
hash-inc       :    1.543 ± 0.002M/s  (+1.7%)

(It's early, and the coffee haven't kicked in, so I hope the
calculations are correct...)

>>
>> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
>>
>> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
>> ---
>>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
>>  1 file changed, 24 insertions(+)
>>
>> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
>> index 15e482f2c657..e95bd1d459a4 100644
>> --- a/arch/riscv/net/bpf_jit_comp64.c
>> +++ b/arch/riscv/net/bpf_jit_comp64.c
>> @@ -12,6 +12,7 @@
>>  #include <linux/stop_machine.h>
>>  #include <asm/patch.h>
>>  #include <asm/cfi.h>
>> +#include <asm/percpu.h>
>>  #include "bpf_jit.h"
>>
>>  #define RV_FENTRY_NINSNS 2
>> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
>>                         emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
>>                         emit_mv(rd, RV_REG_T1, ctx);
>>                         break;
>> +               } else if (insn_is_mov_percpu_addr(insn)) {
>> +                       if (rd != rs)
>> +                               emit_mv(rd, rs, ctx);
>
> Is this an unconditional move instruction? in x86-64, EMIT_mov checks
> whether source and destination registers are the same and doesn't emit
> anything if they match (which makes sense, right)?

Yeah, it is. Folding the check into the emit sounds like a good idea.


Björn

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs
  2024-04-08  7:40   ` Björn Töpel
@ 2024-04-18 22:24     ` Andrii Nakryiko
  0 siblings, 0 replies; 6+ messages in thread
From: Andrii Nakryiko @ 2024-04-18 22:24 UTC (permalink / raw)
  To: Björn Töpel
  Cc: Puranjay Mohan, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, Paul Walmsley, Palmer Dabbelt, Albert Ou, bpf,
	linux-riscv, linux-kernel, Pu Lehui

On Mon, Apr 8, 2024 at 12:40 AM Björn Töpel <bjorn@kernel.org> wrote:
>
> Andrii Nakryiko <andrii.nakryiko@gmail.com> writes:
>
> > On Fri, Apr 5, 2024 at 5:44 AM Puranjay Mohan <puranjay12@gmail.com> wrote:
> >>
> >> Support an instruction for resolving absolute addresses of per-CPU
> >> data from their per-CPU offsets. This instruction is internal-only and
> >> users are not allowed to use them directly. They will only be used for
> >> internal inlining optimizations for now between BPF verifier and BPF
> >> JITs.
> >>
> >> RISC-V uses generic per-cpu implementation where the offsets for CPUs
> >> are kept in an array called __per_cpu_offset[cpu_number]. RISCV stores
> >> the address of the task_struct in TP register. The first element in
> >> tast_struct is struct thread_info, and we can get the cpu number by
> >> reading from the TP register + offsetof(struct thread_info, cpu).
> >>
> >> Once we have the cpu number in a register we read the offset for that
> >> cpu from address: &__per_cpu_offset + cpu_number << 3. Then we add this
> >> offset to the destination register.
> >>
> >> To measure the improvement from this change, the benchmark in [1] was
> >> used on Qemu:
> >>
> >> Before:
> >> glob-arr-inc   :    1.127 ± 0.013M/s
> >> arr-inc        :    1.121 ± 0.004M/s
> >> hash-inc       :    0.681 ± 0.052M/s
> >>
> >> After:
> >> glob-arr-inc   :    1.138 ± 0.011M/s
> >> arr-inc        :    1.366 ± 0.006M/s
> >> hash-inc       :    0.676 ± 0.001M/s
> >
> > TBH, I don't trust benchmarks done inside QEMU. Can you try running
> > this on some real hardware?
>
> I just ran it on a "VisionFive2" SBC:
>
> BEFORE
> ======
> glob-arr-inc   :   11.586 ± 0.021M/s
> arr-inc        :   10.892 ± 0.005M/s
> hash-inc       :    1.517 ± 0.001M/s
>
> AFTER
> =====
> glob-arr-inc   :   11.893 ± 0.017M/s  (+2.6%)
> arr-inc        :   11.630 ± 0.020M/s  (+6.8%)
> hash-inc       :    1.543 ± 0.002M/s  (+1.7%)
>

Nice, looks pretty reasonable (and especially if
bpf_smp_get_current_id() gets inlined as well, the numbers should be
even better)


> (It's early, and the coffee haven't kicked in, so I hope the
> calculations are correct...)
>
> >>
> >> [1] https://github.com/anakryiko/linux/commit/8dec900975ef
> >>
> >> Signed-off-by: Puranjay Mohan <puranjay12@gmail.com>
> >> ---
> >>  arch/riscv/net/bpf_jit_comp64.c | 24 ++++++++++++++++++++++++
> >>  1 file changed, 24 insertions(+)
> >>
> >> diff --git a/arch/riscv/net/bpf_jit_comp64.c b/arch/riscv/net/bpf_jit_comp64.c
> >> index 15e482f2c657..e95bd1d459a4 100644
> >> --- a/arch/riscv/net/bpf_jit_comp64.c
> >> +++ b/arch/riscv/net/bpf_jit_comp64.c
> >> @@ -12,6 +12,7 @@
> >>  #include <linux/stop_machine.h>
> >>  #include <asm/patch.h>
> >>  #include <asm/cfi.h>
> >> +#include <asm/percpu.h>
> >>  #include "bpf_jit.h"
> >>
> >>  #define RV_FENTRY_NINSNS 2
> >> @@ -1089,6 +1090,24 @@ int bpf_jit_emit_insn(const struct bpf_insn *insn, struct rv_jit_context *ctx,
> >>                         emit_or(RV_REG_T1, rd, RV_REG_T1, ctx);
> >>                         emit_mv(rd, RV_REG_T1, ctx);
> >>                         break;
> >> +               } else if (insn_is_mov_percpu_addr(insn)) {
> >> +                       if (rd != rs)
> >> +                               emit_mv(rd, rs, ctx);
> >
> > Is this an unconditional move instruction? in x86-64, EMIT_mov checks
> > whether source and destination registers are the same and doesn't emit
> > anything if they match (which makes sense, right)?
>
> Yeah, it is. Folding the check into the emit sounds like a good idea.
>

great

>
> Björn

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2024-04-18 22:24 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-05 12:43 [PATCH bpf-next] riscv, bpf: add internal-only MOV instruction to resolve per-CPU addrs Puranjay Mohan
2024-04-05 14:16 ` Björn Töpel
2024-04-05 14:33   ` Puranjay Mohan
2024-04-05 18:04 ` Andrii Nakryiko
2024-04-08  7:40   ` Björn Töpel
2024-04-18 22:24     ` Andrii Nakryiko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).