[PATCH v3 bpf-next 0/2] Inline bpf_get_branch

BPF List
 help / color / mirror / Atom feed

* [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper
@ 2024-04-04  0:26 Andrii Nakryiko
  2024-04-04  0:26 ` [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Andrii Nakryiko @ 2024-04-04  0:26 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach. This allows to reduce LBR record usage right before LBR
records are captured from inside BPF program.

See v1 cover letter ([0]) for some visual examples. I dropped them from v2
because there are multiple independent changes landing and being reviewed, all
of which remove different parts of LBR record waste, so presenting final state
of LBR "waste" gets more complicated until all of the pieces land.

  [0] https://lore.kernel.org/bpf/20240321180501.734779-1-andrii@kernel.org/

v2->v3:
  - fix BPF_MUL instruction definition;
v1->v2:
  - inlining of bpf_get_smp_processor_id() split out into a separate patch set
    implementing internal per-CPU BPF instruction;
  - add efficient divide-by-24 through multiplication logic, and leave
    comments to explain the idea behind it; this way inlined version of
    bpf_get_branch_snapshot() has no compromises compared to non-inlined
    version of the helper  (Alexei).

Andrii Nakryiko (2):
  bpf: make bpf_get_branch_snapshot() architecture-agnostic
  bpf: inline bpf_get_branch_snapshot() helper

 kernel/bpf/verifier.c    | 55 ++++++++++++++++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c |  4 ---
 2 files changed, 55 insertions(+), 4 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic
  2024-04-04  0:26 [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper Andrii Nakryiko
@ 2024-04-04  0:26 ` Andrii Nakryiko
  2024-04-04 17:25   ` Yonghong Song
  2024-04-04  0:26 ` [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
  2024-04-04 20:20 ` [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper patchwork-bot+netdevbpf
  2 siblings, 1 reply; 12+ messages in thread
From: Andrii Nakryiko @ 2024-04-04  0:26 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team

perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
there is no reason for BPF subsystem to keep track of which
architectures do support LBR or not. E.g., it looks like ARM64 might soon
get support for BRBE ([0]), which (with proper integration) should be
possible to utilize using this BPF helper.

perf_snapshot_branch_stack static call will point to
__static_call_return0() by default, which just returns zero, which will
lead to -ENOENT, as expected. So no need to guard anything here.

  [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/

Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/trace/bpf_trace.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6d0c95638e1b..afb232b1d7c2 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1188,9 +1188,6 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_tracing = {
 
 BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
 {
-#ifndef CONFIG_X86
-	return -ENOENT;
-#else
 	static const u32 br_entry_size = sizeof(struct perf_branch_entry);
 	u32 entry_cnt = size / br_entry_size;
 
@@ -1203,7 +1200,6 @@ BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
 		return -ENOENT;
 
 	return entry_cnt * br_entry_size;
-#endif
 }
 
 static const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04  0:26 [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper Andrii Nakryiko
  2024-04-04  0:26 ` [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
@ 2024-04-04  0:26 ` Andrii Nakryiko
  2024-04-04 17:44   ` Yonghong Song
                     ` (2 more replies)
  2024-04-04 20:20 ` [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper patchwork-bot+netdevbpf
  2 siblings, 3 replies; 12+ messages in thread
From: Andrii Nakryiko @ 2024-04-04  0:26 UTC (permalink / raw)
  To: bpf, ast, daniel, martin.lau; +Cc: andrii, kernel-team, John Fastabend

Inline bpf_get_branch_snapshot() helper using architecture-agnostic
inline BPF code which calls directly into underlying callback of
perf_snapshot_branch_stack static call. This callback is set early
during kernel initialization and is never updated or reset, so it's ok
to fetch actual implementation using static_call_query() and call
directly into it.

This change eliminates a full function call and saves one LBR entry
in PERF_SAMPLE_BRANCH_ANY LBR mode.

Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
---
 kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 17c06f1505e4..2cb5db317a5e 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			goto next_insn;
 		}
 
+		/* Implement bpf_get_branch_snapshot inline. */
+		if (prog->jit_requested && BITS_PER_LONG == 64 &&
+		    insn->imm == BPF_FUNC_get_branch_snapshot) {
+			/* We are dealing with the following func protos:
+			 * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
+			 * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
+			 */
+			const u32 br_entry_size = sizeof(struct perf_branch_entry);
+
+			/* struct perf_branch_entry is part of UAPI and is
+			 * used as an array element, so extremely unlikely to
+			 * ever grow or shrink
+			 */
+			BUILD_BUG_ON(br_entry_size != 24);
+
+			/* if (unlikely(flags)) return -EINVAL */
+			insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);
+
+			/* Transform size (bytes) into number of entries (cnt = size / 24).
+			 * But to avoid expensive division instruction, we implement
+			 * divide-by-3 through multiplication, followed by further
+			 * division by 8 through 3-bit right shift.
+			 * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr.,
+			 * p. 227, chapter "Unsigned Divison by 3" for details and proofs.
+			 *
+			 * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab.
+			 */
+			insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab);
+			insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0);
+			insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36);
+
+			/* call perf_snapshot_branch_stack implementation */
+			insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
+			/* if (entry_cnt == 0) return -ENOENT */
+			insn_buf[5] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4);
+			/* return entry_cnt * sizeof(struct perf_branch_entry) */
+			insn_buf[6] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size);
+			insn_buf[7] = BPF_JMP_A(3);
+			/* return -EINVAL; */
+			insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL);
+			insn_buf[9] = BPF_JMP_A(1);
+			/* return -ENOENT; */
+			insn_buf[10] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT);
+			cnt = 11;
+
+			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
+			if (!new_prog)
+				return -ENOMEM;
+
+			delta    += cnt - 1;
+			env->prog = prog = new_prog;
+			insn      = new_prog->insnsi + i + delta;
+			continue;
+		}
+
 		/* Implement bpf_kptr_xchg inline */
 		if (prog->jit_requested && BITS_PER_LONG == 64 &&
 		    insn->imm == BPF_FUNC_kptr_xchg &&
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic
  2024-04-04  0:26 ` [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
@ 2024-04-04 17:25   ` Yonghong Song
  0 siblings, 0 replies; 12+ messages in thread
From: Yonghong Song @ 2024-04-04 17:25 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: kernel-team


On 4/3/24 5:26 PM, Andrii Nakryiko wrote:
> perf_snapshot_branch_stack is set up in an architecture-agnostic way, so
> there is no reason for BPF subsystem to keep track of which
> architectures do support LBR or not. E.g., it looks like ARM64 might soon
> get support for BRBE ([0]), which (with proper integration) should be
> possible to utilize using this BPF helper.
>
> perf_snapshot_branch_stack static call will point to
> __static_call_return0() by default, which just returns zero, which will
> lead to -ENOENT, as expected. So no need to guard anything here.
>
>    [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/
>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Yonghong Song <yonghong.song@linux.dev>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04  0:26 ` [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
@ 2024-04-04 17:44   ` Yonghong Song
  2024-04-04 18:14   ` Alexei Starovoitov
  2024-10-23  8:28   ` Shung-Hsi Yu
  2 siblings, 0 replies; 12+ messages in thread
From: Yonghong Song @ 2024-04-04 17:44 UTC (permalink / raw)
  To: Andrii Nakryiko, bpf, ast, daniel, martin.lau; +Cc: kernel-team, John Fastabend


On 4/3/24 5:26 PM, Andrii Nakryiko wrote:
> Inline bpf_get_branch_snapshot() helper using architecture-agnostic
> inline BPF code which calls directly into underlying callback of
> perf_snapshot_branch_stack static call. This callback is set early
> during kernel initialization and is never updated or reset, so it's ok
> to fetch actual implementation using static_call_query() and call
> directly into it.
>
> This change eliminates a full function call and saves one LBR entry
> in PERF_SAMPLE_BRANCH_ANY LBR mode.
>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>

Acked-by: Yonghong Song <yonghong.song@linux.dev>


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04  0:26 ` [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
  2024-04-04 17:44   ` Yonghong Song
@ 2024-04-04 18:14   ` Alexei Starovoitov
  2024-04-04 18:28     ` Yonghong Song
  2024-10-23  8:28   ` Shung-Hsi Yu
  2 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2024-04-04 18:14 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kernel Team, John Fastabend

On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>
> Inline bpf_get_branch_snapshot() helper using architecture-agnostic
> inline BPF code which calls directly into underlying callback of
> perf_snapshot_branch_stack static call. This callback is set early
> during kernel initialization and is never updated or reset, so it's ok
> to fetch actual implementation using static_call_query() and call
> directly into it.
>
> This change eliminates a full function call and saves one LBR entry
> in PERF_SAMPLE_BRANCH_ANY LBR mode.
>
> Acked-by: John Fastabend <john.fastabend@gmail.com>
> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> ---
>  kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
>
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 17c06f1505e4..2cb5db317a5e 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>                         goto next_insn;
>                 }
>
> +               /* Implement bpf_get_branch_snapshot inline. */
> +               if (prog->jit_requested && BITS_PER_LONG == 64 &&
> +                   insn->imm == BPF_FUNC_get_branch_snapshot) {
> +                       /* We are dealing with the following func protos:
> +                        * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
> +                        * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
> +                        */
> +                       const u32 br_entry_size = sizeof(struct perf_branch_entry);
> +
> +                       /* struct perf_branch_entry is part of UAPI and is
> +                        * used as an array element, so extremely unlikely to
> +                        * ever grow or shrink
> +                        */
> +                       BUILD_BUG_ON(br_entry_size != 24);
> +
> +                       /* if (unlikely(flags)) return -EINVAL */
> +                       insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);
> +
> +                       /* Transform size (bytes) into number of entries (cnt = size / 24).
> +                        * But to avoid expensive division instruction, we implement
> +                        * divide-by-3 through multiplication, followed by further
> +                        * division by 8 through 3-bit right shift.
> +                        * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr.,
> +                        * p. 227, chapter "Unsigned Divison by 3" for details and proofs.
> +                        *
> +                        * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab.
> +                        */
> +                       insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab);
> +                       insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0);
> +                       insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36);
> +
> +                       /* call perf_snapshot_branch_stack implementation */
> +                       insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));

How will this work on non-x86 ?
I tried to grep the code and looks like only x86 does:
static_call_update(perf_snapshot_branch_stack,...)

so on other arch-s static_call_query() will return zero/einval?
And above will crash?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04 18:14   ` Alexei Starovoitov
@ 2024-04-04 18:28     ` Yonghong Song
  2024-04-04 18:31       ` Alexei Starovoitov
  0 siblings, 1 reply; 12+ messages in thread
From: Yonghong Song @ 2024-04-04 18:28 UTC (permalink / raw)
  To: Alexei Starovoitov, Andrii Nakryiko
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau,
	Kernel Team, John Fastabend


On 4/4/24 11:14 AM, Alexei Starovoitov wrote:
> On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote:
>> Inline bpf_get_branch_snapshot() helper using architecture-agnostic
>> inline BPF code which calls directly into underlying callback of
>> perf_snapshot_branch_stack static call. This callback is set early
>> during kernel initialization and is never updated or reset, so it's ok
>> to fetch actual implementation using static_call_query() and call
>> directly into it.
>>
>> This change eliminates a full function call and saves one LBR entry
>> in PERF_SAMPLE_BRANCH_ANY LBR mode.
>>
>> Acked-by: John Fastabend <john.fastabend@gmail.com>
>> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
>> ---
>>   kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 55 insertions(+)
>>
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 17c06f1505e4..2cb5db317a5e 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
>>                          goto next_insn;
>>                  }
>>
>> +               /* Implement bpf_get_branch_snapshot inline. */
>> +               if (prog->jit_requested && BITS_PER_LONG == 64 &&
>> +                   insn->imm == BPF_FUNC_get_branch_snapshot) {
>> +                       /* We are dealing with the following func protos:
>> +                        * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
>> +                        * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
>> +                        */
>> +                       const u32 br_entry_size = sizeof(struct perf_branch_entry);
>> +
>> +                       /* struct perf_branch_entry is part of UAPI and is
>> +                        * used as an array element, so extremely unlikely to
>> +                        * ever grow or shrink
>> +                        */
>> +                       BUILD_BUG_ON(br_entry_size != 24);
>> +
>> +                       /* if (unlikely(flags)) return -EINVAL */
>> +                       insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);
>> +
>> +                       /* Transform size (bytes) into number of entries (cnt = size / 24).
>> +                        * But to avoid expensive division instruction, we implement
>> +                        * divide-by-3 through multiplication, followed by further
>> +                        * division by 8 through 3-bit right shift.
>> +                        * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr.,
>> +                        * p. 227, chapter "Unsigned Divison by 3" for details and proofs.
>> +                        *
>> +                        * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab.
>> +                        */
>> +                       insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab);
>> +                       insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0);
>> +                       insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36);
>> +
>> +                       /* call perf_snapshot_branch_stack implementation */
>> +                       insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
> How will this work on non-x86 ?
> I tried to grep the code and looks like only x86 does:
> static_call_update(perf_snapshot_branch_stack,...)
>
> so on other arch-s static_call_query() will return zero/einval?
> And above will crash?

Patch 1 will give the answer.In events/core.c, we have the following: 
DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack, 
perf_snapshot_branch_stack_t); #define DEFINE_STATIC_CALL_RET0(name, 
_func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key 
STATIC_CALL_KEY(name) = { \ .func = __static_call_return0, \ .type = 1, 
\ }; \ ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) So the default value for 
perf_snapshot_branch_stack  is
__static_call_return0.

In static_call.c, long __static_call_return0(void) { return 0; } 
EXPORT_SYMBOL_GPL(__static_call_return0); So we should be fine.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04 18:28     ` Yonghong Song
@ 2024-04-04 18:31       ` Alexei Starovoitov
  2024-04-04 20:08         ` Andrii Nakryiko
  0 siblings, 1 reply; 12+ messages in thread
From: Alexei Starovoitov @ 2024-04-04 18:31 UTC (permalink / raw)
  To: Yonghong Song
  Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann,
	Martin KaFai Lau, Kernel Team, John Fastabend

On Thu, Apr 4, 2024 at 11:28 AM Yonghong Song <yonghong.song@linux.dev> wrote:
>
>
> On 4/4/24 11:14 AM, Alexei Starovoitov wrote:
> > On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> >> Inline bpf_get_branch_snapshot() helper using architecture-agnostic
> >> inline BPF code which calls directly into underlying callback of
> >> perf_snapshot_branch_stack static call. This callback is set early
> >> during kernel initialization and is never updated or reset, so it's ok
> >> to fetch actual implementation using static_call_query() and call
> >> directly into it.
> >>
> >> This change eliminates a full function call and saves one LBR entry
> >> in PERF_SAMPLE_BRANCH_ANY LBR mode.
> >>
> >> Acked-by: John Fastabend <john.fastabend@gmail.com>
> >> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> >> ---
> >>   kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++
> >>   1 file changed, 55 insertions(+)
> >>
> >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> >> index 17c06f1505e4..2cb5db317a5e 100644
> >> --- a/kernel/bpf/verifier.c
> >> +++ b/kernel/bpf/verifier.c
> >> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> >>                          goto next_insn;
> >>                  }
> >>
> >> +               /* Implement bpf_get_branch_snapshot inline. */
> >> +               if (prog->jit_requested && BITS_PER_LONG == 64 &&
> >> +                   insn->imm == BPF_FUNC_get_branch_snapshot) {
> >> +                       /* We are dealing with the following func protos:
> >> +                        * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
> >> +                        * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
> >> +                        */
> >> +                       const u32 br_entry_size = sizeof(struct perf_branch_entry);
> >> +
> >> +                       /* struct perf_branch_entry is part of UAPI and is
> >> +                        * used as an array element, so extremely unlikely to
> >> +                        * ever grow or shrink
> >> +                        */
> >> +                       BUILD_BUG_ON(br_entry_size != 24);
> >> +
> >> +                       /* if (unlikely(flags)) return -EINVAL */
> >> +                       insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);
> >> +
> >> +                       /* Transform size (bytes) into number of entries (cnt = size / 24).
> >> +                        * But to avoid expensive division instruction, we implement
> >> +                        * divide-by-3 through multiplication, followed by further
> >> +                        * division by 8 through 3-bit right shift.
> >> +                        * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr.,
> >> +                        * p. 227, chapter "Unsigned Divison by 3" for details and proofs.
> >> +                        *
> >> +                        * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab.
> >> +                        */
> >> +                       insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab);
> >> +                       insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0);
> >> +                       insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36);
> >> +
> >> +                       /* call perf_snapshot_branch_stack implementation */
> >> +                       insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
> > How will this work on non-x86 ?
> > I tried to grep the code and looks like only x86 does:
> > static_call_update(perf_snapshot_branch_stack,...)
> >
> > so on other arch-s static_call_query() will return zero/einval?
> > And above will crash?
>
> Patch 1 will give the answer.In events/core.c, we have the following:
> DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack,
> perf_snapshot_branch_stack_t); #define DEFINE_STATIC_CALL_RET0(name,
> _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key
> STATIC_CALL_KEY(name) = { \ .func = __static_call_return0, \ .type = 1,
> \ }; \ ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) So the default value for
> perf_snapshot_branch_stack  is
> __static_call_return0.
>
> In static_call.c, long __static_call_return0(void) { return 0; }
> EXPORT_SYMBOL_GPL(__static_call_return0); So we should be fine.

I see. Thanks for explaining.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04 18:31       ` Alexei Starovoitov
@ 2024-04-04 20:08         ` Andrii Nakryiko
  0 siblings, 0 replies; 12+ messages in thread
From: Andrii Nakryiko @ 2024-04-04 20:08 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Yonghong Song, Andrii Nakryiko, bpf, Alexei Starovoitov,
	Daniel Borkmann, Martin KaFai Lau, Kernel Team, John Fastabend

On Thu, Apr 4, 2024 at 11:31 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Apr 4, 2024 at 11:28 AM Yonghong Song <yonghong.song@linux.dev> wrote:
> >
> >
> > On 4/4/24 11:14 AM, Alexei Starovoitov wrote:
> > > On Wed, Apr 3, 2024 at 5:27 PM Andrii Nakryiko <andrii@kernel.org> wrote:
> > >> Inline bpf_get_branch_snapshot() helper using architecture-agnostic
> > >> inline BPF code which calls directly into underlying callback of
> > >> perf_snapshot_branch_stack static call. This callback is set early
> > >> during kernel initialization and is never updated or reset, so it's ok
> > >> to fetch actual implementation using static_call_query() and call
> > >> directly into it.
> > >>
> > >> This change eliminates a full function call and saves one LBR entry
> > >> in PERF_SAMPLE_BRANCH_ANY LBR mode.
> > >>
> > >> Acked-by: John Fastabend <john.fastabend@gmail.com>
> > >> Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
> > >> ---
> > >>   kernel/bpf/verifier.c | 55 +++++++++++++++++++++++++++++++++++++++++++
> > >>   1 file changed, 55 insertions(+)
> > >>
> > >> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > >> index 17c06f1505e4..2cb5db317a5e 100644
> > >> --- a/kernel/bpf/verifier.c
> > >> +++ b/kernel/bpf/verifier.c
> > >> @@ -20181,6 +20181,61 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
> > >>                          goto next_insn;
> > >>                  }
> > >>
> > >> +               /* Implement bpf_get_branch_snapshot inline. */
> > >> +               if (prog->jit_requested && BITS_PER_LONG == 64 &&
> > >> +                   insn->imm == BPF_FUNC_get_branch_snapshot) {
> > >> +                       /* We are dealing with the following func protos:
> > >> +                        * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags);
> > >> +                        * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt);
> > >> +                        */
> > >> +                       const u32 br_entry_size = sizeof(struct perf_branch_entry);
> > >> +
> > >> +                       /* struct perf_branch_entry is part of UAPI and is
> > >> +                        * used as an array element, so extremely unlikely to
> > >> +                        * ever grow or shrink
> > >> +                        */
> > >> +                       BUILD_BUG_ON(br_entry_size != 24);
> > >> +
> > >> +                       /* if (unlikely(flags)) return -EINVAL */
> > >> +                       insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);
> > >> +
> > >> +                       /* Transform size (bytes) into number of entries (cnt = size / 24).
> > >> +                        * But to avoid expensive division instruction, we implement
> > >> +                        * divide-by-3 through multiplication, followed by further
> > >> +                        * division by 8 through 3-bit right shift.
> > >> +                        * Refer to book "Hacker's Delight, 2nd ed." by Henry S. Warren, Jr.,
> > >> +                        * p. 227, chapter "Unsigned Divison by 3" for details and proofs.
> > >> +                        *
> > >> +                        * N / 3 <=> M * N / 2^33, where M = (2^33 + 1) / 3 = 0xaaaaaaab.
> > >> +                        */
> > >> +                       insn_buf[1] = BPF_MOV32_IMM(BPF_REG_0, 0xaaaaaaab);
> > >> +                       insn_buf[2] = BPF_ALU64_REG(BPF_MUL, BPF_REG_2, BPF_REG_0);
> > >> +                       insn_buf[3] = BPF_ALU64_IMM(BPF_RSH, BPF_REG_2, 36);
> > >> +
> > >> +                       /* call perf_snapshot_branch_stack implementation */
> > >> +                       insn_buf[4] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack));
> > > How will this work on non-x86 ?
> > > I tried to grep the code and looks like only x86 does:
> > > static_call_update(perf_snapshot_branch_stack,...)
> > >
> > > so on other arch-s static_call_query() will return zero/einval?
> > > And above will crash?
> >
> > Patch 1 will give the answer.In events/core.c, we have the following:
> > DEFINE_STATIC_CALL_RET0(perf_snapshot_branch_stack,
> > perf_snapshot_branch_stack_t); #define DEFINE_STATIC_CALL_RET0(name,
> > _func) \ DECLARE_STATIC_CALL(name, _func); \ struct static_call_key
> > STATIC_CALL_KEY(name) = { \ .func = __static_call_return0, \ .type = 1,
> > \ }; \ ARCH_DEFINE_STATIC_CALL_RET0_TRAMP(name) So the default value for
> > perf_snapshot_branch_stack  is
> > __static_call_return0.
> >
> > In static_call.c, long __static_call_return0(void) { return 0; }
> > EXPORT_SYMBOL_GPL(__static_call_return0); So we should be fine.
>
> I see. Thanks for explaining.
>

Yep, just like Yonghong explained. And I also dropped that CONFIG_X86
check in anticipation of [0], which hopefully lands some time in the
not too distant future.

  [0] https://lore.kernel.org/all/20240125094119.2542332-1-anshuman.khandual@arm.com/

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper
  2024-04-04  0:26 [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper Andrii Nakryiko
  2024-04-04  0:26 ` [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
  2024-04-04  0:26 ` [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
@ 2024-04-04 20:20 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 12+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-04-04 20:20 UTC (permalink / raw)
  To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, kernel-team

Hello:

This series was applied to bpf/bpf-next.git (master)
by Alexei Starovoitov <ast@kernel.org>:

On Wed,  3 Apr 2024 17:26:38 -0700 you wrote:
> Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
> assembly approach. This allows to reduce LBR record usage right before LBR
> records are captured from inside BPF program.
> 
> See v1 cover letter ([0]) for some visual examples. I dropped them from v2
> because there are multiple independent changes landing and being reviewed, all
> of which remove different parts of LBR record waste, so presenting final state
> of LBR "waste" gets more complicated until all of the pieces land.
> 
> [...]

Here is the summary with links:
  - [v3,bpf-next,1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic
    https://git.kernel.org/bpf/bpf-next/c/5e6a3c1ee693
  - [v3,bpf-next,2/2] bpf: inline bpf_get_branch_snapshot() helper
    https://git.kernel.org/bpf/bpf-next/c/314a53623cd4

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-04-04  0:26 ` [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
  2024-04-04 17:44   ` Yonghong Song
  2024-04-04 18:14   ` Alexei Starovoitov
@ 2024-10-23  8:28   ` Shung-Hsi Yu
  2024-10-23 16:04     ` Andrii Nakryiko
  2 siblings, 1 reply; 12+ messages in thread
From: Shung-Hsi Yu @ 2024-10-23  8:28 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: bpf, ast, daniel, martin.lau, kernel-team, John Fastabend,
	Yonghong Song

Hi Andrii,

I was looking around in do_misc_fixups() and came across this

...
> +			new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> +			if (!new_prog)
> +				return -ENOMEM;
> +
> +			delta    += cnt - 1;
> +			env->prog = prog = new_prog;
> +			insn      = new_prog->insnsi + i + delta;
> +			continue;

Should the above be turned into "goto next_insn" like the others that
were touched by commit 011832b97b31 "bpf: Introduce may_goto
instruction"?

> +		}
> +
>  		/* Implement bpf_kptr_xchg inline */
>  		if (prog->jit_requested && BITS_PER_LONG == 64 &&
>  		    insn->imm == BPF_FUNC_kptr_xchg &&

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper
  2024-10-23  8:28   ` Shung-Hsi Yu
@ 2024-10-23 16:04     ` Andrii Nakryiko
  0 siblings, 0 replies; 12+ messages in thread
From: Andrii Nakryiko @ 2024-10-23 16:04 UTC (permalink / raw)
  To: Shung-Hsi Yu
  Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, kernel-team,
	John Fastabend, Yonghong Song

On Wed, Oct 23, 2024 at 1:29 AM Shung-Hsi Yu <shung-hsi.yu@suse.com> wrote:
>
> Hi Andrii,
>
> I was looking around in do_misc_fixups() and came across this
>
> ...
> > +                     new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt);
> > +                     if (!new_prog)
> > +                             return -ENOMEM;
> > +
> > +                     delta    += cnt - 1;
> > +                     env->prog = prog = new_prog;
> > +                     insn      = new_prog->insnsi + i + delta;
> > +                     continue;
>
> Should the above be turned into "goto next_insn" like the others that
> were touched by commit 011832b97b31 "bpf: Introduce may_goto
> instruction"?
>

Yep, seems so, thanks for catching. I'll send a fix.

> > +             }
> > +
> >               /* Implement bpf_kptr_xchg inline */
> >               if (prog->jit_requested && BITS_PER_LONG == 64 &&
> >                   insn->imm == BPF_FUNC_kptr_xchg &&

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-10-23 16:04 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-04  0:26 [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper Andrii Nakryiko
2024-04-04  0:26 ` [PATCH v3 bpf-next 1/2] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
2024-04-04 17:25   ` Yonghong Song
2024-04-04  0:26 ` [PATCH v3 bpf-next 2/2] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko
2024-04-04 17:44   ` Yonghong Song
2024-04-04 18:14   ` Alexei Starovoitov
2024-04-04 18:28     ` Yonghong Song
2024-04-04 18:31       ` Alexei Starovoitov
2024-04-04 20:08         ` Andrii Nakryiko
2024-10-23  8:28   ` Shung-Hsi Yu
2024-10-23 16:04     ` Andrii Nakryiko
2024-04-04 20:20 ` [PATCH v3 bpf-next 0/2] Inline bpf_get_branch_snapshot() BPF helper patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox