* [PATCH bpf-next 0/3] Inline two LBR-related helpers
@ 2024-03-21 18:04 Andrii Nakryiko
2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko
` (3 more replies)
0 siblings, 4 replies; 23+ messages in thread
From: Andrii Nakryiko @ 2024-03-21 18:04 UTC (permalink / raw)
To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko
Implement inlining of bpf_get_branch_snapshot() BPF helper using generic BPF
assembly approach.
Also inline bpf_get_smp_processor_id() BPF helper but using
architecture-specific assembly code in x86-64 JIT compiler, given getting CPU
ID is highly architecture-specific.
These two helpers are on a criticl direct path to grabbing LBR records from
BPF program and inlining them help save 3 LBR records in PERF_SAMPLE_BRANCH_ANY
mode.
Just to give some visual idea of the effect of these changes (and inlining of
kprobe_multi_link_prog_run() posted as a separte patch) based on retsnoop's
LBR output (with --lbr=any flag). I only show "wasted" records that are needed
to go from when some event happened (kernel function return in this case), to
triggering BPF program that captures LBR *the very first thing* (after getting
CPU ID to get a temporary buffer).
There are still ways to reduce number of "wasted" records further, this is
a problem that requires many small and rather independent steps.
fentry mode
===========
BEFORE
------
[#10] __sys_bpf+0x270 -> __x64_sys_bpf+0x18
[#09] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508684+0x7f
[#08] bpf_trampoline_6442508684+0x9c -> __bpf_prog_enter_recur+0x0
[#07] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0
[#06] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe
[#05] __bpf_prog_enter_recur+0x43 -> bpf_trampoline_6442508684+0xa1
[#04] bpf_trampoline_6442508684+0xad -> bpf_prog_dc54a596b39d4177_fexit1+0x0
[#03] bpf_prog_dc54a596b39d4177_fexit1+0x32 -> bpf_get_smp_processor_id+0x0
[#02] bpf_get_smp_processor_id+0xe -> bpf_prog_dc54a596b39d4177_fexit1+0x37
[#01] bpf_prog_dc54a596b39d4177_fexit1+0xe0 -> bpf_get_branch_snapshot+0x0
[#00] bpf_get_branch_snapshot+0x13 -> intel_pmu_snapshot_branch_stack+0x0
AFTER
-----
[#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18
[#06] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508829+0x7f
[#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0
[#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0
[#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe
[#02] __bpf_prog_enter_recur+0x43 -> bpf_trampoline_6442508829+0xa1
[#01] bpf_trampoline_6442508829+0xad -> bpf_prog_dc54a596b39d4177_fexit1+0x0
[#00] bpf_prog_dc54a596b39d4177_fexit1+0x101 -> intel_pmu_snapshot_branch_stack+0x0
multi-kprobe mode
=================
BEFORE
------
[#14] __sys_bpf+0x270 -> arch_rethook_trampoline+0x0
[#13] arch_rethook_trampoline+0x27 -> arch_rethook_trampoline_callback+0x0
[#12] arch_rethook_trampoline_callback+0x31 -> rethook_trampoline_handler+0x0
[#11] rethook_trampoline_handler+0x6f -> fprobe_exit_handler+0x0
[#10] fprobe_exit_handler+0x3d -> rcu_is_watching+0x0
[#09] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42
[#08] fprobe_exit_handler+0xb4 -> kprobe_multi_link_exit_handler+0x0
[#07] kprobe_multi_link_exit_handler+0x4 -> kprobe_multi_link_prog_run+0x0
[#06] kprobe_multi_link_prog_run+0x2d -> migrate_disable+0x0
[#05] migrate_disable+0x37 -> kprobe_multi_link_prog_run+0x32
[#04] kprobe_multi_link_prog_run+0x58 -> bpf_prog_2b455b4f8a8d48c5_kexit+0x0
[#03] bpf_prog_2b455b4f8a8d48c5_kexit+0x32 -> bpf_get_smp_processor_id+0x0
[#02] bpf_get_smp_processor_id+0xe -> bpf_prog_2b455b4f8a8d48c5_kexit+0x37
[#01] bpf_prog_2b455b4f8a8d48c5_kexit+0x82 -> bpf_get_branch_snapshot+0x0
[#00] bpf_get_branch_snapshot+0x13 -> intel_pmu_snapshot_branch_stack+0x0
AFTER
-----
[#10] __sys_bpf+0xdfc -> arch_rethook_trampoline+0x0
[#09] arch_rethook_trampoline+0x27 -> arch_rethook_trampoline_callback+0x0
[#08] arch_rethook_trampoline_callback+0x31 -> rethook_trampoline_handler+0x0
[#07] rethook_trampoline_handler+0x6f -> fprobe_exit_handler+0x0
[#06] fprobe_exit_handler+0x3d -> rcu_is_watching+0x0
[#05] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42
[#04] fprobe_exit_handler+0xb4 -> kprobe_multi_link_exit_handler+0x0
[#03] kprobe_multi_link_exit_handler+0x31 -> migrate_disable+0x0
[#02] migrate_disable+0x37 -> kprobe_multi_link_exit_handler+0x36
[#01] kprobe_multi_link_exit_handler+0x5c -> bpf_prog_2b455b4f8a8d48c5_kexit+0x0
[#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3 -> intel_pmu_snapshot_branch_stack+0x0
For default --lbr mode (PERF_SAMPLE_BRANCH_ANY_RETURN), interestingly enough,
multi-kprobe is *less* wasteful (by one function call):
fentry mode
===========
BEFORE
------
[#04] __sys_bpf+0x270 -> __x64_sys_bpf+0x18
[#03] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508684+0x7f
[#02] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe
[#01] __bpf_prog_enter_recur+0x43 -> bpf_trampoline_6442508684+0xa1
[#00] bpf_get_smp_processor_id+0xe -> bpf_prog_dc54a596b39d4177_fexit1+0x37
AFTER
-----
[#03] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18
[#02] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508829+0x7f
[#01] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe
[#00] __bpf_prog_enter_recur+0x43 -> bpf_trampoline_6442508829+0xa1
multi-kprobe mode
=================
BEFORE
------
[#03] __sys_bpf+0x270 -> arch_rethook_trampoline+0x0
[#02] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42
[#01] migrate_disable+0x37 -> kprobe_multi_link_prog_run+0x32
[#00] bpf_get_smp_processor_id+0xe -> bpf_prog_2b455b4f8a8d48c5_kexit+0x37
AFTER
-----
[#02] __sys_bpf+0xdfc -> arch_rethook_trampoline+0x0
[#01] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42
[#00] migrate_disable+0x37 -> kprobe_multi_link_exit_handler+0x36
Andrii Nakryiko (3):
bpf: make bpf_get_branch_snapshot() architecture-agnostic
bpf: inline bpf_get_branch_snapshot() helper
bpf,x86: inline bpf_get_smp_processor_id() on x86-64
arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++-
kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++
kernel/trace/bpf_trace.c | 4 ----
3 files changed, 62 insertions(+), 5 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 23+ messages in thread* [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic 2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko @ 2024-03-21 18:04 ` Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko ` (2 subsequent siblings) 3 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-21 18:04 UTC (permalink / raw) To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko perf_snapshot_branch_stack is set up in an architecture-agnostic way, so there is no reason for BPF subsystem to keep track of which architectures do support LBR or not. E.g., it looks like ARM64 might soon get support for BRBE ([0]), which (with proper integration) should be possible to utilize using this BPF helper. perf_snapshot_branch_stack static call will point to __static_call_return0() by default, which just returns zero, which will lead to -ENOENT, as expected. So no need to guard anything here. [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/ Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- kernel/trace/bpf_trace.c | 4 ---- 1 file changed, 4 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index 434e3ece6688..6d000332b17b 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1182,9 +1182,6 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_tracing = { BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags) { -#ifndef CONFIG_X86 - return -ENOENT; -#else static const u32 br_entry_size = sizeof(struct perf_branch_entry); u32 entry_cnt = size / br_entry_size; @@ -1197,7 +1194,6 @@ BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags) return -ENOENT; return entry_cnt * br_entry_size; -#endif } static const struct bpf_func_proto bpf_get_branch_snapshot_proto = { -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic 2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko @ 2024-03-21 21:08 ` Jiri Olsa 0 siblings, 0 replies; 23+ messages in thread From: Jiri Olsa @ 2024-03-21 21:08 UTC (permalink / raw) To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 11:04:59AM -0700, Andrii Nakryiko wrote: > perf_snapshot_branch_stack is set up in an architecture-agnostic way, so > there is no reason for BPF subsystem to keep track of which > architectures do support LBR or not. E.g., it looks like ARM64 might soon > get support for BRBE ([0]), which (with proper integration) should be > possible to utilize using this BPF helper. > > perf_snapshot_branch_stack static call will point to > __static_call_return0() by default, which just returns zero, which will > lead to -ENOENT, as expected. So no need to guard anything here. > > [0] https://lore.kernel.org/linux-arm-kernel/20240125094119.2542332-1-anshuman.khandual@arm.com/ > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> jirka > --- > kernel/trace/bpf_trace.c | 4 ---- > 1 file changed, 4 deletions(-) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index 434e3ece6688..6d000332b17b 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -1182,9 +1182,6 @@ static const struct bpf_func_proto bpf_get_attach_cookie_proto_tracing = { > > BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags) > { > -#ifndef CONFIG_X86 > - return -ENOENT; > -#else > static const u32 br_entry_size = sizeof(struct perf_branch_entry); > u32 entry_cnt = size / br_entry_size; > > @@ -1197,7 +1194,6 @@ BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags) > return -ENOENT; > > return entry_cnt * br_entry_size; > -#endif > } > > static const struct bpf_func_proto bpf_get_branch_snapshot_proto = { > -- > 2.43.0 > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper 2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko 2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko @ 2024-03-21 18:05 ` Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko 2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov 3 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-21 18:05 UTC (permalink / raw) To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko Inline bpf_get_branch_snapshot() helper using architecture-agnostic inline BPF code which calls directly into underlying callback of perf_snapshot_branch_stack static call. This callback is set early during kernel initialization and is never updated or reset, so it's ok to fetch actual implementation using static_call_query() and call directly into it. This change eliminates a full function call and saves one LBR entry in PERF_SAMPLE_BRANCH_ANY LBR mode. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++ 1 file changed, 37 insertions(+) diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c index de7813947981..4fb6c468e199 100644 --- a/kernel/bpf/verifier.c +++ b/kernel/bpf/verifier.c @@ -20130,6 +20130,43 @@ static int do_misc_fixups(struct bpf_verifier_env *env) goto next_insn; } + /* Implement bpf_get_branch_snapshot inline. */ + if (prog->jit_requested && BITS_PER_LONG == 64 && + insn->imm == BPF_FUNC_get_branch_snapshot) { + /* We are dealing with the following func protos: + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); + */ + const u32 br_entry_size = sizeof(struct perf_branch_entry); + + /* if (unlikely(flags)) return -EINVAL */ + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 5); + /* transform size (bytes) into entry_cnt */ + insn_buf[1] = BPF_ALU32_IMM(BPF_DIV, BPF_REG_2, br_entry_size); + /* call perf_snapshot_branch_stack implementation */ + insn_buf[2] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); + /* if (entry_cnt == 0) return -ENOENT */ + insn_buf[3] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4); + /* return entry_cnt * sizeof(struct perf_branch_entry) */ + insn_buf[4] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size); + insn_buf[5] = BPF_JMP_A(3); + /* return -EINVAL; */ + insn_buf[6] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL); + insn_buf[7] = BPF_JMP_A(1); + /* return -ENOENT; */ + insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT); + cnt = 9; + + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); + if (!new_prog) + return -ENOMEM; + + delta += cnt - 1; + env->prog = prog = new_prog; + insn = new_prog->insnsi + i + delta; + continue; + } + /* Implement bpf_kptr_xchg inline */ if (prog->jit_requested && BITS_PER_LONG == 64 && insn->imm == BPF_FUNC_kptr_xchg && -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper 2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko @ 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 21:27 ` Andrii Nakryiko 0 siblings, 1 reply; 23+ messages in thread From: Jiri Olsa @ 2024-03-21 21:08 UTC (permalink / raw) To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 11:05:00AM -0700, Andrii Nakryiko wrote: > Inline bpf_get_branch_snapshot() helper using architecture-agnostic > inline BPF code which calls directly into underlying callback of > perf_snapshot_branch_stack static call. This callback is set early > during kernel initialization and is never updated or reset, so it's ok > to fetch actual implementation using static_call_query() and call > directly into it. > > This change eliminates a full function call and saves one LBR entry > in PERF_SAMPLE_BRANCH_ANY LBR mode. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++ > 1 file changed, 37 insertions(+) > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index de7813947981..4fb6c468e199 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -20130,6 +20130,43 @@ static int do_misc_fixups(struct bpf_verifier_env *env) > goto next_insn; > } > > + /* Implement bpf_get_branch_snapshot inline. */ > + if (prog->jit_requested && BITS_PER_LONG == 64 && > + insn->imm == BPF_FUNC_get_branch_snapshot) { > + /* We are dealing with the following func protos: > + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); > + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); > + */ > + const u32 br_entry_size = sizeof(struct perf_branch_entry); > + > + /* if (unlikely(flags)) return -EINVAL */ > + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 5); nit, you moved the flags check on top, which I think makes sense and we should do it in bpf_get_branch_snapshot as well to keep it same jirka > + /* transform size (bytes) into entry_cnt */ > + insn_buf[1] = BPF_ALU32_IMM(BPF_DIV, BPF_REG_2, br_entry_size); > + /* call perf_snapshot_branch_stack implementation */ > + insn_buf[2] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); > + /* if (entry_cnt == 0) return -ENOENT */ > + insn_buf[3] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4); > + /* return entry_cnt * sizeof(struct perf_branch_entry) */ > + insn_buf[4] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size); > + insn_buf[5] = BPF_JMP_A(3); > + /* return -EINVAL; */ > + insn_buf[6] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL); > + insn_buf[7] = BPF_JMP_A(1); > + /* return -ENOENT; */ > + insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT); > + cnt = 9; > + > + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); > + if (!new_prog) > + return -ENOMEM; > + > + delta += cnt - 1; > + env->prog = prog = new_prog; > + insn = new_prog->insnsi + i + delta; > + continue; > + } > + > /* Implement bpf_kptr_xchg inline */ > if (prog->jit_requested && BITS_PER_LONG == 64 && > insn->imm == BPF_FUNC_kptr_xchg && > -- > 2.43.0 > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper 2024-03-21 21:08 ` Jiri Olsa @ 2024-03-21 21:27 ` Andrii Nakryiko 0 siblings, 0 replies; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-21 21:27 UTC (permalink / raw) To: Jiri Olsa; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 11:05:00AM -0700, Andrii Nakryiko wrote: > > Inline bpf_get_branch_snapshot() helper using architecture-agnostic > > inline BPF code which calls directly into underlying callback of > > perf_snapshot_branch_stack static call. This callback is set early > > during kernel initialization and is never updated or reset, so it's ok > > to fetch actual implementation using static_call_query() and call > > directly into it. > > > > This change eliminates a full function call and saves one LBR entry > > in PERF_SAMPLE_BRANCH_ANY LBR mode. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > kernel/bpf/verifier.c | 37 +++++++++++++++++++++++++++++++++++++ > > 1 file changed, 37 insertions(+) > > > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > > index de7813947981..4fb6c468e199 100644 > > --- a/kernel/bpf/verifier.c > > +++ b/kernel/bpf/verifier.c > > @@ -20130,6 +20130,43 @@ static int do_misc_fixups(struct bpf_verifier_env *env) > > goto next_insn; > > } > > > > + /* Implement bpf_get_branch_snapshot inline. */ > > + if (prog->jit_requested && BITS_PER_LONG == 64 && > > + insn->imm == BPF_FUNC_get_branch_snapshot) { > > + /* We are dealing with the following func protos: > > + * u64 bpf_get_branch_snapshot(void *buf, u32 size, u64 flags); > > + * int perf_snapshot_branch_stack(struct perf_branch_entry *entries, u32 cnt); > > + */ > > + const u32 br_entry_size = sizeof(struct perf_branch_entry); > > + > > + /* if (unlikely(flags)) return -EINVAL */ > > + insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 5); > > nit, you moved the flags check on top, which I think makes sense and > we should do it in bpf_get_branch_snapshot as well to keep it same > here I could control that it won't be taken in common case, so if we are to do that in BPF helper itself, we should use unlikely() and check that compiler actually honored it. I can add that in the next revision. > jirka > > > + /* transform size (bytes) into entry_cnt */ > > + insn_buf[1] = BPF_ALU32_IMM(BPF_DIV, BPF_REG_2, br_entry_size); > > + /* call perf_snapshot_branch_stack implementation */ > > + insn_buf[2] = BPF_EMIT_CALL(static_call_query(perf_snapshot_branch_stack)); > > + /* if (entry_cnt == 0) return -ENOENT */ > > + insn_buf[3] = BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 4); > > + /* return entry_cnt * sizeof(struct perf_branch_entry) */ > > + insn_buf[4] = BPF_ALU32_IMM(BPF_MUL, BPF_REG_0, br_entry_size); > > + insn_buf[5] = BPF_JMP_A(3); > > + /* return -EINVAL; */ > > + insn_buf[6] = BPF_MOV64_IMM(BPF_REG_0, -EINVAL); > > + insn_buf[7] = BPF_JMP_A(1); > > + /* return -ENOENT; */ > > + insn_buf[8] = BPF_MOV64_IMM(BPF_REG_0, -ENOENT); > > + cnt = 9; > > + > > + new_prog = bpf_patch_insn_data(env, i + delta, insn_buf, cnt); > > + if (!new_prog) > > + return -ENOMEM; > > + > > + delta += cnt - 1; > > + env->prog = prog = new_prog; > > + insn = new_prog->insnsi + i + delta; > > + continue; > > + } > > + > > /* Implement bpf_kptr_xchg inline */ > > if (prog->jit_requested && BITS_PER_LONG == 64 && > > insn->imm == BPF_FUNC_kptr_xchg && > > -- > > 2.43.0 > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko 2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko 2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko @ 2024-03-21 18:05 ` Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 23:49 ` Alexei Starovoitov 2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov 3 siblings, 2 replies; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-21 18:05 UTC (permalink / raw) To: bpf, ast, daniel, martin.lau; +Cc: peterz, song, Andrii Nakryiko Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's gs segment-based addressing. Just to be on the safer side both rip-relative addressing is implemented (providing a shorter instruction, but limiting offset to signed 32 bits) and more universal absolute memory offset addressing is used as a fallback in (unlikely) scenario that given offset doesn't fit int s32. The latter is 5 bytes longer, and it seems compilers prefer rip-relative instructions when compiling kernel code. Both instructions were tested and confirmed using gdb. We also already have a BPF selftest (raw_tp_test_run) that validates correctness of bpf_get_smp_processor_id(), while running target BPF program on each online CPU. Here's a disassembly of bpf_get_smp_processor_id() helper: $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' Dump of assembler code for function bpf_get_smp_processor_id: 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> 0xffffffff810fa89c <+12>: 48 98 cdqe 0xffffffff810fa89e <+14>: c3 ret End of assembler dump. And here's a GDB disassembly dump of a piece of BPF program calling bpf_get_smp_processor_id(). $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' 000000000002d840 A pcpu_hot ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] Then attaching GDB to the running kernel in QEMU and breaking inside BPF program: (gdb) b *0xffffffffa000f8e2 Breakpoint 1 at 0xffffffffa000f8e2 When RIP-relative instruction is used: 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> 0xffffffffa000f8e9 cltq You can see that final address is resolved to <pcpu_hot+12> as expected. When absolute addressing is used: 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax 0xffffffffa000f8ed cltq And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), plus 12 (0xc) bytes offset of cpu_number field. This inlining eliminates entire function call for this (rather trivial in terms of instructions executed) helper, saving a bit of performance, but foremost saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first place. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> --- arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- 1 file changed, 25 insertions(+), 1 deletion(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 4900b1ee019f..5b7fdc24b5b8 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, *pprog = prog; } +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); + static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) { u8 *prog = *pprog; @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) pr_err("Target call %p is out of range\n", func); return -ERANGE; } - EMIT1_off32(opcode, offset); + + /* inline bpf_get_smp_processor_id() to avoid calls */ + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { + /* 7 to account for the mov instruction itself, + * as rip value *after* mov instruction is used + */ + offset = (void *)&pcpu_hot.cpu_number - ip - 7; + if (is_simm32(offset)) { + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); + } else { + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ + offset = (s64)(void *)&pcpu_hot.cpu_number; + EMIT2(0x65, 0xa1); + EMIT((u32)offset, 4); + EMIT((u64)offset >> 32, 4); + } + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ + } else { + EMIT1_off32(opcode, offset); + } + *pprog = prog; return 0; } -- 2.43.0 ^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko @ 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 21:09 ` Andrii Nakryiko 2024-03-21 23:49 ` Alexei Starovoitov 1 sibling, 1 reply; 23+ messages in thread From: Jiri Olsa @ 2024-03-21 21:08 UTC (permalink / raw) To: Andrii Nakryiko; +Cc: bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote: > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's > gs segment-based addressing. > > Just to be on the safer side both rip-relative addressing is implemented > (providing a shorter instruction, but limiting offset to signed 32 bits) > and more universal absolute memory offset addressing is used as > a fallback in (unlikely) scenario that given offset doesn't fit int s32. > The latter is 5 bytes longer, and it seems compilers prefer rip-relative > instructions when compiling kernel code. > > Both instructions were tested and confirmed using gdb. We also already > have a BPF selftest (raw_tp_test_run) that validates correctness of > bpf_get_smp_processor_id(), while running target BPF program on each > online CPU. > > Here's a disassembly of bpf_get_smp_processor_id() helper: > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' > Dump of assembler code for function bpf_get_smp_processor_id: > 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> > 0xffffffff810fa89c <+12>: 48 98 cdqe > 0xffffffff810fa89e <+14>: c3 ret > End of assembler dump. > > And here's a GDB disassembly dump of a piece of BPF program calling > bpf_get_smp_processor_id(). > > $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' > 000000000002d840 A pcpu_hot > ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF > program: > > (gdb) b *0xffffffffa000f8e2 > Breakpoint 1 at 0xffffffffa000f8e2 > > When RIP-relative instruction is used: > > 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> > 0xffffffffa000f8e9 cltq > > You can see that final address is resolved to <pcpu_hot+12> as expected. > > When absolute addressing is used: > > 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax > 0xffffffffa000f8ed cltq > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), > plus 12 (0xc) bytes offset of cpu_number field. > > This inlining eliminates entire function call for this (rather trivial in terms > of instructions executed) helper, saving a bit of performance, but foremost > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first > place. this should also 'fix' the k[ret]probe-multi-fast benchmark issue right? https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/ jirka > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index 4900b1ee019f..5b7fdc24b5b8 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > *pprog = prog; > } > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); > + > static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > { > u8 *prog = *pprog; > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > pr_err("Target call %p is out of range\n", func); > return -ERANGE; > } > - EMIT1_off32(opcode, offset); > + > + /* inline bpf_get_smp_processor_id() to avoid calls */ > + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { > + /* 7 to account for the mov instruction itself, > + * as rip value *after* mov instruction is used > + */ > + offset = (void *)&pcpu_hot.cpu_number - ip - 7; > + if (is_simm32(offset)) { > + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ > + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); > + } else { > + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ > + offset = (s64)(void *)&pcpu_hot.cpu_number; > + EMIT2(0x65, 0xa1); > + EMIT((u32)offset, 4); > + EMIT((u64)offset >> 32, 4); > + } > + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ > + } else { > + EMIT1_off32(opcode, offset); > + } > + > *pprog = prog; > return 0; > } > -- > 2.43.0 > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 21:08 ` Jiri Olsa @ 2024-03-21 21:09 ` Andrii Nakryiko 2024-03-21 22:57 ` Jiri Olsa 0 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-21 21:09 UTC (permalink / raw) To: Jiri Olsa; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote: > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's > > gs segment-based addressing. > > > > Just to be on the safer side both rip-relative addressing is implemented > > (providing a shorter instruction, but limiting offset to signed 32 bits) > > and more universal absolute memory offset addressing is used as > > a fallback in (unlikely) scenario that given offset doesn't fit int s32. > > The latter is 5 bytes longer, and it seems compilers prefer rip-relative > > instructions when compiling kernel code. > > > > Both instructions were tested and confirmed using gdb. We also already > > have a BPF selftest (raw_tp_test_run) that validates correctness of > > bpf_get_smp_processor_id(), while running target BPF program on each > > online CPU. > > > > Here's a disassembly of bpf_get_smp_processor_id() helper: > > > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' > > Dump of assembler code for function bpf_get_smp_processor_id: > > 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > > 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> > > 0xffffffff810fa89c <+12>: 48 98 cdqe > > 0xffffffff810fa89e <+14>: c3 ret > > End of assembler dump. > > > > And here's a GDB disassembly dump of a piece of BPF program calling > > bpf_get_smp_processor_id(). > > > > $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' > > 000000000002d840 A pcpu_hot > > ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] > > > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF > > program: > > > > (gdb) b *0xffffffffa000f8e2 > > Breakpoint 1 at 0xffffffffa000f8e2 > > > > When RIP-relative instruction is used: > > > > 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> > > 0xffffffffa000f8e9 cltq > > > > You can see that final address is resolved to <pcpu_hot+12> as expected. > > > > When absolute addressing is used: > > > > 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax > > 0xffffffffa000f8ed cltq > > > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), > > plus 12 (0xc) bytes offset of cpu_number field. > > > > This inlining eliminates entire function call for this (rather trivial in terms > > of instructions executed) helper, saving a bit of performance, but foremost > > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for > > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first > > place. > > this should also 'fix' the k[ret]probe-multi-fast benchmark issue right? I already fixed it locally by switching to bpf_get_numa_node_id(), but this change would generally make my original approach not work because bpf_get_smp_processor_id() isn't actually called at runtime on x86-64 :) > > https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/ > > jirka > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- > > 1 file changed, 25 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > > index 4900b1ee019f..5b7fdc24b5b8 100644 > > --- a/arch/x86/net/bpf_jit_comp.c > > +++ b/arch/x86/net/bpf_jit_comp.c > > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > > *pprog = prog; > > } > > > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ > > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); > > + > > static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > { > > u8 *prog = *pprog; > > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > pr_err("Target call %p is out of range\n", func); > > return -ERANGE; > > } > > - EMIT1_off32(opcode, offset); > > + > > + /* inline bpf_get_smp_processor_id() to avoid calls */ > > + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { > > + /* 7 to account for the mov instruction itself, > > + * as rip value *after* mov instruction is used > > + */ > > + offset = (void *)&pcpu_hot.cpu_number - ip - 7; > > + if (is_simm32(offset)) { > > + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ > > + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); > > + } else { > > + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ > > + offset = (s64)(void *)&pcpu_hot.cpu_number; > > + EMIT2(0x65, 0xa1); > > + EMIT((u32)offset, 4); > > + EMIT((u64)offset >> 32, 4); > > + } > > + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ > > + } else { > > + EMIT1_off32(opcode, offset); > > + } > > + > > *pprog = prog; > > return 0; > > } > > -- > > 2.43.0 > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 21:09 ` Andrii Nakryiko @ 2024-03-21 22:57 ` Jiri Olsa 2024-03-21 23:38 ` Andrii Nakryiko 0 siblings, 1 reply; 23+ messages in thread From: Jiri Olsa @ 2024-03-21 22:57 UTC (permalink / raw) To: Andrii Nakryiko Cc: Jiri Olsa, Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 02:09:41PM -0700, Andrii Nakryiko wrote: > On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote: > > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's > > > gs segment-based addressing. > > > > > > Just to be on the safer side both rip-relative addressing is implemented > > > (providing a shorter instruction, but limiting offset to signed 32 bits) > > > and more universal absolute memory offset addressing is used as > > > a fallback in (unlikely) scenario that given offset doesn't fit int s32. > > > The latter is 5 bytes longer, and it seems compilers prefer rip-relative > > > instructions when compiling kernel code. > > > > > > Both instructions were tested and confirmed using gdb. We also already > > > have a BPF selftest (raw_tp_test_run) that validates correctness of > > > bpf_get_smp_processor_id(), while running target BPF program on each > > > online CPU. > > > > > > Here's a disassembly of bpf_get_smp_processor_id() helper: > > > > > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' > > > Dump of assembler code for function bpf_get_smp_processor_id: > > > 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > > > 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> > > > 0xffffffff810fa89c <+12>: 48 98 cdqe > > > 0xffffffff810fa89e <+14>: c3 ret > > > End of assembler dump. > > > > > > And here's a GDB disassembly dump of a piece of BPF program calling > > > bpf_get_smp_processor_id(). > > > > > > $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' > > > 000000000002d840 A pcpu_hot > > > ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] > > > > > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF > > > program: > > > > > > (gdb) b *0xffffffffa000f8e2 > > > Breakpoint 1 at 0xffffffffa000f8e2 > > > > > > When RIP-relative instruction is used: > > > > > > 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> > > > 0xffffffffa000f8e9 cltq > > > > > > You can see that final address is resolved to <pcpu_hot+12> as expected. > > > > > > When absolute addressing is used: > > > > > > 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax > > > 0xffffffffa000f8ed cltq > > > > > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), > > > plus 12 (0xc) bytes offset of cpu_number field. > > > > > > This inlining eliminates entire function call for this (rather trivial in terms > > > of instructions executed) helper, saving a bit of performance, but foremost > > > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for > > > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first > > > place. > > > > this should also 'fix' the k[ret]probe-multi-fast benchmark issue right? > > I already fixed it locally by switching to bpf_get_numa_node_id(), but > this change would generally make my original approach not work because > bpf_get_smp_processor_id() isn't actually called at runtime on x86-64 > :) hm, but the reason was that program attached to bpf_get_smp_processor_id called bpf_get_smp_processor_id helper: https://lore.kernel.org/bpf/CAEf4BzayNECKkmc4=XfLW5fzsPozMnjqOEmGO+r2UmEQXt1XyA@mail.gmail.com/ inlining of bpf_get_smp_processor_id helper call would prevent that, no? jirka > > > > > https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/ > > > > jirka > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > --- > > > arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- > > > 1 file changed, 25 insertions(+), 1 deletion(-) > > > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > > > index 4900b1ee019f..5b7fdc24b5b8 100644 > > > --- a/arch/x86/net/bpf_jit_comp.c > > > +++ b/arch/x86/net/bpf_jit_comp.c > > > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > > > *pprog = prog; > > > } > > > > > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ > > > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); > > > + > > > static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > > { > > > u8 *prog = *pprog; > > > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > > pr_err("Target call %p is out of range\n", func); > > > return -ERANGE; > > > } > > > - EMIT1_off32(opcode, offset); > > > + > > > + /* inline bpf_get_smp_processor_id() to avoid calls */ > > > + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { > > > + /* 7 to account for the mov instruction itself, > > > + * as rip value *after* mov instruction is used > > > + */ > > > + offset = (void *)&pcpu_hot.cpu_number - ip - 7; > > > + if (is_simm32(offset)) { > > > + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ > > > + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); > > > + } else { > > > + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ > > > + offset = (s64)(void *)&pcpu_hot.cpu_number; > > > + EMIT2(0x65, 0xa1); > > > + EMIT((u32)offset, 4); > > > + EMIT((u64)offset >> 32, 4); > > > + } > > > + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ > > > + } else { > > > + EMIT1_off32(opcode, offset); > > > + } > > > + > > > *pprog = prog; > > > return 0; > > > } > > > -- > > > 2.43.0 > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 22:57 ` Jiri Olsa @ 2024-03-21 23:38 ` Andrii Nakryiko 0 siblings, 0 replies; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-21 23:38 UTC (permalink / raw) To: Jiri Olsa; +Cc: Andrii Nakryiko, bpf, ast, daniel, martin.lau, peterz, song On Thu, Mar 21, 2024 at 3:57 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 02:09:41PM -0700, Andrii Nakryiko wrote: > > On Thu, Mar 21, 2024 at 2:08 PM Jiri Olsa <olsajiri@gmail.com> wrote: > > > > > > On Thu, Mar 21, 2024 at 11:05:01AM -0700, Andrii Nakryiko wrote: > > > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's > > > > gs segment-based addressing. > > > > > > > > Just to be on the safer side both rip-relative addressing is implemented > > > > (providing a shorter instruction, but limiting offset to signed 32 bits) > > > > and more universal absolute memory offset addressing is used as > > > > a fallback in (unlikely) scenario that given offset doesn't fit int s32. > > > > The latter is 5 bytes longer, and it seems compilers prefer rip-relative > > > > instructions when compiling kernel code. > > > > > > > > Both instructions were tested and confirmed using gdb. We also already > > > > have a BPF selftest (raw_tp_test_run) that validates correctness of > > > > bpf_get_smp_processor_id(), while running target BPF program on each > > > > online CPU. > > > > > > > > Here's a disassembly of bpf_get_smp_processor_id() helper: > > > > > > > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' > > > > Dump of assembler code for function bpf_get_smp_processor_id: > > > > 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > > > > 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> > > > > 0xffffffff810fa89c <+12>: 48 98 cdqe > > > > 0xffffffff810fa89e <+14>: c3 ret > > > > End of assembler dump. > > > > > > > > And here's a GDB disassembly dump of a piece of BPF program calling > > > > bpf_get_smp_processor_id(). > > > > > > > > $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' > > > > 000000000002d840 A pcpu_hot > > > > ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] > > > > > > > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF > > > > program: > > > > > > > > (gdb) b *0xffffffffa000f8e2 > > > > Breakpoint 1 at 0xffffffffa000f8e2 > > > > > > > > When RIP-relative instruction is used: > > > > > > > > 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> > > > > 0xffffffffa000f8e9 cltq > > > > > > > > You can see that final address is resolved to <pcpu_hot+12> as expected. > > > > > > > > When absolute addressing is used: > > > > > > > > 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax > > > > 0xffffffffa000f8ed cltq > > > > > > > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), > > > > plus 12 (0xc) bytes offset of cpu_number field. > > > > > > > > This inlining eliminates entire function call for this (rather trivial in terms > > > > of instructions executed) helper, saving a bit of performance, but foremost > > > > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for > > > > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first > > > > place. > > > > > > this should also 'fix' the k[ret]probe-multi-fast benchmark issue right? > > > > I already fixed it locally by switching to bpf_get_numa_node_id(), but > > this change would generally make my original approach not work because > > bpf_get_smp_processor_id() isn't actually called at runtime on x86-64 > > :) > > hm, but the reason was that program attached to bpf_get_smp_processor_id > called bpf_get_smp_processor_id helper: > https://lore.kernel.org/bpf/CAEf4BzayNECKkmc4=XfLW5fzsPozMnjqOEmGO+r2UmEQXt1XyA@mail.gmail.com/ > > inlining of bpf_get_smp_processor_id helper call would prevent that, no? Yes, I'm agreeing with you. I'm just saying that in v2 I'm attaching to bpf_get_numa_node_id() instead, so all this is irrelevant. > > jirka > > > > > > > > > > https://lore.kernel.org/bpf/20240315051813.1320559-2-andrii@kernel.org/ > > > > > > jirka > > > > > > > > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > > > --- > > > > arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- > > > > 1 file changed, 25 insertions(+), 1 deletion(-) > > > > > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > > > > index 4900b1ee019f..5b7fdc24b5b8 100644 > > > > --- a/arch/x86/net/bpf_jit_comp.c > > > > +++ b/arch/x86/net/bpf_jit_comp.c > > > > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > > > > *pprog = prog; > > > > } > > > > > > > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ > > > > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); > > > > + > > > > static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > > > { > > > > u8 *prog = *pprog; > > > > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > > > pr_err("Target call %p is out of range\n", func); > > > > return -ERANGE; > > > > } > > > > - EMIT1_off32(opcode, offset); > > > > + > > > > + /* inline bpf_get_smp_processor_id() to avoid calls */ > > > > + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { > > > > + /* 7 to account for the mov instruction itself, > > > > + * as rip value *after* mov instruction is used > > > > + */ > > > > + offset = (void *)&pcpu_hot.cpu_number - ip - 7; > > > > + if (is_simm32(offset)) { > > > > + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ > > > > + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); > > > > + } else { > > > > + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ > > > > + offset = (s64)(void *)&pcpu_hot.cpu_number; > > > > + EMIT2(0x65, 0xa1); > > > > + EMIT((u32)offset, 4); > > > > + EMIT((u64)offset >> 32, 4); > > > > + } > > > > + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ > > > > + } else { > > > > + EMIT1_off32(opcode, offset); > > > > + } > > > > + > > > > *pprog = prog; > > > > return 0; > > > > } > > > > -- > > > > 2.43.0 > > > > > > > > ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa @ 2024-03-21 23:49 ` Alexei Starovoitov 2024-03-22 16:45 ` Andrii Nakryiko 1 sibling, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2024-03-21 23:49 UTC (permalink / raw) To: Andrii Nakryiko Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's > gs segment-based addressing. > > Just to be on the safer side both rip-relative addressing is implemented > (providing a shorter instruction, but limiting offset to signed 32 bits) > and more universal absolute memory offset addressing is used as > a fallback in (unlikely) scenario that given offset doesn't fit int s32. > The latter is 5 bytes longer, and it seems compilers prefer rip-relative > instructions when compiling kernel code. > > Both instructions were tested and confirmed using gdb. We also already > have a BPF selftest (raw_tp_test_run) that validates correctness of > bpf_get_smp_processor_id(), while running target BPF program on each > online CPU. > > Here's a disassembly of bpf_get_smp_processor_id() helper: > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' > Dump of assembler code for function bpf_get_smp_processor_id: > 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> > 0xffffffff810fa89c <+12>: 48 98 cdqe > 0xffffffff810fa89e <+14>: c3 ret > End of assembler dump. > > And here's a GDB disassembly dump of a piece of BPF program calling > bpf_get_smp_processor_id(). > > $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' > 000000000002d840 A pcpu_hot > ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF > program: > > (gdb) b *0xffffffffa000f8e2 > Breakpoint 1 at 0xffffffffa000f8e2 > > When RIP-relative instruction is used: > > 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> > 0xffffffffa000f8e9 cltq > > You can see that final address is resolved to <pcpu_hot+12> as expected. > > When absolute addressing is used: > > 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax > 0xffffffffa000f8ed cltq > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), > plus 12 (0xc) bytes offset of cpu_number field. > > This inlining eliminates entire function call for this (rather trivial in terms > of instructions executed) helper, saving a bit of performance, but foremost > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first > place. > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > --- > arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- > 1 file changed, 25 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > index 4900b1ee019f..5b7fdc24b5b8 100644 > --- a/arch/x86/net/bpf_jit_comp.c > +++ b/arch/x86/net/bpf_jit_comp.c > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > *pprog = prog; > } > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); > + > static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > { > u8 *prog = *pprog; > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > pr_err("Target call %p is out of range\n", func); > return -ERANGE; > } > - EMIT1_off32(opcode, offset); > + > + /* inline bpf_get_smp_processor_id() to avoid calls */ > + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { > + /* 7 to account for the mov instruction itself, > + * as rip value *after* mov instruction is used > + */ > + offset = (void *)&pcpu_hot.cpu_number - ip - 7; > + if (is_simm32(offset)) { > + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ > + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); > + } else { > + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ > + offset = (s64)(void *)&pcpu_hot.cpu_number; > + EMIT2(0x65, 0xa1); > + EMIT((u32)offset, 4); > + EMIT((u64)offset >> 32, 4); > + } > + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ Please introduce new pseudo insn that can access per-cpu vars in a generic way instead of hacking a specific case. Then we can use it in get_lookup in percpu array and hash maps improving their performance and in lots of other places. pw-bot: cr ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-21 23:49 ` Alexei Starovoitov @ 2024-03-22 16:45 ` Andrii Nakryiko 2024-03-25 3:28 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-22 16:45 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Thu, Mar 21, 2024 at 4:49 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > Add arch-specific inlining of bpf_get_smp_processor_id() using x86-64's > > gs segment-based addressing. > > > > Just to be on the safer side both rip-relative addressing is implemented > > (providing a shorter instruction, but limiting offset to signed 32 bits) > > and more universal absolute memory offset addressing is used as > > a fallback in (unlikely) scenario that given offset doesn't fit int s32. > > The latter is 5 bytes longer, and it seems compilers prefer rip-relative > > instructions when compiling kernel code. > > > > Both instructions were tested and confirmed using gdb. We also already > > have a BPF selftest (raw_tp_test_run) that validates correctness of > > bpf_get_smp_processor_id(), while running target BPF program on each > > online CPU. > > > > Here's a disassembly of bpf_get_smp_processor_id() helper: > > > > $ gdb -batch -ex 'file vmlinux' -ex 'set disassembly-flavor intel' -ex 'disassemble/r bpf_get_smp_processor_id' > > Dump of assembler code for function bpf_get_smp_processor_id: > > 0xffffffff810fa890 <+0>: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0] > > 0xffffffff810fa895 <+5>: 65 8b 05 70 62 f3 7e mov eax,DWORD PTR gs:[rip+0x7ef36270] # 0x30b0c <pcpu_hot+12> > > 0xffffffff810fa89c <+12>: 48 98 cdqe > > 0xffffffff810fa89e <+14>: c3 ret > > End of assembler dump. > > > > And here's a GDB disassembly dump of a piece of BPF program calling > > bpf_get_smp_processor_id(). > > > > $ sudo cat /proc/kallsyms | rg 'pcpu_hot|bpf_prog_2b455b4f8a8d48c5_kexit' > > 000000000002d840 A pcpu_hot > > ffffffffa000f8a8 t bpf_prog_2b455b4f8a8d48c5_kexit [bpf] > > > > Then attaching GDB to the running kernel in QEMU and breaking inside BPF > > program: > > > > (gdb) b *0xffffffffa000f8e2 > > Breakpoint 1 at 0xffffffffa000f8e2 > > > > When RIP-relative instruction is used: > > > > 0xffffffffa000f8e2 mov %gs:0x6001df63(%rip),%eax # 0x2d84c <pcpu_hot+12> > > 0xffffffffa000f8e9 cltq > > > > You can see that final address is resolved to <pcpu_hot+12> as expected. > > > > When absolute addressing is used: > > > > 0xffffffffa000f8e2 movabs %gs:0x2d84c,%eax > > 0xffffffffa000f8ed cltq > > > > And here 0x2d84c matches pcpu_hot address from kallsyms (0x2d840), > > plus 12 (0xc) bytes offset of cpu_number field. > > > > This inlining eliminates entire function call for this (rather trivial in terms > > of instructions executed) helper, saving a bit of performance, but foremost > > saving LBR records (1 for PERF_SAMPLE_BRANCH_ANY_RETURN mode, and 2 for > > PERF_SAMPLE_BRANCH_ANY), which is what motivated this work in the first > > place. > > > > Signed-off-by: Andrii Nakryiko <andrii@kernel.org> > > --- > > arch/x86/net/bpf_jit_comp.c | 26 +++++++++++++++++++++++++- > > 1 file changed, 25 insertions(+), 1 deletion(-) > > > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c > > index 4900b1ee019f..5b7fdc24b5b8 100644 > > --- a/arch/x86/net/bpf_jit_comp.c > > +++ b/arch/x86/net/bpf_jit_comp.c > > @@ -457,6 +457,9 @@ static void emit_prologue(u8 **pprog, u32 stack_depth, bool ebpf_from_cbpf, > > *pprog = prog; > > } > > > > +/* reference to bpf_get_smp_processor_id() helper implementation to detect it for inlining */ > > +extern u64 bpf_get_smp_processor_id(u64, u64, u64, u64, u64); > > + > > static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > { > > u8 *prog = *pprog; > > @@ -467,7 +470,28 @@ static int emit_patch(u8 **pprog, void *func, void *ip, u8 opcode) > > pr_err("Target call %p is out of range\n", func); > > return -ERANGE; > > } > > - EMIT1_off32(opcode, offset); > > + > > + /* inline bpf_get_smp_processor_id() to avoid calls */ > > + if (opcode == 0xE8 && func == &bpf_get_smp_processor_id) { > > + /* 7 to account for the mov instruction itself, > > + * as rip value *after* mov instruction is used > > + */ > > + offset = (void *)&pcpu_hot.cpu_number - ip - 7; > > + if (is_simm32(offset)) { > > + /* mov eax,DWORD PTR gs:[rip+<offset>] ; <pcpu_hot+12> */ > > + EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); > > + } else { > > + /* mov eax,DWORD PTR gs:<offset> ; <pcpu_hot+12> */ > > + offset = (s64)(void *)&pcpu_hot.cpu_number; > > + EMIT2(0x65, 0xa1); > > + EMIT((u32)offset, 4); > > + EMIT((u64)offset >> 32, 4); > > + } > > + EMIT2(0x48, 0x98); /* cdqe, zero-extend eax to rax */ > > Please introduce new pseudo insn that can access per-cpu vars > in a generic way instead of hacking a specific case. Sure, but do they have to be mutually exclusive? One doesn't prevent the other. Having bpf_get_smp_processor_id() inlined benefits tons of existing BPF applications transparently, which sounds like a win to me. Designing and adding new instruction also sounds fine, but it's a separate feature and is more involved and will require careful considerations. E.g., we'll need to think through whether all JITs will be able to implement it in native code (i.e., whether they will have enough free registers to implement this efficiently; x86-64 is lucky to not need any extra registers, I believe ARM64 needs one extra register, though; other architectures I have no idea). Safety considerations as well (do we accept any random offset, or only ones coming from per-CPU BTF variables, etc). I don't know if there are any differences, but per-CPU access for module per-CPU variables is something to look into (I have no clue). And even once we have this instruction and corresponding compiler support, it will still take a while until applications can assume its availability, so that adds to logistics. Also, even with this instruction, getting CPU ID efficiently is still important for cases when a BPF application needs its own per-CPU "storage" solution. I'm not sure it's a good experience to require every user to figure out the pcpu_hot.cpu_number thing by themselves through a few layers of kernel macros. As an interim compromise and solution, would you like me to implement a similar inlining for bpf_this_cpu_ptr() as well? It's just as trivial to do as for bpf_get_smp_processor_id(), and would benefit all existing users of bpf_this_cpu_ptr() as well, while relying on existing BTF information and surrounding infrastructure? What's the worst outcome? If the kernel changes how CPU number is defined (not pcpu_hot.cpu_number) and we can't easily adapt JIT logic, we can just stop doing inlining and we'll lose a bit of performance and function call avoidance. Bad, but not API breaking or anything like that. And we will detect when things change, we have a test that checks this logic for each CPU, making sure we get the right one. > Then we can use it in get_lookup in percpu array and hash maps > improving their performance and in lots of other places. > > pw-bot: cr ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-22 16:45 ` Andrii Nakryiko @ 2024-03-25 3:28 ` Alexei Starovoitov 2024-03-25 17:01 ` Andrii Nakryiko 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2024-03-25 3:28 UTC (permalink / raw) To: Andrii Nakryiko Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > Designing and adding new instruction also sounds fine, but it's a > separate feature and is more involved and will require careful > considerations. E.g., we'll need to think through whether all JITs > will be able to implement it in native code (i.e., whether they will > have enough free registers to implement this efficiently; x86-64 is > lucky to not need any extra registers, I believe ARM64 needs one extra > register, though; other architectures I have no idea). Safety > considerations as well (do we accept any random offset, or only ones > coming from per-CPU BTF variables, etc). I don't know if there are any > differences, but per-CPU access for module per-CPU variables is > something to look into (I have no clue). I didn't mean to add an insn that is exposed all the way to bpf prog and requires verification, addr checks, etc. But a pseudo insn that the verifier can emit when it inlines various helpers. It's ok to add them iteratively and revise as we go, since it won't be an api. Like per_cpu_ptr() equivalent insn doesn't need to be added right away. this_cpu_ptr(offset) will be good enough to start. JIT-ing that single insn will be trivial and you implemented it already: EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); else ... The verifier will emit BPF_THIS_CPU_PTR pseudo insn with pcpu_hot.cpu_number offset when it inlines bpf_get_smp_processor_id(). At the same time it can inline bpf_this_cpu_ptr() with the same insn. And we can finally implement map_ops->map_gen_lookup for per-cpu array and hash map using the same pseudo insn. Those lookups are often used in critical path and the inlining will give a noticeable performance boost. > As an interim compromise and solution, would you like me to implement > a similar inlining for bpf_this_cpu_ptr() as well? It's just as > trivial to do as for bpf_get_smp_processor_id(), and would benefit all > existing users of bpf_this_cpu_ptr() as well, while relying on > existing BTF information and surrounding infrastructure? yes via pseudo insn. after per-cpu maps, various bpf_*redirect*() helpers can be inlined and XDP folks will enjoy extra performance, and likely other cases where we used different solutions instead of emitting per-cpu accesses in the verifier. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 2024-03-25 3:28 ` Alexei Starovoitov @ 2024-03-25 17:01 ` Andrii Nakryiko 0 siblings, 0 replies; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-25 17:01 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Sun, Mar 24, 2024 at 8:28 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > Designing and adding new instruction also sounds fine, but it's a > > separate feature and is more involved and will require careful > > considerations. E.g., we'll need to think through whether all JITs > > will be able to implement it in native code (i.e., whether they will > > have enough free registers to implement this efficiently; x86-64 is > > lucky to not need any extra registers, I believe ARM64 needs one extra > > register, though; other architectures I have no idea). Safety > > considerations as well (do we accept any random offset, or only ones > > coming from per-CPU BTF variables, etc). I don't know if there are any > > differences, but per-CPU access for module per-CPU variables is > > something to look into (I have no clue). > > I didn't mean to add an insn that is exposed all the way to > bpf prog and requires verification, addr checks, etc. > But a pseudo insn that the verifier can emit when it inlines > various helpers. > It's ok to add them iteratively and revise as we go, since it > won't be an api. Ah, ok, so an instruction that JIT will know about, but the verifier will reject if provided by the user? Ok, I can do that. > Like per_cpu_ptr() equivalent insn doesn't need to be added right away. > this_cpu_ptr(offset) will be good enough to start. > JIT-ing that single insn will be trivial and > you implemented it already: > EMIT3_off32(0x65, 0x8b, 0x05, (u32)offset); else ... > > The verifier will emit BPF_THIS_CPU_PTR pseudo insn with > pcpu_hot.cpu_number offset when it inlines bpf_get_smp_processor_id(). > > At the same time it can inline bpf_this_cpu_ptr() with the same insn. > sounds good, we'll need to guard all that with arch-specific check whether JIT supports this instruction, but that's ok (eventually we should be able to make this insn support mandatory and clean up those checks) > And we can finally implement map_ops->map_gen_lookup for > per-cpu array and hash map using the same pseudo insn. > Those lookups are often used in critical path and > the inlining will give a noticeable performance boost. yep, I'll look into that as well > > > As an interim compromise and solution, would you like me to implement > > a similar inlining for bpf_this_cpu_ptr() as well? It's just as > > trivial to do as for bpf_get_smp_processor_id(), and would benefit all > > existing users of bpf_this_cpu_ptr() as well, while relying on > > existing BTF information and surrounding infrastructure? > > yes via pseudo insn. > > after per-cpu maps, various bpf_*redirect*() helpers > can be inlined and XDP folks will enjoy extra performance, > and likely other cases where we used different solutions > instead of emitting per-cpu accesses in the verifier. that part I'll leave up to XDP folks :) ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko ` (2 preceding siblings ...) 2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko @ 2024-03-21 23:46 ` Alexei Starovoitov 2024-03-22 16:45 ` Andrii Nakryiko 3 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2024-03-21 23:46 UTC (permalink / raw) To: Andrii Nakryiko Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > There are still ways to reduce number of "wasted" records further, this is > a problem that requires many small and rather independent steps. I feel this is a wrong path to follow. I think it would be better to introduce a flag for kprobe/fentry to do perf_snapshot_branch_stack() as early as possible and then bpf prog can copy these 16 or 32 8-byte entries at its leasure. Hacking all over the kernel and requiring bpf prog to call bpf_get_branch_snapshot() in the first few instructions looks like self inflicted pain. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov @ 2024-03-22 16:45 ` Andrii Nakryiko 2024-03-25 2:05 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-22 16:45 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > There are still ways to reduce number of "wasted" records further, this is > > a problem that requires many small and rather independent steps. > > I feel this is a wrong path to follow. > I think it would be better to introduce a flag for kprobe/fentry > to do perf_snapshot_branch_stack() as early as possible > and then bpf prog can copy these 16 or 32 8-byte entries at > its leasure. This is basically how Song started when he was adding this feature a few years ago. And I feel like we discussed this and decided that it would be cleaner to let the BPF program decide when (and whether) to get LBR, based on conditions. It still feels like a right tradeoff. Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this doesn't have to happen so fast. BPF program can evaluate some conditions and grab LBR optionally, saving the overhead. With prog flag saying "kernel should capture LBR ASAP", we: a) lose this flexibility to decide whether and when to grab LBR; b) pay overhead regardless if LBR is ever actually used for any given prog invocation; c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU buffers for something that is pretty niche (though hugely valuable when needed, of course); d) each program type that supports bpf_get_branch_snapshot() helper needs to implement this logic in their corresponding `bpf_prog_run_xxx()` running helpers, which is more than a few places. Now, let's see how much we can also realistically save with this approach. For fentry, we do save a few (2) entries, indeed. With changes in this patch we are at: [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 [#06] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508829+0x7f [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe [#02] __bpf_prog_enter_recur+0x43 -> bpf_trampoline_6442508829+0xa1 [#01] bpf_trampoline_6442508829+0xad -> bpf_prog_dc54a596b39d4177_fexit1+0x0 [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101 -> intel_pmu_snapshot_branch_stack+0x0 With flag and kernel support, we'll be at something like [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 [#06] __x64_sys_bpf+0x1a -> bpf_trampoline_6442508829+0x7f [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe [#02] __bpf_prog_enter_recur+0x43 -> intel_pmu_snapshot_branch_stack+0x0 So we get 2 extra LBRs at the expense of all those downsides I mentioned above. But for kretprobe-multi it's even worse (just 1). With changes in this patch set, we are at: [#10] __sys_bpf+0xdfc -> arch_rethook_trampoline+0x0 [#09] arch_rethook_trampoline+0x27 -> arch_rethook_trampoline_callback+0x0 [#08] arch_rethook_trampoline_callback+0x31 -> rethook_trampoline_handler+0x0 [#07] rethook_trampoline_handler+0x6f -> fprobe_exit_handler+0x0 [#06] fprobe_exit_handler+0x3d -> rcu_is_watching+0x0 [#05] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42 [#04] fprobe_exit_handler+0xb4 -> kprobe_multi_link_exit_handler+0x0 [#03] kprobe_multi_link_exit_handler+0x31 -> migrate_disable+0x0 [#02] migrate_disable+0x37 -> kprobe_multi_link_exit_handler+0x36 [#01] kprobe_multi_link_exit_handler+0x5c -> bpf_prog_2b455b4f8a8d48c5_kexit+0x0 [#00] bpf_prog_2b455b4f8a8d48c5_kexit+0xa3 -> intel_pmu_snapshot_branch_stack+0x0 With custom flag support: [#10] __sys_bpf+0xdfc -> arch_rethook_trampoline+0x0 [#09] arch_rethook_trampoline+0x27 -> arch_rethook_trampoline_callback+0x0 [#08] arch_rethook_trampoline_callback+0x31 -> rethook_trampoline_handler+0x0 [#07] rethook_trampoline_handler+0x6f -> fprobe_exit_handler+0x0 [#06] fprobe_exit_handler+0x3d -> rcu_is_watching+0x0 [#05] rcu_is_watching+0x17 -> fprobe_exit_handler+0x42 [#04] fprobe_exit_handler+0xb4 -> kprobe_multi_link_exit_handler+0x0 [#03] kprobe_multi_link_exit_handler+0x31 -> migrate_disable+0x0 [#02] migrate_disable+0x37 -> kprobe_multi_link_exit_handler+0x36 [#01] kprobe_multi_link_exit_handler+0x5c -> intel_pmu_snapshot_branch_stack+0x0 We save just 1 extra LBR record. For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings at all. Is it really worth it? Any other improvements (like flattening of rethook call pass somehow) will benefit both approaches equally. > Hacking all over the kernel and requiring bpf prog to call > bpf_get_branch_snapshot() in the first few instructions > looks like self inflicted pain. While inlining bpf_get_branch_snapshot() does benefit only LBR use case, it's a rather typical BPF helper inlining procedure we do for a lot of helpers, so it's not exactly a hack or anything, just an optimization. But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a pretty frequently used helper used to implement various BPF-program-specific per-CPU usages (recursion protection, temporary storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global variables, which is already a bit faster approach, and now will be even faster). And the implementation is well-contained. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-22 16:45 ` Andrii Nakryiko @ 2024-03-25 2:05 ` Alexei Starovoitov 2024-03-25 17:20 ` Andrii Nakryiko 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2024-03-25 2:05 UTC (permalink / raw) To: Andrii Nakryiko Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > There are still ways to reduce number of "wasted" records further, this is > > > a problem that requires many small and rather independent steps. > > > > I feel this is a wrong path to follow. > > I think it would be better to introduce a flag for kprobe/fentry > > to do perf_snapshot_branch_stack() as early as possible > > and then bpf prog can copy these 16 or 32 8-byte entries at > > its leasure. > > This is basically how Song started when he was adding this feature a > few years ago. And I feel like we discussed this and decided that it > would be cleaner to let the BPF program decide when (and whether) to > get LBR, based on conditions. It still feels like a right tradeoff. Right we discussed it back then and at that time it was about collecting stacks. What's different now is you want to collect all types of branches in retrnoop including plain 'jmp pc+10' and conditional jmps. This is not something that C code can control. always_inline in C and inline by the verifier reduce call frames, but they may have both positive and negative effect when all branches are collected. Hence __always_inline in kprobe_multi_link_prog_run() is a leap of faith with assumptions that compiler won't add jmps before calling into prog, but lots of different compiler flags add instrumentation: kasan, stack protector, security mitigation that count call depth, etc. > Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and > that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this > doesn't have to happen so fast. BPF program can evaluate some > conditions and grab LBR optionally, saving the overhead. > > With prog flag saying "kernel should capture LBR ASAP", we: I was suggesting to use per attachment flag. And kprobe is a lost cause. I would do it for fentry only where we can generate 'save LBR' call first thing in the bpf trampoline. > a) lose this flexibility to decide whether and when to grab LBR; > b) pay overhead regardless if LBR is ever actually used for any > given prog invocation; when retsnoop attaches a prog the prog gotta call that 'save LBR' as soon as possible without any branches. So per-attach flag is not really a downside. > c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU > buffers for something that is pretty niche (though hugely valuable > when needed, of course); I wouldn't worry about such a tiny buffer. > d) each program type that supports bpf_get_branch_snapshot() helper > needs to implement this logic in their corresponding > `bpf_prog_run_xxx()` running helpers, which is more than a few places. I think new kfunc that copies from the buffer will do. Nothing needs to change. Maybe bpf_get_branch_snapshot() can be made smart too, but that is optional. > Now, let's see how much we can also realistically save with this approach. > > For fentry, we do save a few (2) entries, indeed. With changes in this > patch we are at: > > [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 > [#06] __x64_sys_bpf+0x1a -> > bpf_trampoline_6442508829+0x7f > [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 > [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 > [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe > [#02] __bpf_prog_enter_recur+0x43 -> > bpf_trampoline_6442508829+0xa1 > [#01] bpf_trampoline_6442508829+0xad -> > bpf_prog_dc54a596b39d4177_fexit1+0x0 > [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101 -> > intel_pmu_snapshot_branch_stack+0x0 > > With flag and kernel support, we'll be at something like > > [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 > [#06] __x64_sys_bpf+0x1a -> > bpf_trampoline_6442508829+0x7f > [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 > [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 > [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe > [#02] __bpf_prog_enter_recur+0x43 -> > intel_pmu_snapshot_branch_stack+0x0 with flag migrate_disable and prog_enter* will be gone. It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack. If we try hard we can inline wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0); Then it will be bpf_trampoline_ only and that's as minimal as it can get. One entry. Hacking all over the kernel with inline won't get anywhere close. > For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings > at all. Is it really worth it? any_return is ok today. Especially with fentry. > While inlining bpf_get_branch_snapshot() does benefit only LBR use > case, it's a rather typical BPF helper inlining procedure we do for a > lot of helpers, so it's not exactly a hack or anything, just an > optimization. Inlining bpf_get_branch_snapshot() may be ok, but the way you're doing is not a clear win. Replacing size / 24 (that compiler optimize into mul) with actual divide and adding extra mul will be slower. div+mul vs mul is quite a difference. How noticable is that is questionable, but from inlining perspective it doesn't feel right to do "inline to avoid extra frame" instead of "inline to improve performance". > But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a > pretty frequently used helper used to implement various > BPF-program-specific per-CPU usages (recursion protection, temporary > storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global > variables, which is already a bit faster approach, and now will be > even faster). And the implementation is well-contained. Agree that inlining bpf_get_smp_processor_id() is a good thing, but please do it cleanly so that per-cpu accessors can be reused in other places. I'll reply with details in the other thread. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-25 2:05 ` Alexei Starovoitov @ 2024-03-25 17:20 ` Andrii Nakryiko 2024-03-26 3:13 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-25 17:20 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > > > There are still ways to reduce number of "wasted" records further, this is > > > > a problem that requires many small and rather independent steps. > > > > > > I feel this is a wrong path to follow. > > > I think it would be better to introduce a flag for kprobe/fentry > > > to do perf_snapshot_branch_stack() as early as possible > > > and then bpf prog can copy these 16 or 32 8-byte entries at > > > its leasure. > > > > This is basically how Song started when he was adding this feature a > > few years ago. And I feel like we discussed this and decided that it > > would be cleaner to let the BPF program decide when (and whether) to > > get LBR, based on conditions. It still feels like a right tradeoff. > > Right we discussed it back then and at that time it was about > collecting stacks. > What's different now is you want to collect all types of branches I was using --lbr=any from the get go (and it actually was a motivation for the entire feature, because we were lacking visibility inside some large function with lots of conditions), but yes, we had to leave with a bunch of entries wasted, which on Intel CPUs with 32 entries was tolerable, but on AMD now is useless (we get only 1-2 useful entries right now). > in retrnoop including plain 'jmp pc+10' and conditional jmps. > This is not something that C code can control. > always_inline in C and inline by the verifier reduce call frames, > but they may have both positive and negative effect when > all branches are collected. > Hence __always_inline in kprobe_multi_link_prog_run() > is a leap of faith with assumptions that compiler won't > add jmps before calling into prog, > but lots of different compiler flags add instrumentation: > kasan, stack protector, security mitigation that count call depth, etc. > I understand that, but at least for now in practice it does help. I have some more changes in fprobe/ftrace space which reduces waste of LBR entries some more (and would be beneficial regardless of this custom flag support we are discussing), and there is some really good news with aggressive inlining. a) I get only 4 entries wasted for multi-kprobe (7 for fentry, still not bad, but this one is harder to optimize) and b) I get +25% speed up for multi-kprobes, which seems like a nice side benefit. So I agree that none of this is any guarantee, but it also is not some binding UAPI, so seems worth doing. And as I pointed above, I don't think I see any regression in performance, rather the opposite. > > > Granted, for PERF_SAMPLE_BRANCH_ANY you gotta take it immediately (and > > that's what retsnoop does), but for PERF_SAMPLE_BRANCH_ANY_RETURN this > > doesn't have to happen so fast. BPF program can evaluate some > > conditions and grab LBR optionally, saving the overhead. > > > > With prog flag saying "kernel should capture LBR ASAP", we: > > I was suggesting to use per attachment flag. > And kprobe is a lost cause. > I would do it for fentry only where > we can generate 'save LBR' call first thing in the bpf trampoline. > See above, I get down to just 4 unavoidable LBR entries wasted with multi-kprobe, all without a custom flag anywhere. I just really not think it's worth it to complicate trampoline just for this, we'll save at most 1-2 LBR entries, inlining bpf_get_branch_snapshot() gets all basically the same benefit, but across all supported program types. > > a) lose this flexibility to decide whether and when to grab LBR; > > b) pay overhead regardless if LBR is ever actually used for any > > given prog invocation; > > when retsnoop attaches a prog the prog gotta call that 'save LBR' > as soon as possible without any branches. > So per-attach flag is not really a downside. for retsnoop, yes, but only if this is supported in multi-kprobe, which is the main mode. But see above, I just don't think we have to do this to get almost all the benefit. I just need to inline bpf_get_branch_snapshot(). > > > c) have to dedicate a pretty large (32 * 24 = 768 bytes) per-CPU > > buffers for something that is pretty niche (though hugely valuable > > when needed, of course); > > I wouldn't worry about such a tiny buffer. > > > d) each program type that supports bpf_get_branch_snapshot() helper > > needs to implement this logic in their corresponding > > `bpf_prog_run_xxx()` running helpers, which is more than a few places. > > I think new kfunc that copies from the buffer will do. > Nothing needs to change. > Maybe bpf_get_branch_snapshot() can be made smart too, > but that is optional. > I meant that fentry would need to implement this LBR capture in BPF trampoline, multi-kprobe in its kprobe_multi_link_prog_run, kprobe in still another helper. And so on, we have many targeted "runner" helpers for specific program types. And just implementing this for fentry/fexit is not very useful. > > Now, let's see how much we can also realistically save with this approach. > > > > For fentry, we do save a few (2) entries, indeed. With changes in this > > patch we are at: > > > > [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 > > [#06] __x64_sys_bpf+0x1a -> > > bpf_trampoline_6442508829+0x7f > > [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 > > [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 > > [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe > > [#02] __bpf_prog_enter_recur+0x43 -> > > bpf_trampoline_6442508829+0xa1 > > [#01] bpf_trampoline_6442508829+0xad -> > > bpf_prog_dc54a596b39d4177_fexit1+0x0 > > [#00] bpf_prog_dc54a596b39d4177_fexit1+0x101 -> > > intel_pmu_snapshot_branch_stack+0x0 > > > > With flag and kernel support, we'll be at something like > > > > [#07] __sys_bpf+0xdfc -> __x64_sys_bpf+0x18 > > [#06] __x64_sys_bpf+0x1a -> > > bpf_trampoline_6442508829+0x7f > > [#05] bpf_trampoline_6442508829+0x9c -> __bpf_prog_enter_recur+0x0 > > [#04] __bpf_prog_enter_recur+0x9 -> migrate_disable+0x0 > > [#03] migrate_disable+0x37 -> __bpf_prog_enter_recur+0xe > > [#02] __bpf_prog_enter_recur+0x43 -> > > intel_pmu_snapshot_branch_stack+0x0 > > with flag migrate_disable and prog_enter* will be gone. I don't think we can get rid of migrate_disable, we need to make sure we are freezing LBR on the CPU on which BPF program will run. So it's either preempt_disable or migrate_disable. Yes, __bpf_prog_enter_recur() won't be there if we code-generate code for BPF trampoline (though, ugh, who wants more code generation than necessary, but that's an aside). But then see above, migrate_disable will have to be called before __bpf_prog_enter_recur(), which is just more opaque code generation than necessary. > It will be only bpf_trampoline_ and intel_pmu_snapshot_branch_stack. Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is implemented, we call into original function and it returns into trampoline. So this seems unavoidable as well without completely changing how trampoline works for fexit. Multi-kprobe actually, conveniently, avoids this problem. > If we try hard we can inline > wrmsrl(MSR_CORE_PERF_GLOBAL_CTRL, 0); > Then it will be bpf_trampoline_ only and > that's as minimal as it can get. One entry. > Hacking all over the kernel with inline won't get anywhere close. It's not like I'm doing something technically wrong, just enabling more inlining. And it's not really all over the kernel, few targeted places that deal with LBR and BPF programs running. > > > For PERF_SAMPLE_BRANCH_ANY_RETURN return mode there will be no savings > > at all. Is it really worth it? > > any_return is ok today. > Especially with fentry. Yes, even today we are at 2-3 entries, I'm not too worried about this in general. > > > While inlining bpf_get_branch_snapshot() does benefit only LBR use > > case, it's a rather typical BPF helper inlining procedure we do for a > > lot of helpers, so it's not exactly a hack or anything, just an > > optimization. > > Inlining bpf_get_branch_snapshot() may be ok, > but the way you're doing is not a clear win. > Replacing size / 24 (that compiler optimize into mul) > with actual divide and adding extra mul will be slower. > div+mul vs mul is quite a difference. > How noticable is that is questionable, > but from inlining perspective it doesn't feel right to do > "inline to avoid extra frame" instead of > "inline to improve performance". Yes, I saw this division-through-multiplication division. I can replicate that as well, but it will be pretty hard to understand, so I thought it is probably not worth it. Note that bpf_get_branch_snapshot() is not some sort of performance-critical helper, if you are calling it on some super frequent kprobe/fentry, you are paying a lot of price just for copying 700+ bytes (and probably a bunch of other stuff). So I think it's a wrong tradeoff to optimize for performance. bpf_get_branch_snapshot() is about information (most complete LBR), and that's why the inlining. I know it's a bit unconventional compared to other inlining cases, but it's still valid objection, no? > > > But inlining bpf_get_smp_processor_id() goes way beyond LBR, it's a > > pretty frequently used helper used to implement various > > BPF-program-specific per-CPU usages (recursion protection, temporary > > storage, or just plain replacing BPF_MAP_TYPE_ARRAY_PERCPU with global > > variables, which is already a bit faster approach, and now will be > > even faster). And the implementation is well-contained. > > Agree that inlining bpf_get_smp_processor_id() is a good thing, > but please do it cleanly so that per-cpu accessors can be > reused in other places. > I'll reply with details in the other thread. Agreed, internal special instruction makes sense, replied on that patch as well. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-25 17:20 ` Andrii Nakryiko @ 2024-03-26 3:13 ` Alexei Starovoitov 2024-03-26 16:50 ` Andrii Nakryiko 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2024-03-26 3:13 UTC (permalink / raw) To: Andrii Nakryiko Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov > <alexei.starovoitov@gmail.com> wrote: > > > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko > > <andrii.nakryiko@gmail.com> wrote: > > > > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > > > > > > There are still ways to reduce number of "wasted" records further, this is > > > > > a problem that requires many small and rather independent steps. > > > > > > > > I feel this is a wrong path to follow. > > > > I think it would be better to introduce a flag for kprobe/fentry > > > > to do perf_snapshot_branch_stack() as early as possible > > > > and then bpf prog can copy these 16 or 32 8-byte entries at > > > > its leasure. > > > > > > This is basically how Song started when he was adding this feature a > > > few years ago. And I feel like we discussed this and decided that it > > > would be cleaner to let the BPF program decide when (and whether) to > > > get LBR, based on conditions. It still feels like a right tradeoff. > > > > Right we discussed it back then and at that time it was about > > collecting stacks. > > What's different now is you want to collect all types of branches > > I was using --lbr=any and that's probably something to fix. I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled. To figure out the path to return in the code ARCH_LBR_JCC | ARCH_LBR_REL_CALL | ARCH_LBR_IND_CALL | ARCH_LBR_RETURN might be good enough and there won't be a need to do inling in odd places just to avoid tail jmp. > do this to get almost all the benefit. I just need to inline > bpf_get_branch_snapshot(). If that is the only one that needs inling then fine, but I really don't like to always_inline kprobe_multi_link_prog_run(). A day goes by and somebody will send a patch to save 500 bytes of kernel .text by removing always_inline. The argument that it's there to help a user space tool that wants to do lbr=all instead of excluding rel_jmp won't look good. > > I don't think we can get rid of migrate_disable, we need to make sure > we are freezing LBR on the CPU on which BPF program will run. So it's > either preempt_disable or migrate_disable. we cannot extend preempt disable across the prog and migrate_disable won't really help, since there could be another prog on the same cpu doing the same "save lbr" action in a different hook that will trash per-cpu scratch space. But we don't need to either migrate_disable or preempt_disable. We can have a 32*24 byte buffer per attach point. In case of fentry it can be in bpf_trampoline or in bpf_link (I didn't analyze pros/cons too far) and fentry will only do the single "call intel_pmu_snapshot_branch_stack" with that address. That's a trivial addition to arch_prepare_bpf_trampoline. Then bpf prog will take entries from link, since it has access to it. Same thing for kprobes. As soon as it's triggered it will call intel_pmu_snapshot_branch_stack. Should be simple to add. Recursion can overwrite that per-attach buffer, but lbr is screwed anyway if we recursed. So not a concern. > Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is > implemented, we call into original function and it returns into > trampoline. So this seems unavoidable as well without completely > changing how trampoline works for fexit. Multi-kprobe actually, > conveniently, avoids this problem. Definitely do not want to redesign that to help retsnoop save an lbr entry. > > Inlining bpf_get_branch_snapshot() may be ok, > > but the way you're doing is not a clear win. > > Replacing size / 24 (that compiler optimize into mul) > > with actual divide and adding extra mul will be slower. > > div+mul vs mul is quite a difference. > > How noticable is that is questionable, > > but from inlining perspective it doesn't feel right to do > > "inline to avoid extra frame" instead of > > "inline to improve performance". > > Yes, I saw this division-through-multiplication division. I can > replicate that as well, but it will be pretty hard to understand, so I > thought it is probably not worth it. Note that > bpf_get_branch_snapshot() is not some sort of performance-critical > helper, if you are calling it on some super frequent kprobe/fentry, > you are paying a lot of price just for copying 700+ bytes (and > probably a bunch of other stuff). div is the slowest instruction. On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles. L1 cache is 1-2. So it might be faster to copy 768 bytes than do a single div. I still think that adding call intel_pmu_snapshot_branch_stack to fenty and kprobes is a simpler and cleaner solution that eliminates all guess work of compiler inlining and optimizations. We can potentially optimize it further. Since arch_prepare_bpf_trampoline() is arch specific, for x86 we can inline: local_irq_save(flags); __intel_pmu_disable_all(false); __intel_pmu_lbr_disable(); into generated trampoline (since above is just 5-6 instructions) and call into __intel_pmu_snapshot_branch_stack. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-26 3:13 ` Alexei Starovoitov @ 2024-03-26 16:50 ` Andrii Nakryiko 2024-03-27 21:59 ` Alexei Starovoitov 0 siblings, 1 reply; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-26 16:50 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Mon, Mar 25, 2024 at 8:13 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Mon, Mar 25, 2024 at 10:21 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > On Sun, Mar 24, 2024 at 7:05 PM Alexei Starovoitov > > <alexei.starovoitov@gmail.com> wrote: > > > > > > On Fri, Mar 22, 2024 at 9:45 AM Andrii Nakryiko > > > <andrii.nakryiko@gmail.com> wrote: > > > > > > > > On Thu, Mar 21, 2024 at 4:46 PM Alexei Starovoitov > > > > <alexei.starovoitov@gmail.com> wrote: > > > > > > > > > > On Thu, Mar 21, 2024 at 11:05 AM Andrii Nakryiko <andrii@kernel.org> wrote: > > > > > > > > > > > > > > > > > > There are still ways to reduce number of "wasted" records further, this is > > > > > > a problem that requires many small and rather independent steps. > > > > > > > > > > I feel this is a wrong path to follow. > > > > > I think it would be better to introduce a flag for kprobe/fentry > > > > > to do perf_snapshot_branch_stack() as early as possible > > > > > and then bpf prog can copy these 16 or 32 8-byte entries at > > > > > its leasure. > > > > > > > > This is basically how Song started when he was adding this feature a > > > > few years ago. And I feel like we discussed this and decided that it > > > > would be cleaner to let the BPF program decide when (and whether) to > > > > get LBR, based on conditions. It still feels like a right tradeoff. > > > > > > Right we discussed it back then and at that time it was about > > > collecting stacks. > > > What's different now is you want to collect all types of branches > > > > I was using --lbr=any > > and that's probably something to fix. Fix in the sense to adjust or add another generic PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode? > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled. > To figure out the path to return in the code > ARCH_LBR_JCC | > ARCH_LBR_REL_CALL | > ARCH_LBR_IND_CALL | > ARCH_LBR_RETURN > > might be good enough and there won't be a need to do > inling in odd places just to avoid tail jmp. retsnoop supports all modes perf exposes generically (see [0]), I believe I tried all of them and keep gravitating back to --lbr=any as most useful, unfortunately. But it's ok, let's put this particular __always_inline on pause for now, it's one LBR record more, not the end of the world. [0] https://github.com/anakryiko/retsnoop/blob/master/src/retsnoop.c#L269-L280 > > > do this to get almost all the benefit. I just need to inline > > bpf_get_branch_snapshot(). > > If that is the only one that needs inling then fine, yes, it's one of the most important ones, I'll take it :) > but I really don't like to always_inline > kprobe_multi_link_prog_run(). > A day goes by and somebody will send a patch > to save 500 bytes of kernel .text by removing always_inline. > The argument that it's there to help a user space tool that > wants to do lbr=all instead of excluding rel_jmp > won't look good. > There is a lot of __always_inline in rethook/ftrace code, as well as some in BPF code as well. I don't remember people trying to roll this back, so this seems a bit overdramatic. But ok, if you think it will be problematic to reject such hypothetical patches, let's put kprobe_multi_link_prog_run inlining aside for now. > > > > I don't think we can get rid of migrate_disable, we need to make sure > > we are freezing LBR on the CPU on which BPF program will run. So it's > > either preempt_disable or migrate_disable. > > we cannot extend preempt disable across the prog > and migrate_disable won't really help, since > there could be another prog on the same cpu > doing the same "save lbr" action in a different hook > that will trash per-cpu scratch space. > > But we don't need to either migrate_disable or preempt_disable. > We can have a 32*24 byte buffer per attach point. I'm missing how we can get away from having a per-CPU buffer. LBRs on different CPU cores are completely independent and one BPF prog attachment can be simultaneously running on many CPUs. Or do you mean per-CPU allocation for each attach point? We can do LBR capture before migrate_disable calls and still have correct data most of the time (hopefully), though, so yeah, it can be another improvement (but with inlining of those two BPF helpers I'm not sure we have to do this just yet). > In case of fentry it can be in bpf_trampoline or in bpf_link > (I didn't analyze pros/cons too far) and > fentry will only do the single "call intel_pmu_snapshot_branch_stack" > with that address. > That's a trivial addition to arch_prepare_bpf_trampoline. > > Then bpf prog will take entries from link, since it has access to it. > > Same thing for kprobes. As soon as it's triggered it will > call intel_pmu_snapshot_branch_stack. > Should be simple to add. > > Recursion can overwrite that per-attach buffer, but > lbr is screwed anyway if we recursed. So not a concern. > > > Note also __x64_sys_bpf+0x1a, it's the artifact of how fexit is > > implemented, we call into original function and it returns into > > trampoline. So this seems unavoidable as well without completely > > changing how trampoline works for fexit. Multi-kprobe actually, > > conveniently, avoids this problem. > > Definitely do not want to redesign that to help retsnoop save an lbr entry. Yep. > > > > Inlining bpf_get_branch_snapshot() may be ok, > > > but the way you're doing is not a clear win. > > > Replacing size / 24 (that compiler optimize into mul) > > > with actual divide and adding extra mul will be slower. > > > div+mul vs mul is quite a difference. > > > How noticable is that is questionable, > > > but from inlining perspective it doesn't feel right to do > > > "inline to avoid extra frame" instead of > > > "inline to improve performance". > > > > Yes, I saw this division-through-multiplication division. I can > > replicate that as well, but it will be pretty hard to understand, so I > > thought it is probably not worth it. Note that > > bpf_get_branch_snapshot() is not some sort of performance-critical > > helper, if you are calling it on some super frequent kprobe/fentry, > > you are paying a lot of price just for copying 700+ bytes (and > > probably a bunch of other stuff). > > div is the slowest instruction. > On skylake it takes 57 uops and 40-90 cycles while mul takes 3 cycles. > L1 cache is 1-2. > So it might be faster to copy 768 bytes than do a single div. Ok, I can add the div-through-multiplication code to keep it 1:1 w/ compiled helper code, no problem. > > I still think that adding call intel_pmu_snapshot_branch_stack > to fenty and kprobes is a simpler and cleaner solution that eliminates > all guess work of compiler inlining and optimizations. > > We can potentially optimize it further. > Since arch_prepare_bpf_trampoline() is arch specific, > for x86 we can inline: > local_irq_save(flags); > __intel_pmu_disable_all(false); > __intel_pmu_lbr_disable(); > into generated trampoline (since above is just 5-6 instructions) > and call into __intel_pmu_snapshot_branch_stack. Let's keep it as plan B, given this gets into gnarly internals of Intel-specific x86-64 code. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-26 16:50 ` Andrii Nakryiko @ 2024-03-27 21:59 ` Alexei Starovoitov 2024-03-28 22:53 ` Andrii Nakryiko 0 siblings, 1 reply; 23+ messages in thread From: Alexei Starovoitov @ 2024-03-27 21:59 UTC (permalink / raw) To: Andrii Nakryiko Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote: > > Fix in the sense to adjust or add another generic > PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode? > > > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled. > > To figure out the path to return in the code > > ARCH_LBR_JCC | > > ARCH_LBR_REL_CALL | > > ARCH_LBR_IND_CALL | > > ARCH_LBR_RETURN > > > > might be good enough and there won't be a need to do > > inling in odd places just to avoid tail jmp. > > retsnoop supports all modes perf exposes generically (see [0]), I > believe I tried all of them and keep gravitating back to --lbr=any as > most useful, unfortunately. I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND which I suspect will exclude ARCH_LBR_REL_JMP and will avoid counting normal goto-s. > I'm missing how we can get away from having a per-CPU buffer. LBRs on > different CPU cores are completely independent and one BPF prog > attachment can be simultaneously running on many CPUs. > > Or do you mean per-CPU allocation for each attach point? > > We can do LBR capture before migrate_disable calls and still have > correct data most of the time (hopefully), though, so yeah, it can be > another improvement (but with inlining of those two BPF helpers I'm > not sure we have to do this just yet). I meant 32*24 buffer per attachment. Doing per-attachment per-cpu might not scale? It's certainly cleaner with per-cpu though. With a single per-attach buffer the assumption was that the different cpus will likely take the same path towards that kprobe. retsnoop doesn't care which cpu it collected that stack trace from. It cares about the code path and it will be there. We can make the whole thing configurable. bpf_link_attach would specify a buffer or per-cpu or a buffer right in the bpf map array that should be used to store the lbr trace. ^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH bpf-next 0/3] Inline two LBR-related helpers 2024-03-27 21:59 ` Alexei Starovoitov @ 2024-03-28 22:53 ` Andrii Nakryiko 0 siblings, 0 replies; 23+ messages in thread From: Andrii Nakryiko @ 2024-03-28 22:53 UTC (permalink / raw) To: Alexei Starovoitov Cc: Andrii Nakryiko, bpf, Alexei Starovoitov, Daniel Borkmann, Martin KaFai Lau, Peter Zijlstra, Song Liu On Wed, Mar 27, 2024 at 2:59 PM Alexei Starovoitov <alexei.starovoitov@gmail.com> wrote: > > On Tue, Mar 26, 2024 at 9:50 AM Andrii Nakryiko > <andrii.nakryiko@gmail.com> wrote: > > > > Fix in the sense to adjust or add another generic > > PERF_SAMPLE_BRANCH_xxx value? Or you mean stop using --lbr=any mode? > > > > > I suspect retsnoop quality won't suffer if ARCH_LBR_REL_JMP is disabled. > > > To figure out the path to return in the code > > > ARCH_LBR_JCC | > > > ARCH_LBR_REL_CALL | > > > ARCH_LBR_IND_CALL | > > > ARCH_LBR_RETURN > > > > > > might be good enough and there won't be a need to do > > > inling in odd places just to avoid tail jmp. > > > > retsnoop supports all modes perf exposes generically (see [0]), I > > believe I tried all of them and keep gravitating back to --lbr=any as > > most useful, unfortunately. > > I mean to use PERF_SAMPLE_BRANCH_ANY_CALL | PERF_SAMPLE_BRANCH_COND > which I suspect will exclude ARCH_LBR_REL_JMP > and will avoid counting normal goto-s. This would be equivalent to passing `--lbr=any_call --lbr=cond` to retsnoop. And yes, you are right that it doesn't record unconditional jumps, saving a few more LBR frames. The problem is that it makes it super hard to follow what's going on without disassembling all relevant functions down to assembly and following *very carefully* to understand the flow of logic. This is normally absolutely unnecessary with --lbr=any (unless DWARF line info is screwed up, of course). --lbr=any allows to understand C statement-level code flow based on file:line information alone. So, in summary, yes `--lbr=any_call --lbr=cond` is a good last resort option if a few more LBR records are necessary. But it doesn't feel like something I'd recommend typical users to start with. > > > I'm missing how we can get away from having a per-CPU buffer. LBRs on > > different CPU cores are completely independent and one BPF prog > > attachment can be simultaneously running on many CPUs. > > > > Or do you mean per-CPU allocation for each attach point? > > > > We can do LBR capture before migrate_disable calls and still have > > correct data most of the time (hopefully), though, so yeah, it can be > > another improvement (but with inlining of those two BPF helpers I'm > > not sure we have to do this just yet). > > I meant 32*24 buffer per attachment. > Doing per-attachment per-cpu might not scale? > It's certainly cleaner with per-cpu though. > With a single per-attach buffer the assumption was that the different > cpus will likely take the same path towards that kprobe. This is a rather bold assumption. It might be true in a lot of cases, but certainly not always. Think about tracing __sys_bpf for errors. There could be many simultaneous bpf() syscalls in the system, some returning expected -ENOENT for expected map element iteration (so not really interesting), but some resulting in confusing failures (and thus "interesting"). It's even worse with some file system related functions. My point is that in retsnoop I don't want to rely on these assumptions. I won't stop anyone trying to add this functionality in the kernel, but I don't see retsnoop relying on something like that. > retsnoop doesn't care which cpu it collected that stack trace from. > It cares about the code path and it will be there. It does, retsnoop has a bunch of filters to narrow down specific conditions, where process information is taken into account, among other things. And this filtering capabilities will just grow over time. > We can make the whole thing configurable. > bpf_link_attach would specify a buffer or per-cpu or > a buffer right in the bpf map array that should be used > to store the lbr trace. I'm pretty happy with existing functionality and don't really need new APIs. I'll be posting a few more patches improving performance (not just LBR usage) over the next few days. With those I get close enough: 4 LBR records waste with --lbr=any. Thanks for the brainstorming, internal per-CPU instruction implementation turned out pretty well, I'll be sending a patch set soon. ^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2024-03-28 22:54 UTC | newest] Thread overview: 23+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-03-21 18:04 [PATCH bpf-next 0/3] Inline two LBR-related helpers Andrii Nakryiko 2024-03-21 18:04 ` [PATCH bpf-next 1/3] bpf: make bpf_get_branch_snapshot() architecture-agnostic Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 18:05 ` [PATCH bpf-next 2/3] bpf: inline bpf_get_branch_snapshot() helper Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 21:27 ` Andrii Nakryiko 2024-03-21 18:05 ` [PATCH bpf-next 3/3] bpf,x86: inline bpf_get_smp_processor_id() on x86-64 Andrii Nakryiko 2024-03-21 21:08 ` Jiri Olsa 2024-03-21 21:09 ` Andrii Nakryiko 2024-03-21 22:57 ` Jiri Olsa 2024-03-21 23:38 ` Andrii Nakryiko 2024-03-21 23:49 ` Alexei Starovoitov 2024-03-22 16:45 ` Andrii Nakryiko 2024-03-25 3:28 ` Alexei Starovoitov 2024-03-25 17:01 ` Andrii Nakryiko 2024-03-21 23:46 ` [PATCH bpf-next 0/3] Inline two LBR-related helpers Alexei Starovoitov 2024-03-22 16:45 ` Andrii Nakryiko 2024-03-25 2:05 ` Alexei Starovoitov 2024-03-25 17:20 ` Andrii Nakryiko 2024-03-26 3:13 ` Alexei Starovoitov 2024-03-26 16:50 ` Andrii Nakryiko 2024-03-27 21:59 ` Alexei Starovoitov 2024-03-28 22:53 ` Andrii Nakryiko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox