linux-trace-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH bpf-next 0/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper
@ 2026-01-09 15:34 Leon Hwang
  2026-01-09 15:34 ` [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline Leon Hwang
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Leon Hwang @ 2026-01-09 15:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Matt Bobrowski, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Shuah Khan, Leon Hwang, netdev, linux-kernel,
	linux-trace-kernel, linux-kselftest, kernel-patches-bot

When the PMU LBR is running in branch-sensitive mode,
'perf_snapshot_branch_stack()' may capture branch entries from the
trampoline entry up to the call site inside a BPF program. These branch
entries are not useful for analyzing the control flow of the tracee.

To eliminate such noise for tracing programs, the branch snapshot should
be taken as early as possible:

* Call 'perf_snapshot_branch_stack()' at the very beginning of the
  trampoline for fentry programs.
* Call 'perf_snapshot_branch_stack()' immediately after invoking the
  tracee for fexit programs.

With this change, LBR snapshots remain meaningful even when multiple BPF
programs execute before the one requesting LBR data.

In addition, more relevant branch entries can be captured on AMD CPUs,
which provide a 16-entry-deep LBR stack.

Testing

The series was tested in a VM configured with LBR enabled:

vmtest --kvm-cpu-args 'host,pmu=on,lbr-fmt=0x5' -k $(make -s image_name) -

Branch records were verified using bpfsnoop [1]:

/path/to/bpfsnoop -k '(l)icmp_rcv' -E 1 -v \
  --kernel-vmlinux /path/to/kernel/vmlinux

For comparison, the following command was used without
BPF_BRANCH_SNAPSHOT_F_COPY:

/path/to/bpfsnoop -k '(l)icmp_rcv' -E 1 -v \
  --force-get-branch-snapshot --kernel-vmlinux /path/to/kernel/vmlinux

Without BPF_BRANCH_SNAPSHOT_F_COPY, no branch records related to the
tracee are captured. With it enabled, 17 branch records from the tracee
are observed.

Detailed verification results are available in the gist [2].

With this series applied, retsnoop [3] can benefit from improved LBR
support when using the '--lbr --fentries' options.

Links:
[1] https://github.com/bpfsnoop/bpfsnoop
[2] https://gist.github.com/Asphaltt/cffdeb4b2f2db4c3c42f91a59109f9e7
[3] https://github.com/anakryiko/retsnoop

Leon Hwang (3):
  bpf, x64: Call perf_snapshot_branch_stack in trampoline
  bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for
    bpf_get_branch_snapshot helper
  selftests/bpf: Add BPF_BRANCH_SNAPSHOT_F_COPY test

 arch/x86/net/bpf_jit_comp.c                   | 66 +++++++++++++++++++
 include/linux/bpf.h                           | 18 ++++-
 include/linux/bpf_verifier.h                  |  1 +
 kernel/bpf/verifier.c                         | 30 +++++++++
 kernel/trace/bpf_trace.c                      | 17 ++++-
 .../bpf/prog_tests/get_branch_snapshot.c      | 26 +++++++-
 .../selftests/bpf/progs/get_branch_snapshot.c |  3 +-
 7 files changed, 153 insertions(+), 8 deletions(-)

--
2.52.0

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline
  2026-01-09 15:34 [PATCH bpf-next 0/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
@ 2026-01-09 15:34 ` Leon Hwang
  2026-01-09 16:24   ` Alexei Starovoitov
  2026-01-09 15:34 ` [PATCH bpf-next 2/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
  2026-01-09 15:34 ` [PATCH bpf-next 3/3] selftests/bpf: Add BPF_BRANCH_SNAPSHOT_F_COPY test Leon Hwang
  2 siblings, 1 reply; 6+ messages in thread
From: Leon Hwang @ 2026-01-09 15:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Matt Bobrowski, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Shuah Khan, Leon Hwang, netdev, linux-kernel,
	linux-trace-kernel, linux-kselftest, kernel-patches-bot

When the PMU LBR is running in branch-sensitive mode,
'perf_snapshot_branch_stack()' may capture branch entries from the
trampoline entry up to the call site inside a BPF program. These branch
entries are not useful for analyzing the control flow of the tracee.

To eliminate such noise for tracing programs, the branch snapshot should
be taken as early as possible:

* Call 'perf_snapshot_branch_stack()' at the very beginning of the
  trampoline for fentry programs.
* Call 'perf_snapshot_branch_stack()' immediately after invoking the
  tracee for fexit programs.

With this change, LBR snapshots remain meaningful even when multiple BPF
programs execute before the one requesting LBR data.

In addition, more relevant branch entries can be captured on AMD CPUs,
which provide a 16-entry-deep LBR stack.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 arch/x86/net/bpf_jit_comp.c | 66 +++++++++++++++++++++++++++++++++++++
 include/linux/bpf.h         | 16 ++++++++-
 2 files changed, 81 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index e3b1c4b1d550..a71a6c675392 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -12,6 +12,7 @@
 #include <linux/bpf.h>
 #include <linux/memory.h>
 #include <linux/sort.h>
+#include <linux/perf_event.h>
 #include <asm/extable.h>
 #include <asm/ftrace.h>
 #include <asm/set_memory.h>
@@ -19,6 +20,7 @@
 #include <asm/text-patching.h>
 #include <asm/unwind.h>
 #include <asm/cfi.h>
+#include "../events/perf_event.h"
 
 static bool all_callee_regs_used[4] = {true, true, true, true};
 
@@ -3137,6 +3139,54 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
 	return 0;
 }
 
+DEFINE_PER_CPU(struct bpf_tramp_branch_entries, bpf_branch_snapshot);
+
+static int invoke_branch_snapshot(u8 **pprog, void *image, void *rw_image)
+{
+	struct bpf_tramp_branch_entries __percpu *pptr = &bpf_branch_snapshot;
+	u8 *prog = *pprog;
+
+	/*
+	 * Emit:
+	 *
+	 * struct bpf_tramp_branch_entries *br = this_cpu_ptr(&bpf_branch_snapshot);
+	 * br->cnt = static_call(perf_snapshot_branch_stack)(br->entries, x86_pmu.lbr_nr);
+	 */
+
+	/* mov rbx, &bpf_branch_snapshot */
+	emit_mov_imm64(&prog, BPF_REG_6, (long) pptr >> 32, (u32)(long) pptr);
+#ifdef CONFIG_SMP
+	/* add rbx, gs:[<off>] */
+	EMIT2(0x65, 0x48);
+	EMIT3(0x03, 0x1C, 0x25);
+	EMIT((u32)(unsigned long)&this_cpu_off, 4);
+#endif
+	/* mov esi, x86_pmu.lbr_nr */
+	EMIT1_off32(0xBE, x86_pmu.lbr_nr);
+	/* lea rdi, [rbx + offsetof(struct bpf_tramp_branch_entries, entries)] */
+	EMIT4(0x48, 0x8D, 0x7B, offsetof(struct bpf_tramp_branch_entries, entries));
+	/* call static_call_query(perf_snapshot_branch_stack) */
+	if (emit_rsb_call(&prog, static_call_query(perf_snapshot_branch_stack),
+			  image + (prog - (u8 *)rw_image)))
+		return -EINVAL;
+	/* mov dword ptr [rbx], eax */
+	EMIT2(0x89, 0x03);
+
+	*pprog = prog;
+	return 0;
+}
+
+static bool bpf_prog_copy_branch_snapshot(struct bpf_tramp_links *tl)
+{
+	bool copy = false;
+	int i;
+
+	for (i = 0; i < tl->nr_links; i++)
+		copy = copy || tl->links[i]->link.prog->copy_branch_snapshot;
+
+	return copy;
+}
+
 /* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
 #define LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack)	\
 	__LOAD_TCC_PTR(-round_up(stack, 8) - 8)
@@ -3366,6 +3416,14 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 
 	save_args(m, &prog, regs_off, false, flags);
 
+	if (bpf_prog_copy_branch_snapshot(fentry)) {
+		/* Get branch snapshot asap. */
+		if (invoke_branch_snapshot(&prog, image, rw_image)) {
+			ret = -EINVAL;
+			goto cleanup;
+		}
+	}
+
 	if (flags & BPF_TRAMP_F_CALL_ORIG) {
 		/* arg1: mov rdi, im */
 		emit_mov_imm64(&prog, BPF_REG_1, (long) im >> 32, (u32) (long) im);
@@ -3422,6 +3480,14 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
 		emit_nops(&prog, X86_PATCH_SIZE);
 	}
 
+	if (bpf_prog_copy_branch_snapshot(fexit)) {
+		/* Get branch snapshot asap. */
+		if (invoke_branch_snapshot(&prog, image, rw_image)) {
+			ret = -EINVAL;
+			goto cleanup;
+		}
+	}
+
 	if (fmod_ret->nr_links) {
 		/* From Intel 64 and IA-32 Architectures Optimization
 		 * Reference Manual, 3.4.1.4 Code Alignment, Assembly/Compiler
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5936f8e2996f..16dc21836a06 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -6,6 +6,7 @@
 
 #include <uapi/linux/bpf.h>
 #include <uapi/linux/filter.h>
+#include <uapi/linux/perf_event.h>
 
 #include <crypto/sha2.h>
 #include <linux/workqueue.h>
@@ -1236,6 +1237,18 @@ struct bpf_tramp_links {
 
 struct bpf_tramp_run_ctx;
 
+#ifdef CONFIG_X86_64
+/* Same as MAX_LBR_ENTRIES in arch/x86/events/perf_event.h */
+#define MAX_BRANCH_ENTRIES		32
+
+struct bpf_tramp_branch_entries {
+	int cnt;
+	struct perf_branch_entry entries[MAX_BRANCH_ENTRIES];
+};
+
+DECLARE_PER_CPU(struct bpf_tramp_branch_entries, bpf_branch_snapshot);
+#endif
+
 /* Different use cases for BPF trampoline:
  * 1. replace nop at the function entry (kprobe equivalent)
  *    flags = BPF_TRAMP_F_RESTORE_REGS
@@ -1780,7 +1793,8 @@ struct bpf_prog {
 				call_get_stack:1, /* Do we call bpf_get_stack() or bpf_get_stackid() */
 				call_get_func_ip:1, /* Do we call get_func_ip() */
 				tstamp_type_access:1, /* Accessed __sk_buff->tstamp_type */
-				sleepable:1;	/* BPF program is sleepable */
+				sleepable:1,	/* BPF program is sleepable */
+				copy_branch_snapshot:1; /* Copy branch snapshot from prefetched buffer */
 	enum bpf_prog_type	type;		/* Type of BPF program */
 	enum bpf_attach_type	expected_attach_type; /* For some prog types */
 	u32			len;		/* Number of filter blocks */
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf-next 2/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper
  2026-01-09 15:34 [PATCH bpf-next 0/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
  2026-01-09 15:34 ` [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline Leon Hwang
@ 2026-01-09 15:34 ` Leon Hwang
  2026-01-09 15:34 ` [PATCH bpf-next 3/3] selftests/bpf: Add BPF_BRANCH_SNAPSHOT_F_COPY test Leon Hwang
  2 siblings, 0 replies; 6+ messages in thread
From: Leon Hwang @ 2026-01-09 15:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Matt Bobrowski, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Shuah Khan, Leon Hwang, netdev, linux-kernel,
	linux-trace-kernel, linux-kselftest, kernel-patches-bot

Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for tracing programs to copy
branch entries from *bpf_branch_snapshot*.

Instead of introducing a new kfunc, extend bpf_get_branch_snapshot
helper to add the BPF_BRANCH_SNAPSHOT_F_COPY flag support.

Therefore, when BPF_BRANCH_SNAPSHOT_F_COPY is specified:

* Check the *flags* value in verifier's 'check_helper_call()'.
* Skip inlining 'bpf_get_branch_snapshot()' helper in verifier's
  'do_misc_fixups()'.
* 'memcpy()' branch entries in the 'bpf_get_branch_snapshot()' helper.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 include/linux/bpf.h          |  4 ++++
 include/linux/bpf_verifier.h |  1 +
 kernel/bpf/verifier.c        | 30 ++++++++++++++++++++++++++++++
 kernel/trace/bpf_trace.c     | 17 ++++++++++++++---
 4 files changed, 49 insertions(+), 3 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 16dc21836a06..71ce225e5160 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1249,6 +1249,10 @@ struct bpf_tramp_branch_entries {
 DECLARE_PER_CPU(struct bpf_tramp_branch_entries, bpf_branch_snapshot);
 #endif

+enum {
+	BPF_BRANCH_SNAPSHOT_F_COPY	= 1,	/* Copy branch snapshot from bpf_branch_snapshot. */
+};
+
 /* Different use cases for BPF trampoline:
  * 1. replace nop at the function entry (kprobe equivalent)
  *    flags = BPF_TRAMP_F_RESTORE_REGS
diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
index 130bcbd66f60..c60a145e0466 100644
--- a/include/linux/bpf_verifier.h
+++ b/include/linux/bpf_verifier.h
@@ -561,6 +561,7 @@ struct bpf_insn_aux_data {
 	bool non_sleepable; /* helper/kfunc may be called from non-sleepable context */
 	bool is_iter_next; /* bpf_iter_<type>_next() kfunc call */
 	bool call_with_percpu_alloc_ptr; /* {this,per}_cpu_ptr() with prog percpu alloc */
+	bool copy_branch_snapshot; /* BPF_BRANCH_SNAPSHOT_F_COPY for bpf_get_branch_snapshot helper */
 	u8 alu_state; /* used in combination with alu_limit */
 	/* true if STX or LDX instruction is a part of a spill/fill
 	 * pattern for a bpf_fastcall call.
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 53635ea2e41b..0a537f9c2f8c 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -11772,6 +11772,33 @@ static int check_helper_call(struct bpf_verifier_env *env, struct bpf_insn *insn
 		err = push_callback_call(env, insn, insn_idx, meta.subprogno,
 					 set_user_ringbuf_callback_state);
 		break;
+	case BPF_FUNC_get_branch_snapshot:
+	{
+		u64 flags;
+
+		if (!is_reg_const(&regs[BPF_REG_3], false)) {
+			verbose(env, "Flags in bpf_get_branch_snapshot helper must be const.\n");
+			return -EINVAL;
+		}
+		flags = reg_const_value(&regs[BPF_REG_3], false);
+		if (flags & ~BPF_BRANCH_SNAPSHOT_F_COPY) {
+			verbose(env, "Invalid flags in bpf_get_branch_snapshot helper.\n");
+			return -EINVAL;
+		}
+
+		if (flags & BPF_BRANCH_SNAPSHOT_F_COPY) {
+			if (env->prog->type != BPF_PROG_TYPE_TRACING ||
+			    (env->prog->expected_attach_type != BPF_TRACE_FENTRY &&
+			     env->prog->expected_attach_type != BPF_TRACE_FEXIT)) {
+				verbose(env, "Only fentry and fexit programs support BPF_BRANCH_SNAPSHOT_F_COPY.\n");
+				return -EINVAL;
+			}
+
+			env->insn_aux_data[insn_idx].copy_branch_snapshot = true;
+			env->prog->copy_branch_snapshot = true;
+		}
+		break;
+	}
 	}

 	if (err)
@@ -23370,6 +23397,9 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 			 */
 			BUILD_BUG_ON(br_entry_size != 24);

+			if (env->insn_aux_data[i + delta].copy_branch_snapshot)
+				goto patch_call_imm;
+
 			/* if (unlikely(flags)) return -EINVAL */
 			insn_buf[0] = BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, 7);

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6e076485bf70..e9e1698cf608 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1172,10 +1172,20 @@ BPF_CALL_3(bpf_get_branch_snapshot, void *, buf, u32, size, u64, flags)
 	static const u32 br_entry_size = sizeof(struct perf_branch_entry);
 	u32 entry_cnt = size / br_entry_size;

-	entry_cnt = static_call(perf_snapshot_branch_stack)(buf, entry_cnt);
-
-	if (unlikely(flags))
+	if (likely(!flags)) {
+		entry_cnt = static_call(perf_snapshot_branch_stack)(buf, entry_cnt);
+#ifdef CONFIG_X86_64
+	} else if (flags & BPF_BRANCH_SNAPSHOT_F_COPY) {
+		struct bpf_tramp_branch_entries *br;
+
+		br = this_cpu_ptr(&bpf_branch_snapshot);
+		entry_cnt = min_t(u32, entry_cnt, br->cnt);
+		if (entry_cnt)
+			memcpy(buf, (void *) br->entries, entry_cnt * br_entry_size);
+#endif
+	} else {
 		return -EINVAL;
+	}

 	if (!entry_cnt)
 		return -ENOENT;
@@ -1189,6 +1199,7 @@ const struct bpf_func_proto bpf_get_branch_snapshot_proto = {
 	.ret_type	= RET_INTEGER,
 	.arg1_type	= ARG_PTR_TO_UNINIT_MEM,
 	.arg2_type	= ARG_CONST_SIZE_OR_ZERO,
+	.arg3_type	= ARG_ANYTHING,
 };

 BPF_CALL_3(get_func_arg, void *, ctx, u32, n, u64 *, value)
--
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH bpf-next 3/3] selftests/bpf: Add BPF_BRANCH_SNAPSHOT_F_COPY test
  2026-01-09 15:34 [PATCH bpf-next 0/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
  2026-01-09 15:34 ` [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline Leon Hwang
  2026-01-09 15:34 ` [PATCH bpf-next 2/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
@ 2026-01-09 15:34 ` Leon Hwang
  2 siblings, 0 replies; 6+ messages in thread
From: Leon Hwang @ 2026-01-09 15:34 UTC (permalink / raw)
  To: bpf
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H . Peter Anvin,
	Matt Bobrowski, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Shuah Khan, Leon Hwang, netdev, linux-kernel,
	linux-trace-kernel, linux-kselftest, kernel-patches-bot

Add test for BPF_BRANCH_SNAPSHOT_F_COPY flag by adding flag to the
callsite of bpf_get_branch_snapshot helper.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 .../bpf/prog_tests/get_branch_snapshot.c      | 26 ++++++++++++++++---
 .../selftests/bpf/progs/get_branch_snapshot.c |  3 ++-
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
index 0394a1156d99..6b8ab1655ab0 100644
--- a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
+++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
@@ -73,7 +73,7 @@ static void close_perf_events(void)
 	free(pfd_array);
 }
 
-void serial_test_get_branch_snapshot(void)
+static void test_branch_snapshot(int flags)
 {
 	struct get_branch_snapshot *skel = NULL;
 	int err;
@@ -89,8 +89,14 @@ void serial_test_get_branch_snapshot(void)
 		goto cleanup;
 	}
 
-	skel = get_branch_snapshot__open_and_load();
-	if (!ASSERT_OK_PTR(skel, "get_branch_snapshot__open_and_load"))
+	skel = get_branch_snapshot__open();
+	if (!ASSERT_OK_PTR(skel, "get_branch_snapshot__open"))
+		goto cleanup;
+
+	skel->rodata->flags = flags;
+
+	err = get_branch_snapshot__load(skel);
+	if (!ASSERT_OK(err, "get_branch_snapshot__load"))
 		goto cleanup;
 
 	err = kallsyms_find("bpf_testmod_loop_test", &skel->bss->address_low);
@@ -128,3 +134,17 @@ void serial_test_get_branch_snapshot(void)
 	get_branch_snapshot__destroy(skel);
 	close_perf_events();
 }
+
+void serial_test_get_branch_snapshot(void)
+{
+	test_branch_snapshot(0);
+}
+
+enum {
+	BPF_BRANCH_SNAPSHOT_F_COPY	= 1,	/* Copy branch snapshot from bpf_branch_snapshot. */
+};
+
+void serial_test_copy_branch_snapshot(void)
+{
+	test_branch_snapshot(BPF_BRANCH_SNAPSHOT_F_COPY);
+}
diff --git a/tools/testing/selftests/bpf/progs/get_branch_snapshot.c b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
index 511ac634eef0..47a1984bdf46 100644
--- a/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
+++ b/tools/testing/selftests/bpf/progs/get_branch_snapshot.c
@@ -6,6 +6,7 @@
 
 char _license[] SEC("license") = "GPL";
 
+volatile const int flags = 0;
 __u64 test1_hits = 0;
 __u64 address_low = 0;
 __u64 address_high = 0;
@@ -25,7 +26,7 @@ int BPF_PROG(test1, int n, int ret)
 {
 	long i;
 
-	total_entries = bpf_get_branch_snapshot(entries, sizeof(entries), 0);
+	total_entries = bpf_get_branch_snapshot(entries, sizeof(entries), flags);
 	total_entries /= sizeof(struct perf_branch_entry);
 
 	for (i = 0; i < ENTRY_CNT; i++) {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline
  2026-01-09 15:34 ` [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline Leon Hwang
@ 2026-01-09 16:24   ` Alexei Starovoitov
  2026-01-09 16:31     ` Leon Hwang
  0 siblings, 1 reply; 6+ messages in thread
From: Alexei Starovoitov @ 2026-01-09 16:24 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Matt Bobrowski, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Shuah Khan, Network Development, LKML,
	linux-trace-kernel, open list:KERNEL SELFTEST FRAMEWORK,
	kernel-patches-bot

On Fri, Jan 9, 2026 at 7:37 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>
> When the PMU LBR is running in branch-sensitive mode,
> 'perf_snapshot_branch_stack()' may capture branch entries from the
> trampoline entry up to the call site inside a BPF program. These branch
> entries are not useful for analyzing the control flow of the tracee.
>
> To eliminate such noise for tracing programs, the branch snapshot should
> be taken as early as possible:
>
> * Call 'perf_snapshot_branch_stack()' at the very beginning of the
>   trampoline for fentry programs.
> * Call 'perf_snapshot_branch_stack()' immediately after invoking the
>   tracee for fexit programs.
>
> With this change, LBR snapshots remain meaningful even when multiple BPF
> programs execute before the one requesting LBR data.
>
> In addition, more relevant branch entries can be captured on AMD CPUs,
> which provide a 16-entry-deep LBR stack.
>
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>  arch/x86/net/bpf_jit_comp.c | 66 +++++++++++++++++++++++++++++++++++++
>  include/linux/bpf.h         | 16 ++++++++-
>  2 files changed, 81 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index e3b1c4b1d550..a71a6c675392 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -12,6 +12,7 @@
>  #include <linux/bpf.h>
>  #include <linux/memory.h>
>  #include <linux/sort.h>
> +#include <linux/perf_event.h>
>  #include <asm/extable.h>
>  #include <asm/ftrace.h>
>  #include <asm/set_memory.h>
> @@ -19,6 +20,7 @@
>  #include <asm/text-patching.h>
>  #include <asm/unwind.h>
>  #include <asm/cfi.h>
> +#include "../events/perf_event.h"
>
>  static bool all_callee_regs_used[4] = {true, true, true, true};
>
> @@ -3137,6 +3139,54 @@ static int invoke_bpf_mod_ret(const struct btf_func_model *m, u8 **pprog,
>         return 0;
>  }
>
> +DEFINE_PER_CPU(struct bpf_tramp_branch_entries, bpf_branch_snapshot);
> +
> +static int invoke_branch_snapshot(u8 **pprog, void *image, void *rw_image)
> +{
> +       struct bpf_tramp_branch_entries __percpu *pptr = &bpf_branch_snapshot;
> +       u8 *prog = *pprog;
> +
> +       /*
> +        * Emit:
> +        *
> +        * struct bpf_tramp_branch_entries *br = this_cpu_ptr(&bpf_branch_snapshot);
> +        * br->cnt = static_call(perf_snapshot_branch_stack)(br->entries, x86_pmu.lbr_nr);
> +        */
> +
> +       /* mov rbx, &bpf_branch_snapshot */
> +       emit_mov_imm64(&prog, BPF_REG_6, (long) pptr >> 32, (u32)(long) pptr);
> +#ifdef CONFIG_SMP
> +       /* add rbx, gs:[<off>] */
> +       EMIT2(0x65, 0x48);
> +       EMIT3(0x03, 0x1C, 0x25);
> +       EMIT((u32)(unsigned long)&this_cpu_off, 4);
> +#endif
> +       /* mov esi, x86_pmu.lbr_nr */
> +       EMIT1_off32(0xBE, x86_pmu.lbr_nr);
> +       /* lea rdi, [rbx + offsetof(struct bpf_tramp_branch_entries, entries)] */
> +       EMIT4(0x48, 0x8D, 0x7B, offsetof(struct bpf_tramp_branch_entries, entries));
> +       /* call static_call_query(perf_snapshot_branch_stack) */
> +       if (emit_rsb_call(&prog, static_call_query(perf_snapshot_branch_stack),
> +                         image + (prog - (u8 *)rw_image)))
> +               return -EINVAL;
> +       /* mov dword ptr [rbx], eax */
> +       EMIT2(0x89, 0x03);
> +
> +       *pprog = prog;
> +       return 0;
> +}
> +
> +static bool bpf_prog_copy_branch_snapshot(struct bpf_tramp_links *tl)
> +{
> +       bool copy = false;
> +       int i;
> +
> +       for (i = 0; i < tl->nr_links; i++)
> +               copy = copy || tl->links[i]->link.prog->copy_branch_snapshot;
> +
> +       return copy;
> +}
> +
>  /* mov rax, qword ptr [rbp - rounded_stack_depth - 8] */
>  #define LOAD_TRAMP_TAIL_CALL_CNT_PTR(stack)    \
>         __LOAD_TCC_PTR(-round_up(stack, 8) - 8)
> @@ -3366,6 +3416,14 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>
>         save_args(m, &prog, regs_off, false, flags);
>
> +       if (bpf_prog_copy_branch_snapshot(fentry)) {
> +               /* Get branch snapshot asap. */
> +               if (invoke_branch_snapshot(&prog, image, rw_image)) {
> +                       ret = -EINVAL;
> +                       goto cleanup;
> +               }
> +       }

Andrii already tried to do it.
I hated it back then and still hate the idea.
We're not going to add custom logic for one specific use case
no matter how appealing it sounds to save very limited LBR entries.
The HW will get better, but we will be stuck with this optimization forever.

pw-bot: cr

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline
  2026-01-09 16:24   ` Alexei Starovoitov
@ 2026-01-09 16:31     ` Leon Hwang
  0 siblings, 0 replies; 6+ messages in thread
From: Leon Hwang @ 2026-01-09 16:31 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	David S . Miller, David Ahern, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, X86 ML, H . Peter Anvin,
	Matt Bobrowski, Steven Rostedt, Masami Hiramatsu,
	Mathieu Desnoyers, Shuah Khan, Network Development, LKML,
	linux-trace-kernel, open list:KERNEL SELFTEST FRAMEWORK,
	kernel-patches-bot



On 2026/1/10 00:24, Alexei Starovoitov wrote:
> On Fri, Jan 9, 2026 at 7:37 AM Leon Hwang <leon.hwang@linux.dev> wrote:
>>

[...]

>> @@ -3366,6 +3416,14 @@ static int __arch_prepare_bpf_trampoline(struct bpf_tramp_image *im, void *rw_im
>>
>>         save_args(m, &prog, regs_off, false, flags);
>>
>> +       if (bpf_prog_copy_branch_snapshot(fentry)) {
>> +               /* Get branch snapshot asap. */
>> +               if (invoke_branch_snapshot(&prog, image, rw_image)) {
>> +                       ret = -EINVAL;
>> +                       goto cleanup;
>> +               }
>> +       }
> 
> Andrii already tried to do it.
> I hated it back then and still hate the idea.
> We're not going to add custom logic for one specific use case
> no matter how appealing it sounds to save very limited LBR entries.
> The HW will get better, but we will be stuck with this optimization forever.
> 

Understood, thanks for the explanation.

I won’t pursue this approach further.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-01-09 16:31 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-09 15:34 [PATCH bpf-next 0/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
2026-01-09 15:34 ` [PATCH bpf-next 1/3] bpf, x64: Call perf_snapshot_branch_stack in trampoline Leon Hwang
2026-01-09 16:24   ` Alexei Starovoitov
2026-01-09 16:31     ` Leon Hwang
2026-01-09 15:34 ` [PATCH bpf-next 2/3] bpf: Introduce BPF_BRANCH_SNAPSHOT_F_COPY flag for bpf_get_branch_snapshot helper Leon Hwang
2026-01-09 15:34 ` [PATCH bpf-next 3/3] selftests/bpf: Add BPF_BRANCH_SNAPSHOT_F_COPY test Leon Hwang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).