[PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot()
@ 2026-06-16 15:57 Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events Puranjay Mohan
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 15:57 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
	Breno Leitao, Ravi Bangoria, Stephane Eranian,
	Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
	linux-perf-users, linux-kselftest, linux-kernel, kernel-team

Changelog:

v4: https://lore.kernel.org/all/20260527121207.2312181-1-puranjay@kernel.org/
Changes in v5:
- Rework patch 2: drop the UAPI union. Instead rename the helper to perf_clear_branch_entry() and
  clear the entry with a single
  	*br = (struct perf_branch_entry){ }.
  This stays kernel-internal, with no changes to the uapi/ or tools/ headers (James Clark).
- Add an isb() to the BRBCR_EL1 == 0 early-exit in the snapshot, for consistency with the other
  two exit paths (bpf-ci).
- Add Rob Herring's Reviewed-by to patch 3

v3: https://lore.kernel.org/all/20260413185740.3286146-1-puranjay@kernel.org/
Changes in v4:
- Fix leaking branch records when scheduled task has an unrelated perf event (Sashiko)
- Update tools/include/uapi/linux/perf_event.h as well for patch 2
- Introduce cpu_has_brbe() and use it in
  arm_brbe_snapshot_branch_stack(0 to make sure we don't run on a CPU
  without BRBE.
- Add explicit isb() after after writing to SYS_BRBFCR_EL1.
- Rebase on latest arm64 tree.

v2: https://lore.kernel.org/all/20260318171706.2840512-1-puranjay@kernel.org/
Changes in v3:
- Move NULL pmu_ctx fix from arm_pmuv3.c to perf core (Leo Yan)
- Use union to clear branch entry bitfields instead of per-field
  zeroing (Leo Yan)
- Remove per-CPU brbe_active flag; check BRBCR_EL1 == 0 instead (Rob
  Herring)
- Remove redundant valid_brbidr() check in snapshot path (Rob Herring)
- Introduce for_each_brbe_entry() iterator to deduplicate bank
  iteration (Rob Herring)
- Include perf core maintainers (Leo Yan, Rob Herring)

v1: https://lore.kernel.org/all/20260313180352.3800358-1-puranjay@kernel.org/
Changes in v2:
- Rebased on arm64/for-next/core
- Add per-CPU brbe_active flag to guard against UNDEFINED sysreg access
  on non-BRBE CPUs in heterogeneous big.LITTLE systems.
- Fix pre-existing bug in perf_clear_branch_entry_bitfields() that missed
  zeroing new_type and priv bitfields, added as a separate patch with
  Fixes tags (new patch 2).
- Use architecture-specific selftest threshold (#if defined(__aarch64__))
  instead of raising the global threshold, to preserve x86 regression
  detection.

RFC: https://lore.kernel.org/all/20260102214043.1410242-1-puranjay@kernel.org/
Changes from RFC:
 - Fix pre-existing NULL pointer dereference in armv8pmu_sched_task()
   found by Leo Yan during testing (patch 1)
 - Pause BRBE before local_daif_save() to avoid branch pollution from
   trace_hardirqs_off()
 - Use local_daif_save() to prevent pNMI race from counter overflow
   (Mark Rutland)
 - Reuse perf_entry_from_brbe_regset() instead of duplicating register
   read logic, by making it accept NULL event (Mark Rutland)
 - Invalidate BRBE after reading to maintain record contiguity for
   other consumers (Mark Rutland)
 - Adjust selftest wasted_entries threshold for ARM64 (patch 3)
 - Tested on ARM FVP with BRBE enabled

This series enables the bpf_get_branch_snapshot() BPF helper on ARM64
by implementing the perf_snapshot_branch_stack static call for ARM's
Branch Record Buffer Extension (BRBE).

bpf_get_branch_snapshot() [1] allows BPF programs to capture hardware
branch records on-demand from any BPF tracing context. This was
previously only available on x86 (Intel LBR) since v5.16. With BRBE
available on ARMv9, this series closes the gap for ARM64.

Usage model
-----------

The helper works in conjunction with perf events. The userspace
component of the BPF application opens a perf event with
PERF_SAMPLE_BRANCH_STACK on each CPU, which configures the hardware
to continuously record branches into BRBE (on ARM64) or LBR (on x86).
A BPF program attached to a tracepoint, kprobe, or fentry hook can
then call bpf_get_branch_snapshot() to snapshot the branch buffer at
any point. Without an active perf event, BRBE is not recording and
the buffer is empty.

On-demand branch snapshots from BPF are useful for diagnosing which
specific code path was taken inside a function. Stack traces only show
function boundaries, but branch records reveal the exact sequence of
jumps, calls, and returns within a function -- making it possible to
identify which specific error check triggered a failure, or which
callback implementation was invoked through a function pointer.

For example, retsnoop [2] is a BPF-based tool for non-intrusive
mass-tracing of kernel internals. Its LBR mode (--lbr) creates per-CPU
perf events with PERF_SAMPLE_BRANCH_STACK and then uses
bpf_get_branch_snapshot() in its fentry/fexit BPF programs to capture
branch records whenever a traced function returns an error.

Consider debugging a bpf() syscall that returns -EINVAL when creating
a BPF map with invalid parameters. Running retsnoop on an ARM64 FVP
with BRBE to trace the bpf() syscall and array_map_alloc_check():

  $ retsnoop -e '*sys_bpf' -a 'array_map_alloc_check' --lbr=any \
             -F -k vmlinux --debug full-lbr
  $ simfail bpf-bad-map-max-entries-array  # in another terminal

Output of retsnoop:

  --- fentry BPF program (entries #63-#17) ---

  [#63-#59] __htab_map_lookup_elem: hash table walk with memcmp        (hashtab.c)
  [#58] __htab_map_lookup_elem+0x98  -> dump_bpf_prog+0xc850           (hashtab.c:750)
  [#57-#55] ... dump_bpf_prog internal branches ...
  [#54] dump_bpf_prog+0xcab8        -> bpf_get_current_pid_tgid+0x0    (helpers.c:225)
  [#53] bpf_get_current_pid_tgid+0x1c -> dump_bpf_prog+0xcabc          (helpers.c:225)
  [#52-#51] ... dump_bpf_prog -> __htab_map_lookup_elem ...
  [#50-#47] __htab_map_lookup_elem: htab_map_hash (jhash2), select_bucket
  [#46-#42] lookup_nulls_elem_raw: hash chain walk with memcmp         (hashtab.c:717)
  [#41] __htab_map_lookup_elem+0x98  -> dump_bpf_prog+0xcaf8           (hashtab.c:750)
  [#40-#37] ... dump_bpf_prog -> bpf_ktime_get_ns ...
  [#36] bpf_ktime_get_ns+0x10       -> ktime_get_mono_fast_ns+0x0      (helpers.c:178)
  [#35-#32] ktime_get_mono_fast_ns: tk_clock_read -> arch_counter_get_cntpct
  [#31] ktime_get_mono_fast_ns+0x9c -> bpf_ktime_get_ns+0x14           (timekeeping.c:493)
  [#30] bpf_ktime_get_ns+0x18       -> dump_bpf_prog+0xcd50            (helpers.c:178)
  [#29-#25] ... dump_bpf_prog internal branches ...
  [#24] dump_bpf_prog+0x11b28       -> __bpf_prog_exit_recur+0x0       (trampoline.c:1190)
  [#23-#17] __bpf_prog_exit_recur: rcu_read_unlock, migrate_enable     (trampoline.c:1195)

  --- array_map_alloc_check (entries #16-#12) ---

  [#16] dump_bpf_prog+0x11b38       -> array_map_alloc_check+0x8       (arraymap.c:55)
  [#15] array_map_alloc_check+0x18  -> array_map_alloc_check+0xb8      (arraymap.c:56)
        . bpf_map_attr_numa_node       . bpf_map_attr_numa_node
  [#14] array_map_alloc_check+0xbc  -> array_map_alloc_check+0x20      (arraymap.c:59)
        . bpf_map_attr_numa_node
  [#13] array_map_alloc_check+0x24  -> array_map_alloc_check+0x94      (arraymap.c:64)
  [#12] array_map_alloc_check+0x98  -> dump_bpf_prog+0x11b3c           (arraymap.c:82)

  --- fexit trampoline overhead (entries #11-#00) ---

  [#11] dump_bpf_prog+0x11b5c       -> __bpf_prog_enter_recur+0x0      (trampoline.c:1145)
  [#10-#03] __bpf_prog_enter_recur: rcu_read_lock, migrate_disable     (trampoline.c:1146)
  [#02] __bpf_prog_enter_recur+0x114 -> dump_bpf_prog+0x11b60          (trampoline.c:1157)
  [#01] dump_bpf_prog+0x11b6c       -> dump_bpf_prog+0xd230
  [#00] dump_bpf_prog+0xd340        -> arm_brbe_snapshot_branch_stack+0x0 (arm_brbe.c:814)

                   el0t_64_sync+0x168
                   el0t_64_sync_handler+0x98
                   el0_svc+0x28
                   do_el0_svc+0x4c
                   invoke_syscall.constprop.0+0x54
    373us [-EINVAL] __arm64_sys_bpf+0x8
                    __sys_bpf+0x87c
                    map_create+0x120
     95us [-EINVAL] array_map_alloc_check+0x8

The FVP's BRBE buffer has 64 entries (BRBE supports 8, 16, 32, or
64). Of these, entries #63-#17 (47) are consumed by the fentry BPF
trampoline that ran before the function, and entries #11-#00 (12)
are consumed by the fexit trampoline that runs after. Entry #00
shows the very last branch recorded before BRBE is paused: the call
into arm_brbe_snapshot_branch_stack().

The 5 useful entries (#16-#12) show the exact path taken inside
array_map_alloc_check(). Record #14 shows a jump from line 56
(bpf_map_attr_numa_node) to line 59 (the if-condition), and #13
shows an immediate jump from line 59 (attr->max_entries == 0) to
line 64 (return -EINVAL), skipping lines 60-63. This pinpoints
max_entries==0 as the cause -- a diagnosis impossible with stack
traces alone.

[1] 856c02dbce4f ("bpf: Introduce helper bpf_get_branch_snapshot")
[2] https://github.com/anakryiko/retsnoop

Puranjay Mohan (4):
  perf/core: Fix sched_task callbacks for CPU-wide branch stack events
  perf/core: Clear the whole branch entry in perf_clear_branch_entry()
  perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
  selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE

 arch/x86/events/amd/brs.c                     |   2 +-
 arch/x86/events/amd/lbr.c                     |   2 +-
 arch/x86/events/intel/lbr.c                   |   6 +-
 drivers/perf/arm_brbe.c                       | 130 +++++++++++++++---
 drivers/perf/arm_brbe.h                       |   9 ++
 drivers/perf/arm_pmuv3.c                      |   5 +-
 include/linux/perf_event.h                    |  16 +--
 kernel/events/core.c                          |  17 ++-
 .../bpf/prog_tests/get_branch_snapshot.c      |  13 +-
 9 files changed, 150 insertions(+), 50 deletions(-)


base-commit: 61c19a9feb1d87156e46e38d7759f3ad23710e24
-- 
2.53.0-Meta



^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events
  2026-06-16 15:57 [PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-06-16 15:57 ` Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 2/4] perf/core: Clear the whole branch entry in perf_clear_branch_entry() Puranjay Mohan
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 15:57 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
	Breno Leitao, Ravi Bangoria, Stephane Eranian,
	Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
	linux-perf-users, linux-kselftest, linux-kernel, kernel-team

perf_pmu_sched_task() returns early when cpuctx->task_ctx is non-NULL,
deferring to perf_ctx_sched_task_cb() in the context sched_in/out
paths. But perf_ctx_sched_task_cb() only walks the task context's
pmu_ctx_list -- PMUs that have only CPU-wide events are not on that
list and their sched_task callback is silently skipped.

On ARM64 with CPU-wide branch recording:

  perf record -b -e cycles -a -- ls

armv8pmu_sched_task() is skipped whenever the scheduled task has an
unrelated perf event (e.g. a software event), and branch records leak
across task boundaries.

A second problem exists in __perf_pmu_sched_task(): it passes
cpc->task_epc directly to pmu->sched_task(), but task_epc is NULL for
PMUs with only CPU-wide events. When perf_pmu_sched_task() does reach
the loop (because cpuctx->task_ctx is NULL), this causes a NULL
pointer dereference:

  Unable to handle kernel NULL pointer dereference at virtual address 00[.]
  PC is at armv8pmu_sched_task+0x14/0x50
  Call trace:
    armv8pmu_sched_task+0x14/0x50 (P)
    perf_pmu_sched_task+0xac/0x108
    __perf_event_task_sched_out+0x6c/0xe0

Fix both:

 - Remove the blanket early return in perf_pmu_sched_task() when
   cpuctx->task_ctx is set. Instead, skip individual CPCs that have a
   task_epc (those are handled by perf_ctx_sched_task_cb()). CPCs
   without a task_epc are CPU-only and must be handled here.

 - Fall back to &cpc->epc in __perf_pmu_sched_task() when task_epc is
   NULL, so the callback always gets a valid pmu_ctx.

Fixes: bd2756811766 ("perf: Rewrite core context handling")
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 kernel/events/core.c | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index 6d1f8bad7e1c..6604f6e8f352 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -3906,7 +3906,8 @@ static void __perf_pmu_sched_task(struct perf_cpu_pmu_context *cpc,
 	perf_ctx_lock(cpuctx, cpuctx->task_ctx);
 	perf_pmu_disable(pmu);
 
-	pmu->sched_task(cpc->task_epc, task, sched_in);
+	pmu->sched_task(cpc->task_epc ? cpc->task_epc : &cpc->epc,
+			task, sched_in);
 
 	perf_pmu_enable(pmu);
 	perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
@@ -3919,12 +3920,20 @@ static void perf_pmu_sched_task(struct task_struct *prev,
 	struct perf_cpu_context *cpuctx = this_cpu_ptr(&perf_cpu_context);
 	struct perf_cpu_pmu_context *cpc;
 
-	/* cpuctx->task_ctx will be handled in perf_event_context_sched_in/out */
-	if (prev == next || cpuctx->task_ctx)
+	if (prev == next)
 		return;
 
-	list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry)
+	list_for_each_entry(cpc, this_cpu_ptr(&sched_cb_list), sched_cb_entry) {
+		/*
+		 * PMUs with per-task events are handled by
+		 * perf_ctx_sched_task_cb() via perf_event_context_sched_in/out
+		 * when a task context is active.
+		 */
+		if (cpuctx->task_ctx && cpc->task_epc)
+			continue;
+
 		__perf_pmu_sched_task(cpc, sched_in ? next : prev, sched_in);
+	}
 }
 
 static void perf_event_switch(struct task_struct *task,
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/4] perf/core: Clear the whole branch entry in perf_clear_branch_entry()
  2026-06-16 15:57 [PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events Puranjay Mohan
@ 2026-06-16 15:57 ` Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE Puranjay Mohan
  3 siblings, 0 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 15:57 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
	Breno Leitao, Ravi Bangoria, Stephane Eranian,
	Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
	linux-perf-users, linux-kselftest, linux-kernel, kernel-team

perf_clear_branch_entry_bitfields() resets the bitfields of struct
perf_branch_entry one at a time and deliberately leaves from/to alone,
since callers overwrite those immediately. The list of assignments has to
be kept in sync with the struct by hand, and it has already drifted:
new_type and priv were added to perf_branch_entry but never cleared here,
so stale values can leak into the records handed to userspace.

Clear the entry with a single struct assignment instead:

	*br = (struct perf_branch_entry){ };

Every caller writes from/to right after the clear, so zeroing them as well
is harmless and the dead stores are elided on the x86 LBR read paths.
There is no longer anything to keep in sync when a field is added.

The helper no longer touches only the bitfields, so rename it to
perf_clear_branch_entry() and update the callers, fixing up the
br+nr/br+out spacing on the touched lines while at it.

Fixes: b190bc4ac9e6 ("perf: Extend branch type classification")
Fixes: 5402d25aa571 ("perf: Capture branch privilege information")
Suggested-by: James Clark <james.clark@linaro.org>
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 arch/x86/events/amd/brs.c   |  2 +-
 arch/x86/events/amd/lbr.c   |  2 +-
 arch/x86/events/intel/lbr.c |  6 +++---
 drivers/perf/arm_brbe.c     |  2 +-
 include/linux/perf_event.h  | 16 ++--------------
 5 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/arch/x86/events/amd/brs.c b/arch/x86/events/amd/brs.c
index 06f35a6b58a5..68c5f42965e9 100644
--- a/arch/x86/events/amd/brs.c
+++ b/arch/x86/events/amd/brs.c
@@ -343,7 +343,7 @@ void amd_brs_drain(void)
 
 		rdmsrq(brs_from(brs_idx), from);
 
-		perf_clear_branch_entry_bitfields(br+nr);
+		perf_clear_branch_entry(br + nr);
 
 		br[nr].from = from;
 		br[nr].to   = to;
diff --git a/arch/x86/events/amd/lbr.c b/arch/x86/events/amd/lbr.c
index d24da377df77..08401fd60585 100644
--- a/arch/x86/events/amd/lbr.c
+++ b/arch/x86/events/amd/lbr.c
@@ -181,7 +181,7 @@ void amd_pmu_lbr_read(void)
 		    entry.to.split.reserved)
 			continue;
 
-		perf_clear_branch_entry_bitfields(br + out);
+		perf_clear_branch_entry(br + out);
 
 		br[out].from	= sign_ext_branch_ip(entry.from.split.ip);
 		br[out].to	= sign_ext_branch_ip(entry.to.split.ip);
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index 72f2adcda7c6..295da179fa74 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -755,7 +755,7 @@ void intel_pmu_lbr_read_32(struct cpu_hw_events *cpuc)
 
 		rdmsrq(x86_pmu.lbr_from + lbr_idx, msr_lastbranch.lbr);
 
-		perf_clear_branch_entry_bitfields(br);
+		perf_clear_branch_entry(br);
 
 		br->from	= msr_lastbranch.from;
 		br->to		= msr_lastbranch.to;
@@ -846,7 +846,7 @@ void intel_pmu_lbr_read_64(struct cpu_hw_events *cpuc)
 		if (abort && x86_pmu.lbr_double_abort && out > 0)
 			out--;
 
-		perf_clear_branch_entry_bitfields(br+out);
+		perf_clear_branch_entry(br + out);
 		br[out].from	 = from;
 		br[out].to	 = to;
 		br[out].mispred	 = mis;
@@ -920,7 +920,7 @@ static void intel_pmu_store_lbr(struct cpu_hw_events *cpuc,
 		to = rdlbr_to(i, lbr);
 		info = rdlbr_info(i, lbr);
 
-		perf_clear_branch_entry_bitfields(e);
+		perf_clear_branch_entry(e);
 
 		e->from		= from;
 		e->to		= to;
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index ba554e0c846c..effbdeacfcbb 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -604,7 +604,7 @@ static bool perf_entry_from_brbe_regset(int index, struct perf_branch_entry *ent
 		return false;
 
 	brbinf = bregs.brbinf;
-	perf_clear_branch_entry_bitfields(entry);
+	perf_clear_branch_entry(entry);
 	if (brbe_record_is_complete(brbinf)) {
 		entry->from = bregs.brbsrc;
 		entry->to = bregs.brbtgt;
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 48d851fbd8ea..e034be4a473a 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -1467,21 +1467,9 @@ static inline u32 perf_sample_data_size(struct perf_sample_data *data,
 	return size;
 }
 
-/*
- * Clear all bitfields in the perf_branch_entry.
- * The to and from fields are not cleared because they are
- * systematically modified by caller.
- */
-static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *br)
+static inline void perf_clear_branch_entry(struct perf_branch_entry *br)
 {
-	br->mispred	= 0;
-	br->predicted	= 0;
-	br->in_tx	= 0;
-	br->abort	= 0;
-	br->cycles	= 0;
-	br->type	= 0;
-	br->spec	= PERF_BR_SPEC_NA;
-	br->reserved	= 0;
+	*br = (struct perf_branch_entry){ };
 }
 
 extern void perf_output_sample(struct perf_output_handle *handle,
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
  2026-06-16 15:57 [PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events Puranjay Mohan
  2026-06-16 15:57 ` [PATCH v5 2/4] perf/core: Clear the whole branch entry in perf_clear_branch_entry() Puranjay Mohan
@ 2026-06-16 15:57 ` Puranjay Mohan
  2026-06-16 16:15   ` sashiko-bot
  2026-06-16 15:57 ` [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE Puranjay Mohan
  3 siblings, 1 reply; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 15:57 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
	Breno Leitao, Ravi Bangoria, Stephane Eranian,
	Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
	linux-perf-users, linux-kselftest, linux-kernel, kernel-team

Enable bpf_get_branch_snapshot() on ARM64 by implementing the
perf_snapshot_branch_stack static call for BRBE.

BRBE is paused before masking exceptions to avoid branch buffer
pollution from trace_hardirqs_off(). Exceptions are then masked with
local_daif_save() to prevent PMU overflow pseudo-NMIs from interfering.
If an overflow between pause and DAIF save re-enables BRBE, the snapshot
detects this via BRBFCR_EL1.PAUSED and bails out.

Branch records are read using perf_entry_from_brbe_regset() with a NULL
event pointer to bypass event-specific filtering. The buffer is
invalidated after reading.

Introduce a for_each_brbe_entry() iterator to deduplicate bank
iteration between brbe_read_filtered_entries() and the snapshot.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
Reviewed-by: Rob Herring (Arm) <robh@kernel.org>
---
 drivers/perf/arm_brbe.c  | 128 ++++++++++++++++++++++++++++++++-------
 drivers/perf/arm_brbe.h  |   9 +++
 drivers/perf/arm_pmuv3.c |   5 +-
 3 files changed, 120 insertions(+), 22 deletions(-)

diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index effbdeacfcbb..a141ad7abcf2 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/bitmap.h>
 #include <linux/perf/arm_pmu.h>
+#include <asm/daifflags.h>
 #include "arm_brbe.h"
 
 #define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT   | \
@@ -256,6 +257,14 @@ static bool valid_brbe_version(int brbe_version)
 	       brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
 }
 
+static __always_inline bool cpu_has_brbe(void)
+{
+	u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
+	int brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
+
+	return valid_brbe_version(brbe);
+}
+
 static void select_brbe_bank(int bank)
 {
 	u64 brbfcr;
@@ -271,6 +280,20 @@ static void select_brbe_bank(int bank)
 	isb();
 }
 
+static inline void __brbe_advance(int *bank, int *idx, int nr_hw)
+{
+	if (++(*idx) >= BRBE_BANK_MAX_ENTRIES &&
+	    *bank * BRBE_BANK_MAX_ENTRIES + *idx < nr_hw) {
+		*idx = 0;
+		select_brbe_bank(++(*bank));
+	}
+}
+
+#define for_each_brbe_entry(idx, nr_hw)					\
+	for (int __bank = (select_brbe_bank(0), 0), idx = 0;		\
+	     __bank * BRBE_BANK_MAX_ENTRIES + idx < (nr_hw);		\
+	     __brbe_advance(&__bank, &idx, (nr_hw)))
+
 static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
 {
 	entry->brbinf = get_brbinf_reg(idx);
@@ -474,11 +497,9 @@ unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
 
 void brbe_probe(struct arm_pmu *armpmu)
 {
-	u64 brbidr, aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
-	u32 brbe;
+	u64 brbidr;
 
-	brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
-	if (!valid_brbe_version(brbe))
+	if (!cpu_has_brbe())
 		return;
 
 	brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
@@ -618,10 +639,10 @@ static bool perf_entry_from_brbe_regset(int index, struct perf_branch_entry *ent
 
 	brbe_set_perf_entry_type(entry, brbinf);
 
-	if (!branch_sample_no_cycles(event))
+	if (!event || !branch_sample_no_cycles(event))
 		entry->cycles = brbinf_get_cycles(brbinf);
 
-	if (!branch_sample_no_flags(event)) {
+	if (!event || !branch_sample_no_flags(event)) {
 		/* Mispredict info is available for source only and complete branch records. */
 		if (!brbe_record_is_target_only(brbinf)) {
 			entry->mispred = brbinf_get_mispredict(brbinf);
@@ -774,32 +795,97 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
 {
 	struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
 	int nr_hw = brbe_num_branch_records(cpu_pmu);
-	int nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
 	int nr_filtered = 0;
 	u64 branch_sample_type = event->attr.branch_sample_type;
 	DECLARE_BITMAP(event_type_mask, PERF_BR_ARM64_MAX);
 
 	prepare_event_branch_type_mask(branch_sample_type, event_type_mask);
 
-	for (int bank = 0; bank < nr_banks; bank++) {
-		int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
-		int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
+	for_each_brbe_entry(i, nr_hw) {
+		struct perf_branch_entry *pbe = &branch_stack->entries[nr_filtered];
 
-		select_brbe_bank(bank);
+		if (!perf_entry_from_brbe_regset(i, pbe, event))
+			break;
 
-		for (int i = 0; i < nr_this_bank; i++) {
-			struct perf_branch_entry *pbe = &branch_stack->entries[nr_filtered];
+		if (!filter_branch_record(pbe, branch_sample_type, event_type_mask))
+			continue;
 
-			if (!perf_entry_from_brbe_regset(i, pbe, event))
-				goto done;
+		nr_filtered++;
+	}
 
-			if (!filter_branch_record(pbe, branch_sample_type, event_type_mask))
-				continue;
+	branch_stack->nr = nr_filtered;
+}
 
-			nr_filtered++;
-		}
+/*
+ * Best-effort BRBE snapshot for BPF tracing. Pause BRBE to avoid
+ * self-recording and return 0 if the snapshot state appears disturbed.
+ */
+int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
+{
+	unsigned long flags;
+	int nr_hw, nr_copied = 0;
+	u64 brbfcr, brbcr;
+
+	if (!cnt)
+		return 0;
+
+	/* Guard against running on a CPU without BRBE (e.g. big.LITTLE). */
+	if (!cpu_has_brbe())
+		return 0;
+
+	/*
+	 * Pause BRBE first to avoid recording our own branches. The
+	 * sysreg read/write and ISB are branchless, so pausing before
+	 * checking BRBCR avoids polluting the buffer with our own
+	 * conditional branches.
+	 */
+	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+	isb();
+
+	/* Bail out if BRBE is not enabled (BRBCR_EL1 == 0). */
+	if (!brbcr) {
+		write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+		isb();
+		return 0;
 	}
 
-done:
-	branch_stack->nr = nr_filtered;
+	/* Block local exception delivery while reading the buffer. */
+	flags = local_daif_save();
+
+	/*
+	 * A PMU overflow before local_daif_save() could have re-enabled
+	 * BRBE, clearing the PAUSED bit. The overflow handler already
+	 * restored BRBE to its correct state, so just bail out.
+	 */
+	if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
+		local_daif_restore(flags);
+		return 0;
+	}
+
+	nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK,
+			  read_sysreg_s(SYS_BRBIDR0_EL1));
+
+	for_each_brbe_entry(i, nr_hw) {
+		if (nr_copied >= cnt)
+			break;
+
+		if (!perf_entry_from_brbe_regset(i, &entries[nr_copied], NULL))
+			break;
+
+		nr_copied++;
+	}
+
+	brbe_invalidate();
+
+	/* Restore BRBCR before unpausing via BRBFCR, matching brbe_enable(). */
+	write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+	isb();
+	write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+	/* Ensure BRBE is unpaused before returning to the caller. */
+	isb();
+	local_daif_restore(flags);
+
+	return nr_copied;
 }
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
index b7c7d8796c86..c2a1824437fb 100644
--- a/drivers/perf/arm_brbe.h
+++ b/drivers/perf/arm_brbe.h
@@ -10,6 +10,7 @@
 struct arm_pmu;
 struct perf_branch_stack;
 struct perf_event;
+struct perf_branch_entry;
 
 #ifdef CONFIG_ARM64_BRBE
 void brbe_probe(struct arm_pmu *arm_pmu);
@@ -22,6 +23,8 @@ void brbe_disable(void);
 bool brbe_branch_attr_valid(struct perf_event *event);
 void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
 				const struct perf_event *event);
+int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
+				   unsigned int cnt);
 #else
 static inline void brbe_probe(struct arm_pmu *arm_pmu) { }
 static inline unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
@@ -44,4 +47,10 @@ static void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
 				       const struct perf_event *event)
 {
 }
+
+static inline int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
+						 unsigned int cnt)
+{
+	return 0;
+}
 #endif
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 8014ff766cff..1a9f129a0f94 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1449,8 +1449,11 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
 	cpu_pmu->set_event_filter	= armv8pmu_set_event_filter;
 
 	cpu_pmu->pmu.event_idx		= armv8pmu_user_event_idx;
-	if (brbe_num_branch_records(cpu_pmu))
+	if (brbe_num_branch_records(cpu_pmu)) {
 		cpu_pmu->pmu.sched_task		= armv8pmu_sched_task;
+		static_call_update(perf_snapshot_branch_stack,
+				   arm_brbe_snapshot_branch_stack);
+	}
 
 	cpu_pmu->name			= name;
 	cpu_pmu->map_event		= map_event;
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE
  2026-06-16 15:57 [PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
                   ` (2 preceding siblings ...)
  2026-06-16 15:57 ` [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-06-16 15:57 ` Puranjay Mohan
  2026-06-16 16:04   ` sashiko-bot
  3 siblings, 1 reply; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 15:57 UTC (permalink / raw)
  To: bpf
  Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	Will Deacon, Mark Rutland, Catalin Marinas, Leo Yan, Rob Herring,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, James Clark, Ian Rogers, Adrian Hunter, Shuah Khan,
	Breno Leitao, Ravi Bangoria, Stephane Eranian,
	Kumar Kartikeya Dwivedi, Usama Arif, linux-arm-kernel,
	linux-perf-users, linux-kselftest, linux-kernel, kernel-team

The get_branch_snapshot test checks that bpf_get_branch_snapshot()
doesn't waste too many branch entries on infrastructure overhead. The
threshold of < 10 was calibrated for x86 where about 7 entries are
wasted.

On ARM64, the BPF trampoline generates more branches than x86,
resulting in about 13 wasted entries. The overhead comes from the BPF
trampoline calling __bpf_prog_enter_recur which on ARM64 makes
out-of-line calls to __rcu_read_lock and generates more conditional
branches than x86:

 [#12] bpf_testmod_loop_test+0x40    -> bpf_trampoline_...+0x48
 [#11] bpf_trampoline_...+0x68       -> __bpf_prog_enter_recur+0x0
 [#10] __bpf_prog_enter_recur+0x20   -> __bpf_prog_enter_recur+0x118
 [#09] __bpf_prog_enter_recur+0x154  -> __bpf_prog_enter_recur+0x160
 [#08] __bpf_prog_enter_recur+0x164  -> __bpf_prog_enter_recur+0x2c
 [#07] __bpf_prog_enter_recur+0x2c   -> __rcu_read_lock+0x0
 [#06] __rcu_read_lock+0x18          -> __bpf_prog_enter_recur+0x30
 [#05] __bpf_prog_enter_recur+0x9c   -> __bpf_prog_enter_recur+0xf0
 [#04] __bpf_prog_enter_recur+0xf4   -> __bpf_prog_enter_recur+0xa8
 [#03] __bpf_prog_enter_recur+0xb8   -> __bpf_prog_enter_recur+0x100
 [#02] __bpf_prog_enter_recur+0x114  -> bpf_trampoline_...+0x6c
 [#01] bpf_trampoline_...+0x78       -> bpf_prog_...test1+0x0
 [#00] bpf_prog_...test1+0x58        -> arm_brbe_snapshot_branch_stack+0x0

Use an architecture-specific threshold of < 14 for ARM64 to accommodate
this overhead while still detecting regressions.

Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
 .../selftests/bpf/prog_tests/get_branch_snapshot.c  | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
index 0394a1156d99..8d1a3480767f 100644
--- a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
+++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
@@ -116,13 +116,18 @@ void serial_test_get_branch_snapshot(void)
 
 	ASSERT_GT(skel->bss->test1_hits, 6, "find_looptest_in_lbr");
 
-	/* Given we stop LBR in software, we will waste a few entries.
+	/* Given we stop LBR/BRBE in software, we will waste a few entries.
 	 * But we should try to waste as few as possible entries. We are at
-	 * about 7 on x86_64 systems.
-	 * Add a check for < 10 so that we get heads-up when something
-	 * changes and wastes too many entries.
+	 * about 7 on x86_64 and about 13 on arm64 systems (the arm64 BPF
+	 * trampoline generates more branches than x86_64).
+	 * Add a check so that we get heads-up when something changes and
+	 * wastes too many entries.
 	 */
+#if defined(__aarch64__)
+	ASSERT_LT(skel->bss->wasted_entries, 14, "check_wasted_entries");
+#else
 	ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
+#endif
 
 cleanup:
 	get_branch_snapshot__destroy(skel);
-- 
2.53.0-Meta



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE
  2026-06-16 15:57 ` [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE Puranjay Mohan
@ 2026-06-16 16:04   ` sashiko-bot
  2026-06-16 16:12     ` Puranjay Mohan
  0 siblings, 1 reply; 9+ messages in thread
From: sashiko-bot @ 2026-06-16 16:04 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: linux-perf-users, bpf

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] The test skip threshold for `total_entries` was not adjusted for the increased ARM64 overhead, leading to guaranteed test failures on hardware with 16 branch entries.
- [Medium] The ARM64 threshold for `wasted_entries` provides zero margin for environmental variations, making the test extremely brittle.
--

commit 08ff160f53fde2f401cb063e6cbc84b6bc9a9970
Author: Puranjay Mohan <puranjay@kernel.org>

selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE

The commit adjusts the wasted entries threshold in the get_branch_snapshot test
for ARM64. It raises the threshold to accommodate the ~13 wasted entries
generated by the ARM64 BPF trampoline compared to ~7 on x86.

> diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> index 0394a1156d99d..8d1a3480767fe 100644
> --- a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> +++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> @@ -116,13 +116,18 @@ void serial_test_get_branch_snapshot(void)
>  
>  	ASSERT_GT(skel->bss->test1_hits, 6, "find_looptest_in_lbr");

[Severity: Medium]
Since this patch increases the expected wasted entries to ~13 on ARM64, does the
skip threshold also need to be updated?

Right above this line, the test skips if total_entries is too small:

    if (skel->bss->total_entries < 16) {
        /* too few entries for the hit/waste test */
        test__skip();
        goto cleanup;
    }

On an ARM64 system with a 16-entry BRBE buffer, the skip condition evaluates
to false, but with 13 wasted entries, only 3 remain for hits. This would cause
the ASSERT_GT check to deterministically fail rather than gracefully skipping.

Could the threshold be adjusted to account for the additional ARM64 overhead
(e.g., needing at least 13 + 7 = 20 entries)?

> -	/* Given we stop LBR in software, we will waste a few entries.
> +	/* Given we stop LBR/BRBE in software, we will waste a few entries.
>  	 * But we should try to waste as few as possible entries. We are at
> -	 * about 7 on x86_64 systems.
> -	 * Add a check for < 10 so that we get heads-up when something
> -	 * changes and wastes too many entries.
> +	 * about 7 on x86_64 and about 13 on arm64 systems (the arm64 BPF
> +	 * trampoline generates more branches than x86_64).
> +	 * Add a check so that we get heads-up when something changes and
> +	 * wastes too many entries.
>  	 */
> +#if defined(__aarch64__)
> +	ASSERT_LT(skel->bss->wasted_entries, 14, "check_wasted_entries");

[Severity: Medium]
Does this threshold provide enough margin for variations in kernel configuration?

The commit message notes there are about 13 wasted entries. Using < 14 means
the test will fail if exactly 14 entries are wasted, leaving no margin for
error. 

Minor variations like KASAN, PREEMPT_RCU, or different compiler code
generation could add a single branch to __bpf_prog_enter_recur and cause this
test to become flaky.

For comparison, the x86 threshold is < 10 for an expected 7 entries, giving
a 2-3 branch margin. Could the ARM64 threshold be relaxed slightly to prevent
spurious failures?

> +#else
>  	ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
> +#endif
>  
>  cleanup:
>  	get_branch_snapshot__destroy(skel);

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616155716.2631508-1-puranjay@kernel.org?part=4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE
  2026-06-16 16:04   ` sashiko-bot
@ 2026-06-16 16:12     ` Puranjay Mohan
  0 siblings, 0 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 16:12 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-perf-users, bpf

On Tue, Jun 16, 2026 at 5:04 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [Medium] The test skip threshold for `total_entries` was not adjusted for the increased ARM64 overhead, leading to guaranteed test failures on hardware with 16 branch entries.
> - [Medium] The ARM64 threshold for `wasted_entries` provides zero margin for environmental variations, making the test extremely brittle.
> --
>
> commit 08ff160f53fde2f401cb063e6cbc84b6bc9a9970
> Author: Puranjay Mohan <puranjay@kernel.org>
>
> selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE
>
> The commit adjusts the wasted entries threshold in the get_branch_snapshot test
> for ARM64. It raises the threshold to accommodate the ~13 wasted entries
> generated by the ARM64 BPF trampoline compared to ~7 on x86.
>
> > diff --git a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> > index 0394a1156d99d..8d1a3480767fe 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/get_branch_snapshot.c
> > @@ -116,13 +116,18 @@ void serial_test_get_branch_snapshot(void)
> >
> >       ASSERT_GT(skel->bss->test1_hits, 6, "find_looptest_in_lbr");
>
> [Severity: Medium]
> Since this patch increases the expected wasted entries to ~13 on ARM64, does the
> skip threshold also need to be updated?
>
> Right above this line, the test skips if total_entries is too small:
>
>     if (skel->bss->total_entries < 16) {
>         /* too few entries for the hit/waste test */
>         test__skip();
>         goto cleanup;
>     }
>
> On an ARM64 system with a 16-entry BRBE buffer, the skip condition evaluates
> to false, but with 13 wasted entries, only 3 remain for hits. This would cause
> the ASSERT_GT check to deterministically fail rather than gracefully skipping.
>
> Could the threshold be adjusted to account for the additional ARM64 overhead
> (e.g., needing at least 13 + 7 = 20 entries)?
>
> > -     /* Given we stop LBR in software, we will waste a few entries.
> > +     /* Given we stop LBR/BRBE in software, we will waste a few entries.
> >        * But we should try to waste as few as possible entries. We are at
> > -      * about 7 on x86_64 systems.
> > -      * Add a check for < 10 so that we get heads-up when something
> > -      * changes and wastes too many entries.
> > +      * about 7 on x86_64 and about 13 on arm64 systems (the arm64 BPF
> > +      * trampoline generates more branches than x86_64).
> > +      * Add a check so that we get heads-up when something changes and
> > +      * wastes too many entries.
> >        */
> > +#if defined(__aarch64__)
> > +     ASSERT_LT(skel->bss->wasted_entries, 14, "check_wasted_entries");
>
> [Severity: Medium]
> Does this threshold provide enough margin for variations in kernel configuration?
>
> The commit message notes there are about 13 wasted entries. Using < 14 means
> the test will fail if exactly 14 entries are wasted, leaving no margin for
> error.
>
> Minor variations like KASAN, PREEMPT_RCU, or different compiler code
> generation could add a single branch to __bpf_prog_enter_recur and cause this
> test to become flaky.
>
> For comparison, the x86 threshold is < 10 for an expected 7 entries, giving
> a 2-3 branch margin. Could the ARM64 threshold be relaxed slightly to prevent
> spurious failures?
>
> > +#else
> >       ASSERT_LT(skel->bss->wasted_entries, 10, "check_wasted_entries");
> > +#endif
> >
> >  cleanup:
> >       get_branch_snapshot__destroy(skel);


This selftest is supposed to work with the standard BPF CI config and
setup which it does currently, it doesn't need to support all
different configurations and setups.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
  2026-06-16 15:57 ` [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-06-16 16:15   ` sashiko-bot
  2026-06-16 16:17     ` Puranjay Mohan
  0 siblings, 1 reply; 9+ messages in thread
From: sashiko-bot @ 2026-06-16 16:15 UTC (permalink / raw)
  To: Puranjay Mohan; +Cc: bpf, linux-perf-users

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] `arm_brbe_snapshot_branch_stack` contains a TOCTOU vulnerability where caching `SYS_BRBCR_EL1` before disabling interrupts allows a stale value to be restored, unintentionally re-enabling BRBE if an interrupt disables the PMU event in the intervening window.
--

commit dbaa008d5027ec5e859cbcac51237295c1187ed6
Author: Puranjay Mohan <puranjay@kernel.org>

perf/arm64: Add BRBE support for bpf_get_branch_snapshot()

Enable bpf_get_branch_snapshot() on ARM64 by implementing the
perf_snapshot_branch_stack static call for BRBE. This pauses BRBE,
masks exceptions to prevent PMU overflow pseudo-NMI interference,
and reads the branch records.

> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index effbdeacfcbb3..a141ad7abcf25 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
[ ... ]
> @@ -774,32 +795,97 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
[ ... ]
> +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> +{
> +	unsigned long flags;
> +	int nr_hw, nr_copied = 0;
> +	u64 brbfcr, brbcr;
> +
> +	if (!cnt)
> +		return 0;
> +
> +	/* Guard against running on a CPU without BRBE (e.g. big.LITTLE). */
> +	if (!cpu_has_brbe())
> +		return 0;
> +
> +	/*
> +	 * Pause BRBE first to avoid recording our own branches. The
> +	 * sysreg read/write and ISB are branchless, so pausing before
> +	 * checking BRBCR avoids polluting the buffer with our own
> +	 * conditional branches.
> +	 */
> +	brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> +	brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> +	write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> +	isb();
> +
> +	/* Bail out if BRBE is not enabled (BRBCR_EL1 == 0). */
> +	if (!brbcr) {
> +		write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> +		isb();
> +		return 0;
> +	}
> +
> +	/* Block local exception delivery while reading the buffer. */
> +	flags = local_daif_save();
> +
> +	/*
> +	 * A PMU overflow before local_daif_save() could have re-enabled
> +	 * BRBE, clearing the PAUSED bit. The overflow handler already
> +	 * restored BRBE to its correct state, so just bail out.
> +	 */
> +	if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
> +		local_daif_restore(flags);
> +		return 0;
> +	}
> +
> +	nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK,
> +			  read_sysreg_s(SYS_BRBIDR0_EL1));
> +
> +	for_each_brbe_entry(i, nr_hw) {
> +		if (nr_copied >= cnt)
> +			break;
> +
> +		if (!perf_entry_from_brbe_regset(i, &entries[nr_copied], NULL))
> +			break;
> +
> +		nr_copied++;
> +	}
> +
> +	brbe_invalidate();
> +
> +	/* Restore BRBCR before unpausing via BRBFCR, matching brbe_enable(). */
> +	write_sysreg_s(brbcr, SYS_BRBCR_EL1);

[Severity: High]
Is it possible for a race condition here to accidentally re-enable BRBE?

SYS_BRBCR_EL1 is read and cached in brbcr before local_daif_save() disables
interrupts.

If an interrupt arrives in this window and disables the PMU (for example,
an IPI calling perf_remove_from_context() that invokes brbe_disable()),
it would set the BRBFCR_EL1_PAUSED bit.

When execution resumes, the bailout condition evaluates to false because
the PAUSED bit is now set:

    if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
        local_daif_restore(flags);
        return 0;
    }

The function then continues, and blindly writes the cached brbcr back to
the hardware here.

Since the cached brbcr had BRBE enabled, could this unintentionally
re-enable BRBE globally despite the kernel having just explicitly disabled it?

> +	isb();
> +	write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> +	/* Ensure BRBE is unpaused before returning to the caller. */
> +	isb();
> +	local_daif_restore(flags);
> +
> +	return nr_copied;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260616155716.2631508-1-puranjay@kernel.org?part=3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
  2026-06-16 16:15   ` sashiko-bot
@ 2026-06-16 16:17     ` Puranjay Mohan
  0 siblings, 0 replies; 9+ messages in thread
From: Puranjay Mohan @ 2026-06-16 16:17 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: bpf, linux-perf-users

On Tue, Jun 16, 2026 at 5:15 PM <sashiko-bot@kernel.org> wrote:
>
> Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
> - [High] `arm_brbe_snapshot_branch_stack` contains a TOCTOU vulnerability where caching `SYS_BRBCR_EL1` before disabling interrupts allows a stale value to be restored, unintentionally re-enabling BRBE if an interrupt disables the PMU event in the intervening window.
> --
>
> commit dbaa008d5027ec5e859cbcac51237295c1187ed6
> Author: Puranjay Mohan <puranjay@kernel.org>
>
> perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
>
> Enable bpf_get_branch_snapshot() on ARM64 by implementing the
> perf_snapshot_branch_stack static call for BRBE. This pauses BRBE,
> masks exceptions to prevent PMU overflow pseudo-NMI interference,
> and reads the branch records.
>
> > diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> > index effbdeacfcbb3..a141ad7abcf25 100644
> > --- a/drivers/perf/arm_brbe.c
> > +++ b/drivers/perf/arm_brbe.c
> [ ... ]
> > @@ -774,32 +795,97 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> [ ... ]
> > +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> > +{
> > +     unsigned long flags;
> > +     int nr_hw, nr_copied = 0;
> > +     u64 brbfcr, brbcr;
> > +
> > +     if (!cnt)
> > +             return 0;
> > +
> > +     /* Guard against running on a CPU without BRBE (e.g. big.LITTLE). */
> > +     if (!cpu_has_brbe())
> > +             return 0;
> > +
> > +     /*
> > +      * Pause BRBE first to avoid recording our own branches. The
> > +      * sysreg read/write and ISB are branchless, so pausing before
> > +      * checking BRBCR avoids polluting the buffer with our own
> > +      * conditional branches.
> > +      */
> > +     brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> > +     brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> > +     write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> > +     isb();
> > +
> > +     /* Bail out if BRBE is not enabled (BRBCR_EL1 == 0). */
> > +     if (!brbcr) {
> > +             write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> > +             isb();
> > +             return 0;
> > +     }
> > +
> > +     /* Block local exception delivery while reading the buffer. */
> > +     flags = local_daif_save();
> > +
> > +     /*
> > +      * A PMU overflow before local_daif_save() could have re-enabled
> > +      * BRBE, clearing the PAUSED bit. The overflow handler already
> > +      * restored BRBE to its correct state, so just bail out.
> > +      */
> > +     if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
> > +             local_daif_restore(flags);
> > +             return 0;
> > +     }
> > +
> > +     nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK,
> > +                       read_sysreg_s(SYS_BRBIDR0_EL1));
> > +
> > +     for_each_brbe_entry(i, nr_hw) {
> > +             if (nr_copied >= cnt)
> > +                     break;
> > +
> > +             if (!perf_entry_from_brbe_regset(i, &entries[nr_copied], NULL))
> > +                     break;
> > +
> > +             nr_copied++;
> > +     }
> > +
> > +     brbe_invalidate();
> > +
> > +     /* Restore BRBCR before unpausing via BRBFCR, matching brbe_enable(). */
> > +     write_sysreg_s(brbcr, SYS_BRBCR_EL1);
>
> [Severity: High]
> Is it possible for a race condition here to accidentally re-enable BRBE?
>
> SYS_BRBCR_EL1 is read and cached in brbcr before local_daif_save() disables
> interrupts.
>
> If an interrupt arrives in this window and disables the PMU (for example,
> an IPI calling perf_remove_from_context() that invokes brbe_disable()),
> it would set the BRBFCR_EL1_PAUSED bit.
>
> When execution resumes, the bailout condition evaluates to false because
> the PAUSED bit is now set:
>
>     if (!(read_sysreg_s(SYS_BRBFCR_EL1) & BRBFCR_EL1_PAUSED)) {
>         local_daif_restore(flags);
>         return 0;
>     }
>
> The function then continues, and blindly writes the cached brbcr back to
> the hardware here.
>
> Since the cached brbcr had BRBE enabled, could this unintentionally
> re-enable BRBE globally despite the kernel having just explicitly disabled it?

The kernel doesn't leave the BRBE disabled, it enables it at the end
in the PMU interrupt handler and that is the only place where it is
changed. So, this scenario doesn't exist in code.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-06-16 16:17 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-16 15:57 [PATCH v5 0/4] arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-06-16 15:57 ` [PATCH v5 1/4] perf/core: Fix sched_task callbacks for CPU-wide branch stack events Puranjay Mohan
2026-06-16 15:57 ` [PATCH v5 2/4] perf/core: Clear the whole branch entry in perf_clear_branch_entry() Puranjay Mohan
2026-06-16 15:57 ` [PATCH v5 3/4] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-06-16 16:15   ` sashiko-bot
2026-06-16 16:17     ` Puranjay Mohan
2026-06-16 15:57 ` [PATCH v5 4/4] selftests/bpf: Adjust wasted entries threshold for ARM64 BRBE Puranjay Mohan
2026-06-16 16:04   ` sashiko-bot
2026-06-16 16:12     ` Puranjay Mohan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.