* [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
@ 2026-01-02 21:40 Puranjay Mohan
2026-01-02 21:51 ` Puranjay Mohan
2026-01-05 17:49 ` Mark Rutland
0 siblings, 2 replies; 7+ messages in thread
From: Puranjay Mohan @ 2026-01-02 21:40 UTC (permalink / raw)
To: bpf
Cc: Puranjay Mohan, Puranjay Mohan, Alexei Starovoitov,
Andrii Nakryiko, Daniel Borkmann, Martin KaFai Lau,
Eduard Zingerman, Kumar Kartikeya Dwivedi, Will Deacon,
Mark Rutland, Catalin Marinas, Rob Herring, Breno Leitao,
linux-arm-kernel, linux-perf-users, kernel-team
Enable the bpf_get_branch_snapshot() BPF helper on ARM64 by implementing
the perf_snapshot_branch_stack static call for ARM's Branch Record
Buffer Extension (BRBE).
The BPF helper bpf_get_branch_snapshot() allows BPF programs to capture
hardware branch records on-demand. This was previously only available on
x86 (Intel LBR, AMD BRS) but not on ARM64 despite BRBE being available
since ARMv9.
This implementation:
- Follows the x86 snapshot pattern (intel_pmu_snapshot_branch_stack)
- Performs atomic snapshot by pausing BRBE, reading records, and
restoring previous state without disrupting ongoing perf events
- Reads branch records directly from BRBE registers without
event-specific filtering to minimize branch pollution
- Handles all BRBE record types (complete, source-only, target-only)
- Complies with ARM ARM synchronization requirements (ISB barriers per
rule PPBZP)
- Reuses existing BRBE infrastructure (select_brbe_bank,
__read_brbe_regset, brbe_set_perf_entry_type, etc.)
Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
---
This patch is only compile tested as I don't have access to hardware with BRBE.
---
drivers/perf/arm_brbe.c | 95 ++++++++++++++++++++++++++++++++++++++++
drivers/perf/arm_brbe.h | 9 ++++
drivers/perf/arm_pmuv3.c | 5 ++-
3 files changed, 108 insertions(+), 1 deletion(-)
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
index ba554e0c846c..cda7bf522c06 100644
--- a/drivers/perf/arm_brbe.c
+++ b/drivers/perf/arm_brbe.c
@@ -803,3 +803,98 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
done:
branch_stack->nr = nr_filtered;
}
+
+/*
+ * ARM-specific callback invoked through perf_snapshot_branch_stack static
+ * call, defined in include/linux/perf_event.h. See its definition for API
+ * details. It's up to caller to provide enough space in *entries* to fit all
+ * branch records, otherwise returned result will be truncated to *cnt* entries.
+ *
+ * This is similar to brbe_read_filtered_entries but optimized for snapshot mode:
+ * - No filtering based on event attributes (captures everything)
+ * - Minimal branches to avoid polluting the branch buffer
+ * - Direct register reads without event-specific processing
+ */
+int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
+{
+ unsigned long flags;
+ int nr_hw, nr_banks, nr_copied = 0;
+ u64 brbidr, brbfcr, brbcr;
+
+ /*
+ * The sequence of steps to freeze BRBE should be completely inlined
+ * and contain no branches to minimize contamination of branch snapshot.
+ */
+ local_irq_save(flags);
+
+ /* Save current BRBE configuration */
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+
+ /* Pause BRBE to freeze the buffer */
+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ /* Read BRBIDR to determine number of records */
+ brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
+ if (!valid_brbidr(brbidr))
+ goto out_restore;
+
+ nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
+ nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
+
+ /* Read branch records from BRBE banks */
+ for (int bank = 0; bank < nr_banks; bank++) {
+ int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
+ int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
+
+ select_brbe_bank(bank);
+
+ for (int i = 0; i < nr_this_bank; i++) {
+ struct brbe_regset bregs;
+ struct perf_branch_entry *entry;
+
+ if (nr_copied >= cnt)
+ goto out_restore;
+
+ if (!__read_brbe_regset(&bregs, i))
+ goto out_restore;
+
+ entry = &entries[nr_copied];
+ perf_clear_branch_entry_bitfields(entry);
+
+ /* Simple conversion without filtering */
+ if (brbe_record_is_complete(bregs.brbinf)) {
+ entry->from = bregs.brbsrc;
+ entry->to = bregs.brbtgt;
+ } else if (brbe_record_is_source_only(bregs.brbinf)) {
+ entry->from = bregs.brbsrc;
+ entry->to = 0;
+ } else if (brbe_record_is_target_only(bregs.brbinf)) {
+ entry->from = 0;
+ entry->to = bregs.brbtgt;
+ }
+
+ brbe_set_perf_entry_type(entry, bregs.brbinf);
+ entry->cycles = brbinf_get_cycles(bregs.brbinf);
+
+ if (!brbe_record_is_target_only(bregs.brbinf)) {
+ entry->mispred = brbinf_get_mispredict(bregs.brbinf);
+ entry->predicted = !entry->mispred;
+ }
+
+ if (!brbe_record_is_source_only(bregs.brbinf))
+ entry->priv = brbinf_get_perf_priv(bregs.brbinf);
+
+ nr_copied++;
+ }
+ }
+
+out_restore:
+ /* Restore BRBE to its previous state */
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ isb();
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ local_irq_restore(flags);
+ return nr_copied;
+}
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
index b7c7d8796c86..c2a1824437fb 100644
--- a/drivers/perf/arm_brbe.h
+++ b/drivers/perf/arm_brbe.h
@@ -10,6 +10,7 @@
struct arm_pmu;
struct perf_branch_stack;
struct perf_event;
+struct perf_branch_entry;
#ifdef CONFIG_ARM64_BRBE
void brbe_probe(struct arm_pmu *arm_pmu);
@@ -22,6 +23,8 @@ void brbe_disable(void);
bool brbe_branch_attr_valid(struct perf_event *event);
void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
const struct perf_event *event);
+int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
+ unsigned int cnt);
#else
static inline void brbe_probe(struct arm_pmu *arm_pmu) { }
static inline unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
@@ -44,4 +47,10 @@ static void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
const struct perf_event *event)
{
}
+
+static inline int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
+ unsigned int cnt)
+{
+ return 0;
+}
#endif
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 8014ff766cff..1a9f129a0f94 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -1449,8 +1449,11 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
- if (brbe_num_branch_records(cpu_pmu))
+ if (brbe_num_branch_records(cpu_pmu)) {
cpu_pmu->pmu.sched_task = armv8pmu_sched_task;
+ static_call_update(perf_snapshot_branch_stack,
+ arm_brbe_snapshot_branch_stack);
+ }
cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
base-commit: c286e7e9d1f1f3d90ad11c37e896f582b02d19c4
--
2.47.3
^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-01-02 21:40 [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
@ 2026-01-02 21:51 ` Puranjay Mohan
2026-01-05 21:22 ` Leo Yan
2026-01-05 17:49 ` Mark Rutland
1 sibling, 1 reply; 7+ messages in thread
From: Puranjay Mohan @ 2026-01-02 21:51 UTC (permalink / raw)
To: Will Deacon, Mark Rutland, Rob Herring, Catalin Marinas
Cc: Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Breno Leitao, linux-arm-kernel, linux-perf-users, kernel-team
Hi Catalin, Will, Mark, and Rob.
I don't have access to BRBE enabled hardware and Qemu doesn't support
this as well therefore I just sent the compile tested version.
Can you help me with testing this:
The relevant bpf self test is:
./test_progs -t get_branch_snapshot
#132 get_branch_snapshot:SKIP
Summary: 1/0 PASSED, 1 SKIPPED, 0 FAILED
skipped for my setup as it is not supported.
Thanks,
Puranjay
On Fri, Jan 2, 2026 at 9:41 PM Puranjay Mohan <puranjay@kernel.org> wrote:
>
> Enable the bpf_get_branch_snapshot() BPF helper on ARM64 by implementing
> the perf_snapshot_branch_stack static call for ARM's Branch Record
> Buffer Extension (BRBE).
>
> The BPF helper bpf_get_branch_snapshot() allows BPF programs to capture
> hardware branch records on-demand. This was previously only available on
> x86 (Intel LBR, AMD BRS) but not on ARM64 despite BRBE being available
> since ARMv9.
>
> This implementation:
>
> - Follows the x86 snapshot pattern (intel_pmu_snapshot_branch_stack)
> - Performs atomic snapshot by pausing BRBE, reading records, and
> restoring previous state without disrupting ongoing perf events
> - Reads branch records directly from BRBE registers without
> event-specific filtering to minimize branch pollution
> - Handles all BRBE record types (complete, source-only, target-only)
> - Complies with ARM ARM synchronization requirements (ISB barriers per
> rule PPBZP)
> - Reuses existing BRBE infrastructure (select_brbe_bank,
> __read_brbe_regset, brbe_set_perf_entry_type, etc.)
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>
> This patch is only compile tested as I don't have access to hardware with BRBE.
>
> ---
> drivers/perf/arm_brbe.c | 95 ++++++++++++++++++++++++++++++++++++++++
> drivers/perf/arm_brbe.h | 9 ++++
> drivers/perf/arm_pmuv3.c | 5 ++-
> 3 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index ba554e0c846c..cda7bf522c06 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -803,3 +803,98 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> done:
> branch_stack->nr = nr_filtered;
> }
> +
> +/*
> + * ARM-specific callback invoked through perf_snapshot_branch_stack static
> + * call, defined in include/linux/perf_event.h. See its definition for API
> + * details. It's up to caller to provide enough space in *entries* to fit all
> + * branch records, otherwise returned result will be truncated to *cnt* entries.
> + *
> + * This is similar to brbe_read_filtered_entries but optimized for snapshot mode:
> + * - No filtering based on event attributes (captures everything)
> + * - Minimal branches to avoid polluting the branch buffer
> + * - Direct register reads without event-specific processing
> + */
> +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> +{
> + unsigned long flags;
> + int nr_hw, nr_banks, nr_copied = 0;
> + u64 brbidr, brbfcr, brbcr;
> +
> + /*
> + * The sequence of steps to freeze BRBE should be completely inlined
> + * and contain no branches to minimize contamination of branch snapshot.
> + */
> + local_irq_save(flags);
> +
> + /* Save current BRBE configuration */
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> +
> + /* Pause BRBE to freeze the buffer */
> + write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> + isb();
> +
> + /* Read BRBIDR to determine number of records */
> + brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
> + if (!valid_brbidr(brbidr))
> + goto out_restore;
> +
> + nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
> + nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
> +
> + /* Read branch records from BRBE banks */
> + for (int bank = 0; bank < nr_banks; bank++) {
> + int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
> + int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
> +
> + select_brbe_bank(bank);
> +
> + for (int i = 0; i < nr_this_bank; i++) {
> + struct brbe_regset bregs;
> + struct perf_branch_entry *entry;
> +
> + if (nr_copied >= cnt)
> + goto out_restore;
> +
> + if (!__read_brbe_regset(&bregs, i))
> + goto out_restore;
> +
> + entry = &entries[nr_copied];
> + perf_clear_branch_entry_bitfields(entry);
> +
> + /* Simple conversion without filtering */
> + if (brbe_record_is_complete(bregs.brbinf)) {
> + entry->from = bregs.brbsrc;
> + entry->to = bregs.brbtgt;
> + } else if (brbe_record_is_source_only(bregs.brbinf)) {
> + entry->from = bregs.brbsrc;
> + entry->to = 0;
> + } else if (brbe_record_is_target_only(bregs.brbinf)) {
> + entry->from = 0;
> + entry->to = bregs.brbtgt;
> + }
> +
> + brbe_set_perf_entry_type(entry, bregs.brbinf);
> + entry->cycles = brbinf_get_cycles(bregs.brbinf);
> +
> + if (!brbe_record_is_target_only(bregs.brbinf)) {
> + entry->mispred = brbinf_get_mispredict(bregs.brbinf);
> + entry->predicted = !entry->mispred;
> + }
> +
> + if (!brbe_record_is_source_only(bregs.brbinf))
> + entry->priv = brbinf_get_perf_priv(bregs.brbinf);
> +
> + nr_copied++;
> + }
> + }
> +
> +out_restore:
> + /* Restore BRBE to its previous state */
> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> + isb();
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + local_irq_restore(flags);
> + return nr_copied;
> +}
> diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
> index b7c7d8796c86..c2a1824437fb 100644
> --- a/drivers/perf/arm_brbe.h
> +++ b/drivers/perf/arm_brbe.h
> @@ -10,6 +10,7 @@
> struct arm_pmu;
> struct perf_branch_stack;
> struct perf_event;
> +struct perf_branch_entry;
>
> #ifdef CONFIG_ARM64_BRBE
> void brbe_probe(struct arm_pmu *arm_pmu);
> @@ -22,6 +23,8 @@ void brbe_disable(void);
> bool brbe_branch_attr_valid(struct perf_event *event);
> void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> const struct perf_event *event);
> +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
> + unsigned int cnt);
> #else
> static inline void brbe_probe(struct arm_pmu *arm_pmu) { }
> static inline unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
> @@ -44,4 +47,10 @@ static void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> const struct perf_event *event)
> {
> }
> +
> +static inline int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
> + unsigned int cnt)
> +{
> + return 0;
> +}
> #endif
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 8014ff766cff..1a9f129a0f94 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -1449,8 +1449,11 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
> cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
>
> cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
> - if (brbe_num_branch_records(cpu_pmu))
> + if (brbe_num_branch_records(cpu_pmu)) {
> cpu_pmu->pmu.sched_task = armv8pmu_sched_task;
> + static_call_update(perf_snapshot_branch_stack,
> + arm_brbe_snapshot_branch_stack);
> + }
>
> cpu_pmu->name = name;
> cpu_pmu->map_event = map_event;
>
> base-commit: c286e7e9d1f1f3d90ad11c37e896f582b02d19c4
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-01-02 21:40 [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-01-02 21:51 ` Puranjay Mohan
@ 2026-01-05 17:49 ` Mark Rutland
2026-01-05 19:31 ` Puranjay Mohan
1 sibling, 1 reply; 7+ messages in thread
From: Mark Rutland @ 2026-01-05 17:49 UTC (permalink / raw)
To: Puranjay Mohan
Cc: bpf, Puranjay Mohan, Alexei Starovoitov, Andrii Nakryiko,
Daniel Borkmann, Martin KaFai Lau, Eduard Zingerman,
Kumar Kartikeya Dwivedi, Will Deacon, Catalin Marinas,
Rob Herring, Breno Leitao, linux-arm-kernel, linux-perf-users,
kernel-team
Hi Puranjay,
I have a number of techincal concerns with this, noted below. At a high
level, I don't think this makes sense without a better description of
the usage, and I suspect we'd need changes to bpf_get_branch_snapshot()
to be able to implement something useful.
On Fri, Jan 02, 2026 at 01:40:41PM -0800, Puranjay Mohan wrote:
> Enable the bpf_get_branch_snapshot() BPF helper on ARM64 by implementing
> the perf_snapshot_branch_stack static call for ARM's Branch Record
> Buffer Extension (BRBE).
>
> The BPF helper bpf_get_branch_snapshot() allows BPF programs to capture
> hardware branch records on-demand. This was previously only available on
> x86 (Intel LBR, AMD BRS) but not on ARM64 despite BRBE being available
> since ARMv9.
When exactly can bpf_get_branch_snapshot() be called?
How is it expected to be used?
The driver is written on the expectation that BRBE records are only read
in the PMUv3 overflow handler.
> This implementation:
>
> - Follows the x86 snapshot pattern (intel_pmu_snapshot_branch_stack)
> - Performs atomic snapshot by pausing BRBE, reading records, and
> restoring previous state without disrupting ongoing perf events
If BRBE is stopped, then we lost contiguity of branch records (e.g. for
a series of branches A->B->C->D->E, we could record A->B->D->E). The
driver is intended to ensure contiguity of those records, and if that's
lost, we deliberately discard. That's why brbe_enable() calls
brbe_invalidate().
Restarting BRBE without discarding will produce bogus data for other
consumers. If BRBE is stopped, we *must* discard.
> - Reads branch records directly from BRBE registers without
> event-specific filtering to minimize branch pollution
If BRBE is paused, there's no polution, so this rationale doesn't make
sense to me.
BRBE can be configured with various filters (specific to the event),
multiple events with distinct filters can be active simultaneously, and
if no branch-sampling events are active, BRBE won't record anything. I
do not think it makes sense to record whatever happens to be present,
without filtering. Regardless, BRBE won't be recording if you don't have
events. so this doesn't seem to make sense.
> - Handles all BRBE record types (complete, source-only, target-only)
> - Complies with ARM ARM synchronization requirements (ISB barriers per
> rule PPBZP)
> - Reuses existing BRBE infrastructure (select_brbe_bank,
> __read_brbe_regset, brbe_set_perf_entry_type, etc.)
This duplicates other code (e.g. the body of
brbe_read_filtered_entries()), and I don't think this is justified.
>
> Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> ---
>
> This patch is only compile tested as I don't have access to hardware with BRBE.
>
> ---
> drivers/perf/arm_brbe.c | 95 ++++++++++++++++++++++++++++++++++++++++
> drivers/perf/arm_brbe.h | 9 ++++
> drivers/perf/arm_pmuv3.c | 5 ++-
> 3 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> index ba554e0c846c..cda7bf522c06 100644
> --- a/drivers/perf/arm_brbe.c
> +++ b/drivers/perf/arm_brbe.c
> @@ -803,3 +803,98 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> done:
> branch_stack->nr = nr_filtered;
> }
> +
> +/*
> + * ARM-specific callback invoked through perf_snapshot_branch_stack static
> + * call, defined in include/linux/perf_event.h. See its definition for API
> + * details. It's up to caller to provide enough space in *entries* to fit all
> + * branch records, otherwise returned result will be truncated to *cnt* entries.
> + *
> + * This is similar to brbe_read_filtered_entries but optimized for snapshot mode:
> + * - No filtering based on event attributes (captures everything)
> + * - Minimal branches to avoid polluting the branch buffer
> + * - Direct register reads without event-specific processing
> + */
As above I don't think these differences are necessary nor do I think we
should be duplicating this.
> +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> +{
> + unsigned long flags;
> + int nr_hw, nr_banks, nr_copied = 0;
> + u64 brbidr, brbfcr, brbcr;
> +
> + /*
> + * The sequence of steps to freeze BRBE should be completely inlined
> + * and contain no branches to minimize contamination of branch snapshot.
> + */
> + local_irq_save(flags);
The PMU overflow interrupt can be delivered as a pNMI, so this can race
with that. Disabling interrupts is not sufficient.
If this is only called within a perf event overflow handler, then we
already have the necessary serialization.
> +
> + /* Save current BRBE configuration */
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> +
> + /* Pause BRBE to freeze the buffer */
> + write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> + isb();
> +
> + /* Read BRBIDR to determine number of records */
> + brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
> + if (!valid_brbidr(brbidr))
> + goto out_restore;
> +
> + nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
> + nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
> +
> + /* Read branch records from BRBE banks */
> + for (int bank = 0; bank < nr_banks; bank++) {
> + int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
> + int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
> +
> + select_brbe_bank(bank);
> +
> + for (int i = 0; i < nr_this_bank; i++) {
> + struct brbe_regset bregs;
> + struct perf_branch_entry *entry;
> +
> + if (nr_copied >= cnt)
> + goto out_restore;
> +
> + if (!__read_brbe_regset(&bregs, i))
> + goto out_restore;
> +
> + entry = &entries[nr_copied];
> + perf_clear_branch_entry_bitfields(entry);
> +
> + /* Simple conversion without filtering */
> + if (brbe_record_is_complete(bregs.brbinf)) {
> + entry->from = bregs.brbsrc;
> + entry->to = bregs.brbtgt;
> + } else if (brbe_record_is_source_only(bregs.brbinf)) {
> + entry->from = bregs.brbsrc;
> + entry->to = 0;
> + } else if (brbe_record_is_target_only(bregs.brbinf)) {
> + entry->from = 0;
> + entry->to = bregs.brbtgt;
> + }
> +
> + brbe_set_perf_entry_type(entry, bregs.brbinf);
> + entry->cycles = brbinf_get_cycles(bregs.brbinf);
> +
> + if (!brbe_record_is_target_only(bregs.brbinf)) {
> + entry->mispred = brbinf_get_mispredict(bregs.brbinf);
> + entry->predicted = !entry->mispred;
> + }
> +
> + if (!brbe_record_is_source_only(bregs.brbinf))
> + entry->priv = brbinf_get_perf_priv(bregs.brbinf);
> +
> + nr_copied++;
> + }
> + }
> +
> +out_restore:
> + /* Restore BRBE to its previous state */
> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> + isb();
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + local_irq_restore(flags);
> + return nr_copied;
> +}
As above, this is duplicating a lot of logic we already have. If we
really need this function, we should be sharing much more logic with
brbe_read_filtered_entries(), if not using that directly.
Mark.
> diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
> index b7c7d8796c86..c2a1824437fb 100644
> --- a/drivers/perf/arm_brbe.h
> +++ b/drivers/perf/arm_brbe.h
> @@ -10,6 +10,7 @@
> struct arm_pmu;
> struct perf_branch_stack;
> struct perf_event;
> +struct perf_branch_entry;
>
> #ifdef CONFIG_ARM64_BRBE
> void brbe_probe(struct arm_pmu *arm_pmu);
> @@ -22,6 +23,8 @@ void brbe_disable(void);
> bool brbe_branch_attr_valid(struct perf_event *event);
> void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> const struct perf_event *event);
> +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
> + unsigned int cnt);
> #else
> static inline void brbe_probe(struct arm_pmu *arm_pmu) { }
> static inline unsigned int brbe_num_branch_records(const struct arm_pmu *armpmu)
> @@ -44,4 +47,10 @@ static void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> const struct perf_event *event)
> {
> }
> +
> +static inline int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries,
> + unsigned int cnt)
> +{
> + return 0;
> +}
> #endif
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 8014ff766cff..1a9f129a0f94 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -1449,8 +1449,11 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
> cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
>
> cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
> - if (brbe_num_branch_records(cpu_pmu))
> + if (brbe_num_branch_records(cpu_pmu)) {
> cpu_pmu->pmu.sched_task = armv8pmu_sched_task;
> + static_call_update(perf_snapshot_branch_stack,
> + arm_brbe_snapshot_branch_stack);
> + }
>
> cpu_pmu->name = name;
> cpu_pmu->map_event = map_event;
>
> base-commit: c286e7e9d1f1f3d90ad11c37e896f582b02d19c4
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-01-05 17:49 ` Mark Rutland
@ 2026-01-05 19:31 ` Puranjay Mohan
0 siblings, 0 replies; 7+ messages in thread
From: Puranjay Mohan @ 2026-01-05 19:31 UTC (permalink / raw)
To: Mark Rutland
Cc: bpf, Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Will Deacon, Catalin Marinas, Rob Herring, Breno Leitao,
linux-arm-kernel, linux-perf-users, kernel-team
On Mon, Jan 5, 2026 at 5:49 PM Mark Rutland <mark.rutland@arm.com> wrote:
>
> Hi Puranjay,
>
> I have a number of techincal concerns with this, noted below. At a high
> level, I don't think this makes sense without a better description of
> the usage, and I suspect we'd need changes to bpf_get_branch_snapshot()
> to be able to implement something useful.
Hi Mark,
Thanks for your review.
>
> On Fri, Jan 02, 2026 at 01:40:41PM -0800, Puranjay Mohan wrote:
> > Enable the bpf_get_branch_snapshot() BPF helper on ARM64 by implementing
> > the perf_snapshot_branch_stack static call for ARM's Branch Record
> > Buffer Extension (BRBE).
> >
> > The BPF helper bpf_get_branch_snapshot() allows BPF programs to capture
> > hardware branch records on-demand. This was previously only available on
> > x86 (Intel LBR, AMD BRS) but not on ARM64 despite BRBE being available
> > since ARMv9.
>
> When exactly can bpf_get_branch_snapshot() be called?
>
> How is it expected to be used?
>
> The driver is written on the expectation that BRBE records are only read
> in the PMUv3 overflow handler.
bpf_get_branch_snapshot() can be called from any BPF tracing program
so from unknown context.
>
> > This implementation:
> >
> > - Follows the x86 snapshot pattern (intel_pmu_snapshot_branch_stack)
> > - Performs atomic snapshot by pausing BRBE, reading records, and
> > restoring previous state without disrupting ongoing perf events
>
> If BRBE is stopped, then we lost contiguity of branch records (e.g. for
> a series of branches A->B->C->D->E, we could record A->B->D->E). The
> driver is intended to ensure contiguity of those records, and if that's
> lost, we deliberately discard. That's why brbe_enable() calls
> brbe_invalidate().
>
> Restarting BRBE without discarding will produce bogus data for other
> consumers. If BRBE is stopped, we *must* discard.
>
> > - Reads branch records directly from BRBE registers without
> > event-specific filtering to minimize branch pollution
>
> If BRBE is paused, there's no polution, so this rationale doesn't make
> sense to me.
>
> BRBE can be configured with various filters (specific to the event),
> multiple events with distinct filters can be active simultaneously, and
> if no branch-sampling events are active, BRBE won't record anything. I
> do not think it makes sense to record whatever happens to be present,
> without filtering. Regardless, BRBE won't be recording if you don't have
> events. so this doesn't seem to make sense.
In the expected usage of this bpf helper, perf_events are expected to
be created with the correct filtering.
From 856c02dbce4f ("bpf: Introduce helper bpf_get_branch_snapshot")
Introduce bpf_get_branch_snapshot(), which allows tracing program to get
branch trace from hardware (e.g. Intel LBR). To use the feature, the
user need to create perf_event with proper branch_record filtering
on each cpu, and then calls bpf_get_branch_snapshot in the bpf function.
On Intel CPUs, VLBR event (raw event 0x1b00) can be use for this.
>
> > - Handles all BRBE record types (complete, source-only, target-only)
> > - Complies with ARM ARM synchronization requirements (ISB barriers per
> > rule PPBZP)
> > - Reuses existing BRBE infrastructure (select_brbe_bank,
> > __read_brbe_regset, brbe_set_perf_entry_type, etc.)
>
> This duplicates other code (e.g. the body of
> brbe_read_filtered_entries()), and I don't think this is justified.
Ack.
>
> >
> > Signed-off-by: Puranjay Mohan <puranjay@kernel.org>
> > ---
> >
> > This patch is only compile tested as I don't have access to hardware with BRBE.
> >
> > ---
> > drivers/perf/arm_brbe.c | 95 ++++++++++++++++++++++++++++++++++++++++
> > drivers/perf/arm_brbe.h | 9 ++++
> > drivers/perf/arm_pmuv3.c | 5 ++-
> > 3 files changed, 108 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
> > index ba554e0c846c..cda7bf522c06 100644
> > --- a/drivers/perf/arm_brbe.c
> > +++ b/drivers/perf/arm_brbe.c
> > @@ -803,3 +803,98 @@ void brbe_read_filtered_entries(struct perf_branch_stack *branch_stack,
> > done:
> > branch_stack->nr = nr_filtered;
> > }
> > +
> > +/*
> > + * ARM-specific callback invoked through perf_snapshot_branch_stack static
> > + * call, defined in include/linux/perf_event.h. See its definition for API
> > + * details. It's up to caller to provide enough space in *entries* to fit all
> > + * branch records, otherwise returned result will be truncated to *cnt* entries.
> > + *
> > + * This is similar to brbe_read_filtered_entries but optimized for snapshot mode:
> > + * - No filtering based on event attributes (captures everything)
> > + * - Minimal branches to avoid polluting the branch buffer
> > + * - Direct register reads without event-specific processing
> > + */
>
> As above I don't think these differences are necessary nor do I think we
> should be duplicating this.
Ack.
>
> > +int arm_brbe_snapshot_branch_stack(struct perf_branch_entry *entries, unsigned int cnt)
> > +{
> > + unsigned long flags;
> > + int nr_hw, nr_banks, nr_copied = 0;
> > + u64 brbidr, brbfcr, brbcr;
> > +
> > + /*
> > + * The sequence of steps to freeze BRBE should be completely inlined
> > + * and contain no branches to minimize contamination of branch snapshot.
> > + */
> > + local_irq_save(flags);
>
> The PMU overflow interrupt can be delivered as a pNMI, so this can race
> with that. Disabling interrupts is not sufficient.
In that case this implementation is completely broken, Do you have
suggestions about how this should be implemented?
The use case is to find out how the execution reached a bpf program by
looking at the Branch records. Do you think BRBE can be used to fulfil
this?
>
> If this is only called within a perf event overflow handler, then we
> already have the necessary serialization.
>
> > +
> > + /* Save current BRBE configuration */
> > + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> > + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> > +
> > + /* Pause BRBE to freeze the buffer */
> > + write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
> > + isb();
> > +
> > + /* Read BRBIDR to determine number of records */
> > + brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
> > + if (!valid_brbidr(brbidr))
> > + goto out_restore;
> > +
> > + nr_hw = FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
> > + nr_banks = DIV_ROUND_UP(nr_hw, BRBE_BANK_MAX_ENTRIES);
> > +
> > + /* Read branch records from BRBE banks */
> > + for (int bank = 0; bank < nr_banks; bank++) {
> > + int nr_remaining = nr_hw - (bank * BRBE_BANK_MAX_ENTRIES);
> > + int nr_this_bank = min(nr_remaining, BRBE_BANK_MAX_ENTRIES);
> > +
> > + select_brbe_bank(bank);
> > +
> > + for (int i = 0; i < nr_this_bank; i++) {
> > + struct brbe_regset bregs;
> > + struct perf_branch_entry *entry;
> > +
> > + if (nr_copied >= cnt)
> > + goto out_restore;
> > +
> > + if (!__read_brbe_regset(&bregs, i))
> > + goto out_restore;
> > +
> > + entry = &entries[nr_copied];
> > + perf_clear_branch_entry_bitfields(entry);
> > +
> > + /* Simple conversion without filtering */
> > + if (brbe_record_is_complete(bregs.brbinf)) {
> > + entry->from = bregs.brbsrc;
> > + entry->to = bregs.brbtgt;
> > + } else if (brbe_record_is_source_only(bregs.brbinf)) {
> > + entry->from = bregs.brbsrc;
> > + entry->to = 0;
> > + } else if (brbe_record_is_target_only(bregs.brbinf)) {
> > + entry->from = 0;
> > + entry->to = bregs.brbtgt;
> > + }
> > +
> > + brbe_set_perf_entry_type(entry, bregs.brbinf);
> > + entry->cycles = brbinf_get_cycles(bregs.brbinf);
> > +
> > + if (!brbe_record_is_target_only(bregs.brbinf)) {
> > + entry->mispred = brbinf_get_mispredict(bregs.brbinf);
> > + entry->predicted = !entry->mispred;
> > + }
> > +
> > + if (!brbe_record_is_source_only(bregs.brbinf))
> > + entry->priv = brbinf_get_perf_priv(bregs.brbinf);
> > +
> > + nr_copied++;
> > + }
> > + }
> > +
> > +out_restore:
> > + /* Restore BRBE to its previous state */
> > + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> > + isb();
> > + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> > + local_irq_restore(flags);
> > + return nr_copied;
> > +}
>
> As above, this is duplicating a lot of logic we already have. If we
> really need this function, we should be sharing much more logic with
> brbe_read_filtered_entries(), if not using that directly.
>
> Mark.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-01-02 21:51 ` Puranjay Mohan
@ 2026-01-05 21:22 ` Leo Yan
2026-01-05 21:46 ` Puranjay Mohan
0 siblings, 1 reply; 7+ messages in thread
From: Leo Yan @ 2026-01-05 21:22 UTC (permalink / raw)
To: Puranjay Mohan
Cc: Will Deacon, Mark Rutland, Rob Herring, Catalin Marinas,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Breno Leitao, linux-arm-kernel, linux-perf-users, kernel-team
Hi,
On Fri, Jan 02, 2026 at 09:51:15PM +0000, Puranjay Mohan wrote:
> Hi Catalin, Will, Mark, and Rob.
>
> I don't have access to BRBE enabled hardware and Qemu doesn't support
> this as well therefore I just sent the compile tested version.
>
> Can you help me with testing this:
>
> The relevant bpf self test is:
>
> ./test_progs -t get_branch_snapshot
> #132 get_branch_snapshot:SKIP
> Summary: 1/0 PASSED, 1 SKIPPED, 0 FAILED
>
> skipped for my setup as it is not supported.
eBPF selftest build is quite heavy (many dependencies), finally I built
it cleanly.
I did a test on FVP RevC, the test hit NULL pointer deference and kernel
panic in armv8pmu_sched_task(). See https://termbin.com/peg4, the
backtrace should can give you hints for the issue.
Thanks,
Leo
P.s. some logs are duplicate, this might be caused by two cores both hit
the NULL pointer issue ?
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-01-05 21:22 ` Leo Yan
@ 2026-01-05 21:46 ` Puranjay Mohan
2026-01-06 10:54 ` Leo Yan
0 siblings, 1 reply; 7+ messages in thread
From: Puranjay Mohan @ 2026-01-05 21:46 UTC (permalink / raw)
To: Leo Yan
Cc: Will Deacon, Mark Rutland, Rob Herring, Catalin Marinas,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Breno Leitao, linux-arm-kernel, linux-perf-users, kernel-team
On Mon, Jan 5, 2026 at 9:22 PM Leo Yan <leo.yan@arm.com> wrote:
>
> Hi,
>
> On Fri, Jan 02, 2026 at 09:51:15PM +0000, Puranjay Mohan wrote:
> > Hi Catalin, Will, Mark, and Rob.
> >
> > I don't have access to BRBE enabled hardware and Qemu doesn't support
> > this as well therefore I just sent the compile tested version.
> >
> > Can you help me with testing this:
> >
> > The relevant bpf self test is:
> >
> > ./test_progs -t get_branch_snapshot
> > #132 get_branch_snapshot:SKIP
> > Summary: 1/0 PASSED, 1 SKIPPED, 0 FAILED
> >
> > skipped for my setup as it is not supported.
>
> eBPF selftest build is quite heavy (many dependencies), finally I built
> it cleanly.
>
> I did a test on FVP RevC, the test hit NULL pointer deference and kernel
> panic in armv8pmu_sched_task(). See https://termbin.com/peg4, the
> backtrace should can give you hints for the issue.
>
> Thanks,
> Leo
>
> P.s. some logs are duplicate, this might be caused by two cores both hit
> the NULL pointer issue ?
Thanks for running the test, it will help me debug it better. But as
Mark pointer out this has fundamental problems which I will try to fix
first.
P.S. - The test on FVP RevC, does this need special hardware or this
is an emulation platform, in the later case, can I do the test myself
somehow?
Thanks,
Puranjay
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot()
2026-01-05 21:46 ` Puranjay Mohan
@ 2026-01-06 10:54 ` Leo Yan
0 siblings, 0 replies; 7+ messages in thread
From: Leo Yan @ 2026-01-06 10:54 UTC (permalink / raw)
To: Puranjay Mohan
Cc: Will Deacon, Mark Rutland, Rob Herring, Catalin Marinas,
Alexei Starovoitov, Andrii Nakryiko, Daniel Borkmann,
Martin KaFai Lau, Eduard Zingerman, Kumar Kartikeya Dwivedi,
Breno Leitao, linux-arm-kernel, linux-perf-users, kernel-team
On Mon, Jan 05, 2026 at 09:46:01PM +0000, Puranjay Mohan wrote:
[...]
> Thanks for running the test, it will help me debug it better. But as
> Mark pointer out this has fundamental problems which I will try to fix
> first.
I saw the discussion. Maybe we can refer to other architecture for
how to resolve the issues.
> P.S. - The test on FVP RevC, does this need special hardware or this
> is an emulation platform, in the later case, can I do the test myself
> somehow?
FVP is an emulation platform, and the best chance is to download
Shrinkwrap tool [1] for access it. Below are some notes:
- For a FVP with BRBE enabled, I created a addition.yaml:
---8<---
layers:
- arch/v9.5.yaml
run:
params:
-C bp.hostbridge.userNetworking: 1
-C bp.hostbridge.userNetPorts: 8022=22
-C cluster0.has_brbe: 1
-C cluster0.has_brbe_v1p1: 1
-C cluster1.has_brbe: 1
-C cluster1.has_brbe_v1p1: 1
rtvars:
CMDLINE:
type: string
value: console=ttyAMA0 earlycon=pl011,0x1c090000 root=/dev/vda rw ip=dhcp kpti=off
--->8---
- Build firmware for FVP:
export PATH=/path/to/shrinkwrap:$PATH
shrinkwrap build --overlay=addition.yaml ns-edk2.yaml
- Build Linux kernel:
ARCH=arm64 CROSS_COMPILE=aarch64-linux-gnu- make Image arm/fvp-base-revc.dtb
- Run FVP:
shrinkwrap run -o addition.yaml \
--rtvar=KERNEL=$KERNEL_PATH/arch/arm64/boot/Image \
--rtvar=DTB=$KERNEL_PATH/arch/arm64/boot/dts/arm/fvp-base-revc.dtb \
--rtvar=ROOTFS=debian.img ns-edk2.yaml
You can download a root file system image from [2]. Just note, FVP is
slow so you might need to cross build kernel and eBPF selftest on your
host machine and then mount your host folder onto FVP's /mnt folder,
something like:
sshfs -o allow_other,default_permissions,reconnect \
username@host_ip:/path/to/build/folder /mnt
Thanks,
Leo
[1] https://shrinkwrap.docs.arm.com/en/latest/userguide/quickstart.html#
[2] https://releases.linaro.org/debian/images/developer-arm64/debian12/debian.img
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-01-06 10:55 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-02 21:40 [RFC PATCH] perf/arm64: Add BRBE support for bpf_get_branch_snapshot() Puranjay Mohan
2026-01-02 21:51 ` Puranjay Mohan
2026-01-05 21:22 ` Leo Yan
2026-01-05 21:46 ` Puranjay Mohan
2026-01-06 10:54 ` Leo Yan
2026-01-05 17:49 ` Mark Rutland
2026-01-05 19:31 ` Puranjay Mohan
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox