* [V14 0/8] arm64/perf: Enable branch stack sampling
@ 2023-11-14 5:13 Anshuman Khandual
2023-11-14 5:13 ` [V14 1/8] arm64/sysreg: Add BRBE registers and fields Anshuman Khandual
` (8 more replies)
0 siblings, 9 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users
This series enables perf branch stack sampling support on arm64 platform
via a new arch feature called Branch Record Buffer Extension (BRBE). All
the relevant register definitions could be accessed here.
https://developer.arm.com/documentation/ddi0601/2021-12/AArch64-Registers
This series applies on 6.7-rc1 after the following series from James Clark.
https://lore.kernel.org/all/20231019165510.1966367-1-james.clark@arm.com/
Also this series is being hosted below for quick access, review and test.
https://git.gitlab.arm.com/linux-arm/linux-anshuman.git (brbe_v14)
There are still some open questions regarding handling multiple perf events
with different privilege branch filters getting on the same PMU, supporting
guest branch stack tracing from the host etc. Finally also looking for some
suggestions regarding supporting BRBE inside the guest. The series has been
re-organized completely as suggested earlier.
- Anshuman
========== Perf Branch Stack Sampling Support (arm64 platforms) ===========
Currently arm64 platform does not support perf branch stack sampling. Hence
any event requesting for branch stack records i.e PERF_SAMPLE_BRANCH_STACK
marked in event->attr.sample_type, will be rejected in armpmu_event_init().
static int armpmu_event_init(struct perf_event *event)
{
........
/* does not support taken branch sampling */
if (has_branch_stack(event))
return -EOPNOTSUPP;
........
}
$perf record -j any,u,k ls
Error:
cycles:P: PMU Hardware or event type doesn't support branch stack sampling.
-------------------- CONFIG_ARM64_BRBE and FEAT_BRBE ----------------------
After this series, perf branch stack sampling feature gets enabled on arm64
platforms where FEAT_BRBE HW feature is supported, and CONFIG_ARM64_BRBE is
also selected during build. Let's observe all all possible scenarios here.
1. Feature not built (!CONFIG_ARM64_BRBE):
Falls back to the current behaviour i.e event gets rejected.
2. Feature built but HW not supported (CONFIG_ARM64_BRBE && !FEAT_BRBE):
Falls back to the current behaviour i.e event gets rejected.
3. Feature built and HW supported (CONFIG_ARM64_BRBE && FEAT_BRBE):
Platform supports branch stack sampling requests. Let's observe through a
simple example here.
$perf record -j any_call,u,k,save_type ls
[Please refer perf-record man pages for all possible branch filter options]
$perf report
-------------------------- Snip ----------------------
# Overhead Command Source Shared Object Source Symbol Target Symbol Basic Block Cycles
# ........ ....... .................... ............................................ ............................................ ..................
#
3.52% ls [kernel.kallsyms] [k] sched_clock_noinstr [k] arch_counter_get_cntpct 16
3.52% ls [kernel.kallsyms] [k] sched_clock [k] sched_clock_noinstr 9
1.85% ls [kernel.kallsyms] [k] sched_clock_cpu [k] sched_clock 5
1.80% ls [kernel.kallsyms] [k] irqtime_account_irq [k] sched_clock_cpu 20
1.58% ls [kernel.kallsyms] [k] gic_handle_irq [k] generic_handle_domain_irq 19
1.58% ls [kernel.kallsyms] [k] call_on_irq_stack [k] gic_handle_irq 9
1.58% ls [kernel.kallsyms] [k] do_interrupt_handler [k] call_on_irq_stack 23
1.58% ls [kernel.kallsyms] [k] generic_handle_domain_irq [k] __irq_resolve_mapping 6
1.58% ls [kernel.kallsyms] [k] __irq_resolve_mapping [k] __rcu_read_lock 10
-------------------------- Snip ----------------------
$perf report -D | grep cycles
-------------------------- Snip ----------------------
..... 1: ffff800080dd3334 -> ffff800080dd759c 39 cycles P 0 IND_CALL
..... 2: ffff800080ffaea0 -> ffff800080ffb688 16 cycles P 0 IND_CALL
..... 3: ffff800080139918 -> ffff800080ffae64 9 cycles P 0 CALL
..... 4: ffff800080dd3324 -> ffff8000801398f8 7 cycles P 0 CALL
..... 5: ffff8000800f8548 -> ffff800080dd330c 21 cycles P 0 IND_CALL
..... 6: ffff8000800f864c -> ffff8000800f84ec 6 cycles P 0 CALL
..... 7: ffff8000800f86dc -> ffff8000800f8638 11 cycles P 0 CALL
..... 8: ffff8000800f86d4 -> ffff800081008630 16 cycles P 0 CALL
-------------------------- Snip ----------------------
perf script and other tooling can also be applied on the captured perf.data
Similarly branch stack sampling records can be collected via direct system
call i.e perf_event_open() method after setting 'struct perf_event_attr' as
required.
event->attr.sample_type |= PERF_SAMPLE_BRANCH_STACK
event->attr.branch_sample_type |= PERF_SAMPLE_BRANCH_<FILTER_1> |
PERF_SAMPLE_BRANCH_<FILTER_2> |
PERF_SAMPLE_BRANCH_<FILTER_3> |
...............................
But all branch filters might not be supported on the platform.
----------------------- BRBE Branch Filters Support -----------------------
- Following branch filters are supported on arm64.
PERF_SAMPLE_BRANCH_USER /* Branch privilege filters */
PERF_SAMPLE_BRANCH_KERNEL
PERF_SAMPLE_BRANCH_HV
PERF_SAMPLE_BRANCH_ANY /* Branch type filters */
PERF_SAMPLE_BRANCH_ANY_CALL
PERF_SAMPLE_BRANCH_ANY_RETURN
PERF_SAMPLE_BRANCH_IND_CALL
PERF_SAMPLE_BRANCH_COND
PERF_SAMPLE_BRANCH_IND_JUMP
PERF_SAMPLE_BRANCH_CALL
PERF_SAMPLE_BRANCH_NO_FLAGS /* Branch record flags */
PERF_SAMPLE_BRANCH_NO_CYCLES
PERF_SAMPLE_BRANCH_TYPE_SAVE
PERF_SAMPLE_BRANCH_HW_INDEX
PERF_SAMPLE_BRANCH_PRIV_SAVE
- Following branch filters are not supported on arm64.
PERF_SAMPLE_BRANCH_ABORT_TX
PERF_SAMPLE_BRANCH_IN_TX
PERF_SAMPLE_BRANCH_NO_TX
PERF_SAMPLE_BRANCH_CALL_STACK
Events requesting above non-supported branch filters get rejected.
------------------ Possible 'branch_sample_type' Mismatch -----------------
Branch stack sampling attributes 'event->attr.branch_sample_type' generally
remain the same for all the events during a perf record session.
$perf record -e <event_1> -e <event_2> -j <branch_filters> [workload]
event_1->attr.branch_sample_type == event_2->attr.branch_sample_type
This 'branch_sample_type' is used to configure the BRBE hardware, when both
events i.e <event_1> and <event_2> get scheduled on a given PMU. But during
PMU HW event's privilege filter inheritance, 'branch_sample_type' does not
remain the same for all events. Let's consider the following example
$perf record -e cycles:u -e instructions:k -j any,save_type ls
cycles->attr.branch_sample_type != instructions->attr.branch_sample_type
Because cycles event inherits PERF_SAMPLE_BRANCH_USER and instruction event
inherits PERF_SAMPLE_BRANCH_KERNEL. The proposed solution here configures
BRBE hardware with 'branch_sample_type' from last event to be added in the
PMU and hence captured branch records only get passed on to matching events
during a PMU interrupt.
static int
armpmu_add(struct perf_event *event, int flags)
{
........
if (has_branch_stack(event)) {
/*
* Reset branch records buffer if a new task event gets
* scheduled on a PMU which might have existing records.
* Otherwise older branch records present in the buffer
* might leak into the new task event.
*/
if (event->ctx->task && hw_events->brbe_context != event->ctx) {
hw_events->brbe_context = event->ctx;
if (armpmu->branch_reset)
armpmu->branch_reset();
}
hw_events->brbe_users++;
Here -------> hw_events->brbe_sample_type = event->attr.branch_sample_type;
}
........
}
Instead of overriding existing 'branch_sample_type', both could be merged.
--------------------------- Virtualisation support ------------------------
- Branch stack sampling is not currently supported inside the guest (TODO)
- FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
- Future support in guest requires emulating FEAT_BRBE
- Branch stack sampling the guest is not supported in the host (TODO)
- Tracing the guest with event->attr.exclude_guest = 0
- There are multiple challenges involved regarding mixing events
with mismatched branch_sample_type and exclude_guest and passing
on captured BRBE records to intended events during PMU interrupt
- Guest access for BRBE registers and instructions has been blocked
- BRBE state save is not required for VHE host (EL2) guest (EL1) transition
- BRBE state is saved for NVHE host (EL1) guest (EL1) transition
-------------------------------- Testing ---------------------------------
- Cross compiled for both arm64 and arm32 platforms
- Passes all branch tests with 'perf test branch' on arm64
-------------------------------- Questions -------------------------------
- Instead of configuring the BRBE HW with branch_sample_type from the last
event to be added on the PMU as proposed, could those be merged together
e.g all privilege requests ORed, to form a common BRBE configuration and
all events get branch records after a PMU interrupt ?
Changes in V14:
- This series has been reorganised as suggested during V13
- There are just eight patches now i.e 5 enablement and 3 perf branch tests
- Fixed brackets problem in __SYS_BRBINFO/BRBSRC/BRBTGT() macros
- Renamed the macro i.e s/__SYS_BRBINFO/__SYS_BRBINF/
- Renamed s/BRB_IALL/BRB_IALL_INSN and s/BRBE_INJ/BRB_INJ_INSN
- Moved BRB_IALL_INSN and SYS_BRB_INSN instructions to sysreg patch
- Changed E1BRE as ExBRE in sysreg fields inside BRBCR_ELx
- Used BRBCR_ELx for defining all BRBCR_EL1, BRBCR_EL2, and BRBCR_EL12 (new)
- Folded the following three patches into a single patch i.e [PATCH 3/8]
drivers: perf: arm_pmu: Add new sched_task() callback
arm64/perf: Add branch stack support in struct arm_pmu
arm64/perf: Add branch stack support in struct pmu_hw_events
arm64/perf: Add branch stack support in ARMV8 PMU
arm64/perf: Add PERF_ATTACH_TASK_DATA to events with has_branch_stack()
- All armv8pmu_branch_xxxx() stub definitions have been moved inside
include/linux/perf/arm_pmuv3.h for easy access from both arm32 and arm64
- Added brbe_users, brbe_context and brbe_sample_type in struct pmu_hw_events
- Added comments for all the above new elements in struct pmu_hw_events
- Added branch_reset() and sched_task() callbacks
- Changed and optimized branch records processing during a PMU IRQ
- NO branch records get captured for event with mismatched brbe_sample_type
- Branch record context is tracked from armpmu_del() & armpmu_add()
- Branch record hardware is driven from armv8pmu_start() & armv8pmu_stop()
- Dropped NULL check for 'pmu_ctx' inside armv8pmu_sched_task()
- Moved down PERF_ATTACH_TASK_DATA assignment with a preceding comment
- In conflicting branch sample type requests, first event takes precedence
- Folded the following five patches from V13 into a single patch i.e
[PATCH 4/8]
arm64/perf: Enable branch stack events via FEAT_BRBE
arm64/perf: Add struct brbe_regset helper functions
arm64/perf: Implement branch records save on task sched out
arm64/perf: Implement branch records save on PMU IRQ
- Fixed the year in copyright statement
- Added Documentation/arch/arm64/brbe.rst
- Updated Documentation/arch/arm64/booting.rst (BRBCR_EL2.CC for EL1 entry)
- Added __init_el2_brbe() which enables branch record cycle count support
- Disabled EL2 traps in __init_el2_fgt() while accessing BRBE registers and
executing instructions
- Changed CONFIG_ARM64_BRBE user visible description
- Fixed a typo in CONFIG_ARM64_BRBE config option description text
- Added BUILD_BUG_ON() co-relating BRBE_BANK_MAX_ENTRIES and MAX_BRANCH_RECORDS
- Dropped arm64_create_brbe_task_ctx_kmem_cache()
- Moved down comment for PERF_SAMPLE_BRANCH_KERNEL in branch_type_to_brbcr()
- Renamed BRBCR_ELx_DEFAULT_CONFIG as BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_DEFAULT_TS with BRBCR_ELx_TS_MASK in BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_E1BRE instances with BRBCR_ELx_ExBRE
- Added BRBE specific branch stack sampling perf test patches into the series
- Added a patch to prevent guest accesses into BRBE registers and instructions
- Added a patch to save the BRBE host context in NVHE environment
- Updated most commit messages
Changes in V13:
https://lore.kernel.org/all/20230711082455.215983-1-anshuman.khandual@arm.com/
https://lore.kernel.org/all/20230622065351.1092893-1-anshuman.khandual@arm.com/
- Added branch callback stubs for aarch32 pmuv3 based platforms
- Updated the comments for capture_brbe_regset()
- Deleted the comments in __read_brbe_regset()
- Reversed the arguments order in capture_brbe_regset() and brbe_branch_save()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in armv8pmu_branch_read()
- Fixed BRBE_BANK[0|1]_IDX_MAX indices comparison in capture_brbe_regset()
Changes in V12:
https://lore.kernel.org/all/20230615133239.442736-1-anshuman.khandual@arm.com/
- Replaced branch types with complete DIRECT/INDIRECT prefixes/suffixes
- Replaced branch types with complete INSN/ALIGN prefixes/suffixes
- Replaced return branch types as simple RET/ERET
- Replaced time field GST_PHYSICAL as GUEST_PHYSICAL
- Added 0 padding for BRBIDR0_EL1.NUMREC enum values
- Dropped helper arm_pmu_branch_stack_supported()
- Renamed armv8pmu_branch_valid() as armv8pmu_branch_attr_valid()
- Separated perf_task_ctx_cache setup from arm_pmu private allocation
- Collected changes to branch_records_alloc() in a single patch [5/10]
- Reworked and cleaned up branch_records_alloc()
- Reworked armv8pmu_branch_read() with new loop iterations in patch [6/10]
- Reworked capture_brbe_regset() with new loop iterations in patch [8/10]
- Updated the comment in branch_type_to_brbcr()
- Fixed the comment before stitch_stored_live_entries()
- Fixed BRBINFINJ_EL1 definition for VALID_FULL enum field
- Factored out helper __read_brbe_regset() from capture_brbe_regset()
- Dropped the helper copy_brbe_regset()
- Simplified stitch_stored_live_entries() with memcpy(), memmove()
- Reworked armv8pmu_probe_pmu() to bail out early with !probe.present
- Rework brbe_attributes_probe() without 'struct brbe_hw_attr'
- Dropped 'struct brbe_hw_attr' argument from capture_brbe_regset()
- Dropped 'struct brbe_hw_attr' argument from brbe_branch_save()
- Dropped arm_pmu->private and added arm_pmu->reg_trbidr instead
Changes in V11:
https://lore.kernel.org/all/20230531040428.501523-1-anshuman.khandual@arm.com/
- Fixed the crash for per-cpu events without event->pmu_ctx->task_ctx_data
Changes in V10:
https://lore.kernel.org/all/20230517022410.722287-1-anshuman.khandual@arm.com/
- Rebased the series on v6.4-rc2
- Moved ARMV8 PMUV3 changes inside drivers/perf/arm_pmuv3.c
- Moved BRBE driver changes inside drivers/perf/arm_brbe.[c|h]
- Moved the WARN_ON() inside the if condition in armv8pmu_handle_irq()
Changes in V9:
https://lore.kernel.org/all/20230315051444.1683170-1-anshuman.khandual@arm.com/
- Fixed build problem with has_branch_stack() in arm64 header
- BRBINF_EL1 definition has been changed from 'Sysreg' to 'SysregFields'
- Renamed all BRBINF_EL1 call sites as BRBINFx_EL1
- Dropped static const char branch_filter_error_msg[]
- Implemented a positive list check for BRBE supported perf branch filters
- Added a comment in armv8pmu_handle_irq()
- Implemented per-cpu allocation for struct branch_record records
- Skipped looping through bank 1 if an invalid record is detected in bank 0
- Added comment in armv8pmu_branch_read() explaining prohibited region etc
- Added comment warning about erroneously marking transactions as aborted
- Replaced the first argument (perf_branch_entry) in capture_brbe_flags()
- Dropped the last argument (idx) in capture_brbe_flags()
- Dropped the brbcr argument from capture_brbe_flags()
- Used perf_sample_save_brstack() to capture branch records for perf_sample_data
- Added comment explaining rationale for setting BRBCR_EL1_FZP for user only traces
- Dropped BRBE prohibited state mechanism while in armv8pmu_branch_read()
- Implemented event task context based branch records save mechanism
Changes in V8:
https://lore.kernel.org/all/20230123125956.1350336-1-anshuman.khandual@arm.com/
- Replaced arm_pmu->features as arm_pmu->has_branch_stack, updated its helper
- Added a comment and line break before arm_pmu->private element
- Added WARN_ON_ONCE() in helpers i.e armv8pmu_branch_[read|valid|enable|disable]()
- Dropped comments in armv8pmu_enable_event() and armv8pmu_disable_event()
- Replaced open bank encoding in BRBFCR_EL1 with SYS_FIELD_PREP()
- Changed brbe_hw_attr->brbe_version from 'bool' to 'int'
- Updated pr_warn() as pr_warn_once() with values in brbe_get_perf_[type|priv]()
- Replaced all pr_warn_once() as pr_debug_once() in armv8pmu_branch_valid()
- Added a comment in branch_type_to_brbcr() for the BRBCR_EL1 privilege settings
- Modified the comment related to BRBINFx_EL1.LASTFAILED in capture_brbe_flags()
- Modified brbe_get_perf_entry_type() as brbe_set_perf_entry_type()
- Renamed brbe_valid() as brbe_record_is_complete()
- Renamed brbe_source() as brbe_record_is_source_only()
- Renamed brbe_target() as brbe_record_is_target_only()
- Inverted checks for !brbe_record_is_[target|source]_only() for info capture
- Replaced 'fetch' with 'get' in all helpers that extract field value
- Dropped 'static int brbe_current_bank' optimization in select_brbe_bank()
- Dropped select_brbe_bank_index() completely, added capture_branch_entry()
- Process captured branch entries in two separate loops one for each BRBE bank
- Moved branch_records_alloc() inside armv8pmu_probe_pmu()
- Added a forward declaration for the helper has_branch_stack()
- Added new callbacks armv8pmu_private_alloc() and armv8pmu_private_free()
- Updated armv8pmu_probe_pmu() to allocate the private structure before SMP call
Changes in V7:
https://lore.kernel.org/all/20230105031039.207972-1-anshuman.khandual@arm.com/
- Folded [PATCH 7/7] into [PATCH 3/7] which enables branch stack sampling event
- Defined BRBFCR_EL1_BRANCH_FILTERS, BRBCR_EL1_DEFAULT_CONFIG in the header
- Defined BRBFCR_EL1_DEFAULT_CONFIG in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_FZP
- Defined BRBCR_EL1_DEFAULT_TS in the header
- Updated BRBCR_EL1_DEFAULT_CONFIG with BRBCR_EL1_DEFAULT_TS
- Moved BRBCR_EL1_DEFAULT_CONFIG check inside branch_type_to_brbcr()
- Moved down BRBCR_EL1_CC, BRBCR_EL1_MPRED later in branch_type_to_brbcr()
- Also set BRBE in paused state in armv8pmu_branch_disable()
- Dropped brbe_paused(), set_brbe_paused() helpers
- Extracted error string via branch_filter_error_msg[] for armv8pmu_branch_valid()
- Replaced brbe_v1p1 with brbe_version in struct brbe_hw_attr
- Added valid_brbe_[cc, format, version]() helpers
- Split a separate brbe_attributes_probe() from armv8pmu_branch_probe()
- Capture event->attr.branch_sample_type earlier in armv8pmu_branch_valid()
- Defined enum brbe_bank_idx with possible values for BRBE bank indices
- Changed armpmu->hw_attr into armpmu->private
- Added missing space in stub definition for armv8pmu_branch_valid()
- Replaced both kmalloc() with kzalloc()
- Added BRBE_BANK_MAX_ENTRIES
- Updated comment for capture_brbe_flags()
- Updated comment for struct brbe_hw_attr
- Dropped space after type cast in couple of places
- Replaced inverse with negation for testing BRBCR_EL1_FZP in armv8pmu_branch_read()
- Captured cpuc->branches->branch_entries[idx] in a local variable
- Dropped saved_priv from armv8pmu_branch_read()
- Reorganize PERF_SAMPLE_BRANCH_NO_[CYCLES|NO_FLAGS] related configuration
- Replaced with FIELD_GET() and FIELD_PREP() wherever applicable
- Replaced BRBCR_EL1_TS_PHYSICAL with BRBCR_EL1_TS_VIRTUAL
- Moved valid_brbe_nr(), valid_brbe_cc(), valid_brbe_format(), valid_brbe_version()
select_brbe_bank(), select_brbe_bank_index() helpers inside the C implementation
- Reorganized brbe_valid_nr() and dropped the pr_warn() message
- Changed probe sequence in brbe_attributes_probe()
- Added 'brbcr' argument into capture_brbe_flags() to ascertain correct state
- Disable BRBE before disabling the PMU event counter
- Enable PERF_SAMPLE_BRANCH_HV filters when is_kernel_in_hyp_mode()
- Guard armv8pmu_reset() & armv8pmu_sched_task() with arm_pmu_branch_stack_supported()
Changes in V6:
https://lore.kernel.org/linux-arm-kernel/20221208084402.863310-1-anshuman.khandual@arm.com/
- Restore the exception level privilege after reading the branch records
- Unpause the buffer after reading the branch records
- Decouple BRBCR_EL1_EXCEPTION/ERTN from perf event privilege level
- Reworked BRBE implementation and branch stack sampling support on arm pmu
- BRBE implementation is now part of overall ARMV8 PMU implementation
- BRBE implementation moved from drivers/perf/ to inside arch/arm64/kernel/
- CONFIG_ARM_BRBE_PMU renamed as CONFIG_ARM64_BRBE in arch/arm64/Kconfig
- File moved - drivers/perf/arm_pmu_brbe.c -> arch/arm64/kernel/brbe.c
- File moved - drivers/perf/arm_pmu_brbe.h -> arch/arm64/kernel/brbe.h
- BRBE name has been dropped from struct arm_pmu and struct hw_pmu_events
- BRBE name has been abstracted out as 'branches' in arm_pmu and hw_pmu_events
- BRBE name has been abstracted out as 'branches' in ARMV8 PMU implementation
- Added sched_task() callback into struct arm_pmu
- Added 'hw_attr' into struct arm_pmu encapsulating possible PMU HW attributes
- Dropped explicit attributes brbe_(v1p1, nr, cc, format) from struct arm_pmu
- Dropped brbfcr, brbcr, registers scratch area from struct hw_pmu_events
- Dropped brbe_users, brbe_context tracking in struct hw_pmu_events
- Added 'features' tracking into struct arm_pmu with ARM_PMU_BRANCH_STACK flag
- armpmu->hw_attr maps into 'struct brbe_hw_attr' inside BRBE implementation
- Set ARM_PMU_BRANCH_STACK in 'arm_pmu->features' after successful BRBE probe
- Added armv8pmu_branch_reset() inside armv8pmu_branch_enable()
- Dropped brbe_supported() as events will be rejected via ARM_PMU_BRANCH_STACK
- Dropped set_brbe_disabled() as well
- Reformated armv8pmu_branch_valid() warnings while rejecting unsupported events
Changes in V5:
https://lore.kernel.org/linux-arm-kernel/20221107062514.2851047-1-anshuman.khandual@arm.com/
- Changed BRBCR_EL1.VIRTUAL from 0b1 to 0b01
- Changed BRBFCR_EL1.EnL into BRBFCR_EL1.EnI
- Changed config ARM_BRBE_PMU from 'tristate' to 'bool'
Changes in V4:
https://lore.kernel.org/all/20221017055713.451092-1-anshuman.khandual@arm.com/
- Changed ../tools/sysreg declarations as suggested
- Set PERF_SAMPLE_BRANCH_STACK in data.sample_flags
- Dropped perfmon_capable() check in armpmu_event_init()
- s/pr_warn_once/pr_info in armpmu_event_init()
- Added brbe_format element into struct pmu_hw_events
- Changed v1p1 as brbe_v1p1 in struct pmu_hw_events
- Dropped pr_info() from arm64_pmu_brbe_probe(), solved LOCKDEP warning
Changes in V3:
https://lore.kernel.org/all/20220929075857.158358-1-anshuman.khandual@arm.com/
- Moved brbe_stack from the stack and now dynamically allocated
- Return PERF_BR_PRIV_UNKNOWN instead of -1 in brbe_fetch_perf_priv()
- Moved BRBIDR0, BRBCR, BRBFCR registers and fields into tools/sysreg
- Created dummy BRBINF_EL1 field definitions in tools/sysreg
- Dropped ARMPMU_EVT_PRIV framework which cached perfmon_capable()
- Both exception and exception return branche records are now captured
only if the event has PERF_SAMPLE_BRANCH_KERNEL which would already
been checked in generic perf via perf_allow_kernel()
Changes in V2:
https://lore.kernel.org/all/20220908051046.465307-1-anshuman.khandual@arm.com/
- Dropped branch sample filter helpers consolidation patch from this series
- Added new hw_perf_event.flags element ARMPMU_EVT_PRIV to cache perfmon_capable()
- Use cached perfmon_capable() while configuring BRBE branch record filters
Changes in V1:
https://lore.kernel.org/linux-arm-kernel/20220613100119.684673-1-anshuman.khandual@arm.com/
- Added CONFIG_PERF_EVENTS wrapper for all branch sample filter helpers
- Process new perf branch types via PERF_BR_EXTEND_ABI
Changes in RFC V2:
https://lore.kernel.org/linux-arm-kernel/20220412115455.293119-1-anshuman.khandual@arm.com/
- Added branch_sample_priv() while consolidating other branch sample filter helpers
- Changed all SYS_BRBXXXN_EL1 register definition encodings per Marc
- Changed the BRBE driver as per proposed BRBE related perf ABI changes (V5)
- Added documentation for struct arm_pmu changes, updated commit message
- Updated commit message for BRBE detection infrastructure patch
- PERF_SAMPLE_BRANCH_KERNEL gets checked during arm event init (outside the driver)
- Branch privilege state capture mechanism has now moved inside the driver
Changes in RFC V1:
https://lore.kernel.org/all/1642998653-21377-1-git-send-email-anshuman.khandual@arm.com/
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: James Clark <james.clark@arm.com>
Cc: Rob Herring <robh@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Suzuki Poulose <suzuki.poulose@arm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Anshuman Khandual (5):
arm64/sysreg: Add BRBE registers and fields
KVM: arm64: Prevent guest accesses into BRBE system registers/instructions
drivers: perf: arm_pmuv3: Enable branch stack sampling framework
drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
KVM: arm64: nvhe: Disable branch generation in nVHE guests
James Clark (3):
perf: test: Speed up running brstack test on an Arm model
perf: test: Remove empty lines from branch filter test output
perf: test: Extend branch stack sampling test for Arm64 BRBE
Documentation/arch/arm64/booting.rst | 6 +
Documentation/arch/arm64/brbe.rst | 152 +++++
arch/arm64/include/asm/el2_setup.h | 113 +++-
arch/arm64/include/asm/kvm_host.h | 4 +
arch/arm64/include/asm/sysreg.h | 109 ++++
arch/arm64/kvm/debug.c | 6 +
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 38 ++
arch/arm64/kvm/sys_regs.c | 130 +++++
arch/arm64/tools/sysreg | 170 ++++++
drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_brbe.c | 735 +++++++++++++++++++++++++
drivers/perf/arm_brbe.h | 262 +++++++++
drivers/perf/arm_pmu.c | 41 +-
drivers/perf/arm_pmuv3.c | 141 ++++-
include/linux/perf/arm_pmu.h | 34 +-
include/linux/perf/arm_pmuv3.h | 59 ++
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/test_brstack.sh | 57 +-
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 +
tools/perf/tests/workloads/traploop.c | 39 ++
22 files changed, 2099 insertions(+), 13 deletions(-)
create mode 100644 Documentation/arch/arm64/brbe.rst
create mode 100644 drivers/perf/arm_brbe.c
create mode 100644 drivers/perf/arm_brbe.h
create mode 100644 tools/perf/tests/workloads/traploop.c
--
2.25.1
^ permalink raw reply [flat|nested] 30+ messages in thread
* [V14 1/8] arm64/sysreg: Add BRBE registers and fields
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 2/8] KVM: arm64: Prevent guest accesses into BRBE system registers/instructions Anshuman Khandual
` (7 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users
This adds BRBE related register definitions and various other related field
macros there in. These will be used subsequently in a BRBE driver, which is
being added later on.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- Fixed brackets problem in __SYS_BRBINFO/BRBSRC/BRBTGT() macros
- Renamed the macro i.e s/__SYS_BRBINFO/__SYS_BRBINF/
- Renamed s/BRB_IALL/BRB_IALL_INSN and s/BRBE_INJ/BRB_INJ_INSN
- Moved BRB_IALL_INSN and SYS_BRB_INSN instructions here
- Changed E1BRE as ExBRE in sysreg fields inside BRBCR_ELx
- Used BRBCR_ELx for defining all BRBCR_EL1, BRBCR_EL2, and BRBCR_EL12
- Dropped existing tags from James and Mark
arch/arm64/include/asm/sysreg.h | 109 ++++++++++++++++++++
arch/arm64/tools/sysreg | 170 ++++++++++++++++++++++++++++++++
2 files changed, 279 insertions(+)
diff --git a/arch/arm64/include/asm/sysreg.h b/arch/arm64/include/asm/sysreg.h
index 92dfb41af018..8d0913df5176 100644
--- a/arch/arm64/include/asm/sysreg.h
+++ b/arch/arm64/include/asm/sysreg.h
@@ -272,6 +272,109 @@
#define SYS_BRBCR_EL2 sys_reg(2, 4, 9, 0, 0)
+#define __SYS_BRBINF(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 0))
+#define __SYS_BRBSRC(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 1))
+#define __SYS_BRBTGT(n) sys_reg(2, 1, 8, ((n) & 0xf), ((((n) & 0x10) >> 2) + 2))
+
+#define SYS_BRBINF0_EL1 __SYS_BRBINF(0)
+#define SYS_BRBINF1_EL1 __SYS_BRBINF(1)
+#define SYS_BRBINF2_EL1 __SYS_BRBINF(2)
+#define SYS_BRBINF3_EL1 __SYS_BRBINF(3)
+#define SYS_BRBINF4_EL1 __SYS_BRBINF(4)
+#define SYS_BRBINF5_EL1 __SYS_BRBINF(5)
+#define SYS_BRBINF6_EL1 __SYS_BRBINF(6)
+#define SYS_BRBINF7_EL1 __SYS_BRBINF(7)
+#define SYS_BRBINF8_EL1 __SYS_BRBINF(8)
+#define SYS_BRBINF9_EL1 __SYS_BRBINF(9)
+#define SYS_BRBINF10_EL1 __SYS_BRBINF(10)
+#define SYS_BRBINF11_EL1 __SYS_BRBINF(11)
+#define SYS_BRBINF12_EL1 __SYS_BRBINF(12)
+#define SYS_BRBINF13_EL1 __SYS_BRBINF(13)
+#define SYS_BRBINF14_EL1 __SYS_BRBINF(14)
+#define SYS_BRBINF15_EL1 __SYS_BRBINF(15)
+#define SYS_BRBINF16_EL1 __SYS_BRBINF(16)
+#define SYS_BRBINF17_EL1 __SYS_BRBINF(17)
+#define SYS_BRBINF18_EL1 __SYS_BRBINF(18)
+#define SYS_BRBINF19_EL1 __SYS_BRBINF(19)
+#define SYS_BRBINF20_EL1 __SYS_BRBINF(20)
+#define SYS_BRBINF21_EL1 __SYS_BRBINF(21)
+#define SYS_BRBINF22_EL1 __SYS_BRBINF(22)
+#define SYS_BRBINF23_EL1 __SYS_BRBINF(23)
+#define SYS_BRBINF24_EL1 __SYS_BRBINF(24)
+#define SYS_BRBINF25_EL1 __SYS_BRBINF(25)
+#define SYS_BRBINF26_EL1 __SYS_BRBINF(26)
+#define SYS_BRBINF27_EL1 __SYS_BRBINF(27)
+#define SYS_BRBINF28_EL1 __SYS_BRBINF(28)
+#define SYS_BRBINF29_EL1 __SYS_BRBINF(29)
+#define SYS_BRBINF30_EL1 __SYS_BRBINF(30)
+#define SYS_BRBINF31_EL1 __SYS_BRBINF(31)
+
+#define SYS_BRBSRC0_EL1 __SYS_BRBSRC(0)
+#define SYS_BRBSRC1_EL1 __SYS_BRBSRC(1)
+#define SYS_BRBSRC2_EL1 __SYS_BRBSRC(2)
+#define SYS_BRBSRC3_EL1 __SYS_BRBSRC(3)
+#define SYS_BRBSRC4_EL1 __SYS_BRBSRC(4)
+#define SYS_BRBSRC5_EL1 __SYS_BRBSRC(5)
+#define SYS_BRBSRC6_EL1 __SYS_BRBSRC(6)
+#define SYS_BRBSRC7_EL1 __SYS_BRBSRC(7)
+#define SYS_BRBSRC8_EL1 __SYS_BRBSRC(8)
+#define SYS_BRBSRC9_EL1 __SYS_BRBSRC(9)
+#define SYS_BRBSRC10_EL1 __SYS_BRBSRC(10)
+#define SYS_BRBSRC11_EL1 __SYS_BRBSRC(11)
+#define SYS_BRBSRC12_EL1 __SYS_BRBSRC(12)
+#define SYS_BRBSRC13_EL1 __SYS_BRBSRC(13)
+#define SYS_BRBSRC14_EL1 __SYS_BRBSRC(14)
+#define SYS_BRBSRC15_EL1 __SYS_BRBSRC(15)
+#define SYS_BRBSRC16_EL1 __SYS_BRBSRC(16)
+#define SYS_BRBSRC17_EL1 __SYS_BRBSRC(17)
+#define SYS_BRBSRC18_EL1 __SYS_BRBSRC(18)
+#define SYS_BRBSRC19_EL1 __SYS_BRBSRC(19)
+#define SYS_BRBSRC20_EL1 __SYS_BRBSRC(20)
+#define SYS_BRBSRC21_EL1 __SYS_BRBSRC(21)
+#define SYS_BRBSRC22_EL1 __SYS_BRBSRC(22)
+#define SYS_BRBSRC23_EL1 __SYS_BRBSRC(23)
+#define SYS_BRBSRC24_EL1 __SYS_BRBSRC(24)
+#define SYS_BRBSRC25_EL1 __SYS_BRBSRC(25)
+#define SYS_BRBSRC26_EL1 __SYS_BRBSRC(26)
+#define SYS_BRBSRC27_EL1 __SYS_BRBSRC(27)
+#define SYS_BRBSRC28_EL1 __SYS_BRBSRC(28)
+#define SYS_BRBSRC29_EL1 __SYS_BRBSRC(29)
+#define SYS_BRBSRC30_EL1 __SYS_BRBSRC(30)
+#define SYS_BRBSRC31_EL1 __SYS_BRBSRC(31)
+
+#define SYS_BRBTGT0_EL1 __SYS_BRBTGT(0)
+#define SYS_BRBTGT1_EL1 __SYS_BRBTGT(1)
+#define SYS_BRBTGT2_EL1 __SYS_BRBTGT(2)
+#define SYS_BRBTGT3_EL1 __SYS_BRBTGT(3)
+#define SYS_BRBTGT4_EL1 __SYS_BRBTGT(4)
+#define SYS_BRBTGT5_EL1 __SYS_BRBTGT(5)
+#define SYS_BRBTGT6_EL1 __SYS_BRBTGT(6)
+#define SYS_BRBTGT7_EL1 __SYS_BRBTGT(7)
+#define SYS_BRBTGT8_EL1 __SYS_BRBTGT(8)
+#define SYS_BRBTGT9_EL1 __SYS_BRBTGT(9)
+#define SYS_BRBTGT10_EL1 __SYS_BRBTGT(10)
+#define SYS_BRBTGT11_EL1 __SYS_BRBTGT(11)
+#define SYS_BRBTGT12_EL1 __SYS_BRBTGT(12)
+#define SYS_BRBTGT13_EL1 __SYS_BRBTGT(13)
+#define SYS_BRBTGT14_EL1 __SYS_BRBTGT(14)
+#define SYS_BRBTGT15_EL1 __SYS_BRBTGT(15)
+#define SYS_BRBTGT16_EL1 __SYS_BRBTGT(16)
+#define SYS_BRBTGT17_EL1 __SYS_BRBTGT(17)
+#define SYS_BRBTGT18_EL1 __SYS_BRBTGT(18)
+#define SYS_BRBTGT19_EL1 __SYS_BRBTGT(19)
+#define SYS_BRBTGT20_EL1 __SYS_BRBTGT(20)
+#define SYS_BRBTGT21_EL1 __SYS_BRBTGT(21)
+#define SYS_BRBTGT22_EL1 __SYS_BRBTGT(22)
+#define SYS_BRBTGT23_EL1 __SYS_BRBTGT(23)
+#define SYS_BRBTGT24_EL1 __SYS_BRBTGT(24)
+#define SYS_BRBTGT25_EL1 __SYS_BRBTGT(25)
+#define SYS_BRBTGT26_EL1 __SYS_BRBTGT(26)
+#define SYS_BRBTGT27_EL1 __SYS_BRBTGT(27)
+#define SYS_BRBTGT28_EL1 __SYS_BRBTGT(28)
+#define SYS_BRBTGT29_EL1 __SYS_BRBTGT(29)
+#define SYS_BRBTGT30_EL1 __SYS_BRBTGT(30)
+#define SYS_BRBTGT31_EL1 __SYS_BRBTGT(31)
+
#define SYS_MIDR_EL1 sys_reg(3, 0, 0, 0, 0)
#define SYS_MPIDR_EL1 sys_reg(3, 0, 0, 0, 5)
#define SYS_REVIDR_EL1 sys_reg(3, 0, 0, 0, 6)
@@ -784,6 +887,12 @@
#define OP_DVP_RCTX sys_insn(1, 3, 7, 3, 5)
#define OP_CPP_RCTX sys_insn(1, 3, 7, 3, 7)
+/*
+ * BRBE Instructions
+ */
+#define BRB_IALL_INSN __emit_inst(0xd5000000 | OP_BRB_IALL | (0x1f))
+#define BRB_INJ_INSN __emit_inst(0xd5000000 | OP_BRB_INJ | (0x1f))
+
/* Common SCTLR_ELx flags. */
#define SCTLR_ELx_ENTP2 (BIT(60))
#define SCTLR_ELx_DSSBS (BIT(44))
diff --git a/arch/arm64/tools/sysreg b/arch/arm64/tools/sysreg
index 8fe23eac910f..101801fb19f1 100644
--- a/arch/arm64/tools/sysreg
+++ b/arch/arm64/tools/sysreg
@@ -1002,6 +1002,176 @@ UnsignedEnum 3:0 BT
EndEnum
EndSysreg
+
+SysregFields BRBINFx_EL1
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 UNCOND_DIRECT
+ 0b000001 INDIRECT
+ 0b000010 DIRECT_LINK
+ 0b000011 INDIRECT_LINK
+ 0b000101 RET
+ 0b000111 ERET
+ 0b001000 COND_DIRECT
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INSN_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALIGN_FAULT
+ 0b101011 INSN_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b11 FULL
+EndEnum
+EndSysregFields
+
+SysregFields BRBCR_ELx
+Res0 63:24
+Field 23 EXCEPTION
+Field 22 ERTN
+Res0 21:9
+Field 8 FZP
+Res0 7
+Enum 6:5 TS
+ 0b01 VIRTUAL
+ 0b10 GUEST_PHYSICAL
+ 0b11 PHYSICAL
+EndEnum
+Field 4 MPRED
+Field 3 CC
+Res0 2
+Field 1 ExBRE
+Field 0 E0BRE
+EndSysregFields
+
+Sysreg BRBCR_EL2 2 4 9 0 0
+Fields BRBCR_ELx
+EndSysreg
+
+Sysreg BRBCR_EL1 2 1 9 0 0
+Fields BRBCR_ELx
+EndSysreg
+
+Sysreg BRBCR_EL12 2 5 9 0 0
+Fields BRBCR_ELx
+EndSysreg
+
+Sysreg BRBFCR_EL1 2 1 9 0 1
+Res0 63:30
+Enum 29:28 BANK
+ 0b0 FIRST
+ 0b1 SECOND
+EndEnum
+Res0 27:23
+Field 22 CONDDIR
+Field 21 DIRCALL
+Field 20 INDCALL
+Field 19 RTN
+Field 18 INDIRECT
+Field 17 DIRECT
+Field 16 EnI
+Res0 15:8
+Field 7 PAUSED
+Field 6 LASTFAILED
+Res0 5:0
+EndSysreg
+
+Sysreg BRBTS_EL1 2 1 9 0 2
+Field 63:0 TS
+EndSysreg
+
+Sysreg BRBINFINJ_EL1 2 1 9 1 0
+Res0 63:47
+Field 46 CCU
+Field 45:32 CC
+Res0 31:18
+Field 17 LASTFAILED
+Field 16 T
+Res0 15:14
+Enum 13:8 TYPE
+ 0b000000 UNCOND_DIRECT
+ 0b000001 INDIRECT
+ 0b000010 DIRECT_LINK
+ 0b000011 INDIRECT_LINK
+ 0b000101 RET
+ 0b000111 ERET
+ 0b001000 COND_DIRECT
+ 0b100001 DEBUG_HALT
+ 0b100010 CALL
+ 0b100011 TRAP
+ 0b100100 SERROR
+ 0b100110 INSN_DEBUG
+ 0b100111 DATA_DEBUG
+ 0b101010 ALIGN_FAULT
+ 0b101011 INSN_FAULT
+ 0b101100 DATA_FAULT
+ 0b101110 IRQ
+ 0b101111 FIQ
+ 0b111001 DEBUG_EXIT
+EndEnum
+Enum 7:6 EL
+ 0b00 EL0
+ 0b01 EL1
+ 0b10 EL2
+ 0b11 EL3
+EndEnum
+Field 5 MPRED
+Res0 4:2
+Enum 1:0 VALID
+ 0b00 NONE
+ 0b01 TARGET
+ 0b10 SOURCE
+ 0b11 FULL
+EndEnum
+EndSysreg
+
+Sysreg BRBSRCINJ_EL1 2 1 9 1 1
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBTGTINJ_EL1 2 1 9 1 2
+Field 63:0 ADDRESS
+EndSysreg
+
+Sysreg BRBIDR0_EL1 2 1 9 2 0
+Res0 63:16
+Enum 15:12 CC
+ 0b101 20_BIT
+EndEnum
+Enum 11:8 FORMAT
+ 0b0 0
+EndEnum
+Enum 7:0 NUMREC
+ 0b0001000 8
+ 0b0010000 16
+ 0b0100000 32
+ 0b1000000 64
+EndEnum
+EndSysreg
+
Sysreg ID_AA64ZFR0_EL1 3 0 0 4 4
Res0 63:60
UnsignedEnum 59:56 F64MM
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 2/8] KVM: arm64: Prevent guest accesses into BRBE system registers/instructions
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
2023-11-14 5:13 ` [V14 1/8] arm64/sysreg: Add BRBE registers and fields Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework Anshuman Khandual
` (6 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users, Oliver Upton,
James Morse, kvmarm
Currently BRBE feature is not supported in a guest environment. This hides
BRBE feature availability via masking ID_AA64DFR0_EL1.BRBE field. This also
blocks guest accesses into BRBE system registers and instructions as if the
underlying hardware never implemented FEAT_BRBE feature.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: kvmarm@lists.linux.dev
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- This is a new patch in the series
arch/arm64/kvm/sys_regs.c | 130 ++++++++++++++++++++++++++++++++++++++
1 file changed, 130 insertions(+)
diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 4735e1b37fb3..42701065b3cd 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1583,6 +1583,9 @@ static u64 read_sanitised_id_aa64dfr0_el1(struct kvm_vcpu *vcpu,
/* Hide SPE from guests */
val &= ~ID_AA64DFR0_EL1_PMSVer_MASK;
+ /* Hide BRBE from guests */
+ val &= ~ID_AA64DFR0_EL1_BRBE_MASK;
+
return val;
}
@@ -2042,6 +2045,8 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_DC_CISW), access_dcsw },
{ SYS_DESC(SYS_DC_CIGSW), access_dcgsw },
{ SYS_DESC(SYS_DC_CIGDSW), access_dcgsw },
+ { SYS_DESC(OP_BRB_IALL), undef_access },
+ { SYS_DESC(OP_BRB_INJ), undef_access },
DBG_BCR_BVR_WCR_WVR_EL1(0),
DBG_BCR_BVR_WCR_WVR_EL1(1),
@@ -2072,6 +2077,131 @@ static const struct sys_reg_desc sys_reg_descs[] = {
{ SYS_DESC(SYS_DBGCLAIMCLR_EL1), trap_raz_wi },
{ SYS_DESC(SYS_DBGAUTHSTATUS_EL1), trap_dbgauthstatus_el1 },
+ /*
+ * BRBE branch record sysreg address space is interleaved between
+ * corresponding BRBINF<N>_EL1, BRBSRC<N>_EL1, and BRBTGT<N>_EL1.
+ */
+ { SYS_DESC(SYS_BRBINF0_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC0_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT0_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF16_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC16_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT16_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF1_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC1_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT1_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF17_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC17_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT17_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF2_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC2_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT2_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF18_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC18_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT18_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF3_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC3_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT3_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF19_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC19_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT19_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF4_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC4_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT4_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF20_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC20_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT20_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF5_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC5_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT5_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF21_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC21_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT21_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF6_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC6_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT6_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF22_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC22_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT22_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF7_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC7_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT7_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF23_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC23_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT23_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF8_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC8_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT8_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF24_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC24_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT24_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF9_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC9_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT9_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF25_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC25_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT25_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF10_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC10_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT10_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF26_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC26_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT26_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF11_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC11_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT11_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF27_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC27_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT27_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF12_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC12_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT12_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF28_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC28_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT28_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF13_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC13_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT13_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF29_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC29_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT29_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF14_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC14_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT14_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF30_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC30_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT30_EL1), undef_access },
+
+ { SYS_DESC(SYS_BRBINF15_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC15_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT15_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINF31_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRC31_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGT31_EL1), undef_access },
+
+ /* Remaining BRBE sysreg addresses space */
+ { SYS_DESC(SYS_BRBCR_EL1), undef_access },
+ { SYS_DESC(SYS_BRBFCR_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTS_EL1), undef_access },
+ { SYS_DESC(SYS_BRBINFINJ_EL1), undef_access },
+ { SYS_DESC(SYS_BRBSRCINJ_EL1), undef_access },
+ { SYS_DESC(SYS_BRBTGTINJ_EL1), undef_access },
+ { SYS_DESC(SYS_BRBIDR0_EL1), undef_access },
+
{ SYS_DESC(SYS_MDCCSR_EL0), trap_raz_wi },
{ SYS_DESC(SYS_DBGDTR_EL0), trap_raz_wi },
// DBGDTR[TR]X_EL0 share the same encoding
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
2023-11-14 5:13 ` [V14 1/8] arm64/sysreg: Add BRBE registers and fields Anshuman Khandual
2023-11-14 5:13 ` [V14 2/8] KVM: arm64: Prevent guest accesses into BRBE system registers/instructions Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 9:58 ` James Clark
` (2 more replies)
2023-11-14 5:13 ` [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE Anshuman Khandual
` (5 subsequent siblings)
8 siblings, 3 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users
Branch stack sampling support i.e capturing branch records during execution
in core perf, rides along with normal HW events being scheduled on the PMU.
This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
with required HW implementation.
ARMV8 PMU hardware support for branch stack sampling is indicated via a new
feature flag called 'has_branch_stack' that can be ascertained via probing.
This modifies current gate in armpmu_event_init() which blocks branch stack
sampling based perf events unconditionally. Instead allows such perf events
getting initialized on supporting PMU hardware.
Branch stack sampling is enabled and disabled along with regular PMU events
. This adds required function callbacks in armv8pmu_branch_xxx() format, to
drive the PMU branch stack hardware when supported. This also adds fallback
stub definitions for these callbacks for PMUs which would not have required
support.
If a task gets scheduled out, the current branch records get saved in the
task's context data, which can be later used to fill in the records upon an
event overflow. Hence, we enable PERF_ATTACH_TASK_DATA (event->attach_state
based flag) for branch stack requesting perf events. But this also requires
adding support for pmu::sched_task() callback to arm_pmu.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- Folded the following three patches from V13 series into this single patch
drivers: perf: arm_pmu: Add new sched_task() callback
arm64/perf: Add branch stack support in struct arm_pmu
arm64/perf: Add branch stack support in struct pmu_hw_events
arm64/perf: Add branch stack support in ARMV8 PMU
- All armv8pmu_branch_xxxx() stub definitions have been moved inside
include/linux/perf/arm_pmuv3.h for easy access from both arm32 and
arm64 platforms
- Added brbe_users, brbe_context and brbe_sample_type struct pmu_hw_events
- Added branch_reset() and sched_task() callbacks
- Changed and optimized branch records processing during a PMU IRQ
- NO branch records get captured for event with mismatched brbe_sample_type
- Branch record context is tracked from armpmu_del() & armpmu_add()
- Branch record hardware is driven from armv8pmu_start() & armv8pmu_stop()
drivers/perf/arm_pmu.c | 41 +++++++++-
drivers/perf/arm_pmuv3.c | 141 ++++++++++++++++++++++++++++++++-
include/linux/perf/arm_pmu.h | 29 ++++++-
include/linux/perf/arm_pmuv3.h | 46 +++++++++++
4 files changed, 253 insertions(+), 4 deletions(-)
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index d712a19e47ac..76f1376ae594 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -317,6 +317,15 @@ armpmu_del(struct perf_event *event, int flags)
struct hw_perf_event *hwc = &event->hw;
int idx = hwc->idx;
+ if (has_branch_stack(event)) {
+ WARN_ON_ONCE(!hw_events->brbe_users);
+ hw_events->brbe_users--;
+ if (!hw_events->brbe_users) {
+ hw_events->brbe_context = NULL;
+ hw_events->brbe_sample_type = 0;
+ }
+ }
+
armpmu_stop(event, PERF_EF_UPDATE);
hw_events->events[idx] = NULL;
armpmu->clear_event_idx(hw_events, event);
@@ -333,6 +342,22 @@ armpmu_add(struct perf_event *event, int flags)
struct hw_perf_event *hwc = &event->hw;
int idx;
+ if (has_branch_stack(event)) {
+ /*
+ * Reset branch records buffer if a new task event gets
+ * scheduled on a PMU which might have existing records.
+ * Otherwise older branch records present in the buffer
+ * might leak into the new task event.
+ */
+ if (event->ctx->task && hw_events->brbe_context != event->ctx) {
+ hw_events->brbe_context = event->ctx;
+ if (armpmu->branch_reset)
+ armpmu->branch_reset();
+ }
+ hw_events->brbe_users++;
+ hw_events->brbe_sample_type = event->attr.branch_sample_type;
+ }
+
/* An event following a process won't be stopped earlier */
if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
return -ENOENT;
@@ -512,13 +537,24 @@ static int armpmu_event_init(struct perf_event *event)
!cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
return -ENOENT;
- /* does not support taken branch sampling */
- if (has_branch_stack(event))
+ /*
+ * Branch stack sampling events are allowed
+ * only on PMU which has required support.
+ */
+ if (has_branch_stack(event) && !armpmu->has_branch_stack)
return -EOPNOTSUPP;
return __hw_perf_event_init(event);
}
+static void armpmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+
+ if (armpmu->sched_task)
+ armpmu->sched_task(pmu_ctx, sched_in);
+}
+
static void armpmu_enable(struct pmu *pmu)
{
struct arm_pmu *armpmu = to_arm_pmu(pmu);
@@ -865,6 +901,7 @@ struct arm_pmu *armpmu_alloc(void)
}
pmu->pmu = (struct pmu) {
+ .sched_task = armpmu_sched_task,
.pmu_enable = armpmu_enable,
.pmu_disable = armpmu_disable,
.event_init = armpmu_event_init,
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 6ca7be05229c..7f973c77a95e 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -751,14 +751,63 @@ static void armv8pmu_start(struct arm_pmu *cpu_pmu)
armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
kvm_vcpu_pmu_resync_el0();
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_enable(cpu_pmu);
}
static void armv8pmu_stop(struct arm_pmu *cpu_pmu)
{
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_disable();
+
/* Disable all counters */
armv8pmu_pmcr_write(armv8pmu_pmcr_read() & ~ARMV8_PMU_PMCR_E);
}
+/*
+ * This is a read only constant and safe during multi threaded access
+ */
+static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
+
+static void read_branch_records(struct pmu_hw_events *cpuc,
+ struct perf_event *event,
+ struct perf_sample_data *data,
+ bool *branch_captured)
+{
+ /*
+ * CPU specific branch records buffer must have been allocated already
+ * for the hardware records to be captured and processed further.
+ */
+ if (WARN_ON(!cpuc->branches))
+ return;
+
+ /*
+ * Overflowed event's branch_sample_type does not match the configured
+ * branch filters in the BRBE HW. So the captured branch records here
+ * cannot be co-related to the overflowed event. Report to the user as
+ * if no branch records have been captured, and flush branch records.
+ * The same scenario is applicable when the current task context does
+ * not match with overflown event.
+ */
+ if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
+ (event->ctx->task && cpuc->brbe_context != event->ctx)) {
+ perf_sample_save_brstack(data, event, &zero_branch_stack);
+ return;
+ }
+
+ /*
+ * Read the branch records from the hardware once after the PMU IRQ
+ * has been triggered but subsequently same records can be used for
+ * other events that might have been overflowed simultaneously thus
+ * saving much CPU cycles.
+ */
+ if (!*branch_captured) {
+ armv8pmu_branch_read(cpuc, event);
+ *branch_captured = true;
+ }
+ perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack);
+}
+
static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
{
u32 pmovsr;
@@ -766,6 +815,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
struct pt_regs *regs;
int idx;
+ bool branch_captured = false;
/*
* Get and reset the IRQ flags
@@ -809,6 +859,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
if (!armpmu_event_set_period(event))
continue;
+ /*
+ * PMU IRQ should remain asserted until all branch records
+ * are captured and processed into struct perf_sample_data.
+ */
+ if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
+ read_branch_records(cpuc, event, &data, &branch_captured);
+
/*
* Perf event overflow will queue the processing of the event as
* an irq_work which will be taken care of in the handling of
@@ -818,6 +875,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
cpu_pmu->disable(event);
}
armv8pmu_start(cpu_pmu);
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_reset();
return IRQ_HANDLED;
}
@@ -907,6 +966,24 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
return event->hw.idx;
}
+static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
+{
+ struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
+ void *task_ctx = pmu_ctx->task_ctx_data;
+
+ if (armpmu->has_branch_stack) {
+ /* Save branch records in task_ctx on sched out */
+ if (task_ctx && !sched_in) {
+ armv8pmu_branch_save(armpmu, task_ctx);
+ return;
+ }
+
+ /* Reset branch records on sched in */
+ if (sched_in)
+ armv8pmu_branch_reset();
+ }
+}
+
/*
* Add an event filter to a given event.
*/
@@ -977,6 +1054,9 @@ static void armv8pmu_reset(void *info)
pmcr |= ARMV8_PMU_PMCR_LP;
armv8pmu_pmcr_write(pmcr);
+
+ if (cpu_pmu->has_branch_stack)
+ armv8pmu_branch_reset();
}
static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
@@ -1014,6 +1094,20 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
+ if (has_branch_stack(event)) {
+ if (!armv8pmu_branch_attr_valid(event))
+ return -EOPNOTSUPP;
+
+ /*
+ * If a task gets scheduled out, the current branch records
+ * get saved in the task's context data, which can be later
+ * used to fill in the records upon an event overflow. Let's
+ * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
+ * all branch stack sampling perf events.
+ */
+ event->attach_state |= PERF_ATTACH_TASK_DATA;
+ }
+
/*
* CHAIN events only work when paired with an adjacent counter, and it
* never makes sense for a user to open one in isolation, as they'll be
@@ -1130,6 +1224,35 @@ static void __armv8pmu_probe_pmu(void *info)
cpu_pmu->reg_pmmir = read_pmmir();
else
cpu_pmu->reg_pmmir = 0;
+ armv8pmu_branch_probe(cpu_pmu);
+}
+
+static int branch_records_alloc(struct arm_pmu *armpmu)
+{
+ struct branch_records __percpu *records;
+ int cpu;
+
+ records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
+ if (!records)
+ return -ENOMEM;
+
+ /*
+ * percpu memory allocated for 'records' gets completely consumed
+ * here, and never required to be freed up later. So permanently
+ * losing access to this anchor i.e 'records' is acceptable.
+ *
+ * Otherwise this allocation handle would have to be saved up for
+ * free_percpu() release later if required.
+ */
+ for_each_possible_cpu(cpu) {
+ struct pmu_hw_events *events_cpu;
+ struct branch_records *records_cpu;
+
+ events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
+ records_cpu = per_cpu_ptr(records, cpu);
+ events_cpu->branches = records_cpu;
+ }
+ return 0;
}
static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
@@ -1146,7 +1269,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
if (ret)
return ret;
- return probe.present ? 0 : -ENODEV;
+ if (!probe.present)
+ return -ENODEV;
+
+ if (cpu_pmu->has_branch_stack) {
+ ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
+ if (ret)
+ return ret;
+
+ ret = branch_records_alloc(cpu_pmu);
+ if (ret) {
+ armv8pmu_task_ctx_cache_free(cpu_pmu);
+ return ret;
+ }
+ }
+ return 0;
}
static void armv8pmu_disable_user_access_ipi(void *unused)
@@ -1205,6 +1342,8 @@ static int armv8_pmu_init(struct arm_pmu *cpu_pmu, char *name,
cpu_pmu->set_event_filter = armv8pmu_set_event_filter;
cpu_pmu->pmu.event_idx = armv8pmu_user_event_idx;
+ cpu_pmu->sched_task = armv8pmu_sched_task;
+ cpu_pmu->branch_reset = armv8pmu_branch_reset;
cpu_pmu->name = name;
cpu_pmu->map_event = map_event;
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 143fbc10ecfe..a489fdf163b4 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -46,6 +46,18 @@ static_assert((PERF_EVENT_FLAG_ARCH & ARMPMU_EVT_63BIT) == ARMPMU_EVT_63BIT);
}, \
}
+/*
+ * Maximum branch record entries which could be processed
+ * for core perf branch stack sampling support, regardless
+ * of the hardware support available on a given ARM PMU.
+ */
+#define MAX_BRANCH_RECORDS 64
+
+struct branch_records {
+ struct perf_branch_stack branch_stack;
+ struct perf_branch_entry branch_entries[MAX_BRANCH_RECORDS];
+};
+
/* The events for a given PMU register set. */
struct pmu_hw_events {
/*
@@ -72,6 +84,17 @@ struct pmu_hw_events {
struct arm_pmu *percpu_pmu;
int irq;
+
+ struct branch_records *branches;
+
+ /* Active context for task events */
+ void *brbe_context;
+
+ /* Active events requesting branch records */
+ unsigned int brbe_users;
+
+ /* Active branch sample type filters */
+ unsigned long brbe_sample_type;
};
enum armpmu_attr_groups {
@@ -102,8 +125,12 @@ struct arm_pmu {
void (*stop)(struct arm_pmu *);
void (*reset)(void *);
int (*map_event)(struct perf_event *event);
+ void (*sched_task)(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
+ void (*branch_reset)(void);
int num_events;
- bool secure_access; /* 32-bit ARM only */
+ unsigned int secure_access : 1, /* 32-bit ARM only */
+ has_branch_stack: 1, /* 64-bit ARM only */
+ reserved : 30;
#define ARMV8_PMUV3_MAX_COMMON_EVENTS 0x40
DECLARE_BITMAP(pmceid_bitmap, ARMV8_PMUV3_MAX_COMMON_EVENTS);
#define ARMV8_PMUV3_EXT_COMMON_EVENT_BASE 0x4000
diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
index 9c226adf938a..72da4522397c 100644
--- a/include/linux/perf/arm_pmuv3.h
+++ b/include/linux/perf/arm_pmuv3.h
@@ -303,4 +303,50 @@
} \
} while (0)
+struct pmu_hw_events;
+struct arm_pmu;
+struct perf_event;
+
+#ifdef CONFIG_PERF_EVENTS
+static inline void armv8pmu_branch_reset(void)
+{
+}
+
+static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
+{
+}
+
+static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+ return false;
+}
+
+static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
+{
+}
+
+static inline void armv8pmu_branch_disable(void)
+{
+}
+
+static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ WARN_ON_ONCE(!has_branch_stack(event));
+}
+
+static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
+{
+}
+
+static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
+{
+ return 0;
+}
+
+static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
+{
+}
+#endif /* CONFIG_PERF_EVENTS */
#endif
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
` (2 preceding siblings ...)
2023-11-14 5:13 ` [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 12:11 ` James Clark
2023-11-14 5:13 ` [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests Anshuman Khandual
` (4 subsequent siblings)
8 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users
This extends recently added branch stack sampling framework in ARMV8 PMU to
enable such events via new architecture feature called Branch Record Buffer
Extension aka BRBE. This implements all the armv8pmu_branch_xxx() callbacks
as expected at ARMV8 PMU level required to drive perf branch stack sampling
events. This adds a new config option CONFIG_ARM64_BRBE to encapsulate this
BRBE based implementation, available only on ARM64 platforms.
BRBE hardware captures a branch record via three distinct system registers
representing branch source address, branch target address, and other branch
information. A BRBE buffer implementation is organized as multiple banks of
32 branch records each, which is a collection of BRBSRC_EL1, BRBTGT_EL1 and
BRBINF_EL1 registers. Though total BRBE record entries i.e BRBE_MAX_ENTRIES
cannot exceed MAX_BRANCH_RECORDS as defined for ARM PMU.
BRBE hardware attributes get captured in a new reg_brbidr element in struct
arm_pmu during armv8pmu_branch_probe() which is called from broader probing
function __armv8pmu_probe_pmu(). Attributes such as number of branch record
entries implemented in the hardware can be derived from armpmu->reg_brbidr.
BRBE gets enabled via armv8pmu_branch_enable() where it also derives branch
filter, and additional requirements from event's 'attr.branch_sample_type'
and configures them via BRBFCR_EL1 and BRBCR_EL1 registers.
PMU event overflow triggers IRQ, where current branch records get captured,
stitched along with older records available in 'task_ctx', before getting
processed for core perf ring buffer. Task context switch outs incrementally
save current branch records in event's 'pmu_ctx->task_ctx_data' to optimize
workload's branch record samples.
In case multiple events with different branch sample type requests converge
on the same PMU, BRBE gets enabled for branch filters for the last event's
branch sample type. No branch records will be captured and processed for an
event if BRBE hardware config does not match its branch sample type, while
handling the PMU IRQ.
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- Folded the following five patches from V13 series into this single patch
arm64/perf: Enable branch stack events via FEAT_BRBE
arm64/perf: Add struct brbe_regset helper functions
arm64/perf: Implement branch records save on task sched out
arm64/perf: Implement branch records save on PMU IRQ
- Fixed the year in copyright statement
- Added Documentation/arch/arm64/brbe.rst
- Updated Documentation/arch/arm64/booting.rst
- Added __init_el2_brbe() which enables branch record cycle count support
- Disabled EL2 traps in __init_el2_fgt() while accessing BRBE registers and
executing instructions
- Fixed a typo in ARM64_BRBE config option description text
- Added BUILD_BUG_ON() co-relating BRBE_BANK_MAX_ENTRIES and MAX_BRANCH_RECORDS
- Dropped arm64_create_brbe_task_ctx_kmem_cache()
- Moved down comment for PERF_SAMPLE_BRANCH_KERNEL in branch_type_to_brbcr()
- Renamed BRBCR_ELx_DEFAULT_CONFIG as BRBCR_ELx_CONFIG_MASK
- Replaced BRBCR_ELx_DEFAULT_TS with BRBCR_ELx_TS_MASK in BRBCR_ELx_CONFIG_MASK
Documentation/arch/arm64/booting.rst | 6 +
Documentation/arch/arm64/brbe.rst | 152 ++++++
arch/arm64/include/asm/el2_setup.h | 113 +++-
drivers/perf/Kconfig | 11 +
drivers/perf/Makefile | 1 +
drivers/perf/arm_brbe.c | 735 +++++++++++++++++++++++++++
drivers/perf/arm_brbe.h | 262 ++++++++++
include/linux/perf/arm_pmu.h | 5 +
include/linux/perf/arm_pmuv3.h | 13 +
9 files changed, 1295 insertions(+), 3 deletions(-)
create mode 100644 Documentation/arch/arm64/brbe.rst
create mode 100644 drivers/perf/arm_brbe.c
create mode 100644 drivers/perf/arm_brbe.h
diff --git a/Documentation/arch/arm64/booting.rst b/Documentation/arch/arm64/booting.rst
index b57776a68f15..2276df285e83 100644
--- a/Documentation/arch/arm64/booting.rst
+++ b/Documentation/arch/arm64/booting.rst
@@ -349,6 +349,12 @@ Before jumping into the kernel, the following conditions must be met:
- HWFGWTR_EL2.nSMPRI_EL1 (bit 54) must be initialised to 0b01.
+ For CPUs with feature Branch Record Buffer Extension (FEAT_BRBE):
+
+ - If the kernel is entered at EL1 and EL2 is present:
+
+ - BRBCR_EL2.CC (bit 3) must be initialised to 0b1.
+
For CPUs with the Scalable Matrix Extension FA64 feature (FEAT_SME_FA64):
- If EL3 is present:
diff --git a/Documentation/arch/arm64/brbe.rst b/Documentation/arch/arm64/brbe.rst
new file mode 100644
index 000000000000..f52f1df549bb
--- /dev/null
+++ b/Documentation/arch/arm64/brbe.rst
@@ -0,0 +1,152 @@
+============================================
+Branch Record Buffer Extension aka FEAT_BRBE
+============================================
+
+Author: Anshuman Khandual <anshuman.khandual@arm.com>
+
+FEAT_BRBE is an optional architecture feature, which creates branch records
+containing information about change in control flow. The branch information
+contains source address, target address, and some relevant metadata related
+to that change in control flow. BRBE can be configured to filter out branch
+records based on their type and privilege level.
+
+BRBE Hardware
+=============
+
+FEAT_BRBE support on a given implementation, can be determined from system
+register field ID_AA64DFR0_EL1.BRBE, containing 'ID_AA64DFR0_EL1_BRBE_IMP'
+or 'ID_AA64DFR0_EL1_BRBE_BRBE_V1P1'. All BRBE system registers, including
+branch record banks are available for each CPU.
+
+1) Branch Record System Registers
+---------------------------------
+
+A single BRBE branch record representing a single change in control flow is
+constructed from three distinct BRBE system registers.
+
+1. BRBSRC<N>_EL1 - Branch record source address register
+2. BRBTGT<N>_EL1 - Branch record target address register
+3. BRBINF<N>_EL1 - Branch record information register
+
+'N' indices mentioned above ranges inside [0 .. 31] which forms a complete
+branch records bank, and given implementation can have multiple such banks
+of branch records.
+
+2) Branch Record Generation Filters and Controls
+------------------------------------------------
+Branch records generation and their capture control system registers
+
+1. BRBCR_EL1 - Branch record generation control
+2. BRBCR_EL2 - Branch record generation control
+3. BRBFCR_EL1 - Branch record function control
+
+Branch records generation can be filtered based on control flow change type
+and respective execution privilege level. Additional branch record relevant
+information such as elapsed cycles count, and prediction-misprediction can
+also be selectively enabled.
+
+3) Branch Record Information
+----------------------------
+
+Apart from branch source and destination addresses, captured branch records
+also contain information such as prediction, privilege levels, cycle count,
+and transaction state, control flow type etc. These informations are stored
+in respective BRBINF<N>_EL1 registers.
+
+Perf Implementation
+===================
+
+Perf branch stack sampling framework has been enabled on arm64 platform via
+this new FEAT_BRBE feature. The following description explains how this has
+been implemented in various levels of abstraction - from perf core all the
+way to ARMv8 PMUv3 implementation.
+
+1) Branch stack abstraction at ARM PMU
+--------------------------------------
+
+Basic branch stack abstractions such as 'has_branch_stack' pmu feature flag
+in 'struct arm_pmu', defining 'struct branch_records' based branch records
+buffer in 'struct pmu_hw_events' have been implemented at ARM PMU level.
+
+2) Branch stack implementation at ARMv8 PMUv3
+---------------------------------------------
+
+Basic branch stack driving callbacks armv8pmu_branch_xxx() alongside normal
+PMU HW events have been implemented at ARMv8 PMUv3 level with fallback stub
+definitions in case where a given ARMv8 PMUv3 does not implement FEAT_BRBE.
+
+**Detect branch stack support**
+
+__armv8pmu_probe_pmu()
+ armv8pmu_branch_probe()
+ arm_pmu->has_branch_stack = 1
+
+**Allocate branch stack buffers**
+
+__armv8pmu_probe_pmu()
+ armv8pmu_branch_probe()
+ arm_pmu->has_branch_stack
+ - armv8pmu_task_ctx_cache_alloc()
+ - branch_records_alloc()
+
+**Allow branch stack event**
+
+armpmu_event_init()
+ armpmu->has_branch_stack
+ has_branch_stack()
+ - branch event allowed to be created
+
+**Check branch stack event feasibility**
+
+__armv8_pmuv3_map_event()
+ has_branch_stack()
+ - event->attach_state | PERF_ATTACH_TASK_DATA
+ - armv8pmu_branch_attr_valid()
+
+**Enable branch record generation**
+
+armpmu_enable()
+ armpmu->start()
+ armv8pmu_start()
+ armv8pmu_branch_enable()
+
+**Disable branch record generation**
+
+armpmu_disable()
+ armpmu->stop()
+ armv8pmu_stop()
+ armv8pmu_branch_disable()
+
+**Capture branch record at PMU IRQ**
+
+armv8pmu_handle_irq()
+ has_branch_stack()
+ armv8pmu_branch_read()
+ perf_sample_save_brstack()
+
+**Process context sched in or out**
+
+armv8pmu_sched_task()
+ armpmu->has_branch_stack
+ - armv8pmu_branch_reset() --> sched_in
+ - armv8pmu_branch_save() --> sched_out
+
+**Reset branch stack buffer**
+
+armv8pmu_reset()
+ armpmu->has_branch_stack
+ armv8pmu_branch_reset()
+
+
+3) BRBE implementation at ARMv8 PMUv3
+-------------------------------------
+
+FEAT_BRBE specific branch stack callbacks are implemented and are available
+via the new CONFIG_ARM64_BRBE config option. These implementation callbacks
+drive branch records generation control, and capture along side regular PMU
+HW events at ARMv8 PMUv3 level.
+
+Accessing FEAT_BRBE system registers, and instructions without this feature
+being available on the hardware will trigger illegal exceptions. Hence all
+armv8pmu_branch_xxx() should get called after ensuring PMU has branch stack
+support aka FEAT_BRBE via armpmu->has_branch_stack.
diff --git a/arch/arm64/include/asm/el2_setup.h b/arch/arm64/include/asm/el2_setup.h
index b7afaa026842..649b926bf69d 100644
--- a/arch/arm64/include/asm/el2_setup.h
+++ b/arch/arm64/include/asm/el2_setup.h
@@ -154,6 +154,51 @@
.Lskip_set_cptr_\@:
.endm
+#ifdef CONFIG_ARM64_BRBE
+/*
+ * Enable BRBE cycle count
+ *
+ * BRBE requires both BRBCR_EL1.CC and BRBCR_EL2.CC fields, be set
+ * for the cycle counts to be available in BRBINF<N>_EL1.CC during
+ * branch record processing after a PMU interrupt. This enables CC
+ * field on both these registers while still executing inside EL2.
+ *
+ * BRBE driver would still be able to toggle branch records cycle
+ * count support via BRBCR_EL1.CC field regardless of whether the
+ * kernel ends up executing in EL1 or EL2.
+ */
+.macro __init_el2_brbe
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_cc_\@
+
+ mrs_s x0, SYS_BRBCR_EL2
+ orr x0, x0, BRBCR_ELx_CC
+ msr_s SYS_BRBCR_EL2, x0
+
+ /*
+ * Accessing BRBCR_EL1 register here does not require
+ * BRBCR_EL12 addressing mode as HCR_EL2.E2H is still
+ * clear. Regardless, check for HCR_E2H and be on the
+ * safer side.
+ */
+ mrs x1, hcr_el2
+ and x1, x1, #HCR_E2H
+ cbz x1, .Lset_brbe_el1_direct_\@
+
+ mrs_s x0, SYS_BRBCR_EL12
+ orr x0, x0, BRBCR_ELx_CC
+ msr_s SYS_BRBCR_EL12, x0
+ b .Lskip_brbe_cc_\@
+
+.Lset_brbe_el1_direct_\@:
+ mrs_s x0, SYS_BRBCR_EL1
+ orr x0, x0, BRBCR_ELx_CC
+ msr_s SYS_BRBCR_EL1, x0
+.Lskip_brbe_cc_\@:
+.endm
+#endif
+
/* Disable any fine grained traps */
.macro __init_el2_fgt
mrs x1, id_aa64mmfr0_el1
@@ -161,16 +206,62 @@
cbz x1, .Lskip_fgt_\@
mov x0, xzr
+ mov x2, xzr
mrs x1, id_aa64dfr0_el1
ubfx x1, x1, #ID_AA64DFR0_EL1_PMSVer_SHIFT, #4
cmp x1, #3
b.lt .Lset_debug_fgt_\@
+
/* Disable PMSNEVFR_EL1 read and write traps */
- orr x0, x0, #(1 << 62)
+ orr x0, x0, #HDFGRTR_EL2_nPMSNEVFR_EL1_MASK
+ orr x2, x2, #HDFGWTR_EL2_nPMSNEVFR_EL1_MASK
.Lset_debug_fgt_\@:
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_reg_fgt_\@
+
+ /*
+ * Disable read traps for the following BRBE related
+ * registers.
+ *
+ * BRBSRC_EL1
+ * BRBTGT_EL1
+ * BRBINF_EL1
+ * BRBSRCINJ_EL1
+ * BRBTGTINJ_EL1
+ * BRBINFINJ_EL1
+ * BRBTS_EL1
+ */
+ orr x0, x0, #HDFGRTR_EL2_nBRBDATA_MASK
+
+ /*
+ * Disable write traps for the following BRBE related
+ * registers.
+ *
+ * BRBSRCINJ_EL1
+ * BRBTGTINJ_EL1
+ * BRBINFINJ_EL1
+ * BRBTS_EL1
+ */
+ orr x2, x2, #HDFGWTR_EL2_nBRBDATA_MASK
+
+ /*
+ * Disable both read and write traps for the following
+ * BRBE related registers.
+ *
+ * BRBCR_EL1
+ * BRBFCR_EL1
+ */
+ orr x0, x0, #HDFGRTR_EL2_nBRBCTL_MASK
+ orr x2, x2, #HDFGWTR_EL2_nBRBCTL_MASK
+
+ /* Disable BRBIDR_EL1 read traps */
+ orr x0, x0, #HDFGRTR_EL2_nBRBIDR_MASK
+
+.Lskip_brbe_reg_fgt_\@:
msr_s SYS_HDFGRTR_EL2, x0
- msr_s SYS_HDFGWTR_EL2, x0
+ msr_s SYS_HDFGWTR_EL2, x2
mov x0, xzr
mrs x1, id_aa64pfr1_el1
@@ -193,7 +284,20 @@
.Lset_fgt_\@:
msr_s SYS_HFGRTR_EL2, x0
msr_s SYS_HFGWTR_EL2, x0
- msr_s SYS_HFGITR_EL2, xzr
+
+ mov x0, xzr
+ mrs x1, id_aa64dfr0_el1
+ ubfx x1, x1, #ID_AA64DFR0_EL1_BRBE_SHIFT, #4
+ cbz x1, .Lskip_brbe_insn_fgt_\@
+
+ /* Disable traps for BRBIALL instruction */
+ orr x0, x0, #HFGITR_EL2_nBRBIALL_MASK
+
+ /* Disable traps for BRBINJ instruction */
+ orr x0, x0, #HFGITR_EL2_nBRBINJ_MASK
+
+.Lskip_brbe_insn_fgt_\@:
+ msr_s SYS_HFGITR_EL2, x0
mrs x1, id_aa64pfr0_el1 // AMU traps UNDEF without AMU
ubfx x1, x1, #ID_AA64PFR0_EL1_AMU_SHIFT, #4
@@ -228,6 +332,9 @@
__init_el2_nvhe_idregs
__init_el2_cptr
__init_el2_fgt
+#ifdef CONFIG_ARM64_BRBE
+ __init_el2_brbe
+#endif
.endm
#ifndef __KVM_NVHE_HYPERVISOR__
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 273d67ecf6d2..4cfafe375b17 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -180,6 +180,17 @@ config ARM_SPE_PMU
Extension, which provides periodic sampling of operations in
the CPU pipeline and reports this via the perf AUX interface.
+config ARM64_BRBE
+ bool "Enable support for branch stack sampling using FEAT_BRBE"
+ depends on PERF_EVENTS && ARM64 && ARM_PMU
+ default y
+ help
+ Enable perf support for Branch Record Buffer Extension (BRBE) which
+ records all branches taken in an execution path. This supports some
+ branch types and privilege based filtering. It captures additional
+ relevant information such as cycle count, misprediction and branch
+ type, branch privilege level etc.
+
config ARM_DMC620_PMU
tristate "Enable PMU support for the ARM DMC-620 memory controller"
depends on (ARM64 && ACPI) || COMPILE_TEST
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 16b3ec4db916..a8b7bc22e3d6 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -18,6 +18,7 @@ obj-$(CONFIG_RISCV_PMU_SBI) += riscv_pmu_sbi.o
obj-$(CONFIG_THUNDERX2_PMU) += thunderx2_pmu.o
obj-$(CONFIG_XGENE_PMU) += xgene_pmu.o
obj-$(CONFIG_ARM_SPE_PMU) += arm_spe_pmu.o
+obj-$(CONFIG_ARM64_BRBE) += arm_brbe.o
obj-$(CONFIG_ARM_DMC620_PMU) += arm_dmc620_pmu.o
obj-$(CONFIG_MARVELL_CN10K_TAD_PMU) += marvell_cn10k_tad_pmu.o
obj-$(CONFIG_MARVELL_CN10K_DDR_PMU) += marvell_cn10k_ddr_pmu.o
diff --git a/drivers/perf/arm_brbe.c b/drivers/perf/arm_brbe.c
new file mode 100644
index 000000000000..403c382ac436
--- /dev/null
+++ b/drivers/perf/arm_brbe.c
@@ -0,0 +1,735 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Branch Record Buffer Extension Driver.
+ *
+ * Copyright (C) 2022-2023 ARM Limited
+ *
+ * Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ */
+#include "arm_brbe.h"
+
+void armv8pmu_branch_reset(void)
+{
+ asm volatile(BRB_IALL_INSN);
+ isb();
+}
+
+static bool valid_brbe_nr(int brbe_nr)
+{
+ return brbe_nr == BRBIDR0_EL1_NUMREC_8 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_16 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_32 ||
+ brbe_nr == BRBIDR0_EL1_NUMREC_64;
+}
+
+static bool valid_brbe_cc(int brbe_cc)
+{
+ return brbe_cc == BRBIDR0_EL1_CC_20_BIT;
+}
+
+static bool valid_brbe_format(int brbe_format)
+{
+ return brbe_format == BRBIDR0_EL1_FORMAT_0;
+}
+
+static bool valid_brbe_version(int brbe_version)
+{
+ return brbe_version == ID_AA64DFR0_EL1_BRBE_IMP ||
+ brbe_version == ID_AA64DFR0_EL1_BRBE_BRBE_V1P1;
+}
+
+static void select_brbe_bank(int bank)
+{
+ u64 brbfcr;
+
+ WARN_ON(bank > BRBE_BANK_IDX_1);
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbfcr &= ~BRBFCR_EL1_BANK_MASK;
+ brbfcr |= SYS_FIELD_PREP(BRBFCR_EL1, BANK, bank);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+}
+
+static bool __read_brbe_regset(struct brbe_regset *entry, int idx)
+{
+ entry->brbinf = get_brbinf_reg(idx);
+
+ if (brbe_invalid(entry->brbinf))
+ return false;
+
+ entry->brbsrc = get_brbsrc_reg(idx);
+ entry->brbtgt = get_brbtgt_reg(idx);
+ return true;
+}
+
+/*
+ * Read all BRBE entries in HW until the first invalid entry.
+ *
+ * The caller must ensure that the BRBE is not concurrently modifying these
+ * branch entries.
+ */
+static int capture_brbe_regset(struct brbe_regset *buf, int nr_hw_entries)
+{
+ int idx = 0;
+
+ select_brbe_bank(BRBE_BANK_IDX_0);
+ while (idx < nr_hw_entries && idx <= BRBE_BANK0_IDX_MAX) {
+ if (!__read_brbe_regset(&buf[idx], idx))
+ return idx;
+ idx++;
+ }
+
+ select_brbe_bank(BRBE_BANK_IDX_1);
+ while (idx < nr_hw_entries && idx <= BRBE_BANK1_IDX_MAX) {
+ if (!__read_brbe_regset(&buf[idx], idx))
+ return idx;
+ idx++;
+ }
+ return idx;
+}
+
+/*
+ * This function concatenates branch records from stored and live buffer
+ * up to maximum nr_max records and the stored buffer holds the resultant
+ * buffer. The concatenated buffer contains all the branch records from
+ * the live buffer but might contain some from stored buffer considering
+ * the maximum combined length does not exceed 'nr_max'.
+ *
+ * Stored records Live records
+ * ------------------------------------------------^
+ * | S0 | L0 | Newest |
+ * --------------------------------- |
+ * | S1 | L1 | |
+ * --------------------------------- |
+ * | S2 | L2 | |
+ * --------------------------------- |
+ * | S3 | L3 | |
+ * --------------------------------- |
+ * | S4 | L4 | nr_max
+ * --------------------------------- |
+ * | | L5 | |
+ * --------------------------------- |
+ * | | L6 | |
+ * --------------------------------- |
+ * | | L7 | |
+ * --------------------------------- |
+ * | | | |
+ * --------------------------------- |
+ * | | | Oldest |
+ * ------------------------------------------------V
+ *
+ *
+ * S0 is the newest in the stored records, where as L7 is the oldest in
+ * the live records. Unless the live buffer is detected as being full
+ * thus potentially dropping off some older records, L7 and S0 records
+ * are contiguous in time for a user task context. The stitched buffer
+ * here represents maximum possible branch records, contiguous in time.
+ *
+ * Stored records Live records
+ * ------------------------------------------------^
+ * | L0 | L0 | Newest |
+ * --------------------------------- |
+ * | L1 | L1 | |
+ * --------------------------------- |
+ * | L2 | L2 | |
+ * --------------------------------- |
+ * | L3 | L3 | |
+ * --------------------------------- |
+ * | L4 | L4 | nr_max
+ * --------------------------------- |
+ * | L5 | L5 | |
+ * --------------------------------- |
+ * | L6 | L6 | |
+ * --------------------------------- |
+ * | L7 | L7 | |
+ * --------------------------------- |
+ * | S0 | | |
+ * --------------------------------- |
+ * | S1 | | Oldest |
+ * ------------------------------------------------V
+ * | S2 | <----|
+ * ----------------- |
+ * | S3 | <----| Dropped off after nr_max
+ * ----------------- |
+ * | S4 | <----|
+ * -----------------
+ */
+static int stitch_stored_live_entries(struct brbe_regset *stored,
+ struct brbe_regset *live,
+ int nr_stored, int nr_live,
+ int nr_max)
+{
+ int nr_move = min(nr_stored, nr_max - nr_live);
+
+ /* Move the tail of the buffer to make room for the new entries */
+ memmove(&stored[nr_live], &stored[0], nr_move * sizeof(*stored));
+
+ /* Copy the new entries into the head of the buffer */
+ memcpy(&stored[0], &live[0], nr_live * sizeof(*stored));
+
+ /* Return the number of entries in the stitched buffer */
+ return min(nr_live + nr_stored, nr_max);
+}
+
+static int brbe_branch_save(struct brbe_regset *live, int nr_hw_entries)
+{
+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ int nr_live;
+
+ write_sysreg_s(brbfcr | BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ nr_live = capture_brbe_regset(live, nr_hw_entries);
+
+ write_sysreg_s(brbfcr & ~BRBFCR_EL1_PAUSED, SYS_BRBFCR_EL1);
+ isb();
+
+ return nr_live;
+}
+
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
+{
+ struct arm64_perf_task_context *task_ctx = ctx;
+ struct brbe_regset live[BRBE_MAX_ENTRIES];
+ int nr_live, nr_store, nr_hw_entries;
+
+ nr_hw_entries = brbe_get_numrec(arm_pmu->reg_brbidr);
+ nr_live = brbe_branch_save(live, nr_hw_entries);
+ nr_store = task_ctx->nr_brbe_records;
+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+ nr_live, nr_hw_entries);
+ task_ctx->nr_brbe_records = nr_store;
+}
+
+/*
+ * Generic perf branch filters supported on BRBE
+ *
+ * New branch filters need to be evaluated whether they could be supported on
+ * BRBE. This ensures that such branch filters would not just be accepted, to
+ * fail silently. PERF_SAMPLE_BRANCH_HV is a special case that is selectively
+ * supported only on platforms where kernel is in hyp mode.
+ */
+#define BRBE_EXCLUDE_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_ABORT_TX | \
+ PERF_SAMPLE_BRANCH_IN_TX | \
+ PERF_SAMPLE_BRANCH_NO_TX | \
+ PERF_SAMPLE_BRANCH_CALL_STACK)
+
+#define BRBE_ALLOWED_BRANCH_FILTERS (PERF_SAMPLE_BRANCH_USER | \
+ PERF_SAMPLE_BRANCH_KERNEL | \
+ PERF_SAMPLE_BRANCH_HV | \
+ PERF_SAMPLE_BRANCH_ANY | \
+ PERF_SAMPLE_BRANCH_ANY_CALL | \
+ PERF_SAMPLE_BRANCH_ANY_RETURN | \
+ PERF_SAMPLE_BRANCH_IND_CALL | \
+ PERF_SAMPLE_BRANCH_COND | \
+ PERF_SAMPLE_BRANCH_IND_JUMP | \
+ PERF_SAMPLE_BRANCH_CALL | \
+ PERF_SAMPLE_BRANCH_NO_FLAGS | \
+ PERF_SAMPLE_BRANCH_NO_CYCLES | \
+ PERF_SAMPLE_BRANCH_TYPE_SAVE | \
+ PERF_SAMPLE_BRANCH_HW_INDEX | \
+ PERF_SAMPLE_BRANCH_PRIV_SAVE)
+
+#define BRBE_PERF_BRANCH_FILTERS (BRBE_ALLOWED_BRANCH_FILTERS | \
+ BRBE_EXCLUDE_BRANCH_FILTERS)
+
+bool armv8pmu_branch_attr_valid(struct perf_event *event)
+{
+ u64 branch_type = event->attr.branch_sample_type;
+
+ /*
+ * Ensure both perf branch filter allowed and exclude
+ * masks are always in sync with the generic perf ABI.
+ */
+ BUILD_BUG_ON(BRBE_PERF_BRANCH_FILTERS != (PERF_SAMPLE_BRANCH_MAX - 1));
+
+ if (branch_type & ~BRBE_ALLOWED_BRANCH_FILTERS) {
+ pr_debug_once("requested branch filter not supported 0x%llx\n", branch_type);
+ return false;
+ }
+
+ /*
+ * If the event does not have at least one of the privilege
+ * branch filters as in PERF_SAMPLE_BRANCH_PLM_ALL, the core
+ * perf will adjust its value based on perf event's existing
+ * privilege level via attr.exclude_[user|kernel|hv].
+ *
+ * As event->attr.branch_sample_type might have been changed
+ * when the event reaches here, it is not possible to figure
+ * out whether the event originally had HV privilege request
+ * or got added via the core perf. Just report this situation
+ * once and continue ignoring if there are other instances.
+ */
+ if ((branch_type & PERF_SAMPLE_BRANCH_HV) && !is_kernel_in_hyp_mode())
+ pr_debug_once("hypervisor privilege filter not supported 0x%llx\n", branch_type);
+
+ return true;
+}
+
+int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
+{
+ size_t size = sizeof(struct arm64_perf_task_context);
+
+ arm_pmu->pmu.task_ctx_cache = kmem_cache_create("arm64_brbe_task_ctx", size, 0, 0, NULL);
+ if (!arm_pmu->pmu.task_ctx_cache)
+ return -ENOMEM;
+ return 0;
+}
+
+void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
+{
+ kmem_cache_destroy(arm_pmu->pmu.task_ctx_cache);
+}
+
+static int brbe_attributes_probe(struct arm_pmu *armpmu, u32 brbe)
+{
+ u64 brbidr = read_sysreg_s(SYS_BRBIDR0_EL1);
+ int brbe_version, brbe_format, brbe_cc, brbe_nr;
+
+ brbe_version = brbe;
+ brbe_format = brbe_get_format(brbidr);
+ brbe_cc = brbe_get_cc_bits(brbidr);
+ brbe_nr = brbe_get_numrec(brbidr);
+ armpmu->reg_brbidr = brbidr;
+
+ if (!valid_brbe_version(brbe_version) ||
+ !valid_brbe_format(brbe_format) ||
+ !valid_brbe_cc(brbe_cc) ||
+ !valid_brbe_nr(brbe_nr))
+ return -EOPNOTSUPP;
+ return 0;
+}
+
+void armv8pmu_branch_probe(struct arm_pmu *armpmu)
+{
+ u64 aa64dfr0 = read_sysreg_s(SYS_ID_AA64DFR0_EL1);
+ u32 brbe;
+
+ /*
+ * BRBE implementation's branch entries cannot exceed maximum
+ * branch records supported at the ARM PMU level abstraction.
+ * Otherwise there is always a possibility of array overflow,
+ * while processing BRBE branch records.
+ */
+ BUILD_BUG_ON(BRBE_BANK_MAX_ENTRIES > MAX_BRANCH_RECORDS);
+
+ brbe = cpuid_feature_extract_unsigned_field(aa64dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT);
+ if (!brbe)
+ return;
+
+ if (brbe_attributes_probe(armpmu, brbe))
+ return;
+
+ armpmu->has_branch_stack = 1;
+}
+
+/*
+ * BRBE supports the following functional branch type filters while
+ * generating branch records. These branch filters can be enabled,
+ * either individually or as a group i.e ORing multiple filters
+ * with each other.
+ *
+ * BRBFCR_EL1_CONDDIR - Conditional direct branch
+ * BRBFCR_EL1_DIRCALL - Direct call
+ * BRBFCR_EL1_INDCALL - Indirect call
+ * BRBFCR_EL1_INDIRECT - Indirect branch
+ * BRBFCR_EL1_DIRECT - Direct branch
+ * BRBFCR_EL1_RTN - Subroutine return
+ */
+static u64 branch_type_to_brbfcr(int branch_type)
+{
+ u64 brbfcr = 0;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+ brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
+ return brbfcr;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
+ brbfcr |= BRBFCR_EL1_INDCALL;
+ brbfcr |= BRBFCR_EL1_DIRCALL;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+ brbfcr |= BRBFCR_EL1_RTN;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
+ brbfcr |= BRBFCR_EL1_INDCALL;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_COND)
+ brbfcr |= BRBFCR_EL1_CONDDIR;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
+ brbfcr |= BRBFCR_EL1_INDIRECT;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_CALL)
+ brbfcr |= BRBFCR_EL1_DIRCALL;
+
+ return brbfcr;
+}
+
+/*
+ * BRBE supports the following privilege mode filters while generating
+ * branch records.
+ *
+ * BRBCR_ELx_E0BRE - EL0 branch records
+ * BRBCR_ELx_ExBRE - EL1/EL2 branch records
+ *
+ * BRBE also supports the following additional functional branch type
+ * filters while generating branch records.
+ *
+ * BRBCR_ELx_EXCEPTION - Exception
+ * BRBCR_ELx_ERTN - Exception return
+ */
+static u64 branch_type_to_brbcr(int branch_type)
+{
+ u64 brbcr = BRBCR_ELx_DEFAULT_TS;
+
+ /*
+ * BRBE should be paused on PMU interrupt while tracing kernel
+ * space to stop capturing further branch records. Otherwise
+ * interrupt handler branch records might get into the samples
+ * which is not desired.
+ *
+ * BRBE need not be paused on PMU interrupt while tracing only
+ * the user space, because it will automatically be inside the
+ * prohibited region. But even after PMU overflow occurs, the
+ * interrupt could still take much more cycles, before it can
+ * be taken and by that time BRBE will have been overwritten.
+ * Hence enable pause on PMU interrupt mechanism even for user
+ * only traces as well.
+ */
+ brbcr |= BRBCR_ELx_FZP;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_USER)
+ brbcr |= BRBCR_ELx_E0BRE;
+
+ /*
+ * When running in the hyp mode, writing into BRBCR_EL1
+ * actually writes into BRBCR_EL2 instead. Field E2BRE
+ * is also at the same position as E1BRE.
+ */
+ if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
+ brbcr |= BRBCR_ELx_ExBRE;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_HV) {
+ if (is_kernel_in_hyp_mode())
+ brbcr |= BRBCR_ELx_ExBRE;
+ }
+
+ if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
+ brbcr |= BRBCR_ELx_CC;
+
+ if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
+ brbcr |= BRBCR_ELx_MPRED;
+
+ /*
+ * The exception and exception return branches could be
+ * captured, irrespective of the perf event's privilege.
+ * If the perf event does not have enough privilege for
+ * a given exception level, then addresses which falls
+ * under that exception level will be reported as zero
+ * for the captured branch record, creating source only
+ * or target only records.
+ */
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
+ brbcr |= BRBCR_ELx_EXCEPTION;
+ brbcr |= BRBCR_ELx_ERTN;
+ }
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
+ brbcr |= BRBCR_ELx_EXCEPTION;
+
+ if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
+ brbcr |= BRBCR_ELx_ERTN;
+
+ return brbcr & BRBCR_ELx_CONFIG_MASK;
+}
+
+void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
+{
+ struct pmu_hw_events *cpuc = this_cpu_ptr(arm_pmu->hw_events);
+ u64 brbfcr, brbcr;
+
+ if (!(cpuc->brbe_sample_type && cpuc->brbe_users))
+ return;
+
+ /*
+ * BRBE gets configured with a new mismatched branch sample
+ * type request, overriding any previous branch filters.
+ */
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbfcr &= ~BRBFCR_EL1_DEFAULT_CONFIG;
+ brbfcr |= branch_type_to_brbfcr(cpuc->brbe_sample_type);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ brbcr &= ~BRBCR_ELx_CONFIG_MASK;
+ brbcr |= branch_type_to_brbcr(cpuc->brbe_sample_type);
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ isb();
+}
+
+void armv8pmu_branch_disable(void)
+{
+ u64 brbfcr, brbcr;
+
+ brbcr = read_sysreg_s(SYS_BRBCR_EL1);
+ brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ brbcr &= ~(BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE);
+ brbfcr |= BRBFCR_EL1_PAUSED;
+ write_sysreg_s(brbcr, SYS_BRBCR_EL1);
+ write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
+ isb();
+}
+
+static void brbe_set_perf_entry_type(struct perf_branch_entry *entry, u64 brbinf)
+{
+ int brbe_type = brbe_get_type(brbinf);
+
+ switch (brbe_type) {
+ case BRBINFx_EL1_TYPE_UNCOND_DIRECT:
+ entry->type = PERF_BR_UNCOND;
+ break;
+ case BRBINFx_EL1_TYPE_INDIRECT:
+ entry->type = PERF_BR_IND;
+ break;
+ case BRBINFx_EL1_TYPE_DIRECT_LINK:
+ entry->type = PERF_BR_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_INDIRECT_LINK:
+ entry->type = PERF_BR_IND_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_RET:
+ entry->type = PERF_BR_RET;
+ break;
+ case BRBINFx_EL1_TYPE_COND_DIRECT:
+ entry->type = PERF_BR_COND;
+ break;
+ case BRBINFx_EL1_TYPE_CALL:
+ entry->type = PERF_BR_CALL;
+ break;
+ case BRBINFx_EL1_TYPE_TRAP:
+ entry->type = PERF_BR_SYSCALL;
+ break;
+ case BRBINFx_EL1_TYPE_ERET:
+ entry->type = PERF_BR_ERET;
+ break;
+ case BRBINFx_EL1_TYPE_IRQ:
+ entry->type = PERF_BR_IRQ;
+ break;
+ case BRBINFx_EL1_TYPE_DEBUG_HALT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_HALT;
+ break;
+ case BRBINFx_EL1_TYPE_SERROR:
+ entry->type = PERF_BR_SERROR;
+ break;
+ case BRBINFx_EL1_TYPE_INSN_DEBUG:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_INST;
+ break;
+ case BRBINFx_EL1_TYPE_DATA_DEBUG:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_DATA;
+ break;
+ case BRBINFx_EL1_TYPE_ALIGN_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_ALGN;
+ break;
+ case BRBINFx_EL1_TYPE_INSN_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_INST;
+ break;
+ case BRBINFx_EL1_TYPE_DATA_FAULT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_NEW_FAULT_DATA;
+ break;
+ case BRBINFx_EL1_TYPE_FIQ:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_FIQ;
+ break;
+ case BRBINFx_EL1_TYPE_DEBUG_EXIT:
+ entry->type = PERF_BR_EXTEND_ABI;
+ entry->new_type = PERF_BR_ARM64_DEBUG_EXIT;
+ break;
+ default:
+ pr_warn_once("%d - unknown branch type captured\n", brbe_type);
+ entry->type = PERF_BR_UNKNOWN;
+ break;
+ }
+}
+
+static int brbe_get_perf_priv(u64 brbinf)
+{
+ int brbe_el = brbe_get_el(brbinf);
+
+ switch (brbe_el) {
+ case BRBINFx_EL1_EL_EL0:
+ return PERF_BR_PRIV_USER;
+ case BRBINFx_EL1_EL_EL1:
+ return PERF_BR_PRIV_KERNEL;
+ case BRBINFx_EL1_EL_EL2:
+ if (is_kernel_in_hyp_mode())
+ return PERF_BR_PRIV_KERNEL;
+ return PERF_BR_PRIV_HV;
+ default:
+ pr_warn_once("%d - unknown branch privilege captured\n", brbe_el);
+ return PERF_BR_PRIV_UNKNOWN;
+ }
+}
+
+static void capture_brbe_flags(struct perf_branch_entry *entry, struct perf_event *event,
+ u64 brbinf)
+{
+ if (branch_sample_type(event))
+ brbe_set_perf_entry_type(entry, brbinf);
+
+ if (!branch_sample_no_cycles(event))
+ entry->cycles = brbe_get_cycles(brbinf);
+
+ if (!branch_sample_no_flags(event)) {
+ /*
+ * BRBINFx_EL1.LASTFAILED indicates that a TME transaction failed (or
+ * was cancelled) prior to this record, and some number of records
+ * prior to this one, may have been generated during an attempt to
+ * execute the transaction.
+ *
+ * We will remove such entries later in process_branch_aborts().
+ */
+ entry->abort = brbe_get_lastfailed(brbinf);
+
+ /*
+ * All these information (i.e transaction state and mispredicts)
+ * are available for source only and complete branch records.
+ */
+ if (brbe_record_is_complete(brbinf) ||
+ brbe_record_is_source_only(brbinf)) {
+ entry->mispred = brbe_get_mispredict(brbinf);
+ entry->predicted = !entry->mispred;
+ entry->in_tx = brbe_get_in_tx(brbinf);
+ }
+ }
+
+ if (branch_sample_priv(event)) {
+ /*
+ * All these information (i.e branch privilege level) are
+ * available for target only and complete branch records.
+ */
+ if (brbe_record_is_complete(brbinf) ||
+ brbe_record_is_target_only(brbinf))
+ entry->priv = brbe_get_perf_priv(brbinf);
+ }
+}
+
+/*
+ * A branch record with BRBINFx_EL1.LASTFAILED set, implies that all
+ * preceding consecutive branch records, that were in a transaction
+ * (i.e their BRBINFx_EL1.TX set) have been aborted.
+ *
+ * Similarly BRBFCR_EL1.LASTFAILED set, indicate that all preceding
+ * consecutive branch records up to the last record, which were in a
+ * transaction (i.e their BRBINFx_EL1.TX set) have been aborted.
+ *
+ * --------------------------------- -------------------
+ * | 00 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
+ * --------------------------------- -------------------
+ * | 01 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX success]
+ * --------------------------------- -------------------
+ * | 02 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
+ * --------------------------------- -------------------
+ * | 03 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 04 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 05 | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 1 |
+ * --------------------------------- -------------------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | TX = 0 | LF = 0 |
+ * --------------------------------- -------------------
+ * | 61 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 62 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ * | 63 | BRBSRC | BRBTGT | BRBINF | | TX = 1 | LF = 0 | [TX failed]
+ * --------------------------------- -------------------
+ *
+ * BRBFCR_EL1.LASTFAILED == 1
+ *
+ * BRBFCR_EL1.LASTFAILED fails all those consecutive, in transaction
+ * branches records near the end of the BRBE buffer.
+ *
+ * Architecture does not guarantee a non transaction (TX = 0) branch
+ * record between two different transactions. So it is possible that
+ * a subsequent lastfailed record (TX = 0, LF = 1) might erroneously
+ * mark more than required transactions as aborted.
+ */
+static void process_branch_aborts(struct pmu_hw_events *cpuc)
+{
+ u64 brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
+ bool lastfailed = !!(brbfcr & BRBFCR_EL1_LASTFAILED);
+ int idx = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr) - 1;
+ struct perf_branch_entry *entry;
+
+ do {
+ entry = &cpuc->branches->branch_entries[idx];
+ if (entry->in_tx) {
+ entry->abort = lastfailed;
+ } else {
+ lastfailed = entry->abort;
+ entry->abort = false;
+ }
+ } while (idx--, idx >= 0);
+}
+
+static void brbe_regset_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+ struct brbe_regset *regset, int idx)
+{
+ struct perf_branch_entry *entry = &cpuc->branches->branch_entries[idx];
+ u64 brbinf = regset[idx].brbinf;
+
+ perf_clear_branch_entry_bitfields(entry);
+ if (brbe_record_is_complete(brbinf)) {
+ entry->from = regset[idx].brbsrc;
+ entry->to = regset[idx].brbtgt;
+ } else if (brbe_record_is_source_only(brbinf)) {
+ entry->from = regset[idx].brbsrc;
+ entry->to = 0;
+ } else if (brbe_record_is_target_only(brbinf)) {
+ entry->from = 0;
+ entry->to = regset[idx].brbtgt;
+ }
+ capture_brbe_flags(entry, event, brbinf);
+}
+
+static void process_branch_entries(struct pmu_hw_events *cpuc, struct perf_event *event,
+ struct brbe_regset *regset, int nr_regset)
+{
+ int idx;
+
+ for (idx = 0; idx < nr_regset; idx++)
+ brbe_regset_branch_entries(cpuc, event, regset, idx);
+
+ cpuc->branches->branch_stack.nr = nr_regset;
+ cpuc->branches->branch_stack.hw_idx = -1ULL;
+}
+
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
+{
+ struct arm64_perf_task_context *task_ctx = event->pmu_ctx->task_ctx_data;
+ struct brbe_regset live[BRBE_MAX_ENTRIES];
+ int nr_live, nr_store, nr_hw_entries;
+
+ nr_hw_entries = brbe_get_numrec(cpuc->percpu_pmu->reg_brbidr);
+ nr_live = capture_brbe_regset(live, nr_hw_entries);
+ if (event->ctx->task) {
+ nr_store = task_ctx->nr_brbe_records;
+ nr_store = stitch_stored_live_entries(task_ctx->store, live, nr_store,
+ nr_live, nr_hw_entries);
+ process_branch_entries(cpuc, event, task_ctx->store, nr_store);
+ task_ctx->nr_brbe_records = 0;
+ } else {
+ process_branch_entries(cpuc, event, live, nr_live);
+ }
+ process_branch_aborts(cpuc);
+}
diff --git a/drivers/perf/arm_brbe.h b/drivers/perf/arm_brbe.h
new file mode 100644
index 000000000000..b202fe34ea42
--- /dev/null
+++ b/drivers/perf/arm_brbe.h
@@ -0,0 +1,262 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Branch Record Buffer Extension Helpers.
+ *
+ * Copyright (C) 2022-2023 ARM Limited
+ *
+ * Author: Anshuman Khandual <anshuman.khandual@arm.com>
+ */
+#define pr_fmt(fmt) "brbe: " fmt
+
+#include <linux/perf/arm_pmu.h>
+
+#define BRBFCR_EL1_BRANCH_FILTERS (BRBFCR_EL1_DIRECT | \
+ BRBFCR_EL1_INDIRECT | \
+ BRBFCR_EL1_RTN | \
+ BRBFCR_EL1_INDCALL | \
+ BRBFCR_EL1_DIRCALL | \
+ BRBFCR_EL1_CONDDIR)
+
+#define BRBFCR_EL1_DEFAULT_CONFIG (BRBFCR_EL1_BANK_MASK | \
+ BRBFCR_EL1_PAUSED | \
+ BRBFCR_EL1_EnI | \
+ BRBFCR_EL1_BRANCH_FILTERS)
+
+/*
+ * BRBTS_EL1 is currently not used for branch stack implementation
+ * purpose but BRBCR_ELx.TS needs to have a valid value from all
+ * available options. BRBCR_ELx_TS_VIRTUAL is selected for this.
+ */
+#define BRBCR_ELx_DEFAULT_TS FIELD_PREP(BRBCR_ELx_TS_MASK, BRBCR_ELx_TS_VIRTUAL)
+
+#define BRBCR_ELx_CONFIG_MASK (BRBCR_ELx_EXCEPTION | \
+ BRBCR_ELx_ERTN | \
+ BRBCR_ELx_CC | \
+ BRBCR_ELx_MPRED | \
+ BRBCR_ELx_ExBRE | \
+ BRBCR_ELx_E0BRE | \
+ BRBCR_ELx_FZP | \
+ BRBCR_ELx_TS_MASK)
+
+/*
+ * BRBE Buffer Organization
+ *
+ * BRBE buffer is arranged as multiple banks of 32 branch record
+ * entries each. An individual branch record in a given bank could
+ * be accessed, after selecting the bank in BRBFCR_EL1.BANK and
+ * accessing the registers i.e [BRBSRC, BRBTGT, BRBINF] set with
+ * indices [0..31].
+ *
+ * Bank 0
+ *
+ * --------------------------------- ------
+ * | 00 | BRBSRC | BRBTGT | BRBINF | | 00 |
+ * --------------------------------- ------
+ * | 01 | BRBSRC | BRBTGT | BRBINF | | 01 |
+ * --------------------------------- ------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
+ * --------------------------------- ------
+ * | 31 | BRBSRC | BRBTGT | BRBINF | | 31 |
+ * --------------------------------- ------
+ *
+ * Bank 1
+ *
+ * --------------------------------- ------
+ * | 32 | BRBSRC | BRBTGT | BRBINF | | 00 |
+ * --------------------------------- ------
+ * | 33 | BRBSRC | BRBTGT | BRBINF | | 01 |
+ * --------------------------------- ------
+ * | .. | BRBSRC | BRBTGT | BRBINF | | .. |
+ * --------------------------------- ------
+ * | 63 | BRBSRC | BRBTGT | BRBINF | | 31 |
+ * --------------------------------- ------
+ */
+#define BRBE_BANK_MAX_ENTRIES 32
+#define BRBE_MAX_BANK 2
+#define BRBE_MAX_ENTRIES (BRBE_BANK_MAX_ENTRIES * BRBE_MAX_BANK)
+
+#define BRBE_BANK0_IDX_MIN 0
+#define BRBE_BANK0_IDX_MAX 31
+#define BRBE_BANK1_IDX_MIN 32
+#define BRBE_BANK1_IDX_MAX 63
+
+struct brbe_regset {
+ unsigned long brbsrc;
+ unsigned long brbtgt;
+ unsigned long brbinf;
+};
+
+struct arm64_perf_task_context {
+ struct brbe_regset store[BRBE_MAX_ENTRIES];
+ int nr_brbe_records;
+};
+
+struct brbe_hw_attr {
+ int brbe_version;
+ int brbe_cc;
+ int brbe_nr;
+ int brbe_format;
+};
+
+enum brbe_bank_idx {
+ BRBE_BANK_IDX_INVALID = -1,
+ BRBE_BANK_IDX_0,
+ BRBE_BANK_IDX_1,
+ BRBE_BANK_IDX_MAX
+};
+
+#define RETURN_READ_BRBSRCN(n) \
+ read_sysreg_s(SYS_BRBSRC##n##_EL1)
+
+#define RETURN_READ_BRBTGTN(n) \
+ read_sysreg_s(SYS_BRBTGT##n##_EL1)
+
+#define RETURN_READ_BRBINFN(n) \
+ read_sysreg_s(SYS_BRBINF##n##_EL1)
+
+#define BRBE_REGN_CASE(n, case_macro) \
+ case n: return case_macro(n); break
+
+#define BRBE_REGN_SWITCH(x, case_macro) \
+ do { \
+ switch (x) { \
+ BRBE_REGN_CASE(0, case_macro); \
+ BRBE_REGN_CASE(1, case_macro); \
+ BRBE_REGN_CASE(2, case_macro); \
+ BRBE_REGN_CASE(3, case_macro); \
+ BRBE_REGN_CASE(4, case_macro); \
+ BRBE_REGN_CASE(5, case_macro); \
+ BRBE_REGN_CASE(6, case_macro); \
+ BRBE_REGN_CASE(7, case_macro); \
+ BRBE_REGN_CASE(8, case_macro); \
+ BRBE_REGN_CASE(9, case_macro); \
+ BRBE_REGN_CASE(10, case_macro); \
+ BRBE_REGN_CASE(11, case_macro); \
+ BRBE_REGN_CASE(12, case_macro); \
+ BRBE_REGN_CASE(13, case_macro); \
+ BRBE_REGN_CASE(14, case_macro); \
+ BRBE_REGN_CASE(15, case_macro); \
+ BRBE_REGN_CASE(16, case_macro); \
+ BRBE_REGN_CASE(17, case_macro); \
+ BRBE_REGN_CASE(18, case_macro); \
+ BRBE_REGN_CASE(19, case_macro); \
+ BRBE_REGN_CASE(20, case_macro); \
+ BRBE_REGN_CASE(21, case_macro); \
+ BRBE_REGN_CASE(22, case_macro); \
+ BRBE_REGN_CASE(23, case_macro); \
+ BRBE_REGN_CASE(24, case_macro); \
+ BRBE_REGN_CASE(25, case_macro); \
+ BRBE_REGN_CASE(26, case_macro); \
+ BRBE_REGN_CASE(27, case_macro); \
+ BRBE_REGN_CASE(28, case_macro); \
+ BRBE_REGN_CASE(29, case_macro); \
+ BRBE_REGN_CASE(30, case_macro); \
+ BRBE_REGN_CASE(31, case_macro); \
+ default: \
+ pr_warn("unknown register index\n"); \
+ return -1; \
+ } \
+ } while (0)
+
+static inline int buffer_to_brbe_idx(int buffer_idx)
+{
+ return buffer_idx % BRBE_BANK_MAX_ENTRIES;
+}
+
+static inline u64 get_brbsrc_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBSRCN);
+}
+
+static inline u64 get_brbtgt_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBTGTN);
+}
+
+static inline u64 get_brbinf_reg(int buffer_idx)
+{
+ int brbe_idx = buffer_to_brbe_idx(buffer_idx);
+
+ BRBE_REGN_SWITCH(brbe_idx, RETURN_READ_BRBINFN);
+}
+
+static inline u64 brbe_record_valid(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_VALID_MASK, brbinf);
+}
+
+static inline bool brbe_invalid(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_NONE;
+}
+
+static inline bool brbe_record_is_complete(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_FULL;
+}
+
+static inline bool brbe_record_is_source_only(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_SOURCE;
+}
+
+static inline bool brbe_record_is_target_only(u64 brbinf)
+{
+ return brbe_record_valid(brbinf) == BRBINFx_EL1_VALID_TARGET;
+}
+
+static inline int brbe_get_in_tx(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_T_MASK, brbinf);
+}
+
+static inline int brbe_get_mispredict(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_MPRED_MASK, brbinf);
+}
+
+static inline int brbe_get_lastfailed(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_LASTFAILED_MASK, brbinf);
+}
+
+static inline int brbe_get_cycles(u64 brbinf)
+{
+ /*
+ * Captured cycle count is unknown and hence
+ * should not be passed on to the user space.
+ */
+ if (brbinf & BRBINFx_EL1_CCU)
+ return 0;
+
+ return FIELD_GET(BRBINFx_EL1_CC_MASK, brbinf);
+}
+
+static inline int brbe_get_type(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_TYPE_MASK, brbinf);
+}
+
+static inline int brbe_get_el(u64 brbinf)
+{
+ return FIELD_GET(BRBINFx_EL1_EL_MASK, brbinf);
+}
+
+static inline int brbe_get_numrec(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_NUMREC_MASK, brbidr);
+}
+
+static inline int brbe_get_format(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_FORMAT_MASK, brbidr);
+}
+
+static inline int brbe_get_cc_bits(u64 brbidr)
+{
+ return FIELD_GET(BRBIDR0_EL1_CC_MASK, brbidr);
+}
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index a489fdf163b4..76d49c1d8659 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -144,6 +144,11 @@ struct arm_pmu {
/* store the PMMIR_EL1 to expose slots */
u64 reg_pmmir;
+#ifdef CONFIG_ARM64_BRBE
+ /* store the BRBIDR0_EL1 capturing attributes */
+ u64 reg_brbidr;
+#endif
+
/* Only to be used by ACPI probing code */
unsigned long acpi_cpuid;
};
diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
index 72da4522397c..4a4db12411a4 100644
--- a/include/linux/perf/arm_pmuv3.h
+++ b/include/linux/perf/arm_pmuv3.h
@@ -308,6 +308,18 @@ struct arm_pmu;
struct perf_event;
#ifdef CONFIG_PERF_EVENTS
+#ifdef CONFIG_ARM64_BRBE
+void armv8pmu_branch_reset(void);
+void armv8pmu_branch_probe(struct arm_pmu *arm_pmu);
+bool armv8pmu_branch_attr_valid(struct perf_event *event);
+void armv8pmu_branch_enable(struct arm_pmu *arm_pmu);
+void armv8pmu_branch_disable(void);
+void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
+ struct perf_event *event);
+void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx);
+int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu);
+void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu);
+#else /* !CONFIG_ARM64_BRBE */
static inline void armv8pmu_branch_reset(void)
{
}
@@ -348,5 +360,6 @@ static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
{
}
+#endif /* CONFIG_ARM64_BRBE */
#endif /* CONFIG_PERF_EVENTS */
#endif
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
` (3 preceding siblings ...)
2023-11-14 5:13 ` [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 9:16 ` James Clark
2023-11-14 5:13 ` [V14 6/8] perf: test: Speed up running brstack test on an Arm model Anshuman Khandual
` (3 subsequent siblings)
8 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users, Oliver Upton,
James Morse, kvmarm
Disable the BRBE before we enter the guest, saving the status and enable it
back once we get out of the guest. This is just to avoid capturing records
in the guest kernel/userspace, which would be confusing the samples.
Cc: Marc Zyngier <maz@kernel.org>
Cc: Oliver Upton <oliver.upton@linux.dev>
Cc: James Morse <james.morse@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: kvmarm@lists.linux.dev
Cc: linux-arm-kernel@lists.infradead.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- This is a new patch in the series
arch/arm64/include/asm/kvm_host.h | 4 ++++
arch/arm64/kvm/debug.c | 6 +++++
arch/arm64/kvm/hyp/nvhe/debug-sr.c | 38 ++++++++++++++++++++++++++++++
3 files changed, 48 insertions(+)
diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 68421c74283a..1faa0430d8dd 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -449,6 +449,8 @@ enum vcpu_sysreg {
CNTHV_CVAL_EL2,
PMSCR_EL1, /* Statistical profiling extension */
TRFCR_EL1, /* Self-hosted trace filters */
+ BRBCR_EL1, /* Branch Record Buffer Control Register */
+ BRBFCR_EL1, /* Branch Record Buffer Function Control Register */
NR_SYS_REGS /* Nothing after this line! */
};
@@ -753,6 +755,8 @@ struct kvm_vcpu_arch {
#define VCPU_HYP_CONTEXT __vcpu_single_flag(iflags, BIT(7))
/* Save trace filter controls */
#define DEBUG_STATE_SAVE_TRFCR __vcpu_single_flag(iflags, BIT(8))
+/* Save BRBE context if active */
+#define DEBUG_STATE_SAVE_BRBE __vcpu_single_flag(iflags, BIT(9))
/* SVE enabled for host EL0 */
#define HOST_SVE_ENABLED __vcpu_single_flag(sflags, BIT(0))
diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
index 2ab41b954512..4055783c3d34 100644
--- a/arch/arm64/kvm/debug.c
+++ b/arch/arm64/kvm/debug.c
@@ -354,6 +354,11 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
!(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
}
+
+ /* Check if we have BRBE implemented and available at the host */
+ if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT) &&
+ (read_sysreg_s(SYS_BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
+ vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
}
void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
@@ -361,6 +366,7 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRFCR);
+ vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
}
void kvm_etm_set_guest_trfcr(u64 trfcr_guest)
diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
index 6174f710948e..e44a1f71a0f8 100644
--- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
+++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
@@ -93,6 +93,38 @@ static void __debug_restore_trace(struct kvm_cpu_context *host_ctxt,
write_sysreg_s(ctxt_sys_reg(host_ctxt, TRFCR_EL1), SYS_TRFCR_EL1);
}
+static void __debug_save_brbe(struct kvm_cpu_context *host_ctxt)
+{
+ ctxt_sys_reg(host_ctxt, BRBCR_EL1) = 0;
+ ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = 0;
+
+ /* Check if the BRBE is enabled */
+ if (!(ctxt_sys_reg(host_ctxt, BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
+ return;
+
+ /*
+ * Prohibit branch record generation while we are in guest.
+ * Since access to BRBCR_EL1 and BRBFCR_EL1 is trapped, the
+ * guest can't modify the filtering set by the host.
+ */
+ ctxt_sys_reg(host_ctxt, BRBCR_EL1) = read_sysreg_s(SYS_BRBCR_EL1);
+ ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = read_sysreg_s(SYS_BRBFCR_EL1);
+ write_sysreg_s(0, SYS_BRBCR_EL1);
+ write_sysreg_s(0, SYS_BRBFCR_EL1);
+ isb();
+}
+
+static void __debug_restore_brbe(struct kvm_cpu_context *host_ctxt)
+{
+ if (!ctxt_sys_reg(host_ctxt, BRBCR_EL1) || !ctxt_sys_reg(host_ctxt, BRBFCR_EL1))
+ return;
+
+ /* Restore BRBE controls */
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBCR_EL1), SYS_BRBCR_EL1);
+ write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBFCR_EL1), SYS_BRBFCR_EL1);
+ isb();
+}
+
void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
struct kvm_cpu_context *guest_ctxt)
{
@@ -102,6 +134,10 @@ void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
__debug_save_trace(host_ctxt, guest_ctxt);
+
+ /* Disable BRBE branch records */
+ if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
+ __debug_save_brbe(host_ctxt);
}
void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -116,6 +152,8 @@ void __debug_restore_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
__debug_restore_spe(host_ctxt);
if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
__debug_restore_trace(host_ctxt, guest_ctxt);
+ if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
+ __debug_restore_brbe(host_ctxt);
}
void __debug_switch_to_host(struct kvm_vcpu *vcpu)
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 6/8] perf: test: Speed up running brstack test on an Arm model
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
` (4 preceding siblings ...)
2023-11-14 5:13 ` [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 7/8] perf: test: Remove empty lines from branch filter test output Anshuman Khandual
` (2 subsequent siblings)
8 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users
From: James Clark <james.clark@arm.com>
The test runs quite slowly in the model, so replace "xargs -n1" with
"tr ' ' '\n'" which does the same thing but in single digit minutes
instead of double digit minutes.
Also reduce the number of loops in the test application. Unfortunately
this causes intermittent failures on x86, presumably because the
sampling interval is too big to pickup any loops, so keep it the same
there.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- This is a new patch in the series
tools/perf/tests/shell/test_brstack.sh | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh
index 09908d71c994..283c9a902bbf 100755
--- a/tools/perf/tests/shell/test_brstack.sh
+++ b/tools/perf/tests/shell/test_brstack.sh
@@ -12,7 +12,6 @@ if ! perf record -o- --no-buildid --branch-filter any,save_type,u -- true > /dev
fi
TMPDIR=$(mktemp -d /tmp/__perf_test.program.XXXXX)
-TESTPROG="perf test -w brstack"
cleanup() {
rm -rf $TMPDIR
@@ -20,11 +19,21 @@ cleanup() {
trap cleanup EXIT TERM INT
+is_arm64() {
+ uname -m | grep -q aarch64
+}
+
+if is_arm64; then
+ TESTPROG="perf test -w brstack 5000"
+else
+ TESTPROG="perf test -w brstack"
+fi
+
test_user_branches() {
echo "Testing user branch stack sampling"
perf record -o $TMPDIR/perf.data --branch-filter any,save_type,u -- ${TESTPROG} > /dev/null 2>&1
- perf script -i $TMPDIR/perf.data --fields brstacksym | xargs -n1 > $TMPDIR/perf.script
+ perf script -i $TMPDIR/perf.data --fields brstacksym | tr ' ' '\n' > $TMPDIR/perf.script
# example of branch entries:
# brstack_foo+0x14/brstack_bar+0x40/P/-/-/0/CALL
@@ -53,7 +62,7 @@ test_filter() {
echo "Testing branch stack filtering permutation ($test_filter_filter,$test_filter_expect)"
perf record -o $TMPDIR/perf.data --branch-filter $test_filter_filter,save_type,u -- ${TESTPROG} > /dev/null 2>&1
- perf script -i $TMPDIR/perf.data --fields brstack | xargs -n1 > $TMPDIR/perf.script
+ perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' > $TMPDIR/perf.script
# fail if we find any branch type that doesn't match any of the expected ones
# also consider UNKNOWN branch types (-)
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 7/8] perf: test: Remove empty lines from branch filter test output
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
` (5 preceding siblings ...)
2023-11-14 5:13 ` [V14 6/8] perf: test: Speed up running brstack test on an Arm model Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 8/8] perf: test: Extend branch stack sampling test for Arm64 BRBE Anshuman Khandual
2023-11-14 17:17 ` [V14 0/8] arm64/perf: Enable branch stack sampling James Clark
8 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users
From: James Clark <james.clark@arm.com>
In the perf script command, spaces are turned into newlines. But when
there is a double space this results in empty lines which fail the
following inverse grep test, so strip the empty lines.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- This is a new patch in the series
tools/perf/tests/shell/test_brstack.sh | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh
index 283c9a902bbf..b1fe29bb71b3 100755
--- a/tools/perf/tests/shell/test_brstack.sh
+++ b/tools/perf/tests/shell/test_brstack.sh
@@ -62,7 +62,7 @@ test_filter() {
echo "Testing branch stack filtering permutation ($test_filter_filter,$test_filter_expect)"
perf record -o $TMPDIR/perf.data --branch-filter $test_filter_filter,save_type,u -- ${TESTPROG} > /dev/null 2>&1
- perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' > $TMPDIR/perf.script
+ perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' | sed '/^[[:space:]]*$/d' > $TMPDIR/perf.script
# fail if we find any branch type that doesn't match any of the expected ones
# also consider UNKNOWN branch types (-)
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* [V14 8/8] perf: test: Extend branch stack sampling test for Arm64 BRBE
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
` (6 preceding siblings ...)
2023-11-14 5:13 ` [V14 7/8] perf: test: Remove empty lines from branch filter test output Anshuman Khandual
@ 2023-11-14 5:13 ` Anshuman Khandual
2023-11-14 17:17 ` [V14 0/8] arm64/perf: Enable branch stack sampling James Clark
8 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-14 5:13 UTC (permalink / raw)
To: linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
Cc: Anshuman Khandual, Mark Brown, James Clark, Rob Herring,
Marc Zyngier, Suzuki Poulose, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, linux-perf-users, German Gomez
From: James Clark <james.clark@arm.com>
Add Arm64 BRBE-specific testing to the existing branch stack sampling test.
The test currently passes on the Arm FVP RevC model, but no hardware has
been tested yet.
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: linux-perf-users@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Co-developed-by: German Gomez <german.gomez@arm.com>
Signed-off-by: German Gomez <german.gomez@arm.com>
Signed-off-by: James Clark <james.clark@arm.com>
Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
Changes in V14:
- This is a new patch in the series
tools/perf/tests/builtin-test.c | 1 +
tools/perf/tests/shell/test_brstack.sh | 42 ++++++++++++++++++++++++--
tools/perf/tests/tests.h | 1 +
tools/perf/tests/workloads/Build | 2 ++
tools/perf/tests/workloads/traploop.c | 39 ++++++++++++++++++++++++
5 files changed, 82 insertions(+), 3 deletions(-)
create mode 100644 tools/perf/tests/workloads/traploop.c
diff --git a/tools/perf/tests/builtin-test.c b/tools/perf/tests/builtin-test.c
index cb6f1dd00dc4..7d9e0a311ef9 100644
--- a/tools/perf/tests/builtin-test.c
+++ b/tools/perf/tests/builtin-test.c
@@ -139,6 +139,7 @@ static struct test_workload *workloads[] = {
&workload__sqrtloop,
&workload__brstack,
&workload__datasym,
+ &workload__traploop
};
static int num_subtests(const struct test_suite *t)
diff --git a/tools/perf/tests/shell/test_brstack.sh b/tools/perf/tests/shell/test_brstack.sh
index b1fe29bb71b3..b0c96bfae304 100755
--- a/tools/perf/tests/shell/test_brstack.sh
+++ b/tools/perf/tests/shell/test_brstack.sh
@@ -47,12 +47,43 @@ test_user_branches() {
grep -E -m1 "^brstack_foo\+[^ ]*/brstack_bench\+[^ ]*/RET/.*$" $TMPDIR/perf.script
grep -E -m1 "^brstack_bench\+[^ ]*/brstack_bench\+[^ ]*/COND/.*$" $TMPDIR/perf.script
grep -E -m1 "^brstack\+[^ ]*/brstack\+[^ ]*/UNCOND/.*$" $TMPDIR/perf.script
+
+ if is_arm64; then
+ # in arm64 with BRBE, we get IRQ entries that correspond
+ # to any point in the process
+ grep -m1 "/IRQ/" $TMPDIR/perf.script
+ fi
set +x
# some branch types are still not being tested:
# IND COND_CALL COND_RET SYSCALL SYSRET IRQ SERROR NO_TX
}
+test_arm64_trap_eret_branches() {
+ echo "Testing trap & eret branches (arm64 brbe)"
+ perf record -o $TMPDIR/perf.data --branch-filter any,save_type,u -- \
+ perf test -w traploop 250
+ perf script -i $TMPDIR/perf.data --fields brstacksym | tr ' ' '\n' > $TMPDIR/perf.script
+ set -x
+ # BRBINF<n>.TYPE == TRAP are mapped to PERF_BR_SYSCALL by the BRBE driver
+ grep -E -m1 "^trap_bench\+[^ ]*/\[unknown\][^ ]*/SYSCALL/" $TMPDIR/perf.script
+ grep -E -m1 "^\[unknown\][^ ]*/trap_bench\+[^ ]*/ERET/" $TMPDIR/perf.script
+ set +x
+}
+
+test_arm64_kernel_branches() {
+ echo "Testing kernel branches (arm64 brbe)"
+ # skip if perf doesn't have enough privileges
+ if ! perf record --branch-filter any,k -o- -- true > /dev/null; then
+ echo "[skipped: not enough privileges]"
+ return 0
+ fi
+ perf record -o $TMPDIR/perf.data --branch-filter any,k -- uname -a
+ perf script -i $TMPDIR/perf.data --fields brstack | tr ' ' '\n' > $TMPDIR/perf.script
+ grep -E -m1 "0xffff[0-9a-f]{12}" $TMPDIR/perf.script
+ ! egrep -E -m1 "0x0000[0-9a-f]{12}" $TMPDIR/perf.script
+}
+
# first argument <arg0> is the argument passed to "--branch-stack <arg0>,save_type,u"
# second argument are the expected branch types for the given filter
test_filter() {
@@ -75,11 +106,16 @@ set -e
test_user_branches
-test_filter "any_call" "CALL|IND_CALL|COND_CALL|SYSCALL|IRQ"
+if is_arm64; then
+ test_arm64_trap_eret_branches
+ test_arm64_kernel_branches
+fi
+
+test_filter "any_call" "CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|FAULT_DATA|FAULT_INST"
test_filter "call" "CALL|SYSCALL"
test_filter "cond" "COND"
test_filter "any_ret" "RET|COND_RET|SYSRET|ERET"
test_filter "call,cond" "CALL|SYSCALL|COND"
-test_filter "any_call,cond" "CALL|IND_CALL|COND_CALL|IRQ|SYSCALL|COND"
-test_filter "cond,any_call,any_ret" "COND|CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|RET|COND_RET|SYSRET|ERET"
+test_filter "any_call,cond" "CALL|IND_CALL|COND_CALL|IRQ|SYSCALL|COND|FAULT_DATA|FAULT_INST"
+test_filter "cond,any_call,any_ret" "COND|CALL|IND_CALL|COND_CALL|SYSCALL|IRQ|RET|COND_RET|SYSRET|ERET|FAULT_DATA|FAULT_INST"
diff --git a/tools/perf/tests/tests.h b/tools/perf/tests/tests.h
index b394f3ac2d66..c65455da4eaf 100644
--- a/tools/perf/tests/tests.h
+++ b/tools/perf/tests/tests.h
@@ -205,6 +205,7 @@ DECLARE_WORKLOAD(leafloop);
DECLARE_WORKLOAD(sqrtloop);
DECLARE_WORKLOAD(brstack);
DECLARE_WORKLOAD(datasym);
+DECLARE_WORKLOAD(traploop);
extern const char *dso_to_test;
diff --git a/tools/perf/tests/workloads/Build b/tools/perf/tests/workloads/Build
index a1f34d5861e3..a9dc93d8468b 100644
--- a/tools/perf/tests/workloads/Build
+++ b/tools/perf/tests/workloads/Build
@@ -6,8 +6,10 @@ perf-y += leafloop.o
perf-y += sqrtloop.o
perf-y += brstack.o
perf-y += datasym.o
+perf-y += traploop.o
CFLAGS_sqrtloop.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
CFLAGS_leafloop.o = -g -O0 -fno-inline -fno-omit-frame-pointer -U_FORTIFY_SOURCE
CFLAGS_brstack.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
CFLAGS_datasym.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
+CFLAGS_traploop.o = -g -O0 -fno-inline -U_FORTIFY_SOURCE
diff --git a/tools/perf/tests/workloads/traploop.c b/tools/perf/tests/workloads/traploop.c
new file mode 100644
index 000000000000..7dac94897e49
--- /dev/null
+++ b/tools/perf/tests/workloads/traploop.c
@@ -0,0 +1,39 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <stdlib.h>
+#include "../tests.h"
+
+#define BENCH_RUNS 999999
+
+static volatile int cnt;
+
+#ifdef __aarch64__
+static void trap_bench(void)
+{
+ unsigned long val;
+
+ asm("mrs %0, ID_AA64ISAR0_EL1" : "=r" (val)); /* TRAP + ERET */
+}
+#else
+static void trap_bench(void)
+{
+
+}
+#endif
+
+static int traploop(int argc, const char **argv)
+{
+ int num_loops = BENCH_RUNS;
+
+ if (argc > 0)
+ num_loops = atoi(argv[0]);
+
+ while (1) {
+ if ((cnt++) > num_loops)
+ break;
+
+ trap_bench();
+ }
+ return 0;
+}
+
+DEFINE_WORKLOAD(traploop);
--
2.25.1
^ permalink raw reply related [flat|nested] 30+ messages in thread
* Re: [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests
2023-11-14 5:13 ` [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests Anshuman Khandual
@ 2023-11-14 9:16 ` James Clark
2023-11-21 11:12 ` Anshuman Khandual
0 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-14 9:16 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, Oliver Upton, James Morse, kvmarm,
linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
On 14/11/2023 05:13, Anshuman Khandual wrote:
> Disable the BRBE before we enter the guest, saving the status and enable it
> back once we get out of the guest. This is just to avoid capturing records
> in the guest kernel/userspace, which would be confusing the samples.
>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Oliver Upton <oliver.upton@linux.dev>
> Cc: James Morse <james.morse@arm.com>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Will Deacon <will@kernel.org>
> Cc: kvmarm@lists.linux.dev
> Cc: linux-arm-kernel@lists.infradead.org
> CC: linux-kernel@vger.kernel.org
> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
> Changes in V14:
>
> - This is a new patch in the series
>
> arch/arm64/include/asm/kvm_host.h | 4 ++++
> arch/arm64/kvm/debug.c | 6 +++++
> arch/arm64/kvm/hyp/nvhe/debug-sr.c | 38 ++++++++++++++++++++++++++++++
> 3 files changed, 48 insertions(+)
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 68421c74283a..1faa0430d8dd 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -449,6 +449,8 @@ enum vcpu_sysreg {
> CNTHV_CVAL_EL2,
> PMSCR_EL1, /* Statistical profiling extension */
> TRFCR_EL1, /* Self-hosted trace filters */
> + BRBCR_EL1, /* Branch Record Buffer Control Register */
> + BRBFCR_EL1, /* Branch Record Buffer Function Control Register */
>
> NR_SYS_REGS /* Nothing after this line! */
> };
> @@ -753,6 +755,8 @@ struct kvm_vcpu_arch {
> #define VCPU_HYP_CONTEXT __vcpu_single_flag(iflags, BIT(7))
> /* Save trace filter controls */
> #define DEBUG_STATE_SAVE_TRFCR __vcpu_single_flag(iflags, BIT(8))
> +/* Save BRBE context if active */
> +#define DEBUG_STATE_SAVE_BRBE __vcpu_single_flag(iflags, BIT(9))
>
> /* SVE enabled for host EL0 */
> #define HOST_SVE_ENABLED __vcpu_single_flag(sflags, BIT(0))
> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
> index 2ab41b954512..4055783c3d34 100644
> --- a/arch/arm64/kvm/debug.c
> +++ b/arch/arm64/kvm/debug.c
> @@ -354,6 +354,11 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
> !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
> vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
> }
> +
> + /* Check if we have BRBE implemented and available at the host */
> + if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT) &&
> + (read_sysreg_s(SYS_BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
> + vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
Isn't this supposed to just be the feature check? Whether BRBE is
enabled or not is checked later in __debug_save_brbe() anyway.
It seems like it's possible to become enabled after this flag load part.
> }
>
> void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
> @@ -361,6 +366,7 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRFCR);
> + vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
> }
>
> void kvm_etm_set_guest_trfcr(u64 trfcr_guest)
> diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
> index 6174f710948e..e44a1f71a0f8 100644
> --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
> +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
> @@ -93,6 +93,38 @@ static void __debug_restore_trace(struct kvm_cpu_context *host_ctxt,
> write_sysreg_s(ctxt_sys_reg(host_ctxt, TRFCR_EL1), SYS_TRFCR_EL1);
> }
>
> +static void __debug_save_brbe(struct kvm_cpu_context *host_ctxt)
> +{
> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = 0;
> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = 0;
> +
> + /* Check if the BRBE is enabled */
> + if (!(ctxt_sys_reg(host_ctxt, BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
> + return;
Doesn't this always fail, the host BRBCR_EL1 value was just cleared on
the line above.
Also, you need to read the register to determine if it was enabled or
not, so you might as well always store the real value, rather than 0 in
the not enabled case.
> +
> + /*
> + * Prohibit branch record generation while we are in guest.
> + * Since access to BRBCR_EL1 and BRBFCR_EL1 is trapped, the
> + * guest can't modify the filtering set by the host.
> + */
> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = read_sysreg_s(SYS_BRBCR_EL1);
> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = read_sysreg_s(SYS_BRBFCR_EL1)
> + write_sysreg_s(0, SYS_BRBCR_EL1);
> + write_sysreg_s(0, SYS_BRBFCR_EL1);
Why does SYS_BRBFCR_EL1 need to be saved and restored? Only
BRBCR_ELx_E0BRE and BRBCR_ELx_ExBRE need to be cleared to disable BRBE.
> + isb();
> +}
> +
> +static void __debug_restore_brbe(struct kvm_cpu_context *host_ctxt)
> +{
> + if (!ctxt_sys_reg(host_ctxt, BRBCR_EL1) || !ctxt_sys_reg(host_ctxt, BRBFCR_EL1))
> + return;
> +
> + /* Restore BRBE controls */
> + write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBCR_EL1), SYS_BRBCR_EL1);
> + write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBFCR_EL1), SYS_BRBFCR_EL1);
> + isb();
> +}
> +
> void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
> struct kvm_cpu_context *guest_ctxt)
> {
> @@ -102,6 +134,10 @@ void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>
> if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
> __debug_save_trace(host_ctxt, guest_ctxt);
> +
> + /* Disable BRBE branch records */
> + if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
> + __debug_save_brbe(host_ctxt);
> }
>
> void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
> @@ -116,6 +152,8 @@ void __debug_restore_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
> __debug_restore_spe(host_ctxt);
> if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
> __debug_restore_trace(host_ctxt, guest_ctxt);
> + if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
> + __debug_restore_brbe(host_ctxt);
> }
>
> void __debug_switch_to_host(struct kvm_vcpu *vcpu)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 5:13 ` [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework Anshuman Khandual
@ 2023-11-14 9:58 ` James Clark
2023-11-15 5:44 ` Anshuman Khandual
2023-11-14 12:14 ` James Clark
2023-11-14 17:10 ` James Clark
2 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-14 9:58 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 14/11/2023 05:13, Anshuman Khandual wrote:
> Branch stack sampling support i.e capturing branch records during execution
> in core perf, rides along with normal HW events being scheduled on the PMU.
> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
> with required HW implementation.
>
[...]
> - All armv8pmu_branch_xxxx() stub definitions have been moved inside
> include/linux/perf/arm_pmuv3.h for easy access from both arm32 and
> arm64 platforms
>
This causes lots of W=1 build errors because the prototypes are in
arm_pmuv3.h, but arm_brbe.c doesn't include it.
It seems like the main reason you can't include arm_brbe.h in arm32 code
is because there are a load of inline functions and references to
registers in there. But these are only used in arm_brbe.c, so they don't
need to be in the header anyway.
If you removed the code from the header and moved it to the source file
you could move the brbe prototypes to the brbe header and it would be a
bit cleaner and more idiomatic.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
2023-11-14 5:13 ` [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE Anshuman Khandual
@ 2023-11-14 12:11 ` James Clark
2023-11-21 10:47 ` Anshuman Khandual
0 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-14 12:11 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 14/11/2023 05:13, Anshuman Khandual wrote:
[...]
> +/*
> + * BRBE supports the following functional branch type filters while
> + * generating branch records. These branch filters can be enabled,
> + * either individually or as a group i.e ORing multiple filters
> + * with each other.
> + *
> + * BRBFCR_EL1_CONDDIR - Conditional direct branch
> + * BRBFCR_EL1_DIRCALL - Direct call
> + * BRBFCR_EL1_INDCALL - Indirect call
> + * BRBFCR_EL1_INDIRECT - Indirect branch
> + * BRBFCR_EL1_DIRECT - Direct branch
> + * BRBFCR_EL1_RTN - Subroutine return
> + */
> +static u64 branch_type_to_brbfcr(int branch_type)
> +{
> + u64 brbfcr = 0;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
> + brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
> + return brbfcr;
> + }
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
> + brbfcr |= BRBFCR_EL1_INDCALL;
> + brbfcr |= BRBFCR_EL1_DIRCALL;
> + }
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
> + brbfcr |= BRBFCR_EL1_RTN;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
> + brbfcr |= BRBFCR_EL1_INDCALL;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_COND)
> + brbfcr |= BRBFCR_EL1_CONDDIR;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
> + brbfcr |= BRBFCR_EL1_INDIRECT;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_CALL)
> + brbfcr |= BRBFCR_EL1_DIRCALL;
> +
> + return brbfcr;
> +}
> +
> +/*
> + * BRBE supports the following privilege mode filters while generating
> + * branch records.
> + *
> + * BRBCR_ELx_E0BRE - EL0 branch records
> + * BRBCR_ELx_ExBRE - EL1/EL2 branch records
> + *
> + * BRBE also supports the following additional functional branch type
> + * filters while generating branch records.
> + *
> + * BRBCR_ELx_EXCEPTION - Exception
> + * BRBCR_ELx_ERTN - Exception return
> + */
> +static u64 branch_type_to_brbcr(int branch_type)
> +{
> + u64 brbcr = BRBCR_ELx_DEFAULT_TS;
> +
> + /*
> + * BRBE should be paused on PMU interrupt while tracing kernel
> + * space to stop capturing further branch records. Otherwise
> + * interrupt handler branch records might get into the samples
> + * which is not desired.
> + *
> + * BRBE need not be paused on PMU interrupt while tracing only
> + * the user space, because it will automatically be inside the
> + * prohibited region. But even after PMU overflow occurs, the
> + * interrupt could still take much more cycles, before it can
> + * be taken and by that time BRBE will have been overwritten.
> + * Hence enable pause on PMU interrupt mechanism even for user
> + * only traces as well.
> + */
> + brbcr |= BRBCR_ELx_FZP;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_USER)
> + brbcr |= BRBCR_ELx_E0BRE;
> +
> + /*
> + * When running in the hyp mode, writing into BRBCR_EL1
> + * actually writes into BRBCR_EL2 instead. Field E2BRE
> + * is also at the same position as E1BRE.
> + */
> + if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
> + brbcr |= BRBCR_ELx_ExBRE;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_HV) {
> + if (is_kernel_in_hyp_mode())
> + brbcr |= BRBCR_ELx_ExBRE;
> + }
> +
> + if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
> + brbcr |= BRBCR_ELx_CC;
> +
> + if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
> + brbcr |= BRBCR_ELx_MPRED;
> +
> + /*
> + * The exception and exception return branches could be
> + * captured, irrespective of the perf event's privilege.
> + * If the perf event does not have enough privilege for
> + * a given exception level, then addresses which falls
> + * under that exception level will be reported as zero
> + * for the captured branch record, creating source only
> + * or target only records.
> + */
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
> + brbcr |= BRBCR_ELx_EXCEPTION;
> + brbcr |= BRBCR_ELx_ERTN;
> + }
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
> + brbcr |= BRBCR_ELx_EXCEPTION;
> +
> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
> + brbcr |= BRBCR_ELx_ERTN;
> +
> + return brbcr & BRBCR_ELx_CONFIG_MASK;
> +}
> +
> +void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
> +{
> + struct pmu_hw_events *cpuc = this_cpu_ptr(arm_pmu->hw_events);
> + u64 brbfcr, brbcr;
> +
> + if (!(cpuc->brbe_sample_type && cpuc->brbe_users))
> + return;
> +
> + /*
> + * BRBE gets configured with a new mismatched branch sample
> + * type request, overriding any previous branch filters.
> + */
> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
> + brbfcr &= ~BRBFCR_EL1_DEFAULT_CONFIG;
This is called default_config, but is being used semantically the same
way as BRBCR_ELx_CONFIG_MASK below to clear out the fields. Doesn't that
mean that it's a mask rather than a default config? It's only ever used
in this way. default_config implies it's written or used as an
initialiser at some point.
> + brbfcr |= branch_type_to_brbfcr(cpuc->brbe_sample_type);
> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
> + isb();
> +
> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
> + brbcr &= ~BRBCR_ELx_CONFIG_MASK;
> + brbcr |= branch_type_to_brbcr(cpuc->brbe_sample_type);
BRBCR_ELx_CONFIG_MASK is already &'d at the end of
branch_type_to_brbcr(), so isn't it easier and equivalent to just do the
following instead of the read(), &= and then |= ?
write_sysreg_s(branch_type_to_brbcr(...), SYS_BRBCR_EL1);
Or at least make branch_type_to_brbfcr() consistent and &
BRBFCR_EL1_DEFAULT_CONFIG at the end of that function too.
> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
> + isb();
> +}
> +
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 5:13 ` [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework Anshuman Khandual
2023-11-14 9:58 ` James Clark
@ 2023-11-14 12:14 ` James Clark
2023-11-15 7:22 ` Anshuman Khandual
2023-11-14 17:10 ` James Clark
2 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-14 12:14 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 14/11/2023 05:13, Anshuman Khandual wrote:
[...]
> +/*
> + * This is a read only constant and safe during multi threaded access
> + */
> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
> +
> +static void read_branch_records(struct pmu_hw_events *cpuc,
> + struct perf_event *event,
> + struct perf_sample_data *data,
> + bool *branch_captured)
> +{
> + /*
> + * CPU specific branch records buffer must have been allocated already
> + * for the hardware records to be captured and processed further.
> + */
> + if (WARN_ON(!cpuc->branches))
> + return;
> +
> + /*
> + * Overflowed event's branch_sample_type does not match the configured
> + * branch filters in the BRBE HW. So the captured branch records here
> + * cannot be co-related to the overflowed event. Report to the user as
> + * if no branch records have been captured, and flush branch records.
> + * The same scenario is applicable when the current task context does
> + * not match with overflown event.
> + */
> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
> + perf_sample_save_brstack(data, event, &zero_branch_stack);
Is there any benefit to outputting a zero size stack vs not outputting
anything at all?
> + return;
> + }
> +
> + /*
> + * Read the branch records from the hardware once after the PMU IRQ
> + * has been triggered but subsequently same records can be used for
> + * other events that might have been overflowed simultaneously thus
> + * saving much CPU cycles.
> + */
> + if (!*branch_captured) {
> + armv8pmu_branch_read(cpuc, event);
> + *branch_captured = true;
> + }
> + perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack);
> +}
> +
> static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
> {
> u32 pmovsr;
> @@ -766,6 +815,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
> struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
> struct pt_regs *regs;
> int idx;
> + bool branch_captured = false;
>
> /*
> * Get and reset the IRQ flags
> @@ -809,6 +859,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
> if (!armpmu_event_set_period(event))
> continue;
>
> + /*
> + * PMU IRQ should remain asserted until all branch records
> + * are captured and processed into struct perf_sample_data.
> + */
> + if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
> + read_branch_records(cpuc, event, &data, &branch_captured);
You could return instead of using the out param, not really any
different, but maybe a bit more normal:
branch_captured |= read_branch_records(cpuc, event, &data,
branch_captured);
> +
> /*
> * Perf event overflow will queue the processing of the event as
> * an irq_work which will be taken care of in the handling of
> @@ -818,6 +875,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
> cpu_pmu->disable(event);
> }
> armv8pmu_start(cpu_pmu);
> + if (cpu_pmu->has_branch_stack)
> + armv8pmu_branch_reset();
>
> return IRQ_HANDLED;
> }
> @@ -907,6 +966,24 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
> return event->hw.idx;
> }
>
> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
> +{
> + struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
> + void *task_ctx = pmu_ctx->task_ctx_data;
> +
> + if (armpmu->has_branch_stack) {
> + /* Save branch records in task_ctx on sched out */
> + if (task_ctx && !sched_in) {
> + armv8pmu_branch_save(armpmu, task_ctx);
> + return;
> + }
> +
> + /* Reset branch records on sched in */
> + if (sched_in)
> + armv8pmu_branch_reset();
> + }
> +}
> +
> /*
> * Add an event filter to a given event.
> */
> @@ -977,6 +1054,9 @@ static void armv8pmu_reset(void *info)
> pmcr |= ARMV8_PMU_PMCR_LP;
>
> armv8pmu_pmcr_write(pmcr);
> +
> + if (cpu_pmu->has_branch_stack)
> + armv8pmu_branch_reset();
> }
>
> static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
> @@ -1014,6 +1094,20 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>
> hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>
> + if (has_branch_stack(event)) {
> + if (!armv8pmu_branch_attr_valid(event))
> + return -EOPNOTSUPP;
> +
> + /*
> + * If a task gets scheduled out, the current branch records
> + * get saved in the task's context data, which can be later
> + * used to fill in the records upon an event overflow. Let's
> + * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
> + * all branch stack sampling perf events.
> + */
> + event->attach_state |= PERF_ATTACH_TASK_DATA;
> + }
> +
> /*
> * CHAIN events only work when paired with an adjacent counter, and it
> * never makes sense for a user to open one in isolation, as they'll be
> @@ -1130,6 +1224,35 @@ static void __armv8pmu_probe_pmu(void *info)
> cpu_pmu->reg_pmmir = read_pmmir();
> else
> cpu_pmu->reg_pmmir = 0;
> + armv8pmu_branch_probe(cpu_pmu);
I'm not sure if this is splitting hairs or not, but
__armv8pmu_probe_pmu() is run on only one of 'any' of the supported CPUs
for this PMU.
Is it not possible to have some of those CPUs support and some not
support BRBE, even though they are all the same PMU type? Maybe we could
wait for it to explode with some weird system, or change it so that the
BRBE probe is run on every CPU, with a second 'supported_brbe_mask' field.
> +}
> +
> +static int branch_records_alloc(struct arm_pmu *armpmu)
> +{
> + struct branch_records __percpu *records;
> + int cpu;
> +
> + records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
> + if (!records)
> + return -ENOMEM;
> +
Doesn't this technically need to take the CPU mask where BRBE is
supported into account? Otherwise you are allocating for cores that
never use it.
Also it's done per-CPU _and_ per-PMU type, multiplying the number of
BRBE buffers allocated, even if they can only ever be used per-CPU.
> + /*
> + * percpu memory allocated for 'records' gets completely consumed
> + * here, and never required to be freed up later. So permanently
> + * losing access to this anchor i.e 'records' is acceptable.
> + *
> + * Otherwise this allocation handle would have to be saved up for
> + * free_percpu() release later if required.
> + */
> + for_each_possible_cpu(cpu) {
> + struct pmu_hw_events *events_cpu;
> + struct branch_records *records_cpu;
> +
> + events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
> + records_cpu = per_cpu_ptr(records, cpu);
> + events_cpu->branches = records_cpu;
> + }
> + return 0;
> }
>
> static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> @@ -1146,7 +1269,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
> if (ret)
> return ret;
>
> - return probe.present ? 0 : -ENODEV;
> + if (!probe.present)
> + return -ENODEV;
> +
> + if (cpu_pmu->has_branch_stack) {
> + ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
> + if (ret)
> + return ret;
> +
> + ret = branch_records_alloc(cpu_pmu);
> + if (ret) {
> + armv8pmu_task_ctx_cache_free(cpu_pmu);
> + return ret;
> + }
> + }
> + return 0;
> }
>
[...]
> diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
> index 9c226adf938a..72da4522397c 100644
> --- a/include/linux/perf/arm_pmuv3.h
> +++ b/include/linux/perf/arm_pmuv3.h
> @@ -303,4 +303,50 @@
> } \
> } while (0)
>
> +struct pmu_hw_events;
> +struct arm_pmu;
> +struct perf_event;
> +
> +#ifdef CONFIG_PERF_EVENTS
Very minor nit, but if you end up moving the stubs to the brbe header
you probably don't need the #ifdef CONFIG_PERF_EVENTS because it just
won't be included in that case.
> +static inline void armv8pmu_branch_reset(void)
> +{
> +}
> +
> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
> +{
> +}
> +
> +static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
> +{
> + WARN_ON_ONCE(!has_branch_stack(event));
> + return false;
> +}
> +
> +static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
> +{
> +}
> +
> +static inline void armv8pmu_branch_disable(void)
> +{
> +}
> +
> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
> + struct perf_event *event)
> +{
> + WARN_ON_ONCE(!has_branch_stack(event));
> +}
> +
> +static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
> +{
> +}
> +
> +static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
> +{
> + return 0;
> +}
> +
> +static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
> +{
> +}
> +#endif /* CONFIG_PERF_EVENTS */
> #endif
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 5:13 ` [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework Anshuman Khandual
2023-11-14 9:58 ` James Clark
2023-11-14 12:14 ` James Clark
@ 2023-11-14 17:10 ` James Clark
2023-11-30 3:58 ` Anshuman Khandual
2 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-14 17:10 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 14/11/2023 05:13, Anshuman Khandual wrote:
[...]
>
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index d712a19e47ac..76f1376ae594 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -317,6 +317,15 @@ armpmu_del(struct perf_event *event, int flags)
> struct hw_perf_event *hwc = &event->hw;
> int idx = hwc->idx;
>
> + if (has_branch_stack(event)) {
> + WARN_ON_ONCE(!hw_events->brbe_users);
> + hw_events->brbe_users--;
> + if (!hw_events->brbe_users) {
> + hw_events->brbe_context = NULL;
> + hw_events->brbe_sample_type = 0;
> + }
> + }
> +
> armpmu_stop(event, PERF_EF_UPDATE);
> hw_events->events[idx] = NULL;
> armpmu->clear_event_idx(hw_events, event);
> @@ -333,6 +342,22 @@ armpmu_add(struct perf_event *event, int flags)
> struct hw_perf_event *hwc = &event->hw;
> int idx;
>
> + if (has_branch_stack(event)) {
> + /*
> + * Reset branch records buffer if a new task event gets
> + * scheduled on a PMU which might have existing records.
> + * Otherwise older branch records present in the buffer
> + * might leak into the new task event.
> + */
> + if (event->ctx->task && hw_events->brbe_context != event->ctx) {
> + hw_events->brbe_context = event->ctx;
> + if (armpmu->branch_reset)
> + armpmu->branch_reset();
What about a per-thread event following a per-cpu event? Doesn't that
also need to branch_reset()? If hw_events->brbe_context was already
previously assigned, once the per-thread event is switched in it skips
this reset following a per-cpu event on the same core.
I think it should be possible to add a test for this scenario by
creating simulaneous per-cpu and per-thread events and checking for leakage.
> + }
> + hw_events->brbe_users++;
> + hw_events->brbe_sample_type = event->attr.branch_sample_type;
> + }
> +
> /* An event following a process won't be stopped earlier */
> if (!cpumask_test_cpu(smp_processor_id(), &armpmu->supported_cpus))
> return -ENOENT;
> @@ -512,13 +537,24 @@ static int armpmu_event_init(struct perf_event *event)
> !cpumask_test_cpu(event->cpu, &armpmu->supported_cpus))
> return -ENOENT;
>
> - /* does not support taken branch sampling */
> - if (has_branch_stack(event))
> + /*
> + * Branch stack sampling events are allowed
> + * only on PMU which has required support.
> + */
> + if (has_branch_stack(event) && !armpmu->has_branch_stack)
> return -EOPNOTSUPP;
>
> return __hw_perf_event_init(event);
> }
>
[...]
> +/*
> + * This is a read only constant and safe during multi threaded access
> + */
> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
> +
> +static void read_branch_records(struct pmu_hw_events *cpuc,
> + struct perf_event *event,
> + struct perf_sample_data *data,
> + bool *branch_captured)
> +{
> + /*
> + * CPU specific branch records buffer must have been allocated already
> + * for the hardware records to be captured and processed further.
> + */
> + if (WARN_ON(!cpuc->branches))
> + return;
> +
> + /*
> + * Overflowed event's branch_sample_type does not match the configured
> + * branch filters in the BRBE HW. So the captured branch records here
> + * cannot be co-related to the overflowed event. Report to the user as
> + * if no branch records have been captured, and flush branch records.
> + * The same scenario is applicable when the current task context does
> + * not match with overflown event.
> + */
> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
> + perf_sample_save_brstack(data, event, &zero_branch_stack);
> + return;
> + }
I think we should probably add a test for this scenario too. Like that
the second event opened on the same thread as another event with
different brbe settings always produces zero records.
I actually tried to reproduce this behaviour but couldn't. Not sure if I
did something wrong though.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 0/8] arm64/perf: Enable branch stack sampling
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
` (7 preceding siblings ...)
2023-11-14 5:13 ` [V14 8/8] perf: test: Extend branch stack sampling test for Arm64 BRBE Anshuman Khandual
@ 2023-11-14 17:17 ` James Clark
2023-11-22 5:15 ` Anshuman Khandual
8 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-14 17:17 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 14/11/2023 05:13, Anshuman Khandual wrote:
> This series enables perf branch stack sampling support on arm64 platform
> via a new arch feature called Branch Record Buffer Extension (BRBE). All
> the relevant register definitions could be accessed here.
>
[...]
>
> --------------------------- Virtualisation support ------------------------
>
> - Branch stack sampling is not currently supported inside the guest (TODO)
>
> - FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
> - Future support in guest requires emulating FEAT_BRBE
If you never add support for the host looking into a guest, and you save
and restore all the BRBINF[n] registers, I think you might be able to
just let the guest do whatever it wants with BRBE and not trap and
emulate it? Maybe there is some edge case why that wouldn't work, but
it's worth thinking about.
For BRBE specifically I don't see much of a use case for hosts looking
into a guest, at least not like with PMU counters.
>
> - Branch stack sampling the guest is not supported in the host (TODO)
>
> - Tracing the guest with event->attr.exclude_guest = 0
> - There are multiple challenges involved regarding mixing events
> with mismatched branch_sample_type and exclude_guest and passing
> on captured BRBE records to intended events during PMU interrupt
>
> - Guest access for BRBE registers and instructions has been blocked
>
> - BRBE state save is not required for VHE host (EL2) guest (EL1) transition
>
> - BRBE state is saved for NVHE host (EL1) guest (EL1) transition
>
> -------------------------------- Testing ---------------------------------
>
> - Cross compiled for both arm64 and arm32 platforms
> - Passes all branch tests with 'perf test branch' on arm64
>
> -------------------------------- Questions -------------------------------
>
> - Instead of configuring the BRBE HW with branch_sample_type from the last
> event to be added on the PMU as proposed, could those be merged together
> e.g all privilege requests ORed, to form a common BRBE configuration and
> all events get branch records after a PMU interrupt ?
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 9:58 ` James Clark
@ 2023-11-15 5:44 ` Anshuman Khandual
2023-11-15 9:37 ` James Clark
0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-15 5:44 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/14/23 15:28, James Clark wrote:
>
>
> On 14/11/2023 05:13, Anshuman Khandual wrote:
>> Branch stack sampling support i.e capturing branch records during execution
>> in core perf, rides along with normal HW events being scheduled on the PMU.
>> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
>> with required HW implementation.
>>
>
> [...]
>
>> - All armv8pmu_branch_xxxx() stub definitions have been moved inside
>> include/linux/perf/arm_pmuv3.h for easy access from both arm32 and
>> arm64 platforms
>>
>
> This causes lots of W=1 build errors because the prototypes are in
> arm_pmuv3.h, but arm_brbe.c doesn't include it.
I guess these are the W=1 warnings you mentioned above.
drivers/perf/arm_brbe.c:11:6: warning: no previous prototype for ‘armv8pmu_branch_reset’ [-Wmissing-prototypes]
11 | void armv8pmu_branch_reset(void)
| ^~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:190:6: warning: no previous prototype for ‘armv8pmu_branch_save’ [-Wmissing-prototypes]
190 | void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
| ^~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:236:6: warning: no previous prototype for ‘armv8pmu_branch_attr_valid’ [-Wmissing-prototypes]
236 | bool armv8pmu_branch_attr_valid(struct perf_event *event)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:269:5: warning: no previous prototype for ‘armv8pmu_task_ctx_cache_alloc’ [-Wmissing-prototypes]
269 | int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:279:6: warning: no previous prototype for ‘armv8pmu_task_ctx_cache_free’ [-Wmissing-prototypes]
279 | void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:303:6: warning: no previous prototype for ‘armv8pmu_branch_probe’ [-Wmissing-prototypes]
303 | void armv8pmu_branch_probe(struct arm_pmu *armpmu)
| ^~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:449:6: warning: no previous prototype for ‘armv8pmu_branch_enable’ [-Wmissing-prototypes]
449 | void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
| ^~~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:474:6: warning: no previous prototype for ‘armv8pmu_branch_disable’ [-Wmissing-prototypes]
474 | void armv8pmu_branch_disable(void)
| ^~~~~~~~~~~~~~~~~~~~~~~
drivers/perf/arm_brbe.c:717:6: warning: no previous prototype for ‘armv8pmu_branch_read’ [-Wmissing-prototypes]
717 | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
Branch helpers are used in ARM PMU V3 driver i.e drivers/perf/arm_pmuv3.c.
Whether the actual BRBE helper definitions, or their fallback stubs (when
CONFIG_ARM64_BRBE is not enabled), need to be accessible from arm_pmuv3.c
driver not from brbe.c implementations itself.
>
> It seems like the main reason you can't include arm_brbe.h in arm32 code
> is because there are a load of inline functions and references to
> registers in there. But these are only used in arm_brbe.c, so they don't
Right, arm32 should not be exposed to BRBE internals via arm_brbe.h header.
> need to be in the header anyway.
Right, these are only used in arm_brbe.c
>
> If you removed the code from the header and moved it to the source file
> you could move the brbe prototypes to the brbe header and it would be a
> bit cleaner and more idiomatic.
Alight, how about the following changes - build tested on arm32 and arm64.
- Move BRBE helpers from arm_brbe.h into arm_brbe.c
- Move armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
- Move armv8_pmu_xxx() stub definitions inside arm_pmuv3.c for arm32 (!CONFIG_ARM64_BRBE)
- Include arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 12:14 ` James Clark
@ 2023-11-15 7:22 ` Anshuman Khandual
2023-11-15 10:07 ` James Clark
0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-15 7:22 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/14/23 17:44, James Clark wrote:
>
>
> On 14/11/2023 05:13, Anshuman Khandual wrote:
> [...]
>
>> +/*
>> + * This is a read only constant and safe during multi threaded access
>> + */
>> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
>> +
>> +static void read_branch_records(struct pmu_hw_events *cpuc,
>> + struct perf_event *event,
>> + struct perf_sample_data *data,
>> + bool *branch_captured)
>> +{
>> + /*
>> + * CPU specific branch records buffer must have been allocated already
>> + * for the hardware records to be captured and processed further.
>> + */
>> + if (WARN_ON(!cpuc->branches))
>> + return;
>> +
>> + /*
>> + * Overflowed event's branch_sample_type does not match the configured
>> + * branch filters in the BRBE HW. So the captured branch records here
>> + * cannot be co-related to the overflowed event. Report to the user as
>> + * if no branch records have been captured, and flush branch records.
>> + * The same scenario is applicable when the current task context does
>> + * not match with overflown event.
>> + */
>> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
>> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
>> + perf_sample_save_brstack(data, event, &zero_branch_stack);
>
> Is there any benefit to outputting a zero size stack vs not outputting
> anything at all?
The event has got PERF_SAMPLE_BRANCH_STACK marked and hence perf_sample_data
must have PERF_SAMPLE_BRANCH_STACK with it's br_stack pointing to the branch
records. Hence without assigning a zeroed struct perf_branch_stack, there is
a chance, that perf_sample_data will pass on some garbage branch records to
the ring buffer.
>
>> + return;
>> + }
>> +
>> + /*
>> + * Read the branch records from the hardware once after the PMU IRQ
>> + * has been triggered but subsequently same records can be used for
>> + * other events that might have been overflowed simultaneously thus
>> + * saving much CPU cycles.
>> + */
>> + if (!*branch_captured) {
>> + armv8pmu_branch_read(cpuc, event);
>> + *branch_captured = true;
>> + }
>> + perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack);
>> +}
>> +
>> static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>> {
>> u32 pmovsr;
>> @@ -766,6 +815,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>> struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
>> struct pt_regs *regs;
>> int idx;
>> + bool branch_captured = false;
>>
>> /*
>> * Get and reset the IRQ flags
>> @@ -809,6 +859,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>> if (!armpmu_event_set_period(event))
>> continue;
>>
>> + /*
>> + * PMU IRQ should remain asserted until all branch records
>> + * are captured and processed into struct perf_sample_data.
>> + */
>> + if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
>> + read_branch_records(cpuc, event, &data, &branch_captured);
>
> You could return instead of using the out param, not really any
> different, but maybe a bit more normal:
>
> branch_captured |= read_branch_records(cpuc, event, &data,
> branch_captured);
I am just wondering - how that would be any better ?
>
>> +
>> /*
>> * Perf event overflow will queue the processing of the event as
>> * an irq_work which will be taken care of in the handling of
>> @@ -818,6 +875,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>> cpu_pmu->disable(event);
>> }
>> armv8pmu_start(cpu_pmu);
>> + if (cpu_pmu->has_branch_stack)
>> + armv8pmu_branch_reset();
>>
>> return IRQ_HANDLED;
>> }
>> @@ -907,6 +966,24 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
>> return event->hw.idx;
>> }
>>
>> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
>> +{
>> + struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
>> + void *task_ctx = pmu_ctx->task_ctx_data;
>> +
>> + if (armpmu->has_branch_stack) {
>> + /* Save branch records in task_ctx on sched out */
>> + if (task_ctx && !sched_in) {
>> + armv8pmu_branch_save(armpmu, task_ctx);
>> + return;
>> + }
>> +
>> + /* Reset branch records on sched in */
>> + if (sched_in)
>> + armv8pmu_branch_reset();
>> + }
>> +}
>> +
>> /*
>> * Add an event filter to a given event.
>> */
>> @@ -977,6 +1054,9 @@ static void armv8pmu_reset(void *info)
>> pmcr |= ARMV8_PMU_PMCR_LP;
>>
>> armv8pmu_pmcr_write(pmcr);
>> +
>> + if (cpu_pmu->has_branch_stack)
>> + armv8pmu_branch_reset();
>> }
>>
>> static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
>> @@ -1014,6 +1094,20 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>>
>> hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>>
>> + if (has_branch_stack(event)) {
>> + if (!armv8pmu_branch_attr_valid(event))
>> + return -EOPNOTSUPP;
>> +
>> + /*
>> + * If a task gets scheduled out, the current branch records
>> + * get saved in the task's context data, which can be later
>> + * used to fill in the records upon an event overflow. Let's
>> + * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
>> + * all branch stack sampling perf events.
>> + */
>> + event->attach_state |= PERF_ATTACH_TASK_DATA;
>> + }
>> +
>> /*
>> * CHAIN events only work when paired with an adjacent counter, and it
>> * never makes sense for a user to open one in isolation, as they'll be
>> @@ -1130,6 +1224,35 @@ static void __armv8pmu_probe_pmu(void *info)
>> cpu_pmu->reg_pmmir = read_pmmir();
>> else
>> cpu_pmu->reg_pmmir = 0;
>> + armv8pmu_branch_probe(cpu_pmu);
>
> I'm not sure if this is splitting hairs or not, but
> __armv8pmu_probe_pmu() is run on only one of 'any' of the supported CPUs
> for this PMU.
Right.
>
> Is it not possible to have some of those CPUs support and some not
> support BRBE, even though they are all the same PMU type? Maybe we could
I am not sure, but not something I have come across.
> wait for it to explode with some weird system, or change it so that the
> BRBE probe is run on every CPU, with a second 'supported_brbe_mask' field.
Right, but for now, the current solutions looks sufficient.
>
>> +}
>> +
>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>> +{
>> + struct branch_records __percpu *records;
>> + int cpu;
>> +
>> + records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
>> + if (!records)
>> + return -ENOMEM;
>> +
>
> Doesn't this technically need to take the CPU mask where BRBE is
> supported into account? Otherwise you are allocating for cores that
> never use it.
>
> Also it's done per-CPU _and_ per-PMU type, multiplying the number of
> BRBE buffers allocated, even if they can only ever be used per-CPU.
Agreed, but I believe we have already been though this discussion, and
settled for this method - for being a simpler approach.
>
>> + /*
>> + * percpu memory allocated for 'records' gets completely consumed
>> + * here, and never required to be freed up later. So permanently
>> + * losing access to this anchor i.e 'records' is acceptable.
>> + *
>> + * Otherwise this allocation handle would have to be saved up for
>> + * free_percpu() release later if required.
>> + */
>> + for_each_possible_cpu(cpu) {
>> + struct pmu_hw_events *events_cpu;
>> + struct branch_records *records_cpu;
>> +
>> + events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
>> + records_cpu = per_cpu_ptr(records, cpu);
>> + events_cpu->branches = records_cpu;
>> + }
>> + return 0;
>> }
>>
>> static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>> @@ -1146,7 +1269,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>> if (ret)
>> return ret;
>>
>> - return probe.present ? 0 : -ENODEV;
>> + if (!probe.present)
>> + return -ENODEV;
>> +
>> + if (cpu_pmu->has_branch_stack) {
>> + ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
>> + if (ret)
>> + return ret;
>> +
>> + ret = branch_records_alloc(cpu_pmu);
>> + if (ret) {
>> + armv8pmu_task_ctx_cache_free(cpu_pmu);
>> + return ret;
>> + }
>> + }
>> + return 0;
>> }
>>
>
> [...]
>> diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
>> index 9c226adf938a..72da4522397c 100644
>> --- a/include/linux/perf/arm_pmuv3.h
>> +++ b/include/linux/perf/arm_pmuv3.h
>> @@ -303,4 +303,50 @@
>> } \
>> } while (0)
>>
>> +struct pmu_hw_events;
>> +struct arm_pmu;
>> +struct perf_event;
>> +
>> +#ifdef CONFIG_PERF_EVENTS
>
> Very minor nit, but if you end up moving the stubs to the brbe header
> you probably don't need the #ifdef CONFIG_PERF_EVENTS because it just
> won't be included in that case.
Right, will drop CONFIG_PERF_EVENTS wrapper.
>
>> +static inline void armv8pmu_branch_reset(void)
>> +{
>> +}
>> +
>> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
>> +{
>> +}
>> +
>> +static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
>> +{
>> + WARN_ON_ONCE(!has_branch_stack(event));
>> + return false;
>> +}
>> +
>> +static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
>> +{
>> +}
>> +
>> +static inline void armv8pmu_branch_disable(void)
>> +{
>> +}
>> +
>> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
>> + struct perf_event *event)
>> +{
>> + WARN_ON_ONCE(!has_branch_stack(event));
>> +}
>> +
>> +static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
>> +{
>> +}
>> +
>> +static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
>> +{
>> + return 0;
>> +}
>> +
>> +static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
>> +{
>> +}
>> +#endif /* CONFIG_PERF_EVENTS */
>> #endif
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-15 5:44 ` Anshuman Khandual
@ 2023-11-15 9:37 ` James Clark
2023-11-21 9:13 ` Anshuman Khandual
0 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-15 9:37 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 15/11/2023 05:44, Anshuman Khandual wrote:
> On 11/14/23 15:28, James Clark wrote:
>>
>>
>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>> Branch stack sampling support i.e capturing branch records during execution
>>> in core perf, rides along with normal HW events being scheduled on the PMU.
>>> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
>>> with required HW implementation.
>>>
>>
>> [...]
>>
>>> - All armv8pmu_branch_xxxx() stub definitions have been moved inside
>>> include/linux/perf/arm_pmuv3.h for easy access from both arm32 and
>>> arm64 platforms
>>>
>>
>> This causes lots of W=1 build errors because the prototypes are in
>> arm_pmuv3.h, but arm_brbe.c doesn't include it.
>
> I guess these are the W=1 warnings you mentioned above.
>
> drivers/perf/arm_brbe.c:11:6: warning: no previous prototype for ‘armv8pmu_branch_reset’ [-Wmissing-prototypes]
> 11 | void armv8pmu_branch_reset(void)
> | ^~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:190:6: warning: no previous prototype for ‘armv8pmu_branch_save’ [-Wmissing-prototypes]
> 190 | void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
> | ^~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:236:6: warning: no previous prototype for ‘armv8pmu_branch_attr_valid’ [-Wmissing-prototypes]
> 236 | bool armv8pmu_branch_attr_valid(struct perf_event *event)
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:269:5: warning: no previous prototype for ‘armv8pmu_task_ctx_cache_alloc’ [-Wmissing-prototypes]
> 269 | int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:279:6: warning: no previous prototype for ‘armv8pmu_task_ctx_cache_free’ [-Wmissing-prototypes]
> 279 | void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:303:6: warning: no previous prototype for ‘armv8pmu_branch_probe’ [-Wmissing-prototypes]
> 303 | void armv8pmu_branch_probe(struct arm_pmu *armpmu)
> | ^~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:449:6: warning: no previous prototype for ‘armv8pmu_branch_enable’ [-Wmissing-prototypes]
> 449 | void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
> | ^~~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:474:6: warning: no previous prototype for ‘armv8pmu_branch_disable’ [-Wmissing-prototypes]
> 474 | void armv8pmu_branch_disable(void)
> | ^~~~~~~~~~~~~~~~~~~~~~~
> drivers/perf/arm_brbe.c:717:6: warning: no previous prototype for ‘armv8pmu_branch_read’ [-Wmissing-prototypes]
> 717 | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>
> Branch helpers are used in ARM PMU V3 driver i.e drivers/perf/arm_pmuv3.c.
> Whether the actual BRBE helper definitions, or their fallback stubs (when
> CONFIG_ARM64_BRBE is not enabled), need to be accessible from arm_pmuv3.c
> driver not from brbe.c implementations itself.
>
>>
>> It seems like the main reason you can't include arm_brbe.h in arm32 code
>> is because there are a load of inline functions and references to
>> registers in there. But these are only used in arm_brbe.c, so they don't
>
> Right, arm32 should not be exposed to BRBE internals via arm_brbe.h header.
>
>> need to be in the header anyway.
>
> Right, these are only used in arm_brbe.c
>
>>
>> If you removed the code from the header and moved it to the source file
>> you could move the brbe prototypes to the brbe header and it would be a
>> bit cleaner and more idiomatic.
>
> Alight, how about the following changes - build tested on arm32 and arm64.
>
> - Move BRBE helpers from arm_brbe.h into arm_brbe.c
> - Move armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
> - Move armv8_pmu_xxx() stub definitions inside arm_pmuv3.c for arm32 (!CONFIG_ARM64_BRBE)
> - Include arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
Agree to them all except:
- Move armv8_pmu_xxx() stub definitions inside arm_pmuv3.c for arm32
(!CONFIG_ARM64_BRBE)
Normally you put the stubs right next to the prototypes with #else, so
in this case both would be in arm_brbe.h. Not sure what the reason for
splitting them here is? You already said "include arm_brbe.h in
arm_pmuv3.c", so that covers arm32 too.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-15 7:22 ` Anshuman Khandual
@ 2023-11-15 10:07 ` James Clark
2023-11-21 9:57 ` Anshuman Khandual
0 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-15 10:07 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 15/11/2023 07:22, Anshuman Khandual wrote:
> On 11/14/23 17:44, James Clark wrote:
>>
>>
>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>> [...]
>>
>>> +/*
>>> + * This is a read only constant and safe during multi threaded access
>>> + */
>>> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
>>> +
>>> +static void read_branch_records(struct pmu_hw_events *cpuc,
>>> + struct perf_event *event,
>>> + struct perf_sample_data *data,
>>> + bool *branch_captured)
>>> +{
>>> + /*
>>> + * CPU specific branch records buffer must have been allocated already
>>> + * for the hardware records to be captured and processed further.
>>> + */
>>> + if (WARN_ON(!cpuc->branches))
>>> + return;
>>> +
>>> + /*
>>> + * Overflowed event's branch_sample_type does not match the configured
>>> + * branch filters in the BRBE HW. So the captured branch records here
>>> + * cannot be co-related to the overflowed event. Report to the user as
>>> + * if no branch records have been captured, and flush branch records.
>>> + * The same scenario is applicable when the current task context does
>>> + * not match with overflown event.
>>> + */
>>> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
>>> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
>>> + perf_sample_save_brstack(data, event, &zero_branch_stack);
>>
>> Is there any benefit to outputting a zero size stack vs not outputting
>> anything at all?
>
> The event has got PERF_SAMPLE_BRANCH_STACK marked and hence perf_sample_data
> must have PERF_SAMPLE_BRANCH_STACK with it's br_stack pointing to the branch
> records. Hence without assigning a zeroed struct perf_branch_stack, there is
> a chance, that perf_sample_data will pass on some garbage branch records to
> the ring buffer.
>
I don't think that's an issue, the perf core code handles the case where
no branch stack exists on a sample. It even outputs the zero length for
you, but there is other stuff that can be skipped if you just never call
perf_sample_save_brstack():
from kernel/events/core.c:
if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
if (data->br_stack) {
size_t size;
size = data->br_stack->nr
* sizeof(struct perf_branch_entry);
perf_output_put(handle, data->br_stack->nr);
if (branch_sample_hw_index(event))
perf_output_put(handle, data->br_stack->hw_idx);
perf_output_copy(handle, data->br_stack->entries, size);
} else {
/*
* we always store at least the value of nr
*/
u64 nr = 0;
perf_output_put(handle, nr);
}
}
>>
>>> + return;
>>> + }
>>> +
>>> + /*
>>> + * Read the branch records from the hardware once after the PMU IRQ
>>> + * has been triggered but subsequently same records can be used for
>>> + * other events that might have been overflowed simultaneously thus
>>> + * saving much CPU cycles.
>>> + */
>>> + if (!*branch_captured) {
>>> + armv8pmu_branch_read(cpuc, event);
>>> + *branch_captured = true;
>>> + }
>>> + perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack);
>>> +}
>>> +
>>> static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>> {
>>> u32 pmovsr;
>>> @@ -766,6 +815,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>> struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
>>> struct pt_regs *regs;
>>> int idx;
>>> + bool branch_captured = false;
>>>
>>> /*
>>> * Get and reset the IRQ flags
>>> @@ -809,6 +859,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>> if (!armpmu_event_set_period(event))
>>> continue;
>>>
>>> + /*
>>> + * PMU IRQ should remain asserted until all branch records
>>> + * are captured and processed into struct perf_sample_data.
>>> + */
>>> + if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
>>> + read_branch_records(cpuc, event, &data, &branch_captured);
>>
>> You could return instead of using the out param, not really any
>> different, but maybe a bit more normal:
>>
>> branch_captured |= read_branch_records(cpuc, event, &data,
>> branch_captured);
>
> I am just wondering - how that would be any better ?
>
Maybe it wouldn't, but I suppose it's just the same way you don't write
returns like:
armv8pmu_task_ctx_cache_alloc(cpu_pmu, &ret);
instead of:
ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
Out params can be hard to reason about sometimes. Maybe not in this case
though.
>>
>>> +
>>> /*
>>> * Perf event overflow will queue the processing of the event as
>>> * an irq_work which will be taken care of in the handling of
>>> @@ -818,6 +875,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>> cpu_pmu->disable(event);
>>> }
>>> armv8pmu_start(cpu_pmu);
>>> + if (cpu_pmu->has_branch_stack)
>>> + armv8pmu_branch_reset();
>>>
>>> return IRQ_HANDLED;
>>> }
>>> @@ -907,6 +966,24 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
>>> return event->hw.idx;
>>> }
>>>
>>> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
>>> +{
>>> + struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
>>> + void *task_ctx = pmu_ctx->task_ctx_data;
>>> +
>>> + if (armpmu->has_branch_stack) {
>>> + /* Save branch records in task_ctx on sched out */
>>> + if (task_ctx && !sched_in) {
>>> + armv8pmu_branch_save(armpmu, task_ctx);
>>> + return;
>>> + }
>>> +
>>> + /* Reset branch records on sched in */
>>> + if (sched_in)
>>> + armv8pmu_branch_reset();
>>> + }
>>> +}
>>> +
>>> /*
>>> * Add an event filter to a given event.
>>> */
>>> @@ -977,6 +1054,9 @@ static void armv8pmu_reset(void *info)
>>> pmcr |= ARMV8_PMU_PMCR_LP;
>>>
>>> armv8pmu_pmcr_write(pmcr);
>>> +
>>> + if (cpu_pmu->has_branch_stack)
>>> + armv8pmu_branch_reset();
>>> }
>>>
>>> static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
>>> @@ -1014,6 +1094,20 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>>>
>>> hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>>>
>>> + if (has_branch_stack(event)) {
>>> + if (!armv8pmu_branch_attr_valid(event))
>>> + return -EOPNOTSUPP;
>>> +
>>> + /*
>>> + * If a task gets scheduled out, the current branch records
>>> + * get saved in the task's context data, which can be later
>>> + * used to fill in the records upon an event overflow. Let's
>>> + * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
>>> + * all branch stack sampling perf events.
>>> + */
>>> + event->attach_state |= PERF_ATTACH_TASK_DATA;
>>> + }
>>> +
>>> /*
>>> * CHAIN events only work when paired with an adjacent counter, and it
>>> * never makes sense for a user to open one in isolation, as they'll be
>>> @@ -1130,6 +1224,35 @@ static void __armv8pmu_probe_pmu(void *info)
>>> cpu_pmu->reg_pmmir = read_pmmir();
>>> else
>>> cpu_pmu->reg_pmmir = 0;
>>> + armv8pmu_branch_probe(cpu_pmu);
>>
>> I'm not sure if this is splitting hairs or not, but
>> __armv8pmu_probe_pmu() is run on only one of 'any' of the supported CPUs
>> for this PMU.
>
> Right.
>
>>
>> Is it not possible to have some of those CPUs support and some not
>> support BRBE, even though they are all the same PMU type? Maybe we could
>
> I am not sure, but not something I have come across.
>
>> wait for it to explode with some weird system, or change it so that the
>> BRBE probe is run on every CPU, with a second 'supported_brbe_mask' field.
>
> Right, but for now, the current solutions looks sufficient.
>
I suppose it means people will have to split their PMUs to ones that do
and don't support BRBE. I'm not sure if that's worth adding a comment in
the docs or it's too obscure.
>>
>>> +}
>>> +
>>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>>> +{
>>> + struct branch_records __percpu *records;
>>> + int cpu;
>>> +
>>> + records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
>>> + if (!records)
>>> + return -ENOMEM;
>>> +
>>
>> Doesn't this technically need to take the CPU mask where BRBE is
>> supported into account? Otherwise you are allocating for cores that
>> never use it.
>>
>> Also it's done per-CPU _and_ per-PMU type, multiplying the number of
>> BRBE buffers allocated, even if they can only ever be used per-CPU.
>
> Agreed, but I believe we have already been though this discussion, and
> settled for this method - for being a simpler approach.
>
>>
>>> + /*
>>> + * percpu memory allocated for 'records' gets completely consumed
>>> + * here, and never required to be freed up later. So permanently
>>> + * losing access to this anchor i.e 'records' is acceptable.
>>> + *
>>> + * Otherwise this allocation handle would have to be saved up for
>>> + * free_percpu() release later if required.
>>> + */
>>> + for_each_possible_cpu(cpu) {
>>> + struct pmu_hw_events *events_cpu;
>>> + struct branch_records *records_cpu;
>>> +
>>> + events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
>>> + records_cpu = per_cpu_ptr(records, cpu);
>>> + events_cpu->branches = records_cpu;
>>> + }
>>> + return 0;
>>> }
>>>
>>> static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>> @@ -1146,7 +1269,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>> if (ret)
>>> return ret;
>>>
>>> - return probe.present ? 0 : -ENODEV;
>>> + if (!probe.present)
>>> + return -ENODEV;
>>> +
>>> + if (cpu_pmu->has_branch_stack) {
>>> + ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
>>> + if (ret)
>>> + return ret;
>>> +
>>> + ret = branch_records_alloc(cpu_pmu);
>>> + if (ret) {
>>> + armv8pmu_task_ctx_cache_free(cpu_pmu);
>>> + return ret;
>>> + }
>>> + }
>>> + return 0;
>>> }
>>>
>>
>> [...]
>>> diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
>>> index 9c226adf938a..72da4522397c 100644
>>> --- a/include/linux/perf/arm_pmuv3.h
>>> +++ b/include/linux/perf/arm_pmuv3.h
>>> @@ -303,4 +303,50 @@
>>> } \
>>> } while (0)
>>>
>>> +struct pmu_hw_events;
>>> +struct arm_pmu;
>>> +struct perf_event;
>>> +
>>> +#ifdef CONFIG_PERF_EVENTS
>>
>> Very minor nit, but if you end up moving the stubs to the brbe header
>> you probably don't need the #ifdef CONFIG_PERF_EVENTS because it just
>> won't be included in that case.
>
> Right, will drop CONFIG_PERF_EVENTS wrapper.
>
>>
>>> +static inline void armv8pmu_branch_reset(void)
>>> +{
>>> +}
>>> +
>>> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
>>> +{
>>> +}
>>> +
>>> +static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
>>> +{
>>> + WARN_ON_ONCE(!has_branch_stack(event));
>>> + return false;
>>> +}
>>> +
>>> +static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
>>> +{
>>> +}
>>> +
>>> +static inline void armv8pmu_branch_disable(void)
>>> +{
>>> +}
>>> +
>>> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
>>> + struct perf_event *event)
>>> +{
>>> + WARN_ON_ONCE(!has_branch_stack(event));
>>> +}
>>> +
>>> +static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
>>> +{
>>> +}
>>> +
>>> +static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
>>> +{
>>> + return 0;
>>> +}
>>> +
>>> +static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
>>> +{
>>> +}
>>> +#endif /* CONFIG_PERF_EVENTS */
>>> #endif
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-15 9:37 ` James Clark
@ 2023-11-21 9:13 ` Anshuman Khandual
0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-21 9:13 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/15/23 15:07, James Clark wrote:
>
>
> On 15/11/2023 05:44, Anshuman Khandual wrote:
>> On 11/14/23 15:28, James Clark wrote:
>>>
>>>
>>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>>> Branch stack sampling support i.e capturing branch records during execution
>>>> in core perf, rides along with normal HW events being scheduled on the PMU.
>>>> This prepares ARMV8 PMU framework for branch stack support on relevant PMUs
>>>> with required HW implementation.
>>>>
>>>
>>> [...]
>>>
>>>> - All armv8pmu_branch_xxxx() stub definitions have been moved inside
>>>> include/linux/perf/arm_pmuv3.h for easy access from both arm32 and
>>>> arm64 platforms
>>>>
>>>
>>> This causes lots of W=1 build errors because the prototypes are in
>>> arm_pmuv3.h, but arm_brbe.c doesn't include it.
>>
>> I guess these are the W=1 warnings you mentioned above.
>>
>> drivers/perf/arm_brbe.c:11:6: warning: no previous prototype for ‘armv8pmu_branch_reset’ [-Wmissing-prototypes]
>> 11 | void armv8pmu_branch_reset(void)
>> | ^~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:190:6: warning: no previous prototype for ‘armv8pmu_branch_save’ [-Wmissing-prototypes]
>> 190 | void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
>> | ^~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:236:6: warning: no previous prototype for ‘armv8pmu_branch_attr_valid’ [-Wmissing-prototypes]
>> 236 | bool armv8pmu_branch_attr_valid(struct perf_event *event)
>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:269:5: warning: no previous prototype for ‘armv8pmu_task_ctx_cache_alloc’ [-Wmissing-prototypes]
>> 269 | int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:279:6: warning: no previous prototype for ‘armv8pmu_task_ctx_cache_free’ [-Wmissing-prototypes]
>> 279 | void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
>> | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:303:6: warning: no previous prototype for ‘armv8pmu_branch_probe’ [-Wmissing-prototypes]
>> 303 | void armv8pmu_branch_probe(struct arm_pmu *armpmu)
>> | ^~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:449:6: warning: no previous prototype for ‘armv8pmu_branch_enable’ [-Wmissing-prototypes]
>> 449 | void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
>> | ^~~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:474:6: warning: no previous prototype for ‘armv8pmu_branch_disable’ [-Wmissing-prototypes]
>> 474 | void armv8pmu_branch_disable(void)
>> | ^~~~~~~~~~~~~~~~~~~~~~~
>> drivers/perf/arm_brbe.c:717:6: warning: no previous prototype for ‘armv8pmu_branch_read’ [-Wmissing-prototypes]
>> 717 | void armv8pmu_branch_read(struct pmu_hw_events *cpuc, struct perf_event *event)
>>
>> Branch helpers are used in ARM PMU V3 driver i.e drivers/perf/arm_pmuv3.c.
>> Whether the actual BRBE helper definitions, or their fallback stubs (when
>> CONFIG_ARM64_BRBE is not enabled), need to be accessible from arm_pmuv3.c
>> driver not from brbe.c implementations itself.
>>
>>>
>>> It seems like the main reason you can't include arm_brbe.h in arm32 code
>>> is because there are a load of inline functions and references to
>>> registers in there. But these are only used in arm_brbe.c, so they don't
>>
>> Right, arm32 should not be exposed to BRBE internals via arm_brbe.h header.
>>
>>> need to be in the header anyway.
>>
>> Right, these are only used in arm_brbe.c
>>
>>>
>>> If you removed the code from the header and moved it to the source file
>>> you could move the brbe prototypes to the brbe header and it would be a
>>> bit cleaner and more idiomatic.
>>
>> Alight, how about the following changes - build tested on arm32 and arm64.
>>
>> - Move BRBE helpers from arm_brbe.h into arm_brbe.c
>> - Move armv8_pmu_xxx() declaration inside arm_brbe.h for arm64 (CONFIG_ARM64_BRBE)
>> - Move armv8_pmu_xxx() stub definitions inside arm_pmuv3.c for arm32 (!CONFIG_ARM64_BRBE)
>> - Include arm_brbe.h header both in arm_pmuv3.c and arm_brbe.c
>
> Agree to them all except:
>
> - Move armv8_pmu_xxx() stub definitions inside arm_pmuv3.c for arm32
> (!CONFIG_ARM64_BRBE)
>
> Normally you put the stubs right next to the prototypes with #else, so
> in this case both would be in arm_brbe.h. Not sure what the reason for
> splitting them here is? You already said "include arm_brbe.h in
> arm_pmuv3.c", so that covers arm32 too.
Not any particular strong reason for the split as such, will move these
stubs to the header as well. BRBE header includes <linux/perf/arm_pmu.h>
which causes the following redefinition warning for the pr_fmt().
drivers/perf/arm_brbe.c:11: warning: "pr_fmt" redefined
11 | #define pr_fmt(fmt) "brbe: " fmt
|
In file included from ./include/linux/kernel.h:31,
from ./include/linux/interrupt.h:6,
from ./include/linux/perf/arm_pmu.h:11,
from drivers/perf/arm_brbe.h:10,
from drivers/perf/arm_brbe.c:9:
./include/linux/printk.h:345: note: this is the location of the previous definition
345 | #define pr_fmt(fmt) fmt
Although it should be okay to just drop this custom pr_fmt() from BRBE.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-15 10:07 ` James Clark
@ 2023-11-21 9:57 ` Anshuman Khandual
2023-11-23 12:35 ` James Clark
0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-21 9:57 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/15/23 15:37, James Clark wrote:
>
>
> On 15/11/2023 07:22, Anshuman Khandual wrote:
>> On 11/14/23 17:44, James Clark wrote:
>>>
>>>
>>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>> [...]
>>>
>>>> +/*
>>>> + * This is a read only constant and safe during multi threaded access
>>>> + */
>>>> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
>>>> +
>>>> +static void read_branch_records(struct pmu_hw_events *cpuc,
>>>> + struct perf_event *event,
>>>> + struct perf_sample_data *data,
>>>> + bool *branch_captured)
>>>> +{
>>>> + /*
>>>> + * CPU specific branch records buffer must have been allocated already
>>>> + * for the hardware records to be captured and processed further.
>>>> + */
>>>> + if (WARN_ON(!cpuc->branches))
>>>> + return;
>>>> +
>>>> + /*
>>>> + * Overflowed event's branch_sample_type does not match the configured
>>>> + * branch filters in the BRBE HW. So the captured branch records here
>>>> + * cannot be co-related to the overflowed event. Report to the user as
>>>> + * if no branch records have been captured, and flush branch records.
>>>> + * The same scenario is applicable when the current task context does
>>>> + * not match with overflown event.
>>>> + */
>>>> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
>>>> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
>>>> + perf_sample_save_brstack(data, event, &zero_branch_stack);
>>>
>>> Is there any benefit to outputting a zero size stack vs not outputting
>>> anything at all?
>>
>> The event has got PERF_SAMPLE_BRANCH_STACK marked and hence perf_sample_data
>> must have PERF_SAMPLE_BRANCH_STACK with it's br_stack pointing to the branch
>> records. Hence without assigning a zeroed struct perf_branch_stack, there is
>> a chance, that perf_sample_data will pass on some garbage branch records to
>> the ring buffer.
>>
>
> I don't think that's an issue, the perf core code handles the case where
> no branch stack exists on a sample. It even outputs the zero length for
> you, but there is other stuff that can be skipped if you just never call
> perf_sample_save_brstack():
Sending out perf_sample_data without valid data->br_stack seems problematic,
which would be the case when perf_sample_save_brstack() never gets called on
the perf_sample_data being prepared, and depend on the below 'else' case for
pushing out zero records.
Alternatively - temporarily just zeroing out cpuc->branches->branch_stack.nr
for immediate perf_sample_save_brstack(), and then restoring it back to it's
original value might work as well. Remember it still has got valid records
for other qualifying events.
>
> from kernel/events/core.c:
>
> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> if (data->br_stack) {
> size_t size;
>
> size = data->br_stack->nr
> * sizeof(struct perf_branch_entry);
>
> perf_output_put(handle, data->br_stack->nr);
> if (branch_sample_hw_index(event))
> perf_output_put(handle, data->br_stack->hw_idx);
> perf_output_copy(handle, data->br_stack->entries, size);
> } else {
> /*
> * we always store at least the value of nr
> */
> u64 nr = 0;
> perf_output_put(handle, nr);
> }
> }
>
>
>>>
>>>> + return;
>>>> + }
>>>> +
>>>> + /*
>>>> + * Read the branch records from the hardware once after the PMU IRQ
>>>> + * has been triggered but subsequently same records can be used for
>>>> + * other events that might have been overflowed simultaneously thus
>>>> + * saving much CPU cycles.
>>>> + */
>>>> + if (!*branch_captured) {
>>>> + armv8pmu_branch_read(cpuc, event);
>>>> + *branch_captured = true;
>>>> + }
>>>> + perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack);
>>>> +}
>>>> +
>>>> static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>> {
>>>> u32 pmovsr;
>>>> @@ -766,6 +815,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>> struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
>>>> struct pt_regs *regs;
>>>> int idx;
>>>> + bool branch_captured = false;
>>>>
>>>> /*
>>>> * Get and reset the IRQ flags
>>>> @@ -809,6 +859,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>> if (!armpmu_event_set_period(event))
>>>> continue;
>>>>
>>>> + /*
>>>> + * PMU IRQ should remain asserted until all branch records
>>>> + * are captured and processed into struct perf_sample_data.
>>>> + */
>>>> + if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
>>>> + read_branch_records(cpuc, event, &data, &branch_captured);
>>>
>>> You could return instead of using the out param, not really any
>>> different, but maybe a bit more normal:
>>>
>>> branch_captured |= read_branch_records(cpuc, event, &data,
>>> branch_captured);
>>
>> I am just wondering - how that would be any better ?
>>
>
> Maybe it wouldn't, but I suppose it's just the same way you don't write
> returns like:
>
> armv8pmu_task_ctx_cache_alloc(cpu_pmu, &ret);
>
> instead of:
>
> ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
>
> Out params can be hard to reason about sometimes. Maybe not in this case
> though.
The out parameter 'branch_captured' is checked inside read_branch_records()
to ascertain whether the BRBE records have been already captured inside the
buffer i.e cpuc->branches->branch_stack, in case the process can be skipped
(optimization) for subsequent events in the session. Keeping this parameter
branch_captured just inside the caller i.e armv8pmu_handle_irq() would not
achieve that objective.
>>>
>>>> +
>>>> /*
>>>> * Perf event overflow will queue the processing of the event as
>>>> * an irq_work which will be taken care of in the handling of
>>>> @@ -818,6 +875,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>> cpu_pmu->disable(event);
>>>> }
>>>> armv8pmu_start(cpu_pmu);
>>>> + if (cpu_pmu->has_branch_stack)
>>>> + armv8pmu_branch_reset();
>>>>
>>>> return IRQ_HANDLED;
>>>> }
>>>> @@ -907,6 +966,24 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
>>>> return event->hw.idx;
>>>> }
>>>>
>>>> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
>>>> +{
>>>> + struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
>>>> + void *task_ctx = pmu_ctx->task_ctx_data;
>>>> +
>>>> + if (armpmu->has_branch_stack) {
>>>> + /* Save branch records in task_ctx on sched out */
>>>> + if (task_ctx && !sched_in) {
>>>> + armv8pmu_branch_save(armpmu, task_ctx);
>>>> + return;
>>>> + }
>>>> +
>>>> + /* Reset branch records on sched in */
>>>> + if (sched_in)
>>>> + armv8pmu_branch_reset();
>>>> + }
>>>> +}
>>>> +
>>>> /*
>>>> * Add an event filter to a given event.
>>>> */
>>>> @@ -977,6 +1054,9 @@ static void armv8pmu_reset(void *info)
>>>> pmcr |= ARMV8_PMU_PMCR_LP;
>>>>
>>>> armv8pmu_pmcr_write(pmcr);
>>>> +
>>>> + if (cpu_pmu->has_branch_stack)
>>>> + armv8pmu_branch_reset();
>>>> }
>>>>
>>>> static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
>>>> @@ -1014,6 +1094,20 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>>>>
>>>> hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>>>>
>>>> + if (has_branch_stack(event)) {
>>>> + if (!armv8pmu_branch_attr_valid(event))
>>>> + return -EOPNOTSUPP;
>>>> +
>>>> + /*
>>>> + * If a task gets scheduled out, the current branch records
>>>> + * get saved in the task's context data, which can be later
>>>> + * used to fill in the records upon an event overflow. Let's
>>>> + * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
>>>> + * all branch stack sampling perf events.
>>>> + */
>>>> + event->attach_state |= PERF_ATTACH_TASK_DATA;
>>>> + }
>>>> +
>>>> /*
>>>> * CHAIN events only work when paired with an adjacent counter, and it
>>>> * never makes sense for a user to open one in isolation, as they'll be
>>>> @@ -1130,6 +1224,35 @@ static void __armv8pmu_probe_pmu(void *info)
>>>> cpu_pmu->reg_pmmir = read_pmmir();
>>>> else
>>>> cpu_pmu->reg_pmmir = 0;
>>>> + armv8pmu_branch_probe(cpu_pmu);
>>>
>>> I'm not sure if this is splitting hairs or not, but
>>> __armv8pmu_probe_pmu() is run on only one of 'any' of the supported CPUs
>>> for this PMU.
>>
>> Right.
>>
>>>
>>> Is it not possible to have some of those CPUs support and some not
>>> support BRBE, even though they are all the same PMU type? Maybe we could
>>
>> I am not sure, but not something I have come across.
>>
>>> wait for it to explode with some weird system, or change it so that the
>>> BRBE probe is run on every CPU, with a second 'supported_brbe_mask' field.
>>
>> Right, but for now, the current solutions looks sufficient.
>>
>
> I suppose it means people will have to split their PMUs to ones that do
> and don't support BRBE. I'm not sure if that's worth adding a comment in
> the docs or it's too obscure
Sure, can add that comment in brbe.rst. Also with debug enabled i.e wrapped
inside some debug config, it can be ascertained that all cpus on a given ARM
PMU have BRBE with exact same properties.
>
>>>
>>>> +}
>>>> +
>>>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>>>> +{
>>>> + struct branch_records __percpu *records;
>>>> + int cpu;
>>>> +
>>>> + records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
>>>> + if (!records)
>>>> + return -ENOMEM;
>>>> +
>>>
>>> Doesn't this technically need to take the CPU mask where BRBE is
>>> supported into account? Otherwise you are allocating for cores that
>>> never use it.
>>>
>>> Also it's done per-CPU _and_ per-PMU type, multiplying the number of
>>> BRBE buffers allocated, even if they can only ever be used per-CPU.
>>
>> Agreed, but I believe we have already been though this discussion, and
>> settled for this method - for being a simpler approach.
>>
>>>
>>>> + /*
>>>> + * percpu memory allocated for 'records' gets completely consumed
>>>> + * here, and never required to be freed up later. So permanently
>>>> + * losing access to this anchor i.e 'records' is acceptable.
>>>> + *
>>>> + * Otherwise this allocation handle would have to be saved up for
>>>> + * free_percpu() release later if required.
>>>> + */
>>>> + for_each_possible_cpu(cpu) {
>>>> + struct pmu_hw_events *events_cpu;
>>>> + struct branch_records *records_cpu;
>>>> +
>>>> + events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
>>>> + records_cpu = per_cpu_ptr(records, cpu);
>>>> + events_cpu->branches = records_cpu;
>>>> + }
>>>> + return 0;
>>>> }
>>>>
>>>> static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>> @@ -1146,7 +1269,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>> if (ret)
>>>> return ret;
>>>>
>>>> - return probe.present ? 0 : -ENODEV;
>>>> + if (!probe.present)
>>>> + return -ENODEV;
>>>> +
>>>> + if (cpu_pmu->has_branch_stack) {
>>>> + ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
>>>> + if (ret)
>>>> + return ret;
>>>> +
>>>> + ret = branch_records_alloc(cpu_pmu);
>>>> + if (ret) {
>>>> + armv8pmu_task_ctx_cache_free(cpu_pmu);
>>>> + return ret;
>>>> + }
>>>> + }
>>>> + return 0;
>>>> }
>>>>
>>>
>>> [...]
>>>> diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
>>>> index 9c226adf938a..72da4522397c 100644
>>>> --- a/include/linux/perf/arm_pmuv3.h
>>>> +++ b/include/linux/perf/arm_pmuv3.h
>>>> @@ -303,4 +303,50 @@
>>>> } \
>>>> } while (0)
>>>>
>>>> +struct pmu_hw_events;
>>>> +struct arm_pmu;
>>>> +struct perf_event;
>>>> +
>>>> +#ifdef CONFIG_PERF_EVENTS
>>>
>>> Very minor nit, but if you end up moving the stubs to the brbe header
>>> you probably don't need the #ifdef CONFIG_PERF_EVENTS because it just
>>> won't be included in that case.
>>
>> Right, will drop CONFIG_PERF_EVENTS wrapper.
>>
>>>
>>>> +static inline void armv8pmu_branch_reset(void)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
>>>> +{
>>>> + WARN_ON_ONCE(!has_branch_stack(event));
>>>> + return false;
>>>> +}
>>>> +
>>>> +static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void armv8pmu_branch_disable(void)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
>>>> + struct perf_event *event)
>>>> +{
>>>> + WARN_ON_ONCE(!has_branch_stack(event));
>>>> +}
>>>> +
>>>> +static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
>>>> +{
>>>> +}
>>>> +
>>>> +static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
>>>> +{
>>>> + return 0;
>>>> +}
>>>> +
>>>> +static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
>>>> +{
>>>> +}
>>>> +#endif /* CONFIG_PERF_EVENTS */
>>>> #endif
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE
2023-11-14 12:11 ` James Clark
@ 2023-11-21 10:47 ` Anshuman Khandual
0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-21 10:47 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/14/23 17:41, James Clark wrote:
>
>
> On 14/11/2023 05:13, Anshuman Khandual wrote:
> [...]
>
>> +/*
>> + * BRBE supports the following functional branch type filters while
>> + * generating branch records. These branch filters can be enabled,
>> + * either individually or as a group i.e ORing multiple filters
>> + * with each other.
>> + *
>> + * BRBFCR_EL1_CONDDIR - Conditional direct branch
>> + * BRBFCR_EL1_DIRCALL - Direct call
>> + * BRBFCR_EL1_INDCALL - Indirect call
>> + * BRBFCR_EL1_INDIRECT - Indirect branch
>> + * BRBFCR_EL1_DIRECT - Direct branch
>> + * BRBFCR_EL1_RTN - Subroutine return
>> + */
>> +static u64 branch_type_to_brbfcr(int branch_type)
>> +{
>> + u64 brbfcr = 0;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
>> + brbfcr |= BRBFCR_EL1_BRANCH_FILTERS;
>> + return brbfcr;
>> + }
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL) {
>> + brbfcr |= BRBFCR_EL1_INDCALL;
>> + brbfcr |= BRBFCR_EL1_DIRCALL;
>> + }
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
>> + brbfcr |= BRBFCR_EL1_RTN;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_IND_CALL)
>> + brbfcr |= BRBFCR_EL1_INDCALL;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_COND)
>> + brbfcr |= BRBFCR_EL1_CONDDIR;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_IND_JUMP)
>> + brbfcr |= BRBFCR_EL1_INDIRECT;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_CALL)
>> + brbfcr |= BRBFCR_EL1_DIRCALL;
>> +
>> + return brbfcr;
>> +}
>> +
>> +/*
>> + * BRBE supports the following privilege mode filters while generating
>> + * branch records.
>> + *
>> + * BRBCR_ELx_E0BRE - EL0 branch records
>> + * BRBCR_ELx_ExBRE - EL1/EL2 branch records
>> + *
>> + * BRBE also supports the following additional functional branch type
>> + * filters while generating branch records.
>> + *
>> + * BRBCR_ELx_EXCEPTION - Exception
>> + * BRBCR_ELx_ERTN - Exception return
>> + */
>> +static u64 branch_type_to_brbcr(int branch_type)
>> +{
>> + u64 brbcr = BRBCR_ELx_DEFAULT_TS;
>> +
>> + /*
>> + * BRBE should be paused on PMU interrupt while tracing kernel
>> + * space to stop capturing further branch records. Otherwise
>> + * interrupt handler branch records might get into the samples
>> + * which is not desired.
>> + *
>> + * BRBE need not be paused on PMU interrupt while tracing only
>> + * the user space, because it will automatically be inside the
>> + * prohibited region. But even after PMU overflow occurs, the
>> + * interrupt could still take much more cycles, before it can
>> + * be taken and by that time BRBE will have been overwritten.
>> + * Hence enable pause on PMU interrupt mechanism even for user
>> + * only traces as well.
>> + */
>> + brbcr |= BRBCR_ELx_FZP;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_USER)
>> + brbcr |= BRBCR_ELx_E0BRE;
>> +
>> + /*
>> + * When running in the hyp mode, writing into BRBCR_EL1
>> + * actually writes into BRBCR_EL2 instead. Field E2BRE
>> + * is also at the same position as E1BRE.
>> + */
>> + if (branch_type & PERF_SAMPLE_BRANCH_KERNEL)
>> + brbcr |= BRBCR_ELx_ExBRE;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_HV) {
>> + if (is_kernel_in_hyp_mode())
>> + brbcr |= BRBCR_ELx_ExBRE;
>> + }
>> +
>> + if (!(branch_type & PERF_SAMPLE_BRANCH_NO_CYCLES))
>> + brbcr |= BRBCR_ELx_CC;
>> +
>> + if (!(branch_type & PERF_SAMPLE_BRANCH_NO_FLAGS))
>> + brbcr |= BRBCR_ELx_MPRED;
>> +
>> + /*
>> + * The exception and exception return branches could be
>> + * captured, irrespective of the perf event's privilege.
>> + * If the perf event does not have enough privilege for
>> + * a given exception level, then addresses which falls
>> + * under that exception level will be reported as zero
>> + * for the captured branch record, creating source only
>> + * or target only records.
>> + */
>> + if (branch_type & PERF_SAMPLE_BRANCH_ANY) {
>> + brbcr |= BRBCR_ELx_EXCEPTION;
>> + brbcr |= BRBCR_ELx_ERTN;
>> + }
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_CALL)
>> + brbcr |= BRBCR_ELx_EXCEPTION;
>> +
>> + if (branch_type & PERF_SAMPLE_BRANCH_ANY_RETURN)
>> + brbcr |= BRBCR_ELx_ERTN;
>> +
>> + return brbcr & BRBCR_ELx_CONFIG_MASK;
>> +}
>> +
>> +void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
>> +{
>> + struct pmu_hw_events *cpuc = this_cpu_ptr(arm_pmu->hw_events);
>> + u64 brbfcr, brbcr;
>> +
>> + if (!(cpuc->brbe_sample_type && cpuc->brbe_users))
>> + return;
>> +
>> + /*
>> + * BRBE gets configured with a new mismatched branch sample
>> + * type request, overriding any previous branch filters.
>> + */
>> + brbfcr = read_sysreg_s(SYS_BRBFCR_EL1);
>> + brbfcr &= ~BRBFCR_EL1_DEFAULT_CONFIG;
>
> This is called default_config, but is being used semantically the same
> way as BRBCR_ELx_CONFIG_MASK below to clear out the fields. Doesn't that
> mean that it's a mask rather than a default config? It's only ever used
> in this way. default_config implies it's written or used as an
> initialiser at some point.
Sure, will rename BRBFCR_EL1_DEFAULT_CONFIG as BRBFCR_EL1_CONFIG_MASK
making it similar to BRBCR_ELx_CONFIG_MASK.
>
>> + brbfcr |= branch_type_to_brbfcr(cpuc->brbe_sample_type);
>> + write_sysreg_s(brbfcr, SYS_BRBFCR_EL1);
>> + isb();
>> +
>> + brbcr = read_sysreg_s(SYS_BRBCR_EL1);
>> + brbcr &= ~BRBCR_ELx_CONFIG_MASK;
>> + brbcr |= branch_type_to_brbcr(cpuc->brbe_sample_type);
>
> BRBCR_ELx_CONFIG_MASK is already &'d at the end of
> branch_type_to_brbcr(), so isn't it easier and equivalent to just do the
> following instead of the read(), &= and then |= ?
>
> write_sysreg_s(branch_type_to_brbcr(...), SYS_BRBCR_EL1);
>
> Or at least make branch_type_to_brbfcr() consistent and &
> BRBFCR_EL1_DEFAULT_CONFIG at the end of that function too.
This sounds better I guess, will '&' BRBFCR_EL1_CONFIG_MASK at the end
of branch_type_to_brbfcr().
>
>> + write_sysreg_s(brbcr, SYS_BRBCR_EL1);
>> + isb();
>> +}
>> +
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests
2023-11-14 9:16 ` James Clark
@ 2023-11-21 11:12 ` Anshuman Khandual
2023-11-23 13:54 ` James Clark
0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-21 11:12 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, Oliver Upton, James Morse, kvmarm,
linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
On 11/14/23 14:46, James Clark wrote:
>
>
> On 14/11/2023 05:13, Anshuman Khandual wrote:
>> Disable the BRBE before we enter the guest, saving the status and enable it
>> back once we get out of the guest. This is just to avoid capturing records
>> in the guest kernel/userspace, which would be confusing the samples.
>>
>> Cc: Marc Zyngier <maz@kernel.org>
>> Cc: Oliver Upton <oliver.upton@linux.dev>
>> Cc: James Morse <james.morse@arm.com>
>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>> Cc: Will Deacon <will@kernel.org>
>> Cc: kvmarm@lists.linux.dev
>> Cc: linux-arm-kernel@lists.infradead.org
>> CC: linux-kernel@vger.kernel.org
>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>> ---
>> Changes in V14:
>>
>> - This is a new patch in the series
>>
>> arch/arm64/include/asm/kvm_host.h | 4 ++++
>> arch/arm64/kvm/debug.c | 6 +++++
>> arch/arm64/kvm/hyp/nvhe/debug-sr.c | 38 ++++++++++++++++++++++++++++++
>> 3 files changed, 48 insertions(+)
>>
>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>> index 68421c74283a..1faa0430d8dd 100644
>> --- a/arch/arm64/include/asm/kvm_host.h
>> +++ b/arch/arm64/include/asm/kvm_host.h
>> @@ -449,6 +449,8 @@ enum vcpu_sysreg {
>> CNTHV_CVAL_EL2,
>> PMSCR_EL1, /* Statistical profiling extension */
>> TRFCR_EL1, /* Self-hosted trace filters */
>> + BRBCR_EL1, /* Branch Record Buffer Control Register */
>> + BRBFCR_EL1, /* Branch Record Buffer Function Control Register */
>>
>> NR_SYS_REGS /* Nothing after this line! */
>> };
>> @@ -753,6 +755,8 @@ struct kvm_vcpu_arch {
>> #define VCPU_HYP_CONTEXT __vcpu_single_flag(iflags, BIT(7))
>> /* Save trace filter controls */
>> #define DEBUG_STATE_SAVE_TRFCR __vcpu_single_flag(iflags, BIT(8))
>> +/* Save BRBE context if active */
>> +#define DEBUG_STATE_SAVE_BRBE __vcpu_single_flag(iflags, BIT(9))
>>
>> /* SVE enabled for host EL0 */
>> #define HOST_SVE_ENABLED __vcpu_single_flag(sflags, BIT(0))
>> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
>> index 2ab41b954512..4055783c3d34 100644
>> --- a/arch/arm64/kvm/debug.c
>> +++ b/arch/arm64/kvm/debug.c
>> @@ -354,6 +354,11 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
>> !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
>> vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>> }
>> +
>> + /* Check if we have BRBE implemented and available at the host */
>> + if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT) &&
>> + (read_sysreg_s(SYS_BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
>> + vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
>
> Isn't this supposed to just be the feature check? Whether BRBE is
> enabled or not is checked later in __debug_save_brbe() anyway.
Okay, will make it just a feature check via ID_AA64DFR0_EL1_BRBE_SHIFT.
>
> It seems like it's possible to become enabled after this flag load part.
Agreed.
>
>> }
>>
>> void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
>> @@ -361,6 +366,7 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRFCR);
>> + vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
>> }
>>
>> void kvm_etm_set_guest_trfcr(u64 trfcr_guest)
>> diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>> index 6174f710948e..e44a1f71a0f8 100644
>> --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>> +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>> @@ -93,6 +93,38 @@ static void __debug_restore_trace(struct kvm_cpu_context *host_ctxt,
>> write_sysreg_s(ctxt_sys_reg(host_ctxt, TRFCR_EL1), SYS_TRFCR_EL1);
>> }
>>
>> +static void __debug_save_brbe(struct kvm_cpu_context *host_ctxt)
>> +{
>> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = 0;
>> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = 0;
>> +
>> + /* Check if the BRBE is enabled */
>> + if (!(ctxt_sys_reg(host_ctxt, BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
>> + return;
>
> Doesn't this always fail, the host BRBCR_EL1 value was just cleared on
> the line above.
Agreed, this error might have slipped in while converting to ctxt_sys_reg().
>
> Also, you need to read the register to determine if it was enabled or
Right
> not, so you might as well always store the real value, rather than 0 in
> the not enabled case.
But if it is not enabled - why store the real value ?
>
>> +
>> + /*
>> + * Prohibit branch record generation while we are in guest.
>> + * Since access to BRBCR_EL1 and BRBFCR_EL1 is trapped, the
>> + * guest can't modify the filtering set by the host.
>> + */
>> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = read_sysreg_s(SYS_BRBCR_EL1);
>> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = read_sysreg_s(SYS_BRBFCR_EL1)
>> + write_sysreg_s(0, SYS_BRBCR_EL1);
>> + write_sysreg_s(0, SYS_BRBFCR_EL1);
>
> Why does SYS_BRBFCR_EL1 need to be saved and restored? Only
> BRBCR_ELx_E0BRE and BRBCR_ELx_ExBRE need to be cleared to disable BRBE.
Right, just thought both brbcr, and brbfcr system registers represent
current BRBE state (besides branch records), in a more comprehensive
manner, although none would be changed from inside the guest.
>
>> + isb();
>> +}
>> +
>> +static void __debug_restore_brbe(struct kvm_cpu_context *host_ctxt)
>> +{
>> + if (!ctxt_sys_reg(host_ctxt, BRBCR_EL1) || !ctxt_sys_reg(host_ctxt, BRBFCR_EL1))
>> + return;
>> +
>> + /* Restore BRBE controls */
>> + write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBCR_EL1), SYS_BRBCR_EL1);
>> + write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBFCR_EL1), SYS_BRBFCR_EL1);
>> + isb();
>> +}
>> +
>> void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>> struct kvm_cpu_context *guest_ctxt)
>> {
>> @@ -102,6 +134,10 @@ void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>>
>> if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
>> __debug_save_trace(host_ctxt, guest_ctxt);
>> +
>> + /* Disable BRBE branch records */
>> + if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
>> + __debug_save_brbe(host_ctxt);
>> }
>>
>> void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
>> @@ -116,6 +152,8 @@ void __debug_restore_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>> __debug_restore_spe(host_ctxt);
>> if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
>> __debug_restore_trace(host_ctxt, guest_ctxt);
>> + if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
>> + __debug_restore_brbe(host_ctxt);
>> }
>>
>> void __debug_switch_to_host(struct kvm_vcpu *vcpu)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 0/8] arm64/perf: Enable branch stack sampling
2023-11-14 17:17 ` [V14 0/8] arm64/perf: Enable branch stack sampling James Clark
@ 2023-11-22 5:15 ` Anshuman Khandual
2023-11-23 16:23 ` James Clark
0 siblings, 1 reply; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-22 5:15 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/14/23 22:47, James Clark wrote:
>
>
> On 14/11/2023 05:13, Anshuman Khandual wrote:
>> This series enables perf branch stack sampling support on arm64 platform
>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>> the relevant register definitions could be accessed here.
>>
> [...]
>>
>> --------------------------- Virtualisation support ------------------------
>>
>> - Branch stack sampling is not currently supported inside the guest (TODO)
>>
>> - FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
>> - Future support in guest requires emulating FEAT_BRBE
>
> If you never add support for the host looking into a guest, and you save
But that seems to be a valid use case though. Is there a particular concern
why such capability should or could not be added for BRBE ?
> and restore all the BRBINF[n] registers, I think you might be able to
> just let the guest do whatever it wants with BRBE and not trap and
> emulate it? Maybe there is some edge case why that wouldn't work, but
> it's worth thinking about.
Right, in case host tracing of the guest is not supported (although still
wondering why it should not be), saving and restoring complete BRBE state
i.e all system registers that can be accessed from guest, would let guest
do what ever it wants with BRBE without requiring the trap-emulate model.
>
> For BRBE specifically I don't see much of a use case for hosts looking
> into a guest, at least not like with PMU counters.
But how is it any different from normal PMU counters ? Branch records do
provide statistical insights into hot sections in the guest.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-21 9:57 ` Anshuman Khandual
@ 2023-11-23 12:35 ` James Clark
2023-11-27 8:06 ` Anshuman Khandual
0 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-23 12:35 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 21/11/2023 09:57, Anshuman Khandual wrote:
>
>
> On 11/15/23 15:37, James Clark wrote:
>>
>>
>> On 15/11/2023 07:22, Anshuman Khandual wrote:
>>> On 11/14/23 17:44, James Clark wrote:
>>>>
>>>>
>>>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>>> [...]
>>>>
>>>>> +/*
>>>>> + * This is a read only constant and safe during multi threaded access
>>>>> + */
>>>>> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
>>>>> +
>>>>> +static void read_branch_records(struct pmu_hw_events *cpuc,
>>>>> + struct perf_event *event,
>>>>> + struct perf_sample_data *data,
>>>>> + bool *branch_captured)
>>>>> +{
>>>>> + /*
>>>>> + * CPU specific branch records buffer must have been allocated already
>>>>> + * for the hardware records to be captured and processed further.
>>>>> + */
>>>>> + if (WARN_ON(!cpuc->branches))
>>>>> + return;
>>>>> +
>>>>> + /*
>>>>> + * Overflowed event's branch_sample_type does not match the configured
>>>>> + * branch filters in the BRBE HW. So the captured branch records here
>>>>> + * cannot be co-related to the overflowed event. Report to the user as
>>>>> + * if no branch records have been captured, and flush branch records.
>>>>> + * The same scenario is applicable when the current task context does
>>>>> + * not match with overflown event.
>>>>> + */
>>>>> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
>>>>> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
>>>>> + perf_sample_save_brstack(data, event, &zero_branch_stack);
>>>>
>>>> Is there any benefit to outputting a zero size stack vs not outputting
>>>> anything at all?
>>>
>>> The event has got PERF_SAMPLE_BRANCH_STACK marked and hence perf_sample_data
>>> must have PERF_SAMPLE_BRANCH_STACK with it's br_stack pointing to the branch
>>> records. Hence without assigning a zeroed struct perf_branch_stack, there is
>>> a chance, that perf_sample_data will pass on some garbage branch records to
>>> the ring buffer.
>>>
>>
>> I don't think that's an issue, the perf core code handles the case where
>> no branch stack exists on a sample. It even outputs the zero length for
>> you, but there is other stuff that can be skipped if you just never call
>> perf_sample_save_brstack():
>
> Sending out perf_sample_data without valid data->br_stack seems problematic,
> which would be the case when perf_sample_save_brstack() never gets called on
> the perf_sample_data being prepared, and depend on the below 'else' case for
> pushing out zero records.
>
I'm not following why it would be problematic. data->br_stack is
initialised to NULL in perf_prepare_sample() and the core code
specifically has a path that was added for the case where
perf_sample_save_brstack() was never called.
> Alternatively - temporarily just zeroing out cpuc->branches->branch_stack.nr
> for immediate perf_sample_save_brstack(), and then restoring it back to it's
> original value might work as well. Remember it still has got valid records
> for other qualifying events.
>
Zeroing isn't required, br_stack is already zero initialised.
Not sure what you mean by valid records for other qualifying events? But
this is a per sample thing, and the output struct is wiped per sample.
>>
>> from kernel/events/core.c:
>>
>> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> if (data->br_stack) {
>> size_t size;
>>
>> size = data->br_stack->nr
>> * sizeof(struct perf_branch_entry);
>>
>> perf_output_put(handle, data->br_stack->nr);
>> if (branch_sample_hw_index(event))
>> perf_output_put(handle, data->br_stack->hw_idx);
>> perf_output_copy(handle, data->br_stack->entries, size);
>> } else {
>> /*
>> * we always store at least the value of nr
>> */
>> u64 nr = 0;
>> perf_output_put(handle, nr);
>> }
>> }
>>
>>
>>>>
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + /*
>>>>> + * Read the branch records from the hardware once after the PMU IRQ
>>>>> + * has been triggered but subsequently same records can be used for
>>>>> + * other events that might have been overflowed simultaneously thus
>>>>> + * saving much CPU cycles.
>>>>> + */
>>>>> + if (!*branch_captured) {
>>>>> + armv8pmu_branch_read(cpuc, event);
>>>>> + *branch_captured = true;
>>>>> + }
>>>>> + perf_sample_save_brstack(data, event, &cpuc->branches->branch_stack);
>>>>> +}
>>>>> +
>>>>> static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>>> {
>>>>> u32 pmovsr;
>>>>> @@ -766,6 +815,7 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>>> struct pmu_hw_events *cpuc = this_cpu_ptr(cpu_pmu->hw_events);
>>>>> struct pt_regs *regs;
>>>>> int idx;
>>>>> + bool branch_captured = false;
>>>>>
>>>>> /*
>>>>> * Get and reset the IRQ flags
>>>>> @@ -809,6 +859,13 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>>> if (!armpmu_event_set_period(event))
>>>>> continue;
>>>>>
>>>>> + /*
>>>>> + * PMU IRQ should remain asserted until all branch records
>>>>> + * are captured and processed into struct perf_sample_data.
>>>>> + */
>>>>> + if (has_branch_stack(event) && cpu_pmu->has_branch_stack)
>>>>> + read_branch_records(cpuc, event, &data, &branch_captured);
>>>>
>>>> You could return instead of using the out param, not really any
>>>> different, but maybe a bit more normal:
>>>>
>>>> branch_captured |= read_branch_records(cpuc, event, &data,
>>>> branch_captured);
>>>
>>> I am just wondering - how that would be any better ?
>>>
>>
>> Maybe it wouldn't, but I suppose it's just the same way you don't write
>> returns like:
>>
>> armv8pmu_task_ctx_cache_alloc(cpu_pmu, &ret);
>>
>> instead of:
>>
>> ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
>>
>> Out params can be hard to reason about sometimes. Maybe not in this case
>> though.
>
> The out parameter 'branch_captured' is checked inside read_branch_records()
> to ascertain whether the BRBE records have been already captured inside the
> buffer i.e cpuc->branches->branch_stack, in case the process can be skipped
> (optimization) for subsequent events in the session. Keeping this parameter
> branch_captured just inside the caller i.e armv8pmu_handle_irq() would not
> achieve that objective.
>
No, it would achieve the objective and it's the same. It's also arguably
two different things. One is whether any output was generated, and the
other is whether to skip, so having a return value allows you to give
the variables two different names.
skip_output |= read_branch_records(cpuc, event, &data, skip_output);
And on the inside:
bool read_branch_records(..., bool skip_output)
{
bool records_output = false;
if (thing && !skip_output) {
output_records();
records_output = true;
}
return records_output;
}
Either way, I'm not that bothered about this one, I was just mentioning
that the out parameter was a bit weird. But to say you can't accomplish
the same thing isn't right.
>>>>
>>>>> +
>>>>> /*
>>>>> * Perf event overflow will queue the processing of the event as
>>>>> * an irq_work which will be taken care of in the handling of
>>>>> @@ -818,6 +875,8 @@ static irqreturn_t armv8pmu_handle_irq(struct arm_pmu *cpu_pmu)
>>>>> cpu_pmu->disable(event);
>>>>> }
>>>>> armv8pmu_start(cpu_pmu);
>>>>> + if (cpu_pmu->has_branch_stack)
>>>>> + armv8pmu_branch_reset();
>>>>>
>>>>> return IRQ_HANDLED;
>>>>> }
>>>>> @@ -907,6 +966,24 @@ static int armv8pmu_user_event_idx(struct perf_event *event)
>>>>> return event->hw.idx;
>>>>> }
>>>>>
>>>>> +static void armv8pmu_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in)
>>>>> +{
>>>>> + struct arm_pmu *armpmu = to_arm_pmu(pmu_ctx->pmu);
>>>>> + void *task_ctx = pmu_ctx->task_ctx_data;
>>>>> +
>>>>> + if (armpmu->has_branch_stack) {
>>>>> + /* Save branch records in task_ctx on sched out */
>>>>> + if (task_ctx && !sched_in) {
>>>>> + armv8pmu_branch_save(armpmu, task_ctx);
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + /* Reset branch records on sched in */
>>>>> + if (sched_in)
>>>>> + armv8pmu_branch_reset();
>>>>> + }
>>>>> +}
>>>>> +
>>>>> /*
>>>>> * Add an event filter to a given event.
>>>>> */
>>>>> @@ -977,6 +1054,9 @@ static void armv8pmu_reset(void *info)
>>>>> pmcr |= ARMV8_PMU_PMCR_LP;
>>>>>
>>>>> armv8pmu_pmcr_write(pmcr);
>>>>> +
>>>>> + if (cpu_pmu->has_branch_stack)
>>>>> + armv8pmu_branch_reset();
>>>>> }
>>>>>
>>>>> static int __armv8_pmuv3_map_event_id(struct arm_pmu *armpmu,
>>>>> @@ -1014,6 +1094,20 @@ static int __armv8_pmuv3_map_event(struct perf_event *event,
>>>>>
>>>>> hw_event_id = __armv8_pmuv3_map_event_id(armpmu, event);
>>>>>
>>>>> + if (has_branch_stack(event)) {
>>>>> + if (!armv8pmu_branch_attr_valid(event))
>>>>> + return -EOPNOTSUPP;
>>>>> +
>>>>> + /*
>>>>> + * If a task gets scheduled out, the current branch records
>>>>> + * get saved in the task's context data, which can be later
>>>>> + * used to fill in the records upon an event overflow. Let's
>>>>> + * enable PERF_ATTACH_TASK_DATA in 'event->attach_state' for
>>>>> + * all branch stack sampling perf events.
>>>>> + */
>>>>> + event->attach_state |= PERF_ATTACH_TASK_DATA;
>>>>> + }
>>>>> +
>>>>> /*
>>>>> * CHAIN events only work when paired with an adjacent counter, and it
>>>>> * never makes sense for a user to open one in isolation, as they'll be
>>>>> @@ -1130,6 +1224,35 @@ static void __armv8pmu_probe_pmu(void *info)
>>>>> cpu_pmu->reg_pmmir = read_pmmir();
>>>>> else
>>>>> cpu_pmu->reg_pmmir = 0;
>>>>> + armv8pmu_branch_probe(cpu_pmu);
>>>>
>>>> I'm not sure if this is splitting hairs or not, but
>>>> __armv8pmu_probe_pmu() is run on only one of 'any' of the supported CPUs
>>>> for this PMU.
>>>
>>> Right.
>>>
>>>>
>>>> Is it not possible to have some of those CPUs support and some not
>>>> support BRBE, even though they are all the same PMU type? Maybe we could
>>>
>>> I am not sure, but not something I have come across.
>>>
>>>> wait for it to explode with some weird system, or change it so that the
>>>> BRBE probe is run on every CPU, with a second 'supported_brbe_mask' field.
>>>
>>> Right, but for now, the current solutions looks sufficient.
>>>
>>
>> I suppose it means people will have to split their PMUs to ones that do
>> and don't support BRBE. I'm not sure if that's worth adding a comment in
>> the docs or it's too obscure
>
> Sure, can add that comment in brbe.rst. Also with debug enabled i.e wrapped
> inside some debug config, it can be ascertained that all cpus on a given ARM
> PMU have BRBE with exact same properties.
>
>>
>>>>
>>>>> +}
>>>>> +
>>>>> +static int branch_records_alloc(struct arm_pmu *armpmu)
>>>>> +{
>>>>> + struct branch_records __percpu *records;
>>>>> + int cpu;
>>>>> +
>>>>> + records = alloc_percpu_gfp(struct branch_records, GFP_KERNEL);
>>>>> + if (!records)
>>>>> + return -ENOMEM;
>>>>> +
>>>>
>>>> Doesn't this technically need to take the CPU mask where BRBE is
>>>> supported into account? Otherwise you are allocating for cores that
>>>> never use it.
>>>>
>>>> Also it's done per-CPU _and_ per-PMU type, multiplying the number of
>>>> BRBE buffers allocated, even if they can only ever be used per-CPU.
>>>
>>> Agreed, but I believe we have already been though this discussion, and
>>> settled for this method - for being a simpler approach.
>>>
>>>>
>>>>> + /*
>>>>> + * percpu memory allocated for 'records' gets completely consumed
>>>>> + * here, and never required to be freed up later. So permanently
>>>>> + * losing access to this anchor i.e 'records' is acceptable.
>>>>> + *
>>>>> + * Otherwise this allocation handle would have to be saved up for
>>>>> + * free_percpu() release later if required.
>>>>> + */
>>>>> + for_each_possible_cpu(cpu) {
>>>>> + struct pmu_hw_events *events_cpu;
>>>>> + struct branch_records *records_cpu;
>>>>> +
>>>>> + events_cpu = per_cpu_ptr(armpmu->hw_events, cpu);
>>>>> + records_cpu = per_cpu_ptr(records, cpu);
>>>>> + events_cpu->branches = records_cpu;
>>>>> + }
>>>>> + return 0;
>>>>> }
>>>>>
>>>>> static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>>> @@ -1146,7 +1269,21 @@ static int armv8pmu_probe_pmu(struct arm_pmu *cpu_pmu)
>>>>> if (ret)
>>>>> return ret;
>>>>>
>>>>> - return probe.present ? 0 : -ENODEV;
>>>>> + if (!probe.present)
>>>>> + return -ENODEV;
>>>>> +
>>>>> + if (cpu_pmu->has_branch_stack) {
>>>>> + ret = armv8pmu_task_ctx_cache_alloc(cpu_pmu);
>>>>> + if (ret)
>>>>> + return ret;
>>>>> +
>>>>> + ret = branch_records_alloc(cpu_pmu);
>>>>> + if (ret) {
>>>>> + armv8pmu_task_ctx_cache_free(cpu_pmu);
>>>>> + return ret;
>>>>> + }
>>>>> + }
>>>>> + return 0;
>>>>> }
>>>>>
>>>>
>>>> [...]
>>>>> diff --git a/include/linux/perf/arm_pmuv3.h b/include/linux/perf/arm_pmuv3.h
>>>>> index 9c226adf938a..72da4522397c 100644
>>>>> --- a/include/linux/perf/arm_pmuv3.h
>>>>> +++ b/include/linux/perf/arm_pmuv3.h
>>>>> @@ -303,4 +303,50 @@
>>>>> } \
>>>>> } while (0)
>>>>>
>>>>> +struct pmu_hw_events;
>>>>> +struct arm_pmu;
>>>>> +struct perf_event;
>>>>> +
>>>>> +#ifdef CONFIG_PERF_EVENTS
>>>>
>>>> Very minor nit, but if you end up moving the stubs to the brbe header
>>>> you probably don't need the #ifdef CONFIG_PERF_EVENTS because it just
>>>> won't be included in that case.
>>>
>>> Right, will drop CONFIG_PERF_EVENTS wrapper.
>>>
>>>>
>>>>> +static inline void armv8pmu_branch_reset(void)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static inline void armv8pmu_branch_probe(struct arm_pmu *arm_pmu)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static inline bool armv8pmu_branch_attr_valid(struct perf_event *event)
>>>>> +{
>>>>> + WARN_ON_ONCE(!has_branch_stack(event));
>>>>> + return false;
>>>>> +}
>>>>> +
>>>>> +static inline void armv8pmu_branch_enable(struct arm_pmu *arm_pmu)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static inline void armv8pmu_branch_disable(void)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static inline void armv8pmu_branch_read(struct pmu_hw_events *cpuc,
>>>>> + struct perf_event *event)
>>>>> +{
>>>>> + WARN_ON_ONCE(!has_branch_stack(event));
>>>>> +}
>>>>> +
>>>>> +static inline void armv8pmu_branch_save(struct arm_pmu *arm_pmu, void *ctx)
>>>>> +{
>>>>> +}
>>>>> +
>>>>> +static inline int armv8pmu_task_ctx_cache_alloc(struct arm_pmu *arm_pmu)
>>>>> +{
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static inline void armv8pmu_task_ctx_cache_free(struct arm_pmu *arm_pmu)
>>>>> +{
>>>>> +}
>>>>> +#endif /* CONFIG_PERF_EVENTS */
>>>>> #endif
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests
2023-11-21 11:12 ` Anshuman Khandual
@ 2023-11-23 13:54 ` James Clark
2023-11-27 8:25 ` Anshuman Khandual
0 siblings, 1 reply; 30+ messages in thread
From: James Clark @ 2023-11-23 13:54 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, Oliver Upton, James Morse, kvmarm,
linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
On 21/11/2023 11:12, Anshuman Khandual wrote:
>
>
> On 11/14/23 14:46, James Clark wrote:
>>
>>
>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>> Disable the BRBE before we enter the guest, saving the status and enable it
>>> back once we get out of the guest. This is just to avoid capturing records
>>> in the guest kernel/userspace, which would be confusing the samples.
>>>
>>> Cc: Marc Zyngier <maz@kernel.org>
>>> Cc: Oliver Upton <oliver.upton@linux.dev>
>>> Cc: James Morse <james.morse@arm.com>
>>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>> Cc: Will Deacon <will@kernel.org>
>>> Cc: kvmarm@lists.linux.dev
>>> Cc: linux-arm-kernel@lists.infradead.org
>>> CC: linux-kernel@vger.kernel.org
>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>> ---
>>> Changes in V14:
>>>
>>> - This is a new patch in the series
>>>
>>> arch/arm64/include/asm/kvm_host.h | 4 ++++
>>> arch/arm64/kvm/debug.c | 6 +++++
>>> arch/arm64/kvm/hyp/nvhe/debug-sr.c | 38 ++++++++++++++++++++++++++++++
>>> 3 files changed, 48 insertions(+)
>>>
>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>> index 68421c74283a..1faa0430d8dd 100644
>>> --- a/arch/arm64/include/asm/kvm_host.h
>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>> @@ -449,6 +449,8 @@ enum vcpu_sysreg {
>>> CNTHV_CVAL_EL2,
>>> PMSCR_EL1, /* Statistical profiling extension */
>>> TRFCR_EL1, /* Self-hosted trace filters */
>>> + BRBCR_EL1, /* Branch Record Buffer Control Register */
>>> + BRBFCR_EL1, /* Branch Record Buffer Function Control Register */
>>>
>>> NR_SYS_REGS /* Nothing after this line! */
>>> };
>>> @@ -753,6 +755,8 @@ struct kvm_vcpu_arch {
>>> #define VCPU_HYP_CONTEXT __vcpu_single_flag(iflags, BIT(7))
>>> /* Save trace filter controls */
>>> #define DEBUG_STATE_SAVE_TRFCR __vcpu_single_flag(iflags, BIT(8))
>>> +/* Save BRBE context if active */
>>> +#define DEBUG_STATE_SAVE_BRBE __vcpu_single_flag(iflags, BIT(9))
>>>
>>> /* SVE enabled for host EL0 */
>>> #define HOST_SVE_ENABLED __vcpu_single_flag(sflags, BIT(0))
>>> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
>>> index 2ab41b954512..4055783c3d34 100644
>>> --- a/arch/arm64/kvm/debug.c
>>> +++ b/arch/arm64/kvm/debug.c
>>> @@ -354,6 +354,11 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
>>> !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
>>> vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>>> }
>>> +
>>> + /* Check if we have BRBE implemented and available at the host */
>>> + if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT) &&
>>> + (read_sysreg_s(SYS_BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
>>> + vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
>>
>> Isn't this supposed to just be the feature check? Whether BRBE is
>> enabled or not is checked later in __debug_save_brbe() anyway.
>
> Okay, will make it just a feature check via ID_AA64DFR0_EL1_BRBE_SHIFT.
>
>>
>> It seems like it's possible to become enabled after this flag load part.
>
> Agreed.
>
>>
>>> }
>>>
>>> void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
>>> @@ -361,6 +366,7 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
>>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
>>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRFCR);
>>> + vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
>>> }
>>>
>>> void kvm_etm_set_guest_trfcr(u64 trfcr_guest)
>>> diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>> index 6174f710948e..e44a1f71a0f8 100644
>>> --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>> +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>> @@ -93,6 +93,38 @@ static void __debug_restore_trace(struct kvm_cpu_context *host_ctxt,
>>> write_sysreg_s(ctxt_sys_reg(host_ctxt, TRFCR_EL1), SYS_TRFCR_EL1);
>>> }
>>>
>>> +static void __debug_save_brbe(struct kvm_cpu_context *host_ctxt)
>>> +{
>>> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = 0;
>>> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = 0;
>>> +
>>> + /* Check if the BRBE is enabled */
>>> + if (!(ctxt_sys_reg(host_ctxt, BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
>>> + return;
>>
>> Doesn't this always fail, the host BRBCR_EL1 value was just cleared on
>> the line above.
>
> Agreed, this error might have slipped in while converting to ctxt_sys_reg().
>
>>
>> Also, you need to read the register to determine if it was enabled or
>
> Right
>
>> not, so you might as well always store the real value, rather than 0 in
>> the not enabled case.
>
> But if it is not enabled - why store the real value ?
>
It's fewer lines of code and it's less likely to catch someone out if
it's always set to whatever the host value was. Using 0 as a special
value could also be an issue because it's indistinguishable from if the
register was actually set to 0. It's just more to reason about when you
could reduce it to a single assignment.
Also it probably would have avoided the current mistake if it was always
assigned to the host value as well.
>>
>>> +
>>> + /*
>>> + * Prohibit branch record generation while we are in guest.
>>> + * Since access to BRBCR_EL1 and BRBFCR_EL1 is trapped, the
>>> + * guest can't modify the filtering set by the host.
>>> + */
>>> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = read_sysreg_s(SYS_BRBCR_EL1);
>>> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = read_sysreg_s(SYS_BRBFCR_EL1)
>>> + write_sysreg_s(0, SYS_BRBCR_EL1);
>>> + write_sysreg_s(0, SYS_BRBFCR_EL1);
>>
>> Why does SYS_BRBFCR_EL1 need to be saved and restored? Only
>> BRBCR_ELx_E0BRE and BRBCR_ELx_ExBRE need to be cleared to disable BRBE.
>
> Right, just thought both brbcr, and brbfcr system registers represent
> current BRBE state (besides branch records), in a more comprehensive
> manner, although none would be changed from inside the guest.
>
The comment above doesn't match up with this explanation.
Having it in the code implies that it's needed. And as you say the
branch records are missing anyway, so you can't even infer that it's
only done to be comprehensive.
It would be better to not make anyone reading it wonder why it's done
and just not do it. It's only 8 bytes but it's also a waste of space.
>>
>>> + isb();
>>> +}
>>> +
>>> +static void __debug_restore_brbe(struct kvm_cpu_context *host_ctxt)
>>> +{
>>> + if (!ctxt_sys_reg(host_ctxt, BRBCR_EL1) || !ctxt_sys_reg(host_ctxt, BRBFCR_EL1))
>>> + return;
>>> +
>>> + /* Restore BRBE controls */
>>> + write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBCR_EL1), SYS_BRBCR_EL1);
>>> + write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBFCR_EL1), SYS_BRBFCR_EL1);
>>> + isb();
>>> +}
>>> +
>>> void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>>> struct kvm_cpu_context *guest_ctxt)
>>> {
>>> @@ -102,6 +134,10 @@ void __debug_save_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>>>
>>> if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
>>> __debug_save_trace(host_ctxt, guest_ctxt);
>>> +
>>> + /* Disable BRBE branch records */
>>> + if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
>>> + __debug_save_brbe(host_ctxt);
>>> }
>>>
>>> void __debug_switch_to_guest(struct kvm_vcpu *vcpu)
>>> @@ -116,6 +152,8 @@ void __debug_restore_host_buffers_nvhe(struct kvm_cpu_context *host_ctxt,
>>> __debug_restore_spe(host_ctxt);
>>> if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_TRFCR))
>>> __debug_restore_trace(host_ctxt, guest_ctxt);
>>> + if (vcpu_get_flag(host_ctxt->__hyp_running_vcpu, DEBUG_STATE_SAVE_BRBE))
>>> + __debug_restore_brbe(host_ctxt);
>>> }
>>>
>>> void __debug_switch_to_host(struct kvm_vcpu *vcpu)
>
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 0/8] arm64/perf: Enable branch stack sampling
2023-11-22 5:15 ` Anshuman Khandual
@ 2023-11-23 16:23 ` James Clark
0 siblings, 0 replies; 30+ messages in thread
From: James Clark @ 2023-11-23 16:23 UTC (permalink / raw)
To: Anshuman Khandual
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 22/11/2023 05:15, Anshuman Khandual wrote:
> On 11/14/23 22:47, James Clark wrote:
>>
>>
>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>> This series enables perf branch stack sampling support on arm64 platform
>>> via a new arch feature called Branch Record Buffer Extension (BRBE). All
>>> the relevant register definitions could be accessed here.
>>>
>> [...]
>>>
>>> --------------------------- Virtualisation support ------------------------
>>>
>>> - Branch stack sampling is not currently supported inside the guest (TODO)
>>>
>>> - FEAT_BRBE advertised as absent via clearing ID_AA64DFR0_EL1.BRBE
>>> - Future support in guest requires emulating FEAT_BRBE
>>
>> If you never add support for the host looking into a guest, and you save
>
> But that seems to be a valid use case though. Is there a particular concern
> why such capability should or could not be added for BRBE ?
>
What's the use case exactly? You wouldn't even have the binary mappings
of the guest without running perf inside the guest too, and at that
point you might as well have just done the BRBE recording from inside
the guest.
My particular concern is only about the effort required to implement it,
vs its usefulness. Not that we shouldn't ever implement the fully shared
BRBE between host and guest, we could always do it later. My idea was
just to get BRBE working inside of guests quicker.
>> and restore all the BRBINF[n] registers, I think you might be able to
>> just let the guest do whatever it wants with BRBE and not trap and
>> emulate it? Maybe there is some edge case why that wouldn't work, but
>> it's worth thinking about.
>
> Right, in case host tracing of the guest is not supported (although still
> wondering why it should not be), saving and restoring complete BRBE state
> i.e all system registers that can be accessed from guest, would let guest
> do what ever it wants with BRBE without requiring the trap-emulate model.
>
>>
>> For BRBE specifically I don't see much of a use case for hosts looking
>> into a guest, at least not like with PMU counters.
> But how is it any different from normal PMU counters ? Branch records do
> provide statistical insights into hot sections in the guest.
>
There is a big difference, PMU counters can be used to infer general
things about a system without any extra information. That's something
that could be used by a monitoring task or someone looking at a guest
running a known workload.
But for BRBE you need the binaries, mappings, scheduling events, thread
switches etc to make any sense of the pointers in the branch buffers,
otherwise they're just random numbers from who knows which process.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-23 12:35 ` James Clark
@ 2023-11-27 8:06 ` Anshuman Khandual
0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-27 8:06 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/23/23 18:05, James Clark wrote:
>
>
> On 21/11/2023 09:57, Anshuman Khandual wrote:
>>
>>
>> On 11/15/23 15:37, James Clark wrote:
>>>
>>>
>>> On 15/11/2023 07:22, Anshuman Khandual wrote:
>>>> On 11/14/23 17:44, James Clark wrote:
>>>>>
>>>>>
>>>>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>>>> [...]
>>>>>
>>>>>> +/*
>>>>>> + * This is a read only constant and safe during multi threaded access
>>>>>> + */
>>>>>> +static struct perf_branch_stack zero_branch_stack = { .nr = 0, .hw_idx = -1ULL};
>>>>>> +
>>>>>> +static void read_branch_records(struct pmu_hw_events *cpuc,
>>>>>> + struct perf_event *event,
>>>>>> + struct perf_sample_data *data,
>>>>>> + bool *branch_captured)
>>>>>> +{
>>>>>> + /*
>>>>>> + * CPU specific branch records buffer must have been allocated already
>>>>>> + * for the hardware records to be captured and processed further.
>>>>>> + */
>>>>>> + if (WARN_ON(!cpuc->branches))
>>>>>> + return;
>>>>>> +
>>>>>> + /*
>>>>>> + * Overflowed event's branch_sample_type does not match the configured
>>>>>> + * branch filters in the BRBE HW. So the captured branch records here
>>>>>> + * cannot be co-related to the overflowed event. Report to the user as
>>>>>> + * if no branch records have been captured, and flush branch records.
>>>>>> + * The same scenario is applicable when the current task context does
>>>>>> + * not match with overflown event.
>>>>>> + */
>>>>>> + if ((cpuc->brbe_sample_type != event->attr.branch_sample_type) ||
>>>>>> + (event->ctx->task && cpuc->brbe_context != event->ctx)) {
>>>>>> + perf_sample_save_brstack(data, event, &zero_branch_stack);
>>>>>
>>>>> Is there any benefit to outputting a zero size stack vs not outputting
>>>>> anything at all?
>>>>
>>>> The event has got PERF_SAMPLE_BRANCH_STACK marked and hence perf_sample_data
>>>> must have PERF_SAMPLE_BRANCH_STACK with it's br_stack pointing to the branch
>>>> records. Hence without assigning a zeroed struct perf_branch_stack, there is
>>>> a chance, that perf_sample_data will pass on some garbage branch records to
>>>> the ring buffer.
>>>>
>>>
>>> I don't think that's an issue, the perf core code handles the case where
>>> no branch stack exists on a sample. It even outputs the zero length for
>>> you, but there is other stuff that can be skipped if you just never call
>>> perf_sample_save_brstack():
>>
>> Sending out perf_sample_data without valid data->br_stack seems problematic,
>> which would be the case when perf_sample_save_brstack() never gets called on
>> the perf_sample_data being prepared, and depend on the below 'else' case for
>> pushing out zero records.
>>
>
> I'm not following why it would be problematic. data->br_stack is
> initialised to NULL in perf_prepare_sample() and the core code
> specifically has a path that was added for the case where
> perf_sample_save_brstack() was never called.
Without perf_sample_save_brstack() called on the perf sample data will
preserve 'data->br_stack' unchanged as NULL from perf_prepare_sample(),
The perf sample record, will eventually be skipped for 'data->br_stack'
element in perf_output_sample().
void perf_prepare_sample(struct perf_sample_data *data,
struct perf_event *event,
struct pt_regs *regs)
{
....
if (filtered_sample_type & PERF_SAMPLE_BRANCH_STACK) {
data->br_stack = NULL;
data->dyn_size += sizeof(u64);
data->sample_flags |= PERF_SAMPLE_BRANCH_STACK;
}
....
}
void perf_output_sample(struct perf_output_handle *handle,
struct perf_event_header *header,
struct perf_sample_data *data,
struct perf_event *event)
{
....
if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
if (data->br_stack) {
size_t size;
size = data->br_stack->nr
* sizeof(struct perf_branch_entry);
perf_output_put(handle, data->br_stack->nr);
if (branch_sample_hw_index(event))
perf_output_put(handle, data->br_stack->hw_idx);
perf_output_copy(handle, data->br_stack->entries, size);
} else {
/*
* we always store at least the value of nr
*/
u64 nr = 0;
perf_output_put(handle, nr);
}
}
....
}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests
2023-11-23 13:54 ` James Clark
@ 2023-11-27 8:25 ` Anshuman Khandual
0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-27 8:25 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, Oliver Upton, James Morse, kvmarm,
linux-arm-kernel, linux-kernel, will, catalin.marinas,
mark.rutland
On 11/23/23 19:24, James Clark wrote:
>
>
> On 21/11/2023 11:12, Anshuman Khandual wrote:
>>
>>
>> On 11/14/23 14:46, James Clark wrote:
>>>
>>>
>>> On 14/11/2023 05:13, Anshuman Khandual wrote:
>>>> Disable the BRBE before we enter the guest, saving the status and enable it
>>>> back once we get out of the guest. This is just to avoid capturing records
>>>> in the guest kernel/userspace, which would be confusing the samples.
>>>>
>>>> Cc: Marc Zyngier <maz@kernel.org>
>>>> Cc: Oliver Upton <oliver.upton@linux.dev>
>>>> Cc: James Morse <james.morse@arm.com>
>>>> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
>>>> Cc: Catalin Marinas <catalin.marinas@arm.com>
>>>> Cc: Will Deacon <will@kernel.org>
>>>> Cc: kvmarm@lists.linux.dev
>>>> Cc: linux-arm-kernel@lists.infradead.org
>>>> CC: linux-kernel@vger.kernel.org
>>>> Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com>
>>>> ---
>>>> Changes in V14:
>>>>
>>>> - This is a new patch in the series
>>>>
>>>> arch/arm64/include/asm/kvm_host.h | 4 ++++
>>>> arch/arm64/kvm/debug.c | 6 +++++
>>>> arch/arm64/kvm/hyp/nvhe/debug-sr.c | 38 ++++++++++++++++++++++++++++++
>>>> 3 files changed, 48 insertions(+)
>>>>
>>>> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
>>>> index 68421c74283a..1faa0430d8dd 100644
>>>> --- a/arch/arm64/include/asm/kvm_host.h
>>>> +++ b/arch/arm64/include/asm/kvm_host.h
>>>> @@ -449,6 +449,8 @@ enum vcpu_sysreg {
>>>> CNTHV_CVAL_EL2,
>>>> PMSCR_EL1, /* Statistical profiling extension */
>>>> TRFCR_EL1, /* Self-hosted trace filters */
>>>> + BRBCR_EL1, /* Branch Record Buffer Control Register */
>>>> + BRBFCR_EL1, /* Branch Record Buffer Function Control Register */
>>>>
>>>> NR_SYS_REGS /* Nothing after this line! */
>>>> };
>>>> @@ -753,6 +755,8 @@ struct kvm_vcpu_arch {
>>>> #define VCPU_HYP_CONTEXT __vcpu_single_flag(iflags, BIT(7))
>>>> /* Save trace filter controls */
>>>> #define DEBUG_STATE_SAVE_TRFCR __vcpu_single_flag(iflags, BIT(8))
>>>> +/* Save BRBE context if active */
>>>> +#define DEBUG_STATE_SAVE_BRBE __vcpu_single_flag(iflags, BIT(9))
>>>>
>>>> /* SVE enabled for host EL0 */
>>>> #define HOST_SVE_ENABLED __vcpu_single_flag(sflags, BIT(0))
>>>> diff --git a/arch/arm64/kvm/debug.c b/arch/arm64/kvm/debug.c
>>>> index 2ab41b954512..4055783c3d34 100644
>>>> --- a/arch/arm64/kvm/debug.c
>>>> +++ b/arch/arm64/kvm/debug.c
>>>> @@ -354,6 +354,11 @@ void kvm_arch_vcpu_load_debug_state_flags(struct kvm_vcpu *vcpu)
>>>> !(read_sysreg_s(SYS_TRBIDR_EL1) & TRBIDR_EL1_P))
>>>> vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>>>> }
>>>> +
>>>> + /* Check if we have BRBE implemented and available at the host */
>>>> + if (cpuid_feature_extract_unsigned_field(dfr0, ID_AA64DFR0_EL1_BRBE_SHIFT) &&
>>>> + (read_sysreg_s(SYS_BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
>>>> + vcpu_set_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
>>>
>>> Isn't this supposed to just be the feature check? Whether BRBE is
>>> enabled or not is checked later in __debug_save_brbe() anyway.
>>
>> Okay, will make it just a feature check via ID_AA64DFR0_EL1_BRBE_SHIFT.
>>
>>>
>>> It seems like it's possible to become enabled after this flag load part.
>>
>> Agreed.
>>
>>>
>>>> }
>>>>
>>>> void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
>>>> @@ -361,6 +366,7 @@ void kvm_arch_vcpu_put_debug_state_flags(struct kvm_vcpu *vcpu)
>>>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_SPE);
>>>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRBE);
>>>> vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_TRFCR);
>>>> + vcpu_clear_flag(vcpu, DEBUG_STATE_SAVE_BRBE);
>>>> }
>>>>
>>>> void kvm_etm_set_guest_trfcr(u64 trfcr_guest)
>>>> diff --git a/arch/arm64/kvm/hyp/nvhe/debug-sr.c b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>>> index 6174f710948e..e44a1f71a0f8 100644
>>>> --- a/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>>> +++ b/arch/arm64/kvm/hyp/nvhe/debug-sr.c
>>>> @@ -93,6 +93,38 @@ static void __debug_restore_trace(struct kvm_cpu_context *host_ctxt,
>>>> write_sysreg_s(ctxt_sys_reg(host_ctxt, TRFCR_EL1), SYS_TRFCR_EL1);
>>>> }
>>>>
>>>> +static void __debug_save_brbe(struct kvm_cpu_context *host_ctxt)
>>>> +{
>>>> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = 0;
>>>> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = 0;
>>>> +
>>>> + /* Check if the BRBE is enabled */
>>>> + if (!(ctxt_sys_reg(host_ctxt, BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
>>>> + return;
>>>
>>> Doesn't this always fail, the host BRBCR_EL1 value was just cleared on
>>> the line above.
>>
>> Agreed, this error might have slipped in while converting to ctxt_sys_reg().
>>
>>>
>>> Also, you need to read the register to determine if it was enabled or
>>
>> Right
>>
>>> not, so you might as well always store the real value, rather than 0 in
>>> the not enabled case.
>>
>> But if it is not enabled - why store the real value ?
>>
>
> It's fewer lines of code and it's less likely to catch someone out if
> it's always set to whatever the host value was. Using 0 as a special
> value could also be an issue because it's indistinguishable from if the
> register was actually set to 0. It's just more to reason about when you
> could reduce it to a single assignment.
>
> Also it probably would have avoided the current mistake if it was always
> assigned to the host value as well.
Okay, will always save SYS_BRBCR_EL1 into ctxt_sys_reg(host_ctxt, BRBCR_EL1).
>
>>>
>>>> +
>>>> + /*
>>>> + * Prohibit branch record generation while we are in guest.
>>>> + * Since access to BRBCR_EL1 and BRBFCR_EL1 is trapped, the
>>>> + * guest can't modify the filtering set by the host.
>>>> + */
>>>> + ctxt_sys_reg(host_ctxt, BRBCR_EL1) = read_sysreg_s(SYS_BRBCR_EL1);
>>>> + ctxt_sys_reg(host_ctxt, BRBFCR_EL1) = read_sysreg_s(SYS_BRBFCR_EL1)
>>>> + write_sysreg_s(0, SYS_BRBCR_EL1);
>>>> + write_sysreg_s(0, SYS_BRBFCR_EL1);
>>>
>>> Why does SYS_BRBFCR_EL1 need to be saved and restored? Only
>>> BRBCR_ELx_E0BRE and BRBCR_ELx_ExBRE need to be cleared to disable BRBE.
>>
>> Right, just thought both brbcr, and brbfcr system registers represent
>> current BRBE state (besides branch records), in a more comprehensive
>> manner, although none would be changed from inside the guest.
>>
>
> The comment above doesn't match up with this explanation.
>
> Having it in the code implies that it's needed. And as you say the
> branch records are missing anyway, so you can't even infer that it's
> only done to be comprehensive.
>
> It would be better to not make anyone reading it wonder why it's done
> and just not do it. It's only 8 bytes but it's also a waste of space.
Sure, will drop BRBFCR_EL1 handling in here. The changed code is something
like as follows.
static void __debug_save_brbe(struct kvm_cpu_context *host_ctxt)
{
ctxt_sys_reg(host_ctxt, BRBCR_EL1) = read_sysreg_s(SYS_BRBCR_EL1);
/* Check if the BRBE is enabled */
if (!(ctxt_sys_reg(host_ctxt, BRBCR_EL1) & (BRBCR_ELx_E0BRE | BRBCR_ELx_ExBRE)))
return;
/*
* Prohibit branch record generation while we are in guest.
* Since access to BRBCR_EL1 is trapped, the guest can't
* modify the filtering set by the host.
*/
write_sysreg_s(0, SYS_BRBCR_EL1);
isb();
}
static void __debug_restore_brbe(struct kvm_cpu_context *host_ctxt)
{
if (!ctxt_sys_reg(host_ctxt, BRBCR_EL1))
return;
/* Restore BRBE controls */
write_sysreg_s(ctxt_sys_reg(host_ctxt, BRBCR_EL1), SYS_BRBCR_EL1);
isb();
}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework
2023-11-14 17:10 ` James Clark
@ 2023-11-30 3:58 ` Anshuman Khandual
0 siblings, 0 replies; 30+ messages in thread
From: Anshuman Khandual @ 2023-11-30 3:58 UTC (permalink / raw)
To: James Clark
Cc: Mark Brown, Rob Herring, Marc Zyngier, Suzuki Poulose,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
linux-perf-users, linux-arm-kernel, linux-kernel, will,
catalin.marinas, mark.rutland
On 11/14/23 22:40, James Clark wrote:
>
> On 14/11/2023 05:13, Anshuman Khandual wrote:
> [...]
>> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
>> index d712a19e47ac..76f1376ae594 100644
>> --- a/drivers/perf/arm_pmu.c
>> +++ b/drivers/perf/arm_pmu.c
>> @@ -317,6 +317,15 @@ armpmu_del(struct perf_event *event, int flags)
>> struct hw_perf_event *hwc = &event->hw;
>> int idx = hwc->idx;
>>
>> + if (has_branch_stack(event)) {
>> + WARN_ON_ONCE(!hw_events->brbe_users);
>> + hw_events->brbe_users--;
>> + if (!hw_events->brbe_users) {
>> + hw_events->brbe_context = NULL;
>> + hw_events->brbe_sample_type = 0;
>> + }
>> + }
>> +
>> armpmu_stop(event, PERF_EF_UPDATE);
>> hw_events->events[idx] = NULL;
>> armpmu->clear_event_idx(hw_events, event);
>> @@ -333,6 +342,22 @@ armpmu_add(struct perf_event *event, int flags)
>> struct hw_perf_event *hwc = &event->hw;
>> int idx;
>>
>> + if (has_branch_stack(event)) {
>> + /*
>> + * Reset branch records buffer if a new task event gets
>> + * scheduled on a PMU which might have existing records.
>> + * Otherwise older branch records present in the buffer
>> + * might leak into the new task event.
>> + */
>> + if (event->ctx->task && hw_events->brbe_context != event->ctx) {
>> + hw_events->brbe_context = event->ctx;
>> + if (armpmu->branch_reset)
>> + armpmu->branch_reset();
> What about a per-thread event following a per-cpu event? Doesn't that
> also need to branch_reset()? If hw_events->brbe_context was already
> previously assigned, once the per-thread event is switched in it skips
> this reset following a per-cpu event on the same core.
Right, guess it is real a possibility. How about folding in something like ..
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 76f1376ae594..15bb80823ae6 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -343,6 +343,22 @@ armpmu_add(struct perf_event *event, int flags)
int idx;
if (has_branch_stack(event)) {
+ /*
+ * Reset branch records buffer if a new CPU bound event
+ * gets scheduled on a PMU. Otherwise existing branch
+ * records present in the buffer might just leak into
+ * such events.
+ *
+ * Also reset current 'hw_events->brbe_context' because
+ * any previous task bound event now would have lost an
+ * opportunity for continuous branch records.
+ */
+ if (!event->ctx->task) {
+ hw_events->brbe_context = NULL;
+ if (armpmu->branch_reset)
+ armpmu->branch_reset();
+ }
+
/*
* Reset branch records buffer if a new task event gets
* scheduled on a PMU which might have existing records.
^ permalink raw reply related [flat|nested] 30+ messages in thread
end of thread, other threads:[~2023-11-30 3:58 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-14 5:13 [V14 0/8] arm64/perf: Enable branch stack sampling Anshuman Khandual
2023-11-14 5:13 ` [V14 1/8] arm64/sysreg: Add BRBE registers and fields Anshuman Khandual
2023-11-14 5:13 ` [V14 2/8] KVM: arm64: Prevent guest accesses into BRBE system registers/instructions Anshuman Khandual
2023-11-14 5:13 ` [V14 3/8] drivers: perf: arm_pmuv3: Enable branch stack sampling framework Anshuman Khandual
2023-11-14 9:58 ` James Clark
2023-11-15 5:44 ` Anshuman Khandual
2023-11-15 9:37 ` James Clark
2023-11-21 9:13 ` Anshuman Khandual
2023-11-14 12:14 ` James Clark
2023-11-15 7:22 ` Anshuman Khandual
2023-11-15 10:07 ` James Clark
2023-11-21 9:57 ` Anshuman Khandual
2023-11-23 12:35 ` James Clark
2023-11-27 8:06 ` Anshuman Khandual
2023-11-14 17:10 ` James Clark
2023-11-30 3:58 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 4/8] drivers: perf: arm_pmuv3: Enable branch stack sampling via FEAT_BRBE Anshuman Khandual
2023-11-14 12:11 ` James Clark
2023-11-21 10:47 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 5/8] KVM: arm64: nvhe: Disable branch generation in nVHE guests Anshuman Khandual
2023-11-14 9:16 ` James Clark
2023-11-21 11:12 ` Anshuman Khandual
2023-11-23 13:54 ` James Clark
2023-11-27 8:25 ` Anshuman Khandual
2023-11-14 5:13 ` [V14 6/8] perf: test: Speed up running brstack test on an Arm model Anshuman Khandual
2023-11-14 5:13 ` [V14 7/8] perf: test: Remove empty lines from branch filter test output Anshuman Khandual
2023-11-14 5:13 ` [V14 8/8] perf: test: Extend branch stack sampling test for Arm64 BRBE Anshuman Khandual
2023-11-14 17:17 ` [V14 0/8] arm64/perf: Enable branch stack sampling James Clark
2023-11-22 5:15 ` Anshuman Khandual
2023-11-23 16:23 ` James Clark
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).