* [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake
@ 2025-02-18 15:27 Dapeng Mi
2025-02-18 15:27 ` [Patch v2 01/24] perf/x86: Add dynamic constraint Dapeng Mi
` (23 more replies)
0 siblings, 24 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:27 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
This v2 patch series is based on latest perf/core tree "1623ced247f7
(x86/events/amd/iommu: Increase IOMMU_NAME_SIZE)" + extra first two
patches of patch set "Cleanup for Intel PMU initialization"[1].
Changes:
v1 -> v2:
* Add Panther Lake PMU support (patch 02/24)
* Add PEBS static calls to avoid introducing too much
x86_pmu.arch_pebs checks (patch 07~08/24)
* Optimize PEBS constraints base on Kan's dynamic constranit patch
(patch 13/24)
* Split perf tools patch of supporting more vector registers to
several small patches (patch 20~22/24)
Tests:
* Run below tests on Clearwater Forest and no issue is found. Please
notice nmi_watchdog is disabled when running the tests.
a. Basic perf counting case.
perf stat -e '{branches,branches,branches,branches,branches,branches,branches,branches,cycles,instructions,ref-cycles,topdown-bad-spec,topdown-fe-bound,topdown-retiring}' sleep 1
b. Basic PMI based perf sampling case.
perf record -e '{branches,branches,branches,branches,branches,branches,branches,branches,cycles,instructions,ref-cycles,topdown-bad-spec,topdown-fe-bound,topdown-retiring}' sleep 1
c. Basic PEBS based perf sampling case.
perf record -e '{branches,branches,branches,branches,branches,branches,branches,branches,cycles,instructions,ref-cycles,topdown-bad-spec,topdown-fe-bound,topdown-retiring}:p' sleep 1
d. PEBS sampling case with basic, GPRs, vector-registers and LBR groups
perf record -e branches:p -Iax,bx,ip,ssp,xmm0,ymmh0 -b -c 10000 sleep 1
e. PEBS sampling case with auxiliary (memory info) group
perf mem record sleep 1
f. PEBS sampling case with counter group
perf record -e '{branches:p,branches,cycles}:S' -c 10000 sleep 1
g. Perf stat and record test
perf test 95; perf test 119
h. perf-fuzzer test
* Run similar tests on Panther Lake P-cores and E-cores and no issue
is found. CPU 0 is P-core and CPU 9 is E-core. nmi_watchdog is
disabled as well.
P-core:
a. Basic perf counting case.
perf stat -e '{cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/cycles/,cpu_core/instructions/,cpu_core/ref-cycles/,cpu_core/slots/}' taskset -c 0 sleep 1
b. Basic PMI based perf sampling case.
perf record -e '{cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/cycles/,cpu_core/instructions/,cpu_core/ref-cycles/,cpu_core/slots/}' taskset -c 0 sleep 1
c. Basic PEBS based perf sampling case.
perf record -e '{cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/branches/,cpu_core/cycles/,cpu_core/instructions/,cpu_core/ref-cycles/,cpu_core/slots/}:p' taskset -c 0 sleep 1
d. PEBS sampling case with basic, GPRs, vector-registers and LBR groups
perf record -e branches:p -Iax,bx,ip,ssp,xmm0,ymmh0 -b -c 10000 taskset -c 0 sleep 1
e. PEBS sampling case for user space registers
perf record -e branches:p --user-regs=ax,bx,ip -b -c 10000 taskset -c 0 sleep 1
f. PEBS sampling case with auxiliary (memory info) group
perf mem record taskset -c 0 sleep 1
g. PEBS sampling case with counter group
perf record -e '{branches:p,branches,cycles}:S' -c 10000 taskset -c 0 sleep 1
h. Perf stat and record test
perf test 95; perf test 119
E-core:
a. Basic perf counting case.
perf stat -e '{cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/cycles/,cpu_atom/instructions/,cpu_atom/ref-cycles/,cpu_atom/topdown-bad-spec/,cpu_atom/topdown-fe-bound/,cpu_atom/topdown-retiring/}' taskset -c 9 sleep 1
b. Basic PMI based perf sampling case.
perf record -e '{cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/cycles/,cpu_atom/instructions/,cpu_atom/ref-cycles/,cpu_atom/topdown-bad-spec/,cpu_atom/topdown-fe-bound/,cpu_atom/topdown-retiring/}' taskset -c 9 sleep 1
c. Basic PEBS based perf sampling case.
perf record -e '{cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/branches/,cpu_atom/cycles/,cpu_atom/instructions/,cpu_atom/ref-cycles/,cpu_atom/topdown-bad-spec/,cpu_atom/topdown-fe-bound/,cpu_atom/topdown-retiring/}:p' taskset -c 9 sleep 1
d. PEBS sampling case with basic, GPRs, vector-registers and LBR groups
perf record -e branches:p -Iax,bx,ip,ssp,xmm0,ymmh0 -b -c 10000 taskset -c sleep 1
e. PEBS sampling case for user space registers
perf record -e branches:p --user-regs=ax,bx,ip -b -c 10000 taskset -c 9 sleep 1
f. PEBS sampling case with auxiliary (memory info) group
perf mem record taskset -c 9 sleep 1
g. PEBS sampling case with counter group
perf record -e '{branches:p,branches,cycles}:S' -c 10000 taskset -c 9 sleep 1
History:
v1: https://lore.kernel.org/all/20250123140721.2496639-1-dapeng1.mi@linux.intel.com/
Ref:
[1]: https://lore.kernel.org/all/20250129154820.3755948-1-kan.liang@linux.intel.com/
Dapeng Mi (22):
perf/x86/intel: Add PMU support for Clearwater Forest
perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
perf/x86/intel: Decouple BTS initialization from PEBS initialization
perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs
perf/x86/intel: Introduce pairs of PEBS static calls
perf/x86/intel: Initialize architectural PEBS
perf/x86/intel/ds: Factor out common PEBS processing code to functions
perf/x86/intel: Process arch-PEBS records or record fragments
perf/x86/intel: Factor out common functions to process PEBS groups
perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
perf/x86/intel: Update dyn_constranit base on PEBS event precise level
perf/x86/intel: Setup PEBS data configuration and enable legacy groups
perf/x86/intel: Add SSP register support for arch-PEBS
perf/x86/intel: Add counter group support for arch-PEBS
perf/core: Support to capture higher width vector registers
perf/x86/intel: Support arch-PEBS vector registers group capturing
perf tools: Support to show SSP register
perf tools: Enhance arch__intr/user_reg_mask() helpers
perf tools: Enhance sample_regs_user/intr to capture more registers
perf tools: Support to capture more vector registers (x86/Intel)
perf tools/tests: Add vector registers PEBS sampling test
perf tools: Fix incorrect --user-regs comments
Kan Liang (2):
perf/x86: Add dynamic constraint
perf/x86/intel: Add Panther Lake support
arch/arm/kernel/perf_regs.c | 6 +
arch/arm64/kernel/perf_regs.c | 6 +
arch/csky/kernel/perf_regs.c | 5 +
arch/loongarch/kernel/perf_regs.c | 5 +
arch/mips/kernel/perf_regs.c | 5 +
arch/powerpc/perf/perf_regs.c | 5 +
arch/riscv/kernel/perf_regs.c | 5 +
arch/s390/kernel/perf_regs.c | 5 +
arch/x86/events/core.c | 105 ++-
arch/x86/events/intel/bts.c | 6 +-
arch/x86/events/intel/core.c | 330 +++++++-
arch/x86/events/intel/ds.c | 722 ++++++++++++++----
arch/x86/events/intel/lbr.c | 2 +-
arch/x86/events/perf_event.h | 69 +-
arch/x86/include/asm/intel_ds.h | 10 +-
arch/x86/include/asm/msr-index.h | 28 +
arch/x86/include/asm/perf_event.h | 145 +++-
arch/x86/include/uapi/asm/perf_regs.h | 87 ++-
arch/x86/kernel/perf_regs.c | 55 +-
include/linux/perf_event.h | 3 +
include/linux/perf_regs.h | 10 +
include/uapi/linux/perf_event.h | 11 +
kernel/events/core.c | 53 +-
tools/arch/x86/include/uapi/asm/perf_regs.h | 90 ++-
tools/include/uapi/linux/perf_event.h | 14 +
tools/perf/arch/arm/util/perf_regs.c | 8 +-
tools/perf/arch/arm64/util/perf_regs.c | 11 +-
tools/perf/arch/csky/util/perf_regs.c | 8 +-
tools/perf/arch/loongarch/util/perf_regs.c | 8 +-
tools/perf/arch/mips/util/perf_regs.c | 8 +-
tools/perf/arch/powerpc/util/perf_regs.c | 17 +-
tools/perf/arch/riscv/util/perf_regs.c | 8 +-
tools/perf/arch/s390/util/perf_regs.c | 8 +-
tools/perf/arch/x86/util/perf_regs.c | 112 ++-
tools/perf/builtin-record.c | 2 +-
tools/perf/builtin-script.c | 23 +-
tools/perf/tests/shell/record.sh | 55 ++
tools/perf/util/evsel.c | 36 +-
tools/perf/util/intel-pt.c | 2 +-
tools/perf/util/parse-regs-options.c | 23 +-
.../perf/util/perf-regs-arch/perf_regs_x86.c | 90 +++
tools/perf/util/perf_regs.c | 8 +-
tools/perf/util/perf_regs.h | 20 +-
tools/perf/util/record.h | 4 +-
tools/perf/util/sample.h | 6 +-
tools/perf/util/session.c | 29 +-
tools/perf/util/synthetic-events.c | 6 +-
47 files changed, 1966 insertions(+), 308 deletions(-)
--
2.40.1
^ permalink raw reply [flat|nested] 58+ messages in thread
* [Patch v2 01/24] perf/x86: Add dynamic constraint
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
@ 2025-02-18 15:27 ` Dapeng Mi
2025-02-18 15:27 ` [Patch v2 02/24] perf/x86/intel: Add Panther Lake support Dapeng Mi
` (22 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:27 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi
From: Kan Liang <kan.liang@linux.intel.com>
More and more features require a dynamic event constraint, e.g., branch
counter logging, auto counter reload, Arch PEBS, etc.
Add a generic flag, PMU_FL_DYN_CONSTRAINT, to indicate the case. It
avoids keeping adding the individual flag in intel_cpuc_prepare().
Add a variable dyn_constraint in the struct hw_perf_event to track the
dynamic constraint of the event. Apply it if it's updated.
Apply the generic dynamic constraint for branch counter logging.
Many features on and after V6 require dynamic constraint. So
unconditionally set the flag for V6+.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
arch/x86/events/core.c | 1 +
arch/x86/events/intel/core.c | 21 +++++++++++++++------
arch/x86/events/intel/lbr.c | 2 +-
arch/x86/events/perf_event.h | 1 +
include/linux/perf_event.h | 1 +
5 files changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 7b6430e5a77b..883e0ee893cb 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -675,6 +675,7 @@ static int __x86_pmu_event_init(struct perf_event *event)
event->hw.idx = -1;
event->hw.last_cpu = -1;
event->hw.last_tag = ~0ULL;
+ event->hw.dyn_constraint = ~0ULL;
/* mark unused */
event->hw.extra_reg.idx = EXTRA_REG_NONE;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a9b4f1099a86..5570d97b8f4f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3736,10 +3736,9 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
if (cpuc->excl_cntrs)
return intel_get_excl_constraints(cpuc, event, idx, c2);
- /* Not all counters support the branch counter feature. */
- if (branch_sample_counters(event)) {
+ if (event->hw.dyn_constraint != ~0ULL) {
c2 = dyn_constraint(cpuc, c2, idx);
- c2->idxmsk64 &= x86_pmu.lbr_counters;
+ c2->idxmsk64 &= event->hw.dyn_constraint;
c2->weight = hweight64(c2->idxmsk64);
}
@@ -4056,15 +4055,19 @@ static int intel_pmu_hw_config(struct perf_event *event)
leader = event->group_leader;
if (branch_sample_call_stack(leader))
return -EINVAL;
- if (branch_sample_counters(leader))
+ if (branch_sample_counters(leader)) {
num++;
+ leader->hw.dyn_constraint &= x86_pmu.lbr_counters;
+ }
leader->hw.flags |= PERF_X86_EVENT_BRANCH_COUNTERS;
for_each_sibling_event(sibling, leader) {
if (branch_sample_call_stack(sibling))
return -EINVAL;
- if (branch_sample_counters(sibling))
+ if (branch_sample_counters(sibling)) {
num++;
+ sibling->hw.dyn_constraint &= x86_pmu.lbr_counters;
+ }
}
if (num > fls(x86_pmu.lbr_counters))
@@ -4864,7 +4867,7 @@ int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
goto err;
}
- if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_BR_CNTR)) {
+ if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_DYN_CONSTRAINT)) {
size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
cpuc->constraint_list = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
@@ -6582,6 +6585,12 @@ __init int intel_pmu_init(void)
pr_cont(" AnyThread deprecated, ");
}
+ /*
+ * Many features on and after V6 require dynamic constraint,
+ * e.g., Arch PEBS, ACR.
+ */
+ if (version >= 6)
+ x86_pmu.flags |= PMU_FL_DYN_CONSTRAINT;
/*
* Install the hw-cache-events table:
*/
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index dc641b50814e..743dcc322085 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1609,7 +1609,7 @@ void __init intel_pmu_arch_lbr_init(void)
x86_pmu.lbr_nr = lbr_nr;
if (!!x86_pmu.lbr_counters)
- x86_pmu.flags |= PMU_FL_BR_CNTR;
+ x86_pmu.flags |= PMU_FL_BR_CNTR | PMU_FL_DYN_CONSTRAINT;
if (x86_pmu.lbr_mispred)
static_branch_enable(&x86_lbr_mispred);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a698e6484b3b..f4693409e191 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1066,6 +1066,7 @@ do { \
#define PMU_FL_MEM_LOADS_AUX 0x100 /* Require an auxiliary event for the complete memory info */
#define PMU_FL_RETIRE_LATENCY 0x200 /* Support Retire Latency in PEBS */
#define PMU_FL_BR_CNTR 0x400 /* Support branch counter logging */
+#define PMU_FL_DYN_CONSTRAINT 0x800 /* Needs dynamic constraint */
#define EVENT_VAR(_id) event_attr_##_id
#define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d07bc1193f3..c381ea7135df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -158,6 +158,7 @@ struct hw_perf_event {
struct { /* hardware */
u64 config;
u64 last_tag;
+ u64 dyn_constraint;
unsigned long config_base;
unsigned long event_base;
int event_base_rdpmc;
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 02/24] perf/x86/intel: Add Panther Lake support
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
2025-02-18 15:27 ` [Patch v2 01/24] perf/x86: Add dynamic constraint Dapeng Mi
@ 2025-02-18 15:27 ` Dapeng Mi
2025-02-18 15:27 ` [Patch v2 03/24] perf/x86/intel: Add PMU support for Clearwater Forest Dapeng Mi
` (21 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:27 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi
From: Kan Liang <kan.liang@linux.intel.com>
From PMU's perspective, Panther Lake is similar to the previous
generation Lunar Lake. Both are hybrid platforms, with e-core and
p-core.
The key differences are the ARCH PEBS feature and several new events.
The ARCH PEBS is supported in the following patches.
The new events will be supported later in perf tool.
Share the code path with the Lunar Lake. Only update the name.
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
arch/x86/events/intel/core.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5570d97b8f4f..936711db9b32 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -7232,8 +7232,17 @@ __init int intel_pmu_init(void)
name = "meteorlake_hybrid";
break;
+ case INTEL_PANTHERLAKE_L:
+ pr_cont("Pantherlake Hybrid events, ");
+ name = "pantherlake_hybrid";
+ goto lnl_common;
+
case INTEL_LUNARLAKE_M:
case INTEL_ARROWLAKE:
+ pr_cont("Lunarlake Hybrid events, ");
+ name = "lunarlake_hybrid";
+
+ lnl_common:
intel_pmu_init_hybrid(hybrid_big_small);
x86_pmu.pebs_latency_data = lnl_latency_data;
@@ -7255,8 +7264,6 @@ __init int intel_pmu_init(void)
intel_pmu_init_skt(&pmu->pmu);
intel_pmu_pebs_data_source_lnl();
- pr_cont("Lunarlake Hybrid events, ");
- name = "lunarlake_hybrid";
break;
case INTEL_ARROWLAKE_H:
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 03/24] perf/x86/intel: Add PMU support for Clearwater Forest
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
2025-02-18 15:27 ` [Patch v2 01/24] perf/x86: Add dynamic constraint Dapeng Mi
2025-02-18 15:27 ` [Patch v2 02/24] perf/x86/intel: Add Panther Lake support Dapeng Mi
@ 2025-02-18 15:27 ` Dapeng Mi
2025-02-18 15:27 ` [Patch v2 04/24] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs Dapeng Mi
` (20 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:27 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
From PMU's perspective, Clearwater Forest is similar to the previous
generation Sierra Forest.
The key differences are the ARCH PEBS feature and the new added 3 fixed
counters for topdown L1 metrics events.
The ARCH PEBS is supported in the following patches. This patch provides
support for basic perfmon features and 3 new added fixed counters.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 936711db9b32..7521e1e55c0e 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2230,6 +2230,18 @@ static struct extra_reg intel_cmt_extra_regs[] __read_mostly = {
EVENT_EXTRA_END
};
+EVENT_ATTR_STR(topdown-fe-bound, td_fe_bound_skt, "event=0x9c,umask=0x01");
+EVENT_ATTR_STR(topdown-retiring, td_retiring_skt, "event=0xc2,umask=0x02");
+EVENT_ATTR_STR(topdown-be-bound, td_be_bound_skt, "event=0xa4,umask=0x02");
+
+static struct attribute *skt_events_attrs[] = {
+ EVENT_PTR(td_fe_bound_skt),
+ EVENT_PTR(td_retiring_skt),
+ EVENT_PTR(td_bad_spec_cmt),
+ EVENT_PTR(td_be_bound_skt),
+ NULL,
+};
+
#define KNL_OT_L2_HITE BIT_ULL(19) /* Other Tile L2 Hit */
#define KNL_OT_L2_HITF BIT_ULL(20) /* Other Tile L2 Hit */
#define KNL_MCDRAM_LOCAL BIT_ULL(21)
@@ -6802,6 +6814,18 @@ __init int intel_pmu_init(void)
name = "crestmont";
break;
+ case INTEL_ATOM_DARKMONT_X:
+ intel_pmu_init_skt(NULL);
+ intel_pmu_pebs_data_source_cmt();
+ x86_pmu.pebs_latency_data = cmt_latency_data;
+ x86_pmu.get_event_constraints = cmt_get_event_constraints;
+ td_attr = skt_events_attrs;
+ mem_attr = grt_mem_attrs;
+ extra_attr = cmt_format_attr;
+ pr_cont("Darkmont events, ");
+ name = "darkmont";
+ break;
+
case INTEL_WESTMERE:
case INTEL_WESTMERE_EP:
case INTEL_WESTMERE_EX:
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 04/24] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (2 preceding siblings ...)
2025-02-18 15:27 ` [Patch v2 03/24] perf/x86/intel: Add PMU support for Clearwater Forest Dapeng Mi
@ 2025-02-18 15:27 ` Dapeng Mi
2025-02-18 15:27 ` [Patch v2 05/24] perf/x86/intel: Decouple BTS initialization from PEBS initialization Dapeng Mi
` (19 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:27 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
CPUID archPerfmonExt (0x23) leaves are supported to enumerate CPU
level's PMU capabilities on non-hybrid processors as well.
This patch supports to parse archPerfmonExt leaves on non-hybrid
processors. Architectural PEBS leverages archPerfmonExt sub-leaves 0x4
and 0x5 to enumerate the PEBS capabilities as well. This patch is a
precursor of the subsequent arch-PEBS enabling patches.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 27 ++++++++++++++++++++-------
1 file changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7521e1e55c0e..e1383a905cdc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4968,7 +4968,7 @@ static inline bool intel_pmu_broken_perf_cap(void)
return false;
}
-static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
+static void update_pmu_cap(struct pmu *pmu)
{
unsigned int cntr, fixed_cntr, ecx, edx;
union cpuid35_eax eax;
@@ -4977,20 +4977,20 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
cpuid(ARCH_PERFMON_EXT_LEAF, &eax.full, &ebx.full, &ecx, &edx);
if (ebx.split.umask2)
- pmu->config_mask |= ARCH_PERFMON_EVENTSEL_UMASK2;
+ hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_UMASK2;
if (ebx.split.eq)
- pmu->config_mask |= ARCH_PERFMON_EVENTSEL_EQ;
+ hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_EQ;
if (eax.split.cntr_subleaf) {
cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
&cntr, &fixed_cntr, &ecx, &edx);
- pmu->cntr_mask64 = cntr;
- pmu->fixed_cntr_mask64 = fixed_cntr;
+ hybrid(pmu, cntr_mask64) = cntr;
+ hybrid(pmu, fixed_cntr_mask64) = fixed_cntr;
}
if (!intel_pmu_broken_perf_cap()) {
/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
- rdmsrl(MSR_IA32_PERF_CAPABILITIES, pmu->intel_cap.capabilities);
+ rdmsrl(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
}
}
@@ -5076,7 +5076,7 @@ static bool init_hybrid_pmu(int cpu)
goto end;
if (this_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
- update_pmu_cap(pmu);
+ update_pmu_cap(&pmu->pmu);
intel_pmu_check_hybrid_pmus(pmu);
@@ -6559,6 +6559,7 @@ __init int intel_pmu_init(void)
x86_pmu.pebs_events_mask = intel_pmu_pebs_mask(x86_pmu.cntr_mask64);
x86_pmu.pebs_capable = PEBS_COUNTER_MASK;
+ x86_pmu.config_mask = X86_RAW_EVENT_MASK;
/*
* Quirk: v2 perfmon does not report fixed-purpose events, so
@@ -7375,6 +7376,18 @@ __init int intel_pmu_init(void)
x86_pmu.attr_update = hybrid_attr_update;
}
+ /*
+ * The archPerfmonExt (0x23) includes an enhanced enumeration of
+ * PMU architectural features with a per-core view. For non-hybrid,
+ * each core has the same PMU capabilities. It's good enough to
+ * update the x86_pmu from the booting CPU. For hybrid, the x86_pmu
+ * is used to keep the common capabilities. Still keep the values
+ * from the leaf 0xa. The core specific update will be done later
+ * when a new type is online.
+ */
+ if (!is_hybrid() && boot_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
+ update_pmu_cap(NULL);
+
intel_pmu_check_counters_mask(&x86_pmu.cntr_mask64,
&x86_pmu.fixed_cntr_mask64,
&x86_pmu.intel_ctrl);
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 05/24] perf/x86/intel: Decouple BTS initialization from PEBS initialization
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (3 preceding siblings ...)
2025-02-18 15:27 ` [Patch v2 04/24] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs Dapeng Mi
@ 2025-02-18 15:27 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 06/24] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Dapeng Mi
` (18 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:27 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Move x86_pmu.bts flag initialization into bts_init() from
intel_ds_init() and rename intel_ds_init() to intel_pebs_init() since it
fully initializes PEBS now after removing the x86_pmu.bts
initialization.
It's safe to move x86_pmu.bts into bts_init() since all x86_pmu.bts flag
are called after bts_init() execution.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/bts.c | 6 +++++-
arch/x86/events/intel/core.c | 2 +-
arch/x86/events/intel/ds.c | 5 ++---
arch/x86/events/perf_event.h | 2 +-
4 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 8f78b0c900ef..a205d1fb37b1 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -584,7 +584,11 @@ static void bts_event_read(struct perf_event *event)
static __init int bts_init(void)
{
- if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+ if (!boot_cpu_has(X86_FEATURE_DTES64))
+ return -ENODEV;
+
+ x86_pmu.bts = boot_cpu_has(X86_FEATURE_BTS);
+ if (!x86_pmu.bts)
return -ENODEV;
if (boot_cpu_has(X86_FEATURE_PTI)) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e1383a905cdc..a977d4d631fe 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6588,7 +6588,7 @@ __init int intel_pmu_init(void)
if (boot_cpu_has(X86_FEATURE_ARCH_LBR))
intel_pmu_arch_lbr_init();
- intel_ds_init();
+ intel_pebs_init();
x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs last */
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 46aaaeae0c8d..9c8947d3413f 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2650,10 +2650,10 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
}
/*
- * BTS, PEBS probe and setup
+ * PEBS probe and setup
*/
-void __init intel_ds_init(void)
+void __init intel_pebs_init(void)
{
/*
* No support for 32bit formats
@@ -2661,7 +2661,6 @@ void __init intel_ds_init(void)
if (!boot_cpu_has(X86_FEATURE_DTES64))
return;
- x86_pmu.bts = boot_cpu_has(X86_FEATURE_BTS);
x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
if (x86_pmu.version <= 4)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f4693409e191..0a259c98056a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1662,7 +1662,7 @@ void intel_pmu_drain_pebs_buffer(void);
void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
-void intel_ds_init(void);
+void intel_pebs_init(void);
void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
struct cpu_hw_events *cpuc,
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 06/24] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (4 preceding siblings ...)
2025-02-18 15:27 ` [Patch v2 05/24] perf/x86/intel: Decouple BTS initialization from PEBS initialization Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 07/24] perf/x86/intel: Introduce pairs of PEBS static calls Dapeng Mi
` (17 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Since architectural PEBS would be introduced in subsequent patches,
rename x86_pmu.pebs to x86_pmu.ds_pebs for distinguishing with the
upcoming architectural PEBS.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 6 +++---
arch/x86/events/intel/ds.c | 20 ++++++++++----------
arch/x86/events/perf_event.h | 2 +-
3 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index a977d4d631fe..f45296f30ec2 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4281,7 +4281,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
.guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask & ~pebs_mask,
};
- if (!x86_pmu.pebs)
+ if (!x86_pmu.ds_pebs)
return arr;
/*
@@ -5454,7 +5454,7 @@ static __init void intel_clovertown_quirk(void)
* these chips.
*/
pr_warn("PEBS disabled due to CPU errata\n");
- x86_pmu.pebs = 0;
+ x86_pmu.ds_pebs = 0;
x86_pmu.pebs_constraints = NULL;
}
@@ -5942,7 +5942,7 @@ tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
static umode_t
pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
{
- return x86_pmu.pebs ? attr->mode : 0;
+ return x86_pmu.ds_pebs ? attr->mode : 0;
}
static umode_t
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 9c8947d3413f..2a4dc0bbc4f7 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -624,7 +624,7 @@ static int alloc_pebs_buffer(int cpu)
int max, node = cpu_to_node(cpu);
void *buffer, *insn_buff, *cea;
- if (!x86_pmu.pebs)
+ if (!x86_pmu.ds_pebs)
return 0;
buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
@@ -659,7 +659,7 @@ static void release_pebs_buffer(int cpu)
struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
void *cea;
- if (!x86_pmu.pebs)
+ if (!x86_pmu.ds_pebs)
return;
kfree(per_cpu(insn_buffer, cpu));
@@ -734,7 +734,7 @@ void release_ds_buffers(void)
{
int cpu;
- if (!x86_pmu.bts && !x86_pmu.pebs)
+ if (!x86_pmu.bts && !x86_pmu.ds_pebs)
return;
for_each_possible_cpu(cpu)
@@ -763,13 +763,13 @@ void reserve_ds_buffers(void)
x86_pmu.bts_active = 0;
x86_pmu.pebs_active = 0;
- if (!x86_pmu.bts && !x86_pmu.pebs)
+ if (!x86_pmu.bts && !x86_pmu.ds_pebs)
return;
if (!x86_pmu.bts)
bts_err = 1;
- if (!x86_pmu.pebs)
+ if (!x86_pmu.ds_pebs)
pebs_err = 1;
for_each_possible_cpu(cpu) {
@@ -805,7 +805,7 @@ void reserve_ds_buffers(void)
if (x86_pmu.bts && !bts_err)
x86_pmu.bts_active = 1;
- if (x86_pmu.pebs && !pebs_err)
+ if (x86_pmu.ds_pebs && !pebs_err)
x86_pmu.pebs_active = 1;
for_each_possible_cpu(cpu) {
@@ -2661,12 +2661,12 @@ void __init intel_pebs_init(void)
if (!boot_cpu_has(X86_FEATURE_DTES64))
return;
- x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
+ x86_pmu.ds_pebs = boot_cpu_has(X86_FEATURE_PEBS);
x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
if (x86_pmu.version <= 4)
x86_pmu.pebs_no_isolation = 1;
- if (x86_pmu.pebs) {
+ if (x86_pmu.ds_pebs) {
char pebs_type = x86_pmu.intel_cap.pebs_trap ? '+' : '-';
char *pebs_qual = "";
int format = x86_pmu.intel_cap.pebs_format;
@@ -2758,7 +2758,7 @@ void __init intel_pebs_init(void)
default:
pr_cont("no PEBS fmt%d%c, ", format, pebs_type);
- x86_pmu.pebs = 0;
+ x86_pmu.ds_pebs = 0;
}
}
}
@@ -2767,7 +2767,7 @@ void perf_restore_debug_store(void)
{
struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
- if (!x86_pmu.bts && !x86_pmu.pebs)
+ if (!x86_pmu.bts && !x86_pmu.ds_pebs)
return;
wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 0a259c98056a..1e7884fdd990 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -888,7 +888,7 @@ struct x86_pmu {
*/
unsigned int bts :1,
bts_active :1,
- pebs :1,
+ ds_pebs :1,
pebs_active :1,
pebs_broken :1,
pebs_prec_dist :1,
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 07/24] perf/x86/intel: Introduce pairs of PEBS static calls
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (5 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 06/24] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 08/24] perf/x86/intel: Initialize architectural PEBS Dapeng Mi
` (16 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Arch-PEBS retires IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs, so
intel_pmu_pebs_enable/disable() and intel_pmu_pebs_enable/disable_all()
are not needed to call for ach-PEBS.
To make code cleaner, introduces static calls
x86_pmu_pebs_enable/disable() and x86_pmu_pebs_enable/disable_all()
instead of adding "x86_pmu.arch_pebs" check directly in these helpers.
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/core.c | 10 ++++++++++
arch/x86/events/intel/core.c | 8 ++++----
arch/x86/events/intel/ds.c | 5 +++++
arch/x86/events/perf_event.h | 8 ++++++++
4 files changed, 27 insertions(+), 4 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 883e0ee893cb..1c2ff407ef17 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -96,6 +96,11 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_filter, *x86_pmu.filter);
DEFINE_STATIC_CALL_NULL(x86_pmu_late_setup, *x86_pmu.late_setup);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_enable, *x86_pmu.pebs_enable);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_disable, *x86_pmu.pebs_disable);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_enable_all, *x86_pmu.pebs_enable_all);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_disable_all, *x86_pmu.pebs_disable_all);
+
/*
* This one is magic, it will get called even when PMU init fails (because
* there is no PMU), in which case it should simply return NULL.
@@ -2049,6 +2054,11 @@ static void x86_pmu_static_call_update(void)
static_call_update(x86_pmu_filter, x86_pmu.filter);
static_call_update(x86_pmu_late_setup, x86_pmu.late_setup);
+
+ static_call_update(x86_pmu_pebs_enable, x86_pmu.pebs_enable);
+ static_call_update(x86_pmu_pebs_disable, x86_pmu.pebs_disable);
+ static_call_update(x86_pmu_pebs_enable_all, x86_pmu.pebs_enable_all);
+ static_call_update(x86_pmu_pebs_disable_all, x86_pmu.pebs_disable_all);
}
static void _x86_pmu_read(struct perf_event *event)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f45296f30ec2..41c7243a4507 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2312,7 +2312,7 @@ static __always_inline void __intel_pmu_disable_all(bool bts)
static __always_inline void intel_pmu_disable_all(void)
{
__intel_pmu_disable_all(true);
- intel_pmu_pebs_disable_all();
+ static_call_cond(x86_pmu_pebs_disable_all)();
intel_pmu_lbr_disable_all();
}
@@ -2344,7 +2344,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
static void intel_pmu_enable_all(int added)
{
- intel_pmu_pebs_enable_all();
+ static_call_cond(x86_pmu_pebs_enable_all)();
__intel_pmu_enable_all(added, false);
}
@@ -2601,7 +2601,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
* so we don't trigger the event without PEBS bit set.
*/
if (unlikely(event->attr.precise_ip))
- intel_pmu_pebs_disable(event);
+ static_call(x86_pmu_pebs_disable)(event);
}
static void intel_pmu_assign_event(struct perf_event *event, int idx)
@@ -2905,7 +2905,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
int idx = hwc->idx;
if (unlikely(event->attr.precise_ip))
- intel_pmu_pebs_enable(event);
+ static_call(x86_pmu_pebs_enable)(event);
switch (idx) {
case 0 ... INTEL_PMC_IDX_FIXED - 1:
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 2a4dc0bbc4f7..ab4a9a01336d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2674,6 +2674,11 @@ void __init intel_pebs_init(void)
if (format < 4)
x86_pmu.intel_cap.pebs_baseline = 0;
+ x86_pmu.pebs_enable = intel_pmu_pebs_enable;
+ x86_pmu.pebs_disable = intel_pmu_pebs_disable;
+ x86_pmu.pebs_enable_all = intel_pmu_pebs_enable_all;
+ x86_pmu.pebs_disable_all = intel_pmu_pebs_disable_all;
+
switch (format) {
case 0:
pr_cont("PEBS fmt0%c, ", pebs_type);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1e7884fdd990..4bc6c9d66b94 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -806,6 +806,10 @@ struct x86_pmu {
int (*hw_config)(struct perf_event *event);
int (*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
void (*late_setup)(void);
+ void (*pebs_enable)(struct perf_event *event);
+ void (*pebs_disable)(struct perf_event *event);
+ void (*pebs_enable_all)(void);
+ void (*pebs_disable_all)(void);
unsigned eventsel;
unsigned perfctr;
unsigned fixedctr;
@@ -1116,6 +1120,10 @@ DECLARE_STATIC_CALL(x86_pmu_set_period, *x86_pmu.set_period);
DECLARE_STATIC_CALL(x86_pmu_update, *x86_pmu.update);
DECLARE_STATIC_CALL(x86_pmu_drain_pebs, *x86_pmu.drain_pebs);
DECLARE_STATIC_CALL(x86_pmu_late_setup, *x86_pmu.late_setup);
+DECLARE_STATIC_CALL(x86_pmu_pebs_enable, *x86_pmu.pebs_enable);
+DECLARE_STATIC_CALL(x86_pmu_pebs_disable, *x86_pmu.pebs_disable);
+DECLARE_STATIC_CALL(x86_pmu_pebs_enable_all, *x86_pmu.pebs_enable_all);
+DECLARE_STATIC_CALL(x86_pmu_pebs_disable_all, *x86_pmu.pebs_disable_all);
static __always_inline struct x86_perf_task_context_opt *task_context_opt(void *ctx)
{
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 08/24] perf/x86/intel: Initialize architectural PEBS
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (6 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 07/24] perf/x86/intel: Introduce pairs of PEBS static calls Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 09/24] perf/x86/intel/ds: Factor out common PEBS processing code to functions Dapeng Mi
` (15 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
arch-PEBS leverages CPUID.23H.4/5 sub-leaves enumerate arch-PEBS
supported capabilities and counters bitmap. This patch parses these 2
sub-leaves and initializes arch-PEBS capabilities and corresponding
structures.
Since IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs are no longer existed
for arch-PEBS, arch-PEBS doesn't need to manipulate these MSRs. Thus add
a simple pair of __intel_pmu_pebs_enable/disable() callbacks for
arch-PEBS.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/core.c | 21 ++++++++++---
arch/x86/events/intel/core.c | 40 +++++++++++++++++-------
arch/x86/events/intel/ds.c | 52 ++++++++++++++++++++++++++-----
arch/x86/events/perf_event.h | 25 +++++++++++++--
arch/x86/include/asm/perf_event.h | 7 ++++-
5 files changed, 117 insertions(+), 28 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 1c2ff407ef17..24ae1159d6b9 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -554,14 +554,22 @@ static inline int precise_br_compat(struct perf_event *event)
return m == b;
}
-int x86_pmu_max_precise(void)
+int x86_pmu_max_precise(struct pmu *pmu)
{
int precise = 0;
- /* Support for constant skid */
if (x86_pmu.pebs_active && !x86_pmu.pebs_broken) {
- precise++;
+ /* arch PEBS */
+ if (x86_pmu.arch_pebs) {
+ precise = 2;
+ if (hybrid(pmu, arch_pebs_cap).pdists)
+ precise++;
+
+ return precise;
+ }
+ /* legacy PEBS - support for constant skid */
+ precise++;
/* Support for IP fixup */
if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
precise++;
@@ -569,13 +577,14 @@ int x86_pmu_max_precise(void)
if (x86_pmu.pebs_prec_dist)
precise++;
}
+
return precise;
}
int x86_pmu_hw_config(struct perf_event *event)
{
if (event->attr.precise_ip) {
- int precise = x86_pmu_max_precise();
+ int precise = x86_pmu_max_precise(event->pmu);
if (event->attr.precise_ip > precise)
return -EOPNOTSUPP;
@@ -2626,7 +2635,9 @@ static ssize_t max_precise_show(struct device *cdev,
struct device_attribute *attr,
char *buf)
{
- return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu_max_precise());
+ struct pmu *pmu = dev_get_drvdata(cdev);
+
+ return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu_max_precise(pmu));
}
static DEVICE_ATTR_RO(max_precise);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 41c7243a4507..37540eb80029 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4970,22 +4970,37 @@ static inline bool intel_pmu_broken_perf_cap(void)
static void update_pmu_cap(struct pmu *pmu)
{
- unsigned int cntr, fixed_cntr, ecx, edx;
- union cpuid35_eax eax;
- union cpuid35_ebx ebx;
+ unsigned int eax, ebx, ecx, edx;
+ union cpuid35_eax eax_0;
+ union cpuid35_ebx ebx_0;
- cpuid(ARCH_PERFMON_EXT_LEAF, &eax.full, &ebx.full, &ecx, &edx);
+ cpuid(ARCH_PERFMON_EXT_LEAF, &eax_0.full, &ebx_0.full, &ecx, &edx);
- if (ebx.split.umask2)
+ if (ebx_0.split.umask2)
hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_UMASK2;
- if (ebx.split.eq)
+ if (ebx_0.split.eq)
hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_EQ;
- if (eax.split.cntr_subleaf) {
+ if (eax_0.split.cntr_subleaf) {
cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
- &cntr, &fixed_cntr, &ecx, &edx);
- hybrid(pmu, cntr_mask64) = cntr;
- hybrid(pmu, fixed_cntr_mask64) = fixed_cntr;
+ &eax, &ebx, &ecx, &edx);
+ hybrid(pmu, cntr_mask64) = eax;
+ hybrid(pmu, fixed_cntr_mask64) = ebx;
+ }
+
+ /* Bits[5:4] should be set simultaneously if arch-PEBS is supported */
+ if (eax_0.split.pebs_caps_subleaf && eax_0.split.pebs_cnts_subleaf) {
+ cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_PEBS_CAP_LEAF,
+ &eax, &ebx, &ecx, &edx);
+ hybrid(pmu, arch_pebs_cap).caps = (u64)ebx << 32;
+
+ cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_PEBS_COUNTER_LEAF,
+ &eax, &ebx, &ecx, &edx);
+ hybrid(pmu, arch_pebs_cap).counters = ((u64)ecx << 32) | eax;
+ hybrid(pmu, arch_pebs_cap).pdists = ((u64)edx << 32) | ebx;
+ } else {
+ WARN_ON(x86_pmu.arch_pebs == 1);
+ x86_pmu.arch_pebs = 0;
}
if (!intel_pmu_broken_perf_cap()) {
@@ -5942,7 +5957,7 @@ tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
static umode_t
pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
{
- return x86_pmu.ds_pebs ? attr->mode : 0;
+ return intel_pmu_has_pebs() ? attr->mode : 0;
}
static umode_t
@@ -7388,6 +7403,9 @@ __init int intel_pmu_init(void)
if (!is_hybrid() && boot_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
update_pmu_cap(NULL);
+ if (x86_pmu.arch_pebs)
+ pr_cont("Architectural PEBS, ");
+
intel_pmu_check_counters_mask(&x86_pmu.cntr_mask64,
&x86_pmu.fixed_cntr_mask64,
&x86_pmu.intel_ctrl);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index ab4a9a01336d..e66c5f307e93 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1525,6 +1525,15 @@ static inline void intel_pmu_drain_large_pebs(struct cpu_hw_events *cpuc)
intel_pmu_drain_pebs_buffer();
}
+static void __intel_pmu_pebs_enable(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+
+ hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+ cpuc->pebs_enabled |= 1ULL << hwc->idx;
+}
+
void intel_pmu_pebs_enable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1533,9 +1542,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
struct debug_store *ds = cpuc->ds;
unsigned int idx = hwc->idx;
- hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
-
- cpuc->pebs_enabled |= 1ULL << hwc->idx;
+ __intel_pmu_pebs_enable(event);
if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) && (x86_pmu.version < 5))
cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
@@ -1597,14 +1604,22 @@ void intel_pmu_pebs_del(struct perf_event *event)
pebs_update_state(needed_cb, cpuc, event, false);
}
-void intel_pmu_pebs_disable(struct perf_event *event)
+static void __intel_pmu_pebs_disable(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct hw_perf_event *hwc = &event->hw;
intel_pmu_drain_large_pebs(cpuc);
-
cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+ hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+}
+
+void intel_pmu_pebs_disable(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+
+ __intel_pmu_pebs_disable(event);
if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
(x86_pmu.version < 5))
@@ -1616,8 +1631,6 @@ void intel_pmu_pebs_disable(struct perf_event *event)
if (cpuc->enabled)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
-
- hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
}
void intel_pmu_pebs_enable_all(void)
@@ -2649,11 +2662,26 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
}
}
+static void __init intel_arch_pebs_init(void)
+{
+ /*
+ * Current hybrid platforms always both support arch-PEBS or not
+ * on all kinds of cores. So directly set x86_pmu.arch_pebs flag
+ * if boot cpu supports arch-PEBS.
+ */
+ x86_pmu.arch_pebs = 1;
+ x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
+ x86_pmu.pebs_capable = ~0ULL;
+
+ x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
+ x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
+}
+
/*
* PEBS probe and setup
*/
-void __init intel_pebs_init(void)
+static void __init intel_ds_pebs_init(void)
{
/*
* No support for 32bit formats
@@ -2768,6 +2796,14 @@ void __init intel_pebs_init(void)
}
}
+void __init intel_pebs_init(void)
+{
+ if (x86_pmu.intel_cap.pebs_format == 0xf)
+ intel_arch_pebs_init();
+ else
+ intel_ds_pebs_init();
+}
+
void perf_restore_debug_store(void)
{
struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 4bc6c9d66b94..265d76a321dd 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -707,6 +707,12 @@ enum atom_native_id {
skt_native_id = 0x3, /* Skymont */
};
+struct arch_pebs_cap {
+ u64 caps;
+ u64 counters;
+ u64 pdists;
+};
+
struct x86_hybrid_pmu {
struct pmu pmu;
const char *name;
@@ -742,6 +748,8 @@ struct x86_hybrid_pmu {
mid_ack :1,
enabled_ack :1;
+ struct arch_pebs_cap arch_pebs_cap;
+
u64 pebs_data_source[PERF_PEBS_DATA_SOURCE_MAX];
};
@@ -888,7 +896,7 @@ struct x86_pmu {
union perf_capabilities intel_cap;
/*
- * Intel DebugStore bits
+ * Intel DebugStore and PEBS bits
*/
unsigned int bts :1,
bts_active :1,
@@ -899,7 +907,8 @@ struct x86_pmu {
pebs_no_tlb :1,
pebs_no_isolation :1,
pebs_block :1,
- pebs_ept :1;
+ pebs_ept :1,
+ arch_pebs :1;
int pebs_record_size;
int pebs_buffer_size;
u64 pebs_events_mask;
@@ -911,6 +920,11 @@ struct x86_pmu {
u64 rtm_abort_event;
u64 pebs_capable;
+ /*
+ * Intel Architectural PEBS
+ */
+ struct arch_pebs_cap arch_pebs_cap;
+
/*
* Intel LBR
*/
@@ -1205,7 +1219,7 @@ int x86_reserve_hardware(void);
void x86_release_hardware(void);
-int x86_pmu_max_precise(void);
+int x86_pmu_max_precise(struct pmu *pmu);
void hw_perf_lbr_event_destroy(struct perf_event *event);
@@ -1775,6 +1789,11 @@ static inline int intel_pmu_max_num_pebs(struct pmu *pmu)
return fls((u32)hybrid(pmu, pebs_events_mask));
}
+static inline bool intel_pmu_has_pebs(void)
+{
+ return x86_pmu.ds_pebs || x86_pmu.arch_pebs;
+}
+
#else /* CONFIG_CPU_SUP_INTEL */
static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index eaf0d5245999..4f0c01610175 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -195,6 +195,8 @@ union cpuid10_edx {
*/
#define ARCH_PERFMON_EXT_LEAF 0x00000023
#define ARCH_PERFMON_NUM_COUNTER_LEAF 0x1
+#define ARCH_PERFMON_PEBS_CAP_LEAF 0x4
+#define ARCH_PERFMON_PEBS_COUNTER_LEAF 0x5
union cpuid35_eax {
struct {
@@ -205,7 +207,10 @@ union cpuid35_eax {
unsigned int acr_subleaf:1;
/* Events Sub-Leaf */
unsigned int events_subleaf:1;
- unsigned int reserved:28;
+ /* arch-PEBS Sub-Leaves */
+ unsigned int pebs_caps_subleaf:1;
+ unsigned int pebs_cnts_subleaf:1;
+ unsigned int reserved:26;
} split;
unsigned int full;
};
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 09/24] perf/x86/intel/ds: Factor out common PEBS processing code to functions
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (7 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 08/24] perf/x86/intel: Initialize architectural PEBS Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
` (14 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Beside some PEBS record layout difference, arch-PEBS can share most of
PEBS record processing code with adaptive PEBS. Thus, factor out these
common processing code to independent inline functions, so they can be
reused by subsequent arch-PEBS handler.
Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/ds.c | 80 ++++++++++++++++++++++++++------------
1 file changed, 55 insertions(+), 25 deletions(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e66c5f307e93..94865745e997 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2594,6 +2594,54 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
}
}
+static inline void __intel_pmu_handle_pebs_record(struct pt_regs *iregs,
+ struct pt_regs *regs,
+ struct perf_sample_data *data,
+ void *at, u64 pebs_status,
+ short *counts, void **last,
+ setup_fn setup_sample)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *event;
+ int bit;
+
+ for_each_set_bit(bit, (unsigned long *)&pebs_status, X86_PMC_IDX_MAX) {
+ event = cpuc->events[bit];
+
+ if (WARN_ON_ONCE(!event) ||
+ WARN_ON_ONCE(!event->attr.precise_ip))
+ continue;
+
+ if (counts[bit]++)
+ __intel_pmu_pebs_event(event, iregs, regs, data,
+ last[bit], setup_sample);
+
+ last[bit] = at;
+ }
+}
+
+static inline void
+__intel_pmu_handle_last_pebs_record(struct pt_regs *iregs, struct pt_regs *regs,
+ struct perf_sample_data *data, u64 mask,
+ short *counts, void **last,
+ setup_fn setup_sample)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct perf_event *event;
+ int bit;
+
+ for_each_set_bit(bit, (unsigned long *)&mask, X86_PMC_IDX_MAX) {
+ if (!counts[bit])
+ continue;
+
+ event = cpuc->events[bit];
+
+ __intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
+ counts[bit], setup_sample);
+ }
+
+}
+
static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
{
short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@@ -2603,9 +2651,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
struct x86_perf_regs perf_regs;
struct pt_regs *regs = &perf_regs.regs;
struct pebs_basic *basic;
- struct perf_event *event;
void *base, *at, *top;
- int bit;
u64 mask;
if (!x86_pmu.pebs_active)
@@ -2618,6 +2664,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
mask = hybrid(cpuc->pmu, pebs_events_mask) |
(hybrid(cpuc->pmu, fixed_cntr_mask64) << INTEL_PMC_IDX_FIXED);
+ mask &= cpuc->pebs_enabled;
if (unlikely(base >= top)) {
intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
@@ -2635,31 +2682,14 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
if (basic->format_size != cpuc->pebs_record_size)
continue;
- pebs_status = basic->applicable_counters & cpuc->pebs_enabled & mask;
- for_each_set_bit(bit, (unsigned long *)&pebs_status, X86_PMC_IDX_MAX) {
- event = cpuc->events[bit];
-
- if (WARN_ON_ONCE(!event) ||
- WARN_ON_ONCE(!event->attr.precise_ip))
- continue;
-
- if (counts[bit]++) {
- __intel_pmu_pebs_event(event, iregs, regs, data, last[bit],
- setup_pebs_adaptive_sample_data);
- }
- last[bit] = at;
- }
+ pebs_status = mask & basic->applicable_counters;
+ __intel_pmu_handle_pebs_record(iregs, regs, data, at,
+ pebs_status, counts, last,
+ setup_pebs_adaptive_sample_data);
}
- for_each_set_bit(bit, (unsigned long *)&mask, X86_PMC_IDX_MAX) {
- if (!counts[bit])
- continue;
-
- event = cpuc->events[bit];
-
- __intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
- counts[bit], setup_pebs_adaptive_sample_data);
- }
+ __intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
+ setup_pebs_adaptive_sample_data);
}
static void __init intel_arch_pebs_init(void)
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (8 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 09/24] perf/x86/intel/ds: Factor out common PEBS processing code to functions Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-25 10:39 ` Peter Zijlstra
2025-02-18 15:28 ` [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups Dapeng Mi
` (13 subsequent siblings)
23 siblings, 1 reply; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
A significant difference with adaptive PEBS is that arch-PEBS record
supports fragments which means an arch-PEBS record could be split into
several independent fragments which have its own arch-PEBS header in
each fragment.
This patch defines architectural PEBS record layout structures and add
helpers to process arch-PEBS records or fragments. Only legacy PEBS
groups like basic, GPR, XMM and LBR groups are supported in this patch,
the new added YMM/ZMM/OPMASK vector registers capturing would be
supported in subsequent patches.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 9 ++
arch/x86/events/intel/ds.c | 219 ++++++++++++++++++++++++++++++
arch/x86/include/asm/msr-index.h | 6 +
arch/x86/include/asm/perf_event.h | 100 ++++++++++++++
4 files changed, 334 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 37540eb80029..184f69afde08 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3124,6 +3124,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
}
+ /*
+ * Arch PEBS sets bit 54 in the global status register
+ */
+ if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
+ (unsigned long *)&status)) {
+ handled++;
+ x86_pmu.drain_pebs(regs, &data);
+ }
+
/*
* Intel PT
*/
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 94865745e997..f3c0e509c531 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2229,6 +2229,153 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
format_group);
}
+static inline bool arch_pebs_record_continued(struct arch_pebs_header *header)
+{
+ /* Continue bit or null PEBS record indicates fragment follows. */
+ return header->cont || !(header->format & GENMASK_ULL(63, 16));
+}
+
+static void setup_arch_pebs_sample_data(struct perf_event *event,
+ struct pt_regs *iregs, void *__pebs,
+ struct perf_sample_data *data,
+ struct pt_regs *regs)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct arch_pebs_header *header = NULL;
+ struct arch_pebs_aux *meminfo = NULL;
+ struct arch_pebs_gprs *gprs = NULL;
+ struct x86_perf_regs *perf_regs;
+ void *next_record;
+ void *at = __pebs;
+ u64 sample_type;
+
+ if (at == NULL)
+ return;
+
+ perf_regs = container_of(regs, struct x86_perf_regs, regs);
+ perf_regs->xmm_regs = NULL;
+
+ sample_type = event->attr.sample_type;
+ perf_sample_data_init(data, 0, event->hw.last_period);
+ data->period = event->hw.last_period;
+
+ /*
+ * We must however always use iregs for the unwinder to stay sane; the
+ * record BP,SP,IP can point into thin air when the record is from a
+ * previous PMI context or an (I)RET happened between the record and
+ * PMI.
+ */
+ if (sample_type & PERF_SAMPLE_CALLCHAIN)
+ perf_sample_save_callchain(data, event, iregs);
+
+ *regs = *iregs;
+
+again:
+ header = at;
+ next_record = at + sizeof(struct arch_pebs_header);
+ if (header->basic) {
+ struct arch_pebs_basic *basic = next_record;
+
+ /* The ip in basic is EventingIP */
+ set_linear_ip(regs, basic->ip);
+ regs->flags = PERF_EFLAGS_EXACT;
+ setup_pebs_time(event, data, basic->tsc);
+
+ if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+ data->weight.var3_w = basic->valid ? basic->retire : 0;
+
+ next_record = basic + 1;
+ }
+
+ /*
+ * The record for MEMINFO is in front of GP
+ * But PERF_SAMPLE_TRANSACTION needs gprs->ax.
+ * Save the pointer here but process later.
+ */
+ if (header->aux) {
+ meminfo = next_record;
+ next_record = meminfo + 1;
+ }
+
+ if (header->gpr) {
+ gprs = next_record;
+ next_record = gprs + 1;
+
+ if (event->attr.precise_ip < 2) {
+ set_linear_ip(regs, gprs->ip);
+ regs->flags &= ~PERF_EFLAGS_EXACT;
+ }
+
+ if (sample_type & PERF_SAMPLE_REGS_INTR)
+ adaptive_pebs_save_regs(regs, (struct pebs_gprs *)gprs);
+ }
+
+ if (header->aux) {
+ if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
+ u16 latency = meminfo->cache_latency;
+ u64 tsx_latency = intel_get_tsx_weight(meminfo->tsx_tuning);
+
+ data->weight.var2_w = meminfo->instr_latency;
+
+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ data->weight.full = latency ?: tsx_latency;
+ else
+ data->weight.var1_dw = latency ?: (u32)tsx_latency;
+ data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
+ }
+
+ if (sample_type & PERF_SAMPLE_DATA_SRC) {
+ data->data_src.val = get_data_src(event, meminfo->aux);
+ data->sample_flags |= PERF_SAMPLE_DATA_SRC;
+ }
+
+ if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
+ data->addr = meminfo->address;
+ data->sample_flags |= PERF_SAMPLE_ADDR;
+ }
+
+ if (sample_type & PERF_SAMPLE_TRANSACTION) {
+ data->txn = intel_get_tsx_transaction(meminfo->tsx_tuning,
+ gprs ? gprs->ax : 0);
+ data->sample_flags |= PERF_SAMPLE_TRANSACTION;
+ }
+ }
+
+ if (header->xmm) {
+ struct arch_pebs_xmm *xmm;
+
+ next_record += sizeof(struct arch_pebs_xer_header);
+
+ xmm = next_record;
+ perf_regs->xmm_regs = xmm->xmm;
+ next_record = xmm + 1;
+ }
+
+ if (header->lbr) {
+ struct arch_pebs_lbr_header *lbr_header = next_record;
+ struct lbr_entry *lbr;
+ int num_lbr;
+
+ next_record = lbr_header + 1;
+ lbr = next_record;
+
+ num_lbr = header->lbr == ARCH_PEBS_LBR_NUM_VAR ? lbr_header->depth :
+ header->lbr * ARCH_PEBS_BASE_LBR_ENTRIES;
+ next_record += num_lbr * sizeof(struct lbr_entry);
+
+ if (has_branch_stack(event)) {
+ intel_pmu_store_pebs_lbrs(lbr);
+ intel_pmu_lbr_save_brstack(data, cpuc, event);
+ }
+ }
+
+ /* Parse followed fragments if there are. */
+ if (arch_pebs_record_continued(header)) {
+ at = at + header->size;
+ goto again;
+ }
+}
+
static inline void *
get_next_pebs_record_by_bit(void *base, void *top, int bit)
{
@@ -2692,6 +2839,77 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
setup_pebs_adaptive_sample_data);
}
+static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
+ struct perf_sample_data *data)
+{
+ short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
+ void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ union arch_pebs_index index;
+ struct x86_perf_regs perf_regs;
+ struct pt_regs *regs = &perf_regs.regs;
+ void *base, *at, *top;
+ u64 mask;
+
+ rdmsrl(MSR_IA32_PEBS_INDEX, index.full);
+
+ if (unlikely(!index.split.wr)) {
+ intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
+ return;
+ }
+
+ base = cpuc->ds_pebs_vaddr;
+ top = (void *)((u64)cpuc->ds_pebs_vaddr +
+ (index.split.wr << ARCH_PEBS_INDEX_WR_SHIFT));
+
+ mask = hybrid(cpuc->pmu, arch_pebs_cap).counters & cpuc->pebs_enabled;
+
+ if (!iregs)
+ iregs = &dummy_iregs;
+
+ /* Process all but the last event for each counter. */
+ for (at = base; at < top;) {
+ struct arch_pebs_header *header;
+ struct arch_pebs_basic *basic;
+ u64 pebs_status;
+
+ header = at;
+
+ if (WARN_ON_ONCE(!header->size))
+ break;
+
+ /* 1st fragment or single record must have basic group */
+ if (!header->basic) {
+ at += header->size;
+ continue;
+ }
+
+ basic = at + sizeof(struct arch_pebs_header);
+ pebs_status = mask & basic->applicable_counters;
+ __intel_pmu_handle_pebs_record(iregs, regs, data, at,
+ pebs_status, counts, last,
+ setup_arch_pebs_sample_data);
+
+ /* Skip non-last fragments */
+ while (arch_pebs_record_continued(header)) {
+ if (!header->size)
+ break;
+ at += header->size;
+ header = at;
+ }
+
+ /* Skip last fragment or the single record */
+ at += header->size;
+ }
+
+ __intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts,
+ last, setup_arch_pebs_sample_data);
+
+ index.split.wr = 0;
+ index.split.full = 0;
+ wrmsrl(MSR_IA32_PEBS_INDEX, index.full);
+}
+
static void __init intel_arch_pebs_init(void)
{
/*
@@ -2701,6 +2919,7 @@ static void __init intel_arch_pebs_init(void)
*/
x86_pmu.arch_pebs = 1;
x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
+ x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
x86_pmu.pebs_capable = ~0ULL;
x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9a71880eec07..dd09ae9752a7 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -312,6 +312,12 @@
#define PERF_CAP_PEBS_MASK (PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
+/* Arch PEBS */
+#define MSR_IA32_PEBS_BASE 0x000003f4
+#define MSR_IA32_PEBS_INDEX 0x000003f5
+#define ARCH_PEBS_OFFSET_MASK 0x7fffff
+#define ARCH_PEBS_INDEX_WR_SHIFT 4
+
#define MSR_IA32_RTIT_CTL 0x00000570
#define RTIT_CTL_TRACEEN BIT(0)
#define RTIT_CTL_CYCLEACC BIT(1)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 4f0c01610175..4103cc745e86 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -432,6 +432,8 @@ static inline bool is_topdown_idx(int idx)
#define GLOBAL_STATUS_LBRS_FROZEN BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
#define GLOBAL_STATUS_TRACE_TOPAPMI_BIT 55
#define GLOBAL_STATUS_TRACE_TOPAPMI BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
+#define GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT 54
+#define GLOBAL_STATUS_ARCH_PEBS_THRESHOLD BIT_ULL(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT)
#define GLOBAL_STATUS_PERF_METRICS_OVF_BIT 48
#define GLOBAL_CTRL_EN_PERF_METRICS 48
@@ -502,6 +504,104 @@ struct pebs_cntr_header {
#define INTEL_CNTR_METRICS 0x3
+/*
+ * Arch PEBS
+ */
+union arch_pebs_index {
+ struct {
+ u64 rsvd:4,
+ wr:23,
+ rsvd2:4,
+ full:1,
+ en:1,
+ rsvd3:3,
+ thresh:23,
+ rsvd4:5;
+ } split;
+ u64 full;
+};
+
+struct arch_pebs_header {
+ union {
+ u64 format;
+ struct {
+ u64 size:16, /* Record size */
+ rsvd:14,
+ mode:1, /* 64BIT_MODE */
+ cont:1,
+ rsvd2:3,
+ cntr:5,
+ lbr:2,
+ rsvd3:7,
+ xmm:1,
+ ymmh:1,
+ rsvd4:2,
+ opmask:1,
+ zmmh:1,
+ h16zmm:1,
+ rsvd5:5,
+ gpr:1,
+ aux:1,
+ basic:1;
+ };
+ };
+ u64 rsvd6;
+};
+
+struct arch_pebs_basic {
+ u64 ip;
+ u64 applicable_counters;
+ u64 tsc;
+ u64 retire :16, /* Retire Latency */
+ valid :1,
+ rsvd :47;
+ u64 rsvd2;
+ u64 rsvd3;
+};
+
+struct arch_pebs_aux {
+ u64 address;
+ u64 rsvd;
+ u64 rsvd2;
+ u64 rsvd3;
+ u64 rsvd4;
+ u64 aux;
+ u64 instr_latency :16,
+ pad2 :16,
+ cache_latency :16,
+ pad3 :16;
+ u64 tsx_tuning;
+};
+
+struct arch_pebs_gprs {
+ u64 flags, ip, ax, cx, dx, bx, sp, bp, si, di;
+ u64 r8, r9, r10, r11, r12, r13, r14, r15, ssp;
+ u64 rsvd;
+};
+
+struct arch_pebs_xer_header {
+ u64 xstate;
+ u64 rsvd;
+};
+
+struct arch_pebs_xmm {
+ u64 xmm[16*2]; /* two entries for each register */
+};
+
+#define ARCH_PEBS_LBR_NAN 0x0
+#define ARCH_PEBS_LBR_NUM_8 0x1
+#define ARCH_PEBS_LBR_NUM_16 0x2
+#define ARCH_PEBS_LBR_NUM_VAR 0x3
+#define ARCH_PEBS_BASE_LBR_ENTRIES 8
+struct arch_pebs_lbr_header {
+ u64 rsvd;
+ u64 ctl;
+ u64 depth;
+ u64 ler_from;
+ u64 ler_to;
+ u64 ler_info;
+};
+
/*
* AMD Extended Performance Monitoring and Debug cpuid feature detection
*/
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (9 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-25 11:02 ` Peter Zijlstra
2025-02-18 15:28 ` [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
` (12 subsequent siblings)
23 siblings, 1 reply; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Adaptive PEBS and arch-PEBS share lots of same code to process these
PEBS groups, like basic, GPR and meminfo groups. Extract these shared
code to common functions to avoid duplicated code.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/ds.c | 239 ++++++++++++++++++-------------------
1 file changed, 119 insertions(+), 120 deletions(-)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index f3c0e509c531..65eaba3aa48d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2068,6 +2068,91 @@ static inline void __setup_pebs_counter_group(struct cpu_hw_events *cpuc,
#define PEBS_LATENCY_MASK 0xffff
+static inline void __setup_perf_sample_data(struct perf_event *event,
+ struct pt_regs *iregs,
+ struct perf_sample_data *data)
+{
+ perf_sample_data_init(data, 0, event->hw.last_period);
+ data->period = event->hw.last_period;
+
+ /*
+ * We must however always use iregs for the unwinder to stay sane; the
+ * record BP,SP,IP can point into thin air when the record is from a
+ * previous PMI context or an (I)RET happened between the record and
+ * PMI.
+ */
+ perf_sample_save_callchain(data, event, iregs);
+}
+
+static inline void __setup_pebs_basic_group(struct perf_event *event,
+ struct pt_regs *regs,
+ struct perf_sample_data *data,
+ u64 sample_type, u64 ip,
+ u64 tsc, u16 retire)
+{
+ /* The ip in basic is EventingIP */
+ set_linear_ip(regs, ip);
+ regs->flags = PERF_EFLAGS_EXACT;
+ setup_pebs_time(event, data, tsc);
+
+ if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+ data->weight.var3_w = retire;
+}
+
+static inline void __setup_pebs_gpr_group(struct perf_event *event,
+ struct pt_regs *regs,
+ struct pebs_gprs *gprs,
+ u64 sample_type)
+{
+ if (event->attr.precise_ip < 2) {
+ set_linear_ip(regs, gprs->ip);
+ regs->flags &= ~PERF_EFLAGS_EXACT;
+ }
+
+ if (sample_type & PERF_SAMPLE_REGS_INTR)
+ adaptive_pebs_save_regs(regs, gprs);
+}
+
+static inline void __setup_pebs_meminfo_group(struct perf_event *event,
+ struct perf_sample_data *data,
+ u64 sample_type, u64 latency,
+ u16 instr_latency, u64 address,
+ u64 aux, u64 tsx_tuning, u64 ax)
+{
+ if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
+ u64 tsx_latency = intel_get_tsx_weight(tsx_tuning);
+
+ data->weight.var2_w = instr_latency;
+
+ /*
+ * Although meminfo::latency is defined as a u64,
+ * only the lower 32 bits include the valid data
+ * in practice on Ice Lake and earlier platforms.
+ */
+ if (sample_type & PERF_SAMPLE_WEIGHT)
+ data->weight.full = latency ?: tsx_latency;
+ else
+ data->weight.var1_dw = (u32)latency ?: tsx_latency;
+
+ data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
+ }
+
+ if (sample_type & PERF_SAMPLE_DATA_SRC) {
+ data->data_src.val = get_data_src(event, aux);
+ data->sample_flags |= PERF_SAMPLE_DATA_SRC;
+ }
+
+ if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
+ data->addr = address;
+ data->sample_flags |= PERF_SAMPLE_ADDR;
+ }
+
+ if (sample_type & PERF_SAMPLE_TRANSACTION) {
+ data->txn = intel_get_tsx_transaction(tsx_tuning, ax);
+ data->sample_flags |= PERF_SAMPLE_TRANSACTION;
+ }
+}
+
/*
* With adaptive PEBS the layout depends on what fields are configured.
*/
@@ -2077,12 +2162,14 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
struct pt_regs *regs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 sample_type = event->attr.sample_type;
struct pebs_basic *basic = __pebs;
void *next_record = basic + 1;
- u64 sample_type, format_group;
struct pebs_meminfo *meminfo = NULL;
struct pebs_gprs *gprs = NULL;
struct x86_perf_regs *perf_regs;
+ u64 format_group;
+ u16 retire;
if (basic == NULL)
return;
@@ -2090,32 +2177,17 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
- sample_type = event->attr.sample_type;
format_group = basic->format_group;
- perf_sample_data_init(data, 0, event->hw.last_period);
- data->period = event->hw.last_period;
- setup_pebs_time(event, data, basic->tsc);
-
- /*
- * We must however always use iregs for the unwinder to stay sane; the
- * record BP,SP,IP can point into thin air when the record is from a
- * previous PMI context or an (I)RET happened between the record and
- * PMI.
- */
- perf_sample_save_callchain(data, event, iregs);
+ __setup_perf_sample_data(event, iregs, data);
*regs = *iregs;
- /* The ip in basic is EventingIP */
- set_linear_ip(regs, basic->ip);
- regs->flags = PERF_EFLAGS_EXACT;
- if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT) {
- if (x86_pmu.flags & PMU_FL_RETIRE_LATENCY)
- data->weight.var3_w = basic->retire_latency;
- else
- data->weight.var3_w = 0;
- }
+ /* basic group */
+ retire = x86_pmu.flags & PMU_FL_RETIRE_LATENCY ?
+ basic->retire_latency : 0;
+ __setup_pebs_basic_group(event, regs, data, sample_type,
+ basic->ip, basic->tsc, retire);
/*
* The record for MEMINFO is in front of GP
@@ -2131,54 +2203,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
gprs = next_record;
next_record = gprs + 1;
- if (event->attr.precise_ip < 2) {
- set_linear_ip(regs, gprs->ip);
- regs->flags &= ~PERF_EFLAGS_EXACT;
- }
-
- if (sample_type & PERF_SAMPLE_REGS_INTR)
- adaptive_pebs_save_regs(regs, gprs);
+ __setup_pebs_gpr_group(event, regs, gprs, sample_type);
}
if (format_group & PEBS_DATACFG_MEMINFO) {
- if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
- u64 latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
- meminfo->cache_latency : meminfo->mem_latency;
-
- if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
- data->weight.var2_w = meminfo->instr_latency;
-
- /*
- * Although meminfo::latency is defined as a u64,
- * only the lower 32 bits include the valid data
- * in practice on Ice Lake and earlier platforms.
- */
- if (sample_type & PERF_SAMPLE_WEIGHT) {
- data->weight.full = latency ?:
- intel_get_tsx_weight(meminfo->tsx_tuning);
- } else {
- data->weight.var1_dw = (u32)latency ?:
- intel_get_tsx_weight(meminfo->tsx_tuning);
- }
-
- data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
- }
-
- if (sample_type & PERF_SAMPLE_DATA_SRC) {
- data->data_src.val = get_data_src(event, meminfo->aux);
- data->sample_flags |= PERF_SAMPLE_DATA_SRC;
- }
+ u64 latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
+ meminfo->cache_latency : meminfo->mem_latency;
+ u64 instr_latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
+ meminfo->instr_latency : 0;
+ u64 ax = gprs ? gprs->ax : 0;
- if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
- data->addr = meminfo->address;
- data->sample_flags |= PERF_SAMPLE_ADDR;
- }
-
- if (sample_type & PERF_SAMPLE_TRANSACTION) {
- data->txn = intel_get_tsx_transaction(meminfo->tsx_tuning,
- gprs ? gprs->ax : 0);
- data->sample_flags |= PERF_SAMPLE_TRANSACTION;
- }
+ __setup_pebs_meminfo_group(event, data, sample_type, latency,
+ instr_latency, meminfo->address,
+ meminfo->aux, meminfo->tsx_tuning,
+ ax);
}
if (format_group & PEBS_DATACFG_XMMS) {
@@ -2241,13 +2279,13 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
struct pt_regs *regs)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u64 sample_type = event->attr.sample_type;
struct arch_pebs_header *header = NULL;
struct arch_pebs_aux *meminfo = NULL;
struct arch_pebs_gprs *gprs = NULL;
struct x86_perf_regs *perf_regs;
void *next_record;
void *at = __pebs;
- u64 sample_type;
if (at == NULL)
return;
@@ -2255,18 +2293,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
- sample_type = event->attr.sample_type;
- perf_sample_data_init(data, 0, event->hw.last_period);
- data->period = event->hw.last_period;
-
- /*
- * We must however always use iregs for the unwinder to stay sane; the
- * record BP,SP,IP can point into thin air when the record is from a
- * previous PMI context or an (I)RET happened between the record and
- * PMI.
- */
- if (sample_type & PERF_SAMPLE_CALLCHAIN)
- perf_sample_save_callchain(data, event, iregs);
+ __setup_perf_sample_data(event, iregs, data);
*regs = *iregs;
@@ -2275,16 +2302,14 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
next_record = at + sizeof(struct arch_pebs_header);
if (header->basic) {
struct arch_pebs_basic *basic = next_record;
+ u16 retire = 0;
- /* The ip in basic is EventingIP */
- set_linear_ip(regs, basic->ip);
- regs->flags = PERF_EFLAGS_EXACT;
- setup_pebs_time(event, data, basic->tsc);
+ next_record = basic + 1;
if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
- data->weight.var3_w = basic->valid ? basic->retire : 0;
-
- next_record = basic + 1;
+ retire = basic->valid ? basic->retire : 0;
+ __setup_pebs_basic_group(event, regs, data, sample_type,
+ basic->ip, basic->tsc, retire);
}
/*
@@ -2301,44 +2326,18 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
gprs = next_record;
next_record = gprs + 1;
- if (event->attr.precise_ip < 2) {
- set_linear_ip(regs, gprs->ip);
- regs->flags &= ~PERF_EFLAGS_EXACT;
- }
-
- if (sample_type & PERF_SAMPLE_REGS_INTR)
- adaptive_pebs_save_regs(regs, (struct pebs_gprs *)gprs);
+ __setup_pebs_gpr_group(event, regs, (struct pebs_gprs *)gprs,
+ sample_type);
}
if (header->aux) {
- if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
- u16 latency = meminfo->cache_latency;
- u64 tsx_latency = intel_get_tsx_weight(meminfo->tsx_tuning);
+ u64 ax = gprs ? gprs->ax : 0;
- data->weight.var2_w = meminfo->instr_latency;
-
- if (sample_type & PERF_SAMPLE_WEIGHT)
- data->weight.full = latency ?: tsx_latency;
- else
- data->weight.var1_dw = latency ?: (u32)tsx_latency;
- data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
- }
-
- if (sample_type & PERF_SAMPLE_DATA_SRC) {
- data->data_src.val = get_data_src(event, meminfo->aux);
- data->sample_flags |= PERF_SAMPLE_DATA_SRC;
- }
-
- if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
- data->addr = meminfo->address;
- data->sample_flags |= PERF_SAMPLE_ADDR;
- }
-
- if (sample_type & PERF_SAMPLE_TRANSACTION) {
- data->txn = intel_get_tsx_transaction(meminfo->tsx_tuning,
- gprs ? gprs->ax : 0);
- data->sample_flags |= PERF_SAMPLE_TRANSACTION;
- }
+ __setup_pebs_meminfo_group(event, data, sample_type,
+ meminfo->cache_latency,
+ meminfo->instr_latency,
+ meminfo->address, meminfo->aux,
+ meminfo->tsx_tuning, ax);
}
if (header->xmm) {
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (10 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-25 11:18 ` Peter Zijlstra
2025-02-25 11:25 ` Peter Zijlstra
2025-02-18 15:28 ` [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
` (11 subsequent siblings)
23 siblings, 2 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
buffer physical address. This patch allocates arch-PEBS buffer and then
initialize IA32_PEBS_BASE MSR with the buffer physical address.
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/core.c | 4 +-
arch/x86/events/intel/core.c | 4 +-
arch/x86/events/intel/ds.c | 112 ++++++++++++++++++++------------
arch/x86/events/perf_event.h | 16 ++---
arch/x86/include/asm/intel_ds.h | 3 +-
5 files changed, 84 insertions(+), 55 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 24ae1159d6b9..4eaafabf033e 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -416,7 +416,7 @@ int x86_reserve_hardware(void)
if (!reserve_pmc_hardware()) {
err = -EBUSY;
} else {
- reserve_ds_buffers();
+ reserve_bts_pebs_buffers();
reserve_lbr_buffers();
}
}
@@ -432,7 +432,7 @@ void x86_release_hardware(void)
{
if (atomic_dec_and_mutex_lock(&pmc_refcount, &pmc_reserve_mutex)) {
release_pmc_hardware();
- release_ds_buffers();
+ release_bts_pebs_buffers();
release_lbr_buffers();
mutex_unlock(&pmc_reserve_mutex);
}
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 184f69afde08..472366c3db22 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5129,7 +5129,7 @@ static void intel_pmu_cpu_starting(int cpu)
if (is_hybrid() && !init_hybrid_pmu(cpu))
return;
- init_debug_store_on_cpu(cpu);
+ init_pebs_buf_on_cpu(cpu);
/*
* Deal with CPUs that don't clear their LBRs on power-up.
*/
@@ -5223,7 +5223,7 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
static void intel_pmu_cpu_dying(int cpu)
{
- fini_debug_store_on_cpu(cpu);
+ fini_pebs_buf_on_cpu(cpu);
}
void intel_cpuc_finish(struct cpu_hw_events *cpuc)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 65eaba3aa48d..519767fc9180 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -545,26 +545,6 @@ struct pebs_record_skl {
u64 tsc;
};
-void init_debug_store_on_cpu(int cpu)
-{
- struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-
- if (!ds)
- return;
-
- wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA,
- (u32)((u64)(unsigned long)ds),
- (u32)((u64)(unsigned long)ds >> 32));
-}
-
-void fini_debug_store_on_cpu(int cpu)
-{
- if (!per_cpu(cpu_hw_events, cpu).ds)
- return;
-
- wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA, 0, 0);
-}
-
static DEFINE_PER_CPU(void *, insn_buffer);
static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
@@ -624,13 +604,18 @@ static int alloc_pebs_buffer(int cpu)
int max, node = cpu_to_node(cpu);
void *buffer, *insn_buff, *cea;
- if (!x86_pmu.ds_pebs)
+ if (!intel_pmu_has_pebs())
return 0;
- buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
+ buffer = dsalloc_pages(bsiz, preemptible() ? GFP_KERNEL : GFP_ATOMIC, cpu);
if (unlikely(!buffer))
return -ENOMEM;
+ if (x86_pmu.arch_pebs) {
+ hwev->pebs_vaddr = buffer;
+ return 0;
+ }
+
/*
* HSW+ already provides us the eventing ip; no need to allocate this
* buffer then.
@@ -643,7 +628,7 @@ static int alloc_pebs_buffer(int cpu)
}
per_cpu(insn_buffer, cpu) = insn_buff;
}
- hwev->ds_pebs_vaddr = buffer;
+ hwev->pebs_vaddr = buffer;
/* Update the cpu entry area mapping */
cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
ds->pebs_buffer_base = (unsigned long) cea;
@@ -659,17 +644,20 @@ static void release_pebs_buffer(int cpu)
struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
void *cea;
- if (!x86_pmu.ds_pebs)
+ if (!intel_pmu_has_pebs())
return;
- kfree(per_cpu(insn_buffer, cpu));
- per_cpu(insn_buffer, cpu) = NULL;
+ if (x86_pmu.ds_pebs) {
+ kfree(per_cpu(insn_buffer, cpu));
+ per_cpu(insn_buffer, cpu) = NULL;
- /* Clear the fixmap */
- cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
- ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
- dsfree_pages(hwev->ds_pebs_vaddr, x86_pmu.pebs_buffer_size);
- hwev->ds_pebs_vaddr = NULL;
+ /* Clear the fixmap */
+ cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
+ ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
+ }
+
+ dsfree_pages(hwev->pebs_vaddr, x86_pmu.pebs_buffer_size);
+ hwev->pebs_vaddr = NULL;
}
static int alloc_bts_buffer(int cpu)
@@ -730,11 +718,11 @@ static void release_ds_buffer(int cpu)
per_cpu(cpu_hw_events, cpu).ds = NULL;
}
-void release_ds_buffers(void)
+void release_bts_pebs_buffers(void)
{
int cpu;
- if (!x86_pmu.bts && !x86_pmu.ds_pebs)
+ if (!x86_pmu.bts && !intel_pmu_has_pebs())
return;
for_each_possible_cpu(cpu)
@@ -746,7 +734,7 @@ void release_ds_buffers(void)
* observe cpu_hw_events.ds and not program the DS_AREA when
* they come up.
*/
- fini_debug_store_on_cpu(cpu);
+ fini_pebs_buf_on_cpu(cpu);
}
for_each_possible_cpu(cpu) {
@@ -755,7 +743,7 @@ void release_ds_buffers(void)
}
}
-void reserve_ds_buffers(void)
+void reserve_bts_pebs_buffers(void)
{
int bts_err = 0, pebs_err = 0;
int cpu;
@@ -763,19 +751,20 @@ void reserve_ds_buffers(void)
x86_pmu.bts_active = 0;
x86_pmu.pebs_active = 0;
- if (!x86_pmu.bts && !x86_pmu.ds_pebs)
+ if (!x86_pmu.bts && !intel_pmu_has_pebs())
return;
if (!x86_pmu.bts)
bts_err = 1;
- if (!x86_pmu.ds_pebs)
+ if (!intel_pmu_has_pebs())
pebs_err = 1;
for_each_possible_cpu(cpu) {
if (alloc_ds_buffer(cpu)) {
bts_err = 1;
- pebs_err = 1;
+ if (x86_pmu.ds_pebs)
+ pebs_err = 1;
}
if (!bts_err && alloc_bts_buffer(cpu))
@@ -805,7 +794,7 @@ void reserve_ds_buffers(void)
if (x86_pmu.bts && !bts_err)
x86_pmu.bts_active = 1;
- if (x86_pmu.ds_pebs && !pebs_err)
+ if (intel_pmu_has_pebs() && !pebs_err)
x86_pmu.pebs_active = 1;
for_each_possible_cpu(cpu) {
@@ -813,11 +802,50 @@ void reserve_ds_buffers(void)
* Ignores wrmsr_on_cpu() errors for offline CPUs they
* will get this call through intel_pmu_cpu_starting().
*/
- init_debug_store_on_cpu(cpu);
+ init_pebs_buf_on_cpu(cpu);
}
}
}
+void init_pebs_buf_on_cpu(int cpu)
+{
+ struct cpu_hw_events *cpuc = per_cpu_ptr(&cpu_hw_events, cpu);
+
+ if (x86_pmu.arch_pebs) {
+ u64 arch_pebs_base;
+
+ if (!cpuc->pebs_vaddr)
+ return;
+
+ /*
+ * 4KB-aligned pointer of the output buffer
+ * (__alloc_pages_node() return page aligned address)
+ * Buffer Size = 4KB * 2^SIZE
+ * contiguous physical buffer (__alloc_pages_node() with order)
+ */
+ arch_pebs_base = virt_to_phys(cpuc->pebs_vaddr) | PEBS_BUFFER_SHIFT;
+
+ wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE,
+ (u32)arch_pebs_base,
+ (u32)(arch_pebs_base >> 32));
+ } else if (cpuc->ds) {
+ /* legacy PEBS */
+ wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA,
+ (u32)((u64)(unsigned long)cpuc->ds),
+ (u32)((u64)(unsigned long)cpuc->ds >> 32));
+ }
+}
+
+void fini_pebs_buf_on_cpu(int cpu)
+{
+ struct cpu_hw_events *cpuc = per_cpu_ptr(&cpu_hw_events, cpu);
+
+ if (x86_pmu.arch_pebs)
+ wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, 0, 0);
+ else if (cpuc->ds)
+ wrmsr_on_cpu(cpu, MSR_IA32_DS_AREA, 0, 0);
+}
+
/*
* BTS
*/
@@ -2857,8 +2885,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
return;
}
- base = cpuc->ds_pebs_vaddr;
- top = (void *)((u64)cpuc->ds_pebs_vaddr +
+ base = cpuc->pebs_vaddr;
+ top = (void *)((u64)cpuc->pebs_vaddr +
(index.split.wr << ARCH_PEBS_INDEX_WR_SHIFT));
mask = hybrid(cpuc->pmu, arch_pebs_cap).counters & cpuc->pebs_enabled;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 265d76a321dd..1f20892f4040 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -266,11 +266,11 @@ struct cpu_hw_events {
int is_fake;
/*
- * Intel DebugStore bits
+ * Intel DebugStore/PEBS bits
*/
struct debug_store *ds;
- void *ds_pebs_vaddr;
void *ds_bts_vaddr;
+ void *pebs_vaddr;
u64 pebs_enabled;
int n_pebs;
int n_large_pebs;
@@ -1603,13 +1603,13 @@ extern void intel_cpuc_finish(struct cpu_hw_events *cpuc);
int intel_pmu_init(void);
-void init_debug_store_on_cpu(int cpu);
+void init_pebs_buf_on_cpu(int cpu);
-void fini_debug_store_on_cpu(int cpu);
+void fini_pebs_buf_on_cpu(int cpu);
-void release_ds_buffers(void);
+void release_bts_pebs_buffers(void);
-void reserve_ds_buffers(void);
+void reserve_bts_pebs_buffers(void);
void release_lbr_buffers(void);
@@ -1796,11 +1796,11 @@ static inline bool intel_pmu_has_pebs(void)
#else /* CONFIG_CPU_SUP_INTEL */
-static inline void reserve_ds_buffers(void)
+static inline void reserve_bts_pebs_buffers(void)
{
}
-static inline void release_ds_buffers(void)
+static inline void release_bts_pebs_buffers(void)
{
}
diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h
index 5dbeac48a5b9..023c2883f9f3 100644
--- a/arch/x86/include/asm/intel_ds.h
+++ b/arch/x86/include/asm/intel_ds.h
@@ -4,7 +4,8 @@
#include <linux/percpu-defs.h>
#define BTS_BUFFER_SIZE (PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE (PAGE_SIZE << 4)
+#define PEBS_BUFFER_SHIFT 4
+#define PEBS_BUFFER_SIZE (PAGE_SIZE << PEBS_BUFFER_SHIFT)
/* The maximal number of PEBS events: */
#define MAX_PEBS_EVENTS_FMT4 8
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (11 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-27 14:06 ` Liang, Kan
2025-02-18 15:28 ` [Patch v2 14/24] perf/x86/intel: Setup PEBS data configuration and enable legacy groups Dapeng Mi
` (10 subsequent siblings)
23 siblings, 1 reply; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
arch-PEBS provides CPUIDs to enumerate which counters support PEBS
sampling and precise distribution PEBS sampling. Thus PEBS constraints
should be dynamically configured base on these counter and precise
distribution bitmap instead of defining them statically.
Update event dyn_constraint base on PEBS event precise level.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 6 ++++++
arch/x86/events/intel/ds.c | 1 +
2 files changed, 7 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 472366c3db22..c777e0531d40 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4033,6 +4033,8 @@ static int intel_pmu_hw_config(struct perf_event *event)
return ret;
if (event->attr.precise_ip) {
+ struct arch_pebs_cap pebs_cap = hybrid(event->pmu, arch_pebs_cap);
+
if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT)
return -EINVAL;
@@ -4046,6 +4048,10 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
if (x86_pmu.pebs_aliases)
x86_pmu.pebs_aliases(event);
+
+ if (x86_pmu.arch_pebs)
+ event->hw.dyn_constraint = event->attr.precise_ip >= 3 ?
+ pebs_cap.pdists : pebs_cap.counters;
}
if (needs_branch_stack(event)) {
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 519767fc9180..615aefb4e52e 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2948,6 +2948,7 @@ static void __init intel_arch_pebs_init(void)
x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
x86_pmu.pebs_capable = ~0ULL;
+ x86_pmu.flags |= PMU_FL_PEBS_ALL;
x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 14/24] perf/x86/intel: Setup PEBS data configuration and enable legacy groups
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (12 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS Dapeng Mi
` (9 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Different with legacy PEBS, arch-PEBS provides per-counter PEBS data
configuration by programing MSR IA32_PMC_GPx/FXx_CFG_C MSRs.
This patch obtains PEBS data configuration from event attribute and then
writes the PEBS data configuration to MSR IA32_PMC_GPx/FXx_CFG_C and
enable corresponding PEBS groups.
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 127 +++++++++++++++++++++++++++++++
arch/x86/events/intel/ds.c | 17 +++++
arch/x86/events/perf_event.h | 13 ++++
arch/x86/include/asm/intel_ds.h | 7 ++
arch/x86/include/asm/msr-index.h | 10 +++
5 files changed, 174 insertions(+)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c777e0531d40..b80a66751136 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2568,6 +2568,39 @@ static void intel_pmu_disable_fixed(struct perf_event *event)
cpuc->fixed_ctrl_val &= ~mask;
}
+static inline void __intel_pmu_update_event_ext(int idx, u64 ext)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ u32 msr = idx < INTEL_PMC_IDX_FIXED ?
+ x86_pmu_cfg_c_addr(idx, true) :
+ x86_pmu_cfg_c_addr(idx - INTEL_PMC_IDX_FIXED, false);
+
+ cpuc->cfg_c_val[idx] = ext;
+ wrmsrl(msr, ext);
+}
+
+static void intel_pmu_disable_event_ext(struct perf_event *event)
+{
+ if (!x86_pmu.arch_pebs)
+ return;
+
+ /*
+ * Only clear CFG_C MSR for PEBS counter group events,
+ * it avoids the HW counter's value to be added into
+ * other PEBS records incorrectly after PEBS counter
+ * group events are disabled.
+ *
+ * For other events, it's unnecessary to clear CFG_C MSRs
+ * since CFG_C doesn't take effect if counter is in
+ * disabled state. That helps to reduce the WRMSR overhead
+ * in context switches.
+ */
+ if (!is_pebs_counter_event_group(event))
+ return;
+
+ __intel_pmu_update_event_ext(event->hw.idx, 0);
+}
+
static void intel_pmu_disable_event(struct perf_event *event)
{
struct hw_perf_event *hwc = &event->hw;
@@ -2576,9 +2609,12 @@ static void intel_pmu_disable_event(struct perf_event *event)
switch (idx) {
case 0 ... INTEL_PMC_IDX_FIXED - 1:
intel_clear_masks(event, idx);
+ intel_pmu_disable_event_ext(event);
x86_pmu_disable_event(event);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
+ intel_pmu_disable_event_ext(event);
+ fallthrough;
case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_disable_fixed(event);
break;
@@ -2898,6 +2934,66 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
cpuc->fixed_ctrl_val |= bits;
}
+static void intel_pmu_enable_event_ext(struct perf_event *event)
+{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+ struct hw_perf_event *hwc = &event->hw;
+ union arch_pebs_index cached, index;
+ struct arch_pebs_cap cap;
+ u64 ext = 0;
+
+ if (!x86_pmu.arch_pebs)
+ return;
+
+ cap = hybrid(cpuc->pmu, arch_pebs_cap);
+
+ if (event->attr.precise_ip) {
+ u64 pebs_data_cfg = intel_get_arch_pebs_data_config(event);
+
+ ext |= ARCH_PEBS_EN;
+ ext |= (-hwc->sample_period) & ARCH_PEBS_RELOAD;
+
+ if (pebs_data_cfg && cap.caps) {
+ if (pebs_data_cfg & PEBS_DATACFG_MEMINFO)
+ ext |= ARCH_PEBS_AUX & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_GP)
+ ext |= ARCH_PEBS_GPR & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_XMMS)
+ ext |= ARCH_PEBS_VECR_XMM & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_LBRS)
+ ext |= ARCH_PEBS_LBR & cap.caps;
+ }
+
+ if (cpuc->n_pebs == cpuc->n_large_pebs)
+ index.split.thresh = ARCH_PEBS_THRESH_MUL;
+ else
+ index.split.thresh = ARCH_PEBS_THRESH_SINGLE;
+
+ rdmsrl(MSR_IA32_PEBS_INDEX, cached.full);
+ if (index.split.thresh != cached.split.thresh || !cached.split.en) {
+ if (cached.split.thresh == ARCH_PEBS_THRESH_MUL &&
+ cached.split.wr > 0) {
+ /*
+ * Large PEBS was enabled.
+ * Drain PEBS buffer before applying the single PEBS.
+ */
+ intel_pmu_drain_pebs_buffer();
+ } else {
+ index.split.wr = 0;
+ index.split.full = 0;
+ index.split.en = 1;
+ wrmsrl(MSR_IA32_PEBS_INDEX, index.full);
+ }
+ }
+ }
+
+ if (cpuc->cfg_c_val[hwc->idx] != ext)
+ __intel_pmu_update_event_ext(hwc->idx, ext);
+}
+
static void intel_pmu_enable_event(struct perf_event *event)
{
u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
@@ -2912,9 +3008,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
if (branch_sample_counters(event))
enable_mask |= ARCH_PERFMON_EVENTSEL_BR_CNTR;
intel_set_masks(event, idx);
+ intel_pmu_enable_event_ext(event);
__x86_pmu_enable_event(hwc, enable_mask);
break;
case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
+ intel_pmu_enable_event_ext(event);
+ fallthrough;
case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
intel_pmu_enable_fixed(event);
break;
@@ -4983,6 +5082,29 @@ static inline bool intel_pmu_broken_perf_cap(void)
return false;
}
+static inline void __intel_update_pmu_caps(struct pmu *pmu)
+{
+ struct pmu *dest_pmu = pmu ? pmu : x86_get_pmu(smp_processor_id());
+
+ if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
+ dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
+static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
+{
+ u64 caps = hybrid(pmu, arch_pebs_cap).caps;
+
+ x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
+ if (caps & ARCH_PEBS_LBR)
+ x86_pmu.large_pebs_flags |= PERF_SAMPLE_BRANCH_STACK;
+
+ if (!(caps & ARCH_PEBS_AUX))
+ x86_pmu.large_pebs_flags &= ~PERF_SAMPLE_DATA_SRC;
+ if (!(caps & ARCH_PEBS_GPR))
+ x86_pmu.large_pebs_flags &=
+ ~(PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER);
+}
+
static void update_pmu_cap(struct pmu *pmu)
{
unsigned int eax, ebx, ecx, edx;
@@ -5013,6 +5135,9 @@ static void update_pmu_cap(struct pmu *pmu)
&eax, &ebx, &ecx, &edx);
hybrid(pmu, arch_pebs_cap).counters = ((u64)ecx << 32) | eax;
hybrid(pmu, arch_pebs_cap).pdists = ((u64)edx << 32) | ebx;
+
+ __intel_update_pmu_caps(pmu);
+ __intel_update_large_pebs_flags(pmu);
} else {
WARN_ON(x86_pmu.arch_pebs == 1);
x86_pmu.arch_pebs = 0;
@@ -5171,6 +5296,8 @@ static void intel_pmu_cpu_starting(int cpu)
}
}
+ __intel_update_pmu_caps(cpuc->pmu);
+
if (!cpuc->shared_regs)
return;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 615aefb4e52e..712f7dd05c1d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1492,6 +1492,18 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
}
}
+u64 intel_get_arch_pebs_data_config(struct perf_event *event)
+{
+ u64 pebs_data_cfg = 0;
+
+ if (WARN_ON(event->hw.idx < 0 || event->hw.idx >= X86_PMC_IDX_MAX))
+ return 0;
+
+ pebs_data_cfg |= pebs_update_adaptive_cfg(event);
+
+ return pebs_data_cfg;
+}
+
void intel_pmu_pebs_add(struct perf_event *event)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -2934,6 +2946,11 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
index.split.wr = 0;
index.split.full = 0;
+ index.split.en = 1;
+ if (cpuc->n_pebs == cpuc->n_large_pebs)
+ index.split.thresh = ARCH_PEBS_THRESH_MUL;
+ else
+ index.split.thresh = ARCH_PEBS_THRESH_SINGLE;
wrmsrl(MSR_IA32_PEBS_INDEX, index.full);
}
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1f20892f4040..69c4341f5753 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -286,6 +286,9 @@ struct cpu_hw_events {
u64 fixed_ctrl_val;
u64 active_fixed_ctrl_val;
+ /* Cached CFG_C values */
+ u64 cfg_c_val[X86_PMC_IDX_MAX];
+
/*
* Intel LBR bits
*/
@@ -1203,6 +1206,14 @@ static inline unsigned int x86_pmu_fixed_ctr_addr(int index)
x86_pmu.addr_offset(index, false) : index);
}
+static inline unsigned int x86_pmu_cfg_c_addr(int index, bool gp)
+{
+ u32 base = gp ? MSR_IA32_PMC_V6_GP0_CFG_C : MSR_IA32_PMC_V6_FX0_CFG_C;
+
+ return base + (x86_pmu.addr_offset ? x86_pmu.addr_offset(index, false) :
+ index * MSR_IA32_PMC_V6_STEP);
+}
+
static inline int x86_pmu_rdpmc_index(int index)
{
return x86_pmu.rdpmc_index ? x86_pmu.rdpmc_index(index) : index;
@@ -1757,6 +1768,8 @@ void intel_pmu_pebs_data_source_cmt(void);
void intel_pmu_pebs_data_source_lnl(void);
+u64 intel_get_arch_pebs_data_config(struct perf_event *event);
+
int intel_pmu_setup_lbr_filter(struct perf_event *event);
void intel_pt_interrupt(void);
diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h
index 023c2883f9f3..7bb80c993bef 100644
--- a/arch/x86/include/asm/intel_ds.h
+++ b/arch/x86/include/asm/intel_ds.h
@@ -7,6 +7,13 @@
#define PEBS_BUFFER_SHIFT 4
#define PEBS_BUFFER_SIZE (PAGE_SIZE << PEBS_BUFFER_SHIFT)
+/*
+ * The largest PEBS record could consume a page, ensure
+ * a record at least can be written after triggering PMI.
+ */
+#define ARCH_PEBS_THRESH_MUL ((PEBS_BUFFER_SIZE - PAGE_SIZE) >> PEBS_BUFFER_SHIFT)
+#define ARCH_PEBS_THRESH_SINGLE 1
+
/* The maximal number of PEBS events: */
#define MAX_PEBS_EVENTS_FMT4 8
#define MAX_PEBS_EVENTS 32
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index dd09ae9752a7..1e67cb467946 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -318,6 +318,14 @@
#define ARCH_PEBS_OFFSET_MASK 0x7fffff
#define ARCH_PEBS_INDEX_WR_SHIFT 4
+#define ARCH_PEBS_RELOAD 0xffffffff
+#define ARCH_PEBS_LBR_SHIFT 40
+#define ARCH_PEBS_LBR (0x3ull << ARCH_PEBS_LBR_SHIFT)
+#define ARCH_PEBS_VECR_XMM BIT_ULL(49)
+#define ARCH_PEBS_GPR BIT_ULL(61)
+#define ARCH_PEBS_AUX BIT_ULL(62)
+#define ARCH_PEBS_EN BIT_ULL(63)
+
#define MSR_IA32_RTIT_CTL 0x00000570
#define RTIT_CTL_TRACEEN BIT(0)
#define RTIT_CTL_CYCLEACC BIT(1)
@@ -597,7 +605,9 @@
/* V6 PMON MSR range */
#define MSR_IA32_PMC_V6_GP0_CTR 0x1900
#define MSR_IA32_PMC_V6_GP0_CFG_A 0x1901
+#define MSR_IA32_PMC_V6_GP0_CFG_C 0x1903
#define MSR_IA32_PMC_V6_FX0_CTR 0x1980
+#define MSR_IA32_PMC_V6_FX0_CFG_C 0x1983
#define MSR_IA32_PMC_V6_STEP 4
/* KeyID partitioning between MKTME and TDX */
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (13 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 14/24] perf/x86/intel: Setup PEBS data configuration and enable legacy groups Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-25 11:52 ` Peter Zijlstra
2025-02-25 11:54 ` Peter Zijlstra
2025-02-18 15:28 ` [Patch v2 16/24] perf/x86/intel: Add counter group " Dapeng Mi
` (8 subsequent siblings)
23 siblings, 2 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Arch-PEBS supports to capture SSP register in GPR group. This patch
supports to read and output this register. SSP is for shadow stacks.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/core.c | 10 ++++++++++
arch/x86/events/intel/ds.c | 3 +++
arch/x86/include/asm/perf_event.h | 1 +
arch/x86/include/uapi/asm/perf_regs.h | 4 +++-
arch/x86/kernel/perf_regs.c | 5 +++++
5 files changed, 22 insertions(+), 1 deletion(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 4eaafabf033e..d5609c0756c2 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -651,6 +651,16 @@ int x86_pmu_hw_config(struct perf_event *event)
return -EINVAL;
}
+ /* sample_regs_user never support SSP register. */
+ if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP)))
+ return -EINVAL;
+
+ if (unlikely(event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP))) {
+ /* Only arch-PEBS supports to capture SSP register. */
+ if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
+ return -EINVAL;
+ }
+
/* sample_regs_user never support XMM registers */
if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
return -EINVAL;
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 712f7dd05c1d..cad653706431 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2216,6 +2216,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
+ perf_regs->ssp = 0;
format_group = basic->format_group;
@@ -2332,6 +2333,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
+ perf_regs->ssp = 0;
__setup_perf_sample_data(event, iregs, data);
@@ -2368,6 +2370,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
__setup_pebs_gpr_group(event, regs, (struct pebs_gprs *)gprs,
sample_type);
+ perf_regs->ssp = gprs->ssp;
}
if (header->aux) {
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 4103cc745e86..d5285bb4b333 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -692,6 +692,7 @@ extern void perf_events_lapic_init(void);
struct pt_regs;
struct x86_perf_regs {
struct pt_regs regs;
+ u64 ssp;
u64 *xmm_regs;
};
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..9ee9e55aed09 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,11 @@ enum perf_event_x86_regs {
PERF_REG_X86_R13,
PERF_REG_X86_R14,
PERF_REG_X86_R15,
+ /* Shadow stack pointer (SSP) present on Clearwater Forest and newer models. */
+ PERF_REG_X86_SSP,
/* These are the limits for the GPRs. */
PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
- PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+ PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
/* These all need two bits set because they are 128bit */
PERF_REG_X86_XMM0 = 32,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..4b15c7488ec1 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -54,6 +54,8 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
PT_REGS_OFFSET(PERF_REG_X86_R13, r13),
PT_REGS_OFFSET(PERF_REG_X86_R14, r14),
PT_REGS_OFFSET(PERF_REG_X86_R15, r15),
+ /* The pt_regs struct does not store Shadow stack pointer. */
+ (unsigned int) -1,
#endif
};
@@ -68,6 +70,9 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
}
+ if (idx == PERF_REG_X86_SSP)
+ return perf_regs->ssp;
+
if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
return 0;
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 16/24] perf/x86/intel: Add counter group support for arch-PEBS
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (14 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 17/24] perf/core: Support to capture higher width vector registers Dapeng Mi
` (7 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Base on previous adaptive PEBS counter snapshot support, add counter
group support for architectural PEBS. Since arch-PEBS shares same
counter group layout with adaptive PEBS, directly reuse
__setup_pebs_counter_group() helper to process arch-PEBS counter group.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/intel/core.c | 38 ++++++++++++++++++++++++++++---
arch/x86/events/intel/ds.c | 31 +++++++++++++++++++++----
arch/x86/events/perf_event.h | 2 ++
arch/x86/include/asm/msr-index.h | 6 +++++
arch/x86/include/asm/perf_event.h | 13 ++++++++---
5 files changed, 80 insertions(+), 10 deletions(-)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b80a66751136..f21d9f283445 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2965,6 +2965,17 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
if (pebs_data_cfg & PEBS_DATACFG_LBRS)
ext |= ARCH_PEBS_LBR & cap.caps;
+
+ if (pebs_data_cfg &
+ (PEBS_DATACFG_CNTR_MASK << PEBS_DATACFG_CNTR_SHIFT))
+ ext |= ARCH_PEBS_CNTR_GP & cap.caps;
+
+ if (pebs_data_cfg &
+ (PEBS_DATACFG_FIX_MASK << PEBS_DATACFG_FIX_SHIFT))
+ ext |= ARCH_PEBS_CNTR_FIXED & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_METRICS)
+ ext |= ARCH_PEBS_CNTR_METRICS & cap.caps;
}
if (cpuc->n_pebs == cpuc->n_large_pebs)
@@ -2990,6 +3001,9 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
}
}
+ if (is_pebs_counter_event_group(event))
+ ext |= ARCH_PEBS_CNTR_ALLOW;
+
if (cpuc->cfg_c_val[hwc->idx] != ext)
__intel_pmu_update_event_ext(hwc->idx, ext);
}
@@ -4120,6 +4134,20 @@ static inline bool intel_pmu_has_cap(struct perf_event *event, int idx)
return test_bit(idx, (unsigned long *)&intel_cap->capabilities);
}
+static inline bool intel_pmu_has_pebs_counter_group(struct pmu *pmu)
+{
+ u64 caps;
+
+ if (x86_pmu.intel_cap.pebs_format >= 6 && x86_pmu.intel_cap.pebs_baseline)
+ return true;
+
+ caps = hybrid(pmu, arch_pebs_cap).caps;
+ if (x86_pmu.arch_pebs && (caps & ARCH_PEBS_CNTR_MASK))
+ return true;
+
+ return false;
+}
+
static int intel_pmu_hw_config(struct perf_event *event)
{
int ret = x86_pmu_hw_config(event);
@@ -4242,8 +4270,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
}
if ((event->attr.sample_type & PERF_SAMPLE_READ) &&
- (x86_pmu.intel_cap.pebs_format >= 6) &&
- x86_pmu.intel_cap.pebs_baseline &&
+ intel_pmu_has_pebs_counter_group(event->pmu) &&
is_sampling_event(event) &&
event->attr.precise_ip)
event->group_leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
@@ -5097,6 +5124,8 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
if (caps & ARCH_PEBS_LBR)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_BRANCH_STACK;
+ if (caps & ARCH_PEBS_CNTR_MASK)
+ x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
if (!(caps & ARCH_PEBS_AUX))
x86_pmu.large_pebs_flags &= ~PERF_SAMPLE_DATA_SRC;
@@ -6759,8 +6788,11 @@ __init int intel_pmu_init(void)
* Many features on and after V6 require dynamic constraint,
* e.g., Arch PEBS, ACR.
*/
- if (version >= 6)
+ if (version >= 6) {
x86_pmu.flags |= PMU_FL_DYN_CONSTRAINT;
+ x86_pmu.late_setup = intel_pmu_late_setup;
+ }
+
/*
* Install the hw-cache-events table:
*/
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index cad653706431..4b01beee15f4 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1383,7 +1383,7 @@ static void __intel_pmu_pebs_update_cfg(struct perf_event *event,
}
-static void intel_pmu_late_setup(void)
+void intel_pmu_late_setup(void)
{
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
struct perf_event *event;
@@ -1494,13 +1494,20 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
u64 intel_get_arch_pebs_data_config(struct perf_event *event)
{
+ struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
u64 pebs_data_cfg = 0;
+ u64 cntr_mask;
if (WARN_ON(event->hw.idx < 0 || event->hw.idx >= X86_PMC_IDX_MAX))
return 0;
pebs_data_cfg |= pebs_update_adaptive_cfg(event);
+ cntr_mask = (PEBS_DATACFG_CNTR_MASK << PEBS_DATACFG_CNTR_SHIFT) |
+ (PEBS_DATACFG_FIX_MASK << PEBS_DATACFG_FIX_SHIFT) |
+ PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS;
+ pebs_data_cfg |= cpuc->pebs_data_cfg & cntr_mask;
+
return pebs_data_cfg;
}
@@ -2411,6 +2418,24 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
}
}
+ if (header->cntr) {
+ struct arch_pebs_cntr_header *cntr = next_record;
+ unsigned int nr;
+
+ next_record += sizeof(struct arch_pebs_cntr_header);
+
+ if (is_pebs_counter_event_group(event)) {
+ __setup_pebs_counter_group(cpuc, event,
+ (struct pebs_cntr_header *)cntr, next_record);
+ data->sample_flags |= PERF_SAMPLE_READ;
+ }
+
+ nr = hweight32(cntr->cntr) + hweight32(cntr->fixed);
+ if (cntr->metrics == INTEL_CNTR_METRICS)
+ nr += 2;
+ next_record += nr * sizeof(u64);
+ }
+
/* Parse followed fragments if there are. */
if (arch_pebs_record_continued(header)) {
at = at + header->size;
@@ -3040,10 +3065,8 @@ static void __init intel_ds_pebs_init(void)
break;
case 6:
- if (x86_pmu.intel_cap.pebs_baseline) {
+ if (x86_pmu.intel_cap.pebs_baseline)
x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
- x86_pmu.late_setup = intel_pmu_late_setup;
- }
fallthrough;
case 5:
x86_pmu.pebs_ept = 1;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 69c4341f5753..cba7b928fdb2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1697,6 +1697,8 @@ void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
void intel_pebs_init(void);
+void intel_pmu_late_setup(void);
+
void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
struct cpu_hw_events *cpuc,
struct perf_event *event);
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1e67cb467946..0ca84deb2396 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -319,12 +319,18 @@
#define ARCH_PEBS_INDEX_WR_SHIFT 4
#define ARCH_PEBS_RELOAD 0xffffffff
+#define ARCH_PEBS_CNTR_ALLOW BIT_ULL(35)
+#define ARCH_PEBS_CNTR_GP BIT_ULL(36)
+#define ARCH_PEBS_CNTR_FIXED BIT_ULL(37)
+#define ARCH_PEBS_CNTR_METRICS BIT_ULL(38)
#define ARCH_PEBS_LBR_SHIFT 40
#define ARCH_PEBS_LBR (0x3ull << ARCH_PEBS_LBR_SHIFT)
#define ARCH_PEBS_VECR_XMM BIT_ULL(49)
#define ARCH_PEBS_GPR BIT_ULL(61)
#define ARCH_PEBS_AUX BIT_ULL(62)
#define ARCH_PEBS_EN BIT_ULL(63)
+#define ARCH_PEBS_CNTR_MASK (ARCH_PEBS_CNTR_GP | ARCH_PEBS_CNTR_FIXED | \
+ ARCH_PEBS_CNTR_METRICS)
#define MSR_IA32_RTIT_CTL 0x00000570
#define RTIT_CTL_TRACEEN BIT(0)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index d5285bb4b333..461f0e357c9e 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -137,16 +137,16 @@
#define ARCH_PERFMON_EVENTS_COUNT 7
#define PEBS_DATACFG_MEMINFO BIT_ULL(0)
-#define PEBS_DATACFG_GP BIT_ULL(1)
+#define PEBS_DATACFG_GP BIT_ULL(1)
#define PEBS_DATACFG_XMMS BIT_ULL(2)
#define PEBS_DATACFG_LBRS BIT_ULL(3)
-#define PEBS_DATACFG_LBR_SHIFT 24
#define PEBS_DATACFG_CNTR BIT_ULL(4)
+#define PEBS_DATACFG_METRICS BIT_ULL(5)
+#define PEBS_DATACFG_LBR_SHIFT 24
#define PEBS_DATACFG_CNTR_SHIFT 32
#define PEBS_DATACFG_CNTR_MASK GENMASK_ULL(15, 0)
#define PEBS_DATACFG_FIX_SHIFT 48
#define PEBS_DATACFG_FIX_MASK GENMASK_ULL(7, 0)
-#define PEBS_DATACFG_METRICS BIT_ULL(5)
/* Steal the highest bit of pebs_data_cfg for SW usage */
#define PEBS_UPDATE_DS_SW BIT_ULL(63)
@@ -602,6 +602,13 @@ struct arch_pebs_lbr_header {
u64 ler_info;
};
+struct arch_pebs_cntr_header {
+ u32 cntr;
+ u32 fixed;
+ u32 metrics;
+ u32 reserved;
+};
+
/*
* AMD Extended Performance Monitoring and Debug cpuid feature detection
*/
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 17/24] perf/core: Support to capture higher width vector registers
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (15 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 16/24] perf/x86/intel: Add counter group " Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-25 20:32 ` Peter Zijlstra
2025-02-18 15:28 ` [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
` (6 subsequent siblings)
23 siblings, 1 reply; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Arch-PEBS supports to capture more vector registers like OPMASK/YMM/ZMM
registers besides XMM registers. This patch extends PERF_SAMPLE_REGS_INTR
attribute to support these higher width vector registers capturing.
The array sample_regs_intr_ext[] is added into perf_event_attr structure
to record user configured extended register bitmap and a helper
perf_reg_ext_validate() is added to validate if these registers are
supported on some specific PMUs.
This patch just adds the common perf/core support, the x86/intel specific
support would be added in next patch.
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/arm/kernel/perf_regs.c | 6 ++
arch/arm64/kernel/perf_regs.c | 6 ++
arch/csky/kernel/perf_regs.c | 5 ++
arch/loongarch/kernel/perf_regs.c | 5 ++
arch/mips/kernel/perf_regs.c | 5 ++
arch/powerpc/perf/perf_regs.c | 5 ++
arch/riscv/kernel/perf_regs.c | 5 ++
arch/s390/kernel/perf_regs.c | 5 ++
arch/x86/include/asm/perf_event.h | 4 ++
arch/x86/include/uapi/asm/perf_regs.h | 83 ++++++++++++++++++++++++++-
arch/x86/kernel/perf_regs.c | 50 +++++++++++++++-
include/linux/perf_event.h | 2 +
include/linux/perf_regs.h | 10 ++++
include/uapi/linux/perf_event.h | 11 ++++
kernel/events/core.c | 53 ++++++++++++++++-
15 files changed, 250 insertions(+), 5 deletions(-)
diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 0529f90395c9..86b2002d0846 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -37,3 +37,9 @@ void perf_get_regs_user(struct perf_regs *regs_user,
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
}
+
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index b4eece3eb17d..1c91fd3530d5 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -104,3 +104,9 @@ void perf_get_regs_user(struct perf_regs *regs_user,
regs_user->regs = task_pt_regs(current);
regs_user->abi = perf_reg_abi(current);
}
+
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 09b7f88a2d6a..d2e2af0bf1ad 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -26,6 +26,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_abi(struct task_struct *task)
{
return PERF_SAMPLE_REGS_ABI_32;
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 263ac4ab5af6..e1df67e3fab4 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -34,6 +34,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_value(struct pt_regs *regs, int idx)
{
if (WARN_ON_ONCE((u32)idx >= PERF_REG_LOONGARCH_MAX))
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index e686780d1647..bbb5f25b9191 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -37,6 +37,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_value(struct pt_regs *regs, int idx)
{
long v;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 350dccb0143c..d919c628aee3 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -132,6 +132,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_abi(struct task_struct *task)
{
if (is_tsk_32bit_task(task))
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index fd304a248de6..5beb60544c9a 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -26,6 +26,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_abi(struct task_struct *task)
{
#if __riscv_xlen == 64
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index a6b058ee4a36..9247573229b0 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -42,6 +42,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_abi(struct task_struct *task)
{
if (test_tsk_thread_flag(task, TIF_31BIT))
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 461f0e357c9e..3bf8dcaa72ca 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -701,6 +701,10 @@ struct x86_perf_regs {
struct pt_regs regs;
u64 ssp;
u64 *xmm_regs;
+ u64 *opmask_regs;
+ u64 *ymmh_regs;
+ u64 **zmmh_regs;
+ u64 **h16zmm_regs;
};
extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 9ee9e55aed09..3851f627ca60 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -33,7 +33,7 @@ enum perf_event_x86_regs {
PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
- /* These all need two bits set because they are 128bit */
+ /* These all need two bits set because they are 128 bits */
PERF_REG_X86_XMM0 = 32,
PERF_REG_X86_XMM1 = 34,
PERF_REG_X86_XMM2 = 36,
@@ -53,6 +53,87 @@ enum perf_event_x86_regs {
/* These include both GPRs and XMMX registers */
PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
+
+ /*
+ * YMM upper bits need two bits set because they are 128 bits.
+ * PERF_REG_X86_YMMH0 = 64
+ */
+ PERF_REG_X86_YMMH0 = PERF_REG_X86_XMM_MAX,
+ PERF_REG_X86_YMMH1 = PERF_REG_X86_YMMH0 + 2,
+ PERF_REG_X86_YMMH2 = PERF_REG_X86_YMMH1 + 2,
+ PERF_REG_X86_YMMH3 = PERF_REG_X86_YMMH2 + 2,
+ PERF_REG_X86_YMMH4 = PERF_REG_X86_YMMH3 + 2,
+ PERF_REG_X86_YMMH5 = PERF_REG_X86_YMMH4 + 2,
+ PERF_REG_X86_YMMH6 = PERF_REG_X86_YMMH5 + 2,
+ PERF_REG_X86_YMMH7 = PERF_REG_X86_YMMH6 + 2,
+ PERF_REG_X86_YMMH8 = PERF_REG_X86_YMMH7 + 2,
+ PERF_REG_X86_YMMH9 = PERF_REG_X86_YMMH8 + 2,
+ PERF_REG_X86_YMMH10 = PERF_REG_X86_YMMH9 + 2,
+ PERF_REG_X86_YMMH11 = PERF_REG_X86_YMMH10 + 2,
+ PERF_REG_X86_YMMH12 = PERF_REG_X86_YMMH11 + 2,
+ PERF_REG_X86_YMMH13 = PERF_REG_X86_YMMH12 + 2,
+ PERF_REG_X86_YMMH14 = PERF_REG_X86_YMMH13 + 2,
+ PERF_REG_X86_YMMH15 = PERF_REG_X86_YMMH14 + 2,
+ PERF_REG_X86_YMMH_MAX = PERF_REG_X86_YMMH15 + 2,
+
+ /*
+ * ZMM0-15 upper bits need four bits set because they are 256 bits
+ * PERF_REG_X86_ZMMH0 = 96
+ */
+ PERF_REG_X86_ZMMH0 = PERF_REG_X86_YMMH_MAX,
+ PERF_REG_X86_ZMMH1 = PERF_REG_X86_ZMMH0 + 4,
+ PERF_REG_X86_ZMMH2 = PERF_REG_X86_ZMMH1 + 4,
+ PERF_REG_X86_ZMMH3 = PERF_REG_X86_ZMMH2 + 4,
+ PERF_REG_X86_ZMMH4 = PERF_REG_X86_ZMMH3 + 4,
+ PERF_REG_X86_ZMMH5 = PERF_REG_X86_ZMMH4 + 4,
+ PERF_REG_X86_ZMMH6 = PERF_REG_X86_ZMMH5 + 4,
+ PERF_REG_X86_ZMMH7 = PERF_REG_X86_ZMMH6 + 4,
+ PERF_REG_X86_ZMMH8 = PERF_REG_X86_ZMMH7 + 4,
+ PERF_REG_X86_ZMMH9 = PERF_REG_X86_ZMMH8 + 4,
+ PERF_REG_X86_ZMMH10 = PERF_REG_X86_ZMMH9 + 4,
+ PERF_REG_X86_ZMMH11 = PERF_REG_X86_ZMMH10 + 4,
+ PERF_REG_X86_ZMMH12 = PERF_REG_X86_ZMMH11 + 4,
+ PERF_REG_X86_ZMMH13 = PERF_REG_X86_ZMMH12 + 4,
+ PERF_REG_X86_ZMMH14 = PERF_REG_X86_ZMMH13 + 4,
+ PERF_REG_X86_ZMMH15 = PERF_REG_X86_ZMMH14 + 4,
+ PERF_REG_X86_ZMMH_MAX = PERF_REG_X86_ZMMH15 + 4,
+
+ /*
+ * ZMM16-31 need eight bits set because they are 512 bits
+ * PERF_REG_X86_ZMM16 = 160
+ */
+ PERF_REG_X86_ZMM16 = PERF_REG_X86_ZMMH_MAX,
+ PERF_REG_X86_ZMM17 = PERF_REG_X86_ZMM16 + 8,
+ PERF_REG_X86_ZMM18 = PERF_REG_X86_ZMM17 + 8,
+ PERF_REG_X86_ZMM19 = PERF_REG_X86_ZMM18 + 8,
+ PERF_REG_X86_ZMM20 = PERF_REG_X86_ZMM19 + 8,
+ PERF_REG_X86_ZMM21 = PERF_REG_X86_ZMM20 + 8,
+ PERF_REG_X86_ZMM22 = PERF_REG_X86_ZMM21 + 8,
+ PERF_REG_X86_ZMM23 = PERF_REG_X86_ZMM22 + 8,
+ PERF_REG_X86_ZMM24 = PERF_REG_X86_ZMM23 + 8,
+ PERF_REG_X86_ZMM25 = PERF_REG_X86_ZMM24 + 8,
+ PERF_REG_X86_ZMM26 = PERF_REG_X86_ZMM25 + 8,
+ PERF_REG_X86_ZMM27 = PERF_REG_X86_ZMM26 + 8,
+ PERF_REG_X86_ZMM28 = PERF_REG_X86_ZMM27 + 8,
+ PERF_REG_X86_ZMM29 = PERF_REG_X86_ZMM28 + 8,
+ PERF_REG_X86_ZMM30 = PERF_REG_X86_ZMM29 + 8,
+ PERF_REG_X86_ZMM31 = PERF_REG_X86_ZMM30 + 8,
+ PERF_REG_X86_ZMM_MAX = PERF_REG_X86_ZMM31 + 8,
+
+ /*
+ * OPMASK Registers
+ * PERF_REG_X86_OPMASK0 = 288
+ */
+ PERF_REG_X86_OPMASK0 = PERF_REG_X86_ZMM_MAX,
+ PERF_REG_X86_OPMASK1 = PERF_REG_X86_OPMASK0 + 1,
+ PERF_REG_X86_OPMASK2 = PERF_REG_X86_OPMASK1 + 1,
+ PERF_REG_X86_OPMASK3 = PERF_REG_X86_OPMASK2 + 1,
+ PERF_REG_X86_OPMASK4 = PERF_REG_X86_OPMASK3 + 1,
+ PERF_REG_X86_OPMASK5 = PERF_REG_X86_OPMASK4 + 1,
+ PERF_REG_X86_OPMASK6 = PERF_REG_X86_OPMASK5 + 1,
+ PERF_REG_X86_OPMASK7 = PERF_REG_X86_OPMASK6 + 1,
+
+ PERF_REG_X86_VEC_MAX = PERF_REG_X86_OPMASK7 + 1,
};
#define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 4b15c7488ec1..1447cd341868 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -59,12 +59,41 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
#endif
};
-u64 perf_reg_value(struct pt_regs *regs, int idx)
+static u64 perf_reg_ext_value(struct pt_regs *regs, int idx)
{
struct x86_perf_regs *perf_regs;
+ perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+ switch (idx) {
+ case PERF_REG_X86_YMMH0 ... PERF_REG_X86_YMMH_MAX - 1:
+ idx -= PERF_REG_X86_YMMH0;
+ return !perf_regs->ymmh_regs ? 0 : perf_regs->ymmh_regs[idx];
+ case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMMH_MAX - 1:
+ idx -= PERF_REG_X86_ZMMH0;
+ return !perf_regs->zmmh_regs ? 0 : perf_regs->zmmh_regs[idx / 4][idx % 4];
+ case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
+ idx -= PERF_REG_X86_ZMM16;
+ return !perf_regs->h16zmm_regs ? 0 : perf_regs->h16zmm_regs[idx / 8][idx % 8];
+ case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
+ idx -= PERF_REG_X86_OPMASK0;
+ return !perf_regs->opmask_regs ? 0 : perf_regs->opmask_regs[idx];
+ default:
+ WARN_ON_ONCE(1);
+ break;
+ }
+
+ return 0;
+}
+
+u64 perf_reg_value(struct pt_regs *regs, int idx)
+{
+ struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+ if (idx >= PERF_REG_EXTENDED_OFFSET)
+ return perf_reg_ext_value(regs, idx);
+
if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
- perf_regs = container_of(regs, struct x86_perf_regs, regs);
if (!perf_regs->xmm_regs)
return 0;
return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -100,6 +129,11 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ return -EINVAL;
+}
+
u64 perf_reg_abi(struct task_struct *task)
{
return PERF_SAMPLE_REGS_ABI_32;
@@ -125,6 +159,18 @@ int perf_reg_validate(u64 mask)
return 0;
}
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+ if (!mask || !size || size > PERF_NUM_EXT_REGS)
+ return -EINVAL;
+
+ if (find_last_bit(mask, size) >
+ (PERF_REG_X86_VEC_MAX - PERF_REG_EXTENDED_OFFSET))
+ return -EINVAL;
+
+ return 0;
+}
+
u64 perf_reg_abi(struct task_struct *task)
{
if (!user_64bit_mode(task_pt_regs(task)))
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c381ea7135df..5c50119387d8 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -302,6 +302,7 @@ struct perf_event_pmu_context;
#define PERF_PMU_CAP_AUX_OUTPUT 0x0080
#define PERF_PMU_CAP_EXTENDED_HW_TYPE 0x0100
#define PERF_PMU_CAP_AUX_PAUSE 0x0200
+#define PERF_PMU_CAP_MORE_EXT_REGS 0x0400
/**
* pmu::scope
@@ -1390,6 +1391,7 @@ static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *b
br->reserved = 0;
}
+extern bool has_more_extended_regs(struct perf_event *event);
extern void perf_output_sample(struct perf_output_handle *handle,
struct perf_event_header *header,
struct perf_sample_data *data,
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..aa4dfb5af552 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,8 @@ struct perf_regs {
struct pt_regs *regs;
};
+#define PERF_REG_EXTENDED_OFFSET 64
+
#ifdef CONFIG_HAVE_PERF_REGS
#include <asm/perf_regs.h>
@@ -21,6 +23,8 @@ int perf_reg_validate(u64 mask);
u64 perf_reg_abi(struct task_struct *task);
void perf_get_regs_user(struct perf_regs *regs_user,
struct pt_regs *regs);
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size);
+
#else
#define PERF_REG_EXTENDED_MASK 0
@@ -35,6 +39,12 @@ static inline int perf_reg_validate(u64 mask)
return mask ? -ENOSYS : 0;
}
+static inline int perf_reg_ext_validate(unsigned long *mask,
+ unsigned int size)
+{
+ return -EINVAL;
+}
+
static inline u64 perf_reg_abi(struct task_struct *task)
{
return PERF_SAMPLE_REGS_ABI_NONE;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 0524d541d4e3..8a17d696d78c 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -379,6 +379,10 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */
#define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */
#define PERF_ATTR_SIZE_VER8 136 /* add: config3 */
+#define PERF_ATTR_SIZE_VER9 168 /* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
+
+#define PERF_EXT_REGS_ARRAY_SIZE 4
+#define PERF_NUM_EXT_REGS (PERF_EXT_REGS_ARRAY_SIZE * 64)
/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -531,6 +535,13 @@ struct perf_event_attr {
__u64 sig_data;
__u64 config3; /* extension of config2 */
+
+ /*
+ * Extension sets of regs to dump for each sample.
+ * See asm/perf_regs.h for details.
+ */
+ __u64 sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
+ __u64 sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
};
/*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 0f8c55990783..0da480b5e025 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7081,6 +7081,21 @@ perf_output_sample_regs(struct perf_output_handle *handle,
}
}
+static void
+perf_output_sample_regs_ext(struct perf_output_handle *handle,
+ struct pt_regs *regs,
+ unsigned long *mask,
+ unsigned int size)
+{
+ int bit;
+ u64 val;
+
+ for_each_set_bit(bit, mask, size) {
+ val = perf_reg_value(regs, bit + PERF_REG_EXTENDED_OFFSET);
+ perf_output_put(handle, val);
+ }
+}
+
static void perf_sample_regs_user(struct perf_regs *regs_user,
struct pt_regs *regs)
{
@@ -7509,6 +7524,13 @@ static void perf_output_read(struct perf_output_handle *handle,
perf_output_read_one(handle, event, enabled, running);
}
+inline bool has_more_extended_regs(struct perf_event *event)
+{
+ return !!bitmap_weight(
+ (unsigned long *)event->attr.sample_regs_intr_ext,
+ PERF_NUM_EXT_REGS);
+}
+
void perf_output_sample(struct perf_output_handle *handle,
struct perf_event_header *header,
struct perf_sample_data *data,
@@ -7666,6 +7688,12 @@ void perf_output_sample(struct perf_output_handle *handle,
perf_output_sample_regs(handle,
data->regs_intr.regs,
mask);
+ if (has_more_extended_regs(event)) {
+ perf_output_sample_regs_ext(
+ handle, data->regs_intr.regs,
+ (unsigned long *)event->attr.sample_regs_intr_ext,
+ PERF_NUM_EXT_REGS);
+ }
}
}
@@ -7980,6 +8008,12 @@ void perf_prepare_sample(struct perf_sample_data *data,
u64 mask = event->attr.sample_regs_intr;
size += hweight64(mask) * sizeof(u64);
+
+ if (has_more_extended_regs(event)) {
+ size += bitmap_weight(
+ (unsigned long *)event->attr.sample_regs_intr_ext,
+ PERF_NUM_EXT_REGS) * sizeof(u64);
+ }
}
data->dyn_size += size;
@@ -11991,6 +12025,10 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
has_extended_regs(event))
ret = -EOPNOTSUPP;
+ if (!(pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS) &&
+ has_more_extended_regs(event))
+ ret = -EOPNOTSUPP;
+
if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
event_has_any_exclude_flag(event))
ret = -EINVAL;
@@ -12561,8 +12599,19 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
if (!attr->sample_max_stack)
attr->sample_max_stack = sysctl_perf_event_max_stack;
- if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
- ret = perf_reg_validate(attr->sample_regs_intr);
+ if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
+ if (attr->sample_regs_intr != 0)
+ ret = perf_reg_validate(attr->sample_regs_intr);
+ if (ret)
+ return ret;
+ if (!!bitmap_weight((unsigned long *)attr->sample_regs_intr_ext,
+ PERF_NUM_EXT_REGS))
+ ret = perf_reg_ext_validate(
+ (unsigned long *)attr->sample_regs_intr_ext,
+ PERF_NUM_EXT_REGS);
+ if (ret)
+ return ret;
+ }
#ifndef CONFIG_CGROUP_PERF
if (attr->sample_type & PERF_SAMPLE_CGROUP)
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (16 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 17/24] perf/core: Support to capture higher width vector registers Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-25 15:32 ` Peter Zijlstra
2025-02-18 15:28 ` [Patch v2 19/24] perf tools: Support to show SSP register Dapeng Mi
` (5 subsequent siblings)
23 siblings, 1 reply; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Add x86/intel specific vector register (VECR) group capturing for
arch-PEBS. Enable corresponding VECR group bits in
GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
Currently vector registers capturing is only supported by PEBS based
sampling, PMU driver would return error if PMI based sampling tries to
capture these vector registers.
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
arch/x86/events/core.c | 59 ++++++++++++++++++++++
arch/x86/events/intel/core.c | 15 ++++++
arch/x86/events/intel/ds.c | 82 ++++++++++++++++++++++++++++---
arch/x86/include/asm/msr-index.h | 6 +++
arch/x86/include/asm/perf_event.h | 20 ++++++++
5 files changed, 175 insertions(+), 7 deletions(-)
diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index d5609c0756c2..4d4b92b78e2d 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -581,6 +581,39 @@ int x86_pmu_max_precise(struct pmu *pmu)
return precise;
}
+static bool has_vec_regs(struct perf_event *event, int start, int end)
+{
+ /* -1 to subtract PERF_REG_EXTENDED_OFFSET */
+ int idx = start / 64 - 1;
+ int s = start % 64;
+ int e = end % 64;
+
+ return event->attr.sample_regs_intr_ext[idx] & GENMASK_ULL(e, s);
+}
+
+static inline bool has_ymmh_regs(struct perf_event *event)
+{
+ return has_vec_regs(event, PERF_REG_X86_YMMH0, PERF_REG_X86_YMMH15 + 1);
+}
+
+static inline bool has_zmmh_regs(struct perf_event *event)
+{
+ return has_vec_regs(event, PERF_REG_X86_ZMMH0, PERF_REG_X86_ZMMH7 + 3) ||
+ has_vec_regs(event, PERF_REG_X86_ZMMH8, PERF_REG_X86_ZMMH15 + 3);
+}
+
+static inline bool has_h16zmm_regs(struct perf_event *event)
+{
+ return has_vec_regs(event, PERF_REG_X86_ZMM16, PERF_REG_X86_ZMM19 + 7) ||
+ has_vec_regs(event, PERF_REG_X86_ZMM20, PERF_REG_X86_ZMM27 + 7) ||
+ has_vec_regs(event, PERF_REG_X86_ZMM28, PERF_REG_X86_ZMM31 + 7);
+}
+
+static inline bool has_opmask_regs(struct perf_event *event)
+{
+ return has_vec_regs(event, PERF_REG_X86_OPMASK0, PERF_REG_X86_OPMASK7);
+}
+
int x86_pmu_hw_config(struct perf_event *event)
{
if (event->attr.precise_ip) {
@@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
return -EINVAL;
}
+ /*
+ * Architectural PEBS supports to capture more vector registers besides
+ * XMM registers, like YMM, OPMASK and ZMM registers.
+ */
+ if (unlikely(has_more_extended_regs(event))) {
+ u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
+
+ if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
+ return -EINVAL;
+
+ if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
+ return -EINVAL;
+
+ if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
+ return -EINVAL;
+
+ if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
+ return -EINVAL;
+
+ if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
+ return -EINVAL;
+
+ if (!event->attr.precise_ip)
+ return -EINVAL;
+ }
+
return x86_setup_perfctr(event);
}
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f21d9f283445..8ef5b9a05fcc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
if (pebs_data_cfg & PEBS_DATACFG_XMMS)
ext |= ARCH_PEBS_VECR_XMM & cap.caps;
+ if (pebs_data_cfg & PEBS_DATACFG_YMMS)
+ ext |= ARCH_PEBS_VECR_YMM & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
+ ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
+ ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
+
+ if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
+ ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
+
if (pebs_data_cfg & PEBS_DATACFG_LBRS)
ext |= ARCH_PEBS_LBR & cap.caps;
@@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+ if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
+ dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
}
static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 4b01beee15f4..7e5a4202de37 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1413,6 +1413,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
u64 sample_type = attr->sample_type;
u64 pebs_data_cfg = 0;
bool gprs, tsx_weight;
+ int bit = 0;
if (!(sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) &&
attr->precise_ip > 1)
@@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
if (gprs || (attr->precise_ip < 2) || tsx_weight)
pebs_data_cfg |= PEBS_DATACFG_GP;
- if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
- (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
- pebs_data_cfg |= PEBS_DATACFG_XMMS;
+ if (sample_type & PERF_SAMPLE_REGS_INTR) {
+ if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
+ pebs_data_cfg |= PEBS_DATACFG_XMMS;
+
+ for_each_set_bit_from(bit,
+ (unsigned long *)event->attr.sample_regs_intr_ext,
+ PERF_NUM_EXT_REGS) {
+ switch (bit + PERF_REG_EXTENDED_OFFSET) {
+ case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
+ pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
+ bit = PERF_REG_X86_YMMH0 -
+ PERF_REG_EXTENDED_OFFSET - 1;
+ break;
+ case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
+ pebs_data_cfg |= PEBS_DATACFG_YMMS;
+ bit = PERF_REG_X86_ZMMH0 -
+ PERF_REG_EXTENDED_OFFSET - 1;
+ break;
+ case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
+ pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
+ bit = PERF_REG_X86_ZMM16 -
+ PERF_REG_EXTENDED_OFFSET - 1;
+ break;
+ case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
+ pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
+ bit = PERF_REG_X86_ZMM_MAX -
+ PERF_REG_EXTENDED_OFFSET - 1;
+ break;
+ }
+ }
+ }
if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
/*
@@ -2223,6 +2252,10 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
+ perf_regs->ymmh_regs = NULL;
+ perf_regs->opmask_regs = NULL;
+ perf_regs->zmmh_regs = NULL;
+ perf_regs->h16zmm_regs = NULL;
perf_regs->ssp = 0;
format_group = basic->format_group;
@@ -2340,6 +2373,10 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
perf_regs = container_of(regs, struct x86_perf_regs, regs);
perf_regs->xmm_regs = NULL;
+ perf_regs->ymmh_regs = NULL;
+ perf_regs->opmask_regs = NULL;
+ perf_regs->zmmh_regs = NULL;
+ perf_regs->h16zmm_regs = NULL;
perf_regs->ssp = 0;
__setup_perf_sample_data(event, iregs, data);
@@ -2390,14 +2427,45 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
meminfo->tsx_tuning, ax);
}
- if (header->xmm) {
+ if (header->xmm || header->ymmh || header->opmask ||
+ header->zmmh || header->h16zmm) {
struct arch_pebs_xmm *xmm;
+ struct arch_pebs_ymmh *ymmh;
+ struct arch_pebs_zmmh *zmmh;
+ struct arch_pebs_h16zmm *h16zmm;
+ struct arch_pebs_opmask *opmask;
next_record += sizeof(struct arch_pebs_xer_header);
- xmm = next_record;
- perf_regs->xmm_regs = xmm->xmm;
- next_record = xmm + 1;
+ if (header->xmm) {
+ xmm = next_record;
+ perf_regs->xmm_regs = xmm->xmm;
+ next_record = xmm + 1;
+ }
+
+ if (header->ymmh) {
+ ymmh = next_record;
+ perf_regs->ymmh_regs = ymmh->ymmh;
+ next_record = ymmh + 1;
+ }
+
+ if (header->opmask) {
+ opmask = next_record;
+ perf_regs->opmask_regs = opmask->opmask;
+ next_record = opmask + 1;
+ }
+
+ if (header->zmmh) {
+ zmmh = next_record;
+ perf_regs->zmmh_regs = (u64 **)zmmh->zmmh;
+ next_record = zmmh + 1;
+ }
+
+ if (header->h16zmm) {
+ h16zmm = next_record;
+ perf_regs->h16zmm_regs = (u64 **)h16zmm->h16zmm;
+ next_record = h16zmm + 1;
+ }
}
if (header->lbr) {
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0ca84deb2396..973f875cec27 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -326,6 +326,12 @@
#define ARCH_PEBS_LBR_SHIFT 40
#define ARCH_PEBS_LBR (0x3ull << ARCH_PEBS_LBR_SHIFT)
#define ARCH_PEBS_VECR_XMM BIT_ULL(49)
+#define ARCH_PEBS_VECR_YMM BIT_ULL(50)
+#define ARCH_PEBS_VECR_OPMASK BIT_ULL(53)
+#define ARCH_PEBS_VECR_ZMMH BIT_ULL(54)
+#define ARCH_PEBS_VECR_H16ZMM BIT_ULL(55)
+#define ARCH_PEBS_VECR_EXT_SHIFT 50
+#define ARCH_PEBS_VECR_EXT (0x3full << ARCH_PEBS_VECR_EXT_SHIFT)
#define ARCH_PEBS_GPR BIT_ULL(61)
#define ARCH_PEBS_AUX BIT_ULL(62)
#define ARCH_PEBS_EN BIT_ULL(63)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 3bf8dcaa72ca..5f4f30ce6c4c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -142,6 +142,10 @@
#define PEBS_DATACFG_LBRS BIT_ULL(3)
#define PEBS_DATACFG_CNTR BIT_ULL(4)
#define PEBS_DATACFG_METRICS BIT_ULL(5)
+#define PEBS_DATACFG_YMMS BIT_ULL(6)
+#define PEBS_DATACFG_OPMASKS BIT_ULL(7)
+#define PEBS_DATACFG_ZMMHS BIT_ULL(8)
+#define PEBS_DATACFG_H16ZMMS BIT_ULL(9)
#define PEBS_DATACFG_LBR_SHIFT 24
#define PEBS_DATACFG_CNTR_SHIFT 32
#define PEBS_DATACFG_CNTR_MASK GENMASK_ULL(15, 0)
@@ -588,6 +592,22 @@ struct arch_pebs_xmm {
u64 xmm[16*2]; /* two entries for each register */
};
+struct arch_pebs_ymmh {
+ u64 ymmh[16*2]; /* two entries for each register */
+};
+
+struct arch_pebs_opmask {
+ u64 opmask[8];
+};
+
+struct arch_pebs_zmmh {
+ u64 zmmh[16][4]; /* four entries for each register */
+};
+
+struct arch_pebs_h16zmm {
+ u64 h16zmm[16][8]; /* eight entries for each register */
+};
+
#define ARCH_PEBS_LBR_NAN 0x0
#define ARCH_PEBS_LBR_NUM_8 0x1
#define ARCH_PEBS_LBR_NUM_16 0x2
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 19/24] perf tools: Support to show SSP register
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (17 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 20/24] perf tools: Enhance arch__intr/user_reg_mask() helpers Dapeng Mi
` (4 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Add SSP register support.
Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
tools/arch/x86/include/uapi/asm/perf_regs.h | 7 ++++++-
tools/perf/arch/x86/util/perf_regs.c | 2 ++
tools/perf/util/intel-pt.c | 2 +-
tools/perf/util/perf-regs-arch/perf_regs_x86.c | 2 ++
4 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..9c45b07bfcf7 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,14 @@ enum perf_event_x86_regs {
PERF_REG_X86_R13,
PERF_REG_X86_R14,
PERF_REG_X86_R15,
+ /* Shadow stack pointer (SSP) present on Clearwater Forest and newer models. */
+ PERF_REG_X86_SSP,
/* These are the limits for the GPRs. */
PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
- PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+ /* PERF_REG_X86_64_MAX used generally, for PEBS, etc. */
+ PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
+ /* PERF_REG_INTEL_PT_MAX ignores the SSP register. */
+ PERF_REG_INTEL_PT_MAX = PERF_REG_X86_R15 + 1,
/* These all need two bits set because they are 128bit */
PERF_REG_X86_XMM0 = 32,
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 12fd93f04802..9f492568f3b4 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -36,6 +36,8 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG(R14, PERF_REG_X86_R14),
SMPL_REG(R15, PERF_REG_X86_R15),
#endif
+ SMPL_REG(SSP, PERF_REG_X86_SSP),
+
SMPL_REG2(XMM0, PERF_REG_X86_XMM0),
SMPL_REG2(XMM1, PERF_REG_X86_XMM1),
SMPL_REG2(XMM2, PERF_REG_X86_XMM2),
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 30be6dfe09eb..86196275c1e7 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2139,7 +2139,7 @@ static u64 *intel_pt_add_gp_regs(struct regs_dump *intr_regs, u64 *pos,
u32 bit;
int i;
- for (i = 0, bit = 1; i < PERF_REG_X86_64_MAX; i++, bit <<= 1) {
+ for (i = 0, bit = 1; i < PERF_REG_INTEL_PT_MAX; i++, bit <<= 1) {
/* Get the PEBS gp_regs array index */
int n = pebs_gp_regs[i] - 1;
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 708954a9d35d..9a909f02bc04 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -54,6 +54,8 @@ const char *__perf_reg_name_x86(int id)
return "R14";
case PERF_REG_X86_R15:
return "R15";
+ case PERF_REG_X86_SSP:
+ return "ssp";
#define XMM(x) \
case PERF_REG_X86_XMM ## x: \
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 20/24] perf tools: Enhance arch__intr/user_reg_mask() helpers
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (18 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 19/24] perf tools: Support to show SSP register Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 21/24] perf tools: Enhance sample_regs_user/intr to capture more registers Dapeng Mi
` (3 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Arch-PEBS supports to capture higher-width vector registers, like
YMM/ZMM registers, while the return value "uint64_t" of these 2 helpers
is not enough to represent these new added registors. Thus enhance these
two helpers by passing a "unsigned long" pointer, so these two helpers
can return more bits via this pointer.
Currently only sample_intr_regs supports these new added vector
registers, but change arch__user_reg_mask() for the sake of consistency
as well.
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
tools/perf/arch/arm/util/perf_regs.c | 8 ++++----
tools/perf/arch/arm64/util/perf_regs.c | 11 ++++++-----
tools/perf/arch/csky/util/perf_regs.c | 8 ++++----
tools/perf/arch/loongarch/util/perf_regs.c | 8 ++++----
tools/perf/arch/mips/util/perf_regs.c | 8 ++++----
tools/perf/arch/powerpc/util/perf_regs.c | 17 +++++++++--------
tools/perf/arch/riscv/util/perf_regs.c | 8 ++++----
tools/perf/arch/s390/util/perf_regs.c | 8 ++++----
tools/perf/arch/x86/util/perf_regs.c | 13 +++++++------
tools/perf/util/evsel.c | 6 ++++--
tools/perf/util/parse-regs-options.c | 6 +++---
tools/perf/util/perf_regs.c | 8 ++++----
tools/perf/util/perf_regs.h | 4 ++--
13 files changed, 59 insertions(+), 54 deletions(-)
diff --git a/tools/perf/arch/arm/util/perf_regs.c b/tools/perf/arch/arm/util/perf_regs.c
index f94a0210c7b7..14f18d518c96 100644
--- a/tools/perf/arch/arm/util/perf_regs.c
+++ b/tools/perf/arch/arm/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG_END
};
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/arm64/util/perf_regs.c b/tools/perf/arch/arm64/util/perf_regs.c
index 09308665e28a..9bcf4755290c 100644
--- a/tools/perf/arch/arm64/util/perf_regs.c
+++ b/tools/perf/arch/arm64/util/perf_regs.c
@@ -140,12 +140,12 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
return SDT_ARG_VALID;
}
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
@@ -170,10 +170,11 @@ uint64_t arch__user_reg_mask(void)
fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
if (fd != -1) {
close(fd);
- return attr.sample_regs_user;
+ *(uint64_t *)mask = attr.sample_regs_user;
+ return;
}
}
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/csky/util/perf_regs.c b/tools/perf/arch/csky/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/csky/util/perf_regs.c
+++ b/tools/perf/arch/csky/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG_END
};
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/loongarch/util/perf_regs.c b/tools/perf/arch/loongarch/util/perf_regs.c
index f94a0210c7b7..14f18d518c96 100644
--- a/tools/perf/arch/loongarch/util/perf_regs.c
+++ b/tools/perf/arch/loongarch/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG_END
};
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/mips/util/perf_regs.c b/tools/perf/arch/mips/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/mips/util/perf_regs.c
+++ b/tools/perf/arch/mips/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG_END
};
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c
index bd36cfd420a2..e5d042305030 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -187,7 +187,7 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
return SDT_ARG_VALID;
}
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
@@ -199,7 +199,7 @@ uint64_t arch__intr_reg_mask(void)
};
int fd;
u32 version;
- u64 extended_mask = 0, mask = PERF_REGS_MASK;
+ u64 extended_mask = 0;
/*
* Get the PVR value to set the extended
@@ -210,8 +210,10 @@ uint64_t arch__intr_reg_mask(void)
extended_mask = PERF_REG_PMU_MASK_300;
else if ((version == PVR_POWER10) || (version == PVR_POWER11))
extended_mask = PERF_REG_PMU_MASK_31;
- else
- return mask;
+ else {
+ *(u64 *)mask = PERF_REGS_MASK;
+ return;
+ }
attr.sample_regs_intr = extended_mask;
attr.sample_period = 1;
@@ -224,14 +226,13 @@ uint64_t arch__intr_reg_mask(void)
fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
if (fd != -1) {
close(fd);
- mask |= extended_mask;
+ *(u64 *)mask = PERF_REGS_MASK | extended_mask;
}
- return mask;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/riscv/util/perf_regs.c b/tools/perf/arch/riscv/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/riscv/util/perf_regs.c
+++ b/tools/perf/arch/riscv/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG_END
};
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/s390/util/perf_regs.c b/tools/perf/arch/s390/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/s390/util/perf_regs.c
+++ b/tools/perf/arch/s390/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG_END
};
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 9f492568f3b4..5b163f0a651a 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -283,7 +283,7 @@ const struct sample_reg *arch__sample_reg_masks(void)
return sample_reg_masks;
}
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
{
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
@@ -295,6 +295,9 @@ uint64_t arch__intr_reg_mask(void)
.exclude_kernel = 1,
};
int fd;
+
+ *(u64 *)mask = PERF_REGS_MASK;
+
/*
* In an unnamed union, init it here to build on older gcc versions
*/
@@ -320,13 +323,11 @@ uint64_t arch__intr_reg_mask(void)
fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
if (fd != -1) {
close(fd);
- return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
+ *(u64 *)mask = PERF_REG_EXTENDED_MASK | PERF_REGS_MASK;
}
-
- return PERF_REGS_MASK;
}
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
{
- return PERF_REGS_MASK;
+ *(uint64_t *)mask = PERF_REGS_MASK;
}
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index bc144388f892..78bcb12a9d96 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1032,17 +1032,19 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
if (param->record_mode == CALLCHAIN_DWARF) {
if (!function) {
const char *arch = perf_env__arch(evsel__env(evsel));
+ uint64_t mask = 0;
+ arch__user_reg_mask((unsigned long *)&mask);
evsel__set_sample_bit(evsel, REGS_USER);
evsel__set_sample_bit(evsel, STACK_USER);
if (opts->sample_user_regs &&
- DWARF_MINIMAL_REGS(arch) != arch__user_reg_mask()) {
+ DWARF_MINIMAL_REGS(arch) != mask) {
attr->sample_regs_user |= DWARF_MINIMAL_REGS(arch);
pr_warning("WARNING: The use of --call-graph=dwarf may require all the user registers, "
"specifying a subset with --user-regs may render DWARF unwinding unreliable, "
"so the minimal registers set (IP, SP) is explicitly forced.\n");
} else {
- attr->sample_regs_user |= arch__user_reg_mask();
+ attr->sample_regs_user |= mask;
}
attr->sample_stack_user = param->dump_size;
attr->exclude_callchain_user = 1;
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index cda1c620968e..3dcd8dc4f81b 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -16,7 +16,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
const struct sample_reg *r = NULL;
char *s, *os = NULL, *p;
int ret = -1;
- uint64_t mask;
+ uint64_t mask = 0;
if (unset)
return 0;
@@ -28,9 +28,9 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
return -1;
if (intr)
- mask = arch__intr_reg_mask();
+ arch__intr_reg_mask((unsigned long *)&mask);
else
- mask = arch__user_reg_mask();
+ arch__user_reg_mask((unsigned long *)&mask);
/* str may be NULL in case no arg is passed to -I */
if (str) {
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 44b90bbf2d07..7a96290fd1e6 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -11,14 +11,14 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
return SDT_ARG_SKIP;
}
-uint64_t __weak arch__intr_reg_mask(void)
+void __weak arch__intr_reg_mask(unsigned long *mask)
{
- return 0;
+ *(uint64_t *)mask = 0;
}
-uint64_t __weak arch__user_reg_mask(void)
+void __weak arch__user_reg_mask(unsigned long *mask)
{
- return 0;
+ *(uint64_t *)mask = 0;
}
static const struct sample_reg sample_reg_masks[] = {
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index f2d0736d65cc..316d280e5cd7 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -24,8 +24,8 @@ enum {
};
int arch_sdt_arg_parse_op(char *old_op, char **new_op);
-uint64_t arch__intr_reg_mask(void);
-uint64_t arch__user_reg_mask(void);
+void arch__intr_reg_mask(unsigned long *mask);
+void arch__user_reg_mask(unsigned long *mask);
const struct sample_reg *arch__sample_reg_masks(void);
const char *perf_reg_name(int id, const char *arch);
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 21/24] perf tools: Enhance sample_regs_user/intr to capture more registers
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (19 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 20/24] perf tools: Enhance arch__intr/user_reg_mask() helpers Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 22/24] perf tools: Support to capture more vector registers (x86/Intel) Dapeng Mi
` (2 subsequent siblings)
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Intel architectural PEBS supports to capture more vector registers like
OPMASK/YMM/ZMM registers besides already supported XMM registers.
arch-PEBS vector registers (VCER) capturing on perf core/pmu driver
(Intel) has been supported by previous patches. This patch adds perf
tool's part support. In detail, add support for the new
sample_regs_intr/user_ext register selector in perf_event_attr. These 32
bytes bitmap is used to select the new register group OPMASK, YMMH, ZMMH
and ZMM in VECR. Update perf regs to introduce the new registers.
This single patch only introduces the generic support, x86/intel specific
support would be added in next patch.
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
tools/include/uapi/linux/perf_event.h | 14 +++++++++++++
tools/perf/builtin-script.c | 23 +++++++++++++++-----
tools/perf/util/evsel.c | 30 ++++++++++++++++++++-------
tools/perf/util/parse-regs-options.c | 23 ++++++++++++--------
tools/perf/util/perf_regs.h | 16 +++++++++++++-
tools/perf/util/record.h | 4 ++--
tools/perf/util/sample.h | 6 +++++-
tools/perf/util/session.c | 29 +++++++++++++++-----------
tools/perf/util/synthetic-events.c | 6 ++++--
9 files changed, 112 insertions(+), 39 deletions(-)
diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 0524d541d4e3..a46b4f33557f 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -379,6 +379,13 @@ enum perf_event_read_format {
#define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */
#define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */
#define PERF_ATTR_SIZE_VER8 136 /* add: config3 */
+#define PERF_ATTR_SIZE_VER9 168 /* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
+
+#define PERF_EXT_REGS_ARRAY_SIZE 4
+#define PERF_NUM_EXT_REGS (PERF_EXT_REGS_ARRAY_SIZE * 64)
+
+#define PERF_SAMPLE_ARRAY_SIZE (PERF_EXT_REGS_ARRAY_SIZE + 1)
+#define PERF_SAMPLE_REGS_NUM ((PERF_SAMPLE_ARRAY_SIZE) * 64)
/*
* Hardware event_id to monitor via a performance monitoring event:
@@ -531,6 +538,13 @@ struct perf_event_attr {
__u64 sig_data;
__u64 config3; /* extension of config2 */
+
+ /*
+ * Extension sets of regs to dump for each sample.
+ * See asm/perf_regs.h for details.
+ */
+ __u64 sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
+ __u64 sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
};
/*
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 33667b534634..14aba0965a26 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -712,21 +712,32 @@ static int perf_session__check_output_opt(struct perf_session *session)
}
static int perf_sample__fprintf_regs(struct regs_dump *regs, uint64_t mask, const char *arch,
- FILE *fp)
+ unsigned long *mask_ext, FILE *fp)
{
+ unsigned int mask_size = sizeof(mask) * 8;
unsigned i = 0, r;
int printed = 0;
+ u64 val;
if (!regs || !regs->regs)
return 0;
printed += fprintf(fp, " ABI:%" PRIu64 " ", regs->abi);
- for_each_set_bit(r, (unsigned long *) &mask, sizeof(mask) * 8) {
- u64 val = regs->regs[i++];
+ for_each_set_bit(r, (unsigned long *)&mask, mask_size) {
+ val = regs->regs[i++];
printed += fprintf(fp, "%5s:0x%"PRIx64" ", perf_reg_name(r, arch), val);
}
+ if (!mask_ext)
+ return printed;
+
+ for_each_set_bit(r, mask_ext, PERF_NUM_EXT_REGS) {
+ val = regs->regs[i++];
+ printed += fprintf(fp, "%5s:0x%"PRIx64" ",
+ perf_reg_name(r + mask_size, arch), val);
+ }
+
return printed;
}
@@ -784,14 +795,16 @@ static int perf_sample__fprintf_iregs(struct perf_sample *sample,
struct perf_event_attr *attr, const char *arch, FILE *fp)
{
return perf_sample__fprintf_regs(&sample->intr_regs,
- attr->sample_regs_intr, arch, fp);
+ attr->sample_regs_intr, arch,
+ (unsigned long *)attr->sample_regs_intr_ext, fp);
}
static int perf_sample__fprintf_uregs(struct perf_sample *sample,
struct perf_event_attr *attr, const char *arch, FILE *fp)
{
return perf_sample__fprintf_regs(&sample->user_regs,
- attr->sample_regs_user, arch, fp);
+ attr->sample_regs_user, arch,
+ (unsigned long *)attr->sample_regs_user_ext, fp);
}
static int perf_sample__fprintf_start(struct perf_script *script,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 78bcb12a9d96..45e06aeec4e5 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1037,7 +1037,7 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
arch__user_reg_mask((unsigned long *)&mask);
evsel__set_sample_bit(evsel, REGS_USER);
evsel__set_sample_bit(evsel, STACK_USER);
- if (opts->sample_user_regs &&
+ if (bitmap_weight(opts->sample_user_regs, PERF_SAMPLE_REGS_NUM) &&
DWARF_MINIMAL_REGS(arch) != mask) {
attr->sample_regs_user |= DWARF_MINIMAL_REGS(arch);
pr_warning("WARNING: The use of --call-graph=dwarf may require all the user registers, "
@@ -1373,15 +1373,19 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
if (callchain && callchain->enabled && !evsel->no_aux_samples)
evsel__config_callchain(evsel, opts, callchain);
- if (opts->sample_intr_regs && !evsel->no_aux_samples &&
- !evsel__is_dummy_event(evsel)) {
- attr->sample_regs_intr = opts->sample_intr_regs;
+ if (bitmap_weight(opts->sample_intr_regs, PERF_SAMPLE_REGS_NUM) &&
+ !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+ attr->sample_regs_intr = opts->sample_intr_regs[0];
+ memcpy(attr->sample_regs_intr_ext, &opts->sample_intr_regs[1],
+ PERF_NUM_EXT_REGS / 8);
evsel__set_sample_bit(evsel, REGS_INTR);
}
- if (opts->sample_user_regs && !evsel->no_aux_samples &&
- !evsel__is_dummy_event(evsel)) {
- attr->sample_regs_user |= opts->sample_user_regs;
+ if (bitmap_weight(opts->sample_user_regs, PERF_SAMPLE_REGS_NUM) &&
+ !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+ attr->sample_regs_user |= opts->sample_user_regs[0];
+ memcpy(attr->sample_regs_user_ext, &opts->sample_user_regs[1],
+ PERF_NUM_EXT_REGS / 8);
evsel__set_sample_bit(evsel, REGS_USER);
}
@@ -3172,10 +3176,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
if (data->user_regs.abi) {
u64 mask = evsel->core.attr.sample_regs_user;
+ unsigned long *mask_ext =
+ (unsigned long *)evsel->core.attr.sample_regs_user_ext;
+ u64 *user_regs_mask;
sz = hweight64(mask) * sizeof(u64);
+ sz += bitmap_weight(mask_ext, PERF_NUM_EXT_REGS) * sizeof(u64);
OVERFLOW_CHECK(array, sz, max_size);
data->user_regs.mask = mask;
+ user_regs_mask = (u64 *)&data->user_regs.mask_ext;
+ memcpy(&user_regs_mask[1], mask_ext, PERF_NUM_EXT_REGS);
data->user_regs.regs = (u64 *)array;
array = (void *)array + sz;
}
@@ -3228,10 +3238,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
if (data->intr_regs.abi != PERF_SAMPLE_REGS_ABI_NONE) {
u64 mask = evsel->core.attr.sample_regs_intr;
+ unsigned long *mask_ext =
+ (unsigned long *)evsel->core.attr.sample_regs_intr_ext;
+ u64 *intr_regs_mask;
sz = hweight64(mask) * sizeof(u64);
+ sz += bitmap_weight(mask_ext, PERF_NUM_EXT_REGS) * sizeof(u64);
OVERFLOW_CHECK(array, sz, max_size);
data->intr_regs.mask = mask;
+ intr_regs_mask = (u64 *)&data->intr_regs.mask_ext;
+ memcpy(&intr_regs_mask[1], mask_ext, PERF_NUM_EXT_REGS);
data->intr_regs.regs = (u64 *)array;
array = (void *)array + sz;
}
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index 3dcd8dc4f81b..42b176705ccf 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -12,11 +12,13 @@
static int
__parse_regs(const struct option *opt, const char *str, int unset, bool intr)
{
+ unsigned int size = PERF_SAMPLE_REGS_NUM;
uint64_t *mode = (uint64_t *)opt->value;
const struct sample_reg *r = NULL;
char *s, *os = NULL, *p;
int ret = -1;
- uint64_t mask = 0;
+ DECLARE_BITMAP(mask, size);
+ DECLARE_BITMAP(mask_tmp, size);
if (unset)
return 0;
@@ -24,13 +26,14 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
/*
* cannot set it twice
*/
- if (*mode)
+ if (bitmap_weight((unsigned long *)mode, size))
return -1;
+ bitmap_zero(mask, size);
if (intr)
- arch__intr_reg_mask((unsigned long *)&mask);
+ arch__intr_reg_mask(mask);
else
- arch__user_reg_mask((unsigned long *)&mask);
+ arch__user_reg_mask(mask);
/* str may be NULL in case no arg is passed to -I */
if (str) {
@@ -47,7 +50,8 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
if (!strcmp(s, "?")) {
fprintf(stderr, "available registers: ");
for (r = arch__sample_reg_masks(); r->name; r++) {
- if (r->mask & mask)
+ bitmap_and(mask_tmp, mask, r->mask_ext, size);
+ if (bitmap_weight(mask_tmp, size))
fprintf(stderr, "%s ", r->name);
}
fputc('\n', stderr);
@@ -55,7 +59,8 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
goto error;
}
for (r = arch__sample_reg_masks(); r->name; r++) {
- if ((r->mask & mask) && !strcasecmp(s, r->name))
+ bitmap_and(mask_tmp, mask, r->mask_ext, size);
+ if (bitmap_weight(mask_tmp, size) && !strcasecmp(s, r->name))
break;
}
if (!r || !r->name) {
@@ -64,7 +69,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
goto error;
}
- *mode |= r->mask;
+ bitmap_or((unsigned long *)mode, (unsigned long *)mode, r->mask_ext, size);
if (!p)
break;
@@ -75,8 +80,8 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
ret = 0;
/* default to all possible regs */
- if (*mode == 0)
- *mode = mask;
+ if (!bitmap_weight((unsigned long *)mode, size))
+ bitmap_or((unsigned long *)mode, (unsigned long *)mode, mask, size);
error:
free(os);
return ret;
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index 316d280e5cd7..d60a74623a0f 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -4,18 +4,32 @@
#include <linux/types.h>
#include <linux/compiler.h>
+#include <linux/bitmap.h>
+#include <linux/perf_event.h>
+#include "util/record.h"
struct regs_dump;
struct sample_reg {
const char *name;
- uint64_t mask;
+ union {
+ uint64_t mask;
+ DECLARE_BITMAP(mask_ext, PERF_SAMPLE_REGS_NUM);
+ };
};
#define SMPL_REG_MASK(b) (1ULL << (b))
#define SMPL_REG(n, b) { .name = #n, .mask = SMPL_REG_MASK(b) }
#define SMPL_REG2_MASK(b) (3ULL << (b))
#define SMPL_REG2(n, b) { .name = #n, .mask = SMPL_REG2_MASK(b) }
+#define SMPL_REG_EXT(n, b) \
+ { .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0x1ULL << (b % __BITS_PER_LONG) }
+#define SMPL_REG2_EXT(n, b) \
+ { .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0x3ULL << (b % __BITS_PER_LONG) }
+#define SMPL_REG4_EXT(n, b) \
+ { .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0xfULL << (b % __BITS_PER_LONG) }
+#define SMPL_REG8_EXT(n, b) \
+ { .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0xffULL << (b % __BITS_PER_LONG) }
#define SMPL_REG_END { .name = NULL }
enum {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..2741bbbc2794 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -57,8 +57,8 @@ struct record_opts {
unsigned int auxtrace_mmap_pages;
unsigned int user_freq;
u64 branch_stack;
- u64 sample_intr_regs;
- u64 sample_user_regs;
+ u64 sample_intr_regs[PERF_SAMPLE_ARRAY_SIZE];
+ u64 sample_user_regs[PERF_SAMPLE_ARRAY_SIZE];
u64 default_interval;
u64 user_interval;
size_t auxtrace_snapshot_size;
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 70b2c3135555..a0bb7ba4b0c1 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -4,13 +4,17 @@
#include <linux/perf_event.h>
#include <linux/types.h>
+#include <linux/bitmap.h>
/* number of register is bound by the number of bits in regs_dump::mask (64) */
#define PERF_SAMPLE_REGS_CACHE_SIZE (8 * sizeof(u64))
struct regs_dump {
u64 abi;
- u64 mask;
+ union {
+ u64 mask;
+ DECLARE_BITMAP(mask_ext, PERF_SAMPLE_REGS_NUM);
+ };
u64 *regs;
/* Cached values/mask filled by first register access. */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index c06e3020a976..ef42dbff82fc 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -910,12 +910,13 @@ static void branch_stack__printf(struct perf_sample *sample,
}
}
-static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
+static void regs_dump__printf(struct regs_dump *regs, const char *arch)
{
+ unsigned int size = PERF_SAMPLE_REGS_NUM;
unsigned rid, i = 0;
- for_each_set_bit(rid, (unsigned long *) &mask, sizeof(mask) * 8) {
- u64 val = regs[i++];
+ for_each_set_bit(rid, regs->mask_ext, size) {
+ u64 val = regs->regs[i++];
printf(".... %-5s 0x%016" PRIx64 "\n",
perf_reg_name(rid, arch), val);
@@ -936,16 +937,20 @@ static inline const char *regs_dump_abi(struct regs_dump *d)
return regs_abi[d->abi];
}
-static void regs__printf(const char *type, struct regs_dump *regs, const char *arch)
+static void regs__printf(bool intr, struct regs_dump *regs, const char *arch)
{
- u64 mask = regs->mask;
+ u64 *mask = (u64 *)®s->mask_ext;
- printf("... %s regs: mask 0x%" PRIx64 " ABI %s\n",
- type,
- mask,
- regs_dump_abi(regs));
+ if (intr)
+ printf("... intr regs: mask 0x");
+ else
+ printf("... user regs: mask 0x");
+
+ for (int i = 0; i < PERF_SAMPLE_ARRAY_SIZE; i++)
+ printf("%" PRIx64 "", mask[i]);
+ printf(" ABI %s\n", regs_dump_abi(regs));
- regs_dump__printf(mask, regs->regs, arch);
+ regs_dump__printf(regs, arch);
}
static void regs_user__printf(struct perf_sample *sample, const char *arch)
@@ -953,7 +958,7 @@ static void regs_user__printf(struct perf_sample *sample, const char *arch)
struct regs_dump *user_regs = &sample->user_regs;
if (user_regs->regs)
- regs__printf("user", user_regs, arch);
+ regs__printf(false, user_regs, arch);
}
static void regs_intr__printf(struct perf_sample *sample, const char *arch)
@@ -961,7 +966,7 @@ static void regs_intr__printf(struct perf_sample *sample, const char *arch)
struct regs_dump *intr_regs = &sample->intr_regs;
if (intr_regs->regs)
- regs__printf("intr", intr_regs, arch);
+ regs__printf(true, intr_regs, arch);
}
static void stack_user__printf(struct stack_dump *dump)
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 6923b0d5efed..5d124e55167d 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1538,7 +1538,8 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
if (type & PERF_SAMPLE_REGS_INTR) {
if (sample->intr_regs.abi) {
result += sizeof(u64);
- sz = hweight64(sample->intr_regs.mask) * sizeof(u64);
+ sz = bitmap_weight(sample->intr_regs.mask_ext,
+ PERF_SAMPLE_REGS_NUM) * sizeof(u64);
result += sz;
} else {
result += sizeof(u64);
@@ -1745,7 +1746,8 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
if (type & PERF_SAMPLE_REGS_INTR) {
if (sample->intr_regs.abi) {
*array++ = sample->intr_regs.abi;
- sz = hweight64(sample->intr_regs.mask) * sizeof(u64);
+ sz = bitmap_weight(sample->intr_regs.mask_ext,
+ PERF_SAMPLE_REGS_NUM) * sizeof(u64);
memcpy(array, sample->intr_regs.regs, sz);
array = (void *)array + sz;
} else {
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 22/24] perf tools: Support to capture more vector registers (x86/Intel)
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (20 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 21/24] perf tools: Enhance sample_regs_user/intr to capture more registers Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 23/24] perf tools/tests: Add vector registers PEBS sampling test Dapeng Mi
2025-02-18 15:28 ` [Patch v2 24/24] perf tools: Fix incorrect --user-regs comments Dapeng Mi
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Intel architectural PEBS supports to capture more vector registers like
OPMASK/YMM/ZMM registers besides already supported XMM registers.
This patch adds Intel specific support to capture these new vector
registers for perf tools.
Besides, add SSP in perf regs. SSP is stored in general register group
and is selected by sample_regs_intr.
Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
tools/arch/x86/include/uapi/asm/perf_regs.h | 83 +++++++++++++++-
tools/perf/arch/x86/util/perf_regs.c | 99 +++++++++++++++++++
.../perf/util/perf-regs-arch/perf_regs_x86.c | 88 +++++++++++++++++
3 files changed, 269 insertions(+), 1 deletion(-)
diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 9c45b07bfcf7..9d19eef7d7a2 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -36,7 +36,7 @@ enum perf_event_x86_regs {
/* PERF_REG_INTEL_PT_MAX ignores the SSP register. */
PERF_REG_INTEL_PT_MAX = PERF_REG_X86_R15 + 1,
- /* These all need two bits set because they are 128bit */
+ /* These all need two bits set because they are 128 bits */
PERF_REG_X86_XMM0 = 32,
PERF_REG_X86_XMM1 = 34,
PERF_REG_X86_XMM2 = 36,
@@ -56,6 +56,87 @@ enum perf_event_x86_regs {
/* These include both GPRs and XMMX registers */
PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
+
+ /*
+ * YMM upper bits need two bits set because they are 128 bits.
+ * PERF_REG_X86_YMMH0 = 64
+ */
+ PERF_REG_X86_YMMH0 = PERF_REG_X86_XMM_MAX,
+ PERF_REG_X86_YMMH1 = PERF_REG_X86_YMMH0 + 2,
+ PERF_REG_X86_YMMH2 = PERF_REG_X86_YMMH1 + 2,
+ PERF_REG_X86_YMMH3 = PERF_REG_X86_YMMH2 + 2,
+ PERF_REG_X86_YMMH4 = PERF_REG_X86_YMMH3 + 2,
+ PERF_REG_X86_YMMH5 = PERF_REG_X86_YMMH4 + 2,
+ PERF_REG_X86_YMMH6 = PERF_REG_X86_YMMH5 + 2,
+ PERF_REG_X86_YMMH7 = PERF_REG_X86_YMMH6 + 2,
+ PERF_REG_X86_YMMH8 = PERF_REG_X86_YMMH7 + 2,
+ PERF_REG_X86_YMMH9 = PERF_REG_X86_YMMH8 + 2,
+ PERF_REG_X86_YMMH10 = PERF_REG_X86_YMMH9 + 2,
+ PERF_REG_X86_YMMH11 = PERF_REG_X86_YMMH10 + 2,
+ PERF_REG_X86_YMMH12 = PERF_REG_X86_YMMH11 + 2,
+ PERF_REG_X86_YMMH13 = PERF_REG_X86_YMMH12 + 2,
+ PERF_REG_X86_YMMH14 = PERF_REG_X86_YMMH13 + 2,
+ PERF_REG_X86_YMMH15 = PERF_REG_X86_YMMH14 + 2,
+ PERF_REG_X86_YMMH_MAX = PERF_REG_X86_YMMH15 + 2,
+
+ /*
+ * ZMM0-15 upper bits need four bits set because they are 256 bits
+ * PERF_REG_X86_ZMMH0 = 96
+ */
+ PERF_REG_X86_ZMMH0 = PERF_REG_X86_YMMH_MAX,
+ PERF_REG_X86_ZMMH1 = PERF_REG_X86_ZMMH0 + 4,
+ PERF_REG_X86_ZMMH2 = PERF_REG_X86_ZMMH1 + 4,
+ PERF_REG_X86_ZMMH3 = PERF_REG_X86_ZMMH2 + 4,
+ PERF_REG_X86_ZMMH4 = PERF_REG_X86_ZMMH3 + 4,
+ PERF_REG_X86_ZMMH5 = PERF_REG_X86_ZMMH4 + 4,
+ PERF_REG_X86_ZMMH6 = PERF_REG_X86_ZMMH5 + 4,
+ PERF_REG_X86_ZMMH7 = PERF_REG_X86_ZMMH6 + 4,
+ PERF_REG_X86_ZMMH8 = PERF_REG_X86_ZMMH7 + 4,
+ PERF_REG_X86_ZMMH9 = PERF_REG_X86_ZMMH8 + 4,
+ PERF_REG_X86_ZMMH10 = PERF_REG_X86_ZMMH9 + 4,
+ PERF_REG_X86_ZMMH11 = PERF_REG_X86_ZMMH10 + 4,
+ PERF_REG_X86_ZMMH12 = PERF_REG_X86_ZMMH11 + 4,
+ PERF_REG_X86_ZMMH13 = PERF_REG_X86_ZMMH12 + 4,
+ PERF_REG_X86_ZMMH14 = PERF_REG_X86_ZMMH13 + 4,
+ PERF_REG_X86_ZMMH15 = PERF_REG_X86_ZMMH14 + 4,
+ PERF_REG_X86_ZMMH_MAX = PERF_REG_X86_ZMMH15 + 4,
+
+ /*
+ * ZMM16-31 need eight bits set because they are 512 bits
+ * PERF_REG_X86_ZMM16 = 160
+ */
+ PERF_REG_X86_ZMM16 = PERF_REG_X86_ZMMH_MAX,
+ PERF_REG_X86_ZMM17 = PERF_REG_X86_ZMM16 + 8,
+ PERF_REG_X86_ZMM18 = PERF_REG_X86_ZMM17 + 8,
+ PERF_REG_X86_ZMM19 = PERF_REG_X86_ZMM18 + 8,
+ PERF_REG_X86_ZMM20 = PERF_REG_X86_ZMM19 + 8,
+ PERF_REG_X86_ZMM21 = PERF_REG_X86_ZMM20 + 8,
+ PERF_REG_X86_ZMM22 = PERF_REG_X86_ZMM21 + 8,
+ PERF_REG_X86_ZMM23 = PERF_REG_X86_ZMM22 + 8,
+ PERF_REG_X86_ZMM24 = PERF_REG_X86_ZMM23 + 8,
+ PERF_REG_X86_ZMM25 = PERF_REG_X86_ZMM24 + 8,
+ PERF_REG_X86_ZMM26 = PERF_REG_X86_ZMM25 + 8,
+ PERF_REG_X86_ZMM27 = PERF_REG_X86_ZMM26 + 8,
+ PERF_REG_X86_ZMM28 = PERF_REG_X86_ZMM27 + 8,
+ PERF_REG_X86_ZMM29 = PERF_REG_X86_ZMM28 + 8,
+ PERF_REG_X86_ZMM30 = PERF_REG_X86_ZMM29 + 8,
+ PERF_REG_X86_ZMM31 = PERF_REG_X86_ZMM30 + 8,
+ PERF_REG_X86_ZMM_MAX = PERF_REG_X86_ZMM31 + 8,
+
+ /*
+ * OPMASK Registers
+ * PERF_REG_X86_OPMASK0 = 288
+ */
+ PERF_REG_X86_OPMASK0 = PERF_REG_X86_ZMM_MAX,
+ PERF_REG_X86_OPMASK1 = PERF_REG_X86_OPMASK0 + 1,
+ PERF_REG_X86_OPMASK2 = PERF_REG_X86_OPMASK1 + 1,
+ PERF_REG_X86_OPMASK3 = PERF_REG_X86_OPMASK2 + 1,
+ PERF_REG_X86_OPMASK4 = PERF_REG_X86_OPMASK3 + 1,
+ PERF_REG_X86_OPMASK5 = PERF_REG_X86_OPMASK4 + 1,
+ PERF_REG_X86_OPMASK6 = PERF_REG_X86_OPMASK5 + 1,
+ PERF_REG_X86_OPMASK7 = PERF_REG_X86_OPMASK6 + 1,
+
+ PERF_REG_X86_VEC_MAX = PERF_REG_X86_OPMASK7 + 1,
};
#define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 5b163f0a651a..1902a715efa6 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -54,6 +54,67 @@ static const struct sample_reg sample_reg_masks[] = {
SMPL_REG2(XMM13, PERF_REG_X86_XMM13),
SMPL_REG2(XMM14, PERF_REG_X86_XMM14),
SMPL_REG2(XMM15, PERF_REG_X86_XMM15),
+
+ SMPL_REG2_EXT(YMMH0, PERF_REG_X86_YMMH0),
+ SMPL_REG2_EXT(YMMH1, PERF_REG_X86_YMMH1),
+ SMPL_REG2_EXT(YMMH2, PERF_REG_X86_YMMH2),
+ SMPL_REG2_EXT(YMMH3, PERF_REG_X86_YMMH3),
+ SMPL_REG2_EXT(YMMH4, PERF_REG_X86_YMMH4),
+ SMPL_REG2_EXT(YMMH5, PERF_REG_X86_YMMH5),
+ SMPL_REG2_EXT(YMMH6, PERF_REG_X86_YMMH6),
+ SMPL_REG2_EXT(YMMH7, PERF_REG_X86_YMMH7),
+ SMPL_REG2_EXT(YMMH8, PERF_REG_X86_YMMH8),
+ SMPL_REG2_EXT(YMMH9, PERF_REG_X86_YMMH9),
+ SMPL_REG2_EXT(YMMH10, PERF_REG_X86_YMMH10),
+ SMPL_REG2_EXT(YMMH11, PERF_REG_X86_YMMH11),
+ SMPL_REG2_EXT(YMMH12, PERF_REG_X86_YMMH12),
+ SMPL_REG2_EXT(YMMH13, PERF_REG_X86_YMMH13),
+ SMPL_REG2_EXT(YMMH14, PERF_REG_X86_YMMH14),
+ SMPL_REG2_EXT(YMMH15, PERF_REG_X86_YMMH15),
+
+ SMPL_REG4_EXT(ZMMH0, PERF_REG_X86_ZMMH0),
+ SMPL_REG4_EXT(ZMMH1, PERF_REG_X86_ZMMH1),
+ SMPL_REG4_EXT(ZMMH2, PERF_REG_X86_ZMMH2),
+ SMPL_REG4_EXT(ZMMH3, PERF_REG_X86_ZMMH3),
+ SMPL_REG4_EXT(ZMMH4, PERF_REG_X86_ZMMH4),
+ SMPL_REG4_EXT(ZMMH5, PERF_REG_X86_ZMMH5),
+ SMPL_REG4_EXT(ZMMH6, PERF_REG_X86_ZMMH6),
+ SMPL_REG4_EXT(ZMMH7, PERF_REG_X86_ZMMH7),
+ SMPL_REG4_EXT(ZMMH8, PERF_REG_X86_ZMMH8),
+ SMPL_REG4_EXT(ZMMH9, PERF_REG_X86_ZMMH9),
+ SMPL_REG4_EXT(ZMMH10, PERF_REG_X86_ZMMH10),
+ SMPL_REG4_EXT(ZMMH11, PERF_REG_X86_ZMMH11),
+ SMPL_REG4_EXT(ZMMH12, PERF_REG_X86_ZMMH12),
+ SMPL_REG4_EXT(ZMMH13, PERF_REG_X86_ZMMH13),
+ SMPL_REG4_EXT(ZMMH14, PERF_REG_X86_ZMMH14),
+ SMPL_REG4_EXT(ZMMH15, PERF_REG_X86_ZMMH15),
+
+ SMPL_REG8_EXT(ZMM16, PERF_REG_X86_ZMM16),
+ SMPL_REG8_EXT(ZMM17, PERF_REG_X86_ZMM17),
+ SMPL_REG8_EXT(ZMM18, PERF_REG_X86_ZMM18),
+ SMPL_REG8_EXT(ZMM19, PERF_REG_X86_ZMM19),
+ SMPL_REG8_EXT(ZMM20, PERF_REG_X86_ZMM20),
+ SMPL_REG8_EXT(ZMM21, PERF_REG_X86_ZMM21),
+ SMPL_REG8_EXT(ZMM22, PERF_REG_X86_ZMM22),
+ SMPL_REG8_EXT(ZMM23, PERF_REG_X86_ZMM23),
+ SMPL_REG8_EXT(ZMM24, PERF_REG_X86_ZMM24),
+ SMPL_REG8_EXT(ZMM25, PERF_REG_X86_ZMM25),
+ SMPL_REG8_EXT(ZMM26, PERF_REG_X86_ZMM26),
+ SMPL_REG8_EXT(ZMM27, PERF_REG_X86_ZMM27),
+ SMPL_REG8_EXT(ZMM28, PERF_REG_X86_ZMM28),
+ SMPL_REG8_EXT(ZMM29, PERF_REG_X86_ZMM29),
+ SMPL_REG8_EXT(ZMM30, PERF_REG_X86_ZMM30),
+ SMPL_REG8_EXT(ZMM31, PERF_REG_X86_ZMM31),
+
+ SMPL_REG_EXT(OPMASK0, PERF_REG_X86_OPMASK0),
+ SMPL_REG_EXT(OPMASK1, PERF_REG_X86_OPMASK1),
+ SMPL_REG_EXT(OPMASK2, PERF_REG_X86_OPMASK2),
+ SMPL_REG_EXT(OPMASK3, PERF_REG_X86_OPMASK3),
+ SMPL_REG_EXT(OPMASK4, PERF_REG_X86_OPMASK4),
+ SMPL_REG_EXT(OPMASK5, PERF_REG_X86_OPMASK5),
+ SMPL_REG_EXT(OPMASK6, PERF_REG_X86_OPMASK6),
+ SMPL_REG_EXT(OPMASK7, PERF_REG_X86_OPMASK7),
+
SMPL_REG_END
};
@@ -283,6 +344,32 @@ const struct sample_reg *arch__sample_reg_masks(void)
return sample_reg_masks;
}
+static void check_intr_reg_ext_mask(struct perf_event_attr *attr, int idx,
+ u64 fmask, unsigned long *mask)
+{
+ u64 src_mask[PERF_SAMPLE_ARRAY_SIZE] = { 0 };
+ int fd;
+
+ attr->sample_regs_intr = 0;
+ attr->sample_regs_intr_ext[idx] = fmask;
+ src_mask[idx + 1] = fmask;
+
+ fd = sys_perf_event_open(attr, 0, -1, -1, 0);
+ if (fd != -1) {
+ close(fd);
+ bitmap_or(mask, mask, (unsigned long *)src_mask,
+ PERF_SAMPLE_REGS_NUM);
+ }
+}
+
+#define PERF_REG_EXTENDED_YMMH_MASK GENMASK_ULL(31, 0)
+#define PERF_REG_EXTENDED_ZMMH_1ST_MASK GENMASK_ULL(63, 32)
+#define PERF_REG_EXTENDED_ZMMH_2ND_MASK GENMASK_ULL(31, 0)
+#define PERF_REG_EXTENDED_ZMM_1ST_MASK GENMASK_ULL(63, 32)
+#define PERF_REG_EXTENDED_ZMM_2ND_MASK GENMASK_ULL(63, 0)
+#define PERF_REG_EXTENDED_ZMM_3RD_MASK GENMASK_ULL(31, 0)
+#define PERF_REG_EXTENDED_OPMASK_MASK GENMASK_ULL(39, 32)
+
void arch__intr_reg_mask(unsigned long *mask)
{
struct perf_event_attr attr = {
@@ -325,6 +412,18 @@ void arch__intr_reg_mask(unsigned long *mask)
close(fd);
*(u64 *)mask = PERF_REG_EXTENDED_MASK | PERF_REGS_MASK;
}
+
+ /* Check YMMH regs */
+ check_intr_reg_ext_mask(&attr, 0, PERF_REG_EXTENDED_YMMH_MASK, mask);
+ /* Check ZMMLH0-15 regs */
+ check_intr_reg_ext_mask(&attr, 0, PERF_REG_EXTENDED_ZMMH_1ST_MASK, mask);
+ check_intr_reg_ext_mask(&attr, 1, PERF_REG_EXTENDED_ZMMH_2ND_MASK, mask);
+ /* Check ZMM16-31 regs */
+ check_intr_reg_ext_mask(&attr, 1, PERF_REG_EXTENDED_ZMM_1ST_MASK, mask);
+ check_intr_reg_ext_mask(&attr, 2, PERF_REG_EXTENDED_ZMM_2ND_MASK, mask);
+ check_intr_reg_ext_mask(&attr, 3, PERF_REG_EXTENDED_ZMM_3RD_MASK, mask);
+ /* Check OPMASK regs */
+ check_intr_reg_ext_mask(&attr, 3, PERF_REG_EXTENDED_OPMASK_MASK, mask);
}
void arch__user_reg_mask(unsigned long *mask)
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 9a909f02bc04..c926046ebddc 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -78,6 +78,94 @@ const char *__perf_reg_name_x86(int id)
XMM(14)
XMM(15)
#undef XMM
+
+#define YMMH(x) \
+ case PERF_REG_X86_YMMH ## x: \
+ case PERF_REG_X86_YMMH ## x + 1: \
+ return "YMMH" #x;
+ YMMH(0)
+ YMMH(1)
+ YMMH(2)
+ YMMH(3)
+ YMMH(4)
+ YMMH(5)
+ YMMH(6)
+ YMMH(7)
+ YMMH(8)
+ YMMH(9)
+ YMMH(10)
+ YMMH(11)
+ YMMH(12)
+ YMMH(13)
+ YMMH(14)
+ YMMH(15)
+#undef YMMH
+
+#define ZMMH(x) \
+ case PERF_REG_X86_ZMMH ## x: \
+ case PERF_REG_X86_ZMMH ## x + 1: \
+ case PERF_REG_X86_ZMMH ## x + 2: \
+ case PERF_REG_X86_ZMMH ## x + 3: \
+ return "ZMMLH" #x;
+ ZMMH(0)
+ ZMMH(1)
+ ZMMH(2)
+ ZMMH(3)
+ ZMMH(4)
+ ZMMH(5)
+ ZMMH(6)
+ ZMMH(7)
+ ZMMH(8)
+ ZMMH(9)
+ ZMMH(10)
+ ZMMH(11)
+ ZMMH(12)
+ ZMMH(13)
+ ZMMH(14)
+ ZMMH(15)
+#undef ZMMH
+
+#define ZMM(x) \
+ case PERF_REG_X86_ZMM ## x: \
+ case PERF_REG_X86_ZMM ## x + 1: \
+ case PERF_REG_X86_ZMM ## x + 2: \
+ case PERF_REG_X86_ZMM ## x + 3: \
+ case PERF_REG_X86_ZMM ## x + 4: \
+ case PERF_REG_X86_ZMM ## x + 5: \
+ case PERF_REG_X86_ZMM ## x + 6: \
+ case PERF_REG_X86_ZMM ## x + 7: \
+ return "ZMM" #x;
+ ZMM(16)
+ ZMM(17)
+ ZMM(18)
+ ZMM(19)
+ ZMM(20)
+ ZMM(21)
+ ZMM(22)
+ ZMM(23)
+ ZMM(24)
+ ZMM(25)
+ ZMM(26)
+ ZMM(27)
+ ZMM(28)
+ ZMM(29)
+ ZMM(30)
+ ZMM(31)
+#undef ZMM
+
+#define OPMASK(x) \
+ case PERF_REG_X86_OPMASK ## x: \
+ return "opmask" #x;
+
+ OPMASK(0)
+ OPMASK(1)
+ OPMASK(2)
+ OPMASK(3)
+ OPMASK(4)
+ OPMASK(5)
+ OPMASK(6)
+ OPMASK(7)
+#undef OPMASK
default:
return NULL;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 23/24] perf tools/tests: Add vector registers PEBS sampling test
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (21 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 22/24] perf tools: Support to capture more vector registers (x86/Intel) Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
2025-02-18 15:28 ` [Patch v2 24/24] perf tools: Fix incorrect --user-regs comments Dapeng Mi
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
Current adaptive PEBS supports to capture some vector registers like XMM
register, and arch-PEBS supports to capture wider vector registers like
YMM and ZMM registers. This patch adds a perf test case to verify these
vector registers can be captured correctly.
Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
tools/perf/tests/shell/record.sh | 55 ++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
index 0fc7a909ae9b..521eaa1972f9 100755
--- a/tools/perf/tests/shell/record.sh
+++ b/tools/perf/tests/shell/record.sh
@@ -116,6 +116,60 @@ test_register_capture() {
echo "Register capture test [Success]"
}
+test_vec_register_capture() {
+ echo "Vector register capture test"
+ if ! perf record -o /dev/null --quiet -e instructions:p true 2> /dev/null
+ then
+ echo "Vector register capture test [Skipped missing event]"
+ return
+ fi
+ if ! perf record --intr-regs=\? 2>&1 | grep -q 'XMM0'
+ then
+ echo "Vector register capture test [Skipped missing XMM registers]"
+ return
+ fi
+ if ! perf record -o - --intr-regs=xmm0 -e instructions:p \
+ -c 100000 ${testprog} 2> /dev/null \
+ | perf script -F ip,sym,iregs -i - 2> /dev/null \
+ | grep -q "XMM0:"
+ then
+ echo "Vector register capture test [Failed missing XMM output]"
+ err=1
+ return
+ fi
+ echo "Vector registe (XMM) capture test [Success]"
+ if ! perf record --intr-regs=\? 2>&1 | grep -q 'YMMH0'
+ then
+ echo "Vector register capture test [Skipped missing YMM registers]"
+ return
+ fi
+ if ! perf record -o - --intr-regs=ymmh0 -e instructions:p \
+ -c 100000 ${testprog} 2> /dev/null \
+ | perf script -F ip,sym,iregs -i - 2> /dev/null \
+ | grep -q "YMMH0:"
+ then
+ echo "Vector register capture test [Failed missing YMMH output]"
+ err=1
+ return
+ fi
+ echo "Vector registe (YMM) capture test [Success]"
+ if ! perf record --intr-regs=\? 2>&1 | grep -q 'ZMMH0'
+ then
+ echo "Vector register capture test [Skipped missing ZMM registers]"
+ return
+ fi
+ if ! perf record -o - --intr-regs=zmmh0 -e instructions:p \
+ -c 100000 ${testprog} 2> /dev/null \
+ | perf script -F ip,sym,iregs -i - 2> /dev/null \
+ | grep -q "ZMMH0:"
+ then
+ echo "Vector register capture test [Failed missing ZMMH output]"
+ err=1
+ return
+ fi
+ echo "Vector registe (ZMM) capture test [Success]"
+}
+
test_system_wide() {
echo "Basic --system-wide mode test"
if ! perf record -aB --synth=no -o "${perfdata}" ${testprog} 2> /dev/null
@@ -303,6 +357,7 @@ fi
test_per_thread
test_register_capture
+test_vec_register_capture
test_system_wide
test_workload
test_branch_counter
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [Patch v2 24/24] perf tools: Fix incorrect --user-regs comments
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
` (22 preceding siblings ...)
2025-02-18 15:28 ` [Patch v2 23/24] perf tools/tests: Add vector registers PEBS sampling test Dapeng Mi
@ 2025-02-18 15:28 ` Dapeng Mi
23 siblings, 0 replies; 58+ messages in thread
From: Dapeng Mi @ 2025-02-18 15:28 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Kan Liang, Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi
The comment of "--user-regs" option is not correct, fix it.
"on interrupt," -> "on user space,"
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
tools/perf/builtin-record.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/perf/builtin-record.c b/tools/perf/builtin-record.c
index 5db1aedf48df..c3b1ea2d2eae 100644
--- a/tools/perf/builtin-record.c
+++ b/tools/perf/builtin-record.c
@@ -3471,7 +3471,7 @@ static struct option __record_options[] = {
"sample selected machine registers on interrupt,"
" use '-I?' to list register names", parse_intr_regs),
OPT_CALLBACK_OPTARG(0, "user-regs", &record.opts.sample_user_regs, NULL, "any register",
- "sample selected machine registers on interrupt,"
+ "sample selected machine registers on user space,"
" use '--user-regs=?' to list register names", parse_user_regs),
OPT_BOOLEAN(0, "running-time", &record.opts.running_time,
"Record running/enabled time of read (:S) events"),
--
2.40.1
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-18 15:28 ` [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
@ 2025-02-25 10:39 ` Peter Zijlstra
2025-02-25 11:00 ` Peter Zijlstra
` (2 more replies)
0 siblings, 3 replies; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 10:39 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:04PM +0000, Dapeng Mi wrote:
> A significant difference with adaptive PEBS is that arch-PEBS record
> supports fragments which means an arch-PEBS record could be split into
> several independent fragments which have its own arch-PEBS header in
> each fragment.
>
> This patch defines architectural PEBS record layout structures and add
> helpers to process arch-PEBS records or fragments. Only legacy PEBS
> groups like basic, GPR, XMM and LBR groups are supported in this patch,
> the new added YMM/ZMM/OPMASK vector registers capturing would be
> supported in subsequent patches.
>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
> arch/x86/events/intel/core.c | 9 ++
> arch/x86/events/intel/ds.c | 219 ++++++++++++++++++++++++++++++
> arch/x86/include/asm/msr-index.h | 6 +
> arch/x86/include/asm/perf_event.h | 100 ++++++++++++++
> 4 files changed, 334 insertions(+)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 37540eb80029..184f69afde08 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -3124,6 +3124,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
> }
>
> + /*
> + * Arch PEBS sets bit 54 in the global status register
> + */
> + if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
> + (unsigned long *)&status)) {
Will arch_pebs hardware ever toggle bit 62?
> + handled++;
> + x86_pmu.drain_pebs(regs, &data);
static_call(x86_pmu_drain_pebs)(regs, &data);
> + }
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-25 10:39 ` Peter Zijlstra
@ 2025-02-25 11:00 ` Peter Zijlstra
2025-02-26 5:20 ` Mi, Dapeng
2025-02-25 20:42 ` Andi Kleen
2025-02-26 2:54 ` Mi, Dapeng
2 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 11:00 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 25, 2025 at 11:39:27AM +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:04PM +0000, Dapeng Mi wrote:
> > A significant difference with adaptive PEBS is that arch-PEBS record
> > supports fragments which means an arch-PEBS record could be split into
> > several independent fragments which have its own arch-PEBS header in
> > each fragment.
> >
> > This patch defines architectural PEBS record layout structures and add
> > helpers to process arch-PEBS records or fragments. Only legacy PEBS
> > groups like basic, GPR, XMM and LBR groups are supported in this patch,
> > the new added YMM/ZMM/OPMASK vector registers capturing would be
> > supported in subsequent patches.
> >
> > Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> > ---
> > arch/x86/events/intel/core.c | 9 ++
> > arch/x86/events/intel/ds.c | 219 ++++++++++++++++++++++++++++++
> > arch/x86/include/asm/msr-index.h | 6 +
> > arch/x86/include/asm/perf_event.h | 100 ++++++++++++++
> > 4 files changed, 334 insertions(+)
> >
> > diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> > index 37540eb80029..184f69afde08 100644
> > --- a/arch/x86/events/intel/core.c
> > +++ b/arch/x86/events/intel/core.c
> > @@ -3124,6 +3124,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
> > wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
> > }
> >
> > + /*
> > + * Arch PEBS sets bit 54 in the global status register
> > + */
> > + if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
> > + (unsigned long *)&status)) {
>
> Will arch_pebs hardware ever toggle bit 62?
This had me looking at the bit 62 handling, and I noticed the thing from
commit 8077eca079a2 ("perf/x86/pebs: Add workaround for broken OVFL
status on HSW+").
Did that ever get fixed in later chips; notably I'm assuming ARCH PEBS
does not suffer this?
Also, should that workaround have been extended to also include
GLOBAL_STATUS_PERF_METRICS_OVF in that mask, or was that defect fixed
for every chip capable of metrics stuff?
In any case, I think we want a patch clarifying the situation with a
comment.
> > + handled++;
> > + x86_pmu.drain_pebs(regs, &data);
>
> static_call(x86_pmu_drain_pebs)(regs, &data);
>
> > + }
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups
2025-02-18 15:28 ` [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups Dapeng Mi
@ 2025-02-25 11:02 ` Peter Zijlstra
2025-02-26 5:24 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 11:02 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:05PM +0000, Dapeng Mi wrote:
> Adaptive PEBS and arch-PEBS share lots of same code to process these
> PEBS groups, like basic, GPR and meminfo groups. Extract these shared
> code to common functions to avoid duplicated code.
Should you not flip this and the previous patch? Because afaict you're
mostly removing the code you just added, which is a bit silly.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-18 15:28 ` [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
@ 2025-02-25 11:18 ` Peter Zijlstra
2025-02-26 5:48 ` Mi, Dapeng
2025-02-25 11:25 ` Peter Zijlstra
1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 11:18 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
> buffer physical address. This patch allocates arch-PEBS buffer and then
> initialize IA32_PEBS_BASE MSR with the buffer physical address.
Not loving how this patch obscures the whole DS area thing and naming.
> @@ -624,13 +604,18 @@ static int alloc_pebs_buffer(int cpu)
> int max, node = cpu_to_node(cpu);
> void *buffer, *insn_buff, *cea;
>
> - if (!x86_pmu.ds_pebs)
> + if (!intel_pmu_has_pebs())
> return 0;
>
> - buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
> + buffer = dsalloc_pages(bsiz, preemptible() ? GFP_KERNEL : GFP_ATOMIC, cpu);
But this plain smells bad, what is this about?
> if (unlikely(!buffer))
> return -ENOMEM;
>
> + if (x86_pmu.arch_pebs) {
> + hwev->pebs_vaddr = buffer;
> + return 0;
> + }
> +
> /*
> * HSW+ already provides us the eventing ip; no need to allocate this
> * buffer then.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-18 15:28 ` [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
2025-02-25 11:18 ` Peter Zijlstra
@ 2025-02-25 11:25 ` Peter Zijlstra
2025-02-26 6:19 ` Mi, Dapeng
1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 11:25 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
> buffer physical address. This patch allocates arch-PEBS buffer and then
> initialize IA32_PEBS_BASE MSR with the buffer physical address.
Just to clarify, parts with ARCH PEBS will not have BTS and thus not
have DS?
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS
2025-02-18 15:28 ` [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS Dapeng Mi
@ 2025-02-25 11:52 ` Peter Zijlstra
2025-02-26 6:56 ` Mi, Dapeng
2025-02-25 11:54 ` Peter Zijlstra
1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 11:52 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:09PM +0000, Dapeng Mi wrote:
> + if (unlikely(event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP))) {
> + /* Only arch-PEBS supports to capture SSP register. */
> + if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
> + return -EINVAL;
> + }
> @@ -27,9 +27,11 @@ enum perf_event_x86_regs {
> PERF_REG_X86_R13,
> PERF_REG_X86_R14,
> PERF_REG_X86_R15,
> + /* Shadow stack pointer (SSP) present on Clearwater Forest and newer models. */
> + PERF_REG_X86_SSP,
The first comment makes more sense. Nobody knows of cares what a
clearwater forest is, but ARCH-PEBS is something you can check.
Also, this hard implies that anything exposing ARCH-PEBS exposes
CET-SS. Does virt complicate this?
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS
2025-02-18 15:28 ` [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS Dapeng Mi
2025-02-25 11:52 ` Peter Zijlstra
@ 2025-02-25 11:54 ` Peter Zijlstra
2025-02-25 20:44 ` Andi Kleen
1 sibling, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 11:54 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:09PM +0000, Dapeng Mi wrote:
> @@ -651,6 +651,16 @@ int x86_pmu_hw_config(struct perf_event *event)
> return -EINVAL;
> }
>
> + /* sample_regs_user never support SSP register. */
> + if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP)))
> + return -EINVAL;
We can easily enough read user SSP, no?
Should be possible even on todays machines; MSR_IA32_PL3_SSP provides.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-02-18 15:28 ` [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
@ 2025-02-25 15:32 ` Peter Zijlstra
2025-02-26 8:08 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 15:32 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:12PM +0000, Dapeng Mi wrote:
> Add x86/intel specific vector register (VECR) group capturing for
> arch-PEBS. Enable corresponding VECR group bits in
> GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
> bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
>
> Currently vector registers capturing is only supported by PEBS based
> sampling, PMU driver would return error if PMI based sampling tries to
> capture these vector registers.
> @@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
> return -EINVAL;
> }
>
> + /*
> + * Architectural PEBS supports to capture more vector registers besides
> + * XMM registers, like YMM, OPMASK and ZMM registers.
> + */
> + if (unlikely(has_more_extended_regs(event))) {
> + u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
> +
> + if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
> + return -EINVAL;
> +
> + if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
> + return -EINVAL;
> +
> + if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
> + return -EINVAL;
> +
> + if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
> + return -EINVAL;
> +
> + if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
> + return -EINVAL;
> +
> + if (!event->attr.precise_ip)
> + return -EINVAL;
> + }
> +
> return x86_setup_perfctr(event);
> }
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index f21d9f283445..8ef5b9a05fcc 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
> if (pebs_data_cfg & PEBS_DATACFG_XMMS)
> ext |= ARCH_PEBS_VECR_XMM & cap.caps;
>
> + if (pebs_data_cfg & PEBS_DATACFG_YMMS)
> + ext |= ARCH_PEBS_VECR_YMM & cap.caps;
> +
> + if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
> + ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
> +
> + if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
> + ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
> +
> + if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
> + ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
> +
> if (pebs_data_cfg & PEBS_DATACFG_LBRS)
> ext |= ARCH_PEBS_LBR & cap.caps;
>
> @@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
>
> if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
> dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
> +
> + if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
> + dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
> }
There is no technical reason for it to error out, right? We can use
FPU/XSAVE interface to read the CPU state just fine.
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 4b01beee15f4..7e5a4202de37 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
> if (gprs || (attr->precise_ip < 2) || tsx_weight)
> pebs_data_cfg |= PEBS_DATACFG_GP;
>
> - if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
> - (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
> - pebs_data_cfg |= PEBS_DATACFG_XMMS;
> + if (sample_type & PERF_SAMPLE_REGS_INTR) {
> + if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
> +
> + for_each_set_bit_from(bit,
> + (unsigned long *)event->attr.sample_regs_intr_ext,
> + PERF_NUM_EXT_REGS) {
This is indented wrong; please use cino=(0:0
if you worry about indentation depth, break out in helper function.
> + switch (bit + PERF_REG_EXTENDED_OFFSET) {
> + case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
> + pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
> + bit = PERF_REG_X86_YMMH0 -
> + PERF_REG_EXTENDED_OFFSET - 1;
> + break;
> + case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
> + pebs_data_cfg |= PEBS_DATACFG_YMMS;
> + bit = PERF_REG_X86_ZMMH0 -
> + PERF_REG_EXTENDED_OFFSET - 1;
> + break;
> + case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
> + pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
> + bit = PERF_REG_X86_ZMM16 -
> + PERF_REG_EXTENDED_OFFSET - 1;
> + break;
> + case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
> + pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
> + bit = PERF_REG_X86_ZMM_MAX -
> + PERF_REG_EXTENDED_OFFSET - 1;
> + break;
> + }
> + }
> + }
>
> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
> /*
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 17/24] perf/core: Support to capture higher width vector registers
2025-02-18 15:28 ` [Patch v2 17/24] perf/core: Support to capture higher width vector registers Dapeng Mi
@ 2025-02-25 20:32 ` Peter Zijlstra
2025-02-26 7:55 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-25 20:32 UTC (permalink / raw)
To: Dapeng Mi
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 18, 2025 at 03:28:11PM +0000, Dapeng Mi wrote:
> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index 9ee9e55aed09..3851f627ca60 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -33,7 +33,7 @@ enum perf_event_x86_regs {
> PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
> PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
>
> - /* These all need two bits set because they are 128bit */
> + /* These all need two bits set because they are 128 bits */
> PERF_REG_X86_XMM0 = 32,
> PERF_REG_X86_XMM1 = 34,
> PERF_REG_X86_XMM2 = 36,
> @@ -53,6 +53,87 @@ enum perf_event_x86_regs {
>
> /* These include both GPRs and XMMX registers */
> PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
> +
> + /*
> + * YMM upper bits need two bits set because they are 128 bits.
> + * PERF_REG_X86_YMMH0 = 64
> + */
> + PERF_REG_X86_YMMH0 = PERF_REG_X86_XMM_MAX,
> + PERF_REG_X86_YMMH1 = PERF_REG_X86_YMMH0 + 2,
> + PERF_REG_X86_YMMH2 = PERF_REG_X86_YMMH1 + 2,
> + PERF_REG_X86_YMMH3 = PERF_REG_X86_YMMH2 + 2,
> + PERF_REG_X86_YMMH4 = PERF_REG_X86_YMMH3 + 2,
> + PERF_REG_X86_YMMH5 = PERF_REG_X86_YMMH4 + 2,
> + PERF_REG_X86_YMMH6 = PERF_REG_X86_YMMH5 + 2,
> + PERF_REG_X86_YMMH7 = PERF_REG_X86_YMMH6 + 2,
> + PERF_REG_X86_YMMH8 = PERF_REG_X86_YMMH7 + 2,
> + PERF_REG_X86_YMMH9 = PERF_REG_X86_YMMH8 + 2,
> + PERF_REG_X86_YMMH10 = PERF_REG_X86_YMMH9 + 2,
> + PERF_REG_X86_YMMH11 = PERF_REG_X86_YMMH10 + 2,
> + PERF_REG_X86_YMMH12 = PERF_REG_X86_YMMH11 + 2,
> + PERF_REG_X86_YMMH13 = PERF_REG_X86_YMMH12 + 2,
> + PERF_REG_X86_YMMH14 = PERF_REG_X86_YMMH13 + 2,
> + PERF_REG_X86_YMMH15 = PERF_REG_X86_YMMH14 + 2,
> + PERF_REG_X86_YMMH_MAX = PERF_REG_X86_YMMH15 + 2,
> +
> + /*
> + * ZMM0-15 upper bits need four bits set because they are 256 bits
> + * PERF_REG_X86_ZMMH0 = 96
> + */
> + PERF_REG_X86_ZMMH0 = PERF_REG_X86_YMMH_MAX,
> + PERF_REG_X86_ZMMH1 = PERF_REG_X86_ZMMH0 + 4,
> + PERF_REG_X86_ZMMH2 = PERF_REG_X86_ZMMH1 + 4,
> + PERF_REG_X86_ZMMH3 = PERF_REG_X86_ZMMH2 + 4,
> + PERF_REG_X86_ZMMH4 = PERF_REG_X86_ZMMH3 + 4,
> + PERF_REG_X86_ZMMH5 = PERF_REG_X86_ZMMH4 + 4,
> + PERF_REG_X86_ZMMH6 = PERF_REG_X86_ZMMH5 + 4,
> + PERF_REG_X86_ZMMH7 = PERF_REG_X86_ZMMH6 + 4,
> + PERF_REG_X86_ZMMH8 = PERF_REG_X86_ZMMH7 + 4,
> + PERF_REG_X86_ZMMH9 = PERF_REG_X86_ZMMH8 + 4,
> + PERF_REG_X86_ZMMH10 = PERF_REG_X86_ZMMH9 + 4,
> + PERF_REG_X86_ZMMH11 = PERF_REG_X86_ZMMH10 + 4,
> + PERF_REG_X86_ZMMH12 = PERF_REG_X86_ZMMH11 + 4,
> + PERF_REG_X86_ZMMH13 = PERF_REG_X86_ZMMH12 + 4,
> + PERF_REG_X86_ZMMH14 = PERF_REG_X86_ZMMH13 + 4,
> + PERF_REG_X86_ZMMH15 = PERF_REG_X86_ZMMH14 + 4,
> + PERF_REG_X86_ZMMH_MAX = PERF_REG_X86_ZMMH15 + 4,
> +
> + /*
> + * ZMM16-31 need eight bits set because they are 512 bits
> + * PERF_REG_X86_ZMM16 = 160
> + */
> + PERF_REG_X86_ZMM16 = PERF_REG_X86_ZMMH_MAX,
> + PERF_REG_X86_ZMM17 = PERF_REG_X86_ZMM16 + 8,
> + PERF_REG_X86_ZMM18 = PERF_REG_X86_ZMM17 + 8,
> + PERF_REG_X86_ZMM19 = PERF_REG_X86_ZMM18 + 8,
> + PERF_REG_X86_ZMM20 = PERF_REG_X86_ZMM19 + 8,
> + PERF_REG_X86_ZMM21 = PERF_REG_X86_ZMM20 + 8,
> + PERF_REG_X86_ZMM22 = PERF_REG_X86_ZMM21 + 8,
> + PERF_REG_X86_ZMM23 = PERF_REG_X86_ZMM22 + 8,
> + PERF_REG_X86_ZMM24 = PERF_REG_X86_ZMM23 + 8,
> + PERF_REG_X86_ZMM25 = PERF_REG_X86_ZMM24 + 8,
> + PERF_REG_X86_ZMM26 = PERF_REG_X86_ZMM25 + 8,
> + PERF_REG_X86_ZMM27 = PERF_REG_X86_ZMM26 + 8,
> + PERF_REG_X86_ZMM28 = PERF_REG_X86_ZMM27 + 8,
> + PERF_REG_X86_ZMM29 = PERF_REG_X86_ZMM28 + 8,
> + PERF_REG_X86_ZMM30 = PERF_REG_X86_ZMM29 + 8,
> + PERF_REG_X86_ZMM31 = PERF_REG_X86_ZMM30 + 8,
> + PERF_REG_X86_ZMM_MAX = PERF_REG_X86_ZMM31 + 8,
> +
> + /*
> + * OPMASK Registers
> + * PERF_REG_X86_OPMASK0 = 288
> + */
> + PERF_REG_X86_OPMASK0 = PERF_REG_X86_ZMM_MAX,
> + PERF_REG_X86_OPMASK1 = PERF_REG_X86_OPMASK0 + 1,
> + PERF_REG_X86_OPMASK2 = PERF_REG_X86_OPMASK1 + 1,
> + PERF_REG_X86_OPMASK3 = PERF_REG_X86_OPMASK2 + 1,
> + PERF_REG_X86_OPMASK4 = PERF_REG_X86_OPMASK3 + 1,
> + PERF_REG_X86_OPMASK5 = PERF_REG_X86_OPMASK4 + 1,
> + PERF_REG_X86_OPMASK6 = PERF_REG_X86_OPMASK5 + 1,
> + PERF_REG_X86_OPMASK7 = PERF_REG_X86_OPMASK6 + 1,
> +
> + PERF_REG_X86_VEC_MAX = PERF_REG_X86_OPMASK7 + 1,
> };
>
> #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 0524d541d4e3..8a17d696d78c 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -379,6 +379,10 @@ enum perf_event_read_format {
> #define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */
> #define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */
> #define PERF_ATTR_SIZE_VER8 136 /* add: config3 */
> +#define PERF_ATTR_SIZE_VER9 168 /* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
> +
> +#define PERF_EXT_REGS_ARRAY_SIZE 4
> +#define PERF_NUM_EXT_REGS (PERF_EXT_REGS_ARRAY_SIZE * 64)
>
> /*
> * Hardware event_id to monitor via a performance monitoring event:
> @@ -531,6 +535,13 @@ struct perf_event_attr {
> __u64 sig_data;
>
> __u64 config3; /* extension of config2 */
> +
> + /*
> + * Extension sets of regs to dump for each sample.
> + * See asm/perf_regs.h for details.
> + */
> + __u64 sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
> + __u64 sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
> };
>
> /*
*groan*... so do people really need per-register (or even partial
register) masks for all this?
Or can we perhaps -- like XSAVE/PEBS -- do it per register group?
Also, we're going to be getting EGPRs, which I think just about fit in
this 320 bit mask we now have, but it is quite insane.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-25 10:39 ` Peter Zijlstra
2025-02-25 11:00 ` Peter Zijlstra
@ 2025-02-25 20:42 ` Andi Kleen
2025-02-26 2:54 ` Mi, Dapeng
2 siblings, 0 replies; 58+ messages in thread
From: Andi Kleen @ 2025-02-25 20:42 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
Ian Rogers, Adrian Hunter, Alexander Shishkin, Kan Liang,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
> >
> > + /*
> > + * Arch PEBS sets bit 54 in the global status register
> > + */
> > + if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
> > + (unsigned long *)&status)) {
>
> Will arch_pebs hardware ever toggle bit 62?
No it won't.
-Andi
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS
2025-02-25 11:54 ` Peter Zijlstra
@ 2025-02-25 20:44 ` Andi Kleen
2025-02-27 6:29 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Andi Kleen @ 2025-02-25 20:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
Ian Rogers, Adrian Hunter, Alexander Shishkin, Kan Liang,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Tue, Feb 25, 2025 at 12:54:50PM +0100, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:09PM +0000, Dapeng Mi wrote:
>
> > @@ -651,6 +651,16 @@ int x86_pmu_hw_config(struct perf_event *event)
> > return -EINVAL;
> > }
> >
> > + /* sample_regs_user never support SSP register. */
> > + if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP)))
> > + return -EINVAL;
>
> We can easily enough read user SSP, no?
Not for multi record PEBS.
Also technically it may not be precise.
-andi
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-25 10:39 ` Peter Zijlstra
2025-02-25 11:00 ` Peter Zijlstra
2025-02-25 20:42 ` Andi Kleen
@ 2025-02-26 2:54 ` Mi, Dapeng
2 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 2:54 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 6:39 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:04PM +0000, Dapeng Mi wrote:
>> A significant difference with adaptive PEBS is that arch-PEBS record
>> supports fragments which means an arch-PEBS record could be split into
>> several independent fragments which have its own arch-PEBS header in
>> each fragment.
>>
>> This patch defines architectural PEBS record layout structures and add
>> helpers to process arch-PEBS records or fragments. Only legacy PEBS
>> groups like basic, GPR, XMM and LBR groups are supported in this patch,
>> the new added YMM/ZMM/OPMASK vector registers capturing would be
>> supported in subsequent patches.
>>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>> arch/x86/events/intel/core.c | 9 ++
>> arch/x86/events/intel/ds.c | 219 ++++++++++++++++++++++++++++++
>> arch/x86/include/asm/msr-index.h | 6 +
>> arch/x86/include/asm/perf_event.h | 100 ++++++++++++++
>> 4 files changed, 334 insertions(+)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 37540eb80029..184f69afde08 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -3124,6 +3124,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>> }
>>
>> + /*
>> + * Arch PEBS sets bit 54 in the global status register
>> + */
>> + if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
>> + (unsigned long *)&status)) {
> Will arch_pebs hardware ever toggle bit 62?
No, arch-PEBS won't touch anything of debug store related.
>
>> + handled++;
>> + x86_pmu.drain_pebs(regs, &data);
> static_call(x86_pmu_drain_pebs)(regs, &data);
Sure.
>
>> + }
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-25 11:00 ` Peter Zijlstra
@ 2025-02-26 5:20 ` Mi, Dapeng
2025-02-26 9:35 ` Peter Zijlstra
0 siblings, 1 reply; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 5:20 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 7:00 PM, Peter Zijlstra wrote:
> On Tue, Feb 25, 2025 at 11:39:27AM +0100, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2025 at 03:28:04PM +0000, Dapeng Mi wrote:
>>> A significant difference with adaptive PEBS is that arch-PEBS record
>>> supports fragments which means an arch-PEBS record could be split into
>>> several independent fragments which have its own arch-PEBS header in
>>> each fragment.
>>>
>>> This patch defines architectural PEBS record layout structures and add
>>> helpers to process arch-PEBS records or fragments. Only legacy PEBS
>>> groups like basic, GPR, XMM and LBR groups are supported in this patch,
>>> the new added YMM/ZMM/OPMASK vector registers capturing would be
>>> supported in subsequent patches.
>>>
>>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>>> ---
>>> arch/x86/events/intel/core.c | 9 ++
>>> arch/x86/events/intel/ds.c | 219 ++++++++++++++++++++++++++++++
>>> arch/x86/include/asm/msr-index.h | 6 +
>>> arch/x86/include/asm/perf_event.h | 100 ++++++++++++++
>>> 4 files changed, 334 insertions(+)
>>>
>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>> index 37540eb80029..184f69afde08 100644
>>> --- a/arch/x86/events/intel/core.c
>>> +++ b/arch/x86/events/intel/core.c
>>> @@ -3124,6 +3124,15 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
>>> wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
>>> }
>>>
>>> + /*
>>> + * Arch PEBS sets bit 54 in the global status register
>>> + */
>>> + if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
>>> + (unsigned long *)&status)) {
>> Will arch_pebs hardware ever toggle bit 62?
> This had me looking at the bit 62 handling, and I noticed the thing from
> commit 8077eca079a2 ("perf/x86/pebs: Add workaround for broken OVFL
> status on HSW+").
>
> Did that ever get fixed in later chips; notably I'm assuming ARCH PEBS
> does not suffer this?
I'm not sure if arch-PEBS would still suffer this race condition issue, but
the status cleaning for PEBS counters had been move to out of the 62 bit
handling scope by the commit daa864b8f8e3 ("perf/x86/pebs: Fix handling of
PEBS buffer overflows"). So even arch-PEBS still have this race condition
issue, it still can be covered.
>
> Also, should that workaround have been extended to also include
> GLOBAL_STATUS_PERF_METRICS_OVF in that mask, or was that defect fixed
> for every chip capable of metrics stuff?
hmm, per my understanding, GLOBAL_STATUS_PERF_METRICS_OVF handling should
only be skipped when fixed counter 3 or perf metrics are included in PEBS
counter group. In this case, the slots and topdown metrics have been
updated by PEBS handler. It should not be processed again.
@Kan Liang, is it correct?
>
> In any case, I think we want a patch clarifying the situation with a
> comment.
>
>
>>> + handled++;
>>> + x86_pmu.drain_pebs(regs, &data);
>> static_call(x86_pmu_drain_pebs)(regs, &data);
>>
>>> + }
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups
2025-02-25 11:02 ` Peter Zijlstra
@ 2025-02-26 5:24 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 5:24 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 7:02 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:05PM +0000, Dapeng Mi wrote:
>> Adaptive PEBS and arch-PEBS share lots of same code to process these
>> PEBS groups, like basic, GPR and meminfo groups. Extract these shared
>> code to common functions to avoid duplicated code.
> Should you not flip this and the previous patch? Because afaict you're
> mostly removing the code you just added, which is a bit silly.
Sure.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-25 11:18 ` Peter Zijlstra
@ 2025-02-26 5:48 ` Mi, Dapeng
2025-02-26 9:46 ` Peter Zijlstra
0 siblings, 1 reply; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 5:48 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 7:18 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
>> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
>> buffer physical address. This patch allocates arch-PEBS buffer and then
>> initialize IA32_PEBS_BASE MSR with the buffer physical address.
> Not loving how this patch obscures the whole DS area thing and naming.
arch-PEBS uses a totally independent buffer to save the PEBS records and
don't use the debug store area anymore. To reuse the original function as
much as possible and don't mislead users to think arch-PEBS has some
relationship with debug store, the original key word "ds" in the function
names are changed to "BTS_PEBS". I know the name maybe not perfect, do you
have any suggestion? Thanks.
>
>
>> @@ -624,13 +604,18 @@ static int alloc_pebs_buffer(int cpu)
>> int max, node = cpu_to_node(cpu);
>> void *buffer, *insn_buff, *cea;
>>
>> - if (!x86_pmu.ds_pebs)
>> + if (!intel_pmu_has_pebs())
>> return 0;
>>
>> - buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
>> + buffer = dsalloc_pages(bsiz, preemptible() ? GFP_KERNEL : GFP_ATOMIC, cpu);
> But this plain smells bad, what is this about?
In initial implementation, alloc_pebs_buffer() could be called in
init_debug_store_on_cpu() which may be in a irq context. But it was dropped
in latest implementation. So this change is not needed any more. Would drop
it in next version.
>
>> if (unlikely(!buffer))
>> return -ENOMEM;
>>
>> + if (x86_pmu.arch_pebs) {
>> + hwev->pebs_vaddr = buffer;
>> + return 0;
>> + }
>> +
>> /*
>> * HSW+ already provides us the eventing ip; no need to allocate this
>> * buffer then.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-25 11:25 ` Peter Zijlstra
@ 2025-02-26 6:19 ` Mi, Dapeng
2025-02-26 9:48 ` Peter Zijlstra
0 siblings, 1 reply; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 6:19 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 7:25 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
>> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
>> buffer physical address. This patch allocates arch-PEBS buffer and then
>> initialize IA32_PEBS_BASE MSR with the buffer physical address.
> Just to clarify, parts with ARCH PEBS will not have BTS and thus not
> have DS?
No, DS and BTS still exist along with arch-PEBS, only the legacy DS based
PEBS is unavailable and replaced by arch-PEBS.
Here is output of CPUID.1:EDX[21] and IA32_MISC_ENABLE MSR on PTL.
sudo cpuid -l 0x1 | grep DS
DS: debug store = true
DS: debug store = true
DS: debug store = true
DS: debug store = true
DS: debug store = true
DS: debug store = true
sudo rdmsr 0x1a0 -a
851089
851089
851089
851089
851089
851089
We can see debug store is supported, BTS_UNAVAILABLE bit (bit[11] of
IA32_MISC_ENABLE) is cleared but PEBS_UNAVAILABLE bit (bit[12] of
IA32_MISC_ENABLE) is set.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS
2025-02-25 11:52 ` Peter Zijlstra
@ 2025-02-26 6:56 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 6:56 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 7:52 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:09PM +0000, Dapeng Mi wrote:
>
>> + if (unlikely(event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP))) {
>> + /* Only arch-PEBS supports to capture SSP register. */
>> + if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
>> + return -EINVAL;
>> + }
>> @@ -27,9 +27,11 @@ enum perf_event_x86_regs {
>> PERF_REG_X86_R13,
>> PERF_REG_X86_R14,
>> PERF_REG_X86_R15,
>> + /* Shadow stack pointer (SSP) present on Clearwater Forest and newer models. */
>> + PERF_REG_X86_SSP,
> The first comment makes more sense. Nobody knows of cares what a
> clearwater forest is, but ARCH-PEBS is something you can check.
Sure. would modify it in next version.
>
> Also, this hard implies that anything exposing ARCH-PEBS exposes
> CET-SS. Does virt complicate this?
Yes, for real HW, I think CET-SS would be always supported as long as
arch-PEBS is supported, but it's not true for virt. So suppose we need to
check CET-SS is supported before reading it.
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 17/24] perf/core: Support to capture higher width vector registers
2025-02-25 20:32 ` Peter Zijlstra
@ 2025-02-26 7:55 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 7:55 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/26/2025 4:32 AM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:11PM +0000, Dapeng Mi wrote:
>> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
>> index 9ee9e55aed09..3851f627ca60 100644
>> --- a/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -33,7 +33,7 @@ enum perf_event_x86_regs {
>> PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>> PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
>>
>> - /* These all need two bits set because they are 128bit */
>> + /* These all need two bits set because they are 128 bits */
>> PERF_REG_X86_XMM0 = 32,
>> PERF_REG_X86_XMM1 = 34,
>> PERF_REG_X86_XMM2 = 36,
>> @@ -53,6 +53,87 @@ enum perf_event_x86_regs {
>>
>> /* These include both GPRs and XMMX registers */
>> PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
>> +
>> + /*
>> + * YMM upper bits need two bits set because they are 128 bits.
>> + * PERF_REG_X86_YMMH0 = 64
>> + */
>> + PERF_REG_X86_YMMH0 = PERF_REG_X86_XMM_MAX,
>> + PERF_REG_X86_YMMH1 = PERF_REG_X86_YMMH0 + 2,
>> + PERF_REG_X86_YMMH2 = PERF_REG_X86_YMMH1 + 2,
>> + PERF_REG_X86_YMMH3 = PERF_REG_X86_YMMH2 + 2,
>> + PERF_REG_X86_YMMH4 = PERF_REG_X86_YMMH3 + 2,
>> + PERF_REG_X86_YMMH5 = PERF_REG_X86_YMMH4 + 2,
>> + PERF_REG_X86_YMMH6 = PERF_REG_X86_YMMH5 + 2,
>> + PERF_REG_X86_YMMH7 = PERF_REG_X86_YMMH6 + 2,
>> + PERF_REG_X86_YMMH8 = PERF_REG_X86_YMMH7 + 2,
>> + PERF_REG_X86_YMMH9 = PERF_REG_X86_YMMH8 + 2,
>> + PERF_REG_X86_YMMH10 = PERF_REG_X86_YMMH9 + 2,
>> + PERF_REG_X86_YMMH11 = PERF_REG_X86_YMMH10 + 2,
>> + PERF_REG_X86_YMMH12 = PERF_REG_X86_YMMH11 + 2,
>> + PERF_REG_X86_YMMH13 = PERF_REG_X86_YMMH12 + 2,
>> + PERF_REG_X86_YMMH14 = PERF_REG_X86_YMMH13 + 2,
>> + PERF_REG_X86_YMMH15 = PERF_REG_X86_YMMH14 + 2,
>> + PERF_REG_X86_YMMH_MAX = PERF_REG_X86_YMMH15 + 2,
>> +
>> + /*
>> + * ZMM0-15 upper bits need four bits set because they are 256 bits
>> + * PERF_REG_X86_ZMMH0 = 96
>> + */
>> + PERF_REG_X86_ZMMH0 = PERF_REG_X86_YMMH_MAX,
>> + PERF_REG_X86_ZMMH1 = PERF_REG_X86_ZMMH0 + 4,
>> + PERF_REG_X86_ZMMH2 = PERF_REG_X86_ZMMH1 + 4,
>> + PERF_REG_X86_ZMMH3 = PERF_REG_X86_ZMMH2 + 4,
>> + PERF_REG_X86_ZMMH4 = PERF_REG_X86_ZMMH3 + 4,
>> + PERF_REG_X86_ZMMH5 = PERF_REG_X86_ZMMH4 + 4,
>> + PERF_REG_X86_ZMMH6 = PERF_REG_X86_ZMMH5 + 4,
>> + PERF_REG_X86_ZMMH7 = PERF_REG_X86_ZMMH6 + 4,
>> + PERF_REG_X86_ZMMH8 = PERF_REG_X86_ZMMH7 + 4,
>> + PERF_REG_X86_ZMMH9 = PERF_REG_X86_ZMMH8 + 4,
>> + PERF_REG_X86_ZMMH10 = PERF_REG_X86_ZMMH9 + 4,
>> + PERF_REG_X86_ZMMH11 = PERF_REG_X86_ZMMH10 + 4,
>> + PERF_REG_X86_ZMMH12 = PERF_REG_X86_ZMMH11 + 4,
>> + PERF_REG_X86_ZMMH13 = PERF_REG_X86_ZMMH12 + 4,
>> + PERF_REG_X86_ZMMH14 = PERF_REG_X86_ZMMH13 + 4,
>> + PERF_REG_X86_ZMMH15 = PERF_REG_X86_ZMMH14 + 4,
>> + PERF_REG_X86_ZMMH_MAX = PERF_REG_X86_ZMMH15 + 4,
>> +
>> + /*
>> + * ZMM16-31 need eight bits set because they are 512 bits
>> + * PERF_REG_X86_ZMM16 = 160
>> + */
>> + PERF_REG_X86_ZMM16 = PERF_REG_X86_ZMMH_MAX,
>> + PERF_REG_X86_ZMM17 = PERF_REG_X86_ZMM16 + 8,
>> + PERF_REG_X86_ZMM18 = PERF_REG_X86_ZMM17 + 8,
>> + PERF_REG_X86_ZMM19 = PERF_REG_X86_ZMM18 + 8,
>> + PERF_REG_X86_ZMM20 = PERF_REG_X86_ZMM19 + 8,
>> + PERF_REG_X86_ZMM21 = PERF_REG_X86_ZMM20 + 8,
>> + PERF_REG_X86_ZMM22 = PERF_REG_X86_ZMM21 + 8,
>> + PERF_REG_X86_ZMM23 = PERF_REG_X86_ZMM22 + 8,
>> + PERF_REG_X86_ZMM24 = PERF_REG_X86_ZMM23 + 8,
>> + PERF_REG_X86_ZMM25 = PERF_REG_X86_ZMM24 + 8,
>> + PERF_REG_X86_ZMM26 = PERF_REG_X86_ZMM25 + 8,
>> + PERF_REG_X86_ZMM27 = PERF_REG_X86_ZMM26 + 8,
>> + PERF_REG_X86_ZMM28 = PERF_REG_X86_ZMM27 + 8,
>> + PERF_REG_X86_ZMM29 = PERF_REG_X86_ZMM28 + 8,
>> + PERF_REG_X86_ZMM30 = PERF_REG_X86_ZMM29 + 8,
>> + PERF_REG_X86_ZMM31 = PERF_REG_X86_ZMM30 + 8,
>> + PERF_REG_X86_ZMM_MAX = PERF_REG_X86_ZMM31 + 8,
>> +
>> + /*
>> + * OPMASK Registers
>> + * PERF_REG_X86_OPMASK0 = 288
>> + */
>> + PERF_REG_X86_OPMASK0 = PERF_REG_X86_ZMM_MAX,
>> + PERF_REG_X86_OPMASK1 = PERF_REG_X86_OPMASK0 + 1,
>> + PERF_REG_X86_OPMASK2 = PERF_REG_X86_OPMASK1 + 1,
>> + PERF_REG_X86_OPMASK3 = PERF_REG_X86_OPMASK2 + 1,
>> + PERF_REG_X86_OPMASK4 = PERF_REG_X86_OPMASK3 + 1,
>> + PERF_REG_X86_OPMASK5 = PERF_REG_X86_OPMASK4 + 1,
>> + PERF_REG_X86_OPMASK6 = PERF_REG_X86_OPMASK5 + 1,
>> + PERF_REG_X86_OPMASK7 = PERF_REG_X86_OPMASK6 + 1,
>> +
>> + PERF_REG_X86_VEC_MAX = PERF_REG_X86_OPMASK7 + 1,
>> };
>>
>> #define PERF_REG_EXTENDED_MASK (~((1ULL << PERF_REG_X86_XMM0) - 1))
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 0524d541d4e3..8a17d696d78c 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -379,6 +379,10 @@ enum perf_event_read_format {
>> #define PERF_ATTR_SIZE_VER6 120 /* add: aux_sample_size */
>> #define PERF_ATTR_SIZE_VER7 128 /* add: sig_data */
>> #define PERF_ATTR_SIZE_VER8 136 /* add: config3 */
>> +#define PERF_ATTR_SIZE_VER9 168 /* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
>> +
>> +#define PERF_EXT_REGS_ARRAY_SIZE 4
>> +#define PERF_NUM_EXT_REGS (PERF_EXT_REGS_ARRAY_SIZE * 64)
>>
>> /*
>> * Hardware event_id to monitor via a performance monitoring event:
>> @@ -531,6 +535,13 @@ struct perf_event_attr {
>> __u64 sig_data;
>>
>> __u64 config3; /* extension of config2 */
>> +
>> + /*
>> + * Extension sets of regs to dump for each sample.
>> + * See asm/perf_regs.h for details.
>> + */
>> + __u64 sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
>> + __u64 sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
>> };
>>
>> /*
> *groan*... so do people really need per-register (or even partial
> register) masks for all this?
Yeah, I agree. Users should never read partial registers. But as current
perf tool has already supported to read per-register on XMM registers, not
sure if it would introduce back-compatible issues if we only support read
register group, like XMM, YMM or ZMM group.
>
> Or can we perhaps -- like XSAVE/PEBS -- do it per register group?
If there is no back-compatible issue, I think it should work.
>
> Also, we're going to be getting EGPRs, which I think just about fit in
> this 320 bit mask we now have, but it is quite insane.
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-02-25 15:32 ` Peter Zijlstra
@ 2025-02-26 8:08 ` Mi, Dapeng
2025-02-27 6:40 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-26 8:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/25/2025 11:32 PM, Peter Zijlstra wrote:
> On Tue, Feb 18, 2025 at 03:28:12PM +0000, Dapeng Mi wrote:
>> Add x86/intel specific vector register (VECR) group capturing for
>> arch-PEBS. Enable corresponding VECR group bits in
>> GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
>> bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
>>
>> Currently vector registers capturing is only supported by PEBS based
>> sampling, PMU driver would return error if PMI based sampling tries to
>> capture these vector registers.
>
>> @@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
>> return -EINVAL;
>> }
>>
>> + /*
>> + * Architectural PEBS supports to capture more vector registers besides
>> + * XMM registers, like YMM, OPMASK and ZMM registers.
>> + */
>> + if (unlikely(has_more_extended_regs(event))) {
>> + u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
>> +
>> + if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
>> + return -EINVAL;
>> +
>> + if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
>> + return -EINVAL;
>> +
>> + if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
>> + return -EINVAL;
>> +
>> + if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
>> + return -EINVAL;
>> +
>> + if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
>> + return -EINVAL;
>> +
>> + if (!event->attr.precise_ip)
>> + return -EINVAL;
>> + }
>> +
>> return x86_setup_perfctr(event);
>> }
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index f21d9f283445..8ef5b9a05fcc 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
>> if (pebs_data_cfg & PEBS_DATACFG_XMMS)
>> ext |= ARCH_PEBS_VECR_XMM & cap.caps;
>>
>> + if (pebs_data_cfg & PEBS_DATACFG_YMMS)
>> + ext |= ARCH_PEBS_VECR_YMM & cap.caps;
>> +
>> + if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
>> + ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
>> +
>> + if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
>> + ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
>> +
>> + if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
>> + ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
>> +
>> if (pebs_data_cfg & PEBS_DATACFG_LBRS)
>> ext |= ARCH_PEBS_LBR & cap.caps;
>>
>> @@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
>>
>> if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
>> dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
>> +
>> + if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
>> + dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
>> }
> There is no technical reason for it to error out, right? We can use
> FPU/XSAVE interface to read the CPU state just fine.
I think it's not because of technical reason. Let me confirm if we can add
it for non-PEBS sampling.
>
>
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 4b01beee15f4..7e5a4202de37 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
>> if (gprs || (attr->precise_ip < 2) || tsx_weight)
>> pebs_data_cfg |= PEBS_DATACFG_GP;
>>
>> - if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
>> - (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
>> - pebs_data_cfg |= PEBS_DATACFG_XMMS;
>> + if (sample_type & PERF_SAMPLE_REGS_INTR) {
>> + if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
>> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
>> +
>> + for_each_set_bit_from(bit,
>> + (unsigned long *)event->attr.sample_regs_intr_ext,
>> + PERF_NUM_EXT_REGS) {
> This is indented wrong; please use cino=(0:0
> if you worry about indentation depth, break out in helper function.
Sure. would modify it.
>
>> + switch (bit + PERF_REG_EXTENDED_OFFSET) {
>> + case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
>> + pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
>> + bit = PERF_REG_X86_YMMH0 -
>> + PERF_REG_EXTENDED_OFFSET - 1;
>> + break;
>> + case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
>> + pebs_data_cfg |= PEBS_DATACFG_YMMS;
>> + bit = PERF_REG_X86_ZMMH0 -
>> + PERF_REG_EXTENDED_OFFSET - 1;
>> + break;
>> + case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
>> + pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
>> + bit = PERF_REG_X86_ZMM16 -
>> + PERF_REG_EXTENDED_OFFSET - 1;
>> + break;
>> + case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
>> + pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
>> + bit = PERF_REG_X86_ZMM_MAX -
>> + PERF_REG_EXTENDED_OFFSET - 1;
>> + break;
>> + }
>> + }
>> + }
>>
>> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>> /*
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-26 5:20 ` Mi, Dapeng
@ 2025-02-26 9:35 ` Peter Zijlstra
2025-02-26 15:45 ` Liang, Kan
0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-26 9:35 UTC (permalink / raw)
To: Mi, Dapeng
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Wed, Feb 26, 2025 at 01:20:37PM +0800, Mi, Dapeng wrote:
> > Also, should that workaround have been extended to also include
> > GLOBAL_STATUS_PERF_METRICS_OVF in that mask, or was that defect fixed
> > for every chip capable of metrics stuff?
>
> hmm, per my understanding, GLOBAL_STATUS_PERF_METRICS_OVF handling should
> only be skipped when fixed counter 3 or perf metrics are included in PEBS
> counter group. In this case, the slots and topdown metrics have been
> updated by PEBS handler. It should not be processed again.
>
> @Kan Liang, is it correct?
Right, so the thing is, *any* PEBS event pending will clear METRICS_OVF
per:
status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-26 5:48 ` Mi, Dapeng
@ 2025-02-26 9:46 ` Peter Zijlstra
2025-02-27 2:05 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-26 9:46 UTC (permalink / raw)
To: Mi, Dapeng
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Wed, Feb 26, 2025 at 01:48:52PM +0800, Mi, Dapeng wrote:
>
> On 2/25/2025 7:18 PM, Peter Zijlstra wrote:
> > On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
> >> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
> >> buffer physical address. This patch allocates arch-PEBS buffer and then
> >> initialize IA32_PEBS_BASE MSR with the buffer physical address.
> > Not loving how this patch obscures the whole DS area thing and naming.
>
> arch-PEBS uses a totally independent buffer to save the PEBS records and
> don't use the debug store area anymore. To reuse the original function as
> much as possible and don't mislead users to think arch-PEBS has some
> relationship with debug store, the original key word "ds" in the function
> names are changed to "BTS_PEBS". I know the name maybe not perfect, do you
> have any suggestion? Thanks.
Right, so I realize it has a new buffer, but why do you need to make it
all complicated like this?
Just leave the existing stuff and stick the new arch pebs buffer
somewhere new. All that reserve nonsense shouldn't be needed anymore.
Just add it to the intel_pmu_cpu_{prepare,starting,dying,dead} things.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-26 6:19 ` Mi, Dapeng
@ 2025-02-26 9:48 ` Peter Zijlstra
2025-02-27 2:09 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Peter Zijlstra @ 2025-02-26 9:48 UTC (permalink / raw)
To: Mi, Dapeng
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On Wed, Feb 26, 2025 at 02:19:15PM +0800, Mi, Dapeng wrote:
>
> On 2/25/2025 7:25 PM, Peter Zijlstra wrote:
> > On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
> >> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
> >> buffer physical address. This patch allocates arch-PEBS buffer and then
> >> initialize IA32_PEBS_BASE MSR with the buffer physical address.
> > Just to clarify, parts with ARCH PEBS will not have BTS and thus not
> > have DS?
>
> No, DS and BTS still exist along with arch-PEBS, only the legacy DS based
> PEBS is unavailable and replaced by arch-PEBS.
Joy. Is anybody still using BTS now that we have PT? I thought PT was
supposed to be the better BTS.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-26 9:35 ` Peter Zijlstra
@ 2025-02-26 15:45 ` Liang, Kan
2025-02-27 2:04 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2025-02-26 15:45 UTC (permalink / raw)
To: Peter Zijlstra, Mi, Dapeng
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
linux-kernel, linux-perf-users, Dapeng Mi
On 2025-02-26 4:35 a.m., Peter Zijlstra wrote:
> On Wed, Feb 26, 2025 at 01:20:37PM +0800, Mi, Dapeng wrote:
>
>>> Also, should that workaround have been extended to also include
>>> GLOBAL_STATUS_PERF_METRICS_OVF in that mask, or was that defect fixed
>>> for every chip capable of metrics stuff?
>>
>> hmm, per my understanding, GLOBAL_STATUS_PERF_METRICS_OVF handling should
>> only be skipped when fixed counter 3 or perf metrics are included in PEBS
>> counter group. In this case, the slots and topdown metrics have been
>> updated by PEBS handler. It should not be processed again.
>>
>> @Kan Liang, is it correct?
>
> Right, so the thing is, *any* PEBS event pending will clear METRICS_OVF
> per:
>
> status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
>
Yes, we have to add it for both legacy PEBS and ARCH PEBS.
An alternative way may change the order of handling the overflow bit.
The commit daa864b8f8e3 ("perf/x86/pebs: Fix handling of PEBS buffer
overflows") has moved the "status &= ~cpuc->pebs_enabled;" out of PEBS
overflow code.
As long as the PEBS overflow is handled after PT, I don't think the
above is required anymore.
It should be similar to METRICS_OVF. But the PEBS counters snapshotting
should be specially handled, since the PEBS will handle the metrics
counter as well.
@@ -3211,7 +3211,8 @@ static int handle_pmi_common(struct pt_regs *regs,
u64 status)
/*
* Intel Perf metrics
*/
- if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned
long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned
long *)&status) &&
+
!is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS])) {
handled++;
static_call(intel_pmu_update_topdown_event)(NULL, NULL);
}
Thanks,
Kan
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments
2025-02-26 15:45 ` Liang, Kan
@ 2025-02-27 2:04 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-27 2:04 UTC (permalink / raw)
To: Liang, Kan, Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
linux-kernel, linux-perf-users, Dapeng Mi
On 2/26/2025 11:45 PM, Liang, Kan wrote:
>
> On 2025-02-26 4:35 a.m., Peter Zijlstra wrote:
>> On Wed, Feb 26, 2025 at 01:20:37PM +0800, Mi, Dapeng wrote:
>>
>>>> Also, should that workaround have been extended to also include
>>>> GLOBAL_STATUS_PERF_METRICS_OVF in that mask, or was that defect fixed
>>>> for every chip capable of metrics stuff?
>>> hmm, per my understanding, GLOBAL_STATUS_PERF_METRICS_OVF handling should
>>> only be skipped when fixed counter 3 or perf metrics are included in PEBS
>>> counter group. In this case, the slots and topdown metrics have been
>>> updated by PEBS handler. It should not be processed again.
>>>
>>> @Kan Liang, is it correct?
>> Right, so the thing is, *any* PEBS event pending will clear METRICS_OVF
>> per:
>>
>> status &= x86_pmu.intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
>>
> Yes, we have to add it for both legacy PEBS and ARCH PEBS.
>
> An alternative way may change the order of handling the overflow bit.
>
> The commit daa864b8f8e3 ("perf/x86/pebs: Fix handling of PEBS buffer
> overflows") has moved the "status &= ~cpuc->pebs_enabled;" out of PEBS
> overflow code.
>
> As long as the PEBS overflow is handled after PT, I don't think the
> above is required anymore.
>
> It should be similar to METRICS_OVF. But the PEBS counters snapshotting
> should be specially handled, since the PEBS will handle the metrics
> counter as well.
>
> @@ -3211,7 +3211,8 @@ static int handle_pmi_common(struct pt_regs *regs,
> u64 status)
> /*
> * Intel Perf metrics
> */
> - if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned
> long *)&status)) {
> + if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT, (unsigned
> long *)&status) &&
> +
> !is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS])) {
> handled++;
> static_call(intel_pmu_update_topdown_event)(NULL, NULL);
> }
Yes, we still need to handle METRICS_OVF if fixed counter 3 and metrics are
not included into counter group. It ensure the metrics count can be updated
timely once PERF_METRICS MSR overflows.
Since there were more and more bits added into GLOBAL_CTRL_STAT MSR in
past several years, it becomes not correct to execute the below code in
BUFFER_OVF_BIT handling code.
status &= intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
It unconditionally clears METRICS_OVF and other bits which could be added
in the future. This is incorrect and could introduce potential issues.
Combining Kan's change, I think we can change the code like this.
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8ef5b9a05fcc..0cf0f95b1af4 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3192,7 +3192,6 @@ static int handle_pmi_common(struct pt_regs *regs,
u64 status)
struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
int bit;
int handled = 0;
- u64 intel_ctrl = hybrid(cpuc->pmu, intel_ctrl);
inc_irq_stat(apic_perf_irqs);
@@ -3236,7 +3235,6 @@ static int handle_pmi_common(struct pt_regs *regs,
u64 status)
handled++;
x86_pmu_handle_guest_pebs(regs, &data);
static_call(x86_pmu_drain_pebs)(regs, &data);
- status &= intel_ctrl | GLOBAL_STATUS_TRACE_TOPAPMI;
/*
* PMI throttle may be triggered, which stops the PEBS event.
@@ -3269,12 +3267,17 @@ static int handle_pmi_common(struct pt_regs *regs,
u64 status)
/*
* Intel Perf metrics
+ * If PEBS counter group includes fix counter 3, PEBS handler would
update
+ * topdown events which is more accurate, it's unnecessary to
update again.
*/
- if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT,
(unsigned long *)&status)) {
+ if (__test_and_clear_bit(GLOBAL_STATUS_PERF_METRICS_OVF_BIT,
(unsigned long *)&status) &&
+
!is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS])) {
handled++;
static_call(intel_pmu_update_topdown_event)(NULL, NULL);
}
+ status &= hybrid(cpuc->pmu, intel_ctrl);
+
/*
* Checkpointed counters can lead to 'spurious' PMIs because the
* rollback caused by the PMI will have cleared the overflow status
>
> Thanks,
> Kan
>
>
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-26 9:46 ` Peter Zijlstra
@ 2025-02-27 2:05 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-27 2:05 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/26/2025 5:46 PM, Peter Zijlstra wrote:
> On Wed, Feb 26, 2025 at 01:48:52PM +0800, Mi, Dapeng wrote:
>> On 2/25/2025 7:18 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
>>>> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
>>>> buffer physical address. This patch allocates arch-PEBS buffer and then
>>>> initialize IA32_PEBS_BASE MSR with the buffer physical address.
>>> Not loving how this patch obscures the whole DS area thing and naming.
>> arch-PEBS uses a totally independent buffer to save the PEBS records and
>> don't use the debug store area anymore. To reuse the original function as
>> much as possible and don't mislead users to think arch-PEBS has some
>> relationship with debug store, the original key word "ds" in the function
>> names are changed to "BTS_PEBS". I know the name maybe not perfect, do you
>> have any suggestion? Thanks.
> Right, so I realize it has a new buffer, but why do you need to make it
> all complicated like this?
>
> Just leave the existing stuff and stick the new arch pebs buffer
> somewhere new. All that reserve nonsense shouldn't be needed anymore.
>
> Just add it to the intel_pmu_cpu_{prepare,starting,dying,dead} things.
Sure. would do.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
2025-02-26 9:48 ` Peter Zijlstra
@ 2025-02-27 2:09 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-27 2:09 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/26/2025 5:48 PM, Peter Zijlstra wrote:
> On Wed, Feb 26, 2025 at 02:19:15PM +0800, Mi, Dapeng wrote:
>> On 2/25/2025 7:25 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2025 at 03:28:06PM +0000, Dapeng Mi wrote:
>>>> Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
>>>> buffer physical address. This patch allocates arch-PEBS buffer and then
>>>> initialize IA32_PEBS_BASE MSR with the buffer physical address.
>>> Just to clarify, parts with ARCH PEBS will not have BTS and thus not
>>> have DS?
>> No, DS and BTS still exist along with arch-PEBS, only the legacy DS based
>> PEBS is unavailable and replaced by arch-PEBS.
> Joy. Is anybody still using BTS now that we have PT? I thought PT was
> supposed to be the better BTS.
Yeah, I suppose no one still use BTS, but suppose it would need a long time
to drop BTS on HW.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS
2025-02-25 20:44 ` Andi Kleen
@ 2025-02-27 6:29 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-27 6:29 UTC (permalink / raw)
To: Andi Kleen, Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Eranian Stephane,
linux-kernel, linux-perf-users, Dapeng Mi
On 2/26/2025 4:44 AM, Andi Kleen wrote:
> On Tue, Feb 25, 2025 at 12:54:50PM +0100, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2025 at 03:28:09PM +0000, Dapeng Mi wrote:
>>
>>> @@ -651,6 +651,16 @@ int x86_pmu_hw_config(struct perf_event *event)
>>> return -EINVAL;
>>> }
>>>
>>> + /* sample_regs_user never support SSP register. */
>>> + if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP)))
>>> + return -EINVAL;
>> We can easily enough read user SSP, no?
> Not for multi record PEBS.
>
> Also technically it may not be precise.
Yes, If need to support to capture user space SSP, then PEBS has to fall
back to only capture single record.
Andi, you said "it may not be precise", could you please give more details?
I didn't get this. Thanks.
>
> -andi
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-02-26 8:08 ` Mi, Dapeng
@ 2025-02-27 6:40 ` Mi, Dapeng
2025-03-04 3:08 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Mi, Dapeng @ 2025-02-27 6:40 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/26/2025 4:08 PM, Mi, Dapeng wrote:
> On 2/25/2025 11:32 PM, Peter Zijlstra wrote:
>> On Tue, Feb 18, 2025 at 03:28:12PM +0000, Dapeng Mi wrote:
>>> Add x86/intel specific vector register (VECR) group capturing for
>>> arch-PEBS. Enable corresponding VECR group bits in
>>> GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
>>> bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
>>>
>>> Currently vector registers capturing is only supported by PEBS based
>>> sampling, PMU driver would return error if PMI based sampling tries to
>>> capture these vector registers.
>>> @@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
>>> return -EINVAL;
>>> }
>>>
>>> + /*
>>> + * Architectural PEBS supports to capture more vector registers besides
>>> + * XMM registers, like YMM, OPMASK and ZMM registers.
>>> + */
>>> + if (unlikely(has_more_extended_regs(event))) {
>>> + u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
>>> +
>>> + if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
>>> + return -EINVAL;
>>> +
>>> + if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
>>> + return -EINVAL;
>>> +
>>> + if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
>>> + return -EINVAL;
>>> +
>>> + if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
>>> + return -EINVAL;
>>> +
>>> + if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
>>> + return -EINVAL;
>>> +
>>> + if (!event->attr.precise_ip)
>>> + return -EINVAL;
>>> + }
>>> +
>>> return x86_setup_perfctr(event);
>>> }
>>>
>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>> index f21d9f283445..8ef5b9a05fcc 100644
>>> --- a/arch/x86/events/intel/core.c
>>> +++ b/arch/x86/events/intel/core.c
>>> @@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
>>> if (pebs_data_cfg & PEBS_DATACFG_XMMS)
>>> ext |= ARCH_PEBS_VECR_XMM & cap.caps;
>>>
>>> + if (pebs_data_cfg & PEBS_DATACFG_YMMS)
>>> + ext |= ARCH_PEBS_VECR_YMM & cap.caps;
>>> +
>>> + if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
>>> + ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
>>> +
>>> + if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
>>> + ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
>>> +
>>> + if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
>>> + ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
>>> +
>>> if (pebs_data_cfg & PEBS_DATACFG_LBRS)
>>> ext |= ARCH_PEBS_LBR & cap.caps;
>>>
>>> @@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
>>>
>>> if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
>>> dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
>>> +
>>> + if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
>>> + dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
>>> }
>> There is no technical reason for it to error out, right? We can use
>> FPU/XSAVE interface to read the CPU state just fine.
> I think it's not because of technical reason. Let me confirm if we can add
> it for non-PEBS sampling.
Hi Peter,
Just double confirm, you want only PEBS sampling supports to capture SSP
and these vector registers for both *interrupt* and *user space*? or
further, you want PMI based sampling can also support to capture SSP and
these vector registers? Thanks.
>
>
>>
>>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>>> index 4b01beee15f4..7e5a4202de37 100644
>>> --- a/arch/x86/events/intel/ds.c
>>> +++ b/arch/x86/events/intel/ds.c
>>> @@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
>>> if (gprs || (attr->precise_ip < 2) || tsx_weight)
>>> pebs_data_cfg |= PEBS_DATACFG_GP;
>>>
>>> - if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
>>> - (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
>>> - pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>> + if (sample_type & PERF_SAMPLE_REGS_INTR) {
>>> + if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
>>> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>> +
>>> + for_each_set_bit_from(bit,
>>> + (unsigned long *)event->attr.sample_regs_intr_ext,
>>> + PERF_NUM_EXT_REGS) {
>> This is indented wrong; please use cino=(0:0
>> if you worry about indentation depth, break out in helper function.
> Sure. would modify it.
>
>
>>> + switch (bit + PERF_REG_EXTENDED_OFFSET) {
>>> + case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
>>> + pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
>>> + bit = PERF_REG_X86_YMMH0 -
>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>> + break;
>>> + case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
>>> + pebs_data_cfg |= PEBS_DATACFG_YMMS;
>>> + bit = PERF_REG_X86_ZMMH0 -
>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>> + break;
>>> + case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
>>> + pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
>>> + bit = PERF_REG_X86_ZMM16 -
>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>> + break;
>>> + case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
>>> + pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
>>> + bit = PERF_REG_X86_ZMM_MAX -
>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>> + break;
>>> + }
>>> + }
>>> + }
>>>
>>> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>>> /*
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
2025-02-18 15:28 ` [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
@ 2025-02-27 14:06 ` Liang, Kan
2025-03-05 1:41 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2025-02-27 14:06 UTC (permalink / raw)
To: Dapeng Mi, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi
On 2025-02-18 10:28 a.m., Dapeng Mi wrote:
> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
> sampling and precise distribution PEBS sampling. Thus PEBS constraints
> should be dynamically configured base on these counter and precise
> distribution bitmap instead of defining them statically.
>
> Update event dyn_constraint base on PEBS event precise level.
>
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
> arch/x86/events/intel/core.c | 6 ++++++
> arch/x86/events/intel/ds.c | 1 +
> 2 files changed, 7 insertions(+)
>
> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 472366c3db22..c777e0531d40 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -4033,6 +4033,8 @@ static int intel_pmu_hw_config(struct perf_event *event)
> return ret;
>
> if (event->attr.precise_ip) {
> + struct arch_pebs_cap pebs_cap = hybrid(event->pmu, arch_pebs_cap);
> +
> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT)
> return -EINVAL;
>
> @@ -4046,6 +4048,10 @@ static int intel_pmu_hw_config(struct perf_event *event)
> }
> if (x86_pmu.pebs_aliases)
> x86_pmu.pebs_aliases(event);
> +
> + if (x86_pmu.arch_pebs)
> + event->hw.dyn_constraint = event->attr.precise_ip >= 3 ?
> + pebs_cap.pdists : pebs_cap.counters;
> }
The dyn_constraint is only required when the counter mask is different.
I think the pebs_cap.counters should be very likely the same as the
regular counter mask. Maybe something as below (not test).
if (x86_pmu.arch_pebs) {
u64 cntr_mask = event->attr.precise_ip >= 3 ?
pebs_cap.pdists : pebs_cap.counters;
if (cntr_mask != hybrid(event->pmu, intel_ctrl))
event->hw.dyn_constraint = cntr_mask;
}
Thanks,
Kan
>
> if (needs_branch_stack(event)) {
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 519767fc9180..615aefb4e52e 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2948,6 +2948,7 @@ static void __init intel_arch_pebs_init(void)
> x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
> x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
> x86_pmu.pebs_capable = ~0ULL;
> + x86_pmu.flags |= PMU_FL_PEBS_ALL;
>
> x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
> x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-02-27 6:40 ` Mi, Dapeng
@ 2025-03-04 3:08 ` Mi, Dapeng
2025-03-04 16:26 ` Liang, Kan
0 siblings, 1 reply; 58+ messages in thread
From: Mi, Dapeng @ 2025-03-04 3:08 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi
On 2/27/2025 2:40 PM, Mi, Dapeng wrote:
> On 2/26/2025 4:08 PM, Mi, Dapeng wrote:
>> On 2/25/2025 11:32 PM, Peter Zijlstra wrote:
>>> On Tue, Feb 18, 2025 at 03:28:12PM +0000, Dapeng Mi wrote:
>>>> Add x86/intel specific vector register (VECR) group capturing for
>>>> arch-PEBS. Enable corresponding VECR group bits in
>>>> GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
>>>> bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
>>>>
>>>> Currently vector registers capturing is only supported by PEBS based
>>>> sampling, PMU driver would return error if PMI based sampling tries to
>>>> capture these vector registers.
>>>> @@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
>>>> return -EINVAL;
>>>> }
>>>>
>>>> + /*
>>>> + * Architectural PEBS supports to capture more vector registers besides
>>>> + * XMM registers, like YMM, OPMASK and ZMM registers.
>>>> + */
>>>> + if (unlikely(has_more_extended_regs(event))) {
>>>> + u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
>>>> +
>>>> + if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
>>>> + return -EINVAL;
>>>> +
>>>> + if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
>>>> + return -EINVAL;
>>>> +
>>>> + if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
>>>> + return -EINVAL;
>>>> +
>>>> + if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
>>>> + return -EINVAL;
>>>> +
>>>> + if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
>>>> + return -EINVAL;
>>>> +
>>>> + if (!event->attr.precise_ip)
>>>> + return -EINVAL;
>>>> + }
>>>> +
>>>> return x86_setup_perfctr(event);
>>>> }
>>>>
>>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>>> index f21d9f283445..8ef5b9a05fcc 100644
>>>> --- a/arch/x86/events/intel/core.c
>>>> +++ b/arch/x86/events/intel/core.c
>>>> @@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
>>>> if (pebs_data_cfg & PEBS_DATACFG_XMMS)
>>>> ext |= ARCH_PEBS_VECR_XMM & cap.caps;
>>>>
>>>> + if (pebs_data_cfg & PEBS_DATACFG_YMMS)
>>>> + ext |= ARCH_PEBS_VECR_YMM & cap.caps;
>>>> +
>>>> + if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
>>>> + ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
>>>> +
>>>> + if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
>>>> + ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
>>>> +
>>>> + if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
>>>> + ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
>>>> +
>>>> if (pebs_data_cfg & PEBS_DATACFG_LBRS)
>>>> ext |= ARCH_PEBS_LBR & cap.caps;
>>>>
>>>> @@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
>>>>
>>>> if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
>>>> dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
>>>> +
>>>> + if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
>>>> + dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
>>>> }
>>> There is no technical reason for it to error out, right? We can use
>>> FPU/XSAVE interface to read the CPU state just fine.
>> I think it's not because of technical reason. Let me confirm if we can add
>> it for non-PEBS sampling.
> Hi Peter,
>
> Just double confirm, you want only PEBS sampling supports to capture SSP
> and these vector registers for both *interrupt* and *user space*? or
> further, you want PMI based sampling can also support to capture SSP and
> these vector registers? Thanks.
Hi Peter,
May I know your opinion on this? Thanks.
>
>>
>>>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>>>> index 4b01beee15f4..7e5a4202de37 100644
>>>> --- a/arch/x86/events/intel/ds.c
>>>> +++ b/arch/x86/events/intel/ds.c
>>>> @@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
>>>> if (gprs || (attr->precise_ip < 2) || tsx_weight)
>>>> pebs_data_cfg |= PEBS_DATACFG_GP;
>>>>
>>>> - if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
>>>> - (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
>>>> - pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>>> + if (sample_type & PERF_SAMPLE_REGS_INTR) {
>>>> + if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
>>>> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>>> +
>>>> + for_each_set_bit_from(bit,
>>>> + (unsigned long *)event->attr.sample_regs_intr_ext,
>>>> + PERF_NUM_EXT_REGS) {
>>> This is indented wrong; please use cino=(0:0
>>> if you worry about indentation depth, break out in helper function.
>> Sure. would modify it.
>>
>>
>>>> + switch (bit + PERF_REG_EXTENDED_OFFSET) {
>>>> + case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
>>>> + pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
>>>> + bit = PERF_REG_X86_YMMH0 -
>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>> + break;
>>>> + case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
>>>> + pebs_data_cfg |= PEBS_DATACFG_YMMS;
>>>> + bit = PERF_REG_X86_ZMMH0 -
>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>> + break;
>>>> + case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
>>>> + pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
>>>> + bit = PERF_REG_X86_ZMM16 -
>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>> + break;
>>>> + case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
>>>> + pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
>>>> + bit = PERF_REG_X86_ZMM_MAX -
>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>> + break;
>>>> + }
>>>> + }
>>>> + }
>>>>
>>>> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>>>> /*
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-03-04 3:08 ` Mi, Dapeng
@ 2025-03-04 16:26 ` Liang, Kan
2025-03-05 1:34 ` Mi, Dapeng
0 siblings, 1 reply; 58+ messages in thread
From: Liang, Kan @ 2025-03-04 16:26 UTC (permalink / raw)
To: Mi, Dapeng, Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
linux-kernel, linux-perf-users, Dapeng Mi
On 2025-03-03 10:08 p.m., Mi, Dapeng wrote:
>
> On 2/27/2025 2:40 PM, Mi, Dapeng wrote:
>> On 2/26/2025 4:08 PM, Mi, Dapeng wrote:
>>> On 2/25/2025 11:32 PM, Peter Zijlstra wrote:
>>>> On Tue, Feb 18, 2025 at 03:28:12PM +0000, Dapeng Mi wrote:
>>>>> Add x86/intel specific vector register (VECR) group capturing for
>>>>> arch-PEBS. Enable corresponding VECR group bits in
>>>>> GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
>>>>> bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
>>>>>
>>>>> Currently vector registers capturing is only supported by PEBS based
>>>>> sampling, PMU driver would return error if PMI based sampling tries to
>>>>> capture these vector registers.
>>>>> @@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
>>>>> return -EINVAL;
>>>>> }
>>>>>
>>>>> + /*
>>>>> + * Architectural PEBS supports to capture more vector registers besides
>>>>> + * XMM registers, like YMM, OPMASK and ZMM registers.
>>>>> + */
>>>>> + if (unlikely(has_more_extended_regs(event))) {
>>>>> + u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
>>>>> +
>>>>> + if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
>>>>> + return -EINVAL;
>>>>> +
>>>>> + if (!event->attr.precise_ip)
>>>>> + return -EINVAL;
>>>>> + }
>>>>> +
>>>>> return x86_setup_perfctr(event);
>>>>> }
>>>>>
>>>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>>>> index f21d9f283445..8ef5b9a05fcc 100644
>>>>> --- a/arch/x86/events/intel/core.c
>>>>> +++ b/arch/x86/events/intel/core.c
>>>>> @@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
>>>>> if (pebs_data_cfg & PEBS_DATACFG_XMMS)
>>>>> ext |= ARCH_PEBS_VECR_XMM & cap.caps;
>>>>>
>>>>> + if (pebs_data_cfg & PEBS_DATACFG_YMMS)
>>>>> + ext |= ARCH_PEBS_VECR_YMM & cap.caps;
>>>>> +
>>>>> + if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
>>>>> + ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
>>>>> +
>>>>> + if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
>>>>> + ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
>>>>> +
>>>>> + if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
>>>>> + ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
>>>>> +
>>>>> if (pebs_data_cfg & PEBS_DATACFG_LBRS)
>>>>> ext |= ARCH_PEBS_LBR & cap.caps;
>>>>>
>>>>> @@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
>>>>>
>>>>> if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
>>>>> dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
>>>>> +
>>>>> + if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
>>>>> + dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
>>>>> }
>>>> There is no technical reason for it to error out, right? We can use
>>>> FPU/XSAVE interface to read the CPU state just fine.
>>> I think it's not because of technical reason. Let me confirm if we can add
>>> it for non-PEBS sampling.
>> Hi Peter,
>>
>> Just double confirm, you want only PEBS sampling supports to capture SSP
>> and these vector registers for both *interrupt* and *user space*? or
>> further, you want PMI based sampling can also support to capture SSP and
>> these vector registers? Thanks.
I think one of the main reasons to add the vector registers into PEBS
records is because of the large PEBS. So perf can get all the interested
registers and avoid a PMI for each sample.
Technically, I don't think there is a problem supporting them in
non-PEBS PMI sampling. But I'm not sure if it's useful in practice.
The REGS_USER should be more useful. The large PEBS is also available as
long as exclude_kernel.
In my opinion, we may only support the new vector registers for both
REGS_USER and REGS_INTR with PEBS events for now. We can add the support
for non-PEBS events later if there is a requirement.
Thanks,
Kan
>
> Hi Peter,
>
> May I know your opinion on this? Thanks.
>
>
>>
>>>
>>>>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>>>>> index 4b01beee15f4..7e5a4202de37 100644
>>>>> --- a/arch/x86/events/intel/ds.c
>>>>> +++ b/arch/x86/events/intel/ds.c
>>>>> @@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
>>>>> if (gprs || (attr->precise_ip < 2) || tsx_weight)
>>>>> pebs_data_cfg |= PEBS_DATACFG_GP;
>>>>>
>>>>> - if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
>>>>> - (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
>>>>> - pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>>>> + if (sample_type & PERF_SAMPLE_REGS_INTR) {
>>>>> + if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
>>>>> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>>>> +
>>>>> + for_each_set_bit_from(bit,
>>>>> + (unsigned long *)event->attr.sample_regs_intr_ext,
>>>>> + PERF_NUM_EXT_REGS) {
>>>> This is indented wrong; please use cino=(0:0
>>>> if you worry about indentation depth, break out in helper function.
>>> Sure. would modify it.
>>>
>>>
>>>>> + switch (bit + PERF_REG_EXTENDED_OFFSET) {
>>>>> + case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
>>>>> + pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
>>>>> + bit = PERF_REG_X86_YMMH0 -
>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>> + break;
>>>>> + case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
>>>>> + pebs_data_cfg |= PEBS_DATACFG_YMMS;
>>>>> + bit = PERF_REG_X86_ZMMH0 -
>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>> + break;
>>>>> + case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
>>>>> + pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
>>>>> + bit = PERF_REG_X86_ZMM16 -
>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>> + break;
>>>>> + case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
>>>>> + pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
>>>>> + bit = PERF_REG_X86_ZMM_MAX -
>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>> + break;
>>>>> + }
>>>>> + }
>>>>> + }
>>>>>
>>>>> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>>>>> /*
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing
2025-03-04 16:26 ` Liang, Kan
@ 2025-03-05 1:34 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-03-05 1:34 UTC (permalink / raw)
To: Liang, Kan, Peter Zijlstra
Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
linux-kernel, linux-perf-users, Dapeng Mi
On 3/5/2025 12:26 AM, Liang, Kan wrote:
>
> On 2025-03-03 10:08 p.m., Mi, Dapeng wrote:
>> On 2/27/2025 2:40 PM, Mi, Dapeng wrote:
>>> On 2/26/2025 4:08 PM, Mi, Dapeng wrote:
>>>> On 2/25/2025 11:32 PM, Peter Zijlstra wrote:
>>>>> On Tue, Feb 18, 2025 at 03:28:12PM +0000, Dapeng Mi wrote:
>>>>>> Add x86/intel specific vector register (VECR) group capturing for
>>>>>> arch-PEBS. Enable corresponding VECR group bits in
>>>>>> GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
>>>>>> bitmap in perf_event_attr and parse VECR group in arch-PEBS record.
>>>>>>
>>>>>> Currently vector registers capturing is only supported by PEBS based
>>>>>> sampling, PMU driver would return error if PMI based sampling tries to
>>>>>> capture these vector registers.
>>>>>> @@ -676,6 +709,32 @@ int x86_pmu_hw_config(struct perf_event *event)
>>>>>> return -EINVAL;
>>>>>> }
>>>>>>
>>>>>> + /*
>>>>>> + * Architectural PEBS supports to capture more vector registers besides
>>>>>> + * XMM registers, like YMM, OPMASK and ZMM registers.
>>>>>> + */
>>>>>> + if (unlikely(has_more_extended_regs(event))) {
>>>>>> + u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
>>>>>> +
>>>>>> + if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
>>>>>> + return -EINVAL;
>>>>>> +
>>>>>> + if (has_opmask_regs(event) && !(caps & ARCH_PEBS_VECR_OPMASK))
>>>>>> + return -EINVAL;
>>>>>> +
>>>>>> + if (has_ymmh_regs(event) && !(caps & ARCH_PEBS_VECR_YMM))
>>>>>> + return -EINVAL;
>>>>>> +
>>>>>> + if (has_zmmh_regs(event) && !(caps & ARCH_PEBS_VECR_ZMMH))
>>>>>> + return -EINVAL;
>>>>>> +
>>>>>> + if (has_h16zmm_regs(event) && !(caps & ARCH_PEBS_VECR_H16ZMM))
>>>>>> + return -EINVAL;
>>>>>> +
>>>>>> + if (!event->attr.precise_ip)
>>>>>> + return -EINVAL;
>>>>>> + }
>>>>>> +
>>>>>> return x86_setup_perfctr(event);
>>>>>> }
>>>>>>
>>>>>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>>>>>> index f21d9f283445..8ef5b9a05fcc 100644
>>>>>> --- a/arch/x86/events/intel/core.c
>>>>>> +++ b/arch/x86/events/intel/core.c
>>>>>> @@ -2963,6 +2963,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
>>>>>> if (pebs_data_cfg & PEBS_DATACFG_XMMS)
>>>>>> ext |= ARCH_PEBS_VECR_XMM & cap.caps;
>>>>>>
>>>>>> + if (pebs_data_cfg & PEBS_DATACFG_YMMS)
>>>>>> + ext |= ARCH_PEBS_VECR_YMM & cap.caps;
>>>>>> +
>>>>>> + if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
>>>>>> + ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
>>>>>> +
>>>>>> + if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
>>>>>> + ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
>>>>>> +
>>>>>> + if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
>>>>>> + ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
>>>>>> +
>>>>>> if (pebs_data_cfg & PEBS_DATACFG_LBRS)
>>>>>> ext |= ARCH_PEBS_LBR & cap.caps;
>>>>>>
>>>>>> @@ -5115,6 +5127,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
>>>>>>
>>>>>> if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
>>>>>> dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
>>>>>> +
>>>>>> + if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
>>>>>> + dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
>>>>>> }
>>>>> There is no technical reason for it to error out, right? We can use
>>>>> FPU/XSAVE interface to read the CPU state just fine.
>>>> I think it's not because of technical reason. Let me confirm if we can add
>>>> it for non-PEBS sampling.
>>> Hi Peter,
>>>
>>> Just double confirm, you want only PEBS sampling supports to capture SSP
>>> and these vector registers for both *interrupt* and *user space*? or
>>> further, you want PMI based sampling can also support to capture SSP and
>>> these vector registers? Thanks.
> I think one of the main reasons to add the vector registers into PEBS
> records is because of the large PEBS. So perf can get all the interested
> registers and avoid a PMI for each sample.
> Technically, I don't think there is a problem supporting them in
> non-PEBS PMI sampling. But I'm not sure if it's useful in practice.
>
> The REGS_USER should be more useful. The large PEBS is also available as
> long as exclude_kernel.
>
> In my opinion, we may only support the new vector registers for both
> REGS_USER and REGS_INTR with PEBS events for now. We can add the support
> for non-PEBS events later if there is a requirement.
Yes, agree. I plan to support these new added registers for both REGS_USER
and REGS_INTR in v3 version. If someone has different opinion, please let
me know. Thanks.
>
> Thanks,
> Kan
>
>> Hi Peter,
>>
>> May I know your opinion on this? Thanks.
>>
>>
>>>>>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>>>>>> index 4b01beee15f4..7e5a4202de37 100644
>>>>>> --- a/arch/x86/events/intel/ds.c
>>>>>> +++ b/arch/x86/events/intel/ds.c
>>>>>> @@ -1437,9 +1438,37 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
>>>>>> if (gprs || (attr->precise_ip < 2) || tsx_weight)
>>>>>> pebs_data_cfg |= PEBS_DATACFG_GP;
>>>>>>
>>>>>> - if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
>>>>>> - (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
>>>>>> - pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>>>>> + if (sample_type & PERF_SAMPLE_REGS_INTR) {
>>>>>> + if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
>>>>>> + pebs_data_cfg |= PEBS_DATACFG_XMMS;
>>>>>> +
>>>>>> + for_each_set_bit_from(bit,
>>>>>> + (unsigned long *)event->attr.sample_regs_intr_ext,
>>>>>> + PERF_NUM_EXT_REGS) {
>>>>> This is indented wrong; please use cino=(0:0
>>>>> if you worry about indentation depth, break out in helper function.
>>>> Sure. would modify it.
>>>>
>>>>
>>>>>> + switch (bit + PERF_REG_EXTENDED_OFFSET) {
>>>>>> + case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
>>>>>> + pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
>>>>>> + bit = PERF_REG_X86_YMMH0 -
>>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>>> + break;
>>>>>> + case PERF_REG_X86_YMMH0 ... PERF_REG_X86_ZMMH0 - 1:
>>>>>> + pebs_data_cfg |= PEBS_DATACFG_YMMS;
>>>>>> + bit = PERF_REG_X86_ZMMH0 -
>>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>>> + break;
>>>>>> + case PERF_REG_X86_ZMMH0 ... PERF_REG_X86_ZMM16 - 1:
>>>>>> + pebs_data_cfg |= PEBS_DATACFG_ZMMHS;
>>>>>> + bit = PERF_REG_X86_ZMM16 -
>>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>>> + break;
>>>>>> + case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
>>>>>> + pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
>>>>>> + bit = PERF_REG_X86_ZMM_MAX -
>>>>>> + PERF_REG_EXTENDED_OFFSET - 1;
>>>>>> + break;
>>>>>> + }
>>>>>> + }
>>>>>> + }
>>>>>>
>>>>>> if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
>>>>>> /*
>
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
2025-02-27 14:06 ` Liang, Kan
@ 2025-03-05 1:41 ` Mi, Dapeng
0 siblings, 0 replies; 58+ messages in thread
From: Mi, Dapeng @ 2025-03-05 1:41 UTC (permalink / raw)
To: Liang, Kan, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
Andi Kleen, Eranian Stephane
Cc: linux-kernel, linux-perf-users, Dapeng Mi
On 2/27/2025 10:06 PM, Liang, Kan wrote:
>
> On 2025-02-18 10:28 a.m., Dapeng Mi wrote:
>> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
>> sampling and precise distribution PEBS sampling. Thus PEBS constraints
>> should be dynamically configured base on these counter and precise
>> distribution bitmap instead of defining them statically.
>>
>> Update event dyn_constraint base on PEBS event precise level.
>>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>> arch/x86/events/intel/core.c | 6 ++++++
>> arch/x86/events/intel/ds.c | 1 +
>> 2 files changed, 7 insertions(+)
>>
>> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
>> index 472366c3db22..c777e0531d40 100644
>> --- a/arch/x86/events/intel/core.c
>> +++ b/arch/x86/events/intel/core.c
>> @@ -4033,6 +4033,8 @@ static int intel_pmu_hw_config(struct perf_event *event)
>> return ret;
>>
>> if (event->attr.precise_ip) {
>> + struct arch_pebs_cap pebs_cap = hybrid(event->pmu, arch_pebs_cap);
>> +
>> if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT)
>> return -EINVAL;
>>
>> @@ -4046,6 +4048,10 @@ static int intel_pmu_hw_config(struct perf_event *event)
>> }
>> if (x86_pmu.pebs_aliases)
>> x86_pmu.pebs_aliases(event);
>> +
>> + if (x86_pmu.arch_pebs)
>> + event->hw.dyn_constraint = event->attr.precise_ip >= 3 ?
>> + pebs_cap.pdists : pebs_cap.counters;
>> }
> The dyn_constraint is only required when the counter mask is different.
> I think the pebs_cap.counters should be very likely the same as the
> regular counter mask. Maybe something as below (not test).
>
> if (x86_pmu.arch_pebs) {
> u64 cntr_mask = event->attr.precise_ip >= 3 ?
> pebs_cap.pdists : pebs_cap.counters;
> if (cntr_mask != hybrid(event->pmu, intel_ctrl))
> event->hw.dyn_constraint = cntr_mask;
> }
Sure. Thanks.
>
> Thanks,
> Kan
>>
>> if (needs_branch_stack(event)) {
>> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
>> index 519767fc9180..615aefb4e52e 100644
>> --- a/arch/x86/events/intel/ds.c
>> +++ b/arch/x86/events/intel/ds.c
>> @@ -2948,6 +2948,7 @@ static void __init intel_arch_pebs_init(void)
>> x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
>> x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
>> x86_pmu.pebs_capable = ~0ULL;
>> + x86_pmu.flags |= PMU_FL_PEBS_ALL;
>>
>> x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
>> x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
>
^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2025-03-05 1:41 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-18 15:27 [Patch v2 00/24] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
2025-02-18 15:27 ` [Patch v2 01/24] perf/x86: Add dynamic constraint Dapeng Mi
2025-02-18 15:27 ` [Patch v2 02/24] perf/x86/intel: Add Panther Lake support Dapeng Mi
2025-02-18 15:27 ` [Patch v2 03/24] perf/x86/intel: Add PMU support for Clearwater Forest Dapeng Mi
2025-02-18 15:27 ` [Patch v2 04/24] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs Dapeng Mi
2025-02-18 15:27 ` [Patch v2 05/24] perf/x86/intel: Decouple BTS initialization from PEBS initialization Dapeng Mi
2025-02-18 15:28 ` [Patch v2 06/24] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Dapeng Mi
2025-02-18 15:28 ` [Patch v2 07/24] perf/x86/intel: Introduce pairs of PEBS static calls Dapeng Mi
2025-02-18 15:28 ` [Patch v2 08/24] perf/x86/intel: Initialize architectural PEBS Dapeng Mi
2025-02-18 15:28 ` [Patch v2 09/24] perf/x86/intel/ds: Factor out common PEBS processing code to functions Dapeng Mi
2025-02-18 15:28 ` [Patch v2 10/24] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
2025-02-25 10:39 ` Peter Zijlstra
2025-02-25 11:00 ` Peter Zijlstra
2025-02-26 5:20 ` Mi, Dapeng
2025-02-26 9:35 ` Peter Zijlstra
2025-02-26 15:45 ` Liang, Kan
2025-02-27 2:04 ` Mi, Dapeng
2025-02-25 20:42 ` Andi Kleen
2025-02-26 2:54 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 11/24] perf/x86/intel: Factor out common functions to process PEBS groups Dapeng Mi
2025-02-25 11:02 ` Peter Zijlstra
2025-02-26 5:24 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 12/24] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
2025-02-25 11:18 ` Peter Zijlstra
2025-02-26 5:48 ` Mi, Dapeng
2025-02-26 9:46 ` Peter Zijlstra
2025-02-27 2:05 ` Mi, Dapeng
2025-02-25 11:25 ` Peter Zijlstra
2025-02-26 6:19 ` Mi, Dapeng
2025-02-26 9:48 ` Peter Zijlstra
2025-02-27 2:09 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 13/24] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
2025-02-27 14:06 ` Liang, Kan
2025-03-05 1:41 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 14/24] perf/x86/intel: Setup PEBS data configuration and enable legacy groups Dapeng Mi
2025-02-18 15:28 ` [Patch v2 15/24] perf/x86/intel: Add SSP register support for arch-PEBS Dapeng Mi
2025-02-25 11:52 ` Peter Zijlstra
2025-02-26 6:56 ` Mi, Dapeng
2025-02-25 11:54 ` Peter Zijlstra
2025-02-25 20:44 ` Andi Kleen
2025-02-27 6:29 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 16/24] perf/x86/intel: Add counter group " Dapeng Mi
2025-02-18 15:28 ` [Patch v2 17/24] perf/core: Support to capture higher width vector registers Dapeng Mi
2025-02-25 20:32 ` Peter Zijlstra
2025-02-26 7:55 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 18/24] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
2025-02-25 15:32 ` Peter Zijlstra
2025-02-26 8:08 ` Mi, Dapeng
2025-02-27 6:40 ` Mi, Dapeng
2025-03-04 3:08 ` Mi, Dapeng
2025-03-04 16:26 ` Liang, Kan
2025-03-05 1:34 ` Mi, Dapeng
2025-02-18 15:28 ` [Patch v2 19/24] perf tools: Support to show SSP register Dapeng Mi
2025-02-18 15:28 ` [Patch v2 20/24] perf tools: Enhance arch__intr/user_reg_mask() helpers Dapeng Mi
2025-02-18 15:28 ` [Patch v2 21/24] perf tools: Enhance sample_regs_user/intr to capture more registers Dapeng Mi
2025-02-18 15:28 ` [Patch v2 22/24] perf tools: Support to capture more vector registers (x86/Intel) Dapeng Mi
2025-02-18 15:28 ` [Patch v2 23/24] perf tools/tests: Add vector registers PEBS sampling test Dapeng Mi
2025-02-18 15:28 ` [Patch v2 24/24] perf tools: Fix incorrect --user-regs comments Dapeng Mi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).