[Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake
@ 2025-04-15 11:44 Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 01/22] perf/x86/intel: Add Panther Lake support Dapeng Mi
                   ` (22 more replies)
  0 siblings, 23 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

This v3 patch serires is based on latest perf/core tree "5c3627b6f059
 perf/x86/intel/bts: Replace offsetof() with struct_size()" plus extra 2
patches in patchset "perf/x86/intel: Don't clear perf metrics overflow
 bit unconditionally"[1].

Changes:
  v2 -> v3:
  * Rebase patches to 6.15-rc1 code base.
  * Refactor arch-PEBS buffer allocation/release code, decouple with
    legacy PEBS buffer allocation/release code.
  * Support to capture SSP/XMM/YMM/ZMM registers for user space registers
    sampling (--user-regs options) with PEBS events.
  * Fix incorrect sampling frequency issue in frequency sampling mode.
  * Misc changes to address other v2 comments.

Tests:
  Run below tests on Clearwater Forest and Pantherlake, no issue is
  found.
  
  1. Basic perf counting case.
    perf stat -e '{branches,branches,branches,branches,branches,branches,branches,branches,cycles,instructions,ref-cycles}' sleep 1

  2. Basic PMI based perf sampling case.
    perf record -e '{branches,branches,branches,branches,branches,branches,branches,branches,cycles,instructions,ref-cycles}' sleep 1

  3. Basic PEBS based perf sampling case.
    perf record -e '{branches,branches,branches,branches,branches,branches,branches,branches,cycles,instructions,ref-cycles}:p' sleep 1

  4. PEBS sampling case with basic, GPRs, vector-registers and LBR groups
    perf record -e branches:p -Iax,bx,ip,ssp,xmm0,ymm0 -b -c 10000 sleep 1

  5. User space PEBS sampling case with basic, GPRs, vector-registers and LBR groups
    perf record -e branches:pu --user-regs=ax,bx,ip,ssp,xmm0,ymm0 -b -c 10000 sleep 1

  6 PEBS sampling case with auxiliary (memory info) group
    perf mem record sleep 1

  7. PEBS sampling case with counter group
    perf record -e '{branches:p,branches,cycles}:S' -c 10000 sleep 1

  8. Perf stat and record test
    perf test 92; perf test 120

  9. perf-fuzzer test


History:
  v2: https://lore.kernel.org/all/20250218152818.158614-1-dapeng1.mi@linux.intel.com/
  v1: https://lore.kernel.org/all/20250123140721.2496639-1-dapeng1.mi@linux.intel.com/

Ref:
  [1]: https://lore.kernel.org/all/20250415104135.318169-1-dapeng1.mi@linux.intel.com/


Dapeng Mi (21):
  perf/x86/intel: Add PMU support for Clearwater Forest
  perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
  perf/x86/intel: Decouple BTS initialization from PEBS initialization
  perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs
  perf/x86/intel: Introduce pairs of PEBS static calls
  perf/x86/intel: Initialize architectural PEBS
  perf/x86/intel/ds: Factor out PEBS record processing code to functions
  perf/x86/intel/ds: Factor out PEBS group processing code to functions
  perf/x86/intel: Process arch-PEBS records or record fragments
  perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  perf/x86/intel: Setup PEBS data configuration and enable legacy groups
  perf/x86/intel: Add counter group support for arch-PEBS
  perf/x86/intel: Support SSP register capturing for arch-PEBS
  perf/core: Support to capture higher width vector registers
  perf/x86/intel: Support arch-PEBS vector registers group capturing
  perf tools: Support to show SSP register
  perf tools: Enhance arch__intr/user_reg_mask() helpers
  perf tools: Enhance sample_regs_user/intr to capture more registers
  perf tools: Support to capture more vector registers (x86/Intel)
  perf tools/tests: Add vector registers PEBS sampling test

Kan Liang (1):
  perf/x86/intel: Add Panther Lake support

 arch/arm/kernel/perf_regs.c                   |   6 +
 arch/arm64/kernel/perf_regs.c                 |   6 +
 arch/csky/kernel/perf_regs.c                  |   5 +
 arch/loongarch/kernel/perf_regs.c             |   5 +
 arch/mips/kernel/perf_regs.c                  |   5 +
 arch/powerpc/perf/perf_regs.c                 |   5 +
 arch/riscv/kernel/perf_regs.c                 |   5 +
 arch/s390/kernel/perf_regs.c                  |   5 +
 arch/x86/events/core.c                        | 136 +++-
 arch/x86/events/intel/bts.c                   |   6 +-
 arch/x86/events/intel/core.c                  | 329 +++++++-
 arch/x86/events/intel/ds.c                    | 714 ++++++++++++++----
 arch/x86/events/perf_event.h                  |  60 +-
 arch/x86/include/asm/intel_ds.h               |  10 +-
 arch/x86/include/asm/msr-index.h              |  26 +
 arch/x86/include/asm/perf_event.h             | 145 +++-
 arch/x86/include/uapi/asm/perf_regs.h         |  83 +-
 arch/x86/kernel/perf_regs.c                   |  71 +-
 include/linux/perf_event.h                    |   4 +
 include/linux/perf_regs.h                     |  10 +
 include/uapi/linux/perf_event.h               |  11 +
 kernel/events/core.c                          |  98 ++-
 tools/arch/x86/include/uapi/asm/perf_regs.h   |  86 ++-
 tools/include/uapi/linux/perf_event.h         |  14 +
 tools/perf/arch/arm/util/perf_regs.c          |   8 +-
 tools/perf/arch/arm64/util/perf_regs.c        |  11 +-
 tools/perf/arch/csky/util/perf_regs.c         |   8 +-
 tools/perf/arch/loongarch/util/perf_regs.c    |   8 +-
 tools/perf/arch/mips/util/perf_regs.c         |   8 +-
 tools/perf/arch/powerpc/util/perf_regs.c      |  17 +-
 tools/perf/arch/riscv/util/perf_regs.c        |   8 +-
 tools/perf/arch/s390/util/perf_regs.c         |   8 +-
 tools/perf/arch/x86/util/perf_regs.c          | 138 +++-
 tools/perf/builtin-script.c                   |  23 +-
 tools/perf/tests/shell/record.sh              |  55 ++
 tools/perf/util/evsel.c                       |  36 +-
 tools/perf/util/intel-pt.c                    |   2 +-
 tools/perf/util/parse-regs-options.c          |  23 +-
 .../perf/util/perf-regs-arch/perf_regs_x86.c  |  84 +++
 tools/perf/util/perf_regs.c                   |   8 +-
 tools/perf/util/perf_regs.h                   |  20 +-
 tools/perf/util/record.h                      |   4 +-
 tools/perf/util/sample.h                      |   6 +-
 tools/perf/util/session.c                     |  29 +-
 tools/perf/util/synthetic-events.c            |  12 +-
 45 files changed, 2075 insertions(+), 286 deletions(-)


base-commit: 538f1f04b5bfeaff4cd681b2567a0fde2335be38
-- 
2.40.1


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [Patch v3 01/22] perf/x86/intel: Add Panther Lake support
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 02/22] perf/x86/intel: Add PMU support for Clearwater Forest Dapeng Mi
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi

From: Kan Liang <kan.liang@linux.intel.com>

From PMU's perspective, Panther Lake is similar to the previous
generation Lunar Lake. Both are hybrid platforms, with e-core and
p-core.

The key differences are the ARCH PEBS feature and several new events.
The ARCH PEBS is supported in the following patches.
The new events will be supported later in perf tool.

Share the code path with the Lunar Lake. Only update the name.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c6f69ce3b2b3..f107dd826c11 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -7572,8 +7572,17 @@ __init int intel_pmu_init(void)
 		name = "meteorlake_hybrid";
 		break;
 
+	case INTEL_PANTHERLAKE_L:
+		pr_cont("Pantherlake Hybrid events, ");
+		name = "pantherlake_hybrid";
+		goto lnl_common;
+
 	case INTEL_LUNARLAKE_M:
 	case INTEL_ARROWLAKE:
+		pr_cont("Lunarlake Hybrid events, ");
+		name = "lunarlake_hybrid";
+
+	lnl_common:
 		intel_pmu_init_hybrid(hybrid_big_small);
 
 		x86_pmu.pebs_latency_data = lnl_latency_data;
@@ -7595,8 +7604,6 @@ __init int intel_pmu_init(void)
 		intel_pmu_init_skt(&pmu->pmu);
 
 		intel_pmu_pebs_data_source_lnl();
-		pr_cont("Lunarlake Hybrid events, ");
-		name = "lunarlake_hybrid";
 		break;
 
 	case INTEL_ARROWLAKE_H:
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 02/22] perf/x86/intel: Add PMU support for Clearwater Forest
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 01/22] perf/x86/intel: Add Panther Lake support Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 03/22] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs Dapeng Mi
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

From PMU's perspective, Clearwater Forest is similar to the previous
generation Sierra Forest.

The key differences are the ARCH PEBS feature and the new added 3 fixed
counters for topdown L1 metrics events.

The ARCH PEBS is supported in the following patches. This patch provides
support for basic perfmon features and 3 new added fixed counters.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f107dd826c11..adc0187a81a0 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2224,6 +2224,18 @@ static struct extra_reg intel_cmt_extra_regs[] __read_mostly = {
 	EVENT_EXTRA_END
 };
 
+EVENT_ATTR_STR(topdown-fe-bound,       td_fe_bound_skt,        "event=0x9c,umask=0x01");
+EVENT_ATTR_STR(topdown-retiring,       td_retiring_skt,        "event=0xc2,umask=0x02");
+EVENT_ATTR_STR(topdown-be-bound,       td_be_bound_skt,        "event=0xa4,umask=0x02");
+
+static struct attribute *skt_events_attrs[] = {
+	EVENT_PTR(td_fe_bound_skt),
+	EVENT_PTR(td_retiring_skt),
+	EVENT_PTR(td_bad_spec_cmt),
+	EVENT_PTR(td_be_bound_skt),
+	NULL,
+};
+
 #define KNL_OT_L2_HITE		BIT_ULL(19) /* Other Tile L2 Hit */
 #define KNL_OT_L2_HITF		BIT_ULL(20) /* Other Tile L2 Hit */
 #define KNL_MCDRAM_LOCAL	BIT_ULL(21)
@@ -7142,6 +7154,18 @@ __init int intel_pmu_init(void)
 		name = "crestmont";
 		break;
 
+	case INTEL_ATOM_DARKMONT_X:
+		intel_pmu_init_skt(NULL);
+		intel_pmu_pebs_data_source_cmt();
+		x86_pmu.pebs_latency_data = cmt_latency_data;
+		x86_pmu.get_event_constraints = cmt_get_event_constraints;
+		td_attr = skt_events_attrs;
+		mem_attr = grt_mem_attrs;
+		extra_attr = cmt_format_attr;
+		pr_cont("Darkmont events, ");
+		name = "darkmont";
+		break;
+
 	case INTEL_WESTMERE:
 	case INTEL_WESTMERE_EP:
 	case INTEL_WESTMERE_EX:
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 03/22] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 01/22] perf/x86/intel: Add Panther Lake support Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 02/22] perf/x86/intel: Add PMU support for Clearwater Forest Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 04/22] perf/x86/intel: Decouple BTS initialization from PEBS initialization Dapeng Mi
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

CPUID archPerfmonExt (0x23) leaves are supported to enumerate CPU
level's PMU capabilities on non-hybrid processors as well.

This patch supports to parse archPerfmonExt leaves on non-hybrid
processors. Architectural PEBS leverages archPerfmonExt sub-leaves 0x4
and 0x5 to enumerate the PEBS capabilities as well. This patch is a
precursor of the subsequent arch-PEBS enabling patches.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index adc0187a81a0..c7937b872348 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5271,7 +5271,7 @@ static inline bool intel_pmu_broken_perf_cap(void)
 	return false;
 }
 
-static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
+static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int cntr, fixed_cntr, ecx, edx;
 	union cpuid35_eax eax;
@@ -5280,30 +5280,30 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
 	cpuid(ARCH_PERFMON_EXT_LEAF, &eax.full, &ebx.full, &ecx, &edx);
 
 	if (ebx.split.umask2)
-		pmu->config_mask |= ARCH_PERFMON_EVENTSEL_UMASK2;
+		hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_UMASK2;
 	if (ebx.split.eq)
-		pmu->config_mask |= ARCH_PERFMON_EVENTSEL_EQ;
+		hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_EQ;
 
 	if (eax.split.cntr_subleaf) {
 		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
 			    &cntr, &fixed_cntr, &ecx, &edx);
-		pmu->cntr_mask64 = cntr;
-		pmu->fixed_cntr_mask64 = fixed_cntr;
+		hybrid(pmu, cntr_mask64) = cntr;
+		hybrid(pmu, fixed_cntr_mask64) = fixed_cntr;
 	}
 
 	if (eax.split.acr_subleaf) {
 		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_ACR_LEAF,
 			    &cntr, &fixed_cntr, &ecx, &edx);
 		/* The mask of the counters which can be reloaded */
-		pmu->acr_cntr_mask64 = cntr | ((u64)fixed_cntr << INTEL_PMC_IDX_FIXED);
+		hybrid(pmu, acr_cntr_mask64) = cntr | ((u64)fixed_cntr << INTEL_PMC_IDX_FIXED);
 
 		/* The mask of the counters which can cause a reload of reloadable counters */
-		pmu->acr_cause_mask64 = ecx | ((u64)edx << INTEL_PMC_IDX_FIXED);
+		hybrid(pmu, acr_cause_mask64) = ecx | ((u64)edx << INTEL_PMC_IDX_FIXED);
 	}
 
 	if (!intel_pmu_broken_perf_cap()) {
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
-		rdmsrl(MSR_IA32_PERF_CAPABILITIES, pmu->intel_cap.capabilities);
+		rdmsrl(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
 	}
 }
 
@@ -5390,7 +5390,7 @@ static bool init_hybrid_pmu(int cpu)
 		goto end;
 
 	if (this_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
-		update_pmu_cap(pmu);
+		update_pmu_cap(&pmu->pmu);
 
 	intel_pmu_check_hybrid_pmus(pmu);
 
@@ -6899,6 +6899,7 @@ __init int intel_pmu_init(void)
 
 	x86_pmu.pebs_events_mask	= intel_pmu_pebs_mask(x86_pmu.cntr_mask64);
 	x86_pmu.pebs_capable		= PEBS_COUNTER_MASK;
+	x86_pmu.config_mask		= X86_RAW_EVENT_MASK;
 
 	/*
 	 * Quirk: v2 perfmon does not report fixed-purpose events, so
@@ -7715,6 +7716,18 @@ __init int intel_pmu_init(void)
 		x86_pmu.attr_update = hybrid_attr_update;
 	}
 
+	/*
+	 * The archPerfmonExt (0x23) includes an enhanced enumeration of
+	 * PMU architectural features with a per-core view. For non-hybrid,
+	 * each core has the same PMU capabilities. It's good enough to
+	 * update the x86_pmu from the booting CPU. For hybrid, the x86_pmu
+	 * is used to keep the common capabilities. Still keep the values
+	 * from the leaf 0xa. The core specific update will be done later
+	 * when a new type is online.
+	 */
+	if (!is_hybrid() && boot_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
+		update_pmu_cap(NULL);
+
 	intel_pmu_check_counters_mask(&x86_pmu.cntr_mask64,
 				      &x86_pmu.fixed_cntr_mask64,
 				      &x86_pmu.intel_ctrl);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 04/22] perf/x86/intel: Decouple BTS initialization from PEBS initialization
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (2 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 03/22] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 05/22] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Dapeng Mi
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Move x86_pmu.bts flag initialization into bts_init() from
intel_ds_init() and rename intel_ds_init() to intel_pebs_init() since it
fully initializes PEBS now after removing the x86_pmu.bts
initialization.

It's safe to move x86_pmu.bts into bts_init() since all x86_pmu.bts flag
are called after bts_init() execution.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/bts.c  | 6 +++++-
 arch/x86/events/intel/core.c | 2 +-
 arch/x86/events/intel/ds.c   | 5 ++---
 arch/x86/events/perf_event.h | 2 +-
 4 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/events/intel/bts.c b/arch/x86/events/intel/bts.c
index 16bc89c8023b..9560f693fac0 100644
--- a/arch/x86/events/intel/bts.c
+++ b/arch/x86/events/intel/bts.c
@@ -599,7 +599,11 @@ static void bts_event_read(struct perf_event *event)
 
 static __init int bts_init(void)
 {
-	if (!boot_cpu_has(X86_FEATURE_DTES64) || !x86_pmu.bts)
+	if (!boot_cpu_has(X86_FEATURE_DTES64))
+		return -ENODEV;
+
+	x86_pmu.bts = boot_cpu_has(X86_FEATURE_BTS);
+	if (!x86_pmu.bts)
 		return -ENODEV;
 
 	if (boot_cpu_has(X86_FEATURE_PTI)) {
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index c7937b872348..16049ba63135 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -6928,7 +6928,7 @@ __init int intel_pmu_init(void)
 	if (boot_cpu_has(X86_FEATURE_ARCH_LBR))
 		intel_pmu_arch_lbr_init();
 
-	intel_ds_init();
+	intel_pebs_init();
 
 	x86_add_quirk(intel_arch_events_quirk); /* Install first, so it runs last */
 
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index fcf9c5b26cab..d894cf3f631e 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2651,10 +2651,10 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 }
 
 /*
- * BTS, PEBS probe and setup
+ * PEBS probe and setup
  */
 
-void __init intel_ds_init(void)
+void __init intel_pebs_init(void)
 {
 	/*
 	 * No support for 32bit formats
@@ -2662,7 +2662,6 @@ void __init intel_ds_init(void)
 	if (!boot_cpu_has(X86_FEATURE_DTES64))
 		return;
 
-	x86_pmu.bts  = boot_cpu_has(X86_FEATURE_BTS);
 	x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
 	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
 	if (x86_pmu.version <= 4)
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 46bbb503aca1..ac6743e392ad 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1673,7 +1673,7 @@ void intel_pmu_drain_pebs_buffer(void);
 
 void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
 
-void intel_ds_init(void);
+void intel_pebs_init(void);
 
 void intel_pmu_lbr_save_brstack(struct perf_sample_data *data,
 				struct cpu_hw_events *cpuc,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 05/22] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (3 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 04/22] perf/x86/intel: Decouple BTS initialization from PEBS initialization Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 06/22] perf/x86/intel: Introduce pairs of PEBS static calls Dapeng Mi
                   ` (17 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Since architectural PEBS would be introduced in subsequent patches,
rename x86_pmu.pebs to x86_pmu.ds_pebs for distinguishing with the
upcoming architectural PEBS.

Besides restrict reserve_ds_buffers() helper to work only for the
legacy DS based PEBS and avoid it to corrupt the pebs_active flag and
release PEBS buffer incorrectly for arch-PEBS since the later patch
would reuse these flags and alloc/release_pebs_buffer() helpers for
arch-PEBS.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c |  6 +++---
 arch/x86/events/intel/ds.c   | 32 ++++++++++++++++++--------------
 arch/x86/events/perf_event.h |  2 +-
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 16049ba63135..7bbc7a740242 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4584,7 +4584,7 @@ static struct perf_guest_switch_msr *intel_guest_get_msrs(int *nr, void *data)
 		.guest = intel_ctrl & ~cpuc->intel_ctrl_host_mask & ~pebs_mask,
 	};
 
-	if (!x86_pmu.pebs)
+	if (!x86_pmu.ds_pebs)
 		return arr;
 
 	/*
@@ -5764,7 +5764,7 @@ static __init void intel_clovertown_quirk(void)
 	 * these chips.
 	 */
 	pr_warn("PEBS disabled due to CPU errata\n");
-	x86_pmu.pebs = 0;
+	x86_pmu.ds_pebs = 0;
 	x86_pmu.pebs_constraints = NULL;
 }
 
@@ -6252,7 +6252,7 @@ tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 static umode_t
 pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 {
-	return x86_pmu.pebs ? attr->mode : 0;
+	return x86_pmu.ds_pebs ? attr->mode : 0;
 }
 
 static umode_t
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index d894cf3f631e..1d6b3fa6a8eb 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -624,7 +624,7 @@ static int alloc_pebs_buffer(int cpu)
 	int max, node = cpu_to_node(cpu);
 	void *buffer, *insn_buff, *cea;
 
-	if (!x86_pmu.pebs)
+	if (!x86_pmu.ds_pebs)
 		return 0;
 
 	buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
@@ -659,7 +659,7 @@ static void release_pebs_buffer(int cpu)
 	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
 	void *cea;
 
-	if (!x86_pmu.pebs)
+	if (!x86_pmu.ds_pebs)
 		return;
 
 	kfree(per_cpu(insn_buffer, cpu));
@@ -734,7 +734,7 @@ void release_ds_buffers(void)
 {
 	int cpu;
 
-	if (!x86_pmu.bts && !x86_pmu.pebs)
+	if (!x86_pmu.bts && !x86_pmu.ds_pebs)
 		return;
 
 	for_each_possible_cpu(cpu)
@@ -750,7 +750,8 @@ void release_ds_buffers(void)
 	}
 
 	for_each_possible_cpu(cpu) {
-		release_pebs_buffer(cpu);
+		if (x86_pmu.ds_pebs)
+			release_pebs_buffer(cpu);
 		release_bts_buffer(cpu);
 	}
 }
@@ -761,15 +762,17 @@ void reserve_ds_buffers(void)
 	int cpu;
 
 	x86_pmu.bts_active = 0;
-	x86_pmu.pebs_active = 0;
 
-	if (!x86_pmu.bts && !x86_pmu.pebs)
+	if (x86_pmu.ds_pebs)
+		x86_pmu.pebs_active = 0;
+
+	if (!x86_pmu.bts && !x86_pmu.ds_pebs)
 		return;
 
 	if (!x86_pmu.bts)
 		bts_err = 1;
 
-	if (!x86_pmu.pebs)
+	if (!x86_pmu.ds_pebs)
 		pebs_err = 1;
 
 	for_each_possible_cpu(cpu) {
@@ -781,7 +784,8 @@ void reserve_ds_buffers(void)
 		if (!bts_err && alloc_bts_buffer(cpu))
 			bts_err = 1;
 
-		if (!pebs_err && alloc_pebs_buffer(cpu))
+		if (x86_pmu.ds_pebs && !pebs_err &&
+		    alloc_pebs_buffer(cpu))
 			pebs_err = 1;
 
 		if (bts_err && pebs_err)
@@ -793,7 +797,7 @@ void reserve_ds_buffers(void)
 			release_bts_buffer(cpu);
 	}
 
-	if (pebs_err) {
+	if (x86_pmu.ds_pebs && pebs_err) {
 		for_each_possible_cpu(cpu)
 			release_pebs_buffer(cpu);
 	}
@@ -805,7 +809,7 @@ void reserve_ds_buffers(void)
 		if (x86_pmu.bts && !bts_err)
 			x86_pmu.bts_active = 1;
 
-		if (x86_pmu.pebs && !pebs_err)
+		if (x86_pmu.ds_pebs && !pebs_err)
 			x86_pmu.pebs_active = 1;
 
 		for_each_possible_cpu(cpu) {
@@ -2662,12 +2666,12 @@ void __init intel_pebs_init(void)
 	if (!boot_cpu_has(X86_FEATURE_DTES64))
 		return;
 
-	x86_pmu.pebs = boot_cpu_has(X86_FEATURE_PEBS);
+	x86_pmu.ds_pebs = boot_cpu_has(X86_FEATURE_PEBS);
 	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
 	if (x86_pmu.version <= 4)
 		x86_pmu.pebs_no_isolation = 1;
 
-	if (x86_pmu.pebs) {
+	if (x86_pmu.ds_pebs) {
 		char pebs_type = x86_pmu.intel_cap.pebs_trap ?  '+' : '-';
 		char *pebs_qual = "";
 		int format = x86_pmu.intel_cap.pebs_format;
@@ -2759,7 +2763,7 @@ void __init intel_pebs_init(void)
 
 		default:
 			pr_cont("no PEBS fmt%d%c, ", format, pebs_type);
-			x86_pmu.pebs = 0;
+			x86_pmu.ds_pebs = 0;
 		}
 	}
 }
@@ -2768,7 +2772,7 @@ void perf_restore_debug_store(void)
 {
 	struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
 
-	if (!x86_pmu.bts && !x86_pmu.pebs)
+	if (!x86_pmu.bts && !x86_pmu.ds_pebs)
 		return;
 
 	wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index ac6743e392ad..2ef407d0a7e2 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -898,7 +898,7 @@ struct x86_pmu {
 	 */
 	unsigned int	bts			:1,
 			bts_active		:1,
-			pebs			:1,
+			ds_pebs			:1,
 			pebs_active		:1,
 			pebs_broken		:1,
 			pebs_prec_dist		:1,
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 06/22] perf/x86/intel: Introduce pairs of PEBS static calls
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (4 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 05/22] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 07/22] perf/x86/intel: Initialize architectural PEBS Dapeng Mi
                   ` (16 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Arch-PEBS retires IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs, so
intel_pmu_pebs_enable/disable() and intel_pmu_pebs_enable/disable_all()
are not needed to call for ach-PEBS.

To make code cleaner, introduces static calls
x86_pmu_pebs_enable/disable() and x86_pmu_pebs_enable/disable_all()
instead of adding "x86_pmu.arch_pebs" check directly in these helpers.

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c       | 10 ++++++++++
 arch/x86/events/intel/core.c |  8 ++++----
 arch/x86/events/intel/ds.c   |  5 +++++
 arch/x86/events/perf_event.h |  8 ++++++++
 4 files changed, 27 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index cae213296a63..995df8f392b6 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -95,6 +95,11 @@ DEFINE_STATIC_CALL_NULL(x86_pmu_filter, *x86_pmu.filter);
 
 DEFINE_STATIC_CALL_NULL(x86_pmu_late_setup, *x86_pmu.late_setup);
 
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_enable, *x86_pmu.pebs_enable);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_disable, *x86_pmu.pebs_disable);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_enable_all, *x86_pmu.pebs_enable_all);
+DEFINE_STATIC_CALL_NULL(x86_pmu_pebs_disable_all, *x86_pmu.pebs_disable_all);
+
 /*
  * This one is magic, it will get called even when PMU init fails (because
  * there is no PMU), in which case it should simply return NULL.
@@ -2049,6 +2054,11 @@ static void x86_pmu_static_call_update(void)
 	static_call_update(x86_pmu_filter, x86_pmu.filter);
 
 	static_call_update(x86_pmu_late_setup, x86_pmu.late_setup);
+
+	static_call_update(x86_pmu_pebs_enable, x86_pmu.pebs_enable);
+	static_call_update(x86_pmu_pebs_disable, x86_pmu.pebs_disable);
+	static_call_update(x86_pmu_pebs_enable_all, x86_pmu.pebs_enable_all);
+	static_call_update(x86_pmu_pebs_disable_all, x86_pmu.pebs_disable_all);
 }
 
 static void _x86_pmu_read(struct perf_event *event)
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7bbc7a740242..cd6329207311 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2306,7 +2306,7 @@ static __always_inline void __intel_pmu_disable_all(bool bts)
 static __always_inline void intel_pmu_disable_all(void)
 {
 	__intel_pmu_disable_all(true);
-	intel_pmu_pebs_disable_all();
+	static_call_cond(x86_pmu_pebs_disable_all)();
 	intel_pmu_lbr_disable_all();
 }
 
@@ -2338,7 +2338,7 @@ static void __intel_pmu_enable_all(int added, bool pmi)
 
 static void intel_pmu_enable_all(int added)
 {
-	intel_pmu_pebs_enable_all();
+	static_call_cond(x86_pmu_pebs_enable_all)();
 	__intel_pmu_enable_all(added, false);
 }
 
@@ -2595,7 +2595,7 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	 * so we don't trigger the event without PEBS bit set.
 	 */
 	if (unlikely(event->attr.precise_ip))
-		intel_pmu_pebs_disable(event);
+		static_call(x86_pmu_pebs_disable)(event);
 }
 
 static void intel_pmu_assign_event(struct perf_event *event, int idx)
@@ -2948,7 +2948,7 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	int idx = hwc->idx;
 
 	if (unlikely(event->attr.precise_ip))
-		intel_pmu_pebs_enable(event);
+		static_call(x86_pmu_pebs_enable)(event);
 
 	switch (idx) {
 	case 0 ... INTEL_PMC_IDX_FIXED - 1:
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 1d6b3fa6a8eb..e216622b94dc 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2679,6 +2679,11 @@ void __init intel_pebs_init(void)
 		if (format < 4)
 			x86_pmu.intel_cap.pebs_baseline = 0;
 
+		x86_pmu.pebs_enable = intel_pmu_pebs_enable;
+		x86_pmu.pebs_disable = intel_pmu_pebs_disable;
+		x86_pmu.pebs_enable_all = intel_pmu_pebs_enable_all;
+		x86_pmu.pebs_disable_all = intel_pmu_pebs_disable_all;
+
 		switch (format) {
 		case 0:
 			pr_cont("PEBS fmt0%c, ", pebs_type);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2ef407d0a7e2..d201e6ac2ede 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -808,6 +808,10 @@ struct x86_pmu {
 	int		(*hw_config)(struct perf_event *event);
 	int		(*schedule_events)(struct cpu_hw_events *cpuc, int n, int *assign);
 	void		(*late_setup)(void);
+	void		(*pebs_enable)(struct perf_event *event);
+	void		(*pebs_disable)(struct perf_event *event);
+	void		(*pebs_enable_all)(void);
+	void		(*pebs_disable_all)(void);
 	unsigned	eventsel;
 	unsigned	perfctr;
 	unsigned	fixedctr;
@@ -1120,6 +1124,10 @@ DECLARE_STATIC_CALL(x86_pmu_set_period, *x86_pmu.set_period);
 DECLARE_STATIC_CALL(x86_pmu_update,     *x86_pmu.update);
 DECLARE_STATIC_CALL(x86_pmu_drain_pebs,	*x86_pmu.drain_pebs);
 DECLARE_STATIC_CALL(x86_pmu_late_setup,	*x86_pmu.late_setup);
+DECLARE_STATIC_CALL(x86_pmu_pebs_enable, *x86_pmu.pebs_enable);
+DECLARE_STATIC_CALL(x86_pmu_pebs_disable, *x86_pmu.pebs_disable);
+DECLARE_STATIC_CALL(x86_pmu_pebs_enable_all, *x86_pmu.pebs_enable_all);
+DECLARE_STATIC_CALL(x86_pmu_pebs_disable_all, *x86_pmu.pebs_disable_all);
 
 static __always_inline struct x86_perf_task_context_opt *task_context_opt(void *ctx)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 07/22] perf/x86/intel: Initialize architectural PEBS
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (5 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 06/22] perf/x86/intel: Introduce pairs of PEBS static calls Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 08/22] perf/x86/intel/ds: Factor out PEBS record processing code to functions Dapeng Mi
                   ` (15 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

arch-PEBS leverages CPUID.23H.4/5 sub-leaves enumerate arch-PEBS
supported capabilities and counters bitmap. This patch parses these 2
sub-leaves and initializes arch-PEBS capabilities and corresponding
structures.

Since IA32_PEBS_ENABLE and MSR_PEBS_DATA_CFG MSRs are no longer existed
for arch-PEBS, arch-PEBS doesn't need to manipulate these MSRs. Thus add
a simple pair of __intel_pmu_pebs_enable/disable() callbacks for
arch-PEBS.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 21 ++++++++++---
 arch/x86/events/intel/core.c      | 46 ++++++++++++++++++---------
 arch/x86/events/intel/ds.c        | 52 ++++++++++++++++++++++++++-----
 arch/x86/events/perf_event.h      | 25 +++++++++++++--
 arch/x86/include/asm/perf_event.h |  7 ++++-
 5 files changed, 120 insertions(+), 31 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 995df8f392b6..9c205a8a4fa6 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -553,14 +553,22 @@ static inline int precise_br_compat(struct perf_event *event)
 	return m == b;
 }
 
-int x86_pmu_max_precise(void)
+int x86_pmu_max_precise(struct pmu *pmu)
 {
 	int precise = 0;
 
-	/* Support for constant skid */
 	if (x86_pmu.pebs_active && !x86_pmu.pebs_broken) {
-		precise++;
+		/* arch PEBS */
+		if (x86_pmu.arch_pebs) {
+			precise = 2;
+			if (hybrid(pmu, arch_pebs_cap).pdists)
+				precise++;
+
+			return precise;
+		}
 
+		/* legacy PEBS - support for constant skid */
+		precise++;
 		/* Support for IP fixup */
 		if (x86_pmu.lbr_nr || x86_pmu.intel_cap.pebs_format >= 2)
 			precise++;
@@ -568,13 +576,14 @@ int x86_pmu_max_precise(void)
 		if (x86_pmu.pebs_prec_dist)
 			precise++;
 	}
+
 	return precise;
 }
 
 int x86_pmu_hw_config(struct perf_event *event)
 {
 	if (event->attr.precise_ip) {
-		int precise = x86_pmu_max_precise();
+		int precise = x86_pmu_max_precise(event->pmu);
 
 		if (event->attr.precise_ip > precise)
 			return -EOPNOTSUPP;
@@ -2626,7 +2635,9 @@ static ssize_t max_precise_show(struct device *cdev,
 				  struct device_attribute *attr,
 				  char *buf)
 {
-	return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu_max_precise());
+	struct pmu *pmu = dev_get_drvdata(cdev);
+
+	return snprintf(buf, PAGE_SIZE, "%d\n", x86_pmu_max_precise(pmu));
 }
 
 static DEVICE_ATTR_RO(max_precise);
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index cd6329207311..09e2a23f9bcc 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5273,34 +5273,49 @@ static inline bool intel_pmu_broken_perf_cap(void)
 
 static void update_pmu_cap(struct pmu *pmu)
 {
-	unsigned int cntr, fixed_cntr, ecx, edx;
-	union cpuid35_eax eax;
-	union cpuid35_ebx ebx;
+	unsigned int eax, ebx, ecx, edx;
+	union cpuid35_eax eax_0;
+	union cpuid35_ebx ebx_0;
 
-	cpuid(ARCH_PERFMON_EXT_LEAF, &eax.full, &ebx.full, &ecx, &edx);
+	cpuid(ARCH_PERFMON_EXT_LEAF, &eax_0.full, &ebx_0.full, &ecx, &edx);
 
-	if (ebx.split.umask2)
+	if (ebx_0.split.umask2)
 		hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_UMASK2;
-	if (ebx.split.eq)
+	if (ebx_0.split.eq)
 		hybrid(pmu, config_mask) |= ARCH_PERFMON_EVENTSEL_EQ;
 
-	if (eax.split.cntr_subleaf) {
+	if (eax_0.split.cntr_subleaf) {
 		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
-			    &cntr, &fixed_cntr, &ecx, &edx);
-		hybrid(pmu, cntr_mask64) = cntr;
-		hybrid(pmu, fixed_cntr_mask64) = fixed_cntr;
+			    &eax, &ebx, &ecx, &edx);
+		hybrid(pmu, cntr_mask64) = eax;
+		hybrid(pmu, fixed_cntr_mask64) = ebx;
 	}
 
-	if (eax.split.acr_subleaf) {
+	if (eax_0.split.acr_subleaf) {
 		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_ACR_LEAF,
-			    &cntr, &fixed_cntr, &ecx, &edx);
+			    &eax, &ebx, &ecx, &edx);
 		/* The mask of the counters which can be reloaded */
-		hybrid(pmu, acr_cntr_mask64) = cntr | ((u64)fixed_cntr << INTEL_PMC_IDX_FIXED);
+		hybrid(pmu, acr_cntr_mask64) = eax | ((u64)ebx << INTEL_PMC_IDX_FIXED);
 
 		/* The mask of the counters which can cause a reload of reloadable counters */
 		hybrid(pmu, acr_cause_mask64) = ecx | ((u64)edx << INTEL_PMC_IDX_FIXED);
 	}
 
+	/* Bits[5:4] should be set simultaneously if arch-PEBS is supported */
+	if (eax_0.split.pebs_caps_subleaf && eax_0.split.pebs_cnts_subleaf) {
+		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_PEBS_CAP_LEAF,
+			    &eax, &ebx, &ecx, &edx);
+		hybrid(pmu, arch_pebs_cap).caps = (u64)ebx << 32;
+
+		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_PEBS_COUNTER_LEAF,
+			    &eax, &ebx, &ecx, &edx);
+		hybrid(pmu, arch_pebs_cap).counters = ((u64)ecx << 32) | eax;
+		hybrid(pmu, arch_pebs_cap).pdists = ((u64)edx << 32) | ebx;
+	} else {
+		WARN_ON(x86_pmu.arch_pebs == 1);
+		x86_pmu.arch_pebs = 0;
+	}
+
 	if (!intel_pmu_broken_perf_cap()) {
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, hybrid(pmu, intel_cap).capabilities);
@@ -6252,7 +6267,7 @@ tsx_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 static umode_t
 pebs_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 {
-	return x86_pmu.ds_pebs ? attr->mode : 0;
+	return intel_pmu_has_pebs() ? attr->mode : 0;
 }
 
 static umode_t
@@ -7728,6 +7743,9 @@ __init int intel_pmu_init(void)
 	if (!is_hybrid() && boot_cpu_has(X86_FEATURE_ARCH_PERFMON_EXT))
 		update_pmu_cap(NULL);
 
+	if (x86_pmu.arch_pebs)
+		pr_cont("Architectural PEBS, ");
+
 	intel_pmu_check_counters_mask(&x86_pmu.cntr_mask64,
 				      &x86_pmu.fixed_cntr_mask64,
 				      &x86_pmu.intel_ctrl);
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e216622b94dc..4597b5c48d8a 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1530,6 +1530,15 @@ static inline void intel_pmu_drain_large_pebs(struct cpu_hw_events *cpuc)
 		intel_pmu_drain_pebs_buffer();
 }
 
+static void __intel_pmu_pebs_enable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
+	cpuc->pebs_enabled |= 1ULL << hwc->idx;
+}
+
 void intel_pmu_pebs_enable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -1538,9 +1547,7 @@ void intel_pmu_pebs_enable(struct perf_event *event)
 	struct debug_store *ds = cpuc->ds;
 	unsigned int idx = hwc->idx;
 
-	hwc->config &= ~ARCH_PERFMON_EVENTSEL_INT;
-
-	cpuc->pebs_enabled |= 1ULL << hwc->idx;
+	__intel_pmu_pebs_enable(event);
 
 	if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) && (x86_pmu.version < 5))
 		cpuc->pebs_enabled |= 1ULL << (hwc->idx + 32);
@@ -1602,14 +1609,22 @@ void intel_pmu_pebs_del(struct perf_event *event)
 	pebs_update_state(needed_cb, cpuc, event, false);
 }
 
-void intel_pmu_pebs_disable(struct perf_event *event)
+static void __intel_pmu_pebs_disable(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct hw_perf_event *hwc = &event->hw;
 
 	intel_pmu_drain_large_pebs(cpuc);
-
 	cpuc->pebs_enabled &= ~(1ULL << hwc->idx);
+	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
+}
+
+void intel_pmu_pebs_disable(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+
+	__intel_pmu_pebs_disable(event);
 
 	if ((event->hw.flags & PERF_X86_EVENT_PEBS_LDLAT) &&
 	    (x86_pmu.version < 5))
@@ -1621,8 +1636,6 @@ void intel_pmu_pebs_disable(struct perf_event *event)
 
 	if (cpuc->enabled)
 		wrmsrl(MSR_IA32_PEBS_ENABLE, cpuc->pebs_enabled);
-
-	hwc->config |= ARCH_PERFMON_EVENTSEL_INT;
 }
 
 void intel_pmu_pebs_enable_all(void)
@@ -2654,11 +2667,26 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	}
 }
 
+static void __init intel_arch_pebs_init(void)
+{
+	/*
+	 * Current hybrid platforms always both support arch-PEBS or not
+	 * on all kinds of cores. So directly set x86_pmu.arch_pebs flag
+	 * if boot cpu supports arch-PEBS.
+	 */
+	x86_pmu.arch_pebs = 1;
+	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
+	x86_pmu.pebs_capable = ~0ULL;
+
+	x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
+	x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
+}
+
 /*
  * PEBS probe and setup
  */
 
-void __init intel_pebs_init(void)
+static void __init intel_ds_pebs_init(void)
 {
 	/*
 	 * No support for 32bit formats
@@ -2773,6 +2801,14 @@ void __init intel_pebs_init(void)
 	}
 }
 
+void __init intel_pebs_init(void)
+{
+	if (x86_pmu.intel_cap.pebs_format == 0xf)
+		intel_arch_pebs_init();
+	else
+		intel_ds_pebs_init();
+}
+
 void perf_restore_debug_store(void)
 {
 	struct debug_store *ds = __this_cpu_read(cpu_hw_events.ds);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index d201e6ac2ede..23ffad67a927 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -700,6 +700,12 @@ enum hybrid_pmu_type {
 	hybrid_big_small_tiny	= hybrid_big   | hybrid_small_tiny,
 };
 
+struct arch_pebs_cap {
+	u64 caps;
+	u64 counters;
+	u64 pdists;
+};
+
 struct x86_hybrid_pmu {
 	struct pmu			pmu;
 	const char			*name;
@@ -744,6 +750,8 @@ struct x86_hybrid_pmu {
 					mid_ack		:1,
 					enabled_ack	:1;
 
+	struct arch_pebs_cap		arch_pebs_cap;
+
 	u64				pebs_data_source[PERF_PEBS_DATA_SOURCE_MAX];
 };
 
@@ -898,7 +906,7 @@ struct x86_pmu {
 	union perf_capabilities intel_cap;
 
 	/*
-	 * Intel DebugStore bits
+	 * Intel DebugStore and PEBS bits
 	 */
 	unsigned int	bts			:1,
 			bts_active		:1,
@@ -909,7 +917,8 @@ struct x86_pmu {
 			pebs_no_tlb		:1,
 			pebs_no_isolation	:1,
 			pebs_block		:1,
-			pebs_ept		:1;
+			pebs_ept		:1,
+			arch_pebs		:1;
 	int		pebs_record_size;
 	int		pebs_buffer_size;
 	u64		pebs_events_mask;
@@ -921,6 +930,11 @@ struct x86_pmu {
 	u64		rtm_abort_event;
 	u64		pebs_capable;
 
+	/*
+	 * Intel Architectural PEBS
+	 */
+	struct arch_pebs_cap arch_pebs_cap;
+
 	/*
 	 * Intel LBR
 	 */
@@ -1209,7 +1223,7 @@ int x86_reserve_hardware(void);
 
 void x86_release_hardware(void);
 
-int x86_pmu_max_precise(void);
+int x86_pmu_max_precise(struct pmu *pmu);
 
 void hw_perf_lbr_event_destroy(struct perf_event *event);
 
@@ -1784,6 +1798,11 @@ static inline int intel_pmu_max_num_pebs(struct pmu *pmu)
 	return fls((u32)hybrid(pmu, pebs_events_mask));
 }
 
+static inline bool intel_pmu_has_pebs(void)
+{
+	return x86_pmu.ds_pebs || x86_pmu.arch_pebs;
+}
+
 #else /* CONFIG_CPU_SUP_INTEL */
 
 static inline void reserve_ds_buffers(void)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 70d1d94aca7e..7fca9494aae9 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -196,6 +196,8 @@ union cpuid10_edx {
 #define ARCH_PERFMON_EXT_LEAF			0x00000023
 #define ARCH_PERFMON_NUM_COUNTER_LEAF		0x1
 #define ARCH_PERFMON_ACR_LEAF			0x2
+#define ARCH_PERFMON_PEBS_CAP_LEAF		0x4
+#define ARCH_PERFMON_PEBS_COUNTER_LEAF		0x5
 
 union cpuid35_eax {
 	struct {
@@ -206,7 +208,10 @@ union cpuid35_eax {
 		unsigned int    acr_subleaf:1;
 		/* Events Sub-Leaf */
 		unsigned int    events_subleaf:1;
-		unsigned int	reserved:28;
+		/* arch-PEBS Sub-Leaves */
+		unsigned int	pebs_caps_subleaf:1;
+		unsigned int	pebs_cnts_subleaf:1;
+		unsigned int	reserved:26;
 	} split;
 	unsigned int            full;
 };
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 08/22] perf/x86/intel/ds: Factor out PEBS record processing code to functions
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (6 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 07/22] perf/x86/intel: Initialize architectural PEBS Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 09/22] perf/x86/intel/ds: Factor out PEBS group " Dapeng Mi
                   ` (14 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Beside some PEBS record layout difference, arch-PEBS can share most of
PEBS record processing code with adaptive PEBS. Thus, factor out these
common processing code to independent inline functions, so they can be
reused by subsequent arch-PEBS handler.

Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 80 ++++++++++++++++++++++++++------------
 1 file changed, 55 insertions(+), 25 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 4597b5c48d8a..22831ef003d0 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2599,6 +2599,54 @@ static void intel_pmu_drain_pebs_nhm(struct pt_regs *iregs, struct perf_sample_d
 	}
 }
 
+static inline void __intel_pmu_handle_pebs_record(struct pt_regs *iregs,
+						  struct pt_regs *regs,
+						  struct perf_sample_data *data,
+						  void *at, u64 pebs_status,
+						  short *counts, void **last,
+						  setup_fn setup_sample)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *event;
+	int bit;
+
+	for_each_set_bit(bit, (unsigned long *)&pebs_status, X86_PMC_IDX_MAX) {
+		event = cpuc->events[bit];
+
+		if (WARN_ON_ONCE(!event) ||
+		    WARN_ON_ONCE(!event->attr.precise_ip))
+			continue;
+
+		if (counts[bit]++)
+			__intel_pmu_pebs_event(event, iregs, regs, data,
+					       last[bit], setup_sample);
+
+		last[bit] = at;
+	}
+}
+
+static inline void
+__intel_pmu_handle_last_pebs_record(struct pt_regs *iregs, struct pt_regs *regs,
+				    struct perf_sample_data *data, u64 mask,
+				    short *counts, void **last,
+				    setup_fn setup_sample)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct perf_event *event;
+	int bit;
+
+	for_each_set_bit(bit, (unsigned long *)&mask, X86_PMC_IDX_MAX) {
+		if (!counts[bit])
+			continue;
+
+		event = cpuc->events[bit];
+
+		__intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
+					    counts[bit], setup_sample);
+	}
+
+}
+
 static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_data *data)
 {
 	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
@@ -2608,9 +2656,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 	struct x86_perf_regs perf_regs;
 	struct pt_regs *regs = &perf_regs.regs;
 	struct pebs_basic *basic;
-	struct perf_event *event;
 	void *base, *at, *top;
-	int bit;
 	u64 mask;
 
 	if (!x86_pmu.pebs_active)
@@ -2623,6 +2669,7 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 
 	mask = hybrid(cpuc->pmu, pebs_events_mask) |
 	       (hybrid(cpuc->pmu, fixed_cntr_mask64) << INTEL_PMC_IDX_FIXED);
+	mask &= cpuc->pebs_enabled;
 
 	if (unlikely(base >= top)) {
 		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
@@ -2640,31 +2687,14 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 		if (basic->format_size != cpuc->pebs_record_size)
 			continue;
 
-		pebs_status = basic->applicable_counters & cpuc->pebs_enabled & mask;
-		for_each_set_bit(bit, (unsigned long *)&pebs_status, X86_PMC_IDX_MAX) {
-			event = cpuc->events[bit];
-
-			if (WARN_ON_ONCE(!event) ||
-			    WARN_ON_ONCE(!event->attr.precise_ip))
-				continue;
-
-			if (counts[bit]++) {
-				__intel_pmu_pebs_event(event, iregs, regs, data, last[bit],
-						       setup_pebs_adaptive_sample_data);
-			}
-			last[bit] = at;
-		}
+		pebs_status = mask & basic->applicable_counters;
+		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
+					       pebs_status, counts, last,
+					       setup_pebs_adaptive_sample_data);
 	}
 
-	for_each_set_bit(bit, (unsigned long *)&mask, X86_PMC_IDX_MAX) {
-		if (!counts[bit])
-			continue;
-
-		event = cpuc->events[bit];
-
-		__intel_pmu_pebs_last_event(event, iregs, regs, data, last[bit],
-					    counts[bit], setup_pebs_adaptive_sample_data);
-	}
+	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts, last,
+					    setup_pebs_adaptive_sample_data);
 }
 
 static void __init intel_arch_pebs_init(void)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 09/22] perf/x86/intel/ds: Factor out PEBS group processing code to functions
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (7 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 08/22] perf/x86/intel/ds: Factor out PEBS record processing code to functions Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
                   ` (13 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Adaptive PEBS and arch-PEBS share lots of same code to process these
PEBS groups, like basic, GPR and meminfo groups. Extract these shared
code to generic functions to avoid duplicated code.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/ds.c | 172 ++++++++++++++++++++++---------------
 1 file changed, 105 insertions(+), 67 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 22831ef003d0..6c872bf2e916 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2073,6 +2073,91 @@ static inline void __setup_pebs_counter_group(struct cpu_hw_events *cpuc,
 
 #define PEBS_LATENCY_MASK			0xffff
 
+static inline void __setup_perf_sample_data(struct perf_event *event,
+					    struct pt_regs *iregs,
+					    struct perf_sample_data *data)
+{
+	perf_sample_data_init(data, 0, event->hw.last_period);
+	data->period = event->hw.last_period;
+
+	/*
+	 * We must however always use iregs for the unwinder to stay sane; the
+	 * record BP,SP,IP can point into thin air when the record is from a
+	 * previous PMI context or an (I)RET happened between the record and
+	 * PMI.
+	 */
+	perf_sample_save_callchain(data, event, iregs);
+}
+
+static inline void __setup_pebs_basic_group(struct perf_event *event,
+					    struct pt_regs *regs,
+					    struct perf_sample_data *data,
+					    u64 sample_type, u64 ip,
+					    u64 tsc, u16 retire)
+{
+	/* The ip in basic is EventingIP */
+	set_linear_ip(regs, ip);
+	regs->flags = PERF_EFLAGS_EXACT;
+	setup_pebs_time(event, data, tsc);
+
+	if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+		data->weight.var3_w = retire;
+}
+
+static inline void __setup_pebs_gpr_group(struct perf_event *event,
+					  struct pt_regs *regs,
+					  struct pebs_gprs *gprs,
+					  u64 sample_type)
+{
+	if (event->attr.precise_ip < 2) {
+		set_linear_ip(regs, gprs->ip);
+		regs->flags &= ~PERF_EFLAGS_EXACT;
+	}
+
+	if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
+		adaptive_pebs_save_regs(regs, gprs);
+}
+
+static inline void __setup_pebs_meminfo_group(struct perf_event *event,
+					      struct perf_sample_data *data,
+					      u64 sample_type, u64 latency,
+					      u16 instr_latency, u64 address,
+					      u64 aux, u64 tsx_tuning, u64 ax)
+{
+	if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
+		u64 tsx_latency = intel_get_tsx_weight(tsx_tuning);
+
+		data->weight.var2_w = instr_latency;
+
+		/*
+		 * Although meminfo::latency is defined as a u64,
+		 * only the lower 32 bits include the valid data
+		 * in practice on Ice Lake and earlier platforms.
+		 */
+		if (sample_type & PERF_SAMPLE_WEIGHT)
+			data->weight.full = latency ?: tsx_latency;
+		else
+			data->weight.var1_dw = (u32)latency ?: tsx_latency;
+
+		data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
+	}
+
+	if (sample_type & PERF_SAMPLE_DATA_SRC) {
+		data->data_src.val = get_data_src(event, aux);
+		data->sample_flags |= PERF_SAMPLE_DATA_SRC;
+	}
+
+	if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
+		data->addr = address;
+		data->sample_flags |= PERF_SAMPLE_ADDR;
+	}
+
+	if (sample_type & PERF_SAMPLE_TRANSACTION) {
+		data->txn = intel_get_tsx_transaction(tsx_tuning, ax);
+		data->sample_flags |= PERF_SAMPLE_TRANSACTION;
+	}
+}
+
 /*
  * With adaptive PEBS the layout depends on what fields are configured.
  */
@@ -2082,12 +2167,14 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 					    struct pt_regs *regs)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	u64 sample_type = event->attr.sample_type;
 	struct pebs_basic *basic = __pebs;
 	void *next_record = basic + 1;
-	u64 sample_type, format_group;
 	struct pebs_meminfo *meminfo = NULL;
 	struct pebs_gprs *gprs = NULL;
 	struct x86_perf_regs *perf_regs;
+	u64 format_group;
+	u16 retire;
 
 	if (basic == NULL)
 		return;
@@ -2095,32 +2182,17 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	perf_regs->xmm_regs = NULL;
 
-	sample_type = event->attr.sample_type;
 	format_group = basic->format_group;
-	perf_sample_data_init(data, 0, event->hw.last_period);
-	data->period = event->hw.last_period;
 
-	setup_pebs_time(event, data, basic->tsc);
-
-	/*
-	 * We must however always use iregs for the unwinder to stay sane; the
-	 * record BP,SP,IP can point into thin air when the record is from a
-	 * previous PMI context or an (I)RET happened between the record and
-	 * PMI.
-	 */
-	perf_sample_save_callchain(data, event, iregs);
+	__setup_perf_sample_data(event, iregs, data);
 
 	*regs = *iregs;
-	/* The ip in basic is EventingIP */
-	set_linear_ip(regs, basic->ip);
-	regs->flags = PERF_EFLAGS_EXACT;
 
-	if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT) {
-		if (x86_pmu.flags & PMU_FL_RETIRE_LATENCY)
-			data->weight.var3_w = basic->retire_latency;
-		else
-			data->weight.var3_w = 0;
-	}
+	/* basic group */
+	retire = x86_pmu.flags & PMU_FL_RETIRE_LATENCY ?
+			basic->retire_latency : 0;
+	__setup_pebs_basic_group(event, regs, data, sample_type,
+				 basic->ip, basic->tsc, retire);
 
 	/*
 	 * The record for MEMINFO is in front of GP
@@ -2136,54 +2208,20 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 		gprs = next_record;
 		next_record = gprs + 1;
 
-		if (event->attr.precise_ip < 2) {
-			set_linear_ip(regs, gprs->ip);
-			regs->flags &= ~PERF_EFLAGS_EXACT;
-		}
-
-		if (sample_type & (PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER))
-			adaptive_pebs_save_regs(regs, gprs);
+		__setup_pebs_gpr_group(event, regs, gprs, sample_type);
 	}
 
 	if (format_group & PEBS_DATACFG_MEMINFO) {
-		if (sample_type & PERF_SAMPLE_WEIGHT_TYPE) {
-			u64 latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
-					meminfo->cache_latency : meminfo->mem_latency;
-
-			if (x86_pmu.flags & PMU_FL_INSTR_LATENCY)
-				data->weight.var2_w = meminfo->instr_latency;
-
-			/*
-			 * Although meminfo::latency is defined as a u64,
-			 * only the lower 32 bits include the valid data
-			 * in practice on Ice Lake and earlier platforms.
-			 */
-			if (sample_type & PERF_SAMPLE_WEIGHT) {
-				data->weight.full = latency ?:
-					intel_get_tsx_weight(meminfo->tsx_tuning);
-			} else {
-				data->weight.var1_dw = (u32)latency ?:
-					intel_get_tsx_weight(meminfo->tsx_tuning);
-			}
-
-			data->sample_flags |= PERF_SAMPLE_WEIGHT_TYPE;
-		}
-
-		if (sample_type & PERF_SAMPLE_DATA_SRC) {
-			data->data_src.val = get_data_src(event, meminfo->aux);
-			data->sample_flags |= PERF_SAMPLE_DATA_SRC;
-		}
-
-		if (sample_type & PERF_SAMPLE_ADDR_TYPE) {
-			data->addr = meminfo->address;
-			data->sample_flags |= PERF_SAMPLE_ADDR;
-		}
-
-		if (sample_type & PERF_SAMPLE_TRANSACTION) {
-			data->txn = intel_get_tsx_transaction(meminfo->tsx_tuning,
-							  gprs ? gprs->ax : 0);
-			data->sample_flags |= PERF_SAMPLE_TRANSACTION;
-		}
+		u64 latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
+				meminfo->cache_latency : meminfo->mem_latency;
+		u64 instr_latency = x86_pmu.flags & PMU_FL_INSTR_LATENCY ?
+				meminfo->instr_latency : 0;
+		u64 ax = gprs ? gprs->ax : 0;
+
+		__setup_pebs_meminfo_group(event, data, sample_type, latency,
+					   instr_latency, meminfo->address,
+					   meminfo->aux, meminfo->tsx_tuning,
+					   ax);
 	}
 
 	if (format_group & PEBS_DATACFG_XMMS) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (8 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 09/22] perf/x86/intel/ds: Factor out PEBS group " Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 13:57   ` Peter Zijlstra
  2025-04-15 11:44 ` [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

A significant difference with adaptive PEBS is that arch-PEBS record
supports fragments which means an arch-PEBS record could be split into
several independent fragments which have its own arch-PEBS header in
each fragment.

This patch defines architectural PEBS record layout structures and add
helpers to process arch-PEBS records or fragments. Only legacy PEBS
groups like basic, GPR, XMM and LBR groups are supported in this patch,
the new added YMM/ZMM/OPMASK vector registers capturing would be
supported in subsequent patches.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      |  15 ++-
 arch/x86/events/intel/ds.c        | 180 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/msr-index.h  |   6 +
 arch/x86/include/asm/perf_event.h | 100 +++++++++++++++++
 4 files changed, 300 insertions(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 09e2a23f9bcc..0f911e974e02 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3216,6 +3216,19 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 			status &= ~GLOBAL_STATUS_PERF_METRICS_OVF_BIT;
 	}
 
+	/*
+	 * Arch PEBS sets bit 54 in the global status register
+	 */
+	if (__test_and_clear_bit(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT,
+				 (unsigned long *)&status)) {
+		handled++;
+		static_call(x86_pmu_drain_pebs)(regs, &data);
+
+		if (cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS] &&
+		    is_pebs_counter_event_group(cpuc->events[INTEL_PMC_IDX_FIXED_SLOTS]))
+			status &= ~GLOBAL_STATUS_PERF_METRICS_OVF_BIT;
+	}
+
 	/*
 	 * Intel PT
 	 */
@@ -3270,7 +3283,7 @@ static int handle_pmi_common(struct pt_regs *regs, u64 status)
 		 * The PEBS buffer has to be drained before handling the A-PMI
 		 */
 		if (is_pebs_counter_event_group(event))
-			x86_pmu.drain_pebs(regs, &data);
+			static_call(x86_pmu_drain_pebs)(regs, &data);
 
 		last_period = event->hw.last_period;
 
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 6c872bf2e916..ed0bccb04b95 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2272,6 +2272,114 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 			format_group);
 }
 
+static inline bool arch_pebs_record_continued(struct arch_pebs_header *header)
+{
+	/* Continue bit or null PEBS record indicates fragment follows. */
+	return header->cont || !(header->format & GENMASK_ULL(63, 16));
+}
+
+static void setup_arch_pebs_sample_data(struct perf_event *event,
+					struct pt_regs *iregs, void *__pebs,
+					struct perf_sample_data *data,
+					struct pt_regs *regs)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	u64 sample_type = event->attr.sample_type;
+	struct arch_pebs_header *header = NULL;
+	struct arch_pebs_aux *meminfo = NULL;
+	struct arch_pebs_gprs *gprs = NULL;
+	struct x86_perf_regs *perf_regs;
+	void *next_record;
+	void *at = __pebs;
+
+	if (at == NULL)
+		return;
+
+	perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	perf_regs->xmm_regs = NULL;
+
+	__setup_perf_sample_data(event, iregs, data);
+
+	*regs = *iregs;
+
+again:
+	header = at;
+	next_record = at + sizeof(struct arch_pebs_header);
+	if (header->basic) {
+		struct arch_pebs_basic *basic = next_record;
+		u16 retire = 0;
+
+		next_record = basic + 1;
+
+		if (sample_type & PERF_SAMPLE_WEIGHT_STRUCT)
+			retire = basic->valid ? basic->retire : 0;
+		__setup_pebs_basic_group(event, regs, data, sample_type,
+				 basic->ip, basic->tsc, retire);
+	}
+
+	/*
+	 * The record for MEMINFO is in front of GP
+	 * But PERF_SAMPLE_TRANSACTION needs gprs->ax.
+	 * Save the pointer here but process later.
+	 */
+	if (header->aux) {
+		meminfo = next_record;
+		next_record = meminfo + 1;
+	}
+
+	if (header->gpr) {
+		gprs = next_record;
+		next_record = gprs + 1;
+
+		__setup_pebs_gpr_group(event, regs, (struct pebs_gprs *)gprs,
+				       sample_type);
+	}
+
+	if (header->aux) {
+		u64 ax = gprs ? gprs->ax : 0;
+
+		__setup_pebs_meminfo_group(event, data, sample_type,
+					   meminfo->cache_latency,
+					   meminfo->instr_latency,
+					   meminfo->address, meminfo->aux,
+					   meminfo->tsx_tuning, ax);
+	}
+
+	if (header->xmm) {
+		struct arch_pebs_xmm *xmm;
+
+		next_record += sizeof(struct arch_pebs_xer_header);
+
+		xmm = next_record;
+		perf_regs->xmm_regs = xmm->xmm;
+		next_record = xmm + 1;
+	}
+
+	if (header->lbr) {
+		struct arch_pebs_lbr_header *lbr_header = next_record;
+		struct lbr_entry *lbr;
+		int num_lbr;
+
+		next_record = lbr_header + 1;
+		lbr = next_record;
+
+		num_lbr = header->lbr == ARCH_PEBS_LBR_NUM_VAR ? lbr_header->depth :
+					 header->lbr * ARCH_PEBS_BASE_LBR_ENTRIES;
+		next_record += num_lbr * sizeof(struct lbr_entry);
+
+		if (has_branch_stack(event)) {
+			intel_pmu_store_pebs_lbrs(lbr);
+			intel_pmu_lbr_save_brstack(data, cpuc, event);
+		}
+	}
+
+	/* Parse followed fragments if there are. */
+	if (arch_pebs_record_continued(header)) {
+		at = at + header->size;
+		goto again;
+	}
+}
+
 static inline void *
 get_next_pebs_record_by_bit(void *base, void *top, int bit)
 {
@@ -2735,6 +2843,77 @@ static void intel_pmu_drain_pebs_icl(struct pt_regs *iregs, struct perf_sample_d
 					    setup_pebs_adaptive_sample_data);
 }
 
+static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
+				      struct perf_sample_data *data)
+{
+	short counts[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS] = {};
+	void *last[INTEL_PMC_IDX_FIXED + MAX_FIXED_PEBS_EVENTS];
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	union arch_pebs_index index;
+	struct x86_perf_regs perf_regs;
+	struct pt_regs *regs = &perf_regs.regs;
+	void *base, *at, *top;
+	u64 mask;
+
+	rdmsrl(MSR_IA32_PEBS_INDEX, index.full);
+
+	if (unlikely(!index.split.wr)) {
+		intel_pmu_pebs_event_update_no_drain(cpuc, X86_PMC_IDX_MAX);
+		return;
+	}
+
+	base = cpuc->ds_pebs_vaddr;
+	top = (void *)((u64)cpuc->ds_pebs_vaddr +
+		       (index.split.wr << ARCH_PEBS_INDEX_WR_SHIFT));
+
+	mask = hybrid(cpuc->pmu, arch_pebs_cap).counters & cpuc->pebs_enabled;
+
+	if (!iregs)
+		iregs = &dummy_iregs;
+
+	/* Process all but the last event for each counter. */
+	for (at = base; at < top;) {
+		struct arch_pebs_header *header;
+		struct arch_pebs_basic *basic;
+		u64 pebs_status;
+
+		header = at;
+
+		if (WARN_ON_ONCE(!header->size))
+			break;
+
+		/* 1st fragment or single record must have basic group */
+		if (!header->basic) {
+			at += header->size;
+			continue;
+		}
+
+		basic = at + sizeof(struct arch_pebs_header);
+		pebs_status = mask & basic->applicable_counters;
+		__intel_pmu_handle_pebs_record(iregs, regs, data, at,
+					       pebs_status, counts, last,
+					       setup_arch_pebs_sample_data);
+
+		/* Skip non-last fragments */
+		while (arch_pebs_record_continued(header)) {
+			if (!header->size)
+				break;
+			at += header->size;
+			header = at;
+		}
+
+		/* Skip last fragment or the single record */
+		at += header->size;
+	}
+
+	__intel_pmu_handle_last_pebs_record(iregs, regs, data, mask, counts,
+					    last, setup_arch_pebs_sample_data);
+
+	index.split.wr = 0;
+	index.split.full = 0;
+	wrmsrl(MSR_IA32_PEBS_INDEX, index.full);
+}
+
 static void __init intel_arch_pebs_init(void)
 {
 	/*
@@ -2744,6 +2923,7 @@ static void __init intel_arch_pebs_init(void)
 	 */
 	x86_pmu.arch_pebs = 1;
 	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
+	x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
 	x86_pmu.pebs_capable = ~0ULL;
 
 	x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 53da787b9326..d77048df8e72 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -314,6 +314,12 @@
 #define PERF_CAP_PEBS_MASK	(PERF_CAP_PEBS_TRAP | PERF_CAP_ARCH_REG | \
 				 PERF_CAP_PEBS_FORMAT | PERF_CAP_PEBS_BASELINE)
 
+/* Arch PEBS */
+#define MSR_IA32_PEBS_BASE		0x000003f4
+#define MSR_IA32_PEBS_INDEX		0x000003f5
+#define ARCH_PEBS_OFFSET_MASK		0x7fffff
+#define ARCH_PEBS_INDEX_WR_SHIFT	4
+
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define RTIT_CTL_TRACEEN		BIT(0)
 #define RTIT_CTL_CYCLEACC		BIT(1)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7fca9494aae9..7f9d8e6577f0 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -433,6 +433,8 @@ static inline bool is_topdown_idx(int idx)
 #define GLOBAL_STATUS_LBRS_FROZEN		BIT_ULL(GLOBAL_STATUS_LBRS_FROZEN_BIT)
 #define GLOBAL_STATUS_TRACE_TOPAPMI_BIT		55
 #define GLOBAL_STATUS_TRACE_TOPAPMI		BIT_ULL(GLOBAL_STATUS_TRACE_TOPAPMI_BIT)
+#define GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT	54
+#define GLOBAL_STATUS_ARCH_PEBS_THRESHOLD	BIT_ULL(GLOBAL_STATUS_ARCH_PEBS_THRESHOLD_BIT)
 #define GLOBAL_STATUS_PERF_METRICS_OVF_BIT	48
 
 #define GLOBAL_CTRL_EN_PERF_METRICS		48
@@ -503,6 +505,104 @@ struct pebs_cntr_header {
 
 #define INTEL_CNTR_METRICS		0x3
 
+/*
+ * Arch PEBS
+ */
+union arch_pebs_index {
+	struct {
+		u64 rsvd:4,
+		    wr:23,
+		    rsvd2:4,
+		    full:1,
+		    en:1,
+		    rsvd3:3,
+		    thresh:23,
+		    rsvd4:5;
+	} split;
+	u64 full;
+};
+
+struct arch_pebs_header {
+	union {
+		u64 format;
+		struct {
+			u64 size:16,	/* Record size */
+			    rsvd:14,
+			    mode:1,	/* 64BIT_MODE */
+			    cont:1,
+			    rsvd2:3,
+			    cntr:5,
+			    lbr:2,
+			    rsvd3:7,
+			    xmm:1,
+			    ymmh:1,
+			    rsvd4:2,
+			    opmask:1,
+			    zmmh:1,
+			    h16zmm:1,
+			    rsvd5:5,
+			    gpr:1,
+			    aux:1,
+			    basic:1;
+		};
+	};
+	u64 rsvd6;
+};
+
+struct arch_pebs_basic {
+	u64 ip;
+	u64 applicable_counters;
+	u64 tsc;
+	u64 retire	:16,	/* Retire Latency */
+	    valid	:1,
+	    rsvd	:47;
+	u64 rsvd2;
+	u64 rsvd3;
+};
+
+struct arch_pebs_aux {
+	u64 address;
+	u64 rsvd;
+	u64 rsvd2;
+	u64 rsvd3;
+	u64 rsvd4;
+	u64 aux;
+	u64 instr_latency	:16,
+	    pad2		:16,
+	    cache_latency	:16,
+	    pad3		:16;
+	u64 tsx_tuning;
+};
+
+struct arch_pebs_gprs {
+	u64 flags, ip, ax, cx, dx, bx, sp, bp, si, di;
+	u64 r8, r9, r10, r11, r12, r13, r14, r15, ssp;
+	u64 rsvd;
+};
+
+struct arch_pebs_xer_header {
+	u64 xstate;
+	u64 rsvd;
+};
+
+struct arch_pebs_xmm {
+	u64 xmm[16*2];		/* two entries for each register */
+};
+
+#define ARCH_PEBS_LBR_NAN		0x0
+#define ARCH_PEBS_LBR_NUM_8		0x1
+#define ARCH_PEBS_LBR_NUM_16		0x2
+#define ARCH_PEBS_LBR_NUM_VAR		0x3
+#define ARCH_PEBS_BASE_LBR_ENTRIES	8
+struct arch_pebs_lbr_header {
+	u64 rsvd;
+	u64 ctl;
+	u64 depth;
+	u64 ler_from;
+	u64 ler_to;
+	u64 ler_info;
+};
+
 /*
  * AMD Extended Performance Monitoring and Debug cpuid feature detection
  */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (9 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 13:45   ` Peter Zijlstra
  2025-04-15 13:48   ` Peter Zijlstra
  2025-04-15 11:44 ` [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
                   ` (11 subsequent siblings)
  22 siblings, 2 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Arch-PEBS introduces a new MSR IA32_PEBS_BASE to store the arch-PEBS
buffer physical address. This patch allocates arch-PEBS buffer and then
initialize IA32_PEBS_BASE MSR with the buffer physical address.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c    |  2 +
 arch/x86/events/intel/ds.c      | 69 ++++++++++++++++++++++++++-------
 arch/x86/events/perf_event.h    |  7 +++-
 arch/x86/include/asm/intel_ds.h |  3 +-
 4 files changed, 66 insertions(+), 15 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 0f911e974e02..e0be6be50936 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5448,6 +5448,7 @@ static void intel_pmu_cpu_starting(int cpu)
 		return;
 
 	init_debug_store_on_cpu(cpu);
+	init_arch_pebs_buf_on_cpu(cpu);
 	/*
 	 * Deal with CPUs that don't clear their LBRs on power-up, and that may
 	 * even boot with LBRs enabled.
@@ -5545,6 +5546,7 @@ static void free_excl_cntrs(struct cpu_hw_events *cpuc)
 static void intel_pmu_cpu_dying(int cpu)
 {
 	fini_debug_store_on_cpu(cpu);
+	fini_arch_pebs_buf_on_cpu(cpu);
 }
 
 void intel_cpuc_finish(struct cpu_hw_events *cpuc)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index ed0bccb04b95..7437a52ba5f0 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -624,13 +624,18 @@ static int alloc_pebs_buffer(int cpu)
 	int max, node = cpu_to_node(cpu);
 	void *buffer, *insn_buff, *cea;
 
-	if (!x86_pmu.ds_pebs)
+	if (!intel_pmu_has_pebs())
 		return 0;
 
-	buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
+	buffer = dsalloc_pages(bsiz, preemptible() ? GFP_KERNEL : GFP_ATOMIC, cpu);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
+	if (x86_pmu.arch_pebs) {
+		hwev->pebs_vaddr = buffer;
+		return 0;
+	}
+
 	/*
 	 * HSW+ already provides us the eventing ip; no need to allocate this
 	 * buffer then.
@@ -643,7 +648,7 @@ static int alloc_pebs_buffer(int cpu)
 		}
 		per_cpu(insn_buffer, cpu) = insn_buff;
 	}
-	hwev->ds_pebs_vaddr = buffer;
+	hwev->pebs_vaddr = buffer;
 	/* Update the cpu entry area mapping */
 	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
 	ds->pebs_buffer_base = (unsigned long) cea;
@@ -659,17 +664,20 @@ static void release_pebs_buffer(int cpu)
 	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
 	void *cea;
 
-	if (!x86_pmu.ds_pebs)
+	if (!intel_pmu_has_pebs())
 		return;
 
-	kfree(per_cpu(insn_buffer, cpu));
-	per_cpu(insn_buffer, cpu) = NULL;
+	if (x86_pmu.ds_pebs) {
+		kfree(per_cpu(insn_buffer, cpu));
+		per_cpu(insn_buffer, cpu) = NULL;
 
-	/* Clear the fixmap */
-	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
-	ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
-	dsfree_pages(hwev->ds_pebs_vaddr, x86_pmu.pebs_buffer_size);
-	hwev->ds_pebs_vaddr = NULL;
+		/* Clear the fixmap */
+		cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
+		ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
+	}
+
+	dsfree_pages(hwev->pebs_vaddr, x86_pmu.pebs_buffer_size);
+	hwev->pebs_vaddr = NULL;
 }
 
 static int alloc_bts_buffer(int cpu)
@@ -822,6 +830,41 @@ void reserve_ds_buffers(void)
 	}
 }
 
+void init_arch_pebs_buf_on_cpu(int cpu)
+{
+	struct cpu_hw_events *cpuc = per_cpu_ptr(&cpu_hw_events, cpu);
+	u64 arch_pebs_base;
+
+	if (!x86_pmu.arch_pebs)
+		return;
+
+	if (alloc_pebs_buffer(cpu) < 0 || !cpuc->pebs_vaddr) {
+		WARN(1, "Fail to allocate PEBS buffer on CPU %d\n", cpu);
+		x86_pmu.pebs_active = 0;
+		return;
+	}
+
+	/*
+	 * 4KB-aligned pointer of the output buffer
+	 * (__alloc_pages_node() return page aligned address)
+	 * Buffer Size = 4KB * 2^SIZE
+	 * contiguous physical buffer (__alloc_pages_node() with order)
+	 */
+	arch_pebs_base = virt_to_phys(cpuc->pebs_vaddr) | PEBS_BUFFER_SHIFT;
+	wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, (u32)arch_pebs_base,
+		     (u32)(arch_pebs_base >> 32));
+	x86_pmu.pebs_active = 1;
+}
+
+void fini_arch_pebs_buf_on_cpu(int cpu)
+{
+	if (!x86_pmu.arch_pebs)
+		return;
+
+	release_pebs_buffer(cpu);
+	wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, 0, 0);
+}
+
 /*
  * BTS
  */
@@ -2862,8 +2905,8 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 		return;
 	}
 
-	base = cpuc->ds_pebs_vaddr;
-	top = (void *)((u64)cpuc->ds_pebs_vaddr +
+	base = cpuc->pebs_vaddr;
+	top = (void *)((u64)cpuc->pebs_vaddr +
 		       (index.split.wr << ARCH_PEBS_INDEX_WR_SHIFT));
 
 	mask = hybrid(cpuc->pmu, arch_pebs_cap).counters & cpuc->pebs_enabled;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 23ffad67a927..d93d4c7a9876 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -275,8 +275,9 @@ struct cpu_hw_events {
 	 * Intel DebugStore bits
 	 */
 	struct debug_store	*ds;
-	void			*ds_pebs_vaddr;
 	void			*ds_bts_vaddr;
+	/* DS based PEBS or arch-PEBS buffer address */
+	void			*pebs_vaddr;
 	u64			pebs_enabled;
 	int			n_pebs;
 	int			n_large_pebs;
@@ -1610,6 +1611,10 @@ extern void intel_cpuc_finish(struct cpu_hw_events *cpuc);
 
 int intel_pmu_init(void);
 
+void init_arch_pebs_buf_on_cpu(int cpu);
+
+void fini_arch_pebs_buf_on_cpu(int cpu);
+
 void init_debug_store_on_cpu(int cpu);
 
 void fini_debug_store_on_cpu(int cpu);
diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h
index 5dbeac48a5b9..023c2883f9f3 100644
--- a/arch/x86/include/asm/intel_ds.h
+++ b/arch/x86/include/asm/intel_ds.h
@@ -4,7 +4,8 @@
 #include <linux/percpu-defs.h>
 
 #define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
+#define PEBS_BUFFER_SHIFT	4
+#define PEBS_BUFFER_SIZE	(PAGE_SIZE << PEBS_BUFFER_SHIFT)
 
 /* The maximal number of PEBS events: */
 #define MAX_PEBS_EVENTS_FMT4	8
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (10 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 13:53   ` Peter Zijlstra
  2025-04-15 11:44 ` [Patch v3 13/22] perf/x86/intel: Setup PEBS data configuration and enable legacy groups Dapeng Mi
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

arch-PEBS provides CPUIDs to enumerate which counters support PEBS
sampling and precise distribution PEBS sampling. Thus PEBS constraints
should be dynamically configured base on these counter and precise
distribution bitmap instead of defining them statically.

Update event dyn_constraint base on PEBS event precise level.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c | 9 +++++++++
 arch/x86/events/intel/ds.c   | 1 +
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index e0be6be50936..265b5e4baf73 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4252,6 +4252,8 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	}
 
 	if (event->attr.precise_ip) {
+		struct arch_pebs_cap pebs_cap = hybrid(event->pmu, arch_pebs_cap);
+
 		if ((event->attr.config & INTEL_ARCH_EVENT_MASK) == INTEL_FIXED_VLBR_EVENT)
 			return -EINVAL;
 
@@ -4265,6 +4267,13 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		}
 		if (x86_pmu.pebs_aliases)
 			x86_pmu.pebs_aliases(event);
+
+		if (x86_pmu.arch_pebs) {
+			u64 cntr_mask = event->attr.precise_ip >= 3 ?
+						pebs_cap.pdists : pebs_cap.counters;
+			if (cntr_mask != hybrid(event->pmu, intel_ctrl))
+				event->hw.dyn_constraint = cntr_mask;
+		}
 	}
 
 	if (needs_branch_stack(event)) {
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 7437a52ba5f0..757d97c05d8f 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2968,6 +2968,7 @@ static void __init intel_arch_pebs_init(void)
 	x86_pmu.pebs_buffer_size = PEBS_BUFFER_SIZE;
 	x86_pmu.drain_pebs = intel_pmu_drain_arch_pebs;
 	x86_pmu.pebs_capable = ~0ULL;
+	x86_pmu.flags |= PMU_FL_PEBS_ALL;
 
 	x86_pmu.pebs_enable = __intel_pmu_pebs_enable;
 	x86_pmu.pebs_disable = __intel_pmu_pebs_disable;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 13/22] perf/x86/intel: Setup PEBS data configuration and enable legacy groups
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (11 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 14/22] perf/x86/intel: Add counter group support for arch-PEBS Dapeng Mi
                   ` (9 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Different with legacy PEBS, arch-PEBS provides per-counter PEBS data
configuration by programing MSR IA32_PMC_GPx/FXx_CFG_C MSRs.

This patch obtains PEBS data configuration from event attribute and then
writes the PEBS data configuration to MSR IA32_PMC_GPx/FXx_CFG_C and
enable corresponding PEBS groups.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c     | 127 +++++++++++++++++++++++++++++++
 arch/x86/events/intel/ds.c       |  17 +++++
 arch/x86/events/perf_event.h     |  12 +++
 arch/x86/include/asm/intel_ds.h  |   7 ++
 arch/x86/include/asm/msr-index.h |   8 ++
 5 files changed, 171 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 265b5e4baf73..ae7f5dfee041 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2562,6 +2562,39 @@ static void intel_pmu_disable_fixed(struct perf_event *event)
 	cpuc->fixed_ctrl_val &= ~mask;
 }
 
+static inline void __intel_pmu_update_event_ext(int idx, u64 ext)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	u32 msr = idx < INTEL_PMC_IDX_FIXED ?
+		  x86_pmu_cfg_c_addr(idx, true) :
+		  x86_pmu_cfg_c_addr(idx - INTEL_PMC_IDX_FIXED, false);
+
+	cpuc->cfg_c_val[idx] = ext;
+	wrmsrl(msr, ext);
+}
+
+static void intel_pmu_disable_event_ext(struct perf_event *event)
+{
+	if (!x86_pmu.arch_pebs)
+		return;
+
+	/*
+	 * Only clear CFG_C MSR for PEBS counter group events,
+	 * it avoids the HW counter's value to be added into
+	 * other PEBS records incorrectly after PEBS counter
+	 * group events are disabled.
+	 *
+	 * For other events, it's unnecessary to clear CFG_C MSRs
+	 * since CFG_C doesn't take effect if counter is in
+	 * disabled state. That helps to reduce the WRMSR overhead
+	 * in context switches.
+	 */
+	if (!is_pebs_counter_event_group(event))
+		return;
+
+	__intel_pmu_update_event_ext(event->hw.idx, 0);
+}
+
 static void intel_pmu_disable_event(struct perf_event *event)
 {
 	struct hw_perf_event *hwc = &event->hw;
@@ -2570,9 +2603,12 @@ static void intel_pmu_disable_event(struct perf_event *event)
 	switch (idx) {
 	case 0 ... INTEL_PMC_IDX_FIXED - 1:
 		intel_clear_masks(event, idx);
+		intel_pmu_disable_event_ext(event);
 		x86_pmu_disable_event(event);
 		break;
 	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
+		intel_pmu_disable_event_ext(event);
+		fallthrough;
 	case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
 		intel_pmu_disable_fixed(event);
 		break;
@@ -2941,6 +2977,67 @@ static void intel_pmu_enable_acr(struct perf_event *event)
 
 DEFINE_STATIC_CALL_NULL(intel_pmu_enable_acr_event, intel_pmu_enable_acr);
 
+static void intel_pmu_enable_event_ext(struct perf_event *event)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	struct hw_perf_event *hwc = &event->hw;
+	union arch_pebs_index cached, index;
+	struct arch_pebs_cap cap;
+	u64 ext = 0;
+
+	if (!x86_pmu.arch_pebs)
+		return;
+
+	cap = hybrid(cpuc->pmu, arch_pebs_cap);
+
+	if (event->attr.precise_ip) {
+		u64 pebs_data_cfg = intel_get_arch_pebs_data_config(event);
+
+		ext |= ARCH_PEBS_EN;
+		if (hwc->flags & PERF_X86_EVENT_AUTO_RELOAD)
+			ext |= (-hwc->sample_period) & ARCH_PEBS_RELOAD;
+
+		if (pebs_data_cfg && cap.caps) {
+			if (pebs_data_cfg & PEBS_DATACFG_MEMINFO)
+				ext |= ARCH_PEBS_AUX & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_GP)
+				ext |= ARCH_PEBS_GPR & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_XMMS)
+				ext |= ARCH_PEBS_VECR_XMM & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
+				ext |= ARCH_PEBS_LBR & cap.caps;
+		}
+
+		if (cpuc->n_pebs == cpuc->n_large_pebs)
+			index.split.thresh = ARCH_PEBS_THRESH_MUL;
+		else
+			index.split.thresh = ARCH_PEBS_THRESH_SINGLE;
+
+		rdmsrl(MSR_IA32_PEBS_INDEX, cached.full);
+		if (index.split.thresh != cached.split.thresh || !cached.split.en) {
+			if (cached.split.thresh == ARCH_PEBS_THRESH_MUL &&
+			    cached.split.wr > 0) {
+				/*
+				 * Large PEBS was enabled.
+				 * Drain PEBS buffer before applying the single PEBS.
+				 */
+				intel_pmu_drain_pebs_buffer();
+			} else {
+				index.split.wr = 0;
+				index.split.full = 0;
+				index.split.en = 1;
+				wrmsrl(MSR_IA32_PEBS_INDEX, index.full);
+			}
+		}
+	}
+
+	if (cpuc->cfg_c_val[hwc->idx] != ext)
+		__intel_pmu_update_event_ext(hwc->idx, ext);
+}
+
 static void intel_pmu_enable_event(struct perf_event *event)
 {
 	u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
@@ -2956,10 +3053,12 @@ static void intel_pmu_enable_event(struct perf_event *event)
 			enable_mask |= ARCH_PERFMON_EVENTSEL_BR_CNTR;
 		intel_set_masks(event, idx);
 		static_call_cond(intel_pmu_enable_acr_event)(event);
+		intel_pmu_enable_event_ext(event);
 		__x86_pmu_enable_event(hwc, enable_mask);
 		break;
 	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
 		static_call_cond(intel_pmu_enable_acr_event)(event);
+		intel_pmu_enable_event_ext(event);
 		fallthrough;
 	case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
 		intel_pmu_enable_fixed(event);
@@ -5293,6 +5392,29 @@ static inline bool intel_pmu_broken_perf_cap(void)
 	return false;
 }
 
+static inline void __intel_update_pmu_caps(struct pmu *pmu)
+{
+	struct pmu *dest_pmu = pmu ? pmu : x86_get_pmu(smp_processor_id());
+
+	if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
+		dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+}
+
+static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
+{
+	u64 caps = hybrid(pmu, arch_pebs_cap).caps;
+
+	x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
+	if (caps & ARCH_PEBS_LBR)
+		x86_pmu.large_pebs_flags |= PERF_SAMPLE_BRANCH_STACK;
+
+	if (!(caps & ARCH_PEBS_AUX))
+		x86_pmu.large_pebs_flags &= ~PERF_SAMPLE_DATA_SRC;
+	if (!(caps & ARCH_PEBS_GPR))
+		x86_pmu.large_pebs_flags &=
+			~(PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER);
+}
+
 static void update_pmu_cap(struct pmu *pmu)
 {
 	unsigned int eax, ebx, ecx, edx;
@@ -5333,6 +5455,9 @@ static void update_pmu_cap(struct pmu *pmu)
 			    &eax, &ebx, &ecx, &edx);
 		hybrid(pmu, arch_pebs_cap).counters = ((u64)ecx << 32) | eax;
 		hybrid(pmu, arch_pebs_cap).pdists = ((u64)edx << 32) | ebx;
+
+		__intel_update_pmu_caps(pmu);
+		__intel_update_large_pebs_flags(pmu);
 	} else {
 		WARN_ON(x86_pmu.arch_pebs == 1);
 		x86_pmu.arch_pebs = 0;
@@ -5496,6 +5621,8 @@ static void intel_pmu_cpu_starting(int cpu)
 		}
 	}
 
+	__intel_update_pmu_caps(cpuc->pmu);
+
 	if (!cpuc->shared_regs)
 		return;
 
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 757d97c05d8f..6a138435092d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1512,6 +1512,18 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
 	}
 }
 
+u64 intel_get_arch_pebs_data_config(struct perf_event *event)
+{
+	u64 pebs_data_cfg = 0;
+
+	if (WARN_ON(event->hw.idx < 0 || event->hw.idx >= X86_PMC_IDX_MAX))
+		return 0;
+
+	pebs_data_cfg |= pebs_update_adaptive_cfg(event);
+
+	return pebs_data_cfg;
+}
+
 void intel_pmu_pebs_add(struct perf_event *event)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -2954,6 +2966,11 @@ static void intel_pmu_drain_arch_pebs(struct pt_regs *iregs,
 
 	index.split.wr = 0;
 	index.split.full = 0;
+	index.split.en = 1;
+	if (cpuc->n_pebs == cpuc->n_large_pebs)
+		index.split.thresh = ARCH_PEBS_THRESH_MUL;
+	else
+		index.split.thresh = ARCH_PEBS_THRESH_SINGLE;
 	wrmsrl(MSR_IA32_PEBS_INDEX, index.full);
 }
 
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index d93d4c7a9876..c6c2ab34e711 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -296,6 +296,8 @@ struct cpu_hw_events {
 	/* Intel ACR configuration */
 	u64			acr_cfg_b[X86_PMC_IDX_MAX];
 	u64			acr_cfg_c[X86_PMC_IDX_MAX];
+	/* Cached CFG_C values */
+	u64			cfg_c_val[X86_PMC_IDX_MAX];
 
 	/*
 	 * Intel LBR bits
@@ -1208,6 +1210,14 @@ static inline unsigned int x86_pmu_fixed_ctr_addr(int index)
 				   x86_pmu.addr_offset(index, false) : index);
 }
 
+static inline unsigned int x86_pmu_cfg_c_addr(int index, bool gp)
+{
+	u32 base = gp ? MSR_IA32_PMC_V6_GP0_CFG_C : MSR_IA32_PMC_V6_FX0_CFG_C;
+
+	return base + (x86_pmu.addr_offset ? x86_pmu.addr_offset(index, false) :
+					     index * MSR_IA32_PMC_V6_STEP);
+}
+
 static inline int x86_pmu_rdpmc_index(int index)
 {
 	return x86_pmu.rdpmc_index ? x86_pmu.rdpmc_index(index) : index;
@@ -1771,6 +1781,8 @@ void intel_pmu_pebs_data_source_cmt(void);
 
 void intel_pmu_pebs_data_source_lnl(void);
 
+u64 intel_get_arch_pebs_data_config(struct perf_event *event);
+
 int intel_pmu_setup_lbr_filter(struct perf_event *event);
 
 void intel_pt_interrupt(void);
diff --git a/arch/x86/include/asm/intel_ds.h b/arch/x86/include/asm/intel_ds.h
index 023c2883f9f3..7bb80c993bef 100644
--- a/arch/x86/include/asm/intel_ds.h
+++ b/arch/x86/include/asm/intel_ds.h
@@ -7,6 +7,13 @@
 #define PEBS_BUFFER_SHIFT	4
 #define PEBS_BUFFER_SIZE	(PAGE_SIZE << PEBS_BUFFER_SHIFT)
 
+/*
+ * The largest PEBS record could consume a page, ensure
+ * a record at least can be written after triggering PMI.
+ */
+#define ARCH_PEBS_THRESH_MUL	((PEBS_BUFFER_SIZE - PAGE_SIZE) >> PEBS_BUFFER_SHIFT)
+#define ARCH_PEBS_THRESH_SINGLE	1
+
 /* The maximal number of PEBS events: */
 #define MAX_PEBS_EVENTS_FMT4	8
 #define MAX_PEBS_EVENTS		32
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d77048df8e72..ea4f100dbd3c 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -320,6 +320,14 @@
 #define ARCH_PEBS_OFFSET_MASK		0x7fffff
 #define ARCH_PEBS_INDEX_WR_SHIFT	4
 
+#define ARCH_PEBS_RELOAD		0xffffffff
+#define ARCH_PEBS_LBR_SHIFT		40
+#define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
+#define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
+#define ARCH_PEBS_GPR			BIT_ULL(61)
+#define ARCH_PEBS_AUX			BIT_ULL(62)
+#define ARCH_PEBS_EN			BIT_ULL(63)
+
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define RTIT_CTL_TRACEEN		BIT(0)
 #define RTIT_CTL_CYCLEACC		BIT(1)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 14/22] perf/x86/intel: Add counter group support for arch-PEBS
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (12 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 13/22] perf/x86/intel: Setup PEBS data configuration and enable legacy groups Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 15/22] perf/x86/intel: Support SSP register capturing " Dapeng Mi
                   ` (8 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Base on previous adaptive PEBS counter snapshot support, add counter
group support for architectural PEBS. Since arch-PEBS shares same
counter group layout with adaptive PEBS, directly reuse
__setup_pebs_counter_group() helper to process arch-PEBS counter group.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 38 ++++++++++++++++++++++++++++---
 arch/x86/events/intel/ds.c        | 29 ++++++++++++++++++++---
 arch/x86/include/asm/msr-index.h  |  6 +++++
 arch/x86/include/asm/perf_event.h | 13 ++++++++---
 4 files changed, 77 insertions(+), 9 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ae7f5dfee041..d543ed052743 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3009,6 +3009,17 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 
 			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 				ext |= ARCH_PEBS_LBR & cap.caps;
+
+			if (pebs_data_cfg &
+			    (PEBS_DATACFG_CNTR_MASK << PEBS_DATACFG_CNTR_SHIFT))
+				ext |= ARCH_PEBS_CNTR_GP & cap.caps;
+
+			if (pebs_data_cfg &
+			    (PEBS_DATACFG_FIX_MASK << PEBS_DATACFG_FIX_SHIFT))
+				ext |= ARCH_PEBS_CNTR_FIXED & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_METRICS)
+				ext |= ARCH_PEBS_CNTR_METRICS & cap.caps;
 		}
 
 		if (cpuc->n_pebs == cpuc->n_large_pebs)
@@ -3034,6 +3045,9 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 		}
 	}
 
+	if (is_pebs_counter_event_group(event))
+		ext |= ARCH_PEBS_CNTR_ALLOW;
+
 	if (cpuc->cfg_c_val[hwc->idx] != ext)
 		__intel_pmu_update_event_ext(hwc->idx, ext);
 }
@@ -4318,6 +4332,20 @@ static bool intel_pmu_is_acr_group(struct perf_event *event)
 	return false;
 }
 
+static inline bool intel_pmu_has_pebs_counter_group(struct pmu *pmu)
+{
+	u64 caps;
+
+	if (x86_pmu.intel_cap.pebs_format >= 6 && x86_pmu.intel_cap.pebs_baseline)
+		return true;
+
+	caps = hybrid(pmu, arch_pebs_cap).caps;
+	if (x86_pmu.arch_pebs && (caps & ARCH_PEBS_CNTR_MASK))
+		return true;
+
+	return false;
+}
+
 static inline void intel_pmu_set_acr_cntr_constr(struct perf_event *event,
 						 u64 *cause_mask, int *num)
 {
@@ -4464,8 +4492,7 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	}
 
 	if ((event->attr.sample_type & PERF_SAMPLE_READ) &&
-	    (x86_pmu.intel_cap.pebs_format >= 6) &&
-	    x86_pmu.intel_cap.pebs_baseline &&
+	    intel_pmu_has_pebs_counter_group(event->pmu) &&
 	    is_sampling_event(event) &&
 	    event->attr.precise_ip)
 		event->group_leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
@@ -5407,6 +5434,8 @@ static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
 	x86_pmu.large_pebs_flags |= PERF_SAMPLE_TIME;
 	if (caps & ARCH_PEBS_LBR)
 		x86_pmu.large_pebs_flags |= PERF_SAMPLE_BRANCH_STACK;
+	if (caps & ARCH_PEBS_CNTR_MASK)
+		x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
 
 	if (!(caps & ARCH_PEBS_AUX))
 		x86_pmu.large_pebs_flags &= ~PERF_SAMPLE_DATA_SRC;
@@ -7108,8 +7137,11 @@ __init int intel_pmu_init(void)
 	 * Many features on and after V6 require dynamic constraint,
 	 * e.g., Arch PEBS, ACR.
 	 */
-	if (version >= 6)
+	if (version >= 6) {
 		x86_pmu.flags |= PMU_FL_DYN_CONSTRAINT;
+		x86_pmu.late_setup = intel_pmu_late_setup;
+	}
+
 	/*
 	 * Install the hw-cache-events table:
 	 */
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 6a138435092d..19b51b4d0d94 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1514,13 +1514,20 @@ pebs_update_state(bool needed_cb, struct cpu_hw_events *cpuc,
 
 u64 intel_get_arch_pebs_data_config(struct perf_event *event)
 {
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	u64 pebs_data_cfg = 0;
+	u64 cntr_mask;
 
 	if (WARN_ON(event->hw.idx < 0 || event->hw.idx >= X86_PMC_IDX_MAX))
 		return 0;
 
 	pebs_data_cfg |= pebs_update_adaptive_cfg(event);
 
+	cntr_mask = (PEBS_DATACFG_CNTR_MASK << PEBS_DATACFG_CNTR_SHIFT) |
+		    (PEBS_DATACFG_FIX_MASK << PEBS_DATACFG_FIX_SHIFT) |
+		    PEBS_DATACFG_CNTR | PEBS_DATACFG_METRICS;
+	pebs_data_cfg |= cpuc->pebs_data_cfg & cntr_mask;
+
 	return pebs_data_cfg;
 }
 
@@ -2428,6 +2435,24 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 		}
 	}
 
+	if (header->cntr) {
+		struct arch_pebs_cntr_header *cntr = next_record;
+		unsigned int nr;
+
+		next_record += sizeof(struct arch_pebs_cntr_header);
+
+		if (is_pebs_counter_event_group(event)) {
+			__setup_pebs_counter_group(cpuc, event,
+				(struct pebs_cntr_header *)cntr, next_record);
+			data->sample_flags |= PERF_SAMPLE_READ;
+		}
+
+		nr = hweight32(cntr->cntr) + hweight32(cntr->fixed);
+		if (cntr->metrics == INTEL_CNTR_METRICS)
+			nr += 2;
+		next_record += nr * sizeof(u64);
+	}
+
 	/* Parse followed fragments if there are. */
 	if (arch_pebs_record_continued(header)) {
 		at = at + header->size;
@@ -3057,10 +3082,8 @@ static void __init intel_ds_pebs_init(void)
 			break;
 
 		case 6:
-			if (x86_pmu.intel_cap.pebs_baseline) {
+			if (x86_pmu.intel_cap.pebs_baseline)
 				x86_pmu.large_pebs_flags |= PERF_SAMPLE_READ;
-				x86_pmu.late_setup = intel_pmu_late_setup;
-			}
 			fallthrough;
 		case 5:
 			x86_pmu.pebs_ept = 1;
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index ea4f100dbd3c..c971ac09d881 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -321,12 +321,18 @@
 #define ARCH_PEBS_INDEX_WR_SHIFT	4
 
 #define ARCH_PEBS_RELOAD		0xffffffff
+#define ARCH_PEBS_CNTR_ALLOW		BIT_ULL(35)
+#define ARCH_PEBS_CNTR_GP		BIT_ULL(36)
+#define ARCH_PEBS_CNTR_FIXED		BIT_ULL(37)
+#define ARCH_PEBS_CNTR_METRICS		BIT_ULL(38)
 #define ARCH_PEBS_LBR_SHIFT		40
 #define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
 #define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
 #define ARCH_PEBS_GPR			BIT_ULL(61)
 #define ARCH_PEBS_AUX			BIT_ULL(62)
 #define ARCH_PEBS_EN			BIT_ULL(63)
+#define ARCH_PEBS_CNTR_MASK		(ARCH_PEBS_CNTR_GP | ARCH_PEBS_CNTR_FIXED | \
+					 ARCH_PEBS_CNTR_METRICS)
 
 #define MSR_IA32_RTIT_CTL		0x00000570
 #define RTIT_CTL_TRACEEN		BIT(0)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 7f9d8e6577f0..4e5adbc7baea 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -137,16 +137,16 @@
 #define ARCH_PERFMON_EVENTS_COUNT			7
 
 #define PEBS_DATACFG_MEMINFO	BIT_ULL(0)
-#define PEBS_DATACFG_GP	BIT_ULL(1)
+#define PEBS_DATACFG_GP		BIT_ULL(1)
 #define PEBS_DATACFG_XMMS	BIT_ULL(2)
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
-#define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR	BIT_ULL(4)
+#define PEBS_DATACFG_METRICS	BIT_ULL(5)
+#define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR_SHIFT	32
 #define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
 #define PEBS_DATACFG_FIX_SHIFT	48
 #define PEBS_DATACFG_FIX_MASK	GENMASK_ULL(7, 0)
-#define PEBS_DATACFG_METRICS	BIT_ULL(5)
 
 /* Steal the highest bit of pebs_data_cfg for SW usage */
 #define PEBS_UPDATE_DS_SW	BIT_ULL(63)
@@ -603,6 +603,13 @@ struct arch_pebs_lbr_header {
 	u64 ler_info;
 };
 
+struct arch_pebs_cntr_header {
+	u32 cntr;
+	u32 fixed;
+	u32 metrics;
+	u32 reserved;
+};
+
 /*
  * AMD Extended Performance Monitoring and Debug cpuid feature detection
  */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 15/22] perf/x86/intel: Support SSP register capturing for arch-PEBS
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (13 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 14/22] perf/x86/intel: Add counter group support for arch-PEBS Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 14:07   ` Peter Zijlstra
  2025-04-15 11:44 ` [Patch v3 16/22] perf/core: Support to capture higher width vector registers Dapeng Mi
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Arch-PEBS supports to capture shadow stack pointer (SSP) register in GPR
group. This patch supports to capture and output SSP register at
interrupt or user space, but capturing SSP at user space requires
'exclude_kernel' attribute must be set. That avoids kernel space SSP
register is captured unintentionally.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c                | 15 +++++++++++++++
 arch/x86/events/intel/core.c          |  3 ++-
 arch/x86/events/intel/ds.c            |  9 +++++++--
 arch/x86/events/perf_event.h          |  4 ++++
 arch/x86/include/asm/perf_event.h     |  1 +
 arch/x86/include/uapi/asm/perf_regs.h |  4 +++-
 arch/x86/kernel/perf_regs.c           |  7 +++++++
 7 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 9c205a8a4fa6..0ccbe8385c7f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -650,6 +650,21 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
+	if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP))) {
+		/* Only arch-PEBS supports to capture SSP register. */
+		if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
+			return -EINVAL;
+		/* Only user space is allowed to capture. */
+		if (!event->attr.exclude_kernel)
+			return -EINVAL;
+	}
+
+	if (unlikely(event->attr.sample_regs_intr & BIT_ULL(PERF_REG_X86_SSP))) {
+		/* Only arch-PEBS supports to capture SSP register. */
+		if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
+			return -EINVAL;
+	}
+
 	/* sample_regs_user never support XMM registers */
 	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
 		return -EINVAL;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index d543ed052743..b6416535f84d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4151,12 +4151,13 @@ static void intel_pebs_aliases_skl(struct perf_event *event)
 static unsigned long intel_pmu_large_pebs_flags(struct perf_event *event)
 {
 	unsigned long flags = x86_pmu.large_pebs_flags;
+	u64 gprs_mask = x86_pmu.arch_pebs ? ARCH_PEBS_GP_REGS : PEBS_GP_REGS;
 
 	if (event->attr.use_clockid)
 		flags &= ~PERF_SAMPLE_TIME;
 	if (!event->attr.exclude_kernel)
 		flags &= ~PERF_SAMPLE_REGS_USER;
-	if (event->attr.sample_regs_user & ~PEBS_GP_REGS)
+	if (event->attr.sample_regs_user & ~gprs_mask)
 		flags &= ~(PERF_SAMPLE_REGS_USER | PERF_SAMPLE_REGS_INTR);
 	return flags;
 }
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 19b51b4d0d94..91a093cba11f 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1431,6 +1431,7 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	u64 sample_type = attr->sample_type;
 	u64 pebs_data_cfg = 0;
 	bool gprs, tsx_weight;
+	u64 gprs_mask;
 
 	if (!(sample_type & ~(PERF_SAMPLE_IP|PERF_SAMPLE_TIME)) &&
 	    attr->precise_ip > 1)
@@ -1445,10 +1446,11 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	 * + precise_ip < 2 for the non event IP
 	 * + For RTM TSX weight we need GPRs for the abort code.
 	 */
+	gprs_mask = x86_pmu.arch_pebs ? ARCH_PEBS_GP_REGS : PEBS_GP_REGS;
 	gprs = ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-		(attr->sample_regs_intr & PEBS_GP_REGS)) ||
+		(attr->sample_regs_intr & gprs_mask)) ||
 	       ((sample_type & PERF_SAMPLE_REGS_USER) &&
-		(attr->sample_regs_user & PEBS_GP_REGS));
+		(attr->sample_regs_user & gprs_mask));
 
 	tsx_weight = (sample_type & PERF_SAMPLE_WEIGHT_TYPE) &&
 		     ((attr->config & INTEL_ARCH_EVENT_MASK) ==
@@ -2243,6 +2245,7 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ssp = 0;
 
 	format_group = basic->format_group;
 
@@ -2359,6 +2362,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ssp = 0;
 
 	__setup_perf_sample_data(event, iregs, data);
 
@@ -2395,6 +2399,7 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 		__setup_pebs_gpr_group(event, regs, (struct pebs_gprs *)gprs,
 				       sample_type);
+		perf_regs->ssp = gprs->ssp;
 	}
 
 	if (header->aux) {
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index c6c2ab34e711..6a8804a75de9 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -175,6 +175,10 @@ struct amd_nb {
 	 (1ULL << PERF_REG_X86_R14)   | \
 	 (1ULL << PERF_REG_X86_R15))
 
+#define ARCH_PEBS_GP_REGS		\
+	(PEBS_GP_REGS |			\
+	 (1ULL << PERF_REG_X86_SSP))
+
 /*
  * Per register state.
  */
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 4e5adbc7baea..ba382361b13f 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -704,6 +704,7 @@ extern void perf_events_lapic_init(void);
 struct pt_regs;
 struct x86_perf_regs {
 	struct pt_regs	regs;
+	u64		ssp;
 	u64		*xmm_regs;
 };
 
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..f9c5b16b1882 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,11 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/* arch-PEBS supports to capture shadow stack pointer (SSP) */
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
-	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 624703af80a1..985bd616200e 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -54,6 +54,8 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
 	PT_REGS_OFFSET(PERF_REG_X86_R13, r13),
 	PT_REGS_OFFSET(PERF_REG_X86_R14, r14),
 	PT_REGS_OFFSET(PERF_REG_X86_R15, r15),
+	/* The pt_regs struct does not store Shadow stack pointer. */
+	(unsigned int) -1,
 #endif
 };
 
@@ -68,6 +70,11 @@ u64 perf_reg_value(struct pt_regs *regs, int idx)
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
 	}
 
+	if (idx == PERF_REG_X86_SSP) {
+		perf_regs = container_of(regs, struct x86_perf_regs, regs);
+		return perf_regs->ssp;
+	}
+
 	if (WARN_ON_ONCE(idx >= ARRAY_SIZE(pt_regs_offset)))
 		return 0;
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 16/22] perf/core: Support to capture higher width vector registers
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (14 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 15/22] perf/x86/intel: Support SSP register capturing " Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 14:36   ` Peter Zijlstra
  2025-04-15 11:44 ` [Patch v3 17/22] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Arch-PEBS supports to capture more vector registers like OPMASK/YMM/ZMM
registers besides XMM registers. This patch extends PERF_SAMPLE_REGS_INTR
and PERF_SAMPLE_REGS_USER attributes to support these new vector registers
capturing at interrupt and user space.

The arrays sample_regs_intr/user__ext[] is added into perf_event_attr
structure to record user configured extended register bitmap and a helper
perf_reg_ext_validate() is added to validate if these registers are
supported on some specific PMUs. Furthermore considering to leave enough
space to support more GPRs like R16 ~ R31 introduced by APX in the future,
directly extend the array size to 7.

This patch just adds the common perf/core support, the x86/intel specific
support would be added in next patch.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/arm/kernel/perf_regs.c           |  6 ++
 arch/arm64/kernel/perf_regs.c         |  6 ++
 arch/csky/kernel/perf_regs.c          |  5 ++
 arch/loongarch/kernel/perf_regs.c     |  5 ++
 arch/mips/kernel/perf_regs.c          |  5 ++
 arch/powerpc/perf/perf_regs.c         |  5 ++
 arch/riscv/kernel/perf_regs.c         |  5 ++
 arch/s390/kernel/perf_regs.c          |  5 ++
 arch/x86/include/asm/perf_event.h     |  4 ++
 arch/x86/include/uapi/asm/perf_regs.h | 79 ++++++++++++++++++++-
 arch/x86/kernel/perf_regs.c           | 64 ++++++++++++++++-
 include/linux/perf_event.h            |  4 ++
 include/linux/perf_regs.h             | 10 +++
 include/uapi/linux/perf_event.h       | 11 +++
 kernel/events/core.c                  | 98 +++++++++++++++++++++++++--
 15 files changed, 304 insertions(+), 8 deletions(-)

diff --git a/arch/arm/kernel/perf_regs.c b/arch/arm/kernel/perf_regs.c
index 0529f90395c9..86b2002d0846 100644
--- a/arch/arm/kernel/perf_regs.c
+++ b/arch/arm/kernel/perf_regs.c
@@ -37,3 +37,9 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 	regs_user->regs = task_pt_regs(current);
 	regs_user->abi = perf_reg_abi(current);
 }
+
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
diff --git a/arch/arm64/kernel/perf_regs.c b/arch/arm64/kernel/perf_regs.c
index b4eece3eb17d..1c91fd3530d5 100644
--- a/arch/arm64/kernel/perf_regs.c
+++ b/arch/arm64/kernel/perf_regs.c
@@ -104,3 +104,9 @@ void perf_get_regs_user(struct perf_regs *regs_user,
 	regs_user->regs = task_pt_regs(current);
 	regs_user->abi = perf_reg_abi(current);
 }
+
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
diff --git a/arch/csky/kernel/perf_regs.c b/arch/csky/kernel/perf_regs.c
index 09b7f88a2d6a..d2e2af0bf1ad 100644
--- a/arch/csky/kernel/perf_regs.c
+++ b/arch/csky/kernel/perf_regs.c
@@ -26,6 +26,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_32;
diff --git a/arch/loongarch/kernel/perf_regs.c b/arch/loongarch/kernel/perf_regs.c
index 263ac4ab5af6..e1df67e3fab4 100644
--- a/arch/loongarch/kernel/perf_regs.c
+++ b/arch/loongarch/kernel/perf_regs.c
@@ -34,6 +34,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	if (WARN_ON_ONCE((u32)idx >= PERF_REG_LOONGARCH_MAX))
diff --git a/arch/mips/kernel/perf_regs.c b/arch/mips/kernel/perf_regs.c
index e686780d1647..bbb5f25b9191 100644
--- a/arch/mips/kernel/perf_regs.c
+++ b/arch/mips/kernel/perf_regs.c
@@ -37,6 +37,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
 	long v;
diff --git a/arch/powerpc/perf/perf_regs.c b/arch/powerpc/perf/perf_regs.c
index 350dccb0143c..d919c628aee3 100644
--- a/arch/powerpc/perf/perf_regs.c
+++ b/arch/powerpc/perf/perf_regs.c
@@ -132,6 +132,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_abi(struct task_struct *task)
 {
 	if (is_tsk_32bit_task(task))
diff --git a/arch/riscv/kernel/perf_regs.c b/arch/riscv/kernel/perf_regs.c
index fd304a248de6..5beb60544c9a 100644
--- a/arch/riscv/kernel/perf_regs.c
+++ b/arch/riscv/kernel/perf_regs.c
@@ -26,6 +26,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_abi(struct task_struct *task)
 {
 #if __riscv_xlen == 64
diff --git a/arch/s390/kernel/perf_regs.c b/arch/s390/kernel/perf_regs.c
index a6b058ee4a36..9247573229b0 100644
--- a/arch/s390/kernel/perf_regs.c
+++ b/arch/s390/kernel/perf_regs.c
@@ -42,6 +42,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_abi(struct task_struct *task)
 {
 	if (test_tsk_thread_flag(task, TIF_31BIT))
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index ba382361b13f..560eb218868c 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -706,6 +706,10 @@ struct x86_perf_regs {
 	struct pt_regs	regs;
 	u64		ssp;
 	u64		*xmm_regs;
+	u64		*opmask_regs;
+	u64		*ymmh_regs;
+	u64		*zmmh_regs;
+	u64		*h16zmm_regs;
 };
 
 extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
index f9c5b16b1882..5e2d9796b2cc 100644
--- a/arch/x86/include/uapi/asm/perf_regs.h
+++ b/arch/x86/include/uapi/asm/perf_regs.h
@@ -33,7 +33,7 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
 	PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
 
-	/* These all need two bits set because they are 128bit */
+	/* These all need two bits set because they are 128 bits */
 	PERF_REG_X86_XMM0  = 32,
 	PERF_REG_X86_XMM1  = 34,
 	PERF_REG_X86_XMM2  = 36,
@@ -53,6 +53,83 @@ enum perf_event_x86_regs {
 
 	/* These include both GPRs and XMMX registers */
 	PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
+
+	/* Leave bits[127:64] for other GP registers, like R16 ~ R31.*/
+
+	/*
+	 * Each YMM register need 4 bits to represent because they are 256 bits.
+	 * PERF_REG_X86_YMMH0 = 128
+	 */
+	PERF_REG_X86_YMM0	= 128,
+	PERF_REG_X86_YMM1	= PERF_REG_X86_YMM0 + 4,
+	PERF_REG_X86_YMM2	= PERF_REG_X86_YMM1 + 4,
+	PERF_REG_X86_YMM3	= PERF_REG_X86_YMM2 + 4,
+	PERF_REG_X86_YMM4	= PERF_REG_X86_YMM3 + 4,
+	PERF_REG_X86_YMM5	= PERF_REG_X86_YMM4 + 4,
+	PERF_REG_X86_YMM6	= PERF_REG_X86_YMM5 + 4,
+	PERF_REG_X86_YMM7	= PERF_REG_X86_YMM6 + 4,
+	PERF_REG_X86_YMM8	= PERF_REG_X86_YMM7 + 4,
+	PERF_REG_X86_YMM9	= PERF_REG_X86_YMM8 + 4,
+	PERF_REG_X86_YMM10	= PERF_REG_X86_YMM9 + 4,
+	PERF_REG_X86_YMM11	= PERF_REG_X86_YMM10 + 4,
+	PERF_REG_X86_YMM12	= PERF_REG_X86_YMM11 + 4,
+	PERF_REG_X86_YMM13	= PERF_REG_X86_YMM12 + 4,
+	PERF_REG_X86_YMM14	= PERF_REG_X86_YMM13 + 4,
+	PERF_REG_X86_YMM15	= PERF_REG_X86_YMM14 + 4,
+	PERF_REG_X86_YMM_MAX	= PERF_REG_X86_YMM15 + 4,
+
+	/*
+	 * Each ZMM register needs 8 bits to represent because they are 512 bits
+	 * PERF_REG_X86_ZMMH0 = 192
+	 */
+	PERF_REG_X86_ZMM0	= PERF_REG_X86_YMM_MAX,
+	PERF_REG_X86_ZMM1	= PERF_REG_X86_ZMM0 + 8,
+	PERF_REG_X86_ZMM2	= PERF_REG_X86_ZMM1 + 8,
+	PERF_REG_X86_ZMM3	= PERF_REG_X86_ZMM2 + 8,
+	PERF_REG_X86_ZMM4	= PERF_REG_X86_ZMM3 + 8,
+	PERF_REG_X86_ZMM5	= PERF_REG_X86_ZMM4 + 8,
+	PERF_REG_X86_ZMM6	= PERF_REG_X86_ZMM5 + 8,
+	PERF_REG_X86_ZMM7	= PERF_REG_X86_ZMM6 + 8,
+	PERF_REG_X86_ZMM8	= PERF_REG_X86_ZMM7 + 8,
+	PERF_REG_X86_ZMM9	= PERF_REG_X86_ZMM8 + 8,
+	PERF_REG_X86_ZMM10	= PERF_REG_X86_ZMM9 + 8,
+	PERF_REG_X86_ZMM11	= PERF_REG_X86_ZMM10 + 8,
+	PERF_REG_X86_ZMM12	= PERF_REG_X86_ZMM11 + 8,
+	PERF_REG_X86_ZMM13	= PERF_REG_X86_ZMM12 + 8,
+	PERF_REG_X86_ZMM14	= PERF_REG_X86_ZMM13 + 8,
+	PERF_REG_X86_ZMM15	= PERF_REG_X86_ZMM14 + 8,
+	PERF_REG_X86_ZMM16	= PERF_REG_X86_ZMM15 + 8,
+	PERF_REG_X86_ZMM17	= PERF_REG_X86_ZMM16 + 8,
+	PERF_REG_X86_ZMM18	= PERF_REG_X86_ZMM17 + 8,
+	PERF_REG_X86_ZMM19	= PERF_REG_X86_ZMM18 + 8,
+	PERF_REG_X86_ZMM20	= PERF_REG_X86_ZMM19 + 8,
+	PERF_REG_X86_ZMM21	= PERF_REG_X86_ZMM20 + 8,
+	PERF_REG_X86_ZMM22	= PERF_REG_X86_ZMM21 + 8,
+	PERF_REG_X86_ZMM23	= PERF_REG_X86_ZMM22 + 8,
+	PERF_REG_X86_ZMM24	= PERF_REG_X86_ZMM23 + 8,
+	PERF_REG_X86_ZMM25	= PERF_REG_X86_ZMM24 + 8,
+	PERF_REG_X86_ZMM26	= PERF_REG_X86_ZMM25 + 8,
+	PERF_REG_X86_ZMM27	= PERF_REG_X86_ZMM26 + 8,
+	PERF_REG_X86_ZMM28	= PERF_REG_X86_ZMM27 + 8,
+	PERF_REG_X86_ZMM29	= PERF_REG_X86_ZMM28 + 8,
+	PERF_REG_X86_ZMM30	= PERF_REG_X86_ZMM29 + 8,
+	PERF_REG_X86_ZMM31	= PERF_REG_X86_ZMM30 + 8,
+	PERF_REG_X86_ZMM_MAX	= PERF_REG_X86_ZMM31 + 8,
+
+	/*
+	 * OPMASK Registers
+	 * PERF_REG_X86_OPMASK0 = 448
+	 */
+	PERF_REG_X86_OPMASK0	= PERF_REG_X86_ZMM_MAX,
+	PERF_REG_X86_OPMASK1	= PERF_REG_X86_OPMASK0 + 1,
+	PERF_REG_X86_OPMASK2	= PERF_REG_X86_OPMASK1 + 1,
+	PERF_REG_X86_OPMASK3	= PERF_REG_X86_OPMASK2 + 1,
+	PERF_REG_X86_OPMASK4	= PERF_REG_X86_OPMASK3 + 1,
+	PERF_REG_X86_OPMASK5	= PERF_REG_X86_OPMASK4 + 1,
+	PERF_REG_X86_OPMASK6	= PERF_REG_X86_OPMASK5 + 1,
+	PERF_REG_X86_OPMASK7	= PERF_REG_X86_OPMASK6 + 1,
+
+	PERF_REG_X86_VEC_MAX	= PERF_REG_X86_OPMASK7 + 1,
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
diff --git a/arch/x86/kernel/perf_regs.c b/arch/x86/kernel/perf_regs.c
index 985bd616200e..466ccd67ea99 100644
--- a/arch/x86/kernel/perf_regs.c
+++ b/arch/x86/kernel/perf_regs.c
@@ -59,12 +59,55 @@ static unsigned int pt_regs_offset[PERF_REG_X86_MAX] = {
 #endif
 };
 
+static u64 perf_reg_ext_value(struct pt_regs *regs, int idx)
+{
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+	u64 data;
+	int mod;
+
+	switch (idx) {
+	case PERF_REG_X86_YMM0 ... PERF_REG_X86_YMM_MAX - 1:
+		idx -= PERF_REG_X86_YMM0;
+		mod = idx % 4;
+		if (mod < 2)
+			data = !perf_regs->xmm_regs ? 0 : perf_regs->xmm_regs[idx / 4 + mod];
+		else
+			data = !perf_regs->ymmh_regs ? 0 : perf_regs->ymmh_regs[idx / 4 + mod - 2];
+		return data;
+	case PERF_REG_X86_ZMM0 ... PERF_REG_X86_ZMM16 - 1:
+		idx -= PERF_REG_X86_ZMM0;
+		mod = idx % 8;
+		if (mod < 4) {
+			if (mod < 2)
+				data = !perf_regs->xmm_regs ? 0 : perf_regs->xmm_regs[idx / 8 + mod];
+			else
+				data = !perf_regs->ymmh_regs ? 0 : perf_regs->ymmh_regs[idx / 8 + mod - 2];
+		} else {
+			data = !perf_regs->zmmh_regs ? 0 : perf_regs->zmmh_regs[idx / 8 + mod - 4];
+		}
+		return data;
+	case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
+		idx -= PERF_REG_X86_ZMM16;
+		return !perf_regs->h16zmm_regs ? 0 : perf_regs->h16zmm_regs[idx];
+	case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
+		idx -= PERF_REG_X86_OPMASK0;
+		return !perf_regs->opmask_regs ? 0 : perf_regs->opmask_regs[idx];
+	default:
+		WARN_ON_ONCE(1);
+		break;
+	}
+
+	return 0;
+}
+
 u64 perf_reg_value(struct pt_regs *regs, int idx)
 {
-	struct x86_perf_regs *perf_regs;
+	struct x86_perf_regs *perf_regs = container_of(regs, struct x86_perf_regs, regs);
+
+	if (idx >= PERF_REG_EXTENDED_OFFSET)
+		return perf_reg_ext_value(regs, idx);
 
 	if (idx >= PERF_REG_X86_XMM0 && idx < PERF_REG_X86_XMM_MAX) {
-		perf_regs = container_of(regs, struct x86_perf_regs, regs);
 		if (!perf_regs->xmm_regs)
 			return 0;
 		return perf_regs->xmm_regs[idx - PERF_REG_X86_XMM0];
@@ -102,6 +145,11 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	return -EINVAL;
+}
+
 u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_32;
@@ -127,6 +175,18 @@ int perf_reg_validate(u64 mask)
 	return 0;
 }
 
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size)
+{
+	if (!mask || !size || size > PERF_NUM_EXT_REGS)
+		return -EINVAL;
+
+	if (find_last_bit(mask, size) >
+	    (PERF_REG_X86_VEC_MAX - PERF_REG_EXTENDED_OFFSET))
+		return -EINVAL;
+
+	return 0;
+}
+
 u64 perf_reg_abi(struct task_struct *task)
 {
 	if (!user_64bit_mode(task_pt_regs(task)))
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 947ad12dfdbe..5a33c5a0e4e4 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -303,6 +303,7 @@ struct perf_event_pmu_context;
 #define PERF_PMU_CAP_AUX_OUTPUT			0x0080
 #define PERF_PMU_CAP_EXTENDED_HW_TYPE		0x0100
 #define PERF_PMU_CAP_AUX_PAUSE			0x0200
+#define PERF_PMU_CAP_MORE_EXT_REGS		0x0400
 
 /**
  * pmu::scope
@@ -1424,6 +1425,9 @@ static inline void perf_clear_branch_entry_bitfields(struct perf_branch_entry *b
 	br->reserved = 0;
 }
 
+extern bool has_more_extended_intr_regs(struct perf_event *event);
+extern bool has_more_extended_user_regs(struct perf_event *event);
+extern bool has_more_extended_regs(struct perf_event *event);
 extern void perf_output_sample(struct perf_output_handle *handle,
 			       struct perf_event_header *header,
 			       struct perf_sample_data *data,
diff --git a/include/linux/perf_regs.h b/include/linux/perf_regs.h
index f632c5725f16..aa4dfb5af552 100644
--- a/include/linux/perf_regs.h
+++ b/include/linux/perf_regs.h
@@ -9,6 +9,8 @@ struct perf_regs {
 	struct pt_regs	*regs;
 };
 
+#define PERF_REG_EXTENDED_OFFSET	64
+
 #ifdef CONFIG_HAVE_PERF_REGS
 #include <asm/perf_regs.h>
 
@@ -21,6 +23,8 @@ int perf_reg_validate(u64 mask);
 u64 perf_reg_abi(struct task_struct *task);
 void perf_get_regs_user(struct perf_regs *regs_user,
 			struct pt_regs *regs);
+int perf_reg_ext_validate(unsigned long *mask, unsigned int size);
+
 #else
 
 #define PERF_REG_EXTENDED_MASK	0
@@ -35,6 +39,12 @@ static inline int perf_reg_validate(u64 mask)
 	return mask ? -ENOSYS : 0;
 }
 
+static inline int perf_reg_ext_validate(unsigned long *mask,
+					unsigned int size)
+{
+	return -EINVAL;
+}
+
 static inline u64 perf_reg_abi(struct task_struct *task)
 {
 	return PERF_SAMPLE_REGS_ABI_NONE;
diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
index 5fc753c23734..78aae0464a54 100644
--- a/include/uapi/linux/perf_event.h
+++ b/include/uapi/linux/perf_event.h
@@ -379,6 +379,10 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
 #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
+#define PERF_ATTR_SIZE_VER9	168	/* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
+
+#define PERF_EXT_REGS_ARRAY_SIZE	7
+#define PERF_NUM_EXT_REGS		(PERF_EXT_REGS_ARRAY_SIZE * 64)
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -533,6 +537,13 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	/*
+	 * Extension sets of regs to dump for each sample.
+	 * See asm/perf_regs.h for details.
+	 */
+	__u64	sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
+	__u64   sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
 };
 
 /*
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 2eb9cd5d86a1..ebf3be1a6e47 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -7345,6 +7345,21 @@ perf_output_sample_regs(struct perf_output_handle *handle,
 	}
 }
 
+static void
+perf_output_sample_regs_ext(struct perf_output_handle *handle,
+			    struct pt_regs *regs,
+			    unsigned long *mask,
+			    unsigned int size)
+{
+	int bit;
+	u64 val;
+
+	for_each_set_bit(bit, mask, size) {
+		val = perf_reg_value(regs, bit + PERF_REG_EXTENDED_OFFSET);
+		perf_output_put(handle, val);
+	}
+}
+
 static void perf_sample_regs_user(struct perf_regs *regs_user,
 				  struct pt_regs *regs)
 {
@@ -7773,6 +7788,26 @@ static void perf_output_read(struct perf_output_handle *handle,
 		perf_output_read_one(handle, event, enabled, running);
 }
 
+inline bool has_more_extended_intr_regs(struct perf_event *event)
+{
+	return !!bitmap_weight(
+			(unsigned long *)event->attr.sample_regs_intr_ext,
+			PERF_NUM_EXT_REGS);
+}
+
+inline bool has_more_extended_user_regs(struct perf_event *event)
+{
+	return !!bitmap_weight(
+			(unsigned long *)event->attr.sample_regs_user_ext,
+			PERF_NUM_EXT_REGS);
+}
+
+inline bool has_more_extended_regs(struct perf_event *event)
+{
+	return has_more_extended_intr_regs(event) ||
+	       has_more_extended_user_regs(event);
+}
+
 void perf_output_sample(struct perf_output_handle *handle,
 			struct perf_event_header *header,
 			struct perf_sample_data *data,
@@ -7898,6 +7933,12 @@ void perf_output_sample(struct perf_output_handle *handle,
 			perf_output_sample_regs(handle,
 						data->regs_user.regs,
 						mask);
+			if (has_more_extended_user_regs(event)) {
+				perf_output_sample_regs_ext(
+					handle, data->regs_user.regs,
+					(unsigned long *)event->attr.sample_regs_user_ext,
+					PERF_NUM_EXT_REGS);
+			}
 		}
 	}
 
@@ -7930,6 +7971,12 @@ void perf_output_sample(struct perf_output_handle *handle,
 			perf_output_sample_regs(handle,
 						data->regs_intr.regs,
 						mask);
+			if (has_more_extended_intr_regs(event)) {
+				perf_output_sample_regs_ext(
+					handle, data->regs_intr.regs,
+					(unsigned long *)event->attr.sample_regs_intr_ext,
+					PERF_NUM_EXT_REGS);
+			}
 		}
 	}
 
@@ -8181,6 +8228,12 @@ void perf_prepare_sample(struct perf_sample_data *data,
 		if (data->regs_user.regs) {
 			u64 mask = event->attr.sample_regs_user;
 			size += hweight64(mask) * sizeof(u64);
+
+			if (has_more_extended_user_regs(event)) {
+				size += bitmap_weight(
+					(unsigned long *)event->attr.sample_regs_user_ext,
+					 PERF_NUM_EXT_REGS) * sizeof(u64);
+			}
 		}
 
 		data->dyn_size += size;
@@ -8244,6 +8297,12 @@ void perf_prepare_sample(struct perf_sample_data *data,
 			u64 mask = event->attr.sample_regs_intr;
 
 			size += hweight64(mask) * sizeof(u64);
+
+			if (has_more_extended_intr_regs(event)) {
+				size += bitmap_weight(
+					(unsigned long *)event->attr.sample_regs_intr_ext,
+					 PERF_NUM_EXT_REGS) * sizeof(u64);
+			}
 		}
 
 		data->dyn_size += size;
@@ -12496,6 +12555,12 @@ static int perf_try_init_event(struct pmu *pmu, struct perf_event *event)
 		goto err_destroy;
 	}
 
+	if (!(pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS) &&
+	    has_more_extended_regs(event)) {
+		ret = -EOPNOTSUPP;
+		goto err_destroy;
+	}
+
 	if (pmu->capabilities & PERF_PMU_CAP_NO_EXCLUDE &&
 	    event_has_any_exclude_flag(event)) {
 		ret = -EINVAL;
@@ -13028,9 +13093,19 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_REGS_USER) {
-		ret = perf_reg_validate(attr->sample_regs_user);
-		if (ret)
-			return ret;
+		if (attr->sample_regs_user != 0) {
+			ret = perf_reg_validate(attr->sample_regs_user);
+			if (ret)
+				return ret;
+		}
+		if (!!bitmap_weight((unsigned long *)attr->sample_regs_user_ext,
+				    PERF_NUM_EXT_REGS)) {
+			ret = perf_reg_ext_validate(
+				(unsigned long *)attr->sample_regs_user_ext,
+				PERF_NUM_EXT_REGS);
+			if (ret)
+				return ret;
+		}
 	}
 
 	if (attr->sample_type & PERF_SAMPLE_STACK_USER) {
@@ -13051,8 +13126,21 @@ static int perf_copy_attr(struct perf_event_attr __user *uattr,
 	if (!attr->sample_max_stack)
 		attr->sample_max_stack = sysctl_perf_event_max_stack;
 
-	if (attr->sample_type & PERF_SAMPLE_REGS_INTR)
-		ret = perf_reg_validate(attr->sample_regs_intr);
+	if (attr->sample_type & PERF_SAMPLE_REGS_INTR) {
+		if (attr->sample_regs_intr != 0) {
+			ret = perf_reg_validate(attr->sample_regs_intr);
+			if (ret)
+				return ret;
+		}
+		if (!!bitmap_weight((unsigned long *)attr->sample_regs_intr_ext,
+				    PERF_NUM_EXT_REGS)) {
+			ret = perf_reg_ext_validate(
+				(unsigned long *)attr->sample_regs_intr_ext,
+				PERF_NUM_EXT_REGS);
+			if (ret)
+				return ret;
+		}
+	}
 
 #ifndef CONFIG_CGROUP_PERF
 	if (attr->sample_type & PERF_SAMPLE_CGROUP)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 17/22] perf/x86/intel: Support arch-PEBS vector registers group capturing
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (15 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 16/22] perf/core: Support to capture higher width vector registers Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 18/22] perf tools: Support to show SSP register Dapeng Mi
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Add x86/intel specific vector register (VECR) group capturing for
arch-PEBS. Enable corresponding VECR group bits in
GPx_CFG_C/FX0_CFG_C MSRs if users configures these vector registers
bitmap in perf_event_attr and parse VECR group in arch-PEBS record.

Currently vector registers capturing is only supported by PEBS based
sampling, PMU driver would return error if PMI based sampling tries to
capture these vector registers.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 arch/x86/events/core.c            | 90 +++++++++++++++++++++++++++++-
 arch/x86/events/intel/core.c      | 15 +++++
 arch/x86/events/intel/ds.c        | 93 ++++++++++++++++++++++++++++---
 arch/x86/include/asm/msr-index.h  |  6 ++
 arch/x86/include/asm/perf_event.h | 20 +++++++
 5 files changed, 214 insertions(+), 10 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 0ccbe8385c7f..16f019ff44f1 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -580,6 +580,73 @@ int x86_pmu_max_precise(struct pmu *pmu)
 	return precise;
 }
 
+static bool has_vec_regs(struct perf_event *event, bool user,
+			 int start, int end)
+{
+	int idx = (start - PERF_REG_EXTENDED_OFFSET) / 64;
+	int s = start % 64;
+	int e = end % 64;
+	u64 regs_mask;
+
+	if (user)
+		regs_mask = event->attr.sample_regs_user_ext[idx];
+	else
+		regs_mask = event->attr.sample_regs_intr_ext[idx];
+
+	return regs_mask & GENMASK_ULL(e, s);
+}
+
+static inline bool has_ymm_regs(struct perf_event *event, bool user)
+{
+	return has_vec_regs(event, user, PERF_REG_X86_YMM0, PERF_REG_X86_YMM_MAX - 1);
+}
+
+static inline bool has_zmm_regs(struct perf_event *event, bool user)
+{
+	return has_vec_regs(event, user, PERF_REG_X86_ZMM0, PERF_REG_X86_ZMM8 - 1) ||
+	       has_vec_regs(event, user, PERF_REG_X86_ZMM8, PERF_REG_X86_ZMM16 - 1);
+}
+
+static inline bool has_h16zmm_regs(struct perf_event *event, bool user)
+{
+	return has_vec_regs(event, user, PERF_REG_X86_ZMM16, PERF_REG_X86_ZMM24 - 1) ||
+	       has_vec_regs(event, user, PERF_REG_X86_ZMM24, PERF_REG_X86_ZMM_MAX - 1);
+}
+
+static inline bool has_opmask_regs(struct perf_event *event, bool user)
+{
+	return has_vec_regs(event, user, PERF_REG_X86_OPMASK0, PERF_REG_X86_OPMASK7);
+}
+
+static bool ext_vec_regs_supported(struct perf_event *event, bool user)
+{
+	u64 caps = hybrid(event->pmu, arch_pebs_cap).caps;
+
+	if (!(event->pmu->capabilities & PERF_PMU_CAP_MORE_EXT_REGS))
+		return false;
+
+	if (has_opmask_regs(event, user) && !(caps & ARCH_PEBS_VECR_OPMASK))
+		return false;
+
+	if (has_ymm_regs(event, user) && !(caps & ARCH_PEBS_VECR_YMMH))
+		return false;
+
+	if (has_zmm_regs(event, user) && !(caps & ARCH_PEBS_VECR_ZMMH))
+		return false;
+
+	if (has_h16zmm_regs(event, user) && !(caps & ARCH_PEBS_VECR_H16ZMM))
+		return false;
+
+	if (!event->attr.precise_ip)
+		return false;
+
+	/* Only user space sampling is allowed for extended vector registers. */
+	if (user && !event->attr.exclude_kernel)
+		return false;
+
+	return true;
+}
+
 int x86_pmu_hw_config(struct perf_event *event)
 {
 	if (event->attr.precise_ip) {
@@ -665,9 +732,12 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
-	/* sample_regs_user never support XMM registers */
-	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK))
-		return -EINVAL;
+	if (unlikely(event->attr.sample_regs_user & PERF_REG_EXTENDED_MASK)) {
+		/* Only user space sampling is allowed for XMM registers. */
+		if (!event->attr.exclude_kernel)
+			return -EINVAL;
+	}
+
 	/*
 	 * Besides the general purpose registers, XMM registers may
 	 * be collected in PEBS on some platforms, e.g. Icelake
@@ -680,6 +750,20 @@ int x86_pmu_hw_config(struct perf_event *event)
 			return -EINVAL;
 	}
 
+	/*
+	 * Architectural PEBS supports to capture more vector registers besides
+	 * XMM registers, like YMM, OPMASK and ZMM registers.
+	 */
+	if (unlikely(has_more_extended_user_regs(event))) {
+		if (!ext_vec_regs_supported(event, true))
+			return -EINVAL;
+	}
+
+	if (unlikely(has_more_extended_intr_regs(event))) {
+		if (!ext_vec_regs_supported(event, false))
+			return -EINVAL;
+	}
+
 	return x86_setup_perfctr(event);
 }
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index b6416535f84d..9bd77974d83b 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3007,6 +3007,18 @@ static void intel_pmu_enable_event_ext(struct perf_event *event)
 			if (pebs_data_cfg & PEBS_DATACFG_XMMS)
 				ext |= ARCH_PEBS_VECR_XMM & cap.caps;
 
+			if (pebs_data_cfg & PEBS_DATACFG_YMMHS)
+				ext |= ARCH_PEBS_VECR_YMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_OPMASKS)
+				ext |= ARCH_PEBS_VECR_OPMASK & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_ZMMHS)
+				ext |= ARCH_PEBS_VECR_ZMMH & cap.caps;
+
+			if (pebs_data_cfg & PEBS_DATACFG_H16ZMMS)
+				ext |= ARCH_PEBS_VECR_H16ZMM & cap.caps;
+
 			if (pebs_data_cfg & PEBS_DATACFG_LBRS)
 				ext |= ARCH_PEBS_LBR & cap.caps;
 
@@ -5426,6 +5438,9 @@ static inline void __intel_update_pmu_caps(struct pmu *pmu)
 
 	if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_XMM)
 		dest_pmu->capabilities |= PERF_PMU_CAP_EXTENDED_REGS;
+
+	if (hybrid(pmu, arch_pebs_cap).caps & ARCH_PEBS_VECR_EXT)
+		dest_pmu->capabilities |= PERF_PMU_CAP_MORE_EXT_REGS;
 }
 
 static inline void __intel_update_large_pebs_flags(struct pmu *pmu)
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 91a093cba11f..26220bfbe885 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1425,6 +1425,34 @@ void intel_pmu_pebs_late_setup(struct cpu_hw_events *cpuc)
 				PERF_SAMPLE_TRANSACTION |		     \
 				PERF_SAMPLE_DATA_PAGE_SIZE)
 
+static u64 pebs_get_ext_reg_data_cfg(unsigned long *ext_reg)
+{
+	u64 pebs_data_cfg = 0;
+	int bit;
+
+	for_each_set_bit(bit, ext_reg, PERF_NUM_EXT_REGS) {
+		switch (bit + PERF_REG_EXTENDED_OFFSET) {
+		case PERF_REG_X86_OPMASK0 ... PERF_REG_X86_OPMASK7:
+			pebs_data_cfg |= PEBS_DATACFG_OPMASKS;
+			break;
+		case PERF_REG_X86_YMM0 ... PERF_REG_X86_YMM_MAX - 1:
+			pebs_data_cfg |= PEBS_DATACFG_YMMHS | PEBS_DATACFG_XMMS;
+			break;
+		case PERF_REG_X86_ZMM0 ... PERF_REG_X86_ZMM16 - 1:
+			pebs_data_cfg |= PEBS_DATACFG_ZMMHS | PEBS_DATACFG_YMMHS |
+					 PEBS_DATACFG_XMMS;
+			break;
+		case PERF_REG_X86_ZMM16 ... PERF_REG_X86_ZMM_MAX - 1:
+			pebs_data_cfg |= PEBS_DATACFG_H16ZMMS;
+			break;
+		default:
+			break;
+		}
+	}
+
+	return pebs_data_cfg;
+}
+
 static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 {
 	struct perf_event_attr *attr = &event->attr;
@@ -1459,9 +1487,21 @@ static u64 pebs_update_adaptive_cfg(struct perf_event *event)
 	if (gprs || (attr->precise_ip < 2) || tsx_weight)
 		pebs_data_cfg |= PEBS_DATACFG_GP;
 
-	if ((sample_type & PERF_SAMPLE_REGS_INTR) &&
-	    (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK))
-		pebs_data_cfg |= PEBS_DATACFG_XMMS;
+	if (sample_type & PERF_SAMPLE_REGS_INTR) {
+		if (attr->sample_regs_intr & PERF_REG_EXTENDED_MASK)
+			pebs_data_cfg |= PEBS_DATACFG_XMMS;
+
+		pebs_data_cfg |= pebs_get_ext_reg_data_cfg(
+			(unsigned long *)event->attr.sample_regs_intr_ext);
+	}
+
+	if (sample_type & PERF_SAMPLE_REGS_USER) {
+		if (attr->sample_regs_user & PERF_REG_EXTENDED_MASK)
+			pebs_data_cfg |= PEBS_DATACFG_XMMS;
+
+		pebs_data_cfg |= pebs_get_ext_reg_data_cfg(
+			(unsigned long *)event->attr.sample_regs_user_ext);
+	}
 
 	if (sample_type & PERF_SAMPLE_BRANCH_STACK) {
 		/*
@@ -2245,6 +2285,10 @@ static void setup_pebs_adaptive_sample_data(struct perf_event *event,
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ymmh_regs = NULL;
+	perf_regs->opmask_regs = NULL;
+	perf_regs->zmmh_regs = NULL;
+	perf_regs->h16zmm_regs = NULL;
 	perf_regs->ssp = 0;
 
 	format_group = basic->format_group;
@@ -2362,6 +2406,10 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 
 	perf_regs = container_of(regs, struct x86_perf_regs, regs);
 	perf_regs->xmm_regs = NULL;
+	perf_regs->ymmh_regs = NULL;
+	perf_regs->opmask_regs = NULL;
+	perf_regs->zmmh_regs = NULL;
+	perf_regs->h16zmm_regs = NULL;
 	perf_regs->ssp = 0;
 
 	__setup_perf_sample_data(event, iregs, data);
@@ -2412,14 +2460,45 @@ static void setup_arch_pebs_sample_data(struct perf_event *event,
 					   meminfo->tsx_tuning, ax);
 	}
 
-	if (header->xmm) {
+	if (header->xmm || header->ymmh || header->opmask ||
+	    header->zmmh || header->h16zmm) {
 		struct arch_pebs_xmm *xmm;
+		struct arch_pebs_ymmh *ymmh;
+		struct arch_pebs_zmmh *zmmh;
+		struct arch_pebs_h16zmm *h16zmm;
+		struct arch_pebs_opmask *opmask;
 
 		next_record += sizeof(struct arch_pebs_xer_header);
 
-		xmm = next_record;
-		perf_regs->xmm_regs = xmm->xmm;
-		next_record = xmm + 1;
+		if (header->xmm) {
+			xmm = next_record;
+			perf_regs->xmm_regs = xmm->xmm;
+			next_record = xmm + 1;
+		}
+
+		if (header->ymmh) {
+			ymmh = next_record;
+			perf_regs->ymmh_regs = ymmh->ymmh;
+			next_record = ymmh + 1;
+		}
+
+		if (header->opmask) {
+			opmask = next_record;
+			perf_regs->opmask_regs = opmask->opmask;
+			next_record = opmask + 1;
+		}
+
+		if (header->zmmh) {
+			zmmh = next_record;
+			perf_regs->zmmh_regs = zmmh->zmmh;
+			next_record = zmmh + 1;
+		}
+
+		if (header->h16zmm) {
+			h16zmm = next_record;
+			perf_regs->h16zmm_regs = h16zmm->h16zmm;
+			next_record = h16zmm + 1;
+		}
 	}
 
 	if (header->lbr) {
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index c971ac09d881..93193eb6ff94 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -328,6 +328,12 @@
 #define ARCH_PEBS_LBR_SHIFT		40
 #define ARCH_PEBS_LBR			(0x3ull << ARCH_PEBS_LBR_SHIFT)
 #define ARCH_PEBS_VECR_XMM		BIT_ULL(49)
+#define ARCH_PEBS_VECR_YMMH		BIT_ULL(50)
+#define ARCH_PEBS_VECR_OPMASK		BIT_ULL(53)
+#define ARCH_PEBS_VECR_ZMMH		BIT_ULL(54)
+#define ARCH_PEBS_VECR_H16ZMM		BIT_ULL(55)
+#define ARCH_PEBS_VECR_EXT_SHIFT	50
+#define ARCH_PEBS_VECR_EXT		(0x3full << ARCH_PEBS_VECR_EXT_SHIFT)
 #define ARCH_PEBS_GPR			BIT_ULL(61)
 #define ARCH_PEBS_AUX			BIT_ULL(62)
 #define ARCH_PEBS_EN			BIT_ULL(63)
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 560eb218868c..a7b2548bf7b4 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -142,6 +142,10 @@
 #define PEBS_DATACFG_LBRS	BIT_ULL(3)
 #define PEBS_DATACFG_CNTR	BIT_ULL(4)
 #define PEBS_DATACFG_METRICS	BIT_ULL(5)
+#define PEBS_DATACFG_YMMHS	BIT_ULL(6)
+#define PEBS_DATACFG_OPMASKS	BIT_ULL(7)
+#define PEBS_DATACFG_ZMMHS	BIT_ULL(8)
+#define PEBS_DATACFG_H16ZMMS	BIT_ULL(9)
 #define PEBS_DATACFG_LBR_SHIFT	24
 #define PEBS_DATACFG_CNTR_SHIFT	32
 #define PEBS_DATACFG_CNTR_MASK	GENMASK_ULL(15, 0)
@@ -589,6 +593,22 @@ struct arch_pebs_xmm {
 	u64 xmm[16*2];		/* two entries for each register */
 };
 
+struct arch_pebs_ymmh {
+	u64 ymmh[16*2];		/* two entries for each register */
+};
+
+struct arch_pebs_opmask {
+	u64 opmask[8];
+};
+
+struct arch_pebs_zmmh {
+	u64 zmmh[16*4];		/* four entries for each register */
+};
+
+struct arch_pebs_h16zmm {
+	u64 h16zmm[16*8];	/* eight entries for each register */
+};
+
 #define ARCH_PEBS_LBR_NAN		0x0
 #define ARCH_PEBS_LBR_NUM_8		0x1
 #define ARCH_PEBS_LBR_NUM_16		0x2
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 18/22] perf tools: Support to show SSP register
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (16 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 17/22] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 19/22] perf tools: Enhance arch__intr/user_reg_mask() helpers Dapeng Mi
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Add SSP register support.

Reviewed-by: Ian Rogers <irogers@google.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/arch/x86/include/uapi/asm/perf_regs.h    | 7 ++++++-
 tools/perf/arch/x86/util/perf_regs.c           | 2 ++
 tools/perf/util/intel-pt.c                     | 2 +-
 tools/perf/util/perf-regs-arch/perf_regs_x86.c | 2 ++
 4 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 7c9d2bb3833b..1c7ab5af5cc1 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -27,9 +27,14 @@ enum perf_event_x86_regs {
 	PERF_REG_X86_R13,
 	PERF_REG_X86_R14,
 	PERF_REG_X86_R15,
+	/* arch-PEBS supports to capture shadow stack pointer (SSP) */
+	PERF_REG_X86_SSP,
 	/* These are the limits for the GPRs. */
 	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
-	PERF_REG_X86_64_MAX = PERF_REG_X86_R15 + 1,
+	/* PERF_REG_X86_64_MAX used generally, for PEBS, etc. */
+	PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
+	/* PERF_REG_INTEL_PT_MAX ignores the SSP register. */
+	PERF_REG_INTEL_PT_MAX = PERF_REG_X86_R15 + 1,
 
 	/* These all need two bits set because they are 128bit */
 	PERF_REG_X86_XMM0  = 32,
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 12fd93f04802..9f492568f3b4 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -36,6 +36,8 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG(R14, PERF_REG_X86_R14),
 	SMPL_REG(R15, PERF_REG_X86_R15),
 #endif
+	SMPL_REG(SSP, PERF_REG_X86_SSP),
+
 	SMPL_REG2(XMM0, PERF_REG_X86_XMM0),
 	SMPL_REG2(XMM1, PERF_REG_X86_XMM1),
 	SMPL_REG2(XMM2, PERF_REG_X86_XMM2),
diff --git a/tools/perf/util/intel-pt.c b/tools/perf/util/intel-pt.c
index 4e8a9b172fbc..ad23973c9075 100644
--- a/tools/perf/util/intel-pt.c
+++ b/tools/perf/util/intel-pt.c
@@ -2179,7 +2179,7 @@ static u64 *intel_pt_add_gp_regs(struct regs_dump *intr_regs, u64 *pos,
 	u32 bit;
 	int i;
 
-	for (i = 0, bit = 1; i < PERF_REG_X86_64_MAX; i++, bit <<= 1) {
+	for (i = 0, bit = 1; i < PERF_REG_INTEL_PT_MAX; i++, bit <<= 1) {
 		/* Get the PEBS gp_regs array index */
 		int n = pebs_gp_regs[i] - 1;
 
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index 708954a9d35d..c0e95215b577 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -54,6 +54,8 @@ const char *__perf_reg_name_x86(int id)
 		return "R14";
 	case PERF_REG_X86_R15:
 		return "R15";
+	case PERF_REG_X86_SSP:
+		return "SSP";
 
 #define XMM(x) \
 	case PERF_REG_X86_XMM ## x:	\
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 19/22] perf tools: Enhance arch__intr/user_reg_mask() helpers
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (17 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 18/22] perf tools: Support to show SSP register Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 20/22] perf tools: Enhance sample_regs_user/intr to capture more registers Dapeng Mi
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Arch-PEBS supports to capture more higher-width vector registers, like
YMM/ZMM registers, while the return value "uint64_t" of these 2 helpers
is not enough to represent these new added registors. Thus enhance these
two helpers by passing a "unsigned long" pointer, so these two helpers
can return more bits via this pointer.

Currently only sample_intr_regs supports these new added vector
registers, but change arch__user_reg_mask() for the sake of consistency
as well.

Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/arch/arm/util/perf_regs.c       |  8 ++++----
 tools/perf/arch/arm64/util/perf_regs.c     | 11 ++++++-----
 tools/perf/arch/csky/util/perf_regs.c      |  8 ++++----
 tools/perf/arch/loongarch/util/perf_regs.c |  8 ++++----
 tools/perf/arch/mips/util/perf_regs.c      |  8 ++++----
 tools/perf/arch/powerpc/util/perf_regs.c   | 17 +++++++++--------
 tools/perf/arch/riscv/util/perf_regs.c     |  8 ++++----
 tools/perf/arch/s390/util/perf_regs.c      |  8 ++++----
 tools/perf/arch/x86/util/perf_regs.c       | 13 +++++++------
 tools/perf/util/evsel.c                    |  6 ++++--
 tools/perf/util/parse-regs-options.c       |  6 +++---
 tools/perf/util/perf_regs.c                |  8 ++++----
 tools/perf/util/perf_regs.h                |  4 ++--
 13 files changed, 59 insertions(+), 54 deletions(-)

diff --git a/tools/perf/arch/arm/util/perf_regs.c b/tools/perf/arch/arm/util/perf_regs.c
index f94a0210c7b7..14f18d518c96 100644
--- a/tools/perf/arch/arm/util/perf_regs.c
+++ b/tools/perf/arch/arm/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/arm64/util/perf_regs.c b/tools/perf/arch/arm64/util/perf_regs.c
index 09308665e28a..9bcf4755290c 100644
--- a/tools/perf/arch/arm64/util/perf_regs.c
+++ b/tools/perf/arch/arm64/util/perf_regs.c
@@ -140,12 +140,12 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 	return SDT_ARG_VALID;
 }
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
 	struct perf_event_attr attr = {
 		.type                   = PERF_TYPE_HARDWARE,
@@ -170,10 +170,11 @@ uint64_t arch__user_reg_mask(void)
 		fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 		if (fd != -1) {
 			close(fd);
-			return attr.sample_regs_user;
+			*(uint64_t *)mask = attr.sample_regs_user;
+			return;
 		}
 	}
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/csky/util/perf_regs.c b/tools/perf/arch/csky/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/csky/util/perf_regs.c
+++ b/tools/perf/arch/csky/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/loongarch/util/perf_regs.c b/tools/perf/arch/loongarch/util/perf_regs.c
index f94a0210c7b7..14f18d518c96 100644
--- a/tools/perf/arch/loongarch/util/perf_regs.c
+++ b/tools/perf/arch/loongarch/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/mips/util/perf_regs.c b/tools/perf/arch/mips/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/mips/util/perf_regs.c
+++ b/tools/perf/arch/mips/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/powerpc/util/perf_regs.c b/tools/perf/arch/powerpc/util/perf_regs.c
index bd36cfd420a2..e5d042305030 100644
--- a/tools/perf/arch/powerpc/util/perf_regs.c
+++ b/tools/perf/arch/powerpc/util/perf_regs.c
@@ -187,7 +187,7 @@ int arch_sdt_arg_parse_op(char *old_op, char **new_op)
 	return SDT_ARG_VALID;
 }
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
 	struct perf_event_attr attr = {
 		.type                   = PERF_TYPE_HARDWARE,
@@ -199,7 +199,7 @@ uint64_t arch__intr_reg_mask(void)
 	};
 	int fd;
 	u32 version;
-	u64 extended_mask = 0, mask = PERF_REGS_MASK;
+	u64 extended_mask = 0;
 
 	/*
 	 * Get the PVR value to set the extended
@@ -210,8 +210,10 @@ uint64_t arch__intr_reg_mask(void)
 		extended_mask = PERF_REG_PMU_MASK_300;
 	else if ((version == PVR_POWER10) || (version == PVR_POWER11))
 		extended_mask = PERF_REG_PMU_MASK_31;
-	else
-		return mask;
+	else {
+		*(u64 *)mask = PERF_REGS_MASK;
+		return;
+	}
 
 	attr.sample_regs_intr = extended_mask;
 	attr.sample_period = 1;
@@ -224,14 +226,13 @@ uint64_t arch__intr_reg_mask(void)
 	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 	if (fd != -1) {
 		close(fd);
-		mask |= extended_mask;
+		*(u64 *)mask = PERF_REGS_MASK | extended_mask;
 	}
-	return mask;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/riscv/util/perf_regs.c b/tools/perf/arch/riscv/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/riscv/util/perf_regs.c
+++ b/tools/perf/arch/riscv/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/s390/util/perf_regs.c b/tools/perf/arch/s390/util/perf_regs.c
index 6b1665f41180..56c84fc91aff 100644
--- a/tools/perf/arch/s390/util/perf_regs.c
+++ b/tools/perf/arch/s390/util/perf_regs.c
@@ -6,14 +6,14 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG_END
 };
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
 
 const struct sample_reg *arch__sample_reg_masks(void)
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 9f492568f3b4..5b163f0a651a 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -283,7 +283,7 @@ const struct sample_reg *arch__sample_reg_masks(void)
 	return sample_reg_masks;
 }
 
-uint64_t arch__intr_reg_mask(void)
+void arch__intr_reg_mask(unsigned long *mask)
 {
 	struct perf_event_attr attr = {
 		.type			= PERF_TYPE_HARDWARE,
@@ -295,6 +295,9 @@ uint64_t arch__intr_reg_mask(void)
 		.exclude_kernel		= 1,
 	};
 	int fd;
+
+	*(u64 *)mask = PERF_REGS_MASK;
+
 	/*
 	 * In an unnamed union, init it here to build on older gcc versions
 	 */
@@ -320,13 +323,11 @@ uint64_t arch__intr_reg_mask(void)
 	fd = sys_perf_event_open(&attr, 0, -1, -1, 0);
 	if (fd != -1) {
 		close(fd);
-		return (PERF_REG_EXTENDED_MASK | PERF_REGS_MASK);
+		*(u64 *)mask = PERF_REG_EXTENDED_MASK | PERF_REGS_MASK;
 	}
-
-	return PERF_REGS_MASK;
 }
 
-uint64_t arch__user_reg_mask(void)
+void arch__user_reg_mask(unsigned long *mask)
 {
-	return PERF_REGS_MASK;
+	*(uint64_t *)mask = PERF_REGS_MASK;
 }
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 1974395492d7..6e71187d6a93 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1056,17 +1056,19 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
 	if (param->record_mode == CALLCHAIN_DWARF) {
 		if (!function) {
 			const char *arch = perf_env__arch(evsel__env(evsel));
+			uint64_t mask = 0;
 
+			arch__user_reg_mask((unsigned long *)&mask);
 			evsel__set_sample_bit(evsel, REGS_USER);
 			evsel__set_sample_bit(evsel, STACK_USER);
 			if (opts->sample_user_regs &&
-			    DWARF_MINIMAL_REGS(arch) != arch__user_reg_mask()) {
+			    DWARF_MINIMAL_REGS(arch) != mask) {
 				attr->sample_regs_user |= DWARF_MINIMAL_REGS(arch);
 				pr_warning("WARNING: The use of --call-graph=dwarf may require all the user registers, "
 					   "specifying a subset with --user-regs may render DWARF unwinding unreliable, "
 					   "so the minimal registers set (IP, SP) is explicitly forced.\n");
 			} else {
-				attr->sample_regs_user |= arch__user_reg_mask();
+				attr->sample_regs_user |= mask;
 			}
 			attr->sample_stack_user = param->dump_size;
 			attr->exclude_callchain_user = 1;
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index cda1c620968e..3dcd8dc4f81b 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -16,7 +16,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	const struct sample_reg *r = NULL;
 	char *s, *os = NULL, *p;
 	int ret = -1;
-	uint64_t mask;
+	uint64_t mask = 0;
 
 	if (unset)
 		return 0;
@@ -28,9 +28,9 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 		return -1;
 
 	if (intr)
-		mask = arch__intr_reg_mask();
+		arch__intr_reg_mask((unsigned long *)&mask);
 	else
-		mask = arch__user_reg_mask();
+		arch__user_reg_mask((unsigned long *)&mask);
 
 	/* str may be NULL in case no arg is passed to -I */
 	if (str) {
diff --git a/tools/perf/util/perf_regs.c b/tools/perf/util/perf_regs.c
index 44b90bbf2d07..7a96290fd1e6 100644
--- a/tools/perf/util/perf_regs.c
+++ b/tools/perf/util/perf_regs.c
@@ -11,14 +11,14 @@ int __weak arch_sdt_arg_parse_op(char *old_op __maybe_unused,
 	return SDT_ARG_SKIP;
 }
 
-uint64_t __weak arch__intr_reg_mask(void)
+void __weak arch__intr_reg_mask(unsigned long *mask)
 {
-	return 0;
+	*(uint64_t *)mask = 0;
 }
 
-uint64_t __weak arch__user_reg_mask(void)
+void __weak arch__user_reg_mask(unsigned long *mask)
 {
-	return 0;
+	*(uint64_t *)mask = 0;
 }
 
 static const struct sample_reg sample_reg_masks[] = {
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index f2d0736d65cc..316d280e5cd7 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -24,8 +24,8 @@ enum {
 };
 
 int arch_sdt_arg_parse_op(char *old_op, char **new_op);
-uint64_t arch__intr_reg_mask(void);
-uint64_t arch__user_reg_mask(void);
+void arch__intr_reg_mask(unsigned long *mask);
+void arch__user_reg_mask(unsigned long *mask);
 const struct sample_reg *arch__sample_reg_masks(void);
 
 const char *perf_reg_name(int id, const char *arch);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 20/22] perf tools: Enhance sample_regs_user/intr to capture more registers
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (18 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 19/22] perf tools: Enhance arch__intr/user_reg_mask() helpers Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 21/22] perf tools: Support to capture more vector registers (x86/Intel) Dapeng Mi
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Intel architectural PEBS supports to capture more vector registers like
OPMASK/YMM/ZMM registers besides already supported XMM registers.

arch-PEBS vector registers (VCER) capturing on perf core/pmu driver
(Intel) has been supported by previous patches. This patch adds perf
tool's part support. In detail, add support for the new
sample_regs_intr/user_ext register selector in perf_event_attr. These 32
bytes bitmap is used to select the new register group OPMASK, YMMH, ZMMH
and ZMM in VECR. Update perf regs to introduce the new registers.

This single patch only introduces the generic support, x86/intel specific
support would be added in next patch.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/include/uapi/linux/perf_event.h | 14 +++++++++++++
 tools/perf/builtin-script.c           | 23 +++++++++++++++-----
 tools/perf/util/evsel.c               | 30 ++++++++++++++++++++-------
 tools/perf/util/parse-regs-options.c  | 23 ++++++++++++--------
 tools/perf/util/perf_regs.h           | 16 +++++++++++++-
 tools/perf/util/record.h              |  4 ++--
 tools/perf/util/sample.h              |  6 +++++-
 tools/perf/util/session.c             | 29 +++++++++++++++-----------
 tools/perf/util/synthetic-events.c    | 12 +++++++----
 9 files changed, 116 insertions(+), 41 deletions(-)

diff --git a/tools/include/uapi/linux/perf_event.h b/tools/include/uapi/linux/perf_event.h
index 0524d541d4e3..f19370f9bd78 100644
--- a/tools/include/uapi/linux/perf_event.h
+++ b/tools/include/uapi/linux/perf_event.h
@@ -379,6 +379,13 @@ enum perf_event_read_format {
 #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
 #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
 #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
+#define PERF_ATTR_SIZE_VER9	168	/* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
+
+#define PERF_EXT_REGS_ARRAY_SIZE	7
+#define PERF_NUM_EXT_REGS		(PERF_EXT_REGS_ARRAY_SIZE * 64)
+
+#define PERF_SAMPLE_ARRAY_SIZE		(PERF_EXT_REGS_ARRAY_SIZE + 1)
+#define PERF_SAMPLE_REGS_NUM		((PERF_SAMPLE_ARRAY_SIZE) * 64)
 
 /*
  * Hardware event_id to monitor via a performance monitoring event:
@@ -531,6 +538,13 @@ struct perf_event_attr {
 	__u64	sig_data;
 
 	__u64	config3; /* extension of config2 */
+
+	/*
+	 * Extension sets of regs to dump for each sample.
+	 * See asm/perf_regs.h for details.
+	 */
+	__u64	sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
+	__u64   sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
 };
 
 /*
diff --git a/tools/perf/builtin-script.c b/tools/perf/builtin-script.c
index 9b16df881af8..c41d9ccdaa9d 100644
--- a/tools/perf/builtin-script.c
+++ b/tools/perf/builtin-script.c
@@ -722,21 +722,32 @@ static int perf_session__check_output_opt(struct perf_session *session)
 }
 
 static int perf_sample__fprintf_regs(struct regs_dump *regs, uint64_t mask, const char *arch,
-				     FILE *fp)
+				     unsigned long *mask_ext, FILE *fp)
 {
+	unsigned int mask_size = sizeof(mask) * 8;
 	unsigned i = 0, r;
 	int printed = 0;
+	u64 val;
 
 	if (!regs || !regs->regs)
 		return 0;
 
 	printed += fprintf(fp, " ABI:%" PRIu64 " ", regs->abi);
 
-	for_each_set_bit(r, (unsigned long *) &mask, sizeof(mask) * 8) {
-		u64 val = regs->regs[i++];
+	for_each_set_bit(r, (unsigned long *)&mask, mask_size) {
+		val = regs->regs[i++];
 		printed += fprintf(fp, "%5s:0x%"PRIx64" ", perf_reg_name(r, arch), val);
 	}
 
+	if (!mask_ext)
+		return printed;
+
+	for_each_set_bit(r, mask_ext, PERF_NUM_EXT_REGS) {
+		val = regs->regs[i++];
+		printed += fprintf(fp, "%5s:0x%"PRIx64" ",
+				   perf_reg_name(r + mask_size, arch), val);
+	}
+
 	return printed;
 }
 
@@ -797,7 +808,8 @@ static int perf_sample__fprintf_iregs(struct perf_sample *sample,
 		return 0;
 
 	return perf_sample__fprintf_regs(perf_sample__intr_regs(sample),
-					 attr->sample_regs_intr, arch, fp);
+					 attr->sample_regs_intr, arch,
+					 (unsigned long *)attr->sample_regs_intr_ext, fp);
 }
 
 static int perf_sample__fprintf_uregs(struct perf_sample *sample,
@@ -807,7 +819,8 @@ static int perf_sample__fprintf_uregs(struct perf_sample *sample,
 		return 0;
 
 	return perf_sample__fprintf_regs(perf_sample__user_regs(sample),
-					 attr->sample_regs_user, arch, fp);
+					 attr->sample_regs_user, arch,
+					 (unsigned long *)attr->sample_regs_user_ext, fp);
 }
 
 static int perf_sample__fprintf_start(struct perf_script *script,
diff --git a/tools/perf/util/evsel.c b/tools/perf/util/evsel.c
index 6e71187d6a93..4e4389e16369 100644
--- a/tools/perf/util/evsel.c
+++ b/tools/perf/util/evsel.c
@@ -1061,7 +1061,7 @@ static void __evsel__config_callchain(struct evsel *evsel, struct record_opts *o
 			arch__user_reg_mask((unsigned long *)&mask);
 			evsel__set_sample_bit(evsel, REGS_USER);
 			evsel__set_sample_bit(evsel, STACK_USER);
-			if (opts->sample_user_regs &&
+			if (bitmap_weight(opts->sample_user_regs, PERF_SAMPLE_REGS_NUM) &&
 			    DWARF_MINIMAL_REGS(arch) != mask) {
 				attr->sample_regs_user |= DWARF_MINIMAL_REGS(arch);
 				pr_warning("WARNING: The use of --call-graph=dwarf may require all the user registers, "
@@ -1397,15 +1397,19 @@ void evsel__config(struct evsel *evsel, struct record_opts *opts,
 	if (callchain && callchain->enabled && !evsel->no_aux_samples)
 		evsel__config_callchain(evsel, opts, callchain);
 
-	if (opts->sample_intr_regs && !evsel->no_aux_samples &&
-	    !evsel__is_dummy_event(evsel)) {
-		attr->sample_regs_intr = opts->sample_intr_regs;
+	if (bitmap_weight(opts->sample_intr_regs, PERF_SAMPLE_REGS_NUM) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		attr->sample_regs_intr = opts->sample_intr_regs[0];
+		memcpy(attr->sample_regs_intr_ext, &opts->sample_intr_regs[1],
+		       PERF_NUM_EXT_REGS / 8);
 		evsel__set_sample_bit(evsel, REGS_INTR);
 	}
 
-	if (opts->sample_user_regs && !evsel->no_aux_samples &&
-	    !evsel__is_dummy_event(evsel)) {
-		attr->sample_regs_user |= opts->sample_user_regs;
+	if (bitmap_weight(opts->sample_user_regs, PERF_SAMPLE_REGS_NUM) &&
+	    !evsel->no_aux_samples && !evsel__is_dummy_event(evsel)) {
+		attr->sample_regs_user |= opts->sample_user_regs[0];
+		memcpy(attr->sample_regs_user_ext, &opts->sample_user_regs[1],
+		       PERF_NUM_EXT_REGS / 8);
 		evsel__set_sample_bit(evsel, REGS_USER);
 	}
 
@@ -3198,10 +3202,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 
 		if (regs->abi) {
 			u64 mask = evsel->core.attr.sample_regs_user;
+			unsigned long *mask_ext =
+				(unsigned long *)evsel->core.attr.sample_regs_user_ext;
+			u64 *user_regs_mask;
 
 			sz = hweight64(mask) * sizeof(u64);
+			sz += bitmap_weight(mask_ext, PERF_NUM_EXT_REGS) * sizeof(u64);
 			OVERFLOW_CHECK(array, sz, max_size);
 			regs->mask = mask;
+			user_regs_mask = (u64 *)regs->mask_ext;
+			memcpy(&user_regs_mask[1], mask_ext, PERF_NUM_EXT_REGS);
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
 		}
@@ -3255,10 +3265,16 @@ int evsel__parse_sample(struct evsel *evsel, union perf_event *event,
 
 		if (regs->abi != PERF_SAMPLE_REGS_ABI_NONE) {
 			u64 mask = evsel->core.attr.sample_regs_intr;
+			unsigned long *mask_ext =
+				(unsigned long *)evsel->core.attr.sample_regs_intr_ext;
+			u64 *intr_regs_mask;
 
 			sz = hweight64(mask) * sizeof(u64);
+			sz += bitmap_weight(mask_ext, PERF_NUM_EXT_REGS) * sizeof(u64);
 			OVERFLOW_CHECK(array, sz, max_size);
 			regs->mask = mask;
+			intr_regs_mask = (u64 *)regs->mask_ext;
+			memcpy(&intr_regs_mask[1], mask_ext, PERF_NUM_EXT_REGS);
 			regs->regs = (u64 *)array;
 			array = (void *)array + sz;
 		}
diff --git a/tools/perf/util/parse-regs-options.c b/tools/perf/util/parse-regs-options.c
index 3dcd8dc4f81b..42b176705ccf 100644
--- a/tools/perf/util/parse-regs-options.c
+++ b/tools/perf/util/parse-regs-options.c
@@ -12,11 +12,13 @@
 static int
 __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 {
+	unsigned int size = PERF_SAMPLE_REGS_NUM;
 	uint64_t *mode = (uint64_t *)opt->value;
 	const struct sample_reg *r = NULL;
 	char *s, *os = NULL, *p;
 	int ret = -1;
-	uint64_t mask = 0;
+	DECLARE_BITMAP(mask, size);
+	DECLARE_BITMAP(mask_tmp, size);
 
 	if (unset)
 		return 0;
@@ -24,13 +26,14 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	/*
 	 * cannot set it twice
 	 */
-	if (*mode)
+	if (bitmap_weight((unsigned long *)mode, size))
 		return -1;
 
+	bitmap_zero(mask, size);
 	if (intr)
-		arch__intr_reg_mask((unsigned long *)&mask);
+		arch__intr_reg_mask(mask);
 	else
-		arch__user_reg_mask((unsigned long *)&mask);
+		arch__user_reg_mask(mask);
 
 	/* str may be NULL in case no arg is passed to -I */
 	if (str) {
@@ -47,7 +50,8 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 			if (!strcmp(s, "?")) {
 				fprintf(stderr, "available registers: ");
 				for (r = arch__sample_reg_masks(); r->name; r++) {
-					if (r->mask & mask)
+					bitmap_and(mask_tmp, mask, r->mask_ext, size);
+					if (bitmap_weight(mask_tmp, size))
 						fprintf(stderr, "%s ", r->name);
 				}
 				fputc('\n', stderr);
@@ -55,7 +59,8 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 				goto error;
 			}
 			for (r = arch__sample_reg_masks(); r->name; r++) {
-				if ((r->mask & mask) && !strcasecmp(s, r->name))
+				bitmap_and(mask_tmp, mask, r->mask_ext, size);
+				if (bitmap_weight(mask_tmp, size) && !strcasecmp(s, r->name))
 					break;
 			}
 			if (!r || !r->name) {
@@ -64,7 +69,7 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 				goto error;
 			}
 
-			*mode |= r->mask;
+			bitmap_or((unsigned long *)mode, (unsigned long *)mode, r->mask_ext, size);
 
 			if (!p)
 				break;
@@ -75,8 +80,8 @@ __parse_regs(const struct option *opt, const char *str, int unset, bool intr)
 	ret = 0;
 
 	/* default to all possible regs */
-	if (*mode == 0)
-		*mode = mask;
+	if (!bitmap_weight((unsigned long *)mode, size))
+		bitmap_or((unsigned long *)mode, (unsigned long *)mode, mask, size);
 error:
 	free(os);
 	return ret;
diff --git a/tools/perf/util/perf_regs.h b/tools/perf/util/perf_regs.h
index 316d280e5cd7..d60a74623a0f 100644
--- a/tools/perf/util/perf_regs.h
+++ b/tools/perf/util/perf_regs.h
@@ -4,18 +4,32 @@
 
 #include <linux/types.h>
 #include <linux/compiler.h>
+#include <linux/bitmap.h>
+#include <linux/perf_event.h>
+#include "util/record.h"
 
 struct regs_dump;
 
 struct sample_reg {
 	const char *name;
-	uint64_t mask;
+	union {
+		uint64_t mask;
+		DECLARE_BITMAP(mask_ext, PERF_SAMPLE_REGS_NUM);
+	};
 };
 
 #define SMPL_REG_MASK(b) (1ULL << (b))
 #define SMPL_REG(n, b) { .name = #n, .mask = SMPL_REG_MASK(b) }
 #define SMPL_REG2_MASK(b) (3ULL << (b))
 #define SMPL_REG2(n, b) { .name = #n, .mask = SMPL_REG2_MASK(b) }
+#define SMPL_REG_EXT(n, b)	\
+	{ .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0x1ULL << (b % __BITS_PER_LONG) }
+#define SMPL_REG2_EXT(n, b)	\
+	{ .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0x3ULL << (b % __BITS_PER_LONG) }
+#define SMPL_REG4_EXT(n, b)	\
+	{ .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0xfULL << (b % __BITS_PER_LONG) }
+#define SMPL_REG8_EXT(n, b)	\
+	{ .name = #n, .mask_ext[b / __BITS_PER_LONG] = 0xffULL << (b % __BITS_PER_LONG) }
 #define SMPL_REG_END { .name = NULL }
 
 enum {
diff --git a/tools/perf/util/record.h b/tools/perf/util/record.h
index a6566134e09e..2741bbbc2794 100644
--- a/tools/perf/util/record.h
+++ b/tools/perf/util/record.h
@@ -57,8 +57,8 @@ struct record_opts {
 	unsigned int  auxtrace_mmap_pages;
 	unsigned int  user_freq;
 	u64	      branch_stack;
-	u64	      sample_intr_regs;
-	u64	      sample_user_regs;
+	u64	      sample_intr_regs[PERF_SAMPLE_ARRAY_SIZE];
+	u64	      sample_user_regs[PERF_SAMPLE_ARRAY_SIZE];
 	u64	      default_interval;
 	u64	      user_interval;
 	size_t	      auxtrace_snapshot_size;
diff --git a/tools/perf/util/sample.h b/tools/perf/util/sample.h
index 0e96240052e9..82db52aeae4d 100644
--- a/tools/perf/util/sample.h
+++ b/tools/perf/util/sample.h
@@ -4,13 +4,17 @@
 
 #include <linux/perf_event.h>
 #include <linux/types.h>
+#include <linux/bitmap.h>
 
 /* number of register is bound by the number of bits in regs_dump::mask (64) */
 #define PERF_SAMPLE_REGS_CACHE_SIZE (8 * sizeof(u64))
 
 struct regs_dump {
 	u64 abi;
-	u64 mask;
+	union {
+		u64 mask;
+		DECLARE_BITMAP(mask_ext, PERF_SAMPLE_REGS_NUM);
+	};
 	u64 *regs;
 
 	/* Cached values/mask filled by first register access. */
diff --git a/tools/perf/util/session.c b/tools/perf/util/session.c
index 60fb9997ea0d..54db3f36d962 100644
--- a/tools/perf/util/session.c
+++ b/tools/perf/util/session.c
@@ -910,12 +910,13 @@ static void branch_stack__printf(struct perf_sample *sample,
 	}
 }
 
-static void regs_dump__printf(u64 mask, u64 *regs, const char *arch)
+static void regs_dump__printf(struct regs_dump *regs, const char *arch)
 {
+	unsigned int size = PERF_SAMPLE_REGS_NUM;
 	unsigned rid, i = 0;
 
-	for_each_set_bit(rid, (unsigned long *) &mask, sizeof(mask) * 8) {
-		u64 val = regs[i++];
+	for_each_set_bit(rid, regs->mask_ext, size) {
+		u64 val = regs->regs[i++];
 
 		printf(".... %-5s 0x%016" PRIx64 "\n",
 		       perf_reg_name(rid, arch), val);
@@ -936,16 +937,20 @@ static inline const char *regs_dump_abi(struct regs_dump *d)
 	return regs_abi[d->abi];
 }
 
-static void regs__printf(const char *type, struct regs_dump *regs, const char *arch)
+static void regs__printf(bool intr, struct regs_dump *regs, const char *arch)
 {
-	u64 mask = regs->mask;
+	u64 *mask = (u64 *)&regs->mask_ext;
 
-	printf("... %s regs: mask 0x%" PRIx64 " ABI %s\n",
-	       type,
-	       mask,
-	       regs_dump_abi(regs));
+	if (intr)
+		printf("... intr regs: mask 0x");
+	else
+		printf("... user regs: mask 0x");
+
+	for (int i = 0; i < PERF_SAMPLE_ARRAY_SIZE; i++)
+		printf("%" PRIx64 "", mask[i]);
+	printf(" ABI %s\n", regs_dump_abi(regs));
 
-	regs_dump__printf(mask, regs->regs, arch);
+	regs_dump__printf(regs, arch);
 }
 
 static void regs_user__printf(struct perf_sample *sample, const char *arch)
@@ -958,7 +963,7 @@ static void regs_user__printf(struct perf_sample *sample, const char *arch)
 	user_regs = perf_sample__user_regs(sample);
 
 	if (user_regs->regs)
-		regs__printf("user", user_regs, arch);
+		regs__printf(false, user_regs, arch);
 }
 
 static void regs_intr__printf(struct perf_sample *sample, const char *arch)
@@ -971,7 +976,7 @@ static void regs_intr__printf(struct perf_sample *sample, const char *arch)
 	intr_regs = perf_sample__intr_regs(sample);
 
 	if (intr_regs->regs)
-		regs__printf("intr", intr_regs, arch);
+		regs__printf(true, intr_regs, arch);
 }
 
 static void stack_user__printf(struct stack_dump *dump)
diff --git a/tools/perf/util/synthetic-events.c b/tools/perf/util/synthetic-events.c
index 2fc4d0537840..2706b92c9a80 100644
--- a/tools/perf/util/synthetic-events.c
+++ b/tools/perf/util/synthetic-events.c
@@ -1512,7 +1512,8 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
 	if (type & PERF_SAMPLE_REGS_USER) {
 		if (sample->user_regs && sample->user_regs->abi) {
 			result += sizeof(u64);
-			sz = hweight64(sample->user_regs->mask) * sizeof(u64);
+			sz = bitmap_weight(sample->user_regs->mask_ext,
+					   PERF_SAMPLE_REGS_NUM) * sizeof(u64);
 			result += sz;
 		} else {
 			result += sizeof(u64);
@@ -1540,7 +1541,8 @@ size_t perf_event__sample_event_size(const struct perf_sample *sample, u64 type,
 	if (type & PERF_SAMPLE_REGS_INTR) {
 		if (sample->intr_regs && sample->intr_regs->abi) {
 			result += sizeof(u64);
-			sz = hweight64(sample->intr_regs->mask) * sizeof(u64);
+			sz = bitmap_weight(sample->intr_regs->mask_ext,
+					   PERF_SAMPLE_REGS_NUM) * sizeof(u64);
 			result += sz;
 		} else {
 			result += sizeof(u64);
@@ -1711,7 +1713,8 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
 	if (type & PERF_SAMPLE_REGS_USER) {
 		if (sample->user_regs && sample->user_regs->abi) {
 			*array++ = sample->user_regs->abi;
-			sz = hweight64(sample->user_regs->mask) * sizeof(u64);
+			sz = bitmap_weight(sample->user_regs->mask_ext,
+					   PERF_SAMPLE_REGS_NUM) * sizeof(u64);
 			memcpy(array, sample->user_regs->regs, sz);
 			array = (void *)array + sz;
 		} else {
@@ -1747,7 +1750,8 @@ int perf_event__synthesize_sample(union perf_event *event, u64 type, u64 read_fo
 	if (type & PERF_SAMPLE_REGS_INTR) {
 		if (sample->intr_regs && sample->intr_regs->abi) {
 			*array++ = sample->intr_regs->abi;
-			sz = hweight64(sample->intr_regs->mask) * sizeof(u64);
+			sz = bitmap_weight(sample->intr_regs->mask_ext,
+					   PERF_SAMPLE_REGS_NUM) * sizeof(u64);
 			memcpy(array, sample->intr_regs->regs, sz);
 			array = (void *)array + sz;
 		} else {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 21/22] perf tools: Support to capture more vector registers (x86/Intel)
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (19 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 20/22] perf tools: Enhance sample_regs_user/intr to capture more registers Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 11:44 ` [Patch v3 22/22] perf tools/tests: Add vector registers PEBS sampling test Dapeng Mi
  2025-04-15 15:21 ` [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Liang, Kan
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Intel architectural PEBS supports to capture more vector registers like
OPMASK/YMM/ZMM registers besides already supported XMM registers.

This patch adds Intel specific support to capture these new vector
registers for perf tools.

Co-developed-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/arch/x86/include/uapi/asm/perf_regs.h   |  79 ++++++++++-
 tools/perf/arch/x86/util/perf_regs.c          | 129 +++++++++++++++++-
 .../perf/util/perf-regs-arch/perf_regs_x86.c  |  82 +++++++++++
 3 files changed, 285 insertions(+), 5 deletions(-)

diff --git a/tools/arch/x86/include/uapi/asm/perf_regs.h b/tools/arch/x86/include/uapi/asm/perf_regs.h
index 1c7ab5af5cc1..c05c6ec127c8 100644
--- a/tools/arch/x86/include/uapi/asm/perf_regs.h
+++ b/tools/arch/x86/include/uapi/asm/perf_regs.h
@@ -36,7 +36,7 @@ enum perf_event_x86_regs {
 	/* PERF_REG_INTEL_PT_MAX ignores the SSP register. */
 	PERF_REG_INTEL_PT_MAX = PERF_REG_X86_R15 + 1,
 
-	/* These all need two bits set because they are 128bit */
+	/* These all need two bits set because they are 128 bits */
 	PERF_REG_X86_XMM0  = 32,
 	PERF_REG_X86_XMM1  = 34,
 	PERF_REG_X86_XMM2  = 36,
@@ -56,6 +56,83 @@ enum perf_event_x86_regs {
 
 	/* These include both GPRs and XMMX registers */
 	PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
+
+	/* Leave bits[127:64] for other GP registers, like R16 ~ R31.*/
+
+	/*
+	 * Each YMM register need 4 bits to represent because they are 256 bits.
+	 * PERF_REG_X86_YMMH0 = 128
+	 */
+	PERF_REG_X86_YMM0	= 128,
+	PERF_REG_X86_YMM1	= PERF_REG_X86_YMM0 + 4,
+	PERF_REG_X86_YMM2	= PERF_REG_X86_YMM1 + 4,
+	PERF_REG_X86_YMM3	= PERF_REG_X86_YMM2 + 4,
+	PERF_REG_X86_YMM4	= PERF_REG_X86_YMM3 + 4,
+	PERF_REG_X86_YMM5	= PERF_REG_X86_YMM4 + 4,
+	PERF_REG_X86_YMM6	= PERF_REG_X86_YMM5 + 4,
+	PERF_REG_X86_YMM7	= PERF_REG_X86_YMM6 + 4,
+	PERF_REG_X86_YMM8	= PERF_REG_X86_YMM7 + 4,
+	PERF_REG_X86_YMM9	= PERF_REG_X86_YMM8 + 4,
+	PERF_REG_X86_YMM10	= PERF_REG_X86_YMM9 + 4,
+	PERF_REG_X86_YMM11	= PERF_REG_X86_YMM10 + 4,
+	PERF_REG_X86_YMM12	= PERF_REG_X86_YMM11 + 4,
+	PERF_REG_X86_YMM13	= PERF_REG_X86_YMM12 + 4,
+	PERF_REG_X86_YMM14	= PERF_REG_X86_YMM13 + 4,
+	PERF_REG_X86_YMM15	= PERF_REG_X86_YMM14 + 4,
+	PERF_REG_X86_YMM_MAX	= PERF_REG_X86_YMM15 + 4,
+
+	/*
+	 * Each ZMM register needs 8 bits to represent because they are 512 bits
+	 * PERF_REG_X86_ZMMH0 = 192
+	 */
+	PERF_REG_X86_ZMM0	= PERF_REG_X86_YMM_MAX,
+	PERF_REG_X86_ZMM1	= PERF_REG_X86_ZMM0 + 8,
+	PERF_REG_X86_ZMM2	= PERF_REG_X86_ZMM1 + 8,
+	PERF_REG_X86_ZMM3	= PERF_REG_X86_ZMM2 + 8,
+	PERF_REG_X86_ZMM4	= PERF_REG_X86_ZMM3 + 8,
+	PERF_REG_X86_ZMM5	= PERF_REG_X86_ZMM4 + 8,
+	PERF_REG_X86_ZMM6	= PERF_REG_X86_ZMM5 + 8,
+	PERF_REG_X86_ZMM7	= PERF_REG_X86_ZMM6 + 8,
+	PERF_REG_X86_ZMM8	= PERF_REG_X86_ZMM7 + 8,
+	PERF_REG_X86_ZMM9	= PERF_REG_X86_ZMM8 + 8,
+	PERF_REG_X86_ZMM10	= PERF_REG_X86_ZMM9 + 8,
+	PERF_REG_X86_ZMM11	= PERF_REG_X86_ZMM10 + 8,
+	PERF_REG_X86_ZMM12	= PERF_REG_X86_ZMM11 + 8,
+	PERF_REG_X86_ZMM13	= PERF_REG_X86_ZMM12 + 8,
+	PERF_REG_X86_ZMM14	= PERF_REG_X86_ZMM13 + 8,
+	PERF_REG_X86_ZMM15	= PERF_REG_X86_ZMM14 + 8,
+	PERF_REG_X86_ZMM16	= PERF_REG_X86_ZMM15 + 8,
+	PERF_REG_X86_ZMM17	= PERF_REG_X86_ZMM16 + 8,
+	PERF_REG_X86_ZMM18	= PERF_REG_X86_ZMM17 + 8,
+	PERF_REG_X86_ZMM19	= PERF_REG_X86_ZMM18 + 8,
+	PERF_REG_X86_ZMM20	= PERF_REG_X86_ZMM19 + 8,
+	PERF_REG_X86_ZMM21	= PERF_REG_X86_ZMM20 + 8,
+	PERF_REG_X86_ZMM22	= PERF_REG_X86_ZMM21 + 8,
+	PERF_REG_X86_ZMM23	= PERF_REG_X86_ZMM22 + 8,
+	PERF_REG_X86_ZMM24	= PERF_REG_X86_ZMM23 + 8,
+	PERF_REG_X86_ZMM25	= PERF_REG_X86_ZMM24 + 8,
+	PERF_REG_X86_ZMM26	= PERF_REG_X86_ZMM25 + 8,
+	PERF_REG_X86_ZMM27	= PERF_REG_X86_ZMM26 + 8,
+	PERF_REG_X86_ZMM28	= PERF_REG_X86_ZMM27 + 8,
+	PERF_REG_X86_ZMM29	= PERF_REG_X86_ZMM28 + 8,
+	PERF_REG_X86_ZMM30	= PERF_REG_X86_ZMM29 + 8,
+	PERF_REG_X86_ZMM31	= PERF_REG_X86_ZMM30 + 8,
+	PERF_REG_X86_ZMM_MAX	= PERF_REG_X86_ZMM31 + 8,
+
+	/*
+	 * OPMASK Registers
+	 * PERF_REG_X86_OPMASK0 = 448
+	 */
+	PERF_REG_X86_OPMASK0	= PERF_REG_X86_ZMM_MAX,
+	PERF_REG_X86_OPMASK1	= PERF_REG_X86_OPMASK0 + 1,
+	PERF_REG_X86_OPMASK2	= PERF_REG_X86_OPMASK1 + 1,
+	PERF_REG_X86_OPMASK3	= PERF_REG_X86_OPMASK2 + 1,
+	PERF_REG_X86_OPMASK4	= PERF_REG_X86_OPMASK3 + 1,
+	PERF_REG_X86_OPMASK5	= PERF_REG_X86_OPMASK4 + 1,
+	PERF_REG_X86_OPMASK6	= PERF_REG_X86_OPMASK5 + 1,
+	PERF_REG_X86_OPMASK7	= PERF_REG_X86_OPMASK6 + 1,
+
+	PERF_REG_X86_VEC_MAX	= PERF_REG_X86_OPMASK7 + 1,
 };
 
 #define PERF_REG_EXTENDED_MASK	(~((1ULL << PERF_REG_X86_XMM0) - 1))
diff --git a/tools/perf/arch/x86/util/perf_regs.c b/tools/perf/arch/x86/util/perf_regs.c
index 5b163f0a651a..bade6c64770c 100644
--- a/tools/perf/arch/x86/util/perf_regs.c
+++ b/tools/perf/arch/x86/util/perf_regs.c
@@ -54,6 +54,66 @@ static const struct sample_reg sample_reg_masks[] = {
 	SMPL_REG2(XMM13, PERF_REG_X86_XMM13),
 	SMPL_REG2(XMM14, PERF_REG_X86_XMM14),
 	SMPL_REG2(XMM15, PERF_REG_X86_XMM15),
+
+	SMPL_REG4_EXT(YMM0, PERF_REG_X86_YMM0),
+	SMPL_REG4_EXT(YMM1, PERF_REG_X86_YMM1),
+	SMPL_REG4_EXT(YMM2, PERF_REG_X86_YMM2),
+	SMPL_REG4_EXT(YMM3, PERF_REG_X86_YMM3),
+	SMPL_REG4_EXT(YMM4, PERF_REG_X86_YMM4),
+	SMPL_REG4_EXT(YMM5, PERF_REG_X86_YMM5),
+	SMPL_REG4_EXT(YMM6, PERF_REG_X86_YMM6),
+	SMPL_REG4_EXT(YMM7, PERF_REG_X86_YMM7),
+	SMPL_REG4_EXT(YMM8, PERF_REG_X86_YMM8),
+	SMPL_REG4_EXT(YMM9, PERF_REG_X86_YMM9),
+	SMPL_REG4_EXT(YMM10, PERF_REG_X86_YMM10),
+	SMPL_REG4_EXT(YMM11, PERF_REG_X86_YMM11),
+	SMPL_REG4_EXT(YMM12, PERF_REG_X86_YMM12),
+	SMPL_REG4_EXT(YMM13, PERF_REG_X86_YMM13),
+	SMPL_REG4_EXT(YMM14, PERF_REG_X86_YMM14),
+	SMPL_REG4_EXT(YMM15, PERF_REG_X86_YMM15),
+
+	SMPL_REG8_EXT(ZMM0, PERF_REG_X86_ZMM0),
+	SMPL_REG8_EXT(ZMM1, PERF_REG_X86_ZMM1),
+	SMPL_REG8_EXT(ZMM2, PERF_REG_X86_ZMM2),
+	SMPL_REG8_EXT(ZMM3, PERF_REG_X86_ZMM3),
+	SMPL_REG8_EXT(ZMM4, PERF_REG_X86_ZMM4),
+	SMPL_REG8_EXT(ZMM5, PERF_REG_X86_ZMM5),
+	SMPL_REG8_EXT(ZMM6, PERF_REG_X86_ZMM6),
+	SMPL_REG8_EXT(ZMM7, PERF_REG_X86_ZMM7),
+	SMPL_REG8_EXT(ZMM8, PERF_REG_X86_ZMM8),
+	SMPL_REG8_EXT(ZMM9, PERF_REG_X86_ZMM9),
+	SMPL_REG8_EXT(ZMM10, PERF_REG_X86_ZMM10),
+	SMPL_REG8_EXT(ZMM11, PERF_REG_X86_ZMM11),
+	SMPL_REG8_EXT(ZMM12, PERF_REG_X86_ZMM12),
+	SMPL_REG8_EXT(ZMM13, PERF_REG_X86_ZMM13),
+	SMPL_REG8_EXT(ZMM14, PERF_REG_X86_ZMM14),
+	SMPL_REG8_EXT(ZMM15, PERF_REG_X86_ZMM15),
+	SMPL_REG8_EXT(ZMM16, PERF_REG_X86_ZMM16),
+	SMPL_REG8_EXT(ZMM17, PERF_REG_X86_ZMM17),
+	SMPL_REG8_EXT(ZMM18, PERF_REG_X86_ZMM18),
+	SMPL_REG8_EXT(ZMM19, PERF_REG_X86_ZMM19),
+	SMPL_REG8_EXT(ZMM20, PERF_REG_X86_ZMM20),
+	SMPL_REG8_EXT(ZMM21, PERF_REG_X86_ZMM21),
+	SMPL_REG8_EXT(ZMM22, PERF_REG_X86_ZMM22),
+	SMPL_REG8_EXT(ZMM23, PERF_REG_X86_ZMM23),
+	SMPL_REG8_EXT(ZMM24, PERF_REG_X86_ZMM24),
+	SMPL_REG8_EXT(ZMM25, PERF_REG_X86_ZMM25),
+	SMPL_REG8_EXT(ZMM26, PERF_REG_X86_ZMM26),
+	SMPL_REG8_EXT(ZMM27, PERF_REG_X86_ZMM27),
+	SMPL_REG8_EXT(ZMM28, PERF_REG_X86_ZMM28),
+	SMPL_REG8_EXT(ZMM29, PERF_REG_X86_ZMM29),
+	SMPL_REG8_EXT(ZMM30, PERF_REG_X86_ZMM30),
+	SMPL_REG8_EXT(ZMM31, PERF_REG_X86_ZMM31),
+
+	SMPL_REG_EXT(OPMASK0, PERF_REG_X86_OPMASK0),
+	SMPL_REG_EXT(OPMASK1, PERF_REG_X86_OPMASK1),
+	SMPL_REG_EXT(OPMASK2, PERF_REG_X86_OPMASK2),
+	SMPL_REG_EXT(OPMASK3, PERF_REG_X86_OPMASK3),
+	SMPL_REG_EXT(OPMASK4, PERF_REG_X86_OPMASK4),
+	SMPL_REG_EXT(OPMASK5, PERF_REG_X86_OPMASK5),
+	SMPL_REG_EXT(OPMASK6, PERF_REG_X86_OPMASK6),
+	SMPL_REG_EXT(OPMASK7, PERF_REG_X86_OPMASK7),
+
 	SMPL_REG_END
 };
 
@@ -283,13 +343,59 @@ const struct sample_reg *arch__sample_reg_masks(void)
 	return sample_reg_masks;
 }
 
-void arch__intr_reg_mask(unsigned long *mask)
+static void check_ext2_regs_mask(struct perf_event_attr *attr, bool user,
+				 int idx, u64 fmask, unsigned long *mask)
+{
+	u64 reg_mask[PERF_SAMPLE_ARRAY_SIZE] = { 0 };
+	int fd;
+
+	if (user) {
+		attr->sample_regs_user = 0;
+		attr->sample_regs_user_ext[idx] = fmask;
+	} else {
+		attr->sample_regs_intr = 0;
+		attr->sample_regs_intr_ext[idx] = fmask;
+	}
+
+	/* reg_mask[] includes sample_regs_intr regs, so index need add 1. */
+	reg_mask[idx + 1] = fmask;
+
+	fd = sys_perf_event_open(attr, 0, -1, -1, 0);
+	if (fd != -1) {
+		close(fd);
+		bitmap_or(mask, mask, (unsigned long *)reg_mask,
+			  PERF_SAMPLE_REGS_NUM);
+	}
+}
+
+#define PERF_REG_EXTENDED_YMM_MASK	GENMASK_ULL(63, 0)
+#define PERF_REG_EXTENDED_ZMM_MASK	GENMASK_ULL(63, 0)
+#define PERF_REG_EXTENDED_OPMASK_MASK	GENMASK_ULL(7, 0)
+
+static void get_ext2_regs_mask(struct perf_event_attr *attr, bool user,
+			       unsigned long *mask)
+{
+	event_attr_init(attr);
+
+	/* Check YMM regs, bits 128 ~ 191. */
+	check_ext2_regs_mask(attr, user, 1, PERF_REG_EXTENDED_YMM_MASK, mask);
+	/* Check ZMM 0-7 regs, bits 192 ~ 255. */
+	check_ext2_regs_mask(attr, user, 2, PERF_REG_EXTENDED_ZMM_MASK, mask);
+	/* Check ZMM 8-15 regs, bits 256 ~ 319. */
+	check_ext2_regs_mask(attr, user, 3, PERF_REG_EXTENDED_ZMM_MASK, mask);
+	/* Check ZMM 16-23 regs, bits 320 ~ 383. */
+	check_ext2_regs_mask(attr, user, 4, PERF_REG_EXTENDED_ZMM_MASK, mask);
+	/* Check ZMM 16-23 regs, bits 384 ~ 447. */
+	check_ext2_regs_mask(attr, user, 5, PERF_REG_EXTENDED_ZMM_MASK, mask);
+	/* Check OPMASK regs, bits 448 ~ 455. */
+	check_ext2_regs_mask(attr, user, 6, PERF_REG_EXTENDED_OPMASK_MASK, mask);
+}
+
+static void arch__get_reg_mask(unsigned long *mask, bool user)
 {
 	struct perf_event_attr attr = {
 		.type			= PERF_TYPE_HARDWARE,
 		.config			= PERF_COUNT_HW_CPU_CYCLES,
-		.sample_type		= PERF_SAMPLE_REGS_INTR,
-		.sample_regs_intr	= PERF_REG_EXTENDED_MASK,
 		.precise_ip		= 1,
 		.disabled 		= 1,
 		.exclude_kernel		= 1,
@@ -298,6 +404,14 @@ void arch__intr_reg_mask(unsigned long *mask)
 
 	*(u64 *)mask = PERF_REGS_MASK;
 
+	if (user) {
+		attr.sample_type = PERF_SAMPLE_REGS_USER;
+		attr.sample_regs_user = PERF_REG_EXTENDED_MASK;
+	} else {
+		attr.sample_type = PERF_SAMPLE_REGS_INTR;
+		attr.sample_regs_intr = PERF_REG_EXTENDED_MASK;
+	}
+
 	/*
 	 * In an unnamed union, init it here to build on older gcc versions
 	 */
@@ -325,9 +439,16 @@ void arch__intr_reg_mask(unsigned long *mask)
 		close(fd);
 		*(u64 *)mask = PERF_REG_EXTENDED_MASK | PERF_REGS_MASK;
 	}
+
+	get_ext2_regs_mask(&attr, user, mask);
+}
+
+void arch__intr_reg_mask(unsigned long *mask)
+{
+	arch__get_reg_mask(mask, false);
 }
 
 void arch__user_reg_mask(unsigned long *mask)
 {
-	*(uint64_t *)mask = PERF_REGS_MASK;
+	arch__get_reg_mask(mask, true);
 }
diff --git a/tools/perf/util/perf-regs-arch/perf_regs_x86.c b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
index c0e95215b577..eb1e3d716f27 100644
--- a/tools/perf/util/perf-regs-arch/perf_regs_x86.c
+++ b/tools/perf/util/perf-regs-arch/perf_regs_x86.c
@@ -78,6 +78,88 @@ const char *__perf_reg_name_x86(int id)
 	XMM(14)
 	XMM(15)
 #undef XMM
+
+#define YMM(x)					\
+	case PERF_REG_X86_YMM ## x:		\
+	case PERF_REG_X86_YMM ## x + 1:		\
+	case PERF_REG_X86_YMM ## x + 2:		\
+	case PERF_REG_X86_YMM ## x + 3:		\
+		return "YMM" #x;
+	YMM(0)
+	YMM(1)
+	YMM(2)
+	YMM(3)
+	YMM(4)
+	YMM(5)
+	YMM(6)
+	YMM(7)
+	YMM(8)
+	YMM(9)
+	YMM(10)
+	YMM(11)
+	YMM(12)
+	YMM(13)
+	YMM(14)
+	YMM(15)
+#undef YMM
+
+#define ZMM(x)				\
+	case PERF_REG_X86_ZMM ## x:		\
+	case PERF_REG_X86_ZMM ## x + 1:	\
+	case PERF_REG_X86_ZMM ## x + 2:	\
+	case PERF_REG_X86_ZMM ## x + 3:	\
+	case PERF_REG_X86_ZMM ## x + 4:	\
+	case PERF_REG_X86_ZMM ## x + 5:	\
+	case PERF_REG_X86_ZMM ## x + 6:	\
+	case PERF_REG_X86_ZMM ## x + 7:	\
+		return "ZMM" #x;
+	ZMM(0)
+	ZMM(1)
+	ZMM(2)
+	ZMM(3)
+	ZMM(4)
+	ZMM(5)
+	ZMM(6)
+	ZMM(7)
+	ZMM(8)
+	ZMM(9)
+	ZMM(10)
+	ZMM(11)
+	ZMM(12)
+	ZMM(13)
+	ZMM(14)
+	ZMM(15)
+	ZMM(16)
+	ZMM(17)
+	ZMM(18)
+	ZMM(19)
+	ZMM(20)
+	ZMM(21)
+	ZMM(22)
+	ZMM(23)
+	ZMM(24)
+	ZMM(25)
+	ZMM(26)
+	ZMM(27)
+	ZMM(28)
+	ZMM(29)
+	ZMM(30)
+	ZMM(31)
+#undef ZMM
+
+#define OPMASK(x)				\
+	case PERF_REG_X86_OPMASK ## x:		\
+		return "opmask" #x;
+
+	OPMASK(0)
+	OPMASK(1)
+	OPMASK(2)
+	OPMASK(3)
+	OPMASK(4)
+	OPMASK(5)
+	OPMASK(6)
+	OPMASK(7)
+#undef OPMASK
 	default:
 		return NULL;
 	}
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [Patch v3 22/22] perf tools/tests: Add vector registers PEBS sampling test
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (20 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 21/22] perf tools: Support to capture more vector registers (x86/Intel) Dapeng Mi
@ 2025-04-15 11:44 ` Dapeng Mi
  2025-04-15 15:21 ` [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Liang, Kan
  22 siblings, 0 replies; 47+ messages in thread
From: Dapeng Mi @ 2025-04-15 11:44 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Kan Liang, Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi, Dapeng Mi

Current adaptive PEBS supports to capture some vector registers like XMM
register, and arch-PEBS supports to capture wider vector registers like
YMM and ZMM registers. This patch adds a perf test case to verify these
vector registers can be captured correctly.

Suggested-by: Kan Liang <kan.liang@linux.intel.com>
Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
---
 tools/perf/tests/shell/record.sh | 55 ++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/tools/perf/tests/shell/record.sh b/tools/perf/tests/shell/record.sh
index ba8d873d3ca7..d85aab09902b 100755
--- a/tools/perf/tests/shell/record.sh
+++ b/tools/perf/tests/shell/record.sh
@@ -116,6 +116,60 @@ test_register_capture() {
   echo "Register capture test [Success]"
 }
 
+test_vec_register_capture() {
+  echo "Vector register capture test"
+  if ! perf record -o /dev/null --quiet -e instructions:p true 2> /dev/null
+  then
+    echo "Vector register capture test [Skipped missing event]"
+    return
+  fi
+  if ! perf record --intr-regs=\? 2>&1 | grep -q 'XMM0'
+  then
+    echo "Vector register capture test [Skipped missing XMM registers]"
+    return
+  fi
+  if ! perf record -o - --intr-regs=xmm0 -e instructions:p \
+    -c 100000 ${testprog} 2> /dev/null \
+    | perf script -F ip,sym,iregs -i - 2> /dev/null \
+    | grep -q "XMM0:"
+  then
+    echo "Vector register capture test [Failed missing XMM output]"
+    err=1
+    return
+  fi
+  echo "Vector registe (XMM) capture test [Success]"
+  if ! perf record --intr-regs=\? 2>&1 | grep -q 'YMM0'
+  then
+    echo "Vector register capture test [Skipped missing YMM registers]"
+    return
+  fi
+  if ! perf record -o - --intr-regs=ymm0 -e instructions:p \
+    -c 100000 ${testprog} 2> /dev/null \
+    | perf script -F ip,sym,iregs -i - 2> /dev/null \
+    | grep -q "YMM0:"
+  then
+    echo "Vector register capture test [Failed missing YMM output]"
+    err=1
+    return
+  fi
+  echo "Vector registe (YMM) capture test [Success]"
+  if ! perf record --intr-regs=\? 2>&1 | grep -q 'ZMM0'
+  then
+    echo "Vector register capture test [Skipped missing ZMM registers]"
+    return
+  fi
+  if ! perf record -o - --intr-regs=zmm0 -e instructions:p \
+    -c 100000 ${testprog} 2> /dev/null \
+    | perf script -F ip,sym,iregs -i - 2> /dev/null \
+    | grep -q "ZMM0:"
+  then
+    echo "Vector register capture test [Failed missing ZMM output]"
+    err=1
+    return
+  fi
+  echo "Vector register (ZMM) capture test [Success]"
+}
+
 test_system_wide() {
   echo "Basic --system-wide mode test"
   if ! perf record -aB --synth=no -o "${perfdata}" ${testprog} 2> /dev/null
@@ -318,6 +372,7 @@ fi
 
 test_per_thread
 test_register_capture
+test_vec_register_capture
 test_system_wide
 test_workload
 test_branch_counter
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  2025-04-15 11:44 ` [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
@ 2025-04-15 13:45   ` Peter Zijlstra
  2025-04-16  0:59     ` Mi, Dapeng
  2025-04-15 13:48   ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-15 13:45 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:44:17AM +0000, Dapeng Mi wrote:
> +	buffer = dsalloc_pages(bsiz, preemptible() ? GFP_KERNEL : GFP_ATOMIC, cpu);

You were going to make this go away :-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  2025-04-15 11:44 ` [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
  2025-04-15 13:45   ` Peter Zijlstra
@ 2025-04-15 13:48   ` Peter Zijlstra
  2025-04-16  1:03     ` Mi, Dapeng
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-15 13:48 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:44:17AM +0000, Dapeng Mi wrote:

> +void fini_arch_pebs_buf_on_cpu(int cpu)
> +{
> +	if (!x86_pmu.arch_pebs)
> +		return;
> +
> +	release_pebs_buffer(cpu);
> +	wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, 0, 0);
> +}

So first we free the pages, and then we tell the hardware to not write
into them again.

What could possibly go wrong :-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-15 11:44 ` [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
@ 2025-04-15 13:53   ` Peter Zijlstra
  2025-04-15 16:31     ` Liang, Kan
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-15 13:53 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:44:18AM +0000, Dapeng Mi wrote:
> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
> sampling and precise distribution PEBS sampling. Thus PEBS constraints
> should be dynamically configured base on these counter and precise
> distribution bitmap instead of defining them statically.
> 
> Update event dyn_constraint base on PEBS event precise level.

What if any constraints are there on this? CPUID is virt host
controlled, right, so these could be the most horrible masks ever.

This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments
  2025-04-15 11:44 ` [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
@ 2025-04-15 13:57   ` Peter Zijlstra
  2025-04-15 16:09     ` Liang, Kan
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-15 13:57 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:44:16AM +0000, Dapeng Mi wrote:
> A significant difference with adaptive PEBS is that arch-PEBS record
> supports fragments which means an arch-PEBS record could be split into
> several independent fragments which have its own arch-PEBS header in
> each fragment.

With the constraint that all fragments for a single event are
contiguous, right?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 15/22] perf/x86/intel: Support SSP register capturing for arch-PEBS
  2025-04-15 11:44 ` [Patch v3 15/22] perf/x86/intel: Support SSP register capturing " Dapeng Mi
@ 2025-04-15 14:07   ` Peter Zijlstra
  2025-04-16  5:49     ` Mi, Dapeng
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-15 14:07 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:44:21AM +0000, Dapeng Mi wrote:
> Arch-PEBS supports to capture shadow stack pointer (SSP) register in GPR
> group. This patch supports to capture and output SSP register at
> interrupt or user space, but capturing SSP at user space requires
> 'exclude_kernel' attribute must be set. That avoids kernel space SSP
> register is captured unintentionally.
> 
> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
> ---
>  arch/x86/events/core.c                | 15 +++++++++++++++
>  arch/x86/events/intel/core.c          |  3 ++-
>  arch/x86/events/intel/ds.c            |  9 +++++++--
>  arch/x86/events/perf_event.h          |  4 ++++
>  arch/x86/include/asm/perf_event.h     |  1 +
>  arch/x86/include/uapi/asm/perf_regs.h |  4 +++-
>  arch/x86/kernel/perf_regs.c           |  7 +++++++
>  7 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
> index 9c205a8a4fa6..0ccbe8385c7f 100644
> --- a/arch/x86/events/core.c
> +++ b/arch/x86/events/core.c
> @@ -650,6 +650,21 @@ int x86_pmu_hw_config(struct perf_event *event)
>  			return -EINVAL;
>  	}
>  
> +	if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP))) {
> +		/* Only arch-PEBS supports to capture SSP register. */
> +		if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
> +			return -EINVAL;
> +		/* Only user space is allowed to capture. */
> +		if (!event->attr.exclude_kernel)
> +			return -EINVAL;
> +	}

We should be able to support this for !PEBS samples by reading the MSR
just fine, no?

ISTR making a similar comment last time.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 16/22] perf/core: Support to capture higher width vector registers
  2025-04-15 11:44 ` [Patch v3 16/22] perf/core: Support to capture higher width vector registers Dapeng Mi
@ 2025-04-15 14:36   ` Peter Zijlstra
  2025-04-16  6:42     ` Mi, Dapeng
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-15 14:36 UTC (permalink / raw)
  To: Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:44:22AM +0000, Dapeng Mi wrote:
>  extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
> index f9c5b16b1882..5e2d9796b2cc 100644
> --- a/arch/x86/include/uapi/asm/perf_regs.h
> +++ b/arch/x86/include/uapi/asm/perf_regs.h
> @@ -33,7 +33,7 @@ enum perf_event_x86_regs {
>  	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>  	PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
>  
> -	/* These all need two bits set because they are 128bit */
> +	/* These all need two bits set because they are 128 bits */
>  	PERF_REG_X86_XMM0  = 32,
>  	PERF_REG_X86_XMM1  = 34,
>  	PERF_REG_X86_XMM2  = 36,
> @@ -53,6 +53,83 @@ enum perf_event_x86_regs {
>  
>  	/* These include both GPRs and XMMX registers */
>  	PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
> +
> +	/* Leave bits[127:64] for other GP registers, like R16 ~ R31.*/
> +
> +	/*
> +	 * Each YMM register need 4 bits to represent because they are 256 bits.
> +	 * PERF_REG_X86_YMMH0 = 128
> +	 */
> +	PERF_REG_X86_YMM0	= 128,
> +	PERF_REG_X86_YMM1	= PERF_REG_X86_YMM0 + 4,
> +	PERF_REG_X86_YMM2	= PERF_REG_X86_YMM1 + 4,
> +	PERF_REG_X86_YMM3	= PERF_REG_X86_YMM2 + 4,
> +	PERF_REG_X86_YMM4	= PERF_REG_X86_YMM3 + 4,
> +	PERF_REG_X86_YMM5	= PERF_REG_X86_YMM4 + 4,
> +	PERF_REG_X86_YMM6	= PERF_REG_X86_YMM5 + 4,
> +	PERF_REG_X86_YMM7	= PERF_REG_X86_YMM6 + 4,
> +	PERF_REG_X86_YMM8	= PERF_REG_X86_YMM7 + 4,
> +	PERF_REG_X86_YMM9	= PERF_REG_X86_YMM8 + 4,
> +	PERF_REG_X86_YMM10	= PERF_REG_X86_YMM9 + 4,
> +	PERF_REG_X86_YMM11	= PERF_REG_X86_YMM10 + 4,
> +	PERF_REG_X86_YMM12	= PERF_REG_X86_YMM11 + 4,
> +	PERF_REG_X86_YMM13	= PERF_REG_X86_YMM12 + 4,
> +	PERF_REG_X86_YMM14	= PERF_REG_X86_YMM13 + 4,
> +	PERF_REG_X86_YMM15	= PERF_REG_X86_YMM14 + 4,
> +	PERF_REG_X86_YMM_MAX	= PERF_REG_X86_YMM15 + 4,
> +
> +	/*
> +	 * Each ZMM register needs 8 bits to represent because they are 512 bits
> +	 * PERF_REG_X86_ZMMH0 = 192
> +	 */
> +	PERF_REG_X86_ZMM0	= PERF_REG_X86_YMM_MAX,
> +	PERF_REG_X86_ZMM1	= PERF_REG_X86_ZMM0 + 8,
> +	PERF_REG_X86_ZMM2	= PERF_REG_X86_ZMM1 + 8,
> +	PERF_REG_X86_ZMM3	= PERF_REG_X86_ZMM2 + 8,
> +	PERF_REG_X86_ZMM4	= PERF_REG_X86_ZMM3 + 8,
> +	PERF_REG_X86_ZMM5	= PERF_REG_X86_ZMM4 + 8,
> +	PERF_REG_X86_ZMM6	= PERF_REG_X86_ZMM5 + 8,
> +	PERF_REG_X86_ZMM7	= PERF_REG_X86_ZMM6 + 8,
> +	PERF_REG_X86_ZMM8	= PERF_REG_X86_ZMM7 + 8,
> +	PERF_REG_X86_ZMM9	= PERF_REG_X86_ZMM8 + 8,
> +	PERF_REG_X86_ZMM10	= PERF_REG_X86_ZMM9 + 8,
> +	PERF_REG_X86_ZMM11	= PERF_REG_X86_ZMM10 + 8,
> +	PERF_REG_X86_ZMM12	= PERF_REG_X86_ZMM11 + 8,
> +	PERF_REG_X86_ZMM13	= PERF_REG_X86_ZMM12 + 8,
> +	PERF_REG_X86_ZMM14	= PERF_REG_X86_ZMM13 + 8,
> +	PERF_REG_X86_ZMM15	= PERF_REG_X86_ZMM14 + 8,
> +	PERF_REG_X86_ZMM16	= PERF_REG_X86_ZMM15 + 8,
> +	PERF_REG_X86_ZMM17	= PERF_REG_X86_ZMM16 + 8,
> +	PERF_REG_X86_ZMM18	= PERF_REG_X86_ZMM17 + 8,
> +	PERF_REG_X86_ZMM19	= PERF_REG_X86_ZMM18 + 8,
> +	PERF_REG_X86_ZMM20	= PERF_REG_X86_ZMM19 + 8,
> +	PERF_REG_X86_ZMM21	= PERF_REG_X86_ZMM20 + 8,
> +	PERF_REG_X86_ZMM22	= PERF_REG_X86_ZMM21 + 8,
> +	PERF_REG_X86_ZMM23	= PERF_REG_X86_ZMM22 + 8,
> +	PERF_REG_X86_ZMM24	= PERF_REG_X86_ZMM23 + 8,
> +	PERF_REG_X86_ZMM25	= PERF_REG_X86_ZMM24 + 8,
> +	PERF_REG_X86_ZMM26	= PERF_REG_X86_ZMM25 + 8,
> +	PERF_REG_X86_ZMM27	= PERF_REG_X86_ZMM26 + 8,
> +	PERF_REG_X86_ZMM28	= PERF_REG_X86_ZMM27 + 8,
> +	PERF_REG_X86_ZMM29	= PERF_REG_X86_ZMM28 + 8,
> +	PERF_REG_X86_ZMM30	= PERF_REG_X86_ZMM29 + 8,
> +	PERF_REG_X86_ZMM31	= PERF_REG_X86_ZMM30 + 8,
> +	PERF_REG_X86_ZMM_MAX	= PERF_REG_X86_ZMM31 + 8,
> +
> +	/*
> +	 * OPMASK Registers
> +	 * PERF_REG_X86_OPMASK0 = 448
> +	 */
> +	PERF_REG_X86_OPMASK0	= PERF_REG_X86_ZMM_MAX,
> +	PERF_REG_X86_OPMASK1	= PERF_REG_X86_OPMASK0 + 1,
> +	PERF_REG_X86_OPMASK2	= PERF_REG_X86_OPMASK1 + 1,
> +	PERF_REG_X86_OPMASK3	= PERF_REG_X86_OPMASK2 + 1,
> +	PERF_REG_X86_OPMASK4	= PERF_REG_X86_OPMASK3 + 1,
> +	PERF_REG_X86_OPMASK5	= PERF_REG_X86_OPMASK4 + 1,
> +	PERF_REG_X86_OPMASK6	= PERF_REG_X86_OPMASK5 + 1,
> +	PERF_REG_X86_OPMASK7	= PERF_REG_X86_OPMASK6 + 1,
> +
> +	PERF_REG_X86_VEC_MAX	= PERF_REG_X86_OPMASK7 + 1,
>  };

> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
> index 5fc753c23734..78aae0464a54 100644
> --- a/include/uapi/linux/perf_event.h
> +++ b/include/uapi/linux/perf_event.h
> @@ -379,6 +379,10 @@ enum perf_event_read_format {
>  #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
>  #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
>  #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
> +#define PERF_ATTR_SIZE_VER9	168	/* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
> +
> +#define PERF_EXT_REGS_ARRAY_SIZE	7
> +#define PERF_NUM_EXT_REGS		(PERF_EXT_REGS_ARRAY_SIZE * 64)
>  
>  /*
>   * Hardware event_id to monitor via a performance monitoring event:
> @@ -533,6 +537,13 @@ struct perf_event_attr {
>  	__u64	sig_data;
>  
>  	__u64	config3; /* extension of config2 */
> +
> +	/*
> +	 * Extension sets of regs to dump for each sample.
> +	 * See asm/perf_regs.h for details.
> +	 */
> +	__u64	sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
> +	__u64   sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
>  };
>  
>  /*

I still utterly hate this interface. This is a giant waste of bits.

What makes it even worse is that XMMn is the lower half of YMMn which in
turn is the lower half of ZMMn.

So by exposing only ZMMn you already expose all of them. The interface
explicitly allows asking for sub-words.

But most importantly of all, last time I asked if there are users that
actually care about the whole per-register thing and I don't see an
answer here.

Can we please find a better interface? Ideally one that scales up to
1024 and 2048 bit vector width, because I'd hate to have to rev this
again.

Perhaps add sample_vec_regs_*[] with a saner format, and if that is !0
then the XMM regs dissapear from sample_regs_*[] and we get to use that
space to extended GPs.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake
  2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
                   ` (21 preceding siblings ...)
  2025-04-15 11:44 ` [Patch v3 22/22] perf tools/tests: Add vector registers PEBS sampling test Dapeng Mi
@ 2025-04-15 15:21 ` Liang, Kan
  2025-04-16  7:42   ` Peter Zijlstra
  22 siblings, 1 reply; 47+ messages in thread
From: Liang, Kan @ 2025-04-15 15:21 UTC (permalink / raw)
  To: Dapeng Mi, Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Namhyung Kim, Ian Rogers, Adrian Hunter, Alexander Shishkin,
	Andi Kleen, Eranian Stephane
  Cc: linux-kernel, linux-perf-users, Dapeng Mi

Hi Peter,

On 2025-04-15 7:44 a.m., Dapeng Mi wrote:
> Dapeng Mi (21):
>   perf/x86/intel: Add PMU support for Clearwater Forest
> 
> Kan Liang (1):
>   perf/x86/intel: Add Panther Lake support

Could you please take a look and pick up the above two patches if they
look good to you?

The two patches are generic support for the Panther Lake and Clearwater
Forest. With them, at least the non-PEBS and topdown can work.
The ARCH PEBS will be temporarily disabled until this big patch set is
merged.

 # dmesg | grep PMU
[    0.095162] Performance Events: XSAVE Architectural LBR,  AnyThread
deprecated, Pantherlake Hybrid events, 32-deep LBR, full-width counters,
Intel PMU driver.

 # perf stat -e
"{slots,topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound}"
-a
WARNING: events were regrouped to match PMUs
^C
 Performance counter stats for 'system wide':

         2,212,401      cpu_atom/topdown-retiring/
         8,121,982      cpu_atom/topdown-bad-spec/
        42,119,870      cpu_atom/topdown-fe-bound/
        27,667,678      cpu_atom/topdown-be-bound/
       496,377,056      cpu_core/slots/
         2,058,926      cpu_core/topdown-retiring/
         6,008,255      cpu_core/topdown-bad-spec/
       265,352,356      cpu_core/topdown-fe-bound/
       222,957,516      cpu_core/topdown-be-bound/

 # perf record -e cycles:p sleep 1
Error:
cpu_atom/cycles/pH: PMU Hardware doesn't support
sampling/overflow-interrupts. Try 'perf stat'

Thanks,
Kan

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments
  2025-04-15 13:57   ` Peter Zijlstra
@ 2025-04-15 16:09     ` Liang, Kan
  0 siblings, 0 replies; 47+ messages in thread
From: Liang, Kan @ 2025-04-15 16:09 UTC (permalink / raw)
  To: Peter Zijlstra, Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	linux-kernel, linux-perf-users, Dapeng Mi



On 2025-04-15 9:57 a.m., Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 11:44:16AM +0000, Dapeng Mi wrote:
>> A significant difference with adaptive PEBS is that arch-PEBS record
>> supports fragments which means an arch-PEBS record could be split into
>> several independent fragments which have its own arch-PEBS header in
>> each fragment.
> 
> With the constraint that all fragments for a single event are
> contiguous, right?

Yes. If a record is split into n fragments, the first n-1 fragments will
have the CONTINUED bit set and the last fragment will have its CONTINUED
bit cleared. They are contiguous.

Thanks,
Kan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-15 13:53   ` Peter Zijlstra
@ 2025-04-15 16:31     ` Liang, Kan
  2025-04-16  1:46       ` Mi, Dapeng
  2025-04-16 15:32       ` Peter Zijlstra
  0 siblings, 2 replies; 47+ messages in thread
From: Liang, Kan @ 2025-04-15 16:31 UTC (permalink / raw)
  To: Peter Zijlstra, Dapeng Mi
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	linux-kernel, linux-perf-users, Dapeng Mi



On 2025-04-15 9:53 a.m., Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 11:44:18AM +0000, Dapeng Mi wrote:
>> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
>> sampling and precise distribution PEBS sampling. Thus PEBS constraints
>> should be dynamically configured base on these counter and precise
>> distribution bitmap instead of defining them statically.
>>
>> Update event dyn_constraint base on PEBS event precise level.
> 
> What if any constraints are there on this? 

Do you mean the static constraints defined in the
event_constraints/pebs_constraints?

> CPUID is virt host
> controlled, right, so these could be the most horrible masks ever.
>

Yes, it could be changed by VMM. A sanity check should be required if
abad mask is given.

> This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?The dyn_constraint is a supplement of the static constraints. It doesn't
overwrite the static constraints.

In the intel_get_event_constraints(), perf always gets the static
constraints first. If the dyn_constraint is defined, it gets the common
mask of the static constraints and the dynamic constraints. All
constraint rules will be complied.

	if (event->hw.dyn_constraint != ~0ULL) {
		c2 = dyn_constraint(cpuc, c2, idx);
		c2->idxmsk64 &= event->hw.dyn_constraint;
		c2->weight = hweight64(c2->idxmsk64);
	}

Thanks,
Kan


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  2025-04-15 13:45   ` Peter Zijlstra
@ 2025-04-16  0:59     ` Mi, Dapeng
  0 siblings, 0 replies; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-16  0:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi


On 4/15/2025 9:45 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 11:44:17AM +0000, Dapeng Mi wrote:
>> +	buffer = dsalloc_pages(bsiz, preemptible() ? GFP_KERNEL : GFP_ATOMIC, cpu);
> You were going to make this go away :-)

Yeah, since currently alloc_pebs_buffer() would be called in
xxx_cpu_starting() context, I ever thought we still need this, but it seems
I'm wrong. I just drop this change and run it in PTL and don't see any
warning. I would drop this change in next version.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
  2025-04-15 13:48   ` Peter Zijlstra
@ 2025-04-16  1:03     ` Mi, Dapeng
  0 siblings, 0 replies; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-16  1:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi


On 4/15/2025 9:48 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 11:44:17AM +0000, Dapeng Mi wrote:
>
>> +void fini_arch_pebs_buf_on_cpu(int cpu)
>> +{
>> +	if (!x86_pmu.arch_pebs)
>> +		return;
>> +
>> +	release_pebs_buffer(cpu);
>> +	wrmsr_on_cpu(cpu, MSR_IA32_PEBS_BASE, 0, 0);
>> +}
> So first we free the pages, and then we tell the hardware to not write
> into them again.
>
> What could possibly go wrong :-)

Oh, yes. Thanks for pointing this. I would exchange the sequence.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-15 16:31     ` Liang, Kan
@ 2025-04-16  1:46       ` Mi, Dapeng
  2025-04-16 13:59         ` Liang, Kan
  2025-04-16 15:32       ` Peter Zijlstra
  1 sibling, 1 reply; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-16  1:46 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	linux-kernel, linux-perf-users, Dapeng Mi


On 4/16/2025 12:31 AM, Liang, Kan wrote:
>
> On 2025-04-15 9:53 a.m., Peter Zijlstra wrote:
>> On Tue, Apr 15, 2025 at 11:44:18AM +0000, Dapeng Mi wrote:
>>> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
>>> sampling and precise distribution PEBS sampling. Thus PEBS constraints
>>> should be dynamically configured base on these counter and precise
>>> distribution bitmap instead of defining them statically.
>>>
>>> Update event dyn_constraint base on PEBS event precise level.
>> What if any constraints are there on this? 
> Do you mean the static constraints defined in the
> event_constraints/pebs_constraints?
>
>> CPUID is virt host
>> controlled, right, so these could be the most horrible masks ever.
>>
> Yes, it could be changed by VMM. A sanity check should be required if
> abad mask is given.

Yes, we need a check to restrict the PEBS counter mask into the valid
counter mask, and just realized that we can't use hybrid(event->pmu,
intel_ctrl) to check counter mask and need a minor tweak since it includes
the GLOBAL_CTRL_EN_PERF_METRICS bit.

How about this?

        if (x86_pmu.arch_pebs) {
            u64 cntr_mask = hybrid(event->pmu, intel_ctrl) &
                        ~GLOBAL_CTRL_EN_PERF_METRICS;
            u64 pebs_mask = event->attr.precise_ip >= 3 ?
                        pebs_cap.pdists : pebs_cap.counters;
            if (pebs_mask != cntr_mask)
                event->hw.dyn_constraint = pebs_mask & cntr_mask;
        }


>
>> This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?The dyn_constraint is a supplement of the static constraints. It doesn't
> overwrite the static constraints.
>
> In the intel_get_event_constraints(), perf always gets the static
> constraints first. If the dyn_constraint is defined, it gets the common
> mask of the static constraints and the dynamic constraints. All
> constraint rules will be complied.
>
> 	if (event->hw.dyn_constraint != ~0ULL) {
> 		c2 = dyn_constraint(cpuc, c2, idx);
> 		c2->idxmsk64 &= event->hw.dyn_constraint;
> 		c2->weight = hweight64(c2->idxmsk64);
> 	}
>
> Thanks,
> Kan
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 15/22] perf/x86/intel: Support SSP register capturing for arch-PEBS
  2025-04-15 14:07   ` Peter Zijlstra
@ 2025-04-16  5:49     ` Mi, Dapeng
  0 siblings, 0 replies; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-16  5:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi


On 4/15/2025 10:07 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 11:44:21AM +0000, Dapeng Mi wrote:
>> Arch-PEBS supports to capture shadow stack pointer (SSP) register in GPR
>> group. This patch supports to capture and output SSP register at
>> interrupt or user space, but capturing SSP at user space requires
>> 'exclude_kernel' attribute must be set. That avoids kernel space SSP
>> register is captured unintentionally.
>>
>> Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com>
>> ---
>>  arch/x86/events/core.c                | 15 +++++++++++++++
>>  arch/x86/events/intel/core.c          |  3 ++-
>>  arch/x86/events/intel/ds.c            |  9 +++++++--
>>  arch/x86/events/perf_event.h          |  4 ++++
>>  arch/x86/include/asm/perf_event.h     |  1 +
>>  arch/x86/include/uapi/asm/perf_regs.h |  4 +++-
>>  arch/x86/kernel/perf_regs.c           |  7 +++++++
>>  7 files changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
>> index 9c205a8a4fa6..0ccbe8385c7f 100644
>> --- a/arch/x86/events/core.c
>> +++ b/arch/x86/events/core.c
>> @@ -650,6 +650,21 @@ int x86_pmu_hw_config(struct perf_event *event)
>>  			return -EINVAL;
>>  	}
>>  
>> +	if (unlikely(event->attr.sample_regs_user & BIT_ULL(PERF_REG_X86_SSP))) {
>> +		/* Only arch-PEBS supports to capture SSP register. */
>> +		if (!x86_pmu.arch_pebs || !event->attr.precise_ip)
>> +			return -EINVAL;
>> +		/* Only user space is allowed to capture. */
>> +		if (!event->attr.exclude_kernel)
>> +			return -EINVAL;
>> +	}
> We should be able to support this for !PEBS samples by reading the MSR
> just fine, no?

Yes, It can be supported for !PEBS samples. I ever hesitated whether to add
support to capture SSP for !PEBS. Since there is a latency between counter
overflowing and SW reading SSP, the captured SSP may be inaccurate for
!PEBS samples. This makes it not so valuable considering we already support
it for PEBS sampling.

Anyway, we can support it in next version if it's necessary.


>
> ISTR making a similar comment last time.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 16/22] perf/core: Support to capture higher width vector registers
  2025-04-15 14:36   ` Peter Zijlstra
@ 2025-04-16  6:42     ` Mi, Dapeng
  2025-04-16 15:53       ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-16  6:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi


On 4/15/2025 10:36 PM, Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 11:44:22AM +0000, Dapeng Mi wrote:
>>  extern unsigned long perf_arch_instruction_pointer(struct pt_regs *regs);
>> diff --git a/arch/x86/include/uapi/asm/perf_regs.h b/arch/x86/include/uapi/asm/perf_regs.h
>> index f9c5b16b1882..5e2d9796b2cc 100644
>> --- a/arch/x86/include/uapi/asm/perf_regs.h
>> +++ b/arch/x86/include/uapi/asm/perf_regs.h
>> @@ -33,7 +33,7 @@ enum perf_event_x86_regs {
>>  	PERF_REG_X86_32_MAX = PERF_REG_X86_GS + 1,
>>  	PERF_REG_X86_64_MAX = PERF_REG_X86_SSP + 1,
>>  
>> -	/* These all need two bits set because they are 128bit */
>> +	/* These all need two bits set because they are 128 bits */
>>  	PERF_REG_X86_XMM0  = 32,
>>  	PERF_REG_X86_XMM1  = 34,
>>  	PERF_REG_X86_XMM2  = 36,
>> @@ -53,6 +53,83 @@ enum perf_event_x86_regs {
>>  
>>  	/* These include both GPRs and XMMX registers */
>>  	PERF_REG_X86_XMM_MAX = PERF_REG_X86_XMM15 + 2,
>> +
>> +	/* Leave bits[127:64] for other GP registers, like R16 ~ R31.*/
>> +
>> +	/*
>> +	 * Each YMM register need 4 bits to represent because they are 256 bits.
>> +	 * PERF_REG_X86_YMMH0 = 128
>> +	 */
>> +	PERF_REG_X86_YMM0	= 128,
>> +	PERF_REG_X86_YMM1	= PERF_REG_X86_YMM0 + 4,
>> +	PERF_REG_X86_YMM2	= PERF_REG_X86_YMM1 + 4,
>> +	PERF_REG_X86_YMM3	= PERF_REG_X86_YMM2 + 4,
>> +	PERF_REG_X86_YMM4	= PERF_REG_X86_YMM3 + 4,
>> +	PERF_REG_X86_YMM5	= PERF_REG_X86_YMM4 + 4,
>> +	PERF_REG_X86_YMM6	= PERF_REG_X86_YMM5 + 4,
>> +	PERF_REG_X86_YMM7	= PERF_REG_X86_YMM6 + 4,
>> +	PERF_REG_X86_YMM8	= PERF_REG_X86_YMM7 + 4,
>> +	PERF_REG_X86_YMM9	= PERF_REG_X86_YMM8 + 4,
>> +	PERF_REG_X86_YMM10	= PERF_REG_X86_YMM9 + 4,
>> +	PERF_REG_X86_YMM11	= PERF_REG_X86_YMM10 + 4,
>> +	PERF_REG_X86_YMM12	= PERF_REG_X86_YMM11 + 4,
>> +	PERF_REG_X86_YMM13	= PERF_REG_X86_YMM12 + 4,
>> +	PERF_REG_X86_YMM14	= PERF_REG_X86_YMM13 + 4,
>> +	PERF_REG_X86_YMM15	= PERF_REG_X86_YMM14 + 4,
>> +	PERF_REG_X86_YMM_MAX	= PERF_REG_X86_YMM15 + 4,
>> +
>> +	/*
>> +	 * Each ZMM register needs 8 bits to represent because they are 512 bits
>> +	 * PERF_REG_X86_ZMMH0 = 192
>> +	 */
>> +	PERF_REG_X86_ZMM0	= PERF_REG_X86_YMM_MAX,
>> +	PERF_REG_X86_ZMM1	= PERF_REG_X86_ZMM0 + 8,
>> +	PERF_REG_X86_ZMM2	= PERF_REG_X86_ZMM1 + 8,
>> +	PERF_REG_X86_ZMM3	= PERF_REG_X86_ZMM2 + 8,
>> +	PERF_REG_X86_ZMM4	= PERF_REG_X86_ZMM3 + 8,
>> +	PERF_REG_X86_ZMM5	= PERF_REG_X86_ZMM4 + 8,
>> +	PERF_REG_X86_ZMM6	= PERF_REG_X86_ZMM5 + 8,
>> +	PERF_REG_X86_ZMM7	= PERF_REG_X86_ZMM6 + 8,
>> +	PERF_REG_X86_ZMM8	= PERF_REG_X86_ZMM7 + 8,
>> +	PERF_REG_X86_ZMM9	= PERF_REG_X86_ZMM8 + 8,
>> +	PERF_REG_X86_ZMM10	= PERF_REG_X86_ZMM9 + 8,
>> +	PERF_REG_X86_ZMM11	= PERF_REG_X86_ZMM10 + 8,
>> +	PERF_REG_X86_ZMM12	= PERF_REG_X86_ZMM11 + 8,
>> +	PERF_REG_X86_ZMM13	= PERF_REG_X86_ZMM12 + 8,
>> +	PERF_REG_X86_ZMM14	= PERF_REG_X86_ZMM13 + 8,
>> +	PERF_REG_X86_ZMM15	= PERF_REG_X86_ZMM14 + 8,
>> +	PERF_REG_X86_ZMM16	= PERF_REG_X86_ZMM15 + 8,
>> +	PERF_REG_X86_ZMM17	= PERF_REG_X86_ZMM16 + 8,
>> +	PERF_REG_X86_ZMM18	= PERF_REG_X86_ZMM17 + 8,
>> +	PERF_REG_X86_ZMM19	= PERF_REG_X86_ZMM18 + 8,
>> +	PERF_REG_X86_ZMM20	= PERF_REG_X86_ZMM19 + 8,
>> +	PERF_REG_X86_ZMM21	= PERF_REG_X86_ZMM20 + 8,
>> +	PERF_REG_X86_ZMM22	= PERF_REG_X86_ZMM21 + 8,
>> +	PERF_REG_X86_ZMM23	= PERF_REG_X86_ZMM22 + 8,
>> +	PERF_REG_X86_ZMM24	= PERF_REG_X86_ZMM23 + 8,
>> +	PERF_REG_X86_ZMM25	= PERF_REG_X86_ZMM24 + 8,
>> +	PERF_REG_X86_ZMM26	= PERF_REG_X86_ZMM25 + 8,
>> +	PERF_REG_X86_ZMM27	= PERF_REG_X86_ZMM26 + 8,
>> +	PERF_REG_X86_ZMM28	= PERF_REG_X86_ZMM27 + 8,
>> +	PERF_REG_X86_ZMM29	= PERF_REG_X86_ZMM28 + 8,
>> +	PERF_REG_X86_ZMM30	= PERF_REG_X86_ZMM29 + 8,
>> +	PERF_REG_X86_ZMM31	= PERF_REG_X86_ZMM30 + 8,
>> +	PERF_REG_X86_ZMM_MAX	= PERF_REG_X86_ZMM31 + 8,
>> +
>> +	/*
>> +	 * OPMASK Registers
>> +	 * PERF_REG_X86_OPMASK0 = 448
>> +	 */
>> +	PERF_REG_X86_OPMASK0	= PERF_REG_X86_ZMM_MAX,
>> +	PERF_REG_X86_OPMASK1	= PERF_REG_X86_OPMASK0 + 1,
>> +	PERF_REG_X86_OPMASK2	= PERF_REG_X86_OPMASK1 + 1,
>> +	PERF_REG_X86_OPMASK3	= PERF_REG_X86_OPMASK2 + 1,
>> +	PERF_REG_X86_OPMASK4	= PERF_REG_X86_OPMASK3 + 1,
>> +	PERF_REG_X86_OPMASK5	= PERF_REG_X86_OPMASK4 + 1,
>> +	PERF_REG_X86_OPMASK6	= PERF_REG_X86_OPMASK5 + 1,
>> +	PERF_REG_X86_OPMASK7	= PERF_REG_X86_OPMASK6 + 1,
>> +
>> +	PERF_REG_X86_VEC_MAX	= PERF_REG_X86_OPMASK7 + 1,
>>  };
>> diff --git a/include/uapi/linux/perf_event.h b/include/uapi/linux/perf_event.h
>> index 5fc753c23734..78aae0464a54 100644
>> --- a/include/uapi/linux/perf_event.h
>> +++ b/include/uapi/linux/perf_event.h
>> @@ -379,6 +379,10 @@ enum perf_event_read_format {
>>  #define PERF_ATTR_SIZE_VER6	120	/* add: aux_sample_size */
>>  #define PERF_ATTR_SIZE_VER7	128	/* add: sig_data */
>>  #define PERF_ATTR_SIZE_VER8	136	/* add: config3 */
>> +#define PERF_ATTR_SIZE_VER9	168	/* add: sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE] */
>> +
>> +#define PERF_EXT_REGS_ARRAY_SIZE	7
>> +#define PERF_NUM_EXT_REGS		(PERF_EXT_REGS_ARRAY_SIZE * 64)
>>  
>>  /*
>>   * Hardware event_id to monitor via a performance monitoring event:
>> @@ -533,6 +537,13 @@ struct perf_event_attr {
>>  	__u64	sig_data;
>>  
>>  	__u64	config3; /* extension of config2 */
>> +
>> +	/*
>> +	 * Extension sets of regs to dump for each sample.
>> +	 * See asm/perf_regs.h for details.
>> +	 */
>> +	__u64	sample_regs_intr_ext[PERF_EXT_REGS_ARRAY_SIZE];
>> +	__u64   sample_regs_user_ext[PERF_EXT_REGS_ARRAY_SIZE];
>>  };
>>  
>>  /*
> I still utterly hate this interface. This is a giant waste of bits.
>
> What makes it even worse is that XMMn is the lower half of YMMn which in
> turn is the lower half of ZMMn.
>
> So by exposing only ZMMn you already expose all of them. The interface
> explicitly allows asking for sub-words.
>
> But most importantly of all, last time I asked if there are users that
> actually care about the whole per-register thing and I don't see an
> answer here.
>
> Can we please find a better interface? Ideally one that scales up to
> 1024 and 2048 bit vector width, because I'd hate to have to rev this
> again.
>
> Perhaps add sample_vec_regs_*[] with a saner format, and if that is !0
> then the XMM regs dissapear from sample_regs_*[] and we get to use that
> space to extended GPs.

Just think twice, using bitmap to represent these extended registers indeed
wastes bits and is hard to extend, there could be much much more vector
registers if considering AMX.

Considering different arch/HW may support different number vector register,
like platform A supports 8 XMM registers and 8 YMM registers, but platform
B only supports 16 XMM registers, a better way to represent these vector
registers may add two fields, one is a bitmap which represents which kinds
of vector registers needs to be captures. The other field could be a u16
array which represents the corresponding register length of each kind of
vector register. It may look like this.

#define    PERF_SAMPLE_EXT_REGS_XMM    BIT(0)
#define    PERF_SAMPLE_EXT_REGS_YMM    BIT(1)
#define    PERF_SAMPLE_EXT_REGS_ZMM    BIT(2)
    __u32    sample_regs_intr_ext;
    __u16    sample_regs_intr_ext_len[4];
    __u32    sample_regs_user_ext;

    __u16    sample_regs_user_ext_len[4];


Peter, how do you think this? Thanks.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake
  2025-04-15 15:21 ` [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Liang, Kan
@ 2025-04-16  7:42   ` Peter Zijlstra
  0 siblings, 0 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-16  7:42 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 11:21:30AM -0400, Liang, Kan wrote:
> Hi Peter,
> 
> On 2025-04-15 7:44 a.m., Dapeng Mi wrote:
> > Dapeng Mi (21):
> >   perf/x86/intel: Add PMU support for Clearwater Forest
> > 
> > Kan Liang (1):
> >   perf/x86/intel: Add Panther Lake support
> 
> Could you please take a look and pick up the above two patches if they
> look good to you?

Yes, I've picked up that earlier 2 patch series and will pick up the
first 6 patches from this series.

Thanks!

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-16  1:46       ` Mi, Dapeng
@ 2025-04-16 13:59         ` Liang, Kan
  2025-04-17  1:15           ` Mi, Dapeng
  0 siblings, 1 reply; 47+ messages in thread
From: Liang, Kan @ 2025-04-16 13:59 UTC (permalink / raw)
  To: Mi, Dapeng, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	linux-kernel, linux-perf-users, Dapeng Mi



On 2025-04-15 9:46 p.m., Mi, Dapeng wrote:
> 
> On 4/16/2025 12:31 AM, Liang, Kan wrote:
>>
>> On 2025-04-15 9:53 a.m., Peter Zijlstra wrote:
>>> On Tue, Apr 15, 2025 at 11:44:18AM +0000, Dapeng Mi wrote:
>>>> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
>>>> sampling and precise distribution PEBS sampling. Thus PEBS constraints
>>>> should be dynamically configured base on these counter and precise
>>>> distribution bitmap instead of defining them statically.
>>>>
>>>> Update event dyn_constraint base on PEBS event precise level.
>>> What if any constraints are there on this? 
>> Do you mean the static constraints defined in the
>> event_constraints/pebs_constraints?
>>
>>> CPUID is virt host
>>> controlled, right, so these could be the most horrible masks ever.
>>>
>> Yes, it could be changed by VMM. A sanity check should be required if
>> abad mask is given.
> 
> Yes, we need a check to restrict the PEBS counter mask into the valid
> counter mask, and just realized that we can't use hybrid(event->pmu,
> intel_ctrl) to check counter mask and need a minor tweak since it includes
> the GLOBAL_CTRL_EN_PERF_METRICS bit.
> 
> How about this?
> 
>         if (x86_pmu.arch_pebs) {
>             u64 cntr_mask = hybrid(event->pmu, intel_ctrl) &
>                         ~GLOBAL_CTRL_EN_PERF_METRICS;
>             u64 pebs_mask = event->attr.precise_ip >= 3 ?
>                         pebs_cap.pdists : pebs_cap.counters;
>             if (pebs_mask != cntr_mask)
>                 event->hw.dyn_constraint = pebs_mask & cntr_mask;
>         }
> 

The mask isn't changed after boot. The sanity check should only be done
once. I think it can be done in the update_pmu_cap() when perf retrieves
the value. If a bad mask is detected, the PEBS should be disabled.

Thanks,
Kan>
>>
>>> This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?The dyn_constraint is a supplement of the static constraints. It doesn't
>> overwrite the static constraints.
>>
>> In the intel_get_event_constraints(), perf always gets the static
>> constraints first. If the dyn_constraint is defined, it gets the common
>> mask of the static constraints and the dynamic constraints. All
>> constraint rules will be complied.
>>
>> 	if (event->hw.dyn_constraint != ~0ULL) {
>> 		c2 = dyn_constraint(cpuc, c2, idx);
>> 		c2->idxmsk64 &= event->hw.dyn_constraint;
>> 		c2->weight = hweight64(c2->idxmsk64);
>> 	}
>>
>> Thanks,
>> Kan
>>
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-15 16:31     ` Liang, Kan
  2025-04-16  1:46       ` Mi, Dapeng
@ 2025-04-16 15:32       ` Peter Zijlstra
  2025-04-16 19:45         ` Liang, Kan
  1 sibling, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-16 15:32 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Tue, Apr 15, 2025 at 12:31:03PM -0400, Liang, Kan wrote:

> > This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?

> The dyn_constraint is a supplement of the static constraints. It doesn't
> overwrite the static constraints.

That doesn't matter.

> In the intel_get_event_constraints(), perf always gets the static
> constraints first. If the dyn_constraint is defined, it gets the common
> mask of the static constraints and the dynamic constraints. All
> constraint rules will be complied.
> 
> 	if (event->hw.dyn_constraint != ~0ULL) {
> 		c2 = dyn_constraint(cpuc, c2, idx);
> 		c2->idxmsk64 &= event->hw.dyn_constraint;
> 		c2->weight = hweight64(c2->idxmsk64);
> 	}

Read the comment that goes with EVENT_CONSTRAINT_OVERLAP().

Suppose we have (from intel_lnc_event_constraints[]):

  INTEL_UEVENT_CONSTRAINT(0x012a, 0xf)
  INTEL_EVENT_CONSTRAINT(0x2e, 0x3ff)

Then since the first is fully contained in the latter, there is no
problem.

Now imagine PEBS gets a dynamic constraint of 0x3c (just because), and
then you try and create a PEBS event along with the above two events,
and all of a sudden you have:

	0x000f
	0x003c
	0x03ff

And that is exactly the problem case.

Also, looking at that LNC table, please explain:

  INTEL_UEVENT_CONSTRAINT(0x01cd, 0x3fc)

that looks like the exact thing I've asked to never do, exactly because
of the above problem :-(

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 16/22] perf/core: Support to capture higher width vector registers
  2025-04-16  6:42     ` Mi, Dapeng
@ 2025-04-16 15:53       ` Peter Zijlstra
  2025-04-17  2:00         ` Mi, Dapeng
  2025-04-22  3:05         ` Mi, Dapeng
  0 siblings, 2 replies; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-16 15:53 UTC (permalink / raw)
  To: Mi, Dapeng
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Wed, Apr 16, 2025 at 02:42:12PM +0800, Mi, Dapeng wrote:

> Just think twice, using bitmap to represent these extended registers indeed
> wastes bits and is hard to extend, there could be much much more vector
> registers if considering AMX.

*Groan* so AMX should never have been register state :-(


> Considering different arch/HW may support different number vector register,
> like platform A supports 8 XMM registers and 8 YMM registers, but platform
> B only supports 16 XMM registers, a better way to represent these vector
> registers may add two fields, one is a bitmap which represents which kinds
> of vector registers needs to be captures. The other field could be a u16
> array which represents the corresponding register length of each kind of
> vector register. It may look like this.
> 
> #define    PERF_SAMPLE_EXT_REGS_XMM    BIT(0)
> #define    PERF_SAMPLE_EXT_REGS_YMM    BIT(1)
> #define    PERF_SAMPLE_EXT_REGS_ZMM    BIT(2)

>     __u32    sample_regs_intr_ext;
>     __u16    sample_regs_intr_ext_len[4];
>     __u32    sample_regs_user_ext;
>     __u16    sample_regs_user_ext_len[4];
> 
> 
> Peter, how do you think this? Thanks.

I'm not entirely sure I understand.

How about something like:

	__u16 sample_simd_reg_words;
	__u64 sample_simd_reg_intr;
	__u64 sample_simd_reg_user;

Then the simd_reg_words tell us how many (quad) words per register (8 for
512) and simd_reg_{intr,user} are a simple bitmap, one bit per actual
simd reg.

So then all of XMM would be:

  words = 2;
  intr = user = 0xFFFF;

(16 regs, 128 wide)

Whereas ZMM would be:

  words = 8
  intr = user = 0xFFFFFFFF;

(32 regs, 512 wide)


Would this be sufficient? Possibly we can split the words thing into two
__u8, but does it make sense to ask for different vector width for
intr and user ?

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-16 15:32       ` Peter Zijlstra
@ 2025-04-16 19:45         ` Liang, Kan
  2025-04-16 19:56           ` Peter Zijlstra
  0 siblings, 1 reply; 47+ messages in thread
From: Liang, Kan @ 2025-04-16 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi



On 2025-04-16 11:32 a.m., Peter Zijlstra wrote:
> On Tue, Apr 15, 2025 at 12:31:03PM -0400, Liang, Kan wrote:
> 
>>> This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?
> 
>> The dyn_constraint is a supplement of the static constraints. It doesn't
>> overwrite the static constraints.
> 
> That doesn't matter.
> 
>> In the intel_get_event_constraints(), perf always gets the static
>> constraints first. If the dyn_constraint is defined, it gets the common
>> mask of the static constraints and the dynamic constraints. All
>> constraint rules will be complied.
>>
>> 	if (event->hw.dyn_constraint != ~0ULL) {
>> 		c2 = dyn_constraint(cpuc, c2, idx);
>> 		c2->idxmsk64 &= event->hw.dyn_constraint;
>> 		c2->weight = hweight64(c2->idxmsk64);
>> 	}
> 
> Read the comment that goes with EVENT_CONSTRAINT_OVERLAP().
> 
> Suppose we have (from intel_lnc_event_constraints[]):
> 
>   INTEL_UEVENT_CONSTRAINT(0x012a, 0xf)
>   INTEL_EVENT_CONSTRAINT(0x2e, 0x3ff)
> 
> Then since the first is fully contained in the latter, there is no
> problem.
> 
> Now imagine PEBS gets a dynamic constraint of 0x3c (just because), and
> then you try and create a PEBS event along with the above two events,
> and all of a sudden you have:
> 
> 	0x000f
> 	0x003c
> 	0x03ff
> 
> And that is exactly the problem case.
> 
> Also, looking at that LNC table, please explain:
> 
>   INTEL_UEVENT_CONSTRAINT(0x01cd, 0x3fc)
> 
> that looks like the exact thing I've asked to never do, exactly because
> of the above problem :-(

I see. I think we can check the constraint table and update the overlap
bit accordingly. Similar to what we did in the
intel_pmu_check_event_constraints() for the fixed counters.

I'm thinking something as below (Just a POC, not tested.)

For the static table, set the overlap for the events that may trigger
the issue at init time.
For the dynamic constraint, add a dyn_overlap_mask to track if overlap
is required for the feature (The below only supports the branch
counters. The ACR and ARCH PEBS can be added later.) If it's required,
set a flag PERF_X86_EVENT_OVERLAP for the event when the dyn_constraint
is applied. The overlap bit will be set at runtime.

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 16f8aea33243..76a03a0c28e9 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3825,6 +3825,8 @@ intel_get_event_constraints(struct cpu_hw_events
*cpuc, int idx,
 		c2 = dyn_constraint(cpuc, c2, idx);
 		c2->idxmsk64 &= event->hw.dyn_constraint;
 		c2->weight = hweight64(c2->idxmsk64);+		if (event->hw.flags &
PERF_X86_EVENT_OVERLAP)
+			c2->overlap = 1;
 	}

 	return c2;
@@ -4197,6 +4199,12 @@ static inline void
intel_pmu_set_acr_caused_constr(struct perf_event *event,
 		event->hw.dyn_constraint &= hybrid(event->pmu, acr_cause_mask64);
 }

+enum dyn_overlap_bits {
+       DYN_OVERLAP_BRANCH_CNTR
+};
+
+static unsigned long dyn_overlap_mask;
+
 static int intel_pmu_hw_config(struct perf_event *event)
 {
 	int ret = x86_pmu_hw_config(event);
@@ -4261,6 +4269,8 @@ static int intel_pmu_hw_config(struct perf_event
*event)
 		if (branch_sample_counters(leader)) {
 			num++;
 			leader->hw.dyn_constraint &= x86_pmu.lbr_counters;
+			if (test_bit(DYN_OVERLAP_BRANCH_CNTR, &dyn_overlap_mask);
+				leader->hw.flags |= PERF_X86_EVENT_OVERLAP;
 		}
 		leader->hw.flags |= PERF_X86_EVENT_BRANCH_COUNTERS;

@@ -4270,6 +4280,8 @@ static int intel_pmu_hw_config(struct perf_event
*event)
 			if (branch_sample_counters(sibling)) {
 				num++;
 				sibling->hw.dyn_constraint &= x86_pmu.lbr_counters;
+				if (test_bit(DYN_OVERLAP_BRANCH_CNTR, &dyn_overlap_mask);
+					sibling->hw.flags |= PERF_X86_EVENT_OVERLAP;
 			}
 		}

@@ -6638,6 +6650,29 @@ static void
intel_pmu_check_event_constraints(struct event_constraint *event_con
 	if (!event_constraints)
 		return;

+	for_each_event_constraint(c, event_constraints) {
+		if (c->weight == 1 || c->overlap)
+			continue;
+
+		/*
+		 * The counter mask of an event is not a subset of
+		 * the counter mask of a constraint with an equal
+		 * or higher weight. The overlap flag must be set.
+		 */
+		for_each_event_constraint(c2, event_constraints) {
+			if ((c2->weight >= c->weight) &&
+			    (c2->idxmsk64 | c->idxmsk64) != c2->idxmsk64) {
+				c->overlap = 1;
+				break;
+			}
+		}
+
+		/* Check for the dynamic constraint */
+		if (c->weight >= HWEIGHT(x86_pmu.lbr_counters) &&
+		    (c->idxmsk64 | x86_pmu.lbr_counters) != c->idxmsk64)
+ 			__set_bit(DYN_OVERLAP_BRANCH_CNTR, &dyn_overlap_mask);
+	}
+
 	/*
 	 * event on fixed counter2 (REF_CYCLES) only works on this
 	 * counter, so do not extend mask to generic counters


Thanks,
Kan

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-16 19:45         ` Liang, Kan
@ 2025-04-16 19:56           ` Peter Zijlstra
  2025-04-22 22:50             ` Liang, Kan
  0 siblings, 1 reply; 47+ messages in thread
From: Peter Zijlstra @ 2025-04-16 19:56 UTC (permalink / raw)
  To: Liang, Kan
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On Wed, Apr 16, 2025 at 03:45:24PM -0400, Liang, Kan wrote:

> I see. I think we can check the constraint table and update the overlap
> bit accordingly. Similar to what we did in the
> intel_pmu_check_event_constraints() for the fixed counters.
> 
> I'm thinking something as below (Just a POC, not tested.)

I'll try and digest in more detail tomorrow, but having overlap it *not*
a good thing. Which is why I've always asked to make sure this
doesn't happen :/

At the very least we should WARN if we find the dynamic constraint gets
us there.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-16 13:59         ` Liang, Kan
@ 2025-04-17  1:15           ` Mi, Dapeng
  0 siblings, 0 replies; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-17  1:15 UTC (permalink / raw)
  To: Liang, Kan, Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Andi Kleen, Eranian Stephane,
	linux-kernel, linux-perf-users, Dapeng Mi


On 4/16/2025 9:59 PM, Liang, Kan wrote:
>
> On 2025-04-15 9:46 p.m., Mi, Dapeng wrote:
>> On 4/16/2025 12:31 AM, Liang, Kan wrote:
>>> On 2025-04-15 9:53 a.m., Peter Zijlstra wrote:
>>>> On Tue, Apr 15, 2025 at 11:44:18AM +0000, Dapeng Mi wrote:
>>>>> arch-PEBS provides CPUIDs to enumerate which counters support PEBS
>>>>> sampling and precise distribution PEBS sampling. Thus PEBS constraints
>>>>> should be dynamically configured base on these counter and precise
>>>>> distribution bitmap instead of defining them statically.
>>>>>
>>>>> Update event dyn_constraint base on PEBS event precise level.
>>>> What if any constraints are there on this? 
>>> Do you mean the static constraints defined in the
>>> event_constraints/pebs_constraints?
>>>
>>>> CPUID is virt host
>>>> controlled, right, so these could be the most horrible masks ever.
>>>>
>>> Yes, it could be changed by VMM. A sanity check should be required if
>>> abad mask is given.
>> Yes, we need a check to restrict the PEBS counter mask into the valid
>> counter mask, and just realized that we can't use hybrid(event->pmu,
>> intel_ctrl) to check counter mask and need a minor tweak since it includes
>> the GLOBAL_CTRL_EN_PERF_METRICS bit.
>>
>> How about this?
>>
>>         if (x86_pmu.arch_pebs) {
>>             u64 cntr_mask = hybrid(event->pmu, intel_ctrl) &
>>                         ~GLOBAL_CTRL_EN_PERF_METRICS;
>>             u64 pebs_mask = event->attr.precise_ip >= 3 ?
>>                         pebs_cap.pdists : pebs_cap.counters;
>>             if (pebs_mask != cntr_mask)
>>                 event->hw.dyn_constraint = pebs_mask & cntr_mask;
>>         }
>>
> The mask isn't changed after boot. The sanity check should only be done
> once. I think it can be done in the update_pmu_cap() when perf retrieves
> the value. If a bad mask is detected, the PEBS should be disabled.

Yeah, it makes sense. Would do.


>
> Thanks,
> Kan>
>>>> This can land us in EVENT_CONSTRAINT_OVERLAP territory, no?The dyn_constraint is a supplement of the static constraints. It doesn't
>>> overwrite the static constraints.
>>>
>>> In the intel_get_event_constraints(), perf always gets the static
>>> constraints first. If the dyn_constraint is defined, it gets the common
>>> mask of the static constraints and the dynamic constraints. All
>>> constraint rules will be complied.
>>>
>>> 	if (event->hw.dyn_constraint != ~0ULL) {
>>> 		c2 = dyn_constraint(cpuc, c2, idx);
>>> 		c2->idxmsk64 &= event->hw.dyn_constraint;
>>> 		c2->weight = hweight64(c2->idxmsk64);
>>> 	}
>>>
>>> Thanks,
>>> Kan
>>>
>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 16/22] perf/core: Support to capture higher width vector registers
  2025-04-16 15:53       ` Peter Zijlstra
@ 2025-04-17  2:00         ` Mi, Dapeng
  2025-04-22  3:05         ` Mi, Dapeng
  1 sibling, 0 replies; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-17  2:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi


On 4/16/2025 11:53 PM, Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 02:42:12PM +0800, Mi, Dapeng wrote:
>
>> Just think twice, using bitmap to represent these extended registers indeed
>> wastes bits and is hard to extend, there could be much much more vector
>> registers if considering AMX.
> *Groan* so AMX should never have been register state :-(
>
>
>> Considering different arch/HW may support different number vector register,
>> like platform A supports 8 XMM registers and 8 YMM registers, but platform
>> B only supports 16 XMM registers, a better way to represent these vector
>> registers may add two fields, one is a bitmap which represents which kinds
>> of vector registers needs to be captures. The other field could be a u16
>> array which represents the corresponding register length of each kind of
>> vector register. It may look like this.
>>
>> #define    PERF_SAMPLE_EXT_REGS_XMM    BIT(0)
>> #define    PERF_SAMPLE_EXT_REGS_YMM    BIT(1)
>> #define    PERF_SAMPLE_EXT_REGS_ZMM    BIT(2)
>>     __u32    sample_regs_intr_ext;
>>     __u16    sample_regs_intr_ext_len[4];
>>     __u32    sample_regs_user_ext;
>>     __u16    sample_regs_user_ext_len[4];
>>
>>
>> Peter, how do you think this? Thanks.
> I'm not entirely sure I understand.
>
> How about something like:
>
> 	__u16 sample_simd_reg_words;
> 	__u64 sample_simd_reg_intr;
> 	__u64 sample_simd_reg_user;

If only considering x86 XMM/YMM/ZMM registers, it should be enough since
higher width vector registers always contain the lower width vector
registers on x86 platforms, but I'm not sure if we can have such assumption
for other archs. If not, then it's not enough since user may hope to sample
multiple vector registers with different width at the same time.
Furthermore, considering there could be more other registers like APX
registers need to be supported in the future, we'd better define a more
generic and easily extended interface. That's why I suggest to add a bitmap
like above"sample_regs_intr_ext" which can represent multiple kinds of
registers simultaneously.


>
> Then the simd_reg_words tell us how many (quad) words per register (8 for
> 512) and simd_reg_{intr,user} are a simple bitmap, one bit per actual
> simd reg.
>
> So then all of XMM would be:
>
>   words = 2;
>   intr = user = 0xFFFF;
>
> (16 regs, 128 wide)
>
> Whereas ZMM would be:
>
>   words = 8
>   intr = user = 0xFFFFFFFF;
>
> (32 regs, 512 wide)
>
>
> Would this be sufficient? Possibly we can split the words thing into two
> __u8, but does it make sense to ask for different vector width for
> intr and user ?

Yes, we need it. Users may need to sample interrupt registers and user
space registers simultaneously although it sounds a little bit weird.



^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 16/22] perf/core: Support to capture higher width vector registers
  2025-04-16 15:53       ` Peter Zijlstra
  2025-04-17  2:00         ` Mi, Dapeng
@ 2025-04-22  3:05         ` Mi, Dapeng
  1 sibling, 0 replies; 47+ messages in thread
From: Mi, Dapeng @ 2025-04-22  3:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Ian Rogers,
	Adrian Hunter, Alexander Shishkin, Kan Liang, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi


On 4/16/2025 11:53 PM, Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 02:42:12PM +0800, Mi, Dapeng wrote:
>
>> Just think twice, using bitmap to represent these extended registers indeed
>> wastes bits and is hard to extend, there could be much much more vector
>> registers if considering AMX.
> *Groan* so AMX should never have been register state :-(
>
>
>> Considering different arch/HW may support different number vector register,
>> like platform A supports 8 XMM registers and 8 YMM registers, but platform
>> B only supports 16 XMM registers, a better way to represent these vector
>> registers may add two fields, one is a bitmap which represents which kinds
>> of vector registers needs to be captures. The other field could be a u16
>> array which represents the corresponding register length of each kind of
>> vector register. It may look like this.
>>
>> #define    PERF_SAMPLE_EXT_REGS_XMM    BIT(0)
>> #define    PERF_SAMPLE_EXT_REGS_YMM    BIT(1)
>> #define    PERF_SAMPLE_EXT_REGS_ZMM    BIT(2)
>>     __u32    sample_regs_intr_ext;
>>     __u16    sample_regs_intr_ext_len[4];
>>     __u32    sample_regs_user_ext;
>>     __u16    sample_regs_user_ext_len[4];
>>
>>
>> Peter, how do you think this? Thanks.
> I'm not entirely sure I understand.
>
> How about something like:
>
> 	__u16 sample_simd_reg_words;
> 	__u64 sample_simd_reg_intr;
> 	__u64 sample_simd_reg_user;
>
> Then the simd_reg_words tell us how many (quad) words per register (8 for
> 512) and simd_reg_{intr,user} are a simple bitmap, one bit per actual
> simd reg.
>
> So then all of XMM would be:
>
>   words = 2;
>   intr = user = 0xFFFF;
>
> (16 regs, 128 wide)
>
> Whereas ZMM would be:
>
>   words = 8
>   intr = user = 0xFFFFFFFF;
>
> (32 regs, 512 wide)
>
>
> Would this be sufficient? Possibly we can split the words thing into two
> __u8, but does it make sense to ask for different vector width for
> intr and user ?

Hi Peter,

Discussed with Kan offline, it sounds to be a real requirement for user to
sample multiple different kinds of SIMD registers, such as user may hope to
sample OPMASK and ZMM registers simultaneously. So to meet the requirement
and make the interface more flexible, we enhance the interface to this.

    /* Bitmap to represent SIMD regs. */
    __u64 sample_simd_reg_intr;
    __u64 sample_simd_reg_user;
    /*
     * Represent each kind of SIMD reg size (how many u64 words are needed)
     * in above bitmap order, e.g., x86 YMM regs are 256 bits and occupy 4
u64 words.
     */
    __u8 sample_simd_reg_size[4];

sample_simd_reg_intr/sample_simd_reg_user represents SIMD regs bitmap, e.g.
on x86 platform, bit[7:0] represents OPMASK[7:0], bit[23:8] represents
YMM[15:0], bit[55:24] represents ZMM[31:0].

sample_simd_reg_size[] represents how many u64 words are needed in above
bitmap order for each kind of SIMD regs, e.g., sample_simd_reg_size[0] = 1,
which represents each OPMASK occupies 1 u64 word, sample_simd_reg_size[1] =
4, which represents YMM occupies 4 u64 words and ample_simd_reg_size[2] =
8, which represents each ZMM occupies 8 u64 words.

How do you think this interface? Thanks.

Dapeng Mi


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level
  2025-04-16 19:56           ` Peter Zijlstra
@ 2025-04-22 22:50             ` Liang, Kan
  0 siblings, 0 replies; 47+ messages in thread
From: Liang, Kan @ 2025-04-22 22:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dapeng Mi, Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim,
	Ian Rogers, Adrian Hunter, Alexander Shishkin, Andi Kleen,
	Eranian Stephane, linux-kernel, linux-perf-users, Dapeng Mi

On 2025-04-16 3:56 p.m., Peter Zijlstra wrote:
> On Wed, Apr 16, 2025 at 03:45:24PM -0400, Liang, Kan wrote:
> 
>> I see. I think we can check the constraint table and update the overlap
>> bit accordingly. Similar to what we did in the
>> intel_pmu_check_event_constraints() for the fixed counters.
>>
>> I'm thinking something as below (Just a POC, not tested.)
> 
> I'll try and digest in more detail tomorrow, but having overlap it *not*
> a good thing. Which is why I've always asked to make sure this
> doesn't happen :/
>

I've checked all the existing event_constraints[] tables and features,
e.g., auto counter reload.
On the Lion Cove core, the MEM_TRANS_RETIRED.LOAD_LATENCY_GT event has a
constraint mask of 0x3fc. The counter mask for the auto counter reload
feature is 0xfc. On the Golden Cove, the
MEM_TRANS_RETIRED.LOAD_LATENCY_GT event has a constraint mask of 0xfe.

Other constraints (except the one with weight 1) are 0x3, 0xf, 0xff, and
0x3ff.

But I don't think it can trigger the issue which mentioned in the commit
bc1738f6ee83 ("perf, x86: Fix event scheduler for constraints with
overlapping counters"). Because the scheduler always start from 0. The
non-overlapping bits are always scheduled first.
For example, 0xf and 0x3fc. The events with 0xf (has low weights) must
be scheduled first, which occupy the non-overlapping counter 0 and
counter 1. There is no scheduling problem for the events with 0x3fc then.

I think we are good for the static constraints of the existing platforms.

> At the very least we should WARN if we find the dynamic constraint gets
> us there.
> 
So the problem is only with the dynamic constraint.

It looks like only checking the weight and subset is not enough. (It may
trigger false positive for the 0xf and 0x3fc case.)
I think the last bit of the mask should be taken into account as well.

WEIGHT(A) <= WEIGHT(B) &&
A | B != B &&
LAST_BIT(A) > LAST_BIT(B)

Thanks,Kan

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2025-04-22 22:50 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-15 11:44 [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Dapeng Mi
2025-04-15 11:44 ` [Patch v3 01/22] perf/x86/intel: Add Panther Lake support Dapeng Mi
2025-04-15 11:44 ` [Patch v3 02/22] perf/x86/intel: Add PMU support for Clearwater Forest Dapeng Mi
2025-04-15 11:44 ` [Patch v3 03/22] perf/x86/intel: Parse CPUID archPerfmonExt leaves for non-hybrid CPUs Dapeng Mi
2025-04-15 11:44 ` [Patch v3 04/22] perf/x86/intel: Decouple BTS initialization from PEBS initialization Dapeng Mi
2025-04-15 11:44 ` [Patch v3 05/22] perf/x86/intel: Rename x86_pmu.pebs to x86_pmu.ds_pebs Dapeng Mi
2025-04-15 11:44 ` [Patch v3 06/22] perf/x86/intel: Introduce pairs of PEBS static calls Dapeng Mi
2025-04-15 11:44 ` [Patch v3 07/22] perf/x86/intel: Initialize architectural PEBS Dapeng Mi
2025-04-15 11:44 ` [Patch v3 08/22] perf/x86/intel/ds: Factor out PEBS record processing code to functions Dapeng Mi
2025-04-15 11:44 ` [Patch v3 09/22] perf/x86/intel/ds: Factor out PEBS group " Dapeng Mi
2025-04-15 11:44 ` [Patch v3 10/22] perf/x86/intel: Process arch-PEBS records or record fragments Dapeng Mi
2025-04-15 13:57   ` Peter Zijlstra
2025-04-15 16:09     ` Liang, Kan
2025-04-15 11:44 ` [Patch v3 11/22] perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR Dapeng Mi
2025-04-15 13:45   ` Peter Zijlstra
2025-04-16  0:59     ` Mi, Dapeng
2025-04-15 13:48   ` Peter Zijlstra
2025-04-16  1:03     ` Mi, Dapeng
2025-04-15 11:44 ` [Patch v3 12/22] perf/x86/intel: Update dyn_constranit base on PEBS event precise level Dapeng Mi
2025-04-15 13:53   ` Peter Zijlstra
2025-04-15 16:31     ` Liang, Kan
2025-04-16  1:46       ` Mi, Dapeng
2025-04-16 13:59         ` Liang, Kan
2025-04-17  1:15           ` Mi, Dapeng
2025-04-16 15:32       ` Peter Zijlstra
2025-04-16 19:45         ` Liang, Kan
2025-04-16 19:56           ` Peter Zijlstra
2025-04-22 22:50             ` Liang, Kan
2025-04-15 11:44 ` [Patch v3 13/22] perf/x86/intel: Setup PEBS data configuration and enable legacy groups Dapeng Mi
2025-04-15 11:44 ` [Patch v3 14/22] perf/x86/intel: Add counter group support for arch-PEBS Dapeng Mi
2025-04-15 11:44 ` [Patch v3 15/22] perf/x86/intel: Support SSP register capturing " Dapeng Mi
2025-04-15 14:07   ` Peter Zijlstra
2025-04-16  5:49     ` Mi, Dapeng
2025-04-15 11:44 ` [Patch v3 16/22] perf/core: Support to capture higher width vector registers Dapeng Mi
2025-04-15 14:36   ` Peter Zijlstra
2025-04-16  6:42     ` Mi, Dapeng
2025-04-16 15:53       ` Peter Zijlstra
2025-04-17  2:00         ` Mi, Dapeng
2025-04-22  3:05         ` Mi, Dapeng
2025-04-15 11:44 ` [Patch v3 17/22] perf/x86/intel: Support arch-PEBS vector registers group capturing Dapeng Mi
2025-04-15 11:44 ` [Patch v3 18/22] perf tools: Support to show SSP register Dapeng Mi
2025-04-15 11:44 ` [Patch v3 19/22] perf tools: Enhance arch__intr/user_reg_mask() helpers Dapeng Mi
2025-04-15 11:44 ` [Patch v3 20/22] perf tools: Enhance sample_regs_user/intr to capture more registers Dapeng Mi
2025-04-15 11:44 ` [Patch v3 21/22] perf tools: Support to capture more vector registers (x86/Intel) Dapeng Mi
2025-04-15 11:44 ` [Patch v3 22/22] perf tools/tests: Add vector registers PEBS sampling test Dapeng Mi
2025-04-15 15:21 ` [Patch v3 00/22] Arch-PEBS and PMU supports for Clearwater Forest and Panther Lake Liang, Kan
2025-04-16  7:42   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).