public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH V3 0/5] Support auto counter reload
@ 2025-02-13 21:17 kan.liang
  2025-02-13 21:17 ` [PATCH V3 1/5] perf/x86: Add dynamic constraint kan.liang
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: kan.liang @ 2025-02-13 21:17 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, dapeng1.mi, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Changes since V2:
- Rebase on top of several new features, e.g., counters snapshotting
  feature. Rewrite the code for the ACR CPUID-enumeration, configuration
  and late setup.
- Patch 1-3 are newly added for clean up.

Changes since V1:
- Add a check to the reload value which cannot exceeds the max period
- Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
- Update comments explain to case which the event->attr.config2 exceeds
  the group size

The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.

The auto counter reload (ACR) feature takes samples when the relative
rate of two or more events exceeds some threshold, which provides the
fine-grained information at a low cost.
To support the feature, two sets of MSRs are introduced. For a given
counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
can cause a reload of that counter. The reload value is stored in the
IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto
Counter Reload.

Example:

Here is the snippet of the mispredict.c. Since the array has a random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.

For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.

main()
{
...
        for (i = 0; i < N; i++)
                data[i] = rand() % 256;
...
        /* Loop 1 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 64)
                                sum += data[i];
...

...
        /* Loop 2 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 128)
                                sum += data[i];
...
}

Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually samples both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
               -c 1000000 -- ./mispredict

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post-process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.

With this patch, the user can generate the samples only when the branch
miss rate > 20%. For example,
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
                 cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
                -- ./mispredict

(Two different periods are applied to branch-misses and
branch-instructions. The ratio is set to 20%.
If the branch-instructions is overflowed first, the branch-miss
rate < 20%. No samples should be generated. All counters should be
automatically reloaded.
If the branch-misses is overflowed first, the branch-miss rate > 20%.
A sample triggered by the branch-misses event should be
generated. Just the counter of the branch-instructions should be
automatically reloaded.

The branch-misses event should only be automatically reloaded when
the branch-instructions is overflowed. So the "cause" event is the
branch-instructions event. The acr_mask is set to 0x2, since the
event index of branch-instructions is 1.

The branch-instructions event is automatically reloaded no matter which
events are overflowed. So the "cause" events are the branch-misses
and the branch-instructions event. The acr_mask should be set to 0x3.)

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]

 $perf report

Percent       │154:   movl    $0x0,-0x14(%rbp)
              │     ↓ jmp     1af
              │     for (i = j; i < N; i++)
              │15d:   mov     -0x10(%rbp),%eax
              │       mov     %eax,-0x18(%rbp)
              │     ↓ jmp     1a2
              │     if (data[i] >= 128)
              │165:   mov     -0x18(%rbp),%eax
              │       cltq
              │       lea     0x0(,%rax,4),%rdx
              │       mov     -0x8(%rbp),%rax
              │       add     %rdx,%rax
              │       mov     (%rax),%eax
              │    ┌──cmp     $0x7f,%eax
100.00   0.00 │    ├──jle     19e
              │    │sum += data[i];

The 2498 samples are all from the branch-misses events for the Loop 2.

The number of samples and overhead is significantly reduced without
losing any information.

Kan Liang (5):
  perf/x86: Add dynamic constraint
  perf/x86/intel: Track the num of events needs late setup
  perf: Extend the bit width of the arch-specific flag
  perf/x86/intel: Add CPUID enumeration for the auto counter reload
  perf/x86/intel: Support auto counter reload

 arch/x86/events/core.c             |   3 +-
 arch/x86/events/intel/core.c       | 260 ++++++++++++++++++++++++++++-
 arch/x86/events/intel/ds.c         |   3 +-
 arch/x86/events/intel/lbr.c        |   2 +-
 arch/x86/events/perf_event.h       |  33 ++++
 arch/x86/events/perf_event_flags.h |  41 ++---
 arch/x86/include/asm/msr-index.h   |   4 +
 arch/x86/include/asm/perf_event.h  |   1 +
 include/linux/perf_event.h         |   4 +-
 9 files changed, 320 insertions(+), 31 deletions(-)

-- 
2.38.1


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH V3 1/5] perf/x86: Add dynamic constraint
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
@ 2025-02-13 21:17 ` kan.liang
  2025-02-13 21:17 ` [PATCH V3 2/5] perf/x86/intel: Track the num of events needs late setup kan.liang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2025-02-13 21:17 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, dapeng1.mi, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

More and more features require a dynamic event constraint, e.g., branch
counter logging, auto counter reload, Arch PEBS, etc.

Add a generic flag, PMU_FL_DYN_CONSTRAINT, to indicate the case. It
avoids keeping adding the individual flag in intel_cpuc_prepare().

Add a variable dyn_constraint in the struct hw_perf_event to track the
dynamic constraint of the event. Apply it if it's updated.

Apply the generic dynamic constraint for branch counter logging.
Many features on and after V6 require dynamic constraint. So
unconditionally set the flag for V6+.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c       |  1 +
 arch/x86/events/intel/core.c | 21 +++++++++++++++------
 arch/x86/events/intel/lbr.c  |  2 +-
 arch/x86/events/perf_event.h |  1 +
 include/linux/perf_event.h   |  1 +
 5 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index 20ad5cca6ad2..b56fa6a9d7a4 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -675,6 +675,7 @@ static int __x86_pmu_event_init(struct perf_event *event)
 	event->hw.idx = -1;
 	event->hw.last_cpu = -1;
 	event->hw.last_tag = ~0ULL;
+	event->hw.dyn_constraint = ~0ULL;
 
 	/* mark unused */
 	event->hw.extra_reg.idx = EXTRA_REG_NONE;
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index f728d2cfdf1c..2df05b18ff04 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -3736,10 +3736,9 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 	if (cpuc->excl_cntrs)
 		return intel_get_excl_constraints(cpuc, event, idx, c2);
 
-	/* Not all counters support the branch counter feature. */
-	if (branch_sample_counters(event)) {
+	if (event->hw.dyn_constraint != ~0ULL) {
 		c2 = dyn_constraint(cpuc, c2, idx);
-		c2->idxmsk64 &= x86_pmu.lbr_counters;
+		c2->idxmsk64 &= event->hw.dyn_constraint;
 		c2->weight = hweight64(c2->idxmsk64);
 	}
 
@@ -4141,15 +4140,19 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		leader = event->group_leader;
 		if (branch_sample_call_stack(leader))
 			return -EINVAL;
-		if (branch_sample_counters(leader))
+		if (branch_sample_counters(leader)) {
 			num++;
+			leader->hw.dyn_constraint &= x86_pmu.lbr_counters;
+		}
 		leader->hw.flags |= PERF_X86_EVENT_BRANCH_COUNTERS;
 
 		for_each_sibling_event(sibling, leader) {
 			if (branch_sample_call_stack(sibling))
 				return -EINVAL;
-			if (branch_sample_counters(sibling))
+			if (branch_sample_counters(sibling)) {
 				num++;
+				sibling->hw.dyn_constraint &= x86_pmu.lbr_counters;
+			}
 		}
 
 		if (num > fls(x86_pmu.lbr_counters))
@@ -4949,7 +4952,7 @@ int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 			goto err;
 	}
 
-	if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_BR_CNTR)) {
+	if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_DYN_CONSTRAINT)) {
 		size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
 
 		cpuc->constraint_list = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
@@ -6667,6 +6670,12 @@ __init int intel_pmu_init(void)
 			pr_cont(" AnyThread deprecated, ");
 	}
 
+	/*
+	 * Many features on and after V6 require dynamic constraint,
+	 * e.g., Arch PEBS, ACR.
+	 */
+	if (version >= 6)
+		x86_pmu.flags |= PMU_FL_DYN_CONSTRAINT;
 	/*
 	 * Install the hw-cache-events table:
 	 */
diff --git a/arch/x86/events/intel/lbr.c b/arch/x86/events/intel/lbr.c
index dc641b50814e..743dcc322085 100644
--- a/arch/x86/events/intel/lbr.c
+++ b/arch/x86/events/intel/lbr.c
@@ -1609,7 +1609,7 @@ void __init intel_pmu_arch_lbr_init(void)
 	x86_pmu.lbr_nr = lbr_nr;
 
 	if (!!x86_pmu.lbr_counters)
-		x86_pmu.flags |= PMU_FL_BR_CNTR;
+		x86_pmu.flags |= PMU_FL_BR_CNTR | PMU_FL_DYN_CONSTRAINT;
 
 	if (x86_pmu.lbr_mispred)
 		static_branch_enable(&x86_lbr_mispred);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index a698e6484b3b..f4693409e191 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -1066,6 +1066,7 @@ do {									\
 #define PMU_FL_MEM_LOADS_AUX	0x100 /* Require an auxiliary event for the complete memory info */
 #define PMU_FL_RETIRE_LATENCY	0x200 /* Support Retire Latency in PEBS */
 #define PMU_FL_BR_CNTR		0x400 /* Support branch counter logging */
+#define PMU_FL_DYN_CONSTRAINT	0x800 /* Needs dynamic constraint */
 
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 2d07bc1193f3..c381ea7135df 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -158,6 +158,7 @@ struct hw_perf_event {
 		struct { /* hardware */
 			u64		config;
 			u64		last_tag;
+			u64		dyn_constraint;
 			unsigned long	config_base;
 			unsigned long	event_base;
 			int		event_base_rdpmc;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V3 2/5] perf/x86/intel: Track the num of events needs late setup
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
  2025-02-13 21:17 ` [PATCH V3 1/5] perf/x86: Add dynamic constraint kan.liang
@ 2025-02-13 21:17 ` kan.liang
  2025-02-13 21:17 ` [PATCH V3 3/5] perf: Extend the bit width of the arch-specific flag kan.liang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2025-02-13 21:17 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, dapeng1.mi, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

When a machine supports PEBS v6, perf unconditionally searches the
cpuc->event_list[] for every event and check if the late setup is
required, which is unnecessary.

The late setup is only required for special events, e.g., events support
counters snapshotting feature. Add n_late_setup to track the num of
events that needs the late setup.

Other features, e.g., auto counter reload feature, require the late
setup as well. Add a wrapper, intel_pmu_pebs_late_setup, for the events
that support counters snapshotting feature.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c | 14 ++++++++++++++
 arch/x86/events/intel/ds.c   |  3 +--
 arch/x86/events/perf_event.h |  5 +++++
 3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2df05b18ff04..ce04553910ab 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2609,6 +2609,8 @@ static void intel_pmu_del_event(struct perf_event *event)
 		intel_pmu_lbr_del(event);
 	if (event->attr.precise_ip)
 		intel_pmu_pebs_del(event);
+	if (is_pebs_counter_event_group(event))
+		this_cpu_ptr(&cpu_hw_events)->n_late_setup--;
 }
 
 static int icl_set_topdown_event_period(struct perf_event *event)
@@ -2920,12 +2922,24 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	}
 }
 
+void intel_pmu_late_setup(void)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+
+	if (!cpuc->n_late_setup)
+		return;
+
+	intel_pmu_pebs_late_setup(cpuc);
+}
+
 static void intel_pmu_add_event(struct perf_event *event)
 {
 	if (event->attr.precise_ip)
 		intel_pmu_pebs_add(event);
 	if (intel_pmu_needs_branch_stack(event))
 		intel_pmu_lbr_add(event);
+	if (is_pebs_counter_event_group(event))
+		this_cpu_ptr(&cpu_hw_events)->n_late_setup++;
 }
 
 /*
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index e8f808905871..df9499d6e4dc 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1355,9 +1355,8 @@ static void __intel_pmu_pebs_update_cfg(struct perf_event *event,
 }
 
 
-static void intel_pmu_late_setup(void)
+void intel_pmu_pebs_late_setup(struct cpu_hw_events *cpuc)
 {
-	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
 	struct perf_event *event;
 	u64 pebs_data_cfg = 0;
 	int i;
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index f4693409e191..5bf9c117e9ef 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -261,6 +261,7 @@ struct cpu_hw_events {
 	struct event_constraint	*event_constraint[X86_PMC_IDX_MAX];
 
 	int			n_excl; /* the number of exclusive events */
+	int			n_late_setup; /* the num of events needs late setup */
 
 	unsigned int		txn_flags;
 	int			is_fake;
@@ -1602,6 +1603,8 @@ void intel_pmu_disable_bts(void);
 
 int intel_pmu_drain_bts_buffer(void);
 
+void intel_pmu_late_setup(void);
+
 u64 grt_latency_data(struct perf_event *event, u64 status);
 
 u64 cmt_latency_data(struct perf_event *event, u64 status);
@@ -1658,6 +1661,8 @@ void intel_pmu_pebs_disable_all(void);
 
 void intel_pmu_pebs_sched_task(struct perf_event_pmu_context *pmu_ctx, bool sched_in);
 
+void intel_pmu_pebs_late_setup(struct cpu_hw_events *cpuc);
+
 void intel_pmu_drain_pebs_buffer(void);
 
 void intel_pmu_store_pebs_lbrs(struct lbr_entry *lbr);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V3 3/5] perf: Extend the bit width of the arch-specific flag
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
  2025-02-13 21:17 ` [PATCH V3 1/5] perf/x86: Add dynamic constraint kan.liang
  2025-02-13 21:17 ` [PATCH V3 2/5] perf/x86/intel: Track the num of events needs late setup kan.liang
@ 2025-02-13 21:17 ` kan.liang
  2025-02-13 21:17 ` [PATCH V3 4/5] perf/x86/intel: Add CPUID enumeration for the auto counter reload kan.liang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2025-02-13 21:17 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, dapeng1.mi, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The auto counter reload feature requires an event flag to indicate an
auto counter reload group, which can only be scheduled on specific
counters that enumerated in CPUID. However, the hw_perf_event.flags has
run out on X86.

Two solutions were considered to address the issue.
- Currently, 20 bits are reserved for the architecture-specific flags.
  Only the bit 31 is used for the generic flag. There is still plenty
  of space left. Reserve 8 more bits for the arch-specific flags.
- Add a new X86 specific hw_perf_event.flags1 to support more flags.

The former is implemented. Enough room is still left in the global
generic flag.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/perf_event_flags.h | 41 +++++++++++++++---------------
 include/linux/perf_event.h         |  2 +-
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index 1d9e385649b5..70078334e4a3 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -2,23 +2,24 @@
 /*
  * struct hw_perf_event.flags flags
  */
-PERF_ARCH(PEBS_LDLAT,		0x00001) /* ld+ldlat data address sampling */
-PERF_ARCH(PEBS_ST,		0x00002) /* st data address sampling */
-PERF_ARCH(PEBS_ST_HSW,		0x00004) /* haswell style datala, store */
-PERF_ARCH(PEBS_LD_HSW,		0x00008) /* haswell style datala, load */
-PERF_ARCH(PEBS_NA_HSW,		0x00010) /* haswell style datala, unknown */
-PERF_ARCH(EXCL,			0x00020) /* HT exclusivity on counter */
-PERF_ARCH(DYNAMIC,		0x00040) /* dynamic alloc'd constraint */
-PERF_ARCH(PEBS_CNTR,		0x00080) /* PEBS counters snapshot */
-PERF_ARCH(EXCL_ACCT,		0x00100) /* accounted EXCL event */
-PERF_ARCH(AUTO_RELOAD,		0x00200) /* use PEBS auto-reload */
-PERF_ARCH(LARGE_PEBS,		0x00400) /* use large PEBS */
-PERF_ARCH(PEBS_VIA_PT,		0x00800) /* use PT buffer for PEBS */
-PERF_ARCH(PAIR,			0x01000) /* Large Increment per Cycle */
-PERF_ARCH(LBR_SELECT,		0x02000) /* Save/Restore MSR_LBR_SELECT */
-PERF_ARCH(TOPDOWN,		0x04000) /* Count Topdown slots/metrics events */
-PERF_ARCH(PEBS_STLAT,		0x08000) /* st+stlat data address sampling */
-PERF_ARCH(AMD_BRS,		0x10000) /* AMD Branch Sampling */
-PERF_ARCH(PEBS_LAT_HYBRID,	0x20000) /* ld and st lat for hybrid */
-PERF_ARCH(NEEDS_BRANCH_STACK,	0x40000) /* require branch stack setup */
-PERF_ARCH(BRANCH_COUNTERS,	0x80000) /* logs the counters in the extra space of each branch */
+PERF_ARCH(PEBS_LDLAT,		0x0000001) /* ld+ldlat data address sampling */
+PERF_ARCH(PEBS_ST,		0x0000002) /* st data address sampling */
+PERF_ARCH(PEBS_ST_HSW,		0x0000004) /* haswell style datala, store */
+PERF_ARCH(PEBS_LD_HSW,		0x0000008) /* haswell style datala, load */
+PERF_ARCH(PEBS_NA_HSW,		0x0000010) /* haswell style datala, unknown */
+PERF_ARCH(EXCL,			0x0000020) /* HT exclusivity on counter */
+PERF_ARCH(DYNAMIC,		0x0000040) /* dynamic alloc'd constraint */
+PERF_ARCH(PEBS_CNTR,		0x0000080) /* PEBS counters snapshot */
+PERF_ARCH(EXCL_ACCT,		0x0000100) /* accounted EXCL event */
+PERF_ARCH(AUTO_RELOAD,		0x0000200) /* use PEBS auto-reload */
+PERF_ARCH(LARGE_PEBS,		0x0000400) /* use large PEBS */
+PERF_ARCH(PEBS_VIA_PT,		0x0000800) /* use PT buffer for PEBS */
+PERF_ARCH(PAIR,			0x0001000) /* Large Increment per Cycle */
+PERF_ARCH(LBR_SELECT,		0x0002000) /* Save/Restore MSR_LBR_SELECT */
+PERF_ARCH(TOPDOWN,		0x0004000) /* Count Topdown slots/metrics events */
+PERF_ARCH(PEBS_STLAT,		0x0008000) /* st+stlat data address sampling */
+PERF_ARCH(AMD_BRS,		0x0010000) /* AMD Branch Sampling */
+PERF_ARCH(PEBS_LAT_HYBRID,	0x0020000) /* ld and st lat for hybrid */
+PERF_ARCH(NEEDS_BRANCH_STACK,	0x0040000) /* require branch stack setup */
+PERF_ARCH(BRANCH_COUNTERS,	0x0080000) /* logs the counters in the extra space of each branch */
+PERF_ARCH(ACR,			0x0100000) /* Auto counter reload */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index c381ea7135df..238879f6c3e3 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -144,7 +144,7 @@ struct hw_perf_event_extra {
  * PERF_EVENT_FLAG_ARCH bits are reserved for architecture-specific
  * usage.
  */
-#define PERF_EVENT_FLAG_ARCH			0x000fffff
+#define PERF_EVENT_FLAG_ARCH			0x0fffffff
 #define PERF_EVENT_FLAG_USER_READ_CNT		0x80000000
 
 static_assert((PERF_EVENT_FLAG_USER_READ_CNT & PERF_EVENT_FLAG_ARCH) == 0);
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V3 4/5] perf/x86/intel: Add CPUID enumeration for the auto counter reload
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
                   ` (2 preceding siblings ...)
  2025-02-13 21:17 ` [PATCH V3 3/5] perf: Extend the bit width of the arch-specific flag kan.liang
@ 2025-02-13 21:17 ` kan.liang
  2025-02-13 21:17 ` [PATCH V3 5/5] perf/x86/intel: Support " kan.liang
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2025-02-13 21:17 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, dapeng1.mi, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The counters that support the auto counter reload feature can be
enumerated in the CPUID Leaf 0x23 sub-leaf 0x2.

Add acr_cntr_mask to store the mask of counters which are reloadable.
Add acr_cause_mask to store the mask of counters which can cause reload.
Since the e-core and p-core may have different numbers of counters,
track the masks in the struct x86_hybrid_pmu as well.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 10 ++++++++++
 arch/x86/events/perf_event.h      | 17 +++++++++++++++++
 arch/x86/include/asm/perf_event.h |  1 +
 3 files changed, 28 insertions(+)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index ce04553910ab..8e3ad9efd798 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -5075,6 +5075,16 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
 		pmu->fixed_cntr_mask64 = fixed_cntr;
 	}
 
+	if (eax.split.acr_subleaf) {
+		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_ACR_LEAF,
+			    &cntr, &fixed_cntr, &ecx, &edx);
+		/* The mask of the counters which can be reloaded */
+		pmu->acr_cntr_mask64 = cntr | ((u64)fixed_cntr << INTEL_PMC_IDX_FIXED);
+
+		/* The mask of the counters which can cause a reload of reloadable counters */
+		pmu->acr_cause_mask64 = ecx | ((u64)edx << INTEL_PMC_IDX_FIXED);
+	}
+
 	if (!intel_pmu_broken_perf_cap()) {
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, pmu->intel_cap.capabilities);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 5bf9c117e9ef..2184ae0c9a4a 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -725,6 +725,15 @@ struct x86_hybrid_pmu {
 			u64		fixed_cntr_mask64;
 			unsigned long	fixed_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	};
+
+	union {
+			u64		acr_cntr_mask64;
+			unsigned long	acr_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	};
+	union {
+			u64		acr_cause_mask64;
+			unsigned long	acr_cause_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	};
 	struct event_constraint		unconstrained;
 
 	u64				hw_cache_event_ids
@@ -823,6 +832,14 @@ struct x86_pmu {
 			u64		fixed_cntr_mask64;
 			unsigned long	fixed_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	};
+	union {
+			u64		acr_cntr_mask64;
+			unsigned long	acr_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	};
+	union {
+			u64		acr_cause_mask64;
+			unsigned long	acr_cause_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	};
 	int		cntval_bits;
 	u64		cntval_mask;
 	union {
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index eaf0d5245999..5b7a84254ee5 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -195,6 +195,7 @@ union cpuid10_edx {
  */
 #define ARCH_PERFMON_EXT_LEAF			0x00000023
 #define ARCH_PERFMON_NUM_COUNTER_LEAF		0x1
+#define ARCH_PERFMON_ACR_LEAF			0x2
 
 union cpuid35_eax {
 	struct {
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH V3 5/5] perf/x86/intel: Support auto counter reload
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
                   ` (3 preceding siblings ...)
  2025-02-13 21:17 ` [PATCH V3 4/5] perf/x86/intel: Add CPUID enumeration for the auto counter reload kan.liang
@ 2025-02-13 21:17 ` kan.liang
  2025-02-26 21:43 ` [PATCH V3 0/5] " Falcon, Thomas
  2025-03-04  0:58 ` Stephane Eranian
  6 siblings, 0 replies; 9+ messages in thread
From: kan.liang @ 2025-02-13 21:17 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel
  Cc: ak, eranian, dapeng1.mi, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.

The auto counter reload (ACR) feature takes samples when the relative
rate of two or more events exceeds some threshold, which provides the
fine-grained information at a low cost.
To support the feature, two sets of MSRs are introduced. For a given
counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
can cause a reload of that counter. The reload value is stored in the
IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto
Counter Reload.

In the hw_config(), an ACR event is specially configured, because the
cause/reloadable counter mask has to be applied to the dyn_constraint.
Besides the HW limit, e.g., not support perf metrics, PDist and etc, a
SW limit is applied as well. ACR events in a group must be contiguous.
It facilitates the later conversion from the event idx to the counter
idx. Otherwise, the intel_pmu_acr_late_setup() has to traverse the whole
event list again to find the "cause" event.
Also, add a new flag PERF_X86_EVENT_ACR to indicate an ACR group, which
is set to the group leader.

The late setup() is also required for an ACR group. It's to convert the
event idx to the counter idx, and saved it in hw.config1.

The ACR configuration MSRs are only updated in the enable_event().
The disable_event() doesn't clear the ACR CFG register.
Add acr_cfg_b/acr_cfg_c in the struct cpu_hw_events to cache the MSR
values. It can avoid a MSR write if the value is not changed.

Expose an acr_mask to the sysfs. The perf tool can utilize the new
format to configure the relation of events in the group. The bit
sequence of the acr_mask follows the events enabled order of the group.

Example:

Here is the snippet of the mispredict.c. Since the array has a random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.

For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.

main()
{
...
        for (i = 0; i < N; i++)
                data[i] = rand() % 256;
...
        /* Loop 1 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 64)
                                sum += data[i];
...

...
        /* Loop 2 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 128)
                                sum += data[i];
...
}

Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually samples both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
               -c 1000000 -- ./mispredict

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post-process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.

With this patch, the user can generate the samples only when the branch
miss rate > 20%. For example,
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
                 cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
                -- ./mispredict

(Two different periods are applied to branch-misses and
branch-instructions. The ratio is set to 20%.
If the branch-instructions is overflowed first, the branch-miss
rate < 20%. No samples should be generated. All counters should be
automatically reloaded.
If the branch-misses is overflowed first, the branch-miss rate > 20%.
A sample triggered by the branch-misses event should be
generated. Just the counter of the branch-instructions should be
automatically reloaded.

The branch-misses event should only be automatically reloaded when
the branch-instructions is overflowed. So the "cause" event is the
branch-instructions event. The acr_mask is set to 0x2, since the
event index of branch-instructions is 1.

The branch-instructions event is automatically reloaded no matter which
events are overflowed. So the "cause" events are the branch-misses
and the branch-instructions event. The acr_mask should be set to 0x3.)

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]

 $perf report

Percent       │154:   movl    $0x0,-0x14(%rbp)
              │     ↓ jmp     1af
              │     for (i = j; i < N; i++)
              │15d:   mov     -0x10(%rbp),%eax
              │       mov     %eax,-0x18(%rbp)
              │     ↓ jmp     1a2
              │     if (data[i] >= 128)
              │165:   mov     -0x18(%rbp),%eax
              │       cltq
              │       lea     0x0(,%rax,4),%rdx
              │       mov     -0x8(%rbp),%rax
              │       add     %rdx,%rax
              │       mov     (%rax),%eax
              │    ┌──cmp     $0x7f,%eax
100.00   0.00 │    ├──jle     19e
              │    │sum += data[i];

The 2498 samples are all from the branch-misses events for the Loop 2.

The number of samples and overhead is significantly reduced without
losing any information.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/core.c           |   2 +-
 arch/x86/events/intel/core.c     | 219 ++++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h     |  10 ++
 arch/x86/include/asm/msr-index.h |   4 +
 include/linux/perf_event.h       |   1 +
 5 files changed, 233 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/core.c b/arch/x86/events/core.c
index b56fa6a9d7a4..c2525567a03f 100644
--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -756,7 +756,7 @@ void x86_pmu_enable_all(int added)
 	}
 }
 
-static inline int is_x86_event(struct perf_event *event)
+int is_x86_event(struct perf_event *event)
 {
 	int i;
 
diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 8e3ad9efd798..89298cde7056 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2609,7 +2609,8 @@ static void intel_pmu_del_event(struct perf_event *event)
 		intel_pmu_lbr_del(event);
 	if (event->attr.precise_ip)
 		intel_pmu_pebs_del(event);
-	if (is_pebs_counter_event_group(event))
+	if (is_pebs_counter_event_group(event) ||
+	    is_acr_event_group(event))
 		this_cpu_ptr(&cpu_hw_events)->n_late_setup--;
 }
 
@@ -2888,6 +2889,54 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
 	cpuc->fixed_ctrl_val |= bits;
 }
 
+static void intel_pmu_config_acr(int idx, u64 mask, u32 reload)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	int msr_b, msr_c;
+
+	if (!mask && !cpuc->acr_cfg_b[idx])
+		return;
+
+	if (idx < INTEL_PMC_IDX_FIXED) {
+		msr_b = MSR_IA32_PMC_V6_GP0_CFG_B;
+		msr_c = MSR_IA32_PMC_V6_GP0_CFG_C;
+	} else {
+		msr_b = MSR_IA32_PMC_V6_FX0_CFG_B;
+		msr_c = MSR_IA32_PMC_V6_FX0_CFG_C;
+		idx -= INTEL_PMC_IDX_FIXED;
+	}
+
+	if (cpuc->acr_cfg_b[idx] != mask) {
+		wrmsrl(msr_b + x86_pmu.addr_offset(idx, false), mask);
+		cpuc->acr_cfg_b[idx] = mask;
+	}
+	/* Only need to update the reload value when there is a valid config value. */
+	if (mask && cpuc->acr_cfg_c[idx] != reload) {
+		wrmsrl(msr_c + x86_pmu.addr_offset(idx, false), reload);
+		cpuc->acr_cfg_c[idx] = reload;
+	}
+}
+
+static void intel_pmu_enable_acr(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	/* The PMU doesn't support ACR */
+	if (!hybrid(event->pmu, acr_cntr_mask64))
+		return;
+
+	if (!is_acr_event_group(event) || !event->attr.config2) {
+		/*
+		 * The disable doesn't clear the ACR CFG register.
+		 * Check and clear the ACR CFG register.
+		 */
+		intel_pmu_config_acr(hwc->idx, 0, 0);
+		return;
+	}
+
+	intel_pmu_config_acr(hwc->idx, hwc->config1, -hwc->sample_period);
+}
+
 static void intel_pmu_enable_event(struct perf_event *event)
 {
 	u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
@@ -2903,8 +2952,11 @@ static void intel_pmu_enable_event(struct perf_event *event)
 			enable_mask |= ARCH_PERFMON_EVENTSEL_BR_CNTR;
 		intel_set_masks(event, idx);
 		__x86_pmu_enable_event(hwc, enable_mask);
+		intel_pmu_enable_acr(event);
 		break;
 	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
+		intel_pmu_enable_acr(event);
+		fallthrough;
 	case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
 		intel_pmu_enable_fixed(event);
 		break;
@@ -2922,6 +2974,31 @@ static void intel_pmu_enable_event(struct perf_event *event)
 	}
 }
 
+static void intel_pmu_acr_late_setup(struct cpu_hw_events *cpuc)
+{
+	struct perf_event *event, *leader;
+	int i, j, idx;
+
+	for (i = 0; i < cpuc->n_events; i++) {
+		leader = cpuc->event_list[i];
+		if (!is_acr_event_group(leader))
+			continue;
+
+		/* The ACR events must be contiguous. */
+		for (j = i; j < cpuc->n_events; j++) {
+			event = cpuc->event_list[j];
+			if (event->group_leader != leader->group_leader)
+				break;
+			for_each_set_bit(idx, (unsigned long *)&event->attr.config2, X86_PMC_IDX_MAX) {
+				if (WARN_ON_ONCE(i + idx > cpuc->n_events))
+					return;
+				set_bit(cpuc->assign[i + idx], (unsigned long *)&event->hw.config1);
+			}
+		}
+		i = j - 1;
+	}
+}
+
 void intel_pmu_late_setup(void)
 {
 	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
@@ -2930,6 +3007,7 @@ void intel_pmu_late_setup(void)
 		return;
 
 	intel_pmu_pebs_late_setup(cpuc);
+	intel_pmu_acr_late_setup(cpuc);
 }
 
 static void intel_pmu_add_event(struct perf_event *event)
@@ -2938,7 +3016,8 @@ static void intel_pmu_add_event(struct perf_event *event)
 		intel_pmu_pebs_add(event);
 	if (intel_pmu_needs_branch_stack(event))
 		intel_pmu_lbr_add(event);
-	if (is_pebs_counter_event_group(event))
+	if (is_pebs_counter_event_group(event) ||
+	    is_acr_event_group(event))
 		this_cpu_ptr(&cpu_hw_events)->n_late_setup++;
 }
 
@@ -4093,6 +4172,22 @@ static u64 intel_pmu_freq_start_period(struct perf_event *event)
 	return start;
 }
 
+static bool intel_pmu_is_acr_group(struct perf_event *event)
+{
+	if (!hybrid(event->pmu, acr_cntr_mask64))
+		return false;
+
+	/* The group leader has the ACR flag set */
+	if (is_acr_event_group(event))
+		return true;
+
+	/* The acr_mask is set */
+	if (event->attr.config2)
+		return true;
+
+	return false;
+}
+
 static int intel_pmu_hw_config(struct perf_event *event)
 {
 	int ret = x86_pmu_hw_config(event);
@@ -4221,6 +4316,103 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	    event->attr.precise_ip)
 		event->group_leader->hw.flags |= PERF_X86_EVENT_PEBS_CNTR;
 
+	if (intel_pmu_is_acr_group(event)) {
+		struct perf_event *sibling, *leader = event->group_leader;
+		struct pmu *pmu = event->pmu;
+		u64 constraint = hybrid(pmu, acr_cntr_mask64);
+		bool has_sw_event = false;
+		int num = 0, idx = 0;
+		u64 cause_mask = 0;
+
+		/* Not support perf metrics */
+		if (is_metric_event(event))
+			return -EINVAL;
+
+		/* Not support freq mode */
+		if (event->attr.freq)
+			return -EINVAL;
+
+		/* PDist is not supported */
+		if (event->attr.config2 && event->attr.precise_ip > 2)
+			return -EINVAL;
+
+		/* The reload value cannot exceeds the max period */
+		if (event->attr.sample_period > x86_pmu.max_period)
+			return -EINVAL;
+		/*
+		 * The counter-constraints of each event cannot be finalized
+		 * unless the whole group is scanned. However, it's hard
+		 * to know whether the event is the last one of the group.
+		 * Recalculate the counter-constraints for each event when
+		 * adding a new event.
+		 *
+		 * The group is traversed twice, which may be optimized later.
+		 * In the first round,
+		 * - Find all events which do reload when other events
+		 *   overflow and set the corresponding counter-constraints
+		 * - Add all events, which can cause other events reload,
+		 *   in the cause_mask
+		 * - Error out if the number of events exceeds the HW limit
+		 * - The ACR events must be contiguous.
+		 *   Error out if there are non-X86 events between ACR events.
+		 *   This is not a HW limit, but a SW limit.
+		 *   With the assumption, the intel_pmu_acr_late_setup() can
+		 *   easily convert the event idx to counter idx without
+		 *   traversing the whole event list.
+		 */
+		if (!is_x86_event(leader))
+			return -EINVAL;
+
+		if (leader->attr.config2) {
+			leader->hw.dyn_constraint &= constraint;
+			cause_mask |= leader->attr.config2;
+			num++;
+		}
+
+		for_each_sibling_event(sibling, leader) {
+			if (!is_x86_event(sibling)) {
+				has_sw_event = true;
+				continue;
+			}
+			if (!sibling->attr.config2)
+				continue;
+			if (has_sw_event)
+				return -EINVAL;
+			sibling->hw.dyn_constraint &= constraint;
+			cause_mask |= sibling->attr.config2;
+			num++;
+		}
+
+		if (leader != event && event->attr.config2) {
+			if (has_sw_event)
+				return -EINVAL;
+			event->hw.dyn_constraint &= constraint;
+			cause_mask |= event->attr.config2;
+			num++;
+		}
+
+		if (hweight64(cause_mask) > hweight64(hybrid(pmu, acr_cause_mask64)) ||
+		    num > hweight64(constraint))
+			return -EINVAL;
+		/*
+		 * In the second round, apply the counter-constraints for
+		 * the events which can cause other events reload.
+		 */
+		constraint = hybrid(pmu, acr_cause_mask64);
+		if (test_bit(idx++, (unsigned long *)&cause_mask))
+			leader->hw.dyn_constraint &= constraint;
+
+		for_each_sibling_event(sibling, leader) {
+			if (test_bit(idx++, (unsigned long *)&cause_mask))
+				sibling->hw.dyn_constraint &= constraint;
+		}
+
+		if ((leader != event) && test_bit(idx, (unsigned long *)&cause_mask))
+			event->hw.dyn_constraint &= constraint;
+
+		leader->hw.flags |= PERF_X86_EVENT_ACR;
+	}
+
 	if ((event->attr.type == PERF_TYPE_HARDWARE) ||
 	    (event->attr.type == PERF_TYPE_HW_CACHE))
 		return 0;
@@ -6070,6 +6262,21 @@ td_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 	return attr->mode;
 }
 
+PMU_FORMAT_ATTR(acr_mask,	"config2:0-63");
+
+static struct attribute *format_acr_attrs[] = {
+	&format_attr_acr_mask.attr,
+	NULL
+};
+
+static umode_t
+acr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+
+	return hybrid(dev_get_drvdata(dev), acr_cntr_mask64) ? attr->mode : 0;
+}
+
 static struct attribute_group group_events_td  = {
 	.name = "events",
 	.is_visible = td_is_visible,
@@ -6112,6 +6319,12 @@ static struct attribute_group group_format_evtsel_ext = {
 	.is_visible = evtsel_ext_is_visible,
 };
 
+static struct attribute_group group_format_acr = {
+	.name       = "format",
+	.attrs      = format_acr_attrs,
+	.is_visible = acr_is_visible,
+};
+
 static struct attribute_group group_default = {
 	.attrs      = intel_pmu_attrs,
 	.is_visible = default_is_visible,
@@ -6126,6 +6339,7 @@ static const struct attribute_group *attr_update[] = {
 	&group_format_extra,
 	&group_format_extra_skl,
 	&group_format_evtsel_ext,
+	&group_format_acr,
 	&group_default,
 	NULL,
 };
@@ -6410,6 +6624,7 @@ static const struct attribute_group *hybrid_attr_update[] = {
 	&group_caps_lbr,
 	&hybrid_group_format_extra,
 	&group_format_evtsel_ext,
+	&group_format_acr,
 	&group_default,
 	&hybrid_group_cpus,
 	NULL,
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 2184ae0c9a4a..e43e5fe01905 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -120,6 +120,11 @@ static inline bool is_pebs_counter_event_group(struct perf_event *event)
 	return event->group_leader->hw.flags & PERF_X86_EVENT_PEBS_CNTR;
 }
 
+static inline bool is_acr_event_group(struct perf_event *event)
+{
+	return event->group_leader->hw.flags & PERF_X86_EVENT_ACR;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
@@ -287,6 +292,10 @@ struct cpu_hw_events {
 	u64			fixed_ctrl_val;
 	u64			active_fixed_ctrl_val;
 
+	/* Intel ACR configuration */
+	u64			acr_cfg_b[X86_PMC_IDX_MAX];
+	u64			acr_cfg_c[X86_PMC_IDX_MAX];
+
 	/*
 	 * Intel LBR bits
 	 */
@@ -1127,6 +1136,7 @@ static struct perf_pmu_format_hybrid_attr format_attr_hybrid_##_name = {\
 	.pmu_type	= _pmu,						\
 }
 
+int is_x86_event(struct perf_event *event);
 struct pmu *x86_get_pmu(unsigned int cpu);
 extern struct x86_pmu x86_pmu __read_mostly;
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 9a71880eec07..4c9361d8f05d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -591,7 +591,11 @@
 /* V6 PMON MSR range */
 #define MSR_IA32_PMC_V6_GP0_CTR		0x1900
 #define MSR_IA32_PMC_V6_GP0_CFG_A	0x1901
+#define MSR_IA32_PMC_V6_GP0_CFG_B	0x1902
+#define MSR_IA32_PMC_V6_GP0_CFG_C	0x1903
 #define MSR_IA32_PMC_V6_FX0_CTR		0x1980
+#define MSR_IA32_PMC_V6_FX0_CFG_B	0x1982
+#define MSR_IA32_PMC_V6_FX0_CFG_C	0x1983
 #define MSR_IA32_PMC_V6_STEP		4
 
 /* KeyID partitioning between MKTME and TDX */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index 238879f6c3e3..24f2eba200ac 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -157,6 +157,7 @@ struct hw_perf_event {
 	union {
 		struct { /* hardware */
 			u64		config;
+			u64		config1;
 			u64		last_tag;
 			u64		dyn_constraint;
 			unsigned long	config_base;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH V3 0/5] Support auto counter reload
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
                   ` (4 preceding siblings ...)
  2025-02-13 21:17 ` [PATCH V3 5/5] perf/x86/intel: Support " kan.liang
@ 2025-02-26 21:43 ` Falcon, Thomas
  2025-03-04  0:58 ` Stephane Eranian
  6 siblings, 0 replies; 9+ messages in thread
From: Falcon, Thomas @ 2025-02-26 21:43 UTC (permalink / raw)
  To: alexander.shishkin@linux.intel.com, peterz@infradead.org,
	acme@kernel.org, mingo@redhat.com, kan.liang@linux.intel.com,
	Hunter, Adrian, namhyung@kernel.org, irogers@google.com,
	linux-kernel@vger.kernel.org
  Cc: dapeng1.mi@linux.intel.com, ak@linux.intel.com, Eranian, Stephane

On Thu, 2025-02-13 at 13:17 -0800, kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Changes since V2:
> - Rebase on top of several new features, e.g., counters snapshotting
>   feature. Rewrite the code for the ACR CPUID-enumeration,
> configuration
>   and late setup.
> - Patch 1-3 are newly added for clean up.
> 
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2
> exceeds
>   the group size
> 
> The relative rates among two or more events are useful for
> performance
> analysis, e.g., a high branch miss rate may indicate a performance
> issue. Usually, the samples with a relative rate that exceeds some
> threshold are more useful. However, the traditional sampling takes
> samples of events separately. To get the relative rates among two or
> more events, a high sample rate is required, which can bring high
> overhead. Many samples taken in the non-hotspot area are also dropped
> (useless) in the post-process.
> 
> The auto counter reload (ACR) feature takes samples when the relative
> rate of two or more events exceeds some threshold, which provides the
> fine-grained information at a low cost.
> To support the feature, two sets of MSRs are introduced. For a given
> counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
> IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
> can cause a reload of that counter. The reload value is stored in the
> IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
> The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto
> Counter Reload.

Works for me on an Core Ultra 9 275HX.

Tested-by: Thomas Falcon <thomas.falcon@intel.com>

Tom

> 
> Example:
> 
> Here is the snippet of the mispredict.c. Since the array has a random
> numbers, jumps are random and often mispredicted.
> The mispredicted rate depends on the compared value.
> 
> For the Loop1, ~11% of all branches are mispredicted.
> For the Loop2, ~21% of all branches are mispredicted.
> 
> main()
> {
> ...
>         for (i = 0; i < N; i++)
>                 data[i] = rand() % 256;
> ...
>         /* Loop 1 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 64)
>                                 sum += data[i];
> ...
> 
> ...
>         /* Loop 2 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 128)
>                                 sum += data[i];
> ...
> }
> 
> Usually, a code with a high branch miss rate means a bad performance.
> To understand the branch miss rate of the codes, the traditional
> method
> usually samples both branches and branch-misses events. E.g.,
> perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-
> instructions/u}"
>                -c 1000000 -- ./mispredict
> 
> [ perf record: Woken up 4 times to write data ]
> [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
> The 5106 samples are from both events and spread in both Loops.
> In the post-process stage, a user can know that the Loop 2 has a 21%
> branch miss rate. Then they can focus on the samples of branch-misses
> events for the Loop 2.
> 
> With this patch, the user can generate the samples only when the
> branch
> miss rate > 20%. For example,
> perf record -e "{cpu_atom/branch-
> misses,period=200000,acr_mask=0x2/ppu,
>                  cpu_atom/branch-
> instructions,period=1000000,acr_mask=0x3/u}"
>                 -- ./mispredict
> 
> (Two different periods are applied to branch-misses and
> branch-instructions. The ratio is set to 20%.
> If the branch-instructions is overflowed first, the branch-miss
> rate < 20%. No samples should be generated. All counters should be
> automatically reloaded.
> If the branch-misses is overflowed first, the branch-miss rate > 20%.
> A sample triggered by the branch-misses event should be
> generated. Just the counter of the branch-instructions should be
> automatically reloaded.
> 
> The branch-misses event should only be automatically reloaded when
> the branch-instructions is overflowed. So the "cause" event is the
> branch-instructions event. The acr_mask is set to 0x2, since the
> event index of branch-instructions is 1.
> 
> The branch-instructions event is automatically reloaded no matter
> which
> events are overflowed. So the "cause" events are the branch-misses
> and the branch-instructions event. The acr_mask should be set to
> 0x3.)
> 
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
> 
>  $perf report
> 
> Percent       │154:   movl    $0x0,-0x14(%rbp)
>               │     ↓ jmp     1af
>               │     for (i = j; i < N; i++)
>               │15d:   mov     -0x10(%rbp),%eax
>               │       mov     %eax,-0x18(%rbp)
>               │     ↓ jmp     1a2
>               │     if (data[i] >= 128)
>               │165:   mov     -0x18(%rbp),%eax
>               │       cltq
>               │       lea     0x0(,%rax,4),%rdx
>               │       mov     -0x8(%rbp),%rax
>               │       add     %rdx,%rax
>               │       mov     (%rax),%eax
>               │    ┌──cmp     $0x7f,%eax
> 100.00   0.00 │    ├──jle     19e
>               │    │sum += data[i];
> 
> The 2498 samples are all from the branch-misses events for the Loop
> 2.
> 
> The number of samples and overhead is significantly reduced without
> losing any information.
> 
> Kan Liang (5):
>   perf/x86: Add dynamic constraint
>   perf/x86/intel: Track the num of events needs late setup
>   perf: Extend the bit width of the arch-specific flag
>   perf/x86/intel: Add CPUID enumeration for the auto counter reload
>   perf/x86/intel: Support auto counter reload
> 
>  arch/x86/events/core.c             |   3 +-
>  arch/x86/events/intel/core.c       | 260
> ++++++++++++++++++++++++++++-
>  arch/x86/events/intel/ds.c         |   3 +-
>  arch/x86/events/intel/lbr.c        |   2 +-
>  arch/x86/events/perf_event.h       |  33 ++++
>  arch/x86/events/perf_event_flags.h |  41 ++---
>  arch/x86/include/asm/msr-index.h   |   4 +
>  arch/x86/include/asm/perf_event.h  |   1 +
>  include/linux/perf_event.h         |   4 +-
>  9 files changed, 320 insertions(+), 31 deletions(-)
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3 0/5] Support auto counter reload
  2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
                   ` (5 preceding siblings ...)
  2025-02-26 21:43 ` [PATCH V3 0/5] " Falcon, Thomas
@ 2025-03-04  0:58 ` Stephane Eranian
  2025-03-04  2:33   ` Andi Kleen
  6 siblings, 1 reply; 9+ messages in thread
From: Stephane Eranian @ 2025-03-04  0:58 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel, ak, dapeng1.mi, thomas.falcon

Hi Kan,


On Thu, Feb 13, 2025 at 1:17 PM <kan.liang@linux.intel.com> wrote:
>
> From: Kan Liang <kan.liang@linux.intel.com>
>
> Changes since V2:
> - Rebase on top of several new features, e.g., counters snapshotting
>   feature. Rewrite the code for the ACR CPUID-enumeration, configuration
>   and late setup.
> - Patch 1-3 are newly added for clean up.
>
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2 exceeds
>   the group size
>
> The relative rates among two or more events are useful for performance
> analysis, e.g., a high branch miss rate may indicate a performance
> issue. Usually, the samples with a relative rate that exceeds some
> threshold are more useful. However, the traditional sampling takes
> samples of events separately. To get the relative rates among two or
> more events, a high sample rate is required, which can bring high
> overhead. Many samples taken in the non-hotspot area are also dropped
> (useless) in the post-process.
>
> The auto counter reload (ACR) feature takes samples when the relative
> rate of two or more events exceeds some threshold, which provides the
> fine-grained information at a low cost.
> To support the feature, two sets of MSRs are introduced. For a given
> counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
> IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
> can cause a reload of that counter. The reload value is stored in the
> IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
> The details can be found at Intel SDM (085), Volume 3, 21.9.11 Auto
> Counter Reload.
>
> Example:
>
> Here is the snippet of the mispredict.c. Since the array has a random
> numbers, jumps are random and often mispredicted.
> The mispredicted rate depends on the compared value.
>
> For the Loop1, ~11% of all branches are mispredicted.
> For the Loop2, ~21% of all branches are mispredicted.
>
> main()
> {
> ...
>         for (i = 0; i < N; i++)
>                 data[i] = rand() % 256;
> ...
>         /* Loop 1 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 64)
>                                 sum += data[i];
> ...
>
> ...
>         /* Loop 2 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 128)
>                                 sum += data[i];
> ...
> }
>
> Usually, a code with a high branch miss rate means a bad performance.
> To understand the branch miss rate of the codes, the traditional method
> usually samples both branches and branch-misses events. E.g.,
> perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
>                -c 1000000 -- ./mispredict
>
> [ perf record: Woken up 4 times to write data ]
> [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
> The 5106 samples are from both events and spread in both Loops.
> In the post-process stage, a user can know that the Loop 2 has a 21%
> branch miss rate. Then they can focus on the samples of branch-misses
> events for the Loop 2.
>
> With this patch, the user can generate the samples only when the branch
> miss rate > 20%. For example,
> perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
>                  cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
>                 -- ./mispredict
>
> (Two different periods are applied to branch-misses and
> branch-instructions. The ratio is set to 20%.
> If the branch-instructions is overflowed first, the branch-miss
> rate < 20%. No samples should be generated. All counters should be
> automatically reloaded.
> If the branch-misses is overflowed first, the branch-miss rate > 20%.
> A sample triggered by the branch-misses event should be
> generated. Just the counter of the branch-instructions should be
> automatically reloaded.
>
> The branch-misses event should only be automatically reloaded when
> the branch-instructions is overflowed. So the "cause" event is the
> branch-instructions event. The acr_mask is set to 0x2, since the
> event index of branch-instructions is 1.
>
What is your definition of 'event index' here? Is it the position on
the cmdline?


> The branch-instructions event is automatically reloaded no matter which
> events are overflowed. So the "cause" events are the branch-misses
> and the branch-instructions event. The acr_mask should be set to 0x3.)
>
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
>
>  $perf report
>
> Percent       │154:   movl    $0x0,-0x14(%rbp)
>               │     ↓ jmp     1af
>               │     for (i = j; i < N; i++)
>               │15d:   mov     -0x10(%rbp),%eax
>               │       mov     %eax,-0x18(%rbp)
>               │     ↓ jmp     1a2
>               │     if (data[i] >= 128)
>               │165:   mov     -0x18(%rbp),%eax
>               │       cltq
>               │       lea     0x0(,%rax,4),%rdx
>               │       mov     -0x8(%rbp),%rax
>               │       add     %rdx,%rax
>               │       mov     (%rax),%eax
>               │    ┌──cmp     $0x7f,%eax
> 100.00   0.00 │    ├──jle     19e
>               │    │sum += data[i];
>
> The 2498 samples are all from the branch-misses events for the Loop 2.
>
> The number of samples and overhead is significantly reduced without
> losing any information.
>
> Kan Liang (5):
>   perf/x86: Add dynamic constraint
>   perf/x86/intel: Track the num of events needs late setup
>   perf: Extend the bit width of the arch-specific flag
>   perf/x86/intel: Add CPUID enumeration for the auto counter reload
>   perf/x86/intel: Support auto counter reload
>
>  arch/x86/events/core.c             |   3 +-
>  arch/x86/events/intel/core.c       | 260 ++++++++++++++++++++++++++++-
>  arch/x86/events/intel/ds.c         |   3 +-
>  arch/x86/events/intel/lbr.c        |   2 +-
>  arch/x86/events/perf_event.h       |  33 ++++
>  arch/x86/events/perf_event_flags.h |  41 ++---
>  arch/x86/include/asm/msr-index.h   |   4 +
>  arch/x86/include/asm/perf_event.h  |   1 +
>  include/linux/perf_event.h         |   4 +-
>  9 files changed, 320 insertions(+), 31 deletions(-)
>
> --
> 2.38.1
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH V3 0/5] Support auto counter reload
  2025-03-04  0:58 ` Stephane Eranian
@ 2025-03-04  2:33   ` Andi Kleen
  0 siblings, 0 replies; 9+ messages in thread
From: Andi Kleen @ 2025-03-04  2:33 UTC (permalink / raw)
  To: Stephane Eranian
  Cc: kan.liang, peterz, mingo, acme, namhyung, irogers, adrian.hunter,
	alexander.shishkin, linux-kernel, dapeng1.mi, thomas.falcon

> > the branch-instructions is overflowed. So the "cause" event is the
> > branch-instructions event. The acr_mask is set to 0x2, since the
> > event index of branch-instructions is 1.
> >
> What is your definition of 'event index' here? Is it the position on
> the cmdline?

It's the index into the group.

-Andi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-04  2:33 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-13 21:17 [PATCH V3 0/5] Support auto counter reload kan.liang
2025-02-13 21:17 ` [PATCH V3 1/5] perf/x86: Add dynamic constraint kan.liang
2025-02-13 21:17 ` [PATCH V3 2/5] perf/x86/intel: Track the num of events needs late setup kan.liang
2025-02-13 21:17 ` [PATCH V3 3/5] perf: Extend the bit width of the arch-specific flag kan.liang
2025-02-13 21:17 ` [PATCH V3 4/5] perf/x86/intel: Add CPUID enumeration for the auto counter reload kan.liang
2025-02-13 21:17 ` [PATCH V3 5/5] perf/x86/intel: Support " kan.liang
2025-02-26 21:43 ` [PATCH V3 0/5] " Falcon, Thomas
2025-03-04  0:58 ` Stephane Eranian
2025-03-04  2:33   ` Andi Kleen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox