[PATCH V2 0/3] Support auto counter reload

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH V2 0/3] Support auto counter reload
@ 2024-10-10 19:28 kan.liang
  2024-10-10 19:28 ` [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF kan.liang
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: kan.liang @ 2024-10-10 19:28 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter, ak,
	linux-kernel
  Cc: eranian, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

Changes since V1:
- Add a check to the reload value which cannot exceeds the max period
- Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
- Update comments explain to case which the event->attr.config2 exceeds
  the group size

The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.

Auto Counter Reload (ACR) provides a means for software to specify that,
for each supported counter, the hardware should automatically reload the
counter to a specified initial value upon overflow of chosen counters.
This mechanism enables software to sample based on the relative rate of
two (or more) events, such that a sample (PMI or PEBS) is taken only if
the rate of one event exceeds some threshold relative to the rate of
another event. Taking a PMI or PEBS only when the relative rate of
perfmon events crosses a threshold can have significantly less
performance overhead than other techniques.

The details can be found at Intel Architecture Instruction Set
Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.

Examples:

Here is the snippet of the mispredict.c. Since the array has random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.

For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.

main()
{
...
        for (i = 0; i < N; i++)
                data[i] = rand() % 256;
...
        /* Loop 1 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 64)
                                sum += data[i];
...

...
        /* Loop 2 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 128)
                                sum += data[i];
...
}

Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually sample both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
               -c 1000000 -- ./mispredict

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.

With this patch, the user can generate the samples only when the branch
miss rate > 20%.
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
                 cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
                -- ./mispredict
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]

 $perf report

Percent       │154:   movl    $0x0,-0x14(%rbp)
              │     ↓ jmp     1af
              │     for (i = j; i < N; i++)
              │15d:   mov     -0x10(%rbp),%eax
              │       mov     %eax,-0x18(%rbp)
              │     ↓ jmp     1a2
              │     if (data[i] >= 128)
              │165:   mov     -0x18(%rbp),%eax
              │       cltq
              │       lea     0x0(,%rax,4),%rdx
              │       mov     -0x8(%rbp),%rax
              │       add     %rdx,%rax
              │       mov     (%rax),%eax
              │    ┌──cmp     $0x7f,%eax
100.00   0.00 │    ├──jle     19e
              │    │sum += data[i];

The 2498 samples are all from the branch-misses events for the Loop 2.

The number of samples and overhead is significantly reduced without
losing any information.

Kan Liang (3):
  perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
  perf/x86/intel: Add the enumeration and flag for the auto counter
    reload
  perf/x86/intel: Support auto counter reload

 arch/x86/events/intel/core.c       | 262 ++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h       |  21 +++
 arch/x86/events/perf_event_flags.h |   2 +-
 arch/x86/include/asm/msr-index.h   |   4 +
 arch/x86/include/asm/perf_event.h  |   4 +-
 include/linux/perf_event.h         |   2 +
 6 files changed, 288 insertions(+), 7 deletions(-)

-- 
2.38.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
  2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
@ 2024-10-10 19:28 ` kan.liang
  2024-10-10 19:28 ` [PATCH V2 2/3] perf/x86/intel: Add the enumeration and flag for the auto counter reload kan.liang
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: kan.liang @ 2024-10-10 19:28 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter, ak,
	linux-kernel
  Cc: eranian, thomas.falcon, Kan Liang, stable

From: Kan Liang <kan.liang@linux.intel.com>

The EAX of the CPUID Leaf 023H enumerates the mask of valid sub-leaves.
To tell the availability of the sub-leaf 1 (enumerate the counter mask),
perf should check the bit 1 (0x2) of EAS, rather than bit 0 (0x1).

The error is not user-visible on bare metal. Because the sub-leaf 0 and
the sub-leaf 1 are always available. However, it may bring issues in a
virtualization environment when a VMM only enumerates the sub-leaf 0.

Fixes: eb467aaac21e ("perf/x86/intel: Support Architectural PerfMon Extension leaf")
Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
Cc: stable@vger.kernel.org
---
 arch/x86/events/intel/core.c      | 4 ++--
 arch/x86/include/asm/perf_event.h | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 7ca40002a19b..2f3bf3bbbd77 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4886,8 +4886,8 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
 	if (ebx & ARCH_PERFMON_EXT_EQ)
 		pmu->config_mask |= ARCH_PERFMON_EVENTSEL_EQ;
 
-	if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF_BIT) {
-		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF,
+	if (sub_bitmaps & ARCH_PERFMON_NUM_COUNTER_LEAF) {
+		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_NUM_COUNTER_LEAF_BIT,
 			    &eax, &ebx, &ecx, &edx);
 		pmu->cntr_mask64 = eax;
 		pmu->fixed_cntr_mask64 = ebx;
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 91b73571412f..41ace8431e01 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -190,7 +190,7 @@ union cpuid10_edx {
 #define ARCH_PERFMON_EXT_UMASK2			0x1
 #define ARCH_PERFMON_EXT_EQ			0x2
 #define ARCH_PERFMON_NUM_COUNTER_LEAF_BIT	0x1
-#define ARCH_PERFMON_NUM_COUNTER_LEAF		0x1
+#define ARCH_PERFMON_NUM_COUNTER_LEAF		BIT(ARCH_PERFMON_NUM_COUNTER_LEAF_BIT)
 
 /*
  * Intel Architectural LBR CPUID detection/enumeration details:
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V2 2/3] perf/x86/intel: Add the enumeration and flag for the auto counter reload
  2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
  2024-10-10 19:28 ` [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF kan.liang
@ 2024-10-10 19:28 ` kan.liang
  2024-10-10 19:28 ` [PATCH V2 3/3] perf/x86/intel: Support " kan.liang
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 10+ messages in thread
From: kan.liang @ 2024-10-10 19:28 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter, ak,
	linux-kernel
  Cc: eranian, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The counters that support the auto counter reload feature can be
enumerated in the CPUID Leaf 0x23 sub-leaf 0x2.

Add acr_cntr_mask to store the mask of counters which are reloadable.
Add acr_cntr_cause_mask to store the mask of counters which can cause
reload. Since the e-core and p-core may have different numbers of
counters, track the masks in the struct x86_hybrid_pmu as well.

The Auto Counter Reload feature requires a dynamic constraint. Add a PMU
flag to allocate the constraint_list.

There are many existing features which require a dynamic constraint as
well. Add a PMU_FL_DYN_MASK to include the flags of all the features.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c      | 17 +++++++++++++++--
 arch/x86/events/perf_event.h      | 12 ++++++++++++
 arch/x86/include/asm/perf_event.h |  2 ++
 3 files changed, 29 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 2f3bf3bbbd77..726ef13c2c81 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4775,7 +4775,8 @@ static struct intel_excl_cntrs *allocate_excl_cntrs(int cpu)
 	return c;
 }
 
-
+#define PMU_FL_DYN_MASK		(PMU_FL_EXCL_CNTRS | PMU_FL_TFA |	\
+				 PMU_FL_BR_CNTR | PMU_FL_ACR)
 int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 {
 	cpuc->pebs_record_size = x86_pmu.pebs_record_size;
@@ -4786,7 +4787,7 @@ int intel_cpuc_prepare(struct cpu_hw_events *cpuc, int cpu)
 			goto err;
 	}
 
-	if (x86_pmu.flags & (PMU_FL_EXCL_CNTRS | PMU_FL_TFA | PMU_FL_BR_CNTR)) {
+	if (x86_pmu.flags & PMU_FL_DYN_MASK) {
 		size_t sz = X86_PMC_IDX_MAX * sizeof(struct event_constraint);
 
 		cpuc->constraint_list = kzalloc_node(sz, GFP_KERNEL, cpu_to_node(cpu));
@@ -4893,6 +4894,18 @@ static void update_pmu_cap(struct x86_hybrid_pmu *pmu)
 		pmu->fixed_cntr_mask64 = ebx;
 	}
 
+	if (sub_bitmaps & ARCH_PERFMON_ACR_LEAF) {
+		cpuid_count(ARCH_PERFMON_EXT_LEAF, ARCH_PERFMON_ACR_LEAF_BIT,
+			    &eax, &ebx, &ecx, &edx);
+		/* The mask of the counters which can be reloaded */
+		pmu->acr_cntr_mask64 = eax | ((u64)ebx << INTEL_PMC_IDX_FIXED);
+
+		/* The mask of the counters which can cause a reload of reloadable counters */
+		pmu->acr_cntr_cause_mask = ecx | ((u64)edx << INTEL_PMC_IDX_FIXED);
+
+		x86_pmu.flags |= PMU_FL_ACR;
+	}
+
 	if (!intel_pmu_broken_perf_cap()) {
 		/* Perf Metric (Bit 15) and PEBS via PT (Bit 16) are hybrid enumeration */
 		rdmsrl(MSR_IA32_PERF_CAPABILITIES, pmu->intel_cap.capabilities);
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 82c6f45ce975..1ee6d7bb10a3 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -718,6 +718,12 @@ struct x86_hybrid_pmu {
 			u64		fixed_cntr_mask64;
 			unsigned long	fixed_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	};
+
+	union {
+			u64		acr_cntr_mask64;
+			unsigned long	acr_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	};
+	u64				acr_cntr_cause_mask;
 	struct event_constraint		unconstrained;
 
 	u64				hw_cache_event_ids
@@ -815,6 +821,11 @@ struct x86_pmu {
 			u64		fixed_cntr_mask64;
 			unsigned long	fixed_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
 	};
+	union {
+			u64		acr_cntr_mask64;
+			unsigned long	acr_cntr_mask[BITS_TO_LONGS(X86_PMC_IDX_MAX)];
+	};
+	u64				acr_cntr_cause_mask;
 	int		cntval_bits;
 	u64		cntval_mask;
 	union {
@@ -1059,6 +1070,7 @@ do {									\
 #define PMU_FL_MEM_LOADS_AUX	0x100 /* Require an auxiliary event for the complete memory info */
 #define PMU_FL_RETIRE_LATENCY	0x200 /* Support Retire Latency in PEBS */
 #define PMU_FL_BR_CNTR		0x400 /* Support branch counter logging */
+#define PMU_FL_ACR		0x800 /* Support auto-counter reload */
 
 #define EVENT_VAR(_id)  event_attr_##_id
 #define EVENT_PTR(_id) &event_attr_##_id.attr.attr
diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h
index 41ace8431e01..19af3d857db3 100644
--- a/arch/x86/include/asm/perf_event.h
+++ b/arch/x86/include/asm/perf_event.h
@@ -191,6 +191,8 @@ union cpuid10_edx {
 #define ARCH_PERFMON_EXT_EQ			0x2
 #define ARCH_PERFMON_NUM_COUNTER_LEAF_BIT	0x1
 #define ARCH_PERFMON_NUM_COUNTER_LEAF		BIT(ARCH_PERFMON_NUM_COUNTER_LEAF_BIT)
+#define ARCH_PERFMON_ACR_LEAF_BIT		0x2
+#define ARCH_PERFMON_ACR_LEAF			BIT(ARCH_PERFMON_ACR_LEAF_BIT)
 
 /*
  * Intel Architectural LBR CPUID detection/enumeration details:
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH V2 3/3] perf/x86/intel: Support auto counter reload
  2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
  2024-10-10 19:28 ` [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF kan.liang
  2024-10-10 19:28 ` [PATCH V2 2/3] perf/x86/intel: Add the enumeration and flag for the auto counter reload kan.liang
@ 2024-10-10 19:28 ` kan.liang
  2025-03-14 10:20   ` Peter Zijlstra
  2024-11-04 20:37 ` [PATCH V2 0/3] " Liang, Kan
  2025-03-14  9:51 ` Ingo Molnar
  4 siblings, 1 reply; 10+ messages in thread
From: kan.liang @ 2024-10-10 19:28 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter, ak,
	linux-kernel
  Cc: eranian, thomas.falcon, Kan Liang

From: Kan Liang <kan.liang@linux.intel.com>

The relative rates among two or more events are useful for performance
analysis, e.g., a high branch miss rate may indicate a performance
issue. Usually, the samples with a relative rate that exceeds some
threshold are more useful. However, the traditional sampling takes
samples of events separately. To get the relative rates among two or
more events, a high sample rate is required, which can bring high
overhead. Many samples taken in the non-hotspot area are also dropped
(useless) in the post-process.

The auto counter reload (ACR) feature takes samples when the relative
rate of two or more events exceeds some threshold, which provides the
fine-grained information at a low cost.
To support the feature, two sets of MSRs are introduced. For a given
counter IA32_PMC_GPn_CTR/IA32_PMC_FXm_CTR, bit fields in the
IA32_PMC_GPn_CFG_B/IA32_PMC_FXm_CFG_B MSR indicate which counter(s)
can cause a reload of that counter. The reload value is stored in the
IA32_PMC_GPn_CFG_C/IA32_PMC_FXm_CFG_C.
The details can be found at Intel Architecture Instruction Set
Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.

The ACR event/group needs to be specially configured. Because the
counter mask of an event has to be recalculated according to both the
original mask and the reloadable counter mask. The new counter mask is
stored into the new field, dyn_mask, in struct hw_perf_event.
Also, add a new flag PERF_X86_EVENT_ACR to indicate an ACR group, which
is set to the group leader.

The ACR configuration MSRs are only updated in the enable_event().
The disable_event() doesn't clear the ACR CFG register.
Add acr_cfg_b/acr_cfg_c in the struct cpu_hw_events to cache the MSR
values. It can avoid a MSR write if the value is not changed.

Expose a acr_mask to the sysfs. The perf tool can utilize the new format
to configure the relation of events in the group. The bit sequence of
the acr_mask follows the events enabled order of the group. The
kernel will convert it to the real counter order and save the updated
order into the newly added hw.config1 every time the group is scheduled.
The hw.config1 is eventually to be written to the ACR config MSR
(MSR_IA32_PMC_GP/FX_CFG_B) when the event is enabled.

Example:

Here is the snippet of the mispredict.c. Since the array has random
numbers, jumps are random and often mispredicted.
The mispredicted rate depends on the compared value.

For the Loop1, ~11% of all branches are mispredicted.
For the Loop2, ~21% of all branches are mispredicted.

main()
{
...
        for (i = 0; i < N; i++)
                data[i] = rand() % 256;
...
        /* Loop 1 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 64)
                                sum += data[i];
...

...
        /* Loop 2 */
        for (k = 0; k < 50; k++)
                for (i = 0; i < N; i++)
                        if (data[i] >= 128)
                                sum += data[i];
...
}

Usually, a code with a high branch miss rate means a bad performance.
To understand the branch miss rate of the codes, the traditional method
usually sample both branches and branch-misses events. E.g.,
perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
               -c 1000000 -- ./mispredict

[ perf record: Woken up 4 times to write data ]
[ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
The 5106 samples are from both events and spread in both Loops.
In the post process stage, a user can know that the Loop 2 has a 21%
branch miss rate. Then they can focus on the samples of branch-misses
events for the Loop 2.

With this patch, the user can generate the samples only when the branch
miss rate > 20%. For example,
perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
                 cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
                -- ./mispredict

(Two different periods are applied to branch-misses and
branch-instructions. The ratio is 20%. If the branch-instructions is
overflowed first, the branch miss rate < 20%. No samples should be
generated. All counters should be automatically reloaded. The acr_mask
is set to 0x3. If the branch-misses is overflowed first, the branch miss
rate > 20%. A sample triggered by the branch-misses event should be
generated. The counter of the branch-instructions should be
automatically reloaded. The acr_mask is set to 0x2, since the
branch-misses event is the first event in the group.)

[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]

 $perf report

Percent       │154:   movl    $0x0,-0x14(%rbp)
              │     ↓ jmp     1af
              │     for (i = j; i < N; i++)
              │15d:   mov     -0x10(%rbp),%eax
              │       mov     %eax,-0x18(%rbp)
              │     ↓ jmp     1a2
              │     if (data[i] >= 128)
              │165:   mov     -0x18(%rbp),%eax
              │       cltq
              │       lea     0x0(,%rax,4),%rdx
              │       mov     -0x8(%rbp),%rax
              │       add     %rdx,%rax
              │       mov     (%rax),%eax
              │    ┌──cmp     $0x7f,%eax
100.00   0.00 │    ├──jle     19e
              │    │sum += data[i];

The 2498 samples are all from the branch-misses events for the Loop 2.

The number of samples and overhead is significantly reduced without
losing any information.

Signed-off-by: Kan Liang <kan.liang@linux.intel.com>
---
 arch/x86/events/intel/core.c       | 241 ++++++++++++++++++++++++++++-
 arch/x86/events/perf_event.h       |   9 ++
 arch/x86/events/perf_event_flags.h |   2 +-
 arch/x86/include/asm/msr-index.h   |   4 +
 include/linux/perf_event.h         |   2 +
 5 files changed, 256 insertions(+), 2 deletions(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 726ef13c2c81..d3bdc7d18d3f 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -2851,6 +2851,54 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
 	cpuc->fixed_ctrl_val |= bits;
 }
 
+static void intel_pmu_config_acr(int idx, u64 mask, u32 reload)
+{
+	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
+	int msr_b, msr_c;
+
+	if (!mask && !cpuc->acr_cfg_b[idx])
+		return;
+
+	if (idx < INTEL_PMC_IDX_FIXED) {
+		msr_b = MSR_IA32_PMC_V6_GP0_CFG_B;
+		msr_c = MSR_IA32_PMC_V6_GP0_CFG_C;
+	} else {
+		msr_b = MSR_IA32_PMC_V6_FX0_CFG_B;
+		msr_c = MSR_IA32_PMC_V6_FX0_CFG_C;
+		idx -= INTEL_PMC_IDX_FIXED;
+	}
+
+	if (cpuc->acr_cfg_b[idx] != mask) {
+		wrmsrl(msr_b + x86_pmu.addr_offset(idx, false), mask);
+		cpuc->acr_cfg_b[idx] = mask;
+	}
+	/* Only need to update the reload value when there is a valid config value. */
+	if (mask && cpuc->acr_cfg_c[idx] != reload) {
+		wrmsrl(msr_c + x86_pmu.addr_offset(idx, false), reload);
+		cpuc->acr_cfg_c[idx] = reload;
+	}
+}
+
+static void intel_pmu_enable_acr(struct perf_event *event)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	/* The PMU doesn't support ACR */
+	if (!hybrid(event->pmu, acr_cntr_mask64))
+		return;
+
+	if (!is_acr_event_group(event) || !event->attr.config2) {
+		/*
+		 * The disable doesn't clear the ACR CFG register.
+		 * Check and clear the ACR CFG register.
+		 */
+		intel_pmu_config_acr(hwc->idx, 0, 0);
+		return;
+	}
+
+	intel_pmu_config_acr(hwc->idx, hwc->config1, -hwc->sample_period);
+}
+
 static void intel_pmu_enable_event(struct perf_event *event)
 {
 	u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
@@ -2866,8 +2914,11 @@ static void intel_pmu_enable_event(struct perf_event *event)
 			enable_mask |= ARCH_PERFMON_EVENTSEL_BR_CNTR;
 		intel_set_masks(event, idx);
 		__x86_pmu_enable_event(hwc, enable_mask);
+		intel_pmu_enable_acr(event);
 		break;
 	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
+		intel_pmu_enable_acr(event);
+		fallthrough;
 	case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
 		intel_pmu_enable_fixed(event);
 		break;
@@ -3687,6 +3738,12 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
 		c2->weight = hweight64(c2->idxmsk64);
 	}
 
+	if (is_acr_event_group(event)) {
+		c2 = dyn_constraint(cpuc, c2, idx);
+		c2->idxmsk64 &= event->hw.dyn_mask;
+		c2->weight = hweight64(c2->idxmsk64);
+	}
+
 	return c2;
 }
 
@@ -3945,6 +4002,78 @@ static inline bool intel_pmu_has_cap(struct perf_event *event, int idx)
 	return test_bit(idx, (unsigned long *)&intel_cap->capabilities);
 }
 
+static bool intel_pmu_is_acr_group(struct perf_event *event)
+{
+	if (!hybrid(event->pmu, acr_cntr_mask64))
+		return false;
+
+	/* The group leader has the ACR flag set */
+	if (is_acr_event_group(event))
+		return true;
+
+	/* The acr_mask is set */
+	if (event->attr.config2)
+		return true;
+
+	return false;
+}
+
+static int intel_pmu_acr_check_reloadable_event(struct perf_event *event)
+{
+	struct perf_event *sibling, *leader = event->group_leader;
+	int num = 0;
+
+	/*
+	 * The acr_mask(config2) indicates the event can be reloaded by
+	 * other events. Apply the acr_cntr_mask.
+	 */
+	if (leader->attr.config2) {
+		leader->hw.dyn_mask = hybrid(leader->pmu, acr_cntr_mask64);
+		num++;
+	} else
+		leader->hw.dyn_mask = ~0ULL;
+
+	for_each_sibling_event(sibling, leader) {
+		if (sibling->attr.config2) {
+			sibling->hw.dyn_mask = hybrid(sibling->pmu, acr_cntr_mask64);
+			num++;
+		} else
+			sibling->hw.dyn_mask = ~0ULL;
+	}
+
+	if (event->attr.config2) {
+		event->hw.dyn_mask = hybrid(event->pmu, acr_cntr_mask64);
+		num++;
+	} else
+		event->hw.dyn_mask = ~0ULL;
+
+	if (num > hweight64(hybrid(event->pmu, acr_cntr_mask64)))
+		return -EINVAL;
+
+	return 0;
+}
+
+/*
+ * Update the dyn_mask of each event to guarantee the event is scheduled
+ * to the counters which be able to cause a reload.
+ */
+static void intel_pmu_set_acr_dyn_mask(struct perf_event *event, int idx,
+				       struct perf_event *last)
+{
+	struct perf_event *sibling, *leader = event->group_leader;
+	u64 mask = hybrid(event->pmu, acr_cntr_cause_mask);
+
+	/* The event set in the acr_mask(config2) can causes a reload. */
+	if (test_bit(idx, (unsigned long *)&leader->attr.config2))
+		event->hw.dyn_mask &= mask;
+	for_each_sibling_event(sibling, leader) {
+		if (test_bit(idx, (unsigned long *)&sibling->attr.config2))
+			event->hw.dyn_mask &= mask;
+	}
+	if (test_bit(idx, (unsigned long *)&last->attr.config2))
+		event->hw.dyn_mask &= mask;
+}
+
 static int intel_pmu_hw_config(struct perf_event *event)
 {
 	int ret = x86_pmu_hw_config(event);
@@ -4056,6 +4185,50 @@ static int intel_pmu_hw_config(struct perf_event *event)
 		event->hw.flags |= PERF_X86_EVENT_PEBS_VIA_PT;
 	}
 
+	if (intel_pmu_is_acr_group(event)) {
+		struct perf_event *sibling, *leader = event->group_leader;
+		int event_idx = 0;
+
+		/* Not support perf metrics */
+		if (is_metric_event(event))
+			return -EINVAL;
+
+		/* Not support freq mode */
+		if (event->attr.freq)
+			return -EINVAL;
+
+		/* PDist is not supported */
+		if (event->attr.config2 && event->attr.precise_ip > 2)
+			return -EINVAL;
+
+		/* The reload value cannot exceeds the max period */
+		if (event->attr.sample_period > x86_pmu.max_period)
+			return -EINVAL;
+		/*
+		 * It's hard to know whether the event is the last one of
+		 * the group. Reconfigure the dyn_mask of each X86 event
+		 * every time when add a new event.
+		 * It's impossible to verify whether the bits of
+		 * the event->attr.config2 exceeds the group. But it's
+		 * harmless, because the invalid bits are ignored. See
+		 * intel_pmu_update_acr_mask(). The n - n0 guarantees that
+		 * only the bits in the group is used.
+		 *
+		 * Check whether the reloadable counters is enough and
+		 * initialize the dyn_mask.
+		 */
+		if (intel_pmu_acr_check_reloadable_event(event))
+			return -EINVAL;
+
+		/* Reconfigure the dyn_mask for each event */
+		intel_pmu_set_acr_dyn_mask(leader, event_idx++, event);
+		for_each_sibling_event(sibling, leader)
+			intel_pmu_set_acr_dyn_mask(sibling, event_idx++, event);
+		intel_pmu_set_acr_dyn_mask(event, event_idx, event);
+
+		leader->hw.flags |= PERF_X86_EVENT_ACR;
+	}
+
 	if ((event->attr.type == PERF_TYPE_HARDWARE) ||
 	    (event->attr.type == PERF_TYPE_HW_CACHE))
 		return 0;
@@ -4159,6 +4332,49 @@ static int intel_pmu_hw_config(struct perf_event *event)
 	return 0;
 }
 
+static void intel_pmu_update_acr_mask(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+	struct perf_event *event;
+	int n0, i, off;
+
+	if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
+		n0 = cpuc->n_events - cpuc->n_txn;
+	else
+		n0 = cpuc->n_events;
+
+	for (i = n0; i < n; i++) {
+		event = cpuc->event_list[i];
+		event->hw.config1 = 0;
+
+		/* Convert the group index into the counter index */
+		for_each_set_bit(off, (unsigned long *)&event->attr.config2, n - n0)
+			set_bit(assign[n0 + off], (unsigned long *)&event->hw.config1);
+	}
+}
+
+static int intel_pmu_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
+{
+	struct perf_event *event;
+	int ret = x86_schedule_events(cpuc, n, assign);
+
+	if (ret)
+		return ret;
+
+	if (cpuc->is_fake)
+		return ret;
+
+	event = cpuc->event_list[n - 1];
+	/*
+	 * The acr_mask(config2) is the event-enabling order.
+	 * Update it to the counter order after the counters are assigned.
+	 */
+	if (event && is_acr_event_group(event))
+		intel_pmu_update_acr_mask(cpuc, n, assign);
+
+	return 0;
+}
+
+
 /*
  * Currently, the only caller of this function is the atomic_switch_perf_msrs().
  * The host perf context helps to prepare the values of the real hardware for
@@ -5305,7 +5521,7 @@ static __initconst const struct x86_pmu intel_pmu = {
 	.set_period		= intel_pmu_set_period,
 	.update			= intel_pmu_update,
 	.hw_config		= intel_pmu_hw_config,
-	.schedule_events	= x86_schedule_events,
+	.schedule_events	= intel_pmu_schedule_events,
 	.eventsel		= MSR_ARCH_PERFMON_EVENTSEL0,
 	.perfctr		= MSR_ARCH_PERFMON_PERFCTR0,
 	.fixedctr		= MSR_ARCH_PERFMON_FIXED_CTR0,
@@ -5909,6 +6125,21 @@ td_is_visible(struct kobject *kobj, struct attribute *attr, int i)
 	return attr->mode;
 }
 
+PMU_FORMAT_ATTR(acr_mask,	"config2:0-63");
+
+static struct attribute *format_acr_attrs[] = {
+	&format_attr_acr_mask.attr,
+	NULL
+};
+
+static umode_t
+acr_is_visible(struct kobject *kobj, struct attribute *attr, int i)
+{
+	struct device *dev = kobj_to_dev(kobj);
+
+	return hybrid(dev_get_drvdata(dev), acr_cntr_mask64) ? attr->mode : 0;
+}
+
 static struct attribute_group group_events_td  = {
 	.name = "events",
 	.is_visible = td_is_visible,
@@ -5951,6 +6182,12 @@ static struct attribute_group group_format_evtsel_ext = {
 	.is_visible = evtsel_ext_is_visible,
 };
 
+static struct attribute_group group_format_acr = {
+	.name       = "format",
+	.attrs      = format_acr_attrs,
+	.is_visible = acr_is_visible,
+};
+
 static struct attribute_group group_default = {
 	.attrs      = intel_pmu_attrs,
 	.is_visible = default_is_visible,
@@ -5965,6 +6202,7 @@ static const struct attribute_group *attr_update[] = {
 	&group_format_extra,
 	&group_format_extra_skl,
 	&group_format_evtsel_ext,
+	&group_format_acr,
 	&group_default,
 	NULL,
 };
@@ -6249,6 +6487,7 @@ static const struct attribute_group *hybrid_attr_update[] = {
 	&group_caps_lbr,
 	&hybrid_group_format_extra,
 	&group_format_evtsel_ext,
+	&group_format_acr,
 	&group_default,
 	&hybrid_group_cpus,
 	NULL,
diff --git a/arch/x86/events/perf_event.h b/arch/x86/events/perf_event.h
index 1ee6d7bb10a3..f6f2c0e60043 100644
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -115,6 +115,11 @@ static inline bool is_branch_counters_group(struct perf_event *event)
 	return event->group_leader->hw.flags & PERF_X86_EVENT_BRANCH_COUNTERS;
 }
 
+static inline bool is_acr_event_group(struct perf_event *event)
+{
+	return event->group_leader->hw.flags & PERF_X86_EVENT_ACR;
+}
+
 struct amd_nb {
 	int nb_id;  /* NorthBridge id */
 	int refcnt; /* reference count */
@@ -281,6 +286,10 @@ struct cpu_hw_events {
 	u64			fixed_ctrl_val;
 	u64			active_fixed_ctrl_val;
 
+	/* Intel ACR configuration */
+	u64			acr_cfg_b[X86_PMC_IDX_MAX];
+	u64			acr_cfg_c[X86_PMC_IDX_MAX];
+
 	/*
 	 * Intel LBR bits
 	 */
diff --git a/arch/x86/events/perf_event_flags.h b/arch/x86/events/perf_event_flags.h
index 6c977c19f2cd..f21d00965af6 100644
--- a/arch/x86/events/perf_event_flags.h
+++ b/arch/x86/events/perf_event_flags.h
@@ -9,7 +9,7 @@ PERF_ARCH(PEBS_LD_HSW,		0x00008) /* haswell style datala, load */
 PERF_ARCH(PEBS_NA_HSW,		0x00010) /* haswell style datala, unknown */
 PERF_ARCH(EXCL,			0x00020) /* HT exclusivity on counter */
 PERF_ARCH(DYNAMIC,		0x00040) /* dynamic alloc'd constraint */
-			/*	0x00080	*/
+PERF_ARCH(ACR,			0x00080) /* Auto counter reload */
 PERF_ARCH(EXCL_ACCT,		0x00100) /* accounted EXCL event */
 PERF_ARCH(AUTO_RELOAD,		0x00200) /* use PEBS auto-reload */
 PERF_ARCH(LARGE_PEBS,		0x00400) /* use large PEBS */
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 3ae84c3b8e6d..fdc0279c6917 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -591,7 +591,11 @@
 /* V6 PMON MSR range */
 #define MSR_IA32_PMC_V6_GP0_CTR		0x1900
 #define MSR_IA32_PMC_V6_GP0_CFG_A	0x1901
+#define MSR_IA32_PMC_V6_GP0_CFG_B	0x1902
+#define MSR_IA32_PMC_V6_GP0_CFG_C	0x1903
 #define MSR_IA32_PMC_V6_FX0_CTR		0x1980
+#define MSR_IA32_PMC_V6_FX0_CFG_B	0x1982
+#define MSR_IA32_PMC_V6_FX0_CFG_C	0x1983
 #define MSR_IA32_PMC_V6_STEP		4
 
 /* KeyID partitioning between MKTME and TDX */
diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index fb908843f209..f540ab12c678 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -157,7 +157,9 @@ struct hw_perf_event {
 	union {
 		struct { /* hardware */
 			u64		config;
+			u64		config1;
 			u64		last_tag;
+			u64		dyn_mask;
 			unsigned long	config_base;
 			unsigned long	event_base;
 			int		event_base_rdpmc;
-- 
2.38.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 3/3] perf/x86/intel: Support auto counter reload
  2024-10-10 19:28 ` [PATCH V2 3/3] perf/x86/intel: Support " kan.liang
@ 2025-03-14 10:20   ` Peter Zijlstra
  2025-03-14 13:48     ` Liang, Kan
  0 siblings, 1 reply; 10+ messages in thread
From: Peter Zijlstra @ 2025-03-14 10:20 UTC (permalink / raw)
  To: kan.liang
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, ak, linux-kernel,
	eranian, thomas.falcon

On Thu, Oct 10, 2024 at 12:28:44PM -0700, kan.liang@linux.intel.com wrote:

> diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
> index 726ef13c2c81..d3bdc7d18d3f 100644
> --- a/arch/x86/events/intel/core.c
> +++ b/arch/x86/events/intel/core.c
> @@ -2851,6 +2851,54 @@ static void intel_pmu_enable_fixed(struct perf_event *event)
>  	cpuc->fixed_ctrl_val |= bits;
>  }
>  
> +static void intel_pmu_config_acr(int idx, u64 mask, u32 reload)
> +{
> +	struct cpu_hw_events *cpuc = this_cpu_ptr(&cpu_hw_events);
> +	int msr_b, msr_c;
> +
> +	if (!mask && !cpuc->acr_cfg_b[idx])
> +		return;
> +
> +	if (idx < INTEL_PMC_IDX_FIXED) {
> +		msr_b = MSR_IA32_PMC_V6_GP0_CFG_B;
> +		msr_c = MSR_IA32_PMC_V6_GP0_CFG_C;
> +	} else {
> +		msr_b = MSR_IA32_PMC_V6_FX0_CFG_B;
> +		msr_c = MSR_IA32_PMC_V6_FX0_CFG_C;
> +		idx -= INTEL_PMC_IDX_FIXED;
> +	}
> +
> +	if (cpuc->acr_cfg_b[idx] != mask) {
> +		wrmsrl(msr_b + x86_pmu.addr_offset(idx, false), mask);
> +		cpuc->acr_cfg_b[idx] = mask;
> +	}
> +	/* Only need to update the reload value when there is a valid config value. */
> +	if (mask && cpuc->acr_cfg_c[idx] != reload) {
> +		wrmsrl(msr_c + x86_pmu.addr_offset(idx, false), reload);
> +		cpuc->acr_cfg_c[idx] = reload;
> +	}
> +}
> +
> +static void intel_pmu_enable_acr(struct perf_event *event)
> +{
> +	struct hw_perf_event *hwc = &event->hw;
> +
> +	/* The PMU doesn't support ACR */
> +	if (!hybrid(event->pmu, acr_cntr_mask64))
> +		return;

In which case we shouldn't have been able to create such an event,
right?

But you need this because you do the cleanup here. Hmm, perhaps use a
static_call() to elide the whole call if the hardware doesn't support
this?

> +
> +	if (!is_acr_event_group(event) || !event->attr.config2) {
> +		/*
> +		 * The disable doesn't clear the ACR CFG register.
> +		 * Check and clear the ACR CFG register.
> +		 */
> +		intel_pmu_config_acr(hwc->idx, 0, 0);
> +		return;
> +	}
> +
> +	intel_pmu_config_acr(hwc->idx, hwc->config1, -hwc->sample_period);
> +}
> +
>  static void intel_pmu_enable_event(struct perf_event *event)
>  {
>  	u64 enable_mask = ARCH_PERFMON_EVENTSEL_ENABLE;
> @@ -2866,8 +2914,11 @@ static void intel_pmu_enable_event(struct perf_event *event)
>  			enable_mask |= ARCH_PERFMON_EVENTSEL_BR_CNTR;
>  		intel_set_masks(event, idx);
>  		__x86_pmu_enable_event(hwc, enable_mask);
> +		intel_pmu_enable_acr(event);
>  		break;
>  	case INTEL_PMC_IDX_FIXED ... INTEL_PMC_IDX_FIXED_BTS - 1:
> +		intel_pmu_enable_acr(event);
> +		fallthrough;
>  	case INTEL_PMC_IDX_METRIC_BASE ... INTEL_PMC_IDX_METRIC_END:
>  		intel_pmu_enable_fixed(event);
>  		break;

This order is somewhat inconsistent. The GP events get enable,acr, while
the FIXED ones get acr,enable.

> @@ -3687,6 +3738,12 @@ intel_get_event_constraints(struct cpu_hw_events *cpuc, int idx,
>  		c2->weight = hweight64(c2->idxmsk64);
>  	}
>  
> +	if (is_acr_event_group(event)) {
> +		c2 = dyn_constraint(cpuc, c2, idx);
> +		c2->idxmsk64 &= event->hw.dyn_mask;
> +		c2->weight = hweight64(c2->idxmsk64);
> +	}
> +
>  	return c2;
>  }
>  
> @@ -3945,6 +4002,78 @@ static inline bool intel_pmu_has_cap(struct perf_event *event, int idx)
>  	return test_bit(idx, (unsigned long *)&intel_cap->capabilities);
>  }
>  
> +static bool intel_pmu_is_acr_group(struct perf_event *event)
> +{
> +	if (!hybrid(event->pmu, acr_cntr_mask64))
> +		return false;

Why the above check? Aaaah, because the function does two distinct
things. Perhaps split?

> +	/* The group leader has the ACR flag set */
> +	if (is_acr_event_group(event))
> +		return true;
> +
> +	/* The acr_mask is set */
> +	if (event->attr.config2)
> +		return true;
> +
> +	return false;
> +}
> +
> +static int intel_pmu_acr_check_reloadable_event(struct perf_event *event)
> +{
> +	struct perf_event *sibling, *leader = event->group_leader;
> +	int num = 0;
> +
> +	/*
> +	 * The acr_mask(config2) indicates the event can be reloaded by
> +	 * other events. Apply the acr_cntr_mask.
> +	 */
> +	if (leader->attr.config2) {
> +		leader->hw.dyn_mask = hybrid(leader->pmu, acr_cntr_mask64);
> +		num++;
> +	} else
> +		leader->hw.dyn_mask = ~0ULL;

Here..

> +	for_each_sibling_event(sibling, leader) {
> +		if (sibling->attr.config2) {
> +			sibling->hw.dyn_mask = hybrid(sibling->pmu, acr_cntr_mask64);
> +			num++;
> +		} else
> +			sibling->hw.dyn_mask = ~0ULL;

.. here ..

> +	}
> +
> +	if (event->attr.config2) {
> +		event->hw.dyn_mask = hybrid(event->pmu, acr_cntr_mask64);
> +		num++;
> +	} else
> +		event->hw.dyn_mask = ~0ULL;

.. and here, the else branch lost its {}.

Also, you managed to write the exact same logic 3 times, surely that can
be a helper function?

> +
> +	if (num > hweight64(hybrid(event->pmu, acr_cntr_mask64)))
> +		return -EINVAL;
> +
> +	return 0;
> +}
> +
> +/*
> + * Update the dyn_mask of each event to guarantee the event is scheduled
> + * to the counters which be able to cause a reload.
> + */
> +static void intel_pmu_set_acr_dyn_mask(struct perf_event *event, int idx,
> +				       struct perf_event *last)
> +{
> +	struct perf_event *sibling, *leader = event->group_leader;
> +	u64 mask = hybrid(event->pmu, acr_cntr_cause_mask);
> +
> +	/* The event set in the acr_mask(config2) can causes a reload. */
> +	if (test_bit(idx, (unsigned long *)&leader->attr.config2))
> +		event->hw.dyn_mask &= mask;
> +	for_each_sibling_event(sibling, leader) {
> +		if (test_bit(idx, (unsigned long *)&sibling->attr.config2))
> +			event->hw.dyn_mask &= mask;
> +	}
> +	if (test_bit(idx, (unsigned long *)&last->attr.config2))
> +		event->hw.dyn_mask &= mask;
> +}
> +
>  static int intel_pmu_hw_config(struct perf_event *event)
>  {
>  	int ret = x86_pmu_hw_config(event);
> @@ -4056,6 +4185,50 @@ static int intel_pmu_hw_config(struct perf_event *event)
>  		event->hw.flags |= PERF_X86_EVENT_PEBS_VIA_PT;
>  	}
>  
> +	if (intel_pmu_is_acr_group(event)) {

This is where you want to check 2 things:

 - does hardare suport ACR, if not, fail
 - is event ACR, check constraints.

Current code obscures this distinction.

> +		struct perf_event *sibling, *leader = event->group_leader;
> +		int event_idx = 0;
> +
> +		/* Not support perf metrics */
> +		if (is_metric_event(event))
> +			return -EINVAL;
> +
> +		/* Not support freq mode */
> +		if (event->attr.freq)
> +			return -EINVAL;
> +
> +		/* PDist is not supported */
> +		if (event->attr.config2 && event->attr.precise_ip > 2)
> +			return -EINVAL;
> +
> +		/* The reload value cannot exceeds the max period */
> +		if (event->attr.sample_period > x86_pmu.max_period)
> +			return -EINVAL;
> +		/*
> +		 * It's hard to know whether the event is the last one of
> +		 * the group. Reconfigure the dyn_mask of each X86 event
> +		 * every time when add a new event.
> +		 * It's impossible to verify whether the bits of
> +		 * the event->attr.config2 exceeds the group. But it's
> +		 * harmless, because the invalid bits are ignored. See
> +		 * intel_pmu_update_acr_mask(). The n - n0 guarantees that
> +		 * only the bits in the group is used.
> +		 *
> +		 * Check whether the reloadable counters is enough and
> +		 * initialize the dyn_mask.
> +		 */
> +		if (intel_pmu_acr_check_reloadable_event(event))
> +			return -EINVAL;
> +
> +		/* Reconfigure the dyn_mask for each event */
> +		intel_pmu_set_acr_dyn_mask(leader, event_idx++, event);
> +		for_each_sibling_event(sibling, leader)
> +			intel_pmu_set_acr_dyn_mask(sibling, event_idx++, event);
> +		intel_pmu_set_acr_dyn_mask(event, event_idx, event);
> +
> +		leader->hw.flags |= PERF_X86_EVENT_ACR;
> +	}
> +
>  	if ((event->attr.type == PERF_TYPE_HARDWARE) ||
>  	    (event->attr.type == PERF_TYPE_HW_CACHE))
>  		return 0;
> @@ -4159,6 +4332,49 @@ static int intel_pmu_hw_config(struct perf_event *event)
>  	return 0;
>  }
>  
> +static void intel_pmu_update_acr_mask(struct cpu_hw_events *cpuc, int n, int *assign)
> +{
> +	struct perf_event *event;
> +	int n0, i, off;
> +
> +	if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
> +		n0 = cpuc->n_events - cpuc->n_txn;
> +	else
> +		n0 = cpuc->n_events;
> +
> +	for (i = n0; i < n; i++) {
> +		event = cpuc->event_list[i];
> +		event->hw.config1 = 0;
> +
> +		/* Convert the group index into the counter index */
> +		for_each_set_bit(off, (unsigned long *)&event->attr.config2, n - n0)
> +			set_bit(assign[n0 + off], (unsigned long *)&event->hw.config1);

Atomic set_bit() is required?

> +	}
> +}
> +
> +static int intel_pmu_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
> +{
> +	struct perf_event *event;
> +	int ret = x86_schedule_events(cpuc, n, assign);
> +
> +	if (ret)
> +		return ret;
> +
> +	if (cpuc->is_fake)
> +		return ret;
> +
> +	event = cpuc->event_list[n - 1];

ISTR seeing this pattern before somewhere and then argued it was all
sorts of broken. Why is it sane to look at the last event here?

> +	/*
> +	 * The acr_mask(config2) is the event-enabling order.
> +	 * Update it to the counter order after the counters are assigned.
> +	 */
> +	if (event && is_acr_event_group(event))
> +		intel_pmu_update_acr_mask(cpuc, n, assign);
> +
> +	return 0;
> +}
> +
> +
>  /*
>   * Currently, the only caller of this function is the atomic_switch_perf_msrs().
>   * The host perf context helps to prepare the values of the real hardware for
> @@ -5305,7 +5521,7 @@ static __initconst const struct x86_pmu intel_pmu = {
>  	.set_period		= intel_pmu_set_period,
>  	.update			= intel_pmu_update,
>  	.hw_config		= intel_pmu_hw_config,
> -	.schedule_events	= x86_schedule_events,
> +	.schedule_events	= intel_pmu_schedule_events,
>  	.eventsel		= MSR_ARCH_PERFMON_EVENTSEL0,
>  	.perfctr		= MSR_ARCH_PERFMON_PERFCTR0,
>  	.fixedctr		= MSR_ARCH_PERFMON_FIXED_CTR0,

How about only setting that function if the PMU actually support ACR ?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 3/3] perf/x86/intel: Support auto counter reload
  2025-03-14 10:20   ` Peter Zijlstra
@ 2025-03-14 13:48     ` Liang, Kan
  2025-03-14 18:48       ` Liang, Kan
  0 siblings, 1 reply; 10+ messages in thread
From: Liang, Kan @ 2025-03-14 13:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, ak, linux-kernel,
	eranian, thomas.falcon



On 2025-03-14 6:20 a.m., Peter Zijlstra wrote:
> On Thu, Oct 10, 2024 at 12:28:44PM -0700, kan.liang@linux.intel.com wrote:
> 
>> @@ -4159,6 +4332,49 @@ static int intel_pmu_hw_config(struct perf_event *event)
>>  	return 0;
>>  }
>>  
>> +static void intel_pmu_update_acr_mask(struct cpu_hw_events *cpuc, int n, int *assign)
>> +{
>> +	struct perf_event *event;
>> +	int n0, i, off;
>> +
>> +	if (cpuc->txn_flags & PERF_PMU_TXN_ADD)
>> +		n0 = cpuc->n_events - cpuc->n_txn;
>> +	else
>> +		n0 = cpuc->n_events;
>> +
>> +	for (i = n0; i < n; i++) {
>> +		event = cpuc->event_list[i];
>> +		event->hw.config1 = 0;
>> +
>> +		/* Convert the group index into the counter index */
>> +		for_each_set_bit(off, (unsigned long *)&event->attr.config2, n - n0)
>> +			set_bit(assign[n0 + off], (unsigned long *)&event->hw.config1);
> 
> Atomic set_bit() is required?

I don't think so. Will change it to __set_bit().

> 
>> +	}
>> +}
>> +
>> +static int intel_pmu_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>> +{
>> +	struct perf_event *event;
>> +	int ret = x86_schedule_events(cpuc, n, assign);
>> +
>> +	if (ret)
>> +		return ret;
>> +
>> +	if (cpuc->is_fake)
>> +		return ret;
>> +
>> +	event = cpuc->event_list[n - 1];
> 
> ISTR seeing this pattern before somewhere and then argued it was all
> sorts of broken. Why is it sane to look at the last event here?

The schedule_events() is invoked for only two cases, a new event or a
new group. Since the event_list[] is in enabled order, the last event
should be either the new event or the last event of the new group.

The is_acr_event_group() always checks the leader's flag. It doesn't
matter which event in the ACR group is used to do the check.

Checking the last event should be good enough to cover both cases.

> 
>> +	/*
>> +	 * The acr_mask(config2) is the event-enabling order.
>> +	 * Update it to the counter order after the counters are assigned.
>> +	 */
>> +	if (event && is_acr_event_group(event))
>> +		intel_pmu_update_acr_mask(cpuc, n, assign);
>> +
>> +	return 0;
>> +}
>> +
>> +
>>  /*
>>   * Currently, the only caller of this function is the atomic_switch_perf_msrs().
>>   * The host perf context helps to prepare the values of the real hardware for
>> @@ -5305,7 +5521,7 @@ static __initconst const struct x86_pmu intel_pmu = {
>>  	.set_period		= intel_pmu_set_period,
>>  	.update			= intel_pmu_update,
>>  	.hw_config		= intel_pmu_hw_config,
>> -	.schedule_events	= x86_schedule_events,
>> +	.schedule_events	= intel_pmu_schedule_events,
>>  	.eventsel		= MSR_ARCH_PERFMON_EVENTSEL0,
>>  	.perfctr		= MSR_ARCH_PERFMON_PERFCTR0,
>>  	.fixedctr		= MSR_ARCH_PERFMON_FIXED_CTR0,
> 
> How about only setting that function if the PMU actually support ACR ?

Sure. I will also address the other comments which I didn't reply above
in the email.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 3/3] perf/x86/intel: Support auto counter reload
  2025-03-14 13:48     ` Liang, Kan
@ 2025-03-14 18:48       ` Liang, Kan
  0 siblings, 0 replies; 10+ messages in thread
From: Liang, Kan @ 2025-03-14 18:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, acme, namhyung, irogers, adrian.hunter, ak, linux-kernel,
	eranian, thomas.falcon

On 2025-03-14 9:48 a.m., Liang, Kan wrote:
>>> +	}
>>> +}
>>> +
>>> +static int intel_pmu_schedule_events(struct cpu_hw_events *cpuc, int n, int *assign)
>>> +{
>>> +	struct perf_event *event;
>>> +	int ret = x86_schedule_events(cpuc, n, assign);
>>> +
>>> +	if (ret)
>>> +		return ret;
>>> +
>>> +	if (cpuc->is_fake)
>>> +		return ret;
>>> +
>>> +	event = cpuc->event_list[n - 1];
>> ISTR seeing this pattern before somewhere and then argued it was all
>> sorts of broken. Why is it sane to look at the last event here?
> The schedule_events() is invoked for only two cases, a new event or a
> new group. Since the event_list[] is in enabled order, the last event
> should be either the new event or the last event of the new group.
> 
> The is_acr_event_group() always checks the leader's flag. It doesn't
> matter which event in the ACR group is used to do the check.
> 
> Checking the last event should be good enough to cover both cases.

This is an old implementation. Actually, I once sent a V3 last month
which move the codes to late_setup(). The late_setup was introduced
by the counters snapshotting feature. It does a late configuration in
the x86_pmu_enable() after the counters are assigned.
https://lore.kernel.org/lkml/173874832555.10177.18398857610370220622.tip-bot2@tip-bot2/

We don't need to check the last event anymore.

The V3 optimize the late_setup() a little bit.
https://lore.kernel.org/lkml/20250213211718.2406744-3-kan.liang@linux.intel.com/

and extend it for both counters snapshotting and ACR.
https://lore.kernel.org/lkml/20250213211718.2406744-6-kan.liang@linux.intel.com/

But other comments still stand. I will send a V4 later.

Thanks,
Kan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 0/3] Support auto counter reload
  2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
                   ` (2 preceding siblings ...)
  2024-10-10 19:28 ` [PATCH V2 3/3] perf/x86/intel: Support " kan.liang
@ 2024-11-04 20:37 ` Liang, Kan
  2025-03-14  9:51 ` Ingo Molnar
  4 siblings, 0 replies; 10+ messages in thread
From: Liang, Kan @ 2024-11-04 20:37 UTC (permalink / raw)
  To: peterz, mingo, acme, namhyung, irogers, adrian.hunter, ak,
	linux-kernel
  Cc: eranian, thomas.falcon

Hi Peter,

Ping. Could you please let me know if you have any comments.

Thanks,
Kan

On 2024-10-10 3:28 p.m., kan.liang@linux.intel.com wrote:
> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2 exceeds
>   the group size
> 
> The relative rates among two or more events are useful for performance
> analysis, e.g., a high branch miss rate may indicate a performance
> issue. Usually, the samples with a relative rate that exceeds some
> threshold are more useful. However, the traditional sampling takes
> samples of events separately. To get the relative rates among two or
> more events, a high sample rate is required, which can bring high
> overhead. Many samples taken in the non-hotspot area are also dropped
> (useless) in the post-process.
> 
> Auto Counter Reload (ACR) provides a means for software to specify that,
> for each supported counter, the hardware should automatically reload the
> counter to a specified initial value upon overflow of chosen counters.
> This mechanism enables software to sample based on the relative rate of
> two (or more) events, such that a sample (PMI or PEBS) is taken only if
> the rate of one event exceeds some threshold relative to the rate of
> another event. Taking a PMI or PEBS only when the relative rate of
> perfmon events crosses a threshold can have significantly less
> performance overhead than other techniques.
> 
> The details can be found at Intel Architecture Instruction Set
> Extensions and Future Features (053) 8.7 AUTO COUNTER RELOAD.
> 
> Examples:
> 
> Here is the snippet of the mispredict.c. Since the array has random
> numbers, jumps are random and often mispredicted.
> The mispredicted rate depends on the compared value.
> 
> For the Loop1, ~11% of all branches are mispredicted.
> For the Loop2, ~21% of all branches are mispredicted.
> 
> main()
> {
> ...
>         for (i = 0; i < N; i++)
>                 data[i] = rand() % 256;
> ...
>         /* Loop 1 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 64)
>                                 sum += data[i];
> ...
> 
> ...
>         /* Loop 2 */
>         for (k = 0; k < 50; k++)
>                 for (i = 0; i < N; i++)
>                         if (data[i] >= 128)
>                                 sum += data[i];
> ...
> }
> 
> Usually, a code with a high branch miss rate means a bad performance.
> To understand the branch miss rate of the codes, the traditional method
> usually sample both branches and branch-misses events. E.g.,
> perf record -e "{cpu_atom/branch-misses/ppu, cpu_atom/branch-instructions/u}"
>                -c 1000000 -- ./mispredict
> 
> [ perf record: Woken up 4 times to write data ]
> [ perf record: Captured and wrote 0.925 MB perf.data (5106 samples) ]
> The 5106 samples are from both events and spread in both Loops.
> In the post process stage, a user can know that the Loop 2 has a 21%
> branch miss rate. Then they can focus on the samples of branch-misses
> events for the Loop 2.
> 
> With this patch, the user can generate the samples only when the branch
> miss rate > 20%.
> perf record -e "{cpu_atom/branch-misses,period=200000,acr_mask=0x2/ppu,
>                  cpu_atom/branch-instructions,period=1000000,acr_mask=0x3/u}"
>                 -- ./mispredict
> [ perf record: Woken up 1 times to write data ]
> [ perf record: Captured and wrote 0.098 MB perf.data (2498 samples) ]
> 
>  $perf report
> 
> Percent       │154:   movl    $0x0,-0x14(%rbp)
>               │     ↓ jmp     1af
>               │     for (i = j; i < N; i++)
>               │15d:   mov     -0x10(%rbp),%eax
>               │       mov     %eax,-0x18(%rbp)
>               │     ↓ jmp     1a2
>               │     if (data[i] >= 128)
>               │165:   mov     -0x18(%rbp),%eax
>               │       cltq
>               │       lea     0x0(,%rax,4),%rdx
>               │       mov     -0x8(%rbp),%rax
>               │       add     %rdx,%rax
>               │       mov     (%rax),%eax
>               │    ┌──cmp     $0x7f,%eax
> 100.00   0.00 │    ├──jle     19e
>               │    │sum += data[i];
> 
> The 2498 samples are all from the branch-misses events for the Loop 2.
> 
> The number of samples and overhead is significantly reduced without
> losing any information.
> 
> Kan Liang (3):
>   perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
>   perf/x86/intel: Add the enumeration and flag for the auto counter
>     reload
>   perf/x86/intel: Support auto counter reload
> 
>  arch/x86/events/intel/core.c       | 262 ++++++++++++++++++++++++++++-
>  arch/x86/events/perf_event.h       |  21 +++
>  arch/x86/events/perf_event_flags.h |   2 +-
>  arch/x86/include/asm/msr-index.h   |   4 +
>  arch/x86/include/asm/perf_event.h  |   4 +-
>  include/linux/perf_event.h         |   2 +
>  6 files changed, 288 insertions(+), 7 deletions(-)
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 0/3] Support auto counter reload
  2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
                   ` (3 preceding siblings ...)
  2024-11-04 20:37 ` [PATCH V2 0/3] " Liang, Kan
@ 2025-03-14  9:51 ` Ingo Molnar
  2025-03-14 13:06   ` Liang, Kan
  4 siblings, 1 reply; 10+ messages in thread
From: Ingo Molnar @ 2025-03-14  9:51 UTC (permalink / raw)
  To: kan.liang
  Cc: peterz, acme, namhyung, irogers, adrian.hunter, ak, linux-kernel,
	eranian, thomas.falcon


* kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote:

> From: Kan Liang <kan.liang@linux.intel.com>
> 
> Changes since V1:
> - Add a check to the reload value which cannot exceeds the max period
> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
> - Update comments explain to case which the event->attr.config2 exceeds
>   the group size

> The 2498 samples are all from the branch-misses events for the Loop 2.
> 
> The number of samples and overhead is significantly reduced without
> losing any information.

Ok, that looks like a pretty sweet PMU feature.

What is the hardware support range of this auto count reload feature, 
how recent CPU does one have to have?

The series has aged a bit though, while a variant of patch #1 has been 
merged already under:

  47a973fd7563 perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF

... but #2 and #3 don't apply cleanly anymore.

Mind sending a refreshed series perhaps?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH V2 0/3] Support auto counter reload
  2025-03-14  9:51 ` Ingo Molnar
@ 2025-03-14 13:06   ` Liang, Kan
  0 siblings, 0 replies; 10+ messages in thread
From: Liang, Kan @ 2025-03-14 13:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: peterz, acme, namhyung, irogers, adrian.hunter, ak, linux-kernel,
	eranian, thomas.falcon



On 2025-03-14 5:51 a.m., Ingo Molnar wrote:
> 
> * kan.liang@linux.intel.com <kan.liang@linux.intel.com> wrote:
> 
>> From: Kan Liang <kan.liang@linux.intel.com>
>>
>> Changes since V1:
>> - Add a check to the reload value which cannot exceeds the max period
>> - Avoid invoking intel_pmu_enable_acr() for the perf metrics event.
>> - Update comments explain to case which the event->attr.config2 exceeds
>>   the group size
> 
>> The 2498 samples are all from the branch-misses events for the Loop 2.
>>
>> The number of samples and overhead is significantly reduced without
>> losing any information.
> 
> Ok, that looks like a pretty sweet PMU feature.
> 

Thanks for the review.

> What is the hardware support range of this auto count reload feature, 
> how recent CPU does one have to have?

The feature was first introduced into the Sierra Forest server, which
was launched last year.
https://en.wikipedia.org/wiki/Sierra_Forest

All the future platforms should have it support as well.

> 
> The series has aged a bit though, while a variant of patch #1 has been 
> merged already under:
> 
>   47a973fd7563 perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF
> 
> ... but #2 and #3 don't apply cleanly anymore.
> 
> Mind sending a refreshed series perhaps?

No problem.

I saw Peter just gives several feedback. I will also address the
concerns in the new series.

Thanks,
Kan



^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-03-14 18:48 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-10 19:28 [PATCH V2 0/3] Support auto counter reload kan.liang
2024-10-10 19:28 ` [PATCH V2 1/3] perf/x86/intel: Fix ARCH_PERFMON_NUM_COUNTER_LEAF kan.liang
2024-10-10 19:28 ` [PATCH V2 2/3] perf/x86/intel: Add the enumeration and flag for the auto counter reload kan.liang
2024-10-10 19:28 ` [PATCH V2 3/3] perf/x86/intel: Support " kan.liang
2025-03-14 10:20   ` Peter Zijlstra
2025-03-14 13:48     ` Liang, Kan
2025-03-14 18:48       ` Liang, Kan
2024-11-04 20:37 ` [PATCH V2 0/3] " Liang, Kan
2025-03-14  9:51 ` Ingo Molnar
2025-03-14 13:06   ` Liang, Kan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox