* [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
@ 2025-08-20 8:45 Yicong Yang
2025-08-20 8:45 ` [PATCH v2 1/2] perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions Yicong Yang
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: Yicong Yang @ 2025-08-20 8:45 UTC (permalink / raw)
To: will, mark.rutland, linux-arm-kernel
Cc: sudeep.holla, james.clark, robh, anshuman.khandual,
jonathan.cameron, hejunhao3, linuxarm, prime.zeng, xuwei5,
wangyushan12, yangyicong
From: Yicong Yang <yangyicong@hisilicon.com>
This patchset fixed CPU_CYCLES counting on SMT system. CPU_CYCLES can use
PMCCNTR_EL0 which will count the hardware processor clock rather than the
PE clock (ARM DDI0487 L.b D13.1.3) on SMT cores which fails the users
expectation as CPU_CYCLES (0x0011) is defined to count on each PE cycles.
Fix this by avoid using PMCCNTR_EL0 on SMT cores when counting CPU_CYCLES.
Changes since v1:
- remove redundant comment and add tags on Patch 1/2
- detect the SMT implementation during PMU probe rather than runtime
Link: https://lore.kernel.org/linux-arm-kernel/20250812080830.20796-1-yangyicong@huawei.com/
Yicong Yang (2):
perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
drivers/perf/arm_pmu.c | 3 +++
drivers/perf/arm_pmuv3.c | 39 +++++++++++++++++++++++++++++++++--
include/linux/arch_topology.h | 11 ++++++++++
include/linux/perf/arm_pmu.h | 1 +
4 files changed, 52 insertions(+), 2 deletions(-)
--
2.24.0
^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH v2 1/2] perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
2025-08-20 8:45 [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
@ 2025-08-20 8:45 ` Yicong Yang
2025-08-20 8:45 ` [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
` (2 subsequent siblings)
3 siblings, 0 replies; 12+ messages in thread
From: Yicong Yang @ 2025-08-20 8:45 UTC (permalink / raw)
To: will, mark.rutland, linux-arm-kernel
Cc: sudeep.holla, james.clark, robh, anshuman.khandual,
jonathan.cameron, hejunhao3, linuxarm, prime.zeng, xuwei5,
wangyushan12, yangyicong
From: Yicong Yang <yangyicong@hisilicon.com>
PMCCNTR_EL0 is preferred for counting CPU_CYCLES under certain
conditions. Factor out the condition check to a separate function
for further extension. Add documents for better understanding.
No functional changes intended.
Reviewed-by: James Clark <james.clark@linaro.org>
Acked-by: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
drivers/perf/arm_pmuv3.c | 29 +++++++++++++++++++++++++++--
1 file changed, 27 insertions(+), 2 deletions(-)
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index f6d7bab5d555..69c5cc8f5606 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -978,6 +978,32 @@ static int armv8pmu_get_chain_idx(struct pmu_hw_events *cpuc,
return -EAGAIN;
}
+static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
+ struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
+
+ if (evtype != ARMV8_PMUV3_PERFCTR_CPU_CYCLES)
+ return false;
+
+ /*
+ * A CPU_CYCLES event with threshold counting cannot use PMCCNTR_EL0
+ * since it lacks threshold support.
+ */
+ if (armv8pmu_event_get_threshold(&event->attr))
+ return false;
+
+ /*
+ * PMCCNTR_EL0 is not affected by BRBE controls like BRBCR_ELx.FZP.
+ * So don't use it for branch events.
+ */
+ if (has_branch_stack(event))
+ return false;
+
+ return true;
+}
+
static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc,
struct perf_event *event)
{
@@ -986,8 +1012,7 @@ static int armv8pmu_get_event_idx(struct pmu_hw_events *cpuc,
unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
/* Always prefer to place a cycle counter into the cycle counter. */
- if ((evtype == ARMV8_PMUV3_PERFCTR_CPU_CYCLES) &&
- !armv8pmu_event_get_threshold(&event->attr) && !has_branch_stack(event)) {
+ if (armv8pmu_can_use_pmccntr(cpuc, event)) {
if (!test_and_set_bit(ARMV8_PMU_CYCLE_IDX, cpuc->used_mask))
return ARMV8_PMU_CYCLE_IDX;
else if (armv8pmu_event_is_64bit(event) &&
--
2.24.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-08-20 8:45 [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
2025-08-20 8:45 ` [PATCH v2 1/2] perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions Yicong Yang
@ 2025-08-20 8:45 ` Yicong Yang
2025-09-18 13:32 ` Will Deacon
2025-09-08 7:09 ` [PATCH v2 0/2] " Yicong Yang
2025-09-18 16:43 ` Will Deacon
3 siblings, 1 reply; 12+ messages in thread
From: Yicong Yang @ 2025-08-20 8:45 UTC (permalink / raw)
To: will, mark.rutland, linux-arm-kernel
Cc: sudeep.holla, james.clark, robh, anshuman.khandual,
jonathan.cameron, hejunhao3, linuxarm, prime.zeng, xuwei5,
wangyushan12, yangyicong
From: Yicong Yang <yangyicong@hisilicon.com>
CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
one of the SMT siblings is not idle on a multi-threaded implementation.
So don't use it on SMT cores.
Introduce topology_core_has_smt() for knowing the SMT implementation and
cached it in arm_pmu::has_smt during allocation.
When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
patch we'll get:
[root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
Performance counter stats for 'CPU(s) 2-3':
CPU2 2880457316 cycles
CPU3 2880459810 cycles
1.254688470 seconds time elapsed
With this patch the idle state of CPU3 is observed as expected:
[root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
--taskset 2 --timeout 1
[...]
Performance counter stats for 'CPU(s) 2-3':
CPU2 2558580492 cycles
CPU3 305749 cycles
1.113626410 seconds time elapsed
Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
---
drivers/perf/arm_pmu.c | 3 +++
drivers/perf/arm_pmuv3.c | 10 ++++++++++
include/linux/arch_topology.h | 11 +++++++++++
include/linux/perf/arm_pmu.h | 1 +
4 files changed, 25 insertions(+)
diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
index 5c310e803dd7..137ef55d6973 100644
--- a/drivers/perf/arm_pmu.c
+++ b/drivers/perf/arm_pmu.c
@@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
events = per_cpu_ptr(pmu->hw_events, cpu);
events->percpu_pmu = pmu;
+
+ if (!pmu->has_smt && topology_core_has_smt(cpu))
+ pmu->has_smt = true;
}
return pmu;
diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
index 69c5cc8f5606..32b58a0feb33 100644
--- a/drivers/perf/arm_pmuv3.c
+++ b/drivers/perf/arm_pmuv3.c
@@ -981,6 +981,7 @@ static int armv8pmu_get_chain_idx(struct pmu_hw_events *cpuc,
static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
struct perf_event *event)
{
+ struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
struct hw_perf_event *hwc = &event->hw;
unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
@@ -1001,6 +1002,15 @@ static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
if (has_branch_stack(event))
return false;
+ /*
+ * The PMCCNTR_EL0 increments from the processor clock rather than
+ * the PE clock (ARM DDI0487 L.b D13.1.3) which means it'll continue
+ * counting on a WFI PE if one of its SMT silbing is not idle on a
+ * multi-threaded implementation. So don't use it on SMT cores.
+ */
+ if (cpu_pmu->has_smt)
+ return false;
+
return true;
}
diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
index d72d6e5aa200..daa1af2e8204 100644
--- a/include/linux/arch_topology.h
+++ b/include/linux/arch_topology.h
@@ -89,6 +89,17 @@ void remove_cpu_topology(unsigned int cpuid);
void reset_cpu_topology(void);
int parse_acpi_topology(void);
void freq_inv_set_max_ratio(int cpu, u64 max_rate);
+
+/*
+ * Architectures like ARM64 don't have reliable architectural way to get SMT
+ * information and depend on the firmware (ACPI/OF) report. Non-SMT core won't
+ * initialize thread_id so we can use this to detect the SMT implementation.
+ */
+static inline bool topology_core_has_smt(int cpu)
+{
+ return cpu_topology[cpu].thread_id != -1;
+}
+
#endif
#endif /* _LINUX_ARCH_TOPOLOGY_H_ */
diff --git a/include/linux/perf/arm_pmu.h b/include/linux/perf/arm_pmu.h
index 93c9a26492fc..2d39322c40c4 100644
--- a/include/linux/perf/arm_pmu.h
+++ b/include/linux/perf/arm_pmu.h
@@ -119,6 +119,7 @@ struct arm_pmu {
/* PMUv3 only */
int pmuver;
+ bool has_smt;
u64 reg_pmmir;
u64 reg_brbidr;
#define ARMV8_PMUV3_MAX_COMMON_EVENTS 0x40
--
2.24.0
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-08-20 8:45 [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
2025-08-20 8:45 ` [PATCH v2 1/2] perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions Yicong Yang
2025-08-20 8:45 ` [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
@ 2025-09-08 7:09 ` Yicong Yang
2025-09-18 16:43 ` Will Deacon
3 siblings, 0 replies; 12+ messages in thread
From: Yicong Yang @ 2025-09-08 7:09 UTC (permalink / raw)
To: will, mark.rutland, linux-arm-kernel
Cc: yangyicong, sudeep.holla, james.clark, robh, anshuman.khandual,
jonathan.cameron, hejunhao3, linuxarm, prime.zeng, xuwei5,
wangyushan12
a gentle ping...
On 2025/8/20 16:45, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
>
> This patchset fixed CPU_CYCLES counting on SMT system. CPU_CYCLES can use
> PMCCNTR_EL0 which will count the hardware processor clock rather than the
> PE clock (ARM DDI0487 L.b D13.1.3) on SMT cores which fails the users
> expectation as CPU_CYCLES (0x0011) is defined to count on each PE cycles.
> Fix this by avoid using PMCCNTR_EL0 on SMT cores when counting CPU_CYCLES.
>
> Changes since v1:
> - remove redundant comment and add tags on Patch 1/2
> - detect the SMT implementation during PMU probe rather than runtime
> Link: https://lore.kernel.org/linux-arm-kernel/20250812080830.20796-1-yangyicong@huawei.com/
>
> Yicong Yang (2):
> perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
> perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
>
> drivers/perf/arm_pmu.c | 3 +++
> drivers/perf/arm_pmuv3.c | 39 +++++++++++++++++++++++++++++++++--
> include/linux/arch_topology.h | 11 ++++++++++
> include/linux/perf/arm_pmu.h | 1 +
> 4 files changed, 52 insertions(+), 2 deletions(-)
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-08-20 8:45 ` [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
@ 2025-09-18 13:32 ` Will Deacon
2025-09-19 8:56 ` Yicong Yang
2025-09-19 9:37 ` Sudeep Holla
0 siblings, 2 replies; 12+ messages in thread
From: Will Deacon @ 2025-09-18 13:32 UTC (permalink / raw)
To: Yicong Yang
Cc: mark.rutland, linux-arm-kernel, sudeep.holla, james.clark, robh,
anshuman.khandual, jonathan.cameron, hejunhao3, linuxarm,
prime.zeng, xuwei5, wangyushan12, yangyicong
On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
> From: Yicong Yang <yangyicong@hisilicon.com>
>
> CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
> preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
> processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
> one of the SMT siblings is not idle on a multi-threaded implementation.
> So don't use it on SMT cores.
>
> Introduce topology_core_has_smt() for knowing the SMT implementation and
> cached it in arm_pmu::has_smt during allocation.
>
> When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
> patch we'll get:
> [root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
> --taskset 2 --timeout 1
> [...]
> Performance counter stats for 'CPU(s) 2-3':
>
> CPU2 2880457316 cycles
> CPU3 2880459810 cycles
> 1.254688470 seconds time elapsed
>
> With this patch the idle state of CPU3 is observed as expected:
> [root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
> --taskset 2 --timeout 1
> [...]
> Performance counter stats for 'CPU(s) 2-3':
>
> CPU2 2558580492 cycles
> CPU3 305749 cycles
> 1.113626410 seconds time elapsed
>
> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> ---
> drivers/perf/arm_pmu.c | 3 +++
> drivers/perf/arm_pmuv3.c | 10 ++++++++++
> include/linux/arch_topology.h | 11 +++++++++++
> include/linux/perf/arm_pmu.h | 1 +
> 4 files changed, 25 insertions(+)
>
> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> index 5c310e803dd7..137ef55d6973 100644
> --- a/drivers/perf/arm_pmu.c
> +++ b/drivers/perf/arm_pmu.c
> @@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
>
> events = per_cpu_ptr(pmu->hw_events, cpu);
> events->percpu_pmu = pmu;
> +
> + if (!pmu->has_smt && topology_core_has_smt(cpu))
> + pmu->has_smt = true;
Why isn't that just:
pmu->has_smt = topology_core_has_smt(cpu);
?
but then if that's the case, why do we need to stash the result in the
PMU at all?
> }
>
> return pmu;
> diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> index 69c5cc8f5606..32b58a0feb33 100644
> --- a/drivers/perf/arm_pmuv3.c
> +++ b/drivers/perf/arm_pmuv3.c
> @@ -981,6 +981,7 @@ static int armv8pmu_get_chain_idx(struct pmu_hw_events *cpuc,
> static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
> struct perf_event *event)
> {
> + struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> struct hw_perf_event *hwc = &event->hw;
> unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
>
> @@ -1001,6 +1002,15 @@ static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
> if (has_branch_stack(event))
> return false;
>
> + /*
> + * The PMCCNTR_EL0 increments from the processor clock rather than
> + * the PE clock (ARM DDI0487 L.b D13.1.3) which means it'll continue
> + * counting on a WFI PE if one of its SMT silbing is not idle on a
typo: sibling
> + * multi-threaded implementation. So don't use it on SMT cores.
> + */
> + if (cpu_pmu->has_smt)
> + return false;
> +
> return true;
> }
>
> diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> index d72d6e5aa200..daa1af2e8204 100644
> --- a/include/linux/arch_topology.h
> +++ b/include/linux/arch_topology.h
> @@ -89,6 +89,17 @@ void remove_cpu_topology(unsigned int cpuid);
> void reset_cpu_topology(void);
> int parse_acpi_topology(void);
> void freq_inv_set_max_ratio(int cpu, u64 max_rate);
> +
> +/*
> + * Architectures like ARM64 don't have reliable architectural way to get SMT
> + * information and depend on the firmware (ACPI/OF) report. Non-SMT core won't
> + * initialize thread_id so we can use this to detect the SMT implementation.
> + */
> +static inline bool topology_core_has_smt(int cpu)
> +{
> + return cpu_topology[cpu].thread_id != -1;
> +}
Sudeep -- is this ok?
Will
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-08-20 8:45 [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
` (2 preceding siblings ...)
2025-09-08 7:09 ` [PATCH v2 0/2] " Yicong Yang
@ 2025-09-18 16:43 ` Will Deacon
3 siblings, 0 replies; 12+ messages in thread
From: Will Deacon @ 2025-09-18 16:43 UTC (permalink / raw)
To: mark.rutland, linux-arm-kernel, Yicong Yang
Cc: catalin.marinas, kernel-team, Will Deacon, sudeep.holla,
james.clark, robh, anshuman.khandual, jonathan.cameron, hejunhao3,
linuxarm, prime.zeng, xuwei5, wangyushan12, yangyicong
On Wed, 20 Aug 2025 16:45:32 +0800, Yicong Yang wrote:
> This patchset fixed CPU_CYCLES counting on SMT system. CPU_CYCLES can use
> PMCCNTR_EL0 which will count the hardware processor clock rather than the
> PE clock (ARM DDI0487 L.b D13.1.3) on SMT cores which fails the users
> expectation as CPU_CYCLES (0x0011) is defined to count on each PE cycles.
> Fix this by avoid using PMCCNTR_EL0 on SMT cores when counting CPU_CYCLES.
>
> Changes since v1:
> - remove redundant comment and add tags on Patch 1/2
> - detect the SMT implementation during PMU probe rather than runtime
> Link: https://lore.kernel.org/linux-arm-kernel/20250812080830.20796-1-yangyicong@huawei.com/
>
> [...]
Applied first patch to will (for-next/perf), thanks!
[1/2] perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions
https://git.kernel.org/will/c/f8f89e8cf3d6
Cheers,
--
Will
https://fixes.arm64.dev
https://next.arm64.dev
https://will.arm64.dev
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-09-18 13:32 ` Will Deacon
@ 2025-09-19 8:56 ` Yicong Yang
2025-09-19 9:16 ` Mark Rutland
2025-09-19 9:37 ` Sudeep Holla
1 sibling, 1 reply; 12+ messages in thread
From: Yicong Yang @ 2025-09-19 8:56 UTC (permalink / raw)
To: Will Deacon
Cc: yangyicong, mark.rutland, linux-arm-kernel, sudeep.holla,
james.clark, robh, anshuman.khandual, jonathan.cameron, hejunhao3,
linuxarm, prime.zeng, xuwei5, wangyushan12
On 2025/9/18 21:32, Will Deacon wrote:
> On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
>> From: Yicong Yang <yangyicong@hisilicon.com>
>>
>> CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
>> preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
>> processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
>> one of the SMT siblings is not idle on a multi-threaded implementation.
>> So don't use it on SMT cores.
>>
>> Introduce topology_core_has_smt() for knowing the SMT implementation and
>> cached it in arm_pmu::has_smt during allocation.
>>
>> When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
>> patch we'll get:
>> [root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
>> --taskset 2 --timeout 1
>> [...]
>> Performance counter stats for 'CPU(s) 2-3':
>>
>> CPU2 2880457316 cycles
>> CPU3 2880459810 cycles
>> 1.254688470 seconds time elapsed
>>
>> With this patch the idle state of CPU3 is observed as expected:
>> [root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
>> --taskset 2 --timeout 1
>> [...]
>> Performance counter stats for 'CPU(s) 2-3':
>>
>> CPU2 2558580492 cycles
>> CPU3 305749 cycles
>> 1.113626410 seconds time elapsed
>>
>> Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
>> ---
>> drivers/perf/arm_pmu.c | 3 +++
>> drivers/perf/arm_pmuv3.c | 10 ++++++++++
>> include/linux/arch_topology.h | 11 +++++++++++
>> include/linux/perf/arm_pmu.h | 1 +
>> 4 files changed, 25 insertions(+)
>>
>> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
>> index 5c310e803dd7..137ef55d6973 100644
>> --- a/drivers/perf/arm_pmu.c
>> +++ b/drivers/perf/arm_pmu.c
>> @@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
>>
>> events = per_cpu_ptr(pmu->hw_events, cpu);
>> events->percpu_pmu = pmu;
>> +
>> + if (!pmu->has_smt && topology_core_has_smt(cpu))
>> + pmu->has_smt = true;
>
> Why isn't that just:
>
> pmu->has_smt = topology_core_has_smt(cpu);
>
> ?
also works. since one pmu only contains one type of CPU, so just thought
no need to set it multiple times.
>
> but then if that's the case, why do we need to stash the result in the
> PMU at all?
>
should based on the discussion here [1]. stash it during probe will avoid
calling {raw_}smp_processor_id() in pmu::event_init() which may be
horrible for debug in some condition.
[1] https://lore.kernel.org/linux-arm-kernel/aJsV7nzlILHd_ZMa@J2N7QTR9R3/
thanks.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-09-19 8:56 ` Yicong Yang
@ 2025-09-19 9:16 ` Mark Rutland
2025-09-19 10:27 ` Yicong Yang
0 siblings, 1 reply; 12+ messages in thread
From: Mark Rutland @ 2025-09-19 9:16 UTC (permalink / raw)
To: Yicong Yang
Cc: Will Deacon, yangyicong, linux-arm-kernel, sudeep.holla,
james.clark, robh, anshuman.khandual, jonathan.cameron, hejunhao3,
linuxarm, prime.zeng, xuwei5, wangyushan12
On Fri, Sep 19, 2025 at 04:56:18PM +0800, Yicong Yang wrote:
> On 2025/9/18 21:32, Will Deacon wrote:
> > On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
> >> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> >> index 5c310e803dd7..137ef55d6973 100644
> >> --- a/drivers/perf/arm_pmu.c
> >> +++ b/drivers/perf/arm_pmu.c
> >> @@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
> >>
> >> events = per_cpu_ptr(pmu->hw_events, cpu);
> >> events->percpu_pmu = pmu;
> >> +
> >> + if (!pmu->has_smt && topology_core_has_smt(cpu))
> >> + pmu->has_smt = true;
> >
> > Why isn't that just:
> >
> > pmu->has_smt = topology_core_has_smt(cpu);
> >
> > ?
>
> also works. since one pmu only contains one type of CPU, so just thought
> no need to set it multiple times.
>
> > but then if that's the case, why do we need to stash the result in the
> > PMU at all?
>
> should based on the discussion here [1]. stash it during probe will avoid
> calling {raw_}smp_processor_id() in pmu::event_init() which may be
> horrible for debug in some condition.
>
> [1] https://lore.kernel.org/linux-arm-kernel/aJsV7nzlILHd_ZMa@J2N7QTR9R3/
This isn't about being 'horrible for debug'; my comment there was saying
that the proposed patch was incorrect AND it would be horrible to debug
that in practice when it inevitably went wrong.
The key details are:
(1) We need pmu::event_init() to know whether the cycle counter can be
used such that it doesn't permit a group to be created which can
*NEVER* be scheduled in hardware. Otherwise, the core perf code will
waste time periodically trying to schedule that group when it will
*ALWAYS* be rejected by pmu::add().
(2) The pmu::event_init() call runs in a preemptible context and can
run on any CPU in the system, completely independent of the PMU's
supported CPUs. Thus [raw_]smp_processor_id() tells you nothing
about the CPU(s) the event will run on.
Note that for task-bound events, the event->cpu is -1, so that
doesn't tell us either. Only the PMU instance tells us the set of
CPUs.
We can solve that by either stashing this boolean flag at probe time OR
having pmu::event_init() check something like:
topology_core_has_smt(cpumask_first(pmu->supported_cpus));
... and I think stashing at probe time is nicer/clearer.
Mark.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-09-18 13:32 ` Will Deacon
2025-09-19 8:56 ` Yicong Yang
@ 2025-09-19 9:37 ` Sudeep Holla
2025-09-19 11:18 ` Will Deacon
1 sibling, 1 reply; 12+ messages in thread
From: Sudeep Holla @ 2025-09-19 9:37 UTC (permalink / raw)
To: Will Deacon
Cc: Yicong Yang, mark.rutland, Sudeep Holla, linux-arm-kernel,
james.clark, robh, anshuman.khandual, jonathan.cameron, hejunhao3,
linuxarm, prime.zeng, xuwei5, wangyushan12, yangyicong
On Thu, Sep 18, 2025 at 02:32:20PM +0100, Will Deacon wrote:
> On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
> > From: Yicong Yang <yangyicong@hisilicon.com>
> >
> > CPU_CYCLES is expected to count the logical CPU (PE) clock. Currently it's
> > preferred to use PMCCNTR_EL0 for counting CPU_CYCLES, but it'll count
> > processor clock rather than the PE clock (ARM DDI0487 L.b D13.1.3) if
> > one of the SMT siblings is not idle on a multi-threaded implementation.
> > So don't use it on SMT cores.
> >
> > Introduce topology_core_has_smt() for knowing the SMT implementation and
> > cached it in arm_pmu::has_smt during allocation.
> >
> > When counting cycles on SMT CPU 2-3 and CPU 3 is idle, without this
> > patch we'll get:
> > [root@client1 tmp]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
> > --taskset 2 --timeout 1
> > [...]
> > Performance counter stats for 'CPU(s) 2-3':
> >
> > CPU2 2880457316 cycles
> > CPU3 2880459810 cycles
> > 1.254688470 seconds time elapsed
> >
> > With this patch the idle state of CPU3 is observed as expected:
> > [root@client1 ~]# perf stat -e cycles -A -C 2-3 -- stress-ng -c 1
> > --taskset 2 --timeout 1
> > [...]
> > Performance counter stats for 'CPU(s) 2-3':
> >
> > CPU2 2558580492 cycles
> > CPU3 305749 cycles
> > 1.113626410 seconds time elapsed
> >
> > Signed-off-by: Yicong Yang <yangyicong@hisilicon.com>
> > ---
> > drivers/perf/arm_pmu.c | 3 +++
> > drivers/perf/arm_pmuv3.c | 10 ++++++++++
> > include/linux/arch_topology.h | 11 +++++++++++
> > include/linux/perf/arm_pmu.h | 1 +
> > 4 files changed, 25 insertions(+)
> >
> > diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> > index 5c310e803dd7..137ef55d6973 100644
> > --- a/drivers/perf/arm_pmu.c
> > +++ b/drivers/perf/arm_pmu.c
> > @@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
> >
> > events = per_cpu_ptr(pmu->hw_events, cpu);
> > events->percpu_pmu = pmu;
> > +
> > + if (!pmu->has_smt && topology_core_has_smt(cpu))
> > + pmu->has_smt = true;
>
> Why isn't that just:
>
> pmu->has_smt = topology_core_has_smt(cpu);
>
> ?
>
> but then if that's the case, why do we need to stash the result in the
> PMU at all?
>
Agreed, I don't see any point in making a copy here, topology_core_has_smt()
should work.
> > }
> >
> > return pmu;
> > diff --git a/drivers/perf/arm_pmuv3.c b/drivers/perf/arm_pmuv3.c
> > index 69c5cc8f5606..32b58a0feb33 100644
> > --- a/drivers/perf/arm_pmuv3.c
> > +++ b/drivers/perf/arm_pmuv3.c
> > @@ -981,6 +981,7 @@ static int armv8pmu_get_chain_idx(struct pmu_hw_events *cpuc,
> > static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
> > struct perf_event *event)
> > {
> > + struct arm_pmu *cpu_pmu = to_arm_pmu(event->pmu);
> > struct hw_perf_event *hwc = &event->hw;
> > unsigned long evtype = hwc->config_base & ARMV8_PMU_EVTYPE_EVENT;
> >
> > @@ -1001,6 +1002,15 @@ static bool armv8pmu_can_use_pmccntr(struct pmu_hw_events *cpuc,
> > if (has_branch_stack(event))
> > return false;
> >
> > + /*
> > + * The PMCCNTR_EL0 increments from the processor clock rather than
> > + * the PE clock (ARM DDI0487 L.b D13.1.3) which means it'll continue
> > + * counting on a WFI PE if one of its SMT silbing is not idle on a
>
> typo: sibling
>
> > + * multi-threaded implementation. So don't use it on SMT cores.
> > + */
> > + if (cpu_pmu->has_smt)
> > + return false;
> > +
> > return true;
> > }
> >
> > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> > index d72d6e5aa200..daa1af2e8204 100644
> > --- a/include/linux/arch_topology.h
> > +++ b/include/linux/arch_topology.h
> > @@ -89,6 +89,17 @@ void remove_cpu_topology(unsigned int cpuid);
> > void reset_cpu_topology(void);
> > int parse_acpi_topology(void);
> > void freq_inv_set_max_ratio(int cpu, u64 max_rate);
> > +
> > +/*
> > + * Architectures like ARM64 don't have reliable architectural way to get SMT
> > + * information and depend on the firmware (ACPI/OF) report. Non-SMT core won't
> > + * initialize thread_id so we can use this to detect the SMT implementation.
> > + */
> > +static inline bool topology_core_has_smt(int cpu)
> > +{
> > + return cpu_topology[cpu].thread_id != -1;
> > +}
>
> Sudeep -- is this ok?
>
Yes, this looks correct as the topology_sibling_cpumask(aka thread_sibling
mask) changes with hotplug events and may not be desirable. Not sure if we
can do any better with name in case we need to check for active/online thread
sibling in the future(I assuming it is for sure not needed here).
--
Regards,
Sudeep
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-09-19 9:16 ` Mark Rutland
@ 2025-09-19 10:27 ` Yicong Yang
2025-09-19 11:17 ` Will Deacon
0 siblings, 1 reply; 12+ messages in thread
From: Yicong Yang @ 2025-09-19 10:27 UTC (permalink / raw)
To: Mark Rutland
Cc: yangyicong, Will Deacon, linux-arm-kernel, sudeep.holla,
james.clark, robh, anshuman.khandual, jonathan.cameron, hejunhao3,
linuxarm, prime.zeng, xuwei5, wangyushan12
On 2025/9/19 17:16, Mark Rutland wrote:
> On Fri, Sep 19, 2025 at 04:56:18PM +0800, Yicong Yang wrote:
>> On 2025/9/18 21:32, Will Deacon wrote:
>>> On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
>
>>>> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
>>>> index 5c310e803dd7..137ef55d6973 100644
>>>> --- a/drivers/perf/arm_pmu.c
>>>> +++ b/drivers/perf/arm_pmu.c
>>>> @@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
>>>>
>>>> events = per_cpu_ptr(pmu->hw_events, cpu);
>>>> events->percpu_pmu = pmu;
>>>> +
>>>> + if (!pmu->has_smt && topology_core_has_smt(cpu))
>>>> + pmu->has_smt = true;
>>>
>>> Why isn't that just:
>>>
>>> pmu->has_smt = topology_core_has_smt(cpu);
>>>
>>> ?
>>
>> also works. since one pmu only contains one type of CPU, so just thought
>> no need to set it multiple times.
>>
>>> but then if that's the case, why do we need to stash the result in the
>>> PMU at all?
>>
>> should based on the discussion here [1]. stash it during probe will avoid
>> calling {raw_}smp_processor_id() in pmu::event_init() which may be
>> horrible for debug in some condition.
>>
>> [1] https://lore.kernel.org/linux-arm-kernel/aJsV7nzlILHd_ZMa@J2N7QTR9R3/
>
> This isn't about being 'horrible for debug'; my comment there was saying
> that the proposed patch was incorrect AND it would be horrible to debug
> that in practice when it inevitably went wrong.
>
> The key details are:
>
> (1) We need pmu::event_init() to know whether the cycle counter can be
> used such that it doesn't permit a group to be created which can
> *NEVER* be scheduled in hardware. Otherwise, the core perf code will
> waste time periodically trying to schedule that group when it will
> *ALWAYS* be rejected by pmu::add().
>
> (2) The pmu::event_init() call runs in a preemptible context and can
> run on any CPU in the system, completely independent of the PMU's
> supported CPUs. Thus [raw_]smp_processor_id() tells you nothing
> about the CPU(s) the event will run on.
>
> Note that for task-bound events, the event->cpu is -1, so that
> doesn't tell us either. Only the PMU instance tells us the set of
> CPUs.
>
yes this is the problem in the last approach using [raw_]smp_processor_id()
in pmu::event_init().. sorry for the wrong information replied above and
thanks for help me recall this..
> We can solve that by either stashing this boolean flag at probe time OR
> having pmu::event_init() check something like:
>
> topology_core_has_smt(cpumask_first(pmu->supported_cpus));
>
this works. I didn't think of this approach... pmu->supported_cpus may contain
offline CPUs but it doesn't matter since topology_core_has_smt() can also
retieve the SMT implementation for offline CPU.
> ... and I think stashing at probe time is nicer/clearer.
>
I feel similar. will wait for Will's comments :)
thanks.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-09-19 10:27 ` Yicong Yang
@ 2025-09-19 11:17 ` Will Deacon
0 siblings, 0 replies; 12+ messages in thread
From: Will Deacon @ 2025-09-19 11:17 UTC (permalink / raw)
To: Yicong Yang
Cc: Mark Rutland, yangyicong, linux-arm-kernel, sudeep.holla,
james.clark, robh, anshuman.khandual, jonathan.cameron, hejunhao3,
linuxarm, prime.zeng, xuwei5, wangyushan12
On Fri, Sep 19, 2025 at 06:27:54PM +0800, Yicong Yang wrote:
> On 2025/9/19 17:16, Mark Rutland wrote:
> > On Fri, Sep 19, 2025 at 04:56:18PM +0800, Yicong Yang wrote:
> >> On 2025/9/18 21:32, Will Deacon wrote:
> >>> On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
> >
> >>>> diff --git a/drivers/perf/arm_pmu.c b/drivers/perf/arm_pmu.c
> >>>> index 5c310e803dd7..137ef55d6973 100644
> >>>> --- a/drivers/perf/arm_pmu.c
> >>>> +++ b/drivers/perf/arm_pmu.c
> >>>> @@ -901,6 +901,9 @@ struct arm_pmu *armpmu_alloc(void)
> >>>>
> >>>> events = per_cpu_ptr(pmu->hw_events, cpu);
> >>>> events->percpu_pmu = pmu;
> >>>> +
> >>>> + if (!pmu->has_smt && topology_core_has_smt(cpu))
> >>>> + pmu->has_smt = true;
> >>>
> >>> Why isn't that just:
> >>>
> >>> pmu->has_smt = topology_core_has_smt(cpu);
> >>>
> >>> ?
> >>
> >> also works. since one pmu only contains one type of CPU, so just thought
> >> no need to set it multiple times.
> >>
> >>> but then if that's the case, why do we need to stash the result in the
> >>> PMU at all?
> >>
> >> should based on the discussion here [1]. stash it during probe will avoid
> >> calling {raw_}smp_processor_id() in pmu::event_init() which may be
> >> horrible for debug in some condition.
> >>
> >> [1] https://lore.kernel.org/linux-arm-kernel/aJsV7nzlILHd_ZMa@J2N7QTR9R3/
> >
> > This isn't about being 'horrible for debug'; my comment there was saying
> > that the proposed patch was incorrect AND it would be horrible to debug
> > that in practice when it inevitably went wrong.
> >
> > The key details are:
> >
> > (1) We need pmu::event_init() to know whether the cycle counter can be
> > used such that it doesn't permit a group to be created which can
> > *NEVER* be scheduled in hardware. Otherwise, the core perf code will
> > waste time periodically trying to schedule that group when it will
> > *ALWAYS* be rejected by pmu::add().
> >
> > (2) The pmu::event_init() call runs in a preemptible context and can
> > run on any CPU in the system, completely independent of the PMU's
> > supported CPUs. Thus [raw_]smp_processor_id() tells you nothing
> > about the CPU(s) the event will run on.
> >
> > Note that for task-bound events, the event->cpu is -1, so that
> > doesn't tell us either. Only the PMU instance tells us the set of
> > CPUs.
> >
>
> yes this is the problem in the last approach using [raw_]smp_processor_id()
> in pmu::event_init().. sorry for the wrong information replied above and
> thanks for help me recall this..
>
> > We can solve that by either stashing this boolean flag at probe time OR
> > having pmu::event_init() check something like:
> >
> > topology_core_has_smt(cpumask_first(pmu->supported_cpus));
> >
>
> this works. I didn't think of this approach... pmu->supported_cpus may contain
> offline CPUs but it doesn't matter since topology_core_has_smt() can also
> retieve the SMT implementation for offline CPU.
>
> > ... and I think stashing at probe time is nicer/clearer.
> >
>
> I feel similar. will wait for Will's comments :)
Seems like it's two against one, so we'll go with pmu->has_smt and I'll
live with it.
Will
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores
2025-09-19 9:37 ` Sudeep Holla
@ 2025-09-19 11:18 ` Will Deacon
0 siblings, 0 replies; 12+ messages in thread
From: Will Deacon @ 2025-09-19 11:18 UTC (permalink / raw)
To: Sudeep Holla
Cc: Yicong Yang, mark.rutland, linux-arm-kernel, james.clark, robh,
anshuman.khandual, jonathan.cameron, hejunhao3, linuxarm,
prime.zeng, xuwei5, wangyushan12, yangyicong
On Fri, Sep 19, 2025 at 10:37:45AM +0100, Sudeep Holla wrote:
> On Thu, Sep 18, 2025 at 02:32:20PM +0100, Will Deacon wrote:
> > On Wed, Aug 20, 2025 at 04:45:34PM +0800, Yicong Yang wrote:
> > > diff --git a/include/linux/arch_topology.h b/include/linux/arch_topology.h
> > > index d72d6e5aa200..daa1af2e8204 100644
> > > --- a/include/linux/arch_topology.h
> > > +++ b/include/linux/arch_topology.h
> > > @@ -89,6 +89,17 @@ void remove_cpu_topology(unsigned int cpuid);
> > > void reset_cpu_topology(void);
> > > int parse_acpi_topology(void);
> > > void freq_inv_set_max_ratio(int cpu, u64 max_rate);
> > > +
> > > +/*
> > > + * Architectures like ARM64 don't have reliable architectural way to get SMT
> > > + * information and depend on the firmware (ACPI/OF) report. Non-SMT core won't
> > > + * initialize thread_id so we can use this to detect the SMT implementation.
> > > + */
> > > +static inline bool topology_core_has_smt(int cpu)
> > > +{
> > > + return cpu_topology[cpu].thread_id != -1;
> > > +}
> >
> > Sudeep -- is this ok?
> >
>
> Yes, this looks correct as the topology_sibling_cpumask(aka thread_sibling
> mask) changes with hotplug events and may not be desirable. Not sure if we
> can do any better with name in case we need to check for active/online thread
> sibling in the future(I assuming it is for sure not needed here).
Great, thanks for confirming. We're just using this to figure out whether
the hardware implements SMT so I don't think we care about the thread status
at all.
Will
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2025-09-19 11:19 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-20 8:45 [PATCH v2 0/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
2025-08-20 8:45 ` [PATCH v2 1/2] perf: arm_pmuv3: Factor out PMCCNTR_EL0 use conditions Yicong Yang
2025-08-20 8:45 ` [PATCH v2 2/2] perf: arm_pmuv3: Don't use PMCCNTR_EL0 on SMT cores Yicong Yang
2025-09-18 13:32 ` Will Deacon
2025-09-19 8:56 ` Yicong Yang
2025-09-19 9:16 ` Mark Rutland
2025-09-19 10:27 ` Yicong Yang
2025-09-19 11:17 ` Will Deacon
2025-09-19 9:37 ` Sudeep Holla
2025-09-19 11:18 ` Will Deacon
2025-09-08 7:09 ` [PATCH v2 0/2] " Yicong Yang
2025-09-18 16:43 ` Will Deacon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox