[PATCH] cpufreq: CPPC: Increase delay between perf counter reads

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
@ 2025-07-30 22:07 Prashant Malani
  2025-08-06  0:26 ` Yang Shi
  0 siblings, 1 reply; 11+ messages in thread
From: Prashant Malani @ 2025-07-30 22:07 UTC (permalink / raw)
  To: open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar
  Cc: Prashant Malani, Yang Shi, Ionela Voinescu

On a heavily loaded CPU, performance counter reads can be erratic. This is
due to two factors:
- The method used to calculate CPPC delivered performance.
- Linux scheduler vagaries.

As an example, on a CPU which has a max frequency of 3.4 GHz, if we run
stress-ng on the CPU in the background and then read the frequency, we get
invalid readings:

./stress_ng --cpu 108 --taskset 3 -t 30s &
cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq
3600000
3500000
3600000

Per [1] CPPC performance is measured by reading the delivered and reference
counters at timestamp t0, then waiting 2us, and then repeating the
measurement at t1. So, in theory, one should end up with:
Timestamp t0: ref0, del0
Timestamp t1: ref1, del1

However, since the reference and delivered registers are individual
register reads (in the case of FFH, it even results in an IPI to the CPU in
question), what happens in practice is:
Timestamp t0: del0
Timestamp t0 + m: ref0
Timestamp t1: del1
Timestamp t1 + n: ref1

There has been prior discussion[2] about the cause of these differences;
it was broadly pegged as due to IRQs and "interconnect congestion".

Since the gap between t0 and t1 is very small (2us), differing values of m
and n mean that the measurements don't correspond to 2 discrete timestamps,
since the delivered performance delta is being measured across a
significantly different time period than the reference performance
delta. This has an influence on the perf measurement which is:
((del1 - del0) * reference perf) / (ref1 - ref0)

Previously collected data[4] shows that cppc_get_perf_ctrs() itself
takes anywhere between 4.9us and 3.6us, which further suggests that a
2us delta is too less.

If we increase the time delta to a high enough value (i.e if delay >> m,n)
then the effects of m and n get mitigated, leading to both the register
measurements (ref and del) corresponding to the same timestamp.

When this approach was previously proposed[3], there was concern about
this function being called with interrupts off but that was later found to
be not true [2]. So, waiting for a slightly longer time in between counter
samples should be acceptable.

Increase the time delay between counter reads to 200 us to reduce the
effect of timing discrepancies in reading individual performance registers.

[1] https://docs.kernel.org/admin-guide/acpi/cppc_sysfs.html#computing-average-delivered-performance
[2] https://lore.kernel.org/all/7b57e680-0ba3-0b8b-851e-7cc369050386@os.amperecomputing.com/
[3] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
[4] https://lore.kernel.org/all/1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com/

Cc: Yang Shi <yang@os.amperecomputing.com>
Cc: Ionela Voinescu <Ionela.Voinescu@arm.com>
Signed-off-by: Prashant Malani <pmalani@google.com>
---
 drivers/cpufreq/cppc_cpufreq.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
index 4a17162a392d..086c3b87bd4e 100644
--- a/drivers/cpufreq/cppc_cpufreq.c
+++ b/drivers/cpufreq/cppc_cpufreq.c
@@ -718,7 +718,7 @@ static int cppc_get_perf_ctrs_sample(int cpu,
 	if (ret)
 		return ret;

-	udelay(2); /* 2usec delay between sampling */
+	udelay(200); /* 200usec delay between sampling */

 	return cppc_get_perf_ctrs(cpu, fb_ctrs_t1);
 }
-- 
2.50.1.552.g942d659e1b-goog

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-07-30 22:07 [PATCH] cpufreq: CPPC: Increase delay between perf counter reads Prashant Malani
@ 2025-08-06  0:26 ` Yang Shi
  2025-08-06  1:00   ` Prashant Malani
  0 siblings, 1 reply; 11+ messages in thread
From: Yang Shi @ 2025-08-06  0:26 UTC (permalink / raw)
  To: Prashant Malani, open list,
	open list:CPU FREQUENCY SCALING FRAMEWORK, Rafael J. Wysocki,
	Viresh Kumar, beata.michalska, Catalin Marinas
  Cc: Ionela Voinescu



On 7/30/25 3:07 PM, Prashant Malani wrote:
> [You don't often get email from pmalani@google.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
>
> On a heavily loaded CPU, performance counter reads can be erratic. This is
> due to two factors:
> - The method used to calculate CPPC delivered performance.
> - Linux scheduler vagaries.
>
> As an example, on a CPU which has a max frequency of 3.4 GHz, if we run
> stress-ng on the CPU in the background and then read the frequency, we get
> invalid readings:
>
> ./stress_ng --cpu 108 --taskset 3 -t 30s &
> cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_cur_freq
> 3600000
> 3500000
> 3600000
>
> Per [1] CPPC performance is measured by reading the delivered and reference
> counters at timestamp t0, then waiting 2us, and then repeating the
> measurement at t1. So, in theory, one should end up with:
> Timestamp t0: ref0, del0
> Timestamp t1: ref1, del1
>
> However, since the reference and delivered registers are individual
> register reads (in the case of FFH, it even results in an IPI to the CPU in
> question), what happens in practice is:
> Timestamp t0: del0
> Timestamp t0 + m: ref0
> Timestamp t1: del1
> Timestamp t1 + n: ref1
>
> There has been prior discussion[2] about the cause of these differences;
> it was broadly pegged as due to IRQs and "interconnect congestion".
>
> Since the gap between t0 and t1 is very small (2us), differing values of m
> and n mean that the measurements don't correspond to 2 discrete timestamps,
> since the delivered performance delta is being measured across a
> significantly different time period than the reference performance
> delta. This has an influence on the perf measurement which is:
> ((del1 - del0) * reference perf) / (ref1 - ref0)
>
> Previously collected data[4] shows that cppc_get_perf_ctrs() itself
> takes anywhere between 4.9us and 3.6us, which further suggests that a
> 2us delta is too less.
>
> If we increase the time delta to a high enough value (i.e if delay >> m,n)
> then the effects of m and n get mitigated, leading to both the register
> measurements (ref and del) corresponding to the same timestamp.
>
> When this approach was previously proposed[3], there was concern about
> this function being called with interrupts off but that was later found to
> be not true [2]. So, waiting for a slightly longer time in between counter
> samples should be acceptable.
>
> Increase the time delay between counter reads to 200 us to reduce the
> effect of timing discrepancies in reading individual performance registers.
>
> [1] https://docs.kernel.org/admin-guide/acpi/cppc_sysfs.html#computing-average-delivered-performance
> [2] https://lore.kernel.org/all/7b57e680-0ba3-0b8b-851e-7cc369050386@os.amperecomputing.com/
> [3] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/
> [4] https://lore.kernel.org/all/1ce09fd7-0c1d-fc46-ce12-01b25fbd4afd@os.amperecomputing.com/
>
> Cc: Yang Shi <yang@os.amperecomputing.com>
> Cc: Ionela Voinescu <Ionela.Voinescu@arm.com>
> Signed-off-by: Prashant Malani <pmalani@google.com>

Thank you for cc'ing me the patch. I posted the similar patch ago and 
had some discussion on the mailing list. Then someone else from ARM 
pursued a different way to solve it. But I didn't follow very closely. 
If I remember correctly, a new sysfs interface, called cpuinfo_avg_freq 
was added. It should be the preferred way to get cpu frequency. Please 
see 
https://github.com/torvalds/linux/commit/fbb4a4759b541d09ebb8e391d5fa7f9a5a0cad61.

Added Beata Michalska in the loop too, who is the author of the patch. 
Please feel free to correct me, if I'm wrong.

Yang


> ---
>   drivers/cpufreq/cppc_cpufreq.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/cppc_cpufreq.c b/drivers/cpufreq/cppc_cpufreq.c
> index 4a17162a392d..086c3b87bd4e 100644
> --- a/drivers/cpufreq/cppc_cpufreq.c
> +++ b/drivers/cpufreq/cppc_cpufreq.c
> @@ -718,7 +718,7 @@ static int cppc_get_perf_ctrs_sample(int cpu,
>          if (ret)
>                  return ret;
>
> -       udelay(2); /* 2usec delay between sampling */
> +       udelay(200); /* 200usec delay between sampling */
>
>          return cppc_get_perf_ctrs(cpu, fb_ctrs_t1);
>   }
> --
> 2.50.1.552.g942d659e1b-goog
>


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-06  0:26 ` Yang Shi
@ 2025-08-06  1:00   ` Prashant Malani
  2025-08-19  9:28     ` Beata Michalska
  0 siblings, 1 reply; 11+ messages in thread
From: Prashant Malani @ 2025-08-06  1:00 UTC (permalink / raw)
  To: Yang Shi
  Cc: open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, beata.michalska, Catalin Marinas,
	Ionela Voinescu

Thanks Yang,

On Tue, 5 Aug 2025 at 17:26, Yang Shi <yang@os.amperecomputing.com> wrote:
> Thank you for cc'ing me the patch. I posted the similar patch ago and
> had some discussion on the mailing list. Then someone else from ARM
> pursued a different way to solve it. But I didn't follow very closely.
> If I remember correctly, a new sysfs interface, called cpuinfo_avg_freq
> was added. It should be the preferred way to get cpu frequency. Please
> see
> https://github.com/torvalds/linux/commit/fbb4a4759b541d09ebb8e391d5fa7f9a5a0cad61.
>
> Added Beata Michalska in the loop too, who is the author of the patch.
> Please feel free to correct me, if I'm wrong.

Thanks for the additional context. Yeah, the issue is that :
- The new sysfs node is sampling period is too long (20ms) [1]
That makes it problematic for userspace use cases, so we need something
which takes less time.
- The central accuracy issue behind cpuinfo_cur_freq still needs to be handled.

[1] https://elixir.bootlin.com/linux/v6.16/source/arch/arm64/kernel/topology.c#L283


-- 
-Prashant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-06  1:00   ` Prashant Malani
@ 2025-08-19  9:28     ` Beata Michalska
  2025-08-19 23:43       ` Prashant Malani
  0 siblings, 1 reply; 11+ messages in thread
From: Beata Michalska @ 2025-08-19  9:28 UTC (permalink / raw)
  To: Prashant Malani
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Tue, Aug 05, 2025 at 06:00:09PM -0700, Prashant Malani wrote:
> Thanks Yang,
> 
> On Tue, 5 Aug 2025 at 17:26, Yang Shi <yang@os.amperecomputing.com> wrote:
> > Thank you for cc'ing me the patch. I posted the similar patch ago and
> > had some discussion on the mailing list. Then someone else from ARM
> > pursued a different way to solve it. But I didn't follow very closely.
> > If I remember correctly, a new sysfs interface, called cpuinfo_avg_freq
> > was added. It should be the preferred way to get cpu frequency. Please
> > see
> > https://github.com/torvalds/linux/commit/fbb4a4759b541d09ebb8e391d5fa7f9a5a0cad61.
> >
> > Added Beata Michalska in the loop too, who is the author of the patch.
> > Please feel free to correct me, if I'm wrong.
> 
> Thanks for the additional context. Yeah, the issue is that :
> - The new sysfs node is sampling period is too long (20ms) [1]
> That makes it problematic for userspace use cases, so we need something
> which takes less time.
That actually specifies the duration, for which the most recently acquired
sample is considered valid. Sampling is tick-based.

> - The central accuracy issue behind cpuinfo_cur_freq still needs to be handled.
I'd really discourage increasing the sampling period for cppc.

---
BR
Beata
> 
> [1] https://elixir.bootlin.com/linux/v6.16/source/arch/arm64/kernel/topology.c#L283
> 
> 
> -- 
> -Prashant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-19  9:28     ` Beata Michalska
@ 2025-08-19 23:43       ` Prashant Malani
  2025-08-20 20:49         ` Beata Michalska
  0 siblings, 1 reply; 11+ messages in thread
From: Prashant Malani @ 2025-08-19 23:43 UTC (permalink / raw)
  To: Beata Michalska
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Tue, 19 Aug 2025 at 02:29, Beata Michalska <beata.michalska@arm.com> wrote:
>
> On Tue, Aug 05, 2025 at 06:00:09PM -0700, Prashant Malani wrote:
> > Thanks Yang,
> >
> > On Tue, 5 Aug 2025 at 17:26, Yang Shi <yang@os.amperecomputing.com> wrote:
> > > Thank you for cc'ing me the patch. I posted the similar patch ago and
> > > had some discussion on the mailing list. Then someone else from ARM
> > > pursued a different way to solve it. But I didn't follow very closely.
> > > If I remember correctly, a new sysfs interface, called cpuinfo_avg_freq
> > > was added. It should be the preferred way to get cpu frequency. Please
> > > see
> > > https://github.com/torvalds/linux/commit/fbb4a4759b541d09ebb8e391d5fa7f9a5a0cad61.
> > >
> > > Added Beata Michalska in the loop too, who is the author of the patch.
> > > Please feel free to correct me, if I'm wrong.
> >
> > Thanks for the additional context. Yeah, the issue is that :
> > - The new sysfs node is sampling period is too long (20ms) [1]
> > That makes it problematic for userspace use cases, so we need something
> > which takes less time.
> That actually specifies the duration, for which the most recently acquired
> sample is considered valid. Sampling is tick-based.

Thanks for the correction. I made an error in understanding the code.

>
> > - The central accuracy issue behind cpuinfo_cur_freq still needs to be handled.
> I'd really discourage increasing the sampling period for cppc.

The only true solution is to make the register reads (ref + delivered
combined) atomic.
We see that this solves the issue when conducting the same
measurements on firmware
(which is an RTOS, so no scheduler randomness).

Outside of that, I can't think of another mitigation beyond adding delay to make
the time deltas not matter so much.

Perhaps ARM should introduce a "snapshot" feature which takes the snapshot
of the AMU counters on a register write; IAC that's outside the scope
of this discussion.

Could you kindly explain why adding the udelay here is discouraged?

Best regards,

-- 
-Prashant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-19 23:43       ` Prashant Malani
@ 2025-08-20 20:49         ` Beata Michalska
  2025-08-20 21:25           ` Prashant Malani
  0 siblings, 1 reply; 11+ messages in thread
From: Beata Michalska @ 2025-08-20 20:49 UTC (permalink / raw)
  To: Prashant Malani
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Tue, Aug 19, 2025 at 04:43:59PM -0700, Prashant Malani wrote:
> On Tue, 19 Aug 2025 at 02:29, Beata Michalska <beata.michalska@arm.com> wrote:
> >
> > On Tue, Aug 05, 2025 at 06:00:09PM -0700, Prashant Malani wrote:
> > > Thanks Yang,
> > >
> > > On Tue, 5 Aug 2025 at 17:26, Yang Shi <yang@os.amperecomputing.com> wrote:
> > > > Thank you for cc'ing me the patch. I posted the similar patch ago and
> > > > had some discussion on the mailing list. Then someone else from ARM
> > > > pursued a different way to solve it. But I didn't follow very closely.
> > > > If I remember correctly, a new sysfs interface, called cpuinfo_avg_freq
> > > > was added. It should be the preferred way to get cpu frequency. Please
> > > > see
> > > > https://github.com/torvalds/linux/commit/fbb4a4759b541d09ebb8e391d5fa7f9a5a0cad61.
> > > >
> > > > Added Beata Michalska in the loop too, who is the author of the patch.
> > > > Please feel free to correct me, if I'm wrong.
> > >
> > > Thanks for the additional context. Yeah, the issue is that :
> > > - The new sysfs node is sampling period is too long (20ms) [1]
> > > That makes it problematic for userspace use cases, so we need something
> > > which takes less time.
> > That actually specifies the duration, for which the most recently acquired
> > sample is considered valid. Sampling is tick-based.
> 
> Thanks for the correction. I made an error in understanding the code.
> 
> >
> > > - The central accuracy issue behind cpuinfo_cur_freq still needs to be handled.
> > I'd really discourage increasing the sampling period for cppc.
> 
> The only true solution is to make the register reads (ref + delivered
> combined) atomic.
> We see that this solves the issue when conducting the same
> measurements on firmware
> (which is an RTOS, so no scheduler randomness).
Kinda working on that one.
> 
> Outside of that, I can't think of another mitigation beyond adding delay to make
> the time deltas not matter so much.
I'm not entirely sure what 'so much' means in this context.
How one would quantify whether the added delay is actually mitigating the issue?

> 
> Perhaps ARM should introduce a "snapshot" feature which takes the snapshot
> of the AMU counters on a register write; IAC that's outside the scope
What do you mean by register write ?
> of this discussion.
> 
> Could you kindly explain why adding the udelay here is discouraged?

Would you mind clarifying how the specific value of 200 µs was determined?
Was it carefully derived from measurements across multiple platforms and
workloads, or simply observed to “work” on one relatively stable setup?

Understanding the basis for choosing that delay will help put the discussion
into the right context.

---
BR
Beata
> 
> Best regards,
> 
> -- 
> -Prashant

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-20 20:49         ` Beata Michalska
@ 2025-08-20 21:25           ` Prashant Malani
  2025-08-25 14:52             ` Beata Michalska
  0 siblings, 1 reply; 11+ messages in thread
From: Prashant Malani @ 2025-08-20 21:25 UTC (permalink / raw)
  To: Beata Michalska
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

Hi Beata,

On Wed, 20 Aug 2025 at 13:49, Beata Michalska <beata.michalska@arm.com> wrote:
>
> Kinda working on that one.

OK. I'm eager to see what the solution is!

> >
> > Outside of that, I can't think of another mitigation beyond adding delay to make
> > the time deltas not matter so much.
> I'm not entirely sure what 'so much' means in this context.
> How one would quantify whether the added delay is actually mitigating the issue?
>

I alluded to it in the commit description, but here is the my basic
numerical analysis:
The effective timestamps for the 4 readings right now are:
Timestamp t0: del0
Timestamp t0 + m: ref0
(Time delay X us)
Timestamp t1: del1
Timestamp t1 + n: ref1

Timestamp t1 = t0 + m + X

The perf calculation is:
Per = del1 - del0 / ref1 - ref0
      = Del_counter_diff_over_time(t1 - t0) /
ref_counter_diff_over_time(t1 + n - (t0 + m))
      = Del_counter_diff_over time(t0 + m + X - t0) /
ref_counter_diff_over_time((t0 + m + X + n - t0 - m)
      = Del_counter_diff_over_time(m + X) / ref_counter_diff_over_time(n + X)

If X >> (m,n) this becomes:
      = Del_counter_diff_over_time(X) / ref_counter_diff_over_time(X)
which is what the actual calculation is supposed to be.

if X ~ (m, N) (which is what the case is right now), the calculation
becomes erratic.

> >
> > Perhaps ARM should introduce a "snapshot" feature which takes the snapshot
> > of the AMU counters on a register write; IAC that's outside the scope
> What do you mean by register write ?

I mean that we should have a snapshot register:
SYS_AMEVCNTR0_SNAPSHOT_EL0

writing to this will save the current values of SYS_AMEVCNTR0_CORE_EL0 and
SYS_AMEVCNTR0_CONST_EL0 into two shadow registers (say
SYS_AMEVCNTR0_CORE_SNAPSHOT_EL0
and SYS_AMEVCNTR0_CONST_SNAPSHOT_EL0)

That way the OS specifics in how the registers are read don't matter.

Of course, I'm not a H/W architect so this proposal is very rough, but hopefully
helps illustrate my idea.

> > of this discussion.
> >
> > Could you kindly explain why adding the udelay here is discouraged?
>
> Would you mind clarifying how the specific value of 200 µs was determined?
> Was it carefully derived from measurements across multiple platforms and
> workloads, or simply observed to “work” on one relatively stable setup?
>
> Understanding the basis for choosing that delay will help put the discussion
> into the right context.

TBH, I just experimented with values on our system and observed the readings of
cores running stress-ng. I tried 100us and that still gave variability
(values greater than
the theoretical max frequency). It's possible that the "optimal value"
is somewhere
between these two.

There have been other observations on this topic [1], that suggest
that even 100us
improves the error rate significantly from what it is with 2us.

BR,

-Prashant

[1] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-20 21:25           ` Prashant Malani
@ 2025-08-25 14:52             ` Beata Michalska
  2025-08-25 20:24               ` Prashant Malani
  0 siblings, 1 reply; 11+ messages in thread
From: Beata Michalska @ 2025-08-25 14:52 UTC (permalink / raw)
  To: Prashant Malani
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Wed, Aug 20, 2025 at 02:25:16PM -0700, Prashant Malani wrote:
> Hi Beata,
> 
> On Wed, 20 Aug 2025 at 13:49, Beata Michalska <beata.michalska@arm.com> wrote:
> >
> > Kinda working on that one.
> 
> OK. I'm eager to see what the solution is!
> 
> > >
> > > Outside of that, I can't think of another mitigation beyond adding delay to make
> > > the time deltas not matter so much.
> > I'm not entirely sure what 'so much' means in this context.
> > How one would quantify whether the added delay is actually mitigating the issue?
> >
> 
> I alluded to it in the commit description, but here is the my basic
> numerical analysis:
> The effective timestamps for the 4 readings right now are:
> Timestamp t0: del0
> Timestamp t0 + m: ref0
> (Time delay X us)
> Timestamp t1: del1
> Timestamp t1 + n: ref1
> 
> Timestamp t1 = t0 + m + X
> 
> The perf calculation is:
> Per = del1 - del0 / ref1 - ref0
>       = Del_counter_diff_over_time(t1 - t0) /
> ref_counter_diff_over_time(t1 + n - (t0 + m))
>       = Del_counter_diff_over time(t0 + m + X - t0) /
> ref_counter_diff_over_time((t0 + m + X + n - t0 - m)
>       = Del_counter_diff_over_time(m + X) / ref_counter_diff_over_time(n + X)
> 
> If X >> (m,n) this becomes:
>       = Del_counter_diff_over_time(X) / ref_counter_diff_over_time(X)
> which is what the actual calculation is supposed to be.
> 
> if X ~ (m, N) (which is what the case is right now), the calculation
> becomes erratic.
This is still bound by 'm' and 'n' values, as the difference between those will
determine the error factor (with given, fixed X). If m != n, one counter delta
is stretched more than the other, so the perf ratio no longer represents the
same time interval. And that will vary between platforms/workloads leading to
over/under-reporting.

> 
> > >
> > > Perhaps ARM should introduce a "snapshot" feature which takes the snapshot
> > > of the AMU counters on a register write; IAC that's outside the scope
> > What do you mean by register write ?
> 
> I mean that we should have a snapshot register:
> SYS_AMEVCNTR0_SNAPSHOT_EL0
> 
> writing to this will save the current values of SYS_AMEVCNTR0_CORE_EL0 and
> SYS_AMEVCNTR0_CONST_EL0 into two shadow registers (say
> SYS_AMEVCNTR0_CORE_SNAPSHOT_EL0
> and SYS_AMEVCNTR0_CONST_SNAPSHOT_EL0)
> 
> That way the OS specifics in how the registers are read don't matter.
> 
> Of course, I'm not a H/W architect so this proposal is very rough, but hopefully
> helps illustrate my idea.
> 
Ah, so you meant architectural changes - thanks for clarification.
> > > of this discussion.
> > >
> > > Could you kindly explain why adding the udelay here is discouraged?
> >
> > Would you mind clarifying how the specific value of 200 µs was determined?
> > Was it carefully derived from measurements across multiple platforms and
> > workloads, or simply observed to “work” on one relatively stable setup?
> >
> > Understanding the basis for choosing that delay will help put the discussion
> > into the right context.
> 
> TBH, I just experimented with values on our system and observed the readings of
> cores running stress-ng. I tried 100us and that still gave variability
> (values greater than
> the theoretical max frequency). It's possible that the "optimal value"
> is somewhere
> between these two.
> 
> There have been other observations on this topic [1], that suggest
> that even 100us
> improves the error rate significantly from what it is with 2us.
> 
> BR,
Which is exactly why I've mentioned this approach is not really recommended,
being bound to rather specific setup. There have been similar proposals in the
past, all with different values of the delay which should illustrate how fragile
solution (if any) that is.

---
BR
Beata
> -Prashant
> 
> [1] https://lore.kernel.org/all/20230328193846.8757-1-yang@os.amperecomputing.com/

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-25 14:52             ` Beata Michalska
@ 2025-08-25 20:24               ` Prashant Malani
  2025-08-28  7:33                 ` Beata Michalska
  0 siblings, 1 reply; 11+ messages in thread
From: Prashant Malani @ 2025-08-25 20:24 UTC (permalink / raw)
  To: Beata Michalska
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Aug 25 16:52, Beata Michalska wrote:
> On Wed, Aug 20, 2025 at 02:25:16PM -0700, Prashant Malani wrote:
> > Hi Beata,
> > 
> > On Wed, 20 Aug 2025 at 13:49, Beata Michalska <beata.michalska@arm.com> wrote:
> > >
> > > Kinda working on that one.
> > 
> > OK. I'm eager to see what the solution is!
> > 
> > > >
> > > > Outside of that, I can't think of another mitigation beyond adding delay to make
> > > > the time deltas not matter so much.
> > > I'm not entirely sure what 'so much' means in this context.
> > > How one would quantify whether the added delay is actually mitigating the issue?
> > >
> > 
> > I alluded to it in the commit description, but here is the my basic
> > numerical analysis:
> > The effective timestamps for the 4 readings right now are:
> > Timestamp t0: del0
> > Timestamp t0 + m: ref0
> > (Time delay X us)
> > Timestamp t1: del1
> > Timestamp t1 + n: ref1
> > 
> > Timestamp t1 = t0 + m + X
> > 
> > The perf calculation is:
> > Per = del1 - del0 / ref1 - ref0
> >       = Del_counter_diff_over_time(t1 - t0) /
> > ref_counter_diff_over_time(t1 + n - (t0 + m))
> >       = Del_counter_diff_over time(t0 + m + X - t0) /
> > ref_counter_diff_over_time((t0 + m + X + n - t0 - m)
> >       = Del_counter_diff_over_time(m + X) / ref_counter_diff_over_time(n + X)
> > 
> > If X >> (m,n) this becomes:
> >       = Del_counter_diff_over_time(X) / ref_counter_diff_over_time(X)
> > which is what the actual calculation is supposed to be.
> > 
> > if X ~ (m, N) (which is what the case is right now), the calculation
> > becomes erratic.
> This is still bound by 'm' and 'n' values, as the difference between those will
> determine the error factor (with given, fixed X). If m != n, one counter delta
> is stretched more than the other, so the perf ratio no longer represents the
> same time interval. And that will vary between platforms/workloads leading to
> over/under-reporting.

What you are saying holds when m,n ~ X. But if X >> m,n, the X component
dominates. On most platforms, m and n are typically 1-2 us.
If X is anything >= 100us, it dominates the m,n component, making both
time intervals practically the same, i.e

(100 + 1) / (100 + 2) = 101 / 102 = 0.9901 ~ 1.00

> > 
> > There have been other observations on this topic [1], that suggest
> > that even 100us
> > improves the error rate significantly from what it is with 2us.
> > 
> > BR,
> Which is exactly why I've mentioned this approach is not really recommended,
> being bound to rather specific setup. There have been similar proposals in the
> past, all with different values of the delay which should illustrate how fragile
> solution (if any) that is.

The reports/occurences point to the fact that the current value doesn't work.

Another way of putting it is, why is 2us considered the "right"
value?

This patch was never meant to be an ideal solution, but it's better than what
is there at present. Currently, the `policy->cur` is completely unusable on CPPC,
and is cropping up in other locations in the cpufreq driver core [1]
while also breaking a userfacing ABI i.e scaling_setspeed.

I realize you're working on a solution, so if that is O(weeks) away, it
makes sense to wait; otherwise it would seem logical to mitigate the
error (it can always be reverted once the "better" solution is in
place).

Ultimately it's your call, but I'm not convinced with rationale provided
thus far.

Best regards,

-Prashant

[1] https://lore.kernel.org/linux-pm/20250823001937.2765316-1-pmalani@google.com/T/#t

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-25 20:24               ` Prashant Malani
@ 2025-08-28  7:33                 ` Beata Michalska
  2025-08-28 21:29                   ` Prashant Malani
  0 siblings, 1 reply; 11+ messages in thread
From: Beata Michalska @ 2025-08-28  7:33 UTC (permalink / raw)
  To: Prashant Malani
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Mon, Aug 25, 2025 at 08:24:52PM +0000, Prashant Malani wrote:
> On Aug 25 16:52, Beata Michalska wrote:
> > On Wed, Aug 20, 2025 at 02:25:16PM -0700, Prashant Malani wrote:
> > > Hi Beata,
> > > 
> > > On Wed, 20 Aug 2025 at 13:49, Beata Michalska <beata.michalska@arm.com> wrote:
> > > >
> > > > Kinda working on that one.
> > > 
> > > OK. I'm eager to see what the solution is!
> > > 
> > > > >
> > > > > Outside of that, I can't think of another mitigation beyond adding delay to make
> > > > > the time deltas not matter so much.
> > > > I'm not entirely sure what 'so much' means in this context.
> > > > How one would quantify whether the added delay is actually mitigating the issue?
> > > >
> > > 
> > > I alluded to it in the commit description, but here is the my basic
> > > numerical analysis:
> > > The effective timestamps for the 4 readings right now are:
> > > Timestamp t0: del0
> > > Timestamp t0 + m: ref0
> > > (Time delay X us)
> > > Timestamp t1: del1
> > > Timestamp t1 + n: ref1
> > > 
> > > Timestamp t1 = t0 + m + X
> > > 
> > > The perf calculation is:
> > > Per = del1 - del0 / ref1 - ref0
> > >       = Del_counter_diff_over_time(t1 - t0) /
> > > ref_counter_diff_over_time(t1 + n - (t0 + m))
> > >       = Del_counter_diff_over time(t0 + m + X - t0) /
> > > ref_counter_diff_over_time((t0 + m + X + n - t0 - m)
> > >       = Del_counter_diff_over_time(m + X) / ref_counter_diff_over_time(n + X)
> > > 
> > > If X >> (m,n) this becomes:
> > >       = Del_counter_diff_over_time(X) / ref_counter_diff_over_time(X)
> > > which is what the actual calculation is supposed to be.
> > > 
> > > if X ~ (m, N) (which is what the case is right now), the calculation
> > > becomes erratic.
> > This is still bound by 'm' and 'n' values, as the difference between those will
> > determine the error factor (with given, fixed X). If m != n, one counter delta
> > is stretched more than the other, so the perf ratio no longer represents the
> > same time interval. And that will vary between platforms/workloads leading to
> > over/under-reporting.
> 
> What you are saying holds when m,n ~ X. But if X >> m,n, the X component
> dominates. On most platforms, m and n are typically 1-2 us.
> If X is anything >= 100us, it dominates the m,n component, making both
> time intervals practically the same, i.e
> 
> (100 + 1) / (100 + 2) = 101 / 102 = 0.9901 ~ 1.00
True but that does still influence the error - in this case that's ~1% so
negligible. But the overall error magnitude does increase when the range
between min and max of the possible values of m and n gets bigger.
Question is what's the max error that can be deemed acceptable.
And I'm pretty sure there are platforms that would require bigger X still.

> 
> > > 
> > > There have been other observations on this topic [1], that suggest
> > > that even 100us
> > > improves the error rate significantly from what it is with 2us.
> > > 
> > > BR,
> > Which is exactly why I've mentioned this approach is not really recommended,
> > being bound to rather specific setup. There have been similar proposals in the
> > past, all with different values of the delay which should illustrate how fragile
> > solution (if any) that is.
> 
> The reports/occurences point to the fact that the current value doesn't work.
Wasn't claiming it does.
> 
> Another way of putting it is, why is 2us considered the "right"
> value?
Looking at the history, the argument was pretty much the same as yours: was
considered sufficient for most platforms [1]
> 
> This patch was never meant to be an ideal solution, but it's better than what
> is there at present. Currently, the `policy->cur` is completely unusable on CPPC,
> and is cropping up in other locations in the cpufreq driver core [1]
> while also breaking a userfacing ABI i.e scaling_setspeed.
> 
> I realize you're working on a solution, so if that is O(weeks) away, it
> makes sense to wait; otherwise it would seem logical to mitigate the
> error (it can always be reverted once the "better" solution is in
> place).
> 
> Ultimately it's your call, but I'm not convinced with rationale provided
> thus far.
Actually it is not up to me, I'm simply sharing my opinion, which is:
we should fix the problem instead of hiding it.
Setting that aside though - this change seems rather harmless.

Aside:
./scripts/get_maintainer.pl --m ./drivers/cpufreq/cppc_cpufreq.c

---
[1] https://lore.kernel.org/all/37517652-9a74-83f8-1315-07fe79a78d73@codeaurora.org/
---

BR
Beata
> 
> Best regards,
> 
> -Prashant
> 
> [1] https://lore.kernel.org/linux-pm/20250823001937.2765316-1-pmalani@google.com/T/#t

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] cpufreq: CPPC: Increase delay between perf counter reads
  2025-08-28  7:33                 ` Beata Michalska
@ 2025-08-28 21:29                   ` Prashant Malani
  0 siblings, 0 replies; 11+ messages in thread
From: Prashant Malani @ 2025-08-28 21:29 UTC (permalink / raw)
  To: Beata Michalska
  Cc: Yang Shi, open list, open list:CPU FREQUENCY SCALING FRAMEWORK,
	Rafael J. Wysocki, Viresh Kumar, Catalin Marinas, Ionela Voinescu

On Aug 28 09:33, Beata Michalska wrote:
> On Mon, Aug 25, 2025 at 08:24:52PM +0000, Prashant Malani wrote:
> > On Aug 25 16:52, Beata Michalska wrote:
> > What you are saying holds when m,n ~ X. But if X >> m,n, the X component
> > dominates. On most platforms, m and n are typically 1-2 us.
> > If X is anything >= 100us, it dominates the m,n component, making both
> > time intervals practically the same, i.e
> >
> > (100 + 1) / (100 + 2) = 101 / 102 = 0.9901 ~ 1.00
> True but that does still influence the error - in this case that's ~1% so
> negligible. But the overall error magnitude does increase when the range
> between min and max of the possible values of m and n gets bigger.
> Question is what's the max error that can be deemed acceptable.
> And I'm pretty sure there are platforms that would require bigger X still.

I think 1-2us is the typical m,n value for most platforms looking at
past measurements in related threads here (it
might be a little less, but it's a little challenging to measure this
accurately with ftrace, since the timestamps have usec precision).

Even a 5% error would be a great improvement from the current state.

>
> >
> > > >
> > > > There have been other observations on this topic [1], that suggest
> > > > that even 100us
> > > > improves the error rate significantly from what it is with 2us.
> > > >
> > > > BR,
> > > Which is exactly why I've mentioned this approach is not really recommended,
> > > being bound to rather specific setup. There have been similar proposals in the
> > > past, all with different values of the delay which should illustrate how fragile
> > > solution (if any) that is.
> >
> > The reports/occurences point to the fact that the current value doesn't work.
> Wasn't claiming it does.
> >
> > Another way of putting it is, why is 2us considered the "right"
> > value?
> Looking at the history, the argument was pretty much the same as yours: was
> considered sufficient for most platforms [1]

Thanks for that link. It provided helpful context.

> >
> > This patch was never meant to be an ideal solution, but it's better than what
> > is there at present. Currently, the `policy->cur` is completely unusable on CPPC,
> > and is cropping up in other locations in the cpufreq driver core [1]
> > while also breaking a userfacing ABI i.e scaling_setspeed.
> >
> > I realize you're working on a solution, so if that is O(weeks) away, it
> > makes sense to wait; otherwise it would seem logical to mitigate the
> > error (it can always be reverted once the "better" solution is in
> > place).
> >
> > Ultimately it's your call, but I'm not convinced with rationale provided
> > thus far.
> Actually it is not up to me, I'm simply sharing my opinion, which is:
> we should fix the problem instead of hiding it.

I think both options can be pursued concurrently :)

> Setting that aside though - this change seems rather harmless.
>
> Aside:
> ./scripts/get_maintainer.pl --m ./drivers/cpufreq/cppc_cpufreq.c

Yes, I use this for all mailing list submissions (including this one),
so I believe the maintainers should be on this thread.

BR,

-Prashant

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-08-28 21:29 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30 22:07 [PATCH] cpufreq: CPPC: Increase delay between perf counter reads Prashant Malani
2025-08-06  0:26 ` Yang Shi
2025-08-06  1:00   ` Prashant Malani
2025-08-19  9:28     ` Beata Michalska
2025-08-19 23:43       ` Prashant Malani
2025-08-20 20:49         ` Beata Michalska
2025-08-20 21:25           ` Prashant Malani
2025-08-25 14:52             ` Beata Michalska
2025-08-25 20:24               ` Prashant Malani
2025-08-28  7:33                 ` Beata Michalska
2025-08-28 21:29                   ` Prashant Malani

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).