linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost
@ 2025-06-26  9:30 Dietmar Eggemann
  2025-07-14 12:16 ` Dietmar Eggemann
  0 siblings, 1 reply; 5+ messages in thread
From: Dietmar Eggemann @ 2025-06-26  9:30 UTC (permalink / raw)
  To: Rafael J . Wysocki, Viresh Kumar, Sudeep Holla, Christian Loehle
  Cc: linux-pm, linux-kernel, Robin Murphy, Beata Michalska,
	zhenglifeng1

I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
little CPU changed in v6.14 from 446 to 505. I bisected and found that
commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
default per-policy boost flag") (1) introduced this change.
Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
775000 850000, policy1: 950000 1100000) as boost OPPs.

The reason is that the 'policy->boost_enabled = true' is now done after
'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
'highest non-boost' instead of the 'highest boost' frequency.

This is before the CPUFREQ_CREATE_POLICY notifier is fired in
cpufreq_online() to which the cpu_capacity setup code in
[drivers/base/arch_topology.c] has registered.

Its notifier_call init_cpu_capacity_callback() uses
'policy->cpuinfo.max_freq' to set the per-cpu
capacity_freq_ref so that the cpu_capacity can be calculated as:

cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
				      'max system-wide cpu frequency'

(2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.

So before (1) for a little CPU:

cpu_capacity = 578 * 850000 / 1100000 = 446

and after:

cpu_capacity = 578 * 700000 / 800000 = 505

This issue can also be seen on Arm64 boards with cpufreq-dt drivers
using the 'turbo-mode' dt property for boosted OPPs.

What's actually needed IMHO is to calculate cpu_capacity according to
the boost value. I.e.:

(a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
    be kept alive after boot.

(b) There has to be some kind of notification from cpufreq.c to
    arch_topology.c about the toggling of boost. I'm abusing
    CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
    CPUFREQ_MOD_POLICY for this?

(c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
    is set to 0 in cpufreq_frequency_table_cpuinfo().
    This currently clashes with the commented feature that in case the
    driver has set a higher value it should stay untouched.

Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
added 'turbo-mode' to the topmost OPPs in dts file).

This is probably related what Christian Loehle tried to address in
https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 drivers/base/arch_topology.c | 11 -----------
 drivers/cpufreq/cpufreq.c    |  3 +++
 drivers/cpufreq/freq_table.c |  8 +-------
 3 files changed, 4 insertions(+), 18 deletions(-)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 1c3221ff1d1f..0a3916dc9644 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -378,8 +378,6 @@ void acpi_processor_init_invariance_cppc(void)
 
 #ifdef CONFIG_CPU_FREQ
 static cpumask_var_t cpus_to_visit;
-static void parsing_done_workfn(struct work_struct *work);
-static DECLARE_WORK(parsing_done_work, parsing_done_workfn);
 
 static int
 init_cpu_capacity_callback(struct notifier_block *nb,
@@ -408,10 +406,8 @@ init_cpu_capacity_callback(struct notifier_block *nb,
 		if (raw_capacity) {
 			topology_normalize_cpu_scale();
 			schedule_work(&update_topology_flags_work);
-			free_raw_capacity();
 		}
 		pr_debug("cpu_capacity: parsing done\n");
-		schedule_work(&parsing_done_work);
 	}
 
 	return 0;
@@ -447,13 +443,6 @@ static int __init register_cpufreq_notifier(void)
 }
 core_initcall(register_cpufreq_notifier);
 
-static void parsing_done_workfn(struct work_struct *work)
-{
-	cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
-					 CPUFREQ_POLICY_NOTIFIER);
-	free_cpumask_var(cpus_to_visit);
-}
-
 #else
 core_initcall(free_raw_capacity);
 #endif
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index d7426e1d8bdd..6fdfcb6815d7 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2798,6 +2798,9 @@ int cpufreq_boost_set_sw(struct cpufreq_policy *policy, int state)
 		return ret;
 	}
 
+	blocking_notifier_call_chain(&cpufreq_policy_notifier_list,
+				     CPUFREQ_CREATE_POLICY, policy);
+
 	ret = freq_qos_update_request(policy->max_freq_req, policy->max);
 	if (ret < 0)
 		return ret;
diff --git a/drivers/cpufreq/freq_table.c b/drivers/cpufreq/freq_table.c
index 35de513af6c9..06068a12ec53 100644
--- a/drivers/cpufreq/freq_table.c
+++ b/drivers/cpufreq/freq_table.c
@@ -51,13 +51,7 @@ int cpufreq_frequency_table_cpuinfo(struct cpufreq_policy *policy,
 	}
 
 	policy->min = policy->cpuinfo.min_freq = min_freq;
-	policy->max = max_freq;
-	/*
-	 * If the driver has set its own cpuinfo.max_freq above max_freq, leave
-	 * it as is.
-	 */
-	if (policy->cpuinfo.max_freq < max_freq)
-		policy->max = policy->cpuinfo.max_freq = max_freq;
+	policy->max = policy->cpuinfo.max_freq = max_freq;
 
 	if (policy->min == ~0)
 		return -EINVAL;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost
  2025-06-26  9:30 [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost Dietmar Eggemann
@ 2025-07-14 12:16 ` Dietmar Eggemann
  2025-08-04 13:01   ` Vincent Guittot
  0 siblings, 1 reply; 5+ messages in thread
From: Dietmar Eggemann @ 2025-07-14 12:16 UTC (permalink / raw)
  To: Rafael J . Wysocki, Viresh Kumar, Sudeep Holla, Christian Loehle
  Cc: linux-pm, linux-kernel, Robin Murphy, Beata Michalska,
	zhenglifeng1, vincent.guittot@linaro.org, Ionela Voinescu

+cc Vincent Guittot <vincent.guittot@linaro.org>
+cc Ionela Voinescu <ionela.voinescu@arm.com>

On 26/06/2025 11:30, Dietmar Eggemann wrote:
> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> little CPU changed in v6.14 from 446 to 505. I bisected and found that
> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> default per-policy boost flag") (1) introduced this change.
> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> 775000 850000, policy1: 950000 1100000) as boost OPPs.
> 
> The reason is that the 'policy->boost_enabled = true' is now done after
> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> 'highest non-boost' instead of the 'highest boost' frequency.
> 
> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> cpufreq_online() to which the cpu_capacity setup code in
> [drivers/base/arch_topology.c] has registered.
> 
> Its notifier_call init_cpu_capacity_callback() uses
> 'policy->cpuinfo.max_freq' to set the per-cpu
> capacity_freq_ref so that the cpu_capacity can be calculated as:
> 
> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> 				      'max system-wide cpu frequency'
> 
> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> 
> So before (1) for a little CPU:
> 
> cpu_capacity = 578 * 850000 / 1100000 = 446
> 
> and after:
> 
> cpu_capacity = 578 * 700000 / 800000 = 505
> 
> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> using the 'turbo-mode' dt property for boosted OPPs.
> 
> What's actually needed IMHO is to calculate cpu_capacity according to
> the boost value. I.e.:
> 
> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
>     be kept alive after boot.
> 
> (b) There has to be some kind of notification from cpufreq.c to
>     arch_topology.c about the toggling of boost. I'm abusing
>     CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
>     CPUFREQ_MOD_POLICY for this?
> 
> (c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
>     is set to 0 in cpufreq_frequency_table_cpuinfo().
>     This currently clashes with the commented feature that in case the
>     driver has set a higher value it should stay untouched.
> 
> Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
> added 'turbo-mode' to the topmost OPPs in dts file).
> 
> This is probably related what Christian Loehle tried to address in
> https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .

Christian L. reminded me that since commit dd016f379ebc we also have a
performance regression on a system with boosted OPPs using schedutil
CPUfreq governor.

The reason is that per cpu 'capacity_freq_ref' is set in
drivers/base/arch_topology.c only during system boot so far based on the
highest non-boosted OPP since boost is disabled per default.

Schedutil uses capacity_freq_ref (*) in get_next_freq() to calculate the
next frequency request:

   next_freq = max_freq * util / max
               ^^^^^^^^
                 (*)

In case the boost OPPs will be enabled:

   echo 1 > /sys/devices/system/cpu/cpufreq/boost

'capacity_freq_ref' stays at the highest non-boosted OPP's so schedutil
won't request any boosted OPPs for util values > ''highest non boosted
OPP'/'highest boosted OPP' * max'. The 'highest non boosted OPP' will be
used by schedutil instead.

This performance regression will go away with the proposed patch as well.

Calling drivers/base/arch_topology.c's init_cpu_capacity_callback() in
the event that boost is toggled makes sure that 'capacity_freq_ref' will
be set to the highest boosted (0->1) or highest non-boosted (1->0) OPP.

[...]






^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost
  2025-07-14 12:16 ` Dietmar Eggemann
@ 2025-08-04 13:01   ` Vincent Guittot
  2025-08-04 13:18     ` Christian Loehle
  0 siblings, 1 reply; 5+ messages in thread
From: Vincent Guittot @ 2025-08-04 13:01 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Rafael J . Wysocki, Viresh Kumar, Sudeep Holla, Christian Loehle,
	linux-pm, linux-kernel, Robin Murphy, Beata Michalska,
	zhenglifeng1, Ionela Voinescu

On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> +cc Vincent Guittot <vincent.guittot@linaro.org>
> +cc Ionela Voinescu <ionela.voinescu@arm.com>
>
> On 26/06/2025 11:30, Dietmar Eggemann wrote:
> > I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> > the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> > little CPU changed in v6.14 from 446 to 505. I bisected and found that
> > commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> > default per-policy boost flag") (1) introduced this change.
> > Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> > 775000 850000, policy1: 950000 1100000) as boost OPPs.
> >
> > The reason is that the 'policy->boost_enabled = true' is now done after
> > 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> > in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> > 'highest non-boost' instead of the 'highest boost' frequency.
> >
> > This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> > cpufreq_online() to which the cpu_capacity setup code in
> > [drivers/base/arch_topology.c] has registered.
> >
> > Its notifier_call init_cpu_capacity_callback() uses
> > 'policy->cpuinfo.max_freq' to set the per-cpu
> > capacity_freq_ref so that the cpu_capacity can be calculated as:
> >
> > cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> >                                     'max system-wide cpu frequency'
> >
> > (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> >
> > So before (1) for a little CPU:
> >
> > cpu_capacity = 578 * 850000 / 1100000 = 446
> >
> > and after:
> >
> > cpu_capacity = 578 * 700000 / 800000 = 505
> >
> > This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> > using the 'turbo-mode' dt property for boosted OPPs.
> >
> > What's actually needed IMHO is to calculate cpu_capacity according to
> > the boost value. I.e.:
> >
> > (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
> >     be kept alive after boot.

If we adjust the cpu_capacity at runtime this will create oscillation
in PELT values. We should stay with one single capacity all time :
- either include boost value but when boost is disable we will never
reach the max capacity of the cpu which could imply that the cpu will
never be overloaded (from scheduler pov)
- either not include boost_value but allow to go above cpu max compute
capacity which is something we already discussed for x86 and the turbo
freq in the past.

> >
> > (b) There has to be some kind of notification from cpufreq.c to
> >     arch_topology.c about the toggling of boost. I'm abusing
> >     CPUFREQ_CREATE_POLICY for this right now. Could we perhaps add a
> >     CPUFREQ_MOD_POLICY for this?
> >
> > (c) Allow unconditional set of policy->cpuinfo.max_freq in case boost
> >     is set to 0 in cpufreq_frequency_table_cpuinfo().
> >     This currently clashes with the commented feature that in case the
> >     driver has set a higher value it should stay untouched.
> >
> > Tested on Arm64 Juno (scmi-cpufreq) and Hikey 960 (cpufreq-dt +
> > added 'turbo-mode' to the topmost OPPs in dts file).
> >
> > This is probably related what Christian Loehle tried to address in
> > https://lkml.kernel.org/r/3cc5b83b-f81c-4bd7-b7ff-4d02db4e25d8@arm.com .
>
> Christian L. reminded me that since commit dd016f379ebc we also have a
> performance regression on a system with boosted OPPs using schedutil
> CPUfreq governor.
>
> The reason is that per cpu 'capacity_freq_ref' is set in
> drivers/base/arch_topology.c only during system boot so far based on the
> highest non-boosted OPP since boost is disabled per default.
>
> Schedutil uses capacity_freq_ref (*) in get_next_freq() to calculate the
> next frequency request:
>
>    next_freq = max_freq * util / max
>                ^^^^^^^^
>                  (*)
>
> In case the boost OPPs will be enabled:
>
>    echo 1 > /sys/devices/system/cpu/cpufreq/boost
>
> 'capacity_freq_ref' stays at the highest non-boosted OPP's so schedutil
> won't request any boosted OPPs for util values > ''highest non boosted
> OPP'/'highest boosted OPP' * max'. The 'highest non boosted OPP' will be
> used by schedutil instead.
>
> This performance regression will go away with the proposed patch as well.
>
> Calling drivers/base/arch_topology.c's init_cpu_capacity_callback() in
> the event that boost is toggled makes sure that 'capacity_freq_ref' will
> be set to the highest boosted (0->1) or highest non-boosted (1->0) OPP.
>
> [...]
>
>
>
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost
  2025-08-04 13:01   ` Vincent Guittot
@ 2025-08-04 13:18     ` Christian Loehle
  2025-08-04 13:41       ` Vincent Guittot
  0 siblings, 1 reply; 5+ messages in thread
From: Christian Loehle @ 2025-08-04 13:18 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann
  Cc: Rafael J . Wysocki, Viresh Kumar, Sudeep Holla, linux-pm,
	linux-kernel, Robin Murphy, Beata Michalska, zhenglifeng1,
	Ionela Voinescu

On 8/4/25 14:01, Vincent Guittot wrote:
> On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> +cc Vincent Guittot <vincent.guittot@linaro.org>
>> +cc Ionela Voinescu <ionela.voinescu@arm.com>
>>
>> On 26/06/2025 11:30, Dietmar Eggemann wrote:
>>> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
>>> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
>>> little CPU changed in v6.14 from 446 to 505. I bisected and found that
>>> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
>>> default per-policy boost flag") (1) introduced this change.
>>> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
>>> 775000 850000, policy1: 950000 1100000) as boost OPPs.
>>>
>>> The reason is that the 'policy->boost_enabled = true' is now done after
>>> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
>>> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
>>> 'highest non-boost' instead of the 'highest boost' frequency.
>>>
>>> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
>>> cpufreq_online() to which the cpu_capacity setup code in
>>> [drivers/base/arch_topology.c] has registered.
>>>
>>> Its notifier_call init_cpu_capacity_callback() uses
>>> 'policy->cpuinfo.max_freq' to set the per-cpu
>>> capacity_freq_ref so that the cpu_capacity can be calculated as:
>>>
>>> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
>>>                                     'max system-wide cpu frequency'
>>>
>>> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
>>>
>>> So before (1) for a little CPU:
>>>
>>> cpu_capacity = 578 * 850000 / 1100000 = 446
>>>
>>> and after:
>>>
>>> cpu_capacity = 578 * 700000 / 800000 = 505
>>>
>>> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
>>> using the 'turbo-mode' dt property for boosted OPPs.
>>>
>>> What's actually needed IMHO is to calculate cpu_capacity according to
>>> the boost value. I.e.:
>>>
>>> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
>>>     be kept alive after boot.
> 
> If we adjust the cpu_capacity at runtime this will create oscillation
> in PELT values. We should stay with one single capacity all time :
> - either include boost value but when boost is disable we will never
> reach the max capacity of the cpu which could imply that the cpu will
> never be overloaded (from scheduler pov)

overutilized I'm assuming, that's the issue I was worried about here.
Strictly speaking the platform doesn't guarantee that the capacity can
be reached and sustained indefinitely. Whether the frequency is marked
as boost or not.

> - either not include boost_value but allow to go above cpu max compute
> capacity which is something we already discussed for x86 and the turbo
> freq in the past.
> 

But that currently breaks schedutil, i.e. boost frequencies will never
be used with schedutil. There's also some other locations where capacities
>1024 just break some assumptions (e.g. the kernel/sched/ext.c cpuperf
interface defines SCX_CPUPERF_ONE).


So we have either:
a) Potential wrong capacity estimation of CPUs when boost is disabled
(but capacity calculation assumed enabled).
b) Boost frequencies completely unused by schedutil.
c) Oscillating PELT values due to boost enable/disable.

Isn't c) (what Dietmar proposed here) by far the smallest evil of these
three?
I've also found a) very hard to actually trigger, although it's obviously
a problem that depends on the platform.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost
  2025-08-04 13:18     ` Christian Loehle
@ 2025-08-04 13:41       ` Vincent Guittot
  0 siblings, 0 replies; 5+ messages in thread
From: Vincent Guittot @ 2025-08-04 13:41 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Dietmar Eggemann, Rafael J . Wysocki, Viresh Kumar, Sudeep Holla,
	linux-pm, linux-kernel, Robin Murphy, Beata Michalska,
	zhenglifeng1, Ionela Voinescu

On Mon, 4 Aug 2025 at 15:18, Christian Loehle <christian.loehle@arm.com> wrote:
>
> On 8/4/25 14:01, Vincent Guittot wrote:
> > On Mon, 14 Jul 2025 at 14:17, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> +cc Vincent Guittot <vincent.guittot@linaro.org>
> >> +cc Ionela Voinescu <ionela.voinescu@arm.com>
> >>
> >> On 26/06/2025 11:30, Dietmar Eggemann wrote:
> >>> I noticed on my Arm64 big.Little platform (Juno-r0, scmi-cpufreq) that
> >>> the cpu_scale values (/sys/devices/system/cpu/cpu*/cpu_capacity) of the
> >>> little CPU changed in v6.14 from 446 to 505. I bisected and found that
> >>> commit dd016f379ebc ("cpufreq: Introduce a more generic way to set
> >>> default per-policy boost flag") (1) introduced this change.
> >>> Juno's scmi FW marks the 2 topmost OPPs of each CPUfreq policy (policy0:
> >>> 775000 850000, policy1: 950000 1100000) as boost OPPs.
> >>>
> >>> The reason is that the 'policy->boost_enabled = true' is now done after
> >>> 'cpufreq_table_validate_and_sort() -> cpufreq_frequency_table_cpuinfo()'
> >>> in cpufreq_online() so that 'policy->cpuinfo.max_freq' is set to the
> >>> 'highest non-boost' instead of the 'highest boost' frequency.
> >>>
> >>> This is before the CPUFREQ_CREATE_POLICY notifier is fired in
> >>> cpufreq_online() to which the cpu_capacity setup code in
> >>> [drivers/base/arch_topology.c] has registered.
> >>>
> >>> Its notifier_call init_cpu_capacity_callback() uses
> >>> 'policy->cpuinfo.max_freq' to set the per-cpu
> >>> capacity_freq_ref so that the cpu_capacity can be calculated as:
> >>>
> >>> cpu_capacity = raw_cpu_capacity (2) * capacity_freq_ref /
> >>>                                     'max system-wide cpu frequency'
> >>>
> >>> (2) Juno's little CPU has 'capacity-dmips-mhz = <578>'.
> >>>
> >>> So before (1) for a little CPU:
> >>>
> >>> cpu_capacity = 578 * 850000 / 1100000 = 446
> >>>
> >>> and after:
> >>>
> >>> cpu_capacity = 578 * 700000 / 800000 = 505
> >>>
> >>> This issue can also be seen on Arm64 boards with cpufreq-dt drivers
> >>> using the 'turbo-mode' dt property for boosted OPPs.
> >>>
> >>> What's actually needed IMHO is to calculate cpu_capacity according to
> >>> the boost value. I.e.:
> >>>
> >>> (a) The infrastructure to adjust cpu_capacity in arch_topology.c has to
> >>>     be kept alive after boot.
> >
> > If we adjust the cpu_capacity at runtime this will create oscillation
> > in PELT values. We should stay with one single capacity all time :
> > - either include boost value but when boost is disable we will never
> > reach the max capacity of the cpu which could imply that the cpu will
> > never be overloaded (from scheduler pov)
>
> overutilized I'm assuming, that's the issue I was worried about here.

no I was referring to group_is_overloaded which use /Sum of CPU's capacity
That' also true for EAS and overutilized

> Strictly speaking the platform doesn't guarantee that the capacity can
> be reached and sustained indefinitely. Whether the frequency is marked
> as boost or not.

Regarding thermal mitigation and user max freq, we take that into account

>
> > - either not include boost_value but allow to go above cpu max compute
> > capacity which is something we already discussed for x86 and the turbo
> > freq in the past.
> >
>
> But that currently breaks schedutil, i.e. boost frequencies will never
> be used with schedutil. There's also some other locations where capacities

We should allow capacity to go above 1024 to reflect HW reality with
turbo and here enabling/diabling boost

> >1024 just break some assumptions (e.g. the kernel/sched/ext.c cpuperf
> interface defines SCX_CPUPERF_ONE).
>
>
> So we have either:
> a) Potential wrong capacity estimation of CPUs when boost is disabled
> (but capacity calculation assumed enabled).
> b) Boost frequencies completely unused by schedutil.
> c) Oscillating PELT values due to boost enable/disable.
>
> Isn't c) (what Dietmar proposed here) by far the smallest evil of these
> three?

No, this breaks PELT invariance

> I've also found a) very hard to actually trigger, although it's obviously
> a problem that depends on the platform.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-04 13:41 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26  9:30 [RFC PATCH] cpufreq,base/arch_topology: Calculate cpu_capacity according to boost Dietmar Eggemann
2025-07-14 12:16 ` Dietmar Eggemann
2025-08-04 13:01   ` Vincent Guittot
2025-08-04 13:18     ` Christian Loehle
2025-08-04 13:41       ` Vincent Guittot

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).