* [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
@ 2026-03-20 11:31 Xuewen Yan
2026-03-20 11:31 ` [RFC PATCH 2/2] thermal/cpufreq_cooling: Use idle_time to get cpu_load when scx_enabled Xuewen Yan
2026-03-20 12:32 ` [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Lukasz Luba
0 siblings, 2 replies; 18+ messages in thread
From: Xuewen Yan @ 2026-03-20 11:31 UTC (permalink / raw)
To: daniel.lezcano, amit.kachhap, viresh.kumar, lukasz.luba, rafael
Cc: rui.zhang, linux-pm, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
From: Di Shen <di.shen@unisoc.com>
The cpu_idx variable in the get_load function is now
unused and can be safely removed.
No code logic is affected.
Signed-off-by: Di Shen <di.shen@unisoc.com>
---
drivers/thermal/cpufreq_cooling.c | 13 +++++--------
1 file changed, 5 insertions(+), 8 deletions(-)
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index 32bf5ab44f4a..d030dbeb2973 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -151,26 +151,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
* get_load() - get load for a cpu
* @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
* @cpu: cpu number
- * @cpu_idx: index of the cpu in time_in_idle array
*
* Return: The average load of cpu @cpu in percentage since this
* function was last called.
*/
#ifdef CONFIG_SMP
-static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
- int cpu_idx)
+static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
{
unsigned long util = sched_cpu_util(cpu);
return (util * 100) / arch_scale_cpu_capacity(cpu);
}
#else /* !CONFIG_SMP */
-static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
- int cpu_idx)
+static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
{
u32 load;
u64 now, now_idle, delta_time, delta_idle;
- struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu_idx];
+ struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu];
now_idle = get_cpu_idle_time(cpu, &now, 0);
delta_idle = now_idle - idle_time->time;
@@ -231,7 +228,7 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
u32 *power)
{
unsigned long freq;
- int i = 0, cpu;
+ int cpu;
u32 total_load = 0;
struct cpufreq_cooling_device *cpufreq_cdev = cdev->devdata;
struct cpufreq_policy *policy = cpufreq_cdev->policy;
@@ -242,7 +239,7 @@ static int cpufreq_get_requested_power(struct thermal_cooling_device *cdev,
u32 load;
if (cpu_online(cpu))
- load = get_load(cpufreq_cdev, cpu, i);
+ load = get_load(cpufreq_cdev, cpu);
else
load = 0;
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* [RFC PATCH 2/2] thermal/cpufreq_cooling: Use idle_time to get cpu_load when scx_enabled
2026-03-20 11:31 [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Xuewen Yan
@ 2026-03-20 11:31 ` Xuewen Yan
2026-03-24 1:41 ` Qais Yousef
2026-03-20 12:32 ` [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Lukasz Luba
1 sibling, 1 reply; 18+ messages in thread
From: Xuewen Yan @ 2026-03-20 11:31 UTC (permalink / raw)
To: daniel.lezcano, amit.kachhap, viresh.kumar, lukasz.luba, rafael
Cc: rui.zhang, linux-pm, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
From: Di Shen <di.shen@unisoc.com>
Recently, while enabling sched-ext debugging, we observed abnormal behavior
in our thermal power_allocator’s temperature control.
Through debugging, we found that the CPU util was too low, causing
the CPU frequency to remain unrestricted.
This issue stems from the fact that in the sched_cpu_util() function,
when scx is enabled, cpu_util_cfs becomes zero. As a result,
the thermal subsystem perceives an extremely low CPU utilization,
which degrades the effectiveness of the power_allocator’s control.
However, the scx_cpuperf_target() reflects the targeted performance,
not the utilisation. We couldn't use it.
Until a perfect solution is found, using idle_time to get the cpu load
might be a better approach.
Co-developed-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
Signed-off-by: Di Shen <di.shen@unisoc.com>
---
Previous discussion:
https://lore.kernel.org/all/5a5d565b-33ac-4d5c-b0dd-1353324a6117@arm.com/
---
drivers/thermal/cpufreq_cooling.c | 54 ++++++++++++++++++++-----------
1 file changed, 35 insertions(+), 19 deletions(-)
diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
index d030dbeb2973..e8fa70a95d00 100644
--- a/drivers/thermal/cpufreq_cooling.c
+++ b/drivers/thermal/cpufreq_cooling.c
@@ -24,6 +24,9 @@
#include <linux/units.h>
#include "thermal_trace.h"
+#ifdef CONFIG_SCHED_CLASS_EXT
+#include "../../kernel/sched/sched.h"
+#endif
/*
* Cooling state <-> CPUFreq frequency
@@ -72,7 +75,7 @@ struct cpufreq_cooling_device {
struct em_perf_domain *em;
struct cpufreq_policy *policy;
struct thermal_cooling_device_ops cooling_ops;
-#ifndef CONFIG_SMP
+#if !defined(CONFIG_SMP) || defined(CONFIG_SCHED_CLASS_EXT)
struct time_in_idle *idle_time;
#endif
struct freq_qos_request qos_req;
@@ -147,23 +150,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
return freq;
}
-/**
- * get_load() - get load for a cpu
- * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
- * @cpu: cpu number
- *
- * Return: The average load of cpu @cpu in percentage since this
- * function was last called.
- */
-#ifdef CONFIG_SMP
-static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
-{
- unsigned long util = sched_cpu_util(cpu);
-
- return (util * 100) / arch_scale_cpu_capacity(cpu);
-}
-#else /* !CONFIG_SMP */
-static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
+#if !defined(CONFIG_SMP) || defined(CONFIG_SCHED_CLASS_EXT)
+static u32 get_load_from_idle_time(struct cpufreq_cooling_device *cpufreq_cdev,
+ int cpu)
{
u32 load;
u64 now, now_idle, delta_time, delta_idle;
@@ -183,8 +172,35 @@ static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
return load;
}
-#endif /* CONFIG_SMP */
+#endif /* !defined(CONFIG_SMP) || defined(CONFIG_SCHED_CLASS_EXT) */
+/**
+ * get_load() - get load for a cpu
+ * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
+ * @cpu: cpu number
+ *
+ * Return: The average load of cpu @cpu in percentage since this
+ * function was last called.
+ */
+#ifndef CONFIG_SMP
+static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
+ int cpu_idx)
+{
+ return get_load_from_idle_time(cpufreq_cdev, cpu, cpu_idx);
+}
+#else /* CONFIG_SMP */
+static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
+{
+ unsigned long util;
+
+#ifdef CONFIG_SCHED_CLASS_EXT
+ if (scx_enabled())
+ return get_load_from_idle_time(cpufreq_cdev, cpu);
+#endif
+ util = sched_cpu_util(cpu);
+ return (util * 100) / arch_scale_cpu_capacity(cpu);
+}
+#endif /* !CONFIG_SMP */
/**
* get_dynamic_power() - calculate the dynamic power
* @cpufreq_cdev: &cpufreq_cooling_device for this cdev
--
2.25.1
^ permalink raw reply related [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-20 11:31 [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Xuewen Yan
2026-03-20 11:31 ` [RFC PATCH 2/2] thermal/cpufreq_cooling: Use idle_time to get cpu_load when scx_enabled Xuewen Yan
@ 2026-03-20 12:32 ` Lukasz Luba
2026-03-21 8:48 ` Xuewen Yan
2026-03-23 5:34 ` Viresh Kumar
1 sibling, 2 replies; 18+ messages in thread
From: Lukasz Luba @ 2026-03-20 12:32 UTC (permalink / raw)
To: Xuewen Yan
Cc: rui.zhang, rafael, linux-pm, viresh.kumar, amit.kachhap,
daniel.lezcano, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
Hi Xuewen,
On 3/20/26 11:31, Xuewen Yan wrote:
> From: Di Shen <di.shen@unisoc.com>
>
> The cpu_idx variable in the get_load function is now
> unused and can be safely removed.
>
> No code logic is affected.
>
> Signed-off-by: Di Shen <di.shen@unisoc.com>
> ---
> drivers/thermal/cpufreq_cooling.c | 13 +++++--------
> 1 file changed, 5 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> index 32bf5ab44f4a..d030dbeb2973 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -151,26 +151,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> * get_load() - get load for a cpu
> * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
> * @cpu: cpu number
> - * @cpu_idx: index of the cpu in time_in_idle array
> *
> * Return: The average load of cpu @cpu in percentage since this
> * function was last called.
> */
> #ifdef CONFIG_SMP
> -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> - int cpu_idx)
> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> {
> unsigned long util = sched_cpu_util(cpu);
>
> return (util * 100) / arch_scale_cpu_capacity(cpu);
> }
> #else /* !CONFIG_SMP */
> -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> - int cpu_idx)
> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> {
> u32 load;
> u64 now, now_idle, delta_time, delta_idle;
> - struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu_idx];
> + struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu];
This is a bug. We allocate 'num_cpus' size of array based on
number of CPU in the cpumask for a given cpufreq policy.
If there are 4 cpus in the CPU cluster but CPUs have ids:
CPU4-7 then accessing it with this change would explode.
Please re-design this patch set slightly and I will have a look
on the 2nd version (and particularly the part in current patch 2/2).
Regards,
Lukasz
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-20 12:32 ` [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Lukasz Luba
@ 2026-03-21 8:48 ` Xuewen Yan
2026-03-23 5:34 ` Viresh Kumar
1 sibling, 0 replies; 18+ messages in thread
From: Xuewen Yan @ 2026-03-21 8:48 UTC (permalink / raw)
To: Lukasz Luba
Cc: Xuewen Yan, rui.zhang, rafael, linux-pm, viresh.kumar,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao
On Fri, Mar 20, 2026 at 8:32 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
> Hi Xuewen,
>
> On 3/20/26 11:31, Xuewen Yan wrote:
> > From: Di Shen <di.shen@unisoc.com>
> >
> > The cpu_idx variable in the get_load function is now
> > unused and can be safely removed.
> >
> > No code logic is affected.
> >
> > Signed-off-by: Di Shen <di.shen@unisoc.com>
> > ---
> > drivers/thermal/cpufreq_cooling.c | 13 +++++--------
> > 1 file changed, 5 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> > index 32bf5ab44f4a..d030dbeb2973 100644
> > --- a/drivers/thermal/cpufreq_cooling.c
> > +++ b/drivers/thermal/cpufreq_cooling.c
> > @@ -151,26 +151,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> > * get_load() - get load for a cpu
> > * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
> > * @cpu: cpu number
> > - * @cpu_idx: index of the cpu in time_in_idle array
> > *
> > * Return: The average load of cpu @cpu in percentage since this
> > * function was last called.
> > */
> > #ifdef CONFIG_SMP
> > -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> > - int cpu_idx)
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> > {
> > unsigned long util = sched_cpu_util(cpu);
> >
> > return (util * 100) / arch_scale_cpu_capacity(cpu);
> > }
> > #else /* !CONFIG_SMP */
> > -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> > - int cpu_idx)
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> > {
> > u32 load;
> > u64 now, now_idle, delta_time, delta_idle;
> > - struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu_idx];
> > + struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu];
>
> This is a bug. We allocate 'num_cpus' size of array based on
> number of CPU in the cpumask for a given cpufreq policy.
> If there are 4 cpus in the CPU cluster but CPUs have ids:
> CPU4-7 then accessing it with this change would explode.
Sorry, it was our oversight. However, this is fine without the other patch.
For patch-v2, maybe we can add back i++ to work around this issue.
Thanks!
>
> Please re-design this patch set slightly and I will have a look
> on the 2nd version (and particularly the part in current patch 2/2).
>
> Regards,
> Lukasz
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-20 12:32 ` [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Lukasz Luba
2026-03-21 8:48 ` Xuewen Yan
@ 2026-03-23 5:34 ` Viresh Kumar
2026-03-23 9:20 ` Lukasz Luba
1 sibling, 1 reply; 18+ messages in thread
From: Viresh Kumar @ 2026-03-23 5:34 UTC (permalink / raw)
To: Lukasz Luba
Cc: Xuewen Yan, rui.zhang, rafael, linux-pm, amit.kachhap,
daniel.lezcano, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
On 20-03-26, 12:32, Lukasz Luba wrote:
> Hi Xuewen,
>
> On 3/20/26 11:31, Xuewen Yan wrote:
> > From: Di Shen <di.shen@unisoc.com>
> >
> > The cpu_idx variable in the get_load function is now
> > unused and can be safely removed.
> >
> > No code logic is affected.
> >
> > Signed-off-by: Di Shen <di.shen@unisoc.com>
> > ---
> > drivers/thermal/cpufreq_cooling.c | 13 +++++--------
> > 1 file changed, 5 insertions(+), 8 deletions(-)
> >
> > diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> > index 32bf5ab44f4a..d030dbeb2973 100644
> > --- a/drivers/thermal/cpufreq_cooling.c
> > +++ b/drivers/thermal/cpufreq_cooling.c
> > @@ -151,26 +151,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> > * get_load() - get load for a cpu
> > * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
> > * @cpu: cpu number
> > - * @cpu_idx: index of the cpu in time_in_idle array
> > *
> > * Return: The average load of cpu @cpu in percentage since this
> > * function was last called.
> > */
> > #ifdef CONFIG_SMP
> > -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> > - int cpu_idx)
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> > {
> > unsigned long util = sched_cpu_util(cpu);
> > return (util * 100) / arch_scale_cpu_capacity(cpu);
> > }
> > #else /* !CONFIG_SMP */
> > -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> > - int cpu_idx)
> > +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> > {
> > u32 load;
> > u64 now, now_idle, delta_time, delta_idle;
> > - struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu_idx];
> > + struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu];
>
> This is a bug. We allocate 'num_cpus' size of array based on
> number of CPU in the cpumask for a given cpufreq policy.
> If there are 4 cpus in the CPU cluster but CPUs have ids:
> CPU4-7 then accessing it with this change would explode.
I think following commit introduced a bug by removing `i++`.
commit 3f7ced7ac9af ("drivers/thermal/cpufreq_cooling : Refactor thermal_power_cpu_get_power tracing")
--
viresh
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-23 5:34 ` Viresh Kumar
@ 2026-03-23 9:20 ` Lukasz Luba
2026-03-23 10:41 ` Viresh Kumar
0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2026-03-23 9:20 UTC (permalink / raw)
To: Viresh Kumar
Cc: Xuewen Yan, rui.zhang, rafael, linux-pm, amit.kachhap,
daniel.lezcano, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
Hi Viresh,
On 3/23/26 05:34, Viresh Kumar wrote:
> On 20-03-26, 12:32, Lukasz Luba wrote:
>> Hi Xuewen,
>>
>> On 3/20/26 11:31, Xuewen Yan wrote:
>>> From: Di Shen <di.shen@unisoc.com>
>>>
>>> The cpu_idx variable in the get_load function is now
>>> unused and can be safely removed.
>>>
>>> No code logic is affected.
>>>
>>> Signed-off-by: Di Shen <di.shen@unisoc.com>
>>> ---
>>> drivers/thermal/cpufreq_cooling.c | 13 +++++--------
>>> 1 file changed, 5 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
>>> index 32bf5ab44f4a..d030dbeb2973 100644
>>> --- a/drivers/thermal/cpufreq_cooling.c
>>> +++ b/drivers/thermal/cpufreq_cooling.c
>>> @@ -151,26 +151,23 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
>>> * get_load() - get load for a cpu
>>> * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
>>> * @cpu: cpu number
>>> - * @cpu_idx: index of the cpu in time_in_idle array
>>> *
>>> * Return: The average load of cpu @cpu in percentage since this
>>> * function was last called.
>>> */
>>> #ifdef CONFIG_SMP
>>> -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
>>> - int cpu_idx)
>>> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
>>> {
>>> unsigned long util = sched_cpu_util(cpu);
>>> return (util * 100) / arch_scale_cpu_capacity(cpu);
>>> }
>>> #else /* !CONFIG_SMP */
>>> -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
>>> - int cpu_idx)
>>> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
>>> {
>>> u32 load;
>>> u64 now, now_idle, delta_time, delta_idle;
>>> - struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu_idx];
>>> + struct time_in_idle *idle_time = &cpufreq_cdev->idle_time[cpu];
>>
>> This is a bug. We allocate 'num_cpus' size of array based on
>> number of CPU in the cpumask for a given cpufreq policy.
>> If there are 4 cpus in the CPU cluster but CPUs have ids:
>> CPU4-7 then accessing it with this change would explode.
>
> I think following commit introduced a bug by removing `i++`.
>
> commit 3f7ced7ac9af ("drivers/thermal/cpufreq_cooling : Refactor thermal_power_cpu_get_power tracing")
>
Thanks for monitoring the development (it's always good
to have extra engineer opinion)!
I've checked the commit that you referred to and the 'i++' there.
It's safe. That commit removed the heavy operation for only
tracing purpose, namely:
- allocate buffer for N CPUs for 'load_cpu' pointer
- populate CPUs' load from the idle fwk
- put that info into the trace
- free the 'load_cpu' buffer
That has been redesigned since it was just for tracing
and introducing extra time spent for code run in the
throttling phase.
The code in get_load() is OK with the commit that you
mentioned.
Regards,
Lukasz
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-23 9:20 ` Lukasz Luba
@ 2026-03-23 10:41 ` Viresh Kumar
2026-03-23 10:52 ` Lukasz Luba
0 siblings, 1 reply; 18+ messages in thread
From: Viresh Kumar @ 2026-03-23 10:41 UTC (permalink / raw)
To: Lukasz Luba
Cc: Xuewen Yan, rui.zhang, rafael, linux-pm, amit.kachhap,
daniel.lezcano, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
On 23-03-26, 09:20, Lukasz Luba wrote:
> Thanks for monitoring the development (it's always good
> to have extra engineer opinion)!
>
> I've checked the commit that you referred to and the 'i++' there.
> It's safe. That commit removed the heavy operation for only
> tracing purpose, namely:
> - allocate buffer for N CPUs for 'load_cpu' pointer
> - populate CPUs' load from the idle fwk
> - put that info into the trace
> - free the 'load_cpu' buffer
>
> That has been redesigned since it was just for tracing
> and introducing extra time spent for code run in the
> throttling phase.
>
> The code in get_load() is OK with the commit that you
> mentioned.
The code
load = get_load(cpufreq_cdev, cpu, i);
depends on `i` being incremented in the loop to get the correct
`cpu_idx`. But the said commit removed it and left `i` to be set to 0
for ever.
How is that okay ? What am I missing ?
--
viresh
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-23 10:41 ` Viresh Kumar
@ 2026-03-23 10:52 ` Lukasz Luba
2026-03-23 11:06 ` Viresh Kumar
0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2026-03-23 10:52 UTC (permalink / raw)
To: Viresh Kumar
Cc: Xuewen Yan, rui.zhang, rafael, linux-pm, amit.kachhap,
daniel.lezcano, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
On 3/23/26 10:41, Viresh Kumar wrote:
> On 23-03-26, 09:20, Lukasz Luba wrote:
>> Thanks for monitoring the development (it's always good
>> to have extra engineer opinion)!
>>
>> I've checked the commit that you referred to and the 'i++' there.
>> It's safe. That commit removed the heavy operation for only
>> tracing purpose, namely:
>> - allocate buffer for N CPUs for 'load_cpu' pointer
>> - populate CPUs' load from the idle fwk
>> - put that info into the trace
>> - free the 'load_cpu' buffer
>>
>> That has been redesigned since it was just for tracing
>> and introducing extra time spent for code run in the
>> throttling phase.
>>
>> The code in get_load() is OK with the commit that you
>> mentioned.
>
> The code
>
> load = get_load(cpufreq_cdev, cpu, i);
>
> depends on `i` being incremented in the loop to get the correct
> `cpu_idx`. But the said commit removed it and left `i` to be set to 0
> for ever.
>
> How is that okay ? What am I missing ?
>
Right, there is a mix of two things.
The 'i' left but should be removed as well, since
this is !SMP code with only 1 cpu and i=0.
The whole split which has been made for getting
the load or utilization from CPU(s) needs to be
cleaned. The compiled code looks different since
it knows there is non-SMP config used.
Do you want to clean that or I should do this?
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-23 10:52 ` Lukasz Luba
@ 2026-03-23 11:06 ` Viresh Kumar
2026-03-23 13:25 ` Lukasz Luba
0 siblings, 1 reply; 18+ messages in thread
From: Viresh Kumar @ 2026-03-23 11:06 UTC (permalink / raw)
To: Lukasz Luba
Cc: Xuewen Yan, rui.zhang, rafael, linux-pm, amit.kachhap,
daniel.lezcano, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94
On 23-03-26, 10:52, Lukasz Luba wrote:
> > How is that okay ? What am I missing ?
I was missing !SMP :)
> Right, there is a mix of two things.
> The 'i' left but should be removed as well, since
> this is !SMP code with only 1 cpu and i=0.
>
> The whole split which has been made for getting
> the load or utilization from CPU(s) needs to be
> cleaned. The compiled code looks different since
> it knows there is non-SMP config used.
Right, we are allocating that for num_cpus (which should be 1 CPU
anyway). The entire thing must be cleaned.
> Do you want to clean that or I should do this?
It would be helpful if you can do it :)
--
viresh
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-23 11:06 ` Viresh Kumar
@ 2026-03-23 13:25 ` Lukasz Luba
2026-03-24 2:20 ` Xuewen Yan
0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2026-03-23 13:25 UTC (permalink / raw)
To: Viresh Kumar, Xuewen Yan
Cc: rui.zhang, rafael, linux-pm, amit.kachhap, daniel.lezcano,
linux-kernel, ke.wang, di.shen, jeson.gao, xuewen.yan94
On 3/23/26 11:06, Viresh Kumar wrote:
> On 23-03-26, 10:52, Lukasz Luba wrote:
>>> How is that okay ? What am I missing ?
>
> I was missing !SMP :)
>
>> Right, there is a mix of two things.
>> The 'i' left but should be removed as well, since
>> this is !SMP code with only 1 cpu and i=0.
>>
>> The whole split which has been made for getting
>> the load or utilization from CPU(s) needs to be
>> cleaned. The compiled code looks different since
>> it knows there is non-SMP config used.
>
> Right, we are allocating that for num_cpus (which should be 1 CPU
> anyway). The entire thing must be cleaned.
>
>> Do you want to clean that or I should do this?
>
> It would be helpful if you can do it :)
>
OK, I will. Thanks for your involvement Viresh!
Xuewen please wait with your v2, I will send
a redesign of this left code today.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 2/2] thermal/cpufreq_cooling: Use idle_time to get cpu_load when scx_enabled
2026-03-20 11:31 ` [RFC PATCH 2/2] thermal/cpufreq_cooling: Use idle_time to get cpu_load when scx_enabled Xuewen Yan
@ 2026-03-24 1:41 ` Qais Yousef
0 siblings, 0 replies; 18+ messages in thread
From: Qais Yousef @ 2026-03-24 1:41 UTC (permalink / raw)
To: Xuewen Yan
Cc: daniel.lezcano, amit.kachhap, viresh.kumar, lukasz.luba, rafael,
rui.zhang, linux-pm, linux-kernel, ke.wang, di.shen, jeson.gao,
xuewen.yan94, Peter Zijlstra, Vincent Guittot
On 03/20/26 19:31, Xuewen Yan wrote:
> From: Di Shen <di.shen@unisoc.com>
>
> Recently, while enabling sched-ext debugging, we observed abnormal behavior
> in our thermal power_allocator’s temperature control.
> Through debugging, we found that the CPU util was too low, causing
> the CPU frequency to remain unrestricted.
>
> This issue stems from the fact that in the sched_cpu_util() function,
> when scx is enabled, cpu_util_cfs becomes zero. As a result,
> the thermal subsystem perceives an extremely low CPU utilization,
> which degrades the effectiveness of the power_allocator’s control.
>
> However, the scx_cpuperf_target() reflects the targeted performance,
> not the utilisation. We couldn't use it.
>
> Until a perfect solution is found, using idle_time to get the cpu load
> might be a better approach.
>
> Co-developed-by: Xuewen Yan <xuewen.yan@unisoc.com>
> Signed-off-by: Xuewen Yan <xuewen.yan@unisoc.com>
> Signed-off-by: Di Shen <di.shen@unisoc.com>
> ---
> Previous discussion:
> https://lore.kernel.org/all/5a5d565b-33ac-4d5c-b0dd-1353324a6117@arm.com/
>
> ---
> drivers/thermal/cpufreq_cooling.c | 54 ++++++++++++++++++++-----------
> 1 file changed, 35 insertions(+), 19 deletions(-)
>
> diff --git a/drivers/thermal/cpufreq_cooling.c b/drivers/thermal/cpufreq_cooling.c
> index d030dbeb2973..e8fa70a95d00 100644
> --- a/drivers/thermal/cpufreq_cooling.c
> +++ b/drivers/thermal/cpufreq_cooling.c
> @@ -24,6 +24,9 @@
> #include <linux/units.h>
>
> #include "thermal_trace.h"
> +#ifdef CONFIG_SCHED_CLASS_EXT
> +#include "../../kernel/sched/sched.h"
> +#endif
This is a terrible include
>
> /*
> * Cooling state <-> CPUFreq frequency
> @@ -72,7 +75,7 @@ struct cpufreq_cooling_device {
> struct em_perf_domain *em;
> struct cpufreq_policy *policy;
> struct thermal_cooling_device_ops cooling_ops;
> -#ifndef CONFIG_SMP
> +#if !defined(CONFIG_SMP) || defined(CONFIG_SCHED_CLASS_EXT)
> struct time_in_idle *idle_time;
> #endif
> struct freq_qos_request qos_req;
> @@ -147,23 +150,9 @@ static u32 cpu_power_to_freq(struct cpufreq_cooling_device *cpufreq_cdev,
> return freq;
> }
>
> -/**
> - * get_load() - get load for a cpu
> - * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
> - * @cpu: cpu number
> - *
> - * Return: The average load of cpu @cpu in percentage since this
> - * function was last called.
> - */
> -#ifdef CONFIG_SMP
> -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> -{
> - unsigned long util = sched_cpu_util(cpu);
> -
> - return (util * 100) / arch_scale_cpu_capacity(cpu);
> -}
> -#else /* !CONFIG_SMP */
> -static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> +#if !defined(CONFIG_SMP) || defined(CONFIG_SCHED_CLASS_EXT)
> +static u32 get_load_from_idle_time(struct cpufreq_cooling_device *cpufreq_cdev,
> + int cpu)
> {
> u32 load;
> u64 now, now_idle, delta_time, delta_idle;
> @@ -183,8 +172,35 @@ static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
>
> return load;
> }
> -#endif /* CONFIG_SMP */
> +#endif /* !defined(CONFIG_SMP) || defined(CONFIG_SCHED_CLASS_EXT) */
More ugly ifdefs
>
> +/**
> + * get_load() - get load for a cpu
> + * @cpufreq_cdev: struct cpufreq_cooling_device for the cpu
> + * @cpu: cpu number
> + *
> + * Return: The average load of cpu @cpu in percentage since this
> + * function was last called.
> + */
> +#ifndef CONFIG_SMP
> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu,
> + int cpu_idx)
> +{
> + return get_load_from_idle_time(cpufreq_cdev, cpu, cpu_idx);
> +}
> +#else /* CONFIG_SMP */
> +static u32 get_load(struct cpufreq_cooling_device *cpufreq_cdev, int cpu)
> +{
> + unsigned long util;
> +
> +#ifdef CONFIG_SCHED_CLASS_EXT
> + if (scx_enabled())
> + return get_load_from_idle_time(cpufreq_cdev, cpu);
> +#endif
Instead of this scx special hack, wouldn't it be better to implement this as
a special operation mode? But then this will beg the question do we actually
need sched_cpu_util() if it can all be done based on idle time and just remove
the deps on sched_cpu_util()?
ifdefing based on scx is nasty hack, this can be done better; most likely by
decoupling the deps on util if truly the idle time is enough. If it is not
enough, then I am not sure this will solve any problem.
> + util = sched_cpu_util(cpu);
> + return (util * 100) / arch_scale_cpu_capacity(cpu);
> +}
> +#endif /* !CONFIG_SMP */
> /**
> * get_dynamic_power() - calculate the dynamic power
> * @cpufreq_cdev: &cpufreq_cooling_device for this cdev
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-23 13:25 ` Lukasz Luba
@ 2026-03-24 2:20 ` Xuewen Yan
2026-03-24 10:46 ` Lukasz Luba
0 siblings, 1 reply; 18+ messages in thread
From: Xuewen Yan @ 2026-03-24 2:20 UTC (permalink / raw)
To: Lukasz Luba
Cc: Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao
On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 3/23/26 11:06, Viresh Kumar wrote:
> > On 23-03-26, 10:52, Lukasz Luba wrote:
> >>> How is that okay ? What am I missing ?
> >
> > I was missing !SMP :)
> >
> >> Right, there is a mix of two things.
> >> The 'i' left but should be removed as well, since
> >> this is !SMP code with only 1 cpu and i=0.
That's also why we sent out patch 1/2; after all, it is always 0 on
!SMP systems.
> >>
> >> The whole split which has been made for getting
> >> the load or utilization from CPU(s) needs to be
> >> cleaned. The compiled code looks different since
> >> it knows there is non-SMP config used.
> >
> > Right, we are allocating that for num_cpus (which should be 1 CPU
> > anyway). The entire thing must be cleaned.
> >
> >> Do you want to clean that or I should do this?
> >
> > It would be helpful if you can do it :)
> >
>
> OK, I will. Thanks for your involvement Viresh!
>
> Xuewen please wait with your v2, I will send
> a redesign of this left code today.
Okay, and Qais's point is also worth considering: do we actually need
sched_cpu_util()?
The way I see it, generally speaking, the request_power derived from
idle_time might be higher than what we get from sched_cpu_util().
Take this scenario as an example:
Consider a CPU running at the lowest frequency with 50% idle time,
versus one running at the highest frequency with the same 50% idle
time.
In this case, using idle_time yields the same load value for both.
However, sched_cpu_util() would report a lower load when the CPU
frequency is low. This results in a smaller request_power...
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-24 2:20 ` Xuewen Yan
@ 2026-03-24 10:46 ` Lukasz Luba
2026-03-24 12:03 ` Xuewen Yan
2026-03-26 9:05 ` Qais Yousef
0 siblings, 2 replies; 18+ messages in thread
From: Lukasz Luba @ 2026-03-24 10:46 UTC (permalink / raw)
To: Xuewen Yan
Cc: Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao
On 3/24/26 02:20, Xuewen Yan wrote:
> On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>
>>
>> On 3/23/26 11:06, Viresh Kumar wrote:
>>> On 23-03-26, 10:52, Lukasz Luba wrote:
>>>>> How is that okay ? What am I missing ?
>>>
>>> I was missing !SMP :)
>>>
>>>> Right, there is a mix of two things.
>>>> The 'i' left but should be removed as well, since
>>>> this is !SMP code with only 1 cpu and i=0.
>
> That's also why we sent out patch 1/2; after all, it is always 0 on
> !SMP systems.
>
>>>>
>>>> The whole split which has been made for getting
>>>> the load or utilization from CPU(s) needs to be
>>>> cleaned. The compiled code looks different since
>>>> it knows there is non-SMP config used.
>>>
>>> Right, we are allocating that for num_cpus (which should be 1 CPU
>>> anyway). The entire thing must be cleaned.
>>>
>>>> Do you want to clean that or I should do this?
>>>
>>> It would be helpful if you can do it :)
>>>
>>
>> OK, I will. Thanks for your involvement Viresh!
>>
>> Xuewen please wait with your v2, I will send
>> a redesign of this left code today.
>
> Okay, and Qais's point is also worth considering: do we actually need
> sched_cpu_util()?
> The way I see it, generally speaking, the request_power derived from
> idle_time might be higher than what we get from sched_cpu_util().
> Take this scenario as an example:
> Consider a CPU running at the lowest frequency with 50% idle time,
> versus one running at the highest frequency with the same 50% idle
> time.
> In this case, using idle_time yields the same load value for both.
> However, sched_cpu_util() would report a lower load when the CPU
> frequency is low. This results in a smaller request_power...
Right, there are 2 things to consider:
1. what is the utilization when the CPU still have idle time, e.g.
this 50% that you mentioned
2. what is the utilization when there is no idle time and CPU
is fully busy (and starts throttling due to heat)
In this thermal fwk we are mostly in the 2nd case. In that case the
utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
We know which highest frequency was allowed to run and we pick the power
value from EM for it. That's why the estimation is not that bad (apart
from power variation for different flavors of workloads: heavy SIMD vs.
normal integer/load).
In 1st case scenario we might underestimate the power, but that
is not the thermal stress situation anyway, so the max OPP is
still allowed.
So far it is hard to find the best power model to use and robust CPU
load mechanisms. Adding more complexity and creating some
over-engineered code in the kernel to maintain might not have sense.
The thermal solutions are solved in the Firmware nowadays since the
kernel won't react that fast for some rapid changes.
We have to balance the complexity here.
Let's improve the situation a bit. It would be very much appreciated if
you could share information if those changes help your platform
(some older boards might not show any benefit with the new code).
Regards,
Lukasz
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-24 10:46 ` Lukasz Luba
@ 2026-03-24 12:03 ` Xuewen Yan
2026-03-25 8:31 ` Lukasz Luba
2026-03-26 9:05 ` Qais Yousef
1 sibling, 1 reply; 18+ messages in thread
From: Xuewen Yan @ 2026-03-24 12:03 UTC (permalink / raw)
To: Lukasz Luba
Cc: Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao
On Tue, Mar 24, 2026 at 6:45 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>
>
>
> On 3/24/26 02:20, Xuewen Yan wrote:
> > On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
> >>
> >>
> >>
> >> On 3/23/26 11:06, Viresh Kumar wrote:
> >>> On 23-03-26, 10:52, Lukasz Luba wrote:
> >>>>> How is that okay ? What am I missing ?
> >>>
> >>> I was missing !SMP :)
> >>>
> >>>> Right, there is a mix of two things.
> >>>> The 'i' left but should be removed as well, since
> >>>> this is !SMP code with only 1 cpu and i=0.
> >
> > That's also why we sent out patch 1/2; after all, it is always 0 on
> > !SMP systems.
> >
> >>>>
> >>>> The whole split which has been made for getting
> >>>> the load or utilization from CPU(s) needs to be
> >>>> cleaned. The compiled code looks different since
> >>>> it knows there is non-SMP config used.
> >>>
> >>> Right, we are allocating that for num_cpus (which should be 1 CPU
> >>> anyway). The entire thing must be cleaned.
> >>>
> >>>> Do you want to clean that or I should do this?
> >>>
> >>> It would be helpful if you can do it :)
> >>>
> >>
> >> OK, I will. Thanks for your involvement Viresh!
> >>
> >> Xuewen please wait with your v2, I will send
> >> a redesign of this left code today.
> >
> > Okay, and Qais's point is also worth considering: do we actually need
> > sched_cpu_util()?
> > The way I see it, generally speaking, the request_power derived from
> > idle_time might be higher than what we get from sched_cpu_util().
> > Take this scenario as an example:
> > Consider a CPU running at the lowest frequency with 50% idle time,
> > versus one running at the highest frequency with the same 50% idle
> > time.
> > In this case, using idle_time yields the same load value for both.
> > However, sched_cpu_util() would report a lower load when the CPU
> > frequency is low. This results in a smaller request_power...
>
> Right, there are 2 things to consider:
> 1. what is the utilization when the CPU still have idle time, e.g.
> this 50% that you mentioned
> 2. what is the utilization when there is no idle time and CPU
> is fully busy (and starts throttling due to heat)
>
> In this thermal fwk we are mostly in the 2nd case. In that case the
> utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
Haha, indeed. When we debug IPA, we also keep the CPU constantly
running with basically no idle time.
In this scenario, we tested using both sched_cpu_util() and idle_time,
and for thermal control purposes, there was basically no difference
(likely because the load was at 100%).
Maybe we can cook up a test case where the CPU is overheating despite
having some idle time? That way we can compare how the two interfaces
perform.
> We know which highest frequency was allowed to run and we pick the power
> value from EM for it. That's why the estimation is not that bad (apart
> from power variation for different flavors of workloads: heavy SIMD vs.
> normal integer/load).
>
> In 1st case scenario we might underestimate the power, but that
> is not the thermal stress situation anyway, so the max OPP is
> still allowed.
>
> So far it is hard to find the best power model to use and robust CPU
> load mechanisms. Adding more complexity and creating some
> over-engineered code in the kernel to maintain might not have sense.
> The thermal solutions are solved in the Firmware nowadays since the
> kernel won't react that fast for some rapid changes.
>
> We have to balance the complexity here.
> Let's improve the situation a bit. It would be very much appreciated if
> you could share information if those changes help your platform
> (some older boards might not show any benefit with the new code).
>
Understood. We appreciate the balance between complexity and accuracy.
We could test these changes on our platforms and let you know if we
see any improvements in thermal stability or power estimation. Expect
an update from us in a few days.
Thanks!
---
> Regards,
> Lukasz
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-24 12:03 ` Xuewen Yan
@ 2026-03-25 8:31 ` Lukasz Luba
0 siblings, 0 replies; 18+ messages in thread
From: Lukasz Luba @ 2026-03-25 8:31 UTC (permalink / raw)
To: Xuewen Yan
Cc: Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao
On 3/24/26 12:03, Xuewen Yan wrote:
> On Tue, Mar 24, 2026 at 6:45 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>
>>
>>
>> On 3/24/26 02:20, Xuewen Yan wrote:
>>> On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/23/26 11:06, Viresh Kumar wrote:
>>>>> On 23-03-26, 10:52, Lukasz Luba wrote:
>>>>>>> How is that okay ? What am I missing ?
>>>>>
>>>>> I was missing !SMP :)
>>>>>
>>>>>> Right, there is a mix of two things.
>>>>>> The 'i' left but should be removed as well, since
>>>>>> this is !SMP code with only 1 cpu and i=0.
>>>
>>> That's also why we sent out patch 1/2; after all, it is always 0 on
>>> !SMP systems.
>>>
>>>>>>
>>>>>> The whole split which has been made for getting
>>>>>> the load or utilization from CPU(s) needs to be
>>>>>> cleaned. The compiled code looks different since
>>>>>> it knows there is non-SMP config used.
>>>>>
>>>>> Right, we are allocating that for num_cpus (which should be 1 CPU
>>>>> anyway). The entire thing must be cleaned.
>>>>>
>>>>>> Do you want to clean that or I should do this?
>>>>>
>>>>> It would be helpful if you can do it :)
>>>>>
>>>>
>>>> OK, I will. Thanks for your involvement Viresh!
>>>>
>>>> Xuewen please wait with your v2, I will send
>>>> a redesign of this left code today.
>>>
>>> Okay, and Qais's point is also worth considering: do we actually need
>>> sched_cpu_util()?
>>> The way I see it, generally speaking, the request_power derived from
>>> idle_time might be higher than what we get from sched_cpu_util().
>>> Take this scenario as an example:
>>> Consider a CPU running at the lowest frequency with 50% idle time,
>>> versus one running at the highest frequency with the same 50% idle
>>> time.
>>> In this case, using idle_time yields the same load value for both.
>>> However, sched_cpu_util() would report a lower load when the CPU
>>> frequency is low. This results in a smaller request_power...
>>
>> Right, there are 2 things to consider:
>> 1. what is the utilization when the CPU still have idle time, e.g.
>> this 50% that you mentioned
>> 2. what is the utilization when there is no idle time and CPU
>> is fully busy (and starts throttling due to heat)
>>
>> In this thermal fwk we are mostly in the 2nd case. In that case the
>> utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
>
> Haha, indeed. When we debug IPA, we also keep the CPU constantly
> running with basically no idle time.
> In this scenario, we tested using both sched_cpu_util() and idle_time,
> and for thermal control purposes, there was basically no difference
> (likely because the load was at 100%).
> Maybe we can cook up a test case where the CPU is overheating despite
> having some idle time? That way we can compare how the two interfaces
> perform.
I wish we could involve the GPU in those stress scenarios. Then we could
have some load (e.g. 70% on CPUs) and heavy computation on GPU which
heats up the whole silicon. In such experiment you could observe the
difference from those two methods of input power estimation.
Unfortunately, for the GPU it's hard to craft such benchmark
(at least for me lacking GPU programming experience at that level).
Ideally we could have something which controls the amount of GPU
computation and created heat...
>
>> We know which highest frequency was allowed to run and we pick the power
>> value from EM for it. That's why the estimation is not that bad (apart
>> from power variation for different flavors of workloads: heavy SIMD vs.
>> normal integer/load).
>>
>> In 1st case scenario we might underestimate the power, but that
>> is not the thermal stress situation anyway, so the max OPP is
>> still allowed.
>>
>> So far it is hard to find the best power model to use and robust CPU
>> load mechanisms. Adding more complexity and creating some
>> over-engineered code in the kernel to maintain might not have sense.
>> The thermal solutions are solved in the Firmware nowadays since the
>> kernel won't react that fast for some rapid changes.
>>
>> We have to balance the complexity here.
>> Let's improve the situation a bit. It would be very much appreciated if
>> you could share information if those changes help your platform
>> (some older boards might not show any benefit with the new code).
>>
> Understood. We appreciate the balance between complexity and accuracy.
> We could test these changes on our platforms and let you know if we
> see any improvements in thermal stability or power estimation. Expect
> an update from us in a few days.
>
Sounds great, thanks!
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-24 10:46 ` Lukasz Luba
2026-03-24 12:03 ` Xuewen Yan
@ 2026-03-26 9:05 ` Qais Yousef
2026-03-26 9:21 ` Lukasz Luba
1 sibling, 1 reply; 18+ messages in thread
From: Qais Yousef @ 2026-03-26 9:05 UTC (permalink / raw)
To: Lukasz Luba
Cc: Xuewen Yan, Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao, Peter Zijlstra, Vincent Guittot
On 03/24/26 10:46, Lukasz Luba wrote:
>
> On 3/24/26 02:20, Xuewen Yan wrote:
> > On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > >
> > >
> > >
> > > On 3/23/26 11:06, Viresh Kumar wrote:
> > > > On 23-03-26, 10:52, Lukasz Luba wrote:
> > > > > > How is that okay ? What am I missing ?
> > > >
> > > > I was missing !SMP :)
> > > >
> > > > > Right, there is a mix of two things.
> > > > > The 'i' left but should be removed as well, since
> > > > > this is !SMP code with only 1 cpu and i=0.
> >
> > That's also why we sent out patch 1/2; after all, it is always 0 on
> > !SMP systems.
> >
> > > > >
> > > > > The whole split which has been made for getting
> > > > > the load or utilization from CPU(s) needs to be
> > > > > cleaned. The compiled code looks different since
> > > > > it knows there is non-SMP config used.
> > > >
> > > > Right, we are allocating that for num_cpus (which should be 1 CPU
> > > > anyway). The entire thing must be cleaned.
> > > >
> > > > > Do you want to clean that or I should do this?
> > > >
> > > > It would be helpful if you can do it :)
> > > >
> > >
> > > OK, I will. Thanks for your involvement Viresh!
> > >
> > > Xuewen please wait with your v2, I will send
> > > a redesign of this left code today.
> >
> > Okay, and Qais's point is also worth considering: do we actually need
> > sched_cpu_util()?
> > The way I see it, generally speaking, the request_power derived from
> > idle_time might be higher than what we get from sched_cpu_util().
> > Take this scenario as an example:
> > Consider a CPU running at the lowest frequency with 50% idle time,
> > versus one running at the highest frequency with the same 50% idle
> > time.
> > In this case, using idle_time yields the same load value for both.
> > However, sched_cpu_util() would report a lower load when the CPU
> > frequency is low. This results in a smaller request_power...
Invariance will cause settling time to stretch longer, but it should settle to
the correct value eventually. But generally another case against util is that
it has grown to be a description of compute demand more than true idleness of
the system.
>
> Right, there are 2 things to consider:
> 1. what is the utilization when the CPU still have idle time, e.g.
> this 50% that you mentioned
> 2. what is the utilization when there is no idle time and CPU
> is fully busy (and starts throttling due to heat)
Hmm I think what you're trying to say here we need to distinguish between two
cases 50% or fully busy? I think how idle the system is a better question to
ask rather than what is the utilization (given the ubiquity of the signal
nowadays)
>
> In this thermal fwk we are mostly in the 2nd case. In that case the
But from power allocator perspective (which I think is the context, right?),
you want to know if you can shift power?
> utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
> We know which highest frequency was allowed to run and we pick the power
> value from EM for it. That's why the estimation is not that bad (apart
> from power variation for different flavors of workloads: heavy SIMD vs.
> normal integer/load).
>
> In 1st case scenario we might underestimate the power, but that
> is not the thermal stress situation anyway, so the max OPP is
> still allowed.
>
> So far it is hard to find the best power model to use and robust CPU
> load mechanisms. Adding more complexity and creating some
> over-engineered code in the kernel to maintain might not have sense.
> The thermal solutions are solved in the Firmware nowadays since the
> kernel won't react that fast for some rapid changes.
>
> We have to balance the complexity here.
I am not verse in all the details, so not sure what complexity you are
referring to. IMHO the idle time is a more stable view for how much a breathing
room the cpu has. It also deals better with long decay of blocked load
over-estimating the utilization. AFAICS just sample the idle over a window when
you need to take a decision and you'd solve several problems in one go.
> Let's improve the situation a bit. It would be very much appreciated if
> you could share information if those changes help your platform
> (some older boards might not show any benefit with the new code).
>
> Regards,
> Lukasz
>
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-26 9:05 ` Qais Yousef
@ 2026-03-26 9:21 ` Lukasz Luba
2026-03-28 8:09 ` Qais Yousef
0 siblings, 1 reply; 18+ messages in thread
From: Lukasz Luba @ 2026-03-26 9:21 UTC (permalink / raw)
To: Qais Yousef
Cc: Xuewen Yan, Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao, Peter Zijlstra, Vincent Guittot
On 3/26/26 09:05, Qais Yousef wrote:
> On 03/24/26 10:46, Lukasz Luba wrote:
>>
>> On 3/24/26 02:20, Xuewen Yan wrote:
>>> On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
>>>>
>>>>
>>>>
>>>> On 3/23/26 11:06, Viresh Kumar wrote:
>>>>> On 23-03-26, 10:52, Lukasz Luba wrote:
>>>>>>> How is that okay ? What am I missing ?
>>>>>
>>>>> I was missing !SMP :)
>>>>>
>>>>>> Right, there is a mix of two things.
>>>>>> The 'i' left but should be removed as well, since
>>>>>> this is !SMP code with only 1 cpu and i=0.
>>>
>>> That's also why we sent out patch 1/2; after all, it is always 0 on
>>> !SMP systems.
>>>
>>>>>>
>>>>>> The whole split which has been made for getting
>>>>>> the load or utilization from CPU(s) needs to be
>>>>>> cleaned. The compiled code looks different since
>>>>>> it knows there is non-SMP config used.
>>>>>
>>>>> Right, we are allocating that for num_cpus (which should be 1 CPU
>>>>> anyway). The entire thing must be cleaned.
>>>>>
>>>>>> Do you want to clean that or I should do this?
>>>>>
>>>>> It would be helpful if you can do it :)
>>>>>
>>>>
>>>> OK, I will. Thanks for your involvement Viresh!
>>>>
>>>> Xuewen please wait with your v2, I will send
>>>> a redesign of this left code today.
>>>
>>> Okay, and Qais's point is also worth considering: do we actually need
>>> sched_cpu_util()?
>>> The way I see it, generally speaking, the request_power derived from
>>> idle_time might be higher than what we get from sched_cpu_util().
>>> Take this scenario as an example:
>>> Consider a CPU running at the lowest frequency with 50% idle time,
>>> versus one running at the highest frequency with the same 50% idle
>>> time.
>>> In this case, using idle_time yields the same load value for both.
>>> However, sched_cpu_util() would report a lower load when the CPU
>>> frequency is low. This results in a smaller request_power...
>
> Invariance will cause settling time to stretch longer, but it should settle to
> the correct value eventually. But generally another case against util is that
> it has grown to be a description of compute demand more than true idleness of
> the system.
>
>>
>> Right, there are 2 things to consider:
>> 1. what is the utilization when the CPU still have idle time, e.g.
>> this 50% that you mentioned
>> 2. what is the utilization when there is no idle time and CPU
>> is fully busy (and starts throttling due to heat)
>
> Hmm I think what you're trying to say here we need to distinguish between two
> cases 50% or fully busy? I think how idle the system is a better question to
> ask rather than what is the utilization (given the ubiquity of the signal
> nowadays)
Yes, these two cases, which are different and util signal is not the
best for that idleness one.
>
>>
>> In this thermal fwk we are mostly in the 2nd case. In that case the
>
> But from power allocator perspective (which I think is the context, right?),
> you want to know if you can shift power?
I would like to know the avg power in the last X ms window, then
allocate, shift, set.
>
>> utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
>> We know which highest frequency was allowed to run and we pick the power
>> value from EM for it. That's why the estimation is not that bad (apart
>> from power variation for different flavors of workloads: heavy SIMD vs.
>> normal integer/load).
>>
>> In 1st case scenario we might underestimate the power, but that
>> is not the thermal stress situation anyway, so the max OPP is
>> still allowed.
>>
>> So far it is hard to find the best power model to use and robust CPU
>> load mechanisms. Adding more complexity and creating some
>> over-engineered code in the kernel to maintain might not have sense.
>> The thermal solutions are solved in the Firmware nowadays since the
>> kernel won't react that fast for some rapid changes.
>>
>> We have to balance the complexity here.
>
> I am not verse in all the details, so not sure what complexity you are
> referring to. IMHO the idle time is a more stable view for how much a breathing
> room the cpu has. It also deals better with long decay of blocked load
> over-estimating the utilization. AFAICS just sample the idle over a window when
> you need to take a decision and you'd solve several problems in one go.
We have issues in estimating power in that X ms window due to fast
frequency changes. You know how often we can change the frequency,
almost per-task enqueue (and e.g. uclamp pushes that even harder).
Simple approach for assuming that the frequency we see now on CPU
has been there for the whole Xms period is 'not the best'.
The util information w/o uclamp information is not helping
much (even we we would try to derive the freq out of it).
Now even more complex - the FW can change the freq way often
than the kernel. So the question is how far we have to push
the whole kernel and those frameworks to deal with those new
platforms.
Then add the power variation due to different computation types
e.g. SIMD heavy vs simple logging task (high power vs. low power
usage at the same OPP).
IMHO we have to find a balance since even more complex models
in kernel won't be able to handle that.
I have been experimenting with the Active Stats patch set for
quite a while, but then FW came into the equation and complicated
the situation. It will be still better for such platform
where FW doesn't change the freq, so this approach based on
idle stats is worth to add IMO.
^ permalink raw reply [flat|nested] 18+ messages in thread
* Re: [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load()
2026-03-26 9:21 ` Lukasz Luba
@ 2026-03-28 8:09 ` Qais Yousef
0 siblings, 0 replies; 18+ messages in thread
From: Qais Yousef @ 2026-03-28 8:09 UTC (permalink / raw)
To: Lukasz Luba
Cc: Xuewen Yan, Viresh Kumar, Xuewen Yan, rui.zhang, rafael, linux-pm,
amit.kachhap, daniel.lezcano, linux-kernel, ke.wang, di.shen,
jeson.gao, Peter Zijlstra, Vincent Guittot
On 03/26/26 09:21, Lukasz Luba wrote:
>
>
> On 3/26/26 09:05, Qais Yousef wrote:
> > On 03/24/26 10:46, Lukasz Luba wrote:
> > >
> > > On 3/24/26 02:20, Xuewen Yan wrote:
> > > > On Mon, Mar 23, 2026 at 9:25 PM Lukasz Luba <lukasz.luba@arm.com> wrote:
> > > > >
> > > > >
> > > > >
> > > > > On 3/23/26 11:06, Viresh Kumar wrote:
> > > > > > On 23-03-26, 10:52, Lukasz Luba wrote:
> > > > > > > > How is that okay ? What am I missing ?
> > > > > >
> > > > > > I was missing !SMP :)
> > > > > >
> > > > > > > Right, there is a mix of two things.
> > > > > > > The 'i' left but should be removed as well, since
> > > > > > > this is !SMP code with only 1 cpu and i=0.
> > > >
> > > > That's also why we sent out patch 1/2; after all, it is always 0 on
> > > > !SMP systems.
> > > >
> > > > > > >
> > > > > > > The whole split which has been made for getting
> > > > > > > the load or utilization from CPU(s) needs to be
> > > > > > > cleaned. The compiled code looks different since
> > > > > > > it knows there is non-SMP config used.
> > > > > >
> > > > > > Right, we are allocating that for num_cpus (which should be 1 CPU
> > > > > > anyway). The entire thing must be cleaned.
> > > > > >
> > > > > > > Do you want to clean that or I should do this?
> > > > > >
> > > > > > It would be helpful if you can do it :)
> > > > > >
> > > > >
> > > > > OK, I will. Thanks for your involvement Viresh!
> > > > >
> > > > > Xuewen please wait with your v2, I will send
> > > > > a redesign of this left code today.
> > > >
> > > > Okay, and Qais's point is also worth considering: do we actually need
> > > > sched_cpu_util()?
> > > > The way I see it, generally speaking, the request_power derived from
> > > > idle_time might be higher than what we get from sched_cpu_util().
> > > > Take this scenario as an example:
> > > > Consider a CPU running at the lowest frequency with 50% idle time,
> > > > versus one running at the highest frequency with the same 50% idle
> > > > time.
> > > > In this case, using idle_time yields the same load value for both.
> > > > However, sched_cpu_util() would report a lower load when the CPU
> > > > frequency is low. This results in a smaller request_power...
> >
> > Invariance will cause settling time to stretch longer, but it should settle to
> > the correct value eventually. But generally another case against util is that
> > it has grown to be a description of compute demand more than true idleness of
> > the system.
> >
> > >
> > > Right, there are 2 things to consider:
> > > 1. what is the utilization when the CPU still have idle time, e.g.
> > > this 50% that you mentioned
> > > 2. what is the utilization when there is no idle time and CPU
> > > is fully busy (and starts throttling due to heat)
> >
> > Hmm I think what you're trying to say here we need to distinguish between two
> > cases 50% or fully busy? I think how idle the system is a better question to
> > ask rather than what is the utilization (given the ubiquity of the signal
> > nowadays)
>
> Yes, these two cases, which are different and util signal is not the
> best for that idleness one.
>
>
> >
> > >
> > > In this thermal fwk we are mostly in the 2nd case. In that case the
> >
> > But from power allocator perspective (which I think is the context, right?),
> > you want to know if you can shift power?
>
> I would like to know the avg power in the last X ms window, then
> allocate, shift, set.
>
> >
> > > utilization on CPU's runqueue goes to 1024 no mater the CPU's frequency.
> > > We know which highest frequency was allowed to run and we pick the power
> > > value from EM for it. That's why the estimation is not that bad (apart
> > > from power variation for different flavors of workloads: heavy SIMD vs.
> > > normal integer/load).
> > >
> > > In 1st case scenario we might underestimate the power, but that
> > > is not the thermal stress situation anyway, so the max OPP is
> > > still allowed.
> > >
> > > So far it is hard to find the best power model to use and robust CPU
> > > load mechanisms. Adding more complexity and creating some
> > > over-engineered code in the kernel to maintain might not have sense.
> > > The thermal solutions are solved in the Firmware nowadays since the
> > > kernel won't react that fast for some rapid changes.
> > >
> > > We have to balance the complexity here.
> >
> > I am not verse in all the details, so not sure what complexity you are
> > referring to. IMHO the idle time is a more stable view for how much a breathing
> > room the cpu has. It also deals better with long decay of blocked load
> > over-estimating the utilization. AFAICS just sample the idle over a window when
> > you need to take a decision and you'd solve several problems in one go.
>
> We have issues in estimating power in that X ms window due to fast
> frequency changes. You know how often we can change the frequency,
> almost per-task enqueue (and e.g. uclamp pushes that even harder).
>
> Simple approach for assuming that the frequency we see now on CPU
> has been there for the whole Xms period is 'not the best'.
> The util information w/o uclamp information is not helping
> much (even we we would try to derive the freq out of it).
>
> Now even more complex - the FW can change the freq way often
> than the kernel. So the question is how far we have to push
> the whole kernel and those frameworks to deal with those new
> platforms.
>
> Then add the power variation due to different computation types
> e.g. SIMD heavy vs simple logging task (high power vs. low power
> usage at the same OPP).
>
> IMHO we have to find a balance since even more complex models
> in kernel won't be able to handle that.
>
> I have been experimenting with the Active Stats patch set for
> quite a while, but then FW came into the equation and complicated
> the situation. It will be still better for such platform
> where FW doesn't change the freq, so this approach based on
> idle stats is worth to add IMO.
Hmm isn't this orthogonal? It seems cpu util is used today to estimate how busy
(or idle) the cpu is. You can decouple the dep on util (and scheduler in
general) and monitor system idleness.
Anyway. Please don't add this ifdefry and the strange deps on scx. This is
a recipe for shooting ourselves in the foot.
^ permalink raw reply [flat|nested] 18+ messages in thread
end of thread, other threads:[~2026-03-28 8:09 UTC | newest]
Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-20 11:31 [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Xuewen Yan
2026-03-20 11:31 ` [RFC PATCH 2/2] thermal/cpufreq_cooling: Use idle_time to get cpu_load when scx_enabled Xuewen Yan
2026-03-24 1:41 ` Qais Yousef
2026-03-20 12:32 ` [RFC PATCH 1/2] thermal/cpufreq_cooling: remove unused cpu_idx in get_load() Lukasz Luba
2026-03-21 8:48 ` Xuewen Yan
2026-03-23 5:34 ` Viresh Kumar
2026-03-23 9:20 ` Lukasz Luba
2026-03-23 10:41 ` Viresh Kumar
2026-03-23 10:52 ` Lukasz Luba
2026-03-23 11:06 ` Viresh Kumar
2026-03-23 13:25 ` Lukasz Luba
2026-03-24 2:20 ` Xuewen Yan
2026-03-24 10:46 ` Lukasz Luba
2026-03-24 12:03 ` Xuewen Yan
2026-03-25 8:31 ` Lukasz Luba
2026-03-26 9:05 ` Qais Yousef
2026-03-26 9:21 ` Lukasz Luba
2026-03-28 8:09 ` Qais Yousef
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox