I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology

Linux Power Management development
 help / color / mirror / Atom feed

* I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
@ 2025-08-22  5:53 Kenneth Crudup
  2025-08-28 14:56 ` Christian Loehle
  0 siblings, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2025-08-22  5:53 UTC (permalink / raw)
  To: lukasz.luba, linux-pm@vger.kernel.org, Rafael J. Wysocki

So I'm testing the "Energy Model" CPU scheduler, which means I need to 
set "nosmt"; this means I've got "holes" in my CPUs:

----
$  ls /dev/cpu | sort -n
0
2
4
6
8
10
12
13
14
15
16
17
18
19
----

So in .../kernel/power/energy_model.c , when this line gets to a CPU 
"hole", it throws an error and breaks out of the loop:

----
	   policy = cpufreq_cpu_get(cpu);
            if (!policy) {
                        pr_debug("Accessing cpu%d policy failed\n", cpu);
                        schedule_delayed_work(&em_update_work,
                                               msecs_to_jiffies(1000));
                        break;
            }
----

... but I'm thinking (and its working here for me) that a simple 
"continue" for those three lines handles the "CPU hole" case:

----
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index ea7995a25780..df6cae6f8845 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -795,12 +795,9 @@ static void em_check_capacity_update(void)
                         continue;

                 policy = cpufreq_cpu_get(cpu);
-               if (!policy) {
-                       pr_debug("Accessing cpu%d policy failed\n", cpu);
-                       schedule_delayed_work(&em_update_work,
-                                             msecs_to_jiffies(1000));
-                       break;
-               }
+               if (!policy)
+                       continue;
+
                 cpufreq_cpu_put(policy);

                 dev = get_cpu_device(cpu);
----

Thoughts?

-Kenny

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-08-22  5:53 I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology Kenneth Crudup
@ 2025-08-28 14:56 ` Christian Loehle
  2025-08-28 17:42   ` Kenneth Crudup
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Loehle @ 2025-08-28 14:56 UTC (permalink / raw)
  To: Kenneth Crudup, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki

On 8/22/25 06:53, Kenneth Crudup wrote:
> 
> So I'm testing the "Energy Model" CPU scheduler, which means I need to set "nosmt"; this means I've got "holes" in my CPUs:
> 
> ----
> $  ls /dev/cpu | sort -n
> 0
> 2
> 4
> 6
> 8
> 10
> 12
> 13
> 14
> 15
> 16
> 17
> 18
> 19
> ----
> 
> So in .../kernel/power/energy_model.c , when this line gets to a CPU "hole", it throws an error and breaks out of the loop:
> 
> ----
>        policy = cpufreq_cpu_get(cpu);
>            if (!policy) {
>                        pr_debug("Accessing cpu%d policy failed\n", cpu);
>                        schedule_delayed_work(&em_update_work,
>                                               msecs_to_jiffies(1000));
>                        break;
>            }
> ----

Hi Kenneth,
So there's probably a problem here, but presumably your proposal breaks the late boot
(i.e. what this code was initially supposed to support).

> 
> ... but I'm thinking (and its working here for me) that a simple "continue" for those three lines handles the "CPU hole" case:
> 
> ----
> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
> index ea7995a25780..df6cae6f8845 100644
> --- a/kernel/power/energy_model.c
> +++ b/kernel/power/energy_model.c
> @@ -795,12 +795,9 @@ static void em_check_capacity_update(void)
>                         continue;
> 
>                 policy = cpufreq_cpu_get(cpu);
> -               if (!policy) {
> -                       pr_debug("Accessing cpu%d policy failed\n", cpu);
> -                       schedule_delayed_work(&em_update_work,
> -                                             msecs_to_jiffies(1000));
> -                       break;
> -               }
> +               if (!policy)
> +                       continue;
> +
>                 cpufreq_cpu_put(policy);
> 
>                 dev = get_cpu_device(cpu);
> ----
> 
> Thoughts?
> 
> -Kenny
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-08-28 14:56 ` Christian Loehle
@ 2025-08-28 17:42   ` Kenneth Crudup
  2025-08-29 10:31     ` Christian Loehle
  0 siblings, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2025-08-28 17:42 UTC (permalink / raw)
  To: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki
  Cc: Kenneth C


 > So there's probably a problem here, but presumably your proposal 
breaks the late boot
 > (i.e. what this code was initially supposed to support).

Please explain?

It seems to me (and I really don't know, just guessing here) that if 
there were remaining errors from not getting any further CPU policies 
that the loop would just run thru all CPUs "harmlessly".

Now perhaps to ensure the schedule_delayed_work() gets run, what about a 
flag that gets set if any CPU policy was accessed, and the 
schedule_delayed_work() gets run if it's never set?

-Kenny

On 8/28/25 07:56, Christian Loehle wrote:
> On 8/22/25 06:53, Kenneth Crudup wrote:
>>
>> So I'm testing the "Energy Model" CPU scheduler, which means I need to set "nosmt"; this means I've got "holes" in my CPUs:
>>
>> ----
>> $  ls /dev/cpu | sort -n
>> 0
>> 2
>> 4
>> 6
>> 8
>> 10
>> 12
>> 13
>> 14
>> 15
>> 16
>> 17
>> 18
>> 19
>> ----
>>
>> So in .../kernel/power/energy_model.c , when this line gets to a CPU "hole", it throws an error and breaks out of the loop:
>>
>> ----
>>         policy = cpufreq_cpu_get(cpu);
>>             if (!policy) {
>>                         pr_debug("Accessing cpu%d policy failed\n", cpu);
>>                         schedule_delayed_work(&em_update_work,
>>                                                msecs_to_jiffies(1000));
>>                         break;
>>             }
>> ----
> 
> Hi Kenneth,
> So there's probably a problem here, but presumably your proposal breaks the late boot
> (i.e. what this code was initially supposed to support).
> 
>>
>> ... but I'm thinking (and its working here for me) that a simple "continue" for those three lines handles the "CPU hole" case:
>>
>> ----
>> diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
>> index ea7995a25780..df6cae6f8845 100644
>> --- a/kernel/power/energy_model.c
>> +++ b/kernel/power/energy_model.c
>> @@ -795,12 +795,9 @@ static void em_check_capacity_update(void)
>>                          continue;
>>
>>                  policy = cpufreq_cpu_get(cpu);
>> -               if (!policy) {
>> -                       pr_debug("Accessing cpu%d policy failed\n", cpu);
>> -                       schedule_delayed_work(&em_update_work,
>> -                                             msecs_to_jiffies(1000));
>> -                       break;
>> -               }
>> +               if (!policy)
>> +                       continue;
>> +
>>                  cpufreq_cpu_put(policy);
>>
>>                  dev = get_cpu_device(cpu);
>> ----
>>
>> Thoughts?
>>
>> -Kenny
>>
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-08-28 17:42   ` Kenneth Crudup
@ 2025-08-29 10:31     ` Christian Loehle
  2025-08-29 16:45       ` Rafael J. Wysocki
  2025-09-03 17:19       ` Kenneth Crudup
  0 siblings, 2 replies; 14+ messages in thread
From: Christian Loehle @ 2025-08-29 10:31 UTC (permalink / raw)
  To: Kenneth Crudup, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 1245 bytes --]

On 8/28/25 18:42, Kenneth Crudup wrote:
> 
>> So there's probably a problem here, but presumably your proposal breaks the late boot
>> (i.e. what this code was initially supposed to support).
> 
> Please explain?
> 
> It seems to me (and I really don't know, just guessing here) that if there were remaining errors from not getting any further CPU policies that the loop would just run thru all CPUs "harmlessly".

It would miss not-online-yet CPUs and never run again (potentially).

> 
> Now perhaps to ensure the schedule_delayed_work() gets run, what about a flag that gets set if any CPU policy was accessed, and the schedule_delayed_work() gets run if it's never set?

Yeah that's the obvious immediate fix here, I've attached a patch for it. (please test!)
I wonder though, given that this works fine on my x86 nosmt hybrid (presumably something like
yours?) is this actually an issue for you?
With intel_pstate=passive (and neutering a bunch of userspace stuff that insists on switching
schedutil to something inferior) I get a working
# cat /proc/sys/kernel/sched_energy_aware 
1

with all online CPUs having an EM (see /sys/kernel/debug/energy_model/ )

> [snip]

+CC
Dietmar who has also played with this and reviewed the x86 EM part.

[-- Attachment #2: 0001-PM-EM-Fix-late-boot-with-holes-in-CPU-topology.patch --]
[-- Type: text/x-patch, Size: 2290 bytes --]

From e5087c858fe7bf31680ab624a3b3fb1a0f013ef9 Mon Sep 17 00:00:00 2001
From: Christian Loehle <christian.loehle@arm.com>
Date: Fri, 29 Aug 2025 10:58:43 +0100
Subject: [PATCH] PM: EM: Fix late boot with holes in CPU topology

commit e3f1164fc9ee ("PM: EM: Support late CPUs booting and capacity
adjustment") added a mechanism to handle CPUs that come up late by
retrying when any of the `cpufreq_cpu_get()` call fails.

However, if there are holes in the CPU topology (offline CPUs, e.g.
nosmt), the first missing CPU causes the loop to break, preventing
subsequent online CPUs from being updated.
Instead of aborting on the first missing CPU policy, loop through all
and retry if any were missing.

Fixes: e3f1164fc9ee ("PM: EM: Support late CPUs booting and capacity adjustment")
Suggested-by: Kenneth Crudup <kenneth.crudup@gmail.com>
Reported-by: Kenneth Crudup <kenneth.crudup@gmail.com>
Closes: https://lore.kernel.org/linux-pm/40212796-734c-4140-8a85-854f72b8144d@panix.com/
Cc: stable@vger.kernel.org
Signed-off-by: Christian Loehle <christian.loehle@arm.com>
---
 kernel/power/energy_model.c | 14 +++++++++-----
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index ea7995a25780..2454ffd455d0 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -778,7 +778,7 @@ void em_adjust_cpu_capacity(unsigned int cpu)
 static void em_check_capacity_update(void)
 {
 	cpumask_var_t cpu_done_mask;
-	int cpu;
+	int cpu, failed_cpu = -1;
 
 	if (!zalloc_cpumask_var(&cpu_done_mask, GFP_KERNEL)) {
 		pr_warn("no free memory\n");
@@ -796,10 +796,8 @@ static void em_check_capacity_update(void)
 
 		policy = cpufreq_cpu_get(cpu);
 		if (!policy) {
-			pr_debug("Accessing cpu%d policy failed\n", cpu);
-			schedule_delayed_work(&em_update_work,
-					      msecs_to_jiffies(1000));
-			break;
+			failed_cpu = cpu;
+			continue;
 		}
 		cpufreq_cpu_put(policy);
 
@@ -814,6 +812,12 @@ static void em_check_capacity_update(void)
 		em_adjust_new_capacity(cpu, dev, pd);
 	}
 
+	if (failed_cpu >= 0) {
+		pr_debug("Accessing cpu%d policy failed\n", failed_cpu);
+		schedule_delayed_work(&em_update_work,
+				      msecs_to_jiffies(1000));
+	}
+
 	free_cpumask_var(cpu_done_mask);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-08-29 10:31     ` Christian Loehle
@ 2025-08-29 16:45       ` Rafael J. Wysocki
  2025-09-03 17:19       ` Kenneth Crudup
  1 sibling, 0 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-08-29 16:45 UTC (permalink / raw)
  To: Christian Loehle
  Cc: Kenneth Crudup, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki, Dietmar Eggemann

On Fri, Aug 29, 2025 at 12:31 PM Christian Loehle
<christian.loehle@arm.com> wrote:
>
> On 8/28/25 18:42, Kenneth Crudup wrote:
> >
> >> So there's probably a problem here, but presumably your proposal breaks the late boot
> >> (i.e. what this code was initially supposed to support).
> >
> > Please explain?
> >
> > It seems to me (and I really don't know, just guessing here) that if there were remaining errors from not getting any further CPU policies that the loop would just run thru all CPUs "harmlessly".
>
> It would miss not-online-yet CPUs and never run again (potentially).
>
> >
> > Now perhaps to ensure the schedule_delayed_work() gets run, what about a flag that gets set if any CPU policy was accessed, and the schedule_delayed_work() gets run if it's never set?
>
> Yeah that's the obvious immediate fix here, I've attached a patch for it. (please test!)
> I wonder though, given that this works fine on my x86 nosmt hybrid (presumably something like
> yours?) is this actually an issue for you?
> With intel_pstate=passive (and neutering a bunch of userspace stuff that insists on switching
> schedutil to something inferior) I get a working
> # cat /proc/sys/kernel/sched_energy_aware
> 1
>
> with all online CPUs having an EM (see /sys/kernel/debug/energy_model/ )
>
> > [snip]
>
> +CC
> Dietmar who has also played with this and reviewed the x86 EM part.

The message can be printed where it is printed now and failed_cpu can
be replaced with a counter, as far as I'm concerned.  Apart from this
it looks reasonable to me.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-08-29 10:31     ` Christian Loehle
  2025-08-29 16:45       ` Rafael J. Wysocki
@ 2025-09-03 17:19       ` Kenneth Crudup
  2025-09-03 17:26         ` Christian Loehle
  1 sibling, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2025-09-03 17:19 UTC (permalink / raw)
  To: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki, Dietmar Eggemann
  Cc: Kenneth C


(Sorry for the delay, it's been a US holiday weekend here)

 > I wonder though, given that this works fine on my x86 nosmt hybrid 
(presumably something like yours?) is this actually an issue for you?

TBH, the real reason I'd looked into it was 'cause I was getting 
"energy_model: Accessing cpu1 policy failed" every few seconds spammed 
into my dmesg and I thought it may have been an error.

That being said, do you expect your patch to be upstreamed? Raphael 
seems to think it's OK.

I still have /proc/sys/kernel/sched_energy_aware == 1, and then there's:

sudo ls -las /sys/kernel/debug/energy_model/
total 0
0 drwxr-xr-x 16 root root 0 Sep  2 14:20 .
0 drwx------ 53 root root 0 Sep  2 14:20 ..
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu0
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu10
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu12
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu13
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu14
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu15
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu16
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu17
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu18
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu19
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu2
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu4
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu6
0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu8

 > (and neutering a bunch of userspace stuff that insists on switching 
schedutil to something inferior)

Huh. What were these processes, so I can be on the lookout? I just 
switched to using the energy model so still weighing it all out to see 
if I'm sticking to it (I'm not sure if I'm seeing the battery-life 
benefits).

-Kenny

On 8/29/25 03:31, Christian Loehle wrote:
> On 8/28/25 18:42, Kenneth Crudup wrote:
>>
>>> So there's probably a problem here, but presumably your proposal breaks the late boot
>>> (i.e. what this code was initially supposed to support).
>>
>> Please explain?
>>
>> It seems to me (and I really don't know, just guessing here) that if there were remaining errors from not getting any further CPU policies that the loop would just run thru all CPUs "harmlessly".
> 
> It would miss not-online-yet CPUs and never run again (potentially).
> 
>>
>> Now perhaps to ensure the schedule_delayed_work() gets run, what about a flag that gets set if any CPU policy was accessed, and the schedule_delayed_work() gets run if it's never set?
> 
> Yeah that's the obvious immediate fix here, I've attached a patch for it. (please test!)
> I wonder though, given that this works fine on my x86 nosmt hybrid (presumably something like
> yours?) is this actually an issue for you?
> With intel_pstate=passive (and neutering a bunch of userspace stuff that insists on switching
> schedutil to something inferior) I get a working
> # cat /proc/sys/kernel/sched_energy_aware
> 1
> 
> with all online CPUs having an EM (see /sys/kernel/debug/energy_model/ )
> 
>> [snip]
> 
> +CC
> Dietmar who has also played with this and reviewed the x86 EM part.

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-03 17:19       ` Kenneth Crudup
@ 2025-09-03 17:26         ` Christian Loehle
  2025-09-03 18:39           ` Kenneth Crudup
  0 siblings, 1 reply; 14+ messages in thread
From: Christian Loehle @ 2025-09-03 17:26 UTC (permalink / raw)
  To: Kenneth Crudup, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki, Dietmar Eggemann

On 9/3/25 18:19, Kenneth Crudup wrote:
> 
> (Sorry for the delay, it's been a US holiday weekend here)
> 
>> I wonder though, given that this works fine on my x86 nosmt hybrid (presumably something like yours?) is this actually an issue for you?
> 
> TBH, the real reason I'd looked into it was 'cause I was getting "energy_model: Accessing cpu1 policy failed" every few seconds spammed into my dmesg and I thought it may have been an error.
> 
> That being said, do you expect your patch to be upstreamed? Raphael seems to think it's OK.

Yeah looks like we're dropping that debug message.
See
https://lore.kernel.org/lkml/20250831214357.2020076-1-christian.loehle@arm.com/

> 
> I still have /proc/sys/kernel/sched_energy_aware == 1, and then there's:

That means all is well and EAS is enabled!

> 
> sudo ls -las /sys/kernel/debug/energy_model/
> total 0
> 0 drwxr-xr-x 16 root root 0 Sep  2 14:20 .
> 0 drwx------ 53 root root 0 Sep  2 14:20 ..
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu0
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu10
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu12
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu13
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu14
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu15
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu16
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu17
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu18
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu19
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu2
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu4
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu6
> 0 drwxr-xr-x  6 root root 0 Sep  2 14:20 cpu8
> 
>> (and neutering a bunch of userspace stuff that insists on switching schedutil to something inferior)
> 
> Huh. What were these processes, so I can be on the lookout? I just switched to using the energy model so still weighing it all out to see if I'm sticking to it (I'm not sure if I'm seeing the battery-life benefits).

Pretty much anything listed here:
https://wiki.archlinux.org/title/CPU_frequency_scaling
They have different profiles set for these "powersave/balanced/performance" UI knobs
you often get and some of them switch the governor based on that, but as long as
you get
/sys/devices/system/cpu/cpufreq/policy*/scaling_governor
schedutil
or
/proc/sys/kernel/sched_energy_aware
1
You're fine (EAS requires schedutil anyway, so the latter is stronger).

> [snip]

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-03 17:26         ` Christian Loehle
@ 2025-09-03 18:39           ` Kenneth Crudup
  2025-09-03 18:40             ` Rafael J. Wysocki
  0 siblings, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2025-09-03 18:39 UTC (permalink / raw)
  To: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki, Dietmar Eggemann
  Cc: Kenneth C


On 9/3/25 10:26, Christian Loehle wrote:

> Yeah looks like we're dropping that debug message.

Huh. Then I guess the whole thing's a non-issue after all- but that 
being said, the whole point of the reschedule is in case cores come up 
later, right? So in a case like mine, won't it just keep constantly 
rescheduling?

-Kenny

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-03 18:39           ` Kenneth Crudup
@ 2025-09-03 18:40             ` Rafael J. Wysocki
  2025-09-03 18:43               ` Kenneth Crudup
  0 siblings, 1 reply; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-09-03 18:40 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Rafael J. Wysocki, Dietmar Eggemann

On Wed, Sep 3, 2025 at 8:39 PM Kenneth Crudup <kenny@panix.com> wrote:
>
>
> On 9/3/25 10:26, Christian Loehle wrote:
>
> > Yeah looks like we're dropping that debug message.
>
> Huh. Then I guess the whole thing's a non-issue after all- but that
> being said, the whole point of the reschedule is in case cores come up
> later, right? So in a case like mine, won't it just keep constantly
> rescheduling?

It will, but that's a separate issue.

I think that it needs to back off exponentially or something like that.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-03 18:40             ` Rafael J. Wysocki
@ 2025-09-03 18:43               ` Kenneth Crudup
  2025-09-03 18:53                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 14+ messages in thread
From: Kenneth Crudup @ 2025-09-03 18:43 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Dietmar Eggemann, Kenneth C


Is there a way to distinguish between "offline cores" and 
"'non-'existent" cores?

This way we could just skip the ones that "aren't" there, right?

-K

On 9/3/25 11:40, Rafael J. Wysocki wrote:
> On Wed, Sep 3, 2025 at 8:39 PM Kenneth Crudup <kenny@panix.com> wrote:
>>
>>
>> On 9/3/25 10:26, Christian Loehle wrote:
>>
>>> Yeah looks like we're dropping that debug message.
>>
>> Huh. Then I guess the whole thing's a non-issue after all- but that
>> being said, the whole point of the reschedule is in case cores come up
>> later, right? So in a case like mine, won't it just keep constantly
>> rescheduling?
> 
> It will, but that's a separate issue.
> 
> I think that it needs to back off exponentially or something like that.
> 

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-03 18:43               ` Kenneth Crudup
@ 2025-09-03 18:53                 ` Rafael J. Wysocki
  2025-09-04 17:40                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-09-03 18:53 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Rafael J. Wysocki, Christian Loehle, lukasz.luba,
	linux-pm@vger.kernel.org, Dietmar Eggemann

On Wed, Sep 3, 2025 at 8:43 PM Kenneth Crudup <kenny@panix.com> wrote:
>
>
> Is there a way to distinguish between "offline cores" and
> "'non-'existent" cores?
>
> This way we could just skip the ones that "aren't" there, right?

I'm not quite sure about the underlying use case TBH.

The em_check_capacity_update() call may not be necessary on x86 at
all, but I need to double check.

> On 9/3/25 11:40, Rafael J. Wysocki wrote:
> > On Wed, Sep 3, 2025 at 8:39 PM Kenneth Crudup <kenny@panix.com> wrote:
> >>
> >>
> >> On 9/3/25 10:26, Christian Loehle wrote:
> >>
> >>> Yeah looks like we're dropping that debug message.
> >>
> >> Huh. Then I guess the whole thing's a non-issue after all- but that
> >> being said, the whole point of the reschedule is in case cores come up
> >> later, right? So in a case like mine, won't it just keep constantly
> >> rescheduling?
> >
> > It will, but that's a separate issue.
> >
> > I think that it needs to back off exponentially or something like that.
> >
>
> --

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-03 18:53                 ` Rafael J. Wysocki
@ 2025-09-04 17:40                   ` Rafael J. Wysocki
  2025-09-04 18:27                     ` Kenneth Crudup
  2025-09-04 21:14                     ` Kenneth Crudup
  0 siblings, 2 replies; 14+ messages in thread
From: Rafael J. Wysocki @ 2025-09-04 17:40 UTC (permalink / raw)
  To: Kenneth Crudup
  Cc: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 777 bytes --]

On Wed, Sep 3, 2025 at 8:53 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Wed, Sep 3, 2025 at 8:43 PM Kenneth Crudup <kenny@panix.com> wrote:
> >
> >
> > Is there a way to distinguish between "offline cores" and
> > "'non-'existent" cores?
> >
> > This way we could just skip the ones that "aren't" there, right?
>
> I'm not quite sure about the underlying use case TBH.
>
> The em_check_capacity_update() call may not be necessary on x86 at
> all, but I need to double check.

So AFAICS, this can be addressed by something like the attached patch
(the majority of changes in it is just moving kerneldoc comments
around and function renaming).

Since intel_pstate handles capacity updates itself, it can do without
em_check_capacity_update().

[-- Attachment #2: pm-em-register-basic.patch --]
[-- Type: text/x-patch, Size: 5347 bytes --]

---
 drivers/cpufreq/intel_pstate.c |    4 +-
 include/linux/energy_model.h   |   10 ++++++
 kernel/power/energy_model.c    |   67 ++++++++++++++++++++++++++---------------
 3 files changed, 56 insertions(+), 25 deletions(-)

--- a/drivers/cpufreq/intel_pstate.c
+++ b/drivers/cpufreq/intel_pstate.c
@@ -1034,8 +1034,8 @@ static bool hybrid_register_perf_domain(
 	if (!cpu_dev)
 		return false;
 
-	if (em_dev_register_perf_domain(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
-					cpumask_of(cpu), false))
+	if (em_dev_register_pd_no_update(cpu_dev, HYBRID_EM_STATE_COUNT, &cb,
+					 cpumask_of(cpu), false))
 		return false;
 
 	cpudata->pd_registered = true;
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -168,6 +168,9 @@ struct em_perf_domain *em_cpu_get(int cp
 struct em_perf_domain *em_pd_get(struct device *dev);
 int em_dev_update_perf_domain(struct device *dev,
 			      struct em_perf_table *new_table);
+int em_dev_register_pd_no_update(struct device *dev, unsigned int nr_states,
+				 const struct em_data_callback *cb,
+				 const cpumask_t *cpus, bool microwatts);
 int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				const struct em_data_callback *cb,
 				const cpumask_t *cpus, bool microwatts);
@@ -344,6 +347,13 @@ struct em_data_callback {};
 #define EM_SET_ACTIVE_POWER_CB(em_cb, cb) do { } while (0)
 
 static inline
+int em_dev_register_pd_no_update(struct device *dev, unsigned int nr_states,
+				 const struct em_data_callback *cb,
+				 const cpumask_t *cpus, bool microwatts)
+{
+	return -EINVAL;
+}
+static inline
 int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
 				const struct em_data_callback *cb,
 				const cpumask_t *cpus, bool microwatts)
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -527,31 +527,19 @@ struct em_perf_domain *em_cpu_get(int cp
 EXPORT_SYMBOL_GPL(em_cpu_get);
 
 /**
- * em_dev_register_perf_domain() - Register the Energy Model (EM) for a device
- * @dev		: Device for which the EM is to register
- * @nr_states	: Number of performance states to register
- * @cb		: Callback functions providing the data of the Energy Model
- * @cpus	: Pointer to cpumask_t, which in case of a CPU device is
- *		obligatory. It can be taken from i.e. 'policy->cpus'. For other
- *		type of devices this should be set to NULL.
- * @microwatts	: Flag indicating that the power values are in micro-Watts or
- *		in some other scale. It must be set properly.
- *
- * Create Energy Model tables for a performance domain using the callbacks
- * defined in cb.
- *
- * The @microwatts is important to set with correct value. Some kernel
- * sub-systems might rely on this flag and check if all devices in the EM are
- * using the same scale.
+ * em_dev_register_pd_no_update() - Register a perf domain for a device
+ * @dev : Device to register the PD for
+ * @nr_states : Number of performance states in the new PD
+ * @cb : Callback functions for populating the energy model
+ * @cpus : CPUs to include in the new PD (mandatory if @dev is a CPU device)
+ * @microwatts : Whether or not the power values in the EM will be in uW
  *
- * If multiple clients register the same performance domain, all but the first
- * registration will be ignored.
- *
- * Return 0 on success
+ * Like em_dev_register_perf_domain(), but does not trigger a CPU capacity
+ * update after registering the PD, even if @dev is a CPU device.
  */
-int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
-				const struct em_data_callback *cb,
-				const cpumask_t *cpus, bool microwatts)
+int em_dev_register_pd_no_update(struct device *dev, unsigned int nr_states,
+				 const struct em_data_callback *cb,
+				 const cpumask_t *cpus, bool microwatts)
 {
 	struct em_perf_table *em_table;
 	unsigned long cap, prev_cap = 0;
@@ -636,6 +624,39 @@ int em_dev_register_perf_domain(struct d
 unlock:
 	mutex_unlock(&em_pd_mutex);
 
+	return ret;
+}
+EXPORT_SYMBOL_GPL(em_dev_register_pd_no_update);
+
+/**
+ * em_dev_register_perf_domain() - Register the Energy Model (EM) for a device
+ * @dev		: Device for which the EM is to register
+ * @nr_states	: Number of performance states to register
+ * @cb		: Callback functions providing the data of the Energy Model
+ * @cpus	: Pointer to cpumask_t, which in case of a CPU device is
+ *		obligatory. It can be taken from i.e. 'policy->cpus'. For other
+ *		type of devices this should be set to NULL.
+ * @microwatts	: Flag indicating that the power values are in micro-Watts or
+ *		in some other scale. It must be set properly.
+ *
+ * Create Energy Model tables for a performance domain using the callbacks
+ * defined in cb.
+ *
+ * The @microwatts is important to set with correct value. Some kernel
+ * sub-systems might rely on this flag and check if all devices in the EM are
+ * using the same scale.
+ *
+ * If multiple clients register the same performance domain, all but the first
+ * registration will be ignored.
+ *
+ * Return 0 on success
+ */
+int em_dev_register_perf_domain(struct device *dev, unsigned int nr_states,
+				const struct em_data_callback *cb,
+				const cpumask_t *cpus, bool microwatts)
+{
+	int ret = em_dev_register_pd_no_update(dev, nr_states, cb, cpus, microwatts);
+
 	if (_is_cpu_device(dev))
 		em_check_capacity_update();
 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-04 17:40                   ` Rafael J. Wysocki
@ 2025-09-04 18:27                     ` Kenneth Crudup
  2025-09-04 21:14                     ` Kenneth Crudup
  1 sibling, 0 replies; 14+ messages in thread
From: Kenneth Crudup @ 2025-09-04 18:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Dietmar Eggemann, Me


I'll test this today and get back to you.

-K

On 9/4/25 10:40, Rafael J. Wysocki wrote:
> On Wed, Sep 3, 2025 at 8:53 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>
>> On Wed, Sep 3, 2025 at 8:43 PM Kenneth Crudup <kenny@panix.com> wrote:
>>>
>>>
>>> Is there a way to distinguish between "offline cores" and
>>> "'non-'existent" cores?
>>>
>>> This way we could just skip the ones that "aren't" there, right?
>>
>> I'm not quite sure about the underlying use case TBH.
>>
>> The em_check_capacity_update() call may not be necessary on x86 at
>> all, but I need to double check.
> 
> So AFAICS, this can be addressed by something like the attached patch
> (the majority of changes in it is just moving kerneldoc comments
> around and function renaming).
> 
> Since intel_pstate handles capacity updates itself, it can do without
> em_check_capacity_update().

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology
  2025-09-04 17:40                   ` Rafael J. Wysocki
  2025-09-04 18:27                     ` Kenneth Crudup
@ 2025-09-04 21:14                     ` Kenneth Crudup
  1 sibling, 0 replies; 14+ messages in thread
From: Kenneth Crudup @ 2025-09-04 21:14 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Christian Loehle, lukasz.luba, linux-pm@vger.kernel.org,
	Dietmar Eggemann


Tested-by: Kenneth R. Crudup <kenny@panix.com>

On 9/4/25 10:40, Rafael J. Wysocki wrote:

> So AFAICS, this can be addressed by something like the attached patch
> (the majority of changes in it is just moving kerneldoc comments
> around and function renaming).
> 
> Since intel_pstate handles capacity updates itself, it can do without
> em_check_capacity_update().

-- 
Kenneth R. Crudup / Sr. SW Engineer, Scott County Consulting, Orange 
County CA


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-09-04 21:14 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-22  5:53 I think there's an issue with e3f1164fc9e ("PM: EM: Support late CPUs booting and capacity adjustment") if there's "holes" in your CPU topology Kenneth Crudup
2025-08-28 14:56 ` Christian Loehle
2025-08-28 17:42   ` Kenneth Crudup
2025-08-29 10:31     ` Christian Loehle
2025-08-29 16:45       ` Rafael J. Wysocki
2025-09-03 17:19       ` Kenneth Crudup
2025-09-03 17:26         ` Christian Loehle
2025-09-03 18:39           ` Kenneth Crudup
2025-09-03 18:40             ` Rafael J. Wysocki
2025-09-03 18:43               ` Kenneth Crudup
2025-09-03 18:53                 ` Rafael J. Wysocki
2025-09-04 17:40                   ` Rafael J. Wysocki
2025-09-04 18:27                     ` Kenneth Crudup
2025-09-04 21:14                     ` Kenneth Crudup

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox