public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* cpufreq/thermal regression in 6.10
@ 2024-06-09  7:53 Johan Hovold
  2024-06-10 11:17 ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Johan Hovold @ 2024-06-09  7:53 UTC (permalink / raw)
  To: Rafael J. Wysocki, Daniel Lezcano
  Cc: Viresh Kumar, Zhang Rui, Lukasz Luba, Steev Klimaszewski,
	linux-pm, linux-kernel, regressions

Hi,

Steev reported to me off-list that the CPU frequency of the big cores on
the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
frequency with 6.10-rc2.

I just confirmed that once the cores are fully throttled (using the
stepwise thermal governor) due to the skin temperature reaching the
first trip point, scaling_max_freq gets stuck at the next OPP:

	cpu4/cpufreq/scaling_max_freq:940800
	cpu5/cpufreq/scaling_max_freq:940800
	cpu6/cpufreq/scaling_max_freq:940800
	cpu7/cpufreq/scaling_max_freq:940800

when the temperature drops again.

This obviously leads to a massive performance drop and could possibly
also be related to reports like this one:

	https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/

I assume the regression may have been introduced by all the thermal work
that went into 6.10-rc1, but I don't have time to try to track this down
myself right now (and will be away from keyboard most of next week).

I've confirmed that 6.9 works as expected.

Johan


#regzbot introduced: v6.9..v6.10-rc2

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-09  7:53 cpufreq/thermal regression in 6.10 Johan Hovold
@ 2024-06-10 11:17 ` Rafael J. Wysocki
  2024-06-11 10:54   ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Rafael J. Wysocki @ 2024-06-10 11:17 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Rafael J. Wysocki, Daniel Lezcano, Viresh Kumar, Zhang Rui,
	Lukasz Luba, Steev Klimaszewski, linux-pm, linux-kernel,
	regressions

Hi,

Thanks for the report.

On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <johan@kernel.org> wrote:
>
> Hi,
>
> Steev reported to me off-list that the CPU frequency of the big cores on
> the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
> frequency with 6.10-rc2.
>
> I just confirmed that once the cores are fully throttled (using the
> stepwise thermal governor) due to the skin temperature reaching the
> first trip point, scaling_max_freq gets stuck at the next OPP:
>
>         cpu4/cpufreq/scaling_max_freq:940800
>         cpu5/cpufreq/scaling_max_freq:940800
>         cpu6/cpufreq/scaling_max_freq:940800
>         cpu7/cpufreq/scaling_max_freq:940800
>
> when the temperature drops again.

So apparently something fails to update its frequency QoS request.

Would it be possible to provoke this with thermal debug enabled
(CONFIG_THERMAL_DEBUGFS set) and see what's there in
/sys/kernel/debug/thermal/?

> This obviously leads to a massive performance drop and could possibly
> also be related to reports like this one:
>
>         https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/
>
> I assume the regression may have been introduced by all the thermal work
> that went into 6.10-rc1, but I don't have time to try to track this down
> myself right now (and will be away from keyboard most of next week).
>
> I've confirmed that 6.9 works as expected.

Well, I'd need to ask someone else affected by this, then.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-10 11:17 ` Rafael J. Wysocki
@ 2024-06-11 10:54   ` Rafael J. Wysocki
  2024-06-11 12:02     ` Johan Hovold
  2024-06-21 15:46     ` Jens Glathe
  0 siblings, 2 replies; 9+ messages in thread
From: Rafael J. Wysocki @ 2024-06-11 10:54 UTC (permalink / raw)
  To: Johan Hovold
  Cc: Rafael J. Wysocki, Daniel Lezcano, Viresh Kumar, Zhang Rui,
	Lukasz Luba, Steev Klimaszewski, linux-pm, linux-kernel,
	regressions

[-- Attachment #1: Type: text/plain, Size: 2195 bytes --]

On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> Hi,
>
> Thanks for the report.
>
> On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <johan@kernel.org> wrote:
> >
> > Hi,
> >
> > Steev reported to me off-list that the CPU frequency of the big cores on
> > the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
> > frequency with 6.10-rc2.
> >
> > I just confirmed that once the cores are fully throttled (using the
> > stepwise thermal governor) due to the skin temperature reaching the
> > first trip point, scaling_max_freq gets stuck at the next OPP:
> >
> >         cpu4/cpufreq/scaling_max_freq:940800
> >         cpu5/cpufreq/scaling_max_freq:940800
> >         cpu6/cpufreq/scaling_max_freq:940800
> >         cpu7/cpufreq/scaling_max_freq:940800
> >
> > when the temperature drops again.
>
> So apparently something fails to update its frequency QoS request.
>
> Would it be possible to provoke this with thermal debug enabled
> (CONFIG_THERMAL_DEBUGFS set) and see what's there in
> /sys/kernel/debug/thermal/?
>
> > This obviously leads to a massive performance drop and could possibly
> > also be related to reports like this one:
> >
> >         https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/
> >
> > I assume the regression may have been introduced by all the thermal work
> > that went into 6.10-rc1, but I don't have time to try to track this down
> > myself right now (and will be away from keyboard most of next week).
> >
> > I've confirmed that 6.9 works as expected.
>
> Well, I'd need to ask someone else affected by this, then.

If this is the step-wise governor, the problem might have been
introduced by commit

042a3d80f118 thermal: core: Move passive polling management to the core

which removed passive polling count updates from that governor, so if
the thermal zone in question has passive polling only and no regular
polling, temperature updates may stop coming before the governor drops
the cooling device states to the "no target" level.

So please test the attached partial revert of the above commit when you can.

[-- Attachment #2: thermal-gov_step_wise--revert-passive.patch --]
[-- Type: text/x-patch, Size: 848 bytes --]

---
 drivers/thermal/gov_step_wise.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: linux-pm/drivers/thermal/gov_step_wise.c
===================================================================
--- linux-pm.orig/drivers/thermal/gov_step_wise.c
+++ linux-pm/drivers/thermal/gov_step_wise.c
@@ -93,6 +93,16 @@ static void thermal_zone_trip_update(str
 		if (instance->initialized && old_target == instance->target)
 			continue;
 
+		if (trip->type == THERMAL_TRIP_PASSIVE) {
+			/* If needed, update the status of passive polling. */
+			if (old_target == THERMAL_NO_TARGET &&
+			    instance->target != THERMAL_NO_TARGET)
+				tz->passive++;
+			else if (old_target != THERMAL_NO_TARGET &&
+				 instance->target == THERMAL_NO_TARGET)
+				tz->passive--;
+		}
+
 		instance->initialized = true;
 
 		mutex_lock(&instance->cdev->lock);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-11 10:54   ` Rafael J. Wysocki
@ 2024-06-11 12:02     ` Johan Hovold
  2024-06-11 21:19       ` Steev Klimaszewski
  2024-06-21 15:46     ` Jens Glathe
  1 sibling, 1 reply; 9+ messages in thread
From: Johan Hovold @ 2024-06-11 12:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Daniel Lezcano, Viresh Kumar, Zhang Rui,
	Lukasz Luba, Steev Klimaszewski, linux-pm, linux-kernel,
	regressions

On Tue, Jun 11, 2024 at 12:54:25PM +0200, Rafael J. Wysocki wrote:
> On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
> > On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <johan@kernel.org> wrote:

> > > Steev reported to me off-list that the CPU frequency of the big cores on
> > > the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
> > > frequency with 6.10-rc2.
> > >
> > > I just confirmed that once the cores are fully throttled (using the
> > > stepwise thermal governor) due to the skin temperature reaching the
> > > first trip point, scaling_max_freq gets stuck at the next OPP:
> > >
> > >         cpu4/cpufreq/scaling_max_freq:940800
> > >         cpu5/cpufreq/scaling_max_freq:940800
> > >         cpu6/cpufreq/scaling_max_freq:940800
> > >         cpu7/cpufreq/scaling_max_freq:940800
> > >
> > > when the temperature drops again.

> If this is the step-wise governor, the problem might have been
> introduced by commit
> 
> 042a3d80f118 thermal: core: Move passive polling management to the core
> 
> which removed passive polling count updates from that governor, so if
> the thermal zone in question has passive polling only and no regular
> polling, temperature updates may stop coming before the governor drops
> the cooling device states to the "no target" level.
> 
> So please test the attached partial revert of the above commit when you can.

Thanks for the quick fix. The partial revert seems to do the trick:

Tested-by: Johan Hovold <johan+linaro@kernel.org>

Johan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-11 12:02     ` Johan Hovold
@ 2024-06-11 21:19       ` Steev Klimaszewski
  0 siblings, 0 replies; 9+ messages in thread
From: Steev Klimaszewski @ 2024-06-11 21:19 UTC (permalink / raw)
  To: Johan Hovold, Rafael J. Wysocki
  Cc: Rafael J. Wysocki, Daniel Lezcano, Viresh Kumar, Zhang Rui,
	Lukasz Luba, linux-pm, linux-kernel, regressions


On 6/11/24 7:02 AM, Johan Hovold wrote:
> On Tue, Jun 11, 2024 at 12:54:25PM +0200, Rafael J. Wysocki wrote:
>> On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>>> On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <johan@kernel.org> wrote:
>>>> Steev reported to me off-list that the CPU frequency of the big cores on
>>>> the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
>>>> frequency with 6.10-rc2.
>>>>
>>>> I just confirmed that once the cores are fully throttled (using the
>>>> stepwise thermal governor) due to the skin temperature reaching the
>>>> first trip point, scaling_max_freq gets stuck at the next OPP:
>>>>
>>>>          cpu4/cpufreq/scaling_max_freq:940800
>>>>          cpu5/cpufreq/scaling_max_freq:940800
>>>>          cpu6/cpufreq/scaling_max_freq:940800
>>>>          cpu7/cpufreq/scaling_max_freq:940800
>>>>
>>>> when the temperature drops again.
>> If this is the step-wise governor, the problem might have been
>> introduced by commit
>>
>> 042a3d80f118 thermal: core: Move passive polling management to the core
>>
>> which removed passive polling count updates from that governor, so if
>> the thermal zone in question has passive polling only and no regular
>> polling, temperature updates may stop coming before the governor drops
>> the cooling device states to the "no target" level.
>>
>> So please test the attached partial revert of the above commit when you can.
> Thanks for the quick fix. The partial revert seems to do the trick:
>
> Tested-by: Johan Hovold <johan+linaro@kernel.org>
>
> Johan

I can also confirm that it's working here!

Tested-by: Steev Klimaszewski <steev@kali.org>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-11 10:54   ` Rafael J. Wysocki
  2024-06-11 12:02     ` Johan Hovold
@ 2024-06-21 15:46     ` Jens Glathe
  2024-06-21 16:41       ` Rafael J. Wysocki
  1 sibling, 1 reply; 9+ messages in thread
From: Jens Glathe @ 2024-06-21 15:46 UTC (permalink / raw)
  To: Rafael J. Wysocki, Johan Hovold
  Cc: Rafael J. Wysocki, Daniel Lezcano, Viresh Kumar, Zhang Rui,
	Lukasz Luba, Steev Klimaszewski, linux-pm, linux-kernel,
	regressions

Hi there,

unfortunately I experienced the issue with the fix applied. I had to
revert this and  the original commit to get back to normal behaviour. My
system (also Lenovo Thinkpad X13s) uses the schedutil governor, the
behaviour is as described from Steev and Johan. The full throttling
happened during a package build and left the performance cores at 940800.

Cheers

Jens

On 6/11/24 12:54, Rafael J. Wysocki wrote:
> On Mon, Jun 10, 2024 at 1:17 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>> Hi,
>>
>> Thanks for the report.
>>
>> On Sun, Jun 9, 2024 at 9:53 AM Johan Hovold <johan@kernel.org> wrote:
>>> Hi,
>>>
>>> Steev reported to me off-list that the CPU frequency of the big cores on
>>> the Lenovo ThinkPad X13s sometimes appears to get stuck at a low
>>> frequency with 6.10-rc2.
>>>
>>> I just confirmed that once the cores are fully throttled (using the
>>> stepwise thermal governor) due to the skin temperature reaching the
>>> first trip point, scaling_max_freq gets stuck at the next OPP:
>>>
>>>          cpu4/cpufreq/scaling_max_freq:940800
>>>          cpu5/cpufreq/scaling_max_freq:940800
>>>          cpu6/cpufreq/scaling_max_freq:940800
>>>          cpu7/cpufreq/scaling_max_freq:940800
>>>
>>> when the temperature drops again.
>> So apparently something fails to update its frequency QoS request.
>>
>> Would it be possible to provoke this with thermal debug enabled
>> (CONFIG_THERMAL_DEBUGFS set) and see what's there in
>> /sys/kernel/debug/thermal/?
>>
>>> This obviously leads to a massive performance drop and could possibly
>>> also be related to reports like this one:
>>>
>>>          https://lore.kernel.org/all/CAHk-=wjwFGQZcDinK=BkEaA8FSyVg5NaUe0BobxowxeZ5PvetA@mail.gmail.com/
>>>
>>> I assume the regression may have been introduced by all the thermal work
>>> that went into 6.10-rc1, but I don't have time to try to track this down
>>> myself right now (and will be away from keyboard most of next week).
>>>
>>> I've confirmed that 6.9 works as expected.
>> Well, I'd need to ask someone else affected by this, then.
> If this is the step-wise governor, the problem might have been
> introduced by commit
>
> 042a3d80f118 thermal: core: Move passive polling management to the core
>
> which removed passive polling count updates from that governor, so if
> the thermal zone in question has passive polling only and no regular
> polling, temperature updates may stop coming before the governor drops
> the cooling device states to the "no target" level.
>
> So please test the attached partial revert of the above commit when you can.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-21 15:46     ` Jens Glathe
@ 2024-06-21 16:41       ` Rafael J. Wysocki
  2024-06-21 19:59         ` Jens Glathe
  0 siblings, 1 reply; 9+ messages in thread
From: Rafael J. Wysocki @ 2024-06-21 16:41 UTC (permalink / raw)
  To: Jens Glathe
  Cc: Rafael J. Wysocki, Johan Hovold, Rafael J. Wysocki,
	Daniel Lezcano, Viresh Kumar, Zhang Rui, Lukasz Luba,
	Steev Klimaszewski, linux-pm, linux-kernel, regressions

[-- Attachment #1: Type: text/plain, Size: 550 bytes --]

Hi,

On Fri, Jun 21, 2024 at 5:53 PM Jens Glathe
<jens.glathe@oldschoolsolutions.biz> wrote:
>
> Hi there,
>
> unfortunately I experienced the issue with the fix applied. I had to
> revert this and  the original commit to get back to normal behaviour. My
> system (also Lenovo Thinkpad X13s) uses the schedutil governor, the
> behaviour is as described from Steev and Johan. The full throttling
> happened during a package build and left the performance cores at 940800.

So can you please test the attached patch, on top of the fix?

[-- Attachment #2: thermal-gov_step_wise-distangle.patch --]
[-- Type: text/x-patch, Size: 1407 bytes --]

---
 drivers/thermal/gov_step_wise.c |   19 +------------------
 1 file changed, 1 insertion(+), 18 deletions(-)

Index: linux-pm/drivers/thermal/gov_step_wise.c
===================================================================
--- linux-pm.orig/drivers/thermal/gov_step_wise.c
+++ linux-pm/drivers/thermal/gov_step_wise.c
@@ -55,7 +55,7 @@ static unsigned long get_target_state(st
 		if (cur_state <= instance->lower)
 			return THERMAL_NO_TARGET;
 
-		return clamp(cur_state - 1, instance->lower, instance->upper);
+		return instance->lower;
 	}
 
 	return instance->target;
@@ -93,23 +93,6 @@ static void thermal_zone_trip_update(str
 		if (instance->initialized && old_target == instance->target)
 			continue;
 
-		if (trip->type == THERMAL_TRIP_PASSIVE) {
-			/*
-			 * If the target state for this thermal instance
-			 * changes from THERMAL_NO_TARGET to something else,
-			 * ensure that the zone temperature will be updated
-			 * (assuming enabled passive cooling) until it becomes
-			 * THERMAL_NO_TARGET again, or the cooling device may
-			 * not be reset to its initial state.
-			 */
-			if (old_target == THERMAL_NO_TARGET &&
-			    instance->target != THERMAL_NO_TARGET)
-				tz->passive++;
-			else if (old_target != THERMAL_NO_TARGET &&
-				 instance->target == THERMAL_NO_TARGET)
-				tz->passive--;
-		}
-
 		instance->initialized = true;
 
 		mutex_lock(&instance->cdev->lock);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-21 16:41       ` Rafael J. Wysocki
@ 2024-06-21 19:59         ` Jens Glathe
  2024-06-22 12:14           ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Jens Glathe @ 2024-06-21 19:59 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Johan Hovold, Rafael J. Wysocki, Daniel Lezcano, Viresh Kumar,
	Zhang Rui, Lukasz Luba, Steev Klimaszewski, linux-pm,
	linux-kernel, regressions

Hi there,

thank you for the fast fix. Applied, built, installed. Test is
successful, performance core scaling up to 2995200 comes back when skin
temp drops below 55°C.

Tested-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>

Cheers

Jens

On 6/21/24 18:41, Rafael J. Wysocki wrote:
> Hi,
>
> On Fri, Jun 21, 2024 at 5:53 PM Jens Glathe
> <jens.glathe@oldschoolsolutions.biz> wrote:
>> Hi there,
>>
>> unfortunately I experienced the issue with the fix applied. I had to
>> revert this and  the original commit to get back to normal behaviour. My
>> system (also Lenovo Thinkpad X13s) uses the schedutil governor, the
>> behaviour is as described from Steev and Johan. The full throttling
>> happened during a package build and left the performance cores at 940800.
> So can you please test the attached patch, on top of the fix?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: cpufreq/thermal regression in 6.10
  2024-06-21 19:59         ` Jens Glathe
@ 2024-06-22 12:14           ` Rafael J. Wysocki
  0 siblings, 0 replies; 9+ messages in thread
From: Rafael J. Wysocki @ 2024-06-22 12:14 UTC (permalink / raw)
  To: Jens Glathe
  Cc: Rafael J. Wysocki, Johan Hovold, Rafael J. Wysocki,
	Daniel Lezcano, Viresh Kumar, Zhang Rui, Lukasz Luba,
	Steev Klimaszewski, linux-pm, linux-kernel, regressions

On Fri, Jun 21, 2024 at 10:00 PM Jens Glathe
<jens.glathe@oldschoolsolutions.biz> wrote:
>
> Hi there,
>
> thank you for the fast fix. Applied, built, installed. Test is
> successful, performance core scaling up to 2995200 comes back when skin
> temp drops below 55°C.
>
> Tested-by: Jens Glathe <jens.glathe@oldschoolsolutions.biz>

Great, thanks for testing!

I'll submit the patch shortly and it would be good if the others
affected by this issue could try it to confirm that it doesn't regress
anything.


> On 6/21/24 18:41, Rafael J. Wysocki wrote:
> > Hi,
> >
> > On Fri, Jun 21, 2024 at 5:53 PM Jens Glathe
> > <jens.glathe@oldschoolsolutions.biz> wrote:
> >> Hi there,
> >>
> >> unfortunately I experienced the issue with the fix applied. I had to
> >> revert this and  the original commit to get back to normal behaviour. My
> >> system (also Lenovo Thinkpad X13s) uses the schedutil governor, the
> >> behaviour is as described from Steev and Johan. The full throttling
> >> happened during a package build and left the performance cores at 940800.
> > So can you please test the attached patch, on top of the fix?
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-06-22 12:14 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-09  7:53 cpufreq/thermal regression in 6.10 Johan Hovold
2024-06-10 11:17 ` Rafael J. Wysocki
2024-06-11 10:54   ` Rafael J. Wysocki
2024-06-11 12:02     ` Johan Hovold
2024-06-11 21:19       ` Steev Klimaszewski
2024-06-21 15:46     ` Jens Glathe
2024-06-21 16:41       ` Rafael J. Wysocki
2024-06-21 19:59         ` Jens Glathe
2024-06-22 12:14           ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox