* Trip points crossed not detected when no cooling device bound
@ 2024-06-26 6:50 Daniel Lezcano
2024-06-26 10:38 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2024-06-26 6:50 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, Lukasz Luba
Hi,
after experimenting different setup I noticed there is no longer trip
crossed notifications sent to the userspace when there is no cooling
device bound to a thermal zone.
git bisecting leads me to this commit:
202aa0d4bb532338cd27bcc64c60abc2987a2be7 is the first bad commit
commit 202aa0d4bb532338cd27bcc64c60abc2987a2be7
Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Date: Tue Apr 30 17:45:55 2024 +0200
thermal: core: Do not call handle_thermal_trip() if zone
temperature is invalid
Make __thermal_zone_device_update() bail out if update_temperature()
fails to update the zone temperature because __thermal_zone_get_temp()
has returned an error and the current zone temperature is
THERMAL_TEMP_INVALID (user space receiving netlink thermal messages,
thermal debug code and thermal governors may get confused otherwise).
Fixes: 9ad18043fb35 ("thermal: core: Send trip crossing
notifications at init time if needed")
Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
Tested-by: Lukasz Luba <lukasz.luba@arm.com>
drivers/thermal/thermal_core.c | 3 +++
1 file changed, 3 insertions(+)
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-26 6:50 Trip points crossed not detected when no cooling device bound Daniel Lezcano
@ 2024-06-26 10:38 ` Rafael J. Wysocki
2024-06-26 21:21 ` Daniel Lezcano
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2024-06-26 10:38 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Rafael J. Wysocki, Linux PM mailing list, Lukasz Luba
[-- Attachment #1: Type: text/plain, Size: 1720 bytes --]
Hi,
On Wed, Jun 26, 2024 at 8:50 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
>
> Hi,
>
> after experimenting different setup I noticed there is no longer trip
> crossed notifications sent to the userspace when there is no cooling
> device bound to a thermal zone.
I think I've seen those coming for zones without any cooling devices,
so it seems related to the setup.
> git bisecting leads me to this commit:
>
>
> 202aa0d4bb532338cd27bcc64c60abc2987a2be7 is the first bad commit
> commit 202aa0d4bb532338cd27bcc64c60abc2987a2be7
> Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Date: Tue Apr 30 17:45:55 2024 +0200
>
> thermal: core: Do not call handle_thermal_trip() if zone
> temperature is invalid
>
> Make __thermal_zone_device_update() bail out if update_temperature()
> fails to update the zone temperature because __thermal_zone_get_temp()
> has returned an error and the current zone temperature is
> THERMAL_TEMP_INVALID (user space receiving netlink thermal messages,
> thermal debug code and thermal governors may get confused otherwise).
>
> Fixes: 9ad18043fb35 ("thermal: core: Send trip crossing
> notifications at init time if needed")
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
> Tested-by: Lukasz Luba <lukasz.luba@arm.com>
>
> drivers/thermal/thermal_core.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> -
Oh, I see where the problem can be. If the zone is polling only, it
will not rearm the timer when the current zone temperature is invalid
after the above commit, so does the attached patch help?
[-- Attachment #2: thermal-poll-if-temp-invalid.patch --]
[-- Type: text/x-patch, Size: 647 bytes --]
---
drivers/thermal/thermal_core.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
Index: linux-pm/drivers/thermal/thermal_core.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_core.c
+++ linux-pm/drivers/thermal/thermal_core.c
@@ -514,7 +514,7 @@ void __thermal_zone_device_update(struct
update_temperature(tz);
if (tz->temperature == THERMAL_TEMP_INVALID)
- return;
+ goto monitor;
tz->notify_event = event;
@@ -536,6 +536,7 @@ void __thermal_zone_device_update(struct
thermal_debug_update_trip_stats(tz);
+monitor:
monitor_thermal_zone(tz);
}
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-26 10:38 ` Rafael J. Wysocki
@ 2024-06-26 21:21 ` Daniel Lezcano
2024-06-26 22:24 ` Daniel Lezcano
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2024-06-26 21:21 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, Lukasz Luba
On 26/06/2024 12:38, Rafael J. Wysocki wrote:
> Hi,
>
> On Wed, Jun 26, 2024 at 8:50 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>>
>> Hi,
>>
>> after experimenting different setup I noticed there is no longer trip
>> crossed notifications sent to the userspace when there is no cooling
>> device bound to a thermal zone.
>
> I think I've seen those coming for zones without any cooling devices,
> so it seems related to the setup.
>
>> git bisecting leads me to this commit:
>>
>>
>> 202aa0d4bb532338cd27bcc64c60abc2987a2be7 is the first bad commit
>> commit 202aa0d4bb532338cd27bcc64c60abc2987a2be7
>> Author: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Date: Tue Apr 30 17:45:55 2024 +0200
>>
>> thermal: core: Do not call handle_thermal_trip() if zone
>> temperature is invalid
>>
>> Make __thermal_zone_device_update() bail out if update_temperature()
>> fails to update the zone temperature because __thermal_zone_get_temp()
>> has returned an error and the current zone temperature is
>> THERMAL_TEMP_INVALID (user space receiving netlink thermal messages,
>> thermal debug code and thermal governors may get confused otherwise).
>>
>> Fixes: 9ad18043fb35 ("thermal: core: Send trip crossing
>> notifications at init time if needed")
>> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
>> Reviewed-by: Lukasz Luba <lukasz.luba@arm.com>
>> Tested-by: Lukasz Luba <lukasz.luba@arm.com>
>>
>> drivers/thermal/thermal_core.c | 3 +++
>> 1 file changed, 3 insertions(+)
>>
>> -
>
> Oh, I see where the problem can be. If the zone is polling only, it
> will not rearm the timer when the current zone temperature is invalid
> after the above commit, so does the attached patch help?
At this point, I went far when bisecting another problem and I ended up
screwing my config file. So I had to generate a new one from the default
config. Since then the issue is no longer happening which sounds very
strange to me.
I'm still investigating but if you have a suggestion coming in mind, it
would be welcome because I'm failing to find out what is going on ... :/
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-26 21:21 ` Daniel Lezcano
@ 2024-06-26 22:24 ` Daniel Lezcano
2024-06-27 9:54 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2024-06-26 22:24 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, Lukasz Luba
On 26/06/2024 23:21, Daniel Lezcano wrote:
[ ... ]
>> Oh, I see where the problem can be. If the zone is polling only, it
>> will not rearm the timer when the current zone temperature is invalid
>> after the above commit, so does the attached patch help?
>
> At this point, I went far when bisecting another problem and I ended up
> screwing my config file. So I had to generate a new one from the default
> config. Since then the issue is no longer happening which sounds very
> strange to me.
>
> I'm still investigating but if you have a suggestion coming in mind, it
> would be welcome because I'm failing to find out what is going on ... :/
I finally reproduced the issue. That happens when there is *no* cooling
device bound on *any* thermal zones.
Your patch seems to fix the problem but I'm not sure to understand the
conditions of the bug
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-26 22:24 ` Daniel Lezcano
@ 2024-06-27 9:54 ` Rafael J. Wysocki
2024-06-27 16:30 ` Daniel Lezcano
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2024-06-27 9:54 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Rafael J. Wysocki, Linux PM mailing list, Lukasz Luba
On Thu, Jun 27, 2024 at 12:24 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 26/06/2024 23:21, Daniel Lezcano wrote:
>
> [ ... ]
>
> >> Oh, I see where the problem can be. If the zone is polling only, it
> >> will not rearm the timer when the current zone temperature is invalid
> >> after the above commit, so does the attached patch help?
> >
> > At this point, I went far when bisecting another problem and I ended up
> > screwing my config file. So I had to generate a new one from the default
> > config. Since then the issue is no longer happening which sounds very
> > strange to me.
> >
> > I'm still investigating but if you have a suggestion coming in mind, it
> > would be welcome because I'm failing to find out what is going on ... :/
>
> I finally reproduced the issue. That happens when there is *no* cooling
> device bound on *any* thermal zones.
Interesting.
> Your patch seems to fix the problem but I'm not sure to understand the
> conditions of the bug
It's probably the same as for commit 202aa0d4bb53:
thermal_zone_device_init() sets tz->temperature to
THERMAL_TEMP_INVALID and if the first invocation of
__thermal_zone_get_temp() returns an error (because the .get_temp()
callback returns an error), monitor_thermal_zone(). If polling is the
only way in which the zone temperature can be updated, things go south
because the timer is not set and there is no other way to set it. No
updates will be coming.
The reason why the presence of cooling devices can "fix" this is
because thermal_bind_cdev_to_trip() sets tz->need_update to 1 which
then causes the thermal_zone_device_update() in
__thermal_cooling_device_register() to trigger and that will update
the temperature.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-27 9:54 ` Rafael J. Wysocki
@ 2024-06-27 16:30 ` Daniel Lezcano
2024-06-27 18:23 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2024-06-27 16:30 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, Lukasz Luba
On 27/06/2024 11:54, Rafael J. Wysocki wrote:
> On Thu, Jun 27, 2024 at 12:24 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>> On 26/06/2024 23:21, Daniel Lezcano wrote:
>>
>> [ ... ]
>>
>>>> Oh, I see where the problem can be. If the zone is polling only, it
>>>> will not rearm the timer when the current zone temperature is invalid
>>>> after the above commit, so does the attached patch help?
>>>
>>> At this point, I went far when bisecting another problem and I ended up
>>> screwing my config file. So I had to generate a new one from the default
>>> config. Since then the issue is no longer happening which sounds very
>>> strange to me.
>>>
>>> I'm still investigating but if you have a suggestion coming in mind, it
>>> would be welcome because I'm failing to find out what is going on ... :/
>>
>> I finally reproduced the issue. That happens when there is *no* cooling
>> device bound on *any* thermal zones.
>
> Interesting.
>
>> Your patch seems to fix the problem but I'm not sure to understand the
>> conditions of the bug
>
> It's probably the same as for commit 202aa0d4bb53:
> thermal_zone_device_init() sets tz->temperature to
> THERMAL_TEMP_INVALID and if the first invocation of
> __thermal_zone_get_temp() returns an error (because the .get_temp()
> callback returns an error), monitor_thermal_zone(). If polling is the
> only way in which the zone temperature can be updated, things go south
> because the timer is not set and there is no other way to set it. No
> updates will be coming
If there is no polling delay (aka interrupt driven), the routine will
skip the _set_trips function and the monitor_thermal_zone() will do
nothing in this case, right ?
Even setting a label jump to "monitor:" the routine is broken AFAICT
> The reason why the presence of cooling devices can "fix" this is
> because thermal_bind_cdev_to_trip() sets tz->need_update to 1 which
> then causes the thermal_zone_device_update() in
> __thermal_cooling_device_register() to trigger and that will update
> the temperature.
IIUC, the first time get_temp() fails and then when the tz is bound, the
update triggers a new call with get_temp() which returns a valid
temperature ?
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-27 16:30 ` Daniel Lezcano
@ 2024-06-27 18:23 ` Rafael J. Wysocki
2024-06-28 8:04 ` Daniel Lezcano
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2024-06-27 18:23 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Rafael J. Wysocki, Linux PM mailing list, Lukasz Luba
On Thu, Jun 27, 2024 at 6:30 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 27/06/2024 11:54, Rafael J. Wysocki wrote:
> > On Thu, Jun 27, 2024 at 12:24 AM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> >>
> >> On 26/06/2024 23:21, Daniel Lezcano wrote:
> >>
> >> [ ... ]
> >>
> >>>> Oh, I see where the problem can be. If the zone is polling only, it
> >>>> will not rearm the timer when the current zone temperature is invalid
> >>>> after the above commit, so does the attached patch help?
> >>>
> >>> At this point, I went far when bisecting another problem and I ended up
> >>> screwing my config file. So I had to generate a new one from the default
> >>> config. Since then the issue is no longer happening which sounds very
> >>> strange to me.
> >>>
> >>> I'm still investigating but if you have a suggestion coming in mind, it
> >>> would be welcome because I'm failing to find out what is going on ... :/
> >>
> >> I finally reproduced the issue. That happens when there is *no* cooling
> >> device bound on *any* thermal zones.
> >
> > Interesting.
> >
> >> Your patch seems to fix the problem but I'm not sure to understand the
> >> conditions of the bug
> >
> > It's probably the same as for commit 202aa0d4bb53:
> > thermal_zone_device_init() sets tz->temperature to
> > THERMAL_TEMP_INVALID and if the first invocation of
> > __thermal_zone_get_temp() returns an error (because the .get_temp()
> > callback returns an error), monitor_thermal_zone(). If polling is the
> > only way in which the zone temperature can be updated, things go south
> > because the timer is not set and there is no other way to set it. No
> > updates will be coming
>
> If there is no polling delay (aka interrupt driven), the routine will
> skip the _set_trips function and the monitor_thermal_zone() will do
> nothing in this case, right ?
_set_trips() looks at tz->temperature, however, and it doesn't make
sense to call it if the latter is invalid.
Same for handle_thermal_trip() and governor callbacks.
> Even setting a label jump to "monitor:" the routine is broken AFAICT
I beg to differ.
Yes, monitor_thermal_zone() does nothing if there is no polling, but
it needs to be called anyway because it checks whether or not polling
is there in the first place.
And if there is no polling, it is assumed that
__thermal_zone_device_update() will be called by other means.
> > The reason why the presence of cooling devices can "fix" this is
> > because thermal_bind_cdev_to_trip() sets tz->need_update to 1 which
> > then causes the thermal_zone_device_update() in
> > __thermal_cooling_device_register() to trigger and that will update
> > the temperature.
>
> IIUC, the first time get_temp() fails and then when the tz is bound, the
> update triggers a new call with get_temp() which returns a valid
> temperature ?
Yes.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-27 18:23 ` Rafael J. Wysocki
@ 2024-06-28 8:04 ` Daniel Lezcano
2024-06-28 10:49 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Daniel Lezcano @ 2024-06-28 8:04 UTC (permalink / raw)
To: Rafael J. Wysocki; +Cc: Linux PM mailing list, Lukasz Luba
On 27/06/2024 20:23, Rafael J. Wysocki wrote:
> On Thu, Jun 27, 2024 at 6:30 PM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
>>
>> On 27/06/2024 11:54, Rafael J. Wysocki wrote:
>>> On Thu, Jun 27, 2024 at 12:24 AM Daniel Lezcano
>>> <daniel.lezcano@linaro.org> wrote:
>>>>
>>>> On 26/06/2024 23:21, Daniel Lezcano wrote:
>>>>
>>>> [ ... ]
>>>>
>>>>>> Oh, I see where the problem can be. If the zone is polling only, it
>>>>>> will not rearm the timer when the current zone temperature is invalid
>>>>>> after the above commit, so does the attached patch help?
>>>>>
>>>>> At this point, I went far when bisecting another problem and I ended up
>>>>> screwing my config file. So I had to generate a new one from the default
>>>>> config. Since then the issue is no longer happening which sounds very
>>>>> strange to me.
>>>>>
>>>>> I'm still investigating but if you have a suggestion coming in mind, it
>>>>> would be welcome because I'm failing to find out what is going on ... :/
>>>>
>>>> I finally reproduced the issue. That happens when there is *no* cooling
>>>> device bound on *any* thermal zones.
>>>
>>> Interesting.
>>>
>>>> Your patch seems to fix the problem but I'm not sure to understand the
>>>> conditions of the bug
>>>
>>> It's probably the same as for commit 202aa0d4bb53:
>>> thermal_zone_device_init() sets tz->temperature to
>>> THERMAL_TEMP_INVALID and if the first invocation of
>>> __thermal_zone_get_temp() returns an error (because the .get_temp()
>>> callback returns an error), monitor_thermal_zone(). If polling is the
>>> only way in which the zone temperature can be updated, things go south
>>> because the timer is not set and there is no other way to set it. No
>>> updates will be coming
>>
>> If there is no polling delay (aka interrupt driven), the routine will
>> skip the _set_trips function and the monitor_thermal_zone() will do
>> nothing in this case, right ?
>
> _set_trips() looks at tz->temperature, however, and it doesn't make
> sense to call it if the latter is invalid.
>
> Same for handle_thermal_trip() and governor callbacks.
>
>> Even setting a label jump to "monitor:" the routine is broken AFAICT
>
> I beg to differ.
>
> Yes, monitor_thermal_zone() does nothing if there is no polling, but
> it needs to be called anyway because it checks whether or not polling
> is there in the first place.
>
> And if there is no polling, it is assumed that
> __thermal_zone_device_update() will be called by other means.
AFAICT, the interrupt can fire and it will result in a
thermal_zone_device_update() but the interrupt must be setup by
__set_trips() which is skipped because of the invalid temperature.
I've confirmed that with my evb board.
- trips point
- no polling delay (interrupt based)
- no cooling device
That does not work, there is no trip crossed notification.
>>> The reason why the presence of cooling devices can "fix" this is
>>> because thermal_bind_cdev_to_trip() sets tz->need_update to 1 which
>>> then causes the thermal_zone_device_update() in
>>> __thermal_cooling_device_register() to trigger and that will update
>>> the temperature.
>>
>> IIUC, the first time get_temp() fails and then when the tz is bound, the
>> update triggers a new call with get_temp() which returns a valid
>> temperature ?
Ok, I can see another glitch here. Actually the thermal_of is calling
thermal_of_zone_register() which in turn calls
thermal_zone_device_enable(). However, the driver can be not fully setup
yet (eg. rockchip_thermal.c), so it results in an error in get_temp and
an invalid temperature.
With the aforementioned setup, that leads to a broken thermal platform
because no trip notification will happen.
All this result in a fragile code :/
That is working only because there is a cooling device bound to the
thermal zone which happens after the sensor is fully setup.
IMO, we should be much less resilient to .get_temp failing ...
--
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-28 8:04 ` Daniel Lezcano
@ 2024-06-28 10:49 ` Rafael J. Wysocki
2024-06-28 12:26 ` Rafael J. Wysocki
0 siblings, 1 reply; 10+ messages in thread
From: Rafael J. Wysocki @ 2024-06-28 10:49 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Rafael J. Wysocki, Linux PM mailing list, Lukasz Luba
On Fri, Jun 28, 2024 at 10:04 AM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 27/06/2024 20:23, Rafael J. Wysocki wrote:
> > On Thu, Jun 27, 2024 at 6:30 PM Daniel Lezcano
> > <daniel.lezcano@linaro.org> wrote:
> >>
> >> On 27/06/2024 11:54, Rafael J. Wysocki wrote:
> >>> On Thu, Jun 27, 2024 at 12:24 AM Daniel Lezcano
> >>> <daniel.lezcano@linaro.org> wrote:
> >>>>
> >>>> On 26/06/2024 23:21, Daniel Lezcano wrote:
> >>>>
> >>>> [ ... ]
> >>>>
> >>>>>> Oh, I see where the problem can be. If the zone is polling only, it
> >>>>>> will not rearm the timer when the current zone temperature is invalid
> >>>>>> after the above commit, so does the attached patch help?
> >>>>>
> >>>>> At this point, I went far when bisecting another problem and I ended up
> >>>>> screwing my config file. So I had to generate a new one from the default
> >>>>> config. Since then the issue is no longer happening which sounds very
> >>>>> strange to me.
> >>>>>
> >>>>> I'm still investigating but if you have a suggestion coming in mind, it
> >>>>> would be welcome because I'm failing to find out what is going on ... :/
> >>>>
> >>>> I finally reproduced the issue. That happens when there is *no* cooling
> >>>> device bound on *any* thermal zones.
> >>>
> >>> Interesting.
> >>>
> >>>> Your patch seems to fix the problem but I'm not sure to understand the
> >>>> conditions of the bug
> >>>
> >>> It's probably the same as for commit 202aa0d4bb53:
> >>> thermal_zone_device_init() sets tz->temperature to
> >>> THERMAL_TEMP_INVALID and if the first invocation of
> >>> __thermal_zone_get_temp() returns an error (because the .get_temp()
> >>> callback returns an error), monitor_thermal_zone(). If polling is the
> >>> only way in which the zone temperature can be updated, things go south
> >>> because the timer is not set and there is no other way to set it. No
> >>> updates will be coming
> >>
> >> If there is no polling delay (aka interrupt driven), the routine will
> >> skip the _set_trips function and the monitor_thermal_zone() will do
> >> nothing in this case, right ?
> >
> > _set_trips() looks at tz->temperature, however, and it doesn't make
> > sense to call it if the latter is invalid.
> >
> > Same for handle_thermal_trip() and governor callbacks.
> >
> >> Even setting a label jump to "monitor:" the routine is broken AFAICT
> >
> > I beg to differ.
> >
> > Yes, monitor_thermal_zone() does nothing if there is no polling, but
> > it needs to be called anyway because it checks whether or not polling
> > is there in the first place.
> >
> > And if there is no polling, it is assumed that
> > __thermal_zone_device_update() will be called by other means.
>
> AFAICT, the interrupt can fire and it will result in a
> thermal_zone_device_update() but the interrupt must be setup by
> __set_trips() which is skipped because of the invalid temperature.
Say the zone temperature is invalid and __set_trips() is not skipped.
You'll end up with nonsensical values of "low" and "high". Is this OK?
If you rely on __set_trips() to set up the interrupts for you, it
needs to be given a valid zone temperature and trip thresholds.
Some special handling of invalid tz->temperature may be added to it, though.
> I've confirmed that with my evb board.
>
> - trips point
> - no polling delay (interrupt based)
> - no cooling device
>
> That does not work, there is no trip crossed notification.
Yes, if the first zone temperature read fails and there are no
interrupts coming to trigger next temperature updates, this is what's
going to happen.
I gather that the fix I sent doesn't help in this case?
I'll send a v2 of it with .set_trips() taken into account.
> >>> The reason why the presence of cooling devices can "fix" this is
> >>> because thermal_bind_cdev_to_trip() sets tz->need_update to 1 which
> >>> then causes the thermal_zone_device_update() in
> >>> __thermal_cooling_device_register() to trigger and that will update
> >>> the temperature.
> >>
> >> IIUC, the first time get_temp() fails and then when the tz is bound, the
> >> update triggers a new call with get_temp() which returns a valid
> >> temperature ?
>
> Ok, I can see another glitch here. Actually the thermal_of is calling
> thermal_of_zone_register() which in turn calls
> thermal_zone_device_enable(). However, the driver can be not fully setup
> yet (eg. rockchip_thermal.c), so it results in an error in get_temp and
> an invalid temperature.
>
> With the aforementioned setup, that leads to a broken thermal platform
> because no trip notification will happen.
>
> All this result in a fragile code :/
>
> That is working only because there is a cooling device bound to the
> thermal zone which happens after the sensor is fully setup.
>
> IMO, we should be much less resilient to .get_temp failing ...
Well, there is one more problem with it which is the handling of the
-EAGAIN return value. Drivers may believe that it is special, but it
actually isn't.
For now, I want to restore the behavior that was present before
because some setups may rely on this.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Trip points crossed not detected when no cooling device bound
2024-06-28 10:49 ` Rafael J. Wysocki
@ 2024-06-28 12:26 ` Rafael J. Wysocki
0 siblings, 0 replies; 10+ messages in thread
From: Rafael J. Wysocki @ 2024-06-28 12:26 UTC (permalink / raw)
To: Daniel Lezcano; +Cc: Linux PM mailing list, Lukasz Luba
On Fri, Jun 28, 2024 at 12:49 PM Rafael J. Wysocki <rafael@kernel.org> wrote:
>
> On Fri, Jun 28, 2024 at 10:04 AM Daniel Lezcano
> <daniel.lezcano@linaro.org> wrote:
> >
> > On 27/06/2024 20:23, Rafael J. Wysocki wrote:
> > > On Thu, Jun 27, 2024 at 6:30 PM Daniel Lezcano
> > > <daniel.lezcano@linaro.org> wrote:
> > >>
> > >> On 27/06/2024 11:54, Rafael J. Wysocki wrote:
> > >>> On Thu, Jun 27, 2024 at 12:24 AM Daniel Lezcano
> > >>> <daniel.lezcano@linaro.org> wrote:
> > >>>>
> > >>>> On 26/06/2024 23:21, Daniel Lezcano wrote:
> > >>>>
> > >>>> [ ... ]
> > >>>>
> > >>>>>> Oh, I see where the problem can be. If the zone is polling only, it
> > >>>>>> will not rearm the timer when the current zone temperature is invalid
> > >>>>>> after the above commit, so does the attached patch help?
> > >>>>>
> > >>>>> At this point, I went far when bisecting another problem and I ended up
> > >>>>> screwing my config file. So I had to generate a new one from the default
> > >>>>> config. Since then the issue is no longer happening which sounds very
> > >>>>> strange to me.
> > >>>>>
> > >>>>> I'm still investigating but if you have a suggestion coming in mind, it
> > >>>>> would be welcome because I'm failing to find out what is going on ... :/
> > >>>>
> > >>>> I finally reproduced the issue. That happens when there is *no* cooling
> > >>>> device bound on *any* thermal zones.
> > >>>
> > >>> Interesting.
> > >>>
> > >>>> Your patch seems to fix the problem but I'm not sure to understand the
> > >>>> conditions of the bug
> > >>>
> > >>> It's probably the same as for commit 202aa0d4bb53:
> > >>> thermal_zone_device_init() sets tz->temperature to
> > >>> THERMAL_TEMP_INVALID and if the first invocation of
> > >>> __thermal_zone_get_temp() returns an error (because the .get_temp()
> > >>> callback returns an error), monitor_thermal_zone(). If polling is the
> > >>> only way in which the zone temperature can be updated, things go south
> > >>> because the timer is not set and there is no other way to set it. No
> > >>> updates will be coming
> > >>
> > >> If there is no polling delay (aka interrupt driven), the routine will
> > >> skip the _set_trips function and the monitor_thermal_zone() will do
> > >> nothing in this case, right ?
> > >
> > > _set_trips() looks at tz->temperature, however, and it doesn't make
> > > sense to call it if the latter is invalid.
> > >
> > > Same for handle_thermal_trip() and governor callbacks.
> > >
> > >> Even setting a label jump to "monitor:" the routine is broken AFAICT
> > >
> > > I beg to differ.
> > >
> > > Yes, monitor_thermal_zone() does nothing if there is no polling, but
> > > it needs to be called anyway because it checks whether or not polling
> > > is there in the first place.
> > >
> > > And if there is no polling, it is assumed that
> > > __thermal_zone_device_update() will be called by other means.
> >
> > AFAICT, the interrupt can fire and it will result in a
> > thermal_zone_device_update() but the interrupt must be setup by
> > __set_trips() which is skipped because of the invalid temperature.
>
> Say the zone temperature is invalid and __set_trips() is not skipped.
>
> You'll end up with nonsensical values of "low" and "high". Is this OK?
>
> If you rely on __set_trips() to set up the interrupts for you, it
> needs to be given a valid zone temperature and trip thresholds.
>
> Some special handling of invalid tz->temperature may be added to it, though.
Well, that's rather tricky because it is unclear what to do in this
case. If the zone temperature is unknown, the boundaries to pass to
the .set_trips() callbacks are unknown too.
One possible choice would be to make the interrupt trigger right away,
but that's equivalent to putting thermal_zone_device_update() on a
runqueue (with some extra overhead).
Another possible choice would be to set the lower boundary to the
possible minimum and the upper boundary to the temperature of the
first trip, but in that case tz->temperature may be equal to
THERMAL_TEMP_INVALID for a long time (basically until the first trip
is crossed on the way up) which seems a bit odd to me, so I'd rather
not do this either.
Instead, the new patch simply schedules another temperature check with
a delay if the zone temperature turns out to be invalid.
> > I've confirmed that with my evb board.
> >
> > - trips point
> > - no polling delay (interrupt based)
> > - no cooling device
> >
> > That does not work, there is no trip crossed notification.
>
> Yes, if the first zone temperature read fails and there are no
> interrupts coming to trigger next temperature updates, this is what's
> going to happen.
>
> I gather that the fix I sent doesn't help in this case?
>
> I'll send a v2 of it with .set_trips() taken into account.
Here: https://lore.kernel.org/linux-pm/2764814.mvXUDI8C0e@rjwysocki.net/
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2024-06-28 12:26 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-26 6:50 Trip points crossed not detected when no cooling device bound Daniel Lezcano
2024-06-26 10:38 ` Rafael J. Wysocki
2024-06-26 21:21 ` Daniel Lezcano
2024-06-26 22:24 ` Daniel Lezcano
2024-06-27 9:54 ` Rafael J. Wysocki
2024-06-27 16:30 ` Daniel Lezcano
2024-06-27 18:23 ` Rafael J. Wysocki
2024-06-28 8:04 ` Daniel Lezcano
2024-06-28 10:49 ` Rafael J. Wysocki
2024-06-28 12:26 ` Rafael J. Wysocki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox