* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
[not found] <1450676778-7840-1-git-send-email-j-keerthy@ti.com>
@ 2015-12-28 17:41 ` Nishanth Menon
2015-12-29 9:16 ` Keerthy
0 siblings, 1 reply; 6+ messages in thread
From: Nishanth Menon @ 2015-12-28 17:41 UTC (permalink / raw)
To: Keerthy, rui.zhang, edubezval; +Cc: linux-omap, linux-pm@vger.kernel.org
On 12/20/2015 11:46 PM, Keerthy wrote:
+linux-pm.
> In few rare conditions like during boot up the orderly_poweroff
> function might not be able to complete fully leaving the device
> running at dangerously high temperatures. Hence adding a backup
> workqueue to act after a known period of time and poweroff the device.
>
> Suggested-by: Nishanth Menon <nm@ti.com>
> Reported-by: Nishanth Menon <nm@ti.com>
The specific case I hit was a critical temperature event as soon as
the hwmon device was probed (the driver tmp102 was a kernel module).
> Signed-off-by: Keerthy <j-keerthy@ti.com>
> ---
> drivers/thermal/Kconfig | 9 +++++++++
> drivers/thermal/thermal_core.c | 26 ++++++++++++++++++++++++++
> 2 files changed, 35 insertions(+)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 8cc4ac6..25584ee 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
>
> endchoice
>
> +config THERMAL_BKUP_SHUTDOWN_DELAY_MS
> + int "Thermal shutdown backup delay in milli-seconds"
> + depends on THERMAL
> + default "5000"
> + ---help---
> + The number of milliseconds to delay after orderly_poweroff call.
Probably needs a rephrase.
> +
> + Default: 5000 (5 seconds)
> +
> config THERMAL_GOV_FAIR_SHARE
> bool "Fair-share thermal governor"
> help
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index d9e525c..b793857 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);
>
> static struct thermal_governor *def_governor;
>
> +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +#else
> +#define BKUP_SHUTDOWN_DELAY 5000
> +#endif
> +
I am not sure if this #ifdeffery is even needed.
Eduardo, Rui: If this is not the suggested technique, maybe you guys
could suggest how we could handle a case where userspace might be
hungup due to some reason and a case where a critical temperature
event in the middle of device probe was triggered?
Obviously, we'd like to take into consideration userspace latencies as
well- but that is very specific to fs being run.. not really a simple
problem, IMHO..
> static struct thermal_governor *__find_governor(const char *name)
> {
> struct thermal_governor *pos;
> @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz,
> def_governor->throttle(tz, trip);
> }
>
> +static void bkup_shutdown_func(struct work_struct *unused);
> +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
> +
> +static void bkup_shutdown_func(struct work_struct *work)
> +{
> + pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
> + kernel_power_off();
> +
> + pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
> + emergency_restart();
I think documentation is necessary that we are hoping for bootloader
to be able to detect and halt as needed -> risk here obviously is an
infinite reboot loop :( .
> +}
> +
> static void handle_critical_trips(struct thermal_zone_device *tz,
> int trip, enum thermal_trip_type trip_type)
> {
> @@ -443,6 +461,14 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
> dev_emerg(&tz->device,
> "critical temperature reached(%d C),shutting down\n",
> tz->temperature / 1000);
> + /**
needs to be kernel doc style?
> + * This is a backup option to shutdown the
> + * system in case orderly_poweroff
> + * fails
Maybe adjust to 80char?
> + */
> + schedule_delayed_work(&bkup_shutdown_work,
> + msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
> +
> orderly_poweroff(true);
> }
> }
>
--
Regards,
Nishanth Menon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
2015-12-28 17:41 ` [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff Nishanth Menon
@ 2015-12-29 9:16 ` Keerthy
2015-12-31 17:29 ` Eduardo Valentin
0 siblings, 1 reply; 6+ messages in thread
From: Keerthy @ 2015-12-29 9:16 UTC (permalink / raw)
To: Nishanth Menon, Keerthy, rui.zhang, edubezval
Cc: linux-omap, linux-pm@vger.kernel.org
Hi Nishanth,
On Monday 28 December 2015 11:11 PM, Nishanth Menon wrote:
> On 12/20/2015 11:46 PM, Keerthy wrote:
>
> +linux-pm.
Thanks for looping!
>
>> In few rare conditions like during boot up the orderly_poweroff
>> function might not be able to complete fully leaving the device
>> running at dangerously high temperatures. Hence adding a backup
>> workqueue to act after a known period of time and poweroff the device.
>>
>
>
>> Suggested-by: Nishanth Menon <nm@ti.com>
>> Reported-by: Nishanth Menon <nm@ti.com>
>
> The specific case I hit was a critical temperature event as soon as
> the hwmon device was probed (the driver tmp102 was a kernel module).
>
>> Signed-off-by: Keerthy <j-keerthy@ti.com>
>> ---
>> drivers/thermal/Kconfig | 9 +++++++++
>> drivers/thermal/thermal_core.c | 26 ++++++++++++++++++++++++++
>> 2 files changed, 35 insertions(+)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 8cc4ac6..25584ee 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
>>
>> endchoice
>>
>> +config THERMAL_BKUP_SHUTDOWN_DELAY_MS
>> + int "Thermal shutdown backup delay in milli-seconds"
>> + depends on THERMAL
>> + default "5000"
>> + ---help---
>> + The number of milliseconds to delay after orderly_poweroff call.
>
> Probably needs a rephrase.
Delay in milliseconds before the backup thermal shutdown is triggered.
>
>> +
>> + Default: 5000 (5 seconds)
>> +
>> config THERMAL_GOV_FAIR_SHARE
>> bool "Fair-share thermal governor"
>> help
>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
>> index d9e525c..b793857 100644
>> --- a/drivers/thermal/thermal_core.c
>> +++ b/drivers/thermal/thermal_core.c
>> @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);
>>
>> static struct thermal_governor *def_governor;
>>
>> +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
>> +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
>> +#else
>> +#define BKUP_SHUTDOWN_DELAY 5000
>> +#endif
>> +
>
> I am not sure if this #ifdeffery is even needed.
>
>
> Eduardo, Rui: If this is not the suggested technique, maybe you guys
> could suggest how we could handle a case where userspace might be
> hungup due to some reason and a case where a critical temperature
> event in the middle of device probe was triggered?
>
> Obviously, we'd like to take into consideration userspace latencies as
> well- but that is very specific to fs being run.. not really a simple
> problem, IMHO..
>
>> static struct thermal_governor *__find_governor(const char *name)
>> {
>> struct thermal_governor *pos;
>> @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz,
>> def_governor->throttle(tz, trip);
>> }
>>
>> +static void bkup_shutdown_func(struct work_struct *unused);
>> +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
>> +
>> +static void bkup_shutdown_func(struct work_struct *work)
>> +{
>> + pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
>> + kernel_power_off();
>> +
>> + pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
>> + emergency_restart();
>
> I think documentation is necessary that we are hoping for bootloader
> to be able to detect and halt as needed -> risk here obviously is an
> infinite reboot loop :( .
Agreed.
>
>
>> +}
>> +
>> static void handle_critical_trips(struct thermal_zone_device *tz,
>> int trip, enum thermal_trip_type trip_type)
>> {
>> @@ -443,6 +461,14 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>> dev_emerg(&tz->device,
>> "critical temperature reached(%d C),shutting down\n",
>> tz->temperature / 1000);
>> + /**
>
> needs to be kernel doc style?
>
>> + * This is a backup option to shutdown the
>> + * system in case orderly_poweroff
>> + * fails
> Maybe adjust to 80char?
Okay.
>
>> + */
>> + schedule_delayed_work(&bkup_shutdown_work,
>> + msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
>> +
>> orderly_poweroff(true);
>> }
>> }
>>
>
>
I will wait for Eduardo/Rui's inputs before posting the next version.
Best Regards,
Keerthy
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
2015-12-29 9:16 ` Keerthy
@ 2015-12-31 17:29 ` Eduardo Valentin
2015-12-31 17:47 ` Nishanth Menon
0 siblings, 1 reply; 6+ messages in thread
From: Eduardo Valentin @ 2015-12-31 17:29 UTC (permalink / raw)
To: Keerthy
Cc: Nishanth Menon, Keerthy, rui.zhang, linux-omap,
linux-pm@vger.kernel.org
can we have a shorter title?
On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
> Hi Nishanth,
>
<cut>
> >
> >I am not sure if this #ifdeffery is even needed.
> >
> >
> >Eduardo, Rui: If this is not the suggested technique, maybe you guys
> >could suggest how we could handle a case where userspace might be
> >hungup due to some reason and a case where a critical temperature
> >event in the middle of device probe was triggered?
Orderly power off is supposed to take care of this. Looking at the code,
it will force a shutdown in case execution of userland command fails:
static int __orderly_poweroff(bool force)
{
int ret;
ret = run_cmd(poweroff_cmd);
if (ret && force) {
pr_warn("Failed to start orderly shutdown: forcing the issue\n");
/*
* I guess this should try to kick off some daemon to sync and
* poweroff asap. Or not even bother syncing if we're doing an
* emergency shutdown?
*/
emergency_sync();
kernel_power_off();
}
> >
> >Obviously, we'd like to take into consideration userspace latencies as
> >well- but that is very specific to fs being run.. not really a simple
> >problem, IMHO..
> >
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
2015-12-31 17:29 ` Eduardo Valentin
@ 2015-12-31 17:47 ` Nishanth Menon
2015-12-31 18:20 ` Eduardo Valentin
0 siblings, 1 reply; 6+ messages in thread
From: Nishanth Menon @ 2015-12-31 17:47 UTC (permalink / raw)
To: Eduardo Valentin, Keerthy
Cc: Keerthy, rui.zhang, linux-omap, linux-pm@vger.kernel.org
On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> can we have a shorter title?
>
> On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
>> Hi Nishanth,
>>
>
> <cut>
>>>
>>> I am not sure if this #ifdeffery is even needed.
>>>
>>>
>>> Eduardo, Rui: If this is not the suggested technique, maybe you guys
>>> could suggest how we could handle a case where userspace might be
>>> hungup due to some reason and a case where a critical temperature
>>> event in the middle of device probe was triggered?
>
> Orderly power off is supposed to take care of this. Looking at the code,
> it will force a shutdown in case execution of userland command fails:
>
> static int __orderly_poweroff(bool force)
> {
> int ret;
>
> ret = run_cmd(poweroff_cmd);
>
> if (ret && force) {
> pr_warn("Failed to start orderly shutdown: forcing the issue\n");
>
> /*
> * I guess this should try to kick off some daemon to sync and
> * poweroff asap. Or not even bother syncing if we're doing an
> * emergency shutdown?
> */
> emergency_sync();
> kernel_power_off();
> }
Yes, it will *IF* userspace fails. the condition that I had tracked
was before identifying the following fix[1] - Example fail is here[2]
In this case, tmp102 is setup for X15 as [3] - and built as a module.
as the kernel startsup filesystem and starts a modprobe of all modules
via udev rules, the probe of tmp102 detects (falsely) a critical
temperature condition. Shutdown attempt in the middle of driver probe
is always a tricky business.
As we look at the log in [2], Line 472
> thermal thermal_zone3: critical temperature reached(108 C),shutting down
We have userspace trigger for shutdown taking place.
Line 495: INIT: Sending processes the TERM signal
userspace starts shutting down services. (but note that probe for
other devices were either in progress or queued up to complete)..
at line 647 - we are in a weird place -> sysrq shows that system is
idled and userspace is shutdown and system is still active.
In this case, we entered the case thanks to a driver bug, but if this
situation was a real world temperature scenario, then we'd probably in
an overtemp scenario, then device damage could take place OR something
much worse.
The only alternative is to run a parallel thread in case userspace
fails to complete the job in some given period of time - due to what
ever be the condition triggering the problem.
I hope this explains the problem.
[1]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
[2] http://pastebin.ubuntu.com/14326688/
[3]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
--
Regards,
Nishanth Menon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
2015-12-31 17:47 ` Nishanth Menon
@ 2015-12-31 18:20 ` Eduardo Valentin
2015-12-31 18:29 ` Nishanth Menon
0 siblings, 1 reply; 6+ messages in thread
From: Eduardo Valentin @ 2015-12-31 18:20 UTC (permalink / raw)
To: Nishanth Menon
Cc: Keerthy, Keerthy, rui.zhang, linux-omap, linux-pm@vger.kernel.org
On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote:
> On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> > can we have a shorter title?
> >
> >
> > Orderly power off is supposed to take care of this. Looking at the code,
> > it will force a shutdown in case execution of userland command fails:
> >
> > static int __orderly_poweroff(bool force)
> > {
> > int ret;
> >
> > ret = run_cmd(poweroff_cmd);
> >
> > if (ret && force) {
> > pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> >
> > /*
> > * I guess this should try to kick off some daemon to sync and
> > * poweroff asap. Or not even bother syncing if we're doing an
> > * emergency shutdown?
> > */
> > emergency_sync();
> > kernel_power_off();
> > }
>
> Yes, it will *IF* userspace fails. the condition that I had tracked
> was before identifying the following fix[1] - Example fail is here[2]
>
OK. But still, why other users of orderly_poweroff do not
deserve to be fixed, then?
>
> I hope this explains the problem.
>
> [1]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
> [2] http://pastebin.ubuntu.com/14326688/
>
> [3]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
>
> --
> Regards,
> Nishanth Menon
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
2015-12-31 18:20 ` Eduardo Valentin
@ 2015-12-31 18:29 ` Nishanth Menon
0 siblings, 0 replies; 6+ messages in thread
From: Nishanth Menon @ 2015-12-31 18:29 UTC (permalink / raw)
To: Eduardo Valentin
Cc: Keerthy, Keerthy, rui.zhang, linux-omap, linux-pm@vger.kernel.org
On 12/31/2015 12:20 PM, Eduardo Valentin wrote:
> On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote:
>> On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
>>> can we have a shorter title?
>>>
>>>
>>> Orderly power off is supposed to take care of this. Looking at the code,
>>> it will force a shutdown in case execution of userland command fails:
>>>
>>> static int __orderly_poweroff(bool force)
>>> {
>>> int ret;
>>>
>>> ret = run_cmd(poweroff_cmd);
>>>
>>> if (ret && force) {
>>> pr_warn("Failed to start orderly shutdown: forcing the issue\n");
>>>
>>> /*
>>> * I guess this should try to kick off some daemon to sync and
>>> * poweroff asap. Or not even bother syncing if we're doing an
>>> * emergency shutdown?
>>> */
>>> emergency_sync();
>>> kernel_power_off();
>>> }
>>
>> Yes, it will *IF* userspace fails. the condition that I had tracked
>> was before identifying the following fix[1] - Example fail is here[2]
>>
>
> OK. But still, why other users of orderly_poweroff do not
> deserve to be fixed, then?
>
I'd agree as well.. I guess the comment from Robin Holt <holt@sgi.com>
anticipated something like this will eventually occur.
"* I guess this should try to kick off some daemon to sync and
* poweroff asap. Or not even bother syncing if we're doing an
* emergency shutdown?
"
Keerthy - would you spin this as a generic fix?
>>
>> I hope this explains the problem.
>>
>> [1]
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
>> [2] http://pastebin.ubuntu.com/14326688/
>>
>> [3]
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
>>
>> --
>> Regards,
>> Nishanth Menon
--
Regards,
Nishanth Menon
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2015-12-31 18:30 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1450676778-7840-1-git-send-email-j-keerthy@ti.com>
2015-12-28 17:41 ` [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff Nishanth Menon
2015-12-29 9:16 ` Keerthy
2015-12-31 17:29 ` Eduardo Valentin
2015-12-31 17:47 ` Nishanth Menon
2015-12-31 18:20 ` Eduardo Valentin
2015-12-31 18:29 ` Nishanth Menon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).