Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff

linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
       [not found] <1450676778-7840-1-git-send-email-j-keerthy@ti.com>
@ 2015-12-28 17:41 ` Nishanth Menon
  2015-12-29  9:16   ` Keerthy
  0 siblings, 1 reply; 6+ messages in thread
From: Nishanth Menon @ 2015-12-28 17:41 UTC (permalink / raw)
  To: Keerthy, rui.zhang, edubezval; +Cc: linux-omap, linux-pm@vger.kernel.org

On 12/20/2015 11:46 PM, Keerthy wrote:

+linux-pm.

> In few rare conditions like during boot up the orderly_poweroff
> function might not be able to complete fully leaving the device
> running at dangerously high temperatures. Hence adding a backup
> workqueue to act after a known period of time and poweroff the device.
> 


> Suggested-by: Nishanth Menon <nm@ti.com>
> Reported-by: Nishanth Menon <nm@ti.com>

The specific case I hit was a critical temperature event as soon as
the hwmon device was probed (the driver tmp102 was a kernel module).

> Signed-off-by: Keerthy <j-keerthy@ti.com>
> ---
>  drivers/thermal/Kconfig        |  9 +++++++++
>  drivers/thermal/thermal_core.c | 26 ++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index 8cc4ac6..25584ee 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
>  
>  endchoice
>  
> +config THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +        int "Thermal shutdown  backup delay in milli-seconds"
> +        depends on THERMAL
> +        default "5000"
> +        ---help---
> +        The number of milliseconds to delay after orderly_poweroff call.

Probably needs a rephrase.

> +
> +        Default: 5000 (5 seconds)
> +
>  config THERMAL_GOV_FAIR_SHARE
>  	bool "Fair-share thermal governor"
>  	help
> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
> index d9e525c..b793857 100644
> --- a/drivers/thermal/thermal_core.c
> +++ b/drivers/thermal/thermal_core.c
> @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);
>  
>  static struct thermal_governor *def_governor;
>  
> +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
> +#else
> +#define BKUP_SHUTDOWN_DELAY 5000
> +#endif
> +

I am not sure if this #ifdeffery is even needed.


Eduardo, Rui: If this is not the suggested technique, maybe you guys
could suggest how we could handle a case where userspace might be
hungup due to some reason and a case where a critical temperature
event in the middle of device probe was triggered?

Obviously, we'd like to take into consideration userspace latencies as
well- but that is very specific to fs being run.. not really a simple
problem, IMHO..

>  static struct thermal_governor *__find_governor(const char *name)
>  {
>  	struct thermal_governor *pos;
> @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz,
>  		       def_governor->throttle(tz, trip);
>  }
>  
> +static void bkup_shutdown_func(struct work_struct *unused);
> +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
> +
> +static void bkup_shutdown_func(struct work_struct *work)
> +{
> +	pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
> +	kernel_power_off();
> +
> +	pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
> +	emergency_restart();

I think documentation is necessary that we are hoping for bootloader
to be able to detect and halt as needed -> risk here obviously is an
infinite reboot loop :( .


> +}
> +
>  static void handle_critical_trips(struct thermal_zone_device *tz,
>  				int trip, enum thermal_trip_type trip_type)
>  {
> @@ -443,6 +461,14 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>  		dev_emerg(&tz->device,
>  			  "critical temperature reached(%d C),shutting down\n",
>  			  tz->temperature / 1000);
> +		/**

needs to be kernel doc style?

> +		 * This is a backup option to shutdown the
> +		 * system in case orderly_poweroff
> +		 * fails
Maybe adjust to 80char?

> +		 */
> +		schedule_delayed_work(&bkup_shutdown_work,
> +				      msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
> +
>  		orderly_poweroff(true);
>  	}
>  }
> 


-- 
Regards,
Nishanth Menon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
  2015-12-28 17:41 ` [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff Nishanth Menon
@ 2015-12-29  9:16   ` Keerthy
  2015-12-31 17:29     ` Eduardo Valentin
  0 siblings, 1 reply; 6+ messages in thread
From: Keerthy @ 2015-12-29  9:16 UTC (permalink / raw)
  To: Nishanth Menon, Keerthy, rui.zhang, edubezval
  Cc: linux-omap, linux-pm@vger.kernel.org

Hi Nishanth,

On Monday 28 December 2015 11:11 PM, Nishanth Menon wrote:
> On 12/20/2015 11:46 PM, Keerthy wrote:
>
> +linux-pm.

Thanks for looping!

>
>> In few rare conditions like during boot up the orderly_poweroff
>> function might not be able to complete fully leaving the device
>> running at dangerously high temperatures. Hence adding a backup
>> workqueue to act after a known period of time and poweroff the device.
>>
>
>
>> Suggested-by: Nishanth Menon <nm@ti.com>
>> Reported-by: Nishanth Menon <nm@ti.com>
>
> The specific case I hit was a critical temperature event as soon as
> the hwmon device was probed (the driver tmp102 was a kernel module).
>
>> Signed-off-by: Keerthy <j-keerthy@ti.com>
>> ---
>>   drivers/thermal/Kconfig        |  9 +++++++++
>>   drivers/thermal/thermal_core.c | 26 ++++++++++++++++++++++++++
>>   2 files changed, 35 insertions(+)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 8cc4ac6..25584ee 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -92,6 +92,15 @@ config THERMAL_DEFAULT_GOV_POWER_ALLOCATOR
>>
>>   endchoice
>>
>> +config THERMAL_BKUP_SHUTDOWN_DELAY_MS
>> +        int "Thermal shutdown  backup delay in milli-seconds"
>> +        depends on THERMAL
>> +        default "5000"
>> +        ---help---
>> +        The number of milliseconds to delay after orderly_poweroff call.
>
> Probably needs a rephrase.

Delay in milliseconds before the backup thermal shutdown is triggered.

>
>> +
>> +        Default: 5000 (5 seconds)
>> +
>>   config THERMAL_GOV_FAIR_SHARE
>>   	bool "Fair-share thermal governor"
>>   	help
>> diff --git a/drivers/thermal/thermal_core.c b/drivers/thermal/thermal_core.c
>> index d9e525c..b793857 100644
>> --- a/drivers/thermal/thermal_core.c
>> +++ b/drivers/thermal/thermal_core.c
>> @@ -61,6 +61,12 @@ static DEFINE_MUTEX(thermal_governor_lock);
>>
>>   static struct thermal_governor *def_governor;
>>
>> +#ifdef CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
>> +#define BKUP_SHUTDOWN_DELAY CONFIG_THERMAL_BKUP_SHUTDOWN_DELAY_MS
>> +#else
>> +#define BKUP_SHUTDOWN_DELAY 5000
>> +#endif
>> +
>
> I am not sure if this #ifdeffery is even needed.
>
>
> Eduardo, Rui: If this is not the suggested technique, maybe you guys
> could suggest how we could handle a case where userspace might be
> hungup due to some reason and a case where a critical temperature
> event in the middle of device probe was triggered?
>
> Obviously, we'd like to take into consideration userspace latencies as
> well- but that is very specific to fs being run.. not really a simple
> problem, IMHO..
>
>>   static struct thermal_governor *__find_governor(const char *name)
>>   {
>>   	struct thermal_governor *pos;
>> @@ -423,6 +429,18 @@ static void handle_non_critical_trips(struct thermal_zone_device *tz,
>>   		       def_governor->throttle(tz, trip);
>>   }
>>
>> +static void bkup_shutdown_func(struct work_struct *unused);
>> +static DECLARE_DELAYED_WORK(bkup_shutdown_work, bkup_shutdown_func);
>> +
>> +static void bkup_shutdown_func(struct work_struct *work)
>> +{
>> +	pr_warn("Orderly_poweroff has failed! Attempting kernel_power_off\n");
>> +	kernel_power_off();
>> +
>> +	pr_warn("kernel_power_off has failed! Attempting emergency_restart\n");
>> +	emergency_restart();
>
> I think documentation is necessary that we are hoping for bootloader
> to be able to detect and halt as needed -> risk here obviously is an
> infinite reboot loop :( .

Agreed.

>
>
>> +}
>> +
>>   static void handle_critical_trips(struct thermal_zone_device *tz,
>>   				int trip, enum thermal_trip_type trip_type)
>>   {
>> @@ -443,6 +461,14 @@ static void handle_critical_trips(struct thermal_zone_device *tz,
>>   		dev_emerg(&tz->device,
>>   			  "critical temperature reached(%d C),shutting down\n",
>>   			  tz->temperature / 1000);
>> +		/**
>
> needs to be kernel doc style?
>
>> +		 * This is a backup option to shutdown the
>> +		 * system in case orderly_poweroff
>> +		 * fails
> Maybe adjust to 80char?

Okay.

>
>> +		 */
>> +		schedule_delayed_work(&bkup_shutdown_work,
>> +				      msecs_to_jiffies(BKUP_SHUTDOWN_DELAY));
>> +
>>   		orderly_poweroff(true);
>>   	}
>>   }
>>
>
>

I will wait for Eduardo/Rui's inputs before posting the next version.

Best Regards,
Keerthy

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
  2015-12-29  9:16   ` Keerthy
@ 2015-12-31 17:29     ` Eduardo Valentin
  2015-12-31 17:47       ` Nishanth Menon
  0 siblings, 1 reply; 6+ messages in thread
From: Eduardo Valentin @ 2015-12-31 17:29 UTC (permalink / raw)
  To: Keerthy
  Cc: Nishanth Menon, Keerthy, rui.zhang, linux-omap,
	linux-pm@vger.kernel.org

can we have a shorter title?

On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
> Hi Nishanth,
> 

<cut> 
> >
> >I am not sure if this #ifdeffery is even needed.
> >
> >
> >Eduardo, Rui: If this is not the suggested technique, maybe you guys
> >could suggest how we could handle a case where userspace might be
> >hungup due to some reason and a case where a critical temperature
> >event in the middle of device probe was triggered?

Orderly power off is supposed to take care of this. Looking at the code,
it will force a shutdown in case execution of userland command fails:

static int __orderly_poweroff(bool force)
{
        int ret;

        ret = run_cmd(poweroff_cmd);

        if (ret && force) {
                pr_warn("Failed to start orderly shutdown: forcing the issue\n");

                /*
                 * I guess this should try to kick off some daemon to sync and
                 * poweroff asap.  Or not even bother syncing if we're doing an
                 * emergency shutdown?
                 */
                emergency_sync();
                kernel_power_off();
        }

> >
> >Obviously, we'd like to take into consideration userspace latencies as
> >well- but that is very specific to fs being run.. not really a simple
> >problem, IMHO..
> >

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
  2015-12-31 17:29     ` Eduardo Valentin
@ 2015-12-31 17:47       ` Nishanth Menon
  2015-12-31 18:20         ` Eduardo Valentin
  0 siblings, 1 reply; 6+ messages in thread
From: Nishanth Menon @ 2015-12-31 17:47 UTC (permalink / raw)
  To: Eduardo Valentin, Keerthy
  Cc: Keerthy, rui.zhang, linux-omap, linux-pm@vger.kernel.org

On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> can we have a shorter title?
> 
> On Tue, Dec 29, 2015 at 02:46:49PM +0530, Keerthy wrote:
>> Hi Nishanth,
>>
> 
> <cut> 
>>>
>>> I am not sure if this #ifdeffery is even needed.
>>>
>>>
>>> Eduardo, Rui: If this is not the suggested technique, maybe you guys
>>> could suggest how we could handle a case where userspace might be
>>> hungup due to some reason and a case where a critical temperature
>>> event in the middle of device probe was triggered?
> 
> Orderly power off is supposed to take care of this. Looking at the code,
> it will force a shutdown in case execution of userland command fails:
> 
> static int __orderly_poweroff(bool force)
> {
>         int ret;
> 
>         ret = run_cmd(poweroff_cmd);
> 
>         if (ret && force) {
>                 pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> 
>                 /*
>                  * I guess this should try to kick off some daemon to sync and
>                  * poweroff asap.  Or not even bother syncing if we're doing an
>                  * emergency shutdown?
>                  */
>                 emergency_sync();
>                 kernel_power_off();
>         }

Yes, it will *IF* userspace fails. the condition that I had tracked
was before identifying the following fix[1] - Example fail is here[2]

In this case, tmp102 is setup for X15 as [3] - and built as a module.
as the kernel startsup filesystem and starts a modprobe of all modules
via udev rules, the probe of tmp102 detects (falsely) a critical
temperature condition. Shutdown attempt in the middle of driver probe
is always a tricky business.

As we look at the log in [2], Line  472
> thermal thermal_zone3: critical temperature reached(108 C),shutting down
We have userspace trigger for shutdown taking place.

Line 495: INIT: Sending processes the TERM signal

userspace starts shutting down services. (but note that probe for
other devices were either in progress or queued up to complete)..

at line 647 - we are in a weird place -> sysrq shows that system is
idled and userspace is shutdown and system is still active.

In this case, we entered the case thanks to a driver bug, but if this
situation was a real world temperature scenario, then we'd probably in
an overtemp scenario, then device damage could take place OR something
much worse.

The only alternative is to run a parallel thread in case userspace
fails to complete the job in some given period of time - due to what
ever be the condition triggering the problem.

I hope this explains the problem.

[1]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
[2] http://pastebin.ubuntu.com/14326688/

[3]
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738

-- 
Regards,
Nishanth Menon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
  2015-12-31 17:47       ` Nishanth Menon
@ 2015-12-31 18:20         ` Eduardo Valentin
  2015-12-31 18:29           ` Nishanth Menon
  0 siblings, 1 reply; 6+ messages in thread
From: Eduardo Valentin @ 2015-12-31 18:20 UTC (permalink / raw)
  To: Nishanth Menon
  Cc: Keerthy, Keerthy, rui.zhang, linux-omap, linux-pm@vger.kernel.org

On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote:
> On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
> > can we have a shorter title?
> > 
> > 
> > Orderly power off is supposed to take care of this. Looking at the code,
> > it will force a shutdown in case execution of userland command fails:
> > 
> > static int __orderly_poweroff(bool force)
> > {
> >         int ret;
> > 
> >         ret = run_cmd(poweroff_cmd);
> > 
> >         if (ret && force) {
> >                 pr_warn("Failed to start orderly shutdown: forcing the issue\n");
> > 
> >                 /*
> >                  * I guess this should try to kick off some daemon to sync and
> >                  * poweroff asap.  Or not even bother syncing if we're doing an
> >                  * emergency shutdown?
> >                  */
> >                 emergency_sync();
> >                 kernel_power_off();
> >         }
> 
> Yes, it will *IF* userspace fails. the condition that I had tracked
> was before identifying the following fix[1] - Example fail is here[2]
> 

OK. But still, why other users of orderly_poweroff do not
deserve to be fixed, then?

> 
> I hope this explains the problem.
> 
> [1]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
> [2] http://pastebin.ubuntu.com/14326688/
> 
> [3]
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
> 
> -- 
> Regards,
> Nishanth Menon

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff
  2015-12-31 18:20         ` Eduardo Valentin
@ 2015-12-31 18:29           ` Nishanth Menon
  0 siblings, 0 replies; 6+ messages in thread
From: Nishanth Menon @ 2015-12-31 18:29 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Keerthy, Keerthy, rui.zhang, linux-omap, linux-pm@vger.kernel.org

On 12/31/2015 12:20 PM, Eduardo Valentin wrote:
> On Thu, Dec 31, 2015 at 11:47:57AM -0600, Nishanth Menon wrote:
>> On 12/31/2015 11:29 AM, Eduardo Valentin wrote:
>>> can we have a shorter title?
>>>
>>>
>>> Orderly power off is supposed to take care of this. Looking at the code,
>>> it will force a shutdown in case execution of userland command fails:
>>>
>>> static int __orderly_poweroff(bool force)
>>> {
>>>         int ret;
>>>
>>>         ret = run_cmd(poweroff_cmd);
>>>
>>>         if (ret && force) {
>>>                 pr_warn("Failed to start orderly shutdown: forcing the issue\n");
>>>
>>>                 /*
>>>                  * I guess this should try to kick off some daemon to sync and
>>>                  * poweroff asap.  Or not even bother syncing if we're doing an
>>>                  * emergency shutdown?
>>>                  */
>>>                 emergency_sync();
>>>                 kernel_power_off();
>>>         }
>>
>> Yes, it will *IF* userspace fails. the condition that I had tracked
>> was before identifying the following fix[1] - Example fail is here[2]
>>
> 
> OK. But still, why other users of orderly_poweroff do not
> deserve to be fixed, then?
> 


I'd agree as well.. I guess the comment from Robin Holt <holt@sgi.com>
anticipated something like this will eventually occur.

"* I guess this should try to kick off some daemon to sync and
* poweroff asap.  Or not even bother syncing if we're doing an
* emergency shutdown?
"

Keerthy - would you spin this as a generic fix?

>>
>> I hope this explains the problem.
>>
>> [1]
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=00917b5c55aeb01322d5ab51af8c025b82959224
>> [2] http://pastebin.ubuntu.com/14326688/
>>
>> [3]
>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/arm/boot/dts/am57xx-beagle-x15.dts#n738
>>
>> -- 
>> Regards,
>> Nishanth Menon


-- 
Regards,
Nishanth Menon

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2015-12-31 18:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1450676778-7840-1-git-send-email-j-keerthy@ti.com>
2015-12-28 17:41 ` [RFC PATCH] thermal: Schedule a backup thermal shutdown workqueue after a known period of time to tackle failed poweroff Nishanth Menon
2015-12-29  9:16   ` Keerthy
2015-12-31 17:29     ` Eduardo Valentin
2015-12-31 17:47       ` Nishanth Menon
2015-12-31 18:20         ` Eduardo Valentin
2015-12-31 18:29           ` Nishanth Menon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).