* [RFC PATCHv2 0/2] power: device suspend/resume watchdog @ 2013-05-10 21:28 Zoran Markovic 2013-05-10 21:28 ` [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume Zoran Markovic 2013-05-10 21:28 ` [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs Zoran Markovic 0 siblings, 2 replies; 19+ messages in thread From: Zoran Markovic @ 2013-05-10 21:28 UTC (permalink / raw) To: linux-kernel, linux-pm Cc: Zoran Markovic, Android Kernel Team, Colin Cross, Todd Poynor, San Mehat, Benoit Goby, John Stultz, Pavel Machek, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman Hi all, Attached are two patches addressing comments on the implementation of device suspend (and resume) watchdogs from the android kernel. I have squashed changes for the suspend and resume watchdogs as they address pretty much the same functionality, and also added compile-time configurability of the watchdogs. Please be kind to review and comment if it is ready for upstreaming. Best regards, Zoran Cc: Android Kernel Team <kernel-team@android.com> Cc: Colin Cross <ccross@android.com> Cc: Todd Poynor <toddpoynor@google.com> Cc: San Mehat <san@google.com> Cc: Benoit Goby <benoit@android.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <len.brown@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Benoit Goby (1): drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. Zoran Markovic (1): PM: compile-time configuration of device suspend/resume watchdogs. drivers/base/power/main.c | 87 +++++++++++++++++++++++++++++++++++++++++++++ kernel/power/Kconfig | 48 +++++++++++++++++++++++++ 2 files changed, 135 insertions(+) -- 1.7.9.5 ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-10 21:28 [RFC PATCHv2 0/2] power: device suspend/resume watchdog Zoran Markovic @ 2013-05-10 21:28 ` Zoran Markovic 2013-05-11 6:13 ` Colin Cross 2013-05-10 21:28 ` [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs Zoran Markovic 1 sibling, 1 reply; 19+ messages in thread From: Zoran Markovic @ 2013-05-10 21:28 UTC (permalink / raw) To: linux-kernel, linux-pm Cc: Benoit Goby, Android Kernel Team, Colin Cross, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman, Zoran Markovic From: Benoit Goby <benoit@android.com> Below is a patch from android kernel that detects a driver suspend/resume lockup and captures dump in the kernel log. Please review and provide comments. Rather than hard-lock the kernel, dump the suspend/resume thread stack and BUG() when a driver takes too long to suspend/resume. The timeout is set to 12 seconds to be longer than the usbhid 10 second timeout. Exclude from the watchdog the time spent waiting for children that are resumed asynchronously and time every device, whether or not they resumed synchronously. This patch is targeted for mobile devices where a suspend/resume lockup could cause a system reboot and catch user's attention. Information about failing device can later be retrieved from captured log in subsequent boot session. The hardware watchdog timer is likely suspended during this time and couldn't be relied upon. The soft-lockup detector would eventually tell that tasks are not scheduled, but would provide little context as to why. The patch hence uses system timer and assumes it is still active while the devices are suspended/resumed. Cc: Android Kernel Team <kernel-team@android.com> Cc: Colin Cross <ccross@android.com> Cc: Todd Poynor <toddpoynor@google.com> Cc: San Mehat <san@google.com> Cc: Benoit Goby <benoit@android.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <len.brown@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Original-author: San Mehat <san@google.com> Signed-off-by: Benoit Goby <benoit@android.com> [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...), tweaked commit message. Minor changes to merge code into kernel tip.] Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> --- drivers/base/power/main.c | 66 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 66 insertions(+) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index 5a9b656..a6a02c0 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -29,6 +29,8 @@ #include <linux/async.h> #include <linux/suspend.h> #include <linux/cpuidle.h> +#include <linux/timer.h> + #include "../base.h" #include "power.h" @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats; static DEFINE_MUTEX(dpm_list_mtx); static pm_message_t pm_transition; +struct dpm_watchdog { + struct device *dev; + struct task_struct *tsk; + struct timer_list timer; +}; + static int async_error; /** @@ -384,6 +392,56 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, return error; } +/** + * dpm_wd_handler - Driver suspend / resume watchdog handler. + * + * Called when a driver has timed out suspending or resuming. + * There's not much we can do here to recover so BUG() out for + * a crash-dump + */ +static void dpm_wd_handler(unsigned long data) +{ + struct dpm_watchdog *wd = (void *)data; + struct device *dev = wd->dev; + struct task_struct *tsk = wd->tsk; + + dev_emerg(dev, "**** DPM device timeout ****\n"); + show_stack(tsk, NULL); + + BUG(); +} + +/** + * dpm_wd_set - Enable pm watchdog for given device. + * @wd: Watchdog. Must be allocated on the stack. + * @dev: Device to handle. + */ +static void dpm_wd_set(struct dpm_watchdog *wd, struct device *dev) +{ + struct timer_list *timer = &wd->timer; + + wd->dev = dev; + wd->tsk = get_current(); + + init_timer_on_stack(timer); + timer->expires = jiffies + HZ * 12; + timer->function = dpm_wd_handler; + timer->data = (unsigned long)wd; + add_timer(timer); +} + +/** + * dpm_wd_clear - Disable pm watchdog. + * @wd: Watchdog to disable. + */ +static void dpm_wd_clear(struct dpm_watchdog *wd) +{ + struct timer_list *timer = &wd->timer; + + del_timer_sync(timer); + destroy_timer_on_stack(timer); +} + /*------------------------- Resume routines -------------------------*/ /** @@ -570,6 +628,7 @@ static int device_resume(struct device *dev, pm_message_t state, bool async) pm_callback_t callback = NULL; char *info = NULL; int error = 0; + struct dpm_watchdog wd; TRACE_DEVICE(dev); TRACE_RESUME(0); @@ -585,6 +644,7 @@ static int device_resume(struct device *dev, pm_message_t state, bool async) * a resumed device, even if the device hasn't been completed yet. */ dev->power.is_prepared = false; + dpm_wd_set(&wd, dev); if (!dev->power.is_suspended) goto Unlock; @@ -636,6 +696,7 @@ static int device_resume(struct device *dev, pm_message_t state, bool async) Unlock: device_unlock(dev); + dpm_wd_clear(&wd); Complete: complete_all(&dev->power.completion); @@ -1053,6 +1114,7 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) pm_callback_t callback = NULL; char *info = NULL; int error = 0; + struct dpm_watchdog wd; dpm_wait_for_children(dev, async); @@ -1076,6 +1138,8 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) if (dev->power.syscore) goto Complete; + dpm_wd_set(&wd, dev); + device_lock(dev); if (dev->pm_domain) { @@ -1131,6 +1195,8 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) device_unlock(dev); + dpm_wd_clear(&wd); + Complete: complete_all(&dev->power.completion); if (error) -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-10 21:28 ` [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume Zoran Markovic @ 2013-05-11 6:13 ` Colin Cross 2013-05-12 0:39 ` Rafael J. Wysocki 0 siblings, 1 reply; 19+ messages in thread From: Colin Cross @ 2013-05-11 6:13 UTC (permalink / raw) To: Zoran Markovic Cc: lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic <zoran.markovic@linaro.org> wrote: > From: Benoit Goby <benoit@android.com> > > Below is a patch from android kernel that detects a driver suspend/resume > lockup and captures dump in the kernel log. Please review and provide > comments. This paragraph should go below the --- line so it doesn't end up in the final commit message. > Rather than hard-lock the kernel, dump the suspend/resume thread stack and > BUG() when a driver takes too long to suspend/resume. The timeout is set to > 12 seconds to be longer than the usbhid 10 second timeout. > > Exclude from the watchdog the time spent waiting for children that > are resumed asynchronously and time every device, whether or not they > resumed synchronously. > > This patch is targeted for mobile devices where a suspend/resume lockup > could cause a system reboot and catch user's attention. Information > about failing device can later be retrieved from captured log in > subsequent boot session. I would take out the phrase "catch user's attention", the intention of this patch is actually the opposite - get the system back to working normally as fast as possible, while still putting enough information to debug the problem into the log. > The hardware watchdog timer is likely suspended during this time and > couldn't be relied upon. The soft-lockup detector would eventually tell > that tasks are not scheduled, but would provide little context as to why. > The patch hence uses system timer and assumes it is still active while the > devices are suspended/resumed. > > Cc: Android Kernel Team <kernel-team@android.com> > Cc: Colin Cross <ccross@android.com> > Cc: Todd Poynor <toddpoynor@google.com> > Cc: San Mehat <san@google.com> > Cc: Benoit Goby <benoit@android.com> > Cc: John Stultz <john.stultz@linaro.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Rafael J. Wysocki <rjw@sisk.pl> > Cc: Len Brown <len.brown@intel.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Original-author: San Mehat <san@google.com> > Signed-off-by: Benoit Goby <benoit@android.com> > [zoran.markovic@linaro.org: Changed printk(KERN_EMERG,...) to pr_emerg(...), > tweaked commit message. Minor changes to merge code into kernel tip.] > Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> > --- > drivers/base/power/main.c | 66 +++++++++++++++++++++++++++++++++++++++++++++ > 1 file changed, 66 insertions(+) > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index 5a9b656..a6a02c0 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -29,6 +29,8 @@ > #include <linux/async.h> > #include <linux/suspend.h> > #include <linux/cpuidle.h> > +#include <linux/timer.h> > + > #include "../base.h" > #include "power.h" > > @@ -54,6 +56,12 @@ struct suspend_stats suspend_stats; > static DEFINE_MUTEX(dpm_list_mtx); > static pm_message_t pm_transition; > > +struct dpm_watchdog { > + struct device *dev; > + struct task_struct *tsk; > + struct timer_list timer; > +}; > + > static int async_error; > > /** > @@ -384,6 +392,56 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, > return error; > } > > +/** > + * dpm_wd_handler - Driver suspend / resume watchdog handler. > + * > + * Called when a driver has timed out suspending or resuming. > + * There's not much we can do here to recover so BUG() out for > + * a crash-dump > + */ > +static void dpm_wd_handler(unsigned long data) > +{ > + struct dpm_watchdog *wd = (void *)data; > + struct device *dev = wd->dev; > + struct task_struct *tsk = wd->tsk; > + > + dev_emerg(dev, "**** DPM device timeout ****\n"); > + show_stack(tsk, NULL); > + > + BUG(); > +} > + > +/** > + * dpm_wd_set - Enable pm watchdog for given device. > + * @wd: Watchdog. Must be allocated on the stack. > + * @dev: Device to handle. > + */ > +static void dpm_wd_set(struct dpm_watchdog *wd, struct device *dev) > +{ > + struct timer_list *timer = &wd->timer; > + > + wd->dev = dev; > + wd->tsk = get_current(); > + > + init_timer_on_stack(timer); > + timer->expires = jiffies + HZ * 12; > + timer->function = dpm_wd_handler; > + timer->data = (unsigned long)wd; > + add_timer(timer); > +} > + > +/** > + * dpm_wd_clear - Disable pm watchdog. > + * @wd: Watchdog to disable. > + */ > +static void dpm_wd_clear(struct dpm_watchdog *wd) > +{ > + struct timer_list *timer = &wd->timer; > + > + del_timer_sync(timer); > + destroy_timer_on_stack(timer); > +} > + > /*------------------------- Resume routines -------------------------*/ > > /** > @@ -570,6 +628,7 @@ static int device_resume(struct device *dev, pm_message_t state, bool async) > pm_callback_t callback = NULL; > char *info = NULL; > int error = 0; > + struct dpm_watchdog wd; > > TRACE_DEVICE(dev); > TRACE_RESUME(0); > @@ -585,6 +644,7 @@ static int device_resume(struct device *dev, pm_message_t state, bool async) > * a resumed device, even if the device hasn't been completed yet. > */ > dev->power.is_prepared = false; > + dpm_wd_set(&wd, dev); > > if (!dev->power.is_suspended) > goto Unlock; > @@ -636,6 +696,7 @@ static int device_resume(struct device *dev, pm_message_t state, bool async) > > Unlock: > device_unlock(dev); > + dpm_wd_clear(&wd); > > Complete: > complete_all(&dev->power.completion); > @@ -1053,6 +1114,7 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) > pm_callback_t callback = NULL; > char *info = NULL; > int error = 0; > + struct dpm_watchdog wd; > > dpm_wait_for_children(dev, async); > > @@ -1076,6 +1138,8 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) > if (dev->power.syscore) > goto Complete; > > + dpm_wd_set(&wd, dev); > + > device_lock(dev); > > if (dev->pm_domain) { > @@ -1131,6 +1195,8 @@ static int __device_suspend(struct device *dev, pm_message_t state, bool async) > > device_unlock(dev); > > + dpm_wd_clear(&wd); > + > Complete: > complete_all(&dev->power.completion); > if (error) > -- > 1.7.9.5 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-11 6:13 ` Colin Cross @ 2013-05-12 0:39 ` Rafael J. Wysocki 2013-05-12 19:15 ` Colin Cross 0 siblings, 1 reply; 19+ messages in thread From: Rafael J. Wysocki @ 2013-05-12 0:39 UTC (permalink / raw) To: Colin Cross Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Friday, May 10, 2013 11:13:27 PM Colin Cross wrote: > On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic > <zoran.markovic@linaro.org> wrote: > > From: Benoit Goby <benoit@android.com> > > > > Below is a patch from android kernel that detects a driver suspend/resume > > lockup and captures dump in the kernel log. Please review and provide > > comments. > > This paragraph should go below the --- line so it doesn't end up in > the final commit message. > > > Rather than hard-lock the kernel, dump the suspend/resume thread stack and > > BUG() when a driver takes too long to suspend/resume. And how exactly is that different from hanging the kernel? > > The timeout is set to 12 seconds to be longer than the usbhid 10 > > second timeout. Which is kind of arbitrary. > > Exclude from the watchdog the time spent waiting for children that > > are resumed asynchronously and time every device, whether or not they > > resumed synchronously. What about changing the code to use wait_for_completion_timeout() instead of wait_for_completion() and actually trying to recover if one of them times out? [It could try to unregister the device in question and if *that* hangs indefinitely, *then* commit a panic() or something similar, but if it succeeds, abort the ongoing suspend or complete the resume without that device.] Of course, that would involve changing things not to depend on async, but might be better than slapping a timer on top. > > This patch is targeted for mobile devices where a suspend/resume lockup > > could cause a system reboot and catch user's attention. Information > > about failing device can later be retrieved from captured log in > > subsequent boot session. > > I would take out the phrase "catch user's attention", the intention of > this patch is actually the opposite - get the system back to working > normally as fast as possible, while still putting enough information > to debug the problem into the log. > > > The hardware watchdog timer is likely suspended during this time and > > couldn't be relied upon. The soft-lockup detector would eventually tell > > that tasks are not scheduled, but would provide little context as to why. > > The patch hence uses system timer and assumes it is still active while the > > devices are suspended/resumed. I must say I'm not particularly impressed by this patch series. I understand the motivation, but I'm wondering if that's the best we can do. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-12 0:39 ` Rafael J. Wysocki @ 2013-05-12 19:15 ` Colin Cross 2013-05-13 11:26 ` Rafael J. Wysocki 0 siblings, 1 reply; 19+ messages in thread From: Colin Cross @ 2013-05-12 19:15 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Sat, May 11, 2013 at 5:39 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > On Friday, May 10, 2013 11:13:27 PM Colin Cross wrote: >> On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic >> <zoran.markovic@linaro.org> wrote: >> > From: Benoit Goby <benoit@android.com> >> > >> > Below is a patch from android kernel that detects a driver suspend/resume >> > lockup and captures dump in the kernel log. Please review and provide >> > comments. >> >> This paragraph should go below the --- line so it doesn't end up in >> the final commit message. >> >> > Rather than hard-lock the kernel, dump the suspend/resume thread stack and >> > BUG() when a driver takes too long to suspend/resume. > > And how exactly is that different from hanging the kernel? This works best with reboot on panic + pstore ram console, which results in rebooting to a kernel in a good state that can inspect the stack trace. >> > The timeout is set to 12 seconds to be longer than the usbhid 10 >> > second timeout. > > Which is kind of arbitrary. The next patch (which should be squashed into this one) makes this timeout configurable. >> > Exclude from the watchdog the time spent waiting for children that >> > are resumed asynchronously and time every device, whether or not they >> > resumed synchronously. > > What about changing the code to use wait_for_completion_timeout() instead > of wait_for_completion() and actually trying to recover if one of them times > out? [It could try to unregister the device in question and if *that* hangs > indefinitely, *then* commit a panic() or something similar, but if it succeeds, > abort the ongoing suspend or complete the resume without that device.] I don't think this is feasible. What do you do if userspace was using the device? What do you do if the suspend function later starts running again if the blocking condition clears up, and starts accessing hardware for an unregistered device? Trying to unload a driver that is known to be misbehaving sounds to me like it is going to result in an even more unstable situation. The purpose of this patch is to get the end user's phone back into a usable state as fast as possible, while still providing the information needed to debug the issue to power users or developers. > Of course, that would involve changing things not to depend on async, but > might be better than slapping a timer on top. > >> > This patch is targeted for mobile devices where a suspend/resume lockup >> > could cause a system reboot and catch user's attention. Information >> > about failing device can later be retrieved from captured log in >> > subsequent boot session. >> >> I would take out the phrase "catch user's attention", the intention of >> this patch is actually the opposite - get the system back to working >> normally as fast as possible, while still putting enough information >> to debug the problem into the log. >> >> > The hardware watchdog timer is likely suspended during this time and >> > couldn't be relied upon. The soft-lockup detector would eventually tell >> > that tasks are not scheduled, but would provide little context as to why. >> > The patch hence uses system timer and assumes it is still active while the >> > devices are suspended/resumed. > > I must say I'm not particularly impressed by this patch series. I understand > the motivation, but I'm wondering if that's the best we can do. This feature is most useful for Android devices, which is why we never pushed it upstream ourselves. It has been extremely valuable to us during development, we have been using it for years and it turns a class of bugs that is very difficult to debug ("phone is stuck with screen off", none of the regular debugging tools are useful because most drivers are suspended) into something that can be trivially collected out of the logs on the next reboot. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-12 19:15 ` Colin Cross @ 2013-05-13 11:26 ` Rafael J. Wysocki 2013-05-28 18:26 ` Zoran Markovic 0 siblings, 1 reply; 19+ messages in thread From: Rafael J. Wysocki @ 2013-05-13 11:26 UTC (permalink / raw) To: Colin Cross Cc: Zoran Markovic, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Sunday, May 12, 2013 12:15:06 PM Colin Cross wrote: > On Sat, May 11, 2013 at 5:39 PM, Rafael J. Wysocki <rjw@sisk.pl> wrote: > > On Friday, May 10, 2013 11:13:27 PM Colin Cross wrote: > >> On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic > >> <zoran.markovic@linaro.org> wrote: > >> > From: Benoit Goby <benoit@android.com> > >> > > >> > Below is a patch from android kernel that detects a driver suspend/resume > >> > lockup and captures dump in the kernel log. Please review and provide > >> > comments. > >> > >> This paragraph should go below the --- line so it doesn't end up in > >> the final commit message. > >> > >> > Rather than hard-lock the kernel, dump the suspend/resume thread stack and > >> > BUG() when a driver takes too long to suspend/resume. > > > > And how exactly is that different from hanging the kernel? > > This works best with reboot on panic + pstore ram console, which > results in rebooting to a kernel in a good state that can inspect the > stack trace. > > >> > The timeout is set to 12 seconds to be longer than the usbhid 10 > >> > second timeout. > > > > Which is kind of arbitrary. > > The next patch (which should be squashed into this one) makes this > timeout configurable. > > >> > Exclude from the watchdog the time spent waiting for children that > >> > are resumed asynchronously and time every device, whether or not they > >> > resumed synchronously. > > > > What about changing the code to use wait_for_completion_timeout() instead > > of wait_for_completion() and actually trying to recover if one of them times > > out? [It could try to unregister the device in question and if *that* hangs > > indefinitely, *then* commit a panic() or something similar, but if it succeeds, > > abort the ongoing suspend or complete the resume without that device.] > > I don't think this is feasible. What do you do if userspace was using > the device? What do you do if the suspend function later starts > running again if the blocking condition clears up, and starts > accessing hardware for an unregistered device? Trying to unload a > driver that is known to be misbehaving sounds to me like it is going > to result in an even more unstable situation. The purpose of this > patch is to get the end user's phone back into a usable state as fast > as possible, while still providing the information needed to debug the > issue to power users or developers. And using BUG() is probably one of the worst ways to achieve that. On the system I'm using, for example, it will just hang indefinitely and I'm not going to see the call trace anyway because of the console suspend that happens before suspending device drivers. What about this: - Add one more list_head to struct dev_pm_info. - Make dpm_prepare() create a new list for the next steps instead of moving devices out of dpm_list. - Start an async work to carry out dpm_suspend() and make the main thread do wait_for_completion_timeout() for every device in dpm_list (in the reverse order). - If it times out, mark the device in question as unusable, possibly resume the already suspended devices (except for descendants of the failed one) and abort the suspend. Return a specific error code to user space so that it knows what happened. [You can make this step configurable to BUG() instead of doing all those things if you think that will be more useful for platforms you care about.] - Disable future suspends. And analogously for resume. That should allow people to investigate what happened on a system that (hopefully) is not completely dead and you still can have your "reboot if suspend hangs" feature if you like. > > Of course, that would involve changing things not to depend on async, but > > might be better than slapping a timer on top. > > > >> > This patch is targeted for mobile devices where a suspend/resume lockup > >> > could cause a system reboot and catch user's attention. Information > >> > about failing device can later be retrieved from captured log in > >> > subsequent boot session. > >> > >> I would take out the phrase "catch user's attention", the intention of > >> this patch is actually the opposite - get the system back to working > >> normally as fast as possible, while still putting enough information > >> to debug the problem into the log. > >> > >> > The hardware watchdog timer is likely suspended during this time and > >> > couldn't be relied upon. The soft-lockup detector would eventually tell > >> > that tasks are not scheduled, but would provide little context as to why. > >> > The patch hence uses system timer and assumes it is still active while the > >> > devices are suspended/resumed. > > > > I must say I'm not particularly impressed by this patch series. I understand > > the motivation, but I'm wondering if that's the best we can do. > > This feature is most useful for Android devices, which is why we never > pushed it upstream ourselves. And that was OK. > It has been extremely valuable to us during development, we have been using > it for years and it turns a class of bugs that is very difficult to debug > ("phone is stuck with screen off", none of the regular debugging tools are > useful because most drivers are suspended) into something that can be > trivially collected out of the logs on the next reboot. That's *if* you have those logs and that's a pretty big "if". As I said, I understand the motivation and I know why it may be useful. Still, my question is if we can possibly do better here. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-13 11:26 ` Rafael J. Wysocki @ 2013-05-28 18:26 ` Zoran Markovic 2013-05-28 20:49 ` Rafael J. Wysocki 0 siblings, 1 reply; 19+ messages in thread From: Zoran Markovic @ 2013-05-28 18:26 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Colin Cross, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman > What about this: > - Add one more list_head to struct dev_pm_info. > - Make dpm_prepare() create a new list for the next steps instead of moving > devices out of dpm_list. > - Start an async work to carry out dpm_suspend() and make the main thread > do wait_for_completion_timeout() for every device in dpm_list (in the > reverse order). > - If it times out, mark the device in question as unusable, possibly resume > the already suspended devices (except for descendants of the failed one) > and abort the suspend. Return a specific error code to user space so that > it knows what happened. [You can make this step configurable to BUG() > instead of doing all those things if you think that will be more useful for > platforms you care about.] > - Disable future suspends. > And analogously for resume. > > That should allow people to investigate what happened on a system that > (hopefully) is not completely dead and you still can have your "reboot if > suspend hangs" feature if you like. I looked into implementing this. The problem that I encountered is that there is no reliable way of canceling an async task, and hence the asynchronous __device_suspend() would be left racing with a recovery from a suspend timeout. We could do cancel_work_sync() as a recovery, but that call blocks until the running async task is flushed, which might never happen. So doing a panic() is pretty much the only option for recovering. - Zoran ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-28 18:26 ` Zoran Markovic @ 2013-05-28 20:49 ` Rafael J. Wysocki 2013-05-31 21:13 ` Zoran Markovic 0 siblings, 1 reply; 19+ messages in thread From: Rafael J. Wysocki @ 2013-05-28 20:49 UTC (permalink / raw) To: Zoran Markovic Cc: Colin Cross, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Tuesday, May 28, 2013 11:26:09 AM Zoran Markovic wrote: > > What about this: > > - Add one more list_head to struct dev_pm_info. > > - Make dpm_prepare() create a new list for the next steps instead of moving > > devices out of dpm_list. > > - Start an async work to carry out dpm_suspend() and make the main thread > > do wait_for_completion_timeout() for every device in dpm_list (in the > > reverse order). > > - If it times out, mark the device in question as unusable, possibly resume > > the already suspended devices (except for descendants of the failed one) > > and abort the suspend. Return a specific error code to user space so that > > it knows what happened. [You can make this step configurable to BUG() > > instead of doing all those things if you think that will be more useful for > > platforms you care about.] > > - Disable future suspends. > > And analogously for resume. > > > > That should allow people to investigate what happened on a system that > > (hopefully) is not completely dead and you still can have your "reboot if > > suspend hangs" feature if you like. > > I looked into implementing this. The problem that I encountered is > that there is no reliable way of canceling an async task, and hence > the asynchronous __device_suspend() would be left racing with a > recovery from a suspend timeout. Why exactly would it be racing? We wouldn't call device_resume() for the device that timed out (and its descendants). > We could do cancel_work_sync() as a recovery, but that call blocks until the > running async task is flushed, which might never happen. So doing a panic() > is pretty much the only option for recovering. Well, its usefulness is quite limited, then. That said I'm still not convinced that this actually is the case. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-28 20:49 ` Rafael J. Wysocki @ 2013-05-31 21:13 ` Zoran Markovic 2013-06-05 22:17 ` Zoran Markovic 0 siblings, 1 reply; 19+ messages in thread From: Zoran Markovic @ 2013-05-31 21:13 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Colin Cross, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman >> We could do cancel_work_sync() as a recovery, but that call blocks until the >> running async task is flushed, which might never happen. So doing a panic() >> is pretty much the only option for recovering. > > Well, its usefulness is quite limited, then. That said I'm still not convinced > that this actually is the case. It does block in my environment, AFAICS. Looking a bit further in the code, it looks like dpm_suspend() does an async_synchronize_full() which would wait for all async tasks to complete. This is a show-stopper because (under the circumstances) the assumption that every async suspend routine eventually completes doesn't hold. We could possibly select which async tasks to wait for, but this would add unnecessary complexity to a feature targeted for debugging. It seems that this approach - although sounding reasonable - needs to wait until we have a mechanism to cancel an async task. Regards, Zoran ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-05-31 21:13 ` Zoran Markovic @ 2013-06-05 22:17 ` Zoran Markovic 2013-06-05 22:29 ` Rafael J. Wysocki 2013-06-06 14:12 ` Alan Stern 0 siblings, 2 replies; 19+ messages in thread From: Zoran Markovic @ 2013-06-05 22:17 UTC (permalink / raw) To: Rafael J. Wysocki Cc: Colin Cross, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman Rafael, >>> We could do cancel_work_sync() as a recovery, but that call blocks until the >>> running async task is flushed, which might never happen. So doing a panic() >>> is pretty much the only option for recovering. >> >> Well, its usefulness is quite limited, then. That said I'm still not convinced >> that this actually is the case. > > It does block in my environment, AFAICS. Looking a bit further in the > code, it looks like dpm_suspend() does an async_synchronize_full() > which would wait for all async tasks to complete. This is a > show-stopper because (under the circumstances) the assumption that > every async suspend routine eventually completes doesn't hold. > > We could possibly select which async tasks to wait for, but this would > add unnecessary complexity to a feature targeted for debugging. It > seems that this approach - although sounding reasonable - needs to > wait until we have a mechanism to cancel an async task. Looks like the implementation of proposal for an async suspend + wait_for_completion_timeout is quite complex due to above limitations. How do we proceed from here? We have the following options: 1. Give up on the idea of having a suspend/resume watchdog. 2. Use the timer implementation (with possible modifications). 3. Wait for the implementation of (or implement) killing of an already running async work. Are there any other ideas floating around? Thanks, Zoran ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-06-05 22:17 ` Zoran Markovic @ 2013-06-05 22:29 ` Rafael J. Wysocki 2013-06-06 14:12 ` Alan Stern 1 sibling, 0 replies; 19+ messages in thread From: Rafael J. Wysocki @ 2013-06-05 22:29 UTC (permalink / raw) To: Zoran Markovic Cc: Colin Cross, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Wednesday, June 05, 2013 03:17:59 PM Zoran Markovic wrote: > Rafael, Hi, > >>> We could do cancel_work_sync() as a recovery, but that call blocks until the > >>> running async task is flushed, which might never happen. So doing a panic() > >>> is pretty much the only option for recovering. > >> > >> Well, its usefulness is quite limited, then. That said I'm still not convinced > >> that this actually is the case. > > > > It does block in my environment, AFAICS. Looking a bit further in the > > code, it looks like dpm_suspend() does an async_synchronize_full() > > which would wait for all async tasks to complete. This is a > > show-stopper because (under the circumstances) the assumption that > > every async suspend routine eventually completes doesn't hold. > > > > We could possibly select which async tasks to wait for, but this would > > add unnecessary complexity to a feature targeted for debugging. It > > seems that this approach - although sounding reasonable - needs to > > wait until we have a mechanism to cancel an async task. > > Looks like the implementation of proposal for an async suspend + > wait_for_completion_timeout is quite complex due to above limitations. > How do we proceed from here? We have the following options: > 1. Give up on the idea of having a suspend/resume watchdog. > 2. Use the timer implementation (with possible modifications). > 3. Wait for the implementation of (or implement) killing of an already > running async work. > > Are there any other ideas floating around? I'm not aware of any at the moment, but I really don't think this is urgent. I think we can revisit it during the 3.12 cycle and decide how to proceed then. Thanks, Rafael -- I speak only for myself. Rafael J. Wysocki, Intel Open Source Technology Center. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-06-05 22:17 ` Zoran Markovic 2013-06-05 22:29 ` Rafael J. Wysocki @ 2013-06-06 14:12 ` Alan Stern 2013-06-10 21:25 ` Colin Cross 1 sibling, 1 reply; 19+ messages in thread From: Alan Stern @ 2013-06-06 14:12 UTC (permalink / raw) To: Zoran Markovic Cc: Rafael J. Wysocki, Colin Cross, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Wed, 5 Jun 2013, Zoran Markovic wrote: > > It does block in my environment, AFAICS. Looking a bit further in the > > code, it looks like dpm_suspend() does an async_synchronize_full() > > which would wait for all async tasks to complete. This is a > > show-stopper because (under the circumstances) the assumption that > > every async suspend routine eventually completes doesn't hold. > > > > We could possibly select which async tasks to wait for, but this would > > add unnecessary complexity to a feature targeted for debugging. It > > seems that this approach - although sounding reasonable - needs to > > wait until we have a mechanism to cancel an async task. > > Looks like the implementation of proposal for an async suspend + > wait_for_completion_timeout is quite complex due to above limitations. > How do we proceed from here? We have the following options: > 1. Give up on the idea of having a suspend/resume watchdog. > 2. Use the timer implementation (with possible modifications). > 3. Wait for the implementation of (or implement) killing of an already > running async work. > > Are there any other ideas floating around? In general, the kernel is not designed to operate when kernel threads get killed at random times. It's also not designed to operate normally while in the middle of a system suspend. This means there is basically no hope of recovering from a hung async suspend task. (In much the same way, there is no hope of recovering from any hung kernel thread.) The best you can accomplish is to store some useful information somewhere and either panic or force a reboot. Given that the usual storage media may be inaccessible, it may not be easy to find a place to store the information. (By the way, what do you do if a _synchronous_ suspend routine hangs? The two problems are fairly similar.) Alan Stern ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume. 2013-06-06 14:12 ` Alan Stern @ 2013-06-10 21:25 ` Colin Cross 0 siblings, 0 replies; 19+ messages in thread From: Colin Cross @ 2013-06-10 21:25 UTC (permalink / raw) To: Alan Stern Cc: Zoran Markovic, Rafael J. Wysocki, lkml, Linux PM list, Benoit Goby, Android Kernel Team, Todd Poynor, San Mehat, John Stultz, Pavel Machek, Len Brown, Greg Kroah-Hartman On Thu, Jun 6, 2013 at 7:12 AM, Alan Stern <stern@rowland.harvard.edu> wrote: > On Wed, 5 Jun 2013, Zoran Markovic wrote: > >> > It does block in my environment, AFAICS. Looking a bit further in the >> > code, it looks like dpm_suspend() does an async_synchronize_full() >> > which would wait for all async tasks to complete. This is a >> > show-stopper because (under the circumstances) the assumption that >> > every async suspend routine eventually completes doesn't hold. >> > >> > We could possibly select which async tasks to wait for, but this would >> > add unnecessary complexity to a feature targeted for debugging. It >> > seems that this approach - although sounding reasonable - needs to >> > wait until we have a mechanism to cancel an async task. >> >> Looks like the implementation of proposal for an async suspend + >> wait_for_completion_timeout is quite complex due to above limitations. >> How do we proceed from here? We have the following options: >> 1. Give up on the idea of having a suspend/resume watchdog. >> 2. Use the timer implementation (with possible modifications). >> 3. Wait for the implementation of (or implement) killing of an already >> running async work. >> >> Are there any other ideas floating around? > > In general, the kernel is not designed to operate when kernel threads > get killed at random times. It's also not designed to operate normally > while in the middle of a system suspend. > > This means there is basically no hope of recovering from a hung async > suspend task. (In much the same way, there is no hope of recovering > from any hung kernel thread.) The best you can accomplish is to store > some useful information somewhere and either panic or force a reboot. > > Given that the usual storage media may be inaccessible, it may not be > easy to find a place to store the information. > > (By the way, what do you do if a _synchronous_ suspend routine hangs? > The two problems are fairly similar.) This is why the original patch dumped a stack trace of the offending task and panic'd. There is nothing else useful you can do. ^ permalink raw reply [flat|nested] 19+ messages in thread
* [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs. 2013-05-10 21:28 [RFC PATCHv2 0/2] power: device suspend/resume watchdog Zoran Markovic 2013-05-10 21:28 ` [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume Zoran Markovic @ 2013-05-10 21:28 ` Zoran Markovic 2013-05-11 6:23 ` Colin Cross 2013-05-11 9:28 ` Pavel Machek 1 sibling, 2 replies; 19+ messages in thread From: Zoran Markovic @ 2013-05-10 21:28 UTC (permalink / raw) To: linux-kernel, linux-pm Cc: Zoran Markovic, Android Kernel Team, Colin Cross, Todd Poynor, San Mehat, Benoit Goby, John Stultz, Pavel Machek, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman Power management debug option to configure device suspend/resume watchdogs. Available options are: 1. Enable/disable the feature. 2. Select triggered watchdog action between: - system panic (default) - dump stacktrace - log event 3. Select timeout value for the watchdog(s). Minor changes were made to watchdog code to accommodate this feature. Cc: Android Kernel Team <kernel-team@android.com> Cc: Colin Cross <ccross@android.com> Cc: Todd Poynor <toddpoynor@google.com> Cc: San Mehat <san@google.com> Cc: Benoit Goby <benoit@android.com> Cc: John Stultz <john.stultz@linaro.org> Cc: Pavel Machek <pavel@ucw.cz> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Len Brown <len.brown@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> --- drivers/base/power/main.c | 37 ++++++++++++++++++++++++++-------- kernel/power/Kconfig | 48 +++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 77 insertions(+), 8 deletions(-) diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c index a6a02c0..8e0bb33 100644 --- a/drivers/base/power/main.c +++ b/drivers/base/power/main.c @@ -392,6 +392,26 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, return error; } +#ifdef CONFIG_DPM_WD +/** + * dpm_wd_action - recovery from suspend/resume watchdog timeout + * @wd: Watchdog. Must be allocated on the stack. + */ +#if defined(CONFIG_DPM_WD_ACTION_STACKTRACE) +static inline void dpm_wd_action(struct dpm_watchdog *wd) +{ + show_stack(wd->tsk, NULL); +} +#elif defined(CONFIG_DPM_WD_ACTION_PANIC) +static inline void dpm_wd_action(struct dpm_watchdog *wd) +{ + panic("%s: unrecoverable failure\n", dev_name(wd->dev)); +} +#else /* CONFIG_DPM_WD_ACTION_LOG */ +/* event already logged in dpm_wd_handler() */ +#define dpm_wd_action(x) +#endif + /** * dpm_wd_handler - Driver suspend / resume watchdog handler. * @@ -402,13 +422,9 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, static void dpm_wd_handler(unsigned long data) { struct dpm_watchdog *wd = (void *)data; - struct device *dev = wd->dev; - struct task_struct *tsk = wd->tsk; - - dev_emerg(dev, "**** DPM device timeout ****\n"); - show_stack(tsk, NULL); - BUG(); + dev_emerg(wd->dev, "**** DPM device timeout ****\n"); + dpm_wd_action(wd); } /** @@ -424,14 +440,15 @@ static void dpm_wd_set(struct dpm_watchdog *wd, struct device *dev) wd->tsk = get_current(); init_timer_on_stack(timer); - timer->expires = jiffies + HZ * 12; + /* use same timeout value for both suspend and resume */ + timer->expires = jiffies + HZ * CONFIG_DPM_WD_TIMEOUT; timer->function = dpm_wd_handler; timer->data = (unsigned long)wd; add_timer(timer); } /** - * dpm_wd_clear - Disable pm watchdog. + * dpm_wd_clear - Disable suspend/resume watchdog. * @wd: Watchdog to disable. */ static void dpm_wd_clear(struct dpm_watchdog *wd) @@ -441,6 +458,10 @@ static void dpm_wd_clear(struct dpm_watchdog *wd) del_timer_sync(timer); destroy_timer_on_stack(timer); } +#else +#define dpm_wd_set(x, y) +#define dpm_wd_clear(x) +#endif /*------------------------- Resume routines -------------------------*/ diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig index edc8bdd..339caa1 100644 --- a/kernel/power/Kconfig +++ b/kernel/power/Kconfig @@ -179,6 +179,54 @@ config PM_SLEEP_DEBUG def_bool y depends on PM_DEBUG && PM_SLEEP +config DPM_WD + bool "Device suspend/resume watchdog" + depends on PM_DEBUG + ---help--- + Sets up a watchdog timer to capture drivers that are + locked up attempting to suspend/resume a device. + A detected lockup causes a configurable watchdog action, + such as logging the event, dumping the stack trace or + kernel panic. + +choice + prompt "Watchdog recovery action" + default DPM_WD_ACTION_PANIC + depends on DPM_WD + ---help--- + Select recovery action triggered by suspend/resume watchdog. + +config DPM_WD_ACTION_PANIC + bool "System panic" + ---help--- + When selected, a lockup during device's suspend or + resume would cause a system panic. This would immediately + capture user's attention. Panic message can be observed in + subsequent boot session using pstore. + +config DPM_WD_ACTION_STACKTRACE + bool "Dump stack" + ---help--- + When selected, a lockup during device's suspend or + resume would cause the caller's stack to be + captured in the system log. The stack trace shows + which driver call caused a lockup. + +config DPM_WD_ACTION_LOG + bool "Log event" + ---help--- + When selected, a lockup during device's suspend or + resume would cause the watchdog timeout event to be + logged in the system log. + +endchoice + +config DPM_WD_TIMEOUT + int "Watchdog timeout in seconds" + range 1 120 + default 12 + depends on DPM_WD + config PM_TRACE bool help -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs. 2013-05-10 21:28 ` [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs Zoran Markovic @ 2013-05-11 6:23 ` Colin Cross 2013-05-13 16:03 ` John Stultz 2013-05-11 9:28 ` Pavel Machek 1 sibling, 1 reply; 19+ messages in thread From: Colin Cross @ 2013-05-11 6:23 UTC (permalink / raw) To: Zoran Markovic Cc: lkml, Linux PM list, Android Kernel Team, Todd Poynor, San Mehat, Benoit Goby, John Stultz, Pavel Machek, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic <zoran.markovic@linaro.org> wrote: > Power management debug option to configure device suspend/resume watchdogs. > Available options are: > 1. Enable/disable the feature. > 2. Select triggered watchdog action between: > - system panic (default) > - dump stacktrace > - log event I don't see much point to logging the event but not printing the stack trace, you can just disable the feature if you don't want your logs cluttered, and this shouldn't fire very often. > 3. Select timeout value for the watchdog(s). > > Minor changes were made to watchdog code to accommodate this feature. > > Cc: Android Kernel Team <kernel-team@android.com> > Cc: Colin Cross <ccross@android.com> > Cc: Todd Poynor <toddpoynor@google.com> > Cc: San Mehat <san@google.com> > Cc: Benoit Goby <benoit@android.com> > Cc: John Stultz <john.stultz@linaro.org> > Cc: Pavel Machek <pavel@ucw.cz> > Cc: Rafael J. Wysocki <rjw@sisk.pl> > Cc: Len Brown <len.brown@intel.com> > Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> > Signed-off-by: Zoran Markovic <zoran.markovic@linaro.org> This whole patch is linked enough with the previous one that I suggest squashing them together, and expanding your comments above your signed off by to say you added configuration options. > --- > drivers/base/power/main.c | 37 ++++++++++++++++++++++++++-------- > kernel/power/Kconfig | 48 +++++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 77 insertions(+), 8 deletions(-) > > diff --git a/drivers/base/power/main.c b/drivers/base/power/main.c > index a6a02c0..8e0bb33 100644 > --- a/drivers/base/power/main.c > +++ b/drivers/base/power/main.c > @@ -392,6 +392,26 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, > return error; > } > > +#ifdef CONFIG_DPM_WD > +/** > + * dpm_wd_action - recovery from suspend/resume watchdog timeout > + * @wd: Watchdog. Must be allocated on the stack. > + */ > +#if defined(CONFIG_DPM_WD_ACTION_STACKTRACE) > +static inline void dpm_wd_action(struct dpm_watchdog *wd) > +{ > + show_stack(wd->tsk, NULL); > +} > +#elif defined(CONFIG_DPM_WD_ACTION_PANIC) > +static inline void dpm_wd_action(struct dpm_watchdog *wd) > +{ > + panic("%s: unrecoverable failure\n", dev_name(wd->dev)); The panic here is not very useful, it's going to print the stack of the task that was running when the timer fired which is likely to be the idle task if the suspend task is deadlocked. This should call show_stack and panic. If you take out the log action, then all this can stay inline with the handler and be: dev_emerg(wd->dev, "**** DPM device timeout ****\n"); show_stack(wd->tsk, NULL); #ifdef CONFIG_DPM_WD_ACTION_PANIC panic("%s: unrecoverable failure\n", dev_name(wd->dev)); #endif Also, and this is a problem with the existing patch and not your modifications, the format devname: error in the log is usually printed by the device driver, which may cause confusion for someone who sees this error and tries to grep for it in the mentioned driver. Maybe a better format for the pr_emerg and the panic would be: PM: **** DPM device timeout suspending devname **** Ideally the message would specify if the error happened while suspending or resuming. > +} > +#else /* CONFIG_DPM_WD_ACTION_LOG */ > +/* event already logged in dpm_wd_handler() */ > +#define dpm_wd_action(x) > +#endif > + > /** > * dpm_wd_handler - Driver suspend / resume watchdog handler. > * > @@ -402,13 +422,9 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, > static void dpm_wd_handler(unsigned long data) > { > struct dpm_watchdog *wd = (void *)data; > - struct device *dev = wd->dev; > - struct task_struct *tsk = wd->tsk; > - > - dev_emerg(dev, "**** DPM device timeout ****\n"); > - show_stack(tsk, NULL); > > - BUG(); > + dev_emerg(wd->dev, "**** DPM device timeout ****\n"); > + dpm_wd_action(wd); > } > > /** > @@ -424,14 +440,15 @@ static void dpm_wd_set(struct dpm_watchdog *wd, struct device *dev) > wd->tsk = get_current(); > > init_timer_on_stack(timer); > - timer->expires = jiffies + HZ * 12; > + /* use same timeout value for both suspend and resume */ > + timer->expires = jiffies + HZ * CONFIG_DPM_WD_TIMEOUT; > timer->function = dpm_wd_handler; > timer->data = (unsigned long)wd; > add_timer(timer); > } > > /** > - * dpm_wd_clear - Disable pm watchdog. > + * dpm_wd_clear - Disable suspend/resume watchdog. > * @wd: Watchdog to disable. > */ > static void dpm_wd_clear(struct dpm_watchdog *wd) > @@ -441,6 +458,10 @@ static void dpm_wd_clear(struct dpm_watchdog *wd) > del_timer_sync(timer); > destroy_timer_on_stack(timer); > } > +#else > +#define dpm_wd_set(x, y) > +#define dpm_wd_clear(x) > +#endif > > /*------------------------- Resume routines -------------------------*/ > > diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig > index edc8bdd..339caa1 100644 > --- a/kernel/power/Kconfig > +++ b/kernel/power/Kconfig > @@ -179,6 +179,54 @@ config PM_SLEEP_DEBUG > def_bool y > depends on PM_DEBUG && PM_SLEEP > > +config DPM_WD > + bool "Device suspend/resume watchdog" > + depends on PM_DEBUG > + ---help--- > + Sets up a watchdog timer to capture drivers that are > + locked up attempting to suspend/resume a device. > + A detected lockup causes a configurable watchdog action, > + such as logging the event, dumping the stack trace or > + kernel panic. > + > +choice > + prompt "Watchdog recovery action" > + default DPM_WD_ACTION_PANIC > + depends on DPM_WD > + ---help--- > + Select recovery action triggered by suspend/resume watchdog. > + > +config DPM_WD_ACTION_PANIC > + bool "System panic" > + ---help--- > + When selected, a lockup during device's suspend or > + resume would cause a system panic. This would immediately > + capture user's attention. Panic message can be observed in > + subsequent boot session using pstore. > + > +config DPM_WD_ACTION_STACKTRACE > + bool "Dump stack" > + ---help--- > + When selected, a lockup during device's suspend or > + resume would cause the caller's stack to be > + captured in the system log. The stack trace shows > + which driver call caused a lockup. > + > +config DPM_WD_ACTION_LOG > + bool "Log event" > + ---help--- > + When selected, a lockup during device's suspend or > + resume would cause the watchdog timeout event to be > + logged in the system log. > + > +endchoice > + > +config DPM_WD_TIMEOUT > + int "Watchdog timeout in seconds" > + range 1 120 > + default 12 > + depends on DPM_WD > + > config PM_TRACE > bool > help > -- > 1.7.9.5 > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs. 2013-05-11 6:23 ` Colin Cross @ 2013-05-13 16:03 ` John Stultz 0 siblings, 0 replies; 19+ messages in thread From: John Stultz @ 2013-05-13 16:03 UTC (permalink / raw) To: Colin Cross Cc: Zoran Markovic, lkml, Linux PM list, Android Kernel Team, Todd Poynor, San Mehat, Benoit Goby, Pavel Machek, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman On 05/10/2013 11:23 PM, Colin Cross wrote: > On Fri, May 10, 2013 at 2:28 PM, Zoran Markovic > <zoran.markovic@linaro.org> wrote: >> +#ifdef CONFIG_DPM_WD >> +/** >> + * dpm_wd_action - recovery from suspend/resume watchdog timeout >> + * @wd: Watchdog. Must be allocated on the stack. >> + */ >> +#if defined(CONFIG_DPM_WD_ACTION_STACKTRACE) >> +static inline void dpm_wd_action(struct dpm_watchdog *wd) >> +{ >> + show_stack(wd->tsk, NULL); >> +} >> +#elif defined(CONFIG_DPM_WD_ACTION_PANIC) >> +static inline void dpm_wd_action(struct dpm_watchdog *wd) >> +{ >> + panic("%s: unrecoverable failure\n", dev_name(wd->dev)); > The panic here is not very useful, it's going to print the stack of > the task that was running when the timer fired which is likely to be > the idle task if the suspend task is deadlocked. This should call > show_stack and panic. If you take out the log action, then all this > can stay inline with the handler and be: > > dev_emerg(wd->dev, "**** DPM device timeout ****\n"); > show_stack(wd->tsk, NULL); > #ifdef CONFIG_DPM_WD_ACTION_PANIC > panic("%s: unrecoverable failure\n", dev_name(wd->dev)); > #endif #ifdefs in functions are usually to be avoided. Thus why I suggested he use the config dependent dpm_wd_action() function to handle this. thanks -john ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs. 2013-05-10 21:28 ` [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs Zoran Markovic 2013-05-11 6:23 ` Colin Cross @ 2013-05-11 9:28 ` Pavel Machek 2013-05-11 22:21 ` Colin Cross 1 sibling, 1 reply; 19+ messages in thread From: Pavel Machek @ 2013-05-11 9:28 UTC (permalink / raw) To: Zoran Markovic Cc: linux-kernel, linux-pm, Android Kernel Team, Colin Cross, Todd Poynor, San Mehat, Benoit Goby, John Stultz, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman Hi! > Power management debug option to configure device suspend/resume watchdogs. > Available options are: > 1. Enable/disable the feature. > 2. Select triggered watchdog action between: > - system panic (default) > - dump stacktrace > - log event > 3. Select timeout value for the watchdog(s). People disliked the previous behaviour, so you add 10 config options... with different behaviours. Also 1 second timeout is not acceptable for endusers (will break the system), so it should not be offered. In fact, remove that option, too. People doing that kind of debugging can modify the sources, right? (Maybe it would make sense to do same action regular watchdog does, but...) That's not the way to go. If "panic" is right behaviour, just go with panic. > @@ -402,13 +422,9 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, > static void dpm_wd_handler(unsigned long data) > { > struct dpm_watchdog *wd = (void *)data; > - struct device *dev = wd->dev; > - struct task_struct *tsk = wd->tsk; > - > - dev_emerg(dev, "**** DPM device timeout ****\n"); > - show_stack(tsk, NULL); > > - BUG(); > + dev_emerg(wd->dev, "**** DPM device timeout ****\n"); > + dpm_wd_action(wd); > } > > /** And merge this to previous patch. Introducing the code then redoing it in next patch does not help review. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs. 2013-05-11 9:28 ` Pavel Machek @ 2013-05-11 22:21 ` Colin Cross 2013-05-12 0:05 ` Pavel Machek 0 siblings, 1 reply; 19+ messages in thread From: Colin Cross @ 2013-05-11 22:21 UTC (permalink / raw) To: Pavel Machek Cc: Zoran Markovic, lkml, Linux PM list, Android Kernel Team, Todd Poynor, San Mehat, Benoit Goby, John Stultz, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman On Sat, May 11, 2013 at 2:28 AM, Pavel Machek <pavel@ucw.cz> wrote: > Hi! > >> Power management debug option to configure device suspend/resume watchdogs. >> Available options are: >> 1. Enable/disable the feature. >> 2. Select triggered watchdog action between: >> - system panic (default) >> - dump stacktrace >> - log event >> 3. Select timeout value for the watchdog(s). > > People disliked the previous behaviour, so you add 10 config > options... with different behaviours. Also 1 second timeout is not > acceptable for endusers (will break the system), so it should not be > offered. In fact, remove that option, too. People doing that kind of > debugging can modify the sources, right? Greg KH asked for more configurable options. I agree this may be a little too far (I would replace the action choice with a simple "panic on timeout?"), but its better than it was before. Also, a 1 second timeout is perfectly reasonable, especially if you configure it to dump a stack trace but not panic. Mobile devices normally finish suspending within a few hundred ms. > (Maybe it would make sense to do same action regular watchdog does, > but...) > > That's not the way to go. If "panic" is right behaviour, just go with > panic. I can see uses for both panic and stack trace. If you have a driver that takes too long, but eventually suspends, then a panic is unnecessary and a stack trace that you can see in the logs is better, especially for a short timeout. >> @@ -402,13 +422,9 @@ static int dpm_run_callback(pm_callback_t cb, struct device *dev, >> static void dpm_wd_handler(unsigned long data) >> { >> struct dpm_watchdog *wd = (void *)data; >> - struct device *dev = wd->dev; >> - struct task_struct *tsk = wd->tsk; >> - >> - dev_emerg(dev, "**** DPM device timeout ****\n"); >> - show_stack(tsk, NULL); >> >> - BUG(); >> + dev_emerg(wd->dev, "**** DPM device timeout ****\n"); >> + dpm_wd_action(wd); >> } >> >> /** > > And merge this to previous patch. Introducing the code then redoing it > in next patch does not help review. > > Pavel > > -- > (english) http://www.livejournal.com/~pavelmachek > (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs. 2013-05-11 22:21 ` Colin Cross @ 2013-05-12 0:05 ` Pavel Machek 0 siblings, 0 replies; 19+ messages in thread From: Pavel Machek @ 2013-05-12 0:05 UTC (permalink / raw) To: Colin Cross Cc: Zoran Markovic, lkml, Linux PM list, Android Kernel Team, Todd Poynor, San Mehat, Benoit Goby, John Stultz, Rafael J. Wysocki, Len Brown, Greg Kroah-Hartman Hi! > >> Power management debug option to configure device suspend/resume watchdogs. > >> Available options are: > >> 1. Enable/disable the feature. > >> 2. Select triggered watchdog action between: > >> - system panic (default) > >> - dump stacktrace > >> - log event > >> 3. Select timeout value for the watchdog(s). > > > > People disliked the previous behaviour, so you add 10 config > > options... with different behaviours. Also 1 second timeout is not > > acceptable for endusers (will break the system), so it should not be > > offered. In fact, remove that option, too. People doing that kind of > > debugging can modify the sources, right? > > Greg KH asked for more configurable options. I agree this may be a > little too far (I would replace the action choice with a simple "panic > on timeout?"), but its better than it was before. Also, a 1 second > timeout is perfectly reasonable, especially if you configure it to > dump a stack trace but not panic. Mobile devices normally finish > suspending within a few hundred ms. For stack dump, yes that's ok. Maybe it ends up less ugly if it will be runtime-configurable? -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2013-06-10 21:25 UTC | newest] Thread overview: 19+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-05-10 21:28 [RFC PATCHv2 0/2] power: device suspend/resume watchdog Zoran Markovic 2013-05-10 21:28 ` [RFC PATCHv2 1/2] drivers: power: Add watchdog timer to catch drivers which lockup during suspend/resume Zoran Markovic 2013-05-11 6:13 ` Colin Cross 2013-05-12 0:39 ` Rafael J. Wysocki 2013-05-12 19:15 ` Colin Cross 2013-05-13 11:26 ` Rafael J. Wysocki 2013-05-28 18:26 ` Zoran Markovic 2013-05-28 20:49 ` Rafael J. Wysocki 2013-05-31 21:13 ` Zoran Markovic 2013-06-05 22:17 ` Zoran Markovic 2013-06-05 22:29 ` Rafael J. Wysocki 2013-06-06 14:12 ` Alan Stern 2013-06-10 21:25 ` Colin Cross 2013-05-10 21:28 ` [RFC PATCHv2 2/2] PM: compile-time configuration of device suspend/resume watchdogs Zoran Markovic 2013-05-11 6:23 ` Colin Cross 2013-05-13 16:03 ` John Stultz 2013-05-11 9:28 ` Pavel Machek 2013-05-11 22:21 ` Colin Cross 2013-05-12 0:05 ` Pavel Machek
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).