From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
To: Sameer Nanda <snanda@chromium.org>
Cc: mingo@redhat.com, peterz@infradead.org, len.brown@intel.com,
pavel@ucw.cz, rjw@sisk.pl, akpm@linux-foundation.org,
dzickus@redhat.com, msb@chromium.org,
linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
olofj@chromium.org
Subject: Re: [PATCH] watchdog: fix for lockup detector breakage on resume
Date: Mon, 30 Apr 2012 11:42:13 +0530 [thread overview]
Message-ID: <4F9E2D3D.3000000@linux.vnet.ibm.com> (raw)
In-Reply-To: <1335550240-17765-1-git-send-email-snanda@chromium.org>
On 04/27/2012 11:40 PM, Sameer Nanda wrote:
> On the suspend/resume path the boot CPU does not go though an
> offline->online transition. This breaks the NMI detector
> post-resume since it depends on PMU state that is lost when
> the system gets suspended.
>
> Fix this by forcing a CPU offline->online transition for the
> lockup detector on the boot CPU during resume.
>
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
> To provide more context, we enable NMI watchdog on
> Chrome OS. We have seen several reports of systems freezing
> up completely which indicated that the NMI watchdog was not
> firing for some reason.
>
> Debugging further, we found a simple way of repro'ing system
> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
> after the system has been suspended/resumed one or more times.
>
> With this patch in place, the system freeze result in panics,
> as expected. These panics provide a nice stack trace for us
> to debug the actual issue causing the freeze.
>
>
> include/linux/sched.h | 4 ++++
> kernel/power/suspend.c | 3 +++
> kernel/watchdog.c | 16 ++++++++++++++++
> 3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 81a173c..118cc38 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
> size_t *lenp, loff_t *ppos);
> extern unsigned int softlockup_panic;
> void lockup_detector_init(void);
> +void lockup_detector_bootcpu_resume(void);
> #else
> static inline void touch_softlockup_watchdog(void)
> {
> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
> static inline void lockup_detector_init(void)
> {
> }
> +static inline void lockup_detector_bootcpu_resume(void)
> +{
> +}
> #endif
>
> #ifdef CONFIG_DETECT_HUNG_TASK
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 396d262..0d262a8 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
> arch_suspend_enable_irqs();
> BUG_ON(irqs_disabled());
>
> + /* Kick the lockup detector */
> + lockup_detector_bootcpu_resume();
> +
> Enable_cpus:
> enable_nonboot_cpus();
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index df30ee0..dd2ac93 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
> .notifier_call = cpu_callback
> };
>
> +void lockup_detector_bootcpu_resume(void)
> +{
> + void *cpu = (void *)(long)smp_processor_id();
> +
> + /*
> + * On the suspend/resume path the boot CPU does not go though the
> + * offline->online transition. This breaks the NMI detector post
> + * resume. Force an offline->online transition for the boot CPU on
> + * resume.
> + */
> + cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> + cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> +
I have a couple of comments about this:
1. Strictly speaking, we should be using the _FROZEN variants here (since the
tasks are still frozen).
Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
and cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
Right now, since the same action is taken for either variant (ie., with or without
_FROZEN), it really doesn't matter. But still, good to be on the safer side no?
2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?
3. How about hibernation? We don't hit this problem there?
> + return;
> +}
> +
> void __init lockup_detector_init(void)
> {
> void *cpu = (void *)(long)smp_processor_id();
Regards,
Srivatsa S. Bhat
next prev parent reply other threads:[~2012-04-30 6:12 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-27 18:10 [PATCH] watchdog: fix for lockup detector breakage on resume Sameer Nanda
2012-04-27 21:12 ` Andrew Morton
2012-04-27 21:33 ` Rafael J. Wysocki
2012-04-27 21:40 ` Sameer Nanda
2012-04-27 22:03 ` Andrew Morton
2012-04-27 22:20 ` Sameer Nanda
2012-04-30 6:12 ` Srivatsa S. Bhat [this message]
2012-04-30 13:05 ` Don Zickus
2012-04-30 21:10 ` Sameer Nanda
2012-05-01 17:25 ` Sameer Nanda
2012-05-02 13:14 ` Srivatsa S. Bhat
2012-05-01 17:22 ` [PATCH v2] " Sameer Nanda
2012-05-07 3:24 ` Anshuman Khandual
2012-06-08 21:44 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F9E2D3D.3000000@linux.vnet.ibm.com \
--to=srivatsa.bhat@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=dzickus@redhat.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=msb@chromium.org \
--cc=olofj@chromium.org \
--cc=pavel@ucw.cz \
--cc=peterz@infradead.org \
--cc=rjw@sisk.pl \
--cc=snanda@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox