From: "Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com>
To: Sameer Nanda <snanda@chromium.org>
Cc: mingo@redhat.com, peterz@infradead.org, len.brown@intel.com,
pavel@ucw.cz, rjw@sisk.pl, akpm@linux-foundation.org,
dzickus@redhat.com, msb@chromium.org,
linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
olofj@chromium.org
Subject: Re: [PATCH] watchdog: fix for lockup detector breakage on resume
Date: Mon, 30 Apr 2012 11:42:13 +0530 [thread overview]
Message-ID: <4F9E2D3D.3000000@linux.vnet.ibm.com> (raw)
In-Reply-To: <1335550240-17765-1-git-send-email-snanda@chromium.org>
On 04/27/2012 11:40 PM, Sameer Nanda wrote:
> On the suspend/resume path the boot CPU does not go though an
> offline->online transition. This breaks the NMI detector
> post-resume since it depends on PMU state that is lost when
> the system gets suspended.
>
> Fix this by forcing a CPU offline->online transition for the
> lockup detector on the boot CPU during resume.
>
> Signed-off-by: Sameer Nanda <snanda@chromium.org>
> ---
> To provide more context, we enable NMI watchdog on
> Chrome OS. We have seen several reports of systems freezing
> up completely which indicated that the NMI watchdog was not
> firing for some reason.
>
> Debugging further, we found a simple way of repro'ing system
> freezes -- issuing the command 'tasket 1 sh -c "echo nmilockup > /proc/breakme"'
> after the system has been suspended/resumed one or more times.
>
> With this patch in place, the system freeze result in panics,
> as expected. These panics provide a nice stack trace for us
> to debug the actual issue causing the freeze.
>
>
> include/linux/sched.h | 4 ++++
> kernel/power/suspend.c | 3 +++
> kernel/watchdog.c | 16 ++++++++++++++++
> 3 files changed, 23 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 81a173c..118cc38 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -317,6 +317,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
> size_t *lenp, loff_t *ppos);
> extern unsigned int softlockup_panic;
> void lockup_detector_init(void);
> +void lockup_detector_bootcpu_resume(void);
> #else
> static inline void touch_softlockup_watchdog(void)
> {
> @@ -330,6 +331,9 @@ static inline void touch_all_softlockup_watchdogs(void)
> static inline void lockup_detector_init(void)
> {
> }
> +static inline void lockup_detector_bootcpu_resume(void)
> +{
> +}
> #endif
>
> #ifdef CONFIG_DETECT_HUNG_TASK
> diff --git a/kernel/power/suspend.c b/kernel/power/suspend.c
> index 396d262..0d262a8 100644
> --- a/kernel/power/suspend.c
> +++ b/kernel/power/suspend.c
> @@ -177,6 +177,9 @@ static int suspend_enter(suspend_state_t state, bool *wakeup)
> arch_suspend_enable_irqs();
> BUG_ON(irqs_disabled());
>
> + /* Kick the lockup detector */
> + lockup_detector_bootcpu_resume();
> +
> Enable_cpus:
> enable_nonboot_cpus();
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index df30ee0..dd2ac93 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -585,6 +585,22 @@ static struct notifier_block __cpuinitdata cpu_nfb = {
> .notifier_call = cpu_callback
> };
>
> +void lockup_detector_bootcpu_resume(void)
> +{
> + void *cpu = (void *)(long)smp_processor_id();
> +
> + /*
> + * On the suspend/resume path the boot CPU does not go though the
> + * offline->online transition. This breaks the NMI detector post
> + * resume. Force an offline->online transition for the boot CPU on
> + * resume.
> + */
> + cpu_callback(&cpu_nfb, CPU_DEAD, cpu);
> + cpu_callback(&cpu_nfb, CPU_ONLINE, cpu);
> +
I have a couple of comments about this:
1. Strictly speaking, we should be using the _FROZEN variants here (since the
tasks are still frozen).
Like, cpu_callback(&cpu_nfb, CPU_DEAD_FROZEN, cpu);
and cpu_callback(&cpu_nfb, CPU_ONLINE_FROZEN, cpu);
Right now, since the same action is taken for either variant (ie., with or without
_FROZEN), it really doesn't matter. But still, good to be on the safer side no?
2. Why are we skipping the CPU_UP_PREPARE_FROZEN callback?
3. How about hibernation? We don't hit this problem there?
> + return;
> +}
> +
> void __init lockup_detector_init(void)
> {
> void *cpu = (void *)(long)smp_processor_id();
Regards,
Srivatsa S. Bhat
next prev parent reply other threads:[~2012-04-30 6:12 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-04-27 18:10 [PATCH] watchdog: fix for lockup detector breakage on resume Sameer Nanda
2012-04-27 21:12 ` Andrew Morton
2012-04-27 21:33 ` Rafael J. Wysocki
2012-04-27 21:40 ` Sameer Nanda
2012-04-27 22:03 ` Andrew Morton
2012-04-27 22:20 ` Sameer Nanda
2012-04-30 6:12 ` Srivatsa S. Bhat [this message]
2012-04-30 13:05 ` Don Zickus
2012-04-30 21:10 ` Sameer Nanda
2012-05-01 17:25 ` Sameer Nanda
2012-05-02 13:14 ` Srivatsa S. Bhat
2012-05-01 17:22 ` [PATCH v2] " Sameer Nanda
2012-05-07 3:24 ` Anshuman Khandual
2012-06-08 21:44 ` Andrew Morton
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4F9E2D3D.3000000@linux.vnet.ibm.com \
--to=srivatsa.bhat@linux.vnet.ibm.com \
--cc=akpm@linux-foundation.org \
--cc=dzickus@redhat.com \
--cc=len.brown@intel.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pm@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=msb@chromium.org \
--cc=olofj@chromium.org \
--cc=pavel@ucw.cz \
--cc=peterz@infradead.org \
--cc=rjw@sisk.pl \
--cc=snanda@chromium.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.