* [PATCH] watchdog: Add a sysctl to disable soft lockup detector @ 2013-12-03 21:54 Ben Zhang 2013-12-03 22:27 ` Andrew Morton 2013-12-04 21:29 ` Don Zickus 0 siblings, 2 replies; 7+ messages in thread From: Ben Zhang @ 2013-12-03 21:54 UTC (permalink / raw) To: linux-kernel Cc: Don Zickus, Andrew Morton, Ingo Molnar, Frederic Weisbecker, Ben Zhang This provides usermode a way to disable only the soft lockup detector while keeping the hard lockup detector running. kernel.softlockup_detector_enable=1: This is the default. The soft lockup detector is enabled. When a soft lockup is detected, a warning message with debug info is printed. The kernel may be configured to panics in this case via the sysctl kernel.softlockup_panic. kernel.softlockup_detector_enable=0: The soft lockup detector is disabled. Warning message is not printed on soft lockup. The kernel does not panic on soft lockup regardless of the value of kernel.softlockup_panic. Note kernel.softlockup_detector_enable does not affect the hard lockup detector. Signed-off-by: Ben Zhang <benzh@chromium.org> --- include/linux/sched.h | 1 + kernel/sysctl.c | 9 +++++++++ kernel/watchdog.c | 15 +++++++++++++++ 3 files changed, 25 insertions(+) diff --git a/include/linux/sched.h b/include/linux/sched.h index 768b037..93ebec4 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -270,6 +270,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); extern unsigned int softlockup_panic; +extern unsigned int softlockup_detector_enable; void lockup_detector_init(void); #else static inline void touch_softlockup_watchdog(void) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a6047..8ae1f36 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -840,6 +840,15 @@ static struct ctl_table kern_table[] = { .extra2 = &one, }, { + .procname = "softlockup_detector_enable", + .data = &softlockup_detector_enable, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { .procname = "nmi_watchdog", .data = &watchdog_user_enabled, .maxlen = sizeof (int), diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4431610..b9594e6 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -80,6 +80,18 @@ static int __init softlockup_panic_setup(char *str) } __setup("softlockup_panic=", softlockup_panic_setup); +unsigned int __read_mostly softlockup_detector_enable = 1; + +static int __init softlockup_detector_enable_setup(char *str) +{ + unsigned long res; + if (kstrtoul(str, 0, &res)) + res = 1; + softlockup_detector_enable = res; + return 1; +} +__setup("softlockup_detector_enable=", softlockup_detector_enable_setup); + static int __init nowatchdog_setup(char *str) { watchdog_user_enabled = 0; @@ -293,6 +305,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) return HRTIMER_RESTART; } + if (!softlockup_detector_enable) + return HRTIMER_RESTART; + /* check for a softlockup * This is done by making sure a high priority task is * being scheduled. The task touches the watchdog to -- 1.8.4.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] watchdog: Add a sysctl to disable soft lockup detector 2013-12-03 21:54 [PATCH] watchdog: Add a sysctl to disable soft lockup detector Ben Zhang @ 2013-12-03 22:27 ` Andrew Morton 2013-12-04 21:29 ` Don Zickus 1 sibling, 0 replies; 7+ messages in thread From: Andrew Morton @ 2013-12-03 22:27 UTC (permalink / raw) To: Ben Zhang; +Cc: linux-kernel, Don Zickus, Ingo Molnar, Frederic Weisbecker On Tue, 3 Dec 2013 13:54:34 -0800 Ben Zhang <benzh@chromium.org> wrote: > This provides usermode a way to disable only the soft > lockup detector while keeping the hard lockup detector > running. Please update the changelog to describe the current behavior. Please also describe why you think that behavior should be changed. ie: what's the reason for this patch. Please update Documentation/ for this feature. Probably that's kernel-parameters.txt for the boot option and sysctl/kernel.txt for the procfs addition. > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -270,6 +270,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write, > void __user *buffer, > size_t *lenp, loff_t *ppos); > extern unsigned int softlockup_panic; > +extern unsigned int softlockup_detector_enable; Remove unneeded space while we're in there. > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -840,6 +840,15 @@ static struct ctl_table kern_table[] = { > .extra2 = &one, > }, > { > + .procname = "softlockup_detector_enable", > + .data = &softlockup_detector_enable, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = &zero, > + .extra2 = &one, > + }, And let's describe the full procfs path to this pseudo-file within the changelog. > .procname = "nmi_watchdog", > .data = &watchdog_user_enabled, > .maxlen = sizeof (int), ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH] watchdog: Add a sysctl to disable soft lockup detector 2013-12-03 21:54 [PATCH] watchdog: Add a sysctl to disable soft lockup detector Ben Zhang 2013-12-03 22:27 ` Andrew Morton @ 2013-12-04 21:29 ` Don Zickus 2013-12-05 1:55 ` [PATCH v2] " Ben Zhang 1 sibling, 1 reply; 7+ messages in thread From: Don Zickus @ 2013-12-04 21:29 UTC (permalink / raw) To: Ben Zhang; +Cc: linux-kernel, Andrew Morton, Ingo Molnar, Frederic Weisbecker On Tue, Dec 03, 2013 at 01:54:34PM -0800, Ben Zhang wrote: > This provides usermode a way to disable only the soft > lockup detector while keeping the hard lockup detector > running. > > kernel.softlockup_detector_enable=1: > This is the default. The soft lockup detector is enabled. > When a soft lockup is detected, a warning message with > debug info is printed. The kernel may be configured to > panics in this case via the sysctl kernel.softlockup_panic. > > kernel.softlockup_detector_enable=0: > The soft lockup detector is disabled. Warning message is > not printed on soft lockup. The kernel does not panic on > soft lockup regardless of the value of kernel.softlockup_panic. > Note kernel.softlockup_detector_enable does not affect > the hard lockup detector. As Andrew said, I wouldn't mind see an explaination for the need. The softlockup was designed to be always running when the hardlockup detector was running (though this patch is cute way to work around that). So your patch disables the check but not the overhead. I am not sure if that would confuse an end user or not. Cheers, Don > > Signed-off-by: Ben Zhang <benzh@chromium.org> > --- > include/linux/sched.h | 1 + > kernel/sysctl.c | 9 +++++++++ > kernel/watchdog.c | 15 +++++++++++++++ > 3 files changed, 25 insertions(+) > > diff --git a/include/linux/sched.h b/include/linux/sched.h > index 768b037..93ebec4 100644 > --- a/include/linux/sched.h > +++ b/include/linux/sched.h > @@ -270,6 +270,7 @@ extern int proc_dowatchdog_thresh(struct ctl_table *table, int write, > void __user *buffer, > size_t *lenp, loff_t *ppos); > extern unsigned int softlockup_panic; > +extern unsigned int softlockup_detector_enable; > void lockup_detector_init(void); > #else > static inline void touch_softlockup_watchdog(void) > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index 34a6047..8ae1f36 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -840,6 +840,15 @@ static struct ctl_table kern_table[] = { > .extra2 = &one, > }, > { > + .procname = "softlockup_detector_enable", > + .data = &softlockup_detector_enable, > + .maxlen = sizeof(int), > + .mode = 0644, > + .proc_handler = proc_dointvec_minmax, > + .extra1 = &zero, > + .extra2 = &one, > + }, > + { > .procname = "nmi_watchdog", > .data = &watchdog_user_enabled, > .maxlen = sizeof (int), > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 4431610..b9594e6 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -80,6 +80,18 @@ static int __init softlockup_panic_setup(char *str) > } > __setup("softlockup_panic=", softlockup_panic_setup); > > +unsigned int __read_mostly softlockup_detector_enable = 1; > + > +static int __init softlockup_detector_enable_setup(char *str) > +{ > + unsigned long res; > + if (kstrtoul(str, 0, &res)) > + res = 1; > + softlockup_detector_enable = res; > + return 1; > +} > +__setup("softlockup_detector_enable=", softlockup_detector_enable_setup); > + > static int __init nowatchdog_setup(char *str) > { > watchdog_user_enabled = 0; > @@ -293,6 +305,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) > return HRTIMER_RESTART; > } > > + if (!softlockup_detector_enable) > + return HRTIMER_RESTART; > + > /* check for a softlockup > * This is done by making sure a high priority task is > * being scheduled. The task touches the watchdog to > -- > 1.8.4.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH v2] watchdog: Add a sysctl to disable soft lockup detector 2013-12-04 21:29 ` Don Zickus @ 2013-12-05 1:55 ` Ben Zhang 2013-12-05 3:12 ` Don Zickus 0 siblings, 1 reply; 7+ messages in thread From: Ben Zhang @ 2013-12-05 1:55 UTC (permalink / raw) To: linux-kernel Cc: Don Zickus, Andrew Morton, Ingo Molnar, Frederic Weisbecker, Ben Zhang Currently, the soft lockup detector and hard lockup detector can be enabled or disabled together via the flag variable watchdog_user_enabled. There isn't a way to disable only the soft lockup detector while keeping the hard lockup detector running. The hard lockup detector sometimes does not work on a x86 machine with multiple cpus when softlockup_panic is set to 0. For example: 1. Hard lockup occurs on cpu0 ("cli" followed by a infinite loop). 2. Soft lockup occurs on cpu1 shortly after because cpu1 tries to send a function to cpu0 via smp_call_function_single(). 3. watchdog_timer_fn() detects the soft lockup on cpu1 and dumps the stack. dump_stack() eventually calls touch_nmi_watchdog() which sets watchdog_nmi_touch=true for all cpus and sets watchdog_touch_ts=0 for cpu1. 4. NMI fires on cpu0. watchdog_overflow_callback() sees watchdog_nmi_touch=true, so it does not do anything except setting watchdog_nmi_touch=false. 5. watchdog_timer_fn() is called again on cpu1, it sees watchdog_touch_ts=0, so reloads it with the current tick. Thus, is_softlockup() returns false, and soft_watchdog_warn is set to false. 6. Before NMI can fire on cpu0 again with watchdog_nmi_touch=false, watchdog_timer_fn() reports the soft lockup on cpu1 again and we go back to #3. The machine stays locked up and the log shows repeated reports of soft lockup on cpu1. Therefore, we need a way to disable the soft lockup check so that the hard lockup detector can reboot the machine. * Existing boot options for the watchdog: nmi_watchdog=panic/nopanic/0 softlockup_panic=0/1 nowatchdog nosoftlockup * Variables modified by the boot options: int watchdog_user_enabled; unsigned int softlockup_panic; unsigned int hardlockup_panic; * Existing sysctls at /proc/sys/kernel/... for the watchdog: nmi_watchdog=0/1 watchdog=0/1 softlockup_panic=0/1 watchdog_thresh=0~60 * Variables modified by the sysctls: int watchdog_user_enabled; unsigned int softlockup_panic; int watchdog_thresh; This patch adds a new boot option softlockup_detector_enable and a sysctl at /proc/sys/kernel/softlockup_detector_enable to allow disabling only the soft lockup detector. softlockup_detector_enable=1: This is the default. The soft lockup detector is enabled. When a soft lockup is detected, a warning message with debug info is printed. The kernel may be configured to panics in this case via the sysctl kernel.softlockup_panic. softlockup_detector_enable=0: The soft lockup detector is disabled. Warning message is not printed on soft lockup. The kernel does not panic on soft lockup regardless of the value of kernel.softlockup_panic. Note kernel.softlockup_detector_enable does not affect the hard lockup detector. Signed-off-by: Ben Zhang <benzh@chromium.org> --- Documentation/kernel-parameters.txt | 11 +++++++++++ Documentation/sysctl/kernel.txt | 20 ++++++++++++++++++++ include/linux/sched.h | 3 ++- kernel/sysctl.c | 9 +++++++++ kernel/watchdog.c | 15 +++++++++++++++ 5 files changed, 57 insertions(+), 1 deletion(-) diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt index 50680a5..5678ac3 100644 --- a/Documentation/kernel-parameters.txt +++ b/Documentation/kernel-parameters.txt @@ -2980,6 +2980,17 @@ bytes respectively. Such letter suffixes can also be entirely omitted. 1: Fast pin select (default) 2: ATC IRMode + softlockup_detector_enable= + [KNL] Should the soft-lockup detector be enabled. If + the soft-lockup detector is disabled, no warning + message is printed on soft lockup, and the kernel does + not panic on soft lockup regardless of the value of + softlockup_panic. softlockup_detector_enable does not + affect the hard lockup detector. + If this parameter is not present, the soft-lockup + detector is enabled by default. + Format: <integer> + softlockup_panic= [KNL] Should the soft-lockup detector generate panics. Format: <integer> diff --git a/Documentation/sysctl/kernel.txt b/Documentation/sysctl/kernel.txt index 26b7ee4..209212e 100644 --- a/Documentation/sysctl/kernel.txt +++ b/Documentation/sysctl/kernel.txt @@ -70,6 +70,7 @@ show up in /proc/sys/kernel: - shmall - shmmax [ sysv ipc ] - shmmni +- softlockup_detector_enable - stop-a [ SPARC only ] - sysrq ==> Documentation/sysrq.txt - tainted @@ -718,6 +719,25 @@ without users and with a dead originative process will be destroyed. ============================================================== +softlockup_detector_enable: + +Should the soft-lockup detector be enabled. + +softlockup_detector_enable=1: +This is the default. The soft lockup detector is enabled. +When a soft lockup is detected, a warning message with +debug info is printed. The kernel may be configured to +panics in this case via the sysctl kernel.softlockup_panic. + +softlockup_detector_enable=0: +The soft lockup detector is disabled. Warning message is +not printed on soft lockup. The kernel does not panic on +soft lockup regardless of the value of kernel.softlockup_panic. +Note kernel.softlockup_detector_enable does not affect +the hard lockup detector. + +============================================================== + tainted: Non-zero if the kernel has been tainted. Numeric values, which diff --git a/include/linux/sched.h b/include/linux/sched.h index 768b037..6d3749d 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -269,7 +269,8 @@ extern void touch_all_softlockup_watchdogs(void); extern int proc_dowatchdog_thresh(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); -extern unsigned int softlockup_panic; +extern unsigned int softlockup_panic; +extern unsigned int softlockup_detector_enable; void lockup_detector_init(void); #else static inline void touch_softlockup_watchdog(void) diff --git a/kernel/sysctl.c b/kernel/sysctl.c index 34a6047..8ae1f36 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -840,6 +840,15 @@ static struct ctl_table kern_table[] = { .extra2 = &one, }, { + .procname = "softlockup_detector_enable", + .data = &softlockup_detector_enable, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &zero, + .extra2 = &one, + }, + { .procname = "nmi_watchdog", .data = &watchdog_user_enabled, .maxlen = sizeof (int), diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4431610..b9594e6 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -80,6 +80,18 @@ static int __init softlockup_panic_setup(char *str) } __setup("softlockup_panic=", softlockup_panic_setup); +unsigned int __read_mostly softlockup_detector_enable = 1; + +static int __init softlockup_detector_enable_setup(char *str) +{ + unsigned long res; + if (kstrtoul(str, 0, &res)) + res = 1; + softlockup_detector_enable = res; + return 1; +} +__setup("softlockup_detector_enable=", softlockup_detector_enable_setup); + static int __init nowatchdog_setup(char *str) { watchdog_user_enabled = 0; @@ -293,6 +305,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct hrtimer *hrtimer) return HRTIMER_RESTART; } + if (!softlockup_detector_enable) + return HRTIMER_RESTART; + /* check for a softlockup * This is done by making sure a high priority task is * being scheduled. The task touches the watchdog to -- 1.8.5.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH v2] watchdog: Add a sysctl to disable soft lockup detector 2013-12-05 1:55 ` [PATCH v2] " Ben Zhang @ 2013-12-05 3:12 ` Don Zickus 2013-12-05 20:42 ` [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Ben Zhang 0 siblings, 1 reply; 7+ messages in thread From: Don Zickus @ 2013-12-05 3:12 UTC (permalink / raw) To: Ben Zhang; +Cc: linux-kernel, Andrew Morton, Ingo Molnar, Frederic Weisbecker On Wed, Dec 04, 2013 at 05:55:56PM -0800, Ben Zhang wrote: > Currently, the soft lockup detector and hard lockup detector > can be enabled or disabled together via the flag variable > watchdog_user_enabled. There isn't a way to disable only the > soft lockup detector while keeping the hard lockup detector > running. > > The hard lockup detector sometimes does not work on a x86 > machine with multiple cpus when softlockup_panic is set to 0. > For example: > 1. Hard lockup occurs on cpu0 ("cli" followed by a infinite loop). > 2. Soft lockup occurs on cpu1 shortly after because cpu1 tries to > send a function to cpu0 via smp_call_function_single(). > 3. watchdog_timer_fn() detects the soft lockup on cpu1 and > dumps the stack. dump_stack() eventually calls touch_nmi_watchdog() > which sets watchdog_nmi_touch=true for all cpus and sets > watchdog_touch_ts=0 for cpu1. > 4. NMI fires on cpu0. watchdog_overflow_callback() sees > watchdog_nmi_touch=true, so it does not do anything except setting > watchdog_nmi_touch=false. > 5. watchdog_timer_fn() is called again on cpu1, it sees > watchdog_touch_ts=0, so reloads it with the current tick. Thus, > is_softlockup() returns false, and soft_watchdog_warn is set to false. > 6. Before NMI can fire on cpu0 again with watchdog_nmi_touch=false, > watchdog_timer_fn() reports the soft lockup on cpu1 again > and we go back to #3. Yup. This is because touch_nmi_watchdog touches _all_ the cpus instead of its own. I tried to fix this 3 years ago, but Andrew wanted me to fix something else in the panic code first as a trade for his ack in changing the semantics of touch_nmi_watchdog. ;-p I doubt this patch still applies but the concept is pretty simple, touch only the local cpu not all of them. Should be pretty easy to port. Not entirely sure if this would be accepted by folks, but I think it would address your fundamental problem. Cheers, Don -------------------8<------------------------ From: Don Zickus <dzickus@redhat.com> Date: Thu, 4 Nov 2010 20:53:03 -0400 Subject: [PATCH v3] watchdog: touch_nmi_watchdog should only touch local cpu not every one I ran into a scenario where while one cpu was stuck and should have panic'd because of the NMI watchdog, it didn't. The reason was another cpu was spewing stack dumps on to the console. Upon investigation, I noticed that when writing to the console and also when dumping the stack, the watchdog is touched. This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu just spins forever. This change causes the semantics of touch_nmi_watchdog to be changed slightly. Previously, I accidentally changed the semantics and we noticed there was a codepath in which touch_nmi_watchdog could be touched from a preemtible area. That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled. I believe it was the acpi code. My attempt here re-introduces the change to have the touch_nmi_watchdog() code only touch the local cpu instead of all of the cpus. But instead of using __get_cpu_var(), I use the __raw_get_cpu_var() version. This avoids the preemption problem. However my reasoning wasn't because I was trying to be lazy. Instead I rationalized it as, well if preemption is enabled then interrupts should be enabled to and the NMI watchdog will have no reason to trigger. So it won't matter if the wrong cpu is touched because the percpu interrupt counters the NMI watchdog uses should still be incrementing. V3: Really remove touch_all_nmi_watchdogs() V2: Remove touch_all_nmi_watchdogs() Signed-off-by: Don Zickus <dzickus@redhat.com> --- kernel/watchdog.c | 17 +++++++++-------- 1 files changed, 9 insertions(+), 8 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index dc8e168..09fddd7 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -141,14 +141,15 @@ void touch_all_softlockup_watchdogs(void) #ifdef CONFIG_HARDLOCKUP_DETECTOR void touch_nmi_watchdog(void) { - if (watchdog_enabled) { - unsigned cpu; + /* + * Using __raw here because some code paths have + * preemption enabled. If preemption is enabled + * then interrupts should be enabled too, in which + * case we shouldn't have to worry about the watchdog + * going off. + */ + __raw_get_cpu_var(watchdog_nmi_touch) = true; - for_each_present_cpu(cpu) { - if (per_cpu(watchdog_nmi_touch, cpu) != true) - per_cpu(watchdog_nmi_touch, cpu) = true; - } - } touch_softlockup_watchdog(); } EXPORT_SYMBOL(touch_nmi_watchdog); -- 1.7.2.3 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one 2013-12-05 3:12 ` Don Zickus @ 2013-12-05 20:42 ` Ben Zhang 2013-12-16 15:55 ` Don Zickus 0 siblings, 1 reply; 7+ messages in thread From: Ben Zhang @ 2013-12-05 20:42 UTC (permalink / raw) To: linux-kernel Cc: Don Zickus, Andrew Morton, Ingo Molnar, Frederic Weisbecker, Ben Zhang I ran into a scenario where while one cpu was stuck and should have panic'd because of the NMI watchdog, it didn't. The reason was another cpu was spewing stack dumps on to the console. Upon investigation, I noticed that when writing to the console and also when dumping the stack, the watchdog is touched. This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu just spins forever. This change causes the semantics of touch_nmi_watchdog to be changed slightly. Previously, I accidentally changed the semantics and we noticed there was a codepath in which touch_nmi_watchdog could be touched from a preemtible area. That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled. I believe it was the acpi code. My attempt here re-introduces the change to have the touch_nmi_watchdog() code only touch the local cpu instead of all of the cpus. But instead of using __get_cpu_var(), I use the __raw_get_cpu_var() version. This avoids the preemption problem. However my reasoning wasn't because I was trying to be lazy. Instead I rationalized it as, well if preemption is enabled then interrupts should be enabled to and the NMI watchdog will have no reason to trigger. So it won't matter if the wrong cpu is touched because the percpu interrupt counters the NMI watchdog uses should still be incrementing. Signed-off-by: Don Zickus <dzickus@redhat.com> Signed-off-by: Ben Zhang <benzh@chromium.org> --- kernel/watchdog.c | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 4431610..613f611 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -158,14 +158,14 @@ void touch_all_softlockup_watchdogs(void) #ifdef CONFIG_HARDLOCKUP_DETECTOR void touch_nmi_watchdog(void) { - if (watchdog_user_enabled) { - unsigned cpu; - - for_each_present_cpu(cpu) { - if (per_cpu(watchdog_nmi_touch, cpu) != true) - per_cpu(watchdog_nmi_touch, cpu) = true; - } - } + /* + * Using __raw here because some code paths have + * preemption enabled. If preemption is enabled + * then interrupts should be enabled too, in which + * case we shouldn't have to worry about the watchdog + * going off. + */ + __raw_get_cpu_var(watchdog_nmi_touch) = true; touch_softlockup_watchdog(); } EXPORT_SYMBOL(touch_nmi_watchdog); -- 1.8.5.1 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one 2013-12-05 20:42 ` [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Ben Zhang @ 2013-12-16 15:55 ` Don Zickus 0 siblings, 0 replies; 7+ messages in thread From: Don Zickus @ 2013-12-16 15:55 UTC (permalink / raw) To: Ben Zhang; +Cc: linux-kernel, Andrew Morton, Ingo Molnar, Frederic Weisbecker On Thu, Dec 05, 2013 at 12:42:24PM -0800, Ben Zhang wrote: > I ran into a scenario where while one cpu was stuck and should have panic'd > because of the NMI watchdog, it didn't. The reason was another cpu was spewing > stack dumps on to the console. Upon investigation, I noticed that when writing > to the console and also when dumping the stack, the watchdog is touched. > > This causes all the cpus to reset their NMI watchdog flags and the 'stuck' cpu > just spins forever. > > This change causes the semantics of touch_nmi_watchdog to be changed slightly. > Previously, I accidentally changed the semantics and we noticed there was a > codepath in which touch_nmi_watchdog could be touched from a preemtible area. > That caused a BUG() to happen when CONFIG_DEBUG_PREEMPT was enabled. I believe > it was the acpi code. > > My attempt here re-introduces the change to have the touch_nmi_watchdog() code > only touch the local cpu instead of all of the cpus. But instead of using > __get_cpu_var(), I use the __raw_get_cpu_var() version. > > This avoids the preemption problem. However my reasoning wasn't because I was > trying to be lazy. Instead I rationalized it as, well if preemption is enabled > then interrupts should be enabled to and the NMI watchdog will have no reason > to trigger. So it won't matter if the wrong cpu is touched because the percpu > interrupt counters the NMI watchdog uses should still be incrementing. Hi Andrew, I'm ok with this patch, though it does alter the behaviour of how touch_nmi_watchdog works. For the most part I don't think most callers need to touch all of the watchdogs (on each cpu). Perhaps a corner case will pop up (the scheduler?? to mimic touch_all_softlockup_watchdogs() ). But this does address an issue where if a system is locked up and one cpu is spewing out useful debug messages (or error messages), the hard lockup will fail to go off. We have seen this on RHEL also. I wouldn't mind see this get soaked in linux-next if possible. And see what falls out. Cheers, Don > > Signed-off-by: Don Zickus <dzickus@redhat.com> > Signed-off-by: Ben Zhang <benzh@chromium.org> > --- > kernel/watchdog.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 4431610..613f611 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -158,14 +158,14 @@ void touch_all_softlockup_watchdogs(void) > #ifdef CONFIG_HARDLOCKUP_DETECTOR > void touch_nmi_watchdog(void) > { > - if (watchdog_user_enabled) { > - unsigned cpu; > - > - for_each_present_cpu(cpu) { > - if (per_cpu(watchdog_nmi_touch, cpu) != true) > - per_cpu(watchdog_nmi_touch, cpu) = true; > - } > - } > + /* > + * Using __raw here because some code paths have > + * preemption enabled. If preemption is enabled > + * then interrupts should be enabled too, in which > + * case we shouldn't have to worry about the watchdog > + * going off. > + */ > + __raw_get_cpu_var(watchdog_nmi_touch) = true; > touch_softlockup_watchdog(); > } > EXPORT_SYMBOL(touch_nmi_watchdog); > -- > 1.8.5.1 > ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-12-16 15:55 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-12-03 21:54 [PATCH] watchdog: Add a sysctl to disable soft lockup detector Ben Zhang 2013-12-03 22:27 ` Andrew Morton 2013-12-04 21:29 ` Don Zickus 2013-12-05 1:55 ` [PATCH v2] " Ben Zhang 2013-12-05 3:12 ` Don Zickus 2013-12-05 20:42 ` [PATCH] watchdog: touch_nmi_watchdog should only touch local cpu not every one Ben Zhang 2013-12-16 15:55 ` Don Zickus
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox