* NMI: Enable watchdog by default
@ 2012-03-07 16:55 Andrew Cooper
2012-03-07 17:08 ` Jan Beulich
2012-03-07 17:11 ` Keir Fraser
0 siblings, 2 replies; 8+ messages in thread
From: Andrew Cooper @ 2012-03-07 16:55 UTC (permalink / raw)
To: xen-devel@lists.xensource.com; +Cc: Keir Fraser, Jan Beulich
[-- Attachment #1: Type: text/plain, Size: 270 bytes --]
This patch is based on one which has been in XenServer for a very long.
To keep the trend of documentation going, it also corrects the new
command line document.
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
[-- Attachment #2: enable-nmi-watchdog.patch --]
[-- Type: text/x-patch, Size: 2083 bytes --]
# HG changeset patch
# Parent e193375fb82af819eb55a1189308fcd4c1e8b40f
NMI: Enable watchdog by default.
This should make hung conditions more visible. Change the timeout from 5
seconds to 300, as several operations on large guests take far longer than 5
seconds.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
diff -r e193375fb82a docs/misc/xen-command-line.markdown
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -391,7 +391,10 @@ The optional `keep` parameter causes Xen
### watchdog
> `= <boolean>`
-Run an NMI watchdog on each processor. Defaults to disabled.
+> Default: `true`
+
+Run an NMI watchdog on each processor. A CPU hung for more than 300 seconds will result in a panic, making hung conditions more visible.
+
### x2apic
### x2apic\_phys
### xencons
diff -r e193375fb82a xen/arch/x86/nmi.c
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -425,11 +425,12 @@ void nmi_watchdog_tick(struct cpu_user_r
!atomic_read(&watchdog_disable_count) )
{
/*
- * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds)
- * before doing the oops ...
+ * Ayiee, looks like this CPU is stuck.
+ * Wait 5 minutes before panic because some actions on large guests
+ * can take many seconds to complete.
*/
this_cpu(alert_counter)++;
- if ( this_cpu(alert_counter) == 5*nmi_hz )
+ if ( this_cpu(alert_counter) == 300*nmi_hz )
{
console_force_unlock();
printk("Watchdog timer detects that CPU%d is stuck!\n",
diff -r e193375fb82a xen/arch/x86/setup.c
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -55,7 +55,7 @@ static unsigned int __initdata max_cpus;
integer_param("maxcpus", max_cpus);
/* opt_watchdog: If true, run a watchdog NMI on each processor. */
-static bool_t __initdata opt_watchdog;
+static bool_t __initdata opt_watchdog = 1;
boolean_param("watchdog", opt_watchdog);
/* smep: Enable/disable Supervisor Mode Execution Protection (default on). */
[-- Attachment #3: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: NMI: Enable watchdog by default 2012-03-07 16:55 NMI: Enable watchdog by default Andrew Cooper @ 2012-03-07 17:08 ` Jan Beulich 2012-03-07 17:11 ` Keir Fraser 1 sibling, 0 replies; 8+ messages in thread From: Jan Beulich @ 2012-03-07 17:08 UTC (permalink / raw) To: Andrew Cooper; +Cc: Keir Fraser, xen-devel >>> On 07.03.12 at 17:55, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > This patch is based on one which has been in XenServer for a very long. I'm not sure this is a good idea - people with systems where the watchdog doesn't work will likely start complaining about why it doesn't work, as that's going to be always visible in the log. Furthermore, a timeout of 5 min is just unacceptably large. If we really still have problems with the 5 sec timeout, we should fix those rather than bumping the timeout - no timer events for 5 sec is already way too long. Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default 2012-03-07 16:55 NMI: Enable watchdog by default Andrew Cooper 2012-03-07 17:08 ` Jan Beulich @ 2012-03-07 17:11 ` Keir Fraser 2012-03-07 17:35 ` George Dunlap 1 sibling, 1 reply; 8+ messages in thread From: Keir Fraser @ 2012-03-07 17:11 UTC (permalink / raw) To: Andrew Cooper, xen-devel@lists.xensource.com; +Cc: Jan Beulich On 07/03/2012 16:55, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote: > This patch is based on one which has been in XenServer for a very long. > > To keep the trend of documentation going, it also corrects the new > command line document. This does not only enable the watchdog by default, but also changes the timeout from 5 seconds to 5 *minutes*! Even if that is a good idea in some cases, we'd at least have to make the timeout configurable. Developers will not appreciate having to wait for 5 minutes for their lockups to produce useful trace output. -- Keir ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default 2012-03-07 17:11 ` Keir Fraser @ 2012-03-07 17:35 ` George Dunlap 2012-03-07 17:41 ` Keir Fraser 0 siblings, 1 reply; 8+ messages in thread From: George Dunlap @ 2012-03-07 17:35 UTC (permalink / raw) To: Keir Fraser; +Cc: Andrew Cooper, xen-devel@lists.xensource.com, Jan Beulich On Wed, Mar 7, 2012 at 5:11 PM, Keir Fraser <keir@xen.org> wrote: > On 07/03/2012 16:55, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote: > >> This patch is based on one which has been in XenServer for a very long. >> >> To keep the trend of documentation going, it also corrects the new >> command line document. > > This does not only enable the watchdog by default, but also changes the > timeout from 5 seconds to 5 *minutes*! > > Even if that is a good idea in some cases, we'd at least have to make the > timeout configurable. Developers will not appreciate having to wait for 5 > minutes for their lockups to produce useful trace output. Yes, the 5 minute timeout is what we ship for production boxes, so that we can catch actual deadlocks while being *really really* sure we don't take down a customer's system unless it's *really* dead. I think leaving the timeout as it is and making it configurable is probably the best option. -George > > -- Keir > > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default 2012-03-07 17:35 ` George Dunlap @ 2012-03-07 17:41 ` Keir Fraser 2012-03-07 18:14 ` NMI: watchdog timeout command line parameter Andrew Cooper 0 siblings, 1 reply; 8+ messages in thread From: Keir Fraser @ 2012-03-07 17:41 UTC (permalink / raw) To: George Dunlap; +Cc: Andrew Cooper, xen-devel@lists.xensource.com, Jan Beulich On 07/03/2012 17:35, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote: > On Wed, Mar 7, 2012 at 5:11 PM, Keir Fraser <keir@xen.org> wrote: >> On 07/03/2012 16:55, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote: >> >>> This patch is based on one which has been in XenServer for a very long. >>> >>> To keep the trend of documentation going, it also corrects the new >>> command line document. >> >> This does not only enable the watchdog by default, but also changes the >> timeout from 5 seconds to 5 *minutes*! >> >> Even if that is a good idea in some cases, we'd at least have to make the >> timeout configurable. Developers will not appreciate having to wait for 5 >> minutes for their lockups to produce useful trace output. > > Yes, the 5 minute timeout is what we ship for production boxes, so > that we can catch actual deadlocks while being *really really* sure we > don't take down a customer's system unless it's *really* dead. I > think leaving the timeout as it is and making it configurable is > probably the best option. A patch to do that would be acceptable. With that we may as well keep NMI watchdog disabled by default, and default timeout of 5 seconds. That then means that XenServer carries a command-line option for its watchdog settings, rather than a patch. -- Keir > -George > >> >> -- Keir >> >> >> >> _______________________________________________ >> Xen-devel mailing list >> Xen-devel@lists.xen.org >> http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* NMI: watchdog timeout command line parameter 2012-03-07 17:41 ` Keir Fraser @ 2012-03-07 18:14 ` Andrew Cooper 2012-03-08 10:06 ` Jan Beulich 0 siblings, 1 reply; 8+ messages in thread From: Andrew Cooper @ 2012-03-07 18:14 UTC (permalink / raw) To: Keir Fraser; +Cc: George Dunlap, xen-devel@lists.xensource.com, Jan Beulich [-- Attachment #1: Type: text/plain, Size: 190 bytes --] Here is a patch which allows a command line parameter to set the watchdog timeout. -- Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer T: +44 (0)1223 225 900, http://www.citrix.com [-- Attachment #2: nmi-watchdog-timeout.patch --] [-- Type: text/x-patch, Size: 3650 bytes --] # HG changeset patch # Parent e193375fb82af819eb55a1189308fcd4c1e8b40f NMI: Command line parameter for watchdog timeout Introduce a command parameter to set the watchtog timeout. Manually specifying "watchdog_timeout=<seconds>" on the command line will also turn the watchdog on. For consistency, move opt_watchdog into nmi.c along with opt_watchdog_timeout. Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com> diff -r e193375fb82a docs/misc/xen-command-line.markdown --- a/docs/misc/xen-command-line.markdown +++ b/docs/misc/xen-command-line.markdown @@ -391,7 +391,17 @@ The optional `keep` parameter causes Xen ### watchdog > `= <boolean>` -Run an NMI watchdog on each processor. Defaults to disabled. +> Default: `false` + +Run an NMI watchdog on each processor. If a processor is stuck for longer than the watchdog\_timeout, a panic occurs. + +### watchdog\_timeout +> `= <integer>` + +> Default: `5` + +Set the NMI watchdog timeout in seconds. Specifying `0` will turn off the watchdog. + ### x2apic ### x2apic\_phys ### xencons diff -r e193375fb82a xen/arch/x86/nmi.c --- a/xen/arch/x86/nmi.c +++ b/xen/arch/x86/nmi.c @@ -40,6 +40,19 @@ static unsigned int nmi_p4_cccr_val; static DEFINE_PER_CPU(struct timer, nmi_timer); static DEFINE_PER_CPU(unsigned int, nmi_timer_ticks); +/* opt_watchdog: If true, run a watchdog NMI on each processor. */ +bool_t __initdata opt_watchdog = 0; +boolean_param("watchdog", opt_watchdog); + +/* opt_watchdog_timeout: Number of seconds to wait before panic. */ +static unsigned int opt_watchdog_timeout = 5; +static void parse_watchdog_timeout(char * s) +{ + opt_watchdog_timeout = simple_strtoull(s, NULL, 0); + opt_watchdog = !!opt_watchdog_timeout; +} +custom_param("watchdog_timeout", parse_watchdog_timeout); + /* * lapic_nmi_owner tracks the ownership of the lapic NMI hardware: * - it may be reserved by some other driver, or not @@ -425,11 +438,11 @@ void nmi_watchdog_tick(struct cpu_user_r !atomic_read(&watchdog_disable_count) ) { /* - * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds) + * Ayiee, looks like this CPU is stuck ... wait for the timeout * before doing the oops ... */ this_cpu(alert_counter)++; - if ( this_cpu(alert_counter) == 5*nmi_hz ) + if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz ) { console_force_unlock(); printk("Watchdog timer detects that CPU%d is stuck!\n", diff -r e193375fb82a xen/arch/x86/setup.c --- a/xen/arch/x86/setup.c +++ b/xen/arch/x86/setup.c @@ -45,6 +45,7 @@ #include <asm/mach-generic/mach_apic.h> /* for generic_apic_probe */ #include <asm/setup.h> #include <xen/cpu.h> +#include <asm/nmi.h> /* opt_nosmp: If true, secondary processors are ignored. */ static bool_t __initdata opt_nosmp; @@ -54,10 +55,6 @@ boolean_param("nosmp", opt_nosmp); static unsigned int __initdata max_cpus; integer_param("maxcpus", max_cpus); -/* opt_watchdog: If true, run a watchdog NMI on each processor. */ -static bool_t __initdata opt_watchdog; -boolean_param("watchdog", opt_watchdog); - /* smep: Enable/disable Supervisor Mode Execution Protection (default on). */ static bool_t __initdata disable_smep; invbool_param("smep", disable_smep); diff -r e193375fb82a xen/include/asm-x86/nmi.h --- a/xen/include/asm-x86/nmi.h +++ b/xen/include/asm-x86/nmi.h @@ -5,6 +5,9 @@ #include <public/nmi.h> struct cpu_user_regs; + +/* Watchdog boolean from the command line */ +extern bool_t opt_watchdog; typedef int (*nmi_callback_t)(struct cpu_user_regs *regs, int cpu); [-- Attachment #3: Type: text/plain, Size: 126 bytes --] _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: watchdog timeout command line parameter 2012-03-07 18:14 ` NMI: watchdog timeout command line parameter Andrew Cooper @ 2012-03-08 10:06 ` Jan Beulich 2012-03-08 10:22 ` George Dunlap 0 siblings, 1 reply; 8+ messages in thread From: Jan Beulich @ 2012-03-08 10:06 UTC (permalink / raw) To: Andrew Cooper, Keir Fraser; +Cc: George Dunlap, xen-devel@lists.xensource.com >>> On 07.03.12 at 19:14, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > Here is a patch which allows a command line parameter to set the > watchdog timeout. Wouldn't it make sense to simply modify the watchdog= semantics slightly: When given a numeric value, it specifies the timeout (and zero disables as before), while when given any value parse_bool() accepts the timeout remains at the default and the watchdog just gets turned on. The only ambiguity with the current possible values would be watchdog=1, and I think having the meaning of this changed is acceptable. Jan ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: watchdog timeout command line parameter 2012-03-08 10:06 ` Jan Beulich @ 2012-03-08 10:22 ` George Dunlap 0 siblings, 0 replies; 8+ messages in thread From: George Dunlap @ 2012-03-08 10:22 UTC (permalink / raw) To: Jan Beulich Cc: George Dunlap, Andrew Cooper, Keir Fraser, xen-devel@lists.xensource.com On Thu, 2012-03-08 at 10:06 +0000, Jan Beulich wrote: > >>> On 07.03.12 at 19:14, Andrew Cooper <andrew.cooper3@citrix.com> wrote: > > Here is a patch which allows a command line parameter to set the > > watchdog timeout. > > Wouldn't it make sense to simply modify the watchdog= semantics > slightly: When given a numeric value, it specifies the timeout (and zero > disables as before), while when given any value parse_bool() accepts > the timeout remains at the default and the watchdog just gets turned > on. I had thought of this too, but I don't think it actually simplifies it. It makes the parsing code more complicated, and it makes the interface actually more complicated and less predictable, for pretty much no good reason. It's not like adding extra command-line options is really onerous, either in terms of users or in terms of our code. I think it's much more important to have consistency and predictability; the new interpretation "watchdog=1" is exactly the kind of thing we need to avoid. -George > The only ambiguity with the current possible values would be > watchdog=1, and I think having the meaning of this changed is > acceptable. > > Jan > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-03-08 10:22 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-03-07 16:55 NMI: Enable watchdog by default Andrew Cooper 2012-03-07 17:08 ` Jan Beulich 2012-03-07 17:11 ` Keir Fraser 2012-03-07 17:35 ` George Dunlap 2012-03-07 17:41 ` Keir Fraser 2012-03-07 18:14 ` NMI: watchdog timeout command line parameter Andrew Cooper 2012-03-08 10:06 ` Jan Beulich 2012-03-08 10:22 ` George Dunlap
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).