* NMI: Enable watchdog by default
@ 2012-03-07 16:55 Andrew Cooper
2012-03-07 17:08 ` Jan Beulich
2012-03-07 17:11 ` Keir Fraser
0 siblings, 2 replies; 8+ messages in thread
From: Andrew Cooper @ 2012-03-07 16:55 UTC (permalink / raw)
To: xen-devel@lists.xensource.com; +Cc: Keir Fraser, Jan Beulich
[-- Attachment #1: Type: text/plain, Size: 270 bytes --]
This patch is based on one which has been in XenServer for a very long.
To keep the trend of documentation going, it also corrects the new
command line document.
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
[-- Attachment #2: enable-nmi-watchdog.patch --]
[-- Type: text/x-patch, Size: 2083 bytes --]
# HG changeset patch
# Parent e193375fb82af819eb55a1189308fcd4c1e8b40f
NMI: Enable watchdog by default.
This should make hung conditions more visible. Change the timeout from 5
seconds to 300, as several operations on large guests take far longer than 5
seconds.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
diff -r e193375fb82a docs/misc/xen-command-line.markdown
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -391,7 +391,10 @@ The optional `keep` parameter causes Xen
### watchdog
> `= <boolean>`
-Run an NMI watchdog on each processor. Defaults to disabled.
+> Default: `true`
+
+Run an NMI watchdog on each processor. A CPU hung for more than 300 seconds will result in a panic, making hung conditions more visible.
+
### x2apic
### x2apic\_phys
### xencons
diff -r e193375fb82a xen/arch/x86/nmi.c
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -425,11 +425,12 @@ void nmi_watchdog_tick(struct cpu_user_r
!atomic_read(&watchdog_disable_count) )
{
/*
- * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds)
- * before doing the oops ...
+ * Ayiee, looks like this CPU is stuck.
+ * Wait 5 minutes before panic because some actions on large guests
+ * can take many seconds to complete.
*/
this_cpu(alert_counter)++;
- if ( this_cpu(alert_counter) == 5*nmi_hz )
+ if ( this_cpu(alert_counter) == 300*nmi_hz )
{
console_force_unlock();
printk("Watchdog timer detects that CPU%d is stuck!\n",
diff -r e193375fb82a xen/arch/x86/setup.c
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -55,7 +55,7 @@ static unsigned int __initdata max_cpus;
integer_param("maxcpus", max_cpus);
/* opt_watchdog: If true, run a watchdog NMI on each processor. */
-static bool_t __initdata opt_watchdog;
+static bool_t __initdata opt_watchdog = 1;
boolean_param("watchdog", opt_watchdog);
/* smep: Enable/disable Supervisor Mode Execution Protection (default on). */
[-- Attachment #3: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default
2012-03-07 16:55 NMI: Enable watchdog by default Andrew Cooper
@ 2012-03-07 17:08 ` Jan Beulich
2012-03-07 17:11 ` Keir Fraser
1 sibling, 0 replies; 8+ messages in thread
From: Jan Beulich @ 2012-03-07 17:08 UTC (permalink / raw)
To: Andrew Cooper; +Cc: Keir Fraser, xen-devel
>>> On 07.03.12 at 17:55, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> This patch is based on one which has been in XenServer for a very long.
I'm not sure this is a good idea - people with systems where the
watchdog doesn't work will likely start complaining about why it
doesn't work, as that's going to be always visible in the log.
Furthermore, a timeout of 5 min is just unacceptably large. If we
really still have problems with the 5 sec timeout, we should fix
those rather than bumping the timeout - no timer events for 5 sec
is already way too long.
Jan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default
2012-03-07 16:55 NMI: Enable watchdog by default Andrew Cooper
2012-03-07 17:08 ` Jan Beulich
@ 2012-03-07 17:11 ` Keir Fraser
2012-03-07 17:35 ` George Dunlap
1 sibling, 1 reply; 8+ messages in thread
From: Keir Fraser @ 2012-03-07 17:11 UTC (permalink / raw)
To: Andrew Cooper, xen-devel@lists.xensource.com; +Cc: Jan Beulich
On 07/03/2012 16:55, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
> This patch is based on one which has been in XenServer for a very long.
>
> To keep the trend of documentation going, it also corrects the new
> command line document.
This does not only enable the watchdog by default, but also changes the
timeout from 5 seconds to 5 *minutes*!
Even if that is a good idea in some cases, we'd at least have to make the
timeout configurable. Developers will not appreciate having to wait for 5
minutes for their lockups to produce useful trace output.
-- Keir
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default
2012-03-07 17:11 ` Keir Fraser
@ 2012-03-07 17:35 ` George Dunlap
2012-03-07 17:41 ` Keir Fraser
0 siblings, 1 reply; 8+ messages in thread
From: George Dunlap @ 2012-03-07 17:35 UTC (permalink / raw)
To: Keir Fraser; +Cc: Andrew Cooper, xen-devel@lists.xensource.com, Jan Beulich
On Wed, Mar 7, 2012 at 5:11 PM, Keir Fraser <keir@xen.org> wrote:
> On 07/03/2012 16:55, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>
>> This patch is based on one which has been in XenServer for a very long.
>>
>> To keep the trend of documentation going, it also corrects the new
>> command line document.
>
> This does not only enable the watchdog by default, but also changes the
> timeout from 5 seconds to 5 *minutes*!
>
> Even if that is a good idea in some cases, we'd at least have to make the
> timeout configurable. Developers will not appreciate having to wait for 5
> minutes for their lockups to produce useful trace output.
Yes, the 5 minute timeout is what we ship for production boxes, so
that we can catch actual deadlocks while being *really really* sure we
don't take down a customer's system unless it's *really* dead. I
think leaving the timeout as it is and making it configurable is
probably the best option.
-George
>
> -- Keir
>
>
>
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: Enable watchdog by default
2012-03-07 17:35 ` George Dunlap
@ 2012-03-07 17:41 ` Keir Fraser
2012-03-07 18:14 ` NMI: watchdog timeout command line parameter Andrew Cooper
0 siblings, 1 reply; 8+ messages in thread
From: Keir Fraser @ 2012-03-07 17:41 UTC (permalink / raw)
To: George Dunlap; +Cc: Andrew Cooper, xen-devel@lists.xensource.com, Jan Beulich
On 07/03/2012 17:35, "George Dunlap" <George.Dunlap@eu.citrix.com> wrote:
> On Wed, Mar 7, 2012 at 5:11 PM, Keir Fraser <keir@xen.org> wrote:
>> On 07/03/2012 16:55, "Andrew Cooper" <andrew.cooper3@citrix.com> wrote:
>>
>>> This patch is based on one which has been in XenServer for a very long.
>>>
>>> To keep the trend of documentation going, it also corrects the new
>>> command line document.
>>
>> This does not only enable the watchdog by default, but also changes the
>> timeout from 5 seconds to 5 *minutes*!
>>
>> Even if that is a good idea in some cases, we'd at least have to make the
>> timeout configurable. Developers will not appreciate having to wait for 5
>> minutes for their lockups to produce useful trace output.
>
> Yes, the 5 minute timeout is what we ship for production boxes, so
> that we can catch actual deadlocks while being *really really* sure we
> don't take down a customer's system unless it's *really* dead. I
> think leaving the timeout as it is and making it configurable is
> probably the best option.
A patch to do that would be acceptable. With that we may as well keep NMI
watchdog disabled by default, and default timeout of 5 seconds. That then
means that XenServer carries a command-line option for its watchdog
settings, rather than a patch.
-- Keir
> -George
>
>>
>> -- Keir
>>
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> Xen-devel@lists.xen.org
>> http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* NMI: watchdog timeout command line parameter
2012-03-07 17:41 ` Keir Fraser
@ 2012-03-07 18:14 ` Andrew Cooper
2012-03-08 10:06 ` Jan Beulich
0 siblings, 1 reply; 8+ messages in thread
From: Andrew Cooper @ 2012-03-07 18:14 UTC (permalink / raw)
To: Keir Fraser; +Cc: George Dunlap, xen-devel@lists.xensource.com, Jan Beulich
[-- Attachment #1: Type: text/plain, Size: 190 bytes --]
Here is a patch which allows a command line parameter to set the
watchdog timeout.
--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com
[-- Attachment #2: nmi-watchdog-timeout.patch --]
[-- Type: text/x-patch, Size: 3650 bytes --]
# HG changeset patch
# Parent e193375fb82af819eb55a1189308fcd4c1e8b40f
NMI: Command line parameter for watchdog timeout
Introduce a command parameter to set the watchtog timeout. Manually specifying
"watchdog_timeout=<seconds>" on the command line will also turn the watchdog
on. For consistency, move opt_watchdog into nmi.c along with
opt_watchdog_timeout.
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
diff -r e193375fb82a docs/misc/xen-command-line.markdown
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -391,7 +391,17 @@ The optional `keep` parameter causes Xen
### watchdog
> `= <boolean>`
-Run an NMI watchdog on each processor. Defaults to disabled.
+> Default: `false`
+
+Run an NMI watchdog on each processor. If a processor is stuck for longer than the watchdog\_timeout, a panic occurs.
+
+### watchdog\_timeout
+> `= <integer>`
+
+> Default: `5`
+
+Set the NMI watchdog timeout in seconds. Specifying `0` will turn off the watchdog.
+
### x2apic
### x2apic\_phys
### xencons
diff -r e193375fb82a xen/arch/x86/nmi.c
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -40,6 +40,19 @@ static unsigned int nmi_p4_cccr_val;
static DEFINE_PER_CPU(struct timer, nmi_timer);
static DEFINE_PER_CPU(unsigned int, nmi_timer_ticks);
+/* opt_watchdog: If true, run a watchdog NMI on each processor. */
+bool_t __initdata opt_watchdog = 0;
+boolean_param("watchdog", opt_watchdog);
+
+/* opt_watchdog_timeout: Number of seconds to wait before panic. */
+static unsigned int opt_watchdog_timeout = 5;
+static void parse_watchdog_timeout(char * s)
+{
+ opt_watchdog_timeout = simple_strtoull(s, NULL, 0);
+ opt_watchdog = !!opt_watchdog_timeout;
+}
+custom_param("watchdog_timeout", parse_watchdog_timeout);
+
/*
* lapic_nmi_owner tracks the ownership of the lapic NMI hardware:
* - it may be reserved by some other driver, or not
@@ -425,11 +438,11 @@ void nmi_watchdog_tick(struct cpu_user_r
!atomic_read(&watchdog_disable_count) )
{
/*
- * Ayiee, looks like this CPU is stuck ... wait a few IRQs (5 seconds)
+ * Ayiee, looks like this CPU is stuck ... wait for the timeout
* before doing the oops ...
*/
this_cpu(alert_counter)++;
- if ( this_cpu(alert_counter) == 5*nmi_hz )
+ if ( this_cpu(alert_counter) == opt_watchdog_timeout*nmi_hz )
{
console_force_unlock();
printk("Watchdog timer detects that CPU%d is stuck!\n",
diff -r e193375fb82a xen/arch/x86/setup.c
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -45,6 +45,7 @@
#include <asm/mach-generic/mach_apic.h> /* for generic_apic_probe */
#include <asm/setup.h>
#include <xen/cpu.h>
+#include <asm/nmi.h>
/* opt_nosmp: If true, secondary processors are ignored. */
static bool_t __initdata opt_nosmp;
@@ -54,10 +55,6 @@ boolean_param("nosmp", opt_nosmp);
static unsigned int __initdata max_cpus;
integer_param("maxcpus", max_cpus);
-/* opt_watchdog: If true, run a watchdog NMI on each processor. */
-static bool_t __initdata opt_watchdog;
-boolean_param("watchdog", opt_watchdog);
-
/* smep: Enable/disable Supervisor Mode Execution Protection (default on). */
static bool_t __initdata disable_smep;
invbool_param("smep", disable_smep);
diff -r e193375fb82a xen/include/asm-x86/nmi.h
--- a/xen/include/asm-x86/nmi.h
+++ b/xen/include/asm-x86/nmi.h
@@ -5,6 +5,9 @@
#include <public/nmi.h>
struct cpu_user_regs;
+
+/* Watchdog boolean from the command line */
+extern bool_t opt_watchdog;
typedef int (*nmi_callback_t)(struct cpu_user_regs *regs, int cpu);
[-- Attachment #3: Type: text/plain, Size: 126 bytes --]
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: watchdog timeout command line parameter
2012-03-07 18:14 ` NMI: watchdog timeout command line parameter Andrew Cooper
@ 2012-03-08 10:06 ` Jan Beulich
2012-03-08 10:22 ` George Dunlap
0 siblings, 1 reply; 8+ messages in thread
From: Jan Beulich @ 2012-03-08 10:06 UTC (permalink / raw)
To: Andrew Cooper, Keir Fraser; +Cc: George Dunlap, xen-devel@lists.xensource.com
>>> On 07.03.12 at 19:14, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> Here is a patch which allows a command line parameter to set the
> watchdog timeout.
Wouldn't it make sense to simply modify the watchdog= semantics
slightly: When given a numeric value, it specifies the timeout (and zero
disables as before), while when given any value parse_bool() accepts
the timeout remains at the default and the watchdog just gets turned
on.
The only ambiguity with the current possible values would be
watchdog=1, and I think having the meaning of this changed is
acceptable.
Jan
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: NMI: watchdog timeout command line parameter
2012-03-08 10:06 ` Jan Beulich
@ 2012-03-08 10:22 ` George Dunlap
0 siblings, 0 replies; 8+ messages in thread
From: George Dunlap @ 2012-03-08 10:22 UTC (permalink / raw)
To: Jan Beulich
Cc: George Dunlap, Andrew Cooper, Keir Fraser,
xen-devel@lists.xensource.com
On Thu, 2012-03-08 at 10:06 +0000, Jan Beulich wrote:
> >>> On 07.03.12 at 19:14, Andrew Cooper <andrew.cooper3@citrix.com> wrote:
> > Here is a patch which allows a command line parameter to set the
> > watchdog timeout.
>
> Wouldn't it make sense to simply modify the watchdog= semantics
> slightly: When given a numeric value, it specifies the timeout (and zero
> disables as before), while when given any value parse_bool() accepts
> the timeout remains at the default and the watchdog just gets turned
> on.
I had thought of this too, but I don't think it actually simplifies it.
It makes the parsing code more complicated, and it makes the interface
actually more complicated and less predictable, for pretty much no good
reason. It's not like adding extra command-line options is really
onerous, either in terms of users or in terms of our code.
I think it's much more important to have consistency and predictability;
the new interpretation "watchdog=1" is exactly the kind of thing we need
to avoid.
-George
> The only ambiguity with the current possible values would be
> watchdog=1, and I think having the meaning of this changed is
> acceptable.
>
> Jan
>
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2012-03-08 10:22 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-03-07 16:55 NMI: Enable watchdog by default Andrew Cooper
2012-03-07 17:08 ` Jan Beulich
2012-03-07 17:11 ` Keir Fraser
2012-03-07 17:35 ` George Dunlap
2012-03-07 17:41 ` Keir Fraser
2012-03-07 18:14 ` NMI: watchdog timeout command line parameter Andrew Cooper
2012-03-08 10:06 ` Jan Beulich
2012-03-08 10:22 ` George Dunlap
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).