public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] x86: sysctl to allow panic on IOCK NMI error
@ 2009-06-24 21:32 Greg KH
  2009-06-25  9:16 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: Greg KH @ 2009-06-24 21:32 UTC (permalink / raw)
  To: x86
  Cc: H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Kurt Garloff,
	linux-kernel

From: Kurt Garloff <garloff@suse.de>

This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
defaults to 0 (off).

When enabled, the kernel panics when the kernel receives an NMI caused
by an IO error.

The IO error triggered NMI indicates a serious system condition, which
could result in IO data corruption. Rather than contiuing, panicing and
dumping might be a better choice, so one can figure out what's causing
the IO error.

This could be especially important to companies running IO intensive
applications where corruption must be avoided, e.g. a banks databases.


Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---

I don't know why this wasn't sent to you before now, sorry about that.
SuSE has been shipping it for a while, it was done at the request of a
large database vendor, for their users.  Can you queue it up for .32?
  - gregkh

 arch/x86/kernel/dumpstack.c |    1 +
 arch/x86/kernel/traps.c     |    3 +++
 include/linux/kernel.h      |    1 +
 include/linux/sysctl.h      |    1 +
 kernel/sysctl.c             |    8 ++++++++
 kernel/sysctl_check.c       |    1 +
 6 files changed, 15 insertions(+)

--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,6 +22,7 @@
 #include "dumpstack.h"
 
 int panic_on_unrecovered_nmi;
+int panic_on_io_nmi;
 unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -346,6 +346,9 @@ io_check_error(unsigned char reason, str
 	printk(KERN_EMERG "NMI: IOCK error (debug interrupt?)\n");
 	show_registers(regs);
 
+	if (panic_on_io_nmi)
+		panic("NMI IOCK error: Not continuing");
+
 	/* Re-enable the IOCK line, wait for a few seconds */
 	reason = (reason & 0xf) | 8;
 	outb(reason, 0x61);
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -303,6 +303,7 @@ extern int oops_in_progress;		/* If set,
 extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
+extern int panic_on_io_nmi;
 extern const char *print_tainted(void);
 extern void add_taint(unsigned flag);
 extern int test_taint(unsigned flag);
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -163,6 +163,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74,
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_PANIC_ON_IO_NMI=77, /* int: whether we will panic on an io NMI */
 };
 
 
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -744,6 +744,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= KERN_PANIC_ON_IO_NMI,
+		.procname	= "panic_on_io_nmi",
+		.data		= &panic_on_io_nmi,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_BOOTLOADER_TYPE,
 		.procname	= "bootloader_type",
 		.data		= &bootloader_type,
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -104,6 +104,7 @@ static const struct trans_ctl_table tran
 	{ KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ KERN_NMI_WATCHDOG,		"nmi_watchdog" },
 	{ KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ KERN_PANIC_ON_IO_NMI,		"panic_on_io_nmi" },
 	{}
 };
 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip:x86/urgent] x86: Add sysctl to allow panic on IOCK NMI error
  2009-06-24 21:32 [PATCH] x86: sysctl to allow panic on IOCK NMI error Greg KH
@ 2009-06-25  9:16 ` tip-bot for Kurt Garloff
  2009-06-25 18:15 ` [PATCH] x86: " Eric W. Biederman
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Kurt Garloff @ 2009-06-25  9:16 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, gregkh, tglx, robertangelino, garloff,
	mingo

Commit-ID:  246982af230e52c0d83701b292790a0283362c9a
Gitweb:     http://git.kernel.org/tip/246982af230e52c0d83701b292790a0283362c9a
Author:     Kurt Garloff <garloff@suse.de>
AuthorDate: Wed, 24 Jun 2009 14:32:11 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 25 Jun 2009 11:13:02 +0200

x86: Add sysctl to allow panic on IOCK NMI error

This patch introduces a new sysctl:

    /proc/sys/kernel/panic_on_io_nmi

which defaults to 0 (off).

When enabled, the kernel panics when the kernel receives an NMI
caused by an IO error.

The IO error triggered NMI indicates a serious system
condition, which could result in IO data corruption. Rather
than contiuing, panicing and dumping might be a better choice,
so one can figure out what's causing the IO error.

This could be especially important to companies running IO
intensive applications where corruption must be avoided, e.g. a
bank's databases.

[ SuSE has been shipping it for a while, it was done at the
  request of a large database vendor, for their users. ]

Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: Kurt Garloff <garloff@suse.de>
LKML-Reference: <20090624213211.GA11291@kroah.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/dumpstack.c |    1 +
 arch/x86/kernel/traps.c     |    3 +++
 include/linux/kernel.h      |    1 +
 include/linux/sysctl.h      |    1 +
 kernel/sysctl.c             |    8 ++++++++
 kernel/sysctl_check.c       |    1 +
 6 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 95ea5fa..c840571 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,6 +22,7 @@
 #include "dumpstack.h"
 
 int panic_on_unrecovered_nmi;
+int panic_on_io_nmi;
 unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a0f48f5..5204332 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -346,6 +346,9 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 	printk(KERN_EMERG "NMI: IOCK error (debug interrupt?)\n");
 	show_registers(regs);
 
+	if (panic_on_io_nmi)
+		panic("NMI IOCK error: Not continuing");
+
 	/* Re-enable the IOCK line, wait for a few seconds */
 	reason = (reason & 0xf) | 8;
 	outb(reason, 0x61);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index fac104e..d6320a3 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -303,6 +303,7 @@ extern int oops_in_progress;		/* If set, an oops, panic(), BUG() or die() is in
 extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
+extern int panic_on_io_nmi;
 extern const char *print_tainted(void);
 extern void add_taint(unsigned flag);
 extern int test_taint(unsigned flag);
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index e76d3b2..56766b9 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -163,6 +163,7 @@ enum
 	KERN_MAX_LOCK_DEPTH=74,
 	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
 	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
+	KERN_PANIC_ON_IO_NMI=77, /* int: whether we will panic on an io NMI */
 };
 
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..d44e8c3 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -744,6 +744,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= KERN_PANIC_ON_IO_NMI,
+		.procname	= "panic_on_io_nmi",
+		.data		= &panic_on_io_nmi,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_BOOTLOADER_TYPE,
 		.procname	= "bootloader_type",
 		.data		= &bootloader_type,
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index b38423c..e1d8e0f 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -104,6 +104,7 @@ static const struct trans_ctl_table trans_kern_table[] = {
 	{ KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ KERN_NMI_WATCHDOG,		"nmi_watchdog" },
 	{ KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ KERN_PANIC_ON_IO_NMI,		"panic_on_io_nmi" },
 	{}
 };
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-24 21:32 [PATCH] x86: sysctl to allow panic on IOCK NMI error Greg KH
  2009-06-25  9:16 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
@ 2009-06-25 18:15 ` Eric W. Biederman
  2009-06-25 20:07   ` Greg KH
  2009-06-25 20:06 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 20+ messages in thread
From: Eric W. Biederman @ 2009-06-25 18:15 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Kurt Garloff,
	linux-kernel

Greg KH <gregkh@suse.de> writes:

> From: Kurt Garloff <garloff@suse.de>
>
> This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
> defaults to 0 (off).
>
> When enabled, the kernel panics when the kernel receives an NMI caused
> by an IO error.
>
> The IO error triggered NMI indicates a serious system condition, which
> could result in IO data corruption. Rather than contiuing, panicing and
> dumping might be a better choice, so one can figure out what's causing
> the IO error.
>
> This could be especially important to companies running IO intensive
> applications where corruption must be avoided, e.g. a banks databases.

Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>

New binary sysctls are not allowed.   Please remove the
sysctl.h and .ctl_name portions.

see:
Documentation/sysctl/ctl_unnumbered.txt
Documentation/feature-remove-schedule.txt

I have a set of patches that should make this kind of thing
fail to compile for .32.  Hopefully I can get that out in 
the next couple of days.  Making problems like this easier
to spot and deal with.

> Signed-off-by: Kurt Garloff <garloff@suse.de>
> Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
>
> ---
>
> I don't know why this wasn't sent to you before now, sorry about that.
> SuSE has been shipping it for a while, it was done at the request of a
> large database vendor, for their users.  Can you queue it up for .32?
>   - gregkh
>
>  arch/x86/kernel/dumpstack.c |    1 +
>  arch/x86/kernel/traps.c     |    3 +++
>  include/linux/kernel.h      |    1 +
>  include/linux/sysctl.h      |    1 +
>  kernel/sysctl.c             |    8 ++++++++
>  kernel/sysctl_check.c       |    1 +
>  6 files changed, 15 insertions(+)
>
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -22,6 +22,7 @@
>  #include "dumpstack.h"
>  
>  int panic_on_unrecovered_nmi;
> +int panic_on_io_nmi;
>  unsigned int code_bytes = 64;
>  int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
>  static int die_counter;
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -346,6 +346,9 @@ io_check_error(unsigned char reason, str
>  	printk(KERN_EMERG "NMI: IOCK error (debug interrupt?)\n");
>  	show_registers(regs);
>  
> +	if (panic_on_io_nmi)
> +		panic("NMI IOCK error: Not continuing");
> +
>  	/* Re-enable the IOCK line, wait for a few seconds */
>  	reason = (reason & 0xf) | 8;
>  	outb(reason, 0x61);
> --- a/include/linux/kernel.h
> +++ b/include/linux/kernel.h
> @@ -303,6 +303,7 @@ extern int oops_in_progress;		/* If set,
>  extern int panic_timeout;
>  extern int panic_on_oops;
>  extern int panic_on_unrecovered_nmi;
> +extern int panic_on_io_nmi;
>  extern const char *print_tainted(void);
>  extern void add_taint(unsigned flag);
>  extern int test_taint(unsigned flag);
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -163,6 +163,7 @@ enum
>  	KERN_MAX_LOCK_DEPTH=74,
>  	KERN_NMI_WATCHDOG=75, /* int: enable/disable nmi watchdog */
>  	KERN_PANIC_ON_NMI=76, /* int: whether we will panic on an unrecovered */
> +	KERN_PANIC_ON_IO_NMI=77, /* int: whether we will panic on an io NMI */
>  };
>  
>  
> --- a/kernel/sysctl.c
> +++ b/kernel/sysctl.c
> @@ -744,6 +744,14 @@ static struct ctl_table kern_table[] = {
>  		.proc_handler	= &proc_dointvec,
>  	},
>  	{
> +		.ctl_name	= KERN_PANIC_ON_IO_NMI,
> +		.procname	= "panic_on_io_nmi",
> +		.data		= &panic_on_io_nmi,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_dointvec,
> +	},
> +	{
>  		.ctl_name	= KERN_BOOTLOADER_TYPE,
>  		.procname	= "bootloader_type",
>  		.data		= &bootloader_type,
> --- a/kernel/sysctl_check.c
> +++ b/kernel/sysctl_check.c
> @@ -104,6 +104,7 @@ static const struct trans_ctl_table tran
>  	{ KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
>  	{ KERN_NMI_WATCHDOG,		"nmi_watchdog" },
>  	{ KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
> +	{ KERN_PANIC_ON_IO_NMI,		"panic_on_io_nmi" },
>  	{}
>  };
>  
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip:x86/urgent] x86: Add sysctl to allow panic on IOCK NMI error
  2009-06-24 21:32 [PATCH] x86: sysctl to allow panic on IOCK NMI error Greg KH
  2009-06-25  9:16 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
  2009-06-25 18:15 ` [PATCH] x86: " Eric W. Biederman
@ 2009-06-25 20:06 ` tip-bot for Kurt Garloff
  2009-06-25 20:09 ` tip-bot for Kurt Garloff
  2009-06-30 22:27 ` [PATCH] x86: " Maciej W. Rozycki
  4 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Kurt Garloff @ 2009-06-25 20:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, ebiederm, gregkh, tglx, robertangelino,
	mingo, garloff

Commit-ID:  0b2f282236572b333d5d74264b04f214c24aba18
Gitweb:     http://git.kernel.org/tip/0b2f282236572b333d5d74264b04f214c24aba18
Author:     Kurt Garloff <garloff@suse.de>
AuthorDate: Wed, 24 Jun 2009 14:32:11 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 25 Jun 2009 22:02:58 +0200

x86: Add sysctl to allow panic on IOCK NMI error

This patch introduces a new sysctl:

    /proc/sys/kernel/panic_on_io_nmi

which defaults to 0 (off).

When enabled, the kernel panics when the kernel receives an NMI
caused by an IO error.

The IO error triggered NMI indicates a serious system
condition, which could result in IO data corruption. Rather
than contiuing, panicing and dumping might be a better choice,
so one can figure out what's causing the IO error.

This could be especially important to companies running IO
intensive applications where corruption must be avoided, e.g. a
bank's databases.

[ SuSE has been shipping it for a while, it was done at the
  request of a large database vendor, for their users. ]

Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
LKML-Reference: <20090624213211.GA11291@kroah.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/dumpstack.c |    1 +
 arch/x86/kernel/traps.c     |    3 +++
 include/linux/kernel.h      |    1 +
 kernel/sysctl.c             |    8 ++++++++
 kernel/sysctl_check.c       |    1 +
 5 files changed, 14 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 95ea5fa..c840571 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,6 +22,7 @@
 #include "dumpstack.h"
 
 int panic_on_unrecovered_nmi;
+int panic_on_io_nmi;
 unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a0f48f5..5204332 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -346,6 +346,9 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 	printk(KERN_EMERG "NMI: IOCK error (debug interrupt?)\n");
 	show_registers(regs);
 
+	if (panic_on_io_nmi)
+		panic("NMI IOCK error: Not continuing");
+
 	/* Re-enable the IOCK line, wait for a few seconds */
 	reason = (reason & 0xf) | 8;
 	outb(reason, 0x61);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index fac104e..d6320a3 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -303,6 +303,7 @@ extern int oops_in_progress;		/* If set, an oops, panic(), BUG() or die() is in
 extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
+extern int panic_on_io_nmi;
 extern const char *print_tainted(void);
 extern void add_taint(unsigned flag);
 extern int test_taint(unsigned flag);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..fba42ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -744,6 +744,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "panic_on_io_nmi",
+		.data		= &panic_on_io_nmi,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_BOOTLOADER_TYPE,
 		.procname	= "bootloader_type",
 		.data		= &bootloader_type,
diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index b38423c..e1d8e0f 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -104,6 +104,7 @@ static const struct trans_ctl_table trans_kern_table[] = {
 	{ KERN_MAX_LOCK_DEPTH,		"max_lock_depth" },
 	{ KERN_NMI_WATCHDOG,		"nmi_watchdog" },
 	{ KERN_PANIC_ON_NMI,		"panic_on_unrecovered_nmi" },
+	{ KERN_PANIC_ON_IO_NMI,		"panic_on_io_nmi" },
 	{}
 };
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-25 18:15 ` [PATCH] x86: " Eric W. Biederman
@ 2009-06-25 20:07   ` Greg KH
  2009-06-25 20:18     ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Greg KH @ 2009-06-25 20:07 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Kurt Garloff,
	linux-kernel

On Thu, Jun 25, 2009 at 11:15:45AM -0700, Eric W. Biederman wrote:
> Greg KH <gregkh@suse.de> writes:
> 
> > From: Kurt Garloff <garloff@suse.de>
> >
> > This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
> > defaults to 0 (off).
> >
> > When enabled, the kernel panics when the kernel receives an NMI caused
> > by an IO error.
> >
> > The IO error triggered NMI indicates a serious system condition, which
> > could result in IO data corruption. Rather than contiuing, panicing and
> > dumping might be a better choice, so one can figure out what's causing
> > the IO error.
> >
> > This could be especially important to companies running IO intensive
> > applications where corruption must be avoided, e.g. a banks databases.
> 
> Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> 
> New binary sysctls are not allowed.   Please remove the
> sysctl.h and .ctl_name portions.
> 
> see:
> Documentation/sysctl/ctl_unnumbered.txt
> Documentation/feature-remove-schedule.txt
> 
> I have a set of patches that should make this kind of thing
> fail to compile for .32.  Hopefully I can get that out in 
> the next couple of days.  Making problems like this easier
> to spot and deal with.

Oops, ok, will do, sorry about that.  Let me go create a new patch.

Ingo, do you want an incremental one, or a replacement for the original?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [tip:x86/urgent] x86: Add sysctl to allow panic on IOCK NMI error
  2009-06-24 21:32 [PATCH] x86: sysctl to allow panic on IOCK NMI error Greg KH
                   ` (2 preceding siblings ...)
  2009-06-25 20:06 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
@ 2009-06-25 20:09 ` tip-bot for Kurt Garloff
  2009-06-30 22:27 ` [PATCH] x86: " Maciej W. Rozycki
  4 siblings, 0 replies; 20+ messages in thread
From: tip-bot for Kurt Garloff @ 2009-06-25 20:09 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, ebiederm, gregkh, tglx, robertangelino,
	mingo, garloff

Commit-ID:  5211a242d0cbdded372aee59da18f80552b0a80a
Gitweb:     http://git.kernel.org/tip/5211a242d0cbdded372aee59da18f80552b0a80a
Author:     Kurt Garloff <garloff@suse.de>
AuthorDate: Wed, 24 Jun 2009 14:32:11 -0700
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 25 Jun 2009 22:06:11 +0200

x86: Add sysctl to allow panic on IOCK NMI error

This patch introduces a new sysctl:

    /proc/sys/kernel/panic_on_io_nmi

which defaults to 0 (off).

When enabled, the kernel panics when the kernel receives an NMI
caused by an IO error.

The IO error triggered NMI indicates a serious system
condition, which could result in IO data corruption. Rather
than contiuing, panicing and dumping might be a better choice,
so one can figure out what's causing the IO error.

This could be especially important to companies running IO
intensive applications where corruption must be avoided, e.g. a
bank's databases.

[ SuSE has been shipping it for a while, it was done at the
  request of a large database vendor, for their users. ]

Signed-off-by: Kurt Garloff <garloff@suse.de>
Signed-off-by: Roberto Angelino <robertangelino@gmail.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
LKML-Reference: <20090624213211.GA11291@kroah.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>


---
 arch/x86/kernel/dumpstack.c |    1 +
 arch/x86/kernel/traps.c     |    3 +++
 include/linux/kernel.h      |    1 +
 kernel/sysctl.c             |    8 ++++++++
 4 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 95ea5fa..c840571 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -22,6 +22,7 @@
 #include "dumpstack.h"
 
 int panic_on_unrecovered_nmi;
+int panic_on_io_nmi;
 unsigned int code_bytes = 64;
 int kstack_depth_to_print = 3 * STACKSLOTS_PER_LINE;
 static int die_counter;
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index a0f48f5..5204332 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -346,6 +346,9 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 	printk(KERN_EMERG "NMI: IOCK error (debug interrupt?)\n");
 	show_registers(regs);
 
+	if (panic_on_io_nmi)
+		panic("NMI IOCK error: Not continuing");
+
 	/* Re-enable the IOCK line, wait for a few seconds */
 	reason = (reason & 0xf) | 8;
 	outb(reason, 0x61);
diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index fac104e..d6320a3 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -303,6 +303,7 @@ extern int oops_in_progress;		/* If set, an oops, panic(), BUG() or die() is in
 extern int panic_timeout;
 extern int panic_on_oops;
 extern int panic_on_unrecovered_nmi;
+extern int panic_on_io_nmi;
 extern const char *print_tainted(void);
 extern void add_taint(unsigned flag);
 extern int test_taint(unsigned flag);
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 62e4ff9..fba42ed 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -744,6 +744,14 @@ static struct ctl_table kern_table[] = {
 		.proc_handler	= &proc_dointvec,
 	},
 	{
+		.ctl_name	= CTL_UNNUMBERED,
+		.procname	= "panic_on_io_nmi",
+		.data		= &panic_on_io_nmi,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_dointvec,
+	},
+	{
 		.ctl_name	= KERN_BOOTLOADER_TYPE,
 		.procname	= "bootloader_type",
 		.data		= &bootloader_type,

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-25 20:18     ` Ingo Molnar
@ 2009-06-25 20:16       ` Greg KH
  0 siblings, 0 replies; 20+ messages in thread
From: Greg KH @ 2009-06-25 20:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Eric W. Biederman, x86, H. Peter Anvin, Thomas Gleixner,
	Kurt Garloff, linux-kernel

On Thu, Jun 25, 2009 at 10:18:35PM +0200, Ingo Molnar wrote:
> 
> * Greg KH <gregkh@suse.de> wrote:
> 
> > On Thu, Jun 25, 2009 at 11:15:45AM -0700, Eric W. Biederman wrote:
> > > Greg KH <gregkh@suse.de> writes:
> > > 
> > > > From: Kurt Garloff <garloff@suse.de>
> > > >
> > > > This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
> > > > defaults to 0 (off).
> > > >
> > > > When enabled, the kernel panics when the kernel receives an NMI caused
> > > > by an IO error.
> > > >
> > > > The IO error triggered NMI indicates a serious system condition, which
> > > > could result in IO data corruption. Rather than contiuing, panicing and
> > > > dumping might be a better choice, so one can figure out what's causing
> > > > the IO error.
> > > >
> > > > This could be especially important to companies running IO intensive
> > > > applications where corruption must be avoided, e.g. a banks databases.
> > > 
> > > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > > 
> > > New binary sysctls are not allowed.   Please remove the
> > > sysctl.h and .ctl_name portions.
> > > 
> > > see:
> > > Documentation/sysctl/ctl_unnumbered.txt
> > > Documentation/feature-remove-schedule.txt
> > > 
> > > I have a set of patches that should make this kind of thing
> > > fail to compile for .32.  Hopefully I can get that out in 
> > > the next couple of days.  Making problems like this easier
> > > to spot and deal with.
> > 
> > Oops, ok, will do, sorry about that.  Let me go create a new 
> > patch.
> > 
> > Ingo, do you want an incremental one, or a replacement for the 
> > original?
> 
> Neither - already fixed :)

Wonderful, thanks for doing this.

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-25 20:07   ` Greg KH
@ 2009-06-25 20:18     ` Ingo Molnar
  2009-06-25 20:16       ` Greg KH
  0 siblings, 1 reply; 20+ messages in thread
From: Ingo Molnar @ 2009-06-25 20:18 UTC (permalink / raw)
  To: Greg KH
  Cc: Eric W. Biederman, x86, H. Peter Anvin, Thomas Gleixner,
	Kurt Garloff, linux-kernel


* Greg KH <gregkh@suse.de> wrote:

> On Thu, Jun 25, 2009 at 11:15:45AM -0700, Eric W. Biederman wrote:
> > Greg KH <gregkh@suse.de> writes:
> > 
> > > From: Kurt Garloff <garloff@suse.de>
> > >
> > > This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
> > > defaults to 0 (off).
> > >
> > > When enabled, the kernel panics when the kernel receives an NMI caused
> > > by an IO error.
> > >
> > > The IO error triggered NMI indicates a serious system condition, which
> > > could result in IO data corruption. Rather than contiuing, panicing and
> > > dumping might be a better choice, so one can figure out what's causing
> > > the IO error.
> > >
> > > This could be especially important to companies running IO intensive
> > > applications where corruption must be avoided, e.g. a banks databases.
> > 
> > Nacked-by: "Eric W. Biederman" <ebiederm@xmission.com>
> > 
> > New binary sysctls are not allowed.   Please remove the
> > sysctl.h and .ctl_name portions.
> > 
> > see:
> > Documentation/sysctl/ctl_unnumbered.txt
> > Documentation/feature-remove-schedule.txt
> > 
> > I have a set of patches that should make this kind of thing
> > fail to compile for .32.  Hopefully I can get that out in 
> > the next couple of days.  Making problems like this easier
> > to spot and deal with.
> 
> Oops, ok, will do, sorry about that.  Let me go create a new 
> patch.
> 
> Ingo, do you want an incremental one, or a replacement for the 
> original?

Neither - already fixed :)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-24 21:32 [PATCH] x86: sysctl to allow panic on IOCK NMI error Greg KH
                   ` (3 preceding siblings ...)
  2009-06-25 20:09 ` tip-bot for Kurt Garloff
@ 2009-06-30 22:27 ` Maciej W. Rozycki
  2009-06-30 22:30   ` Greg KH
  4 siblings, 1 reply; 20+ messages in thread
From: Maciej W. Rozycki @ 2009-06-30 22:27 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Kurt Garloff,
	linux-kernel

On Wed, 24 Jun 2009, Greg KH wrote:

> This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
> defaults to 0 (off).
> 
> When enabled, the kernel panics when the kernel receives an NMI caused
> by an IO error.
> 
> The IO error triggered NMI indicates a serious system condition, which
> could result in IO data corruption. Rather than contiuing, panicing and
> dumping might be a better choice, so one can figure out what's causing
> the IO error.
> 
> This could be especially important to companies running IO intensive
> applications where corruption must be avoided, e.g. a banks databases.

 These days an IOCK NMI typically happens in response to a PCI SERR -- it 
may be useful to traverse PCI buses to find the offender and dump this 
information on this occasion too.  The south bridge may have additional 
status too.

  Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-30 22:27 ` [PATCH] x86: " Maciej W. Rozycki
@ 2009-06-30 22:30   ` Greg KH
  2009-07-01  0:50     ` Maciej W. Rozycki
  0 siblings, 1 reply; 20+ messages in thread
From: Greg KH @ 2009-06-30 22:30 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Kurt Garloff,
	linux-kernel

On Tue, Jun 30, 2009 at 11:27:57PM +0100, Maciej W. Rozycki wrote:
> On Wed, 24 Jun 2009, Greg KH wrote:
> 
> > This patch introduces a sysctl /proc/sys/kernel/panic_on_io_nmi, which
> > defaults to 0 (off).
> > 
> > When enabled, the kernel panics when the kernel receives an NMI caused
> > by an IO error.
> > 
> > The IO error triggered NMI indicates a serious system condition, which
> > could result in IO data corruption. Rather than contiuing, panicing and
> > dumping might be a better choice, so one can figure out what's causing
> > the IO error.
> > 
> > This could be especially important to companies running IO intensive
> > applications where corruption must be avoided, e.g. a banks databases.
> 
>  These days an IOCK NMI typically happens in response to a PCI SERR -- it 
> may be useful to traverse PCI buses to find the offender and dump this 
> information on this occasion too.  The south bridge may have additional 
> status too.

Sure, that would be great to have.  Care to make a patch?  :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-06-30 22:30   ` Greg KH
@ 2009-07-01  0:50     ` Maciej W. Rozycki
  2009-07-01 11:10       ` Ingo Molnar
  0 siblings, 1 reply; 20+ messages in thread
From: Maciej W. Rozycki @ 2009-07-01  0:50 UTC (permalink / raw)
  To: Greg KH
  Cc: x86, H. Peter Anvin, Thomas Gleixner, Ingo Molnar, Kurt Garloff,
	linux-kernel

On Tue, 30 Jun 2009, Greg KH wrote:

> >  These days an IOCK NMI typically happens in response to a PCI SERR -- it 
> > may be useful to traverse PCI buses to find the offender and dump this 
> > information on this occasion too.  The south bridge may have additional 
> > status too.
> 
> Sure, that would be great to have.  Care to make a patch?  :)

 ENOTIME, sorry.  Next year perhaps.  Or a homework project for one of the 
newbies. ;)

  Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-01  0:50     ` Maciej W. Rozycki
@ 2009-07-01 11:10       ` Ingo Molnar
  2009-07-01 17:30         ` Jesse Barnes
                           ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Ingo Molnar @ 2009-07-01 11:10 UTC (permalink / raw)
  To: Maciej W. Rozycki
  Cc: Greg KH, x86, H. Peter Anvin, Thomas Gleixner, Kurt Garloff,
	linux-kernel, Yinghai Lu, Jesse Barnes


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

> On Tue, 30 Jun 2009, Greg KH wrote:
> 
> > >  These days an IOCK NMI typically happens in response to a PCI 
> > > SERR -- it may be useful to traverse PCI buses to find the 
> > > offender and dump this information on this occasion too.  The 
> > > south bridge may have additional status too.
> > 
> > Sure, that would be great to have.  Care to make a patch?  :)
> 
>  ENOTIME, sorry.  Next year perhaps.  Or a homework project for 
> one of the newbies. ;)

You know that this project would kill a newbie, right? :)

We have no real southbridge drivers on x86 - but we should certainly 
add some. Also, walking the PCI device tree from NMI context is 
tricky as the lists there are not NMI safe - we could crash if we 
happen to get a #IOCK while loading/unloading drivers (which is rare 
but could happen).

IMHO it's all very much desired functionality, but highly 
non-trivial.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-01 11:10       ` Ingo Molnar
@ 2009-07-01 17:30         ` Jesse Barnes
  2009-07-01 17:37         ` Maciej W. Rozycki
  2009-07-03  9:19         ` Kurt Garloff
  2 siblings, 0 replies; 20+ messages in thread
From: Jesse Barnes @ 2009-07-01 17:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Maciej W. Rozycki, Greg KH, x86, H. Peter Anvin, Thomas Gleixner,
	Kurt Garloff, linux-kernel, Yinghai Lu

On Wed, 1 Jul 2009 13:10:03 +0200
Ingo Molnar <mingo@elte.hu> wrote:

> 
> * Maciej W. Rozycki <macro@linux-mips.org> wrote:
> 
> > On Tue, 30 Jun 2009, Greg KH wrote:
> > 
> > > >  These days an IOCK NMI typically happens in response to a PCI 
> > > > SERR -- it may be useful to traverse PCI buses to find the 
> > > > offender and dump this information on this occasion too.  The 
> > > > south bridge may have additional status too.
> > > 
> > > Sure, that would be great to have.  Care to make a patch?  :)
> > 
> >  ENOTIME, sorry.  Next year perhaps.  Or a homework project for 
> > one of the newbies. ;)
> 
> You know that this project would kill a newbie, right? :)
> 
> We have no real southbridge drivers on x86 - but we should certainly 
> add some. Also, walking the PCI device tree from NMI context is 
> tricky as the lists there are not NMI safe - we could crash if we 
> happen to get a #IOCK while loading/unloading drivers (which is rare 
> but could happen).
> 
> IMHO it's all very much desired functionality, but highly 
> non-trivial.

We actually have some code to do this for PCIe AER support.  If we
detect multiple errors or the root complex doesn't give us error ID
info, we walk the bus looking for errors.  So there's some potential
for reuse here...

-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-01 11:10       ` Ingo Molnar
  2009-07-01 17:30         ` Jesse Barnes
@ 2009-07-01 17:37         ` Maciej W. Rozycki
  2009-07-02  7:53           ` Ingo Molnar
  2009-07-03  9:19         ` Kurt Garloff
  2 siblings, 1 reply; 20+ messages in thread
From: Maciej W. Rozycki @ 2009-07-01 17:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Greg KH, x86, H. Peter Anvin, Thomas Gleixner, Kurt Garloff,
	linux-kernel, Yinghai Lu, Jesse Barnes

On Wed, 1 Jul 2009, Ingo Molnar wrote:

> >  ENOTIME, sorry.  Next year perhaps.  Or a homework project for 
> > one of the newbies. ;)
> 
> You know that this project would kill a newbie, right? :)

 Well, that's just a fast track to become a veteran, isn't it? ;)

> We have no real southbridge drivers on x86 - but we should certainly 
> add some. Also, walking the PCI device tree from NMI context is 
> tricky as the lists there are not NMI safe - we could crash if we 
> happen to get a #IOCK while loading/unloading drivers (which is rare 
> but could happen).

 That shouldn't be a problem if we were about to panic().  For a more 
sophisticated attempt of recovery -- yes, that would have to be addressed.  

 The possibly simplest approach could be keeping a local list of PCI 
configuration space addresses of the PCI status (and secondary status, as 
applicable) register of all the currently present devices.  That would 
still require some care as many northbridges do not provide means for 
atomic PCI config accesses and a NMI could happen because of some other 
than the CPU master's activity in the middle of a regular config access 
being done by the CPU.  So e.g. the config address register would have to 
be restored so that the regular config data access goes to the right 
location once the NMI handler has concluded.  But that does not sound as 
complicated as traversing the regular structures.

> IMHO it's all very much desired functionality, but highly 
> non-trivial.

 Memory ECC error handlers would benefit from some southbridge 
infrastructure too.

  Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-01 17:37         ` Maciej W. Rozycki
@ 2009-07-02  7:53           ` Ingo Molnar
  2009-07-02 10:41             ` Borislav Petkov
  2009-07-03 21:35             ` Maciej W. Rozycki
  0 siblings, 2 replies; 20+ messages in thread
From: Ingo Molnar @ 2009-07-02  7:53 UTC (permalink / raw)
  To: Maciej W. Rozycki, Borislav Petkov
  Cc: Greg KH, x86, H. Peter Anvin, Thomas Gleixner, Kurt Garloff,
	linux-kernel, Yinghai Lu, Jesse Barnes


* Maciej W. Rozycki <macro@linux-mips.org> wrote:

> On Wed, 1 Jul 2009, Ingo Molnar wrote:
> 
> > >  ENOTIME, sorry.  Next year perhaps.  Or a homework project 
> > > for one of the newbies. ;)
> > 
> > You know that this project would kill a newbie, right? :)
> 
>  Well, that's just a fast track to become a veteran, isn't it? ;)

No, that's just a fast track to quickly make it into the list of our 
Fallen Heroes :-/ The fast track to become a kernel veteran is to, 
if possible, not challenge a tank with a hand-grenade. But i 
digress.

> > We have no real southbridge drivers on x86 - but we should 
> > certainly add some. Also, walking the PCI device tree from NMI 
> > context is tricky as the lists there are not NMI safe - we could 
> > crash if we happen to get a #IOCK while loading/unloading 
> > drivers (which is rare but could happen).
> 
> That shouldn't be a problem if we were about to panic().  For a 
> more sophisticated attempt of recovery -- yes, that would have to 
> be addressed.

We are only panic-ing if the sysctl is set. The diagnostics would be 
useful anyway. The proper approach would be to defer it a bit in the 
non-panic case an read it out from some friendlier context - such as 
the EDAC core.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-02  7:53           ` Ingo Molnar
@ 2009-07-02 10:41             ` Borislav Petkov
  2009-07-03  7:59               ` Ingo Molnar
  2009-07-03 21:35             ` Maciej W. Rozycki
  1 sibling, 1 reply; 20+ messages in thread
From: Borislav Petkov @ 2009-07-02 10:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Maciej W. Rozycki, Greg KH, x86, H. Peter Anvin, Thomas Gleixner,
	Kurt Garloff, linux-kernel, Yinghai Lu, Jesse Barnes,
	Doug Thompson, Dave Jiang

On Thu, Jul 02, 2009 at 09:53:05AM +0200, Ingo Molnar wrote:
> > That shouldn't be a problem if we were about to panic().  For a 
> > more sophisticated attempt of recovery -- yes, that would have to 
> > be addressed.
> 
> We are only panic-ing if the sysctl is set. The diagnostics would be 
> useful anyway. The proper approach would be to defer it a bit in the 
> non-panic case an read it out from some friendlier context - such as 
> the EDAC core.

Quickly skimming through code shows some functionality is there:
drivers/edac/edac_pci{,_sysfs}.c but doesn't seem to be NMI safe. CCing
some more people.

-- 
Regards/Gruss,
Boris.

Operating | Advanced Micro Devices GmbH
  System  | Karl-Hammerschmidt-Str. 34, 85609 Dornach b. München, Germany
 Research | Geschäftsführer: Thomas M. McCoy, Giuliano Meroni
  Center  | Sitz: Dornach, Gemeinde Aschheim, Landkreis München
  (OSRC)  | Registergericht München, HRB Nr. 43632


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-02 10:41             ` Borislav Petkov
@ 2009-07-03  7:59               ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2009-07-03  7:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Maciej W. Rozycki, Greg KH, x86, H. Peter Anvin, Thomas Gleixner,
	Kurt Garloff, linux-kernel, Yinghai Lu, Jesse Barnes,
	Doug Thompson, Dave Jiang


* Borislav Petkov <borislav.petkov@amd.com> wrote:

> On Thu, Jul 02, 2009 at 09:53:05AM +0200, Ingo Molnar wrote:
> > > That shouldn't be a problem if we were about to panic().  For a 
> > > more sophisticated attempt of recovery -- yes, that would have to 
> > > be addressed.
> > 
> > We are only panic-ing if the sysctl is set. The diagnostics 
> > would be useful anyway. The proper approach would be to defer it 
> > a bit in the non-panic case an read it out from some friendlier 
> > context - such as the EDAC core.
> 
> Quickly skimming through code shows some functionality is there: 
> drivers/edac/edac_pci{,_sysfs}.c but doesn't seem to be NMI safe. 
> CCing some more people.

the 'defer' would mean we put it into a softirq context or so and do 
it from there - not from NMI context.

But yes, if we are about to panic due to this and want to print out 
something useful, making EDAC NMI safe would be nice.

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-01 11:10       ` Ingo Molnar
  2009-07-01 17:30         ` Jesse Barnes
  2009-07-01 17:37         ` Maciej W. Rozycki
@ 2009-07-03  9:19         ` Kurt Garloff
  2009-07-03  9:24           ` Ingo Molnar
  2 siblings, 1 reply; 20+ messages in thread
From: Kurt Garloff @ 2009-07-03  9:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Maciej W. Rozycki, Greg KH, x86, H. Peter Anvin, Thomas Gleixner,
	linux-kernel, Yinghai Lu, Jesse Barnes

[-- Attachment #1: Type: text/plain, Size: 1391 bytes --]

Ingo,

On Wed, Jul 01, 2009 at 01:10:03PM +0200, Ingo Molnar wrote:
> * Maciej W. Rozycki <macro@linux-mips.org> wrote:
> 
> > On Tue, 30 Jun 2009, Greg KH wrote:
> > 
> > > >  These days an IOCK NMI typically happens in response to a PCI 
> > > > SERR -- it may be useful to traverse PCI buses to find the 
> > > > offender and dump this information on this occasion too.  The 
> > > > south bridge may have additional status too.
> > > 
> > > Sure, that would be great to have.  Care to make a patch?  :)
> > 
> >  ENOTIME, sorry.  Next year perhaps.  Or a homework project for 
> > one of the newbies. ;)
> 
> You know that this project would kill a newbie, right? :)
> 
> We have no real southbridge drivers on x86 - but we should certainly 
> add some. Also, walking the PCI device tree from NMI context is 
> tricky as the lists there are not NMI safe - we could crash if we 
> happen to get a #IOCK while loading/unloading drivers (which is rare 
> but could happen).

Well -- in case we panic the system anyway this is not necessarily a
big issue (let's print the message before ...) -- if we crash trying
to gather additional info, we'll lose the info. Currently we never have
the info ...

> IMHO it's all very much desired functionality, but highly 
> non-trivial.

Too bad.

Best,
-- 
Kurt Garloff, VP OPS Partner Engineering -- Novell Inc.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-03  9:19         ` Kurt Garloff
@ 2009-07-03  9:24           ` Ingo Molnar
  0 siblings, 0 replies; 20+ messages in thread
From: Ingo Molnar @ 2009-07-03  9:24 UTC (permalink / raw)
  To: Kurt Garloff, Maciej W. Rozycki, Greg KH, x86, H. Peter Anvin,
	Thomas Gleixner, linux-kernel, Yinghai Lu, Jesse Barnes


* Kurt Garloff <garloff@suse.de> wrote:

> Ingo,
> 
> On Wed, Jul 01, 2009 at 01:10:03PM +0200, Ingo Molnar wrote:
> > * Maciej W. Rozycki <macro@linux-mips.org> wrote:
> > 
> > > On Tue, 30 Jun 2009, Greg KH wrote:
> > > 
> > > > >  These days an IOCK NMI typically happens in response to a PCI 
> > > > > SERR -- it may be useful to traverse PCI buses to find the 
> > > > > offender and dump this information on this occasion too.  The 
> > > > > south bridge may have additional status too.
> > > > 
> > > > Sure, that would be great to have.  Care to make a patch?  :)
> > > 
> > >  ENOTIME, sorry.  Next year perhaps.  Or a homework project for 
> > > one of the newbies. ;)
> > 
> > You know that this project would kill a newbie, right? :)
> > 
> > We have no real southbridge drivers on x86 - but we should 
> > certainly add some. Also, walking the PCI device tree from NMI 
> > context is tricky as the lists there are not NMI safe - we could 
> > crash if we happen to get a #IOCK while loading/unloading 
> > drivers (which is rare but could happen).
> 
> Well -- in case we panic the system anyway this is not necessarily 
> a big issue (let's print the message before ...) -- if we crash 
> trying to gather additional info, we'll lose the info. Currently 
> we never have the info ...

We dont _necessarily_ crash ... The crash/panic is default off and 
sysctl driven.

Allowing a crash is not the highest quality of implementation and 
i'm somewhat wary of 'allow a little bit of crap' arguments - it's 
too similar to the 'little bit pregnant' concept ;-)

	Ingo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] x86: sysctl to allow panic on IOCK NMI error
  2009-07-02  7:53           ` Ingo Molnar
  2009-07-02 10:41             ` Borislav Petkov
@ 2009-07-03 21:35             ` Maciej W. Rozycki
  1 sibling, 0 replies; 20+ messages in thread
From: Maciej W. Rozycki @ 2009-07-03 21:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, Greg KH, x86, H. Peter Anvin, Thomas Gleixner,
	Kurt Garloff, linux-kernel, Yinghai Lu, Jesse Barnes

On Thu, 2 Jul 2009, Ingo Molnar wrote:

> >  Well, that's just a fast track to become a veteran, isn't it? ;)
> 
> No, that's just a fast track to quickly make it into the list of our 
> Fallen Heroes :-/ The fast track to become a kernel veteran is to, 
> if possible, not challenge a tank with a hand-grenade. But i 
> digress.

 What doesn't kill you will make you stronger, ;) but otherwise I digress 
too.

> > That shouldn't be a problem if we were about to panic().  For a 
> > more sophisticated attempt of recovery -- yes, that would have to 
> > be addressed.
> 
> We are only panic-ing if the sysctl is set. The diagnostics would be 
> useful anyway. The proper approach would be to defer it a bit in the 
> non-panic case an read it out from some friendlier context - such as 
> the EDAC core.

 Hmm, my concern is in the case of a PCI SERR the system may not 
necessarily be in a recoverable state.  For example if a master abort 
happened due to a timeout (which is outside the PCI spec I'm told, but the 
only way to avoid holding the bus undefinitely) and the target finally 
responded, then it may have corrupted a subsequent transaction.  My point 
is thus any diagnostic output should be produced as soon as possible and 
involving as little system resources as absolutely necessary.  This being 
enough to identify the device triggering the SERR -- so that if an error 
is fatal and recurs, then the possible offender can be determined.

 Deferring such initial diagnostic to a softirq or suchlike does not sound 
as a terribly good idea to me.  I think this is also the right place to 
disable the device's master access to the bus (and possibly target address 
space decoders too -- the device may have started misdecoding and 
interfering with transactions meant to involve other devices) -- till the 
recovery procedure has been completed.

 Then further processing, such as signalling the involved device's driver 
that the error happened and letting it attempt to recover is something 
that should happen in less restricted a context.  It is the driver only 
that could further determine the cause based on the state of the device's 
registers (e.g. what was the target when the reporting device acted as a 
master) and the knowledge of how it operates, reset the device, etc.  
Once the situation has been rectified and the device determined to be 
capable to continue operating (e.g. the built-in to the firmware self-test 
-- if available -- was run and reported success) the device can be 
reconfigured and put on the bus again.

  Maciej

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2009-07-03 21:29 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-06-24 21:32 [PATCH] x86: sysctl to allow panic on IOCK NMI error Greg KH
2009-06-25  9:16 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
2009-06-25 18:15 ` [PATCH] x86: " Eric W. Biederman
2009-06-25 20:07   ` Greg KH
2009-06-25 20:18     ` Ingo Molnar
2009-06-25 20:16       ` Greg KH
2009-06-25 20:06 ` [tip:x86/urgent] x86: Add " tip-bot for Kurt Garloff
2009-06-25 20:09 ` tip-bot for Kurt Garloff
2009-06-30 22:27 ` [PATCH] x86: " Maciej W. Rozycki
2009-06-30 22:30   ` Greg KH
2009-07-01  0:50     ` Maciej W. Rozycki
2009-07-01 11:10       ` Ingo Molnar
2009-07-01 17:30         ` Jesse Barnes
2009-07-01 17:37         ` Maciej W. Rozycki
2009-07-02  7:53           ` Ingo Molnar
2009-07-02 10:41             ` Borislav Petkov
2009-07-03  7:59               ` Ingo Molnar
2009-07-03 21:35             ` Maciej W. Rozycki
2009-07-03  9:19         ` Kurt Garloff
2009-07-03  9:24           ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox