[V5][PATCH 0/6] x86, nmi: new NMI handling routines

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [V5][PATCH 0/6] x86, nmi: new NMI handling routines
@ 2011-09-20 14:43 Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
                   ` (5 more replies)
  0 siblings, 6 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus

I had the pleasure of hosting Robert Richter at Red Hat in August.  One of
the issues he wanted to talk with me about was having the NMI handling 
routines execute all the NMI handlers for each NMI mainly for his AMD IBS
work.  But he also brought up another good point that because of the way NMIs
work, it is possible to lose them if multiple NMIs happen at the same time.

As a result, we sat around and discussed how we could go about executing
all the nmi handlers for each NMI to ensure that we would not lose any events.

We decided the best way to do this would be to have the NMI handlers break
away from the notifier routines and create our own.  This would allow us to
execute all the handlers without hacking up the notifier stuff and easily
track the number of events processed at a higher level to deal with the new
problemm of extra NMIs.

I spent some time hacking and came up with this patch.  I tested it on my
core2quad machine trying to enable all the NMI handler I could, mainly
perf and kgdb (and oprofile too when perf was disabled).  Everything seems
to work correctly.  If people are ok with this approach, I'll try and test
this on more machines.

More details about the patch are in the individual changelogs.

This version only changes patch 4.

Don Zickus (6):
  x86, nmi: split out nmi from traps.c
  x86, nmi: create new NMI handler routines
  x86, nmi: wire up NMI handlers to new routines
  x86, nmi:  add in logic to handle multiple events and unknown NMIs
  x86, nmi: track NMI usage stats
  x86, nmi: print out NMI stats in /proc/interrupts

 arch/x86/include/asm/nmi.h              |   39 ++--
 arch/x86/include/asm/reboot.h           |    2 +-
 arch/x86/kernel/Makefile                |    2 +-
 arch/x86/kernel/apic/hw_nmi.c           |   27 +--
 arch/x86/kernel/apic/x2apic_uv_x.c      |   20 +-
 arch/x86/kernel/cpu/mcheck/mce-inject.c |   20 +-
 arch/x86/kernel/cpu/mcheck/mce.c        |    3 -
 arch/x86/kernel/cpu/perf_event.c        |   60 +----
 arch/x86/kernel/crash.c                 |    5 +-
 arch/x86/kernel/irq.c                   |    2 +
 arch/x86/kernel/kgdb.c                  |   60 +++-
 arch/x86/kernel/nmi.c                   |  473 +++++++++++++++++++++++++++++++
 arch/x86/kernel/reboot.c                |   23 +-
 arch/x86/kernel/traps.c                 |  155 ----------
 arch/x86/oprofile/nmi_int.c             |   40 +--
 arch/x86/oprofile/nmi_timer_int.c       |   28 +--
 drivers/acpi/apei/ghes.c                |   22 +-
 drivers/char/ipmi/ipmi_watchdog.c       |   32 +--
 drivers/watchdog/hpwdt.c                |   23 +--
 19 files changed, 613 insertions(+), 423 deletions(-)
 create mode 100644 arch/x86/kernel/nmi.c

-- 
1.7.6

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [V5][PATCH 1/6] x86, nmi: split out nmi from traps.c
  2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
@ 2011-09-20 14:43 ` Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus

The nmi stuff is changing a lot and adding more functionality.  Split it
out from the traps.c file so it doesn't continue to pollute that file.

This makes it easier to find and expand all the future nmi related work.

No real functional changes here.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/Makefile |    2 +-
 arch/x86/kernel/nmi.c    |  178 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/traps.c  |  155 ----------------------------------------
 3 files changed, 179 insertions(+), 156 deletions(-)
 create mode 100644 arch/x86/kernel/nmi.c

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 82f2912..8baca3c 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -19,7 +19,7 @@ endif
 
 obj-y			:= process_$(BITS).o signal.o entry_$(BITS).o
 obj-y			+= traps.o irq.o irq_$(BITS).o dumpstack_$(BITS).o
-obj-y			+= time.o ioport.o ldt.o dumpstack.o
+obj-y			+= time.o ioport.o ldt.o dumpstack.o nmi.o
 obj-y			+= setup.o x86_init.o i8259.o irqinit.o jump_label.o
 obj-$(CONFIG_IRQ_WORK)  += irq_work.o
 obj-y			+= probe_roms.o
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
new file mode 100644
index 0000000..68d758a
--- /dev/null
+++ b/arch/x86/kernel/nmi.c
@@ -0,0 +1,178 @@
+/*
+ *  Copyright (C) 1991, 1992  Linus Torvalds
+ *  Copyright (C) 2000, 2001, 2002 Andi Kleen, SuSE Labs
+ *
+ *  Pentium III FXSR, SSE support
+ *	Gareth Hughes <gareth@valinux.com>, May 2000
+ */
+
+/*
+ * Handle hardware traps and faults.
+ */
+#include <linux/spinlock.h>
+#include <linux/kprobes.h>
+#include <linux/kdebug.h>
+#include <linux/nmi.h>
+
+#if defined(CONFIG_EDAC)
+#include <linux/edac.h>
+#endif
+
+#include <linux/atomic.h>
+#include <asm/traps.h>
+#include <asm/mach_traps.h>
+
+static int ignore_nmis;
+
+int unknown_nmi_panic;
+/*
+ * Prevent NMI reason port (0x61) being accessed simultaneously, can
+ * only be used in NMI handler.
+ */
+static DEFINE_RAW_SPINLOCK(nmi_reason_lock);
+
+static int __init setup_unknown_nmi_panic(char *str)
+{
+	unknown_nmi_panic = 1;
+	return 1;
+}
+__setup("unknown_nmi_panic", setup_unknown_nmi_panic);
+
+static notrace __kprobes void
+pci_serr_error(unsigned char reason, struct pt_regs *regs)
+{
+	pr_emerg("NMI: PCI system error (SERR) for reason %02x on CPU %d.\n",
+		 reason, smp_processor_id());
+
+	/*
+	 * On some machines, PCI SERR line is used to report memory
+	 * errors. EDAC makes use of it.
+	 */
+#if defined(CONFIG_EDAC)
+	if (edac_handler_set()) {
+		edac_atomic_assert_error();
+		return;
+	}
+#endif
+
+	if (panic_on_unrecovered_nmi)
+		panic("NMI: Not continuing");
+
+	pr_emerg("Dazed and confused, but trying to continue\n");
+
+	/* Clear and disable the PCI SERR error line. */
+	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_SERR;
+	outb(reason, NMI_REASON_PORT);
+}
+
+static notrace __kprobes void
+io_check_error(unsigned char reason, struct pt_regs *regs)
+{
+	unsigned long i;
+
+	pr_emerg(
+	"NMI: IOCK error (debug interrupt?) for reason %02x on CPU %d.\n",
+		 reason, smp_processor_id());
+	show_registers(regs);
+
+	if (panic_on_io_nmi)
+		panic("NMI IOCK error: Not continuing");
+
+	/* Re-enable the IOCK line, wait for a few seconds */
+	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
+	outb(reason, NMI_REASON_PORT);
+
+	i = 20000;
+	while (--i) {
+		touch_nmi_watchdog();
+		udelay(100);
+	}
+
+	reason &= ~NMI_REASON_CLEAR_IOCHK;
+	outb(reason, NMI_REASON_PORT);
+}
+
+static notrace __kprobes void
+unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
+{
+	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
+			NOTIFY_STOP)
+		return;
+#ifdef CONFIG_MCA
+	/*
+	 * Might actually be able to figure out what the guilty party
+	 * is:
+	 */
+	if (MCA_bus) {
+		mca_handle_nmi();
+		return;
+	}
+#endif
+	pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
+		 reason, smp_processor_id());
+
+	pr_emerg("Do you have a strange power saving mode enabled?\n");
+	if (unknown_nmi_panic || panic_on_unrecovered_nmi)
+		panic("NMI: Not continuing");
+
+	pr_emerg("Dazed and confused, but trying to continue\n");
+}
+
+static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
+{
+	unsigned char reason = 0;
+
+	/*
+	 * CPU-specific NMI must be processed before non-CPU-specific
+	 * NMI, otherwise we may lose it, because the CPU-specific
+	 * NMI can not be detected/processed on other CPUs.
+	 */
+	if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
+		return;
+
+	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
+	raw_spin_lock(&nmi_reason_lock);
+	reason = get_nmi_reason();
+
+	if (reason & NMI_REASON_MASK) {
+		if (reason & NMI_REASON_SERR)
+			pci_serr_error(reason, regs);
+		else if (reason & NMI_REASON_IOCHK)
+			io_check_error(reason, regs);
+#ifdef CONFIG_X86_32
+		/*
+		 * Reassert NMI in case it became active
+		 * meanwhile as it's edge-triggered:
+		 */
+		reassert_nmi();
+#endif
+		raw_spin_unlock(&nmi_reason_lock);
+		return;
+	}
+	raw_spin_unlock(&nmi_reason_lock);
+
+	unknown_nmi_error(reason, regs);
+}
+
+dotraplinkage notrace __kprobes void
+do_nmi(struct pt_regs *regs, long error_code)
+{
+	nmi_enter();
+
+	inc_irq_stat(__nmi_count);
+
+	if (!ignore_nmis)
+		default_do_nmi(regs);
+
+	nmi_exit();
+}
+
+void stop_nmi(void)
+{
+	ignore_nmis++;
+}
+
+void restart_nmi(void)
+{
+	ignore_nmis--;
+}
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 6913369..a8e3eb8 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -81,15 +81,6 @@ gate_desc idt_table[NR_VECTORS] __page_aligned_data = { { { { 0, 0 } } }, };
 DECLARE_BITMAP(used_vectors, NR_VECTORS);
 EXPORT_SYMBOL_GPL(used_vectors);
 
-static int ignore_nmis;
-
-int unknown_nmi_panic;
-/*
- * Prevent NMI reason port (0x61) being accessed simultaneously, can
- * only be used in NMI handler.
- */
-static DEFINE_RAW_SPINLOCK(nmi_reason_lock);
-
 static inline void conditional_sti(struct pt_regs *regs)
 {
 	if (regs->flags & X86_EFLAGS_IF)
@@ -307,152 +298,6 @@ gp_in_kernel:
 	die("general protection fault", regs, error_code);
 }
 
-static int __init setup_unknown_nmi_panic(char *str)
-{
-	unknown_nmi_panic = 1;
-	return 1;
-}
-__setup("unknown_nmi_panic", setup_unknown_nmi_panic);
-
-static notrace __kprobes void
-pci_serr_error(unsigned char reason, struct pt_regs *regs)
-{
-	pr_emerg("NMI: PCI system error (SERR) for reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
-
-	/*
-	 * On some machines, PCI SERR line is used to report memory
-	 * errors. EDAC makes use of it.
-	 */
-#if defined(CONFIG_EDAC)
-	if (edac_handler_set()) {
-		edac_atomic_assert_error();
-		return;
-	}
-#endif
-
-	if (panic_on_unrecovered_nmi)
-		panic("NMI: Not continuing");
-
-	pr_emerg("Dazed and confused, but trying to continue\n");
-
-	/* Clear and disable the PCI SERR error line. */
-	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_SERR;
-	outb(reason, NMI_REASON_PORT);
-}
-
-static notrace __kprobes void
-io_check_error(unsigned char reason, struct pt_regs *regs)
-{
-	unsigned long i;
-
-	pr_emerg(
-	"NMI: IOCK error (debug interrupt?) for reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
-	show_registers(regs);
-
-	if (panic_on_io_nmi)
-		panic("NMI IOCK error: Not continuing");
-
-	/* Re-enable the IOCK line, wait for a few seconds */
-	reason = (reason & NMI_REASON_CLEAR_MASK) | NMI_REASON_CLEAR_IOCHK;
-	outb(reason, NMI_REASON_PORT);
-
-	i = 20000;
-	while (--i) {
-		touch_nmi_watchdog();
-		udelay(100);
-	}
-
-	reason &= ~NMI_REASON_CLEAR_IOCHK;
-	outb(reason, NMI_REASON_PORT);
-}
-
-static notrace __kprobes void
-unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
-{
-	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
-			NOTIFY_STOP)
-		return;
-#ifdef CONFIG_MCA
-	/*
-	 * Might actually be able to figure out what the guilty party
-	 * is:
-	 */
-	if (MCA_bus) {
-		mca_handle_nmi();
-		return;
-	}
-#endif
-	pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
-
-	pr_emerg("Do you have a strange power saving mode enabled?\n");
-	if (unknown_nmi_panic || panic_on_unrecovered_nmi)
-		panic("NMI: Not continuing");
-
-	pr_emerg("Dazed and confused, but trying to continue\n");
-}
-
-static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
-{
-	unsigned char reason = 0;
-
-	/*
-	 * CPU-specific NMI must be processed before non-CPU-specific
-	 * NMI, otherwise we may lose it, because the CPU-specific
-	 * NMI can not be detected/processed on other CPUs.
-	 */
-	if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
-		return;
-
-	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
-	raw_spin_lock(&nmi_reason_lock);
-	reason = get_nmi_reason();
-
-	if (reason & NMI_REASON_MASK) {
-		if (reason & NMI_REASON_SERR)
-			pci_serr_error(reason, regs);
-		else if (reason & NMI_REASON_IOCHK)
-			io_check_error(reason, regs);
-#ifdef CONFIG_X86_32
-		/*
-		 * Reassert NMI in case it became active
-		 * meanwhile as it's edge-triggered:
-		 */
-		reassert_nmi();
-#endif
-		raw_spin_unlock(&nmi_reason_lock);
-		return;
-	}
-	raw_spin_unlock(&nmi_reason_lock);
-
-	unknown_nmi_error(reason, regs);
-}
-
-dotraplinkage notrace __kprobes void
-do_nmi(struct pt_regs *regs, long error_code)
-{
-	nmi_enter();
-
-	inc_irq_stat(__nmi_count);
-
-	if (!ignore_nmis)
-		default_do_nmi(regs);
-
-	nmi_exit();
-}
-
-void stop_nmi(void)
-{
-	ignore_nmis++;
-}
-
-void restart_nmi(void)
-{
-	ignore_nmis--;
-}
-
 /* May run on IST stack. */
 dotraplinkage void __kprobes do_int3(struct pt_regs *regs, long error_code)
 {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [V5][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
@ 2011-09-20 14:43 ` Don Zickus
  2011-09-21  5:36   ` Huang Ying
  2011-09-20 14:43 ` [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus

The NMI handlers used to rely on the notifier infrastructure.  This worked
great until we wanted to support handling multiple events better.

One of the key ideas to the nmi handling is to process _all_ the handlers for
each NMI.  The reason behind this switch is because NMIs are edge triggered.
If enough NMIs are triggered, then they could be lost because the cpu can
only latch at most one NMI (besides the one currently being processed).

In order to deal with this we have decided to process all the NMI handlers
for each NMI.  This allows the handlers to determine if they recieved an
event or not (the ones that can not determine this will be left to fend
for themselves on the unknown NMI list).

As a result of this change it is now possible to have an extra NMI that
was destined to be received for an already processed event.  Because the
event was processed in the previous NMI, this NMI gets dropped and becomes
an 'unknown' NMI.  This of course will cause printks that scare people.

However, we prefer to have extra NMIs as opposed to losing NMIs and as such
are have developed a basic mechanism to catch most of them.  That will be
a later patch.

To accomplish this idea, I unhooked the nmi handlers from the notifier
routines and created a new mechanism loosely based on doIRQ.  The reason
for this is the notifier routines have a couple of shortcomings.  One we
could't guarantee all future NMI handlers used NOTIFY_OK instead of
NOTIFY_STOP.  Second, we couldn't keep track of the number of events being
handled in each routine (most only handle one, perf can handle more than one).
Third, I wanted to eventually display which nmi handlers are registered in
the system in /proc/interrupts to help see who is generating NMIs.

The patch below just implements the new infrastructure but doesn't wire it up
yet (that is the next patch).  Its design is based on doIRQ structs and the
atomic notifier routines.  So the rcu stuff in the patch isn't entirely untested
(as the notifier routines have soaked it) but it should be double checked in
case I copied the code wrong.

V2:
  - use kstrdup to copy/allocate device name
  - fix-up _GPL oops

V3:
  - fix leak in register_nmi_handler error path
  - removed _raw annotations from rcu_dereference

V4:
  - handle kstrndup failure

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h |   19 +++++
 arch/x86/kernel/nmi.c      |  157 ++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 176 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 4886a68..6d04b28 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -42,6 +42,25 @@ void arch_trigger_all_cpu_backtrace(void);
 #define NMI_LOCAL_NORMAL_PRIOR	(NMI_LOCAL_BIT | NMI_NORMAL_PRIOR)
 #define NMI_LOCAL_LOW_PRIOR	(NMI_LOCAL_BIT | NMI_LOW_PRIOR)
 
+#define NMI_FLAG_FIRST	1
+
+enum {
+	NMI_LOCAL=0,
+	NMI_UNKNOWN,
+	NMI_EXTERNAL,
+	NMI_MAX
+};
+
+#define NMI_DONE	0
+#define NMI_HANDLED	1
+
+typedef int (*nmi_handler_t)(unsigned int, struct pt_regs *);
+
+int register_nmi_handler(unsigned int, nmi_handler_t, unsigned long,
+			 const char *);
+
+void unregister_nmi_handler(unsigned int, const char *);
+
 void stop_nmi(void);
 void restart_nmi(void);
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 68d758a..c2df58a 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -13,6 +13,9 @@
 #include <linux/kprobes.h>
 #include <linux/kdebug.h>
 #include <linux/nmi.h>
+#include <linux/delay.h>
+#include <linux/hardirq.h>
+#include <linux/slab.h>
 
 #if defined(CONFIG_EDAC)
 #include <linux/edac.h>
@@ -21,6 +24,28 @@
 #include <linux/atomic.h>
 #include <asm/traps.h>
 #include <asm/mach_traps.h>
+#include <asm/nmi.h>
+
+#define NMI_MAX_NAMELEN	16
+struct nmiaction {
+	struct nmiaction __rcu *next;
+	nmi_handler_t handler;
+	unsigned int flags;
+	char *name;
+};
+
+struct nmi_desc {
+	spinlock_t lock;
+	struct nmiaction __rcu *head;
+};
+
+static struct nmi_desc nmi_desc[NMI_MAX] = 
+{
+	{	.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[0].lock), },
+	{	.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[1].lock), },
+	{	.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[2].lock), },
+
+};
 
 static int ignore_nmis;
 
@@ -38,6 +63,138 @@ static int __init setup_unknown_nmi_panic(char *str)
 }
 __setup("unknown_nmi_panic", setup_unknown_nmi_panic);
 
+#define nmi_to_desc(type) (&nmi_desc[type])
+
+static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
+{
+	struct nmi_desc *desc = nmi_to_desc(type);
+	struct nmiaction *next_a, *a, **ap = &desc->head;
+	int handled=0;
+
+	rcu_read_lock();
+
+	a = rcu_dereference(*ap);
+
+	/*
+	 * NMIs are edge-triggered, which means if you have enough
+	 * of them concurrently, you can lose some because only one
+	 * can be latched at any given time.  Walk the whole list
+	 * to handle those situations.
+	 */
+	while (a) {
+		next_a = rcu_dereference(a->next);
+
+		handled += a->handler(type, regs);
+
+		a = next_a;
+	}
+	rcu_read_unlock();
+
+	/* return total number of NMI events handled */
+	return handled;
+}
+
+static int __setup_nmi(unsigned int type, struct nmiaction *action)
+{
+	struct nmi_desc *desc = nmi_to_desc(type);
+	struct nmiaction **a = &(desc->head);
+	unsigned long flags;
+
+	spin_lock_irqsave(&desc->lock, flags);
+
+	/*
+	 * some handlers need to be executed first otherwise a fake
+	 * event confuses some handlers (kdump uses this flag)
+	 */
+	if (!(action->flags & NMI_FLAG_FIRST))
+		while ((*a) != NULL)
+			a = &((*a)->next);
+	
+	action->next = *a;
+	rcu_assign_pointer(*a, action);
+
+	spin_unlock_irqrestore(&desc->lock, flags);
+	return 0;
+}
+
+static struct nmiaction *__free_nmi(unsigned int type, const char *name)
+{
+	struct nmi_desc *desc = nmi_to_desc(type);
+	struct nmiaction *n, **np = &(desc->head);
+	unsigned long flags;
+
+	spin_lock_irqsave(&desc->lock, flags);
+
+	while ((*np) != NULL) {
+		n = *np;
+
+		/*
+		 * the name passed in to describe the nmi handler
+		 * is used as the lookup key
+		 */
+		if (!strcmp(n->name, name)) {
+			WARN(in_nmi(),
+				"Trying to free NMI (%s) from NMI context!\n", n->name);
+			rcu_assign_pointer(*np, n->next);
+			break;
+		}
+		np = &(n->next);
+	}
+
+	spin_unlock_irqrestore(&desc->lock, flags);
+	synchronize_rcu();
+	return (*np);
+}
+
+int register_nmi_handler(unsigned int type, nmi_handler_t handler,
+			unsigned long nmiflags, const char *devname)
+{
+	struct nmiaction *action;
+	int retval = -ENOMEM;
+
+	if (!handler)
+		return -EINVAL;
+
+	action = kzalloc(sizeof(struct nmiaction), GFP_KERNEL);
+	if (!action)
+		goto fail_action;
+
+	action->handler = handler;
+	action->flags = nmiflags;
+	action->name = kstrndup(devname, NMI_MAX_NAMELEN, GFP_KERNEL);
+	if (!action->name)
+		goto fail_action_name;
+
+	retval = __setup_nmi(type, action);
+
+	if (retval)
+		goto fail_setup_nmi;
+
+	return retval;
+
+fail_setup_nmi:
+	kfree(action->name);
+fail_action_name:
+	kfree(action);
+fail_action:	
+
+	return retval;
+}
+EXPORT_SYMBOL_GPL(register_nmi_handler);
+
+void unregister_nmi_handler(unsigned int type, const char *name)
+{
+	struct nmiaction *a;
+
+	a = __free_nmi(type, name);
+	if (a) {
+		kfree(a->name);
+		kfree(a);
+	}
+}
+
+EXPORT_SYMBOL_GPL(unregister_nmi_handler);
+
 static notrace __kprobes void
 pci_serr_error(unsigned char reason, struct pt_regs *regs)
 {
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
@ 2011-09-20 14:43 ` Don Zickus
  2011-09-21  5:41   ` Huang Ying
  2011-09-20 14:43 ` [V5][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus, Jason Wessel, Andi Kleen,
	Corey Minyard, Jack Steiner

Just convert all the files that have an nmi handler to the new routines.
Most of it is straight forward conversion.  A couple of places needed some
tweaking like kgdb which separates the debug notifier from the nmi handler
and mce removes a call to notify_die (as I couldn't figure out why it was
there).

The things that get converted are the registeration/unregistration routines
and the nmi handler itself has its args changed along with code removal
to check which list it is on (most are on one NMI list except for kgdb
which has both an NMI routine and an NMI Unknown routine).

Cc: Jason Wessel <jason.wessel@windriver.com>
Cc: Andi Kleen <ak@linux.intel.com>
Cc: Robert Richter <robert.richter@amd.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Corey Minyard <minyard@acm.org>
Cc: Jack Steiner <steiner@sgi.com>
Acked-by: Corey Minyard <minyard@acm.org>
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h              |   20 ----------
 arch/x86/include/asm/reboot.h           |    2 +-
 arch/x86/kernel/apic/hw_nmi.c           |   27 +++-----------
 arch/x86/kernel/apic/x2apic_uv_x.c      |   20 ++--------
 arch/x86/kernel/cpu/mcheck/mce-inject.c |   20 ++++-------
 arch/x86/kernel/cpu/mcheck/mce.c        |    3 --
 arch/x86/kernel/cpu/perf_event.c        |   60 +++----------------------------
 arch/x86/kernel/crash.c                 |    5 +--
 arch/x86/kernel/kgdb.c                  |   60 +++++++++++++++++++++++--------
 arch/x86/kernel/nmi.c                   |   11 ++++--
 arch/x86/kernel/reboot.c                |   23 ++++--------
 arch/x86/oprofile/nmi_int.c             |   40 ++++++--------------
 arch/x86/oprofile/nmi_timer_int.c       |   28 +++-----------
 drivers/acpi/apei/ghes.c                |   22 ++++-------
 drivers/char/ipmi/ipmi_watchdog.c       |   32 +++++-----------
 drivers/watchdog/hpwdt.c                |   23 +++---------
 16 files changed, 125 insertions(+), 271 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 6d04b28..fc74547 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -22,26 +22,6 @@ void arch_trigger_all_cpu_backtrace(void);
 #define arch_trigger_all_cpu_backtrace arch_trigger_all_cpu_backtrace
 #endif
 
-/*
- * Define some priorities for the nmi notifier call chain.
- *
- * Create a local nmi bit that has a higher priority than
- * external nmis, because the local ones are more frequent.
- *
- * Also setup some default high/normal/low settings for
- * subsystems to registers with.  Using 4 bits to separate
- * the priorities.  This can go a lot higher if needed be.
- */
-
-#define NMI_LOCAL_SHIFT		16	/* randomly picked */
-#define NMI_LOCAL_BIT		(1ULL << NMI_LOCAL_SHIFT)
-#define NMI_HIGH_PRIOR		(1ULL << 8)
-#define NMI_NORMAL_PRIOR	(1ULL << 4)
-#define NMI_LOW_PRIOR		(1ULL << 0)
-#define NMI_LOCAL_HIGH_PRIOR	(NMI_LOCAL_BIT | NMI_HIGH_PRIOR)
-#define NMI_LOCAL_NORMAL_PRIOR	(NMI_LOCAL_BIT | NMI_NORMAL_PRIOR)
-#define NMI_LOCAL_LOW_PRIOR	(NMI_LOCAL_BIT | NMI_LOW_PRIOR)
-
 #define NMI_FLAG_FIRST	1
 
 enum {
diff --git a/arch/x86/include/asm/reboot.h b/arch/x86/include/asm/reboot.h
index 3250e3d..92f29706 100644
--- a/arch/x86/include/asm/reboot.h
+++ b/arch/x86/include/asm/reboot.h
@@ -23,7 +23,7 @@ void machine_real_restart(unsigned int type);
 #define MRR_BIOS	0
 #define MRR_APM		1
 
-typedef void (*nmi_shootdown_cb)(int, struct die_args*);
+typedef void (*nmi_shootdown_cb)(int, struct pt_regs*);
 void nmi_shootdown_cpus(nmi_shootdown_cb callback);
 
 #endif /* _ASM_X86_REBOOT_H */
diff --git a/arch/x86/kernel/apic/hw_nmi.c b/arch/x86/kernel/apic/hw_nmi.c
index d5e57db0..31cb9ae 100644
--- a/arch/x86/kernel/apic/hw_nmi.c
+++ b/arch/x86/kernel/apic/hw_nmi.c
@@ -60,22 +60,10 @@ void arch_trigger_all_cpu_backtrace(void)
 }
 
 static int __kprobes
-arch_trigger_all_cpu_backtrace_handler(struct notifier_block *self,
-			 unsigned long cmd, void *__args)
+arch_trigger_all_cpu_backtrace_handler(unsigned int cmd, struct pt_regs *regs)
 {
-	struct die_args *args = __args;
-	struct pt_regs *regs;
 	int cpu;
 
-	switch (cmd) {
-	case DIE_NMI:
-		break;
-
-	default:
-		return NOTIFY_DONE;
-	}
-
-	regs = args->regs;
 	cpu = smp_processor_id();
 
 	if (cpumask_test_cpu(cpu, to_cpumask(backtrace_mask))) {
@@ -86,21 +74,16 @@ arch_trigger_all_cpu_backtrace_handler(struct notifier_block *self,
 		show_regs(regs);
 		arch_spin_unlock(&lock);
 		cpumask_clear_cpu(cpu, to_cpumask(backtrace_mask));
-		return NOTIFY_STOP;
+		return NMI_HANDLED;
 	}
 
-	return NOTIFY_DONE;
+	return NMI_DONE;
 }
 
-static __read_mostly struct notifier_block backtrace_notifier = {
-	.notifier_call          = arch_trigger_all_cpu_backtrace_handler,
-	.next                   = NULL,
-	.priority               = NMI_LOCAL_LOW_PRIOR,
-};
-
 static int __init register_trigger_all_cpu_backtrace(void)
 {
-	register_die_notifier(&backtrace_notifier);
+	register_nmi_handler(NMI_LOCAL, arch_trigger_all_cpu_backtrace_handler,
+				0, "arch_bt");
 	return 0;
 }
 early_initcall(register_trigger_all_cpu_backtrace);
diff --git a/arch/x86/kernel/apic/x2apic_uv_x.c b/arch/x86/kernel/apic/x2apic_uv_x.c
index 34b1859..75be00e 100644
--- a/arch/x86/kernel/apic/x2apic_uv_x.c
+++ b/arch/x86/kernel/apic/x2apic_uv_x.c
@@ -672,18 +672,11 @@ void __cpuinit uv_cpu_init(void)
 /*
  * When NMI is received, print a stack trace.
  */
-int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
+int uv_handle_nmi(unsigned int reason, struct pt_regs *regs)
 {
 	unsigned long real_uv_nmi;
 	int bid;
 
-	if (reason != DIE_NMIUNKNOWN)
-		return NOTIFY_OK;
-
-	if (in_crash_kexec)
-		/* do nothing if entering the crash kernel */
-		return NOTIFY_OK;
-
 	/*
 	 * Each blade has an MMR that indicates when an NMI has been sent
 	 * to cpus on the blade. If an NMI is detected, atomically
@@ -704,7 +697,7 @@ int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
 	}
 
 	if (likely(__get_cpu_var(cpu_last_nmi_count) == uv_blade_info[bid].nmi_count))
-		return NOTIFY_DONE;
+		return NMI_DONE;
 
 	__get_cpu_var(cpu_last_nmi_count) = uv_blade_info[bid].nmi_count;
 
@@ -717,17 +710,12 @@ int uv_handle_nmi(struct notifier_block *self, unsigned long reason, void *data)
 	dump_stack();
 	spin_unlock(&uv_nmi_lock);
 
-	return NOTIFY_STOP;
+	return NMI_HANDLED;
 }
 
-static struct notifier_block uv_dump_stack_nmi_nb = {
-	.notifier_call	= uv_handle_nmi,
-	.priority = NMI_LOCAL_LOW_PRIOR - 1,
-};
-
 void uv_register_nmi_notifier(void)
 {
-	if (register_die_notifier(&uv_dump_stack_nmi_nb))
+	if (register_nmi_handler(NMI_UNKNOWN, uv_handle_nmi, 0, "uv"))
 		printk(KERN_WARNING "UV NMI handler failed to register\n");
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-inject.c b/arch/x86/kernel/cpu/mcheck/mce-inject.c
index 0ed633c..6199232 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-inject.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-inject.c
@@ -78,27 +78,20 @@ static void raise_exception(struct mce *m, struct pt_regs *pregs)
 
 static cpumask_var_t mce_inject_cpumask;
 
-static int mce_raise_notify(struct notifier_block *self,
-			    unsigned long val, void *data)
+static int mce_raise_notify(unsigned int cmd, struct pt_regs *regs)
 {
-	struct die_args *args = (struct die_args *)data;
 	int cpu = smp_processor_id();
 	struct mce *m = &__get_cpu_var(injectm);
-	if (val != DIE_NMI || !cpumask_test_cpu(cpu, mce_inject_cpumask))
-		return NOTIFY_DONE;
+	if (!cpumask_test_cpu(cpu, mce_inject_cpumask))
+		return NMI_DONE;
 	cpumask_clear_cpu(cpu, mce_inject_cpumask);
 	if (m->inject_flags & MCJ_EXCEPTION)
-		raise_exception(m, args->regs);
+		raise_exception(m, regs);
 	else if (m->status)
 		raise_poll(m);
-	return NOTIFY_STOP;
+	return NMI_HANDLED;
 }
 
-static struct notifier_block mce_raise_nb = {
-	.notifier_call = mce_raise_notify,
-	.priority = NMI_LOCAL_NORMAL_PRIOR,
-};
-
 /* Inject mce on current CPU */
 static int raise_local(void)
 {
@@ -216,7 +209,8 @@ static int inject_init(void)
 		return -ENOMEM;
 	printk(KERN_INFO "Machine check injector initialized\n");
 	mce_chrdev_ops.write = mce_write;
-	register_die_notifier(&mce_raise_nb);
+	register_nmi_handler(NMI_LOCAL, mce_raise_notify, 0,
+				"mce_notify");
 	return 0;
 }
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 08363b0..3fc65b6 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -908,9 +908,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
 
 	percpu_inc(mce_exception_count);
 
-	if (notify_die(DIE_NMI, "machine check", regs, error_code,
-			   18, SIGKILL) == NOTIFY_STOP)
-		goto out;
 	if (!banks)
 		goto out;
 
diff --git a/arch/x86/kernel/cpu/perf_event.c b/arch/x86/kernel/cpu/perf_event.c
index 4ee3abf..767371f 100644
--- a/arch/x86/kernel/cpu/perf_event.c
+++ b/arch/x86/kernel/cpu/perf_event.c
@@ -1375,68 +1375,18 @@ struct pmu_nmi_state {
 static DEFINE_PER_CPU(struct pmu_nmi_state, pmu_nmi);
 
 static int __kprobes
-perf_event_nmi_handler(struct notifier_block *self,
-			 unsigned long cmd, void *__args)
+perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
-	struct die_args *args = __args;
-	unsigned int this_nmi;
 	int handled;
 
 	if (!atomic_read(&active_events))
-		return NOTIFY_DONE;
+		return NMI_DONE;
 
-	switch (cmd) {
-	case DIE_NMI:
-		break;
-	case DIE_NMIUNKNOWN:
-		this_nmi = percpu_read(irq_stat.__nmi_count);
-		if (this_nmi != __this_cpu_read(pmu_nmi.marked))
-			/* let the kernel handle the unknown nmi */
-			return NOTIFY_DONE;
-		/*
-		 * This one is a PMU back-to-back nmi. Two events
-		 * trigger 'simultaneously' raising two back-to-back
-		 * NMIs. If the first NMI handles both, the latter
-		 * will be empty and daze the CPU. So, we drop it to
-		 * avoid false-positive 'unknown nmi' messages.
-		 */
-		return NOTIFY_STOP;
-	default:
-		return NOTIFY_DONE;
-	}
-
-	handled = x86_pmu.handle_irq(args->regs);
-	if (!handled)
-		return NOTIFY_DONE;
-
-	this_nmi = percpu_read(irq_stat.__nmi_count);
-	if ((handled > 1) ||
-		/* the next nmi could be a back-to-back nmi */
-	    ((__this_cpu_read(pmu_nmi.marked) == this_nmi) &&
-	     (__this_cpu_read(pmu_nmi.handled) > 1))) {
-		/*
-		 * We could have two subsequent back-to-back nmis: The
-		 * first handles more than one counter, the 2nd
-		 * handles only one counter and the 3rd handles no
-		 * counter.
-		 *
-		 * This is the 2nd nmi because the previous was
-		 * handling more than one counter. We will mark the
-		 * next (3rd) and then drop it if unhandled.
-		 */
-		__this_cpu_write(pmu_nmi.marked, this_nmi + 1);
-		__this_cpu_write(pmu_nmi.handled, handled);
-	}
+	handled = x86_pmu.handle_irq(regs);
 
-	return NOTIFY_STOP;
+	return handled;
 }
 
-static __read_mostly struct notifier_block perf_event_nmi_notifier = {
-	.notifier_call		= perf_event_nmi_handler,
-	.next			= NULL,
-	.priority		= NMI_LOCAL_LOW_PRIOR,
-};
-
 static struct event_constraint unconstrained;
 static struct event_constraint emptyconstraint;
 
@@ -1557,7 +1507,7 @@ static int __init init_hw_perf_events(void)
 		((1LL << x86_pmu.num_counters_fixed)-1) << X86_PMC_IDX_FIXED;
 
 	perf_events_lapic_init();
-	register_die_notifier(&perf_event_nmi_notifier);
+	register_nmi_handler(NMI_LOCAL, perf_event_nmi_handler, 0, "PMI");
 
 	unconstrained = (struct event_constraint)
 		__EVENT_CONSTRAINT(0, (1ULL << x86_pmu.num_counters) - 1,
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 764c7c2..13ad899 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -32,15 +32,12 @@ int in_crash_kexec;
 
 #if defined(CONFIG_SMP) && defined(CONFIG_X86_LOCAL_APIC)
 
-static void kdump_nmi_callback(int cpu, struct die_args *args)
+static void kdump_nmi_callback(int cpu, struct pt_regs *regs)
 {
-	struct pt_regs *regs;
 #ifdef CONFIG_X86_32
 	struct pt_regs fixed_regs;
 #endif
 
-	regs = args->regs;
-
 #ifdef CONFIG_X86_32
 	if (!user_mode_vm(regs)) {
 		crash_fixup_ss_esp(&fixed_regs, regs);
diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 00354d4..faba577 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -511,28 +511,37 @@ single_step_cont(struct pt_regs *regs, struct die_args *args)
 
 static int was_in_debug_nmi[NR_CPUS];
 
-static int __kgdb_notify(struct die_args *args, unsigned long cmd)
+static int kgdb_nmi_handler(unsigned int cmd, struct pt_regs *regs)
 {
-	struct pt_regs *regs = args->regs;
-
 	switch (cmd) {
-	case DIE_NMI:
+	case NMI_LOCAL:
 		if (atomic_read(&kgdb_active) != -1) {
 			/* KGDB CPU roundup */
 			kgdb_nmicallback(raw_smp_processor_id(), regs);
 			was_in_debug_nmi[raw_smp_processor_id()] = 1;
 			touch_nmi_watchdog();
-			return NOTIFY_STOP;
+			return NMI_HANDLED;
 		}
-		return NOTIFY_DONE;
+		break;
 
-	case DIE_NMIUNKNOWN:
+	case NMI_UNKNOWN:
 		if (was_in_debug_nmi[raw_smp_processor_id()]) {
 			was_in_debug_nmi[raw_smp_processor_id()] = 0;
-			return NOTIFY_STOP;
+			return NMI_HANDLED;
 		}
-		return NOTIFY_DONE;
+		break;
+	default:
+		/* do nothing */
+		break;
+	}
+	return NMI_DONE;
+}
+
+static int __kgdb_notify(struct die_args *args, unsigned long cmd)
+{
+	struct pt_regs *regs = args->regs;
 
+	switch (cmd) {
 	case DIE_DEBUG:
 		if (atomic_read(&kgdb_cpu_doing_single_step) != -1) {
 			if (user_mode(regs))
@@ -590,11 +599,6 @@ kgdb_notify(struct notifier_block *self, unsigned long cmd, void *ptr)
 
 static struct notifier_block kgdb_notifier = {
 	.notifier_call	= kgdb_notify,
-
-	/*
-	 * Lowest-prio notifier priority, we want to be notified last:
-	 */
-	.priority	= NMI_LOCAL_LOW_PRIOR,
 };
 
 /**
@@ -605,7 +609,31 @@ static struct notifier_block kgdb_notifier = {
  */
 int kgdb_arch_init(void)
 {
-	return register_die_notifier(&kgdb_notifier);
+	int retval;
+
+	retval = register_die_notifier(&kgdb_notifier);
+	if (retval)
+		goto out;
+
+	retval = register_nmi_handler(NMI_LOCAL, kgdb_nmi_handler,
+					0, "kgdb");
+	if (retval)
+		goto out1;
+
+	retval = register_nmi_handler(NMI_UNKNOWN, kgdb_nmi_handler,
+					0, "kgdb");
+
+	if (retval)
+		goto out2;
+
+	return retval;
+
+out2:
+	unregister_nmi_handler(NMI_LOCAL, "kgdb");
+out1:
+	unregister_die_notifier(&kgdb_notifier);
+out:
+	return retval;
 }
 
 static void kgdb_hw_overflow_handler(struct perf_event *event,
@@ -673,6 +701,8 @@ void kgdb_arch_exit(void)
 			breakinfo[i].pev = NULL;
 		}
 	}
+	unregister_nmi_handler(NMI_UNKNOWN, "kgdb");
+	unregister_nmi_handler(NMI_LOCAL, "kgdb");
 	unregister_die_notifier(&kgdb_notifier);
 }
 
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index c2df58a..acd61e8 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -1,6 +1,7 @@
 /*
  *  Copyright (C) 1991, 1992  Linus Torvalds
  *  Copyright (C) 2000, 2001, 2002 Andi Kleen, SuSE Labs
+ *  Copyright (C) 2011	Don Zickus Red Hat, Inc.
  *
  *  Pentium III FXSR, SSE support
  *	Gareth Hughes <gareth@valinux.com>, May 2000
@@ -252,8 +253,10 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 static notrace __kprobes void
 unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 {
-	if (notify_die(DIE_NMIUNKNOWN, "nmi", regs, reason, 2, SIGINT) ==
-			NOTIFY_STOP)
+	int handled;
+
+	handled = nmi_handle(NMI_UNKNOWN, regs);
+	if (handled) 
 		return;
 #ifdef CONFIG_MCA
 	/*
@@ -278,13 +281,15 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 {
 	unsigned char reason = 0;
+	int handled;
 
 	/*
 	 * CPU-specific NMI must be processed before non-CPU-specific
 	 * NMI, otherwise we may lose it, because the CPU-specific
 	 * NMI can not be detected/processed on other CPUs.
 	 */
-	if (notify_die(DIE_NMI, "nmi", regs, 0, 2, SIGINT) == NOTIFY_STOP)
+	handled = nmi_handle(NMI_LOCAL, regs);
+	if (handled)
 		return;
 
 	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
diff --git a/arch/x86/kernel/reboot.c b/arch/x86/kernel/reboot.c
index 9242436..adab340 100644
--- a/arch/x86/kernel/reboot.c
+++ b/arch/x86/kernel/reboot.c
@@ -464,7 +464,7 @@ static inline void kb_wait(void)
 	}
 }
 
-static void vmxoff_nmi(int cpu, struct die_args *args)
+static void vmxoff_nmi(int cpu, struct pt_regs *regs)
 {
 	cpu_emergency_vmxoff();
 }
@@ -736,14 +736,10 @@ static nmi_shootdown_cb shootdown_callback;
 
 static atomic_t waiting_for_crash_ipi;
 
-static int crash_nmi_callback(struct notifier_block *self,
-			unsigned long val, void *data)
+static int crash_nmi_callback(unsigned int val, struct pt_regs *regs)
 {
 	int cpu;
 
-	if (val != DIE_NMI)
-		return NOTIFY_OK;
-
 	cpu = raw_smp_processor_id();
 
 	/* Don't do anything if this handler is invoked on crashing cpu.
@@ -751,10 +747,10 @@ static int crash_nmi_callback(struct notifier_block *self,
 	 * an NMI if system was initially booted with nmi_watchdog parameter.
 	 */
 	if (cpu == crashing_cpu)
-		return NOTIFY_STOP;
+		return NMI_HANDLED;
 	local_irq_disable();
 
-	shootdown_callback(cpu, (struct die_args *)data);
+	shootdown_callback(cpu, regs);
 
 	atomic_dec(&waiting_for_crash_ipi);
 	/* Assume hlt works */
@@ -762,7 +758,7 @@ static int crash_nmi_callback(struct notifier_block *self,
 	for (;;)
 		cpu_relax();
 
-	return 1;
+	return NMI_HANDLED;
 }
 
 static void smp_send_nmi_allbutself(void)
@@ -770,12 +766,6 @@ static void smp_send_nmi_allbutself(void)
 	apic->send_IPI_allbutself(NMI_VECTOR);
 }
 
-static struct notifier_block crash_nmi_nb = {
-	.notifier_call = crash_nmi_callback,
-	/* we want to be the first one called */
-	.priority = NMI_LOCAL_HIGH_PRIOR+1,
-};
-
 /* Halt all other CPUs, calling the specified function on each of them
  *
  * This function can be used to halt all other CPUs on crash
@@ -794,7 +784,8 @@ void nmi_shootdown_cpus(nmi_shootdown_cb callback)
 
 	atomic_set(&waiting_for_crash_ipi, num_online_cpus() - 1);
 	/* Would it be better to replace the trap vector here? */
-	if (register_die_notifier(&crash_nmi_nb))
+	if (register_nmi_handler(NMI_LOCAL, crash_nmi_callback, 
+				 NMI_FLAG_FIRST, "crash"))
 		return;		/* return what? */
 	/* Ensure the new callback function is set before sending
 	 * out the NMI
diff --git a/arch/x86/oprofile/nmi_int.c b/arch/x86/oprofile/nmi_int.c
index 68894fd..adf8fb3 100644
--- a/arch/x86/oprofile/nmi_int.c
+++ b/arch/x86/oprofile/nmi_int.c
@@ -61,26 +61,15 @@ u64 op_x86_get_ctrl(struct op_x86_model_spec const *model,
 }
 
 
-static int profile_exceptions_notify(struct notifier_block *self,
-				     unsigned long val, void *data)
+static int profile_exceptions_notify(unsigned int val, struct pt_regs *regs)
 {
-	struct die_args *args = (struct die_args *)data;
-	int ret = NOTIFY_DONE;
-
-	switch (val) {
-	case DIE_NMI:
-		if (ctr_running)
-			model->check_ctrs(args->regs, &__get_cpu_var(cpu_msrs));
-		else if (!nmi_enabled)
-			break;
-		else
-			model->stop(&__get_cpu_var(cpu_msrs));
-		ret = NOTIFY_STOP;
-		break;
-	default:
-		break;
-	}
-	return ret;
+	if (ctr_running)
+		model->check_ctrs(regs, &__get_cpu_var(cpu_msrs));
+	else if (!nmi_enabled)
+		return NMI_DONE;
+	else
+		model->stop(&__get_cpu_var(cpu_msrs));
+	return NMI_HANDLED;
 }
 
 static void nmi_cpu_save_registers(struct op_msrs *msrs)
@@ -363,12 +352,6 @@ static void nmi_cpu_setup(void *dummy)
 	apic_write(APIC_LVTPC, APIC_DM_NMI);
 }
 
-static struct notifier_block profile_exceptions_nb = {
-	.notifier_call = profile_exceptions_notify,
-	.next = NULL,
-	.priority = NMI_LOCAL_LOW_PRIOR,
-};
-
 static void nmi_cpu_restore_registers(struct op_msrs *msrs)
 {
 	struct op_msr *counters = msrs->counters;
@@ -508,7 +491,8 @@ static int nmi_setup(void)
 	ctr_running = 0;
 	/* make variables visible to the nmi handler: */
 	smp_mb();
-	err = register_die_notifier(&profile_exceptions_nb);
+	err = register_nmi_handler(NMI_LOCAL, profile_exceptions_notify,
+					0, "oprofile");
 	if (err)
 		goto fail;
 
@@ -538,7 +522,7 @@ static void nmi_shutdown(void)
 	put_online_cpus();
 	/* make variables visible to the nmi handler: */
 	smp_mb();
-	unregister_die_notifier(&profile_exceptions_nb);
+	unregister_nmi_handler(NMI_LOCAL, "oprofile");
 	msrs = &get_cpu_var(cpu_msrs);
 	model->shutdown(msrs);
 	free_msrs();
diff --git a/arch/x86/oprofile/nmi_timer_int.c b/arch/x86/oprofile/nmi_timer_int.c
index 720bf5a..7f8052c 100644
--- a/arch/x86/oprofile/nmi_timer_int.c
+++ b/arch/x86/oprofile/nmi_timer_int.c
@@ -18,32 +18,16 @@
 #include <asm/apic.h>
 #include <asm/ptrace.h>
 
-static int profile_timer_exceptions_notify(struct notifier_block *self,
-					   unsigned long val, void *data)
+static int profile_timer_exceptions_notify(unsigned int val, struct pt_regs *regs)
 {
-	struct die_args *args = (struct die_args *)data;
-	int ret = NOTIFY_DONE;
-
-	switch (val) {
-	case DIE_NMI:
-		oprofile_add_sample(args->regs, 0);
-		ret = NOTIFY_STOP;
-		break;
-	default:
-		break;
-	}
-	return ret;
+	oprofile_add_sample(regs, 0);
+	return NMI_HANDLED;
 }
 
-static struct notifier_block profile_timer_exceptions_nb = {
-	.notifier_call = profile_timer_exceptions_notify,
-	.next = NULL,
-	.priority = NMI_LOW_PRIOR,
-};
-
 static int timer_start(void)
 {
-	if (register_die_notifier(&profile_timer_exceptions_nb))
+	if (register_nmi_handler(NMI_LOCAL, profile_timer_exceptions_notify,
+					0, "oprofile-timer"))
 		return 1;
 	return 0;
 }
@@ -51,7 +35,7 @@ static int timer_start(void)
 
 static void timer_stop(void)
 {
-	unregister_die_notifier(&profile_timer_exceptions_nb);
+	unregister_nmi_handler(NMI_LOCAL, "oprofile-timer");
 	synchronize_sched();  /* Allow already-started NMIs to complete. */
 }
 
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 0784f99..b8e08cb 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -50,6 +50,7 @@
 #include <acpi/hed.h>
 #include <asm/mce.h>
 #include <asm/tlbflush.h>
+#include <asm/nmi.h>
 
 #include "apei-internal.h"
 
@@ -749,15 +750,11 @@ static void ghes_proc_in_irq(struct irq_work *irq_work)
 	}
 }
 
-static int ghes_notify_nmi(struct notifier_block *this,
-				  unsigned long cmd, void *data)
+static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
 {
 	struct ghes *ghes, *ghes_global = NULL;
 	int sev, sev_global = -1;
-	int ret = NOTIFY_DONE;
-
-	if (cmd != DIE_NMI)
-		return ret;
+	int ret = NMI_DONE;
 
 	raw_spin_lock(&ghes_nmi_lock);
 	list_for_each_entry_rcu(ghes, &ghes_nmi, list) {
@@ -770,10 +767,10 @@ static int ghes_notify_nmi(struct notifier_block *this,
 			sev_global = sev;
 			ghes_global = ghes;
 		}
-		ret = NOTIFY_STOP;
+		ret = NMI_HANDLED;
 	}
 
-	if (ret == NOTIFY_DONE)
+	if (ret == NMI_DONE)
 		goto out;
 
 	if (sev_global >= GHES_SEV_PANIC) {
@@ -825,10 +822,6 @@ static struct notifier_block ghes_notifier_sci = {
 	.notifier_call = ghes_notify_sci,
 };
 
-static struct notifier_block ghes_notifier_nmi = {
-	.notifier_call = ghes_notify_nmi,
-};
-
 static unsigned long ghes_esource_prealloc_size(
 	const struct acpi_hest_generic *generic)
 {
@@ -918,7 +911,8 @@ static int __devinit ghes_probe(struct platform_device *ghes_dev)
 		ghes_estatus_pool_expand(len);
 		mutex_lock(&ghes_list_mutex);
 		if (list_empty(&ghes_nmi))
-			register_die_notifier(&ghes_notifier_nmi);
+			register_nmi_handler(NMI_LOCAL, ghes_notify_nmi, 0,
+						"ghes");
 		list_add_rcu(&ghes->list, &ghes_nmi);
 		mutex_unlock(&ghes_list_mutex);
 		break;
@@ -964,7 +958,7 @@ static int __devexit ghes_remove(struct platform_device *ghes_dev)
 		mutex_lock(&ghes_list_mutex);
 		list_del_rcu(&ghes->list);
 		if (list_empty(&ghes_nmi))
-			unregister_die_notifier(&ghes_notifier_nmi);
+			unregister_nmi_handler(NMI_LOCAL, "ghes");
 		mutex_unlock(&ghes_list_mutex);
 		/*
 		 * To synchronize with NMI handler, ghes can only be
diff --git a/drivers/char/ipmi/ipmi_watchdog.c b/drivers/char/ipmi/ipmi_watchdog.c
index 3302586..3dcab56 100644
--- a/drivers/char/ipmi/ipmi_watchdog.c
+++ b/drivers/char/ipmi/ipmi_watchdog.c
@@ -1077,17 +1077,8 @@ static void ipmi_unregister_watchdog(int ipmi_intf)
 
 #ifdef HAVE_DIE_NMI
 static int
-ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
+ipmi_nmi(unsigned int val, struct pt_regs *regs)
 {
-	struct die_args *args = data;
-
-	if (val != DIE_NMIUNKNOWN)
-		return NOTIFY_OK;
-
-	/* Hack, if it's a memory or I/O error, ignore it. */
-	if (args->err & 0xc0)
-		return NOTIFY_OK;
-
 	/*
 	 * If we get here, it's an NMI that's not a memory or I/O
 	 * error.  We can't truly tell if it's from IPMI or not
@@ -1097,15 +1088,15 @@ ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
 
 	if (testing_nmi) {
 		testing_nmi = 2;
-		return NOTIFY_STOP;
+		return NMI_HANDLED;
 	}
 
 	/* If we are not expecting a timeout, ignore it. */
 	if (ipmi_watchdog_state == WDOG_TIMEOUT_NONE)
-		return NOTIFY_OK;
+		return NMI_DONE;
 
 	if (preaction_val != WDOG_PRETIMEOUT_NMI)
-		return NOTIFY_OK;
+		return NMI_DONE;
 
 	/*
 	 * If no one else handled the NMI, we assume it was the IPMI
@@ -1120,12 +1111,8 @@ ipmi_nmi(struct notifier_block *self, unsigned long val, void *data)
 			panic(PFX "pre-timeout");
 	}
 
-	return NOTIFY_STOP;
+	return NMI_HANDLED;
 }
-
-static struct notifier_block ipmi_nmi_handler = {
-	.notifier_call = ipmi_nmi
-};
 #endif
 
 static int wdog_reboot_handler(struct notifier_block *this,
@@ -1290,7 +1277,8 @@ static void check_parms(void)
 		}
 	}
 	if (do_nmi && !nmi_handler_registered) {
-		rv = register_die_notifier(&ipmi_nmi_handler);
+		rv = register_nmi_handler(NMI_UNKNOWN, ipmi_nmi, 0,
+						"ipmi");
 		if (rv) {
 			printk(KERN_WARNING PFX
 			       "Can't register nmi handler\n");
@@ -1298,7 +1286,7 @@ static void check_parms(void)
 		} else
 			nmi_handler_registered = 1;
 	} else if (!do_nmi && nmi_handler_registered) {
-		unregister_die_notifier(&ipmi_nmi_handler);
+		unregister_nmi_handler(NMI_UNKNOWN, "ipmi");
 		nmi_handler_registered = 0;
 	}
 #endif
@@ -1336,7 +1324,7 @@ static int __init ipmi_wdog_init(void)
 	if (rv) {
 #ifdef HAVE_DIE_NMI
 		if (nmi_handler_registered)
-			unregister_die_notifier(&ipmi_nmi_handler);
+			unregister_nmi_handler(NMI_UNKNOWN, "ipmi");
 #endif
 		atomic_notifier_chain_unregister(&panic_notifier_list,
 						 &wdog_panic_notifier);
@@ -1357,7 +1345,7 @@ static void __exit ipmi_wdog_exit(void)
 
 #ifdef HAVE_DIE_NMI
 	if (nmi_handler_registered)
-		unregister_die_notifier(&ipmi_nmi_handler);
+		unregister_nmi_handler(NMI_UNKNOWN, "ipmi");
 #endif
 
 	atomic_notifier_chain_unregister(&panic_notifier_list,
diff --git a/drivers/watchdog/hpwdt.c b/drivers/watchdog/hpwdt.c
index 410fba4..e0f6202 100644
--- a/drivers/watchdog/hpwdt.c
+++ b/drivers/watchdog/hpwdt.c
@@ -477,15 +477,12 @@ static int hpwdt_time_left(void)
 /*
  *	NMI Handler
  */
-static int hpwdt_pretimeout(struct notifier_block *nb, unsigned long ulReason,
+static int hpwdt_pretimeout(unsigned int ulReason, struct pt_regs *regs,
 				void *data)
 {
 	unsigned long rom_pl;
 	static int die_nmi_called;
 
-	if (ulReason != DIE_NMIUNKNOWN)
-		goto out;
-
 	if (!hpwdt_nmi_decoding)
 		goto out;
 
@@ -507,7 +504,7 @@ static int hpwdt_pretimeout(struct notifier_block *nb, unsigned long ulReason,
 		"Management Log for details.\n");
 
 out:
-	return NOTIFY_OK;
+	return NMI_DONE;
 }
 #endif /* CONFIG_HPWDT_NMI_DECODING */
 
@@ -647,13 +644,6 @@ static struct miscdevice hpwdt_miscdev = {
 	.fops = &hpwdt_fops,
 };
 
-#ifdef CONFIG_HPWDT_NMI_DECODING
-static struct notifier_block die_notifier = {
-	.notifier_call = hpwdt_pretimeout,
-	.priority = 0,
-};
-#endif /* CONFIG_HPWDT_NMI_DECODING */
-
 /*
  *	Init & Exit
  */
@@ -739,10 +729,9 @@ static int __devinit hpwdt_init_nmi_decoding(struct pci_dev *dev)
 	 * die notify list to handle a critical NMI. The default is to
 	 * be last so other users of the NMI signal can function.
 	 */
-	if (priority)
-		die_notifier.priority = 0x7FFFFFFF;
-
-	retval = register_die_notifier(&die_notifier);
+	retval = register_nmi_handler(NMI_UNKNOWN, hpwdt_pretimeout,
+					(priority) ? NMI_HANDLE_FIRST : 0,
+					"hpwdt");
 	if (retval != 0) {
 		dev_warn(&dev->dev,
 			"Unable to register a die notifier (err=%d).\n",
@@ -762,7 +751,7 @@ static int __devinit hpwdt_init_nmi_decoding(struct pci_dev *dev)
 
 static void hpwdt_exit_nmi_decoding(void)
 {
-	unregister_die_notifier(&die_notifier);
+	unregister_nmi_handler(NMI_UNKNOWN, "hpwdt");
 	if (cru_rom_addr)
 		iounmap(cru_rom_addr);
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (2 preceding siblings ...)
  2011-09-20 14:43 ` [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
@ 2011-09-20 14:43 ` Don Zickus
  2011-09-20 17:23   ` Avi Kivity
                     ` (2 more replies)
  2011-09-20 14:43 ` [V5][PATCH 5/6] x86, nmi: track NMI usage stats Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus
  5 siblings, 3 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus

Previous patches allow the NMI subsystem to process multipe NMI events
in one NMI.  As previously discussed this can cause issues when an event
triggered another NMI but is processed in the current NMI.  This causes the
next NMI to go unprocessed and become an 'unknown' NMI.

To handle this, we first have to flag whether or not the NMI handler handled
more than one event or not.  If it did, then there exists a chance that
the next NMI might be already processed.  Once the NMI is flagged as a
candidate to be swallowed, we next look for a back-to-back NMI condition.

This is determined by looking at the %rip from pt_regs.  If it is the same
as the previous NMI, it is assumed the cpu did not have a chance to jump
back into a non-NMI context and execute code and instead handled another NMI.

If both of those conditions are true then we will swallow any unknown NMI.

There still exists a chance that we accidentally swallow a real unknown NMI,
but for now things seem better.

An optimization has also been added to the nmi notifier rountine.  Because x86
can latch up to one NMI while currently processing an NMI, we don't have to
worry about executing _all_ the handlers in a standalone NMI.  The idea is
if multiple NMIs come in, the second NMI will represent them.  For those
back-to-back NMI cases, we have the potentail to drop NMIs.  Therefore only
execute all the handlers in the second half of a detected back-to-back NMI.

V2:
  - forgot to add the 'read' code for swallow_nmi (went into next patch)

V3:
  - redesigned the algorithm to utilize Avi's idea of detecting a back-to-back
    NMI with %rip.
V4:
  - clean up fixes, like adding 'static', rename save_rip to last_nmi_rip

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/nmi.c |   80 +++++++++++++++++++++++++++++++++++++++++++++---
 1 files changed, 75 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index acd61e8..04c9339 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -66,7 +66,7 @@ __setup("unknown_nmi_panic", setup_unknown_nmi_panic);
 
 #define nmi_to_desc(type) (&nmi_desc[type])
 
-static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
+static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs, bool b2b)
 {
 	struct nmi_desc *desc = nmi_to_desc(type);
 	struct nmiaction *next_a, *a, **ap = &desc->head;
@@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
 
 		handled += a->handler(type, regs);
 
+		/*
+ 		 * Optimization: only loop once if this is not a 
+ 		 * back-to-back NMI.  The idea is nothing is dropped
+ 		 * on the first NMI, only on the second of a back-to-back
+ 		 * NMI.  No need to waste cycles going through all the
+ 		 * handlers.
+ 		 */
+		if (!b2b && handled)
+			break;
+
 		a = next_a;
 	}
 	rcu_read_unlock();
@@ -255,7 +265,13 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 {
 	int handled;
 
-	handled = nmi_handle(NMI_UNKNOWN, regs);
+	/*
+	 * Use 'false' as back-to-back NMIs are dealt with one level up.
+	 * Of course this makes having multiple 'unknown' handlers useless
+	 * as only the first one is ever run (unless it can actually determine
+	 * if it caused the NMI)
+	 */
+	handled = nmi_handle(NMI_UNKNOWN, regs, false);
 	if (handled) 
 		return;
 #ifdef CONFIG_MCA
@@ -278,19 +294,49 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 	pr_emerg("Dazed and confused, but trying to continue\n");
 }
 
+static DEFINE_PER_CPU(bool, swallow_nmi);
+static DEFINE_PER_CPU(unsigned long, last_nmi_rip);
+
 static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 {
 	unsigned char reason = 0;
 	int handled;
+	bool b2b = false;
 
 	/*
 	 * CPU-specific NMI must be processed before non-CPU-specific
 	 * NMI, otherwise we may lose it, because the CPU-specific
 	 * NMI can not be detected/processed on other CPUs.
 	 */
-	handled = nmi_handle(NMI_LOCAL, regs);
-	if (handled)
+
+	/*
+	 * Back-to-back NMIs are interesting because they can either
+	 * be two NMI or more than two NMIs (any thing over two is dropped
+	 * due to NMI being edge-triggered).  If this is the second half
+	 * of the back-to-back NMI, assume we dropped things and process
+	 * more handlers.  Otherwise reset the 'swallow' NMI behaviour
+	 */
+	if (regs->ip == __this_cpu_read(last_nmi_rip))
+		b2b = true;
+	else
+		__this_cpu_write(swallow_nmi, false);
+
+	__this_cpu_write(last_nmi_rip, regs->ip);
+
+	handled = nmi_handle(NMI_LOCAL, regs, b2b);
+	if (handled) {
+		/*
+ 		 * There are cases when a NMI handler handles multiple
+ 		 * events in the current NMI.  One of these events may
+ 		 * be queued for in the next NMI.  Because the event is
+ 		 * already handled, the next NMI will result in an unknown
+ 		 * NMI.  Instead lets flag this for a potential NMI to
+ 		 * swallow.
+		 */
+		if (handled > 1)
+			__this_cpu_write(swallow_nmi, true);
 		return;
+	}
 
 	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
 	raw_spin_lock(&nmi_reason_lock);
@@ -313,7 +359,31 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	}
 	raw_spin_unlock(&nmi_reason_lock);
 
-	unknown_nmi_error(reason, regs);
+	/*
+	 * Only one NMI can be latched at a time.  To handle
+	 * this we may process multiple nmi handlers at once to
+	 * cover the case where an NMI is dropped.  The downside
+	 * to this approach is we may process an NMI prematurely,
+	 * while its real NMI is sitting latched.  This will cause
+	 * an unknown NMI on the next run of the NMI processing.
+	 * 
+	 * We tried to flag that condition above, by setting the
+	 * swallow_nmi flag when we process more than one event.
+	 * This condition is also only present on the second half
+	 * of a back-to-back NMI, so we flag that condition too.
+	 *
+	 * If both are true, we assume we already processed this
+	 * NMI previously and we swallow it.  Otherwise we reset
+	 * the logic.
+	 *
+	 * I am sure there are scenarios where we accidentally
+	 * swallow a real 'unknown' NMI.  But this is the best
+	 * we can do for now.
+	 */
+	if (b2b && __this_cpu_read(swallow_nmi))
+		;
+	else
+		unknown_nmi_error(reason, regs);
 }
 
 dotraplinkage notrace __kprobes void
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [V5][PATCH 5/6] x86, nmi: track NMI usage stats
  2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (3 preceding siblings ...)
  2011-09-20 14:43 ` [V5][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
@ 2011-09-20 14:43 ` Don Zickus
  2011-09-20 14:43 ` [V5][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus
  5 siblings, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus

Now that the NMI handler are broken into lists, increment the appropriate
stats for each list.  This allows us to see what is going on when they
get printed out in the next patch.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/nmi.c |   20 ++++++++++++++++++--
 1 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 04c9339..bb8242e 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -48,6 +48,15 @@ static struct nmi_desc nmi_desc[NMI_MAX] =
 
 };
 
+struct nmi_stats {
+	unsigned int normal;
+	unsigned int unknown;
+	unsigned int external;
+	unsigned int swallow;
+};
+
+static DEFINE_PER_CPU(struct nmi_stats, nmi_stats);
+
 static int ignore_nmis;
 
 int unknown_nmi_panic;
@@ -272,8 +281,13 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 	 * if it caused the NMI)
 	 */
 	handled = nmi_handle(NMI_UNKNOWN, regs, false);
-	if (handled) 
+	if (handled) {
+		__this_cpu_add(nmi_stats.unknown, handled);
 		return;
+	}
+
+	__this_cpu_add(nmi_stats.unknown, 1);
+
 #ifdef CONFIG_MCA
 	/*
 	 * Might actually be able to figure out what the guilty party
@@ -324,6 +338,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
 	handled = nmi_handle(NMI_LOCAL, regs, b2b);
+	__this_cpu_add(nmi_stats.normal, handled);
 	if (handled) {
 		/*
  		 * There are cases when a NMI handler handles multiple
@@ -354,6 +369,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 		 */
 		reassert_nmi();
 #endif
+		__this_cpu_add(nmi_stats.external, 1);
 		raw_spin_unlock(&nmi_reason_lock);
 		return;
 	}
@@ -381,7 +397,7 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	 * we can do for now.
 	 */
 	if (b2b && __this_cpu_read(swallow_nmi))
-		;
+		__this_cpu_add(nmi_stats.swallow, 1);
 	else
 		unknown_nmi_error(reason, regs);
 }
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [V5][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts
  2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
                   ` (4 preceding siblings ...)
  2011-09-20 14:43 ` [V5][PATCH 5/6] x86, nmi: track NMI usage stats Don Zickus
@ 2011-09-20 14:43 ` Don Zickus
  5 siblings, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-20 14:43 UTC (permalink / raw)
  To: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang
  Cc: LKML, paulmck, avi, jeremy, Don Zickus

This is a cheap hack to add the stats to the middle of /proc/interrupts.
It is more of a conversation starter than anything as I am not sure
the right letters and place to put this stuff.

The benefit of these stats is a better breakdown of which list the NMIs
get handled in either a normal handler, unknown, or external.  It also
list the number of unknown NMIs swallowed to help check for false
positives or not.  Another benefit is the ability to actually see which
NMI handlers are currently registered in the system.

The output of 'cat /proc/interrupts/ will look like this:

<snip>
 58:        275          0        864          0   PCI-MSI-edge      eth0
NMI:       4161       4155        158       4194   Non-maskable interrupts
SWA:          0          0          0          0   Unknown NMIs swallowed
  0:       4161       4155        158       4194   NMI  PMI, arch_bt
UNK:          0          0          0          0   NMI
EXT:          0          0          0          0   NMI
LOC:      12653      13304      13974      12926   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:          6          6          5          6   Performance monitoring interrupts
IWI:          0          0          0          0   IRQ work interrupts
RES:       1839       1897       1821       1854   Rescheduling interrupts
CAL:        524       2714        392        331   Function call interrupts
TLB:        217        146        593        576   TLB shootdowns
TRM:          0          0          0          0   Thermal event interrupts
THR:          0          0          0          0   Threshold APIC interrupts
MCE:          0          0          0          0   Machine check exceptions
MCP:          1          1          1          1   Machine check polls
ERR:          0
MIS:          0

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h |    2 +
 arch/x86/kernel/irq.c      |    2 +
 arch/x86/kernel/nmi.c      |   47 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 51 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index fc74547..a4f1945 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -24,6 +24,8 @@ void arch_trigger_all_cpu_backtrace(void);
 
 #define NMI_FLAG_FIRST	1
 
+void arch_show_nmi(struct seq_file *p, int prec);
+
 enum {
 	NMI_LOCAL=0,
 	NMI_UNKNOWN,
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 6c0802e..44d1cac 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -16,6 +16,7 @@
 #include <asm/idle.h>
 #include <asm/mce.h>
 #include <asm/hw_irq.h>
+#include <asm/nmi.h>
 
 atomic_t irq_err_count;
 
@@ -55,6 +56,7 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);
 	seq_printf(p, "  Non-maskable interrupts\n");
+	arch_show_nmi(p, prec);
 #ifdef CONFIG_X86_LOCAL_APIC
 	seq_printf(p, "%*s: ", prec, "LOC");
 	for_each_online_cpu(j)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index bb8242e..5619ae9 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -424,3 +424,50 @@ void restart_nmi(void)
 {
 	ignore_nmis--;
 }
+
+void arch_show_nmi(struct seq_file *p, int prec)
+{
+	int j;
+	struct nmiaction *action;
+
+	seq_printf(p, "%*s: ", prec, "SWA");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.swallow, j));
+        seq_printf(p, "  Unknown NMIs swallowed\n");
+
+	seq_printf(p, "%*s: ", prec, "  0");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.normal, j));
+        seq_printf(p, "  NMI");
+	action = (nmi_to_desc(NMI_LOCAL))->head;
+	if (action) {
+		seq_printf(p, "\t%s", action->name);
+		while ((action = action->next) != NULL)
+			seq_printf(p, ", %s", action->name);
+	}
+	seq_putc(p, '\n');
+
+	seq_printf(p, "%*s: ", prec, "UNK");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.unknown, j));
+        seq_printf(p, "  NMI");
+	action = (nmi_to_desc(NMI_UNKNOWN))->head;
+	if (action) {
+		seq_printf(p, "\t%s", action->name);
+		while ((action = action->next) != NULL)
+			seq_printf(p, ", %s", action->name);
+	}
+	seq_putc(p, '\n');
+
+	seq_printf(p, "%*s: ", prec, "EXT");
+        for_each_online_cpu(j)
+                seq_printf(p, "%10u ", per_cpu(nmi_stats.external, j));
+        seq_printf(p, "  NMI");
+	action = (nmi_to_desc(NMI_EXTERNAL))->head;
+	if (action) {
+		seq_printf(p, "\t%s", action->name);
+		while ((action = action->next) != NULL)
+			seq_printf(p, ", %s", action->name);
+	}
+	seq_putc(p, '\n');
+}
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-20 14:43 ` [V5][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
@ 2011-09-20 17:23   ` Avi Kivity
  2011-09-20 20:10     ` Don Zickus
  2011-09-21  5:43   ` Huang Ying
  2011-09-21 10:08   ` Robert Richter
  2 siblings, 1 reply; 28+ messages in thread
From: Avi Kivity @ 2011-09-20 17:23 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang, LKML,
	paulmck, jeremy

On 09/20/2011 05:43 PM, Don Zickus wrote:
> Previous patches allow the NMI subsystem to process multipe NMI events
> in one NMI.  As previously discussed this can cause issues when an event
> triggered another NMI but is processed in the current NMI.  This causes the
> next NMI to go unprocessed and become an 'unknown' NMI.
>
> To handle this, we first have to flag whether or not the NMI handler handled
> more than one event or not.  If it did, then there exists a chance that
> the next NMI might be already processed.  Once the NMI is flagged as a
> candidate to be swallowed, we next look for a back-to-back NMI condition.
>
> This is determined by looking at the %rip from pt_regs.  If it is the same
> as the previous NMI, it is assumed the cpu did not have a chance to jump
> back into a non-NMI context and execute code and instead handled another NMI.
>
> If both of those conditions are true then we will swallow any unknown NMI.
>
> There still exists a chance that we accidentally swallow a real unknown NMI,
> but for now things seem better.
>
> An optimization has also been added to the nmi notifier rountine.  Because x86
> can latch up to one NMI while currently processing an NMI, we don't have to
> worry about executing _all_ the handlers in a standalone NMI.  The idea is
> if multiple NMIs come in, the second NMI will represent them.  For those
> back-to-back NMI cases, we have the potentail to drop NMIs.  Therefore only
> execute all the handlers in the second half of a detected back-to-back NMI.
>
> V2:
>    - forgot to add the 'read' code for swallow_nmi (went into next patch)
>
> V3:
>    - redesigned the algorithm to utilize Avi's idea of detecting a back-to-back
>      NMI with %rip.
> V4:
>    - clean up fixes, like adding 'static', rename save_rip to last_nmi_rip
>
>

Missing a zeroing of last_nmi_rip in the idle path.  Otherwise, as Andi 
points out, and idle machine will always see NMIs coming in from the 
hlt/mwait address and detect them as back-to-back NMIs.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-20 17:23   ` Avi Kivity
@ 2011-09-20 20:10     ` Don Zickus
  2011-09-21  5:45       ` Avi Kivity
  0 siblings, 1 reply; 28+ messages in thread
From: Don Zickus @ 2011-09-20 20:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang, LKML,
	paulmck, jeremy

On Tue, Sep 20, 2011 at 08:23:03PM +0300, Avi Kivity wrote:
> >V4:
> >   - clean up fixes, like adding 'static', rename save_rip to last_nmi_rip
> >
> >
> 
> Missing a zeroing of last_nmi_rip in the idle path.  Otherwise, as
> Andi points out, and idle machine will always see NMIs coming in
> from the hlt/mwait address and detect them as back-to-back NMIs.

You mean something like this?

From: Don Zickus <dzickus@redhat.com>
Date: Fri, 19 Aug 2011 15:51:44 -0400
Subject: [PATCH] x86, nmi:  add in logic to handle multiple events and
 unknown NMIs

Previous patches allow the NMI subsystem to process multipe NMI events
in one NMI.  As previously discussed this can cause issues when an event
triggered another NMI but is processed in the current NMI.  This causes the
next NMI to go unprocessed and become an 'unknown' NMI.

To handle this, we first have to flag whether or not the NMI handler handled
more than one event or not.  If it did, then there exists a chance that
the next NMI might be already processed.  Once the NMI is flagged as a
candidate to be swallowed, we next look for a back-to-back NMI condition.

This is determined by looking at the %rip from pt_regs.  If it is the same
as the previous NMI, it is assumed the cpu did not have a chance to jump
back into a non-NMI context and execute code and instead handled another NMI.

If both of those conditions are true then we will swallow any unknown NMI.

There still exists a chance that we accidentally swallow a real unknown NMI,
but for now things seem better.

An optimization has also been added to the nmi notifier rountine.  Because x86
can latch up to one NMI while currently processing an NMI, we don't have to
worry about executing _all_ the handlers in a standalone NMI.  The idea is
if multiple NMIs come in, the second NMI will represent them.  For those
back-to-back NMI cases, we have the potentail to drop NMIs.  Therefore only
execute all the handlers in the second half of a detected back-to-back NMI.

V2:
  - forgot to add the 'read' code for swallow_nmi (went into next patch)

V3:
  - redesigned the algorithm to utilize Avi's idea of detecting a back-to-back
    NMI with %rip.
V4:
  - clean up fixes, like adding 'static', rename save_rip to last_nmi_rip

V5:
  - wire up the idle path to reset the back-to-back NMI logic

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h   |    1 +
 arch/x86/kernel/nmi.c        |   86 +++++++++++++++++++++++++++++++++++++++--
 arch/x86/kernel/process_32.c |    2 +
 arch/x86/kernel/process_64.c |    2 +
 4 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index fc74547..0930d4a 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -43,5 +43,6 @@ void unregister_nmi_handler(unsigned int, const char *);
 
 void stop_nmi(void);
 void restart_nmi(void);
+void local_touch_nmi(void);
 
 #endif /* _ASM_X86_NMI_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index acd61e8..79ac87b 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -66,7 +66,7 @@ __setup("unknown_nmi_panic", setup_unknown_nmi_panic);
 
 #define nmi_to_desc(type) (&nmi_desc[type])
 
-static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
+static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs, bool b2b)
 {
 	struct nmi_desc *desc = nmi_to_desc(type);
 	struct nmiaction *next_a, *a, **ap = &desc->head;
@@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
 
 		handled += a->handler(type, regs);
 
+		/*
+ 		 * Optimization: only loop once if this is not a 
+ 		 * back-to-back NMI.  The idea is nothing is dropped
+ 		 * on the first NMI, only on the second of a back-to-back
+ 		 * NMI.  No need to waste cycles going through all the
+ 		 * handlers.
+ 		 */
+		if (!b2b && handled)
+			break;
+
 		a = next_a;
 	}
 	rcu_read_unlock();
@@ -255,7 +265,13 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 {
 	int handled;
 
-	handled = nmi_handle(NMI_UNKNOWN, regs);
+	/*
+	 * Use 'false' as back-to-back NMIs are dealt with one level up.
+	 * Of course this makes having multiple 'unknown' handlers useless
+	 * as only the first one is ever run (unless it can actually determine
+	 * if it caused the NMI)
+	 */
+	handled = nmi_handle(NMI_UNKNOWN, regs, false);
 	if (handled) 
 		return;
 #ifdef CONFIG_MCA
@@ -278,19 +294,49 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 	pr_emerg("Dazed and confused, but trying to continue\n");
 }
 
+static DEFINE_PER_CPU(bool, swallow_nmi);
+static DEFINE_PER_CPU(unsigned long, last_nmi_rip);
+
 static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 {
 	unsigned char reason = 0;
 	int handled;
+	bool b2b = false;
 
 	/*
 	 * CPU-specific NMI must be processed before non-CPU-specific
 	 * NMI, otherwise we may lose it, because the CPU-specific
 	 * NMI can not be detected/processed on other CPUs.
 	 */
-	handled = nmi_handle(NMI_LOCAL, regs);
-	if (handled)
+
+	/*
+	 * Back-to-back NMIs are interesting because they can either
+	 * be two NMI or more than two NMIs (any thing over two is dropped
+	 * due to NMI being edge-triggered).  If this is the second half
+	 * of the back-to-back NMI, assume we dropped things and process
+	 * more handlers.  Otherwise reset the 'swallow' NMI behaviour
+	 */
+	if (regs->ip == __this_cpu_read(last_nmi_rip))
+		b2b = true;
+	else
+		__this_cpu_write(swallow_nmi, false);
+
+	__this_cpu_write(last_nmi_rip, regs->ip);
+
+	handled = nmi_handle(NMI_LOCAL, regs, b2b);
+	if (handled) {
+		/*
+ 		 * There are cases when a NMI handler handles multiple
+ 		 * events in the current NMI.  One of these events may
+ 		 * be queued for in the next NMI.  Because the event is
+ 		 * already handled, the next NMI will result in an unknown
+ 		 * NMI.  Instead lets flag this for a potential NMI to
+ 		 * swallow.
+		 */
+		if (handled > 1)
+			__this_cpu_write(swallow_nmi, true);
 		return;
+	}
 
 	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
 	raw_spin_lock(&nmi_reason_lock);
@@ -313,7 +359,31 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
 	}
 	raw_spin_unlock(&nmi_reason_lock);
 
-	unknown_nmi_error(reason, regs);
+	/*
+	 * Only one NMI can be latched at a time.  To handle
+	 * this we may process multiple nmi handlers at once to
+	 * cover the case where an NMI is dropped.  The downside
+	 * to this approach is we may process an NMI prematurely,
+	 * while its real NMI is sitting latched.  This will cause
+	 * an unknown NMI on the next run of the NMI processing.
+	 * 
+	 * We tried to flag that condition above, by setting the
+	 * swallow_nmi flag when we process more than one event.
+	 * This condition is also only present on the second half
+	 * of a back-to-back NMI, so we flag that condition too.
+	 *
+	 * If both are true, we assume we already processed this
+	 * NMI previously and we swallow it.  Otherwise we reset
+	 * the logic.
+	 *
+	 * I am sure there are scenarios where we accidentally
+	 * swallow a real 'unknown' NMI.  But this is the best
+	 * we can do for now.
+	 */
+	if (b2b && __this_cpu_read(swallow_nmi))
+		;
+	else
+		unknown_nmi_error(reason, regs);
 }
 
 dotraplinkage notrace __kprobes void
@@ -338,3 +408,9 @@ void restart_nmi(void)
 {
 	ignore_nmis--;
 }
+
+/* reset the back-to-back NMI logic */
+void local_touch_nmi(void)
+{
+	__this_cpu_write(last_nmi_rip, 0);
+}
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 7a3b651..46ff054 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -57,6 +57,7 @@
 #include <asm/idle.h>
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
+#include <asm/nmi.h>
 
 asmlinkage void ret_from_fork(void) __asm__("ret_from_fork");
 
@@ -107,6 +108,7 @@ void cpu_idle(void)
 			if (cpu_is_offline(cpu))
 				play_dead();
 
+			local_touch_nmi();
 			local_irq_disable();
 			/* Don't trace irqs off for idle */
 			stop_critical_timings();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index f693e44..3bd7e6e 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -51,6 +51,7 @@
 #include <asm/idle.h>
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
+#include <asm/nmi.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -133,6 +134,7 @@ void cpu_idle(void)
 			 * from here on, until they go to idle.
 			 * Otherwise, idle callbacks can misfire.
 			 */
+			local_touch_nmi();
 			local_irq_disable();
 			enter_idle();
 			/* Don't trace irqs off for idle */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-09-20 14:43 ` [V5][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
@ 2011-09-21  5:36   ` Huang Ying
  2011-09-21 13:56     ` Don Zickus
  0 siblings, 1 reply; 28+ messages in thread
From: Huang Ying @ 2011-09-21  5:36 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86@kernel.org, Andi Kleen, Robert Richter, Peter Zijlstra, LKML,
	paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On 09/20/2011 10:43 PM, Don Zickus wrote:
> The NMI handlers used to rely on the notifier infrastructure.  This worked
> great until we wanted to support handling multiple events better.
> 
> One of the key ideas to the nmi handling is to process _all_ the handlers for
> each NMI.  The reason behind this switch is because NMIs are edge triggered.
> If enough NMIs are triggered, then they could be lost because the cpu can
> only latch at most one NMI (besides the one currently being processed).
> 
> In order to deal with this we have decided to process all the NMI handlers
> for each NMI.  This allows the handlers to determine if they recieved an
> event or not (the ones that can not determine this will be left to fend
> for themselves on the unknown NMI list).
> 
> As a result of this change it is now possible to have an extra NMI that
> was destined to be received for an already processed event.  Because the
> event was processed in the previous NMI, this NMI gets dropped and becomes
> an 'unknown' NMI.  This of course will cause printks that scare people.
> 
> However, we prefer to have extra NMIs as opposed to losing NMIs and as such
> are have developed a basic mechanism to catch most of them.  That will be
> a later patch.
> 
> To accomplish this idea, I unhooked the nmi handlers from the notifier
> routines and created a new mechanism loosely based on doIRQ.  The reason
> for this is the notifier routines have a couple of shortcomings.  One we
> could't guarantee all future NMI handlers used NOTIFY_OK instead of
> NOTIFY_STOP.  Second, we couldn't keep track of the number of events being
> handled in each routine (most only handle one, perf can handle more than one).
> Third, I wanted to eventually display which nmi handlers are registered in
> the system in /proc/interrupts to help see who is generating NMIs.
> 
> The patch below just implements the new infrastructure but doesn't wire it up
> yet (that is the next patch).  Its design is based on doIRQ structs and the
> atomic notifier routines.  So the rcu stuff in the patch isn't entirely untested
> (as the notifier routines have soaked it) but it should be double checked in
> case I copied the code wrong.
> 
> V2:
>   - use kstrdup to copy/allocate device name
>   - fix-up _GPL oops
> 
> V3:
>   - fix leak in register_nmi_handler error path
>   - removed _raw annotations from rcu_dereference
> 
> V4:
>   - handle kstrndup failure
> 
> Signed-off-by: Don Zickus <dzickus@redhat.com>
> ---
>  arch/x86/include/asm/nmi.h |   19 +++++
>  arch/x86/kernel/nmi.c      |  157 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 176 insertions(+), 0 deletions(-)
> 
> diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
> index 4886a68..6d04b28 100644
> --- a/arch/x86/include/asm/nmi.h
> +++ b/arch/x86/include/asm/nmi.h
> @@ -42,6 +42,25 @@ void arch_trigger_all_cpu_backtrace(void);
>  #define NMI_LOCAL_NORMAL_PRIOR	(NMI_LOCAL_BIT | NMI_NORMAL_PRIOR)
>  #define NMI_LOCAL_LOW_PRIOR	(NMI_LOCAL_BIT | NMI_LOW_PRIOR)
>  
> +#define NMI_FLAG_FIRST	1
> +
> +enum {
> +	NMI_LOCAL=0,
> +	NMI_UNKNOWN,
> +	NMI_EXTERNAL,
> +	NMI_MAX
> +};
> +
> +#define NMI_DONE	0
> +#define NMI_HANDLED	1
> +
> +typedef int (*nmi_handler_t)(unsigned int, struct pt_regs *);
> +
> +int register_nmi_handler(unsigned int, nmi_handler_t, unsigned long,
> +			 const char *);
> +
> +void unregister_nmi_handler(unsigned int, const char *);
> +
>  void stop_nmi(void);
>  void restart_nmi(void);
>  
> diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
> index 68d758a..c2df58a 100644
> --- a/arch/x86/kernel/nmi.c
> +++ b/arch/x86/kernel/nmi.c
> @@ -13,6 +13,9 @@
>  #include <linux/kprobes.h>
>  #include <linux/kdebug.h>
>  #include <linux/nmi.h>
> +#include <linux/delay.h>
> +#include <linux/hardirq.h>
> +#include <linux/slab.h>
>  
>  #if defined(CONFIG_EDAC)
>  #include <linux/edac.h>
> @@ -21,6 +24,28 @@
>  #include <linux/atomic.h>
>  #include <asm/traps.h>
>  #include <asm/mach_traps.h>
> +#include <asm/nmi.h>
> +
> +#define NMI_MAX_NAMELEN	16
> +struct nmiaction {
> +	struct nmiaction __rcu *next;

Why not just use struct list_head here and use list_xxx_rcu family to
operate on the list?  IMHO, that will make code simpler without much
overhead.

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-20 14:43 ` [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
@ 2011-09-21  5:41   ` Huang Ying
  2011-09-21 10:49     ` Borislav Petkov
  0 siblings, 1 reply; 28+ messages in thread
From: Huang Ying @ 2011-09-21  5:41 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86@kernel.org, Andi Kleen, Robert Richter, Peter Zijlstra, LKML,
	paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org,
	Jason Wessel, Andi Kleen, Corey Minyard, Jack Steiner, Alan Cox

On 09/20/2011 10:43 PM, Don Zickus wrote:
> Just convert all the files that have an nmi handler to the new routines.
> Most of it is straight forward conversion.  A couple of places needed some
> tweaking like kgdb which separates the debug notifier from the nmi handler
> and mce removes a call to notify_die (as I couldn't figure out why it was
> there).

It is used to call a debugger on a machine check, according to following
thread:

https://lkml.org/lkml/2010/5/27/114

So maybe we can turn that into a kgdb direct call?

Best Regards,
Huang Ying

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-20 14:43 ` [V5][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
  2011-09-20 17:23   ` Avi Kivity
@ 2011-09-21  5:43   ` Huang Ying
  2011-09-21 13:57     ` Don Zickus
  2011-09-21 10:08   ` Robert Richter
  2 siblings, 1 reply; 28+ messages in thread
From: Huang Ying @ 2011-09-21  5:43 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86@kernel.org, Andi Kleen, Robert Richter, Peter Zijlstra, LKML,
	paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On 09/20/2011 10:43 PM, Don Zickus wrote:
[snip]
> @@ -313,7 +359,31 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
>  	}
>  	raw_spin_unlock(&nmi_reason_lock);
>  
> -	unknown_nmi_error(reason, regs);
> +	/*
> +	 * Only one NMI can be latched at a time.  To handle
> +	 * this we may process multiple nmi handlers at once to
> +	 * cover the case where an NMI is dropped.  The downside
> +	 * to this approach is we may process an NMI prematurely,
> +	 * while its real NMI is sitting latched.  This will cause
> +	 * an unknown NMI on the next run of the NMI processing.
> +	 * 
> +	 * We tried to flag that condition above, by setting the
> +	 * swallow_nmi flag when we process more than one event.
> +	 * This condition is also only present on the second half
> +	 * of a back-to-back NMI, so we flag that condition too.
> +	 *
> +	 * If both are true, we assume we already processed this
> +	 * NMI previously and we swallow it.  Otherwise we reset
> +	 * the logic.
> +	 *
> +	 * I am sure there are scenarios where we accidentally
> +	 * swallow a real 'unknown' NMI.  But this is the best
> +	 * we can do for now.

Why not describe a scenario where we swallow a real 'unknown' NMI?  So
that someone working on the code in the future will know the challenge?

Best Regards,
Huang Ying

> +	 */
> +	if (b2b && __this_cpu_read(swallow_nmi))
> +		;
> +	else
> +		unknown_nmi_error(reason, regs);
>  }
>  
>  dotraplinkage notrace __kprobes void


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-20 20:10     ` Don Zickus
@ 2011-09-21  5:45       ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2011-09-21  5:45 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, Robert Richter, Peter Zijlstra, ying.huang, LKML,
	paulmck, jeremy

On 09/20/2011 11:10 PM, Don Zickus wrote:
> On Tue, Sep 20, 2011 at 08:23:03PM +0300, Avi Kivity wrote:
> >  >V4:
> >  >    - clean up fixes, like adding 'static', rename save_rip to last_nmi_rip
> >  >
> >  >
> >
> >  Missing a zeroing of last_nmi_rip in the idle path.  Otherwise, as
> >  Andi points out, and idle machine will always see NMIs coming in
> >  from the hlt/mwait address and detect them as back-to-back NMIs.
>
> You mean something like this?
>

Yes.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-20 14:43 ` [V5][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
  2011-09-20 17:23   ` Avi Kivity
  2011-09-21  5:43   ` Huang Ying
@ 2011-09-21 10:08   ` Robert Richter
  2011-09-21 14:04     ` Don Zickus
  2 siblings, 1 reply; 28+ messages in thread
From: Robert Richter @ 2011-09-21 10:08 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86@kernel.org, Andi Kleen, Peter Zijlstra, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On 20.09.11 10:43:10, Don Zickus wrote:
> @@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
>  
>  		handled += a->handler(type, regs);
>  
> +		/*
> + 		 * Optimization: only loop once if this is not a 
> + 		 * back-to-back NMI.  The idea is nothing is dropped
> + 		 * on the first NMI, only on the second of a back-to-back
> + 		 * NMI.  No need to waste cycles going through all the
> + 		 * handlers.
> + 		 */
> +		if (!b2b && handled)
> +			break;

In rare cases we will lose nmis here.

We see a back-to-back nmi in the case if a 2 nmi source triggers
*after* the nmi handler is entered. Depending on the internal cpu
timing influenced by microcode and SMM code execution, the nmi may not
entered immediately. So all sources that trigger *before* the nmi
handler is entered raise only one nmi with no subsequent nmi.

However, as these cases should be very rare, I think we can live with
it in favor of the optimization to jump out the handler chain and save
lot of cpu cycles esp. in the case of heavy PMI load.

-Robert

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-21  5:41   ` Huang Ying
@ 2011-09-21 10:49     ` Borislav Petkov
  2011-09-21 14:06       ` Don Zickus
  0 siblings, 1 reply; 28+ messages in thread
From: Borislav Petkov @ 2011-09-21 10:49 UTC (permalink / raw)
  To: Huang Ying
  Cc: Don Zickus, x86@kernel.org, Andi Kleen, Robert Richter,
	Peter Zijlstra, LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com,
	jeremy@goop.org, Jason Wessel, Andi Kleen, Corey Minyard,
	Jack Steiner, Alan Cox, Tony Luck

+ Tony.

On Wed, Sep 21, 2011 at 01:41:39PM +0800, Huang Ying wrote:
> On 09/20/2011 10:43 PM, Don Zickus wrote:
> > Just convert all the files that have an nmi handler to the new routines.
> > Most of it is straight forward conversion.  A couple of places needed some
> > tweaking like kgdb which separates the debug notifier from the nmi handler
> > and mce removes a call to notify_die (as I couldn't figure out why it was
> > there).
> 
> It is used to call a debugger on a machine check, according to following
> thread:
> 
> https://lkml.org/lkml/2010/5/27/114

Thanks for digging that out - I couldn't find anywhere in the git logs
why was this added in the first place.

> So maybe we can turn that into a kgdb direct call?

After reading the thread, the semi-legitimate usage of using it as
a jump into the debugger just because some hardware reports certain
conditions through an MCE sounds pretty hacky to me.

Besides, if the driver developer needs that, he can add the code for the
duration of her/his development cycle as aid, and remove it in the end.

This early-exit deal is especially inacceptable if you get an
uncorrectable error and some notifier call in the chain consumes it and
we never get to report it or decode it, or do recovery action. And thus
the box merrily continues on although a corruption just happened and we
didn't even get a chance to panic.

So I really really want to remove it, actually.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 2/6] x86, nmi: create new NMI handler routines
  2011-09-21  5:36   ` Huang Ying
@ 2011-09-21 13:56     ` Don Zickus
  0 siblings, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-21 13:56 UTC (permalink / raw)
  To: Huang Ying
  Cc: x86@kernel.org, Andi Kleen, Robert Richter, Peter Zijlstra, LKML,
	paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On Wed, Sep 21, 2011 at 01:36:30PM +0800, Huang Ying wrote:
> On 09/20/2011 10:43 PM, Don Zickus wrote:
> > index 68d758a..c2df58a 100644
> > --- a/arch/x86/kernel/nmi.c
> > +++ b/arch/x86/kernel/nmi.c
> > @@ -13,6 +13,9 @@
> >  #include <linux/kprobes.h>
> >  #include <linux/kdebug.h>
> >  #include <linux/nmi.h>
> > +#include <linux/delay.h>
> > +#include <linux/hardirq.h>
> > +#include <linux/slab.h>
> >  
> >  #if defined(CONFIG_EDAC)
> >  #include <linux/edac.h>
> > @@ -21,6 +24,28 @@
> >  #include <linux/atomic.h>
> >  #include <asm/traps.h>
> >  #include <asm/mach_traps.h>
> > +#include <asm/nmi.h>
> > +
> > +#define NMI_MAX_NAMELEN	16
> > +struct nmiaction {
> > +	struct nmiaction __rcu *next;
> 
> Why not just use struct list_head here and use list_xxx_rcu family to
> operate on the list?  IMHO, that will make code simpler without much
> overhead.

Yeah, I just copied and pasted from what was in the notifier chain stuff.
Even though I don't need the previous pointer, the awkward rcu pointer
stuff would be a lot simpler to implement and read with those macros.  Let
me play with it and see what I come up with.

Thanks for the suggestion.

Cheers,
Don

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21  5:43   ` Huang Ying
@ 2011-09-21 13:57     ` Don Zickus
  0 siblings, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-21 13:57 UTC (permalink / raw)
  To: Huang Ying
  Cc: x86@kernel.org, Andi Kleen, Robert Richter, Peter Zijlstra, LKML,
	paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On Wed, Sep 21, 2011 at 01:43:57PM +0800, Huang Ying wrote:
> On 09/20/2011 10:43 PM, Don Zickus wrote:
> [snip]
> > @@ -313,7 +359,31 @@ static notrace __kprobes void default_do_nmi(struct pt_regs *regs)
> >  	}
> >  	raw_spin_unlock(&nmi_reason_lock);
> >  
> > -	unknown_nmi_error(reason, regs);
> > +	/*
> > +	 * Only one NMI can be latched at a time.  To handle
> > +	 * this we may process multiple nmi handlers at once to
> > +	 * cover the case where an NMI is dropped.  The downside
> > +	 * to this approach is we may process an NMI prematurely,
> > +	 * while its real NMI is sitting latched.  This will cause
> > +	 * an unknown NMI on the next run of the NMI processing.
> > +	 * 
> > +	 * We tried to flag that condition above, by setting the
> > +	 * swallow_nmi flag when we process more than one event.
> > +	 * This condition is also only present on the second half
> > +	 * of a back-to-back NMI, so we flag that condition too.
> > +	 *
> > +	 * If both are true, we assume we already processed this
> > +	 * NMI previously and we swallow it.  Otherwise we reset
> > +	 * the logic.
> > +	 *
> > +	 * I am sure there are scenarios where we accidentally
> > +	 * swallow a real 'unknown' NMI.  But this is the best
> > +	 * we can do for now.
> 
> Why not describe a scenario where we swallow a real 'unknown' NMI?  So
> that someone working on the code in the future will know the challenge?

I can add a couple of lines of comment for that.

Thanks,
Don

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 10:08   ` Robert Richter
@ 2011-09-21 14:04     ` Don Zickus
  2011-09-21 15:18       ` Robert Richter
  0 siblings, 1 reply; 28+ messages in thread
From: Don Zickus @ 2011-09-21 14:04 UTC (permalink / raw)
  To: Robert Richter
  Cc: x86@kernel.org, Andi Kleen, Peter Zijlstra, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On Wed, Sep 21, 2011 at 12:08:42PM +0200, Robert Richter wrote:
> On 20.09.11 10:43:10, Don Zickus wrote:
> > @@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
> >  
> >  		handled += a->handler(type, regs);
> >  
> > +		/*
> > + 		 * Optimization: only loop once if this is not a 
> > + 		 * back-to-back NMI.  The idea is nothing is dropped
> > + 		 * on the first NMI, only on the second of a back-to-back
> > + 		 * NMI.  No need to waste cycles going through all the
> > + 		 * handlers.
> > + 		 */
> > +		if (!b2b && handled)
> > +			break;
> 
> In rare cases we will lose nmis here.
> 
> We see a back-to-back nmi in the case if a 2 nmi source triggers
> *after* the nmi handler is entered. Depending on the internal cpu
> timing influenced by microcode and SMM code execution, the nmi may not
> entered immediately. So all sources that trigger *before* the nmi
> handler is entered raise only one nmi with no subsequent nmi.

Right, but that can only happen with the second NMI in the back-to-back
NMI case.  Here the optimization is only for the first NMI, with the
assumption that you will always have a second NMI if multiple sources
trigger, so you can process those in the second iteration (assuming we
correctly detect the back-to-back NMI condition).  Then when the second
NMI comes in, we have no idea how many we dropped to get here so we
process all the handlers based on the assumption we might not have another
NMI behind us to make up for the dropped NMIs.

Unless I misunderstood your point above?

> 
> However, as these cases should be very rare, I think we can live with
> it in favor of the optimization to jump out the handler chain and save
> lot of cpu cycles esp. in the case of heavy PMI load.

I think it is covered as described above?

Cheers,
Don

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines
  2011-09-21 10:49     ` Borislav Petkov
@ 2011-09-21 14:06       ` Don Zickus
  0 siblings, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-21 14:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Huang Ying, x86@kernel.org, Andi Kleen, Robert Richter,
	Peter Zijlstra, LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com,
	jeremy@goop.org, Jason Wessel, Andi Kleen, Corey Minyard,
	Jack Steiner, Alan Cox, Tony Luck

On Wed, Sep 21, 2011 at 12:49:36PM +0200, Borislav Petkov wrote:
> + Tony.
> 
> On Wed, Sep 21, 2011 at 01:41:39PM +0800, Huang Ying wrote:
> > On 09/20/2011 10:43 PM, Don Zickus wrote:
> > > Just convert all the files that have an nmi handler to the new routines.
> > > Most of it is straight forward conversion.  A couple of places needed some
> > > tweaking like kgdb which separates the debug notifier from the nmi handler
> > > and mce removes a call to notify_die (as I couldn't figure out why it was
> > > there).
> > 
> > It is used to call a debugger on a machine check, according to following
> > thread:
> > 
> > https://lkml.org/lkml/2010/5/27/114
> 
> Thanks for digging that out - I couldn't find anywhere in the git logs
> why was this added in the first place.
> 
> > So maybe we can turn that into a kgdb direct call?
> 
> After reading the thread, the semi-legitimate usage of using it as
> a jump into the debugger just because some hardware reports certain
> conditions through an MCE sounds pretty hacky to me.
> 
> Besides, if the driver developer needs that, he can add the code for the
> duration of her/his development cycle as aid, and remove it in the end.
> 
> This early-exit deal is especially inacceptable if you get an
> uncorrectable error and some notifier call in the chain consumes it and
> we never get to report it or decode it, or do recovery action. And thus
> the box merrily continues on although a corruption just happened and we
> didn't even get a chance to panic.
> 
> So I really really want to remove it, actually.

Cool.  Thanks Ying and Boris for settling that.  I was scratching my head
trying to understand why that was there.  It keeps the code simpler to
now. :-)

Cheers,
Don

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 14:04     ` Don Zickus
@ 2011-09-21 15:18       ` Robert Richter
  2011-09-21 15:33         ` Peter Zijlstra
  2011-09-21 16:13         ` Don Zickus
  0 siblings, 2 replies; 28+ messages in thread
From: Robert Richter @ 2011-09-21 15:18 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86@kernel.org, Andi Kleen, Peter Zijlstra, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On 21.09.11 10:04:32, Don Zickus wrote:
> On Wed, Sep 21, 2011 at 12:08:42PM +0200, Robert Richter wrote:
> > On 20.09.11 10:43:10, Don Zickus wrote:
> > > @@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
> > >  
> > >  		handled += a->handler(type, regs);
> > >  
> > > +		/*
> > > + 		 * Optimization: only loop once if this is not a 
> > > + 		 * back-to-back NMI.  The idea is nothing is dropped
> > > + 		 * on the first NMI, only on the second of a back-to-back
> > > + 		 * NMI.  No need to waste cycles going through all the
> > > + 		 * handlers.
> > > + 		 */
> > > +		if (!b2b && handled)
> > > +			break;
> > 
> > In rare cases we will lose nmis here.
> > 
> > We see a back-to-back nmi in the case if a 2 nmi source triggers
> > *after* the nmi handler is entered. Depending on the internal cpu
> > timing influenced by microcode and SMM code execution, the nmi may not
> > entered immediately. So all sources that trigger *before* the nmi
> > handler is entered raise only one nmi with no subsequent nmi.
> 
> Right, but that can only happen with the second NMI in the back-to-back
> NMI case.  Here the optimization is only for the first NMI, with the
> assumption that you will always have a second NMI if multiple sources
> trigger, so you can process those in the second iteration (assuming we
> correctly detect the back-to-back NMI condition).  Then when the second
> NMI comes in, we have no idea how many we dropped to get here so we
> process all the handlers based on the assumption we might not have another
> NMI behind us to make up for the dropped NMIs.
> 
> Unless I misunderstood your point above?

No, my point was that a second NMI might not be latched even if there
are two nmi sources pending.

Your logic is correct but assumes you will always receive a second
nmi. This is not always the case depending on the cpu's internal
timing. Usually there is the following sequence for back-to-back nmis:

1. HW triggers first NMI, an NMI is pending.
2. NMI handler is called, no NMI pending anymore.
3. HW triggers a second NMI, an NMI is pending.
4. Return from NMI handler.
5. NMI handler is called again to serve the 2nd, no NMI pending anymore.
6. Return from NMI handler.

The above is what your algorithm covers.

But in rare cases there is the following:

1. The cpu executes some microcode or SMM code.
2. HW triggers the first NMI, an NMI is pending.
3. HW triggers a second NMI, the NMI is still pending.
4. The cpu finished microcode or SMM code.
5. NMI handler is called, no NMI pending anymore.
6. Return from NMI handler.

In this case the handler is called only once and the second nmi
remains unhandled with you implementation.

I don't see a way how this could be catched without serving all
handlers the first time. But as said, in favor of the optimization I
think we can live with losing some NMIs.

-Robert

> > However, as these cases should be very rare, I think we can live with
> > it in favor of the optimization to jump out the handler chain and save
> > lot of cpu cycles esp. in the case of heavy PMI load.

-- 
Advanced Micro Devices, Inc.
Operating System Research Center

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 15:18       ` Robert Richter
@ 2011-09-21 15:33         ` Peter Zijlstra
  2011-09-21 16:04           ` Robert Richter
  2011-09-21 16:13         ` Don Zickus
  1 sibling, 1 reply; 28+ messages in thread
From: Peter Zijlstra @ 2011-09-21 15:33 UTC (permalink / raw)
  To: Robert Richter
  Cc: Don Zickus, x86@kernel.org, Andi Kleen, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On Wed, 2011-09-21 at 17:18 +0200, Robert Richter wrote:
> 1. The cpu executes some microcode or SMM code.
> 2. HW triggers the first NMI, an NMI is pending.
> 3. HW triggers a second NMI, the NMI is still pending.
> 4. The cpu finished microcode or SMM code.
> 5. NMI handler is called, no NMI pending anymore.
> 6. Return from NMI handler. 

Even without SMM, all you need is two different NMI users to trigger
while an NMI is in flight.

Wouldn't be entirely impossible to trigger if you have
non-fatal-MCE/hardware-NMI-switch/PMI all active.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 15:33         ` Peter Zijlstra
@ 2011-09-21 16:04           ` Robert Richter
  2011-09-21 16:40             ` Peter Zijlstra
  0 siblings, 1 reply; 28+ messages in thread
From: Robert Richter @ 2011-09-21 16:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Don Zickus, x86@kernel.org, Andi Kleen, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On 21.09.11 11:33:52, Peter Zijlstra wrote:
> On Wed, 2011-09-21 at 17:18 +0200, Robert Richter wrote:
> > 1. The cpu executes some microcode or SMM code.
> > 2. HW triggers the first NMI, an NMI is pending.
> > 3. HW triggers a second NMI, the NMI is still pending.
> > 4. The cpu finished microcode or SMM code.
> > 5. NMI handler is called, no NMI pending anymore.
> > 6. Return from NMI handler. 
> 
> Even without SMM, all you need is two different NMI users to trigger
> while an NMI is in flight.
> 
> Wouldn't be entirely impossible to trigger if you have
> non-fatal-MCE/hardware-NMI-switch/PMI all active.

No, I am trying to explain this...

If hw triggers more than one nmi *before* the nmi handler is entered
(no matter what the source is), then there will be no back-to-back
nmi. NMIs are edge triggered and there is only one pending signal (AMD
cpus). The signal is cleared when entering the handler. If the cpu
executes microcode or smm code and delays execution of the nmi
handler, then there is no second call of the handler even if there are
2 sources for it.

-Robert


-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 15:18       ` Robert Richter
  2011-09-21 15:33         ` Peter Zijlstra
@ 2011-09-21 16:13         ` Don Zickus
  2011-09-21 16:24           ` Avi Kivity
  1 sibling, 1 reply; 28+ messages in thread
From: Don Zickus @ 2011-09-21 16:13 UTC (permalink / raw)
  To: Robert Richter
  Cc: x86@kernel.org, Andi Kleen, Peter Zijlstra, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On Wed, Sep 21, 2011 at 05:18:30PM +0200, Robert Richter wrote:
> On 21.09.11 10:04:32, Don Zickus wrote:
> > On Wed, Sep 21, 2011 at 12:08:42PM +0200, Robert Richter wrote:
> > > On 20.09.11 10:43:10, Don Zickus wrote:
> > > > @@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
> > > >  
> > > >  		handled += a->handler(type, regs);
> > > >  
> > > > +		/*
> > > > + 		 * Optimization: only loop once if this is not a 
> > > > + 		 * back-to-back NMI.  The idea is nothing is dropped
> > > > + 		 * on the first NMI, only on the second of a back-to-back
> > > > + 		 * NMI.  No need to waste cycles going through all the
> > > > + 		 * handlers.
> > > > + 		 */
> > > > +		if (!b2b && handled)
> > > > +			break;
> > > 
> > > In rare cases we will lose nmis here.
> > > 
> > > We see a back-to-back nmi in the case if a 2 nmi source triggers
> > > *after* the nmi handler is entered. Depending on the internal cpu
> > > timing influenced by microcode and SMM code execution, the nmi may not
> > > entered immediately. So all sources that trigger *before* the nmi
> > > handler is entered raise only one nmi with no subsequent nmi.
> > 
> > Right, but that can only happen with the second NMI in the back-to-back
> > NMI case.  Here the optimization is only for the first NMI, with the
> > assumption that you will always have a second NMI if multiple sources
> > trigger, so you can process those in the second iteration (assuming we
> > correctly detect the back-to-back NMI condition).  Then when the second
> > NMI comes in, we have no idea how many we dropped to get here so we
> > process all the handlers based on the assumption we might not have another
> > NMI behind us to make up for the dropped NMIs.
> > 
> > Unless I misunderstood your point above?
> 
> No, my point was that a second NMI might not be latched even if there
> are two nmi sources pending.
> 
> Your logic is correct but assumes you will always receive a second
> nmi. This is not always the case depending on the cpu's internal
> timing. Usually there is the following sequence for back-to-back nmis:
> 
> 1. HW triggers first NMI, an NMI is pending.
> 2. NMI handler is called, no NMI pending anymore.
> 3. HW triggers a second NMI, an NMI is pending.
> 4. Return from NMI handler.
> 5. NMI handler is called again to serve the 2nd, no NMI pending anymore.
> 6. Return from NMI handler.
> 
> The above is what your algorithm covers.
> 
> But in rare cases there is the following:
> 
> 1. The cpu executes some microcode or SMM code.
> 2. HW triggers the first NMI, an NMI is pending.
> 3. HW triggers a second NMI, the NMI is still pending.
> 4. The cpu finished microcode or SMM code.
> 5. NMI handler is called, no NMI pending anymore.
> 6. Return from NMI handler.
> 
> In this case the handler is called only once and the second nmi
> remains unhandled with you implementation.
> 
> I don't see a way how this could be catched without serving all
> handlers the first time. But as said, in favor of the optimization I
> think we can live with losing some NMIs.

Ah, I get it know.  Crap.  Well I think Avi was pushing it to make those
ticket_spin_locks work in virt land.  It seems like we should lean towards
removing the optimization.  Avi?

Cheers,
Don

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 16:13         ` Don Zickus
@ 2011-09-21 16:24           ` Avi Kivity
  2011-09-21 16:54             ` Robert Richter
  2011-09-21 17:10             ` Don Zickus
  0 siblings, 2 replies; 28+ messages in thread
From: Avi Kivity @ 2011-09-21 16:24 UTC (permalink / raw)
  To: Don Zickus
  Cc: Robert Richter, x86@kernel.org, Andi Kleen, Peter Zijlstra,
	ying.huang@intel.com, LKML, paulmck@linux.vnet.ibm.com,
	jeremy@goop.org

On 09/21/2011 07:13 PM, Don Zickus wrote:
> On Wed, Sep 21, 2011 at 05:18:30PM +0200, Robert Richter wrote:
> >  On 21.09.11 10:04:32, Don Zickus wrote:
> >  >  On Wed, Sep 21, 2011 at 12:08:42PM +0200, Robert Richter wrote:
> >  >  >  On 20.09.11 10:43:10, Don Zickus wrote:
> >  >  >  >  @@ -87,6 +87,16 @@ static int notrace __kprobes nmi_handle(unsigned int type, struct pt_regs *regs)
> >  >  >  >
> >  >  >  >   		handled += a->handler(type, regs);
> >  >  >  >
> >  >  >  >  +		/*
> >  >  >  >  + 		 * Optimization: only loop once if this is not a
> >  >  >  >  + 		 * back-to-back NMI.  The idea is nothing is dropped
> >  >  >  >  + 		 * on the first NMI, only on the second of a back-to-back
> >  >  >  >  + 		 * NMI.  No need to waste cycles going through all the
> >  >  >  >  + 		 * handlers.
> >  >  >  >  + 		 */
> >  >  >  >  +		if (!b2b&&  handled)
> >  >  >  >  +			break;
> >  >  >
> >  >  >  In rare cases we will lose nmis here.
> >  >  >
> >  >  >  We see a back-to-back nmi in the case if a 2 nmi source triggers
> >  >  >  *after* the nmi handler is entered. Depending on the internal cpu
> >  >  >  timing influenced by microcode and SMM code execution, the nmi may not
> >  >  >  entered immediately. So all sources that trigger *before* the nmi
> >  >  >  handler is entered raise only one nmi with no subsequent nmi.
> >  >
> >  >  Right, but that can only happen with the second NMI in the back-to-back
> >  >  NMI case.  Here the optimization is only for the first NMI, with the
> >  >  assumption that you will always have a second NMI if multiple sources
> >  >  trigger, so you can process those in the second iteration (assuming we
> >  >  correctly detect the back-to-back NMI condition).  Then when the second
> >  >  NMI comes in, we have no idea how many we dropped to get here so we
> >  >  process all the handlers based on the assumption we might not have another
> >  >  NMI behind us to make up for the dropped NMIs.
> >  >
> >  >  Unless I misunderstood your point above?
> >
> >  No, my point was that a second NMI might not be latched even if there
> >  are two nmi sources pending.
> >
> >  Your logic is correct but assumes you will always receive a second
> >  nmi. This is not always the case depending on the cpu's internal
> >  timing. Usually there is the following sequence for back-to-back nmis:
> >
> >  1. HW triggers first NMI, an NMI is pending.
> >  2. NMI handler is called, no NMI pending anymore.
> >  3. HW triggers a second NMI, an NMI is pending.
> >  4. Return from NMI handler.
> >  5. NMI handler is called again to serve the 2nd, no NMI pending anymore.
> >  6. Return from NMI handler.
> >
> >  The above is what your algorithm covers.
> >
> >  But in rare cases there is the following:
> >
> >  1. The cpu executes some microcode or SMM code.
> >  2. HW triggers the first NMI, an NMI is pending.
> >  3. HW triggers a second NMI, the NMI is still pending.
> >  4. The cpu finished microcode or SMM code.
> >  5. NMI handler is called, no NMI pending anymore.
> >  6. Return from NMI handler.
> >
> >  In this case the handler is called only once and the second nmi
> >  remains unhandled with you implementation.
> >
> >  I don't see a way how this could be catched without serving all
> >  handlers the first time. But as said, in favor of the optimization I
> >  think we can live with losing some NMIs.
>
> Ah, I get it know.  Crap.  Well I think Avi was pushing it to make those
> ticket_spin_locks work in virt land.  It seems like we should lean towards
> removing the optimization.  Avi?
>

Well, in virt land there are no SMIs, and we can guarantee that the 
queue length is always two.  So if these rare cases are okay for 
upstream, it'll be fine for virt.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 16:04           ` Robert Richter
@ 2011-09-21 16:40             ` Peter Zijlstra
  0 siblings, 0 replies; 28+ messages in thread
From: Peter Zijlstra @ 2011-09-21 16:40 UTC (permalink / raw)
  To: Robert Richter
  Cc: Don Zickus, x86@kernel.org, Andi Kleen, ying.huang@intel.com,
	LKML, paulmck@linux.vnet.ibm.com, avi@redhat.com, jeremy@goop.org

On Wed, 2011-09-21 at 18:04 +0200, Robert Richter wrote:
> No, I am trying to explain this...

D'uh, I got the b2b case inverted.. you're right.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 16:24           ` Avi Kivity
@ 2011-09-21 16:54             ` Robert Richter
  2011-09-25 12:54               ` Avi Kivity
  2011-09-21 17:10             ` Don Zickus
  1 sibling, 1 reply; 28+ messages in thread
From: Robert Richter @ 2011-09-21 16:54 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Don Zickus, x86@kernel.org, Andi Kleen, Peter Zijlstra,
	ying.huang@intel.com, LKML, paulmck@linux.vnet.ibm.com,
	jeremy@goop.org

On 21.09.11 12:24:54, Avi Kivity wrote:
> On 09/21/2011 07:13 PM, Don Zickus wrote:
> > On Wed, Sep 21, 2011 at 05:18:30PM +0200, Robert Richter wrote:
> > >  On 21.09.11 10:04:32, Don Zickus wrote:

> > >  But in rare cases there is the following:
> > >
> > >  1. The cpu executes some microcode or SMM code.
> > >  2. HW triggers the first NMI, an NMI is pending.
> > >  3. HW triggers a second NMI, the NMI is still pending.
> > >  4. The cpu finished microcode or SMM code.
> > >  5. NMI handler is called, no NMI pending anymore.
> > >  6. Return from NMI handler.
> > >
> > >  In this case the handler is called only once and the second nmi
> > >  remains unhandled with you implementation.
> > >
> > >  I don't see a way how this could be catched without serving all
> > >  handlers the first time. But as said, in favor of the optimization I
> > >  think we can live with losing some NMIs.

I have to revise this after thinking more about this. We may not lose
an nmi for sources where the nmi handler must always reenable the nmi,
e.g. IBS. Losing one nmi means for IBS that sample generation gets
stuck.

-Robert

> >
> > Ah, I get it know.  Crap.  Well I think Avi was pushing it to make those
> > ticket_spin_locks work in virt land.  It seems like we should lean towards
> > removing the optimization.  Avi?
> >
> 
> Well, in virt land there are no SMIs, and we can guarantee that the 
> queue length is always two.  So if these rare cases are okay for 
> upstream, it'll be fine for virt.
> 
> -- 
> error compiling committee.c: too many arguments to function
> 
> 

-- 
Advanced Micro Devices, Inc.
Operating System Research Center


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 16:24           ` Avi Kivity
  2011-09-21 16:54             ` Robert Richter
@ 2011-09-21 17:10             ` Don Zickus
  1 sibling, 0 replies; 28+ messages in thread
From: Don Zickus @ 2011-09-21 17:10 UTC (permalink / raw)
  To: Avi Kivity
  Cc: Robert Richter, x86@kernel.org, Andi Kleen, Peter Zijlstra,
	ying.huang@intel.com, LKML, paulmck@linux.vnet.ibm.com,
	jeremy@goop.org

On Wed, Sep 21, 2011 at 07:24:54PM +0300, Avi Kivity wrote:
> >>  But in rare cases there is the following:
> >>
> >>  1. The cpu executes some microcode or SMM code.
> >>  2. HW triggers the first NMI, an NMI is pending.
> >>  3. HW triggers a second NMI, the NMI is still pending.
> >>  4. The cpu finished microcode or SMM code.
> >>  5. NMI handler is called, no NMI pending anymore.
> >>  6. Return from NMI handler.
> >>
> >>  In this case the handler is called only once and the second nmi
> >>  remains unhandled with you implementation.
> >>
> >>  I don't see a way how this could be catched without serving all
> >>  handlers the first time. But as said, in favor of the optimization I
> >>  think we can live with losing some NMIs.
> >
> >Ah, I get it know.  Crap.  Well I think Avi was pushing it to make those
> >ticket_spin_locks work in virt land.  It seems like we should lean towards
> >removing the optimization.  Avi?
> >
> 
> Well, in virt land there are no SMIs, and we can guarantee that the
> queue length is always two.  So if these rare cases are okay for
> upstream, it'll be fine for virt.

Actually I was trying to remember what the argument for the optimization
was again?  I believe the virt team needed it right?

Cheers,
Don

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [V5][PATCH 4/6] x86, nmi:  add in logic to handle multiple events and unknown NMIs
  2011-09-21 16:54             ` Robert Richter
@ 2011-09-25 12:54               ` Avi Kivity
  0 siblings, 0 replies; 28+ messages in thread
From: Avi Kivity @ 2011-09-25 12:54 UTC (permalink / raw)
  To: Robert Richter
  Cc: Don Zickus, x86@kernel.org, Andi Kleen, Peter Zijlstra,
	ying.huang@intel.com, LKML, paulmck@linux.vnet.ibm.com,
	jeremy@goop.org

On 09/21/2011 07:54 PM, Robert Richter wrote:
> On 21.09.11 12:24:54, Avi Kivity wrote:
> >  On 09/21/2011 07:13 PM, Don Zickus wrote:
> >  >  On Wed, Sep 21, 2011 at 05:18:30PM +0200, Robert Richter wrote:
> >  >  >   On 21.09.11 10:04:32, Don Zickus wrote:
>
> >  >  >   But in rare cases there is the following:
> >  >  >
> >  >  >   1. The cpu executes some microcode or SMM code.
> >  >  >   2. HW triggers the first NMI, an NMI is pending.
> >  >  >   3. HW triggers a second NMI, the NMI is still pending.
> >  >  >   4. The cpu finished microcode or SMM code.
> >  >  >   5. NMI handler is called, no NMI pending anymore.
> >  >  >   6. Return from NMI handler.
> >  >  >
> >  >  >   In this case the handler is called only once and the second nmi
> >  >  >   remains unhandled with you implementation.
> >  >  >
> >  >  >   I don't see a way how this could be catched without serving all
> >  >  >   handlers the first time. But as said, in favor of the optimization I
> >  >  >   think we can live with losing some NMIs.
>
> I have to revise this after thinking more about this. We may not lose
> an nmi for sources where the nmi handler must always reenable the nmi,
> e.g. IBS. Losing one nmi means for IBS that sample generation gets
> stuck.
>

Well, that pretty much kills the whole idea.  This thing has to be reliable.

I'll ask Intel if they can guarantee a length 2 queue on their 
processors (or maybe Andi you can find this out).

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2011-09-25 12:55 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-20 14:43 [V5][PATCH 0/6] x86, nmi: new NMI handling routines Don Zickus
2011-09-20 14:43 ` [V5][PATCH 1/6] x86, nmi: split out nmi from traps.c Don Zickus
2011-09-20 14:43 ` [V5][PATCH 2/6] x86, nmi: create new NMI handler routines Don Zickus
2011-09-21  5:36   ` Huang Ying
2011-09-21 13:56     ` Don Zickus
2011-09-20 14:43 ` [V5][PATCH 3/6] x86, nmi: wire up NMI handlers to new routines Don Zickus
2011-09-21  5:41   ` Huang Ying
2011-09-21 10:49     ` Borislav Petkov
2011-09-21 14:06       ` Don Zickus
2011-09-20 14:43 ` [V5][PATCH 4/6] x86, nmi: add in logic to handle multiple events and unknown NMIs Don Zickus
2011-09-20 17:23   ` Avi Kivity
2011-09-20 20:10     ` Don Zickus
2011-09-21  5:45       ` Avi Kivity
2011-09-21  5:43   ` Huang Ying
2011-09-21 13:57     ` Don Zickus
2011-09-21 10:08   ` Robert Richter
2011-09-21 14:04     ` Don Zickus
2011-09-21 15:18       ` Robert Richter
2011-09-21 15:33         ` Peter Zijlstra
2011-09-21 16:04           ` Robert Richter
2011-09-21 16:40             ` Peter Zijlstra
2011-09-21 16:13         ` Don Zickus
2011-09-21 16:24           ` Avi Kivity
2011-09-21 16:54             ` Robert Richter
2011-09-25 12:54               ` Avi Kivity
2011-09-21 17:10             ` Don Zickus
2011-09-20 14:43 ` [V5][PATCH 5/6] x86, nmi: track NMI usage stats Don Zickus
2011-09-20 14:43 ` [V5][PATCH 6/6] x86, nmi: print out NMI stats in /proc/interrupts Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).