public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups
@ 2014-05-15 19:25 Don Zickus
  2014-05-15 19:25 ` [PATCH 1/6] x86, nmi: Implement delayed irq_work mechanism to handle lost NMIs Don Zickus
                   ` (6 more replies)
  0 siblings, 7 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus

I started this patch by fixing a performance problem with the GHES
NMI handler and then things evolved to more patches as I was poking
around in the code.

The main focus was moving the GHES NMI driver to its own NMI subtype
to avoid slowing down perf handling.  Then I decided to move the
default external NMI handler to its own routine.  Then finally, I
needed to see which NMI handlers were registered so I hacked up
/proc/interrupts to show me.

Tested mostly on HP boxes that have GHES enabled and having the iLO
send NMIs to panic the box (using hpwdt driver).  Ran perf on other
GHES enabled boxes to test performance results.

V2: adding irq_work items to handled possible lost NMIs (new patch 1)
    modified output of /proc/interrupts based on feedback (patch 6)

Don Zickus (5):
  x86, nmi:  Add new nmi type 'external'
  x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and
    'panic_on_io_nmi'
  x86, nmi: Remove 'reason' value from unknown nmi output
  x86, nmi: Move default external NMI handler to its own routine
  x86, nmi: Add better NMI stats to /proc/interrupts and show handlers

 Documentation/kernel-parameters.txt |    9 ++
 arch/x86/include/asm/nmi.h          |    5 +
 arch/x86/kernel/irq.c               |    3 +
 arch/x86/kernel/nmi.c               |  199 ++++++++++++++++++++++++++--------
 drivers/acpi/apei/ghes.c            |    4 +-
 drivers/watchdog/hpwdt.c            |   24 +++--
 6 files changed, 187 insertions(+), 57 deletions(-)

*** BLURB HERE ***

Don Zickus (6):
  x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  x86, nmi:  Add new nmi type 'external'
  x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and
    'panic_on_io_nmi'
  x86, nmi: Remove 'reason' value from unknown nmi output
  x86, nmi: Move default external NMI handler to its own routine
  x86, nmi: Add better NMI stats to /proc/interrupts and show handlers

 Documentation/kernel-parameters.txt |    9 +
 arch/x86/include/asm/nmi.h          |    5 +
 arch/x86/kernel/irq.c               |    3 +
 arch/x86/kernel/nmi.c               |  302 +++++++++++++++++++++++++++++-----
 drivers/acpi/apei/ghes.c            |    4 +-
 drivers/watchdog/hpwdt.c            |   24 ++-
 6 files changed, 292 insertions(+), 55 deletions(-)


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/6] x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
@ 2014-05-15 19:25 ` Don Zickus
  2014-05-21 10:29   ` Peter Zijlstra
  2014-05-15 19:25 ` [PATCH 2/6] x86, nmi: Add new nmi type 'external' Don Zickus
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus

The x86's NMI is a fragile beast.  There are times when you can have to many
NMIs or too little.  We have addresses the times when you have to many by
implementing back-to-back NMI detection logic.  This patch attempts to
address the issue of losing NMIs.

A lost NMI situation can occur for example when during the processing of one
NMI, multiple NMIs come in.  Only one NMI can be latched at any given time,
therefore the second NMI would be dropped.  If the latched NMI was handled by
the PMI handler, then the current code logic would never call the SERR or
IO_CHK handlers to process the external NMI.

As a result the external NMI would be blocked from future NMIs (because it
would not be unmasked) and the critical event would not be serviced.

Of course these situations are rare, but with the frequency of the PMI events,
the situation is such that a rare external NMI could be accidentally dropped.

To workaround this small race, I implemented an irq_work item.  For every NMI
processed an irq_work item is queued (lazily) to send a simulated NMI on the
next clock tick.  The idea is to keep simulating NMIs until they pass through
all the handlers without any one claiming them.

The way the irq_work framework is structured only one cpu at a time will handle
the work to avoid an NMI storm on all the cpus.

I have tested this on an Ivy Bridge with 4 events going on all cpus to help
catch various corner cases.  The logic seems to work and under heavy load I
have seen irq_work_queue prevent up to 30-40 events from being queued because
of the small delay utilized during one iteration of an irq_work item.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/nmi.c |  109 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 109 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index b4872b9..7b17864 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -110,6 +110,106 @@ static void nmi_max_handler(struct irq_work *w)
 		a->handler, whole_msecs, decimal_msecs);
 }
 
+/*
+ * setup a lazy irq work queue to handle dropped/lost NMIs
+ *
+ * Because of the way NMIs are shared and are edge triggered,
+ * lots of scenarios pop up that could cause us to lose or
+ * recieve an extra NMI.
+ *
+ * The nmi_delayed_work logic below is trying to handle
+ * the case where we lose an NMI.  How that happens is if
+ * two NMIs arrive roughly at the same time while currently
+ * processing another NMI.  Only one NMI can be latched,
+ * therefore the second NMI would be dropped.
+ *
+ * Why this matters even though we cycle through all the
+ * NMI handlers?  It matters because external NMI for
+ * SERR or IO_CHK currently gets ignored if a local NMI
+ * (like PMI is handled).  This can block the external
+ * NMI from:
+ * - causing proper notification of a critical event
+ * - block future NMIs from happening unti unmasked
+ *
+ * As a result we create a delayed irq work item, that
+ * will blindly check all the NMI handlers on the next
+ * timer tick to see if we missed an NMI.  It will continue
+ * to check until we cycle through cleanly and hit every
+ * handler.
+ */
+DEFINE_PER_CPU(bool, nmi_delayed_work_pending);
+
+static void nmi_delayed_work_func(struct irq_work *irq_work)
+{
+	DECLARE_BITMAP(nmi_mask, NR_CPUS);
+	cpumask_t *mask;
+
+	preempt_disable();
+
+	/*
+	 * Can't use send_IPI_self here because it will
+	 * send an NMI in IRQ context which is not what
+	 * we want.  Create a cpumask for local cpu and
+	 * force an IPI the normal way (not the shortcut).
+	 */
+	bitmap_zero(nmi_mask, NR_CPUS);
+	mask = to_cpumask(nmi_mask);
+	cpu_set(smp_processor_id(), *mask);
+
+	__this_cpu_xchg(nmi_delayed_work_pending, true);
+	apic->send_IPI_mask(to_cpumask(nmi_mask), NMI_VECTOR);
+
+	preempt_enable();
+}
+
+struct irq_work nmi_delayed_work =
+{
+	.func	= nmi_delayed_work_func,
+	.flags	= IRQ_WORK_LAZY,
+};
+
+static bool nmi_queue_work_clear(void)
+{
+	bool set = __this_cpu_read(nmi_delayed_work_pending);
+
+	__this_cpu_write(nmi_delayed_work_pending, false);
+
+	return set;
+}
+
+static int nmi_queue_work(void)
+{
+	bool queued = irq_work_queue(&nmi_delayed_work);
+
+	if (queued) {
+		/*
+		 * If the delayed NMI actually finds a 'dropped' NMI, the
+		 * work pending bit will never be cleared.  A new delayed
+		 * work NMI is supposed to be sent in that case.  But there
+		 * is no guarantee that the same cpu will be used.  So
+		 * pro-actively clear the flag here (the new self-IPI will
+		 * re-set it.
+		 *
+		 * However, there is a small chance that a real NMI and the
+		 * simulated one occur at the same time.  What happens is the
+		 * simulated IPI NMI sets the work_pending flag and then sends
+		 * the IPI.  At this point the irq_work allows a new work
+		 * event.  So when the simulated IPI is handled by a real NMI
+		 * handler it comes in here to queue more work.  Because
+		 * irq_work returns success, the work_pending bit is cleared.
+		 * The second part of the back-to-back NMI is kicked off, the
+		 * work_pending bit is not set and an unknown NMI is generated.
+		 * Therefore check the BUSY bit before clearing.  The theory is
+		 * if the BUSY bit is set, then there should be an NMI for this
+		 * cpu latched somewhere and will be cleared when it runs.
+		 */
+		if (!(nmi_delayed_work.flags & IRQ_WORK_BUSY))
+			nmi_queue_work_clear();
+	}
+
+	return 0;
+}
+
 static int __kprobes nmi_handle(unsigned int type, struct pt_regs *regs, bool b2b)
 {
 	struct nmi_desc *desc = nmi_to_desc(type);
@@ -341,6 +441,9 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 		 */
 		if (handled > 1)
 			__this_cpu_write(swallow_nmi, true);
+
+		/* kick off delayed work in case we swallowed external NMI */
+		nmi_queue_work();
 		return;
 	}
 
@@ -362,10 +465,16 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 #endif
 		__this_cpu_add(nmi_stats.external, 1);
 		raw_spin_unlock(&nmi_reason_lock);
+		/* kick off delayed work in case we swallowed external NMI */
+		nmi_queue_work();
 		return;
 	}
 	raw_spin_unlock(&nmi_reason_lock);
 
+	/* expected delayed queued NMI? Don't flag as unknown */
+	if (nmi_queue_work_clear())
+		return;
+
 	/*
 	 * Only one NMI can be latched at a time.  To handle
 	 * this we may process multiple nmi handlers at once to
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/6] x86, nmi:  Add new nmi type 'external'
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
  2014-05-15 19:25 ` [PATCH 1/6] x86, nmi: Implement delayed irq_work mechanism to handle lost NMIs Don Zickus
@ 2014-05-15 19:25 ` Don Zickus
  2014-05-15 19:25 ` [PATCH 3/6] x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and 'panic_on_io_nmi' Don Zickus
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus

I noticed when debugging a perf problem on a machine with GHES enabled,
perf seemed slow.  I then realized that the GHES NMI routine was taking
a global lock all the time to inspect the hardware.  This contended
with all the local perf counters which did not need a lock.  So each cpu
accidentally was synchronizing with itself when using perf.

This is because the way the nmi handler works.  It executes all the handlers
registered to a particular subtype (to deal with nmi sharing).  As a result
the GHES handler was executed on every PMI.

Fix this by creating a new nmi type called NMI_EXT, which is used by
handlers that need to probe external hardware and require a global lock
to do so.

Now the main NMI handler can check the internal NMI handlers first and
then the external ones if nothing is found.

This makes perf a little faster again on those machines with GHES enabled.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h |    1 +
 arch/x86/kernel/nmi.c      |   41 ++++++++++++++++++++++++++---------------
 drivers/acpi/apei/ghes.c   |    4 ++--
 drivers/watchdog/hpwdt.c   |   10 ++--------
 4 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 5f2fc44..8631a49 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -25,6 +25,7 @@ extern int unknown_nmi_panic;
 
 enum {
 	NMI_LOCAL=0,
+	NMI_EXT,
 	NMI_UNKNOWN,
 	NMI_SERR,
 	NMI_IO_CHECK,
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 7b17864..b7c6f6b 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -34,29 +34,32 @@
 #include <trace/events/nmi.h>
 
 struct nmi_desc {
-	spinlock_t lock;
-	struct list_head head;
+	spinlock_t		lock;
+	struct list_head	head;
 };
 
 static struct nmi_desc nmi_desc[NMI_MAX] = 
 {
-	{
-		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[0].lock),
-		.head = LIST_HEAD_INIT(nmi_desc[0].head),
+	[NMI_LOCAL] = {
+		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[NMI_LOCAL].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[NMI_LOCAL].head),
 	},
-	{
-		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[1].lock),
-		.head = LIST_HEAD_INIT(nmi_desc[1].head),
+	[NMI_EXT] = {
+		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[NMI_EXT].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[NMI_EXT].head),
 	},
-	{
-		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[2].lock),
-		.head = LIST_HEAD_INIT(nmi_desc[2].head),
+	[NMI_UNKNOWN] = {
+		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[NMI_UNKNOWN].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[NMI_UNKNOWN].head),
 	},
-	{
-		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[3].lock),
-		.head = LIST_HEAD_INIT(nmi_desc[3].head),
+	[NMI_SERR] = {
+		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[NMI_SERR].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[NMI_SERR].head),
+	},
+	[NMI_IO_CHECK] = {
+		.lock = __SPIN_LOCK_UNLOCKED(&nmi_desc[NMI_IO_CHECK].lock),
+		.head = LIST_HEAD_INIT(nmi_desc[NMI_IO_CHECK].head),
 	},
-
 };
 
 struct nmi_stats {
@@ -428,8 +431,16 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 
 	__this_cpu_write(last_nmi_rip, regs->ip);
 
+	/* try local NMIs first */
 	handled = nmi_handle(NMI_LOCAL, regs, b2b);
 	__this_cpu_add(nmi_stats.normal, handled);
+
+	/* if unclaimed, try external NMIs next */
+	if (!handled) {
+		handled = nmi_handle(NMI_EXT, regs, b2b);
+		__this_cpu_add(nmi_stats.external, handled);
+	}
+
 	if (handled) {
 		/*
 		 * There are cases when a NMI handler handles multiple
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index dab7cb7..8b78bf5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -976,7 +976,7 @@ static int ghes_probe(struct platform_device *ghes_dev)
 		ghes_estatus_pool_expand(len);
 		mutex_lock(&ghes_list_mutex);
 		if (list_empty(&ghes_nmi))
-			register_nmi_handler(NMI_LOCAL, ghes_notify_nmi, 0,
+			register_nmi_handler(NMI_EXT, ghes_notify_nmi, 0,
 						"ghes");
 		list_add_rcu(&ghes->list, &ghes_nmi);
 		mutex_unlock(&ghes_list_mutex);
@@ -1025,7 +1025,7 @@ static int ghes_remove(struct platform_device *ghes_dev)
 		mutex_lock(&ghes_list_mutex);
 		list_del_rcu(&ghes->list);
 		if (list_empty(&ghes_nmi))
-			unregister_nmi_handler(NMI_LOCAL, "ghes");
+			unregister_nmi_handler(NMI_EXT, "ghes");
 		mutex_unlock(&ghes_list_mutex);
 		/*
 		 * To synchronize with NMI handler, ghes can only be
diff --git a/drivers/watchdog/hpwdt.c b/drivers/watchdog/hpwdt.c
index 75d2243..dd70ee7 100644
--- a/drivers/watchdog/hpwdt.c
+++ b/drivers/watchdog/hpwdt.c
@@ -736,12 +736,9 @@ static int hpwdt_init_nmi_decoding(struct pci_dev *dev)
 	retval = register_nmi_handler(NMI_UNKNOWN, hpwdt_pretimeout, 0, "hpwdt");
 	if (retval)
 		goto error;
-	retval = register_nmi_handler(NMI_SERR, hpwdt_pretimeout, 0, "hpwdt");
+	retval = register_nmi_handler(NMI_EXT, hpwdt_pretimeout, 0, "hpwdt");
 	if (retval)
 		goto error1;
-	retval = register_nmi_handler(NMI_IO_CHECK, hpwdt_pretimeout, 0, "hpwdt");
-	if (retval)
-		goto error2;
 
 	dev_info(&dev->dev,
 			"HP Watchdog Timer Driver: NMI decoding initialized"
@@ -749,8 +746,6 @@ static int hpwdt_init_nmi_decoding(struct pci_dev *dev)
 			(allow_kdump == 0) ? "OFF" : "ON");
 	return 0;
 
-error2:
-	unregister_nmi_handler(NMI_SERR, "hpwdt");
 error1:
 	unregister_nmi_handler(NMI_UNKNOWN, "hpwdt");
 error:
@@ -765,8 +760,7 @@ error:
 static void hpwdt_exit_nmi_decoding(void)
 {
 	unregister_nmi_handler(NMI_UNKNOWN, "hpwdt");
-	unregister_nmi_handler(NMI_SERR, "hpwdt");
-	unregister_nmi_handler(NMI_IO_CHECK, "hpwdt");
+	unregister_nmi_handler(NMI_EXT, "hpwdt");
 	if (cru_rom_addr)
 		iounmap(cru_rom_addr);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/6] x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and 'panic_on_io_nmi'
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
  2014-05-15 19:25 ` [PATCH 1/6] x86, nmi: Implement delayed irq_work mechanism to handle lost NMIs Don Zickus
  2014-05-15 19:25 ` [PATCH 2/6] x86, nmi: Add new nmi type 'external' Don Zickus
@ 2014-05-15 19:25 ` Don Zickus
  2014-05-15 19:25 ` [PATCH 4/6] x86, nmi: Remove 'reason' value from unknown nmi output Don Zickus
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus

These options are accessable through /proc/sys/kernel but not on the command
line.  The main use is for on board controllers (iLO, DRAC, BMC) to be able
to issue an external NMI to bring down a hung box.

This just makes configuring a box a little easier.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 Documentation/kernel-parameters.txt |    9 +++++++++
 arch/x86/kernel/nmi.c               |   14 ++++++++++++++
 2 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b6c67d5..a4056b5 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2339,6 +2339,15 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
 			timeout < 0: reboot immediately
 			Format: <timeout>
 
+	panic_on_unrecovered_nmi [X86]
+			Force a machine to panic if an unrecoverable NMI is
+			unclaimed.  This covers SERR or UNKONWN NMI cases.
+
+	panic_on_io_nmi [X86]
+			Force a machine to panic if an IO NMI is unclaimed.
+			This covers external NMIs with no handlers associated
+			with them.
+
 	parkbd.port=	[HW] Parallel port number the keyboard adapter is
 			connected to, default is 0.
 			Format: <parport#>
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index b7c6f6b..0467f42 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -87,6 +87,20 @@ static int __init setup_unknown_nmi_panic(char *str)
 }
 __setup("unknown_nmi_panic", setup_unknown_nmi_panic);
 
+static int __init setup_panic_on_unrecovered_nmi(char *str)
+{
+	panic_on_unrecovered_nmi = 1;
+	return 1;
+}
+__setup("panic_on_unrecovered_nmi", setup_panic_on_unrecovered_nmi);
+
+static int __init setup_panic_on_io_nmi(char *str)
+{
+	panic_on_io_nmi = 1;
+	return 1;
+}
+__setup("panic_on_io_nmi", setup_panic_on_io_nmi);
+
 #define nmi_to_desc(type) (&nmi_desc[type])
 
 static u64 nmi_longest_ns = 1 * NSEC_PER_MSEC;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 4/6] x86, nmi: Remove 'reason' value from unknown nmi output
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
                   ` (2 preceding siblings ...)
  2014-05-15 19:25 ` [PATCH 3/6] x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and 'panic_on_io_nmi' Don Zickus
@ 2014-05-15 19:25 ` Don Zickus
  2014-05-15 19:25 ` [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine Don Zickus
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus

The 'reason' value is from the io port for the external NMI.

We already have handlers that deal with non-zero 'reason'.  Providing
this in the output of unknown NMI, doesn't add much.

In addition, it should lower the number of questions I get when
customers keep attaching new unknown NMI panic updates because the
reason value changes (which is expected because part of it is a
timer bit that toggles all the time).

However, the main reason for this, is for the next patch which
will move the external NMI check to another function.  Hence the
reason value will not be available.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/kernel/nmi.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 0467f42..a5835ad 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -388,7 +388,7 @@ io_check_error(unsigned char reason, struct pt_regs *regs)
 }
 
 static __kprobes void
-unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
+unknown_nmi_error(struct pt_regs *regs)
 {
 	int handled;
 
@@ -406,8 +406,8 @@ unknown_nmi_error(unsigned char reason, struct pt_regs *regs)
 
 	__this_cpu_add(nmi_stats.unknown, 1);
 
-	pr_emerg("Uhhuh. NMI received for unknown reason %02x on CPU %d.\n",
-		 reason, smp_processor_id());
+	pr_emerg("Uhhuh. NMI received for unknown reason on CPU %d.\n",
+		 smp_processor_id());
 
 	pr_emerg("Do you have a strange power saving mode enabled?\n");
 	if (unknown_nmi_panic || panic_on_unrecovered_nmi)
@@ -533,7 +533,7 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 	if (b2b && __this_cpu_read(swallow_nmi))
 		__this_cpu_add(nmi_stats.swallow, 1);
 	else
-		unknown_nmi_error(reason, regs);
+		unknown_nmi_error(regs);
 }
 
 /*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
                   ` (3 preceding siblings ...)
  2014-05-15 19:25 ` [PATCH 4/6] x86, nmi: Remove 'reason' value from unknown nmi output Don Zickus
@ 2014-05-15 19:25 ` Don Zickus
  2014-05-21 10:38   ` Peter Zijlstra
  2014-05-15 19:25 ` [PATCH 6/6 V2] x86, nmi: Add better NMI stats to /proc/interrupts and show handlers Don Zickus
  2014-05-15 20:28 ` [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
  6 siblings, 1 reply; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86
  Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus,
	thomas.mingarelli

Now that we have setup an NMI subtye called NMI_EXT, there is really
no need to hard code the default external NMI handler in the main
nmi handler routine.

Move it to a proper function and register it on boot.  This change is
just code movement.

In addition, update the hpwdt to allow it to unregister the default
handler on its registration (and vice versa).  This allows the driver
to take control of that io port (which it ultimately wanted to do
originally), but in a cleaner way.

Tested by HP to make sure I didn't break anything. :-)

Cc: thomas.mingarelli@hp.com
Signed-off-by: Don Zickus <dzickus@redhat.com>
---
 arch/x86/include/asm/nmi.h |    3 ++
 arch/x86/kernel/nmi.c      |   81 +++++++++++++++++++++++++++----------------
 drivers/watchdog/hpwdt.c   |   14 +++++++
 3 files changed, 68 insertions(+), 30 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index 8631a49..c8ffab3 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -64,4 +64,7 @@ void stop_nmi(void);
 void restart_nmi(void);
 void local_touch_nmi(void);
 
+int register_nmi_default_external_handler(void);
+void unregister_nmi_default_external_handler(void);
+
 #endif /* _ASM_X86_NMI_H */
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index a5835ad..458be2a 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -74,11 +74,6 @@ static DEFINE_PER_CPU(struct nmi_stats, nmi_stats);
 static int ignore_nmis;
 
 int unknown_nmi_panic;
-/*
- * Prevent NMI reason port (0x61) being accessed simultaneously, can
- * only be used in NMI handler.
- */
-static DEFINE_RAW_SPINLOCK(nmi_reason_lock);
 
 static int __init setup_unknown_nmi_panic(char *str)
 {
@@ -421,7 +416,6 @@ static DEFINE_PER_CPU(unsigned long, last_nmi_rip);
 
 static __kprobes void default_do_nmi(struct pt_regs *regs)
 {
-	unsigned char reason = 0;
 	int handled;
 	bool b2b = false;
 
@@ -472,30 +466,6 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 		return;
 	}
 
-	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
-	raw_spin_lock(&nmi_reason_lock);
-	reason = x86_platform.get_nmi_reason();
-
-	if (reason & NMI_REASON_MASK) {
-		if (reason & NMI_REASON_SERR)
-			pci_serr_error(reason, regs);
-		else if (reason & NMI_REASON_IOCHK)
-			io_check_error(reason, regs);
-#ifdef CONFIG_X86_32
-		/*
-		 * Reassert NMI in case it became active
-		 * meanwhile as it's edge-triggered:
-		 */
-		reassert_nmi();
-#endif
-		__this_cpu_add(nmi_stats.external, 1);
-		raw_spin_unlock(&nmi_reason_lock);
-		/* kick off delayed work in case we swallowed external NMI */
-		nmi_queue_work();
-		return;
-	}
-	raw_spin_unlock(&nmi_reason_lock);
-
 	/* expected delayed queued NMI? Don't flag as unknown */
 	if (nmi_queue_work_clear())
 		return;
@@ -537,6 +507,57 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 }
 
 /*
+ * Prevent NMI reason port (0x61) being accessed simultaneously, can
+ * only be used in NMI handler.
+ */
+static DEFINE_RAW_SPINLOCK(nmi_reason_lock);
+
+static int __kprobes
+nmi_default_external_handler(unsigned int cmd, struct pt_regs *regs)
+{
+	unsigned char reason = 0;
+	int ret = NMI_DONE;
+
+	/* Non-CPU-specific NMI: NMI sources can be processed on any CPU */
+	raw_spin_lock(&nmi_reason_lock);
+	reason = x86_platform.get_nmi_reason();
+
+	if (reason & NMI_REASON_MASK) {
+		if (reason & NMI_REASON_SERR)
+			pci_serr_error(reason, regs);
+		else if (reason & NMI_REASON_IOCHK)
+			io_check_error(reason, regs);
+#ifdef CONFIG_X86_32
+		/*
+		 * Reassert NMI in case it became active
+		 * meanwhile as it's edge-triggered:
+		 */
+		reassert_nmi();
+#endif
+		ret = NMI_HANDLED;
+	}
+	raw_spin_unlock(&nmi_reason_lock);
+
+	return ret;
+}
+
+int register_nmi_default_external_handler(void)
+{
+	register_nmi_handler(NMI_EXT, nmi_default_external_handler,
+				0, "plat");
+
+	return 0;
+}
+EXPORT_SYMBOL(register_nmi_default_external_handler);
+early_initcall(register_nmi_default_external_handler);
+
+void unregister_nmi_default_external_handler(void)
+{
+	unregister_nmi_handler(NMI_EXT, "plat");
+}
+EXPORT_SYMBOL(unregister_nmi_default_external_handler);
+
+/*
  * NMIs can hit breakpoints which will cause it to lose its
  * NMI context with the CPU when the breakpoint does an iret.
  */
diff --git a/drivers/watchdog/hpwdt.c b/drivers/watchdog/hpwdt.c
index dd70ee7..10e7813 100644
--- a/drivers/watchdog/hpwdt.c
+++ b/drivers/watchdog/hpwdt.c
@@ -37,6 +37,7 @@
 #include <asm/cacheflush.h>
 #endif /* CONFIG_HPWDT_NMI_DECODING */
 #include <asm/nmi.h>
+#include <asm/mach_traps.h>
 
 #define HPWDT_VERSION			"1.3.3"
 #define SECS_TO_TICKS(secs)		((secs) * 1000 / 128)
@@ -486,6 +487,15 @@ static int hpwdt_pretimeout(unsigned int ulReason, struct pt_regs *regs)
 		goto out;
 
 	spin_lock_irqsave(&rom_lock, rom_pl);
+	if (ulReason == NMI_EXT) {
+		unsigned char reason = 0;
+		reason = x86_platform.get_nmi_reason();
+
+		if (!(reason & NMI_REASON_MASK)) {
+			spin_unlock_irqrestore(&rom_lock, rom_pl);
+			goto out;
+		}
+	}
 	if (!die_nmi_called && !is_icru && !is_uefi)
 		asminline_call(&cmn_regs, cru_rom_addr);
 	die_nmi_called = 1;
@@ -740,6 +750,9 @@ static int hpwdt_init_nmi_decoding(struct pci_dev *dev)
 	if (retval)
 		goto error1;
 
+	/* hpwdt is the new external port 0x61 handler */
+	unregister_nmi_default_external_handler();
+
 	dev_info(&dev->dev,
 			"HP Watchdog Timer Driver: NMI decoding initialized"
 			", allow kernel dump: %s (default = 0/OFF)\n",
@@ -761,6 +774,7 @@ static void hpwdt_exit_nmi_decoding(void)
 {
 	unregister_nmi_handler(NMI_UNKNOWN, "hpwdt");
 	unregister_nmi_handler(NMI_EXT, "hpwdt");
+	register_nmi_default_external_handler();
 	if (cru_rom_addr)
 		iounmap(cru_rom_addr);
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 6/6 V2] x86, nmi: Add better NMI stats to /proc/interrupts and show handlers
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
                   ` (4 preceding siblings ...)
  2014-05-15 19:25 ` [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine Don Zickus
@ 2014-05-15 19:25 ` Don Zickus
  2014-05-15 20:28 ` [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
  6 siblings, 0 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-15 19:25 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott, Don Zickus

The main reason for this patch is because I have a hard time knowing
what NMI handlers are registered on the system when debugging NMI issues.

This info is provided in /proc/interrupts for interrupt handlers, so I
added support for NMI stuff too.  As a bonus it provides stat breakdowns
much like the interrupts.

[root@dhcp71-248 ~]# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  0:         32          0          0          0  IR-IO-APIC-edge      timer
<snip>
NMI:         13         15          7          6   Non-maskable interrupts
NLC:         12         13          8          7   NMI: Local  (PMI, arch_bt)
NXT:          0          0          0          0   NMI: External  (plat)
NUN:          0          0          0          0   NMI: Unknown
NSW:          0          0          0          0   NMI: Swallowed
LOC:      23768      18248      14187      15891   Local timer interrupts
SPU:          0          0          0          0   Spurious interrupts
PMI:         11         12          7          6   Performance monitoring interrupts
IWI:       2131       1196        862       1113   IRQ work interrupts

The extra 'local' NMI count represents the number of NMIs 'handled'
whereas the general NMI count represents how many actual NMIs were
processed.  IOW, two NMIs came in at once during one call.

I am open to better suggestions.

Signed-off-by: Don Zickus <dzickus@redhat.com>
---
V2: modified output based on feedback (Ingo Molnar, Rob Elliott)
---
 arch/x86/include/asm/nmi.h |    1 +
 arch/x86/kernel/irq.c      |    3 ++
 arch/x86/kernel/nmi.c      |   57 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h
index c8ffab3..cbb49a2 100644
--- a/arch/x86/include/asm/nmi.h
+++ b/arch/x86/include/asm/nmi.h
@@ -63,6 +63,7 @@ void unregister_nmi_handler(unsigned int, const char *);
 void stop_nmi(void);
 void restart_nmi(void);
 void local_touch_nmi(void);
+void nmi_show_interrupts(struct seq_file *, int);
 
 int register_nmi_default_external_handler(void);
 void unregister_nmi_default_external_handler(void);
diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 42805fa..a969f87 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -17,6 +17,7 @@
 #include <asm/idle.h>
 #include <asm/mce.h>
 #include <asm/hw_irq.h>
+#include <asm/nmi.h>
 
 #define CREATE_TRACE_POINTS
 #include <asm/trace/irq_vectors.h>
@@ -59,6 +60,8 @@ int arch_show_interrupts(struct seq_file *p, int prec)
 	for_each_online_cpu(j)
 		seq_printf(p, "%10u ", irq_stats(j)->__nmi_count);
 	seq_printf(p, "  Non-maskable interrupts\n");
+	nmi_show_interrupts(p, prec);
+
 #ifdef CONFIG_X86_LOCAL_APIC
 	seq_printf(p, "%*s: ", prec, "LOC");
 	for_each_online_cpu(j)
diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c
index 458be2a..f54fea3 100644
--- a/arch/x86/kernel/nmi.c
+++ b/arch/x86/kernel/nmi.c
@@ -506,6 +506,63 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
 		unknown_nmi_error(regs);
 }
 
+static void print_nmi_action_name(struct seq_file *p, int type)
+{
+	struct nmi_desc *desc;
+	struct nmiaction *a;
+
+	rcu_read_lock();
+
+	desc = nmi_to_desc(type);
+
+	if (!(list_empty(&desc->head))) {
+
+		a = list_entry_rcu(desc->head.next, struct nmiaction, list);
+		seq_printf(p, "  (%s", a->name);
+
+		list_for_each_entry_continue_rcu(a, &desc->head, list)
+			seq_printf(p, ", %s", a->name);
+
+		seq_printf(p, ")");
+	}
+	seq_puts(p, "\n");
+
+	rcu_read_unlock();
+}
+
+void nmi_show_interrupts(struct seq_file *p, int prec)
+{
+	int j;
+
+#define get_nmi_stats(j)	(&per_cpu(nmi_stats, j))
+
+	seq_printf(p, "%*s: ", prec, "NLC");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", get_nmi_stats(j)->normal);
+	seq_printf(p, " %-8s", " NMI: Local");
+
+	print_nmi_action_name(p, NMI_LOCAL);
+
+	seq_printf(p, "%*s: ", prec, "NXT");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", get_nmi_stats(j)->external);
+	seq_printf(p, " %-8s", " NMI: External");
+
+	print_nmi_action_name(p, NMI_EXT);
+
+	seq_printf(p, "%*s: ", prec, "NUN");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", get_nmi_stats(j)->unknown);
+	seq_printf(p, " %-8s", " NMI: Unknown");
+
+	print_nmi_action_name(p, NMI_UNKNOWN);
+
+	seq_printf(p, "%*s: ", prec, "NSW");
+	for_each_online_cpu(j)
+		seq_printf(p, "%10u ", get_nmi_stats(j)->swallow);
+	seq_printf(p, " %-8s", " NMI: Swallowed\n");
+}
+
 /*
  * Prevent NMI reason port (0x61) being accessed simultaneously, can
  * only be used in NMI handler.
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups
  2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
                   ` (5 preceding siblings ...)
  2014-05-15 19:25 ` [PATCH 6/6 V2] x86, nmi: Add better NMI stats to /proc/interrupts and show handlers Don Zickus
@ 2014-05-15 20:28 ` Don Zickus
  6 siblings, 0 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-15 20:28 UTC (permalink / raw)
  To: x86; +Cc: Peter Zijlstra, Andi Kleen, gong.chen, LKML, Elliott

On Thu, May 15, 2014 at 03:25:43PM -0400, Don Zickus wrote:
> I started this patch by fixing a performance problem with the GHES
> NMI handler and then things evolved to more patches as I was poking
> around in the code.
> 
> The main focus was moving the GHES NMI driver to its own NMI subtype
> to avoid slowing down perf handling.  Then I decided to move the
> default external NMI handler to its own routine.  Then finally, I
> needed to see which NMI handlers were registered so I hacked up
> /proc/interrupts to show me.
> 
> Tested mostly on HP boxes that have GHES enabled and having the iLO
> send NMIs to panic the box (using hpwdt driver).  Ran perf on other
> GHES enabled boxes to test performance results.
> 
> V2: adding irq_work items to handled possible lost NMIs (new patch 1)
>     modified output of /proc/interrupts based on feedback (patch 6)
> 

ugh.  Forgot to snip below..
> Don Zickus (5):
>   x86, nmi:  Add new nmi type 'external'
>   x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and
>     'panic_on_io_nmi'
>   x86, nmi: Remove 'reason' value from unknown nmi output
>   x86, nmi: Move default external NMI handler to its own routine
>   x86, nmi: Add better NMI stats to /proc/interrupts and show handlers
> 
>  Documentation/kernel-parameters.txt |    9 ++
>  arch/x86/include/asm/nmi.h          |    5 +
>  arch/x86/kernel/irq.c               |    3 +
>  arch/x86/kernel/nmi.c               |  199 ++++++++++++++++++++++++++--------
>  drivers/acpi/apei/ghes.c            |    4 +-
>  drivers/watchdog/hpwdt.c            |   24 +++--
>  6 files changed, 187 insertions(+), 57 deletions(-)
> 
> *** BLURB HERE ***

.. to here.

Cheers,
Don

> 
> Don Zickus (6):
>   x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
>   x86, nmi:  Add new nmi type 'external'
>   x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and
>     'panic_on_io_nmi'
>   x86, nmi: Remove 'reason' value from unknown nmi output
>   x86, nmi: Move default external NMI handler to its own routine
>   x86, nmi: Add better NMI stats to /proc/interrupts and show handlers
> 
>  Documentation/kernel-parameters.txt |    9 +
>  arch/x86/include/asm/nmi.h          |    5 +
>  arch/x86/kernel/irq.c               |    3 +
>  arch/x86/kernel/nmi.c               |  302 +++++++++++++++++++++++++++++-----
>  drivers/acpi/apei/ghes.c            |    4 +-
>  drivers/watchdog/hpwdt.c            |   24 ++-
>  6 files changed, 292 insertions(+), 55 deletions(-)
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  2014-05-15 19:25 ` [PATCH 1/6] x86, nmi: Implement delayed irq_work mechanism to handle lost NMIs Don Zickus
@ 2014-05-21 10:29   ` Peter Zijlstra
  2014-05-21 16:45     ` Don Zickus
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-21 10:29 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, fweisbec

On Thu, May 15, 2014 at 03:25:44PM -0400, Don Zickus wrote:
> +DEFINE_PER_CPU(bool, nmi_delayed_work_pending);
> +
> +static void nmi_delayed_work_func(struct irq_work *irq_work)
> +{
> +	DECLARE_BITMAP(nmi_mask, NR_CPUS);

That's _far_ too big for on-stack, 4k cpus would make that 512 bytes.

> +	cpumask_t *mask;
> +
> +	preempt_disable();

That's superfluous, irq_work's are guaranteed to be called with IRQs
disabled.

> +
> +	/*
> +	 * Can't use send_IPI_self here because it will
> +	 * send an NMI in IRQ context which is not what
> +	 * we want.  Create a cpumask for local cpu and
> +	 * force an IPI the normal way (not the shortcut).
> +	 */
> +	bitmap_zero(nmi_mask, NR_CPUS);
> +	mask = to_cpumask(nmi_mask);
> +	cpu_set(smp_processor_id(), *mask);
> +
> +	__this_cpu_xchg(nmi_delayed_work_pending, true);

Why is this xchg and not __this_cpu_write() ?

> +	apic->send_IPI_mask(to_cpumask(nmi_mask), NMI_VECTOR);

What's wrong with apic->send_IPI_self(NMI_VECTOR); ?

> +
> +	preempt_enable();
> +}
> +
> +struct irq_work nmi_delayed_work =
> +{
> +	.func	= nmi_delayed_work_func,
> +	.flags	= IRQ_WORK_LAZY,
> +};

OK, so I don't particularly like the LAZY stuff and was hoping to remove
it before more users could show up... apparently I'm too late :-(

Frederic, I suppose this means dual lists.

> +static bool nmi_queue_work_clear(void)
> +{
> +	bool set = __this_cpu_read(nmi_delayed_work_pending);
> +
> +	__this_cpu_write(nmi_delayed_work_pending, false);
> +
> +	return set;
> +}

That's a test-and-clear, the name doesn't reflect this. And here you do
_not_ use xchg where you actually could have.

That said, try and avoid using xchg() its unconditionally serialized.

> +
> +static int nmi_queue_work(void)
> +{
> +	bool queued = irq_work_queue(&nmi_delayed_work);
> +
> +	if (queued) {
> +		/*
> +		 * If the delayed NMI actually finds a 'dropped' NMI, the
> +		 * work pending bit will never be cleared.  A new delayed
> +		 * work NMI is supposed to be sent in that case.  But there
> +		 * is no guarantee that the same cpu will be used.  So
> +		 * pro-actively clear the flag here (the new self-IPI will
> +		 * re-set it.
> +		 *
> +		 * However, there is a small chance that a real NMI and the
> +		 * simulated one occur at the same time.  What happens is the
> +		 * simulated IPI NMI sets the work_pending flag and then sends
> +		 * the IPI.  At this point the irq_work allows a new work
> +		 * event.  So when the simulated IPI is handled by a real NMI
> +		 * handler it comes in here to queue more work.  Because
> +		 * irq_work returns success, the work_pending bit is cleared.
> +		 * The second part of the back-to-back NMI is kicked off, the
> +		 * work_pending bit is not set and an unknown NMI is generated.
> +		 * Therefore check the BUSY bit before clearing.  The theory is
> +		 * if the BUSY bit is set, then there should be an NMI for this
> +		 * cpu latched somewhere and will be cleared when it runs.
> +		 */
> +		if (!(nmi_delayed_work.flags & IRQ_WORK_BUSY))
> +			nmi_queue_work_clear();

So I'm utterly and completely failing to parse that. It just doesn't
make sense.

> +	}
> +
> +	return 0;
> +}

Why does this function have a return value if all it can return is 0 and
everybody ignores it?

> +
>  static int __kprobes nmi_handle(unsigned int type, struct pt_regs *regs, bool b2b)
>  {
>  	struct nmi_desc *desc = nmi_to_desc(type);
> @@ -341,6 +441,9 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
>  		 */
>  		if (handled > 1)
>  			__this_cpu_write(swallow_nmi, true);
> +
> +		/* kick off delayed work in case we swallowed external NMI */

That's inaccurate, there's no guarantee we actually swallowed one
afaict, this is where we have to assume we lost one because there's
really no other place.

> +		nmi_queue_work();
>  		return;
>  	}
>  
> @@ -362,10 +465,16 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
>  #endif
>  		__this_cpu_add(nmi_stats.external, 1);
>  		raw_spin_unlock(&nmi_reason_lock);
> +		/* kick off delayed work in case we swallowed external NMI */
> +		nmi_queue_work();

Again, inaccurate, there's no guarantee we did swallow an external NMI,
but the thing is, there's no guarantee we didn't either, which is why we
need to do this.

>  		return;
>  	}
>  	raw_spin_unlock(&nmi_reason_lock);
>  
> +	/* expected delayed queued NMI? Don't flag as unknown */
> +	if (nmi_queue_work_clear())
> +		return;
> +

Right, so here we effectively swallow the extra nmi and avoid the
unknown_nmi_error() bits.



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine
  2014-05-15 19:25 ` [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine Don Zickus
@ 2014-05-21 10:38   ` Peter Zijlstra
  2014-05-21 16:48     ` Don Zickus
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-21 10:38 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, thomas.mingarelli

On Thu, May 15, 2014 at 03:25:48PM -0400, Don Zickus wrote:
> Now that we have setup an NMI subtye called NMI_EXT, there is really
> no need to hard code the default external NMI handler in the main
> nmi handler routine.
> 
> Move it to a proper function and register it on boot.  This change is
> just code movement.
> 
> In addition, update the hpwdt to allow it to unregister the default
> handler on its registration (and vice versa).  This allows the driver
> to take control of that io port (which it ultimately wanted to do
> originally), but in a cleaner way.

wanting that is one thing, but is it also a sane thing? You don't do
thing just because drivers want it.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  2014-05-21 10:29   ` Peter Zijlstra
@ 2014-05-21 16:45     ` Don Zickus
  2014-05-21 17:51       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Don Zickus @ 2014-05-21 16:45 UTC (permalink / raw)
  To: Peter Zijlstra; +Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, fweisbec

On Wed, May 21, 2014 at 12:29:34PM +0200, Peter Zijlstra wrote:
> On Thu, May 15, 2014 at 03:25:44PM -0400, Don Zickus wrote:
> > +DEFINE_PER_CPU(bool, nmi_delayed_work_pending);
> > +
> > +static void nmi_delayed_work_func(struct irq_work *irq_work)
> > +{
> > +	DECLARE_BITMAP(nmi_mask, NR_CPUS);
> 
> That's _far_ too big for on-stack, 4k cpus would make that 512 bytes.
> 
> > +	cpumask_t *mask;
> > +
> > +	preempt_disable();
> 
> That's superfluous, irq_work's are guaranteed to be called with IRQs
> disabled.
> 
> > +
> > +	/*
> > +	 * Can't use send_IPI_self here because it will
> > +	 * send an NMI in IRQ context which is not what
> > +	 * we want.  Create a cpumask for local cpu and
> > +	 * force an IPI the normal way (not the shortcut).
> > +	 */
> > +	bitmap_zero(nmi_mask, NR_CPUS);
> > +	mask = to_cpumask(nmi_mask);
> > +	cpu_set(smp_processor_id(), *mask);
> > +
> > +	__this_cpu_xchg(nmi_delayed_work_pending, true);
> 
> Why is this xchg and not __this_cpu_write() ?
> 
> > +	apic->send_IPI_mask(to_cpumask(nmi_mask), NMI_VECTOR);
> 
> What's wrong with apic->send_IPI_self(NMI_VECTOR); ?

I tried to explain that in my comment above.  IPI_self uses the shortcut
method to send IPIs which means the NMI_VECTOR will be delivered in IRQ
context _not_ NMI context. :-(  This is why I do the whole silly dance.

I am open to better options.  We might be able to patch IPI_self but I am
not sure why it didn't special case the NMI_VECTOR to begin with (like the
other handlers seem to do).

> 
> > +
> > +	preempt_enable();
> > +}
> > +
> > +struct irq_work nmi_delayed_work =
> > +{
> > +	.func	= nmi_delayed_work_func,
> > +	.flags	= IRQ_WORK_LAZY,
> > +};
> 
> OK, so I don't particularly like the LAZY stuff and was hoping to remove
> it before more users could show up... apparently I'm too late :-(
> 
> Frederic, I suppose this means dual lists.

Again I am open to suggestions here.  I just wanted a little delay somehow
(a few ticks ideally).  The IRQ_WORK_LAZY seemed to be good enough for me
for now so I chose it.

> 
> > +static bool nmi_queue_work_clear(void)
> > +{
> > +	bool set = __this_cpu_read(nmi_delayed_work_pending);
> > +
> > +	__this_cpu_write(nmi_delayed_work_pending, false);
> > +
> > +	return set;
> > +}
> 
> That's a test-and-clear, the name doesn't reflect this. And here you do
> _not_ use xchg where you actually could have.
> 
> That said, try and avoid using xchg() its unconditionally serialized.

Ok. test-and-clear makes more sense.

> 
> > +
> > +static int nmi_queue_work(void)
> > +{
> > +	bool queued = irq_work_queue(&nmi_delayed_work);
> > +
> > +	if (queued) {
> > +		/*
> > +		 * If the delayed NMI actually finds a 'dropped' NMI, the
> > +		 * work pending bit will never be cleared.  A new delayed
> > +		 * work NMI is supposed to be sent in that case.  But there
> > +		 * is no guarantee that the same cpu will be used.  So
> > +		 * pro-actively clear the flag here (the new self-IPI will
> > +		 * re-set it.
> > +		 *
> > +		 * However, there is a small chance that a real NMI and the
> > +		 * simulated one occur at the same time.  What happens is the
> > +		 * simulated IPI NMI sets the work_pending flag and then sends
> > +		 * the IPI.  At this point the irq_work allows a new work
> > +		 * event.  So when the simulated IPI is handled by a real NMI
> > +		 * handler it comes in here to queue more work.  Because
> > +		 * irq_work returns success, the work_pending bit is cleared.
> > +		 * The second part of the back-to-back NMI is kicked off, the
> > +		 * work_pending bit is not set and an unknown NMI is generated.
> > +		 * Therefore check the BUSY bit before clearing.  The theory is
> > +		 * if the BUSY bit is set, then there should be an NMI for this
> > +		 * cpu latched somewhere and will be cleared when it runs.
> > +		 */
> > +		if (!(nmi_delayed_work.flags & IRQ_WORK_BUSY))
> > +			nmi_queue_work_clear();
> 
> So I'm utterly and completely failing to parse that. It just doesn't
> make sense.

Ok.  Sorry about that.  There is two issues here.  One is hypotethical I
think and the second I stumbled upon after trying to solve the first one.

Let me try to explain this again and see if I can do a better job.

The original problem is that two NMIs might show up with the second one
being dropped because they are latched waiting for the first NMI to
finish (something like that).

Currently all the NMI handlers are processed regardless if the first one
on the list handled the NMI or not.  This helps avoid the problem
described above.

Now a new wrinkle.  There are two NMI handlers (legacy port 0x61 and GHES)
that require a spin lock to access the hardware safely.  This is obviously
a big latency hit.  Not a problem if the NMI volume is low.

However, perf makes NMI a high volume interrupt. :-)

The question is how to handle the high volume PMI without incurring the
latency hit of the spin locks.

I posted a patch to split these into two queues: internal and external
(external being the ones that use a spin lock).

Ingo pointed out the 'original' problem again and suggested using irq_work
to solve it by passing in a dummy NMI periodically to look for 'stale??'
NMIs.


Implementing Ingo's idea seemed to imply most of the time the dummy NMI
would pass through and cause unknown errors unless I caught (using that
test-and-clear flag above).  Seemed acceptable.

Say a dummy NMI came through and actually _found_ a 'stale' NMI (like it
was supposed to do).  What happens to the 'flag'?  It doesn't get cleared.

To take care of that, I just keep creating 'dummy' NMIs until it falls all
the way through and clears the flag.




Now to my 'comment' that doesn't make sense.

I was theorizing what happens if a dummy NMI came through and found a 'stale'
NMI?  My patch was supposed to generate another 'dummy' NMI to clear the
'flag'.  But are there any guarantees that the new 'dummy' NMI is sent to
the same cpu?  If not, then the flag is never cleared.  Which means a
later 'real' unknown NMI would get swallowed accidentally.

That is what the first half of my comment was trying to say.

I tried to solve this by always clearing the 'flag' upon re-queue of the
irq_work event. [The flag is always 'set' in the irq_work function]

This way if the irq_work event was scheduled on a different cpu, the
'flag' would not stay set forever.


The second half tries to talk about a problem I uncovered while trying to
solve the first problem, which is:

The irq_work function is called and the first thing it does is set a
'flag' (to prevent the accidentally unknown NMI).  Then it calls the IPI,
which immediately takes the NMI.

Now oddly enough there were situations were the NMI happened and it found
a PMI event.  So my patch re-queued the irq_work (after clearing the
'flag' based on problem 1 above).  However, the PMI and self IPI happened
at roughly the same time, such that after the IRET of the PMI NMI, the
'dummy' NMI is immediately executed (think back-to-back NMIs).

This NMI is supposed to be the 'dummy' NMI caught below with the 'flag'
set.  However, the 'flag' was cleared upon re-queue of the previous PMI
NMI.  This leads to a periodic unknown NMI. Ooops.

I tried to workaround this by checking the IRQ_BUSY flag and not clear it
in that case.


/me takes a break..



So both my problems center around what guarantees does irq_work have to
stay on the same cpu?

Or better yet, is there a different way to solve this problem?



> 
> > +	}
> > +
> > +	return 0;
> > +}
> 
> Why does this function have a return value if all it can return is 0 and
> everybody ignores it?

Good question.  I can remove that.

> 
> > +
> >  static int __kprobes nmi_handle(unsigned int type, struct pt_regs *regs, bool b2b)
> >  {
> >  	struct nmi_desc *desc = nmi_to_desc(type);
> > @@ -341,6 +441,9 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
> >  		 */
> >  		if (handled > 1)
> >  			__this_cpu_write(swallow_nmi, true);
> > +
> > +		/* kick off delayed work in case we swallowed external NMI */
> 
> That's inaccurate, there's no guarantee we actually swallowed one
> afaict, this is where we have to assume we lost one because there's
> really no other place.
> 
> > +		nmi_queue_work();
> >  		return;
> >  	}
> >  
> > @@ -362,10 +465,16 @@ static __kprobes void default_do_nmi(struct pt_regs *regs)
> >  #endif
> >  		__this_cpu_add(nmi_stats.external, 1);
> >  		raw_spin_unlock(&nmi_reason_lock);
> > +		/* kick off delayed work in case we swallowed external NMI */
> > +		nmi_queue_work();
> 
> Again, inaccurate, there's no guarantee we did swallow an external NMI,
> but the thing is, there's no guarantee we didn't either, which is why we
> need to do this.

Ok, I thought adding 'in case' would imply what you suggested.  I can try
to re-word it a little more.  Something like:

   /*
    * we don't know if we swallowed an NMI shared with another handler or
    * not, kick off delayed work to make sure everyone sees it.
    */


> 
> >  		return;
> >  	}
> >  	raw_spin_unlock(&nmi_reason_lock);
> >  
> > +	/* expected delayed queued NMI? Don't flag as unknown */
> > +	if (nmi_queue_work_clear())
> > +		return;
> > +
> 
> Right, so here we effectively swallow the extra nmi and avoid the
> unknown_nmi_error() bits.

Unfortunately, yes.

Cheers,
Don

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine
  2014-05-21 10:38   ` Peter Zijlstra
@ 2014-05-21 16:48     ` Don Zickus
  2014-05-21 18:17       ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Don Zickus @ 2014-05-21 16:48 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, thomas.mingarelli

On Wed, May 21, 2014 at 12:38:46PM +0200, Peter Zijlstra wrote:
> On Thu, May 15, 2014 at 03:25:48PM -0400, Don Zickus wrote:
> > Now that we have setup an NMI subtye called NMI_EXT, there is really
> > no need to hard code the default external NMI handler in the main
> > nmi handler routine.
> > 
> > Move it to a proper function and register it on boot.  This change is
> > just code movement.
> > 
> > In addition, update the hpwdt to allow it to unregister the default
> > handler on its registration (and vice versa).  This allows the driver
> > to take control of that io port (which it ultimately wanted to do
> > originally), but in a cleaner way.
> 
> wanting that is one thing, but is it also a sane thing? You don't do
> thing just because drivers want it.

Heh. I understand.

Today, I have hacked up the SERR and IOCHK handlers to give hpwdt the
chance to do its 'magic' bios call to collect information before
panic'ing.

I was trying to clean things up by removing those hacks, but I guess I can
see your point, there is no guarantee they handle the hardware correctly.
:-/

Cheers,
Don


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  2014-05-21 16:45     ` Don Zickus
@ 2014-05-21 17:51       ` Peter Zijlstra
  2014-05-21 19:02         ` Don Zickus
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-21 17:51 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, fweisbec, dave.hansen

On Wed, May 21, 2014 at 12:45:25PM -0400, Don Zickus wrote:
> > > +	/*
> > > +	 * Can't use send_IPI_self here because it will
> > > +	 * send an NMI in IRQ context which is not what
> > > +	 * we want.  Create a cpumask for local cpu and
> > > +	 * force an IPI the normal way (not the shortcut).
> > > +	 */
> > > +	bitmap_zero(nmi_mask, NR_CPUS);
> > > +	mask = to_cpumask(nmi_mask);
> > > +	cpu_set(smp_processor_id(), *mask);
> > > +
> > > +	__this_cpu_xchg(nmi_delayed_work_pending, true);
> > 
> > Why is this xchg and not __this_cpu_write() ?
> > 
> > > +	apic->send_IPI_mask(to_cpumask(nmi_mask), NMI_VECTOR);
> > 
> > What's wrong with apic->send_IPI_self(NMI_VECTOR); ?
> 
> I tried to explain that in my comment above.  IPI_self uses the shortcut
> method to send IPIs which means the NMI_VECTOR will be delivered in IRQ
> context _not_ NMI context. :-(  This is why I do the whole silly dance.

I'm still not getting it, probably because I don't know how these APICs
really work, but the way I read both the comment and your explanation
here is that we get an NMI nested in the IRQ context we called it from,
which is pretty much exactly what we want.

> I am open to better options.  We might be able to patch IPI_self but I am
> not sure why it didn't special case the NMI_VECTOR to begin with (like the
> other handlers seem to do).

cpumask_of(cpu) gives the right mask.

> > > +struct irq_work nmi_delayed_work =
> > > +{
> > > +	.func	= nmi_delayed_work_func,
> > > +	.flags	= IRQ_WORK_LAZY,
> > > +};
> > 
> > OK, so I don't particularly like the LAZY stuff and was hoping to remove
> > it before more users could show up... apparently I'm too late :-(
> > 
> > Frederic, I suppose this means dual lists.
> 
> Again I am open to suggestions here.  I just wanted a little delay somehow
> (a few ticks ideally).  The IRQ_WORK_LAZY seemed to be good enough for me
> for now so I chose it.

This might actually be a good enough usage of LAZY to not shoot it in
the head.

> > > +static int nmi_queue_work(void)
> > > +{
> > > +	bool queued = irq_work_queue(&nmi_delayed_work);
> > > +
> > > +	if (queued) {
> > > +		/*
> > > +		 * If the delayed NMI actually finds a 'dropped' NMI, the
> > > +		 * work pending bit will never be cleared.  A new delayed
> > > +		 * work NMI is supposed to be sent in that case.  But there
> > > +		 * is no guarantee that the same cpu will be used.  So
> > > +		 * pro-actively clear the flag here (the new self-IPI will
> > > +		 * re-set it.
> > > +		 *
> > > +		 * However, there is a small chance that a real NMI and the
> > > +		 * simulated one occur at the same time.  What happens is the
> > > +		 * simulated IPI NMI sets the work_pending flag and then sends
> > > +		 * the IPI.  At this point the irq_work allows a new work
> > > +		 * event.  So when the simulated IPI is handled by a real NMI
> > > +		 * handler it comes in here to queue more work.  Because
> > > +		 * irq_work returns success, the work_pending bit is cleared.
> > > +		 * The second part of the back-to-back NMI is kicked off, the
> > > +		 * work_pending bit is not set and an unknown NMI is generated.
> > > +		 * Therefore check the BUSY bit before clearing.  The theory is
> > > +		 * if the BUSY bit is set, then there should be an NMI for this
> > > +		 * cpu latched somewhere and will be cleared when it runs.
> > > +		 */
> > > +		if (!(nmi_delayed_work.flags & IRQ_WORK_BUSY))
> > > +			nmi_queue_work_clear();
> > 
> > So I'm utterly and completely failing to parse that. It just doesn't
> > make sense.
> 
> Ok.  Sorry about that.  There is two issues here.  One is hypotethical I
> think and the second I stumbled upon after trying to solve the first one.
> 
> Let me try to explain this again and see if I can do a better job.
> 
> The original problem is that two NMIs might show up with the second one
> being dropped because they are latched waiting for the first NMI to
> finish (something like that).
> 
> Currently all the NMI handlers are processed regardless if the first one
> on the list handled the NMI or not.  This helps avoid the problem
> described above.
> 
> Now a new wrinkle.  There are two NMI handlers (legacy port 0x61 and GHES)
> that require a spin lock to access the hardware safely.  This is obviously
> a big latency hit.  Not a problem if the NMI volume is low.
> 
> However, perf makes NMI a high volume interrupt. :-)
> 
> The question is how to handle the high volume PMI without incurring the
> latency hit of the spin locks.
> 
> I posted a patch to split these into two queues: internal and external
> (external being the ones that use a spin lock).
> 
> Ingo pointed out the 'original' problem again and suggested using irq_work
> to solve it by passing in a dummy NMI periodically to look for 'stale??'
> NMIs.
> 
> 
> Implementing Ingo's idea seemed to imply most of the time the dummy NMI
> would pass through and cause unknown errors unless I caught (using that
> test-and-clear flag above).  Seemed acceptable.
> 
> Say a dummy NMI came through and actually _found_ a 'stale' NMI (like it
> was supposed to do).  What happens to the 'flag'?  It doesn't get cleared.
> 
> To take care of that, I just keep creating 'dummy' NMIs until it falls all
> the way through and clears the flag.
> 

Right, I got that far.

> Now to my 'comment' that doesn't make sense.
> 
> I was theorizing what happens if a dummy NMI came through and found a 'stale'
> NMI?  My patch was supposed to generate another 'dummy' NMI to clear the
> 'flag'.  But are there any guarantees that the new 'dummy' NMI is sent to
> the same cpu?  If not, then the flag is never cleared.  Which means a
> later 'real' unknown NMI would get swallowed accidentally.
> 
> That is what the first half of my comment was trying to say.
> 
> I tried to solve this by always clearing the 'flag' upon re-queue of the
> irq_work event. [The flag is always 'set' in the irq_work function]
> 
> This way if the irq_work event was scheduled on a different cpu, the
> 'flag' would not stay set forever.

OK, that I got too, I think. And at that point I was thinking, why not
unconditionally wipe the flag in nmi_queue_work() and have the handler
set it again.

> The second half tries to talk about a problem I uncovered while trying to
> solve the first problem, which is:
> 
> The irq_work function is called and the first thing it does is set a
> 'flag' (to prevent the accidentally unknown NMI).  Then it calls the IPI,
> which immediately takes the NMI.
> 
> Now oddly enough there were situations were the NMI happened and it found
> a PMI event.  So my patch re-queued the irq_work (after clearing the
> 'flag' based on problem 1 above).  However, the PMI and self IPI happened
> at roughly the same time, such that after the IRET of the PMI NMI, the
> 'dummy' NMI is immediately executed (think back-to-back NMIs).
> 
> This NMI is supposed to be the 'dummy' NMI caught below with the 'flag'
> set.  However, the 'flag' was cleared upon re-queue of the previous PMI
> NMI.  This leads to a periodic unknown NMI. Ooops.
> 
> I tried to workaround this by checking the IRQ_BUSY flag and not clear it
> in that case.

OK, so this is the part I got lost in and didn't see.

So a comment in the handler between setting the flag and raising the IPI
might be a good idea..

> So both my problems center around what guarantees does irq_work have to
> stay on the same cpu?

Well, none as you used a global irq_work, so all cpus will now contend
on it on every NMI trying to queue it :-(

Didn't we have a problem with NMIs touching global cachelines before?

Wasn't that one of Dave Hansen's problems?

> Or better yet, is there a different way to solve this problem?

Lemme think about that...

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine
  2014-05-21 16:48     ` Don Zickus
@ 2014-05-21 18:17       ` Peter Zijlstra
  2014-05-21 19:13         ` Don Zickus
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-21 18:17 UTC (permalink / raw)
  To: Don Zickus; +Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, thomas.mingarelli

On Wed, May 21, 2014 at 12:48:48PM -0400, Don Zickus wrote:
> On Wed, May 21, 2014 at 12:38:46PM +0200, Peter Zijlstra wrote:
> > On Thu, May 15, 2014 at 03:25:48PM -0400, Don Zickus wrote:
> > > Now that we have setup an NMI subtye called NMI_EXT, there is really
> > > no need to hard code the default external NMI handler in the main
> > > nmi handler routine.
> > > 
> > > Move it to a proper function and register it on boot.  This change is
> > > just code movement.
> > > 
> > > In addition, update the hpwdt to allow it to unregister the default
> > > handler on its registration (and vice versa).  This allows the driver
> > > to take control of that io port (which it ultimately wanted to do
> > > originally), but in a cleaner way.
> > 
> > wanting that is one thing, but is it also a sane thing? You don't do
> > thing just because drivers want it.
> 
> Heh. I understand.
> 
> Today, I have hacked up the SERR and IOCHK handlers to give hpwdt the
> chance to do its 'magic' bios call to collect information before
> panic'ing.
> 
> I was trying to clean things up by removing those hacks, but I guess I can
> see your point, there is no guarantee they handle the hardware correctly.
> :-/

So while I'll leave the decision to the x86 people, I find the changelog
entirely devoid of a good reason to do this.

An in my personal opinion any hardware that triggers non detectable NMIs
is just plain broken.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  2014-05-21 17:51       ` Peter Zijlstra
@ 2014-05-21 19:02         ` Don Zickus
  2014-05-21 19:38           ` Peter Zijlstra
  0 siblings, 1 reply; 17+ messages in thread
From: Don Zickus @ 2014-05-21 19:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, fweisbec, dave.hansen

On Wed, May 21, 2014 at 07:51:49PM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2014 at 12:45:25PM -0400, Don Zickus wrote:
> > > > +	/*
> > > > +	 * Can't use send_IPI_self here because it will
> > > > +	 * send an NMI in IRQ context which is not what
> > > > +	 * we want.  Create a cpumask for local cpu and
> > > > +	 * force an IPI the normal way (not the shortcut).
> > > > +	 */
> > > > +	bitmap_zero(nmi_mask, NR_CPUS);
> > > > +	mask = to_cpumask(nmi_mask);
> > > > +	cpu_set(smp_processor_id(), *mask);
> > > > +
> > > > +	__this_cpu_xchg(nmi_delayed_work_pending, true);
> > > 
> > > Why is this xchg and not __this_cpu_write() ?
> > > 
> > > > +	apic->send_IPI_mask(to_cpumask(nmi_mask), NMI_VECTOR);
> > > 
> > > What's wrong with apic->send_IPI_self(NMI_VECTOR); ?
> > 
> > I tried to explain that in my comment above.  IPI_self uses the shortcut
> > method to send IPIs which means the NMI_VECTOR will be delivered in IRQ
> > context _not_ NMI context. :-(  This is why I do the whole silly dance.
> 
> I'm still not getting it, probably because I don't know how these APICs
> really work, but the way I read both the comment and your explanation
> here is that we get an NMI nested in the IRQ context we called it from,
> which is pretty much exactly what we want.

Um, ok.  I think my concern with that is an NMI nested in IRQ context
could be interrupted by a real NMI. I believe that would cause nmi_enter()
to barf among other bad things in the nmi code.

> 
> > I am open to better options.  We might be able to patch IPI_self but I am
> > not sure why it didn't special case the NMI_VECTOR to begin with (like the
> > other handlers seem to do).
> 
> cpumask_of(cpu) gives the right mask.

Ok. Thanks.

<snip>

> > Now oddly enough there were situations were the NMI happened and it found
> > a PMI event.  So my patch re-queued the irq_work (after clearing the
> > 'flag' based on problem 1 above).  However, the PMI and self IPI happened
> > at roughly the same time, such that after the IRET of the PMI NMI, the
> > 'dummy' NMI is immediately executed (think back-to-back NMIs).
> > 
> > This NMI is supposed to be the 'dummy' NMI caught below with the 'flag'
> > set.  However, the 'flag' was cleared upon re-queue of the previous PMI
> > NMI.  This leads to a periodic unknown NMI. Ooops.
> > 
> > I tried to workaround this by checking the IRQ_BUSY flag and not clear it
> > in that case.
> 
> OK, so this is the part I got lost in and didn't see.
> 
> So a comment in the handler between setting the flag and raising the IPI
> might be a good idea..

Ok. I can do that.

> 
> > So both my problems center around what guarantees does irq_work have to
> > stay on the same cpu?
> 
> Well, none as you used a global irq_work, so all cpus will now contend
> on it on every NMI trying to queue it :-(

Yes, I was stuck between using a per-cpu implementation in which every dummy
NMI grabs the spin lock in the nmi handlers, or a global lock.  I tried
the global lock.

I thought the irq_work lock seemed less contended because it was only read
once before being acted upon (for a cacheline seperate from actual nmi work).

Whereas a spin lock in the nmi handlers seems to keep reading the lock
until it owns it thus slowing down useful work for the handler that owns
the lock (because of the cache contention).

I could be wrong though.


> 
> Didn't we have a problem with NMIs touching global cachelines before?
> 
> Wasn't that one of Dave Hansen's problems?

That might have been the regular port 0x61 spin locks that we added to
support bsp hot plugging.

> 
> > Or better yet, is there a different way to solve this problem?
> 
> Lemme think about that...

Of course oddly enough, the external NMIs are only routed to cpu0 right
now.  So having any non-cpu0 read them would be unexpected.  As a result
the global spin locks are added overhead in the off-chance the bsp cpu
disappears for some reason.

Perhaps just remove those global spin locks and keep a global variable of
which cpu gets to read the external NMIs.  Then upon hot plugging of the
bsp cpu (or the cpu that is setup to read the external NMI), a new cpu is
chosen and a dummy NMI is sent to handle the case an NMI is dropped during
the transition.

This would keep things fast in the normal path and be slow in the hotplug
path (which is rare anyway).  Then we could back to one queue again
(instead of the two queues I am proposing here).

Just an idea to spark conversation.

Cheers,
Don


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine
  2014-05-21 18:17       ` Peter Zijlstra
@ 2014-05-21 19:13         ` Don Zickus
  0 siblings, 0 replies; 17+ messages in thread
From: Don Zickus @ 2014-05-21 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, thomas.mingarelli

On Wed, May 21, 2014 at 08:17:56PM +0200, Peter Zijlstra wrote:
> On Wed, May 21, 2014 at 12:48:48PM -0400, Don Zickus wrote:
> > On Wed, May 21, 2014 at 12:38:46PM +0200, Peter Zijlstra wrote:
> > > On Thu, May 15, 2014 at 03:25:48PM -0400, Don Zickus wrote:
> > > > Now that we have setup an NMI subtye called NMI_EXT, there is really
> > > > no need to hard code the default external NMI handler in the main
> > > > nmi handler routine.
> > > > 
> > > > Move it to a proper function and register it on boot.  This change is
> > > > just code movement.
> > > > 
> > > > In addition, update the hpwdt to allow it to unregister the default
> > > > handler on its registration (and vice versa).  This allows the driver
> > > > to take control of that io port (which it ultimately wanted to do
> > > > originally), but in a cleaner way.
> > > 
> > > wanting that is one thing, but is it also a sane thing? You don't do
> > > thing just because drivers want it.
> > 
> > Heh. I understand.
> > 
> > Today, I have hacked up the SERR and IOCHK handlers to give hpwdt the
> > chance to do its 'magic' bios call to collect information before
> > panic'ing.
> > 
> > I was trying to clean things up by removing those hacks, but I guess I can
> > see your point, there is no guarantee they handle the hardware correctly.
> > :-/
> 
> So while I'll leave the decision to the x86 people, I find the changelog
> entirely devoid of a good reason to do this.
> 
> An in my personal opinion any hardware that triggers non detectable NMIs
> is just plain broken.

I do agree.  And I am not looking to argue against your opinion, but the
'broken' part is what is interesting to vendors.  With firmware becoming
more prevalent these days, I have seen large upticks in unknown NMIs with
RHEL-X due to broken firmware implementing the latest bells and whistles.

With so much firmware on the system (various pci cards, system firmware,
etc), no one knows which piece is broken.  What hpwdt is trying to do (and
other vendors too), is the momemnt an unknown NMI happens, jump into
bios and start poking registers on various system bridges to figure out
who is causing the problems and log them somehow (on a BMC and its ilk).
Then the hardware guys know what to fix.

Of course, ACPI's APEI was supposed to create a framework to properly
deliver these errors to the OS for reliable reporting (using a properly
registerd NMI handler with a detectable NMI).  But I think it is still a
work in progress. :-/

So the problem is the hardware _is_ broken, but how to communicate that is
difficult and unknown NMI appears to be the cheap and easy way to do that.

Cheers,
Don

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/6] x86, nmi:  Implement delayed irq_work mechanism to handle lost NMIs
  2014-05-21 19:02         ` Don Zickus
@ 2014-05-21 19:38           ` Peter Zijlstra
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Zijlstra @ 2014-05-21 19:38 UTC (permalink / raw)
  To: Don Zickus
  Cc: x86, Andi Kleen, gong.chen, LKML, Elliott, fweisbec, dave.hansen

On Wed, May 21, 2014 at 03:02:25PM -0400, Don Zickus wrote:
> On Wed, May 21, 2014 at 07:51:49PM +0200, Peter Zijlstra wrote:
> > On Wed, May 21, 2014 at 12:45:25PM -0400, Don Zickus wrote:
> > > > > +	/*
> > > > > +	 * Can't use send_IPI_self here because it will
> > > > > +	 * send an NMI in IRQ context which is not what
> > > > > +	 * we want.  Create a cpumask for local cpu and
> > > > > +	 * force an IPI the normal way (not the shortcut).
> > > > > +	 */
> > > > > +	bitmap_zero(nmi_mask, NR_CPUS);
> > > > > +	mask = to_cpumask(nmi_mask);
> > > > > +	cpu_set(smp_processor_id(), *mask);
> > > > > +
> > > > > +	__this_cpu_xchg(nmi_delayed_work_pending, true);
> > > > 
> > > > Why is this xchg and not __this_cpu_write() ?
> > > > 
> > > > > +	apic->send_IPI_mask(to_cpumask(nmi_mask), NMI_VECTOR);
> > > > 
> > > > What's wrong with apic->send_IPI_self(NMI_VECTOR); ?
> > > 
> > > I tried to explain that in my comment above.  IPI_self uses the shortcut
> > > method to send IPIs which means the NMI_VECTOR will be delivered in IRQ
> > > context _not_ NMI context. :-(  This is why I do the whole silly dance.
> > 
> > I'm still not getting it, probably because I don't know how these APICs
> > really work, but the way I read both the comment and your explanation
> > here is that we get an NMI nested in the IRQ context we called it from,
> > which is pretty much exactly what we want.
> 
> Um, ok.  I think my concern with that is an NMI nested in IRQ context
> could be interrupted by a real NMI. I believe that would cause nmi_enter()
> to barf among other bad things in the nmi code.

Ohh, you mean the NMI handler will run as a regular interrupt? Yes, that
would be bad.

> > > So both my problems center around what guarantees does irq_work have to
> > > stay on the same cpu?
> > 
> > Well, none as you used a global irq_work, so all cpus will now contend
> > on it on every NMI trying to queue it :-(
> 
> Yes, I was stuck between using a per-cpu implementation in which every dummy
> NMI grabs the spin lock in the nmi handlers, or a global lock.  I tried
> the global lock.
> 
> I thought the irq_work lock seemed less contended because it was only read
> once before being acted upon (for a cacheline seperate from actual nmi work).
> 
> Whereas a spin lock in the nmi handlers seems to keep reading the lock
> until it owns it thus slowing down useful work for the handler that owns
> the lock (because of the cache contention).
> 
> I could be wrong though.

Well, pretty much every NMI will call irq_queue_work() which calls
irq_work_claim() which does an uncondition cmpxchg (locked rmw) on the
global cacheline.

Which is *hurt*.


will try and reply to the rest later..

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2014-05-21 19:38 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-05-15 19:25 [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus
2014-05-15 19:25 ` [PATCH 1/6] x86, nmi: Implement delayed irq_work mechanism to handle lost NMIs Don Zickus
2014-05-21 10:29   ` Peter Zijlstra
2014-05-21 16:45     ` Don Zickus
2014-05-21 17:51       ` Peter Zijlstra
2014-05-21 19:02         ` Don Zickus
2014-05-21 19:38           ` Peter Zijlstra
2014-05-15 19:25 ` [PATCH 2/6] x86, nmi: Add new nmi type 'external' Don Zickus
2014-05-15 19:25 ` [PATCH 3/6] x86, nmi: Add boot line option 'panic_on_unrecovered_nmi' and 'panic_on_io_nmi' Don Zickus
2014-05-15 19:25 ` [PATCH 4/6] x86, nmi: Remove 'reason' value from unknown nmi output Don Zickus
2014-05-15 19:25 ` [PATCH 5/6] x86, nmi: Move default external NMI handler to its own routine Don Zickus
2014-05-21 10:38   ` Peter Zijlstra
2014-05-21 16:48     ` Don Zickus
2014-05-21 18:17       ` Peter Zijlstra
2014-05-21 19:13         ` Don Zickus
2014-05-15 19:25 ` [PATCH 6/6 V2] x86, nmi: Add better NMI stats to /proc/interrupts and show handlers Don Zickus
2014-05-15 20:28 ` [PATCH 0/6 V2] x86, nmi: Various fixes and cleanups Don Zickus

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox