[PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
@ 2014-12-05  2:29 Calvin Owens
  2014-12-09 18:08 ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Calvin Owens @ 2014-12-05  2:29 UTC (permalink / raw)
  To: Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin
  Cc: x86, linux-edac, linux-kernel, kernel-team, Calvin Owens

The Intel CMCI interrupt handler calls mce_timer_kick() to force more
frequent polling for MCE events when a CMCI storm occurs and CMCI
interrupts are subsequently disabled.

If a CMCI interrupt storm happens to be detected while the timer
interrupt is executing timer functions, mce_timer_kick() can race with
mce_timer_fn(), which results in a double-add and the following BUG:

	#0 [ffff88047fda3ad0] machine_kexec at ffffffff8102bdf5
	#1 [ffff88047fda3b20] crash_kexec at ffffffff8109e788
	#2 [ffff88047fda3bf0] oops_end at ffffffff815f20e8
	#3 [ffff88047fda3c20] die at ffffffff81005c08
	#4 [ffff88047fda3c50] do_trap at ffffffff815f192b
	#5 [ffff88047fda3cb0] do_invalid_op at ffffffff81002f42
	#6 [ffff88047fda3d60] invalid_op at ffffffff815fa668
	[exception RIP: add_timer_on+234]
	RIP: ffffffff8104d05a RSP: ffff88047fda3e18 RFLAGS: 00010286
	RAX: 0000000000000000 RBX: ffff88047fdacbc0 RCX: 000000001fbee3ff
	RDX: ffff88047fda0000 RSI: 000000000000001d RDI: ffff88047fdacbc0
	RBP: ffff88047fda3e58 R8: 0000000000000000 R9: ffffffff81aa0940
	R10: 0720072007200720 R11: 0720072007200765 R12: ffff880474a6c000
	R13: 0000000000000101 R14: 000000000000001d R15: ffff88047fdacbc0
	ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
	#7 [ffff88047fda3e60] mce_timer_fn at ffffffff8101f524
	#8 [ffff88047fda3e80] call_timer_fn at ffffffff8104b4fa
	#9 [ffff88047fda3ec0] run_timer_softirq at ffffffff8104ce70

The timer_add() in mce_timer_kick() is actually unnecessary: since the
timer is re-added by its handler function, the only case in which the
timer doesn't exist is when the CMCI interrupt calls mce_timer_kick() in
the interval between the timer firing and mce_timer_fn() actually being
executed. Thus, the timer work will be performed by mce_timer_fn() just
after the interrupt exits.

This patch removes the add_timer() from mce_timer_kick(), and disables
local interrupts during mce_timer_fn() so that mce_timer_fn() will
always pick up the timer interval value that mce_timer_kick() drops
in the PERCPU variable.

This means that the CMCI interrupt that hits the storm threshold will
call mce_timer_kick() either:

	1) In the interval between the mce_timer firing and mce_timer_fn()
	   disabling local IRQs. In this case, mce_timer_fn() will
	   immediately execute after the CMCI handler exits, and will
	   use the interval loaded in the PERCPU variable from
	   mce_timer_kick() to calculate its next timer interval.

	2) Happen after mce_timer_fn() has done its work, in which case
	   the existing timer will be updated with the new interval if
	   it is before the existing one.

Signed-off-by: Calvin Owens <calvinowens@fb.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 61a9668ce..7074a90 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1286,7 +1286,7 @@ static int cmc_error_seen(void)
 static void mce_timer_fn(unsigned long data)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
-	unsigned long iv;
+	unsigned long iv, flags;
 	int notify;
 
 	WARN_ON(smp_processor_id() != data);
@@ -1301,6 +1301,9 @@ static void mce_timer_fn(unsigned long data)
 	 * Alert userspace if needed.  If we logged an MCE, reduce the
 	 * polling interval, otherwise increase the polling interval.
 	 */
+
+	local_irq_save(flags);
+
 	iv = __this_cpu_read(mce_next_interval);
 	notify = mce_notify_irq();
 	notify |= cmc_error_seen();
@@ -1316,6 +1319,8 @@ static void mce_timer_fn(unsigned long data)
 		t->expires = jiffies + iv;
 		add_timer_on(t, smp_processor_id());
 	}
+
+	local_irq_restore(flags);
 }
 
 /*
@@ -1330,9 +1335,6 @@ void mce_timer_kick(unsigned long interval)
 	if (timer_pending(t)) {
 		if (time_before(when, t->expires))
 			mod_timer_pinned(t, when);
-	} else {
-		t->expires = round_jiffies(when);
-		add_timer_on(t, smp_processor_id());
 	}
 	if (interval < iv)
 		__this_cpu_write(mce_next_interval, interval);
-- 
2.1.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-05  2:29 [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms Calvin Owens
@ 2014-12-09 18:08 ` Borislav Petkov
  2014-12-09 23:00   ` Luck, Tony
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2014-12-09 18:08 UTC (permalink / raw)
  To: Calvin Owens, Tony Luck
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86, linux-edac,
	linux-kernel, kernel-team

On Thu, Dec 04, 2014 at 06:29:35PM -0800, Calvin Owens wrote:
> The Intel CMCI interrupt handler calls mce_timer_kick() to force more
> frequent polling for MCE events when a CMCI storm occurs and CMCI
> interrupts are subsequently disabled.
> 
> If a CMCI interrupt storm happens to be detected while the timer
> interrupt is executing timer functions, mce_timer_kick() can race with
> mce_timer_fn(), which results in a double-add and the following BUG:
> 
> 	#0 [ffff88047fda3ad0] machine_kexec at ffffffff8102bdf5
> 	#1 [ffff88047fda3b20] crash_kexec at ffffffff8109e788
> 	#2 [ffff88047fda3bf0] oops_end at ffffffff815f20e8
> 	#3 [ffff88047fda3c20] die at ffffffff81005c08
> 	#4 [ffff88047fda3c50] do_trap at ffffffff815f192b
> 	#5 [ffff88047fda3cb0] do_invalid_op at ffffffff81002f42
> 	#6 [ffff88047fda3d60] invalid_op at ffffffff815fa668
> 	[exception RIP: add_timer_on+234]

Do I understand this correctly?

This is

	BUG_ON(timer_pending(timer) || !timer->function);

in add_timer_on() and the first check, at that, the timer_pending()?

> 	RIP: ffffffff8104d05a RSP: ffff88047fda3e18 RFLAGS: 00010286
> 	RAX: 0000000000000000 RBX: ffff88047fdacbc0 RCX: 000000001fbee3ff
> 	RDX: ffff88047fda0000 RSI: 000000000000001d RDI: ffff88047fdacbc0
> 	RBP: ffff88047fda3e58 R8: 0000000000000000 R9: ffffffff81aa0940
> 	R10: 0720072007200720 R11: 0720072007200765 R12: ffff880474a6c000
> 	R13: 0000000000000101 R14: 000000000000001d R15: ffff88047fdacbc0
> 	ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> 	#7 [ffff88047fda3e60] mce_timer_fn at ffffffff8101f524
> 	#8 [ffff88047fda3e80] call_timer_fn at ffffffff8104b4fa
> 	#9 [ffff88047fda3ec0] run_timer_softirq at ffffffff8104ce70
> 
> The timer_add() in mce_timer_kick() is actually unnecessary: since the
> timer is re-added by its handler function,

What about the case where iv can become 0?

        /* Might have become 0 after CMCI storm subsided */
        if (iv) {
                t->expires = jiffies + iv;
                add_timer_on(t, smp_processor_id());
        }

I have to say that whole story of iv becoming 0 doesn't sound all too
sane to me, though. This interval either decreases when polling - but up
to HZ/100 jiffies and not below. So I don't see how that would become 0
ever!

Regardless, you need to readd the timer because you won't be able to
poll the MCE banks in a storm mode.

> the only case in which the
> timer doesn't exist is when the CMCI interrupt calls mce_timer_kick() in
> the interval between the timer firing and mce_timer_fn() actually being
> executed. Thus, the timer work will be performed by mce_timer_fn() just
> after the interrupt exits.
> 
> This patch removes the add_timer() from mce_timer_kick(), and disables
> local interrupts during mce_timer_fn() so that mce_timer_fn() will
> always pick up the timer interval value that mce_timer_kick() drops
> in the PERCPU variable.
> 
> This means that the CMCI interrupt that hits the storm threshold will
> call mce_timer_kick() either:
> 
> 	1) In the interval between the mce_timer firing and mce_timer_fn()
> 	   disabling local IRQs. In this case, mce_timer_fn() will
> 	   immediately execute after the CMCI handler exits, and will
> 	   use the interval loaded in the PERCPU variable from
> 	   mce_timer_kick() to calculate its next timer interval.
> 
> 	2) Happen after mce_timer_fn() has done its work, in which case
> 	   the existing timer will be updated with the new interval if
> 	   it is before the existing one.

Right, so this polling thing once again proves its fragility to me - we
have had problems with it in the past so maybe we should think about
replacing it with something simpler and and much more robust instead of
this flaky dynamically adjustable polling thing.

So I'm thinking of leaving the detection code as it is, when we detect
a storm on a CPU, we set CMCI_STORM_ACTIVE and start a kernel thread at
max freq HZ/100 and polling the MCA banks. No adjustable frequency, no
timers, no nothing. A stupid per-cpu thread which polls.

Then, once we haven't seen any more CMCI errors - cmc_error_seen() - we
park that thread and switch back to interrupt mode.

I think this should be a lot simpler and hopefully more robust.

Downsides? I don't see any. We will normally poll for only short periods
of time. On boxes where we have to poll for much longer, the current
mechanism will bring us to max frequency polling anyway. But I might be
missing some nasty use cases...

Tony, what do you think?

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-09 18:08 ` Borislav Petkov
@ 2014-12-09 23:00   ` Luck, Tony
  2014-12-10  3:11     ` Calvin Owens
  0 siblings, 1 reply; 11+ messages in thread
From: Luck, Tony @ 2014-12-09 23:00 UTC (permalink / raw)
  To: Borislav Petkov, Calvin Owens
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 1706 bytes --]

> Right, so this polling thing once again proves its fragility to me - we
> have had problems with it in the past so maybe we should think about
> replacing it with something simpler and and much more robust instead of
> this flaky dynamically adjustable polling thing.

Dynamic intervals for polling make sense for cpus that don't support
CMCI.  We need to check occasionally to see if there are any corrected errors,
but we don't want to waste a lot of cpu time doing that too often. There are
almost never any errors to be found. So begin polling at 5 minute intervals
(eternity on a multi-GHz cpu). If we do find an error, then look more frequently,
because there are several cases where a single error source might generate
multiple errors (e.g. stuck bit).

But then we came along an co-opted this mechanism  for CMCI storm
control.  And you are right that we made things needlessly complex
by using the same variable rate mechanism.  If we had a storm, we know
we are having a high rate of errors (15 in one second) ... so we just want
to poll at a high-ish rate to collect a good sample of subsequent errors.
Also to detect when the storm ends in a timely manner. So we don't
gain much by tweaking the poll rate, and we have complex code.

> So I'm thinking of leaving the detection code as it is, when we detect
> a storm on a CPU, we set CMCI_STORM_ACTIVE and start a kernel thread at
> max freq HZ/100 and polling the MCA banks. No adjustable frequency, no
> timers, no nothing. A stupid per-cpu thread which polls.

Go for it.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-09 23:00   ` Luck, Tony
@ 2014-12-10  3:11     ` Calvin Owens
  2014-12-10 19:23       ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Calvin Owens @ 2014-12-10  3:11 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Tuesday 12/09 at 23:00 +0000, Luck, Tony wrote:
> > Right, so this polling thing once again proves its fragility to me - we
> > have had problems with it in the past so maybe we should think about
> > replacing it with something simpler and and much more robust instead of
> > this flaky dynamically adjustable polling thing.
> 
> Dynamic intervals for polling make sense for cpus that don't support
> CMCI.  We need to check occasionally to see if there are any corrected errors,
> but we don't want to waste a lot of cpu time doing that too often. There are
> almost never any errors to be found. So begin polling at 5 minute intervals
> (eternity on a multi-GHz cpu). If we do find an error, then look more frequently,
> because there are several cases where a single error source might generate
> multiple errors (e.g. stuck bit).
> 
> But then we came along an co-opted this mechanism  for CMCI storm
> control.  And you are right that we made things needlessly complex
> by using the same variable rate mechanism.  If we had a storm, we know
> we are having a high rate of errors (15 in one second) ... so we just want
> to poll at a high-ish rate to collect a good sample of subsequent errors.
> Also to detect when the storm ends in a timely manner. So we don't
> gain much by tweaking the poll rate, and we have complex code.
> 
> > So I'm thinking of leaving the detection code as it is, when we detect
> > a storm on a CPU, we set CMCI_STORM_ACTIVE and start a kernel thread at
> > max freq HZ/100 and polling the MCA banks. No adjustable frequency, no
> > timers, no nothing. A stupid per-cpu thread which polls.
> 
> Go for it.

Just to make sure I understand what you're looking for:

When MCE is initialized, spawn a kthread for each CPU instead of the
current timers. If CMCI is supported, we just leave this thread parked,
and only process errors from the CMCI interrupt handler.

When a CMCI storm happens, we disable CMCI interrupts and kick the
kthread, which polls every HZ/100 until the storm has subsided, at which
point it re-enables CMCI interrupts and parks itself.

If CMCI isn't supported though, how is the polling done? You said the
dynamic interval is desirable, wouldn't that need to be in the kthread?
Having both the kthread and the timer around seems ugly, even if only
one is used on a given machine.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-10  3:11     ` Calvin Owens
@ 2014-12-10 19:23       ` Borislav Petkov
  2014-12-11  2:34         ` Calvin Owens
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2014-12-10 19:23 UTC (permalink / raw)
  To: Calvin Owens, Luck, Tony
  Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, x86@kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com

On Tue, Dec 09, 2014 at 07:11:02PM -0800, Calvin Owens wrote:
> Just to make sure I understand what you're looking for:
> 
> When MCE is initialized, spawn a kthread for each CPU instead of the
> current timers. If CMCI is supported, we just leave this thread parked,
> and only process errors from the CMCI interrupt handler.
> 
> When a CMCI storm happens, we disable CMCI interrupts and kick the
> kthread, which polls every HZ/100 until the storm has subsided, at which
> point it re-enables CMCI interrupts and parks itself.
> 
> If CMCI isn't supported though, how is the polling done? You said the
> dynamic interval is desirable, wouldn't that need to be in the kthread?
> Having both the kthread and the timer around seems ugly, even if only
> one is used on a given machine.

Ok, so in talking with the guys here internally it sounds to me like
a kernel thread or a workqueue or whatever other facility relying on
wake_up_process() we take, would have scheduling latency in there and
get delayed when polling the MCA banks. In a storm condition, this might
not be such a good idea.

So we maybe better off with the timer interrupt after all but try
to decouple the dynamic adjusting of polling frequency for non-CMCI
machines and do an on/off simpler polling on CMCI machines.

So what I'm thinking of is:

* once we've detected CMCI storm, we immediately start polling with
CMCI_STORM_INTERVAL, i.e. once per second.

* as long as we keep seeing errors during polling in storm mode, we keep
that polling frequency.

* the moment we don't see any errors anymore, we go to
CMCI_STORM_SUBSIDED and then eventually to CMCI_STORM_NONE.

The code remains unchanged for CMCI machines which are not in storm
mode and for non-CMCI machines.

Anyway, this below is completely untested but it seems simpler to me and
does away with the adaptive logic for CMCI storms (you might want to
apply it first as the diff is hardly readable and this code is nasty as
hell anyway).

Thoughts?

---
diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 51b26e895933..1b9733614e4c 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -108,6 +108,7 @@ struct mca_config {
 	bool disabled;
 	bool ser;
 	bool bios_cmci_threshold;
+	bool cmci;
 	u8 banks;
 	s8 bootlog;
 	int tolerant;
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..6e4984fc37ce 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -30,18 +30,19 @@ extern struct mce_bank *mce_banks;
 extern mce_banks_t mce_banks_ce_disabled;
 
 #ifdef CONFIG_X86_MCE_INTEL
-unsigned long mce_intel_adjust_timer(unsigned long interval);
+unsigned long cmci_intel_adjust_timer(unsigned long interval);
 void mce_intel_cmci_poll(void);
 void mce_intel_hcpu_update(unsigned long cpu);
 void cmci_disable_bank(int bank);
 #else
-# define mce_intel_adjust_timer mce_adjust_timer_default
+# define cmci_intel_adjust_timer mce_adjust_timer_default
 static inline void mce_intel_cmci_poll(void) { }
 static inline void mce_intel_hcpu_update(unsigned long cpu) { }
 static inline void cmci_disable_bank(int bank) { }
 #endif
 
 void mce_timer_kick(unsigned long interval);
+int ce_error_seen(void);
 
 #ifdef CONFIG_ACPI_APEI
 int apei_write_mce(struct mce *m);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d2c611699cd9..e3f426698164 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1324,7 +1324,7 @@ static unsigned long mce_adjust_timer_default(unsigned long interval)
 static unsigned long (*mce_adjust_timer)(unsigned long interval) =
 	mce_adjust_timer_default;
 
-static int cmc_error_seen(void)
+int ce_error_seen(void)
 {
 	unsigned long *v = this_cpu_ptr(&mce_polled_error);
 
@@ -1334,36 +1334,36 @@ static int cmc_error_seen(void)
 static void mce_timer_fn(unsigned long data)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
+	int cpu = smp_processor_id();
 	unsigned long iv;
-	int notify;
 
-	WARN_ON(smp_processor_id() != data);
+	WARN_ON(cpu != data);
 
 	if (mce_available(this_cpu_ptr(&cpu_info))) {
-		machine_check_poll(MCP_TIMESTAMP,
-				this_cpu_ptr(&mce_poll_banks));
+		machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
 		mce_intel_cmci_poll();
 	}
 
+	iv = __this_cpu_read(mce_next_interval);
+
+	if (mca_cfg.cmci) {
+		iv = mce_adjust_timer(iv);
+		goto done;
+	}
+
 	/*
-	 * Alert userspace if needed.  If we logged an MCE, reduce the
-	 * polling interval, otherwise increase the polling interval.
+	 * Alert userspace if needed. If we logged an MCE, reduce the polling
+	 * interval, otherwise increase the polling interval.
 	 */
-	iv = __this_cpu_read(mce_next_interval);
-	notify = mce_notify_irq();
-	notify |= cmc_error_seen();
-	if (notify) {
+	if (mce_notify_irq())
 		iv = max(iv / 2, (unsigned long) HZ/100);
-	} else {
+	else
 		iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
-		iv = mce_adjust_timer(iv);
-	}
+
+done:
 	__this_cpu_write(mce_next_interval, iv);
-	/* Might have become 0 after CMCI storm subsided */
-	if (iv) {
-		t->expires = jiffies + iv;
-		add_timer_on(t, smp_processor_id());
-	}
+	t->expires = jiffies + iv;
+	add_timer_on(t, cpu);
 }
 
 /*
@@ -1682,7 +1682,7 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
 	switch (c->x86_vendor) {
 	case X86_VENDOR_INTEL:
 		mce_intel_feature_init(c);
-		mce_adjust_timer = mce_intel_adjust_timer;
+		mce_adjust_timer = cmci_intel_adjust_timer;
 		break;
 	case X86_VENDOR_AMD:
 		mce_amd_feature_init(c);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b3c97bafc123..6b8cbeaaca88 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -46,7 +46,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
 
 #define CMCI_THRESHOLD		1
 #define CMCI_POLL_INTERVAL	(30 * HZ)
-#define CMCI_STORM_INTERVAL	(1 * HZ)
+#define CMCI_STORM_INTERVAL	(HZ)
 #define CMCI_STORM_THRESHOLD	15
 
 static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
@@ -68,6 +68,9 @@ static int cmci_supported(int *banks)
 	if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
 		return 0;
 
+	if (mca_cfg.cmci)
+		return 1;
+
 	/*
 	 * Vendor check is not strictly needed, but the initial
 	 * initialization is vendor keyed and this
@@ -79,7 +82,9 @@ static int cmci_supported(int *banks)
 		return 0;
 	rdmsrl(MSR_IA32_MCG_CAP, cap);
 	*banks = min_t(unsigned, MAX_NR_BANKS, cap & 0xff);
-	return !!(cap & MCG_CMCI_P);
+	mca_cfg.cmci = !!(cap & MCG_CMCI_P);
+
+	return mca_cfg.cmci;
 }
 
 void mce_intel_cmci_poll(void)
@@ -97,12 +102,13 @@ void mce_intel_hcpu_update(unsigned long cpu)
 	per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
 }
 
-unsigned long mce_intel_adjust_timer(unsigned long interval)
+unsigned long cmci_intel_adjust_timer(unsigned long interval)
 {
-	int r;
-
-	if (interval < CMCI_POLL_INTERVAL)
-		return interval;
+	if (ce_error_seen() &&
+	    (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
+		mce_notify_irq();
+		return CMCI_STORM_INTERVAL;
+	}
 
 	switch (__this_cpu_read(cmci_storm_state)) {
 	case CMCI_STORM_ACTIVE:
@@ -112,8 +118,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
 		 * timer interval is back to our poll interval.
 		 */
 		__this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
-		r = atomic_sub_return(1, &cmci_storm_on_cpus);
-		if (r == 0)
+		if (!atomic_sub_return(1, &cmci_storm_on_cpus))
 			pr_notice("CMCI storm subsided: switching to interrupt mode\n");
 		/* FALLTHROUGH */
 
@@ -178,7 +183,7 @@ static bool cmci_storm_detect(void)
 	cmci_storm_disable_banks();
 	__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
 	r = atomic_add_return(1, &cmci_storm_on_cpus);
-	mce_timer_kick(CMCI_POLL_INTERVAL);
+	mce_timer_kick(CMCI_STORM_INTERVAL);
 
 	if (r == 1)
 		pr_notice("CMCI storm detected: switching to poll mode\n");


-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-10 19:23       ` Borislav Petkov
@ 2014-12-11  2:34         ` Calvin Owens
  2014-12-11 14:43           ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Calvin Owens @ 2014-12-11  2:34 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Wednesday 12/10 at 20:23 +0100, Borislav Petkov wrote:
> On Tue, Dec 09, 2014 at 07:11:02PM -0800, Calvin Owens wrote:
> > Just to make sure I understand what you're looking for:
> > 
> > When MCE is initialized, spawn a kthread for each CPU instead of the
> > current timers. If CMCI is supported, we just leave this thread parked,
> > and only process errors from the CMCI interrupt handler.
> > 
> > When a CMCI storm happens, we disable CMCI interrupts and kick the
> > kthread, which polls every HZ/100 until the storm has subsided, at which
> > point it re-enables CMCI interrupts and parks itself.
> > 
> > If CMCI isn't supported though, how is the polling done? You said the
> > dynamic interval is desirable, wouldn't that need to be in the kthread?
> > Having both the kthread and the timer around seems ugly, even if only
> > one is used on a given machine.
> 
> Ok, so in talking with the guys here internally it sounds to me like
> a kernel thread or a workqueue or whatever other facility relying on
> wake_up_process() we take, would have scheduling latency in there and
> get delayed when polling the MCA banks. In a storm condition, this might
> not be such a good idea.
> 
> So we maybe better off with the timer interrupt after all but try
> to decouple the dynamic adjusting of polling frequency for non-CMCI
> machines and do an on/off simpler polling on CMCI machines.
> 
> So what I'm thinking of is:
> 
> * once we've detected CMCI storm, we immediately start polling with
> CMCI_STORM_INTERVAL, i.e. once per second.
> 
> * as long as we keep seeing errors during polling in storm mode, we keep
> that polling frequency.
> 
> * the moment we don't see any errors anymore, we go to
> CMCI_STORM_SUBSIDED and then eventually to CMCI_STORM_NONE.
> 
> The code remains unchanged for CMCI machines which are not in storm
> mode and for non-CMCI machines.
> 
> Anyway, this below is completely untested but it seems simpler to me and
> does away with the adaptive logic for CMCI storms (you might want to
> apply it first as the diff is hardly readable and this code is nasty as
> hell anyway).
> 
> Thoughts?

This doens't fix the original issue though: the timer double-add can
still happen if the CMCI interrupt that hits the STORM threshold pops
off during mce_timer_fn() and calls mce_timer_kick().

The polling is unnecessary on a CMCI-capable machine except in STORMs,
right? Wouldn't it be better to eliminate it in that case?

That said, I ran mce-test with your patch on top of 3.18, looks good
there. But that doesn't simulate CMCI interrupts. 

Thanks,
Calvin

> ---
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index 51b26e895933..1b9733614e4c 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -108,6 +108,7 @@ struct mca_config {
>  	bool disabled;
>  	bool ser;
>  	bool bios_cmci_threshold;
> +	bool cmci;
>  	u8 banks;
>  	s8 bootlog;
>  	int tolerant;
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> index 10b46906767f..6e4984fc37ce 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
> +++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
> @@ -30,18 +30,19 @@ extern struct mce_bank *mce_banks;
>  extern mce_banks_t mce_banks_ce_disabled;
>  
>  #ifdef CONFIG_X86_MCE_INTEL
> -unsigned long mce_intel_adjust_timer(unsigned long interval);
> +unsigned long cmci_intel_adjust_timer(unsigned long interval);
>  void mce_intel_cmci_poll(void);
>  void mce_intel_hcpu_update(unsigned long cpu);
>  void cmci_disable_bank(int bank);
>  #else
> -# define mce_intel_adjust_timer mce_adjust_timer_default
> +# define cmci_intel_adjust_timer mce_adjust_timer_default
>  static inline void mce_intel_cmci_poll(void) { }
>  static inline void mce_intel_hcpu_update(unsigned long cpu) { }
>  static inline void cmci_disable_bank(int bank) { }
>  #endif
>  
>  void mce_timer_kick(unsigned long interval);
> +int ce_error_seen(void);
>  
>  #ifdef CONFIG_ACPI_APEI
>  int apei_write_mce(struct mce *m);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index d2c611699cd9..e3f426698164 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1324,7 +1324,7 @@ static unsigned long mce_adjust_timer_default(unsigned long interval)
>  static unsigned long (*mce_adjust_timer)(unsigned long interval) =
>  	mce_adjust_timer_default;
>  
> -static int cmc_error_seen(void)
> +int ce_error_seen(void)
>  {
>  	unsigned long *v = this_cpu_ptr(&mce_polled_error);
>  
> @@ -1334,36 +1334,36 @@ static int cmc_error_seen(void)
>  static void mce_timer_fn(unsigned long data)
>  {
>  	struct timer_list *t = this_cpu_ptr(&mce_timer);
> +	int cpu = smp_processor_id();
>  	unsigned long iv;
> -	int notify;
>  
> -	WARN_ON(smp_processor_id() != data);
> +	WARN_ON(cpu != data);
>  
>  	if (mce_available(this_cpu_ptr(&cpu_info))) {
> -		machine_check_poll(MCP_TIMESTAMP,
> -				this_cpu_ptr(&mce_poll_banks));
> +		machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
>  		mce_intel_cmci_poll();
>  	}
>  
> +	iv = __this_cpu_read(mce_next_interval);
> +
> +	if (mca_cfg.cmci) {
> +		iv = mce_adjust_timer(iv);
> +		goto done;
> +	}
> +
>  	/*
> -	 * Alert userspace if needed.  If we logged an MCE, reduce the
> -	 * polling interval, otherwise increase the polling interval.
> +	 * Alert userspace if needed. If we logged an MCE, reduce the polling
> +	 * interval, otherwise increase the polling interval.
>  	 */
> -	iv = __this_cpu_read(mce_next_interval);
> -	notify = mce_notify_irq();
> -	notify |= cmc_error_seen();
> -	if (notify) {
> +	if (mce_notify_irq())
>  		iv = max(iv / 2, (unsigned long) HZ/100);
> -	} else {
> +	else
>  		iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
> -		iv = mce_adjust_timer(iv);
> -	}
> +
> +done:
>  	__this_cpu_write(mce_next_interval, iv);
> -	/* Might have become 0 after CMCI storm subsided */
> -	if (iv) {
> -		t->expires = jiffies + iv;
> -		add_timer_on(t, smp_processor_id());
> -	}
> +	t->expires = jiffies + iv;
> +	add_timer_on(t, cpu);
>  }
>  
>  /*
> @@ -1682,7 +1682,7 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
>  	switch (c->x86_vendor) {
>  	case X86_VENDOR_INTEL:
>  		mce_intel_feature_init(c);
> -		mce_adjust_timer = mce_intel_adjust_timer;
> +		mce_adjust_timer = cmci_intel_adjust_timer;
>  		break;
>  	case X86_VENDOR_AMD:
>  		mce_amd_feature_init(c);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> index b3c97bafc123..6b8cbeaaca88 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
> @@ -46,7 +46,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
>  
>  #define CMCI_THRESHOLD		1
>  #define CMCI_POLL_INTERVAL	(30 * HZ)
> -#define CMCI_STORM_INTERVAL	(1 * HZ)
> +#define CMCI_STORM_INTERVAL	(HZ)
>  #define CMCI_STORM_THRESHOLD	15
>  
>  static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
> @@ -68,6 +68,9 @@ static int cmci_supported(int *banks)
>  	if (mca_cfg.cmci_disabled || mca_cfg.ignore_ce)
>  		return 0;
>  
> +	if (mca_cfg.cmci)
> +		return 1;
> +
>  	/*
>  	 * Vendor check is not strictly needed, but the initial
>  	 * initialization is vendor keyed and this
> @@ -79,7 +82,9 @@ static int cmci_supported(int *banks)
>  		return 0;
>  	rdmsrl(MSR_IA32_MCG_CAP, cap);
>  	*banks = min_t(unsigned, MAX_NR_BANKS, cap & 0xff);
> -	return !!(cap & MCG_CMCI_P);
> +	mca_cfg.cmci = !!(cap & MCG_CMCI_P);
> +
> +	return mca_cfg.cmci;
>  }
>  
>  void mce_intel_cmci_poll(void)
> @@ -97,12 +102,13 @@ void mce_intel_hcpu_update(unsigned long cpu)
>  	per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
>  }
>  
> -unsigned long mce_intel_adjust_timer(unsigned long interval)
> +unsigned long cmci_intel_adjust_timer(unsigned long interval)
>  {
> -	int r;
> -
> -	if (interval < CMCI_POLL_INTERVAL)
> -		return interval;
> +	if (ce_error_seen() &&
> +	    (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
> +		mce_notify_irq();
> +		return CMCI_STORM_INTERVAL;
> +	}
>  
>  	switch (__this_cpu_read(cmci_storm_state)) {
>  	case CMCI_STORM_ACTIVE:
> @@ -112,8 +118,7 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
>  		 * timer interval is back to our poll interval.
>  		 */
>  		__this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
> -		r = atomic_sub_return(1, &cmci_storm_on_cpus);
> -		if (r == 0)
> +		if (!atomic_sub_return(1, &cmci_storm_on_cpus))
>  			pr_notice("CMCI storm subsided: switching to interrupt mode\n");
>  		/* FALLTHROUGH */
>  
> @@ -178,7 +183,7 @@ static bool cmci_storm_detect(void)
>  	cmci_storm_disable_banks();
>  	__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
>  	r = atomic_add_return(1, &cmci_storm_on_cpus);
> -	mce_timer_kick(CMCI_POLL_INTERVAL);
> +	mce_timer_kick(CMCI_STORM_INTERVAL);
>  
>  	if (r == 1)
>  		pr_notice("CMCI storm detected: switching to poll mode\n");
> 
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-11  2:34         ` Calvin Owens
@ 2014-12-11 14:43           ` Borislav Petkov
  2014-12-17  0:17             ` Calvin Owens
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2014-12-11 14:43 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Wed, Dec 10, 2014 at 06:34:38PM -0800, Calvin Owens wrote:
> This doens't fix the original issue though: the timer double-add can
> still happen if the CMCI interrupt that hits the STORM threshold pops
> off during mce_timer_fn() and calls mce_timer_kick().

Right, I guess we could do something similar to what you proposed
already (diff below ontop of my previous diff).

> The polling is unnecessary on a CMCI-capable machine except in STORMs,
> right? Wouldn't it be better to eliminate it in that case?

Right, the problem with error thresholding is that it only counts errors
and raises the APIC interrupt when a certain count has been reached.
But there's no way for the software to look at *each* error and try to
discover trends and stuff.

Thus, we've been working on something which looks at each memory error,
collects them and then decays them over time and acts only for the
significant ones by poisoning pages:

https://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp@alien8.de

When we have that thing in place we can probably even go a step further
and simply disable error thresholding because it simply doesn't give us
what we need.

Btw, we would very much welcome your thoughs on this collector thing and
whether it would be usable in large installations. Or if not, what else
would it need added.

> That said, I ran mce-test with your patch on top of 3.18, looks good
> there. But that doesn't simulate CMCI interrupts.

Thanks, I had a machine here which was the perfect natural error
injector (has faulty DIMMs). I need to dig it out and play with it again
:)

Thanks.

---
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e3f426698164..8cf2f486b81e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1331,6 +1331,24 @@ int ce_error_seen(void)
 	return test_and_clear_bit(0, v);
 }
 
+static void __restart_timer(struct timer_list *t, unsigned long interval)
+{
+	unsigned long when = jiffies + interval;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (timer_pending(t)) {
+		if (time_before(when, t->expires))
+			mod_timer_pinned(t, when);
+	} else {
+		t->expires = round_jiffies(when);
+		add_timer_on(t, smp_processor_id());
+	}
+
+	local_irq_restore(flags);
+}
+
 static void mce_timer_fn(unsigned long data)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
@@ -1362,8 +1380,7 @@ static void mce_timer_fn(unsigned long data)
 
 done:
 	__this_cpu_write(mce_next_interval, iv);
-	t->expires = jiffies + iv;
-	add_timer_on(t, cpu);
+	__restart_timer(t, iv);
 }
 
 /*
@@ -1372,16 +1389,10 @@ done:
 void mce_timer_kick(unsigned long interval)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
-	unsigned long when = jiffies + interval;
 	unsigned long iv = __this_cpu_read(mce_next_interval);
 
-	if (timer_pending(t)) {
-		if (time_before(when, t->expires))
-			mod_timer_pinned(t, when);
-	} else {
-		t->expires = round_jiffies(when);
-		add_timer_on(t, smp_processor_id());
-	}
+	__restart_timer(t, interval);
+
 	if (interval < iv)
 		__this_cpu_write(mce_next_interval, interval);
 }

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-11 14:43           ` Borislav Petkov
@ 2014-12-17  0:17             ` Calvin Owens
  2014-12-17 21:19               ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Calvin Owens @ 2014-12-17  0:17 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Thursday 12/11 at 15:43 +0100, Borislav Petkov wrote:
> On Wed, Dec 10, 2014 at 06:34:38PM -0800, Calvin Owens wrote:
> > This doens't fix the original issue though: the timer double-add can
> > still happen if the CMCI interrupt that hits the STORM threshold pops
> > off during mce_timer_fn() and calls mce_timer_kick().
> 
> Right, I guess we could do something similar to what you proposed
> already (diff below ontop of my previous diff).

That looks good, thanks :)
 
> > The polling is unnecessary on a CMCI-capable machine except in STORMs,
> > right? Wouldn't it be better to eliminate it in that case?
> 
> Right, the problem with error thresholding is that it only counts errors
> and raises the APIC interrupt when a certain count has been reached.
> But there's no way for the software to look at *each* error and try to
> discover trends and stuff.
> 
> Thus, we've been working on something which looks at each memory error,
> collects them and then decays them over time and acts only for the
> significant ones by poisoning pages:
> 
> https://urldefense.proofpoint.com/v1/url?u=https://lkml.kernel.org/r/1404242623-10094-1-git-send-email-bp%40alien8.de&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=oEb120Cp%2FehdhuY2M7qjelK5yT8IPB5WC2nEG4obDh4%3D%0A&m=Uvvje79YqtSSsaolAYcJ9N3FPGiWTIz7feUxwbhAkNc%3D%0A&s=c88a2b8458e7f07ba887d26414f2ce5b80f9180353d0895a7f4f8b47d19dbec8
> 
> When we have that thing in place we can probably even go a step further
> and simply disable error thresholding because it simply doesn't give us
> what we need.
> 
> Btw, we would very much welcome your thoughs on this collector thing and
> whether it would be usable in large installations. Or if not, what else
> would it need added.

Ooh I like this, I'll add some thoughts over in that thread.

> > That said, I ran mce-test with your patch on top of 3.18, looks good
> > there. But that doesn't simulate CMCI interrupts.
> 
> Thanks, I had a machine here which was the perfect natural error
> injector (has faulty DIMMs). I need to dig it out and play with it again
> :)
> 
> Thanks.
> 
> ---
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index e3f426698164..8cf2f486b81e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -1331,6 +1331,24 @@ int ce_error_seen(void)
>  	return test_and_clear_bit(0, v);
>  }
>  
> +static void __restart_timer(struct timer_list *t, unsigned long interval)
> +{
> +	unsigned long when = jiffies + interval;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	if (timer_pending(t)) {
> +		if (time_before(when, t->expires))
> +			mod_timer_pinned(t, when);
> +	} else {
> +		t->expires = round_jiffies(when);
> +		add_timer_on(t, smp_processor_id());
> +	}
> +
> +	local_irq_restore(flags);
> +}
> +
>  static void mce_timer_fn(unsigned long data)
>  {
>  	struct timer_list *t = this_cpu_ptr(&mce_timer);
> @@ -1362,8 +1380,7 @@ static void mce_timer_fn(unsigned long data)
>  
>  done:
>  	__this_cpu_write(mce_next_interval, iv);
> -	t->expires = jiffies + iv;
> -	add_timer_on(t, cpu);
> +	__restart_timer(t, iv);
>  }
>  
>  /*
> @@ -1372,16 +1389,10 @@ done:
>  void mce_timer_kick(unsigned long interval)
>  {
>  	struct timer_list *t = this_cpu_ptr(&mce_timer);
> -	unsigned long when = jiffies + interval;
>  	unsigned long iv = __this_cpu_read(mce_next_interval);
>  
> -	if (timer_pending(t)) {
> -		if (time_before(when, t->expires))
> -			mod_timer_pinned(t, when);
> -	} else {
> -		t->expires = round_jiffies(when);
> -		add_timer_on(t, smp_processor_id());
> -	}
> +	__restart_timer(t, interval);
> +
>  	if (interval < iv)
>  		__this_cpu_write(mce_next_interval, interval);
>  }
> 
> -- 
> Regards/Gruss,
>     Boris.
> 
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-17  0:17             ` Calvin Owens
@ 2014-12-17 21:19               ` Borislav Petkov
  2015-01-23 11:03                 ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2014-12-17 21:19 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Tue, Dec 16, 2014 at 04:17:11PM -0800, Calvin Owens wrote:
> > Right, I guess we could do something similar to what you proposed
> > already (diff below ontop of my previous diff).
> 
> That looks good, thanks :)

Ok, thanks. I'll try to give it a run once I get a quiet minute from all
the CVE fun recently.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2014-12-17 21:19               ` Borislav Petkov
@ 2015-01-23 11:03                 ` Borislav Petkov
  2015-02-02 11:05                   ` Borislav Petkov
  0 siblings, 1 reply; 11+ messages in thread
From: Borislav Petkov @ 2015-01-23 11:03 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Wed, Dec 17, 2014 at 10:19:12PM +0100, Borislav Petkov wrote:
> Ok, thanks. I'll try to give it a run once I get a quiet minute from
> all the CVE fun recently.

Ok, I think we have something which shapes up to be a good
simplification for the storm code and cleans up a bunch of stuff in that
area, see below.

I've uploaded it as a branch if you want to give it a run as there are
other changes to MCE pending for 3.20 which might cause merge conflicts:

git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git, branch mce-fb

Shout if there are some issues remaining not addressed.

Thanks.

---
From: Borislav Petkov <bp@suse.de>
Subject: [PATCH] x86, MCE: Cleanup CMCI storm logic

Initially, this started with the yet another report about a race
condition in the CMCI storm adaptive period length thing. Yes, we have
to admit, it is fragile and error prone. So let's simplify it.

The simpler logic is: now, after we enter storm mode, we go straight to
polling with CMCI_STORM_INTERVAL, i.e. once a second. We remain in storm
mode as long as we see errors being logged while polling.

Theoretically, if we see an uninterrupted error stream, we will remain
in storm mode indefinitely and keep polling the MSRs.

However, when the storm is actually a burst of errors, once we have
logged them all, we back out of it after ~5 mins of polling and no more
errors logged.

Making machine_check_poll() return a bool and denoting whether it has
seen an error or not lets us simplify a bunch of code and move the storm
handling private to mce_intel.c.

Some minor cleanups while at it.

Reported-by: Calvin Owens <calvinowens@fb.com>
Link: http://lkml.kernel.org/r/https://lkml.kernel.org/r/1417746575-23299-1-git-send-email-calvinowens@fb.com
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
 arch/x86/include/asm/mce.h                |  8 +--
 arch/x86/kernel/cpu/mcheck/mce-internal.h |  9 ++--
 arch/x86/kernel/cpu/mcheck/mce.c          | 86 ++++++++++++++++---------------
 arch/x86/kernel/cpu/mcheck/mce_intel.c    | 63 ++++++++++++++--------
 4 files changed, 96 insertions(+), 70 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 9b3de99dc004..fd38a23e729f 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -183,11 +183,11 @@ typedef DECLARE_BITMAP(mce_banks_t, MAX_NR_BANKS);
 DECLARE_PER_CPU(mce_banks_t, mce_poll_banks);
 
 enum mcp_flags {
-	MCP_TIMESTAMP = (1 << 0),	/* log time stamp */
-	MCP_UC = (1 << 1),		/* log uncorrected errors */
-	MCP_DONTLOG = (1 << 2),		/* only clear, don't log */
+	MCP_TIMESTAMP	= BIT(0),	/* log time stamp */
+	MCP_UC		= BIT(1),	/* log uncorrected errors */
+	MCP_DONTLOG	= BIT(2),	/* only clear, don't log */
 };
-void machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
+bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b);
 
 int mce_notify_irq(void);
 
diff --git a/arch/x86/kernel/cpu/mcheck/mce-internal.h b/arch/x86/kernel/cpu/mcheck/mce-internal.h
index 10b46906767f..e12f0bfb45c1 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-internal.h
+++ b/arch/x86/kernel/cpu/mcheck/mce-internal.h
@@ -14,6 +14,7 @@ enum severity_level {
 };
 
 #define ATTR_LEN		16
+#define INITIAL_CHECK_INTERVAL	5 * 60 /* 5 minutes */
 
 /* One object for each MCE bank, shared by all CPUs */
 struct mce_bank {
@@ -30,13 +31,13 @@ extern struct mce_bank *mce_banks;
 extern mce_banks_t mce_banks_ce_disabled;
 
 #ifdef CONFIG_X86_MCE_INTEL
-unsigned long mce_intel_adjust_timer(unsigned long interval);
-void mce_intel_cmci_poll(void);
+unsigned long cmci_intel_adjust_timer(unsigned long interval);
+bool mce_intel_cmci_poll(void);
 void mce_intel_hcpu_update(unsigned long cpu);
 void cmci_disable_bank(int bank);
 #else
-# define mce_intel_adjust_timer mce_adjust_timer_default
-static inline void mce_intel_cmci_poll(void) { }
+# define cmci_intel_adjust_timer mce_adjust_timer_default
+static inline bool mce_intel_cmci_poll(void) { return false; }
 static inline void mce_intel_hcpu_update(unsigned long cpu) { }
 static inline void cmci_disable_bank(int bank) { }
 #endif
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d23179900755..410807be1a6c 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -59,7 +59,7 @@ static DEFINE_MUTEX(mce_chrdev_read_mutex);
 #define CREATE_TRACE_POINTS
 #include <trace/events/mce.h>
 
-#define SPINUNIT 100	/* 100ns */
+#define SPINUNIT		100	/* 100ns */
 
 DEFINE_PER_CPU(unsigned, mce_exception_count);
 
@@ -88,9 +88,6 @@ static DECLARE_WAIT_QUEUE_HEAD(mce_chrdev_wait);
 static DEFINE_PER_CPU(struct mce, mces_seen);
 static int			cpu_missing;
 
-/* CMCI storm detection filter */
-static DEFINE_PER_CPU(unsigned long, mce_polled_error);
-
 /*
  * MCA banks polled by the period polling timer for corrected events.
  * With Intel CMCI, this only has MCA banks which do not support CMCI (if any).
@@ -624,8 +621,9 @@ DEFINE_PER_CPU(unsigned, mce_poll_count);
  * is already totally * confused. In this case it's likely it will
  * not fully execute the machine check handler either.
  */
-void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
+bool machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 {
+	bool error_logged = false;
 	struct mce m;
 	int severity;
 	int i;
@@ -648,7 +646,7 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		if (!(m.status & MCI_STATUS_VAL))
 			continue;
 
-		this_cpu_write(mce_polled_error, 1);
+
 		/*
 		 * Uncorrected or signalled events are handled by the exception
 		 * handler when it is enabled, so don't process those here.
@@ -681,8 +679,10 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 * Don't get the IP here because it's unlikely to
 		 * have anything to do with the actual error location.
 		 */
-		if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce)
+		if (!(flags & MCP_DONTLOG) && !mca_cfg.dont_log_ce) {
+			error_logged = true;
 			mce_log(&m);
+		}
 
 		/*
 		 * Clear state for this bank.
@@ -696,6 +696,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 	 */
 
 	sync_core();
+
+	return error_logged;
 }
 EXPORT_SYMBOL_GPL(machine_check_poll);
 
@@ -1257,7 +1259,7 @@ void mce_log_therm_throt_event(__u64 status)
  * poller finds an MCE, poll 2x faster.  When the poller finds no more
  * errors, poll 2x slower (up to check_interval seconds).
  */
-static unsigned long check_interval = 5 * 60; /* 5 minutes */
+static unsigned long check_interval = INITIAL_CHECK_INTERVAL;
 
 static DEFINE_PER_CPU(unsigned long, mce_next_interval); /* in jiffies */
 static DEFINE_PER_CPU(struct timer_list, mce_timer);
@@ -1267,49 +1269,57 @@ static unsigned long mce_adjust_timer_default(unsigned long interval)
 	return interval;
 }
 
-static unsigned long (*mce_adjust_timer)(unsigned long interval) =
-	mce_adjust_timer_default;
+static unsigned long (*mce_adjust_timer)(unsigned long interval) = mce_adjust_timer_default;
 
-static int cmc_error_seen(void)
+static void __restart_timer(struct timer_list *t, unsigned long interval)
 {
-	unsigned long *v = this_cpu_ptr(&mce_polled_error);
+	unsigned long when = jiffies + interval;
+	unsigned long flags;
+
+	local_irq_save(flags);
+
+	if (timer_pending(t)) {
+		if (time_before(when, t->expires))
+			mod_timer_pinned(t, when);
+	} else {
+		t->expires = round_jiffies(when);
+		add_timer_on(t, smp_processor_id());
+	}
 
-	return test_and_clear_bit(0, v);
+	local_irq_restore(flags);
 }
 
 static void mce_timer_fn(unsigned long data)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
+	int cpu = smp_processor_id();
 	unsigned long iv;
-	int notify;
 
-	WARN_ON(smp_processor_id() != data);
+	WARN_ON(cpu != data);
+
+	iv = __this_cpu_read(mce_next_interval);
 
 	if (mce_available(this_cpu_ptr(&cpu_info))) {
-		machine_check_poll(MCP_TIMESTAMP,
-				this_cpu_ptr(&mce_poll_banks));
-		mce_intel_cmci_poll();
+		machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_poll_banks));
+
+		if (mce_intel_cmci_poll()) {
+			iv = mce_adjust_timer(iv);
+			goto done;
+		}
 	}
 
 	/*
-	 * Alert userspace if needed.  If we logged an MCE, reduce the
-	 * polling interval, otherwise increase the polling interval.
+	 * Alert userspace if needed. If we logged an MCE, reduce the polling
+	 * interval, otherwise increase the polling interval.
 	 */
-	iv = __this_cpu_read(mce_next_interval);
-	notify = mce_notify_irq();
-	notify |= cmc_error_seen();
-	if (notify) {
+	if (mce_notify_irq())
 		iv = max(iv / 2, (unsigned long) HZ/100);
-	} else {
+	else
 		iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));
-		iv = mce_adjust_timer(iv);
-	}
+
+done:
 	__this_cpu_write(mce_next_interval, iv);
-	/* Might have become 0 after CMCI storm subsided */
-	if (iv) {
-		t->expires = jiffies + iv;
-		add_timer_on(t, smp_processor_id());
-	}
+	__restart_timer(t, iv);
 }
 
 /*
@@ -1318,16 +1328,10 @@ static void mce_timer_fn(unsigned long data)
 void mce_timer_kick(unsigned long interval)
 {
 	struct timer_list *t = this_cpu_ptr(&mce_timer);
-	unsigned long when = jiffies + interval;
 	unsigned long iv = __this_cpu_read(mce_next_interval);
 
-	if (timer_pending(t)) {
-		if (time_before(when, t->expires))
-			mod_timer_pinned(t, when);
-	} else {
-		t->expires = round_jiffies(when);
-		add_timer_on(t, smp_processor_id());
-	}
+	__restart_timer(t, interval);
+
 	if (interval < iv)
 		__this_cpu_write(mce_next_interval, interval);
 }
@@ -1628,7 +1632,7 @@ static void __mcheck_cpu_init_vendor(struct cpuinfo_x86 *c)
 	switch (c->x86_vendor) {
 	case X86_VENDOR_INTEL:
 		mce_intel_feature_init(c);
-		mce_adjust_timer = mce_intel_adjust_timer;
+		mce_adjust_timer = cmci_intel_adjust_timer;
 		break;
 	case X86_VENDOR_AMD:
 		mce_amd_feature_init(c);
diff --git a/arch/x86/kernel/cpu/mcheck/mce_intel.c b/arch/x86/kernel/cpu/mcheck/mce_intel.c
index b3c97bafc123..b4a41cf030ed 100644
--- a/arch/x86/kernel/cpu/mcheck/mce_intel.c
+++ b/arch/x86/kernel/cpu/mcheck/mce_intel.c
@@ -39,6 +39,15 @@
 static DEFINE_PER_CPU(mce_banks_t, mce_banks_owned);
 
 /*
+ * CMCI storm detection backoff counter
+ *
+ * During storm, we reset this counter to INITIAL_CHECK_INTERVAL in case we've
+ * encountered an error. If not, we decrement it by one. We signal the end of
+ * the CMCI storm when it reaches 0.
+ */
+static DEFINE_PER_CPU(int, cmci_backoff_cnt);
+
+/*
  * cmci_discover_lock protects against parallel discovery attempts
  * which could race against each other.
  */
@@ -46,7 +55,7 @@ static DEFINE_RAW_SPINLOCK(cmci_discover_lock);
 
 #define CMCI_THRESHOLD		1
 #define CMCI_POLL_INTERVAL	(30 * HZ)
-#define CMCI_STORM_INTERVAL	(1 * HZ)
+#define CMCI_STORM_INTERVAL	(HZ)
 #define CMCI_STORM_THRESHOLD	15
 
 static DEFINE_PER_CPU(unsigned long, cmci_time_stamp);
@@ -82,11 +91,21 @@ static int cmci_supported(int *banks)
 	return !!(cap & MCG_CMCI_P);
 }
 
-void mce_intel_cmci_poll(void)
+bool mce_intel_cmci_poll(void)
 {
 	if (__this_cpu_read(cmci_storm_state) == CMCI_STORM_NONE)
-		return;
-	machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
+		return false;
+
+	/*
+	 * Reset the counter if we've logged an error in the last poll
+	 * during the storm.
+	 */
+	if (machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned)))
+		this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
+	else
+		this_cpu_dec(cmci_backoff_cnt);
+
+	return true;
 }
 
 void mce_intel_hcpu_update(unsigned long cpu)
@@ -97,31 +116,32 @@ void mce_intel_hcpu_update(unsigned long cpu)
 	per_cpu(cmci_storm_state, cpu) = CMCI_STORM_NONE;
 }
 
-unsigned long mce_intel_adjust_timer(unsigned long interval)
+unsigned long cmci_intel_adjust_timer(unsigned long interval)
 {
-	int r;
-
-	if (interval < CMCI_POLL_INTERVAL)
-		return interval;
+	if ((this_cpu_read(cmci_backoff_cnt) > 0) &&
+	    (__this_cpu_read(cmci_storm_state) == CMCI_STORM_ACTIVE)) {
+		mce_notify_irq();
+		return CMCI_STORM_INTERVAL;
+	}
 
 	switch (__this_cpu_read(cmci_storm_state)) {
 	case CMCI_STORM_ACTIVE:
+
 		/*
 		 * We switch back to interrupt mode once the poll timer has
-		 * silenced itself. That means no events recorded and the
-		 * timer interval is back to our poll interval.
+		 * silenced itself. That means no events recorded and the timer
+		 * interval is back to our poll interval.
 		 */
 		__this_cpu_write(cmci_storm_state, CMCI_STORM_SUBSIDED);
-		r = atomic_sub_return(1, &cmci_storm_on_cpus);
-		if (r == 0)
+		if (!atomic_sub_return(1, &cmci_storm_on_cpus))
 			pr_notice("CMCI storm subsided: switching to interrupt mode\n");
+
 		/* FALLTHROUGH */
 
 	case CMCI_STORM_SUBSIDED:
 		/*
-		 * We wait for all cpus to go back to SUBSIDED
-		 * state. When that happens we switch back to
-		 * interrupt mode.
+		 * We wait for all CPUs to go back to SUBSIDED state. When that
+		 * happens we switch back to interrupt mode.
 		 */
 		if (!atomic_read(&cmci_storm_on_cpus)) {
 			__this_cpu_write(cmci_storm_state, CMCI_STORM_NONE);
@@ -130,10 +150,8 @@ unsigned long mce_intel_adjust_timer(unsigned long interval)
 		}
 		return CMCI_POLL_INTERVAL;
 	default:
-		/*
-		 * We have shiny weather. Let the poll do whatever it
-		 * thinks.
-		 */
+
+		/* We have shiny weather. Let the poll do whatever it thinks. */
 		return interval;
 	}
 }
@@ -178,7 +196,8 @@ static bool cmci_storm_detect(void)
 	cmci_storm_disable_banks();
 	__this_cpu_write(cmci_storm_state, CMCI_STORM_ACTIVE);
 	r = atomic_add_return(1, &cmci_storm_on_cpus);
-	mce_timer_kick(CMCI_POLL_INTERVAL);
+	mce_timer_kick(CMCI_STORM_INTERVAL);
+	this_cpu_write(cmci_backoff_cnt, INITIAL_CHECK_INTERVAL);
 
 	if (r == 1)
 		pr_notice("CMCI storm detected: switching to poll mode\n");
@@ -195,6 +214,7 @@ static void intel_threshold_interrupt(void)
 {
 	if (cmci_storm_detect())
 		return;
+
 	machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
 	mce_notify_irq();
 }
@@ -286,6 +306,7 @@ void cmci_recheck(void)
 
 	if (!mce_available(raw_cpu_ptr(&cpu_info)) || !cmci_supported(&banks))
 		return;
+
 	local_irq_save(flags);
 	machine_check_poll(MCP_TIMESTAMP, this_cpu_ptr(&mce_banks_owned));
 	local_irq_restore(flags);
-- 
2.2.0.33.gc18b867

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms.
  2015-01-23 11:03                 ` Borislav Petkov
@ 2015-02-02 11:05                   ` Borislav Petkov
  0 siblings, 0 replies; 11+ messages in thread
From: Borislav Petkov @ 2015-02-02 11:05 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Luck, Tony, Thomas Gleixner, Ingo Molnar, H. Peter Anvin,
	x86@kernel.org, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, kernel-team@fb.com

On Fri, Jan 23, 2015 at 12:03:35PM +0100, Borislav Petkov wrote:
> I've uploaded it as a branch if you want to give it a run as there are
> other changes to MCE pending for 3.20 which might cause merge conflicts:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git, branch mce-fb
> 
> Shout if there are some issues remaining not addressed.

In the meantime, I've tested it successfully on an EINJ box too, will
queue it for 3.21 for extra simmering time in linux-next et al.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-02-02 11:05 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-05  2:29 [PATCH] x86: mce: Avoid timer double-add during CMCI interrupt storms Calvin Owens
2014-12-09 18:08 ` Borislav Petkov
2014-12-09 23:00   ` Luck, Tony
2014-12-10  3:11     ` Calvin Owens
2014-12-10 19:23       ` Borislav Petkov
2014-12-11  2:34         ` Calvin Owens
2014-12-11 14:43           ` Borislav Petkov
2014-12-17  0:17             ` Calvin Owens
2014-12-17 21:19               ` Borislav Petkov
2015-01-23 11:03                 ` Borislav Petkov
2015-02-02 11:05                   ` Borislav Petkov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).