linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
@ 2011-04-18 14:00 Borislav Petkov
  2011-04-19 17:13 ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2011-04-18 14:00 UTC (permalink / raw)
  To: H. Peter Anvin, Ingo Molnar, Thomas Gleixner
  Cc: Tony Luck, EDAC devel, LKML, Borislav Petkov, Prarit Bhargava,
	Nagananda Chumbalkar, Russ Anderson

From: Borislav Petkov <borislav.petkov@amd.com>

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding hint only
in the uncorrectable errors case and when no notifier is registered.
Also, limit issuing the "mcelog --ascii" message in the rare case when
we dump unreported CEs before panicking.

While at it, remove unused old x86_mce_decode_callback from the header.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/include/asm/mce.h       |    2 --
 arch/x86/kernel/cpu/mcheck/mce.c |   23 ++++++-----------------
 2 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..679bd9e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret != NOTIFY_STOP)
+		printk_ratelimited(KERN_EMERG HW_ERR
+				   "Run the above through 'mcelog --ascii'\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1721,8 +1712,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;
-- 
1.7.4.rc2


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-18 14:00 [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier Borislav Petkov
@ 2011-04-19 17:13 ` Ingo Molnar
  2011-04-19 17:35   ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2011-04-19 17:13 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Thomas Gleixner, Tony Luck, EDAC devel, LKML,
	Borislav Petkov, Prarit Bhargava, Nagananda Chumbalkar,
	Russ Anderson


* Borislav Petkov <bp@amd64.org> wrote:

> From: Borislav Petkov <borislav.petkov@amd.com>
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>

Hm, that From/SOB combination does not look right.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-19 17:13 ` Ingo Molnar
@ 2011-04-19 17:35   ` Borislav Petkov
  2011-04-19 17:44     ` Ingo Molnar
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2011-04-19 17:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, H. Peter Anvin, Thomas Gleixner, Tony Luck,
	EDAC devel, LKML, Prarit Bhargava, Nagananda Chumbalkar,
	Russ Anderson

On Tue, Apr 19, 2011 at 01:13:40PM -0400, Ingo Molnar wrote:
> 
> * Borislav Petkov <bp@amd64.org> wrote:
> 
> > From: Borislav Petkov <borislav.petkov@amd.com>
> > Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> > Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> 
> Hm, that From/SOB combination does not look right.

Sorry, how's that instead:

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Wed, 13 Apr 2011 14:32:06 +0200
Subject: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding hint only
in the uncorrectable errors case and when no notifier is registered.
Also, limit issuing the "mcelog --ascii" message in the rare case when
we dump unreported CEs before panicking.

While at it, remove unused old x86_mce_decode_callback from the header.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
---
 arch/x86/include/asm/mce.h       |    2 --
 arch/x86/kernel/cpu/mcheck/mce.c |   23 ++++++-----------------
 2 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..679bd9e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret != NOTIFY_STOP)
+		printk_ratelimited(KERN_EMERG HW_ERR
+				   "Run the above through 'mcelog --ascii'\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1721,8 +1712,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;
-- 
1.7.4.rc2


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-19 17:35   ` Borislav Petkov
@ 2011-04-19 17:44     ` Ingo Molnar
  2011-04-20 10:23       ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Ingo Molnar @ 2011-04-19 17:44 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: H. Peter Anvin, Thomas Gleixner, Tony Luck, EDAC devel, LKML,
	Prarit Bhargava, Nagananda Chumbalkar, Russ Anderson


* Borislav Petkov <bp@amd64.org> wrote:

> +	if (ret != NOTIFY_STOP)
> +		printk_ratelimited(KERN_EMERG HW_ERR
> +				   "Run the above through 'mcelog --ascii'\n");

pr_emerg_ratelimited() would allow you to keep that broken line happy?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-19 17:44     ` Ingo Molnar
@ 2011-04-20 10:23       ` Borislav Petkov
  2011-04-21 12:10         ` [tip:x86/mce] x86, mce: " tip-bot for Borislav Petkov
  2011-04-25 19:40         ` [PATCH -v2 2/2] x86, MCE: " Eric W. Biederman
  0 siblings, 2 replies; 14+ messages in thread
From: Borislav Petkov @ 2011-04-20 10:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, H. Peter Anvin, Thomas Gleixner, Tony Luck,
	EDAC devel, LKML, Prarit Bhargava, Nagananda Chumbalkar,
	Russ Anderson

On Tue, Apr 19, 2011 at 01:44:46PM -0400, Ingo Molnar wrote:
> 
> * Borislav Petkov <bp@amd64.org> wrote:
> 
> > +	if (ret != NOTIFY_STOP)
> > +		printk_ratelimited(KERN_EMERG HW_ERR
> > +				   "Run the above through 'mcelog --ascii'\n");
> 
> pr_emerg_ratelimited() would allow you to keep that broken line happy?

Not entirely. I had to do the exit-early-to-save-an-indentation-level
trick also :)

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Wed, 13 Apr 2011 14:32:06 +0200
Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding hint only
in the uncorrectable errors case and when no notifier is registered.
Also, limit issuing the "mcelog --ascii" message in the rare case when
we dump unreported CEs before panicking.

While at it, remove unused old x86_mce_decode_callback from the header.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
---
 arch/x86/include/asm/mce.h       |    2 --
 arch/x86/kernel/cpu/mcheck/mce.c |   24 +++++++-----------------
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..ff1ae9b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,11 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret == NOTIFY_STOP)
+		return;
+
+	pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1721,8 +1713,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;
-- 
1.7.4.rc2



-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 4363 

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [tip:x86/mce] x86, mce: Drop the default decoding notifier
  2011-04-20 10:23       ` Borislav Petkov
@ 2011-04-21 12:10         ` tip-bot for Borislav Petkov
  2011-04-25 19:40         ` [PATCH -v2 2/2] x86, MCE: " Eric W. Biederman
  1 sibling, 0 replies; 14+ messages in thread
From: tip-bot for Borislav Petkov @ 2011-04-21 12:10 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, tony.luck, rja, Nagananda.Chumbalkar,
	tglx, prarit, mingo, borislav.petkov

Commit-ID:  dffa4b2f62ff28c982144c7033001b1ece4d3532
Gitweb:     http://git.kernel.org/tip/dffa4b2f62ff28c982144c7033001b1ece4d3532
Author:     Borislav Petkov <borislav.petkov@amd.com>
AuthorDate: Wed, 20 Apr 2011 12:23:49 +0200
Committer:  Ingo Molnar <mingo@elte.hu>
CommitDate: Thu, 21 Apr 2011 11:35:10 +0200

x86, mce: Drop the default decoding notifier

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding
hint only in the uncorrectable errors case and when no notifier
is registered. Also, limit issuing the "mcelog --ascii" message
in the rare case when we dump unreported CEs before panicking.

While at it, remove unused old x86_mce_decode_callback from the
header.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Nagananda Chumbalkar <Nagananda.Chumbalkar@hp.com>
Cc: Russ Anderson <rja@sgi.com>
Link: http://lkml.kernel.org/r/20110420102349.GB1361@aftab
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 arch/x86/include/asm/mce.h       |    2 --
 arch/x86/kernel/cpu/mcheck/mce.c |   24 +++++++-----------------
 2 files changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..ff1ae9b 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,11 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret == NOTIFY_STOP)
+		return;
+
+	pr_emerg_ratelimited(HW_ERR "Run the above through 'mcelog --ascii'\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1721,8 +1713,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-20 10:23       ` Borislav Petkov
  2011-04-21 12:10         ` [tip:x86/mce] x86, mce: " tip-bot for Borislav Petkov
@ 2011-04-25 19:40         ` Eric W. Biederman
  2011-04-26  7:42           ` Borislav Petkov
  1 sibling, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2011-04-25 19:40 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Tony Luck,
	EDAC devel, LKML, Prarit Bhargava, Nagananda Chumbalkar,
	Russ Anderson

Borislav Petkov <bp@amd64.org> writes:

> On Tue, Apr 19, 2011 at 01:44:46PM -0400, Ingo Molnar wrote:
>> 
>> * Borislav Petkov <bp@amd64.org> wrote:
>> 
>> > +	if (ret != NOTIFY_STOP)
>> > +		printk_ratelimited(KERN_EMERG HW_ERR
>> > +				   "Run the above through 'mcelog --ascii'\n");
>> 
>> pr_emerg_ratelimited() would allow you to keep that broken line happy?
>
> Not entirely. I had to do the exit-early-to-save-an-indentation-level
> trick also :)
>
> --
> From: Borislav Petkov <borislav.petkov@amd.com>
> Date: Wed, 13 Apr 2011 14:32:06 +0200
> Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier
>
> The default notifier doesn't make a lot of sense to call in the
> correctable errors case. Drop it and emit the mcelog decoding hint only
> in the uncorrectable errors case and when no notifier is registered.
> Also, limit issuing the "mcelog --ascii" message in the rare case when
> we dump unreported CEs before panicking.
>
> While at it, remove unused old x86_mce_decode_callback from the
> header.

Can we please print something if we please log something in the
case of a correctable error, when we only report it via mcelog?

I have a stupid recent intel cpu here that hits that case and without
the default x86_mce_decode_callback I wouldn't have even known that I am
getting something like 50 correctable errors an hour on one of my
machines.  In particular I am it hits so often I am seeing:
"mce_notify_irq: 2 callbacks suppressed".  I need to get those dimms
replaced soon because in a new product I simply can't imagine that many
correctable errors.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-25 19:40         ` [PATCH -v2 2/2] x86, MCE: " Eric W. Biederman
@ 2011-04-26  7:42           ` Borislav Petkov
  2011-04-26 21:06             ` Eric W. Biederman
  0 siblings, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2011-04-26  7:42 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Tony Luck, EDAC devel, LKML, Prarit Bhargava,
	Nagananda Chumbalkar, Russ Anderson

On Mon, Apr 25, 2011 at 03:40:11PM -0400, Eric W. Biederman wrote:
> > From: Borislav Petkov <borislav.petkov@amd.com>
> > Date: Wed, 13 Apr 2011 14:32:06 +0200
> > Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier
> >
> > The default notifier doesn't make a lot of sense to call in the
> > correctable errors case. Drop it and emit the mcelog decoding hint only
> > in the uncorrectable errors case and when no notifier is registered.
> > Also, limit issuing the "mcelog --ascii" message in the rare case when
> > we dump unreported CEs before panicking.
> >
> > While at it, remove unused old x86_mce_decode_callback from the
> > header.
> 
> Can we please print something if we please log something in the
> case of a correctable error, when we only report it via mcelog?
> 
> I have a stupid recent intel cpu here that hits that case and without
> the default x86_mce_decode_callback I wouldn't have even known that I am
> getting something like 50 correctable errors an hour on one of my
> machines.  In particular I am it hits so often I am seeing:
> "mce_notify_irq: 2 callbacks suppressed".  I need to get those dimms
> replaced soon because in a new product I simply can't imagine that many
> correctable errors.

Isn't there a mcelog daemon or something that polls /dev/mcelog and
tells you about those DRAM ECCs in some log file where you're supposed
to look? :)

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-26  7:42           ` Borislav Petkov
@ 2011-04-26 21:06             ` Eric W. Biederman
  2011-04-26 21:47               ` Borislav Petkov
  2011-04-26 22:32               ` Russ Anderson
  0 siblings, 2 replies; 14+ messages in thread
From: Eric W. Biederman @ 2011-04-26 21:06 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Tony Luck,
	EDAC devel, LKML, Prarit Bhargava, Nagananda Chumbalkar,
	Russ Anderson

Borislav Petkov <bp@amd64.org> writes:

> On Mon, Apr 25, 2011 at 03:40:11PM -0400, Eric W. Biederman wrote:
>> > From: Borislav Petkov <borislav.petkov@amd.com>
>> > Date: Wed, 13 Apr 2011 14:32:06 +0200
>> > Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier
>> >
>> > The default notifier doesn't make a lot of sense to call in the
>> > correctable errors case. Drop it and emit the mcelog decoding hint only
>> > in the uncorrectable errors case and when no notifier is registered.
>> > Also, limit issuing the "mcelog --ascii" message in the rare case when
>> > we dump unreported CEs before panicking.
>> >
>> > While at it, remove unused old x86_mce_decode_callback from the
>> > header.
>> 
>> Can we please print something if we please log something in the
>> case of a correctable error, when we only report it via mcelog?
>> 
>> I have a stupid recent intel cpu here that hits that case and without
>> the default x86_mce_decode_callback I wouldn't have even known that I am
>> getting something like 50 correctable errors an hour on one of my
>> machines.  In particular I am it hits so often I am seeing:
>> "mce_notify_irq: 2 callbacks suppressed".  I need to get those dimms
>> replaced soon because in a new product I simply can't imagine that many
>> correctable errors.
>
> Isn't there a mcelog daemon or something that polls /dev/mcelog and
> tells you about those DRAM ECCs in some log file where you're supposed
> to look? :)

On fedora 14 there is a cron job that writes to /var/log/mcelog, and
does not go through syslog.  But you have to be proactive and look
there.  If the people who work on this code can't even remember
where to look I can't imagine how anyone else can remember.
Which is why I object to the removal of the one printk that told
me something was broken on my machine.

So far from what I have seen /dev/mcelog and the userspace mcelog is
over complicated and near useless.  It seems to focused around the
notion that "This is not a software problem, please do not bug
Andi Kleen about it"

Well it is a hardware problem so I do need to RMA that hardware.
Sigh.

Eric


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-26 21:06             ` Eric W. Biederman
@ 2011-04-26 21:47               ` Borislav Petkov
  2011-04-26 22:26                 ` Eric W. Biederman
  2011-04-26 22:32               ` Russ Anderson
  1 sibling, 1 reply; 14+ messages in thread
From: Borislav Petkov @ 2011-04-26 21:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Tony Luck, EDAC devel, LKML, Prarit Bhargava,
	Nagananda Chumbalkar, Russ Anderson

On Tue, Apr 26, 2011 at 05:06:39PM -0400, Eric W. Biederman wrote:
> Borislav Petkov <bp@amd64.org> writes:
> 
> > On Mon, Apr 25, 2011 at 03:40:11PM -0400, Eric W. Biederman wrote:
> >> > From: Borislav Petkov <borislav.petkov@amd.com>
> >> > Date: Wed, 13 Apr 2011 14:32:06 +0200
> >> > Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier
> >> >
> >> > The default notifier doesn't make a lot of sense to call in the
> >> > correctable errors case. Drop it and emit the mcelog decoding hint only
> >> > in the uncorrectable errors case and when no notifier is registered.
> >> > Also, limit issuing the "mcelog --ascii" message in the rare case when
> >> > we dump unreported CEs before panicking.
> >> >
> >> > While at it, remove unused old x86_mce_decode_callback from the
> >> > header.
> >> 
> >> Can we please print something if we please log something in the
> >> case of a correctable error, when we only report it via mcelog?
> >> 
> >> I have a stupid recent intel cpu here that hits that case and without
> >> the default x86_mce_decode_callback I wouldn't have even known that I am
> >> getting something like 50 correctable errors an hour on one of my
> >> machines.  In particular I am it hits so often I am seeing:
> >> "mce_notify_irq: 2 callbacks suppressed".  I need to get those dimms
> >> replaced soon because in a new product I simply can't imagine that many
> >> correctable errors.
> >
> > Isn't there a mcelog daemon or something that polls /dev/mcelog and
> > tells you about those DRAM ECCs in some log file where you're supposed
> > to look? :)
> 
> On fedora 14 there is a cron job that writes to /var/log/mcelog, and
> does not go through syslog.  But you have to be proactive and look
> there.  If the people who work on this code can't even remember
> where to look I can't imagine how anyone else can remember.

Ha!

I'm working exactly in the opposite direction actually - drop mcelog and
make RAS much more user friendly. As a first step, this is why we have
all that MCE decoding code for AMD hw and when you get an error, you
can't miss it:

Apr 20 21:08:24 kepek kernel: [  300.816122] [Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc00c000c6080a13
Apr 20 21:08:24 kepek kernel: [  300.825156] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
Apr 20 21:08:24 kepek kernel: [  300.825167] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x4171fe380
Apr 20 21:08:24 kepek kernel: [  300.825257] EDAC MC0: CE page 0x4171fe, offset 0x380, grain 0, syndrome 0xc601, row 3, channel 0, label "": amd64_
edac

or this:

Apr 15 16:54:17 kepek kernel: [72187.027059] [Hardware Error]: MC0_STATUS[-|UE|-|-|AddrV|UECC]: 0xb400210000010016
Apr 15 16:54:17 kepek kernel: [72187.027059] [Hardware Error]: Data Cache Error: L2 TLB multimatch.
Apr 15 16:54:17 kepek kernel: [72187.027059] [Hardware Error]: cache level: L2, tx: DATA


There's also this RAS daemon I'm hacking on which uses perf to carry
error information to userspace and do more than reporting it. For
example, server farm guys don't want to scan syslog for every CECC error
but rather have it collected somewhere on one machine, maybe over the
network, etc, etc.

So now is the time to speak up and let me know how you would like to get
the error reported? In general, what should be done differently in Linux
wrt to RAS.

> Which is why I object to the removal of the one printk that told
> me something was broken on my machine.

I dunno, maybe it's time we moved the MCE decoding functionality which
is shared by most of x86 into core code. Ingo, Peter, Thomas, what do
you guys think?

This'll at least put something in the logs that is sensible instead
of useless strings which tell the users what to do next. Also, we can
ratelimit it so that DIMMs generating too many CECCs don't flood them
too much. Hmm...

> So far from what I have seen /dev/mcelog and the userspace mcelog is
> over complicated and near useless.  It seems to focused around the
> notion that "This is not a software problem, please do not bug
> Andi Kleen about it"

;-)

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-26 21:47               ` Borislav Petkov
@ 2011-04-26 22:26                 ` Eric W. Biederman
  2011-04-26 23:44                   ` Luck, Tony
  0 siblings, 1 reply; 14+ messages in thread
From: Eric W. Biederman @ 2011-04-26 22:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, Tony Luck,
	EDAC devel, LKML, Prarit Bhargava, Nagananda Chumbalkar,
	Russ Anderson

Borislav Petkov <bp@amd64.org> writes:

> On Tue, Apr 26, 2011 at 05:06:39PM -0400, Eric W. Biederman wrote:
>
> Ha!
>
> I'm working exactly in the opposite direction actually - drop mcelog and
> make RAS much more user friendly. As a first step, this is why we have
> all that MCE decoding code for AMD hw and when you get an error, you
> can't miss it:

I have no problem with having mcelog go away.

When we are on a system where we can't decode the mce, (aka the hardware
is newer than the kernel) we need some kind of sensible fallback that
prints something into syslog even if we don't have the full decode.

> Apr 20 21:08:24 kepek kernel: [  300.816122] [Hardware Error]: MC4_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc00c000c6080a13
> Apr 20 21:08:24 kepek kernel: [  300.825156] [Hardware Error]: Northbridge Error (node 0): DRAM ECC error detected on the NB.
> Apr 20 21:08:24 kepek kernel: [  300.825167] EDAC amd64 MC0: CE ERROR_ADDRESS= 0x4171fe380
> Apr 20 21:08:24 kepek kernel: [  300.825257] EDAC MC0: CE page 0x4171fe, offset 0x380, grain 0, syndrome 0xc601, row 3, channel 0, label "": amd64_
> edac
>
> or this:
>
> Apr 15 16:54:17 kepek kernel: [72187.027059] [Hardware Error]: MC0_STATUS[-|UE|-|-|AddrV|UECC]: 0xb400210000010016
> Apr 15 16:54:17 kepek kernel: [72187.027059] [Hardware Error]: Data Cache Error: L2 TLB multimatch.
> Apr 15 16:54:17 kepek kernel: [72187.027059] [Hardware Error]: cache level: L2, tx: DATA
>
>
> There's also this RAS daemon I'm hacking on which uses perf to carry
> error information to userspace and do more than reporting it. For
> example, server farm guys don't want to scan syslog for every CECC error
> but rather have it collected somewhere on one machine, maybe over the
> network, etc, etc.

Collected somewhere on one machine sounds remarkably like syslog.

I expect what the big server farm guys object to most is errors that
are hard to parse and hard to deal with in automation.  And I can't
say that I blame them.

> So now is the time to speak up and let me know how you would like to get
> the error reported? In general, what should be done differently in Linux
> wrt to RAS.

>> Which is why I object to the removal of the one printk that told
>> me something was broken on my machine.
>
> I dunno, maybe it's time we moved the MCE decoding functionality which
> is shared by most of x86 into core code. Ingo, Peter, Thomas, what do
> you guys think?
>
> This'll at least put something in the logs that is sensible instead
> of useless strings which tell the users what to do next. Also, we can
> ratelimit it so that DIMMs generating too many CECCs don't flood them
> too much. Hmm...

Sure.  Although any DIMM that is generating so many correctable errors
that you need to rate limit it in the kernel, won't likely to confine
itself to correctable errors.

Still it can happen that things are so bad that you do need to rate
limit it in the kernel.  Still with those you start wondering "How did
this machine boot?"  So printk_ratelimit sounds like a fine idea.

In the casual use situation where I have not yet bothered on my small
number of machines to set up sophisticated logging infrastructure
I just want something that adds an annoying message to syslog so
people who are logged in can say what was that, and downtime can be
scheduled.

Eric

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-26 21:06             ` Eric W. Biederman
  2011-04-26 21:47               ` Borislav Petkov
@ 2011-04-26 22:32               ` Russ Anderson
  1 sibling, 0 replies; 14+ messages in thread
From: Russ Anderson @ 2011-04-26 22:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Borislav Petkov, Ingo Molnar, H. Peter Anvin, Thomas Gleixner,
	Tony Luck, EDAC devel, LKML, Prarit Bhargava,
	Nagananda Chumbalkar, rja

On Tue, Apr 26, 2011 at 02:06:39PM -0700, Eric W. Biederman wrote:
> Borislav Petkov <bp@amd64.org> writes:
> > On Mon, Apr 25, 2011 at 03:40:11PM -0400, Eric W. Biederman wrote:
> >> > From: Borislav Petkov <borislav.petkov@amd.com>
> >> > Date: Wed, 13 Apr 2011 14:32:06 +0200
> >> > Subject: [PATCH -v2.1 2/2] x86, MCE: Drop the default decoding notifier
> >> >
> >> > The default notifier doesn't make a lot of sense to call in the
> >> > correctable errors case. Drop it and emit the mcelog decoding hint only
> >> > in the uncorrectable errors case and when no notifier is registered.
> >> > Also, limit issuing the "mcelog --ascii" message in the rare case when
> >> > we dump unreported CEs before panicking.
> >> >
> >> > While at it, remove unused old x86_mce_decode_callback from the
> >> > header.
> >> 
> >> Can we please print something if we please log something in the
> >> case of a correctable error, when we only report it via mcelog?
> >> 
> >> I have a stupid recent intel cpu here that hits that case and without
> >> the default x86_mce_decode_callback I wouldn't have even known that I am
> >> getting something like 50 correctable errors an hour on one of my
> >> machines.  In particular I am it hits so often I am seeing:
> >> "mce_notify_irq: 2 callbacks suppressed".  I need to get those dimms
> >> replaced soon because in a new product I simply can't imagine that many
> >> correctable errors.
> >
> > Isn't there a mcelog daemon or something that polls /dev/mcelog and
> > tells you about those DRAM ECCs in some log file where you're supposed
> > to look? :)
> 
> On fedora 14 there is a cron job that writes to /var/log/mcelog, and
> does not go through syslog.

Interesting.  I'm running fedora 14 and don't have a /var/log/mcelog
file or see an mcelog package (not that I'd looked until just now).

>                              But you have to be proactive and look
> there.  If the people who work on this code can't even remember
> where to look I can't imagine how anyone else can remember.
> Which is why I object to the removal of the one printk that told
> me something was broken on my machine.

Historically hardware error reporting has been very platform
dependent.  Those differences made it difficult to come up with
agreement on standard ways to report errors.  You raise a good
point that it needs to work better.

> So far from what I have seen /dev/mcelog and the userspace mcelog is
> over complicated and near useless.

/dev/mcelog is extremely useful to SGI.  As you said, "you have to 
be proactive and look there" which we are and do.  :-)

>                                     It seems to focused around the
> notion that "This is not a software problem, please do not bug
> Andi Kleen about it"
> 
> Well it is a hardware problem so I do need to RMA that hardware.
> Sigh.

You raise a good issue that users do need to know when their 
hardware is having issues.

> Eric

-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-26 22:26                 ` Eric W. Biederman
@ 2011-04-26 23:44                   ` Luck, Tony
  2011-04-27 14:03                     ` Borislav Petkov
  0 siblings, 1 reply; 14+ messages in thread
From: Luck, Tony @ 2011-04-26 23:44 UTC (permalink / raw)
  To: Eric W. Biederman, Borislav Petkov
  Cc: Ingo Molnar, H. Peter Anvin, Thomas Gleixner, EDAC devel, LKML,
	Prarit Bhargava, Nagananda Chumbalkar, Russ Anderson

> Sure.  Although any DIMM that is generating so many correctable errors
> that you need to rate limit it in the kernel, won't likely to confine
> itself to correctable errors.
>
> Still it can happen that things are so bad that you do need to rate
> limit it in the kernel.  Still with those you start wondering "How did
> this machine boot?"  So printk_ratelimit sounds like a fine idea.


Perhaps we really want thresholds rather than rate limits (for corrected
errors).  One corrected error shouldn't cause any but the most paranoid
to worry.  A couple of errors from the same DIMM close together might be
some cause for concern, but could just be happenstance. Enough errors that
rate limiting looks useful, and you are into "something needs to be done"
territory.

-Tony

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier
  2011-04-26 23:44                   ` Luck, Tony
@ 2011-04-27 14:03                     ` Borislav Petkov
  0 siblings, 0 replies; 14+ messages in thread
From: Borislav Petkov @ 2011-04-27 14:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Eric W. Biederman, Borislav Petkov, Ingo Molnar, H. Peter Anvin,
	Thomas Gleixner, EDAC devel, LKML, Prarit Bhargava,
	Nagananda Chumbalkar, Russ Anderson

On Tue, Apr 26, 2011 at 07:44:35PM -0400, Luck, Tony wrote:
> > Sure.  Although any DIMM that is generating so many correctable errors
> > that you need to rate limit it in the kernel, won't likely to confine
> > itself to correctable errors.
> >
> > Still it can happen that things are so bad that you do need to rate
> > limit it in the kernel.  Still with those you start wondering "How did
> > this machine boot?"  So printk_ratelimit sounds like a fine idea.
> 
> 
> Perhaps we really want thresholds rather than rate limits (for corrected
> errors).  One corrected error shouldn't cause any but the most paranoid
> to worry.  A couple of errors from the same DIMM close together might be
> some cause for concern, but could just be happenstance. Enough errors that
> rate limiting looks useful, and you are into "something needs to be done"
> territory.

Right, but for thresholding you need to know to which DIMM the error
address belongs. And this is not trivial in all cases. It looks like we
need this error reporting thing dynamic:

As a sane default, we want to dump some _sensible_ info to syslog about
some errors happening. Then, all in-kernel decoding modules can enrich
that error info with more specific details. Those two reporting modes
should be ratelimited since doing thresholding in kernel could be tricky
and taking up precious resources.

Finally, you have the userspace daemon which datacenter people or
Google with small root partitions could use, which sends that info to a
centralized location instead of someone collecting it from each node. In
that case, no info goes out to syslog but is eaten up by the RAS daemon
which runs on every machine. And in userspace you can do all the cool
thresholding, rate limiting, policy applying, etc, your heart desires.

Makes sense?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2011-04-27 14:03 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-18 14:00 [PATCH -v2 2/2] x86, MCE: Drop the default decoding notifier Borislav Petkov
2011-04-19 17:13 ` Ingo Molnar
2011-04-19 17:35   ` Borislav Petkov
2011-04-19 17:44     ` Ingo Molnar
2011-04-20 10:23       ` Borislav Petkov
2011-04-21 12:10         ` [tip:x86/mce] x86, mce: " tip-bot for Borislav Petkov
2011-04-25 19:40         ` [PATCH -v2 2/2] x86, MCE: " Eric W. Biederman
2011-04-26  7:42           ` Borislav Petkov
2011-04-26 21:06             ` Eric W. Biederman
2011-04-26 21:47               ` Borislav Petkov
2011-04-26 22:26                 ` Eric W. Biederman
2011-04-26 23:44                   ` Luck, Tony
2011-04-27 14:03                     ` Borislav Petkov
2011-04-26 22:32               ` Russ Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).