[PATCH]: mce: don't print "human readable" message for corrected errors

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH]: mce: don't print "human readable" message for corrected errors
@ 2011-04-12 17:44 Prarit Bhargava
  2011-04-12 18:58 ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-12 17:44 UTC (permalink / raw)
  To: linux-kernel; +Cc: dzickus, mstowe, dnelson, tony.luck, Prarit Bhargava

Don't display the "human readable" warning for correctable errors in mce.
There is no need for this information to be displayed.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 450a366..8024013 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -100,10 +100,16 @@ ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
 static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
+			      void *data)
 {
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
+	struct mce *m = (struct mce *)data;
+
+	if (m->status & MCI_STATUS_UC) {
+		pr_emerg(HW_ERR "No human readable MCE decoding support "
+			 "on this CPU type.\n");
+		pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' "
+			 "to decode.\n");
+	}
 
 	return NOTIFY_STOP;
 }

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 17:44 [PATCH]: mce: don't print "human readable" message for corrected errors Prarit Bhargava
@ 2011-04-12 18:58 ` Borislav Petkov
  2011-04-12 19:22   ` Prarit Bhargava
  2011-04-12 20:02   ` Luck, Tony
  0 siblings, 2 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-12 18:58 UTC (permalink / raw)
  To: Prarit Bhargava; +Cc: linux-kernel, dzickus, mstowe, dnelson, tony.luck

On Tue, Apr 12, 2011 at 01:44:05PM -0400, Prarit Bhargava wrote:
> Don't display the "human readable" warning for correctable errors in mce.
> There is no need for this information to be displayed.
> 
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>

Why not? This way you turn reporting of _ALL_ correctable MCEs
completely off and some users would actually like to run them through
mcelog on Intel.

> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 450a366..8024013 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -100,10 +100,16 @@ ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
>  EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
>  
>  static int default_decode_mce(struct notifier_block *nb, unsigned long val,
> -			       void *data)
> +			      void *data)
>  {
> -	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
> -	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
> +	struct mce *m = (struct mce *)data;
> +
> +	if (m->status & MCI_STATUS_UC) {
> +		pr_emerg(HW_ERR "No human readable MCE decoding support "
> +			 "on this CPU type.\n");
> +		pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' "
> +			 "to decode.\n");
> +	}
>  
>  	return NOTIFY_STOP;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 18:58 ` Borislav Petkov
@ 2011-04-12 19:22   ` Prarit Bhargava
  2011-04-12 19:57     ` Borislav Petkov
  2011-04-12 20:02   ` Luck, Tony
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-12 19:22 UTC (permalink / raw)
  To: Borislav Petkov, linux-kernel, dzickus, mstowe, dnelson,
	tony.luck



On 04/12/2011 02:58 PM, Borislav Petkov wrote:
> On Tue, Apr 12, 2011 at 01:44:05PM -0400, Prarit Bhargava wrote:
>   
>> Don't display the "human readable" warning for correctable errors in mce.
>> There is no need for this information to be displayed.
>>
>> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
>>     
> Why not? This way you turn reporting of _ALL_ correctable MCEs
> completely off and some users would actually like to run them through
> mcelog on Intel.
>   

I may be reading the code wrong ... but in the case of the
default_decode_mce() callback there is *no* additional output displayed
for correctable MCEs -- so the message is AFAICT useless.

ie) all you see on the console log is

[Hardware Error]: Machine check events logged

^^^ this notifies the user that something happened with MCE

[Hardware Error]: No human readable MCE decoding support on this CPU type.

^^^ this is purely informational and unnecessary

[Hardware Error]: Run the message through 'mcelog --ascii' to decode.

^^^ this makes absolutely no sense.  There is no message to decode.

You won't see this in the case of the amd-edac because it has it's own
callback.

P.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 19:22   ` Prarit Bhargava
@ 2011-04-12 19:57     ` Borislav Petkov
  0 siblings, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-12 19:57 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel, dzickus, mstowe, dnelson, tony.luck, Ingo Molnar,
	H. Peter Anvin, Thomas Gleixner

Adding x86 people.

On Tue, Apr 12, 2011 at 03:22:50PM -0400, Prarit Bhargava wrote:
> 
> 
> On 04/12/2011 02:58 PM, Borislav Petkov wrote:
> > On Tue, Apr 12, 2011 at 01:44:05PM -0400, Prarit Bhargava wrote:
> >   
> >> Don't display the "human readable" warning for correctable errors in mce.
> >> There is no need for this information to be displayed.
> >>
> >> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> >>     
> > Why not? This way you turn reporting of _ALL_ correctable MCEs
> > completely off and some users would actually like to run them through
> > mcelog on Intel.
> >   
> 
> I may be reading the code wrong ... but in the case of the
> default_decode_mce() callback there is *no* additional output displayed
> for correctable MCEs -- so the message is AFAICT useless.
> 
> ie) all you see on the console log is
> 
> [Hardware Error]: Machine check events logged
> 
> ^^^ this notifies the user that something happened with MCE
> 
> [Hardware Error]: No human readable MCE decoding support on this CPU type.
> 
> ^^^ this is purely informational and unnecessary
> 
> [Hardware Error]: Run the message through 'mcelog --ascii' to decode.
> 
> ^^^ this makes absolutely no sense.  There is no message to decode.
> 
> You won't see this in the case of the amd-edac because it has it's own
> callback.

Right, I see what you're sayin'. This is actually this piece:

		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
			mce_log(&m);
			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
			add_taint(TAINT_MACHINE_CHECK);
		}

in machine_check_poll and the reason I added it at the time was so EDAC
gets notified about DRAM ECCs. And you're right, when we're not decoding
in the kernel this message is not helping a lot. I can imagine this
filling up the logs on a machine spitting lotsa DRAM CECC errors.

I'm wondering whether dropping the printks completely is also an option
for I don't see any merit for having them anyway...

Hmmm.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 18:58 ` Borislav Petkov
  2011-04-12 19:22   ` Prarit Bhargava
@ 2011-04-12 20:02   ` Luck, Tony
  2011-04-12 20:15     ` Prarit Bhargava
  2011-04-13  2:24     ` [PATCH]: mce: don't print "human readable" message for corrected errors Russ Anderson
  1 sibling, 2 replies; 38+ messages in thread
From: Luck, Tony @ 2011-04-12 20:02 UTC (permalink / raw)
  To: Borislav Petkov, Prarit Bhargava
  Cc: linux-kernel@vger.kernel.org, dzickus@redhat.com,
	mstowe@redhat.com, dnelson@redhat.com

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: text/plain; charset="utf-8", Size: 871 bytes --]

> Why not? This way you turn reporting of _ALL_ correctable MCEs
> completely off and some users would actually like to run them through
> mcelog on Intel.

pr_emerg() is rather overkill for a corrected error - on large systems
corrected errors are going to be a routine occurrence (my personal estimation
is "one soft error per gigabyte per month" ... which is pretty much the
same as "one per terabyte per hour" for the people with the really cool
toys.

We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps
this made sense when systems were small and machine checks were rare and
scary.  But I think we need to start working with the reality that
corrected errors are normal events.

-Tony
ÿôèº{.nÇ+‰·Ÿ®‰†+%ŠËÿ±éÝ¶\x17¥Šwÿº{.nÇ+‰·¥Š{±þG«éÿŠ{ayº\x1dÊ‡Ú™ë,j\a¢f£¢·hšïêÿ‘êçz_è®\x03(éšŽŠÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?™¨èÚ&£ø§~á¶iO•æ¬z·švØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?–I¥

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 20:02   ` Luck, Tony
@ 2011-04-12 20:15     ` Prarit Bhargava
  2011-04-12 20:28       ` Borislav Petkov
  2011-04-13  2:24     ` [PATCH]: mce: don't print "human readable" message for corrected errors Russ Anderson
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-12 20:15 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, dzickus@redhat.com,
	mstowe@redhat.com, dnelson@redhat.com


> We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps
> this made sense when systems were small and machine checks were rare and
> scary.  But I think we need to start working with the reality that
> corrected errors are normal events.
>
>   

It still makes sense for small lt 1TB systems, IMO.  But ... maybe a
flag of some sort is necessary to stop the TAINTing for systems larger
than that.  The CEs may point to something going wrong on a system.  CEs
in theory become UCEs eventually, right?

>From a OS point of view, we would like to know that there is flaky HW on
the system.

/me knows this is going to turn into a PFA discussion in 4 ... 3 ... 2
....   ;)

P.

> -Tony
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 20:15     ` Prarit Bhargava
@ 2011-04-12 20:28       ` Borislav Petkov
  2011-04-13  3:00         ` Russ Anderson
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-12 20:28 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Luck, Tony, linux-kernel@vger.kernel.org, dzickus@redhat.com,
	mstowe@redhat.com, dnelson@redhat.com

On Tue, Apr 12, 2011 at 04:15:38PM -0400, Prarit Bhargava wrote:
> 
> > We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps
> > this made sense when systems were small and machine checks were rare and
> > scary.  But I think we need to start working with the reality that
> > corrected errors are normal events.
> >
> >   
> 
> It still makes sense for small lt 1TB systems, IMO.  But ... maybe a
> flag of some sort is necessary to stop the TAINTing for systems larger
> than that.  The CEs may point to something going wrong on a system.  CEs
> in theory become UCEs eventually, right?
> 
> From a OS point of view, we would like to know that there is flaky HW on
> the system.
> 
> /me knows this is going to turn into a PFA discussion in 4 ... 3 ... 2
> ....   ;)

Yeah, there's the TAINT thing too, good point Tony. Well, we definitely
don't want to get tainted for correctable errors - they're too "normal"
to do so, IMHO.

I'm thinking remove the TAINT for CEs and don't call the default
notifier if it is the only notifier call registered. Maybe something like

	if (num_notifiers(&x86_mce_decoder_chain) > 1)
		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);

or since the notifiers are priority sorted, don't call notifiers with -1
prio.

Or something to that effect.

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 20:28       ` Borislav Petkov
@ 2011-04-13  3:00         ` Russ Anderson
  2011-04-13  7:14           ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Russ Anderson @ 2011-04-13  3:00 UTC (permalink / raw)
  To: Borislav Petkov, Prarit Bhargava, Luck, Tony,
	linux-kernel@vger.kernel.org, dzickus@redhat.com,
	mstowe@redhat.com, dnelson@redhat.com
  Cc: rja

On Tue, Apr 12, 2011 at 10:28:54PM +0200, Borislav Petkov wrote:
> 
> Yeah, there's the TAINT thing too, good point Tony. Well, we definitely
> don't want to get tainted for correctable errors - they're too "normal"
> to do so, IMHO.

I agreed.

> I'm thinking remove the TAINT for CEs and don't call the default
> notifier if it is the only notifier call registered. Maybe something like
> 
> 	if (num_notifiers(&x86_mce_decoder_chain) > 1)
> 		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
> 
> or since the notifiers are priority sorted, don't call notifiers with -1
> prio.
> 
> Or something to that effect.

What is the point of having a default notifier if it doesn't get called?

Any consideration of adding thresholding (ie only log the first X number
of corrected errors) as is done on IA64?  (see arch/ia64/kernel/mca.c)

Thanks,
-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-13  3:00         ` Russ Anderson
@ 2011-04-13  7:14           ` Borislav Petkov
  2011-04-13 13:24             ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13  7:14 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Prarit Bhargava, Luck, Tony, linux-kernel@vger.kernel.org,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja

On Tue, Apr 12, 2011 at 10:00:34PM -0500, Russ Anderson wrote:
> > I'm thinking remove the TAINT for CEs and don't call the default
> > notifier if it is the only notifier call registered. Maybe something like
> > 
> > 	if (num_notifiers(&x86_mce_decoder_chain) > 1)
> > 		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
> > 
> > or since the notifiers are priority sorted, don't call notifiers with -1
> > prio.
> > 
> > Or something to that effect.
> 
> What is the point of having a default notifier if it doesn't get called?

Well, the thing is, we call the same notifier chain in the event of
both correctable and uncorrectable errors. But to be honest, if we shut
up the default pr_emerg() calls with "no human readable.." in the CEs'
case, then we don't really need it in the UEs' case either, IMO.

The only half-way sensible info we print is

	"Run the message through 'mcelog --ascii' to decode.\n"

on UE because there we print MCA regs to the console where mcelog can
actually decode them, and this should be a hint to the user to do so.

In the CEs case, no such info comes out and there's no need for those
printks to flood the logs.

So maybe we could drop the default notifier and do in print_mce():

	if (notifier_chain_empty(&x86_mce_decoder_chain))
		pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");

I think this could work, let me cook up something.

> Any consideration of adding thresholding (ie only log the first X number
> of corrected errors) as is done on IA64?  (see arch/ia64/kernel/mca.c)

Yep, this is in the works with a RAS daemon that should collect all
error info in userspace and do all policies there. We might even drop
the spitting in the system logs almost completely and I know this'll
make a lot of people happy :).

-- 
Regards/Gruss,
    Boris.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-13  7:14           ` Borislav Petkov
@ 2011-04-13 13:24             ` Borislav Petkov
  2011-04-13 13:36               ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov
                                 ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 13:24 UTC (permalink / raw)
  To: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony,
	linux-kernel@vger.kernel.org, dzickus@redhat.com,
	mstowe@redhat.com, dnelson@redhat.com, rja

On Wed, Apr 13, 2011 at 09:14:33AM +0200, Borislav Petkov wrote:

[..]

> So maybe we could drop the default notifier and do in print_mce():
> 
> 	if (notifier_chain_empty(&x86_mce_decoder_chain))
> 		pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
> 
> I think this could work, let me cook up something.

Ok, I'm sending what I scratched up as a reply to this mail. They remove
the default notifier and use an atomic flag to distinguish the cases
between when we have a decoder or not.

This is marginally better in the sense that we don't have to call into
the notifier chain if there are no notifier calls on it so it saves us
a call in the UC case. Also, it drops the tainting for CEs and testing
this by injecting DRAM CECCs leaves /proc/sys/kernel/tainted at 0.

I still have to test the UC case but for that I'll have to dust off my
patchset which implements hw MCE injection first.

To be continued...

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 1/3] x86, MCE: Do not taint when correctable errors
  2011-04-13 13:24             ` Borislav Petkov
@ 2011-04-13 13:36               ` Borislav Petkov
  2011-04-13 13:36               ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov
  2011-04-13 13:36               ` [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core Borislav Petkov
  2 siblings, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 13:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

From: Borislav Petkov <borislav.petkov@amd.com>

Correctable errors are considered something rather normal on modern
hardware these days. Even more importantly, correctable errors mean
exactly that - they've been corrected by the hardware - and there's no
need to taint the kernel since execution hasn't been compromised so far.

Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/kernel/cpu/mcheck/mce.c |    1 -
 1 files changed, 0 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3385ea2..68e2303 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -590,7 +590,6 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
 			mce_log(&m);
 			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
-			add_taint(TAINT_MACHINE_CHECK);
 		}
 
 		/*
-- 
1.7.4.rc2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 13:24             ` Borislav Petkov
  2011-04-13 13:36               ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov
@ 2011-04-13 13:36               ` Borislav Petkov
  2011-04-13 14:01                 ` Prarit Bhargava
  2011-04-13 13:36               ` [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core Borislav Petkov
  2 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 13:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

From: Borislav Petkov <borislav.petkov@amd.com>

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding hint only
in the uncorrectable errors case and when no notifier is registered.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/include/asm/mce.h       |    4 ++--
 arch/x86/kernel/cpu/mcheck/mce.c |   26 +++++++++-----------------
 2 files changed, 11 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..871d3c4 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -117,6 +117,8 @@ struct mce_log {
 
 extern struct atomic_notifier_head x86_mce_decoder_chain;
 
+extern atomic_t mce_decoders;
+
 #include <linux/percpu.h>
 #include <linux/init.h>
 #include <asm/atomic.h>
@@ -142,8 +144,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..3069608 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,19 +105,8 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
+atomic_t mce_decoders = ATOMIC_INIT(0);
+EXPORT_SYMBOL_GPL(mce_decoders);
 
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
@@ -239,7 +228,11 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (!atomic_read(&mce_decoders)) {
+		pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
+		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
+	} else
+		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -589,7 +582,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 */
 		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
 			mce_log(&m);
-			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
+			if (atomic_read(&mce_decoders))
+				atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
 		}
 
 		/*
@@ -1721,8 +1715,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;
-- 
1.7.4.rc2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 13:36               ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov
@ 2011-04-13 14:01                 ` Prarit Bhargava
  2011-04-13 14:18                   ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 14:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Borislav Petkov, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

On 04/13/2011 09:36 AM, Borislav Petkov wrote:
> From: Borislav Petkov <borislav.petkov@amd.com>
> 
> The default notifier doesn't make a lot of sense to call in the
> correctable errors case. Drop it and emit the mcelog decoding hint only
> in the uncorrectable errors case and when no notifier is registered.
> 
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> ---
>  arch/x86/include/asm/mce.h       |    4 ++--
>  arch/x86/kernel/cpu/mcheck/mce.c |   26 +++++++++-----------------
>  2 files changed, 11 insertions(+), 19 deletions(-)
> 
  
> +extern atomic_t mce_decoders;
> +

Boris,

I don't think we need to do this.  I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below.

  	 * Print out human-readable details about the MCE error,
>  	 * (if the CPU has an implementation for that)
>  	 */
> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +	if (!atomic_read(&mce_decoders)) {
> +		pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
> +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");

I thought we didn't want these lines at all for CE errors?

> +	} else
> +		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>  }
>  
>  #define PANIC_TIMEOUT 5 /* 5 seconds */
> @@ -589,7 +582,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  		 */
>  		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
>  			mce_log(&m);
> -			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
> +			if (atomic_read(&mce_decoders))
> +				atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
>  		}
>  
>  		/*
> @@ -1721,8 +1715,6 @@ __setup("mce", mcheck_enable);
>  
>  int __init mcheck_init(void)
>  {
> -	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
> -
>  	mcheck_intel_therm_init();
>  
>  	return 0;

Again, this patch has not been tested or compiled against latest upstream.  I
have quickly tested it against RHEL6 (2.6.32,33,34,35,36,38,39 based) and
confirmed that I don't see the messages for CEs.  I have NOT fully tested to see what happens when I add in a "dummy" notifier to see if the messages are not printed.

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3385ea2..b10a1f4 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -100,25 +100,12 @@ static int			cpu_missing;
 
 /*
  * CPU/chipset specific EDAC code can register a notifier call here to print
- * MCE errors in a human-readable form.
+ * MCE errors in a human-readable form.  Notifiers must return NOTIFY_STOP
+ * upon completion.
  */
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +199,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -236,10 +225,13 @@ static void print_mce(struct mce *m)
 		m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid);
 
 	/*
-	 * Print out human-readable details about the MCE error,
-	 * (if the CPU has an implementation for that)
+	 * Output a message if the CPU has human-readable information for
+	 * unhandled UC errors
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
+	if ((ret != NOTIFY_STOP) && (m->status & MCI_STATUS_UC))
+		pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' "
+			 "to decode.\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -589,8 +581,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
 		 */
 		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
 			mce_log(&m);
-			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
-			add_taint(TAINT_MACHINE_CHECK);
+			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0,
+						   &m);
 		}
 
 		/*
@@ -1722,8 +1714,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 14:01                 ` Prarit Bhargava
@ 2011-04-13 14:18                   ` Borislav Petkov
  2011-04-13 14:22                     ` Prarit Bhargava
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 14:18 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

On Wed, Apr 13, 2011 at 10:01:22AM -0400, Prarit Bhargava wrote:
> On 04/13/2011 09:36 AM, Borislav Petkov wrote:
> > From: Borislav Petkov <borislav.petkov@amd.com>
> > 
> > The default notifier doesn't make a lot of sense to call in the
> > correctable errors case. Drop it and emit the mcelog decoding hint only
> > in the uncorrectable errors case and when no notifier is registered.
> > 
> > Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> > ---
> >  arch/x86/include/asm/mce.h       |    4 ++--
> >  arch/x86/kernel/cpu/mcheck/mce.c |   26 +++++++++-----------------
> >  2 files changed, 11 insertions(+), 19 deletions(-)
> > 
>   
> > +extern atomic_t mce_decoders;
> > +
> 
> Boris,
> 
> I don't think we need to do this.  I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below.
> 
>   	 * Print out human-readable details about the MCE error,
> >  	 * (if the CPU has an implementation for that)
> >  	 */
> > -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> > +	if (!atomic_read(&mce_decoders)) {
> > +		pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
> > +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
> 
> I thought we didn't want these lines at all for CE errors?

This is the UC only case - we call print_mce() before we panic.

> 
> > +	} else
> > +		atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> >  }
> >  
> >  #define PANIC_TIMEOUT 5 /* 5 seconds */
> > @@ -589,7 +582,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
> >  		 */
> >  		if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) {
> >  			mce_log(&m);
> > -			atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
> > +			if (atomic_read(&mce_decoders))
> > +				atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
> >  		}
> >  
> >  		/*
> > @@ -1721,8 +1715,6 @@ __setup("mce", mcheck_enable);
> >  
> >  int __init mcheck_init(void)
> >  {
> > -	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
> > -
> >  	mcheck_intel_therm_init();
> >  
> >  	return 0;
> 
> Again, this patch has not been tested or compiled against latest upstream.  I
> have quickly tested it against RHEL6 (2.6.32,33,34,35,36,38,39 based) and
> confirmed that I don't see the messages for CEs.  I have NOT fully tested to see what happens when I add in a "dummy" notifier to see if the messages are not printed.
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 3385ea2..b10a1f4 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -100,25 +100,12 @@ static int			cpu_missing;
>  
>  /*
>   * CPU/chipset specific EDAC code can register a notifier call here to print
> - * MCE errors in a human-readable form.
> + * MCE errors in a human-readable form.  Notifiers must return NOTIFY_STOP
> + * upon completion.
>   */
>  ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
>  EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
>  
> -static int default_decode_mce(struct notifier_block *nb, unsigned long val,
> -			       void *data)
> -{
> -	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
> -	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
> -
> -	return NOTIFY_STOP;
> -}
> -
> -static struct notifier_block mce_dec_nb = {
> -	.notifier_call = default_decode_mce,
> -	.priority      = -1,
> -};
> -
>  /* MCA banks polled by the period polling timer for corrected events */
>  DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
>  	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
> @@ -212,6 +199,8 @@ void mce_log(struct mce *mce)
>  
>  static void print_mce(struct mce *m)
>  {
> +	int ret;
> +
>  	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
>  	       m->extcpu, m->mcgstatus, m->bank, m->status);
>  
> @@ -236,10 +225,13 @@ static void print_mce(struct mce *m)
>  		m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid);
>  
>  	/*
> -	 * Print out human-readable details about the MCE error,
> -	 * (if the CPU has an implementation for that)
> +	 * Output a message if the CPU has human-readable information for
> +	 * unhandled UC errors
>  	 */
> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m);
> +	if ((ret != NOTIFY_STOP) && (m->status & MCI_STATUS_UC))
> +		pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' "
> +			 "to decode.\n");

Yep, yours is actually nicer - the notifier returns NOTIFY_DONE if it
doesn't call any notifier calls on the chain so the first test should
suffice.

Let me update the patch add your S-O-B so you could give it a run.

Thanks.


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 14:18                   ` Borislav Petkov
@ 2011-04-13 14:22                     ` Prarit Bhargava
  2011-04-13 14:26                       ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 14:22 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/13/2011 10:18 AM, Borislav Petkov wrote:
> On Wed, Apr 13, 2011 at 10:01:22AM -0400, Prarit Bhargava wrote:
>   
>> On 04/13/2011 09:36 AM, Borislav Petkov wrote:
>>     
>>> From: Borislav Petkov <borislav.petkov@amd.com>
>>>
>>> The default notifier doesn't make a lot of sense to call in the
>>> correctable errors case. Drop it and emit the mcelog decoding hint only
>>> in the uncorrectable errors case and when no notifier is registered.
>>>
>>> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
>>> ---
>>>  arch/x86/include/asm/mce.h       |    4 ++--
>>>  arch/x86/kernel/cpu/mcheck/mce.c |   26 +++++++++-----------------
>>>  2 files changed, 11 insertions(+), 19 deletions(-)
>>>
>>>       
>>   
>>     
>>> +extern atomic_t mce_decoders;
>>> +
>>>       
>> Boris,
>>
>> I don't think we need to do this.  I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below.
>>
>>   	 * Print out human-readable details about the MCE error,
>>     
>>>  	 * (if the CPU has an implementation for that)
>>>  	 */
>>> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>>> +	if (!atomic_read(&mce_decoders)) {
>>> +		pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
>>> +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
>>>       
>> I thought we didn't want these lines at all for CE errors?
>>     
> This is the UC only case - we call print_mce() before we panic.
>   

.. right ... but we would still see nonsensical messages before the
panic, which will confuse the end user.  IMO, dump the messages unless
it's UC.  And we shouldn't print out the human readable message at all.

I'm installing F14 on  a system that is known to generate CEs @
boot-time.  I'll test patches+latest upstream to here to see what happens...

P.

>
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 14:22                     ` Prarit Bhargava
@ 2011-04-13 14:26                       ` Borislav Petkov
  2011-04-13 14:32                         ` Prarit Bhargava
  2011-04-13 14:36                         ` [PATCH -v2] " Borislav Petkov
  0 siblings, 2 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 14:26 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson,
	Luck, Tony, dzickus@redhat.com, mstowe@redhat.com,
	dnelson@redhat.com, rja@americas.sgi.com

On Wed, Apr 13, 2011 at 10:22:41AM -0400, Prarit Bhargava wrote:
> 
> 
> On 04/13/2011 10:18 AM, Borislav Petkov wrote:
> > On Wed, Apr 13, 2011 at 10:01:22AM -0400, Prarit Bhargava wrote:
> >   
> >> On 04/13/2011 09:36 AM, Borislav Petkov wrote:
> >>     
> >>> From: Borislav Petkov <borislav.petkov@amd.com>
> >>>
> >>> The default notifier doesn't make a lot of sense to call in the
> >>> correctable errors case. Drop it and emit the mcelog decoding hint only
> >>> in the uncorrectable errors case and when no notifier is registered.
> >>>
> >>> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> >>> ---
> >>>  arch/x86/include/asm/mce.h       |    4 ++--
> >>>  arch/x86/kernel/cpu/mcheck/mce.c |   26 +++++++++-----------------
> >>>  2 files changed, 11 insertions(+), 19 deletions(-)
> >>>
> >>>       
> >>   
> >>     
> >>> +extern atomic_t mce_decoders;
> >>> +
> >>>       
> >> Boris,
> >>
> >> I don't think we need to do this.  I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below.
> >>
> >>   	 * Print out human-readable details about the MCE error,
> >>     
> >>>  	 * (if the CPU has an implementation for that)
> >>>  	 */
> >>> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> >>> +	if (!atomic_read(&mce_decoders)) {
> >>> +		pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
> >>> +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
> >>>       
> >> I thought we didn't want these lines at all for CE errors?
> >>     
> > This is the UC only case - we call print_mce() before we panic.
> >   
> 
> .. right ... but we would still see nonsensical messages before the
> panic, which will confuse the end user.  IMO, dump the messages unless
> it's UC.  And we shouldn't print out the human readable message at all.

I'd still leave the "mcelog ... " line though as a hint.

> I'm installing F14 on  a system that is known to generate CEs @
> boot-time.  I'll test patches+latest upstream to here to see what happens...

Cool.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 14:26                       ` Borislav Petkov
@ 2011-04-13 14:32                         ` Prarit Bhargava
  2011-04-13 14:39                           ` Borislav Petkov
  2011-04-13 14:36                         ` [PATCH -v2] " Borislav Petkov
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 14:32 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com


>> .. right ... but we would still see nonsensical messages before the
>> panic, which will confuse the end user.  IMO, dump the messages unless
>> it's UC.  And we shouldn't print out the human readable message at all.
>>     
> I'd still leave the "mcelog ... " line though as a hint.
>
>   

:)

There are no messages -- that's part of the problem.  We're asking a
user to do something for a message that doesn't exist ...

P.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 14:32                         ` Prarit Bhargava
@ 2011-04-13 14:39                           ` Borislav Petkov
  2011-04-13 14:45                             ` Prarit Bhargava
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 14:39 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

On Wed, Apr 13, 2011 at 10:32:35AM -0400, Prarit Bhargava wrote:
> 
> >> .. right ... but we would still see nonsensical messages before the
> >> panic, which will confuse the end user.  IMO, dump the messages unless
> >> it's UC.  And we shouldn't print out the human readable message at all.
> >>     
> > I'd still leave the "mcelog ... " line though as a hint.
> >
> >   
> 
> :)
> 
> There are no messages -- that's part of the problem.  We're asking a
> user to do something for a message that doesn't exist ...

Yes and no - in the UC case you get all the MCA registers dumped in
print_mce() - and those you want to decode. In the CE case, we won't be
issuing any with the default notifier removed. I just sent you a -v2, I
think it should do what we want.

Or am I missing something...?

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier
  2011-04-13 14:39                           ` Borislav Petkov
@ 2011-04-13 14:45                             ` Prarit Bhargava
  0 siblings, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 14:45 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/13/2011 10:39 AM, Borislav Petkov wrote:
> On Wed, Apr 13, 2011 at 10:32:35AM -0400, Prarit Bhargava wrote:
>   
>>     
>>>> .. right ... but we would still see nonsensical messages before the
>>>> panic, which will confuse the end user.  IMO, dump the messages unless
>>>> it's UC.  And we shouldn't print out the human readable message at all.
>>>>     
>>>>         
>>> I'd still leave the "mcelog ... " line though as a hint.
>>>
>>>   
>>>       
>> :)
>>
>> There are no messages -- that's part of the problem.  We're asking a
>> user to do something for a message that doesn't exist ...
>>     
> Yes and no - in the UC case you get all the MCA registers dumped in
> print_mce() - and those you want to decode. In the CE case, we won't be
> issuing any with the default notifier removed. I just sent you a -v2, I
> think it should do what we want.
>   

Yup.
> Or am I missing something...?
>
>   

Nope. The follow up looks great :)

P.
> Thanks.
>
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 14:26                       ` Borislav Petkov
  2011-04-13 14:32                         ` Prarit Bhargava
@ 2011-04-13 14:36                         ` Borislav Petkov
  2011-04-13 17:01                           ` Prarit Bhargava
  1 sibling, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 14:36 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com


The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding hint only
in the uncorrectable errors case and when no notifier is registered.

While at it, remove unused old x86_mce_decode_callback from the header.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/include/asm/mce.h       |    2 --
 arch/x86/kernel/cpu/mcheck/mce.c |   22 +++++-----------------
 2 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..33714e8 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,9 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret != NOTIFY_STOP)
+		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1721,8 +1711,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;
-- 
1.7.4.rc2


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 14:36                         ` [PATCH -v2] " Borislav Petkov
@ 2011-04-13 17:01                           ` Prarit Bhargava
  2011-04-13 17:13                             ` Luck, Tony
  2011-04-13 17:14                             ` Prarit Bhargava
  0 siblings, 2 replies; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 17:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com


> @@ -239,7 +227,9 @@ static void print_mce(struct mce *m)
>  	 * Print out human-readable details about the MCE error,
>  	 * (if the CPU has an implementation for that)
>  	 */
> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +	if (ret != NOTIFY_STOP)
> +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
>  }
>   

Borislav,

I still think you need the check for UC here.  When an UC occurs and
mce_panic() is called the output will include:

[Hardware Error]:  Run the above through 'mcelog --ascii' to decode.

potentially many, many times.  The problem still is that there is no
output to decode (in the default case).

P.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* RE: [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 17:01                           ` Prarit Bhargava
@ 2011-04-13 17:13                             ` Luck, Tony
  2011-04-13 17:17                               ` Prarit Bhargava
  2011-04-13 17:14                             ` Prarit Bhargava
  1 sibling, 1 reply; 38+ messages in thread
From: Luck, Tony @ 2011-04-13 17:13 UTC (permalink / raw)
  To: Prarit Bhargava, Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, dzickus@redhat.com,
	mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com

> [Hardware Error]:  Run the above through 'mcelog --ascii' to decode.
>
> potentially many, many times.  The problem still is that there is no
> output to decode (in the default case).

printk_once()?

Giving the admin a hint on how to do their job is all well
and good - but no need to do so for each and every report.

-Tony

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 17:13                             ` Luck, Tony
@ 2011-04-13 17:17                               ` Prarit Bhargava
  0 siblings, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 17:17 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/13/2011 01:13 PM, Luck, Tony wrote:
>> [Hardware Error]:  Run the above through 'mcelog --ascii' to decode.
>>
>> potentially many, many times.  The problem still is that there is no
>> output to decode (in the default case).
>>     
> printk_once()?
>   
I don't think that applies here (but you might have a different idea of
what the code needs to look like) as the printk has to happen for *each*
UC error, but never for CE errors :/.

P.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 17:01                           ` Prarit Bhargava
  2011-04-13 17:13                             ` Luck, Tony
@ 2011-04-13 17:14                             ` Prarit Bhargava
  2011-04-13 17:37                               ` Borislav Petkov
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-13 17:14 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/13/2011 01:01 PM, Prarit Bhargava wrote:
>   
>> @@ -239,7 +227,9 @@ static void print_mce(struct mce *m)
>>  	 * Print out human-readable details about the MCE error,
>>  	 * (if the CPU has an implementation for that)
>>  	 */
>> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>> +	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>> +	if (ret != NOTIFY_STOP)
>> +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
>>  }
>>   
>>     
> Borislav,
>
>   

Oops.  Let me *carefully* rephrase that so it is clear what I'm
complaining about.

> I still think you need the check for UC here.  When an UC occurs and
> mce_panic() is called the output will include:
>
> [Hardware Error]:  Run the above through 'mcelog --ascii' to decode.
>
> potentially many, many times

for _all_ unreported *correctable* errors.

> .  The problem still is that there is no
> output to decode (in the default case).
>
>   

ie) (sorry for the cut-and-paste)

        /* First print corrected ones that are still unlogged */
        for (i = 0; i < MCE_LOG_LEN; i++) {
                struct mce *m = &mcelog.entry[i];
                if (!(m->status & MCI_STATUS_VAL))
                        continue;
                if (!(m->status & MCI_STATUS_UC)) {
                        print_mce(m);
                        if (!apei_err)
                                apei_err = apei_write_mce(m);
                }
        }

will potentially result in many bogus messages during a time at which we
definitely do not want bogus messages.

P.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 17:14                             ` Prarit Bhargava
@ 2011-04-13 17:37                               ` Borislav Petkov
  2011-04-14 14:59                                 ` Prarit Bhargava
  2011-04-14 15:00                                 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov
  0 siblings, 2 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 17:37 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson,
	Luck, Tony, dzickus@redhat.com, mstowe@redhat.com,
	dnelson@redhat.com, rja@americas.sgi.com

On Wed, Apr 13, 2011 at 01:14:35PM -0400, Prarit Bhargava wrote:
> 
> 
> On 04/13/2011 01:01 PM, Prarit Bhargava wrote:
> >   
> >> @@ -239,7 +227,9 @@ static void print_mce(struct mce *m)
> >>  	 * Print out human-readable details about the MCE error,
> >>  	 * (if the CPU has an implementation for that)
> >>  	 */
> >> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> >> +	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> >> +	if (ret != NOTIFY_STOP)
> >> +		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n");
> >>  }
> >>   
> >>     
> > Borislav,
> >
> >   
> 
> Oops.  Let me *carefully* rephrase that so it is clear what I'm
> complaining about.
> 
> > I still think you need the check for UC here.  When an UC occurs and
> > mce_panic() is called the output will include:
> >
> > [Hardware Error]:  Run the above through 'mcelog --ascii' to decode.
> >
> > potentially many, many times
> 
> for _all_ unreported *correctable* errors.
> 
> > .  The problem still is that there is no
> > output to decode (in the default case).
> >
> >   
> 
> ie) (sorry for the cut-and-paste)
> 
>         /* First print corrected ones that are still unlogged */
>         for (i = 0; i < MCE_LOG_LEN; i++) {
>                 struct mce *m = &mcelog.entry[i];
>                 if (!(m->status & MCI_STATUS_VAL))
>                         continue;
>                 if (!(m->status & MCI_STATUS_UC)) {
>                         print_mce(m);
>                         if (!apei_err)
>                                 apei_err = apei_write_mce(m);
>                 }
>         }
> 
> will potentially result in many bogus messages during a time at which we
> definitely do not want bogus messages.

I don't think that this is a problem. This is on the panic path and it
is supposed to dump only the _unreported_ CE MCEs queued in the mcelog
which can contain 32 MCEs max.

In the worst case, we will report 32 CEs before panicking. For that case
we either do printk_once as Tony suggested or we ratelimit it. I'll
update the patch.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier
  2011-04-13 17:37                               ` Borislav Petkov
@ 2011-04-14 14:59                                 ` Prarit Bhargava
  2011-04-14 15:00                                 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov
  1 sibling, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-14 14:59 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com


> In the worst case, we will report 32 CEs before panicking. For that case
> we either do printk_once as Tony suggested or we ratelimit it. I'll
> update the patch.
> 

Okay ... how 'bout using Tony's printk_once() just after the call to
print_mce() in mce_panic()? [uncompiled and untested patch below]

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 3385ea2..6ecfea5 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC))
+		pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' "
+			 "to decode.\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp)
 			continue;
 		if (!(m->status & MCI_STATUS_UC)) {
 			print_mce(m);
+			printk_once(KERN_EMERG HW_ERR "MCE Corrected Error(s) "
+				    "detected.");
 			if (!apei_err)
 				apei_err = apei_write_mce(m);
 		}
@@ -1722,8 +1715,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-13 17:37                               ` Borislav Petkov
  2011-04-14 14:59                                 ` Prarit Bhargava
@ 2011-04-14 15:00                                 ` Borislav Petkov
  2011-04-14 15:04                                   ` Prarit Bhargava
  1 sibling, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-14 15:00 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote:
> In the worst case, we will report 32 CEs before panicking. For that case
> we either do printk_once as Tony suggested or we ratelimit it. I'll
> update the patch.

Ok, how about the following, I ratelimit the printk to the default of 10
messages per 5 seconds. I've also got the hardware MCE injection patches
ready and will do some testing with them.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Wed, 13 Apr 2011 14:32:06 +0200
Subject: [PATCH -v3] x86, MCE: Drop the default decoding notifier

The default notifier doesn't make a lot of sense to call in the
correctable errors case. Drop it and emit the mcelog decoding hint only
in the uncorrectable errors case and when no notifier is registered.
Also, limit the issual of the "mcelog" message in the remote case when
we dump unreported CEs before panicking.

While at it, remove unused old x86_mce_decode_callback from the header.

Signed-off-by: Prarit Bhargava <prarit@redhat.com>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 arch/x86/include/asm/mce.h       |    2 --
 arch/x86/kernel/cpu/mcheck/mce.c |   23 ++++++-----------------
 2 files changed, 6 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index eb16e94..021979a 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
 static inline void enable_p5_mce(void) {}
 #endif
 
-extern void (*x86_mce_decode_callback)(struct mce *m);
-
 void mce_setup(struct mce *m);
 void mce_log(struct mce *m);
 DECLARE_PER_CPU(struct sys_device, mce_dev);
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 68e2303..679bd9e 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -105,20 +105,6 @@ static int			cpu_missing;
 ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
 EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
 
-static int default_decode_mce(struct notifier_block *nb, unsigned long val,
-			       void *data)
-{
-	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
-	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
-
-	return NOTIFY_STOP;
-}
-
-static struct notifier_block mce_dec_nb = {
-	.notifier_call = default_decode_mce,
-	.priority      = -1,
-};
-
 /* MCA banks polled by the period polling timer for corrected events */
 DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
 	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
@@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
 
 static void print_mce(struct mce *m)
 {
+	int ret = 0;
+
 	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
 	       m->extcpu, m->mcgstatus, m->bank, m->status);
 
@@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
 	 * Print out human-readable details about the MCE error,
 	 * (if the CPU has an implementation for that)
 	 */
-	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+	if (ret != NOTIFY_STOP)
+		printk_ratelimited(KERN_EMERG HW_ERR
+				   "Run the above through 'mcelog --ascii'\n");
 }
 
 #define PANIC_TIMEOUT 5 /* 5 seconds */
@@ -1721,8 +1712,6 @@ __setup("mce", mcheck_enable);
 
 int __init mcheck_init(void)
 {
-	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
-
 	mcheck_intel_therm_init();
 
 	return 0;
-- 
1.7.4.rc2



-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:00                                 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov
@ 2011-04-14 15:04                                   ` Prarit Bhargava
  2011-04-14 15:16                                     ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-14 15:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/14/2011 11:00 AM, Borislav Petkov wrote:
> On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote:
>   
>> In the worst case, we will report 32 CEs before panicking. For that case
>> we either do printk_once as Tony suggested or we ratelimit it. I'll
>> update the patch.
>>     
> Ok, how about the following, I ratelimit the printk to the default of 10
> messages per 5 seconds. I've also got the hardware MCE injection patches
> ready and will do some testing with them.
>   

See my previous email ;)  I think just putting in a printk_once after
the CE call to print_mce() in mce_panic() might be better?  At least
that way we get the --ascii message for *EVERY* UC which IMO would be
nice...

P.

> --
> From: Borislav Petkov <borislav.petkov@amd.com>
> Date: Wed, 13 Apr 2011 14:32:06 +0200
> Subject: [PATCH -v3] x86, MCE: Drop the default decoding notifier
>
> The default notifier doesn't make a lot of sense to call in the
> correctable errors case. Drop it and emit the mcelog decoding hint only
> in the uncorrectable errors case and when no notifier is registered.
> Also, limit the issual of the "mcelog" message in the remote case when
> we dump unreported CEs before panicking.
>
> While at it, remove unused old x86_mce_decode_callback from the header.
>
> Signed-off-by: Prarit Bhargava <prarit@redhat.com>
> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
> ---
>  arch/x86/include/asm/mce.h       |    2 --
>  arch/x86/kernel/cpu/mcheck/mce.c |   23 ++++++-----------------
>  2 files changed, 6 insertions(+), 19 deletions(-)
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index eb16e94..021979a 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {}
>  static inline void enable_p5_mce(void) {}
>  #endif
>  
> -extern void (*x86_mce_decode_callback)(struct mce *m);
> -
>  void mce_setup(struct mce *m);
>  void mce_log(struct mce *m);
>  DECLARE_PER_CPU(struct sys_device, mce_dev);
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 68e2303..679bd9e 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -105,20 +105,6 @@ static int			cpu_missing;
>  ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain);
>  EXPORT_SYMBOL_GPL(x86_mce_decoder_chain);
>  
> -static int default_decode_mce(struct notifier_block *nb, unsigned long val,
> -			       void *data)
> -{
> -	pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n");
> -	pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n");
> -
> -	return NOTIFY_STOP;
> -}
> -
> -static struct notifier_block mce_dec_nb = {
> -	.notifier_call = default_decode_mce,
> -	.priority      = -1,
> -};
> -
>  /* MCA banks polled by the period polling timer for corrected events */
>  DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = {
>  	[0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL
> @@ -212,6 +198,8 @@ void mce_log(struct mce *mce)
>  
>  static void print_mce(struct mce *m)
>  {
> +	int ret = 0;
> +
>  	pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n",
>  	       m->extcpu, m->mcgstatus, m->bank, m->status);
>  
> @@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
>  	 * Print out human-readable details about the MCE error,
>  	 * (if the CPU has an implementation for that)
>  	 */
> -	atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +	ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +	if (ret != NOTIFY_STOP)
> +		printk_ratelimited(KERN_EMERG HW_ERR
> +				   "Run the above through 'mcelog --ascii'\n");
>  }
>  
>  #define PANIC_TIMEOUT 5 /* 5 seconds */
> @@ -1721,8 +1712,6 @@ __setup("mce", mcheck_enable);
>  
>  int __init mcheck_init(void)
>  {
> -	atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb);
> -
>  	mcheck_intel_therm_init();
>  
>  	return 0;
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:04                                   ` Prarit Bhargava
@ 2011-04-14 15:16                                     ` Borislav Petkov
  2011-04-14 15:23                                       ` Prarit Bhargava
  2011-04-14 15:33                                       ` Russ Anderson
  0 siblings, 2 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-14 15:16 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson,
	Luck, Tony, dzickus@redhat.com, mstowe@redhat.com,
	dnelson@redhat.com, rja@americas.sgi.com

On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote:
> 
> 
> On 04/14/2011 11:00 AM, Borislav Petkov wrote:
> > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote:
> >   
> >> In the worst case, we will report 32 CEs before panicking. For that case
> >> we either do printk_once as Tony suggested or we ratelimit it. I'll
> >> update the patch.
> >>     
> > Ok, how about the following, I ratelimit the printk to the default of 10
> > messages per 5 seconds. I've also got the hardware MCE injection patches
> > ready and will do some testing with them.
> >   
> 
> See my previous email ;)  I think just putting in a printk_once after
> the CE call to print_mce() in mce_panic() might be better?  At least
> that way we get the --ascii message for *EVERY* UC which IMO would be
> nice...

Are you sure? printk_once() is, as its name says, a one-time thing and
it is implemented that way - a static bool counter which is once set and
that's it. I.e., the "--ascii" message will be printed only once for the
system's lifetime.

The ratelimit-ed thing dumps it a strict number of times. In the end,
I don't have a strong opinion on how many times we issue it - I'm fine
with it either way.

Maybe some other opinions. Tony?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:16                                     ` Borislav Petkov
@ 2011-04-14 15:23                                       ` Prarit Bhargava
  2011-04-14 15:44                                         ` Borislav Petkov
  2011-04-14 15:33                                       ` Russ Anderson
  1 sibling, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-14 15:23 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/14/2011 11:16 AM, Borislav Petkov wrote:
> On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote:
>   
>>
>> On 04/14/2011 11:00 AM, Borislav Petkov wrote:
>>     
>>> On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote:
>>>   
>>>       
>>>> In the worst case, we will report 32 CEs before panicking. For that case
>>>> we either do printk_once as Tony suggested or we ratelimit it. I'll
>>>> update the patch.
>>>>     
>>>>         
>>> Ok, how about the following, I ratelimit the printk to the default of 10
>>> messages per 5 seconds. I've also got the hardware MCE injection patches
>>> ready and will do some testing with them.
>>>   
>>>       
>> See my previous email ;)  I think just putting in a printk_once after
>> the CE call to print_mce() in mce_panic() might be better?  At least
>> that way we get the --ascii message for *EVERY* UC which IMO would be
>> nice...
>>     
> Are you sure? printk_once() is, as its name says, a one-time thing and
> it is implemented that way - a static bool counter which is once set and
> that's it. I.e., the "--ascii" message will be printed only once for the
> system's lifetime.
>   

Oops ... I may have confused you because what I did was subtle.  I
really should have explicitly pointed out what I did.  Sorry, my bad.

>From my patch (sorry for the cut-and-paste):

@@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
         * Print out human-readable details about the MCE error,
         * (if the CPU has an implementation for that)
         */
-       atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+       ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
+       if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC))
+               pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' "
+                        "to decode.\n");
 }
 
This, of course, only outputs during UCs.

and

@@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final,
char *exp)
                        continue;
                if (!(m->status & MCI_STATUS_UC)) {
                        print_mce(m);
+                       printk_once(KERN_EMERG HW_ERR "MCE Corrected
Error(s) "
+                                   "detected.");
                        if (!apei_err)
                                apei_err = apei_write_mce(m);
                }

so we'll print "MCE Corrected Error(s)" _once_ if we go through this
path.  Since there is no data to decode with mcelog, a nice little one
time message is probably the way to go :).

P.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:23                                       ` Prarit Bhargava
@ 2011-04-14 15:44                                         ` Borislav Petkov
  2011-04-14 15:49                                           ` Prarit Bhargava
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-14 15:44 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson,
	Luck, Tony, dzickus@redhat.com, mstowe@redhat.com,
	dnelson@redhat.com, rja@americas.sgi.com

On Thu, Apr 14, 2011 at 11:23:04AM -0400, Prarit Bhargava wrote:
> Oops ... I may have confused you because what I did was subtle.  I
> really should have explicitly pointed out what I did.  Sorry, my bad.
> 
> From my patch (sorry for the cut-and-paste):
> 
> @@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
>          * Print out human-readable details about the MCE error,
>          * (if the CPU has an implementation for that)
>          */
> -       atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +       ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
> +       if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC))
> +               pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' "
> +                        "to decode.\n");
>  }
>  
> This, of course, only outputs during UCs.
> 
> and
> 
> @@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final,
> char *exp)
>                         continue;
>                 if (!(m->status & MCI_STATUS_UC)) {
>                         print_mce(m);
> +                       printk_once(KERN_EMERG HW_ERR "MCE Corrected
> Error(s) "
> +                                   "detected.");
>                         if (!apei_err)
>                                 apei_err = apei_write_mce(m);
>                 }
> 
> so we'll print "MCE Corrected Error(s)" _once_ if we go through this
> path.  Since there is no data to decode with mcelog, a nice little one
> time message is probably the way to go :).

Ok, first of all, see the print_mce(m) call above? Yes, we're dumping
full CE MCE info in this case because they were unlogged and as such,
that info can be decoded.

But this whole point is moot since those errors can be only 32 max _and_
on the _panic_ path. And I don't think this path matters because it is
_very_ seldom. I bet you don't hit it on any of your machines.

And we don't want to fix that - we want to fix the case with the
occasional CE MCEs which get detected in the polling path but none of
their MCA regs get dumped for decoding so the decoding hint there is
out of place. And we fixed that at least partially so that it doesn't
flood the logs. If you're not fine with the default ratelimit of 10 msgs
per 5 seconds we can always raise the ratelimit but tweaking an almost
hypothetical case is just not worth it.

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:44                                         ` Borislav Petkov
@ 2011-04-14 15:49                                           ` Prarit Bhargava
  2011-04-14 19:02                                             ` Borislav Petkov
  0 siblings, 1 reply; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-14 15:49 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/14/2011 11:44 AM, Borislav Petkov wrote:
> On Thu, Apr 14, 2011 at 11:23:04AM -0400, Prarit Bhargava wrote:
>   
>> Oops ... I may have confused you because what I did was subtle.  I
>> really should have explicitly pointed out what I did.  Sorry, my bad.
>>
>> From my patch (sorry for the cut-and-paste):
>>
>> @@ -239,7 +227,10 @@ static void print_mce(struct mce *m)
>>          * Print out human-readable details about the MCE error,
>>          * (if the CPU has an implementation for that)
>>          */
>> -       atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>> +       ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m);
>> +       if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC))
>> +               pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' "
>> +                        "to decode.\n");
>>  }
>>  
>> This, of course, only outputs during UCs.
>>
>> and
>>
>> @@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final,
>> char *exp)
>>                         continue;
>>                 if (!(m->status & MCI_STATUS_UC)) {
>>                         print_mce(m);
>> +                       printk_once(KERN_EMERG HW_ERR "MCE Corrected
>> Error(s) "
>> +                                   "detected.");
>>                         if (!apei_err)
>>                                 apei_err = apei_write_mce(m);
>>                 }
>>
>> so we'll print "MCE Corrected Error(s)" _once_ if we go through this
>> path.  Since there is no data to decode with mcelog, a nice little one
>> time message is probably the way to go :).
>>     
> Ok, first of all, see the print_mce(m) call above? Yes, we're dumping
> full CE MCE info in this case because they were unlogged and as such,
> that info can be decoded.
>
> But this whole point is moot since those errors can be only 32 max _and_
> on the _panic_ path. And I don't think this path matters because it is
> _very_ seldom. I bet you don't hit it on any of your machines.
>   

Ohhhh ... I was running on the assumption that the data was *never* output.

> And we don't want to fix that - we want to fix the case with the
> occasional CE MCEs which get detected in the polling path but none of
> their MCA regs get dumped for decoding so the decoding hint there is
> out of place. And we fixed that at least partially so that it doesn't
> flood the logs. If you're not fine with the default ratelimit of 10 msgs
> per 5 seconds we can always raise the ratelimit but tweaking an almost
> hypothetical case is just not worth it.
>
>   

Okay -- I'm good then.

P.

> Thanks.
>
>   

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:49                                           ` Prarit Bhargava
@ 2011-04-14 19:02                                             ` Borislav Petkov
  2011-04-14 19:04                                               ` Prarit Bhargava
  0 siblings, 1 reply; 38+ messages in thread
From: Borislav Petkov @ 2011-04-14 19:02 UTC (permalink / raw)
  To: Prarit Bhargava
  Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson,
	Luck, Tony, dzickus@redhat.com, mstowe@redhat.com,
	dnelson@redhat.com, rja@americas.sgi.com

On Thu, Apr 14, 2011 at 11:49:08AM -0400, Prarit Bhargava wrote:
> > And we don't want to fix that - we want to fix the case with the
> > occasional CE MCEs which get detected in the polling path but none of
> > their MCA regs get dumped for decoding so the decoding hint there is
> > out of place. And we fixed that at least partially so that it doesn't
> > flood the logs. If you're not fine with the default ratelimit of 10 msgs
> > per 5 seconds we can always raise the ratelimit but tweaking an almost
> > hypothetical case is just not worth it.
> >
> Okay -- I'm good then.

Ok, injecting MCEs with this patch looks like this. Nevermind the
decoded MCEs, I simply uncommented the "if (ret != NOTIFY_DONE)" line
so that I can see the rate limiting. So we still have them there, this
is the default setting of 10 calls per 5 secs, we might want to dial it
down in the output. After the 10th MCE, it doesn't appear anymore.

Hmm...

Prarit, you said you have a machine which spits a lot of CECCs on boot,
does the final version help there, did you have a chance to run it?

[  312.983610] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: 9200400000010e3f
[  312.987463] [Hardware Error]: TSC 7ca4633c24 
[  312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807306 SOCKET 0 APIC 0
[  312.987463] [Hardware Error]: MC0_STATUS[-|CE|-|PCC|-|CECC]: 0x9200400000010e3f
[  312.987463] [Hardware Error]: Data Cache Error: 
[  312.987463] [Hardware Error]: Corrupted DC MCE info?
[  312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: DRD, part-proc: GEN (no timeout)
[  312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' 					<-----------------
[  312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: d400400000010016
[  312.987463] [Hardware Error]: TSC 825c93b856 
[  312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807321 SOCKET 0 APIC 0
[  312.987463] [Hardware Error]: MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000010016
[  312.987463] [Hardware Error]: Data Cache Error: L2 TLB multimatch.
[  312.987463] [Hardware Error]: cache level: L2, tx: DATA
[  312.987463] [Hardware Error]: Run the above through 'mcelog --ascii'						<-----------------
[  312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: d40040000000081f
[  312.987463] [Hardware Error]: TSC 852bb06a47 
[  312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807328 SOCKET 0 APIC 0
[  312.987463] [Hardware Error]: MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040000000081f
[  312.987463] [Hardware Error]: Data Cache Error: 
[  312.987463] [Hardware Error]: Corrupted DC MCE info?
[  312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: RD, part-proc: SRC (no timeout)
[  312.987463] [Hardware Error]: Run the above through 'mcelog --ascii'						<-----------------
[  312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: dc03400008000f43
[  312.987463] [Hardware Error]: TSC 9a9c3818f3 
[  312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807382 SOCKET 0 APIC 0
[  312.987463] [Hardware Error]: MC0_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc03400008000f43
[  312.987463] [Hardware Error]: Data Cache Error: 
[  312.987463] [Hardware Error]: Corrupted DC MCE info?
[  312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DWR, part-proc: GEN (timed out)
[  312.987463] [Hardware Error]: Run the above through 'mcelog --ascii'						<-----------------
[  312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: f600210000000107
[  312.987463] [Hardware Error]: TSC 88ce52d026 
[  312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807337 SOCKET 0 APIC 0
[  312.987463] [Hardware Error]: MC0_STATUS[Over|UE|-|PCC|AddrV|UECC]: 0xf600210000000107
[  312.987463] [Hardware Error]: Data Cache Error: 
[  312.987463] [Hardware Error]: Corrupted DC MCE info?
[  312.987463] [Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: GEN
[  312.987463] [Hardware Error]: Run the above through 'mcelog --ascii'						<-----------------
[  312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: dc03400008000f43
[  312.987463] [Hardware Error]: TSC 9a9c3818f3 
[  312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807382 SOCKET 0 APIC 0
[  312.987463] [Hardware Error]: MC0_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc03400008000f43
[  312.987463] [Hardware Error]: Data Cache Error: 
[  312.987463] [Hardware Error]: Corrupted DC MCE info?
[  312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DWR, part-proc: GEN (timed out)
[  312.987463] [Hardware Error]: Run the above through 'mcelog --ascii'						<-----------------
[  312.987463] [Hardware Error]: Machine check: MCIP not set in MCA handler
[  312.987463] [Hardware Error]: Fake kernel panic: Fatal machine check on current CPU
[  312.987463] mce_notify_irq: 2 callbacks suppressed
[  312.987463] [Hardware Error]: Machine check events logged

Thanks.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 19:02                                             ` Borislav Petkov
@ 2011-04-14 19:04                                               ` Prarit Bhargava
  0 siblings, 0 replies; 38+ messages in thread
From: Prarit Bhargava @ 2011-04-14 19:04 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com



On 04/14/2011 03:02 PM, Borislav Petkov wrote:
> On Thu, Apr 14, 2011 at 11:49:08AM -0400, Prarit Bhargava wrote:
>   
>>> And we don't want to fix that - we want to fix the case with the
>>> occasional CE MCEs which get detected in the polling path but none of
>>> their MCA regs get dumped for decoding so the decoding hint there is
>>> out of place. And we fixed that at least partially so that it doesn't
>>> flood the logs. If you're not fine with the default ratelimit of 10 msgs
>>> per 5 seconds we can always raise the ratelimit but tweaking an almost
>>> hypothetical case is just not worth it.
>>>
>>>       
>> Okay -- I'm good then.
>>     
> Ok, injecting MCEs with this patch looks like this. Nevermind the
> decoded MCEs, I simply uncommented the "if (ret != NOTIFY_DONE)" line
> so that I can see the rate limiting. So we still have them there, this
> is the default setting of 10 calls per 5 secs, we might want to dial it
> down in the output. After the 10th MCE, it doesn't appear anymore.
>
> Hmm...
>
> Prarit, you said you have a machine which spits a lot of CECCs on boot,
> does the final version help there, did you have a chance to run it?
>   

Yup -- it looked good and did what I expected it to do ... I didn't save
the output though :(

P.


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:16                                     ` Borislav Petkov
  2011-04-14 15:23                                       ` Prarit Bhargava
@ 2011-04-14 15:33                                       ` Russ Anderson
  2011-04-14 15:49                                         ` Borislav Petkov
  1 sibling, 1 reply; 38+ messages in thread
From: Russ Anderson @ 2011-04-14 15:33 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Prarit Bhargava, linux-kernel@vger.kernel.org, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja

On Thu, Apr 14, 2011 at 05:16:21PM +0200, Borislav Petkov wrote:
> On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote:
> > On 04/14/2011 11:00 AM, Borislav Petkov wrote:
> > > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote:
> > >   
> > >> In the worst case, we will report 32 CEs before panicking. For that case
> > >> we either do printk_once as Tony suggested or we ratelimit it. I'll
> > >> update the patch.
> > >>     
> > > Ok, how about the following, I ratelimit the printk to the default of 10
> > > messages per 5 seconds. I've also got the hardware MCE injection patches
> > > ready and will do some testing with them.
> > >   
> > 
> > See my previous email ;)  I think just putting in a printk_once after
> > the CE call to print_mce() in mce_panic() might be better?  At least
> > that way we get the --ascii message for *EVERY* UC which IMO would be
> > nice...
> 
> Are you sure? printk_once() is, as its name says, a one-time thing and
> it is implemented that way - a static bool counter which is once set and
> that's it. I.e., the "--ascii" message will be printed only once for the
> system's lifetime.
> 
> The ratelimit-ed thing dumps it a strict number of times. In the end,
> I don't have a strong opinion on how many times we issue it - I'm fine
> with it either way.
> 
> Maybe some other opinions. Tony?

In general I think you and Prarit are headed in the right direction.

As for when to throttle messages, differing people will have
different opinions as to the right number.  For example, some
sites may want the threshold low because once they see a CE they
will schedule to replace the DIMM.  Manufacturing sometimes 
wants to see all the CEs to know how good/bad the DIMM is.

My suggestion is to pick a default value and have a /sys (or
other) way of changing the value.  That way if someone has 
a need to change the value they can.  In real life someone 
will have a legitimate need to change the value.

Is the thresholding on a per DIMM, per Socket, or per system
basis?  SGI tends to have large systems with lots of DIMMs.
Per DIMM or per Socket thresholds tend to scale better.

Thanks,
-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier
  2011-04-14 15:33                                       ` Russ Anderson
@ 2011-04-14 15:49                                         ` Borislav Petkov
  0 siblings, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-14 15:49 UTC (permalink / raw)
  To: Russ Anderson
  Cc: Prarit Bhargava, linux-kernel@vger.kernel.org, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

On Thu, Apr 14, 2011 at 11:33:18AM -0400, Russ Anderson wrote:
> On Thu, Apr 14, 2011 at 05:16:21PM +0200, Borislav Petkov wrote:
> > On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote:
> > > On 04/14/2011 11:00 AM, Borislav Petkov wrote:
> > > > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote:
> > > >   
> > > >> In the worst case, we will report 32 CEs before panicking. For that case
> > > >> we either do printk_once as Tony suggested or we ratelimit it. I'll
> > > >> update the patch.
> > > >>     
> > > > Ok, how about the following, I ratelimit the printk to the default of 10
> > > > messages per 5 seconds. I've also got the hardware MCE injection patches
> > > > ready and will do some testing with them.
> > > >   
> > > 
> > > See my previous email ;)  I think just putting in a printk_once after
> > > the CE call to print_mce() in mce_panic() might be better?  At least
> > > that way we get the --ascii message for *EVERY* UC which IMO would be
> > > nice...
> > 
> > Are you sure? printk_once() is, as its name says, a one-time thing and
> > it is implemented that way - a static bool counter which is once set and
> > that's it. I.e., the "--ascii" message will be printed only once for the
> > system's lifetime.
> > 
> > The ratelimit-ed thing dumps it a strict number of times. In the end,
> > I don't have a strong opinion on how many times we issue it - I'm fine
> > with it either way.
> > 
> > Maybe some other opinions. Tony?
> 
> In general I think you and Prarit are headed in the right direction.
> 
> As for when to throttle messages, differing people will have
> different opinions as to the right number.  For example, some
> sites may want the threshold low because once they see a CE they
> will schedule to replace the DIMM.  Manufacturing sometimes 
> wants to see all the CEs to know how good/bad the DIMM is.
> 
> My suggestion is to pick a default value and have a /sys (or
> other) way of changing the value.  That way if someone has 
> a need to change the value they can.  In real life someone 
> will have a legitimate need to change the value.
> 
> Is the thresholding on a per DIMM, per Socket, or per system
> basis?  SGI tends to have large systems with lots of DIMMs.
> Per DIMM or per Socket thresholds tend to scale better.

Nah, we're talking about the decoding hint only

	"Run the above through 'mcelog --ascii'"

and that issuing it makes no sense when no correctable errors info is on
dmesg.

The CEs don't get reported in this case, I think on Intel you have to
run mcelog. On AMD, you generally use the amd64_edac driver and collect
all CEs info which gets decoded to chip selects on the node and then you
can do thresholding.

What you're talking above is not yet fully ... hm.. implemented yet :).

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core
  2011-04-13 13:24             ` Borislav Petkov
  2011-04-13 13:36               ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov
  2011-04-13 13:36               ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov
@ 2011-04-13 13:36               ` Borislav Petkov
  2 siblings, 0 replies; 38+ messages in thread
From: Borislav Petkov @ 2011-04-13 13:36 UTC (permalink / raw)
  To: linux-kernel
  Cc: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com,
	rja@americas.sgi.com

From: Borislav Petkov <borislav.petkov@amd.com>

Now that we don't have a default notifier, register with the MCE core to
state that MCE decoder functionality is present.

Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 drivers/edac/mce_amd.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
index 795cfbc..4448010 100644
--- a/drivers/edac/mce_amd.c
+++ b/drivers/edac/mce_amd.c
@@ -902,6 +902,8 @@ static int __init mce_amd_init(void)
 
 	atomic_notifier_chain_register(&x86_mce_decoder_chain, &amd_mce_dec_nb);
 
+	atomic_inc(&mce_decoders);
+
 	return 0;
 }
 early_initcall(mce_amd_init);
@@ -909,6 +911,7 @@ early_initcall(mce_amd_init);
 #ifdef MODULE
 static void __exit mce_amd_exit(void)
 {
+	atomic_dec(&mce_decoders);
 	atomic_notifier_chain_unregister(&x86_mce_decoder_chain, &amd_mce_dec_nb);
 	kfree(fam_ops);
 }
-- 
1.7.4.rc2


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH]: mce: don't print "human readable" message for corrected errors
  2011-04-12 20:02   ` Luck, Tony
  2011-04-12 20:15     ` Prarit Bhargava
@ 2011-04-13  2:24     ` Russ Anderson
  1 sibling, 0 replies; 38+ messages in thread
From: Russ Anderson @ 2011-04-13  2:24 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Prarit Bhargava, linux-kernel@vger.kernel.org,
	dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja

On Tue, Apr 12, 2011 at 01:02:21PM -0700, Luck, Tony wrote:
> > Why not? This way you turn reporting of _ALL_ correctable MCEs
> > completely off and some users would actually like to run them through
> > mcelog on Intel.
> 
> pr_emerg() is rather overkill for a corrected error - on large systems
> corrected errors are going to be a routine occurrence (my personal estimation
> is "one soft error per gigabyte per month" ... which is pretty much the
> same as "one per terabyte per hour" for the people with the really cool
> toys.

Good point.

> We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps
> this made sense when systems were small and machine checks were rare and
> scary.  But I think we need to start working with the reality that
> corrected errors are normal events.

I agree.  Corrected errors - by definition - have hardware corrected data.
There is no corruption so there is no reason for kernel taint.  It would
be like setting taint when one hard drive of a RAID file system goes bad.

It's worth noting that linux does not set taint when it recovers from 
_uncorrected_ memory errors on IA64 (by killing the application
that consumed the bad data and discarding the bad page).  Modern hardware
has enough error detection/correction code to avoid undetected data 
corruption from memory errors.


-- 
Russ Anderson, OS RAS/Partitioning Project Lead  
SGI - Silicon Graphics Inc          rja@sgi.com

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2011-04-14 19:05 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-04-12 17:44 [PATCH]: mce: don't print "human readable" message for corrected errors Prarit Bhargava
2011-04-12 18:58 ` Borislav Petkov
2011-04-12 19:22   ` Prarit Bhargava
2011-04-12 19:57     ` Borislav Petkov
2011-04-12 20:02   ` Luck, Tony
2011-04-12 20:15     ` Prarit Bhargava
2011-04-12 20:28       ` Borislav Petkov
2011-04-13  3:00         ` Russ Anderson
2011-04-13  7:14           ` Borislav Petkov
2011-04-13 13:24             ` Borislav Petkov
2011-04-13 13:36               ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov
2011-04-13 13:36               ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov
2011-04-13 14:01                 ` Prarit Bhargava
2011-04-13 14:18                   ` Borislav Petkov
2011-04-13 14:22                     ` Prarit Bhargava
2011-04-13 14:26                       ` Borislav Petkov
2011-04-13 14:32                         ` Prarit Bhargava
2011-04-13 14:39                           ` Borislav Petkov
2011-04-13 14:45                             ` Prarit Bhargava
2011-04-13 14:36                         ` [PATCH -v2] " Borislav Petkov
2011-04-13 17:01                           ` Prarit Bhargava
2011-04-13 17:13                             ` Luck, Tony
2011-04-13 17:17                               ` Prarit Bhargava
2011-04-13 17:14                             ` Prarit Bhargava
2011-04-13 17:37                               ` Borislav Petkov
2011-04-14 14:59                                 ` Prarit Bhargava
2011-04-14 15:00                                 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov
2011-04-14 15:04                                   ` Prarit Bhargava
2011-04-14 15:16                                     ` Borislav Petkov
2011-04-14 15:23                                       ` Prarit Bhargava
2011-04-14 15:44                                         ` Borislav Petkov
2011-04-14 15:49                                           ` Prarit Bhargava
2011-04-14 19:02                                             ` Borislav Petkov
2011-04-14 19:04                                               ` Prarit Bhargava
2011-04-14 15:33                                       ` Russ Anderson
2011-04-14 15:49                                         ` Borislav Petkov
2011-04-13 13:36               ` [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core Borislav Petkov
2011-04-13  2:24     ` [PATCH]: mce: don't print "human readable" message for corrected errors Russ Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).