* [PATCH]: mce: don't print "human readable" message for corrected errors @ 2011-04-12 17:44 Prarit Bhargava 2011-04-12 18:58 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-12 17:44 UTC (permalink / raw) To: linux-kernel; +Cc: dzickus, mstowe, dnelson, tony.luck, Prarit Bhargava Don't display the "human readable" warning for correctable errors in mce. There is no need for this information to be displayed. Signed-off-by: Prarit Bhargava <prarit@redhat.com> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 450a366..8024013 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -100,10 +100,16 @@ ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); static int default_decode_mce(struct notifier_block *nb, unsigned long val, - void *data) + void *data) { - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); + struct mce *m = (struct mce *)data; + + if (m->status & MCI_STATUS_UC) { + pr_emerg(HW_ERR "No human readable MCE decoding support " + "on this CPU type.\n"); + pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' " + "to decode.\n"); + } return NOTIFY_STOP; } ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 17:44 [PATCH]: mce: don't print "human readable" message for corrected errors Prarit Bhargava @ 2011-04-12 18:58 ` Borislav Petkov 2011-04-12 19:22 ` Prarit Bhargava 2011-04-12 20:02 ` Luck, Tony 0 siblings, 2 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-12 18:58 UTC (permalink / raw) To: Prarit Bhargava; +Cc: linux-kernel, dzickus, mstowe, dnelson, tony.luck On Tue, Apr 12, 2011 at 01:44:05PM -0400, Prarit Bhargava wrote: > Don't display the "human readable" warning for correctable errors in mce. > There is no need for this information to be displayed. > > Signed-off-by: Prarit Bhargava <prarit@redhat.com> Why not? This way you turn reporting of _ALL_ correctable MCEs completely off and some users would actually like to run them through mcelog on Intel. > > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 450a366..8024013 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -100,10 +100,16 @@ ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); > EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); > > static int default_decode_mce(struct notifier_block *nb, unsigned long val, > - void *data) > + void *data) > { > - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); > - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); > + struct mce *m = (struct mce *)data; > + > + if (m->status & MCI_STATUS_UC) { > + pr_emerg(HW_ERR "No human readable MCE decoding support " > + "on this CPU type.\n"); > + pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' " > + "to decode.\n"); > + } > > return NOTIFY_STOP; > } > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 18:58 ` Borislav Petkov @ 2011-04-12 19:22 ` Prarit Bhargava 2011-04-12 19:57 ` Borislav Petkov 2011-04-12 20:02 ` Luck, Tony 1 sibling, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-12 19:22 UTC (permalink / raw) To: Borislav Petkov, linux-kernel, dzickus, mstowe, dnelson, tony.luck On 04/12/2011 02:58 PM, Borislav Petkov wrote: > On Tue, Apr 12, 2011 at 01:44:05PM -0400, Prarit Bhargava wrote: > >> Don't display the "human readable" warning for correctable errors in mce. >> There is no need for this information to be displayed. >> >> Signed-off-by: Prarit Bhargava <prarit@redhat.com> >> > Why not? This way you turn reporting of _ALL_ correctable MCEs > completely off and some users would actually like to run them through > mcelog on Intel. > I may be reading the code wrong ... but in the case of the default_decode_mce() callback there is *no* additional output displayed for correctable MCEs -- so the message is AFAICT useless. ie) all you see on the console log is [Hardware Error]: Machine check events logged ^^^ this notifies the user that something happened with MCE [Hardware Error]: No human readable MCE decoding support on this CPU type. ^^^ this is purely informational and unnecessary [Hardware Error]: Run the message through 'mcelog --ascii' to decode. ^^^ this makes absolutely no sense. There is no message to decode. You won't see this in the case of the amd-edac because it has it's own callback. P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 19:22 ` Prarit Bhargava @ 2011-04-12 19:57 ` Borislav Petkov 0 siblings, 0 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-12 19:57 UTC (permalink / raw) To: Prarit Bhargava Cc: linux-kernel, dzickus, mstowe, dnelson, tony.luck, Ingo Molnar, H. Peter Anvin, Thomas Gleixner Adding x86 people. On Tue, Apr 12, 2011 at 03:22:50PM -0400, Prarit Bhargava wrote: > > > On 04/12/2011 02:58 PM, Borislav Petkov wrote: > > On Tue, Apr 12, 2011 at 01:44:05PM -0400, Prarit Bhargava wrote: > > > >> Don't display the "human readable" warning for correctable errors in mce. > >> There is no need for this information to be displayed. > >> > >> Signed-off-by: Prarit Bhargava <prarit@redhat.com> > >> > > Why not? This way you turn reporting of _ALL_ correctable MCEs > > completely off and some users would actually like to run them through > > mcelog on Intel. > > > > I may be reading the code wrong ... but in the case of the > default_decode_mce() callback there is *no* additional output displayed > for correctable MCEs -- so the message is AFAICT useless. > > ie) all you see on the console log is > > [Hardware Error]: Machine check events logged > > ^^^ this notifies the user that something happened with MCE > > [Hardware Error]: No human readable MCE decoding support on this CPU type. > > ^^^ this is purely informational and unnecessary > > [Hardware Error]: Run the message through 'mcelog --ascii' to decode. > > ^^^ this makes absolutely no sense. There is no message to decode. > > You won't see this in the case of the amd-edac because it has it's own > callback. Right, I see what you're sayin'. This is actually this piece: if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) { mce_log(&m); atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); add_taint(TAINT_MACHINE_CHECK); } in machine_check_poll and the reason I added it at the time was so EDAC gets notified about DRAM ECCs. And you're right, when we're not decoding in the kernel this message is not helping a lot. I can imagine this filling up the logs on a machine spitting lotsa DRAM CECC errors. I'm wondering whether dropping the printks completely is also an option for I don't see any merit for having them anyway... Hmmm. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 18:58 ` Borislav Petkov 2011-04-12 19:22 ` Prarit Bhargava @ 2011-04-12 20:02 ` Luck, Tony 2011-04-12 20:15 ` Prarit Bhargava 2011-04-13 2:24 ` [PATCH]: mce: don't print "human readable" message for corrected errors Russ Anderson 1 sibling, 2 replies; 38+ messages in thread From: Luck, Tony @ 2011-04-12 20:02 UTC (permalink / raw) To: Borislav Petkov, Prarit Bhargava Cc: linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #1: Type: text/plain; charset="utf-8", Size: 871 bytes --] > Why not? This way you turn reporting of _ALL_ correctable MCEs > completely off and some users would actually like to run them through > mcelog on Intel. pr_emerg() is rather overkill for a corrected error - on large systems corrected errors are going to be a routine occurrence (my personal estimation is "one soft error per gigabyte per month" ... which is pretty much the same as "one per terabyte per hour" for the people with the really cool toys. We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps this made sense when systems were small and machine checks were rare and scary. But I think we need to start working with the reality that corrected errors are normal events. -Tony ÿôèº{.nÇ+·®+%Ëÿ±éݶ\x17¥wÿº{.nÇ+·¥{±þG«éÿ{ayº\x1dÊÚë,j\a¢f£¢·hïêÿêçz_è®\x03(éÝ¢j"ú\x1a¶^[m§ÿÿ¾\a«þG«éÿ¢¸?¨èÚ&£ø§~á¶iOæ¬z·vØ^\x14\x04\x1a¶^[m§ÿÿÃ\fÿ¶ìÿ¢¸?I¥ ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 20:02 ` Luck, Tony @ 2011-04-12 20:15 ` Prarit Bhargava 2011-04-12 20:28 ` Borislav Petkov 2011-04-13 2:24 ` [PATCH]: mce: don't print "human readable" message for corrected errors Russ Anderson 1 sibling, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-12 20:15 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com > We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps > this made sense when systems were small and machine checks were rare and > scary. But I think we need to start working with the reality that > corrected errors are normal events. > > It still makes sense for small lt 1TB systems, IMO. But ... maybe a flag of some sort is necessary to stop the TAINTing for systems larger than that. The CEs may point to something going wrong on a system. CEs in theory become UCEs eventually, right? >From a OS point of view, we would like to know that there is flaky HW on the system. /me knows this is going to turn into a PFA discussion in 4 ... 3 ... 2 .... ;) P. > -Tony > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 20:15 ` Prarit Bhargava @ 2011-04-12 20:28 ` Borislav Petkov 2011-04-13 3:00 ` Russ Anderson 0 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-12 20:28 UTC (permalink / raw) To: Prarit Bhargava Cc: Luck, Tony, linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com On Tue, Apr 12, 2011 at 04:15:38PM -0400, Prarit Bhargava wrote: > > > We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps > > this made sense when systems were small and machine checks were rare and > > scary. But I think we need to start working with the reality that > > corrected errors are normal events. > > > > > > It still makes sense for small lt 1TB systems, IMO. But ... maybe a > flag of some sort is necessary to stop the TAINTing for systems larger > than that. The CEs may point to something going wrong on a system. CEs > in theory become UCEs eventually, right? > > From a OS point of view, we would like to know that there is flaky HW on > the system. > > /me knows this is going to turn into a PFA discussion in 4 ... 3 ... 2 > .... ;) Yeah, there's the TAINT thing too, good point Tony. Well, we definitely don't want to get tainted for correctable errors - they're too "normal" to do so, IMHO. I'm thinking remove the TAINT for CEs and don't call the default notifier if it is the only notifier call registered. Maybe something like if (num_notifiers(&x86_mce_decoder_chain) > 1) atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); or since the notifiers are priority sorted, don't call notifiers with -1 prio. Or something to that effect. -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 20:28 ` Borislav Petkov @ 2011-04-13 3:00 ` Russ Anderson 2011-04-13 7:14 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Russ Anderson @ 2011-04-13 3:00 UTC (permalink / raw) To: Borislav Petkov, Prarit Bhargava, Luck, Tony, linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com Cc: rja On Tue, Apr 12, 2011 at 10:28:54PM +0200, Borislav Petkov wrote: > > Yeah, there's the TAINT thing too, good point Tony. Well, we definitely > don't want to get tainted for correctable errors - they're too "normal" > to do so, IMHO. I agreed. > I'm thinking remove the TAINT for CEs and don't call the default > notifier if it is the only notifier call registered. Maybe something like > > if (num_notifiers(&x86_mce_decoder_chain) > 1) > atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > > or since the notifiers are priority sorted, don't call notifiers with -1 > prio. > > Or something to that effect. What is the point of having a default notifier if it doesn't get called? Any consideration of adding thresholding (ie only log the first X number of corrected errors) as is done on IA64? (see arch/ia64/kernel/mca.c) Thanks, -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-13 3:00 ` Russ Anderson @ 2011-04-13 7:14 ` Borislav Petkov 2011-04-13 13:24 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 7:14 UTC (permalink / raw) To: Russ Anderson Cc: Prarit Bhargava, Luck, Tony, linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja On Tue, Apr 12, 2011 at 10:00:34PM -0500, Russ Anderson wrote: > > I'm thinking remove the TAINT for CEs and don't call the default > > notifier if it is the only notifier call registered. Maybe something like > > > > if (num_notifiers(&x86_mce_decoder_chain) > 1) > > atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > > > > or since the notifiers are priority sorted, don't call notifiers with -1 > > prio. > > > > Or something to that effect. > > What is the point of having a default notifier if it doesn't get called? Well, the thing is, we call the same notifier chain in the event of both correctable and uncorrectable errors. But to be honest, if we shut up the default pr_emerg() calls with "no human readable.." in the CEs' case, then we don't really need it in the UEs' case either, IMO. The only half-way sensible info we print is "Run the message through 'mcelog --ascii' to decode.\n" on UE because there we print MCA regs to the console where mcelog can actually decode them, and this should be a hint to the user to do so. In the CEs case, no such info comes out and there's no need for those printks to flood the logs. So maybe we could drop the default notifier and do in print_mce(): if (notifier_chain_empty(&x86_mce_decoder_chain)) pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); I think this could work, let me cook up something. > Any consideration of adding thresholding (ie only log the first X number > of corrected errors) as is done on IA64? (see arch/ia64/kernel/mca.c) Yep, this is in the works with a RAS daemon that should collect all error info in userspace and do all policies there. We might even drop the spitting in the system logs almost completely and I know this'll make a lot of people happy :). -- Regards/Gruss, Boris. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-13 7:14 ` Borislav Petkov @ 2011-04-13 13:24 ` Borislav Petkov 2011-04-13 13:36 ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 13:24 UTC (permalink / raw) To: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony, linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja On Wed, Apr 13, 2011 at 09:14:33AM +0200, Borislav Petkov wrote: [..] > So maybe we could drop the default notifier and do in print_mce(): > > if (notifier_chain_empty(&x86_mce_decoder_chain)) > pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); > > I think this could work, let me cook up something. Ok, I'm sending what I scratched up as a reply to this mail. They remove the default notifier and use an atomic flag to distinguish the cases between when we have a decoder or not. This is marginally better in the sense that we don't have to call into the notifier chain if there are no notifier calls on it so it saves us a call in the UC case. Also, it drops the tainting for CEs and testing this by injecting DRAM CECCs leaves /proc/sys/kernel/tainted at 0. I still have to test the UC case but for that I'll have to dust off my patchset which implements hw MCE injection first. To be continued... -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/3] x86, MCE: Do not taint when correctable errors 2011-04-13 13:24 ` Borislav Petkov @ 2011-04-13 13:36 ` Borislav Petkov 2011-04-13 13:36 ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov 2011-04-13 13:36 ` [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core Borislav Petkov 2 siblings, 0 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 13:36 UTC (permalink / raw) To: linux-kernel Cc: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com From: Borislav Petkov <borislav.petkov@amd.com> Correctable errors are considered something rather normal on modern hardware these days. Even more importantly, correctable errors mean exactly that - they've been corrected by the hardware - and there's no need to taint the kernel since execution hasn't been compromised so far. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- arch/x86/kernel/cpu/mcheck/mce.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 3385ea2..68e2303 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -590,7 +590,6 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) { mce_log(&m); atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); - add_taint(TAINT_MACHINE_CHECK); } /* -- 1.7.4.rc2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 13:24 ` Borislav Petkov 2011-04-13 13:36 ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov @ 2011-04-13 13:36 ` Borislav Petkov 2011-04-13 14:01 ` Prarit Bhargava 2011-04-13 13:36 ` [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core Borislav Petkov 2 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 13:36 UTC (permalink / raw) To: linux-kernel Cc: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com From: Borislav Petkov <borislav.petkov@amd.com> The default notifier doesn't make a lot of sense to call in the correctable errors case. Drop it and emit the mcelog decoding hint only in the uncorrectable errors case and when no notifier is registered. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- arch/x86/include/asm/mce.h | 4 ++-- arch/x86/kernel/cpu/mcheck/mce.c | 26 +++++++++----------------- 2 files changed, 11 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index eb16e94..871d3c4 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -117,6 +117,8 @@ struct mce_log { extern struct atomic_notifier_head x86_mce_decoder_chain; +extern atomic_t mce_decoders; + #include <linux/percpu.h> #include <linux/init.h> #include <asm/atomic.h> @@ -142,8 +144,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {} static inline void enable_p5_mce(void) {} #endif -extern void (*x86_mce_decode_callback)(struct mce *m); - void mce_setup(struct mce *m); void mce_log(struct mce *m); DECLARE_PER_CPU(struct sys_device, mce_dev); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 68e2303..3069608 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -105,19 +105,8 @@ static int cpu_missing; ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); -static int default_decode_mce(struct notifier_block *nb, unsigned long val, - void *data) -{ - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); - - return NOTIFY_STOP; -} - -static struct notifier_block mce_dec_nb = { - .notifier_call = default_decode_mce, - .priority = -1, -}; +atomic_t mce_decoders = ATOMIC_INIT(0); +EXPORT_SYMBOL_GPL(mce_decoders); /* MCA banks polled by the period polling timer for corrected events */ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { @@ -239,7 +228,11 @@ static void print_mce(struct mce *m) * Print out human-readable details about the MCE error, * (if the CPU has an implementation for that) */ - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + if (!atomic_read(&mce_decoders)) { + pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); + } else + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); } #define PANIC_TIMEOUT 5 /* 5 seconds */ @@ -589,7 +582,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) */ if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) { mce_log(&m); - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); + if (atomic_read(&mce_decoders)) + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); } /* @@ -1721,8 +1715,6 @@ __setup("mce", mcheck_enable); int __init mcheck_init(void) { - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); - mcheck_intel_therm_init(); return 0; -- 1.7.4.rc2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 13:36 ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov @ 2011-04-13 14:01 ` Prarit Bhargava 2011-04-13 14:18 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 14:01 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel, Borislav Petkov, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/13/2011 09:36 AM, Borislav Petkov wrote: > From: Borislav Petkov <borislav.petkov@amd.com> > > The default notifier doesn't make a lot of sense to call in the > correctable errors case. Drop it and emit the mcelog decoding hint only > in the uncorrectable errors case and when no notifier is registered. > > Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> > --- > arch/x86/include/asm/mce.h | 4 ++-- > arch/x86/kernel/cpu/mcheck/mce.c | 26 +++++++++----------------- > 2 files changed, 11 insertions(+), 19 deletions(-) > > +extern atomic_t mce_decoders; > + Boris, I don't think we need to do this. I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below. * Print out human-readable details about the MCE error, > * (if the CPU has an implementation for that) > */ > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + if (!atomic_read(&mce_decoders)) { > + pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); > + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); I thought we didn't want these lines at all for CE errors? > + } else > + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > } > > #define PANIC_TIMEOUT 5 /* 5 seconds */ > @@ -589,7 +582,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) > */ > if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) { > mce_log(&m); > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > + if (atomic_read(&mce_decoders)) > + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > } > > /* > @@ -1721,8 +1715,6 @@ __setup("mce", mcheck_enable); > > int __init mcheck_init(void) > { > - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); > - > mcheck_intel_therm_init(); > > return 0; Again, this patch has not been tested or compiled against latest upstream. I have quickly tested it against RHEL6 (2.6.32,33,34,35,36,38,39 based) and confirmed that I don't see the messages for CEs. I have NOT fully tested to see what happens when I add in a "dummy" notifier to see if the messages are not printed. diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 3385ea2..b10a1f4 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -100,25 +100,12 @@ static int cpu_missing; /* * CPU/chipset specific EDAC code can register a notifier call here to print - * MCE errors in a human-readable form. + * MCE errors in a human-readable form. Notifiers must return NOTIFY_STOP + * upon completion. */ ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); -static int default_decode_mce(struct notifier_block *nb, unsigned long val, - void *data) -{ - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); - - return NOTIFY_STOP; -} - -static struct notifier_block mce_dec_nb = { - .notifier_call = default_decode_mce, - .priority = -1, -}; - /* MCA banks polled by the period polling timer for corrected events */ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { [0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL @@ -212,6 +199,8 @@ void mce_log(struct mce *mce) static void print_mce(struct mce *m) { + int ret; + pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n", m->extcpu, m->mcgstatus, m->bank, m->status); @@ -236,10 +225,13 @@ static void print_mce(struct mce *m) m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid); /* - * Print out human-readable details about the MCE error, - * (if the CPU has an implementation for that) + * Output a message if the CPU has human-readable information for + * unhandled UC errors */ - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); + if ((ret != NOTIFY_STOP) && (m->status & MCI_STATUS_UC)) + pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' " + "to decode.\n"); } #define PANIC_TIMEOUT 5 /* 5 seconds */ @@ -589,8 +581,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) */ if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) { mce_log(&m); - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); - add_taint(TAINT_MACHINE_CHECK); + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, + &m); } /* @@ -1722,8 +1714,6 @@ __setup("mce", mcheck_enable); int __init mcheck_init(void) { - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); - mcheck_intel_therm_init(); return 0; ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 14:01 ` Prarit Bhargava @ 2011-04-13 14:18 ` Borislav Petkov 2011-04-13 14:22 ` Prarit Bhargava 0 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 14:18 UTC (permalink / raw) To: Prarit Bhargava Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Wed, Apr 13, 2011 at 10:01:22AM -0400, Prarit Bhargava wrote: > On 04/13/2011 09:36 AM, Borislav Petkov wrote: > > From: Borislav Petkov <borislav.petkov@amd.com> > > > > The default notifier doesn't make a lot of sense to call in the > > correctable errors case. Drop it and emit the mcelog decoding hint only > > in the uncorrectable errors case and when no notifier is registered. > > > > Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> > > --- > > arch/x86/include/asm/mce.h | 4 ++-- > > arch/x86/kernel/cpu/mcheck/mce.c | 26 +++++++++----------------- > > 2 files changed, 11 insertions(+), 19 deletions(-) > > > > > +extern atomic_t mce_decoders; > > + > > Boris, > > I don't think we need to do this. I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below. > > * Print out human-readable details about the MCE error, > > * (if the CPU has an implementation for that) > > */ > > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > > + if (!atomic_read(&mce_decoders)) { > > + pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); > > + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); > > I thought we didn't want these lines at all for CE errors? This is the UC only case - we call print_mce() before we panic. > > > + } else > > + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > > } > > > > #define PANIC_TIMEOUT 5 /* 5 seconds */ > > @@ -589,7 +582,8 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b) > > */ > > if (!(flags & MCP_DONTLOG) && !mce_dont_log_ce) { > > mce_log(&m); > > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > > + if (atomic_read(&mce_decoders)) > > + atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > > } > > > > /* > > @@ -1721,8 +1715,6 @@ __setup("mce", mcheck_enable); > > > > int __init mcheck_init(void) > > { > > - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); > > - > > mcheck_intel_therm_init(); > > > > return 0; > > Again, this patch has not been tested or compiled against latest upstream. I > have quickly tested it against RHEL6 (2.6.32,33,34,35,36,38,39 based) and > confirmed that I don't see the messages for CEs. I have NOT fully tested to see what happens when I add in a "dummy" notifier to see if the messages are not printed. > > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 3385ea2..b10a1f4 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -100,25 +100,12 @@ static int cpu_missing; > > /* > * CPU/chipset specific EDAC code can register a notifier call here to print > - * MCE errors in a human-readable form. > + * MCE errors in a human-readable form. Notifiers must return NOTIFY_STOP > + * upon completion. > */ > ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); > EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); > > -static int default_decode_mce(struct notifier_block *nb, unsigned long val, > - void *data) > -{ > - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); > - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); > - > - return NOTIFY_STOP; > -} > - > -static struct notifier_block mce_dec_nb = { > - .notifier_call = default_decode_mce, > - .priority = -1, > -}; > - > /* MCA banks polled by the period polling timer for corrected events */ > DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { > [0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL > @@ -212,6 +199,8 @@ void mce_log(struct mce *mce) > > static void print_mce(struct mce *m) > { > + int ret; > + > pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n", > m->extcpu, m->mcgstatus, m->bank, m->status); > > @@ -236,10 +225,13 @@ static void print_mce(struct mce *m) > m->cpuvendor, m->cpuid, m->time, m->socketid, m->apicid); > > /* > - * Print out human-readable details about the MCE error, > - * (if the CPU has an implementation for that) > + * Output a message if the CPU has human-readable information for > + * unhandled UC errors > */ > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, &m); > + if ((ret != NOTIFY_STOP) && (m->status & MCI_STATUS_UC)) > + pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' " > + "to decode.\n"); Yep, yours is actually nicer - the notifier returns NOTIFY_DONE if it doesn't call any notifier calls on the chain so the first test should suffice. Let me update the patch add your S-O-B so you could give it a run. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 14:18 ` Borislav Petkov @ 2011-04-13 14:22 ` Prarit Bhargava 2011-04-13 14:26 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 14:22 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/13/2011 10:18 AM, Borislav Petkov wrote: > On Wed, Apr 13, 2011 at 10:01:22AM -0400, Prarit Bhargava wrote: > >> On 04/13/2011 09:36 AM, Borislav Petkov wrote: >> >>> From: Borislav Petkov <borislav.petkov@amd.com> >>> >>> The default notifier doesn't make a lot of sense to call in the >>> correctable errors case. Drop it and emit the mcelog decoding hint only >>> in the uncorrectable errors case and when no notifier is registered. >>> >>> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> >>> --- >>> arch/x86/include/asm/mce.h | 4 ++-- >>> arch/x86/kernel/cpu/mcheck/mce.c | 26 +++++++++----------------- >>> 2 files changed, 11 insertions(+), 19 deletions(-) >>> >>> >> >> >>> +extern atomic_t mce_decoders; >>> + >>> >> Boris, >> >> I don't think we need to do this. I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below. >> >> * Print out human-readable details about the MCE error, >> >>> * (if the CPU has an implementation for that) >>> */ >>> - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); >>> + if (!atomic_read(&mce_decoders)) { >>> + pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); >>> + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); >>> >> I thought we didn't want these lines at all for CE errors? >> > This is the UC only case - we call print_mce() before we panic. > .. right ... but we would still see nonsensical messages before the panic, which will confuse the end user. IMO, dump the messages unless it's UC. And we shouldn't print out the human readable message at all. I'm installing F14 on a system that is known to generate CEs @ boot-time. I'll test patches+latest upstream to here to see what happens... P. > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 14:22 ` Prarit Bhargava @ 2011-04-13 14:26 ` Borislav Petkov 2011-04-13 14:32 ` Prarit Bhargava 2011-04-13 14:36 ` [PATCH -v2] " Borislav Petkov 0 siblings, 2 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 14:26 UTC (permalink / raw) To: Prarit Bhargava Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Wed, Apr 13, 2011 at 10:22:41AM -0400, Prarit Bhargava wrote: > > > On 04/13/2011 10:18 AM, Borislav Petkov wrote: > > On Wed, Apr 13, 2011 at 10:01:22AM -0400, Prarit Bhargava wrote: > > > >> On 04/13/2011 09:36 AM, Borislav Petkov wrote: > >> > >>> From: Borislav Petkov <borislav.petkov@amd.com> > >>> > >>> The default notifier doesn't make a lot of sense to call in the > >>> correctable errors case. Drop it and emit the mcelog decoding hint only > >>> in the uncorrectable errors case and when no notifier is registered. > >>> > >>> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> > >>> --- > >>> arch/x86/include/asm/mce.h | 4 ++-- > >>> arch/x86/kernel/cpu/mcheck/mce.c | 26 +++++++++----------------- > >>> 2 files changed, 11 insertions(+), 19 deletions(-) > >>> > >>> > >> > >> > >>> +extern atomic_t mce_decoders; > >>> + > >>> > >> Boris, > >> > >> I don't think we need to do this. I think we can use the existing notifier chain tools to do this check for us ... *untested and uncompiled* patch below. > >> > >> * Print out human-readable details about the MCE error, > >> > >>> * (if the CPU has an implementation for that) > >>> */ > >>> - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > >>> + if (!atomic_read(&mce_decoders)) { > >>> + pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); > >>> + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); > >>> > >> I thought we didn't want these lines at all for CE errors? > >> > > This is the UC only case - we call print_mce() before we panic. > > > > .. right ... but we would still see nonsensical messages before the > panic, which will confuse the end user. IMO, dump the messages unless > it's UC. And we shouldn't print out the human readable message at all. I'd still leave the "mcelog ... " line though as a hint. > I'm installing F14 on a system that is known to generate CEs @ > boot-time. I'll test patches+latest upstream to here to see what happens... Cool. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 14:26 ` Borislav Petkov @ 2011-04-13 14:32 ` Prarit Bhargava 2011-04-13 14:39 ` Borislav Petkov 2011-04-13 14:36 ` [PATCH -v2] " Borislav Petkov 1 sibling, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 14:32 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com >> .. right ... but we would still see nonsensical messages before the >> panic, which will confuse the end user. IMO, dump the messages unless >> it's UC. And we shouldn't print out the human readable message at all. >> > I'd still leave the "mcelog ... " line though as a hint. > > :) There are no messages -- that's part of the problem. We're asking a user to do something for a message that doesn't exist ... P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 14:32 ` Prarit Bhargava @ 2011-04-13 14:39 ` Borislav Petkov 2011-04-13 14:45 ` Prarit Bhargava 0 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 14:39 UTC (permalink / raw) To: Prarit Bhargava Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Wed, Apr 13, 2011 at 10:32:35AM -0400, Prarit Bhargava wrote: > > >> .. right ... but we would still see nonsensical messages before the > >> panic, which will confuse the end user. IMO, dump the messages unless > >> it's UC. And we shouldn't print out the human readable message at all. > >> > > I'd still leave the "mcelog ... " line though as a hint. > > > > > > :) > > There are no messages -- that's part of the problem. We're asking a > user to do something for a message that doesn't exist ... Yes and no - in the UC case you get all the MCA registers dumped in print_mce() - and those you want to decode. In the CE case, we won't be issuing any with the default notifier removed. I just sent you a -v2, I think it should do what we want. Or am I missing something...? Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/3] x86, MCE: Drop default decoding notifier 2011-04-13 14:39 ` Borislav Petkov @ 2011-04-13 14:45 ` Prarit Bhargava 0 siblings, 0 replies; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 14:45 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/13/2011 10:39 AM, Borislav Petkov wrote: > On Wed, Apr 13, 2011 at 10:32:35AM -0400, Prarit Bhargava wrote: > >> >>>> .. right ... but we would still see nonsensical messages before the >>>> panic, which will confuse the end user. IMO, dump the messages unless >>>> it's UC. And we shouldn't print out the human readable message at all. >>>> >>>> >>> I'd still leave the "mcelog ... " line though as a hint. >>> >>> >>> >> :) >> >> There are no messages -- that's part of the problem. We're asking a >> user to do something for a message that doesn't exist ... >> > Yes and no - in the UC case you get all the MCA registers dumped in > print_mce() - and those you want to decode. In the CE case, we won't be > issuing any with the default notifier removed. I just sent you a -v2, I > think it should do what we want. > Yup. > Or am I missing something...? > > Nope. The follow up looks great :) P. > Thanks. > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 14:26 ` Borislav Petkov 2011-04-13 14:32 ` Prarit Bhargava @ 2011-04-13 14:36 ` Borislav Petkov 2011-04-13 17:01 ` Prarit Bhargava 1 sibling, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 14:36 UTC (permalink / raw) To: Prarit Bhargava Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com The default notifier doesn't make a lot of sense to call in the correctable errors case. Drop it and emit the mcelog decoding hint only in the uncorrectable errors case and when no notifier is registered. While at it, remove unused old x86_mce_decode_callback from the header. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- arch/x86/include/asm/mce.h | 2 -- arch/x86/kernel/cpu/mcheck/mce.c | 22 +++++----------------- 2 files changed, 5 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index eb16e94..021979a 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {} static inline void enable_p5_mce(void) {} #endif -extern void (*x86_mce_decode_callback)(struct mce *m); - void mce_setup(struct mce *m); void mce_log(struct mce *m); DECLARE_PER_CPU(struct sys_device, mce_dev); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 68e2303..33714e8 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -105,20 +105,6 @@ static int cpu_missing; ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); -static int default_decode_mce(struct notifier_block *nb, unsigned long val, - void *data) -{ - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); - - return NOTIFY_STOP; -} - -static struct notifier_block mce_dec_nb = { - .notifier_call = default_decode_mce, - .priority = -1, -}; - /* MCA banks polled by the period polling timer for corrected events */ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { [0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL @@ -212,6 +198,8 @@ void mce_log(struct mce *mce) static void print_mce(struct mce *m) { + int ret = 0; + pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n", m->extcpu, m->mcgstatus, m->bank, m->status); @@ -239,7 +227,9 @@ static void print_mce(struct mce *m) * Print out human-readable details about the MCE error, * (if the CPU has an implementation for that) */ - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + if (ret != NOTIFY_STOP) + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); } #define PANIC_TIMEOUT 5 /* 5 seconds */ @@ -1721,8 +1711,6 @@ __setup("mce", mcheck_enable); int __init mcheck_init(void) { - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); - mcheck_intel_therm_init(); return 0; -- 1.7.4.rc2 -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 14:36 ` [PATCH -v2] " Borislav Petkov @ 2011-04-13 17:01 ` Prarit Bhargava 2011-04-13 17:13 ` Luck, Tony 2011-04-13 17:14 ` Prarit Bhargava 0 siblings, 2 replies; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 17:01 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com > @@ -239,7 +227,9 @@ static void print_mce(struct mce *m) > * Print out human-readable details about the MCE error, > * (if the CPU has an implementation for that) > */ > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + if (ret != NOTIFY_STOP) > + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); > } > Borislav, I still think you need the check for UC here. When an UC occurs and mce_panic() is called the output will include: [Hardware Error]: Run the above through 'mcelog --ascii' to decode. potentially many, many times. The problem still is that there is no output to decode (in the default case). P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* RE: [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 17:01 ` Prarit Bhargava @ 2011-04-13 17:13 ` Luck, Tony 2011-04-13 17:17 ` Prarit Bhargava 2011-04-13 17:14 ` Prarit Bhargava 1 sibling, 1 reply; 38+ messages in thread From: Luck, Tony @ 2011-04-13 17:13 UTC (permalink / raw) To: Prarit Bhargava, Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com > [Hardware Error]: Run the above through 'mcelog --ascii' to decode. > > potentially many, many times. The problem still is that there is no > output to decode (in the default case). printk_once()? Giving the admin a hint on how to do their job is all well and good - but no need to do so for each and every report. -Tony ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 17:13 ` Luck, Tony @ 2011-04-13 17:17 ` Prarit Bhargava 0 siblings, 0 replies; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 17:17 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/13/2011 01:13 PM, Luck, Tony wrote: >> [Hardware Error]: Run the above through 'mcelog --ascii' to decode. >> >> potentially many, many times. The problem still is that there is no >> output to decode (in the default case). >> > printk_once()? > I don't think that applies here (but you might have a different idea of what the code needs to look like) as the printk has to happen for *each* UC error, but never for CE errors :/. P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 17:01 ` Prarit Bhargava 2011-04-13 17:13 ` Luck, Tony @ 2011-04-13 17:14 ` Prarit Bhargava 2011-04-13 17:37 ` Borislav Petkov 1 sibling, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-13 17:14 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/13/2011 01:01 PM, Prarit Bhargava wrote: > >> @@ -239,7 +227,9 @@ static void print_mce(struct mce *m) >> * Print out human-readable details about the MCE error, >> * (if the CPU has an implementation for that) >> */ >> - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); >> + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); >> + if (ret != NOTIFY_STOP) >> + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); >> } >> >> > Borislav, > > Oops. Let me *carefully* rephrase that so it is clear what I'm complaining about. > I still think you need the check for UC here. When an UC occurs and > mce_panic() is called the output will include: > > [Hardware Error]: Run the above through 'mcelog --ascii' to decode. > > potentially many, many times for _all_ unreported *correctable* errors. > . The problem still is that there is no > output to decode (in the default case). > > ie) (sorry for the cut-and-paste) /* First print corrected ones that are still unlogged */ for (i = 0; i < MCE_LOG_LEN; i++) { struct mce *m = &mcelog.entry[i]; if (!(m->status & MCI_STATUS_VAL)) continue; if (!(m->status & MCI_STATUS_UC)) { print_mce(m); if (!apei_err) apei_err = apei_write_mce(m); } } will potentially result in many bogus messages during a time at which we definitely do not want bogus messages. P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 17:14 ` Prarit Bhargava @ 2011-04-13 17:37 ` Borislav Petkov 2011-04-14 14:59 ` Prarit Bhargava 2011-04-14 15:00 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov 0 siblings, 2 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 17:37 UTC (permalink / raw) To: Prarit Bhargava Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Wed, Apr 13, 2011 at 01:14:35PM -0400, Prarit Bhargava wrote: > > > On 04/13/2011 01:01 PM, Prarit Bhargava wrote: > > > >> @@ -239,7 +227,9 @@ static void print_mce(struct mce *m) > >> * Print out human-readable details about the MCE error, > >> * (if the CPU has an implementation for that) > >> */ > >> - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > >> + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > >> + if (ret != NOTIFY_STOP) > >> + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' to decode.\n"); > >> } > >> > >> > > Borislav, > > > > > > Oops. Let me *carefully* rephrase that so it is clear what I'm > complaining about. > > > I still think you need the check for UC here. When an UC occurs and > > mce_panic() is called the output will include: > > > > [Hardware Error]: Run the above through 'mcelog --ascii' to decode. > > > > potentially many, many times > > for _all_ unreported *correctable* errors. > > > . The problem still is that there is no > > output to decode (in the default case). > > > > > > ie) (sorry for the cut-and-paste) > > /* First print corrected ones that are still unlogged */ > for (i = 0; i < MCE_LOG_LEN; i++) { > struct mce *m = &mcelog.entry[i]; > if (!(m->status & MCI_STATUS_VAL)) > continue; > if (!(m->status & MCI_STATUS_UC)) { > print_mce(m); > if (!apei_err) > apei_err = apei_write_mce(m); > } > } > > will potentially result in many bogus messages during a time at which we > definitely do not want bogus messages. I don't think that this is a problem. This is on the panic path and it is supposed to dump only the _unreported_ CE MCEs queued in the mcelog which can contain 32 MCEs max. In the worst case, we will report 32 CEs before panicking. For that case we either do printk_once as Tony suggested or we ratelimit it. I'll update the patch. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v2] x86, MCE: Drop default decoding notifier 2011-04-13 17:37 ` Borislav Petkov @ 2011-04-14 14:59 ` Prarit Bhargava 2011-04-14 15:00 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov 1 sibling, 0 replies; 38+ messages in thread From: Prarit Bhargava @ 2011-04-14 14:59 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com > In the worst case, we will report 32 CEs before panicking. For that case > we either do printk_once as Tony suggested or we ratelimit it. I'll > update the patch. > Okay ... how 'bout using Tony's printk_once() just after the call to print_mce() in mce_panic()? [uncompiled and untested patch below] diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index eb16e94..021979a 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {} static inline void enable_p5_mce(void) {} #endif -extern void (*x86_mce_decode_callback)(struct mce *m); - void mce_setup(struct mce *m); void mce_log(struct mce *m); DECLARE_PER_CPU(struct sys_device, mce_dev); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 3385ea2..6ecfea5 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -105,20 +105,6 @@ static int cpu_missing; ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); -static int default_decode_mce(struct notifier_block *nb, unsigned long val, - void *data) -{ - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); - - return NOTIFY_STOP; -} - -static struct notifier_block mce_dec_nb = { - .notifier_call = default_decode_mce, - .priority = -1, -}; - /* MCA banks polled by the period polling timer for corrected events */ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { [0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL @@ -212,6 +198,8 @@ void mce_log(struct mce *mce) static void print_mce(struct mce *m) { + int ret = 0; + pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n", m->extcpu, m->mcgstatus, m->bank, m->status); @@ -239,7 +227,10 @@ static void print_mce(struct mce *m) * Print out human-readable details about the MCE error, * (if the CPU has an implementation for that) */ - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC)) + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' " + "to decode.\n"); } #define PANIC_TIMEOUT 5 /* 5 seconds */ @@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp) continue; if (!(m->status & MCI_STATUS_UC)) { print_mce(m); + printk_once(KERN_EMERG HW_ERR "MCE Corrected Error(s) " + "detected."); if (!apei_err) apei_err = apei_write_mce(m); } @@ -1722,8 +1715,6 @@ __setup("mce", mcheck_enable); int __init mcheck_init(void) { - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); - mcheck_intel_therm_init(); return 0; ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-13 17:37 ` Borislav Petkov 2011-04-14 14:59 ` Prarit Bhargava @ 2011-04-14 15:00 ` Borislav Petkov 2011-04-14 15:04 ` Prarit Bhargava 1 sibling, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-14 15:00 UTC (permalink / raw) To: Prarit Bhargava Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote: > In the worst case, we will report 32 CEs before panicking. For that case > we either do printk_once as Tony suggested or we ratelimit it. I'll > update the patch. Ok, how about the following, I ratelimit the printk to the default of 10 messages per 5 seconds. I've also got the hardware MCE injection patches ready and will do some testing with them. -- From: Borislav Petkov <borislav.petkov@amd.com> Date: Wed, 13 Apr 2011 14:32:06 +0200 Subject: [PATCH -v3] x86, MCE: Drop the default decoding notifier The default notifier doesn't make a lot of sense to call in the correctable errors case. Drop it and emit the mcelog decoding hint only in the uncorrectable errors case and when no notifier is registered. Also, limit the issual of the "mcelog" message in the remote case when we dump unreported CEs before panicking. While at it, remove unused old x86_mce_decode_callback from the header. Signed-off-by: Prarit Bhargava <prarit@redhat.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- arch/x86/include/asm/mce.h | 2 -- arch/x86/kernel/cpu/mcheck/mce.c | 23 ++++++----------------- 2 files changed, 6 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h index eb16e94..021979a 100644 --- a/arch/x86/include/asm/mce.h +++ b/arch/x86/include/asm/mce.h @@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {} static inline void enable_p5_mce(void) {} #endif -extern void (*x86_mce_decode_callback)(struct mce *m); - void mce_setup(struct mce *m); void mce_log(struct mce *m); DECLARE_PER_CPU(struct sys_device, mce_dev); diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c index 68e2303..679bd9e 100644 --- a/arch/x86/kernel/cpu/mcheck/mce.c +++ b/arch/x86/kernel/cpu/mcheck/mce.c @@ -105,20 +105,6 @@ static int cpu_missing; ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); -static int default_decode_mce(struct notifier_block *nb, unsigned long val, - void *data) -{ - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); - - return NOTIFY_STOP; -} - -static struct notifier_block mce_dec_nb = { - .notifier_call = default_decode_mce, - .priority = -1, -}; - /* MCA banks polled by the period polling timer for corrected events */ DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { [0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL @@ -212,6 +198,8 @@ void mce_log(struct mce *mce) static void print_mce(struct mce *m) { + int ret = 0; + pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n", m->extcpu, m->mcgstatus, m->bank, m->status); @@ -239,7 +227,10 @@ static void print_mce(struct mce *m) * Print out human-readable details about the MCE error, * (if the CPU has an implementation for that) */ - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + if (ret != NOTIFY_STOP) + printk_ratelimited(KERN_EMERG HW_ERR + "Run the above through 'mcelog --ascii'\n"); } #define PANIC_TIMEOUT 5 /* 5 seconds */ @@ -1721,8 +1712,6 @@ __setup("mce", mcheck_enable); int __init mcheck_init(void) { - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); - mcheck_intel_therm_init(); return 0; -- 1.7.4.rc2 -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:00 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov @ 2011-04-14 15:04 ` Prarit Bhargava 2011-04-14 15:16 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-14 15:04 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/14/2011 11:00 AM, Borislav Petkov wrote: > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote: > >> In the worst case, we will report 32 CEs before panicking. For that case >> we either do printk_once as Tony suggested or we ratelimit it. I'll >> update the patch. >> > Ok, how about the following, I ratelimit the printk to the default of 10 > messages per 5 seconds. I've also got the hardware MCE injection patches > ready and will do some testing with them. > See my previous email ;) I think just putting in a printk_once after the CE call to print_mce() in mce_panic() might be better? At least that way we get the --ascii message for *EVERY* UC which IMO would be nice... P. > -- > From: Borislav Petkov <borislav.petkov@amd.com> > Date: Wed, 13 Apr 2011 14:32:06 +0200 > Subject: [PATCH -v3] x86, MCE: Drop the default decoding notifier > > The default notifier doesn't make a lot of sense to call in the > correctable errors case. Drop it and emit the mcelog decoding hint only > in the uncorrectable errors case and when no notifier is registered. > Also, limit the issual of the "mcelog" message in the remote case when > we dump unreported CEs before panicking. > > While at it, remove unused old x86_mce_decode_callback from the header. > > Signed-off-by: Prarit Bhargava <prarit@redhat.com> > Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> > --- > arch/x86/include/asm/mce.h | 2 -- > arch/x86/kernel/cpu/mcheck/mce.c | 23 ++++++----------------- > 2 files changed, 6 insertions(+), 19 deletions(-) > > diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h > index eb16e94..021979a 100644 > --- a/arch/x86/include/asm/mce.h > +++ b/arch/x86/include/asm/mce.h > @@ -142,8 +142,6 @@ static inline void winchip_mcheck_init(struct cpuinfo_x86 *c) {} > static inline void enable_p5_mce(void) {} > #endif > > -extern void (*x86_mce_decode_callback)(struct mce *m); > - > void mce_setup(struct mce *m); > void mce_log(struct mce *m); > DECLARE_PER_CPU(struct sys_device, mce_dev); > diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c > index 68e2303..679bd9e 100644 > --- a/arch/x86/kernel/cpu/mcheck/mce.c > +++ b/arch/x86/kernel/cpu/mcheck/mce.c > @@ -105,20 +105,6 @@ static int cpu_missing; > ATOMIC_NOTIFIER_HEAD(x86_mce_decoder_chain); > EXPORT_SYMBOL_GPL(x86_mce_decoder_chain); > > -static int default_decode_mce(struct notifier_block *nb, unsigned long val, > - void *data) > -{ > - pr_emerg(HW_ERR "No human readable MCE decoding support on this CPU type.\n"); > - pr_emerg(HW_ERR "Run the message through 'mcelog --ascii' to decode.\n"); > - > - return NOTIFY_STOP; > -} > - > -static struct notifier_block mce_dec_nb = { > - .notifier_call = default_decode_mce, > - .priority = -1, > -}; > - > /* MCA banks polled by the period polling timer for corrected events */ > DEFINE_PER_CPU(mce_banks_t, mce_poll_banks) = { > [0 ... BITS_TO_LONGS(MAX_NR_BANKS)-1] = ~0UL > @@ -212,6 +198,8 @@ void mce_log(struct mce *mce) > > static void print_mce(struct mce *m) > { > + int ret = 0; > + > pr_emerg(HW_ERR "CPU %d: Machine Check Exception: %Lx Bank %d: %016Lx\n", > m->extcpu, m->mcgstatus, m->bank, m->status); > > @@ -239,7 +227,10 @@ static void print_mce(struct mce *m) > * Print out human-readable details about the MCE error, > * (if the CPU has an implementation for that) > */ > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + if (ret != NOTIFY_STOP) > + printk_ratelimited(KERN_EMERG HW_ERR > + "Run the above through 'mcelog --ascii'\n"); > } > > #define PANIC_TIMEOUT 5 /* 5 seconds */ > @@ -1721,8 +1712,6 @@ __setup("mce", mcheck_enable); > > int __init mcheck_init(void) > { > - atomic_notifier_chain_register(&x86_mce_decoder_chain, &mce_dec_nb); > - > mcheck_intel_therm_init(); > > return 0; > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:04 ` Prarit Bhargava @ 2011-04-14 15:16 ` Borislav Petkov 2011-04-14 15:23 ` Prarit Bhargava 2011-04-14 15:33 ` Russ Anderson 0 siblings, 2 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-14 15:16 UTC (permalink / raw) To: Prarit Bhargava Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote: > > > On 04/14/2011 11:00 AM, Borislav Petkov wrote: > > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote: > > > >> In the worst case, we will report 32 CEs before panicking. For that case > >> we either do printk_once as Tony suggested or we ratelimit it. I'll > >> update the patch. > >> > > Ok, how about the following, I ratelimit the printk to the default of 10 > > messages per 5 seconds. I've also got the hardware MCE injection patches > > ready and will do some testing with them. > > > > See my previous email ;) I think just putting in a printk_once after > the CE call to print_mce() in mce_panic() might be better? At least > that way we get the --ascii message for *EVERY* UC which IMO would be > nice... Are you sure? printk_once() is, as its name says, a one-time thing and it is implemented that way - a static bool counter which is once set and that's it. I.e., the "--ascii" message will be printed only once for the system's lifetime. The ratelimit-ed thing dumps it a strict number of times. In the end, I don't have a strong opinion on how many times we issue it - I'm fine with it either way. Maybe some other opinions. Tony? -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:16 ` Borislav Petkov @ 2011-04-14 15:23 ` Prarit Bhargava 2011-04-14 15:44 ` Borislav Petkov 2011-04-14 15:33 ` Russ Anderson 1 sibling, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-14 15:23 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/14/2011 11:16 AM, Borislav Petkov wrote: > On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote: > >> >> On 04/14/2011 11:00 AM, Borislav Petkov wrote: >> >>> On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote: >>> >>> >>>> In the worst case, we will report 32 CEs before panicking. For that case >>>> we either do printk_once as Tony suggested or we ratelimit it. I'll >>>> update the patch. >>>> >>>> >>> Ok, how about the following, I ratelimit the printk to the default of 10 >>> messages per 5 seconds. I've also got the hardware MCE injection patches >>> ready and will do some testing with them. >>> >>> >> See my previous email ;) I think just putting in a printk_once after >> the CE call to print_mce() in mce_panic() might be better? At least >> that way we get the --ascii message for *EVERY* UC which IMO would be >> nice... >> > Are you sure? printk_once() is, as its name says, a one-time thing and > it is implemented that way - a static bool counter which is once set and > that's it. I.e., the "--ascii" message will be printed only once for the > system's lifetime. > Oops ... I may have confused you because what I did was subtle. I really should have explicitly pointed out what I did. Sorry, my bad. >From my patch (sorry for the cut-and-paste): @@ -239,7 +227,10 @@ static void print_mce(struct mce *m) * Print out human-readable details about the MCE error, * (if the CPU has an implementation for that) */ - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); + if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC)) + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' " + "to decode.\n"); } This, of course, only outputs during UCs. and @@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final, char *exp) continue; if (!(m->status & MCI_STATUS_UC)) { print_mce(m); + printk_once(KERN_EMERG HW_ERR "MCE Corrected Error(s) " + "detected."); if (!apei_err) apei_err = apei_write_mce(m); } so we'll print "MCE Corrected Error(s)" _once_ if we go through this path. Since there is no data to decode with mcelog, a nice little one time message is probably the way to go :). P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:23 ` Prarit Bhargava @ 2011-04-14 15:44 ` Borislav Petkov 2011-04-14 15:49 ` Prarit Bhargava 0 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-14 15:44 UTC (permalink / raw) To: Prarit Bhargava Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Thu, Apr 14, 2011 at 11:23:04AM -0400, Prarit Bhargava wrote: > Oops ... I may have confused you because what I did was subtle. I > really should have explicitly pointed out what I did. Sorry, my bad. > > From my patch (sorry for the cut-and-paste): > > @@ -239,7 +227,10 @@ static void print_mce(struct mce *m) > * Print out human-readable details about the MCE error, > * (if the CPU has an implementation for that) > */ > - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); > + if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC)) > + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' " > + "to decode.\n"); > } > > This, of course, only outputs during UCs. > > and > > @@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final, > char *exp) > continue; > if (!(m->status & MCI_STATUS_UC)) { > print_mce(m); > + printk_once(KERN_EMERG HW_ERR "MCE Corrected > Error(s) " > + "detected."); > if (!apei_err) > apei_err = apei_write_mce(m); > } > > so we'll print "MCE Corrected Error(s)" _once_ if we go through this > path. Since there is no data to decode with mcelog, a nice little one > time message is probably the way to go :). Ok, first of all, see the print_mce(m) call above? Yes, we're dumping full CE MCE info in this case because they were unlogged and as such, that info can be decoded. But this whole point is moot since those errors can be only 32 max _and_ on the _panic_ path. And I don't think this path matters because it is _very_ seldom. I bet you don't hit it on any of your machines. And we don't want to fix that - we want to fix the case with the occasional CE MCEs which get detected in the polling path but none of their MCA regs get dumped for decoding so the decoding hint there is out of place. And we fixed that at least partially so that it doesn't flood the logs. If you're not fine with the default ratelimit of 10 msgs per 5 seconds we can always raise the ratelimit but tweaking an almost hypothetical case is just not worth it. Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:44 ` Borislav Petkov @ 2011-04-14 15:49 ` Prarit Bhargava 2011-04-14 19:02 ` Borislav Petkov 0 siblings, 1 reply; 38+ messages in thread From: Prarit Bhargava @ 2011-04-14 15:49 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/14/2011 11:44 AM, Borislav Petkov wrote: > On Thu, Apr 14, 2011 at 11:23:04AM -0400, Prarit Bhargava wrote: > >> Oops ... I may have confused you because what I did was subtle. I >> really should have explicitly pointed out what I did. Sorry, my bad. >> >> From my patch (sorry for the cut-and-paste): >> >> @@ -239,7 +227,10 @@ static void print_mce(struct mce *m) >> * Print out human-readable details about the MCE error, >> * (if the CPU has an implementation for that) >> */ >> - atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); >> + ret = atomic_notifier_call_chain(&x86_mce_decoder_chain, 0, m); >> + if (ret != NOTIFY_STOP && (m->status & MCI_STATUS_UC)) >> + pr_emerg(HW_ERR "Run the above through 'mcelog --ascii' " >> + "to decode.\n"); >> } >> >> This, of course, only outputs during UCs. >> >> and >> >> @@ -289,6 +280,8 @@ static void mce_panic(char *msg, struct mce *final, >> char *exp) >> continue; >> if (!(m->status & MCI_STATUS_UC)) { >> print_mce(m); >> + printk_once(KERN_EMERG HW_ERR "MCE Corrected >> Error(s) " >> + "detected."); >> if (!apei_err) >> apei_err = apei_write_mce(m); >> } >> >> so we'll print "MCE Corrected Error(s)" _once_ if we go through this >> path. Since there is no data to decode with mcelog, a nice little one >> time message is probably the way to go :). >> > Ok, first of all, see the print_mce(m) call above? Yes, we're dumping > full CE MCE info in this case because they were unlogged and as such, > that info can be decoded. > > But this whole point is moot since those errors can be only 32 max _and_ > on the _panic_ path. And I don't think this path matters because it is > _very_ seldom. I bet you don't hit it on any of your machines. > Ohhhh ... I was running on the assumption that the data was *never* output. > And we don't want to fix that - we want to fix the case with the > occasional CE MCEs which get detected in the polling path but none of > their MCA regs get dumped for decoding so the decoding hint there is > out of place. And we fixed that at least partially so that it doesn't > flood the logs. If you're not fine with the default ratelimit of 10 msgs > per 5 seconds we can always raise the ratelimit but tweaking an almost > hypothetical case is just not worth it. > > Okay -- I'm good then. P. > Thanks. > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:49 ` Prarit Bhargava @ 2011-04-14 19:02 ` Borislav Petkov 2011-04-14 19:04 ` Prarit Bhargava 0 siblings, 1 reply; 38+ messages in thread From: Borislav Petkov @ 2011-04-14 19:02 UTC (permalink / raw) To: Prarit Bhargava Cc: Borislav Petkov, linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Thu, Apr 14, 2011 at 11:49:08AM -0400, Prarit Bhargava wrote: > > And we don't want to fix that - we want to fix the case with the > > occasional CE MCEs which get detected in the polling path but none of > > their MCA regs get dumped for decoding so the decoding hint there is > > out of place. And we fixed that at least partially so that it doesn't > > flood the logs. If you're not fine with the default ratelimit of 10 msgs > > per 5 seconds we can always raise the ratelimit but tweaking an almost > > hypothetical case is just not worth it. > > > Okay -- I'm good then. Ok, injecting MCEs with this patch looks like this. Nevermind the decoded MCEs, I simply uncommented the "if (ret != NOTIFY_DONE)" line so that I can see the rate limiting. So we still have them there, this is the default setting of 10 calls per 5 secs, we might want to dial it down in the output. After the 10th MCE, it doesn't appear anymore. Hmm... Prarit, you said you have a machine which spits a lot of CECCs on boot, does the final version help there, did you have a chance to run it? [ 312.983610] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: 9200400000010e3f [ 312.987463] [Hardware Error]: TSC 7ca4633c24 [ 312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807306 SOCKET 0 APIC 0 [ 312.987463] [Hardware Error]: MC0_STATUS[-|CE|-|PCC|-|CECC]: 0x9200400000010e3f [ 312.987463] [Hardware Error]: Data Cache Error: [ 312.987463] [Hardware Error]: Corrupted DC MCE info? [ 312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: DRD, part-proc: GEN (no timeout) [ 312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' <----------------- [ 312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: d400400000010016 [ 312.987463] [Hardware Error]: TSC 825c93b856 [ 312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807321 SOCKET 0 APIC 0 [ 312.987463] [Hardware Error]: MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd400400000010016 [ 312.987463] [Hardware Error]: Data Cache Error: L2 TLB multimatch. [ 312.987463] [Hardware Error]: cache level: L2, tx: DATA [ 312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' <----------------- [ 312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: d40040000000081f [ 312.987463] [Hardware Error]: TSC 852bb06a47 [ 312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807328 SOCKET 0 APIC 0 [ 312.987463] [Hardware Error]: MC0_STATUS[Over|CE|-|-|AddrV|CECC]: 0xd40040000000081f [ 312.987463] [Hardware Error]: Data Cache Error: [ 312.987463] [Hardware Error]: Corrupted DC MCE info? [ 312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: GEN, mem-tx: RD, part-proc: SRC (no timeout) [ 312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' <----------------- [ 312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: dc03400008000f43 [ 312.987463] [Hardware Error]: TSC 9a9c3818f3 [ 312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807382 SOCKET 0 APIC 0 [ 312.987463] [Hardware Error]: MC0_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc03400008000f43 [ 312.987463] [Hardware Error]: Data Cache Error: [ 312.987463] [Hardware Error]: Corrupted DC MCE info? [ 312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DWR, part-proc: GEN (timed out) [ 312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' <----------------- [ 312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: f600210000000107 [ 312.987463] [Hardware Error]: TSC 88ce52d026 [ 312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807337 SOCKET 0 APIC 0 [ 312.987463] [Hardware Error]: MC0_STATUS[Over|UE|-|PCC|AddrV|UECC]: 0xf600210000000107 [ 312.987463] [Hardware Error]: Data Cache Error: [ 312.987463] [Hardware Error]: Corrupted DC MCE info? [ 312.987463] [Hardware Error]: cache level: L3/GEN, tx: DATA, mem-tx: GEN [ 312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' <----------------- [ 312.987463] [Hardware Error]: CPU 0: Machine Check Exception: 0 Bank 0: dc03400008000f43 [ 312.987463] [Hardware Error]: TSC 9a9c3818f3 [ 312.987463] [Hardware Error]: PROCESSOR 2:100f91 TIME 1302807382 SOCKET 0 APIC 0 [ 312.987463] [Hardware Error]: MC0_STATUS[Over|CE|MiscV|-|AddrV|CECC]: 0xdc03400008000f43 [ 312.987463] [Hardware Error]: Data Cache Error: [ 312.987463] [Hardware Error]: Corrupted DC MCE info? [ 312.987463] [Hardware Error]: cache level: L3/GEN, mem/io: MEM, mem-tx: DWR, part-proc: GEN (timed out) [ 312.987463] [Hardware Error]: Run the above through 'mcelog --ascii' <----------------- [ 312.987463] [Hardware Error]: Machine check: MCIP not set in MCA handler [ 312.987463] [Hardware Error]: Fake kernel panic: Fatal machine check on current CPU [ 312.987463] mce_notify_irq: 2 callbacks suppressed [ 312.987463] [Hardware Error]: Machine check events logged Thanks. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 19:02 ` Borislav Petkov @ 2011-04-14 19:04 ` Prarit Bhargava 0 siblings, 0 replies; 38+ messages in thread From: Prarit Bhargava @ 2011-04-14 19:04 UTC (permalink / raw) To: Borislav Petkov Cc: linux-kernel@vger.kernel.org, Russ Anderson, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On 04/14/2011 03:02 PM, Borislav Petkov wrote: > On Thu, Apr 14, 2011 at 11:49:08AM -0400, Prarit Bhargava wrote: > >>> And we don't want to fix that - we want to fix the case with the >>> occasional CE MCEs which get detected in the polling path but none of >>> their MCA regs get dumped for decoding so the decoding hint there is >>> out of place. And we fixed that at least partially so that it doesn't >>> flood the logs. If you're not fine with the default ratelimit of 10 msgs >>> per 5 seconds we can always raise the ratelimit but tweaking an almost >>> hypothetical case is just not worth it. >>> >>> >> Okay -- I'm good then. >> > Ok, injecting MCEs with this patch looks like this. Nevermind the > decoded MCEs, I simply uncommented the "if (ret != NOTIFY_DONE)" line > so that I can see the rate limiting. So we still have them there, this > is the default setting of 10 calls per 5 secs, we might want to dial it > down in the output. After the 10th MCE, it doesn't appear anymore. > > Hmm... > > Prarit, you said you have a machine which spits a lot of CECCs on boot, > does the final version help there, did you have a chance to run it? > Yup -- it looked good and did what I expected it to do ... I didn't save the output though :( P. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:16 ` Borislav Petkov 2011-04-14 15:23 ` Prarit Bhargava @ 2011-04-14 15:33 ` Russ Anderson 2011-04-14 15:49 ` Borislav Petkov 1 sibling, 1 reply; 38+ messages in thread From: Russ Anderson @ 2011-04-14 15:33 UTC (permalink / raw) To: Borislav Petkov Cc: Prarit Bhargava, linux-kernel@vger.kernel.org, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja On Thu, Apr 14, 2011 at 05:16:21PM +0200, Borislav Petkov wrote: > On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote: > > On 04/14/2011 11:00 AM, Borislav Petkov wrote: > > > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote: > > > > > >> In the worst case, we will report 32 CEs before panicking. For that case > > >> we either do printk_once as Tony suggested or we ratelimit it. I'll > > >> update the patch. > > >> > > > Ok, how about the following, I ratelimit the printk to the default of 10 > > > messages per 5 seconds. I've also got the hardware MCE injection patches > > > ready and will do some testing with them. > > > > > > > See my previous email ;) I think just putting in a printk_once after > > the CE call to print_mce() in mce_panic() might be better? At least > > that way we get the --ascii message for *EVERY* UC which IMO would be > > nice... > > Are you sure? printk_once() is, as its name says, a one-time thing and > it is implemented that way - a static bool counter which is once set and > that's it. I.e., the "--ascii" message will be printed only once for the > system's lifetime. > > The ratelimit-ed thing dumps it a strict number of times. In the end, > I don't have a strong opinion on how many times we issue it - I'm fine > with it either way. > > Maybe some other opinions. Tony? In general I think you and Prarit are headed in the right direction. As for when to throttle messages, differing people will have different opinions as to the right number. For example, some sites may want the threshold low because once they see a CE they will schedule to replace the DIMM. Manufacturing sometimes wants to see all the CEs to know how good/bad the DIMM is. My suggestion is to pick a default value and have a /sys (or other) way of changing the value. That way if someone has a need to change the value they can. In real life someone will have a legitimate need to change the value. Is the thresholding on a per DIMM, per Socket, or per system basis? SGI tends to have large systems with lots of DIMMs. Per DIMM or per Socket thresholds tend to scale better. Thanks, -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH -v3] x86, MCE: Drop the default decoding notifier 2011-04-14 15:33 ` Russ Anderson @ 2011-04-14 15:49 ` Borislav Petkov 0 siblings, 0 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-14 15:49 UTC (permalink / raw) To: Russ Anderson Cc: Prarit Bhargava, linux-kernel@vger.kernel.org, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com On Thu, Apr 14, 2011 at 11:33:18AM -0400, Russ Anderson wrote: > On Thu, Apr 14, 2011 at 05:16:21PM +0200, Borislav Petkov wrote: > > On Thu, Apr 14, 2011 at 11:04:43AM -0400, Prarit Bhargava wrote: > > > On 04/14/2011 11:00 AM, Borislav Petkov wrote: > > > > On Wed, Apr 13, 2011 at 01:37:05PM -0400, Borislav Petkov wrote: > > > > > > > >> In the worst case, we will report 32 CEs before panicking. For that case > > > >> we either do printk_once as Tony suggested or we ratelimit it. I'll > > > >> update the patch. > > > >> > > > > Ok, how about the following, I ratelimit the printk to the default of 10 > > > > messages per 5 seconds. I've also got the hardware MCE injection patches > > > > ready and will do some testing with them. > > > > > > > > > > See my previous email ;) I think just putting in a printk_once after > > > the CE call to print_mce() in mce_panic() might be better? At least > > > that way we get the --ascii message for *EVERY* UC which IMO would be > > > nice... > > > > Are you sure? printk_once() is, as its name says, a one-time thing and > > it is implemented that way - a static bool counter which is once set and > > that's it. I.e., the "--ascii" message will be printed only once for the > > system's lifetime. > > > > The ratelimit-ed thing dumps it a strict number of times. In the end, > > I don't have a strong opinion on how many times we issue it - I'm fine > > with it either way. > > > > Maybe some other opinions. Tony? > > In general I think you and Prarit are headed in the right direction. > > As for when to throttle messages, differing people will have > different opinions as to the right number. For example, some > sites may want the threshold low because once they see a CE they > will schedule to replace the DIMM. Manufacturing sometimes > wants to see all the CEs to know how good/bad the DIMM is. > > My suggestion is to pick a default value and have a /sys (or > other) way of changing the value. That way if someone has > a need to change the value they can. In real life someone > will have a legitimate need to change the value. > > Is the thresholding on a per DIMM, per Socket, or per system > basis? SGI tends to have large systems with lots of DIMMs. > Per DIMM or per Socket thresholds tend to scale better. Nah, we're talking about the decoding hint only "Run the above through 'mcelog --ascii'" and that issuing it makes no sense when no correctable errors info is on dmesg. The CEs don't get reported in this case, I think on Intel you have to run mcelog. On AMD, you generally use the amd64_edac driver and collect all CEs info which gets decoded to chip selects on the node and then you can do thresholding. What you're talking above is not yet fully ... hm.. implemented yet :). -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 ^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core 2011-04-13 13:24 ` Borislav Petkov 2011-04-13 13:36 ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov 2011-04-13 13:36 ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov @ 2011-04-13 13:36 ` Borislav Petkov 2 siblings, 0 replies; 38+ messages in thread From: Borislav Petkov @ 2011-04-13 13:36 UTC (permalink / raw) To: linux-kernel Cc: Borislav Petkov, Russ Anderson, Prarit Bhargava, Luck, Tony, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja@americas.sgi.com From: Borislav Petkov <borislav.petkov@amd.com> Now that we don't have a default notifier, register with the MCE core to state that MCE decoder functionality is present. Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- drivers/edac/mce_amd.c | 3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c index 795cfbc..4448010 100644 --- a/drivers/edac/mce_amd.c +++ b/drivers/edac/mce_amd.c @@ -902,6 +902,8 @@ static int __init mce_amd_init(void) atomic_notifier_chain_register(&x86_mce_decoder_chain, &amd_mce_dec_nb); + atomic_inc(&mce_decoders); + return 0; } early_initcall(mce_amd_init); @@ -909,6 +911,7 @@ early_initcall(mce_amd_init); #ifdef MODULE static void __exit mce_amd_exit(void) { + atomic_dec(&mce_decoders); atomic_notifier_chain_unregister(&x86_mce_decoder_chain, &amd_mce_dec_nb); kfree(fam_ops); } -- 1.7.4.rc2 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH]: mce: don't print "human readable" message for corrected errors 2011-04-12 20:02 ` Luck, Tony 2011-04-12 20:15 ` Prarit Bhargava @ 2011-04-13 2:24 ` Russ Anderson 1 sibling, 0 replies; 38+ messages in thread From: Russ Anderson @ 2011-04-13 2:24 UTC (permalink / raw) To: Luck, Tony Cc: Borislav Petkov, Prarit Bhargava, linux-kernel@vger.kernel.org, dzickus@redhat.com, mstowe@redhat.com, dnelson@redhat.com, rja On Tue, Apr 12, 2011 at 01:02:21PM -0700, Luck, Tony wrote: > > Why not? This way you turn reporting of _ALL_ correctable MCEs > > completely off and some users would actually like to run them through > > mcelog on Intel. > > pr_emerg() is rather overkill for a corrected error - on large systems > corrected errors are going to be a routine occurrence (my personal estimation > is "one soft error per gigabyte per month" ... which is pretty much the > same as "one per terabyte per hour" for the people with the really cool > toys. Good point. > We are also setting TAINT_MACHINE_CHECK for corrected errors - perhaps > this made sense when systems were small and machine checks were rare and > scary. But I think we need to start working with the reality that > corrected errors are normal events. I agree. Corrected errors - by definition - have hardware corrected data. There is no corruption so there is no reason for kernel taint. It would be like setting taint when one hard drive of a RAID file system goes bad. It's worth noting that linux does not set taint when it recovers from _uncorrected_ memory errors on IA64 (by killing the application that consumed the bad data and discarding the bad page). Modern hardware has enough error detection/correction code to avoid undetected data corruption from memory errors. -- Russ Anderson, OS RAS/Partitioning Project Lead SGI - Silicon Graphics Inc rja@sgi.com ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2011-04-14 19:05 UTC | newest] Thread overview: 38+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-04-12 17:44 [PATCH]: mce: don't print "human readable" message for corrected errors Prarit Bhargava 2011-04-12 18:58 ` Borislav Petkov 2011-04-12 19:22 ` Prarit Bhargava 2011-04-12 19:57 ` Borislav Petkov 2011-04-12 20:02 ` Luck, Tony 2011-04-12 20:15 ` Prarit Bhargava 2011-04-12 20:28 ` Borislav Petkov 2011-04-13 3:00 ` Russ Anderson 2011-04-13 7:14 ` Borislav Petkov 2011-04-13 13:24 ` Borislav Petkov 2011-04-13 13:36 ` [PATCH 1/3] x86, MCE: Do not taint when correctable errors Borislav Petkov 2011-04-13 13:36 ` [PATCH 2/3] x86, MCE: Drop default decoding notifier Borislav Petkov 2011-04-13 14:01 ` Prarit Bhargava 2011-04-13 14:18 ` Borislav Petkov 2011-04-13 14:22 ` Prarit Bhargava 2011-04-13 14:26 ` Borislav Petkov 2011-04-13 14:32 ` Prarit Bhargava 2011-04-13 14:39 ` Borislav Petkov 2011-04-13 14:45 ` Prarit Bhargava 2011-04-13 14:36 ` [PATCH -v2] " Borislav Petkov 2011-04-13 17:01 ` Prarit Bhargava 2011-04-13 17:13 ` Luck, Tony 2011-04-13 17:17 ` Prarit Bhargava 2011-04-13 17:14 ` Prarit Bhargava 2011-04-13 17:37 ` Borislav Petkov 2011-04-14 14:59 ` Prarit Bhargava 2011-04-14 15:00 ` [PATCH -v3] x86, MCE: Drop the " Borislav Petkov 2011-04-14 15:04 ` Prarit Bhargava 2011-04-14 15:16 ` Borislav Petkov 2011-04-14 15:23 ` Prarit Bhargava 2011-04-14 15:44 ` Borislav Petkov 2011-04-14 15:49 ` Prarit Bhargava 2011-04-14 19:02 ` Borislav Petkov 2011-04-14 19:04 ` Prarit Bhargava 2011-04-14 15:33 ` Russ Anderson 2011-04-14 15:49 ` Borislav Petkov 2011-04-13 13:36 ` [PATCH 3/3] EDAC, MCE, AMD: Register with MCE core Borislav Petkov 2011-04-13 2:24 ` [PATCH]: mce: don't print "human readable" message for corrected errors Russ Anderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).