From: Borislav Petkov <bp@alien8.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "Chen, Gong" <gong.chen@linux.intel.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool.
Date: Mon, 23 Nov 2015 18:59:32 +0100 [thread overview]
Message-ID: <20151123175932.GG5134@pd.tnic> (raw)
In-Reply-To: <20151119203920.GH6065@pd.tnic>
On Thu, Nov 19, 2015 at 09:39:20PM +0100, Borislav Petkov wrote:
> On Thu, Nov 19, 2015 at 07:33:58PM +0000, Luck, Tony wrote:
> > > Applied, thanks.
> >
> > Did you test it (note the "UNTESTED" in the subject!). My usual system for this is getting upgrades and being
> > flaky at the moment.
>
> Bah, it builds, should be enough. Ship it. :-)
>
> Lemme get a box...
Here some results:
# grep . /sys/kernel/debug/apei/einj/*
/sys/kernel/debug/apei/einj/available_error_type:0x00000002 Processor Uncorrectable non-fatal
/sys/kernel/debug/apei/einj/available_error_type:0x00000008 Memory Correctable
/sys/kernel/debug/apei/einj/available_error_type:0x00000010 Memory Uncorrectable non-fatal
grep: /sys/kernel/debug/apei/einj/error_inject: Permission denied
/sys/kernel/debug/apei/einj/error_type:0x0
Looks like some old EINJ without all the features. Oh well, let's see
what'll happen anyway:
# echo 0x8 > error_type
# echo 1 > error_inject
[ 840.461666] mce: [Hardware Error]: Machine check events logged
[ 840.476221] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 840.489214] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090
[ 840.507685] EDAC sbridge MC0: TSC 0
[ 840.515223] EDAC sbridge MC0: ADDR bb68ec00 EDAC sbridge MC0: MISC 20403ebe86
[ 840.532477] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0
[ 840.551279] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 840.563872] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8800004100800090
[ 840.581970] EDAC sbridge MC0: TSC 0
[ 840.589513] EDAC sbridge MC0: ADDR 0 EDAC sbridge MC0: MISC 4908400040004200
[ 840.606267] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299322 SOCKET 0 APIC 0
[ 841.499090] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68e offset:0xc00 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
So yeah, mce_notify_irq() is visible there, i.e. we did mce_log() here
which sets mce_need_notify.
# echo 0x2 > error_type
# echo 1 > error_inject
bash: echo: write error: Invalid argument
[ 885.272000] [Firmware Warn]: APEI: Invalid action table, unknown instruction type: 5
ACPI_EINJ_FLUSH_CACHELINE??
Yeah, we're missing some functionality.
# echo 0x10 > error_type
# echo 1 > error_inject
That went BOOM:
[ 1296.233435] Disabling lock debugging due to kernel taint
[ 1296.248010] mce: [Hardware Error]: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 1296.269245] mce: [Hardware Error]: RIP !INEXACT! 10:<ffffffff8136260f> {intel_idle+0xbf/0x130}
[ 1296.290735] mce: [Hardware Error]: TSC 37c1fb53beb ADDR bb68f400 MISC 20401a9a86
[ 1296.309772] mce: [Hardware Error]: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c microcode 710
[ 1296.332058] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 1296.346094] EDAC sbridge MC0: CPU 6: Machine Check Exception: 5 Bank 5: be00000000010090
[ 1296.366517] EDAC sbridge MC0: TSC 37c1fb53beb
[ 1296.375974] EDAC sbridge MC0: ADDR bb68f400 EDAC sbridge MC0: MISC 20401a9a86
[ 1296.394493] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448299778 SOCKET 0 APIC c
[ 1296.416153] EDAC MC0: 0 UE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:
0x400 grain:32 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
...
judging by the CPU numbers, looks like node 0 got that error in the shared bank:
.... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7
.... node #0, CPUs: #32 #33 #34 #35 #36 #37 #38 #39
finishing with
[ 1299.907994] mce: [Hardware Error]: Machine check: Processor context corrupt
[ 1299.926783] Kernel panic - not syncing: Fatal machine check
[ 1299.959632] Kernel Offset: disabled
[ 1299.984254] Rebooting in 100 seconds..
dont_log_ce:
$ for i in $(seq 0 63); do echo 1 > /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; cat /sys/devices/system/machinecheck/machinecheck$i/dont_log_ce; done | uniq
1
# echo 0x8 > error_type
# echo 1 > error_inject
[ 318.263797] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ 318.277029] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 5: 8c00004000010090
[ 318.295631] EDAC sbridge MC0: TSC 0
[ 318.303143] EDAC sbridge MC0: ADDR bb68f000 EDAC sbridge MC0: MISC 2040262686
[ 318.320473] EDAC sbridge MC0: PROCESSOR 0:206d7 TIME 1448300397 SOCKET 0 APIC 0
[ 318.809112] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0xbb68f offset:0x0 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:1 rank:0)
This looks ok, we're missing the mce_notify_irq() line "mce: [Hardware
Error]: Machine check events logged" which is as expected but the EDAC
lines are there because we sent the error on the notify chain.
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
next prev parent reply other threads:[~2015-11-23 17:59 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-11-11 15:16 [PATCH] Cleanup useless codes in CMCI handler Chen, Gong
2015-11-11 19:38 ` Luck, Tony
2015-11-11 22:01 ` [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool Tony Luck
2015-11-12 16:12 ` Chen, Gong
2015-11-19 16:15 ` Borislav Petkov
2015-11-19 19:33 ` Luck, Tony
2015-11-19 20:39 ` Borislav Petkov
2015-11-23 17:59 ` Borislav Petkov [this message]
2015-11-21 19:15 ` Borislav Petkov
2015-11-21 19:17 ` [PATCH 1/2] x86/mce: Add the missing memory error check on AMD Borislav Petkov
2015-11-21 19:18 ` [PATCH 2/2] x86/mce: Make usable address checks Intel-only Borislav Petkov
2015-11-24 0:19 ` [UNTESTED PATCH] x86, mce: Avoid double entry of deferred errors into the genpool Luck, Tony
2015-11-24 7:36 ` Borislav Petkov
2015-11-24 15:51 ` Luck, Tony
2015-11-24 18:56 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20151123175932.GG5134@pd.tnic \
--to=bp@alien8.de \
--cc=gong.chen@linux.intel.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox