From: Chen Yucong <slaoub@gmail.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: Borislav Petkov <bp@alien8.de>,
Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>,
"ak@linux.intel.com" <ak@linux.intel.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 1/2] x86, mce, severity: extend the the mce_severity mechanism to handle UCNA/DEFERRED error
Date: Wed, 12 Nov 2014 09:03:32 +0800 [thread overview]
Message-ID: <1415754212.12188.12.camel@debian> (raw)
In-Reply-To: <3908561D78D1C84285E8C5FCA982C28F329293DE@ORSMSX114.amr.corp.intel.com>
On Tue, 2014-11-11 at 18:44 +0000, Luck, Tony wrote:
> >> The bank 7 error reported as severity 0 because EN=0 ... so we took no action for it.
> >
> > How come EN is 0? Bank7 error reporting is not enabled? Why? Or the
> > error injection thing doesn't do it?
>
> The "EN" bit is poorly named, and not well documented. Here's a clip from the SDM:
>
> One of bullets in 15.10.4.1 Machine-Check Exception Handler for Error Recovery
>
> When the EN flag is zero but the VAL and UC flags are one in the
> IA32_MCi_STATUS register, the reported uncorrected error in this bank
> is not enabled. As uncorrected errors with the EN flag = 0 are not the
> source of machine check exceptions, the MCE handler should log and clear
> non-enabled errors when the S bit is set and should continue searching
> for enabled errors from the other IA32_MCi_STATUS registers. Note that
> when IA32_MCG_CAP [24] is 0, any uncorrected error condition (VAL =1
> and UC=1) including the one with the EN flag cleared are fatal and the
> handler must signal the operating system to reset the system. For the
> errors that do not generate machine check exceptions, the EN flag has
> no meaning. See Chapter 19: Table 19-15 to find the errors that do not
> generate machine check exceptions.
>
> Unfortunately the reference to chapter 19 is stale (that is now all about
> performance monitoring - I'll log a bug with the SDM editor to find the
> right reference and fix this).
>
> What this is trying to say is that the "EN" bit is to enable signaling
> of machine checks - so it only has meaning when checking banks from the
> machine check handler. Errors that are logged, but not signaled, or signaled
> as CMCI will have MCi_STATUS.EN=0
>
>
> >> The bank 3 error got past that hurdle, then through the next BIT(8) set indicates a
> >> cache error. Fell at the last check because ADDRV=0.
> >
> > I guess you could tweak the injection path to write in a default address
> > so that that check gets bypassed...
>
> I don't think this is an injection artifact. I think on this processor the mid-level-cache
> just isn't providing an address in this case. It doesn't help to make one up - our whole
> game plan is to offline a page with a UC error - and we must have an address to know
> which page to offline.
>
> Perhaps the severity table entries for UCNA and DEFERRED errors should look to see
> if ADDRV is set - if not, don't report this as UCNA/DEFERRED?
>
We can also find the following snippet from AMD APM Volume 2:
9.3.2 Error-Reporting Register Banks - MCi_STATUS
EN—Bit 60. When set to 1, this bit indicates that the error condition is
enabled in the corresponding error-reporting control register (MCi_CTL).
Errors disabled by MCi_CTL do not cause a `machine-check exception'.
Just as what you said, the severity table entry for the "EN" check
should have been skipped when calling from the CMCI/Poll handler.
As shown below:
MCESEV(
NO, "Not enabled",
EXCP, BITCLR(MCI_STATUS_EN)
),
thx!
cyc
next prev parent reply other threads:[~2014-11-12 1:03 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-11-08 1:40 [PATCH v3 0/2]RAS: add the support for handling UCNA/DEFERRED error Chen Yucong
2014-11-08 1:40 ` [PATCH v3 1/2] x86, mce, severity: extend the the mce_severity mechanism to handle " Chen Yucong
2014-11-10 22:06 ` Aravind Gopalakrishnan
2014-11-10 22:17 ` Borislav Petkov
2014-11-10 23:03 ` Aravind Gopalakrishnan
2014-11-10 23:32 ` Luck, Tony
2014-11-11 8:56 ` Borislav Petkov
2014-11-11 18:44 ` Luck, Tony
2014-11-12 1:03 ` Chen Yucong [this message]
2014-11-12 18:28 ` Luck, Tony
2014-11-08 1:40 ` [PATCH v3 2/2] x86, mce: support memory error recovery for both UCNA and Deferred error in machine_check_poll Chen Yucong
2014-11-10 19:06 ` Borislav Petkov
2014-11-10 21:37 ` Borislav Petkov
2014-11-10 21:44 ` Luck, Tony
2014-11-10 21:47 ` Borislav Petkov
2014-11-10 16:42 ` [PATCH v3 0/2]RAS: add the support for handling UCNA/DEFERRED error Borislav Petkov
2014-11-10 18:47 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1415754212.12188.12.camel@debian \
--to=slaoub@gmail.com \
--cc=ak@linux.intel.com \
--cc=aravind.gopalakrishnan@amd.com \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.