Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: <tglx@linutronix.de>, <mingo@redhat.com>, <hpa@zytor.com>,
	<tony.luck@intel.com>, <dougthompson@xmission.com>,
	<mchehab@osg.samsung.com>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org>,
	<dave.hansen@linux.intel.com>, <mgorman@suse.de>, <bp@suse.de>,
	<riel@redhat.com>, <jacob.w.shin@gmail.com>
Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors
Date: Wed, 7 Jan 2015 18:18:58 -0600	[thread overview]
Message-ID: <54ADCCF2.9060402@amd.com> (raw)
In-Reply-To: <20150107170654.GG3984@pd.tnic>

On 1/7/2015 11:06 AM, Borislav Petkov wrote:
> On Tue, Jan 06, 2015 at 05:54:15PM -0600, Aravind Gopalakrishnan wrote:
>> But we still need to change the error injection interfaces in mce_amd_inj:
>> mce_amd_inj triggers a #MC on the cpu number that the user specifies on
>> debugfs.
>> For any error other than MC4 errors, this is fine.
>> But we should really be triggering #MC only on NBC for MC4 errors.
> Why?
>
> As you said yourself, the errors get reported on the NBC. Where they get
> *triggered* is a different story.

Apologies if I was not clear earlier. Let me try to address the issue again-
I shall be verbose for sake of clarity here..

The bank 4 MSRs are per-node and per-node MSR are shared between cores 
in a node.
So, technically, all cores of the same node have access to the MSR.
But, since D18F3x44[NBMstToMstCpuEn] is set, access is restricted to 
only the NBC.
And, BKDG states that-
reads of these MSRs from other cores return 0’s and writes are ignored.

Now, with mce_amd_inj interface as it is right now, we basically 
wrmsr_on_cpu()
to the MCx_[status|addr|misc] registers using the cpu value user 
specifies at /sys/kernel/debug/mce-inject/cpu.

For a bank4 error (assume a UC case here) to a non-NBC (say core 6 of 
first node in a multi-node platform),
mce_amd_inj will simply wrmsr_on_cpu(6,...).
Since writes are ignored, we basically don't populate any info on the 
MSRs and when you trigger_mce on cpu 6,
do_machine_check will try to read status MSR for cpu6 which causes RAZ 
and you basically would not see any output on dmesg.
(This is why I had originally thought we had dropped MCEs)

If the same error were to be introduced on a NBC (core 0 in the above 
example),
(i.e), user were to provide cpu number 0 on 
/sys/kernel/debug/mce-inject/cpu; then we would see output on dmesg.
This is because writes from cpu 0 to the MSR will go through.

This is the correction I have made in patch 3 where, for bank = 4, I 
find the NBC for the given cpu and write the MSRs using the nbc value.
(I still need to modify the patch to also trigger #MC on the NBC)

Also, just to clarify any terminology issues:
'reporting' of errors means: active notification of errors to software 
via machine check exceptions.
(as defined by BKDG in the section "Error Detection, Action, Logging, 
and Reporting".
It's section 2.13.1.3 on a F15h M0h-0fh BKDG rev 3.14 for me.
section number might vary for you depending on the document version you 
are referring to..)

> We do injection as it is described in "2.15.2 Error Injection and
> Simulation" in F15h BKDG, for example. Reporting of the thusly injected
> bank4 error goes to the NBC.
>
>

Just want to clarify some (potential) terminology issues here too:
"Error injection" is causing a DRAM error by writing to D18F3xB8 and 
D18F3xBC.
If a DRAM error were to be introduced by using above method, then HW 
should correctly 'report' the error to NBC.

"Error simulation" is basically what we are doing in mce_amd_inj.
But before we drive a #MC, we should honor the rules specified in the 
BKDG wrt writing of MSRs IMHO. (specifically for the bank=4 case)

Thanks,
-Aravind.

     prev parent reply	other threads:[~2015-01-08  0:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-22 20:10 [PATCH 0/3] Fix MCE handling for AMD multi-node processors Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 1/3] x86,amd: Refactor amd cpu topology functions for " Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 2/3] x86, mce: Handle AMD MCE on bank4 on NBC " Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 3/3] edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors Aravind Gopalakrishnan
2014-12-22 20:15 ` [PATCH 0/3] Fix MCE handling for AMD multi-node processors Borislav Petkov
2014-12-22 20:56   ` Aravind Gopalakrishnan
2014-12-22 23:19     ` Borislav Petkov
2014-12-23 19:41       ` Aravind Gopalakrishnan
2015-01-06 23:54         ` Aravind Gopalakrishnan
2015-01-07 17:06           ` Borislav Petkov
2015-01-08  0:18             ` Aravind Gopalakrishnan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54ADCCF2.9060402@amd.com \
    --to=aravind.gopalakrishnan@amd.com \
    --cc=bp@alien8.de \
    --cc=bp@suse.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dougthompson@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jacob.w.shin@gmail.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@osg.samsung.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox