Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
To: Borislav Petkov <bp@alien8.de>
Cc: <tglx@linutronix.de>, <mingo@redhat.com>, <hpa@zytor.com>,
	<tony.luck@intel.com>, <dougthompson@xmission.com>,
	<mchehab@osg.samsung.com>, <x86@kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-edac@vger.kernel.org>,
	<dave.hansen@linux.intel.com>, <mgorman@suse.de>, <bp@suse.de>,
	<riel@redhat.com>, <jacob.w.shin@gmail.com>
Subject: Re: [PATCH 0/3] Fix MCE handling for AMD multi-node processors
Date: Wed, 7 Jan 2015 18:18:58 -0600	[thread overview]
Message-ID: <54ADCCF2.9060402@amd.com> (raw)
In-Reply-To: <20150107170654.GG3984@pd.tnic>

On 1/7/2015 11:06 AM, Borislav Petkov wrote:
> On Tue, Jan 06, 2015 at 05:54:15PM -0600, Aravind Gopalakrishnan wrote:
>> But we still need to change the error injection interfaces in mce_amd_inj:
>> mce_amd_inj triggers a #MC on the cpu number that the user specifies on
>> debugfs.
>> For any error other than MC4 errors, this is fine.
>> But we should really be triggering #MC only on NBC for MC4 errors.
> Why?
>
> As you said yourself, the errors get reported on the NBC. Where they get
> *triggered* is a different story.

Apologies if I was not clear earlier. Let me try to address the issue again-
I shall be verbose for sake of clarity here..

The bank 4 MSRs are per-node and per-node MSR are shared between cores 
in a node.
So, technically, all cores of the same node have access to the MSR.
But, since D18F3x44[NBMstToMstCpuEn] is set, access is restricted to 
only the NBC.
And, BKDG states that-
reads of these MSRs from other cores return 0’s and writes are ignored.

Now, with mce_amd_inj interface as it is right now, we basically 
wrmsr_on_cpu()
to the MCx_[status|addr|misc] registers using the cpu value user 
specifies at /sys/kernel/debug/mce-inject/cpu.

For a bank4 error (assume a UC case here) to a non-NBC (say core 6 of 
first node in a multi-node platform),
mce_amd_inj will simply wrmsr_on_cpu(6,...).
Since writes are ignored, we basically don't populate any info on the 
MSRs and when you trigger_mce on cpu 6,
do_machine_check will try to read status MSR for cpu6 which causes RAZ 
and you basically would not see any output on dmesg.
(This is why I had originally thought we had dropped MCEs)

If the same error were to be introduced on a NBC (core 0 in the above 
example),
(i.e), user were to provide cpu number 0 on 
/sys/kernel/debug/mce-inject/cpu; then we would see output on dmesg.
This is because writes from cpu 0 to the MSR will go through.

This is the correction I have made in patch 3 where, for bank = 4, I 
find the NBC for the given cpu and write the MSRs using the nbc value.
(I still need to modify the patch to also trigger #MC on the NBC)

Also, just to clarify any terminology issues:
'reporting' of errors means: active notification of errors to software 
via machine check exceptions.
(as defined by BKDG in the section "Error Detection, Action, Logging, 
and Reporting".
It's section 2.13.1.3 on a F15h M0h-0fh BKDG rev 3.14 for me.
section number might vary for you depending on the document version you 
are referring to..)

> We do injection as it is described in "2.15.2 Error Injection and
> Simulation" in F15h BKDG, for example. Reporting of the thusly injected
> bank4 error goes to the NBC.
>
>

Just want to clarify some (potential) terminology issues here too:
"Error injection" is causing a DRAM error by writing to D18F3xB8 and 
D18F3xBC.
If a DRAM error were to be introduced by using above method, then HW 
should correctly 'report' the error to NBC.

"Error simulation" is basically what we are doing in mce_amd_inj.
But before we drive a #MC, we should honor the rules specified in the 
BKDG wrt writing of MSRs IMHO. (specifically for the bank=4 case)

Thanks,
-Aravind.

     prev parent reply	other threads:[~2015-01-08  0:35 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-12-22 20:10 [PATCH 0/3] Fix MCE handling for AMD multi-node processors Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 1/3] x86,amd: Refactor amd cpu topology functions for " Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 2/3] x86, mce: Handle AMD MCE on bank4 on NBC " Aravind Gopalakrishnan
2014-12-22 20:10 ` [PATCH 3/3] edac, mce_amd_inj: Inject errors only on NBC for bank 4 errors Aravind Gopalakrishnan
2014-12-22 20:15 ` [PATCH 0/3] Fix MCE handling for AMD multi-node processors Borislav Petkov
2014-12-22 20:56   ` Aravind Gopalakrishnan
2014-12-22 23:19     ` Borislav Petkov
2014-12-23 19:41       ` Aravind Gopalakrishnan
2015-01-06 23:54         ` Aravind Gopalakrishnan
2015-01-07 17:06           ` Borislav Petkov
2015-01-08  0:18             ` Aravind Gopalakrishnan [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54ADCCF2.9060402@amd.com \
    --to=aravind.gopalakrishnan@amd.com \
    --cc=bp@alien8.de \
    --cc=bp@suse.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=dougthompson@xmission.com \
    --cc=hpa@zytor.com \
    --cc=jacob.w.shin@gmail.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@osg.samsung.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=riel@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.