From: Aravind Gopalakrishnan <aravind.gopalakrishnan@amd.com>
To: Borislav Petkov <bp@alien8.de>, <slaoub@gmail.com>
Cc: Tony Luck <tony.luck@intel.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
LKML <linux-kernel@vger.kernel.org>
Subject: Re: Fwd: [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it
Date: Wed, 8 Oct 2014 16:52:06 -0500 [thread overview]
Message-ID: <5435B206.60402@amd.com> (raw)
In-Reply-To: <CAOjmkp9qQiTbqU3NUhUDAoQAa8wAPJnE_qXbDuBKrA3ee1_APQ@mail.gmail.com>
>
> Ok, this return is still bugging me - we're logging the error which
> caused the counter overflow but we go and explicitly clear _STATUS so
> that machine_check_poll doesn't pick up the same error again.
>
> Even though, machine_check_poll is intended to log the thresholding
> error.
>
> Which actually makes me think that that machine_check_poll is actually
> completely useless there. IOW, how about that instead:
>
> ---
> From: Chen Yucong <slaoub@gmail.com <mailto:slaoub@gmail.com>>
> Date: Thu, 2 Oct 2014 14:48:19 +0200
> Subject: [PATCH] x86, MCE, AMD: Correct thresholding error logging
>
> mce_setup() does not gather the content of IA32_MCG_STATUS, so it
> should be read explicitly. Moreover, we need to clear IA32_MCx_STATUS
> to avoid that mce_log() logs the processed threshold event again
> at next time.
>
> But we do the logging ourselves and machine_check_poll() is completely
> useless there. So kill it.
>
> Signed-off-by: Chen Yucong <slaoub@gmail.com <mailto:slaoub@gmail.com>>
> Signed-off-by: Borislav Petkov <bp@suse.de <mailto:bp@suse.de>>
> ---
> arch/x86/kernel/cpu/mcheck/mce_amd.c | 30 +++++++++++++++---------------
> 1 file changed, 15 insertions(+), 15 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce_amd.c
> b/arch/x86/kernel/cpu/mcheck/mce_amd.c
> index 1c54d3d61a4d..9ce64955559d 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce_amd.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce_amd.c
> @@ -270,14 +270,13 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
> static void amd_threshold_interrupt(void)
> {
> u32 low = 0, high = 0, address = 0;
> + int cpu = smp_processor_id();
> unsigned int bank, block;
> struct mce m;
>
> - mce_setup(&m);
> -
> /* assume first bank caused it */
> for (bank = 0; bank < mca_cfg.banks; ++bank) {
> - if (!(per_cpu(bank_map, m.cpu) & (1 << bank)))
> + if (!(per_cpu(bank_map, cpu) & (1 << bank)))
> continue;
> for (block = 0; block < NR_BLOCKS; ++block) {
> if (block == 0) {
> @@ -309,20 +308,21 @@ static void amd_threshold_interrupt(void)
> * Log the machine check that caused the threshold
> * event.
> */
> - machine_check_poll(MCP_TIMESTAMP,
> - &__get_cpu_var(mce_poll_banks));
> -
> - if (high & MASK_OVERFLOW_HI) {
> - rdmsrl(address, m.misc);
> - rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
> - m.bank = K8_MCE_THRESHOLD_BASE
> - + bank * NR_BLOCKS
> - + block;
> - mce_log(&m);
> - return;
> - }
> + if (high & MASK_OVERFLOW_HI)
> + goto log;
> }
> }
> + return;
> +
> +log:
> + mce_setup(&m);
> + rdmsrl(MSR_IA32_MCG_STATUS, m.mcgstatus);
> + rdmsrl(address, m.misc);
> + rdmsrl(MSR_IA32_MCx_STATUS(bank), m.status);
> + m.bank = K8_MCE_THRESHOLD_BASE + bank * NR_BLOCKS + block;
I am not understanding why m.bank is assigned this value..
It only causes incorrect decoding-
[ 608.832916] DEBUG: raise_amd_threshold_event
[ 608.832926] [Hardware Error]: Corrected error, no action required.
[ 608.833143] [Hardware Error]: CPU:26 (15:2:0)
MC165_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
[ 608.833551] [Hardware Error]: MC165_ADDR: 0x0000000000000000
[ 608.833777] [Hardware Error]: cache level: RESV, tx: INSN
[ 608.834034] amd_inject module loaded ...
(Obviously, as in amd_decode_mce() we switch (m->bank) for decoding the
status and there is no bank 165)
OTOH, if m.bank = bank;
Then we get correct decoding info-
[ 58.021978] DEBUG: raise_amd_threshold_event
[ 58.021992] [Hardware Error]: Corrected error, no action required.
[ 58.022155] [Hardware Error]: CPU:0 (15:60:0)
MC4_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00000000000000
[ 58.022393] [Hardware Error]: MC4_ADDR: 0x0000000000000000
[ 58.022531] [Hardware Error]: MC4 Error (node 0): DRAM ECC error
detected on the NB.
<snip..it's throws WARN as "Something is rotten in the state of Denmark".>
<.. but that's fine. we are just fake-injecting errors here.. :) >
[ 58.022933] [Hardware Error]: cache level: RESV, tx: INSN
[ 58.023084] amd_inject module loaded ...
Thanks,
-Aravind.
> + mce_log(&m);
> +
> + wrmsrl(MSR_IA32_MCx_STATUS(bank), 0);
> }
>
> /*
> --
> 2.0.0
>
> --
> Regards/Gruss,
> Boris.
next prev parent reply other threads:[~2014-10-08 21:52 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-09-23 2:16 [PATCH] x86, MCE, AMD: use macros to compute bank MSRs Chen Yucong
2014-09-23 8:19 ` [PATCH] x86, MCE, AMD: save IA32_MCi_STATUS before machine_check_poll() resets it Chen Yucong
2014-09-28 8:15 ` Chen Yucong
2014-09-29 12:05 ` Borislav Petkov
2014-09-30 0:39 ` Chen Yucong
2014-09-30 7:25 ` Borislav Petkov
2014-09-30 9:56 ` Chen Yucong
2014-09-30 10:09 ` Borislav Petkov
2014-10-01 4:35 ` Chen Yucong
2014-10-02 13:12 ` Borislav Petkov
2014-10-02 14:37 ` Chen Yucong
[not found] ` <CAOjmkp9qQiTbqU3NUhUDAoQAa8wAPJnE_qXbDuBKrA3ee1_APQ@mail.gmail.com>
2014-10-08 21:52 ` Aravind Gopalakrishnan [this message]
2014-10-08 22:57 ` Fwd: " Borislav Petkov
2014-10-09 16:53 ` Aravind Gopalakrishnan
2014-10-09 17:35 ` Borislav Petkov
2014-10-09 19:01 ` Aravind Gopalakrishnan
2014-10-21 20:28 ` Borislav Petkov
2014-10-22 1:51 ` Chen Yucong
2014-10-22 8:16 ` Borislav Petkov
2014-10-22 8:53 ` Chen Yucong
2014-10-22 9:30 ` Borislav Petkov
2014-10-29 15:59 ` Aravind Gopalakrishnan
2014-10-30 19:04 ` Aravind Gopalakrishnan
2014-10-30 21:39 ` Borislav Petkov
2014-10-01 5:26 ` Chen Yucong
2014-10-01 10:10 ` Borislav Petkov
2014-09-28 8:09 ` [PATCH] x86, MCE, AMD: use macros to compute bank MSRs Chen Yucong
2014-09-29 11:48 ` Borislav Petkov
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5435B206.60402@amd.com \
--to=aravind.gopalakrishnan@amd.com \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=slaoub@gmail.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.