public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Chen Yucong <slaoub@gmail.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "bp@alien8.de" <bp@alien8.de>,
	"ak@linux.intel.com" <ak@linux.intel.com>,
	"aravind.gopalakrishnan@amd.com" <aravind.gopalakrishnan@amd.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] x86, mce: support memory error recovery for both UCNA and Deferred error in machine_check_poll
Date: Tue, 04 Nov 2014 10:11:14 +0800	[thread overview]
Message-ID: <1415067074.24825.27.camel@debian> (raw)
In-Reply-To: <1414548964.20336.17.camel@debian>

On Wed, 2014-10-29 at 10:16 +0800, Chen Yucong wrote:
> On Mon, 2014-10-27 at 23:10 +0000, Luck, Tony wrote:
> > +	m->mcgstatus |= (MCG_STATUS_MCIP|MCG_STATUS_RIPV);
> > +	severity = mce_severity(m, mca_cfg.tolerant, NULL);
> > 
> > This seems a big hack to make mce_severity() work when called from
> > CMCI context (when MCG_STATUS register is not set).  It would also
> > be confusing as the subsequent logged entries would show MCIP and RIPV
> > bits set in the mcg_status.
> > 
> > If someone can think of a less hacky way to do this, that would be good. Otherwise
> > the code needs a comment, and should reset m->mcg_status to avoid making logs
> > that have incorrect data.
> > 
> Hi all,
> 
> At the suggestion of Tony, this patch add a comment, and restore m->mcgstatus to avoid
> making logs that have incorrect data.
> 

Hi Tony,

Do you have any more comments for the two patches?

thx!
cyc
> 
> From: Chen Yucong <slaoub@gmail.com>
> 
> Signed-off-by: Chen Yucong <slaoub@gmail.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   64 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 62 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index fdc422e..d285d26 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -575,6 +575,56 @@ static void mce_read_aux(struct mce *m, int i)
>  	}
>  }
>  
> +static bool mem_deferred_error(struct mce *m)
> +{
> +	int severity;
> +	u8 mcgs = m->mcgstatus & 0xff;
> +	struct cpuinfo_x86 *c = &boot_cpu_data;
> +
> +	/*
> +	 * mce_severity is specific to machine check exception, and it will
> +	 * check MCIP/EIPV/RIPV bits. In order to get pass the check, we need
> +	 * to set MCIP and RIPV.
> +	 */
> +	m->mcgstatus |= (MCG_STATUS_MCIP|MCG_STATUS_RIPV);
> +	severity = mce_severity(m, mca_cfg.tolerant, NULL);
> +
> +	/* restore the original value of m->mcgstatus */
> +	m->mcgstatus = (m->mcgstatus & ~0xff) | mcgs;
> +
> +	if (c->x86_vendor == X86_VENDOR_AMD) {
> +		/*
> +		 * AMD BKDGs - Machine Check Error Codes
> +		 *
> +		 * Bit 8 of ErrCode[15:0] of MCi_STATUS is used for indicating
> +		 * a memory-specific error. Note that this field encodes info-
> +		 * rmation about memory-hierarchy level involved in the error.
> +		 */
> +		if (severity == MCE_DEFERRED_SEVERITY)
> +			return  (m->status & 0xff00) == BIT(8);
> +	} else if (c->x86_vendor == X86_VENDOR_INTEL) {
> +		/*
> +		 * Intel SDM Volume 3B - 15.9.2 Compound Error Codes
> +		 *
> +		 * Bit 7 of the MCACOD field of IA32_MCi_STATUS is used for
> +		 * indicating a memory error. Bit 8 is used for indicating a
> +		 * cache hierarchy error. The combination of bit 2 and bit 3
> +		 * is used for indicating a `generic' cache hierarchy error
> +		 * But we can't just blindly check the above bits, because if
> +		 * bit 11 is set, then it is a bus/interconnect error - and
> +		 * either way the above bits just gives more detail on what
> +		 * bus/interconnect error happened. Note that bit 12 can be
> +		 * ignored, as it's the "filter" bit.
> +		 */
> +		if (severity == MCE_UCNA_SEVERITY)
> +			return (m->status & 0xef80) == BIT(7) ||
> +			       (m->status & 0xef00) == BIT(8) ||
> +			       (m->status & 0xeffc) == 0xc;
> +	}
> +
> +	return false;
> +}
> +
>  DEFINE_PER_CPU(unsigned, mce_poll_count);
>  
>  /*
> @@ -630,6 +680,16 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  
>  		if (!(flags & MCP_TIMESTAMP))
>  			m.tsc = 0;
> +
> +		/*
> +		 * In the cases where we don't have a valid address after all,
> +		 * do not add it into the ring buffer.
> +		 */
> +		if (mem_deferred_error(&m) && (m.status & MCI_STATUS_ADDRV)) {
> +			mce_ring_add(m.addr >> PAGE_SHIFT);
> +			mce_schedule_work();
> +		}
> +
>  		/*
>  		 * Don't get the IP here because it's unlikely to
>  		 * have anything to do with the actual error location.
> @@ -1098,8 +1158,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		severity = mce_severity(&m, cfg->tolerant, NULL);
>  
>  		/*
> -		 * When machine check was for corrected handler don't touch,
> -		 * unless we're panicing.
> +		 * When machine check was for corrected/deferred handler don't
> +		 * touch, unless we're panicing.
>  		 */
>  		if ((severity == MCE_KEEP_SEVERITY ||
>  		     severity == MCE_UCNA_SEVERITY) && !no_way_out)



  reply	other threads:[~2014-11-04  2:11 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-27  0:56 [PATCH 0/2] RAS: add the support for handling UCNA/DEFERRED error Chen Yucong
2014-10-27  0:56 ` [PATCH 1/2] x86, mce, severity: extend the the mce_severity mechanism to handle " Chen Yucong
2014-10-27  0:56 ` [PATCH 2/2] x86, mce: support memory error recovery for both UCNA and Deferred error in machine_check_poll Chen Yucong
2014-10-27 23:10   ` Luck, Tony
2014-10-28  2:21     ` Chen Yucong
2014-10-29  2:16     ` Chen Yucong
2014-11-04  2:11       ` Chen Yucong [this message]
2014-11-04 11:38       ` Borislav Petkov
2014-10-27  9:36 ` [PATCH 0/2] RAS: add the support for handling UCNA/DEFERRED error Chen Yucong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1415067074.24825.27.camel@debian \
    --to=slaoub@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=aravind.gopalakrishnan@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox