All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chen Yucong <slaoub@gmail.com>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "bp@alien8.de" <bp@alien8.de>,
	"ak@linux.intel.com" <ak@linux.intel.com>,
	"aravind.gopalakrishnan@amd.com" <aravind.gopalakrishnan@amd.com>,
	"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 2/2] x86, mce: support memory error recovery for both UCNA and Deferred error in machine_check_poll
Date: Tue, 04 Nov 2014 10:11:14 +0800	[thread overview]
Message-ID: <1415067074.24825.27.camel@debian> (raw)
In-Reply-To: <1414548964.20336.17.camel@debian>

On Wed, 2014-10-29 at 10:16 +0800, Chen Yucong wrote:
> On Mon, 2014-10-27 at 23:10 +0000, Luck, Tony wrote:
> > +	m->mcgstatus |= (MCG_STATUS_MCIP|MCG_STATUS_RIPV);
> > +	severity = mce_severity(m, mca_cfg.tolerant, NULL);
> > 
> > This seems a big hack to make mce_severity() work when called from
> > CMCI context (when MCG_STATUS register is not set).  It would also
> > be confusing as the subsequent logged entries would show MCIP and RIPV
> > bits set in the mcg_status.
> > 
> > If someone can think of a less hacky way to do this, that would be good. Otherwise
> > the code needs a comment, and should reset m->mcg_status to avoid making logs
> > that have incorrect data.
> > 
> Hi all,
> 
> At the suggestion of Tony, this patch add a comment, and restore m->mcgstatus to avoid
> making logs that have incorrect data.
> 

Hi Tony,

Do you have any more comments for the two patches?

thx!
cyc
> 
> From: Chen Yucong <slaoub@gmail.com>
> 
> Signed-off-by: Chen Yucong <slaoub@gmail.com>
> ---
>  arch/x86/kernel/cpu/mcheck/mce.c |   64 ++++++++++++++++++++++++++++++++++++--
>  1 file changed, 62 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index fdc422e..d285d26 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -575,6 +575,56 @@ static void mce_read_aux(struct mce *m, int i)
>  	}
>  }
>  
> +static bool mem_deferred_error(struct mce *m)
> +{
> +	int severity;
> +	u8 mcgs = m->mcgstatus & 0xff;
> +	struct cpuinfo_x86 *c = &boot_cpu_data;
> +
> +	/*
> +	 * mce_severity is specific to machine check exception, and it will
> +	 * check MCIP/EIPV/RIPV bits. In order to get pass the check, we need
> +	 * to set MCIP and RIPV.
> +	 */
> +	m->mcgstatus |= (MCG_STATUS_MCIP|MCG_STATUS_RIPV);
> +	severity = mce_severity(m, mca_cfg.tolerant, NULL);
> +
> +	/* restore the original value of m->mcgstatus */
> +	m->mcgstatus = (m->mcgstatus & ~0xff) | mcgs;
> +
> +	if (c->x86_vendor == X86_VENDOR_AMD) {
> +		/*
> +		 * AMD BKDGs - Machine Check Error Codes
> +		 *
> +		 * Bit 8 of ErrCode[15:0] of MCi_STATUS is used for indicating
> +		 * a memory-specific error. Note that this field encodes info-
> +		 * rmation about memory-hierarchy level involved in the error.
> +		 */
> +		if (severity == MCE_DEFERRED_SEVERITY)
> +			return  (m->status & 0xff00) == BIT(8);
> +	} else if (c->x86_vendor == X86_VENDOR_INTEL) {
> +		/*
> +		 * Intel SDM Volume 3B - 15.9.2 Compound Error Codes
> +		 *
> +		 * Bit 7 of the MCACOD field of IA32_MCi_STATUS is used for
> +		 * indicating a memory error. Bit 8 is used for indicating a
> +		 * cache hierarchy error. The combination of bit 2 and bit 3
> +		 * is used for indicating a `generic' cache hierarchy error
> +		 * But we can't just blindly check the above bits, because if
> +		 * bit 11 is set, then it is a bus/interconnect error - and
> +		 * either way the above bits just gives more detail on what
> +		 * bus/interconnect error happened. Note that bit 12 can be
> +		 * ignored, as it's the "filter" bit.
> +		 */
> +		if (severity == MCE_UCNA_SEVERITY)
> +			return (m->status & 0xef80) == BIT(7) ||
> +			       (m->status & 0xef00) == BIT(8) ||
> +			       (m->status & 0xeffc) == 0xc;
> +	}
> +
> +	return false;
> +}
> +
>  DEFINE_PER_CPU(unsigned, mce_poll_count);
>  
>  /*
> @@ -630,6 +680,16 @@ void machine_check_poll(enum mcp_flags flags, mce_banks_t *b)
>  
>  		if (!(flags & MCP_TIMESTAMP))
>  			m.tsc = 0;
> +
> +		/*
> +		 * In the cases where we don't have a valid address after all,
> +		 * do not add it into the ring buffer.
> +		 */
> +		if (mem_deferred_error(&m) && (m.status & MCI_STATUS_ADDRV)) {
> +			mce_ring_add(m.addr >> PAGE_SHIFT);
> +			mce_schedule_work();
> +		}
> +
>  		/*
>  		 * Don't get the IP here because it's unlikely to
>  		 * have anything to do with the actual error location.
> @@ -1098,8 +1158,8 @@ void do_machine_check(struct pt_regs *regs, long error_code)
>  		severity = mce_severity(&m, cfg->tolerant, NULL);
>  
>  		/*
> -		 * When machine check was for corrected handler don't touch,
> -		 * unless we're panicing.
> +		 * When machine check was for corrected/deferred handler don't
> +		 * touch, unless we're panicing.
>  		 */
>  		if ((severity == MCE_KEEP_SEVERITY ||
>  		     severity == MCE_UCNA_SEVERITY) && !no_way_out)



  reply	other threads:[~2014-11-04  2:11 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-10-27  0:56 [PATCH 0/2] RAS: add the support for handling UCNA/DEFERRED error Chen Yucong
2014-10-27  0:56 ` [PATCH 1/2] x86, mce, severity: extend the the mce_severity mechanism to handle " Chen Yucong
2014-10-27  0:56 ` [PATCH 2/2] x86, mce: support memory error recovery for both UCNA and Deferred error in machine_check_poll Chen Yucong
2014-10-27 23:10   ` Luck, Tony
2014-10-28  2:21     ` Chen Yucong
2014-10-29  2:16     ` Chen Yucong
2014-11-04  2:11       ` Chen Yucong [this message]
2014-11-04 11:38       ` Borislav Petkov
2014-10-27  9:36 ` [PATCH 0/2] RAS: add the support for handling UCNA/DEFERRED error Chen Yucong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1415067074.24825.27.camel@debian \
    --to=slaoub@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=aravind.gopalakrishnan@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.