Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors

Linux EDAC development
 help / color / mirror / Atom feed

From: Carlos Bilbao <carlos.bilbao@amd.com>
To: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: bp@alien8.de, tglx@linutronix.de, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	bilbao@vt.edu
Subject: Re: [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors
Date: Tue, 5 Apr 2022 12:24:31 -0500	[thread overview]
Message-ID: <ec7dc7e7-5808-58f8-cbe9-d8fdd2de4c35@amd.com> (raw)
In-Reply-To: <Ykx59WvoWKi2y23x@yaz-ubuntu>

On 4/5/2022 12:18 PM, Yazen Ghannam wrote:
> On Thu, Mar 31, 2022 at 11:38:49AM -0500, Carlos Bilbao wrote:
>> The MCE handler needs to understand the severity of the machine errors to
>> act accordingly. In the case of AMD, very few errors are covered in the
>> grading logic.
>>
>> Extend the MCEs severity grading of AMD to cover new types of machine
>> errors.
>>
> 
> This patch does not add new types of machine errors. Please update the commit
> message (and cover letter) to be consistent with changes made between patch
> revisions.
>  

I'm thinking "cover error cases not previously considered".

>> Signed-off-by: Carlos Bilbao <carlos.bilbao@amd.com>
>> ---
>>  arch/x86/kernel/cpu/mce/severity.c | 104 ++++++++++-------------------
>>  1 file changed, 37 insertions(+), 67 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
>> index 1add86935349..4d52eef21230 100644
>> --- a/arch/x86/kernel/cpu/mce/severity.c
>> +++ b/arch/x86/kernel/cpu/mce/severity.c
>> @@ -301,85 +301,55 @@ static noinstr int error_context(struct mce *m, struct pt_regs *regs)
>>  	}
>>  }
>>  
>> -static __always_inline int mce_severity_amd_smca(struct mce *m, enum context err_ctx)
>> -{
>> -	u64 mcx_cfg;
>> -
>> -	/*
>> -	 * We need to look at the following bits:
>> -	 * - "succor" bit (data poisoning support), and
>> -	 * - TCC bit (Task Context Corrupt)
>> -	 * in MCi_STATUS to determine error severity.
>> -	 */
>> -	if (!mce_flags.succor)
>> -		return MCE_PANIC_SEVERITY;
>> -
>> -	mcx_cfg = mce_rdmsrl(MSR_AMD64_SMCA_MCx_CONFIG(m->bank));
>> -
>> -	/* TCC (Task context corrupt). If set and if IN_KERNEL, panic. */
>> -	if ((mcx_cfg & MCI_CONFIG_MCAX) &&
>> -	    (m->status & MCI_STATUS_TCC) &&
>> -	    (err_ctx == IN_KERNEL))
>> -		return MCE_PANIC_SEVERITY;
>> -
>> -	 /* ...otherwise invoke hwpoison handler. */
>> -	return MCE_AR_SEVERITY;
>> -}
>> -
>>  /*
>> - * See AMD Error Scope Hierarchy table in a newer BKDG. For example
>> - * 49125_15h_Models_30h-3Fh_BKDG.pdf, section "RAS Features"
>> + * See AMD PPR(s) section 3.1 Machine Check Architecture
> 
> I don't know that section numbers will be consistent between different PPR
> versions, so having the section name is a good idea. The "Machine Check Error
> Handling" section is what the severity grading function is based on.
> 

Ack

>>   */
>>  static noinstr int mce_severity_amd(struct mce *m, struct pt_regs *regs, char **msg, bool is_excp)
>>  {
>> -	enum context ctx = error_context(m, regs);
>> +	int ret;
>> +
>> +	/*
>> +	 * Default return value: Action required, the error must be handled
>> +	 * immediately.
>> +	 */
>> +	ret = MCE_AR_SEVERITY;
>>  
>>  	/* Processor Context Corrupt, no need to fumble too much, die! */
>> -	if (m->status & MCI_STATUS_PCC)
>> -		return MCE_PANIC_SEVERITY;
>> +	if (m->status & MCI_STATUS_PCC) {
>> +		ret = MCE_PANIC_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -	if (m->status & MCI_STATUS_UC) {
>> +	/*
>> +	 * Evaluate the severity of deferred errors for AMD systems, for which only
>> +	 * scrub error is interesting to notify an action requirement. The poll
>> +	 * handler catches deferred errors and adds to mce_ring so memorty-failure
>> +	 * can take recovery actions.
>> +	 */
> 
> I think this whole comment can be dropped. The "scrub error" part is not
> correct. The polling function may find deferred errors, but they are most
> likely to be see by the deferred error interrupt handler on modern AMD
> systems. The "mce_ring" was removed a long time ago (in v4.3).
> 

Ack

>> +	if (m->status & MCI_STATUS_DEFERRED) {
>> +		ret = MCE_DEFERRED_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -		if (ctx == IN_KERNEL)
>> -			return MCE_PANIC_SEVERITY;
>> +	/* If the UC bit is not set, the error has been corrected */
> 
> This comment is not true. Deferred errors are an example of an uncorrectable
> error where UC is not set.
> 

Ack

>> +	if (!(m->status & MCI_STATUS_UC)) {
>> +		ret = MCE_KEEP_SEVERITY;
>> +		goto amd_severity;
>> +	}
>>  
>> -		/*
>> -		 * On older systems where overflow_recov flag is not present, we
>> -		 * should simply panic if an error overflow occurs. If
>> -		 * overflow_recov flag is present and set, then software can try
>> -		 * to at least kill process to prolong system operation.
>> -		 */
>> -		if (mce_flags.overflow_recov) {
>> -			if (mce_flags.smca)
>> -				return mce_severity_amd_smca(m, ctx);
>> -
>> -			/* kill current process */
>> -			return MCE_AR_SEVERITY;
>> -		} else {
>> -			/* at least one error was not logged */
>> -			if (m->status & MCI_STATUS_OVER)
>> -				return MCE_PANIC_SEVERITY;
>> -		}
>> -
>> -		/*
>> -		 * For any other case, return MCE_UC_SEVERITY so that we log the
>> -		 * error and exit #MC handler.
>> -		 */
>> -		return MCE_UC_SEVERITY;
>> +	if (((m->status & MCI_STATUS_OVER) && !mce_flags.overflow_recov)
>> +	     || !mce_flags.succor) {
> 
> I appreciate merged two cases together that have the same result. But I feel
> keeping them separate may be easier to follow. They can also each have their
> own code comments. Or keep them together and explain each within the same
> comment block.
> 

I will divide these two cases.

> Also, there's a checkpatch "CHECK" here. You'll see it when using the
> "--strict" flag with checkpatch.
> 
>> +		ret = MCE_PANIC_SEVERITY;
>> +		goto amd_severity;
>>  	}
>>  
>> -	/*
>> -	 * deferred error: poll handler catches these and adds to mce_ring so
>> -	 * memory-failure can take recovery actions.
>> -	 */
>> -	if (m->status & MCI_STATUS_DEFERRED)
>> -		return MCE_DEFERRED_SEVERITY;
>> +	if (error_context(m, regs) == IN_KERNEL) {
>> +		ret = MCE_PANIC_SEVERITY;
>> +	}
> 
> Braces aren't needed here. The previous comment about braces was for when
> there's a block of "if/else-if/else" statements. A single "if" statement with
> a single line doesn't need braces.
> 

Ack

>>  
>> -	/*
>> -	 * corrected error: poll handler catches these and passes responsibility
>> -	 * of decoding the error to EDAC
>> -	 */
>> -	return MCE_KEEP_SEVERITY;
>> +amd_severity:
> 
> This label doesn't look right to me. Maybe I'm too used to seeing "out" and
> "err" labels.
> 
> Please see "Documentation/process/coding-style.rst" section (7) "Centralized
> exiting of functions".
> 
> Maybe something like "out_ret_severity" to indicate the code is going to exit
> and return the severity. Or maybe just use "out"? Maybe others have thoughts
> on this.
> 

"out_amd_severity" sounds good to me.

> Thanks,
> Yazen

Will send updated pachset. 

Thanks,
Carlos

next prev parent reply	other threads:[~2022-04-06  4:06 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-31 16:38 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
2022-03-31 16:38 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao
2022-04-05 17:18   ` Yazen Ghannam
2022-04-05 17:24     ` Carlos Bilbao [this message]
2022-04-05 17:41       ` Yazen Ghannam
2022-04-05 17:46         ` Carlos Bilbao
2022-03-31 16:38 ` [PATCH v2 2/2] x86/mce: Add messages to describe panic machine errors on AMD's MCEs grading Carlos Bilbao
2022-03-31 17:17   ` Day, Michael
2022-04-05 17:38   ` Yazen Ghannam
  -- strict thread matches above, loose matches on Subject: below --
2022-03-31 16:32 [PATCH v2 0/2] x86/mce: Grade new machine errors for AMD MCEs and include messages for panic cases Carlos Bilbao
2022-03-31 16:32 ` [PATCH v2 1/2] x86/mce: Extend AMD severity grading function with new types of errors Carlos Bilbao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ec7dc7e7-5808-58f8-cbe9-d8fdd2de4c35@amd.com \
    --to=carlos.bilbao@amd.com \
    --cc=bilbao@vt.edu \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox