public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] Properly check for usable addresses on AMD
@ 2023-06-13 14:11 Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error() Yazen Ghannam
                   ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-13 14:11 UTC (permalink / raw)
  To: linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, xueshuai, baolin.wang,
	Yazen Ghannam

Hi all,

This set adds proper checks for usable addresses on AMD systems.

Patch 1 creates helper functions for memory error checks that will be
used in the second patch.

Patch 2 adds the proper usable address checks.

Patch 3 restructures the current usable address function to call out to
vendor-specific helpers. 

I don't think these need 'stable' backports, since there isn't an urgent
issue to be fixed. But I can include 'stable' if there's interest.

Thanks,
Yazen

Yazen Ghannam (3):
  x86/MCE/AMD: Split amd_mce_is_memory_error()
  x86/mce: Define amd_mce_usable_address()
  x86/mce: Fixup mce_usable_address()

 arch/x86/include/asm/mce.h         |  2 +-
 arch/x86/kernel/cpu/mce/amd.c      | 68 +++++++++++++++++++++++++++---
 arch/x86/kernel/cpu/mce/core.c     | 32 +++++---------
 arch/x86/kernel/cpu/mce/intel.c    | 20 +++++++++
 arch/x86/kernel/cpu/mce/internal.h |  4 ++
 5 files changed, 99 insertions(+), 27 deletions(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error()
  2023-06-13 14:11 [PATCH 0/3] Properly check for usable addresses on AMD Yazen Ghannam
@ 2023-06-13 14:11 ` Yazen Ghannam
  2023-06-14  2:06   ` Shuai Xue
  2023-10-16 13:48   ` [tip: ras/core] " tip-bot2 for Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 2/3] x86/mce: Define amd_mce_usable_address() Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 3/3] x86/mce: Fixup mce_usable_address() Yazen Ghannam
  2 siblings, 2 replies; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-13 14:11 UTC (permalink / raw)
  To: linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, xueshuai, baolin.wang,
	Yazen Ghannam

Define helper functions for legacy and SMCA systems in order to reuse
individual checks in later changes.

Describe what each function is checking for, and correct the XEC bitmask
for SMCA.

No functional change intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/kernel/cpu/mce/amd.c | 30 +++++++++++++++++++++++++-----
 1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 5e74610b39e7..1ccfb0c9257f 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -713,17 +713,37 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
 		deferred_error_interrupt_enable(c);
 }
 
-bool amd_mce_is_memory_error(struct mce *m)
+/*
+ * DRAM ECC errors are reported in the Northbridge (bank 4) with
+ * Extended Error Code 8.
+ */
+static bool legacy_mce_is_memory_error(struct mce *m)
+{
+	return m->bank == 4 && XEC(m->status, 0x1f) == 8;
+}
+
+/*
+ * DRAM ECC errors are reported in Unified Memory Controllers with
+ * Extended Error Code 0.
+ */
+static bool smca_mce_is_memory_error(struct mce *m)
 {
 	enum smca_bank_types bank_type;
-	/* ErrCodeExt[20:16] */
-	u8 xec = (m->status >> 16) & 0x1f;
+
+	if (XEC(m->status, 0x3f))
+		return false;
 
 	bank_type = smca_get_bank_type(m->extcpu, m->bank);
+
+	return bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2;
+}
+
+bool amd_mce_is_memory_error(struct mce *m)
+{
 	if (mce_flags.smca)
-		return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
+		return smca_mce_is_memory_error(m);
 
-	return m->bank == 4 && xec == 0x8;
+	return legacy_mce_is_memory_error(m);
 }
 
 static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-13 14:11 [PATCH 0/3] Properly check for usable addresses on AMD Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error() Yazen Ghannam
@ 2023-06-13 14:11 ` Yazen Ghannam
  2023-06-14  2:19   ` Shuai Xue
  2023-10-16 13:48   ` [tip: ras/core] " tip-bot2 for Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 3/3] x86/mce: Fixup mce_usable_address() Yazen Ghannam
  2 siblings, 2 replies; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-13 14:11 UTC (permalink / raw)
  To: linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, xueshuai, baolin.wang,
	Yazen Ghannam

Currently, all valid MCA_ADDR values are assumed to be usable on AMD
systems. However, this is not correct in most cases. Notifiers expecting
usable addresses may then operate on inappropriate values.

Define a helper function to do AMD-specific checks for a usable memory
address. List out all known cases.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/mce/core.c     |  3 +++
 arch/x86/kernel/cpu/mce/internal.h |  2 ++
 3 files changed, 43 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index 1ccfb0c9257f..ca79fa10b844 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
 	return legacy_mce_is_memory_error(m);
 }
 
+/*
+ * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
+ * a system physical address. Therefore individual cases need to be detected.
+ * Future cases and checks will be added as needed.
+ *
+ * 1) General case
+ *	a) Assume address is not usable.
+ * 2) "Poison" errors
+ *	a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
+ *	   Northbridge (bank 4).
+ *	b) Refers to poison consumption in the Core. Does not include "no action",
+ *	   "action optional", or "deferred" error severities.
+ *	c) Will include a usuable address so that immediate action can be taken.
+ * 3) Northbridge DRAM ECC errors
+ *	a) Reported in legacy bank 4 with XEC 8.
+ *	b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
+ *	   this bit should not be checked.
+ *
+ * NOTE: SMCA UMC memory errors fall into case #1.
+ */
+bool amd_mce_usable_address(struct mce *m)
+{
+	/* Check special Northbridge case first. */
+	if (!mce_flags.smca) {
+		if (legacy_mce_is_memory_error(m))
+			return true;
+		else if (m->bank == 4)
+			return false;
+	}
+
+	/* Check Poison bit for all other bank types. */
+	if (m->status & MCI_STATUS_POISON)
+		return true;
+
+	/* Assume address is not usable for all others. */
+	return false;
+}
+
 static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
 {
 	struct mce m;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 89e2aab5d34d..859ce20dd730 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -464,6 +464,9 @@ int mce_usable_address(struct mce *m)
 	if (!(m->status & MCI_STATUS_ADDRV))
 		return 0;
 
+	if (m->cpuvendor == X86_VENDOR_AMD)
+		return amd_mce_usable_address(m);
+
 	/* Checks after this one are Intel/Zhaoxin-specific: */
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
 	    boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index d2412ce2d312..0d4c5b83ed93 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -207,6 +207,7 @@ extern bool filter_mce(struct mce *m);
 
 #ifdef CONFIG_X86_MCE_AMD
 extern bool amd_filter_mce(struct mce *m);
+bool amd_mce_usable_address(struct mce *m);
 
 /*
  * If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -234,6 +235,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)
 
 #else
 static inline bool amd_filter_mce(struct mce *m) { return false; }
+static inline bool amd_mce_usable_address(struct mce *m) { return false; }
 static inline void smca_extract_err_addr(struct mce *m) { }
 #endif
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] x86/mce: Fixup mce_usable_address()
  2023-06-13 14:11 [PATCH 0/3] Properly check for usable addresses on AMD Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error() Yazen Ghannam
  2023-06-13 14:11 ` [PATCH 2/3] x86/mce: Define amd_mce_usable_address() Yazen Ghannam
@ 2023-06-13 14:11 ` Yazen Ghannam
  2023-10-16 13:48   ` [tip: ras/core] x86/mce: Cleanup mce_usable_address() tip-bot2 for Yazen Ghannam
  2 siblings, 1 reply; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-13 14:11 UTC (permalink / raw)
  To: linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, xueshuai, baolin.wang,
	Yazen Ghannam

Move Intel-specific checks into a helper function.

Explicitly use "bool" for return type.

No functional change intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 arch/x86/include/asm/mce.h         |  2 +-
 arch/x86/kernel/cpu/mce/core.c     | 33 +++++++++---------------------
 arch/x86/kernel/cpu/mce/intel.c    | 20 ++++++++++++++++++
 arch/x86/kernel/cpu/mce/internal.h |  2 ++
 4 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 180b1cbfcc4e..6de6e1d95952 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -245,7 +245,7 @@ static inline void cmci_recheck(void) {}
 int mce_available(struct cpuinfo_x86 *c);
 bool mce_is_memory_error(struct mce *m);
 bool mce_is_correctable(struct mce *m);
-int mce_usable_address(struct mce *m);
+bool mce_usable_address(struct mce *m);
 
 DECLARE_PER_CPU(unsigned, mce_exception_count);
 DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 859ce20dd730..c17e2b54853b 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -453,35 +453,22 @@ static void mce_irq_work_cb(struct irq_work *entry)
 	mce_schedule_work();
 }
 
-/*
- * Check if the address reported by the CPU is in a format we can parse.
- * It would be possible to add code for most other cases, but all would
- * be somewhat complicated (e.g. segment offset would require an instruction
- * parser). So only support physical addresses up to page granularity for now.
- */
-int mce_usable_address(struct mce *m)
+bool mce_usable_address(struct mce *m)
 {
 	if (!(m->status & MCI_STATUS_ADDRV))
-		return 0;
+		return false;
 
-	if (m->cpuvendor == X86_VENDOR_AMD)
+	switch (m->cpuvendor) {
+	case X86_VENDOR_AMD:
 		return amd_mce_usable_address(m);
 
-	/* Checks after this one are Intel/Zhaoxin-specific: */
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
-	    boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
-		return 1;
-
-	if (!(m->status & MCI_STATUS_MISCV))
-		return 0;
-
-	if (MCI_MISC_ADDR_LSB(m->misc) > PAGE_SHIFT)
-		return 0;
-
-	if (MCI_MISC_ADDR_MODE(m->misc) != MCI_MISC_ADDR_PHYS)
-		return 0;
+	case X86_VENDOR_INTEL:
+	case X86_VENDOR_ZHAOXIN:
+		return intel_mce_usable_address(m);
 
-	return 1;
+	default:
+		return true;
+	}
 }
 EXPORT_SYMBOL_GPL(mce_usable_address);
 
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index 95275a5e57e0..56ecf128a534 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -519,3 +519,23 @@ bool intel_filter_mce(struct mce *m)
 
 	return false;
 }
+
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses up to page granularity for now.
+ */
+bool intel_mce_usable_address(struct mce *m)
+{
+	if (!(m->status & MCI_STATUS_MISCV))
+		return false;
+
+	if (MCI_MISC_ADDR_LSB(m->misc) > PAGE_SHIFT)
+		return false;
+
+	if (MCI_MISC_ADDR_MODE(m->misc) != MCI_MISC_ADDR_PHYS)
+		return false;
+
+	return true;
+}
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 0d4c5b83ed93..962b3134991d 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -49,6 +49,7 @@ void intel_init_cmci(void);
 void intel_init_lmce(void);
 void intel_clear_lmce(void);
 bool intel_filter_mce(struct mce *m);
+bool intel_mce_usable_address(struct mce *m);
 #else
 # define cmci_intel_adjust_timer mce_adjust_timer_default
 static inline bool mce_intel_cmci_poll(void) { return false; }
@@ -58,6 +59,7 @@ static inline void intel_init_cmci(void) { }
 static inline void intel_init_lmce(void) { }
 static inline void intel_clear_lmce(void) { }
 static inline bool intel_filter_mce(struct mce *m) { return false; }
+static inline bool intel_mce_usable_address(struct mce *m) { return false; }
 #endif
 
 void mce_timer_kick(unsigned long interval);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error()
  2023-06-13 14:11 ` [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error() Yazen Ghannam
@ 2023-06-14  2:06   ` Shuai Xue
  2023-06-14 15:06     ` Yazen Ghannam
  2023-10-16 13:48   ` [tip: ras/core] " tip-bot2 for Yazen Ghannam
  1 sibling, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2023-06-14  2:06 UTC (permalink / raw)
  To: Yazen Ghannam, linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, baolin.wang



On 2023/6/13 22:11, Yazen Ghannam wrote:
> Define helper functions for legacy and SMCA systems in order to reuse
> individual checks in later changes.
> 
> Describe what each function is checking for, and correct the XEC bitmask
> for SMCA.
> 
> No functional change intended.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>  arch/x86/kernel/cpu/mce/amd.c | 30 +++++++++++++++++++++++++-----
>  1 file changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 5e74610b39e7..1ccfb0c9257f 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -713,17 +713,37 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
>  		deferred_error_interrupt_enable(c);
>  }
>  
> -bool amd_mce_is_memory_error(struct mce *m)
> +/*
> + * DRAM ECC errors are reported in the Northbridge (bank 4) with
> + * Extended Error Code 8.
> + */
> +static bool legacy_mce_is_memory_error(struct mce *m)
> +{
> +	return m->bank == 4 && XEC(m->status, 0x1f) == 8;
> +}
> +
> +/*
> + * DRAM ECC errors are reported in Unified Memory Controllers with
> + * Extended Error Code 0.
> + */
> +static bool smca_mce_is_memory_error(struct mce *m)
>  {
>  	enum smca_bank_types bank_type;
> -	/* ErrCodeExt[20:16] */
> -	u8 xec = (m->status >> 16) & 0x1f;
> +
> +	if (XEC(m->status, 0x3f))
> +		return false;
>  
>  	bank_type = smca_get_bank_type(m->extcpu, m->bank);
> +
> +	return bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2;
> +}
> +
> +bool amd_mce_is_memory_error(struct mce *m)
> +{
>  	if (mce_flags.smca)
> -		return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
> +		return smca_mce_is_memory_error(m);
>  
> -	return m->bank == 4 && xec == 0x8;
> +	return legacy_mce_is_memory_error(m);
>  }
>  
>  static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)

Hi, Yazen,

Which tree are you working on? This patch can not be applied to Linus master ?
(commit b6dad5178ceaf23f369c3711062ce1f2afc33644)

Thanks.

Best Regards,
Shuai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-13 14:11 ` [PATCH 2/3] x86/mce: Define amd_mce_usable_address() Yazen Ghannam
@ 2023-06-14  2:19   ` Shuai Xue
  2023-06-14 15:09     ` Yazen Ghannam
  2023-10-16 13:48   ` [tip: ras/core] " tip-bot2 for Yazen Ghannam
  1 sibling, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2023-06-14  2:19 UTC (permalink / raw)
  To: Yazen Ghannam, linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, baolin.wang



On 2023/6/13 22:11, Yazen Ghannam wrote:
> Currently, all valid MCA_ADDR values are assumed to be usable on AMD
> systems. However, this is not correct in most cases. Notifiers expecting
> usable addresses may then operate on inappropriate values.
> 
> Define a helper function to do AMD-specific checks for a usable memory
> address. List out all known cases.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>  arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/mce/core.c     |  3 +++
>  arch/x86/kernel/cpu/mce/internal.h |  2 ++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
> index 1ccfb0c9257f..ca79fa10b844 100644
> --- a/arch/x86/kernel/cpu/mce/amd.c
> +++ b/arch/x86/kernel/cpu/mce/amd.c
> @@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
>  	return legacy_mce_is_memory_error(m);
>  }
>  
> +/*
> + * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
> + * a system physical address. Therefore individual cases need to be detected.
> + * Future cases and checks will be added as needed.
> + *
> + * 1) General case
> + *	a) Assume address is not usable.
> + * 2) "Poison" errors
> + *	a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
> + *	   Northbridge (bank 4).
> + *	b) Refers to poison consumption in the Core. Does not include "no action",
> + *	   "action optional", or "deferred" error severities.
> + *	c) Will include a usuable address so that immediate action can be taken.
> + * 3) Northbridge DRAM ECC errors
> + *	a) Reported in legacy bank 4 with XEC 8.
> + *	b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
> + *	   this bit should not be checked.
[nit]

> + *
> + * NOTE: SMCA UMC memory errors fall into case #1.

hi, Yazen

The address for SMCA UMC memory error is not system physical address, it make sense
to be not usable. But how we deal with the SMCA address? The MCE chain like
uc_decode_notifier will do a sanity check with mce_usable_address and it will not
handle SMCA address.

Thanks.

Best Regards,
Shuai

> + */
> +bool amd_mce_usable_address(struct mce *m)
> +{
> +	/* Check special Northbridge case first. */
> +	if (!mce_flags.smca) {
> +		if (legacy_mce_is_memory_error(m))
> +			return true;
> +		else if (m->bank == 4)
> +			return false;
> +	}
> +
> +	/* Check Poison bit for all other bank types. */
> +	if (m->status & MCI_STATUS_POISON)
> +		return true;
> +
> +	/* Assume address is not usable for all others. */
> +	return false;
> +}
> +
>  static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
>  {
>  	struct mce m;
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 89e2aab5d34d..859ce20dd730 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -464,6 +464,9 @@ int mce_usable_address(struct mce *m)
>  	if (!(m->status & MCI_STATUS_ADDRV))
>  		return 0;
>  
> +	if (m->cpuvendor == X86_VENDOR_AMD)
> +		return amd_mce_usable_address(m);
> +
>  	/* Checks after this one are Intel/Zhaoxin-specific: */
>  	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
>  	    boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
> diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
> index d2412ce2d312..0d4c5b83ed93 100644
> --- a/arch/x86/kernel/cpu/mce/internal.h
> +++ b/arch/x86/kernel/cpu/mce/internal.h
> @@ -207,6 +207,7 @@ extern bool filter_mce(struct mce *m);
>  
>  #ifdef CONFIG_X86_MCE_AMD
>  extern bool amd_filter_mce(struct mce *m);
> +bool amd_mce_usable_address(struct mce *m);
>  
>  /*
>   * If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
> @@ -234,6 +235,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)
>  
>  #else
>  static inline bool amd_filter_mce(struct mce *m) { return false; }
> +static inline bool amd_mce_usable_address(struct mce *m) { return false; }
>  static inline void smca_extract_err_addr(struct mce *m) { }
>  #endif
>  

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error()
  2023-06-14  2:06   ` Shuai Xue
@ 2023-06-14 15:06     ` Yazen Ghannam
  2023-06-15  2:03       ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-14 15:06 UTC (permalink / raw)
  To: Shuai Xue, linux-edac
  Cc: yazen.ghannam, linux-kernel, tony.luck, x86, muralidhara.mk,
	joao.m.martins, william.roche, boris.ostrovsky, john.allen,
	baolin.wang

On 6/13/2023 10:06 PM, Shuai Xue wrote:
> 
> 
> On 2023/6/13 22:11, Yazen Ghannam wrote:
>> Define helper functions for legacy and SMCA systems in order to reuse
>> individual checks in later changes.
>>
>> Describe what each function is checking for, and correct the XEC bitmask
>> for SMCA.
>>
>> No functional change intended.
>>
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>> ---
>>   arch/x86/kernel/cpu/mce/amd.c | 30 +++++++++++++++++++++++++-----
>>   1 file changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>> index 5e74610b39e7..1ccfb0c9257f 100644
>> --- a/arch/x86/kernel/cpu/mce/amd.c
>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>> @@ -713,17 +713,37 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
>>   		deferred_error_interrupt_enable(c);
>>   }
>>   
>> -bool amd_mce_is_memory_error(struct mce *m)
>> +/*
>> + * DRAM ECC errors are reported in the Northbridge (bank 4) with
>> + * Extended Error Code 8.
>> + */
>> +static bool legacy_mce_is_memory_error(struct mce *m)
>> +{
>> +	return m->bank == 4 && XEC(m->status, 0x1f) == 8;
>> +}
>> +
>> +/*
>> + * DRAM ECC errors are reported in Unified Memory Controllers with
>> + * Extended Error Code 0.
>> + */
>> +static bool smca_mce_is_memory_error(struct mce *m)
>>   {
>>   	enum smca_bank_types bank_type;
>> -	/* ErrCodeExt[20:16] */
>> -	u8 xec = (m->status >> 16) & 0x1f;
>> +
>> +	if (XEC(m->status, 0x3f))
>> +		return false;
>>   
>>   	bank_type = smca_get_bank_type(m->extcpu, m->bank);
>> +
>> +	return bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2;
>> +}
>> +
>> +bool amd_mce_is_memory_error(struct mce *m)
>> +{
>>   	if (mce_flags.smca)
>> -		return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
>> +		return smca_mce_is_memory_error(m);
>>   
>> -	return m->bank == 4 && xec == 0x8;
>> +	return legacy_mce_is_memory_error(m);
>>   }
>>   
>>   static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
> 
> Hi, Yazen,
> 
> Which tree are you working on? This patch can not be applied to Linus master ?
> (commit b6dad5178ceaf23f369c3711062ce1f2afc33644)
> 

Hi Shuai,

I'm using tip/master as the base.
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/

Sorry, I forgot to mention this in the cover letter.

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-14  2:19   ` Shuai Xue
@ 2023-06-14 15:09     ` Yazen Ghannam
  2023-06-15  2:12       ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-14 15:09 UTC (permalink / raw)
  To: Shuai Xue, linux-edac
  Cc: yazen.ghannam, linux-kernel, tony.luck, x86, muralidhara.mk,
	joao.m.martins, william.roche, boris.ostrovsky, john.allen,
	baolin.wang

On 6/13/2023 10:19 PM, Shuai Xue wrote:
> 
> 
> On 2023/6/13 22:11, Yazen Ghannam wrote:
>> Currently, all valid MCA_ADDR values are assumed to be usable on AMD
>> systems. However, this is not correct in most cases. Notifiers expecting
>> usable addresses may then operate on inappropriate values.
>>
>> Define a helper function to do AMD-specific checks for a usable memory
>> address. List out all known cases.
>>
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>> ---
>>   arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
>>   arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>   arch/x86/kernel/cpu/mce/internal.h |  2 ++
>>   3 files changed, 43 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>> index 1ccfb0c9257f..ca79fa10b844 100644
>> --- a/arch/x86/kernel/cpu/mce/amd.c
>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>> @@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
>>   	return legacy_mce_is_memory_error(m);
>>   }
>>   
>> +/*
>> + * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
>> + * a system physical address. Therefore individual cases need to be detected.
>> + * Future cases and checks will be added as needed.
>> + *
>> + * 1) General case
>> + *	a) Assume address is not usable.
>> + * 2) "Poison" errors
>> + *	a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
>> + *	   Northbridge (bank 4).
>> + *	b) Refers to poison consumption in the Core. Does not include "no action",
>> + *	   "action optional", or "deferred" error severities.
>> + *	c) Will include a usuable address so that immediate action can be taken.
>> + * 3) Northbridge DRAM ECC errors
>> + *	a) Reported in legacy bank 4 with XEC 8.
>> + *	b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
>> + *	   this bit should not be checked.
> [nit]
> 
>> + *
>> + * NOTE: SMCA UMC memory errors fall into case #1.
> 
> hi, Yazen
> 
> The address for SMCA UMC memory error is not system physical address, it make sense
> to be not usable. But how we deal with the SMCA address? The MCE chain like
> uc_decode_notifier will do a sanity check with mce_usable_address and it will not
> handle SMCA address.
>

Hi Shuai,

That's correct.

There isn't a good solution today. This will be handled in future changes.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error()
  2023-06-14 15:06     ` Yazen Ghannam
@ 2023-06-15  2:03       ` Shuai Xue
  2023-06-15 15:09         ` Yazen Ghannam
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2023-06-15  2:03 UTC (permalink / raw)
  To: Yazen Ghannam, linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, baolin.wang



On 2023/6/14 23:06, Yazen Ghannam wrote:
> On 6/13/2023 10:06 PM, Shuai Xue wrote:
>>
>>
>> On 2023/6/13 22:11, Yazen Ghannam wrote:
>>> Define helper functions for legacy and SMCA systems in order to reuse
>>> individual checks in later changes.
>>>
>>> Describe what each function is checking for, and correct the XEC bitmask
>>> for SMCA.
>>>
>>> No functional change intended.
>>>
>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>> ---
>>>   arch/x86/kernel/cpu/mce/amd.c | 30 +++++++++++++++++++++++++-----
>>>   1 file changed, 25 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>> index 5e74610b39e7..1ccfb0c9257f 100644
>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>> @@ -713,17 +713,37 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
>>>           deferred_error_interrupt_enable(c);
>>>   }
>>>   -bool amd_mce_is_memory_error(struct mce *m)
>>> +/*
>>> + * DRAM ECC errors are reported in the Northbridge (bank 4) with
>>> + * Extended Error Code 8.
>>> + */
>>> +static bool legacy_mce_is_memory_error(struct mce *m)
>>> +{
>>> +    return m->bank == 4 && XEC(m->status, 0x1f) == 8;
>>> +}
>>> +
>>> +/*
>>> + * DRAM ECC errors are reported in Unified Memory Controllers with
>>> + * Extended Error Code 0.
>>> + */
>>> +static bool smca_mce_is_memory_error(struct mce *m)
>>>   {
>>>       enum smca_bank_types bank_type;
>>> -    /* ErrCodeExt[20:16] */
>>> -    u8 xec = (m->status >> 16) & 0x1f;
>>> +
>>> +    if (XEC(m->status, 0x3f))
>>> +        return false;
>>>         bank_type = smca_get_bank_type(m->extcpu, m->bank);
>>> +
>>> +    return bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2;
>>> +}
>>> +
>>> +bool amd_mce_is_memory_error(struct mce *m)
>>> +{
>>>       if (mce_flags.smca)
>>> -        return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
>>> +        return smca_mce_is_memory_error(m);
>>>   -    return m->bank == 4 && xec == 0x8;
>>> +    return legacy_mce_is_memory_error(m);
>>>   }
>>>     static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
>>
>> Hi, Yazen,
>>
>> Which tree are you working on? This patch can not be applied to Linus master ?
>> (commit b6dad5178ceaf23f369c3711062ce1f2afc33644)
>>
> 
> Hi Shuai,
> 
> I'm using tip/master as the base.
> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/
> 
> Sorry, I forgot to mention this in the cover letter.

Ok. This patch itself looks good to me.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-14 15:09     ` Yazen Ghannam
@ 2023-06-15  2:12       ` Shuai Xue
  2023-06-15 15:15         ` Yazen Ghannam
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2023-06-15  2:12 UTC (permalink / raw)
  To: Yazen Ghannam, linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, baolin.wang



On 2023/6/14 23:09, Yazen Ghannam wrote:
> On 6/13/2023 10:19 PM, Shuai Xue wrote:
>>
>>
>> On 2023/6/13 22:11, Yazen Ghannam wrote:
>>> Currently, all valid MCA_ADDR values are assumed to be usable on AMD
>>> systems. However, this is not correct in most cases. Notifiers expecting
>>> usable addresses may then operate on inappropriate values.
>>>
>>> Define a helper function to do AMD-specific checks for a usable memory
>>> address. List out all known cases.
>>>
>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>> ---
>>>   arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
>>>   arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>>   arch/x86/kernel/cpu/mce/internal.h |  2 ++
>>>   3 files changed, 43 insertions(+)
>>>
>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>> index 1ccfb0c9257f..ca79fa10b844 100644
>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>> @@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
>>>       return legacy_mce_is_memory_error(m);
>>>   }
>>>   +/*
>>> + * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
>>> + * a system physical address. Therefore individual cases need to be detected.
>>> + * Future cases and checks will be added as needed.
>>> + *
>>> + * 1) General case
>>> + *    a) Assume address is not usable.
>>> + * 2) "Poison" errors
>>> + *    a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
>>> + *       Northbridge (bank 4).
>>> + *    b) Refers to poison consumption in the Core. Does not include "no action",
>>> + *       "action optional", or "deferred" error severities.
>>> + *    c) Will include a usuable address so that immediate action can be taken.
>>> + * 3) Northbridge DRAM ECC errors
>>> + *    a) Reported in legacy bank 4 with XEC 8.
>>> + *    b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
>>> + *       this bit should not be checked.
>> [nit]
>>
>>> + *
>>> + * NOTE: SMCA UMC memory errors fall into case #1.
>>
>> hi, Yazen
>>
>> The address for SMCA UMC memory error is not system physical address, it make sense
>> to be not usable. But how we deal with the SMCA address? The MCE chain like
>> uc_decode_notifier will do a sanity check with mce_usable_address and it will not
>> handle SMCA address.
>>
> 
> Hi Shuai,
> 
> That's correct.
> 
> There isn't a good solution today. This will be handled in future changes.

Hi, Yazen,

Do you have plan to address it? If not, I can help. We meet this problem in our products.

Thanks
Shuai




^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error()
  2023-06-15  2:03       ` Shuai Xue
@ 2023-06-15 15:09         ` Yazen Ghannam
  0 siblings, 0 replies; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-15 15:09 UTC (permalink / raw)
  To: Shuai Xue, linux-edac
  Cc: yazen.ghannam, linux-kernel, tony.luck, x86, muralidhara.mk,
	joao.m.martins, william.roche, boris.ostrovsky, john.allen,
	baolin.wang

On 6/14/2023 10:03 PM, Shuai Xue wrote:
> 
> 
> On 2023/6/14 23:06, Yazen Ghannam wrote:
>> On 6/13/2023 10:06 PM, Shuai Xue wrote:
>>>
>>>
>>> On 2023/6/13 22:11, Yazen Ghannam wrote:
>>>> Define helper functions for legacy and SMCA systems in order to reuse
>>>> individual checks in later changes.
>>>>
>>>> Describe what each function is checking for, and correct the XEC bitmask
>>>> for SMCA.
>>>>
>>>> No functional change intended.
>>>>
>>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>> ---
>>>>    arch/x86/kernel/cpu/mce/amd.c | 30 +++++++++++++++++++++++++-----
>>>>    1 file changed, 25 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>>> index 5e74610b39e7..1ccfb0c9257f 100644
>>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>>> @@ -713,17 +713,37 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
>>>>            deferred_error_interrupt_enable(c);
>>>>    }
>>>>    -bool amd_mce_is_memory_error(struct mce *m)
>>>> +/*
>>>> + * DRAM ECC errors are reported in the Northbridge (bank 4) with
>>>> + * Extended Error Code 8.
>>>> + */
>>>> +static bool legacy_mce_is_memory_error(struct mce *m)
>>>> +{
>>>> +    return m->bank == 4 && XEC(m->status, 0x1f) == 8;
>>>> +}
>>>> +
>>>> +/*
>>>> + * DRAM ECC errors are reported in Unified Memory Controllers with
>>>> + * Extended Error Code 0.
>>>> + */
>>>> +static bool smca_mce_is_memory_error(struct mce *m)
>>>>    {
>>>>        enum smca_bank_types bank_type;
>>>> -    /* ErrCodeExt[20:16] */
>>>> -    u8 xec = (m->status >> 16) & 0x1f;
>>>> +
>>>> +    if (XEC(m->status, 0x3f))
>>>> +        return false;
>>>>          bank_type = smca_get_bank_type(m->extcpu, m->bank);
>>>> +
>>>> +    return bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2;
>>>> +}
>>>> +
>>>> +bool amd_mce_is_memory_error(struct mce *m)
>>>> +{
>>>>        if (mce_flags.smca)
>>>> -        return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
>>>> +        return smca_mce_is_memory_error(m);
>>>>    -    return m->bank == 4 && xec == 0x8;
>>>> +    return legacy_mce_is_memory_error(m);
>>>>    }
>>>>      static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
>>>
>>> Hi, Yazen,
>>>
>>> Which tree are you working on? This patch can not be applied to Linus master ?
>>> (commit b6dad5178ceaf23f369c3711062ce1f2afc33644)
>>>
>>
>> Hi Shuai,
>>
>> I'm using tip/master as the base.
>> https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/log/
>>
>> Sorry, I forgot to mention this in the cover letter.
> 
> Ok. This patch itself looks good to me.
> 
> Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
> 

Thank you!

-Yazen


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-15  2:12       ` Shuai Xue
@ 2023-06-15 15:15         ` Yazen Ghannam
  2023-06-16  1:59           ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Yazen Ghannam @ 2023-06-15 15:15 UTC (permalink / raw)
  To: Shuai Xue, linux-edac
  Cc: yazen.ghannam, linux-kernel, tony.luck, x86, muralidhara.mk,
	joao.m.martins, william.roche, boris.ostrovsky, john.allen,
	baolin.wang

On 6/14/2023 10:12 PM, Shuai Xue wrote:
> 
> 
> On 2023/6/14 23:09, Yazen Ghannam wrote:
>> On 6/13/2023 10:19 PM, Shuai Xue wrote:
>>>
>>>
>>> On 2023/6/13 22:11, Yazen Ghannam wrote:
>>>> Currently, all valid MCA_ADDR values are assumed to be usable on AMD
>>>> systems. However, this is not correct in most cases. Notifiers expecting
>>>> usable addresses may then operate on inappropriate values.
>>>>
>>>> Define a helper function to do AMD-specific checks for a usable memory
>>>> address. List out all known cases.
>>>>
>>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>> ---
>>>>    arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
>>>>    arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>>>    arch/x86/kernel/cpu/mce/internal.h |  2 ++
>>>>    3 files changed, 43 insertions(+)
>>>>
>>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>>> index 1ccfb0c9257f..ca79fa10b844 100644
>>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>>> @@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
>>>>        return legacy_mce_is_memory_error(m);
>>>>    }
>>>>    +/*
>>>> + * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
>>>> + * a system physical address. Therefore individual cases need to be detected.
>>>> + * Future cases and checks will be added as needed.
>>>> + *
>>>> + * 1) General case
>>>> + *    a) Assume address is not usable.
>>>> + * 2) "Poison" errors
>>>> + *    a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
>>>> + *       Northbridge (bank 4).
>>>> + *    b) Refers to poison consumption in the Core. Does not include "no action",
>>>> + *       "action optional", or "deferred" error severities.
>>>> + *    c) Will include a usuable address so that immediate action can be taken.
>>>> + * 3) Northbridge DRAM ECC errors
>>>> + *    a) Reported in legacy bank 4 with XEC 8.
>>>> + *    b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
>>>> + *       this bit should not be checked.
>>> [nit]
>>>
>>>> + *
>>>> + * NOTE: SMCA UMC memory errors fall into case #1.
>>>
>>> hi, Yazen
>>>
>>> The address for SMCA UMC memory error is not system physical address, it make sense
>>> to be not usable. But how we deal with the SMCA address? The MCE chain like
>>> uc_decode_notifier will do a sanity check with mce_usable_address and it will not
>>> handle SMCA address.
>>>
>>
>> Hi Shuai,
>>
>> That's correct.
>>
>> There isn't a good solution today. This will be handled in future changes.
> 
> Hi, Yazen,
> 
> Do you have plan to address it? If not, I can help. We meet this problem in our products.
> 

Yes, definitely. The first step is to update the address translation 
code; this is progress. Afterwards, we can find a way to leverage this 
in the MCE notifier flows.

Just curious, how big is the benefit of this preemptive page offline in 
your use cases? That is, compared to page offline as part of poison data 
consumption.

Thanks,
Yazen


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-15 15:15         ` Yazen Ghannam
@ 2023-06-16  1:59           ` Shuai Xue
  2023-06-16  7:46             ` William Roche
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2023-06-16  1:59 UTC (permalink / raw)
  To: Yazen Ghannam, linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	william.roche, boris.ostrovsky, john.allen, baolin.wang



On 2023/6/15 23:15, Yazen Ghannam wrote:
> On 6/14/2023 10:12 PM, Shuai Xue wrote:
>>
>>
>> On 2023/6/14 23:09, Yazen Ghannam wrote:
>>> On 6/13/2023 10:19 PM, Shuai Xue wrote:
>>>>
>>>>
>>>> On 2023/6/13 22:11, Yazen Ghannam wrote:
>>>>> Currently, all valid MCA_ADDR values are assumed to be usable on AMD
>>>>> systems. However, this is not correct in most cases. Notifiers expecting
>>>>> usable addresses may then operate on inappropriate values.
>>>>>
>>>>> Define a helper function to do AMD-specific checks for a usable memory
>>>>> address. List out all known cases.
>>>>>
>>>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>>> ---
>>>>>    arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
>>>>>    arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>>>>    arch/x86/kernel/cpu/mce/internal.h |  2 ++
>>>>>    3 files changed, 43 insertions(+)
>>>>>
>>>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>>>> index 1ccfb0c9257f..ca79fa10b844 100644
>>>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>>>> @@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
>>>>>        return legacy_mce_is_memory_error(m);
>>>>>    }
>>>>>    +/*
>>>>> + * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
>>>>> + * a system physical address. Therefore individual cases need to be detected.
>>>>> + * Future cases and checks will be added as needed.
>>>>> + *
>>>>> + * 1) General case
>>>>> + *    a) Assume address is not usable.
>>>>> + * 2) "Poison" errors
>>>>> + *    a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
>>>>> + *       Northbridge (bank 4).
>>>>> + *    b) Refers to poison consumption in the Core. Does not include "no action",
>>>>> + *       "action optional", or "deferred" error severities.
>>>>> + *    c) Will include a usuable address so that immediate action can be taken.
>>>>> + * 3) Northbridge DRAM ECC errors
>>>>> + *    a) Reported in legacy bank 4 with XEC 8.
>>>>> + *    b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
>>>>> + *       this bit should not be checked.
>>>> [nit]
>>>>
>>>>> + *
>>>>> + * NOTE: SMCA UMC memory errors fall into case #1.
>>>>
>>>> hi, Yazen
>>>>
>>>> The address for SMCA UMC memory error is not system physical address, it make sense
>>>> to be not usable. But how we deal with the SMCA address? The MCE chain like
>>>> uc_decode_notifier will do a sanity check with mce_usable_address and it will not
>>>> handle SMCA address.
>>>>
>>>
>>> Hi Shuai,
>>>
>>> That's correct.
>>>
>>> There isn't a good solution today. This will be handled in future changes.
>>
>> Hi, Yazen,
>>
>> Do you have plan to address it? If not, I can help. We meet this problem in our products.
>>
> 
> Yes, definitely. The first step is to update the address translation code; this is progress. Afterwards, we can find a way to leverage this in the MCE notifier flows.

Look forward to it.

> 
> Just curious, how big is the benefit of this preemptive page offline in your use cases? That is, compared to page offline as part of poison data consumption.

There are three aspects of benefits if SMCA address detected by scrubber is offlined
in advance:

- Free page: it should be isolated and not allocated by buddy so that the poison data
  will never be consumed.
- In-use page: the heath VMs could be migrated into other heath node if many UCE occurs.
- Mitigate the possibility of nested MCE which is a fatal error.

Thank you.

Best Regards,
Shuai.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86/mce: Define amd_mce_usable_address()
  2023-06-16  1:59           ` Shuai Xue
@ 2023-06-16  7:46             ` William Roche
  0 siblings, 0 replies; 17+ messages in thread
From: William Roche @ 2023-06-16  7:46 UTC (permalink / raw)
  To: Shuai Xue, Yazen Ghannam, linux-edac
  Cc: linux-kernel, tony.luck, x86, muralidhara.mk, joao.m.martins,
	boris.ostrovsky, john.allen, baolin.wang

On 6/16/23 03:59, Shuai Xue wrote:
> On 2023/6/15 23:15, Yazen Ghannam wrote:
>> On 6/14/2023 10:12 PM, Shuai Xue wrote:
>>>
>>> On 2023/6/14 23:09, Yazen Ghannam wrote:
>>>> On 6/13/2023 10:19 PM, Shuai Xue wrote:
>>>>>
>>>>> On 2023/6/13 22:11, Yazen Ghannam wrote:
>>>>>> Currently, all valid MCA_ADDR values are assumed to be usable on AMD
>>>>>> systems. However, this is not correct in most cases. Notifiers expecting
>>>>>> usable addresses may then operate on inappropriate values.
>>>>>>
>>>>>> Define a helper function to do AMD-specific checks for a usable memory
>>>>>> address. List out all known cases.
>>>>>>
>>>>>> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
>>>>>> ---
>>>>>>     arch/x86/kernel/cpu/mce/amd.c      | 38 ++++++++++++++++++++++++++++++
>>>>>>     arch/x86/kernel/cpu/mce/core.c     |  3 +++
>>>>>>     arch/x86/kernel/cpu/mce/internal.h |  2 ++
>>>>>>     3 files changed, 43 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
>>>>>> index 1ccfb0c9257f..ca79fa10b844 100644
>>>>>> --- a/arch/x86/kernel/cpu/mce/amd.c
>>>>>> +++ b/arch/x86/kernel/cpu/mce/amd.c
>>>>>> @@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
>>>>>>         return legacy_mce_is_memory_error(m);
>>>>>>     }
>>>>>>     +/*
>>>>>> + * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
>>>>>> + * a system physical address. Therefore individual cases need to be detected.
>>>>>> + * Future cases and checks will be added as needed.
>>>>>> + *
>>>>>> + * 1) General case
>>>>>> + *    a) Assume address is not usable.
>>>>>> + * 2) "Poison" errors
>>>>>> + *    a) Indicated by MCA_STATUS[43]: POISON. Defined for all banks except legacy
>>>>>> + *       Northbridge (bank 4).
>>>>>> + *    b) Refers to poison consumption in the Core. Does not include "no action",
>>>>>> + *       "action optional", or "deferred" error severities.
>>>>>> + *    c) Will include a usuable address so that immediate action can be taken.
>>>>>> + * 3) Northbridge DRAM ECC errors
>>>>>> + *    a) Reported in legacy bank 4 with XEC 8.
>>>>>> + *    b) MCA_STATUS[43] is *not* defined as POISON in legacy bank 4. Therefore,
>>>>>> + *       this bit should not be checked.
>>>>> [nit]
>>>>>
>>>>>> + *
>>>>>> + * NOTE: SMCA UMC memory errors fall into case #1.
>>>>> hi, Yazen
>>>>>
>>>>> The address for SMCA UMC memory error is not system physical address, it make sense
>>>>> to be not usable. But how we deal with the SMCA address? The MCE chain like
>>>>> uc_decode_notifier will do a sanity check with mce_usable_address and it will not
>>>>> handle SMCA address.
>>>>>
>>>> Hi Shuai,
>>>>
>>>> That's correct.
>>>>
>>>> There isn't a good solution today. This will be handled in future changes.
>>> Hi, Yazen,
>>>
>>> Do you have plan to address it? If not, I can help. We meet this problem in our products.
>>>
>> Yes, definitely. The first step is to update the address translation code; this is progress. Afterwards, we can find a way to leverage this in the MCE notifier flows.
> Look forward to it.
>
>> Just curious, how big is the benefit of this preemptive page offline in your use cases? That is, compared to page offline as part of poison data consumption.
> There are three aspects of benefits if SMCA address detected by scrubber is offlined
> in advance:
>
> - Free page: it should be isolated and not allocated by buddy so that the poison data
>    will never be consumed.
> - In-use page: the heath VMs could be migrated into other heath node if many UCE occurs.

I would also like to add that an application able to take action and 
re-create the impacted memory page (like a database for example), could 
do so before the data is requested by a user.

> - Mitigate the possibility of nested MCE which is a fatal error.
>
> Thank you.
>
> Best Regards,
> Shuai.

HTH,
William


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [tip: ras/core] x86/mce: Define amd_mce_usable_address()
  2023-06-13 14:11 ` [PATCH 2/3] x86/mce: Define amd_mce_usable_address() Yazen Ghannam
  2023-06-14  2:19   ` Shuai Xue
@ 2023-10-16 13:48   ` tip-bot2 for Yazen Ghannam
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot2 for Yazen Ghannam @ 2023-10-16 13:48 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Yazen Ghannam, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     48da1ad8ba95ecd35d76355594c629f3ef2a954a
Gitweb:        https://git.kernel.org/tip/48da1ad8ba95ecd35d76355594c629f3ef2a954a
Author:        Yazen Ghannam <yazen.ghannam@amd.com>
AuthorDate:    Tue, 13 Jun 2023 09:11:41 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Mon, 16 Oct 2023 15:31:32 +02:00

x86/mce: Define amd_mce_usable_address()

Currently, all valid MCA_ADDR values are assumed to be usable on AMD
systems. However, this is not correct in most cases. Notifiers expecting
usable addresses may then operate on inappropriate values.

Define a helper function to do AMD-specific checks for a usable memory
address. List out all known cases.

  [ bp: Tone down the capitalized words. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-3-yazen.ghannam@amd.com
---
 arch/x86/kernel/cpu/mce/amd.c      | 38 +++++++++++++++++++++++++++++-
 arch/x86/kernel/cpu/mce/core.c     |  3 ++-
 arch/x86/kernel/cpu/mce/internal.h |  2 ++-
 3 files changed, 43 insertions(+)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index c069934..f3517b8 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -746,6 +746,44 @@ bool amd_mce_is_memory_error(struct mce *m)
 		return legacy_mce_is_memory_error(m);
 }
 
+/*
+ * AMD systems do not have an explicit indicator that the value in MCA_ADDR is
+ * a system physical address. Therefore, individual cases need to be detected.
+ * Future cases and checks will be added as needed.
+ *
+ * 1) General case
+ *	a) Assume address is not usable.
+ * 2) Poison errors
+ *	a) Indicated by MCA_STATUS[43]: poison. Defined for all banks except legacy
+ *	   northbridge (bank 4).
+ *	b) Refers to poison consumption in the core. Does not include "no action",
+ *	   "action optional", or "deferred" error severities.
+ *	c) Will include a usable address so that immediate action can be taken.
+ * 3) Northbridge DRAM ECC errors
+ *	a) Reported in legacy bank 4 with extended error code (XEC) 8.
+ *	b) MCA_STATUS[43] is *not* defined as poison in legacy bank 4. Therefore,
+ *	   this bit should not be checked.
+ *
+ * NOTE: SMCA UMC memory errors fall into case #1.
+ */
+bool amd_mce_usable_address(struct mce *m)
+{
+	/* Check special northbridge case 3) first. */
+	if (!mce_flags.smca) {
+		if (legacy_mce_is_memory_error(m))
+			return true;
+		else if (m->bank == 4)
+			return false;
+	}
+
+	/* Check poison bit for all other bank types. */
+	if (m->status & MCI_STATUS_POISON)
+		return true;
+
+	/* Assume address is not usable for all others. */
+	return false;
+}
+
 static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)
 {
 	struct mce m;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f72..06c21f5 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -464,6 +464,9 @@ int mce_usable_address(struct mce *m)
 	if (!(m->status & MCI_STATUS_ADDRV))
 		return 0;
 
+	if (m->cpuvendor == X86_VENDOR_AMD)
+		return amd_mce_usable_address(m);
+
 	/* Checks after this one are Intel/Zhaoxin-specific: */
 	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
 	    boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index bcf1b3c..a191554 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -210,6 +210,7 @@ extern bool filter_mce(struct mce *m);
 
 #ifdef CONFIG_X86_MCE_AMD
 extern bool amd_filter_mce(struct mce *m);
+bool amd_mce_usable_address(struct mce *m);
 
 /*
  * If MCA_CONFIG[McaLsbInStatusSupported] is set, extract ErrAddr in bits
@@ -237,6 +238,7 @@ static __always_inline void smca_extract_err_addr(struct mce *m)
 
 #else
 static inline bool amd_filter_mce(struct mce *m) { return false; }
+static inline bool amd_mce_usable_address(struct mce *m) { return false; }
 static inline void smca_extract_err_addr(struct mce *m) { }
 #endif
 

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip: ras/core] x86/mce: Cleanup mce_usable_address()
  2023-06-13 14:11 ` [PATCH 3/3] x86/mce: Fixup mce_usable_address() Yazen Ghannam
@ 2023-10-16 13:48   ` tip-bot2 for Yazen Ghannam
  0 siblings, 0 replies; 17+ messages in thread
From: tip-bot2 for Yazen Ghannam @ 2023-10-16 13:48 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Yazen Ghannam, Borislav Petkov (AMD), x86, linux-kernel

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     1bae0cfe4a171ccc5f731426296e45beafa096b8
Gitweb:        https://git.kernel.org/tip/1bae0cfe4a171ccc5f731426296e45beafa096b8
Author:        Yazen Ghannam <yazen.ghannam@amd.com>
AuthorDate:    Tue, 13 Jun 2023 09:11:42 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Mon, 16 Oct 2023 15:37:01 +02:00

x86/mce: Cleanup mce_usable_address()

Move Intel-specific checks into a helper function.

Explicitly use "bool" for return type.

No functional change intended.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20230613141142.36801-4-yazen.ghannam@amd.com
---
 arch/x86/include/asm/mce.h         |  2 +-
 arch/x86/kernel/cpu/mce/core.c     | 33 ++++++++---------------------
 arch/x86/kernel/cpu/mce/intel.c    | 20 ++++++++++++++++++-
 arch/x86/kernel/cpu/mce/internal.h |  2 ++-
 4 files changed, 33 insertions(+), 24 deletions(-)

diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
index 180b1cb..6de6e1d 100644
--- a/arch/x86/include/asm/mce.h
+++ b/arch/x86/include/asm/mce.h
@@ -245,7 +245,7 @@ static inline void cmci_recheck(void) {}
 int mce_available(struct cpuinfo_x86 *c);
 bool mce_is_memory_error(struct mce *m);
 bool mce_is_correctable(struct mce *m);
-int mce_usable_address(struct mce *m);
+bool mce_usable_address(struct mce *m);
 
 DECLARE_PER_CPU(unsigned, mce_exception_count);
 DECLARE_PER_CPU(unsigned, mce_poll_count);
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 06c21f5..0214d42 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -453,35 +453,22 @@ static void mce_irq_work_cb(struct irq_work *entry)
 	mce_schedule_work();
 }
 
-/*
- * Check if the address reported by the CPU is in a format we can parse.
- * It would be possible to add code for most other cases, but all would
- * be somewhat complicated (e.g. segment offset would require an instruction
- * parser). So only support physical addresses up to page granularity for now.
- */
-int mce_usable_address(struct mce *m)
+bool mce_usable_address(struct mce *m)
 {
 	if (!(m->status & MCI_STATUS_ADDRV))
-		return 0;
+		return false;
 
-	if (m->cpuvendor == X86_VENDOR_AMD)
+	switch (m->cpuvendor) {
+	case X86_VENDOR_AMD:
 		return amd_mce_usable_address(m);
 
-	/* Checks after this one are Intel/Zhaoxin-specific: */
-	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL &&
-	    boot_cpu_data.x86_vendor != X86_VENDOR_ZHAOXIN)
-		return 1;
-
-	if (!(m->status & MCI_STATUS_MISCV))
-		return 0;
-
-	if (MCI_MISC_ADDR_LSB(m->misc) > PAGE_SHIFT)
-		return 0;
-
-	if (MCI_MISC_ADDR_MODE(m->misc) != MCI_MISC_ADDR_PHYS)
-		return 0;
+	case X86_VENDOR_INTEL:
+	case X86_VENDOR_ZHAOXIN:
+		return intel_mce_usable_address(m);
 
-	return 1;
+	default:
+		return true;
+	}
 }
 EXPORT_SYMBOL_GPL(mce_usable_address);
 
diff --git a/arch/x86/kernel/cpu/mce/intel.c b/arch/x86/kernel/cpu/mce/intel.c
index f532355..52bce53 100644
--- a/arch/x86/kernel/cpu/mce/intel.c
+++ b/arch/x86/kernel/cpu/mce/intel.c
@@ -536,3 +536,23 @@ bool intel_filter_mce(struct mce *m)
 
 	return false;
 }
+
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses up to page granularity for now.
+ */
+bool intel_mce_usable_address(struct mce *m)
+{
+	if (!(m->status & MCI_STATUS_MISCV))
+		return false;
+
+	if (MCI_MISC_ADDR_LSB(m->misc) > PAGE_SHIFT)
+		return false;
+
+	if (MCI_MISC_ADDR_MODE(m->misc) != MCI_MISC_ADDR_PHYS)
+		return false;
+
+	return true;
+}
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index a191554..e13a26c 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -49,6 +49,7 @@ void intel_init_cmci(void);
 void intel_init_lmce(void);
 void intel_clear_lmce(void);
 bool intel_filter_mce(struct mce *m);
+bool intel_mce_usable_address(struct mce *m);
 #else
 # define cmci_intel_adjust_timer mce_adjust_timer_default
 static inline bool mce_intel_cmci_poll(void) { return false; }
@@ -58,6 +59,7 @@ static inline void intel_init_cmci(void) { }
 static inline void intel_init_lmce(void) { }
 static inline void intel_clear_lmce(void) { }
 static inline bool intel_filter_mce(struct mce *m) { return false; }
+static inline bool intel_mce_usable_address(struct mce *m) { return false; }
 #endif
 
 void mce_timer_kick(unsigned long interval);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [tip: ras/core] x86/MCE/AMD: Split amd_mce_is_memory_error()
  2023-06-13 14:11 ` [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error() Yazen Ghannam
  2023-06-14  2:06   ` Shuai Xue
@ 2023-10-16 13:48   ` tip-bot2 for Yazen Ghannam
  1 sibling, 0 replies; 17+ messages in thread
From: tip-bot2 for Yazen Ghannam @ 2023-10-16 13:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Yazen Ghannam, Borislav Petkov (AMD), Shuai Xue, x86,
	linux-kernel

The following commit has been merged into the ras/core branch of tip:

Commit-ID:     495a91d0998367f4f079593f491bdfe8ef06838e
Gitweb:        https://git.kernel.org/tip/495a91d0998367f4f079593f491bdfe8ef06838e
Author:        Yazen Ghannam <yazen.ghannam@amd.com>
AuthorDate:    Tue, 13 Jun 2023 09:11:40 -05:00
Committer:     Borislav Petkov (AMD) <bp@alien8.de>
CommitterDate: Mon, 16 Oct 2023 15:04:53 +02:00

x86/MCE/AMD: Split amd_mce_is_memory_error()

Define helper functions for legacy and SMCA systems in order to reuse
individual checks in later changes.

Describe what each function is checking for, and correct the XEC bitmask
for SMCA.

No functional change intended.

  [ bp: Use "else in amd_mce_is_memory_error() to make the conditional
    balanced, for readability. ]

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>
Link: https://lore.kernel.org/r/20230613141142.36801-2-yazen.ghannam@amd.com
---
 arch/x86/kernel/cpu/mce/amd.c | 32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/amd.c b/arch/x86/kernel/cpu/mce/amd.c
index c267f43..c069934 100644
--- a/arch/x86/kernel/cpu/mce/amd.c
+++ b/arch/x86/kernel/cpu/mce/amd.c
@@ -713,17 +713,37 @@ void mce_amd_feature_init(struct cpuinfo_x86 *c)
 		deferred_error_interrupt_enable(c);
 }
 
-bool amd_mce_is_memory_error(struct mce *m)
+/*
+ * DRAM ECC errors are reported in the Northbridge (bank 4) with
+ * Extended Error Code 8.
+ */
+static bool legacy_mce_is_memory_error(struct mce *m)
+{
+	return m->bank == 4 && XEC(m->status, 0x1f) == 8;
+}
+
+/*
+ * DRAM ECC errors are reported in Unified Memory Controllers with
+ * Extended Error Code 0.
+ */
+static bool smca_mce_is_memory_error(struct mce *m)
 {
 	enum smca_bank_types bank_type;
-	/* ErrCodeExt[20:16] */
-	u8 xec = (m->status >> 16) & 0x1f;
+
+	if (XEC(m->status, 0x3f))
+		return false;
 
 	bank_type = smca_get_bank_type(m->extcpu, m->bank);
-	if (mce_flags.smca)
-		return (bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) && xec == 0x0;
 
-	return m->bank == 4 && xec == 0x8;
+	return bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2;
+}
+
+bool amd_mce_is_memory_error(struct mce *m)
+{
+	if (mce_flags.smca)
+		return smca_mce_is_memory_error(m);
+	else
+		return legacy_mce_is_memory_error(m);
 }
 
 static void __log_error(unsigned int bank, u64 status, u64 addr, u64 misc)

^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2023-10-16 13:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-13 14:11 [PATCH 0/3] Properly check for usable addresses on AMD Yazen Ghannam
2023-06-13 14:11 ` [PATCH 1/3] x86/MCE/AMD: Split amd_mce_is_memory_error() Yazen Ghannam
2023-06-14  2:06   ` Shuai Xue
2023-06-14 15:06     ` Yazen Ghannam
2023-06-15  2:03       ` Shuai Xue
2023-06-15 15:09         ` Yazen Ghannam
2023-10-16 13:48   ` [tip: ras/core] " tip-bot2 for Yazen Ghannam
2023-06-13 14:11 ` [PATCH 2/3] x86/mce: Define amd_mce_usable_address() Yazen Ghannam
2023-06-14  2:19   ` Shuai Xue
2023-06-14 15:09     ` Yazen Ghannam
2023-06-15  2:12       ` Shuai Xue
2023-06-15 15:15         ` Yazen Ghannam
2023-06-16  1:59           ` Shuai Xue
2023-06-16  7:46             ` William Roche
2023-10-16 13:48   ` [tip: ras/core] " tip-bot2 for Yazen Ghannam
2023-06-13 14:11 ` [PATCH 3/3] x86/mce: Fixup mce_usable_address() Yazen Ghannam
2023-10-16 13:48   ` [tip: ras/core] x86/mce: Cleanup mce_usable_address() tip-bot2 for Yazen Ghannam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox