linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] Incorporate DRAM address in EDAC messages
@ 2025-07-17 16:48 Avadhut Naik
  2025-07-17 16:48 ` [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM Avadhut Naik
  2025-07-17 16:48 ` [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message Avadhut Naik
  0 siblings, 2 replies; 10+ messages in thread
From: Avadhut Naik @ 2025-07-17 16:48 UTC (permalink / raw)
  To: linux-edac; +Cc: bp, yazen.ghannam, john.allen, linux-kernel, avadhut.naik

Currently, the amd64_edac module only provides UMC normalized and system
physical address when a DRAM ECC error occurs. DRAM Address is neither
logged nor exported through tracepoint.

Modern AMD SOCs provide UEFI PRM module that implements various address
translation PRM handlers. These PRM handlers can be leveraged to convert
UMC normalized address into DRAM address at runtime on occurence of a DRAM
ECC error. This translated DRAM address can then be logged and exported
through tracepoints. This set adds the required support to accomplish the
aforementioned.

The first patch adds support in the Address Transaltion Library to invoke
the appropriate PRM handler to perform the translation.

The second patch leverages the support added in the first patch to log
DRAM Address and export it through the RAS tracepoint on occurrence of a
DRAM ECC error.

Avadhut Naik (2):
  RAS/AMD/ATL: Translate UMC normalized address to DRAM address using
    PRM
  EDAC/amd64: Incorporate DRAM Address in EDAC message

 drivers/edac/amd64_edac.c      | 23 ++++++++++++++++++++++
 drivers/edac/amd64_edac.h      |  1 +
 drivers/ras/amd/atl/core.c     |  3 ++-
 drivers/ras/amd/atl/internal.h |  9 +++++++++
 drivers/ras/amd/atl/prm.c      | 36 ++++++++++++++++++++++++++++++----
 drivers/ras/amd/atl/umc.c      | 12 ++++++++++++
 drivers/ras/ras.c              | 18 +++++++++++++++--
 include/linux/ras.h            | 19 +++++++++++++++++-
 8 files changed, 113 insertions(+), 8 deletions(-)


base-commit: 1fb0ddddf5d139089675b86702933cbca992b4d4
-- 
2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM
  2025-07-17 16:48 [PATCH 0/2] Incorporate DRAM address in EDAC messages Avadhut Naik
@ 2025-07-17 16:48 ` Avadhut Naik
  2025-07-28 14:40   ` Yazen Ghannam
  2025-07-17 16:48 ` [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message Avadhut Naik
  1 sibling, 1 reply; 10+ messages in thread
From: Avadhut Naik @ 2025-07-17 16:48 UTC (permalink / raw)
  To: linux-edac; +Cc: bp, yazen.ghannam, john.allen, linux-kernel, avadhut.naik

Modern AMD SOCs like Zen5 provide UEFI PRM module that implements various
address translation PRM handlers.[1] These handlers can be invoked by the
OS or hypervisor at runtime to perform address translations.

On AMD's Zen-based SOCs, Unified Memory Controller (UMC) relative
"normalized" address is reported through MCA_ADDR of UMC SMCA bank type
on occurrence of a DRAM ECC error. This address must be converted into
system physical address and DRAM address to export additional information
about the error.

Add support to convert normalized address into DRAM address through the
appropriate PRM handler. Support for obtaining the system physical address
already exists. Instead of logging the translated DRAM address locally,
register the translating function when the Address Translation library is
initialized. Modules like amd64_edac can then invoke the PRM handler to
add the DRAM address to their error records. Additionally, it can also be
exported through the RAS tracepont.

[1] AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh ACPI v6.5 Porting Guide, Chapter 22

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
 drivers/ras/amd/atl/core.c     |  3 ++-
 drivers/ras/amd/atl/internal.h |  9 +++++++++
 drivers/ras/amd/atl/prm.c      | 36 ++++++++++++++++++++++++++++++----
 drivers/ras/amd/atl/umc.c      | 12 ++++++++++++
 drivers/ras/ras.c              | 18 +++++++++++++++--
 include/linux/ras.h            | 19 +++++++++++++++++-
 6 files changed, 89 insertions(+), 8 deletions(-)

diff --git a/drivers/ras/amd/atl/core.c b/drivers/ras/amd/atl/core.c
index 4197e10993ac..ca1646d030ca 100644
--- a/drivers/ras/amd/atl/core.c
+++ b/drivers/ras/amd/atl/core.c
@@ -207,7 +207,8 @@ static int __init amd_atl_init(void)
 
 	/* Increment this module's recount so that it can't be easily unloaded. */
 	__module_get(THIS_MODULE);
-	amd_atl_register_decoder(convert_umc_mca_addr_to_sys_addr);
+	amd_atl_register_decoder(convert_umc_mca_addr_to_sys_addr,
+				 convert_umc_mca_addr_to_dram_addr);
 
 	pr_info("AMD Address Translation Library initialized\n");
 	return 0;
diff --git a/drivers/ras/amd/atl/internal.h b/drivers/ras/amd/atl/internal.h
index 2b6279d32774..53095310438c 100644
--- a/drivers/ras/amd/atl/internal.h
+++ b/drivers/ras/amd/atl/internal.h
@@ -279,18 +279,27 @@ int dehash_address(struct addr_ctx *ctx);
 
 unsigned long norm_to_sys_addr(u8 socket_id, u8 die_id, u8 coh_st_inst_id, unsigned long addr);
 unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
+int convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr);
 
 u64 add_base_and_hole(struct addr_ctx *ctx, u64 addr);
 u64 remove_base_and_hole(struct addr_ctx *ctx, u64 addr);
 
 #ifdef CONFIG_AMD_ATL_PRM
 unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id, unsigned long addr);
+int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
+			      unsigned long addr, struct dram_addr *dram_addr);
 #else
 static inline unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id,
 						     unsigned long addr)
 {
        return -ENODEV;
 }
+
+static inline int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
+					    unsigned long addr, struct dram_addr *dram_addr)
+{
+	return -ENODEV;
+}
 #endif
 
 /*
diff --git a/drivers/ras/amd/atl/prm.c b/drivers/ras/amd/atl/prm.c
index 0931a20d213b..9bbaf8c85da0 100644
--- a/drivers/ras/amd/atl/prm.c
+++ b/drivers/ras/amd/atl/prm.c
@@ -19,10 +19,11 @@
 #include <linux/prmt.h>
 
 /*
- * PRM parameter buffer - normalized to system physical address, as described
- * in the "PRM Parameter Buffer" section of the AMD ACPI Porting Guide.
+ * PRM parameter buffer - normalized to system physical address and normalized
+ * to DRAM address, as described in the "PRM Parameter Buffer" section of the
+ * AMD ACPI Porting Guide.
  */
-struct norm_to_sys_param_buf {
+struct prm_parameter_buffer {
 	u64 norm_addr;
 	u8 socket;
 	u64 bank_id;
@@ -33,9 +34,13 @@ static const guid_t norm_to_sys_guid = GUID_INIT(0xE7180659, 0xA65D, 0x451D,
 						 0x92, 0xCD, 0x2B, 0x56, 0xF1,
 						 0x2B, 0xEB, 0xA6);
 
+static const guid_t norm_to_dram_guid = GUID_INIT(0x7626C6AE, 0xF973, 0x429C,
+						 0xA9, 0x1C, 0x10, 0x7D, 0x7B,
+						 0xE2, 0x98, 0xB0);
+
 unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 bank_id, unsigned long addr)
 {
-	struct norm_to_sys_param_buf p_buf;
+	struct prm_parameter_buffer p_buf;
 	unsigned long ret_addr;
 	int ret;
 
@@ -55,3 +60,26 @@ unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 bank_id, unsigned long
 
 	return ret;
 }
+
+int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
+			      unsigned long addr, struct dram_addr *dram_addr)
+{
+	struct prm_parameter_buffer p_buf;
+	int ret;
+
+	p_buf.norm_addr	= addr;
+	p_buf.socket	= socket_id;
+	p_buf.bank_id	= bank_id;
+	p_buf.out_buf	= dram_addr;
+
+	ret = acpi_call_prm_handler(norm_to_dram_guid, &p_buf);
+	if (!ret)
+		return ret;
+
+	if (ret == -ENODEV)
+		pr_debug("PRM module/handler not available.\n");
+	else
+		pr_notice_once("PRM DRAM Address Translation failed.\n");
+
+	return ret;
+}
diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
index 6e072b7667e9..df6accae8929 100644
--- a/drivers/ras/amd/atl/umc.c
+++ b/drivers/ras/amd/atl/umc.c
@@ -427,3 +427,15 @@ unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
 
 	return norm_to_sys_addr(socket_id, die_id, coh_st_inst_id, addr);
 }
+
+int convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr)
+{
+	u8 socket_id = topology_physical_package_id(err->cpu);
+	unsigned long addr = get_addr(err->addr);
+	u64 bank_id = err->ipid;
+	int ret;
+
+	ret = prm_umc_norm_to_dram_addr(socket_id, bank_id, addr, dram_addr);
+
+	return ret;
+}
diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
index a6e4792a1b2e..cae6388d41be 100644
--- a/drivers/ras/ras.c
+++ b/drivers/ras/ras.c
@@ -19,15 +19,20 @@
  */
 static unsigned long (*amd_atl_umc_na_to_spa)(struct atl_err *err);
 
-void amd_atl_register_decoder(unsigned long (*f)(struct atl_err *))
+static int (*amd_atl_umc_na_to_dram_addr)(struct atl_err *err, struct dram_addr *dram_addr);
+
+void amd_atl_register_decoder(unsigned long (*f1)(struct atl_err *),
+			      int (*f2)(struct atl_err *, struct dram_addr *))
 {
-	amd_atl_umc_na_to_spa = f;
+	amd_atl_umc_na_to_spa = f1;
+	amd_atl_umc_na_to_dram_addr = f2;
 }
 EXPORT_SYMBOL_GPL(amd_atl_register_decoder);
 
 void amd_atl_unregister_decoder(void)
 {
 	amd_atl_umc_na_to_spa = NULL;
+	amd_atl_umc_na_to_dram_addr = NULL;
 }
 EXPORT_SYMBOL_GPL(amd_atl_unregister_decoder);
 
@@ -39,6 +44,15 @@ unsigned long amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
 	return amd_atl_umc_na_to_spa(err);
 }
 EXPORT_SYMBOL_GPL(amd_convert_umc_mca_addr_to_sys_addr);
+
+int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr)
+{
+	if (!amd_atl_umc_na_to_dram_addr)
+		return -EINVAL;
+
+	return amd_atl_umc_na_to_dram_addr(err, dram_addr);
+}
+EXPORT_SYMBOL_GPL(amd_convert_umc_mca_addr_to_dram_addr);
 #endif /* CONFIG_AMD_ATL */
 
 #define CREATE_TRACE_POINTS
diff --git a/include/linux/ras.h b/include/linux/ras.h
index a64182bc72ad..feb53f8470b0 100644
--- a/include/linux/ras.h
+++ b/include/linux/ras.h
@@ -42,15 +42,32 @@ struct atl_err {
 	u32 cpu;
 };
 
+struct dram_addr {
+	u8 chip_select;
+	u8 bank_group;
+	u8 bank_addr;
+	u32 row_addr;
+	u16 col_addr;
+	u8 rank_mul;
+	u8 sub_ch;
+} __packed;
+
 #if IS_ENABLED(CONFIG_AMD_ATL)
-void amd_atl_register_decoder(unsigned long (*f)(struct atl_err *));
+void amd_atl_register_decoder(unsigned long (*f1)(struct atl_err *),
+			      int (*f2)(struct atl_err *, struct dram_addr *));
 void amd_atl_unregister_decoder(void);
 void amd_retire_dram_row(struct atl_err *err);
 unsigned long amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
+int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr);
 #else
 static inline void amd_retire_dram_row(struct atl_err *err) { }
 static inline unsigned long
 amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err) { return -EINVAL; }
+static inline int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err,
+							struct dram_addr *dram_addr)
+{
+	return -EINVAL;
+}
 #endif /* CONFIG_AMD_ATL */
 
 #endif /* __RAS_H__ */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
  2025-07-17 16:48 [PATCH 0/2] Incorporate DRAM address in EDAC messages Avadhut Naik
  2025-07-17 16:48 ` [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM Avadhut Naik
@ 2025-07-17 16:48 ` Avadhut Naik
  2025-07-28 15:14   ` Yazen Ghannam
  1 sibling, 1 reply; 10+ messages in thread
From: Avadhut Naik @ 2025-07-17 16:48 UTC (permalink / raw)
  To: linux-edac; +Cc: bp, yazen.ghannam, john.allen, linux-kernel, avadhut.naik

Currently, the amd64_edac module only provides UMC normalized and system
physical address when a DRAM ECC error occurs. DRAM Address on which the
error has occurred is not provided since the support required to translate
the normalized address into DRAM address has not yet been implemented.

This support however, has now been implemented through an earlier patch
(RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
and DRAM address, which provides additional debugging information relating
to the error received, can now be logged by the module.

Add the required support to log DRAM address on which the error has been
received in dmesg and through the RAS tracepoint.

Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
---
 drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
 drivers/edac/amd64_edac.h |  1 +
 2 files changed, 24 insertions(+)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 07f1e9dc1ca7..36b46cd81bb2 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
 	switch (err->err_code) {
 	case DECODE_OK:
 		string = "";
+		if (err->dram_addr) {
+			char s[100];
+
+			memset(s, 0, 100);
+			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
+					   " Row: 0x%x Column: 0x%x"
+					   " RankMul: 0x%x SubChannel: 0x%x",
+					   err->dram_addr->chip_select,
+					   err->dram_addr->bank_group,
+					   err->dram_addr->bank_addr,
+					   err->dram_addr->row_addr,
+					   err->dram_addr->col_addr,
+					   err->dram_addr->rank_mul,
+					   err->dram_addr->sub_ch);
+			string = s;
+		}
 		break;
 	case ERR_NODE:
 		string = "Failed to map error addr to a node";
@@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
 static void decode_umc_error(int node_id, struct mce *m)
 {
 	u8 ecc_type = (m->status >> 45) & 0x3;
+	struct dram_addr dram_addr;
 	struct mem_ctl_info *mci;
 	unsigned long sys_addr;
 	struct amd64_pvt *pvt;
 	struct atl_err a_err;
 	struct err_info err;
+	int ret;
 
 	node_id = fixup_node_id(node_id, m);
 
@@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
 
 	pvt = mci->pvt_info;
 
+	memset(&dram_addr, 0, sizeof(dram_addr));
 	memset(&err, 0, sizeof(err));
 
 	if (m->status & MCI_STATUS_DEFERRED)
@@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
 		goto log_error;
 	}
 
+	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
+	if (!ret)
+		err.dram_addr = &dram_addr;
+
 	error_address_to_page_and_offset(sys_addr, &err);
 
 log_error:
diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
index 17228d07de4c..88b0b8425ab3 100644
--- a/drivers/edac/amd64_edac.h
+++ b/drivers/edac/amd64_edac.h
@@ -399,6 +399,7 @@ struct err_info {
 	u16 syndrome;
 	u32 page;
 	u32 offset;
+	struct dram_addr *dram_addr;
 };
 
 static inline u32 get_umc_base(u8 channel)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM
  2025-07-17 16:48 ` [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM Avadhut Naik
@ 2025-07-28 14:40   ` Yazen Ghannam
  2025-08-06 18:21     ` Naik, Avadhut
  0 siblings, 1 reply; 10+ messages in thread
From: Yazen Ghannam @ 2025-07-28 14:40 UTC (permalink / raw)
  To: Avadhut Naik; +Cc: linux-edac, bp, john.allen, linux-kernel

On Thu, Jul 17, 2025 at 04:48:42PM +0000, Avadhut Naik wrote:
> Modern AMD SOCs like Zen5 provide UEFI PRM module that implements various

Minor nit: Zen5 is a core revision and doesn't represent an SoC.

You could say "Recent AMD systems...". A platform (FW/OS) interface like
PRM doesn't depend on hardware revisions.

> address translation PRM handlers.[1] These handlers can be invoked by the
> OS or hypervisor at runtime to perform address translations.
> 
> On AMD's Zen-based SOCs, Unified Memory Controller (UMC) relative
> "normalized" address is reported through MCA_ADDR of UMC SMCA bank type
> on occurrence of a DRAM ECC error. This address must be converted into
> system physical address and DRAM address to export additional information
> about the error.
> 
> Add support to convert normalized address into DRAM address through the
> appropriate PRM handler. Support for obtaining the system physical address

It's not necessary to mention that the SPA translation already exists.

> already exists. Instead of logging the translated DRAM address locally,
> register the translating function when the Address Translation library is
> initialized. Modules like amd64_edac can then invoke the PRM handler to
> add the DRAM address to their error records. Additionally, it can also be
> exported through the RAS tracepont.
> 
> [1] AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh ACPI v6.5 Porting Guide, Chapter 22

Could this be a link?
https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#links-to-documentation

> 
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> ---
>  drivers/ras/amd/atl/core.c     |  3 ++-
>  drivers/ras/amd/atl/internal.h |  9 +++++++++
>  drivers/ras/amd/atl/prm.c      | 36 ++++++++++++++++++++++++++++++----
>  drivers/ras/amd/atl/umc.c      | 12 ++++++++++++
>  drivers/ras/ras.c              | 18 +++++++++++++++--
>  include/linux/ras.h            | 19 +++++++++++++++++-
>  6 files changed, 89 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/ras/amd/atl/core.c b/drivers/ras/amd/atl/core.c
> index 4197e10993ac..ca1646d030ca 100644
> --- a/drivers/ras/amd/atl/core.c
> +++ b/drivers/ras/amd/atl/core.c
> @@ -207,7 +207,8 @@ static int __init amd_atl_init(void)
>  
>  	/* Increment this module's recount so that it can't be easily unloaded. */
>  	__module_get(THIS_MODULE);
> -	amd_atl_register_decoder(convert_umc_mca_addr_to_sys_addr);
> +	amd_atl_register_decoder(convert_umc_mca_addr_to_sys_addr,
> +				 convert_umc_mca_addr_to_dram_addr);
>  
>  	pr_info("AMD Address Translation Library initialized\n");
>  	return 0;
> diff --git a/drivers/ras/amd/atl/internal.h b/drivers/ras/amd/atl/internal.h
> index 2b6279d32774..53095310438c 100644
> --- a/drivers/ras/amd/atl/internal.h
> +++ b/drivers/ras/amd/atl/internal.h
> @@ -279,18 +279,27 @@ int dehash_address(struct addr_ctx *ctx);
>  
>  unsigned long norm_to_sys_addr(u8 socket_id, u8 die_id, u8 coh_st_inst_id, unsigned long addr);
>  unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
> +int convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr);
>  
>  u64 add_base_and_hole(struct addr_ctx *ctx, u64 addr);
>  u64 remove_base_and_hole(struct addr_ctx *ctx, u64 addr);
>  
>  #ifdef CONFIG_AMD_ATL_PRM
>  unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id, unsigned long addr);
> +int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
> +			      unsigned long addr, struct dram_addr *dram_addr);
>  #else
>  static inline unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id,
>  						     unsigned long addr)
>  {
>         return -ENODEV;
>  }
> +
> +static inline int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
> +					    unsigned long addr, struct dram_addr *dram_addr)
> +{
> +	return -ENODEV;
> +}
>  #endif
>  
>  /*
> diff --git a/drivers/ras/amd/atl/prm.c b/drivers/ras/amd/atl/prm.c
> index 0931a20d213b..9bbaf8c85da0 100644
> --- a/drivers/ras/amd/atl/prm.c
> +++ b/drivers/ras/amd/atl/prm.c
> @@ -19,10 +19,11 @@
>  #include <linux/prmt.h>
>  
>  /*
> - * PRM parameter buffer - normalized to system physical address, as described
> - * in the "PRM Parameter Buffer" section of the AMD ACPI Porting Guide.
> + * PRM parameter buffer - normalized to system physical address and normalized
> + * to DRAM address, as described in the "PRM Parameter Buffer" section of the
> + * AMD ACPI Porting Guide.
>   */
> -struct norm_to_sys_param_buf {
> +struct prm_parameter_buffer {
>  	u64 norm_addr;
>  	u8 socket;
>  	u64 bank_id;
> @@ -33,9 +34,13 @@ static const guid_t norm_to_sys_guid = GUID_INIT(0xE7180659, 0xA65D, 0x451D,
>  						 0x92, 0xCD, 0x2B, 0x56, 0xF1,
>  						 0x2B, 0xEB, 0xA6);
>  
> +static const guid_t norm_to_dram_guid = GUID_INIT(0x7626C6AE, 0xF973, 0x429C,
> +						 0xA9, 0x1C, 0x10, 0x7D, 0x7B,
> +						 0xE2, 0x98, 0xB0);
> +
>  unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 bank_id, unsigned long addr)
>  {
> -	struct norm_to_sys_param_buf p_buf;
> +	struct prm_parameter_buffer p_buf;
>  	unsigned long ret_addr;
>  	int ret;
>  
> @@ -55,3 +60,26 @@ unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 bank_id, unsigned long
>  
>  	return ret;
>  }
> +
> +int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
> +			      unsigned long addr, struct dram_addr *dram_addr)
> +{
> +	struct prm_parameter_buffer p_buf;
> +	int ret;
> +
> +	p_buf.norm_addr	= addr;
> +	p_buf.socket	= socket_id;
> +	p_buf.bank_id	= bank_id;
> +	p_buf.out_buf	= dram_addr;
> +
> +	ret = acpi_call_prm_handler(norm_to_dram_guid, &p_buf);
> +	if (!ret)
> +		return ret;
> +
> +	if (ret == -ENODEV)
> +		pr_debug("PRM module/handler not available.\n");
> +	else
> +		pr_notice_once("PRM DRAM Address Translation failed.\n");
> +
> +	return ret;
> +}
> diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
> index 6e072b7667e9..df6accae8929 100644
> --- a/drivers/ras/amd/atl/umc.c
> +++ b/drivers/ras/amd/atl/umc.c
> @@ -427,3 +427,15 @@ unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
>  
>  	return norm_to_sys_addr(socket_id, die_id, coh_st_inst_id, addr);
>  }
> +
> +int convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr)
> +{
> +	u8 socket_id = topology_physical_package_id(err->cpu);
> +	unsigned long addr = get_addr(err->addr);
> +	u64 bank_id = err->ipid;
> +	int ret;
> +
> +	ret = prm_umc_norm_to_dram_addr(socket_id, bank_id, addr, dram_addr);
> +
> +	return ret;

The 'ret' variable is not necessary. Just return the function call.

You can go even further and not use any new variables. Not sure how
it'll be aesthetically, but that was my first thought.

> +}
> diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
> index a6e4792a1b2e..cae6388d41be 100644
> --- a/drivers/ras/ras.c
> +++ b/drivers/ras/ras.c
> @@ -19,15 +19,20 @@
>   */
>  static unsigned long (*amd_atl_umc_na_to_spa)(struct atl_err *err);
>  
> -void amd_atl_register_decoder(unsigned long (*f)(struct atl_err *))
> +static int (*amd_atl_umc_na_to_dram_addr)(struct atl_err *err, struct dram_addr *dram_addr);
> +
> +void amd_atl_register_decoder(unsigned long (*f1)(struct atl_err *),
> +			      int (*f2)(struct atl_err *, struct dram_addr *))
>  {
> -	amd_atl_umc_na_to_spa = f;
> +	amd_atl_umc_na_to_spa = f1;
> +	amd_atl_umc_na_to_dram_addr = f2;
>  }
>  EXPORT_SYMBOL_GPL(amd_atl_register_decoder);
>  
>  void amd_atl_unregister_decoder(void)
>  {
>  	amd_atl_umc_na_to_spa = NULL;
> +	amd_atl_umc_na_to_dram_addr = NULL;
>  }
>  EXPORT_SYMBOL_GPL(amd_atl_unregister_decoder);
>  
> @@ -39,6 +44,15 @@ unsigned long amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
>  	return amd_atl_umc_na_to_spa(err);
>  }
>  EXPORT_SYMBOL_GPL(amd_convert_umc_mca_addr_to_sys_addr);
> +
> +int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr)
> +{
> +	if (!amd_atl_umc_na_to_dram_addr)
> +		return -EINVAL;
> +
> +	return amd_atl_umc_na_to_dram_addr(err, dram_addr);
> +}
> +EXPORT_SYMBOL_GPL(amd_convert_umc_mca_addr_to_dram_addr);
>  #endif /* CONFIG_AMD_ATL */
>  
>  #define CREATE_TRACE_POINTS
> diff --git a/include/linux/ras.h b/include/linux/ras.h
> index a64182bc72ad..feb53f8470b0 100644
> --- a/include/linux/ras.h
> +++ b/include/linux/ras.h
> @@ -42,15 +42,32 @@ struct atl_err {
>  	u32 cpu;
>  };
>  
> +struct dram_addr {

There's another struct in the kernel called 'dram_addr'.

It may help to make this more unique with a prefix: atl_dram_addr.

> +	u8 chip_select;
> +	u8 bank_group;
> +	u8 bank_addr;
> +	u32 row_addr;
> +	u16 col_addr;
> +	u8 rank_mul;
> +	u8 sub_ch;
> +} __packed;
> +
>  #if IS_ENABLED(CONFIG_AMD_ATL)
> -void amd_atl_register_decoder(unsigned long (*f)(struct atl_err *));
> +void amd_atl_register_decoder(unsigned long (*f1)(struct atl_err *),
> +			      int (*f2)(struct atl_err *, struct dram_addr *));
>  void amd_atl_unregister_decoder(void);
>  void amd_retire_dram_row(struct atl_err *err);
>  unsigned long amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
> +int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr);
>  #else
>  static inline void amd_retire_dram_row(struct atl_err *err) { }
>  static inline unsigned long
>  amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err) { return -EINVAL; }
> +static inline int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err,
> +							struct dram_addr *dram_addr)
> +{
> +	return -EINVAL;
> +}
>  #endif /* CONFIG_AMD_ATL */
>  
>  #endif /* __RAS_H__ */
> -- 

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
  2025-07-17 16:48 ` [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message Avadhut Naik
@ 2025-07-28 15:14   ` Yazen Ghannam
  2025-08-06 21:08     ` Naik, Avadhut
  0 siblings, 1 reply; 10+ messages in thread
From: Yazen Ghannam @ 2025-07-28 15:14 UTC (permalink / raw)
  To: Avadhut Naik; +Cc: linux-edac, bp, john.allen, linux-kernel

On Thu, Jul 17, 2025 at 04:48:43PM +0000, Avadhut Naik wrote:
> Currently, the amd64_edac module only provides UMC normalized and system

The amd64_edac module provides data for the EDAC interface. This is only
the system physical address (PFN + offset). The UMC normalized address
comes from MCA error decoding.

> physical address when a DRAM ECC error occurs. DRAM Address on which the
> error has occurred is not provided since the support required to translate
> the normalized address into DRAM address has not yet been implemented.

I don't think this last sentence is necessary.

> 
> This support however, has now been implemented through an earlier patch
> (RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
> and DRAM address, which provides additional debugging information relating
> to the error received, can now be logged by the module.
> 
> Add the required support to log DRAM address on which the error has been
> received in dmesg and through the RAS tracepoint.

These last two paragraphs could be something like this:

"Use the new PRM call in the AMD Address Translation Library to gather the
DRAM address of an error. Include this data in the EDAC 'string' so it
is available in the kernel messages and EDAC trace event." 

> 
> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
> ---
>  drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
>  drivers/edac/amd64_edac.h |  1 +
>  2 files changed, 24 insertions(+)
> 
> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
> index 07f1e9dc1ca7..36b46cd81bb2 100644
> --- a/drivers/edac/amd64_edac.c
> +++ b/drivers/edac/amd64_edac.c
> @@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
>  	switch (err->err_code) {
>  	case DECODE_OK:
>  		string = "";
> +		if (err->dram_addr) {
> +			char s[100];
> +
> +			memset(s, 0, 100);
> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
> +					   " Row: 0x%x Column: 0x%x"
> +					   " RankMul: 0x%x SubChannel: 0x%x",

There's a checkpatch warning here about splitting user-visible strings.

Why not use scnprintf() or the like?

> +					   err->dram_addr->chip_select,
> +					   err->dram_addr->bank_group,
> +					   err->dram_addr->bank_addr,
> +					   err->dram_addr->row_addr,
> +					   err->dram_addr->col_addr,
> +					   err->dram_addr->rank_mul,
> +					   err->dram_addr->sub_ch);
> +			string = s;

Looking at the description for 'edac_mc_handle_error()', it seems that
"other_detail" would be more appropriate for this info.

I do think we should consider updating the EDAC interface if multiple
vendors are gathering this data.

> +		}
>  		break;
>  	case ERR_NODE:
>  		string = "Failed to map error addr to a node";
> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>  static void decode_umc_error(int node_id, struct mce *m)
>  {
>  	u8 ecc_type = (m->status >> 45) & 0x3;
> +	struct dram_addr dram_addr;
>  	struct mem_ctl_info *mci;
>  	unsigned long sys_addr;
>  	struct amd64_pvt *pvt;
>  	struct atl_err a_err;
>  	struct err_info err;
> +	int ret;
>  
>  	node_id = fixup_node_id(node_id, m);
>  
> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>  
>  	pvt = mci->pvt_info;
>  
> +	memset(&dram_addr, 0, sizeof(dram_addr));
>  	memset(&err, 0, sizeof(err));
>  
>  	if (m->status & MCI_STATUS_DEFERRED)
> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
>  		goto log_error;
>  	}
>  
> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
> +	if (!ret)
> +		err.dram_addr = &dram_addr;

I feel like it is not necessary to pass a second struct if it is already
contained in another.

You could just clear (or not set) that field if an error occurs.

> +
>  	error_address_to_page_and_offset(sys_addr, &err);
>  
>  log_error:
> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
> index 17228d07de4c..88b0b8425ab3 100644
> --- a/drivers/edac/amd64_edac.h
> +++ b/drivers/edac/amd64_edac.h
> @@ -399,6 +399,7 @@ struct err_info {
>  	u16 syndrome;
>  	u32 page;
>  	u32 offset;
> +	struct dram_addr *dram_addr;
>  };
>  
>  static inline u32 get_umc_base(u8 channel)
> -- 

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM
  2025-07-28 14:40   ` Yazen Ghannam
@ 2025-08-06 18:21     ` Naik, Avadhut
  0 siblings, 0 replies; 10+ messages in thread
From: Naik, Avadhut @ 2025-08-06 18:21 UTC (permalink / raw)
  To: Yazen Ghannam, Avadhut Naik; +Cc: linux-edac, bp, john.allen, linux-kernel

Hi,

On 7/28/2025 09:40, Yazen Ghannam wrote:
> On Thu, Jul 17, 2025 at 04:48:42PM +0000, Avadhut Naik wrote:
>> Modern AMD SOCs like Zen5 provide UEFI PRM module that implements various
> 
> Minor nit: Zen5 is a core revision and doesn't represent an SoC.
> 
> You could say "Recent AMD systems...". A platform (FW/OS) interface like
> PRM doesn't depend on hardware revisions.
> 
Will do!

>> address translation PRM handlers.[1] These handlers can be invoked by the
>> OS or hypervisor at runtime to perform address translations.
>>
>> On AMD's Zen-based SOCs, Unified Memory Controller (UMC) relative
>> "normalized" address is reported through MCA_ADDR of UMC SMCA bank type
>> on occurrence of a DRAM ECC error. This address must be converted into
>> system physical address and DRAM address to export additional information
>> about the error.
>>
>> Add support to convert normalized address into DRAM address through the
>> appropriate PRM handler. Support for obtaining the system physical address
> 
> It's not necessary to mention that the SPA translation already exists.
>
Okay!
 
>> already exists. Instead of logging the translated DRAM address locally,
>> register the translating function when the Address Translation library is
>> initialized. Modules like amd64_edac can then invoke the PRM handler to
>> add the DRAM address to their error records. Additionally, it can also be
>> exported through the RAS tracepont.
>>
>> [1] AMD Family 1Ah Models 00h–0Fh and Models 10h–1Fh ACPI v6.5 Porting Guide, Chapter 22
> 
> Could this be a link?
> https://www.kernel.org/doc/html/latest/process/maintainer-tip.html#links-to-documentation
> 
Will take a look.
>>
>> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
>> ---
>>  drivers/ras/amd/atl/core.c     |  3 ++-
>>  drivers/ras/amd/atl/internal.h |  9 +++++++++
>>  drivers/ras/amd/atl/prm.c      | 36 ++++++++++++++++++++++++++++++----
>>  drivers/ras/amd/atl/umc.c      | 12 ++++++++++++
>>  drivers/ras/ras.c              | 18 +++++++++++++++--
>>  include/linux/ras.h            | 19 +++++++++++++++++-
>>  6 files changed, 89 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/ras/amd/atl/core.c b/drivers/ras/amd/atl/core.c
>> index 4197e10993ac..ca1646d030ca 100644
>> --- a/drivers/ras/amd/atl/core.c
>> +++ b/drivers/ras/amd/atl/core.c
>> @@ -207,7 +207,8 @@ static int __init amd_atl_init(void)
>>  
>>  	/* Increment this module's recount so that it can't be easily unloaded. */
>>  	__module_get(THIS_MODULE);
>> -	amd_atl_register_decoder(convert_umc_mca_addr_to_sys_addr);
>> +	amd_atl_register_decoder(convert_umc_mca_addr_to_sys_addr,
>> +				 convert_umc_mca_addr_to_dram_addr);
>>  
>>  	pr_info("AMD Address Translation Library initialized\n");
>>  	return 0;
>> diff --git a/drivers/ras/amd/atl/internal.h b/drivers/ras/amd/atl/internal.h
>> index 2b6279d32774..53095310438c 100644
>> --- a/drivers/ras/amd/atl/internal.h
>> +++ b/drivers/ras/amd/atl/internal.h
>> @@ -279,18 +279,27 @@ int dehash_address(struct addr_ctx *ctx);
>>  
>>  unsigned long norm_to_sys_addr(u8 socket_id, u8 die_id, u8 coh_st_inst_id, unsigned long addr);
>>  unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
>> +int convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr);
>>  
>>  u64 add_base_and_hole(struct addr_ctx *ctx, u64 addr);
>>  u64 remove_base_and_hole(struct addr_ctx *ctx, u64 addr);
>>  
>>  #ifdef CONFIG_AMD_ATL_PRM
>>  unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id, unsigned long addr);
>> +int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
>> +			      unsigned long addr, struct dram_addr *dram_addr);
>>  #else
>>  static inline unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 umc_bank_inst_id,
>>  						     unsigned long addr)
>>  {
>>         return -ENODEV;
>>  }
>> +
>> +static inline int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
>> +					    unsigned long addr, struct dram_addr *dram_addr)
>> +{
>> +	return -ENODEV;
>> +}
>>  #endif
>>  
>>  /*
>> diff --git a/drivers/ras/amd/atl/prm.c b/drivers/ras/amd/atl/prm.c
>> index 0931a20d213b..9bbaf8c85da0 100644
>> --- a/drivers/ras/amd/atl/prm.c
>> +++ b/drivers/ras/amd/atl/prm.c
>> @@ -19,10 +19,11 @@
>>  #include <linux/prmt.h>
>>  
>>  /*
>> - * PRM parameter buffer - normalized to system physical address, as described
>> - * in the "PRM Parameter Buffer" section of the AMD ACPI Porting Guide.
>> + * PRM parameter buffer - normalized to system physical address and normalized
>> + * to DRAM address, as described in the "PRM Parameter Buffer" section of the
>> + * AMD ACPI Porting Guide.
>>   */
>> -struct norm_to_sys_param_buf {
>> +struct prm_parameter_buffer {
>>  	u64 norm_addr;
>>  	u8 socket;
>>  	u64 bank_id;
>> @@ -33,9 +34,13 @@ static const guid_t norm_to_sys_guid = GUID_INIT(0xE7180659, 0xA65D, 0x451D,
>>  						 0x92, 0xCD, 0x2B, 0x56, 0xF1,
>>  						 0x2B, 0xEB, 0xA6);
>>  
>> +static const guid_t norm_to_dram_guid = GUID_INIT(0x7626C6AE, 0xF973, 0x429C,
>> +						 0xA9, 0x1C, 0x10, 0x7D, 0x7B,
>> +						 0xE2, 0x98, 0xB0);
>> +
>>  unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 bank_id, unsigned long addr)
>>  {
>> -	struct norm_to_sys_param_buf p_buf;
>> +	struct prm_parameter_buffer p_buf;
>>  	unsigned long ret_addr;
>>  	int ret;
>>  
>> @@ -55,3 +60,26 @@ unsigned long prm_umc_norm_to_sys_addr(u8 socket_id, u64 bank_id, unsigned long
>>  
>>  	return ret;
>>  }
>> +
>> +int prm_umc_norm_to_dram_addr(u8 socket_id, u64 bank_id,
>> +			      unsigned long addr, struct dram_addr *dram_addr)
>> +{
>> +	struct prm_parameter_buffer p_buf;
>> +	int ret;
>> +
>> +	p_buf.norm_addr	= addr;
>> +	p_buf.socket	= socket_id;
>> +	p_buf.bank_id	= bank_id;
>> +	p_buf.out_buf	= dram_addr;
>> +
>> +	ret = acpi_call_prm_handler(norm_to_dram_guid, &p_buf);
>> +	if (!ret)
>> +		return ret;
>> +
>> +	if (ret == -ENODEV)
>> +		pr_debug("PRM module/handler not available.\n");
>> +	else
>> +		pr_notice_once("PRM DRAM Address Translation failed.\n");
>> +
>> +	return ret;
>> +}
>> diff --git a/drivers/ras/amd/atl/umc.c b/drivers/ras/amd/atl/umc.c
>> index 6e072b7667e9..df6accae8929 100644
>> --- a/drivers/ras/amd/atl/umc.c
>> +++ b/drivers/ras/amd/atl/umc.c
>> @@ -427,3 +427,15 @@ unsigned long convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
>>  
>>  	return norm_to_sys_addr(socket_id, die_id, coh_st_inst_id, addr);
>>  }
>> +
>> +int convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr)
>> +{
>> +	u8 socket_id = topology_physical_package_id(err->cpu);
>> +	unsigned long addr = get_addr(err->addr);
>> +	u64 bank_id = err->ipid;
>> +	int ret;
>> +
>> +	ret = prm_umc_norm_to_dram_addr(socket_id, bank_id, addr, dram_addr);
>> +
>> +	return ret;
> 
> The 'ret' variable is not necessary. Just return the function call.
> 
> You can go even further and not use any new variables. Not sure how
> it'll be aesthetically, but that was my first thought.
> 
It may not look good aesthetically if we remove all variables.
Will remove ret though.

>> +}
>> diff --git a/drivers/ras/ras.c b/drivers/ras/ras.c
>> index a6e4792a1b2e..cae6388d41be 100644
>> --- a/drivers/ras/ras.c
>> +++ b/drivers/ras/ras.c
>> @@ -19,15 +19,20 @@
>>   */
>>  static unsigned long (*amd_atl_umc_na_to_spa)(struct atl_err *err);
>>  
>> -void amd_atl_register_decoder(unsigned long (*f)(struct atl_err *))
>> +static int (*amd_atl_umc_na_to_dram_addr)(struct atl_err *err, struct dram_addr *dram_addr);
>> +
>> +void amd_atl_register_decoder(unsigned long (*f1)(struct atl_err *),
>> +			      int (*f2)(struct atl_err *, struct dram_addr *))
>>  {
>> -	amd_atl_umc_na_to_spa = f;
>> +	amd_atl_umc_na_to_spa = f1;
>> +	amd_atl_umc_na_to_dram_addr = f2;
>>  }
>>  EXPORT_SYMBOL_GPL(amd_atl_register_decoder);
>>  
>>  void amd_atl_unregister_decoder(void)
>>  {
>>  	amd_atl_umc_na_to_spa = NULL;
>> +	amd_atl_umc_na_to_dram_addr = NULL;
>>  }
>>  EXPORT_SYMBOL_GPL(amd_atl_unregister_decoder);
>>  
>> @@ -39,6 +44,15 @@ unsigned long amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err)
>>  	return amd_atl_umc_na_to_spa(err);
>>  }
>>  EXPORT_SYMBOL_GPL(amd_convert_umc_mca_addr_to_sys_addr);
>> +
>> +int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr)
>> +{
>> +	if (!amd_atl_umc_na_to_dram_addr)
>> +		return -EINVAL;
>> +
>> +	return amd_atl_umc_na_to_dram_addr(err, dram_addr);
>> +}
>> +EXPORT_SYMBOL_GPL(amd_convert_umc_mca_addr_to_dram_addr);
>>  #endif /* CONFIG_AMD_ATL */
>>  
>>  #define CREATE_TRACE_POINTS
>> diff --git a/include/linux/ras.h b/include/linux/ras.h
>> index a64182bc72ad..feb53f8470b0 100644
>> --- a/include/linux/ras.h
>> +++ b/include/linux/ras.h
>> @@ -42,15 +42,32 @@ struct atl_err {
>>  	u32 cpu;
>>  };
>>  
>> +struct dram_addr {
> 
> There's another struct in the kernel called 'dram_addr'.
> 
> It may help to make this more unique with a prefix: atl_dram_addr.
> 
Will change this.
>> +	u8 chip_select;
>> +	u8 bank_group;
>> +	u8 bank_addr;
>> +	u32 row_addr;
>> +	u16 col_addr;
>> +	u8 rank_mul;
>> +	u8 sub_ch;
>> +} __packed;
>> +
>>  #if IS_ENABLED(CONFIG_AMD_ATL)
>> -void amd_atl_register_decoder(unsigned long (*f)(struct atl_err *));
>> +void amd_atl_register_decoder(unsigned long (*f1)(struct atl_err *),
>> +			      int (*f2)(struct atl_err *, struct dram_addr *));
>>  void amd_atl_unregister_decoder(void);
>>  void amd_retire_dram_row(struct atl_err *err);
>>  unsigned long amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err);
>> +int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err, struct dram_addr *dram_addr);
>>  #else
>>  static inline void amd_retire_dram_row(struct atl_err *err) { }
>>  static inline unsigned long
>>  amd_convert_umc_mca_addr_to_sys_addr(struct atl_err *err) { return -EINVAL; }
>> +static inline int amd_convert_umc_mca_addr_to_dram_addr(struct atl_err *err,
>> +							struct dram_addr *dram_addr)
>> +{
>> +	return -EINVAL;
>> +}
>>  #endif /* CONFIG_AMD_ATL */
>>  
>>  #endif /* __RAS_H__ */
>> -- 
> 
> Thanks,
> Yazen

-- 
Thanks,
Avadhut Naik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
  2025-07-28 15:14   ` Yazen Ghannam
@ 2025-08-06 21:08     ` Naik, Avadhut
  2025-08-06 21:15       ` Borislav Petkov
  2025-08-11 13:43       ` Yazen Ghannam
  0 siblings, 2 replies; 10+ messages in thread
From: Naik, Avadhut @ 2025-08-06 21:08 UTC (permalink / raw)
  To: Yazen Ghannam, Avadhut Naik; +Cc: linux-edac, bp, john.allen, linux-kernel

Hi,

On 7/28/2025 10:14, Yazen Ghannam wrote:
> On Thu, Jul 17, 2025 at 04:48:43PM +0000, Avadhut Naik wrote:
>> Currently, the amd64_edac module only provides UMC normalized and system
> 
> The amd64_edac module provides data for the EDAC interface. This is only
> the system physical address (PFN + offset). The UMC normalized address
> comes from MCA error decoding.
> 
Will reword this part.
>> physical address when a DRAM ECC error occurs. DRAM Address on which the
>> error has occurred is not provided since the support required to translate
>> the normalized address into DRAM address has not yet been implemented.
> 
> I don't think this last sentence is necessary.
> 
Noted.
>>
>> This support however, has now been implemented through an earlier patch
>> (RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM)
>> and DRAM address, which provides additional debugging information relating
>> to the error received, can now be logged by the module.
>>
>> Add the required support to log DRAM address on which the error has been
>> received in dmesg and through the RAS tracepoint.
> 
> These last two paragraphs could be something like this:
> 
> "Use the new PRM call in the AMD Address Translation Library to gather the
> DRAM address of an error. Include this data in the EDAC 'string' so it
> is available in the kernel messages and EDAC trace event." 
> 
Okay. Will change them.
>>
>> Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
>> ---
>>  drivers/edac/amd64_edac.c | 23 +++++++++++++++++++++++
>>  drivers/edac/amd64_edac.h |  1 +
>>  2 files changed, 24 insertions(+)
>>
>> diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
>> index 07f1e9dc1ca7..36b46cd81bb2 100644
>> --- a/drivers/edac/amd64_edac.c
>> +++ b/drivers/edac/amd64_edac.c
>> @@ -2724,6 +2724,22 @@ static void __log_ecc_error(struct mem_ctl_info *mci, struct err_info *err,
>>  	switch (err->err_code) {
>>  	case DECODE_OK:
>>  		string = "";
>> +		if (err->dram_addr) {
>> +			char s[100];
>> +
>> +			memset(s, 0, 100);
>> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
>> +					   " Row: 0x%x Column: 0x%x"
>> +					   " RankMul: 0x%x SubChannel: 0x%x",
> 
> There's a checkpatch warning here about splitting user-visible strings.
> 
> Why not use scnprintf() or the like?
> 
Had noticed the checkpatch warning initially.
IIRC, it was for splitting the quoted string across multiple lines.
Can use scnprintf here. But wont the warning still prevail?
One way I can think of for getting rid of the warning is to generate the string
through multiple scnprintf calls. Something like below:

            memset(s, 0, len);
            n = scnprintf(s + n, len - n, "Cs: 0x%x Bank Grp: 0x%x",
                      err->dram_addr->chip_select,
                      err->dram_addr->bank_group);
            n += scnprintf(s + n, len - n, " Bank Addr: 0x%x",
                      err->dram_addr->bank_addr);
            n += scnprintf(s + n, len - n, " Row: 0x%x Column: 0x%x",
                      err->dram_addr->row_addr,
                      err->dram_addr->col_addr);
            n += scnprintf(s + n, len - n, " RankMul: 0x%x SubChannel: 0x%x",
                      err->dram_addr->rank_mul,
                      err->dram_addr->sub_ch);

            pr_err("%s: s: %s\n", __func__, s);
            string = s;

Is this acceptable?

>> +					   err->dram_addr->chip_select,
>> +					   err->dram_addr->bank_group,
>> +					   err->dram_addr->bank_addr,
>> +					   err->dram_addr->row_addr,
>> +					   err->dram_addr->col_addr,
>> +					   err->dram_addr->rank_mul,
>> +					   err->dram_addr->sub_ch);
>> +			string = s;
> 
> Looking at the description for 'edac_mc_handle_error()', it seems that
> "other_detail" would be more appropriate for this info.
> 
> I do think we should consider updating the EDAC interface if multiple
> vendors are gathering this data.
> 
Okay, will use "other_detail" parameter of edac_mc_handle_error() for this.

>> +		}
>>  		break;
>>  	case ERR_NODE:
>>  		string = "Failed to map error addr to a node";
>> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
>>  static void decode_umc_error(int node_id, struct mce *m)
>>  {
>>  	u8 ecc_type = (m->status >> 45) & 0x3;
>> +	struct dram_addr dram_addr;
>>  	struct mem_ctl_info *mci;
>>  	unsigned long sys_addr;
>>  	struct amd64_pvt *pvt;
>>  	struct atl_err a_err;
>>  	struct err_info err;
>> +	int ret;
>>  
>>  	node_id = fixup_node_id(node_id, m);
>>  
>> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
>>  
>>  	pvt = mci->pvt_info;
>>  
>> +	memset(&dram_addr, 0, sizeof(dram_addr));
>>  	memset(&err, 0, sizeof(err));
>>  
>>  	if (m->status & MCI_STATUS_DEFERRED)
>> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
>>  		goto log_error;
>>  	}
>>  
>> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
>> +	if (!ret)
>> +		err.dram_addr = &dram_addr;
> 
> I feel like it is not necessary to pass a second struct if it is already
> contained in another.
> 
> You could just clear (or not set) that field if an error occurs.
>
Slightly confused here.
Do you mean we should avoid passing dram_addr as second parameter
for amd_convert_umc_mca_addr_to_dram_addr() and instead just pass
struct err_info instance err?

And, in case some error occurs, we should just do
	err.dram_addr = 0x0;
?
 
>> +
>>  	error_address_to_page_and_offset(sys_addr, &err);
>>  
>>  log_error:
>> diff --git a/drivers/edac/amd64_edac.h b/drivers/edac/amd64_edac.h
>> index 17228d07de4c..88b0b8425ab3 100644
>> --- a/drivers/edac/amd64_edac.h
>> +++ b/drivers/edac/amd64_edac.h
>> @@ -399,6 +399,7 @@ struct err_info {
>>  	u16 syndrome;
>>  	u32 page;
>>  	u32 offset;
>> +	struct dram_addr *dram_addr;
>>  };
>>  
>>  static inline u32 get_umc_base(u8 channel)
>> -- 
> 
> Thanks,
> Yazen

-- 
Thanks,
Avadhut Naik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
  2025-08-06 21:08     ` Naik, Avadhut
@ 2025-08-06 21:15       ` Borislav Petkov
  2025-08-06 21:37         ` Naik, Avadhut
  2025-08-11 13:43       ` Yazen Ghannam
  1 sibling, 1 reply; 10+ messages in thread
From: Borislav Petkov @ 2025-08-06 21:15 UTC (permalink / raw)
  To: Naik, Avadhut
  Cc: Yazen Ghannam, Avadhut Naik, linux-edac, john.allen, linux-kernel

On Wed, Aug 06, 2025 at 04:08:07PM -0500, Naik, Avadhut wrote:
> >> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
> >> +					   " Row: 0x%x Column: 0x%x"
> >> +					   " RankMul: 0x%x SubChannel: 0x%x",
> > 
> > There's a checkpatch warning here about splitting user-visible strings.
> > 
> > Why not use scnprintf() or the like?
> > 
> Had noticed the checkpatch warning initially.
> IIRC, it was for splitting the quoted string across multiple lines.
> Can use scnprintf here. But wont the warning still prevail?

Just do one long line. The idea is to be able to grep the code when you see
the string issued in dmesg.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
  2025-08-06 21:15       ` Borislav Petkov
@ 2025-08-06 21:37         ` Naik, Avadhut
  0 siblings, 0 replies; 10+ messages in thread
From: Naik, Avadhut @ 2025-08-06 21:37 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Yazen Ghannam, Avadhut Naik, linux-edac, john.allen, linux-kernel



On 8/6/2025 16:15, Borislav Petkov wrote:
> On Wed, Aug 06, 2025 at 04:08:07PM -0500, Naik, Avadhut wrote:
>>>> +			sprintf(s, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x"
>>>> +					   " Row: 0x%x Column: 0x%x"
>>>> +					   " RankMul: 0x%x SubChannel: 0x%x",
>>>
>>> There's a checkpatch warning here about splitting user-visible strings.
>>>
>>> Why not use scnprintf() or the like?
>>>
>> Had noticed the checkpatch warning initially.
>> IIRC, it was for splitting the quoted string across multiple lines.
>> Can use scnprintf here. But wont the warning still prevail?
> 
> Just do one long line. The idea is to be able to grep the code when you see
> the string issued in dmesg.
> 
Okay, sounds good! Will change it to something like below:

scnprintf(s, 100, "Cs: 0x%x Bank Grp: 0x%x Bank Addr: 0x%x Row: 0x%x Column: 0x%x RankMul: 0x%x SubChannel: 0x%x",
          err->dram_addr->chip_select,
          err->dram_addr->bank_group,
          err->dram_addr->bank_addr,
          err->dram_addr->row_addr,
          err->dram_addr->col_addr,
          err->dram_addr->rank_mul,
          err->dram_addr->sub_ch);

-- 
Thanks,
Avadhut Naik


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message
  2025-08-06 21:08     ` Naik, Avadhut
  2025-08-06 21:15       ` Borislav Petkov
@ 2025-08-11 13:43       ` Yazen Ghannam
  1 sibling, 0 replies; 10+ messages in thread
From: Yazen Ghannam @ 2025-08-11 13:43 UTC (permalink / raw)
  To: Naik, Avadhut; +Cc: Avadhut Naik, linux-edac, bp, john.allen, linux-kernel

On Wed, Aug 06, 2025 at 04:08:07PM -0500, Naik, Avadhut wrote:
[...]
> >> @@ -2808,11 +2824,13 @@ static void umc_get_err_info(struct mce *m, struct err_info *err)
> >>  static void decode_umc_error(int node_id, struct mce *m)
> >>  {
> >>  	u8 ecc_type = (m->status >> 45) & 0x3;
> >> +	struct dram_addr dram_addr;
> >>  	struct mem_ctl_info *mci;
> >>  	unsigned long sys_addr;
> >>  	struct amd64_pvt *pvt;
> >>  	struct atl_err a_err;
> >>  	struct err_info err;
> >> +	int ret;
> >>  
> >>  	node_id = fixup_node_id(node_id, m);
> >>  
> >> @@ -2822,6 +2840,7 @@ static void decode_umc_error(int node_id, struct mce *m)
> >>  
> >>  	pvt = mci->pvt_info;
> >>  
> >> +	memset(&dram_addr, 0, sizeof(dram_addr));
> >>  	memset(&err, 0, sizeof(err));
> >>  
> >>  	if (m->status & MCI_STATUS_DEFERRED)
> >> @@ -2853,6 +2872,10 @@ static void decode_umc_error(int node_id, struct mce *m)
> >>  		goto log_error;
> >>  	}
> >>  
> >> +	ret = amd_convert_umc_mca_addr_to_dram_addr(&a_err, &dram_addr);
> >> +	if (!ret)
> >> +		err.dram_addr = &dram_addr;
> > 
> > I feel like it is not necessary to pass a second struct if it is already
> > contained in another.
> > 
> > You could just clear (or not set) that field if an error occurs.
> >
> Slightly confused here.
> Do you mean we should avoid passing dram_addr as second parameter
> for amd_convert_umc_mca_addr_to_dram_addr() and instead just pass
> struct err_info instance err?
> 
> And, in case some error occurs, we should just do
> 	err.dram_addr = 0x0;
> ?
>  

Sorry, I think I misread this before.

I was thinking you can add 'struct dram_addr' to 'struct atl_err'. The
intent of 'struct atl_err' is to collect all needed parameters for the
translation functions.

Additionally, I just realized, we should have an 'invalid' default value
for dram_addr. Technically, bank=0, row=0, col=0, etc., would be a valid
DRAM address.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-08-11 13:43 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-17 16:48 [PATCH 0/2] Incorporate DRAM address in EDAC messages Avadhut Naik
2025-07-17 16:48 ` [PATCH 1/2] RAS/AMD/ATL: Translate UMC normalized address to DRAM address using PRM Avadhut Naik
2025-07-28 14:40   ` Yazen Ghannam
2025-08-06 18:21     ` Naik, Avadhut
2025-07-17 16:48 ` [PATCH 2/2] EDAC/amd64: Incorporate DRAM Address in EDAC message Avadhut Naik
2025-07-28 15:14   ` Yazen Ghannam
2025-08-06 21:08     ` Naik, Avadhut
2025-08-06 21:15       ` Borislav Petkov
2025-08-06 21:37         ` Naik, Avadhut
2025-08-11 13:43       ` Yazen Ghannam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).