[PATCH] EDAC/versal: Report PFN and page offset for DDR errors

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
@ 2026-04-15  6:02 Shubhrajyoti Datta
  2026-04-16  5:32 ` Prasanna Kumar T S M
  2026-04-22  3:49 ` Srivatsa S. Bhat
  0 siblings, 2 replies; 5+ messages in thread
From: Shubhrajyoti Datta @ 2026-04-15  6:02 UTC (permalink / raw)
  To: linux-kernel, linux-edac
  Cc: git, ptsm, srivatsa, shubhrajyoti.datta, Borislav Petkov,
	Tony Luck, Shubhrajyoti Datta

Currently, DDRMC correctable and uncorrectable error events are reported
to EDAC with page frame number (pfn) and offset set to zero.
This information is not useful to locate the address for memory errors.

Compute the physical address from the error information and extract
the page frame number and offset before calling edac_mc_handle_error().
This provides the actual memory location information to the userspace.

Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory controller driver")
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
---

 drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
 1 file changed, 17 insertions(+), 19 deletions(-)

diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
index 5a43b5d43ca2..18045f96610e 100644
--- a/drivers/edac/versal_edac.c
+++ b/drivers/edac/versal_edac.c
@@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct edac_priv *priv, union ecc_error
 static void handle_error(struct mem_ctl_info *mci, struct ecc_status *stat)
 {
 	struct edac_priv *priv = mci->pvt_info;
+	enum hw_event_mc_err_type type;
 	union ecc_error_info pinf;
+	unsigned long pa, pfn;
 
 	if (stat->error_type == XDDR_ERR_TYPE_CE) {
 		priv->ce_cnt++;
 		pinf = stat->ceinfo[stat->channel];
-		snprintf(priv->message, XDDR_EDAC_MSG_SIZE,
-			 "Error type:%s MC ID: %d Addr at %lx Burst Pos: %d\n",
-			 "CE", priv->mc_id,
-			 convert_to_physical(priv, pinf), pinf.burstpos);
-
-		edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
-				     1, 0, 0, 0, 0, 0, -1,
-				     priv->message, "");
-	}
-
-	if (stat->error_type == XDDR_ERR_TYPE_UE) {
+		type = HW_EVENT_ERR_CORRECTED;
+	} else if (stat->error_type == XDDR_ERR_TYPE_UE) {
 		priv->ue_cnt++;
 		pinf = stat->ueinfo[stat->channel];
-		snprintf(priv->message, XDDR_EDAC_MSG_SIZE,
-			 "Error type:%s MC ID: %d Addr at %lx Burst Pos: %d\n",
-			 "UE", priv->mc_id,
-			 convert_to_physical(priv, pinf), pinf.burstpos);
-
-		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
-				     1, 0, 0, 0, 0, 0, -1,
-				     priv->message, "");
+		type = HW_EVENT_ERR_UNCORRECTED;
+	} else {
+		return;
 	}
 
+	pa = convert_to_physical(priv, pinf);
+	pfn = PHYS_PFN(pa);
+	snprintf(priv->message, XDDR_EDAC_MSG_SIZE,
+		 "Error type:%s MC ID: %d Addr at %lx Burst Pos: %d\n",
+		 type == HW_EVENT_ERR_UNCORRECTED ? "UE" : "CE", priv->mc_id,
+		 pa, pinf.burstpos);
+	edac_mc_handle_error(type, mci,
+			     1, pfn, offset_in_page(pa), 0, 0, 0, -1,
+			     priv->message, "");
+
 	memset(stat, 0, sizeof(*stat));
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
  2026-04-15  6:02 [PATCH] EDAC/versal: Report PFN and page offset for DDR errors Shubhrajyoti Datta
@ 2026-04-16  5:32 ` Prasanna Kumar T S M
  2026-04-22  3:49 ` Srivatsa S. Bhat
  1 sibling, 0 replies; 5+ messages in thread
From: Prasanna Kumar T S M @ 2026-04-16  5:32 UTC (permalink / raw)
  To: Shubhrajyoti Datta, linux-kernel, linux-edac
  Cc: git, srivatsa, shubhrajyoti.datta, Borislav Petkov, Tony Luck



On 15-04-2026 11:32, Shubhrajyoti Datta wrote:
> Currently, DDRMC correctable and uncorrectable error events are reported
> to EDAC with page frame number (pfn) and offset set to zero.
> This information is not useful to locate the address for memory errors.
> 
> Compute the physical address from the error information and extract
> the page frame number and offset before calling edac_mc_handle_error().
> This provides the actual memory location information to the userspace.
> 
> Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory controller driver")
> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> ---
> 
>   drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
>   1 file changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> index 5a43b5d43ca2..18045f96610e 100644
> --- a/drivers/edac/versal_edac.c
> +++ b/drivers/edac/versal_edac.c
> @@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct edac_priv *priv, union ecc_error
>   static void handle_error(struct mem_ctl_info *mci, struct ecc_status *stat)
>   {
>   	struct edac_priv *priv = mci->pvt_info;
> +	enum hw_event_mc_err_type type;
>   	union ecc_error_info pinf;
> +	unsigned long pa, pfn;
>   
>   	if (stat->error_type == XDDR_ERR_TYPE_CE) {
>   		priv->ce_cnt++;
>   		pinf = stat->ceinfo[stat->channel];
> -		snprintf(priv->message, XDDR_EDAC_MSG_SIZE,
> -			 "Error type:%s MC ID: %d Addr at %lx Burst Pos: %d\n",
> -			 "CE", priv->mc_id,
> -			 convert_to_physical(priv, pinf), pinf.burstpos);
> -
> -		edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> -				     1, 0, 0, 0, 0, 0, -1,
> -				     priv->message, "");
> -	}
> -
> -	if (stat->error_type == XDDR_ERR_TYPE_UE) {
> +		type = HW_EVENT_ERR_CORRECTED;
> +	} else if (stat->error_type == XDDR_ERR_TYPE_UE) {
>   		priv->ue_cnt++;
>   		pinf = stat->ueinfo[stat->channel];
> -		snprintf(priv->message, XDDR_EDAC_MSG_SIZE,
> -			 "Error type:%s MC ID: %d Addr at %lx Burst Pos: %d\n",
> -			 "UE", priv->mc_id,
> -			 convert_to_physical(priv, pinf), pinf.burstpos);
> -
> -		edac_mc_handle_error(HW_EVENT_ERR_UNCORRECTED, mci,
> -				     1, 0, 0, 0, 0, 0, -1,
> -				     priv->message, "");
> +		type = HW_EVENT_ERR_UNCORRECTED;
> +	} else {
> +		return;
>   	}
>   
> +	pa = convert_to_physical(priv, pinf);
> +	pfn = PHYS_PFN(pa);
> +	snprintf(priv->message, XDDR_EDAC_MSG_SIZE,
> +		 "Error type:%s MC ID: %d Addr at %lx Burst Pos: %d\n",
> +		 type == HW_EVENT_ERR_UNCORRECTED ? "UE" : "CE", priv->mc_id,
> +		 pa, pinf.burstpos);
> +	edac_mc_handle_error(type, mci,
> +			     1, pfn, offset_in_page(pa), 0, 0, 0, -1,
> +			     priv->message, "");
> +
>   	memset(stat, 0, sizeof(*stat));
>   }
>   

Hi Shubhrajyoti,

Looks good to me.

Reviewed-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>

Thanks,
Prasanna

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
  2026-04-15  6:02 [PATCH] EDAC/versal: Report PFN and page offset for DDR errors Shubhrajyoti Datta
  2026-04-16  5:32 ` Prasanna Kumar T S M
@ 2026-04-22  3:49 ` Srivatsa S. Bhat
  2026-04-27  6:27   ` Datta, Shubhrajyoti
  1 sibling, 1 reply; 5+ messages in thread
From: Srivatsa S. Bhat @ 2026-04-22  3:49 UTC (permalink / raw)
  To: Shubhrajyoti Datta
  Cc: linux-kernel, linux-edac, git, ptsm, shubhrajyoti.datta,
	Borislav Petkov, Tony Luck

On Wed, Apr 15, 2026 at 11:32:39AM +0530, Shubhrajyoti Datta wrote:
> Currently, DDRMC correctable and uncorrectable error events are reported
> to EDAC with page frame number (pfn) and offset set to zero.
> This information is not useful to locate the address for memory errors.
> 
> Compute the physical address from the error information and extract
> the page frame number and offset before calling edac_mc_handle_error().
> This provides the actual memory location information to the userspace.
> 
> Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory controller driver")
> Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> ---
> 
>  drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
>  1 file changed, 17 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> index 5a43b5d43ca2..18045f96610e 100644
> --- a/drivers/edac/versal_edac.c
> +++ b/drivers/edac/versal_edac.c
> @@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct edac_priv *priv, union ecc_error
>  static void handle_error(struct mem_ctl_info *mci, struct ecc_status *stat)
>  {

[...]

>  	if (stat->error_type == XDDR_ERR_TYPE_CE) {

[...]

> +	} else if (stat->error_type == XDDR_ERR_TYPE_UE) {

[...]
> +	} else {
> +		return;

I like the cleanup contributed by this patch (in terms of reducing
code duplication) in addition to the actual fix. However, this patch
also introduces a subtle behavior change - the existing code calls
memset() to clear out the ecc_status struct unconditionally, but this
patch doesn't call memset if the error type is not CE or UE (i.e., in
the early return path).

Was this change intentional? Wouldn't it potentially cause stale data
to be left over in the ecc_status struct, affecting future reuse?

[...]

> +
>  	memset(stat, 0, sizeof(*stat));
>  }
>  

If the expectation is to actually clear it out unconditionally, it
would be great to document it in the comments (if not done already).

Thank you!

Regards,
Srivatsa
Microsoft Linux Systems Group

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
  2026-04-22  3:49 ` Srivatsa S. Bhat
@ 2026-04-27  6:27   ` Datta, Shubhrajyoti
  2026-04-27 18:49     ` Srivatsa S. Bhat
  0 siblings, 1 reply; 5+ messages in thread
From: Datta, Shubhrajyoti @ 2026-04-27  6:27 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	git (AMD-Xilinx), ptsm@linux.microsoft.com,
	shubhrajyoti.datta@gmail.com, Borislav Petkov, Tony Luck

Public

> -----Original Message-----
> From: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
> Sent: Wednesday, April 22, 2026 9:20 AM
> To: Datta, Shubhrajyoti <shubhrajyoti.datta@amd.com>
> Cc: linux-kernel@vger.kernel.org; linux-edac@vger.kernel.org; git (AMD-Xilinx)
> <git@amd.com>; ptsm@linux.microsoft.com; shubhrajyoti.datta@gmail.com;
> Borislav Petkov <bp@alien8.de>; Tony Luck <tony.luck@intel.com>
> Subject: Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
>
> Caution: This message originated from an External Source. Use proper caution
> when opening attachments, clicking links, or responding.
>
>
> On Wed, Apr 15, 2026 at 11:32:39AM +0530, Shubhrajyoti Datta wrote:
> > Currently, DDRMC correctable and uncorrectable error events are
> > reported to EDAC with page frame number (pfn) and offset set to zero.
> > This information is not useful to locate the address for memory errors.
> >
> > Compute the physical address from the error information and extract
> > the page frame number and offset before calling edac_mc_handle_error().
> > This provides the actual memory location information to the userspace.
> >
> > Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory
> > controller driver")
> > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> > ---
> >
> >  drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
> >  1 file changed, 17 insertions(+), 19 deletions(-)
> >
> > diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> > index 5a43b5d43ca2..18045f96610e 100644
> > --- a/drivers/edac/versal_edac.c
> > +++ b/drivers/edac/versal_edac.c
> > @@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct
> > edac_priv *priv, union ecc_error  static void handle_error(struct
> > mem_ctl_info *mci, struct ecc_status *stat)  {
>
> [...]
>
> >       if (stat->error_type == XDDR_ERR_TYPE_CE) {
>
> [...]
>
> > +     } else if (stat->error_type == XDDR_ERR_TYPE_UE) {
>
> [...]
> > +     } else {
> > +             return;
>
> I like the cleanup contributed by this patch (in terms of reducing code
> duplication) in addition to the actual fix. However, this patch also introduces a
> subtle behavior change - the existing code calls
> memset() to clear out the ecc_status struct unconditionally, but this patch
> doesn't call memset if the error type is not CE or UE (i.e., in the early return
> path).
>
> Was this change intentional? Wouldn't it potentially cause stale data to be left
> over in the ecc_status struct, affecting future reuse?

Hi Srivatsa,
Thanks for the review.

The early return in handle_error is intentionally safe. get_error_info is always called before handle_error:

  if (get_error_info(priv))
          return;
  handle_error(mci, &priv->stat);

  get_error_info already guards against non-CE/UE events:

  if (!eccr0_ceval && !eccr1_ceval && !eccr0_ueval && !eccr1_ueval)
          return 1;

  So handle_error is only ever reached when at least one CE or UE error is present, making the else { return; } branch unreachable in practice. The removed memset in that path was overprotective dead code.

  For future events, get_error_info always reloads the data fresh before handle_error runs, so there is no risk of stale data.

  I can add a comment above handle_error noting this precondition if that would help future readers.

  Regards,
  Shubhrajyoti


>
> [...]
>
> > +
> >       memset(stat, 0, sizeof(*stat));
> >  }
> >
>
> If the expectation is to actually clear it out unconditionally, it would be great to
> document it in the comments (if not done already).
>
> Thank you!
>
> Regards,
> Srivatsa
> Microsoft Linux Systems Group

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
  2026-04-27  6:27   ` Datta, Shubhrajyoti
@ 2026-04-27 18:49     ` Srivatsa S. Bhat
  0 siblings, 0 replies; 5+ messages in thread
From: Srivatsa S. Bhat @ 2026-04-27 18:49 UTC (permalink / raw)
  To: Datta, Shubhrajyoti
  Cc: linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	git (AMD-Xilinx), ptsm@linux.microsoft.com,
	shubhrajyoti.datta@gmail.com, Borislav Petkov, Tony Luck

On Mon, Apr 27, 2026 at 06:27:36AM +0000, Datta, Shubhrajyoti wrote:
> Public
> 
> > -----Original Message-----
> > From: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
> > Sent: Wednesday, April 22, 2026 9:20 AM
> > To: Datta, Shubhrajyoti <shubhrajyoti.datta@amd.com>
> > Cc: linux-kernel@vger.kernel.org; linux-edac@vger.kernel.org; git (AMD-Xilinx)
> > <git@amd.com>; ptsm@linux.microsoft.com; shubhrajyoti.datta@gmail.com;
> > Borislav Petkov <bp@alien8.de>; Tony Luck <tony.luck@intel.com>
> > Subject: Re: [PATCH] EDAC/versal: Report PFN and page offset for DDR errors
> >
> > Caution: This message originated from an External Source. Use proper caution
> > when opening attachments, clicking links, or responding.
> >
> >
> > On Wed, Apr 15, 2026 at 11:32:39AM +0530, Shubhrajyoti Datta wrote:
> > > Currently, DDRMC correctable and uncorrectable error events are
> > > reported to EDAC with page frame number (pfn) and offset set to zero.
> > > This information is not useful to locate the address for memory errors.
> > >
> > > Compute the physical address from the error information and extract
> > > the page frame number and offset before calling edac_mc_handle_error().
> > > This provides the actual memory location information to the userspace.
> > >
> > > Fixes: 6f15b178cd63 ("EDAC/versal: Add a Xilinx Versal memory
> > > controller driver")
> > > Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
> > > ---
> > >
> > >  drivers/edac/versal_edac.c | 36 +++++++++++++++++-------------------
> > >  1 file changed, 17 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/drivers/edac/versal_edac.c b/drivers/edac/versal_edac.c
> > > index 5a43b5d43ca2..18045f96610e 100644
> > > --- a/drivers/edac/versal_edac.c
> > > +++ b/drivers/edac/versal_edac.c
> > > @@ -414,34 +414,32 @@ static unsigned long convert_to_physical(struct
> > > edac_priv *priv, union ecc_error  static void handle_error(struct
> > > mem_ctl_info *mci, struct ecc_status *stat)  {
> >
> > [...]
> >
> > >       if (stat->error_type == XDDR_ERR_TYPE_CE) {
> >
> > [...]
> >
> > > +     } else if (stat->error_type == XDDR_ERR_TYPE_UE) {
> >
> > [...]
> > > +     } else {
> > > +             return;
> >
> > I like the cleanup contributed by this patch (in terms of reducing code
> > duplication) in addition to the actual fix. However, this patch also introduces a
> > subtle behavior change - the existing code calls
> > memset() to clear out the ecc_status struct unconditionally, but this patch
> > doesn't call memset if the error type is not CE or UE (i.e., in the early return
> > path).
> >
> > Was this change intentional? Wouldn't it potentially cause stale data to be left
> > over in the ecc_status struct, affecting future reuse?
> 
> Hi Srivatsa,
> Thanks for the review.
> 
> The early return in handle_error is intentionally safe.
> get_error_info is always called before handle_error:
> 
>   if (get_error_info(priv))
>           return;
>   handle_error(mci, &priv->stat);
> 
>   get_error_info already guards against non-CE/UE events:
> 
>   if (!eccr0_ceval && !eccr1_ceval && !eccr0_ueval && !eccr1_ueval)
>           return 1;
> 
>   So handle_error is only ever reached when at least one CE or UE
>   error is present, making the else { return; } branch unreachable
>   in practice. The removed memset in that path was overprotective
>   dead code.
> 
>   For future events, get_error_info always reloads the data fresh
>   before handle_error runs, so there is no risk of stale data.
> 

I see, thank you for the clarification!

>   I can add a comment above handle_error noting this precondition if
>   that would help future readers.
> 

Sure, that would be great! Along with that, how about also removing
the else { return; } branch if it is actually unreachable?

Regards,
Srivatsa
Microsoft Linux Systems Group

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-04-27 18:49 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-15  6:02 [PATCH] EDAC/versal: Report PFN and page offset for DDR errors Shubhrajyoti Datta
2026-04-16  5:32 ` Prasanna Kumar T S M
2026-04-22  3:49 ` Srivatsa S. Bhat
2026-04-27  6:27   ` Datta, Shubhrajyoti
2026-04-27 18:49     ` Srivatsa S. Bhat

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox