public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
@ 2025-10-06 15:17 Yazen Ghannam
  2025-10-06 16:35 ` Naik, Avadhut
  2025-10-06 21:34 ` Borislav Petkov
  0 siblings, 2 replies; 6+ messages in thread
From: Yazen Ghannam @ 2025-10-06 15:17 UTC (permalink / raw)
  To: bp, tony.luck, linux-edac
  Cc: linux-kernel, avadhut.naik, john.allen, Yazen Ghannam

Generally, FMPM will handle all memory errors as it is expected that
"upstream" entities, like hardware thresholding or other Linux notifier
blocks, will filter out errors.

However, some users prefer that correctable errors are not filtered out
but only that FMPM does not take action on them.

Add a module parameter to ignore correctable errors.

When set, FMPM will not retire memory nor will it save FRU records for
correctable errors.

Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
---
 drivers/ras/amd/fmpm.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
index 8877c6ff64c4..08b16a133f20 100644
--- a/drivers/ras/amd/fmpm.c
+++ b/drivers/ras/amd/fmpm.c
@@ -129,6 +129,14 @@ static struct dentry *fmpm_dfs_entries;
 	GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2,	\
 		  0x12, 0x0a, 0x44, 0x58)
 
+/**
+ * DOC: ignore_ce (bool)
+ * Switch to handle or ignore correctable errors.
+ */
+static bool ignore_ce;
+module_param(ignore_ce, bool, 0644);
+MODULE_PARM_DESC(ignore_ce, "Ignore correctable errors");
+
 /**
  * DOC: max_nr_entries (byte)
  * Maximum number of descriptor entries possible for each FRU.
@@ -413,6 +421,9 @@ static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, v
 	if (!mce_is_memory_error(m))
 		return NOTIFY_DONE;
 
+	if (ignore_ce && mce_is_correctable(m))
+		return NOTIFY_DONE;
+
 	retire_dram_row(m->addr, m->ipid, m->extcpu);
 
 	/*

base-commit: fd94619c43360eb44d28bd3ef326a4f85c600a07
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
  2025-10-06 15:17 [PATCH] RAS/AMD/FMPM: Add option to ignore CEs Yazen Ghannam
@ 2025-10-06 16:35 ` Naik, Avadhut
  2025-10-06 21:34 ` Borislav Petkov
  1 sibling, 0 replies; 6+ messages in thread
From: Naik, Avadhut @ 2025-10-06 16:35 UTC (permalink / raw)
  To: Yazen Ghannam, bp, tony.luck, linux-edac
  Cc: linux-kernel, avadhut.naik, john.allen



On 10/6/2025 10:17, Yazen Ghannam wrote:
> Generally, FMPM will handle all memory errors as it is expected that
> "upstream" entities, like hardware thresholding or other Linux notifier
> blocks, will filter out errors.
> 
> However, some users prefer that correctable errors are not filtered out
> but only that FMPM does not take action on them.
> 
> Add a module parameter to ignore correctable errors.
> 
> When set, FMPM will not retire memory nor will it save FRU records for
> correctable errors.
> 
> Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
> ---
>  drivers/ras/amd/fmpm.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/drivers/ras/amd/fmpm.c b/drivers/ras/amd/fmpm.c
> index 8877c6ff64c4..08b16a133f20 100644
> --- a/drivers/ras/amd/fmpm.c
> +++ b/drivers/ras/amd/fmpm.c
> @@ -129,6 +129,14 @@ static struct dentry *fmpm_dfs_entries;
>  	GUID_INIT(0x5e4706c1, 0x5356, 0x48c6, 0x93, 0x0b, 0x52, 0xf2,	\
>  		  0x12, 0x0a, 0x44, 0x58)
>  
> +/**
> + * DOC: ignore_ce (bool)
> + * Switch to handle or ignore correctable errors.
> + */
> +static bool ignore_ce;
> +module_param(ignore_ce, bool, 0644);
> +MODULE_PARM_DESC(ignore_ce, "Ignore correctable errors");
> +
>  /**
>   * DOC: max_nr_entries (byte)
>   * Maximum number of descriptor entries possible for each FRU.
> @@ -413,6 +421,9 @@ static int fru_handle_mem_poison(struct notifier_block *nb, unsigned long val, v
>  	if (!mce_is_memory_error(m))
>  		return NOTIFY_DONE;
>  
> +	if (ignore_ce && mce_is_correctable(m))
> +		return NOTIFY_DONE;
> +
>  	retire_dram_row(m->addr, m->ipid, m->extcpu);
>  
>  	/*
> 
> base-commit: fd94619c43360eb44d28bd3ef326a4f85c600a07

LGTM!

Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>

-- 
Thanks,
Avadhut Naik


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
  2025-10-06 15:17 [PATCH] RAS/AMD/FMPM: Add option to ignore CEs Yazen Ghannam
  2025-10-06 16:35 ` Naik, Avadhut
@ 2025-10-06 21:34 ` Borislav Petkov
  2025-10-07 14:56   ` Yazen Ghannam
  1 sibling, 1 reply; 6+ messages in thread
From: Borislav Petkov @ 2025-10-06 21:34 UTC (permalink / raw)
  To: Yazen Ghannam
  Cc: tony.luck, linux-edac, linux-kernel, avadhut.naik, john.allen

On Mon, Oct 06, 2025 at 03:17:31PM +0000, Yazen Ghannam wrote:
> Generally, FMPM will handle all memory errors as it is expected that
> "upstream" entities, like hardware thresholding or other Linux notifier
> blocks, will filter out errors.
> 
> However, some users prefer that correctable errors are not filtered out
> but only that FMPM does not take action on them.

That's a pretty shallow use case if you ask me...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
  2025-10-06 21:34 ` Borislav Petkov
@ 2025-10-07 14:56   ` Yazen Ghannam
  2025-10-07 16:52     ` Luck, Tony
  0 siblings, 1 reply; 6+ messages in thread
From: Yazen Ghannam @ 2025-10-07 14:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: tony.luck, linux-edac, linux-kernel, avadhut.naik, john.allen

On Mon, Oct 06, 2025 at 11:34:06PM +0200, Borislav Petkov wrote:
> On Mon, Oct 06, 2025 at 03:17:31PM +0000, Yazen Ghannam wrote:
> > Generally, FMPM will handle all memory errors as it is expected that
> > "upstream" entities, like hardware thresholding or other Linux notifier
> > blocks, will filter out errors.
> > 
> > However, some users prefer that correctable errors are not filtered out
> > but only that FMPM does not take action on them.
> 
> That's a pretty shallow use case if you ask me...
> 
> -- 

I think it's a common use case without FMPM.

IOW, log correctable errors but don't offline memory because of them.

Does that sounds better or about the same?

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
  2025-10-07 14:56   ` Yazen Ghannam
@ 2025-10-07 16:52     ` Luck, Tony
  2025-10-07 17:38       ` Yazen Ghannam
  0 siblings, 1 reply; 6+ messages in thread
From: Luck, Tony @ 2025-10-07 16:52 UTC (permalink / raw)
  To: Yazen Ghannam, Borislav Petkov
  Cc: linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	avadhut.naik@amd.com, john.allen@amd.com

> I think it's a common use case without FMPM.
>
> IOW, log correctable errors but don't offline memory because of them.
>
> Does that sounds better or about the same?

Linux has  /proc/sys/vm/enable_soft_offline toggle for that case.

-Tony

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH] RAS/AMD/FMPM: Add option to ignore CEs
  2025-10-07 16:52     ` Luck, Tony
@ 2025-10-07 17:38       ` Yazen Ghannam
  0 siblings, 0 replies; 6+ messages in thread
From: Yazen Ghannam @ 2025-10-07 17:38 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, linux-edac@vger.kernel.org,
	linux-kernel@vger.kernel.org, avadhut.naik@amd.com,
	john.allen@amd.com

On Tue, Oct 07, 2025 at 04:52:55PM +0000, Luck, Tony wrote:
> > I think it's a common use case without FMPM.
> >
> > IOW, log correctable errors but don't offline memory because of them.
> >
> > Does that sounds better or about the same?
> 
> Linux has  /proc/sys/vm/enable_soft_offline toggle for that case.
> 

Thanks, that's a good suggestion.

We would still need a check in fru_handle_mem_poison() to skip saving
records to persistent storage.

And we would need a code update in _retire_row_mi300() to use the
soft_offline path.

Thanks,
Yazen

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-10-07 17:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-06 15:17 [PATCH] RAS/AMD/FMPM: Add option to ignore CEs Yazen Ghannam
2025-10-06 16:35 ` Naik, Avadhut
2025-10-06 21:34 ` Borislav Petkov
2025-10-07 14:56   ` Yazen Ghannam
2025-10-07 16:52     ` Luck, Tony
2025-10-07 17:38       ` Yazen Ghannam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox