[PATCH] ghes: Track number of recovered hardware errors

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] ghes: Track number of recovered hardware errors
@ 2025-07-14 16:57 Breno Leitao
  2025-07-14 17:10 ` Luck, Tony
  2025-07-14 17:10 ` Borislav Petkov
  0 siblings, 2 replies; 30+ messages in thread
From: Breno Leitao @ 2025-07-14 16:57 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck,
	Borislav Petkov, Robert Moore
  Cc: linux-acpi, linux-kernel, acpica-devel, kernel-team, Breno Leitao

Add a global variable, ghes_recovered_erors, to count hardware errors
classified as recoverable or corrected. This counter is exported and
included in vmcoreinfo for post-crash diagnostics.

Tracking this value helps operators potentially correlate hardware
errors across system events and crash dumps, indicating that RAS logs
might be useful while analyzing these crashes. This discussion and
motivation could be found in [1].

Atomic operations are deliberately omitted, as precise accuracy is not
required for this metric.

Link: https://lore.kernel.org/all/20250704-taint_recovered-v1-0-7a817f2d228e@debian.org/#t [1]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/acpi/apei/ghes.c | 15 +++++++++++++--
 include/acpi/ghes.h      |  2 ++
 kernel/vmcore_info.c     |  4 ++++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f0584ccad4519..3735cfba17667 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -118,6 +118,12 @@ static inline bool is_hest_sync_notify(struct ghes *ghes)
 	return notify_type == ACPI_HEST_NOTIFY_SEA;
 }
 
+/* Count the number of hardware recovered errors, to be reported at
+ * crash/vmcore
+ */
+unsigned int ghes_recovered_erors;
+EXPORT_SYMBOL_GPL(ghes_recovered_erors);
+
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -1100,13 +1106,16 @@ static int ghes_proc(struct ghes *ghes)
 {
 	struct acpi_hest_generic_status *estatus = ghes->estatus;
 	u64 buf_paddr;
-	int rc;
+	int rc, sev;
 
 	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
 	if (rc)
 		goto out;
 
-	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
+	sev = ghes_severity(estatus->error_severity);
+	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
+		ghes_recovered_erors += 1;
+	else if (sev >= GHES_SEV_PANIC)
 		__ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
 
 	if (!ghes_estatus_cached(estatus)) {
@@ -1750,6 +1759,8 @@ void __init acpi_ghes_init(void)
 		pr_info(GHES_PFX "APEI firmware first mode is enabled by APEI bit.\n");
 	else
 		pr_info(GHES_PFX "Failed to enable APEI firmware first mode.\n");
+
+	ghes_recovered_erors = 0;
 }
 
 /*
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index be1dd4c1a9174..4b6be6733f826 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -75,6 +75,8 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
 struct list_head *ghes_get_devices(void);
 
 void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
+
+extern unsigned int ghes_recovered_erors;
 #else
 static inline struct list_head *ghes_get_devices(void) { return NULL; }
 
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f89..cb2a7daef3a68 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -14,6 +14,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
 #include <linux/kmemleak.h>
+#include <acpi/ghes.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -223,6 +224,9 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_SYMBOL(kallsyms_offsets);
 	VMCOREINFO_SYMBOL(kallsyms_relative_base);
 #endif /* CONFIG_KALLSYMS */
+#ifdef CONFIG_ACPI_APEI_GHES
+	VMCOREINFO_NUMBER(ghes_recovered_erors);
+#endif
 
 	arch_crash_save_vmcoreinfo();
 	update_vmcoreinfo_note();

---
base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
change-id: 20250707-vmcore_hw_error-322429e6c316

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 16:57 [PATCH] ghes: Track number of recovered hardware errors Breno Leitao
@ 2025-07-14 17:10 ` Luck, Tony
  2025-07-14 17:10 ` Borislav Petkov
  1 sibling, 0 replies; 30+ messages in thread
From: Luck, Tony @ 2025-07-14 17:10 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Borislav Petkov,
	Robert Moore, linux-acpi, linux-kernel, acpica-devel, kernel-team

On Mon, Jul 14, 2025 at 09:57:29AM -0700, Breno Leitao wrote:
> Add a global variable, ghes_recovered_erors, to count hardware errors
> classified as recoverable or corrected. This counter is exported and
> included in vmcoreinfo for post-crash diagnostics.
> 
> Tracking this value helps operators potentially correlate hardware
> errors across system events and crash dumps, indicating that RAS logs
> might be useful while analyzing these crashes. This discussion and
> motivation could be found in [1].
> 
> Atomic operations are deliberately omitted, as precise accuracy is not
> required for this metric.

[snip]

> @@ -1100,13 +1106,16 @@ static int ghes_proc(struct ghes *ghes)
>  {
>  	struct acpi_hest_generic_status *estatus = ghes->estatus;
>  	u64 buf_paddr;
> -	int rc;
> +	int rc, sev;
>  
>  	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
>  	if (rc)
>  		goto out;
>  
> -	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
> +	sev = ghes_severity(estatus->error_severity);
> +	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
> +		ghes_recovered_erors += 1;

		ghes_recovered_erors++:

> +	else if (sev >= GHES_SEV_PANIC)
>  		__ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
>  
>  	if (!ghes_estatus_cached(estatus)) {
> @@ -1750,6 +1759,8 @@ void __init acpi_ghes_init(void)
>  		pr_info(GHES_PFX "APEI firmware first mode is enabled by APEI bit.\n");
>  	else
>  		pr_info(GHES_PFX "Failed to enable APEI firmware first mode.\n");
> +
> +	ghes_recovered_erors = 0;

Unnecessary. Global variables all start at zero unless otherwise
initialized.

>  }

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 16:57 [PATCH] ghes: Track number of recovered hardware errors Breno Leitao
  2025-07-14 17:10 ` Luck, Tony
@ 2025-07-14 17:10 ` Borislav Petkov
  2025-07-14 17:33   ` Luck, Tony
  1 sibling, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2025-07-14 17:10 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck,
	Robert Moore, linux-acpi, linux-kernel, acpica-devel, kernel-team

On Mon, Jul 14, 2025 at 09:57:29AM -0700, Breno Leitao wrote:
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index f0584ccad4519..3735cfba17667 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -118,6 +118,12 @@ static inline bool is_hest_sync_notify(struct ghes *ghes)
>  	return notify_type == ACPI_HEST_NOTIFY_SEA;
>  }
>  
> +/* Count the number of hardware recovered errors, to be reported at
> + * crash/vmcore
> + */

Kernel comments style format is:

	/*
	 * A sentence ending with a full-stop.
	 * Another sentence. ...
	 * More sentences. ...
	 */

> +unsigned int ghes_recovered_erors;
> +EXPORT_SYMBOL_GPL(ghes_recovered_erors);

If you're going to do this, then you can perhaps make this variable always
present so that you don't need an export and call it "hardware_errors_count"
or so and all machinery which deals with RAS - GHES, MCE, AER, bla, can
increment it...

> @@ -223,6 +224,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	VMCOREINFO_SYMBOL(kallsyms_offsets);
>  	VMCOREINFO_SYMBOL(kallsyms_relative_base);
>  #endif /* CONFIG_KALLSYMS */
> +#ifdef CONFIG_ACPI_APEI_GHES
> +	VMCOREINFO_NUMBER(ghes_recovered_erors);
> +#endif

... and then you can add it to the vmcore image unconditionally as a metric
telling that the machine has had so and so hw errors.

I'd say.

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 17:10 ` Borislav Petkov
@ 2025-07-14 17:33   ` Luck, Tony
  2025-07-14 17:35     ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2025-07-14 17:33 UTC (permalink / raw)
  To: Borislav Petkov, Breno Leitao
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Moore, Robert,
	linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	acpica-devel@lists.linux.dev, kernel-team@meta.com

> If you're going to do this, then you can perhaps make this variable always
> present so that you don't need an export and call it "hardware_errors_count"
> or so and all machinery which deals with RAS - GHES, MCE, AER, bla, can
> increment it...

Not sure I'd want to see all the different classes of errors bundled together
in a single count.  I think MCE recovery is quite robust and rarely leads to
subsequent kernel problems.

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 17:33   ` Luck, Tony
@ 2025-07-14 17:35     ` Borislav Petkov
  2025-07-14 22:21       ` Luck, Tony
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2025-07-14 17:35 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Mon, Jul 14, 2025 at 05:33:45PM +0000, Luck, Tony wrote:
> > If you're going to do this, then you can perhaps make this variable always
> > present so that you don't need an export and call it "hardware_errors_count"
> > or so and all machinery which deals with RAS - GHES, MCE, AER, bla, can
> > increment it...
> 
> Not sure I'd want to see all the different classes of errors bundled together
> in a single count.  I think MCE recovery is quite robust and rarely leads to
> subsequent kernel problems.

That's what I said. And a RAS tool can give that info already.

But for some reason Breno still wants that info somewhere else.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 17:35     ` Borislav Petkov
@ 2025-07-14 22:21       ` Luck, Tony
  2025-07-15  8:29         ` Borislav Petkov
                           ` (2 more replies)
  0 siblings, 3 replies; 30+ messages in thread
From: Luck, Tony @ 2025-07-14 22:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Mon, Jul 14, 2025 at 07:35:56PM +0200, Borislav Petkov wrote:
> On Mon, Jul 14, 2025 at 05:33:45PM +0000, Luck, Tony wrote:
> > > If you're going to do this, then you can perhaps make this variable always
> > > present so that you don't need an export and call it "hardware_errors_count"
> > > or so and all machinery which deals with RAS - GHES, MCE, AER, bla, can
> > > increment it...
> > 
> > Not sure I'd want to see all the different classes of errors bundled together
> > in a single count.  I think MCE recovery is quite robust and rarely leads to
> > subsequent kernel problems.
> 
> That's what I said. And a RAS tool can give that info already.

There's some value in it being in the kdump file, rather than having
to correlate data from multiple places. That's both time consuming
and error prone.

> But for some reason Breno still wants that info somewhere else.

So what about something like:

enum recovered_error_sources {
	ERR_GHES,
	ERR_MCE,
	ERR_AER,
	...
	ERR_NUM_SOURCES
};

static struct recovered_error_info {
	int	num_recovered_errors;
	time64_t	last_recovered_error_timestamp;
} recovered_error_info[ERR_NUM_SOURCES];

void log_recovered_error(enum recovered_error_sources src)
{
	recovered_error_info[src].num_recovered_errors++;
	recovered_error_info[src].last_recovered_error_timestamp =
		ktime_get_real_seconds();
}
EXPORT_SYMBOL_GPL(log_recovered_error);


PLus code to include that in VMCORE.

Then each subsystem just adds:


	log_recovered_error(ERR_GHES);
or
	log_recovered_error(ERR_MCE);
etc.

in the recovery path.

A count is just a hint. A count with a timestamp that is shortly
before a crash is a smoking gun.

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 22:21       ` Luck, Tony
@ 2025-07-15  8:29         ` Borislav Petkov
  2025-07-15 10:20           ` Breno Leitao
  2025-07-15 10:07         ` Breno Leitao
  2025-07-17 16:06         ` Breno Leitao
  2 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2025-07-15  8:29 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Mon, Jul 14, 2025 at 03:21:44PM -0700, Luck, Tony wrote:
> So what about something like:
> 
> enum recovered_error_sources {
> 	ERR_GHES,
> 	ERR_MCE,
> 	ERR_AER,
> 	...
> 	ERR_NUM_SOURCES
> };
> 
> static struct recovered_error_info {
> 	int	num_recovered_errors;
> 	time64_t	last_recovered_error_timestamp;
> } recovered_error_info[ERR_NUM_SOURCES];

Too many "recovered" :-)

> A count is just a hint. A count with a timestamp that is shortly
> before a crash is a smoking gun.

All good thoughts... from where I'm standing right now, though, it looks to me
like we're wagging the dog: inventing issues and thinking of which solution
would fit them best. :-)

We have all of that info in rasdaemon. If the machine explodes, one can simply
read out its database from the core file. 

And if you don't run rasdaemon, you can read out dmesg from the vmcore where
the errors should have been dumped anyway.

So there's no need to add anything new to the kernel.

IMNSVHO.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 22:21       ` Luck, Tony
  2025-07-15  8:29         ` Borislav Petkov
@ 2025-07-15 10:07         ` Breno Leitao
  2025-07-15 10:18           ` Borislav Petkov
  2025-07-17 16:06         ` Breno Leitao
  2 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-15 10:07 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Mon, Jul 14, 2025 at 03:21:44PM -0700, Luck, Tony wrote:
> On Mon, Jul 14, 2025 at 07:35:56PM +0200, Borislav Petkov wrote:
> > On Mon, Jul 14, 2025 at 05:33:45PM +0000, Luck, Tony wrote:
> > > > If you're going to do this, then you can perhaps make this variable always
> > > > present so that you don't need an export and call it "hardware_errors_count"
> > > > or so and all machinery which deals with RAS - GHES, MCE, AER, bla, can
> > > > increment it...
> > > 
> > > Not sure I'd want to see all the different classes of errors bundled together
> > > in a single count.  I think MCE recovery is quite robust and rarely leads to
> > > subsequent kernel problems.
> > 
> > That's what I said. And a RAS tool can give that info already.
> 
> There's some value in it being in the kdump file, rather than having
> to correlate data from multiple places. That's both time consuming
> and error prone.

That's precisely the aim: I want to streamline the process without
duplicating detailed error reports, since we already have specialized
tools for in-depth analysis.

Even a brief value in the crash dump alerting the reader that errors
occurred would be a valuable aid for anyone diagnosing the issue.

> > But for some reason Breno still wants that info somewhere else.
> 
> So what about something like:
> 
> enum recovered_error_sources {
> 	ERR_GHES,
> 	ERR_MCE,
> 	ERR_AER,
> 	...
> 	ERR_NUM_SOURCES
> };
> 
> static struct recovered_error_info {
> 	int	num_recovered_errors;
> 	time64_t	last_recovered_error_timestamp;
> } recovered_error_info[ERR_NUM_SOURCES];
> 
> void log_recovered_error(enum recovered_error_sources src)
> {
> 	recovered_error_info[src].num_recovered_errors++;
> 	recovered_error_info[src].last_recovered_error_timestamp =
> 		ktime_get_real_seconds();
> }
> EXPORT_SYMBOL_GPL(log_recovered_error);
> 
> 
> PLus code to include that in VMCORE.
> 
> Then each subsystem just adds:
> 
> 
> 	log_recovered_error(ERR_GHES);
> or
> 	log_recovered_error(ERR_MCE);
> etc.
> 
> in the recovery path.
 
Thanks. Let me play with this suggestion.
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 10:07         ` Breno Leitao
@ 2025-07-15 10:18           ` Borislav Petkov
  0 siblings, 0 replies; 30+ messages in thread
From: Borislav Petkov @ 2025-07-15 10:18 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Tue, Jul 15, 2025 at 03:07:15AM -0700, Breno Leitao wrote:
> That's precisely the aim: I want to streamline the process without
> duplicating detailed error reports, since we already have specialized
> tools for in-depth analysis.
> 
> Even a brief value in the crash dump alerting the reader that errors
> occurred would be a valuable aid for anyone diagnosing the issue.

And what's stopping you from examining dmesg from the core dump?

I'm sure you have to look at that anyway...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15  8:29         ` Borislav Petkov
@ 2025-07-15 10:20           ` Breno Leitao
  2025-07-15 10:31             ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-15 10:20 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

Hello Borislav,

On Tue, Jul 15, 2025 at 10:29:39AM +0200, Borislav Petkov wrote:
> We have all of that info in rasdaemon. If the machine explodes, one can simply
> read out its database from the core file. 
> 
> And if you don't run rasdaemon, you can read out dmesg from the vmcore where
> the errors should have been dumped anyway.

I approach this from a slightly different perspective, given my
experience overseeing millions of servers and having to diagnose kernel
problems across such a massive fleet.

To be candid, when you’re operating at that scale, kernel crashes and
hardware errors are unfortunately a common occurrence.

For instance, If every investigation (as you suggested above) take just
a couple of minutes, there simply wouldn’t be enough hours in the day,
even working 24x7, to keep up with the volume.

Thanks for the review and suggestions,
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 10:20           ` Breno Leitao
@ 2025-07-15 10:31             ` Borislav Petkov
  2025-07-15 12:02               ` Breno Leitao
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2025-07-15 10:31 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
> For instance, If every investigation (as you suggested above) take just
> a couple of minutes, there simply wouldn’t be enough hours in the day,
> even working 24x7, to keep up with the volume.

Well, first of all, it would help considerably if you put the use case in the
commit message.

Then, are you saying that when examining kernel crashes, you don't look at
dmesg in the core file?

I find that hard to believe.

Because if you do look at dmesg and if you would grep it for hw errors - we do
log those if desired, AFAIR - you don't need anything new.

I'd say...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 10:31             ` Borislav Petkov
@ 2025-07-15 12:02               ` Breno Leitao
  2025-07-15 12:53                 ` Borislav Petkov
  0 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-15 12:02 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

Hello Borislav,

On Tue, Jul 15, 2025 at 12:31:25PM +0200, Borislav Petkov wrote:
> On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
> > For instance, If every investigation (as you suggested above) take just
> > a couple of minutes, there simply wouldn’t be enough hours in the day,
> > even working 24x7, to keep up with the volume.
> 
> Well, first of all, it would help considerably if you put the use case in the
> commit message.

Sorry, my bad. I can do better if we decide that this is worth pursuing.

> Then, are you saying that when examining kernel crashes, you don't look at
> I find that hard to believe.

We absolutely do examine kernel messages when investigating crashes, and
over time we've developed an extensive set of regular expressions to
identify relevant errors.

In practice, what you're describing is very similar to the workflow we
already use. For example, here are just a few of the regex patterns we
match in dmesg, grouped by category:

    (r"Machine check: Processor context corrupt", "cpu"),
    (r"Kernel panic - not syncing: Panicing machine check CPU died", "cpu"),
    (r"Machine check: Data load in unrecoverable area of kernel", "memory"),
    (r"Instruction fetch error in kernel", "memory"),
    (r"\[Hardware Error\]: +section_type: memory error", "memory"),
    (r"EDAC skx MC\d: HANDLING MCE MEMORY ERROR", "memory"),
    (r"\[Hardware Error\]:   section_type: general processor error", "cpu"),
    (r"UE memory read error on", "memory"),

And that’s just a partial list. We have 26 regexps for various issues,
and I wouldn’t be surprised if other large operators use a similar
approach.

While this system mostly works, there are real advantages to
consolidating this logic in the kernel itself, as I’m proposing:

    * Reduces the risk of mistakes
    	- Less chance of missing changes or edge cases.

    * Centralizes effort
	- Users don’t have to maintain their own lists; the logic lives
	  closer to the source of truth.

    * Simplifies maintenance
	- Avoids the constant need to update regexps if message strings
	  change.

    * Easier validation
	- It becomes straightforward to cross-check that all relevant
	  messages are being captured.

    * Automatic accounting
	- Any new or updated messages are immediately reflected.

    * Lower postmortem overhead
	- Requires less supporting infrastructure for crash analysis.

    * Netconsole support
	- Makes this status data available via netconsole, which is
	  helpful for those users.


> Because if you do look at dmesg and if you would grep it for hw errors - we do
> log those if desired, AFAIR - you don't need anything new.

Understood. If you don’t see additional value in kernel-side
counting, I can certainly keep relying on our current method. For
us, though, having this functionality built in feels more robust and
sustainable.

Thanks for the discussion,
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 12:02               ` Breno Leitao
@ 2025-07-15 12:53                 ` Borislav Petkov
  2025-07-15 13:46                   ` Shuai Xue
  0 siblings, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2025-07-15 12:53 UTC (permalink / raw)
  To: Breno Leitao, Alexander Graf, Konrad Rzeszutek Wilk, Peter Gonda
  Cc: Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Tue, Jul 15, 2025 at 05:02:39AM -0700, Breno Leitao wrote:
> Hello Borislav,
> 
> On Tue, Jul 15, 2025 at 12:31:25PM +0200, Borislav Petkov wrote:
> > On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
> > > For instance, If every investigation (as you suggested above) take just
> > > a couple of minutes, there simply wouldn’t be enough hours in the day,
> > > even working 24x7, to keep up with the volume.
> > 
> > Well, first of all, it would help considerably if you put the use case in the
> > commit message.
> 
> Sorry, my bad. I can do better if we decide that this is worth pursuing.
> 
> > Then, are you saying that when examining kernel crashes, you don't look at
> > I find that hard to believe.
> 
> We absolutely do examine kernel messages when investigating crashes, and
> over time we've developed an extensive set of regular expressions to
> identify relevant errors.
> 
> In practice, what you're describing is very similar to the workflow we
> already use. For example, here are just a few of the regex patterns we
> match in dmesg, grouped by category:
> 
>     (r"Machine check: Processor context corrupt", "cpu"),
>     (r"Kernel panic - not syncing: Panicing machine check CPU died", "cpu"),
>     (r"Machine check: Data load in unrecoverable area of kernel", "memory"),
>     (r"Instruction fetch error in kernel", "memory"),
>     (r"\[Hardware Error\]: +section_type: memory error", "memory"),
>     (r"EDAC skx MC\d: HANDLING MCE MEMORY ERROR", "memory"),
>     (r"\[Hardware Error\]:   section_type: general processor error", "cpu"),
>     (r"UE memory read error on", "memory"),
> 
> And that’s just a partial list. We have 26 regexps for various issues,
> and I wouldn’t be surprised if other large operators use a similar
> approach.
> 
> While this system mostly works, there are real advantages to
> consolidating this logic in the kernel itself, as I’m proposing:
> 
>     * Reduces the risk of mistakes
>     	- Less chance of missing changes or edge cases.
> 
>     * Centralizes effort
> 	- Users don’t have to maintain their own lists; the logic lives
> 	  closer to the source of truth.
> 
>     * Simplifies maintenance
> 	- Avoids the constant need to update regexps if message strings
> 	  change.
> 
>     * Easier validation
> 	- It becomes straightforward to cross-check that all relevant
> 	  messages are being captured.
> 
>     * Automatic accounting
> 	- Any new or updated messages are immediately reflected.
> 
>     * Lower postmortem overhead
> 	- Requires less supporting infrastructure for crash analysis.
> 
>     * Netconsole support
> 	- Makes this status data available via netconsole, which is
> 	  helpful for those users.

Yap, this is more like it. Those sound to me like good reasons to have this
additional logging.

It would be really good to sync with other cloud providers here so that we can
do this one solution which fits all. Lemme CC some other folks I know who do
cloud gunk and leave the whole mail for their pleasure.

Newly CCed folks, you know how to find the whole discussion. :-)

Thx.

> > Because if you do look at dmesg and if you would grep it for hw errors - we do
> > log those if desired, AFAIR - you don't need anything new.
> 
> Understood. If you don’t see additional value in kernel-side
> counting, I can certainly keep relying on our current method. For
> us, though, having this functionality built in feels more robust and
> sustainable.
> 
> Thanks for the discussion,
> --breno

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 12:53                 ` Borislav Petkov
@ 2025-07-15 13:46                   ` Shuai Xue
  2025-07-15 15:09                     ` Borislav Petkov
  2025-07-15 17:25                     ` Breno Leitao
  0 siblings, 2 replies; 30+ messages in thread
From: Shuai Xue @ 2025-07-15 13:46 UTC (permalink / raw)
  To: Borislav Petkov, Breno Leitao, Alexander Graf,
	Konrad Rzeszutek Wilk, Peter Gonda, Luck, Tony
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Moore, Robert,
	linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	acpica-devel@lists.linux.dev, kernel-team@meta.com



在 2025/7/15 20:53, Borislav Petkov 写道:
> On Tue, Jul 15, 2025 at 05:02:39AM -0700, Breno Leitao wrote:
>> Hello Borislav,
>>
>> On Tue, Jul 15, 2025 at 12:31:25PM +0200, Borislav Petkov wrote:
>>> On Tue, Jul 15, 2025 at 03:20:35AM -0700, Breno Leitao wrote:
>>>> For instance, If every investigation (as you suggested above) take just
>>>> a couple of minutes, there simply wouldn’t be enough hours in the day,
>>>> even working 24x7, to keep up with the volume.
>>>
>>> Well, first of all, it would help considerably if you put the use case in the
>>> commit message.
>>
>> Sorry, my bad. I can do better if we decide that this is worth pursuing.
>>
>>> Then, are you saying that when examining kernel crashes, you don't look at
>>> I find that hard to believe.
>>
>> We absolutely do examine kernel messages when investigating crashes, and
>> over time we've developed an extensive set of regular expressions to
>> identify relevant errors.
>>
>> In practice, what you're describing is very similar to the workflow we
>> already use. For example, here are just a few of the regex patterns we
>> match in dmesg, grouped by category:
>>
>>      (r"Machine check: Processor context corrupt", "cpu"),
>>      (r"Kernel panic - not syncing: Panicing machine check CPU died", "cpu"),
>>      (r"Machine check: Data load in unrecoverable area of kernel", "memory"),
>>      (r"Instruction fetch error in kernel", "memory"),
>>      (r"\[Hardware Error\]: +section_type: memory error", "memory"),
>>      (r"EDAC skx MC\d: HANDLING MCE MEMORY ERROR", "memory"),
>>      (r"\[Hardware Error\]:   section_type: general processor error", "cpu"),
>>      (r"UE memory read error on", "memory"),
>>
>> And that’s just a partial list. We have 26 regexps for various issues,
>> and I wouldn’t be surprised if other large operators use a similar
>> approach.
>>
>> While this system mostly works, there are real advantages to
>> consolidating this logic in the kernel itself, as I’m proposing:
>>
>>      * Reduces the risk of mistakes
>>      	- Less chance of missing changes or edge cases.
>>
>>      * Centralizes effort
>> 	- Users don’t have to maintain their own lists; the logic lives
>> 	  closer to the source of truth.
>>
>>      * Simplifies maintenance
>> 	- Avoids the constant need to update regexps if message strings
>> 	  change.
>>
>>      * Easier validation
>> 	- It becomes straightforward to cross-check that all relevant
>> 	  messages are being captured.
>>
>>      * Automatic accounting
>> 	- Any new or updated messages are immediately reflected.
>>
>>      * Lower postmortem overhead
>> 	- Requires less supporting infrastructure for crash analysis.
>>
>>      * Netconsole support
>> 	- Makes this status data available via netconsole, which is
>> 	  helpful for those users.
> 
> Yap, this is more like it. Those sound to me like good reasons to have this
> additional logging.
> 
> It would be really good to sync with other cloud providers here so that we can
> do this one solution which fits all. Lemme CC some other folks I know who do
> cloud gunk and leave the whole mail for their pleasure.
> 
> Newly CCed folks, you know how to find the whole discussion. :-)
> 
> Thx.


For the purpose of counting, how about using the cmdline of rasdaemon?

$ ras-mc-ctl --summary
Memory controller events summary:
         Uncorrected on DIMM Label(s): 'SOCKET 1 CHANNEL 1 DIMM 0 DIMM1' 
location: 0:18:-1:-1 errors: 1

PCIe AER events summary:
         2 Uncorrected (Non-Fatal) errors: Completion Timeout

ARM processor events summary:
         CPU(mpidr=0x81090100) has 1 errors
         CPU(mpidr=0x810e0000) has 1 errors
         CPU(mpidr=0x81180000) has 1 errors
         CPU(mpidr=0x811a0000) has 1 errors
         CPU(mpidr=0x811c0000) has 1 errors
         CPU(mpidr=0x811d0300) has 1 errors
         CPU(mpidr=0x811f0100) has 1 errors
         CPU(mpidr=0x81390300) has 1 errors
         CPU(mpidr=0x813a0200) has 1 errors

No devlink errors.
Disk errors summary:
         0:0 has 60 errors
         0:2048 has 7 errors
         0:66304 has 2162 errors
Memory failure events summary:
         Recovered errors: 24

@Breno, Is rasdaemon not enough for your needs?


AFAICS, it is easier to extend more statistical metrics, like PR 205 
[1]. Also, it is easier to carry out releases and changes than with the 
kernel in the production environment.


Thanks.
Shuai

[1] 
https://github.com/mchehab/rasdaemon/pull/205/commits/391d67bc7d17443d00db96850e56770451126a0e


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 13:46                   ` Shuai Xue
@ 2025-07-15 15:09                     ` Borislav Petkov
  2025-07-16  2:05                       ` Shuai Xue
  2025-07-15 17:25                     ` Breno Leitao
  1 sibling, 1 reply; 30+ messages in thread
From: Borislav Petkov @ 2025-07-15 15:09 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Breno Leitao, Alexander Graf, Konrad Rzeszutek Wilk, Peter Gonda,
	Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:
> For the purpose of counting, how about using the cmdline of rasdaemon?

That would mean you have to run rasdaemon on those machines before they
explode and then carve out the rasdaemon db from the coredump (this is
post-mortem analysis).

I would love for rasdaemon to log over the network and then other tools can
query those centralized logs but that has its own challenges...

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 13:46                   ` Shuai Xue
  2025-07-15 15:09                     ` Borislav Petkov
@ 2025-07-15 17:25                     ` Breno Leitao
  2025-07-16  3:04                       ` Shuai Xue
  1 sibling, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-15 17:25 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Borislav Petkov, Alexander Graf, Konrad Rzeszutek Wilk,
	Peter Gonda, Luck, Tony, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

Hello Shuai,

On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:
> > It would be really good to sync with other cloud providers here so that we can
> > do this one solution which fits all. Lemme CC some other folks I know who do
> > cloud gunk and leave the whole mail for their pleasure.
> > 
> > Newly CCed folks, you know how to find the whole discussion. :-)
> > 
> > Thx.
> 
> 
> For the purpose of counting, how about using the cmdline of rasdaemon?

How do you manage it at a large fleet of hosts? Do you have rasdaemon
logging always and how do you correlate with kernel crashes? At Meta, we
have an a "clues" tag for each crash, and one of the tags is Machine
Check Exception (MCE), which is parsed from dmesg right now (with the
regexp I shared earlier).

My plan with this patch is to have a counter for hardware errors that
would be exposed to the crashdump. So, post-morten analyzes tooling can
easily query if there are hardware errors and query RAS information in
the right databases, in case it seems a smoking gun.

Do you have any experience with this type of automatic correlation?

Thanks for your insights,
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 15:09                     ` Borislav Petkov
@ 2025-07-16  2:05                       ` Shuai Xue
  2025-07-16  6:30                         ` Mauro Carvalho Chehab
  0 siblings, 1 reply; 30+ messages in thread
From: Shuai Xue @ 2025-07-16  2:05 UTC (permalink / raw)
  To: Borislav Petkov, mchehab+huawei
  Cc: Breno Leitao, Alexander Graf, Konrad Rzeszutek Wilk, Peter Gonda,
	Luck, Tony, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

在 2025/7/15 23:09, Borislav Petkov 写道:
> On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:
>> For the purpose of counting, how about using the cmdline of rasdaemon?
> 
> That would mean you have to run rasdaemon on those machines before they
> explode and then carve out the rasdaemon db from the coredump (this is
> post-mortem analysis).

Rasdaemon is a userspace tool that will collect all hardware error 
events reported by the Linux Kernel from several sources (EDAC, MCE, 
PCI, ...) into one common framework. And it has been a standard tools
in Alibaba. As far as I know, twitter also use Rasdaemon in its production.

> 
> I would love for rasdaemon to log over the network and then other tools can
> query those centralized logs but that has its own challenges...
> 

I also prefer collecting rasdaemon data in a centralized data center, as 
this is more beneficial for using big data analytics to analyze and 
predict errors. At the same time, the centralized side also uses 
rasdaemon logs as one of the references for machine operations and 
maintenance.

As for rasdaemon itself, it is just a single-node event collector and 
database, although it does also print logs. In practice, we use SLS [1] 
to collect rasdaemon text logs from individual nodes and parse them on 
the central side.

Thanks.
Shuai

[1]https://www.alibabacloud.com/help/en/sls/getting-started
[2]https://blog.x.com/engineering/en_us/topics/infrastructure/2023/how-twitter-uses-rasdaemon-for-hardware-reliability 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-15 17:25                     ` Breno Leitao
@ 2025-07-16  3:04                       ` Shuai Xue
  2025-07-16 12:42                         ` Breno Leitao
  0 siblings, 1 reply; 30+ messages in thread
From: Shuai Xue @ 2025-07-16  3:04 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Borislav Petkov, Alexander Graf, Konrad Rzeszutek Wilk,
	Peter Gonda, Luck, Tony, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com



在 2025/7/16 01:25, Breno Leitao 写道:
> Hello Shuai,
> 
> On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:
>>> It would be really good to sync with other cloud providers here so that we can
>>> do this one solution which fits all. Lemme CC some other folks I know who do
>>> cloud gunk and leave the whole mail for their pleasure.
>>>
>>> Newly CCed folks, you know how to find the whole discussion. :-)
>>>
>>> Thx.
>>
>>
>> For the purpose of counting, how about using the cmdline of rasdaemon?
> 
> How do you manage it at a large fleet of hosts? Do you have rasdaemon
> logging always and how do you correlate with kernel crashes? At Meta, we
> have an a "clues" tag for each crash, and one of the tags is Machine
> Check Exception (MCE), which is parsed from dmesg right now (with the
> regexp I shared earlier).

We deploy rasdaemon on each individual node, and then collect the
rasdaemon logs centrally. At the same time, we collect out-of-band
error logs. We aggregate and count the types and occurrences of errors,
and finally use empirical thresholds for operational alerts. The crash
analysis service consumes these alert messages.

> 
> My plan with this patch is to have a counter for hardware errors that
> would be exposed to the crashdump. So, post-morten analyzes tooling can
> easily query if there are hardware errors and query RAS information in
> the right databases, in case it seems a smoking gun.

I see your point. But does using a single ghes_recovered_errors counter
to track all corrected and non-fatal errors for CPU, memory, and PCIe
really help?

> 
> Do you have any experience with this type of automatic correlation?

Please see my reply above.

> 
> Thanks for your insights,
> --breno

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-16  2:05                       ` Shuai Xue
@ 2025-07-16  6:30                         ` Mauro Carvalho Chehab
  0 siblings, 0 replies; 30+ messages in thread
From: Mauro Carvalho Chehab @ 2025-07-16  6:30 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Borislav Petkov, Breno Leitao, Alexander Graf,
	Konrad Rzeszutek Wilk, Peter Gonda, Luck, Tony, Rafael J. Wysocki,
	Len Brown, James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

Em Wed, 16 Jul 2025 10:05:27 +0800
Shuai Xue <xueshuai@linux.alibaba.com> escreveu:

> 在 2025/7/15 23:09, Borislav Petkov 写道:
> > On Tue, Jul 15, 2025 at 09:46:03PM +0800, Shuai Xue wrote:  
> >> For the purpose of counting, how about using the cmdline of rasdaemon?  
> > 
> > That would mean you have to run rasdaemon on those machines before they
> > explode and then carve out the rasdaemon db from the coredump (this is
> > post-mortem analysis).  
> 
> Rasdaemon is a userspace tool that will collect all hardware error 
> events reported by the Linux Kernel from several sources (EDAC, MCE, 
> PCI, ...) into one common framework. And it has been a standard tools
> in Alibaba. As far as I know, twitter also use Rasdaemon in its production.

There are several others using rasdaemon, afaikt. It was originally
implemented due to a demand from supercomputer customers with thousands
of nodes in US, and have been shipped on major distros for quite a while.

> 
> > 
> > I would love for rasdaemon to log over the network and then other tools can
> > query those centralized logs but that has its own challenges...
> >   
> 
> I also prefer collecting rasdaemon data in a centralized data center, as 
> this is more beneficial for using big data analytics to analyze and 
> predict errors. At the same time, the centralized side also uses 
> rasdaemon logs as one of the references for machine operations and 
> maintenance.
> 
> As for rasdaemon itself, it is just a single-node event collector and 
> database, although it does also print logs. In practice, we use SLS [1] 
> to collect rasdaemon text logs from individual nodes and parse them on 
> the central side.

Well, rasdaemon already uses SQL commands to store on its SQLite database.

It shouldn't be hard to add a patch series to optionally use a centralized
database directly. My only concern is that delivering logs to an external
database on a machine that has hardware errors can be problematic and
eventually end losing events.

Also, supporting different databases can be problematic due to the
libraries they require. Last time I wrote a code to write to an Oracle
DB (a life-long time ago), the number of the libraries that were required
were huge. Also, changing the order with "-l" caused ld to not find the
right objects. It was messy. Ok, supporting MySQL and PostgreSQL is not
that hard.

Perhaps a good compromise would be to add a logic there to open a local
socket or a tcp socket with a logger daemon, sending the events asynchronously
after storing locally at SQLite. Then, write a Python script using SQLAlchemy. 
This way, we gain for free support for several different databases.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-16  3:04                       ` Shuai Xue
@ 2025-07-16 12:42                         ` Breno Leitao
  2025-07-17  3:03                           ` Shuai Xue
  0 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-16 12:42 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Borislav Petkov, Alexander Graf, Konrad Rzeszutek Wilk,
	Peter Gonda, Luck, Tony, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

hello Shuai,

On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
> > My plan with this patch is to have a counter for hardware errors that
> > would be exposed to the crashdump. So, post-morten analyzes tooling can
> > easily query if there are hardware errors and query RAS information in
> > the right databases, in case it seems a smoking gun.
> 
> I see your point. But does using a single ghes_recovered_errors counter
> to track all corrected and non-fatal errors for CPU, memory, and PCIe
> really help?

It provides a quick indication that hardware issues have occurred, which
can prompt the operator to investigate further via RAS events.

That said, Tony proposed a more robust approach—categorizing and
tracking errors by their source. This would involve maintaining separate
counters for each source using an counter per enum type:

	enum recovered_error_sources {
		ERR_GHES,
		ERR_MCE,
		ERR_AER,
		...
		ERR_NUM_SOURCES
	};

See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/

Do you think this would help you by any chance?

Thanks
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-16 12:42                         ` Breno Leitao
@ 2025-07-17  3:03                           ` Shuai Xue
  2025-07-17 12:06                             ` Breno Leitao
  0 siblings, 1 reply; 30+ messages in thread
From: Shuai Xue @ 2025-07-17  3:03 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Borislav Petkov, Alexander Graf, Konrad Rzeszutek Wilk,
	Peter Gonda, Luck, Tony, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com



在 2025/7/16 20:42, Breno Leitao 写道:
> hello Shuai,
> 
> On Wed, Jul 16, 2025 at 11:04:28AM +0800, Shuai Xue wrote:
>>> My plan with this patch is to have a counter for hardware errors that
>>> would be exposed to the crashdump. So, post-morten analyzes tooling can
>>> easily query if there are hardware errors and query RAS information in
>>> the right databases, in case it seems a smoking gun.
>>
>> I see your point. But does using a single ghes_recovered_errors counter
>> to track all corrected and non-fatal errors for CPU, memory, and PCIe
>> really help?
> 
> It provides a quick indication that hardware issues have occurred, which
> can prompt the operator to investigate further via RAS events.
> 
> That said, Tony proposed a more robust approach—categorizing and
> tracking errors by their source. This would involve maintaining separate
> counters for each source using an counter per enum type:
> 
> 	enum recovered_error_sources {
> 		ERR_GHES,
> 		ERR_MCE,
> 		ERR_AER,
> 		...
> 		ERR_NUM_SOURCES
> 	};
> 
> See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/
> 
> Do you think this would help you by any chance?
> 
> Thanks
> --breno


Personally, I think this approach would be more helpful. Additionally, I
suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
Errors) together. This is especially important for memory errors, as CEs
occur much more frequently than UEs, but their impact is much smaller.

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-17  3:03                           ` Shuai Xue
@ 2025-07-17 12:06                             ` Breno Leitao
  2025-07-17 17:19                               ` Luck, Tony
  0 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-17 12:06 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Borislav Petkov, Alexander Graf, Konrad Rzeszutek Wilk,
	Peter Gonda, Luck, Tony, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

hello Shuai,

On Thu, Jul 17, 2025 at 11:03:51AM +0800, Shuai Xue wrote:
> 在 2025/7/16 20:42, Breno Leitao 写道:
> > That said, Tony proposed a more robust approach—categorizing and
> > tracking errors by their source. This would involve maintaining separate
> > counters for each source using an counter per enum type:
> > 
> > 	enum recovered_error_sources {
> > 		ERR_GHES,
> > 		ERR_MCE,
> > 		ERR_AER,
> > 		...
> > 		ERR_NUM_SOURCES
> > 	};
> > 
> > See more at: https://lore.kernel.org/all/aHWC-J851eaHa_Au@agluck-desk3/
> > 
> > Do you think this would help you by any chance?
> 
> Personally, I think this approach would be more helpful. Additionally, I
> suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
> Errors) together. This is especially important for memory errors, as CEs
> occur much more frequently than UEs, but their impact is much smaller.

Yes, I totally agree. This would be even better than my original
solution. Let me spend some time on it and see how further I can go.

Thanks for your opinions,
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-14 22:21       ` Luck, Tony
  2025-07-15  8:29         ` Borislav Petkov
  2025-07-15 10:07         ` Breno Leitao
@ 2025-07-17 16:06         ` Breno Leitao
  2025-07-17 17:29           ` Luck, Tony
  2 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-17 16:06 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

Hello Tony,

On Mon, Jul 14, 2025 at 03:21:44PM -0700, Luck, Tony wrote:

> > But for some reason Breno still wants that info somewhere else.
> 
> So what about something like:

This is a way better suggestion than what I propose and it seems that
Shuai also liked it.

That said, I am playing with it and it is looking promissing. I liked
Borislav idea of having an always-present array and use your error
sources to index the array.

> static struct recovered_error_info {
> 	int	num_recovered_errors;
> 	time64_t	last_recovered_error_timestamp;
> } recovered_error_info[ERR_NUM_SOURCES];

I know naming is hard. Playing with it, I thought about the
hwerror_tracking as the "name" for this feature. Does it sound ok?

> void log_recovered_error(enum recovered_error_sources src)
> {
> 	recovered_error_info[src].num_recovered_errors++;
> 	recovered_error_info[src].last_recovered_error_timestamp =
> 		ktime_get_real_seconds();
> }
> EXPORT_SYMBOL_GPL(log_recovered_error);

Where do you think this code should be? I suppose we have a few options:

  1) Maybe a driver called hwerror_tracking in drivers/ directory
    - Pro: Code is self-contained
    - Cons: This will require a CONFIG_HWERROR_TRACKING, around. Also
      it will create some inter dependency of drivers.
  
  2) A hwerror_tracking.c error in kernel/ and have it always enabled
    - Pro: This is always available, and doesn't depend on the CONFIG dance
    - Cons: There is no way to disable it (?)
  
  3) In some pre-existing file.

Any opinion about it?
Thanks
--breno


^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-17 12:06                             ` Breno Leitao
@ 2025-07-17 17:19                               ` Luck, Tony
  2025-07-17 17:39                                 ` Breno Leitao
  0 siblings, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2025-07-17 17:19 UTC (permalink / raw)
  To: Breno Leitao, Shuai Xue
  Cc: Borislav Petkov, Graf, Alexander, Konrad Rzeszutek Wilk,
	Peter Gonda, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

>> Personally, I think this approach would be more helpful. Additionally, I
>> suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
>> Errors) together. This is especially important for memory errors, as CEs
>> occur much more frequently than UEs, but their impact is much smaller.

Total agreement on keeping corrected memory errors out of this special
handling. They happen all the time in a large fleet, and are not significant
unless the same address repeats.

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-17 16:06         ` Breno Leitao
@ 2025-07-17 17:29           ` Luck, Tony
  2025-07-18 16:11             ` Breno Leitao
  0 siblings, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2025-07-17 17:29 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

> That said, I am playing with it and it is looking promissing. I liked
> Borislav idea of having an always-present array and use your error
> sources to index the array.
>
> > static struct recovered_error_info {
> >     int     num_recovered_errors;
> >     time64_t        last_recovered_error_timestamp;
> > } recovered_error_info[ERR_NUM_SOURCES];
>
> I know naming is hard. Playing with it, I thought about the
> hwerror_tracking as the "name" for this feature. Does it sound ok?

Much better than the name I picked. But easy to swap out if
somebody suggests something better.

> > void log_recovered_error(enum recovered_error_sources src)
> > {
> >     recovered_error_info[src].num_recovered_errors++;
> >     recovered_error_info[src].last_recovered_error_timestamp =
> >             ktime_get_real_seconds();
> > }
> > EXPORT_SYMBOL_GPL(log_recovered_error);
>
> Where do you think this code should be? I suppose we have a few options:
>
>   1) Maybe a driver called hwerror_tracking in drivers/ directory
>     - Pro: Code is self-contained
>     - Cons: This will require a CONFIG_HWERROR_TRACKING, around. Also
>       it will create some inter dependency of drivers.
>
>   2) A hwerror_tracking.c error in kernel/ and have it always enabled
>     - Pro: This is always available, and doesn't depend on the CONFIG dance
>     - Cons: There is no way to disable it (?)
>
>   3) In some pre-existing file.

If the intent is still to add this information to vmcore (as in
earlier discussions in this thread). Then it could go into
kernel/vmcore_info.c (and be configured with CONFIG_VMCORE_INFO).

Would just need an empty stub in some header file for the
log_recovered_error() function.

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-17 17:19                               ` Luck, Tony
@ 2025-07-17 17:39                                 ` Breno Leitao
  2025-07-17 17:54                                   ` Luck, Tony
  0 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-17 17:39 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Shuai Xue, Borislav Petkov, Graf, Alexander,
	Konrad Rzeszutek Wilk, Peter Gonda, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

On Thu, Jul 17, 2025 at 05:19:48PM +0000, Luck, Tony wrote:
> >> Personally, I think this approach would be more helpful. Additionally, I
> >> suggest not mixing CEs (Correctable Errors) and UEs (Uncorrectable
> >> Errors) together. This is especially important for memory errors, as CEs
> >> occur much more frequently than UEs, but their impact is much smaller.
> 
> Total agreement on keeping corrected memory errors out of this special
> handling. They happen all the time in a large fleet, and are not significant
> unless the same address repeats.

Are these EDAC errors? Shouldn't we track CE errors in
edac_device_handle_ce_count()?

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-17 17:39                                 ` Breno Leitao
@ 2025-07-17 17:54                                   ` Luck, Tony
  0 siblings, 0 replies; 30+ messages in thread
From: Luck, Tony @ 2025-07-17 17:54 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Shuai Xue, Borislav Petkov, Graf, Alexander,
	Konrad Rzeszutek Wilk, Peter Gonda, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com

>> Total agreement on keeping corrected memory errors out of this special
>> handling. They happen all the time in a large fleet, and are not significant
>> unless the same address repeats.
>
>  Are these EDAC errors? Shouldn't we track CE errors in
> edac_device_handle_ce_count()?

Existing code should already handle this. EDAC drivers register
with mce_register_injector_chain() to be told of memory errors.

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-17 17:29           ` Luck, Tony
@ 2025-07-18 16:11             ` Breno Leitao
  2025-07-18 17:36               ` Luck, Tony
  0 siblings, 1 reply; 30+ messages in thread
From: Breno Leitao @ 2025-07-18 16:11 UTC (permalink / raw)
  To: Luck, Tony, xueshuai, mahesh, oohall
  Cc: Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com, osandov, linuxppc-dev

Hello Tony,

On Thu, Jul 17, 2025 at 05:29:14PM +0000, Luck, Tony wrote:
>
> If the intent is still to add this information to vmcore (as in
> earlier discussions in this thread). Then it could go into
> kernel/vmcore_info.c (and be configured with CONFIG_VMCORE_INFO).
> 
> Would just need an empty stub in some header file for the
> log_recovered_error() function.

Thanks for the suggestion.

I found that I don't need to expose the metrics in vmcore info at all to
be able to read them from vmcore, given crash/drgn can read those
symbols.

Global variable hwerror_tracking will be write-only during kernel
run-time, and only read during post morten analyzes. I am still not sure
if the compiler might not get rid of them completely, given no on reads.
I am wondering if I should EXPORT_SYMBOL_GPL(hwerror_tracking) to avoid
any optimization there.

Anyway, this is the patch I am using and it solves the problem I am
interested in. Any opinion?

Thanks for your support,
--breno

commit 396d9bd5266607731b535f4246fd3e4971df9016
Author: Breno Leitao <leitao@debian.org>
Date:   Thu Jul 17 07:39:26 2025 -0700

    vmcoreinfo: Track and log recoverable hardware errors
    
    Introduce a generic infrastructure for tracking recoverable hardware
    errors (HW errors that did not cause a panic) and record them for vmcore
    consumption. This aids post-mortem crash analysis tools by preserving
    a count and timestamp for the last occurrence of such errors.
    
    This patch adds centralized logging for three common sources of
    recoverable hardware errors:
    
      - PCIe AER Correctable errors
      - x86 Machine Check Exceptions (MCE)
      - APEI/CPER GHES corrected or recoverable errors
    
    Each source logs to a shared `hwerror_tracking` array, protected by a
    spinlock, and maintains a per-source error count and timestamp of the
    most recent event.
    
    hwerror_tracking is write-only at kernel runtime, and it is meant to be
    read from vmcore using tools like crash/drgn. For example, this is how
    it looks like from drgn:
    
            >>> prog['hwerror_tracking']
            (struct hwerror_tracking_info [3]){
                    {
                            .count = (int)0,
                            .timestamp = (time64_t)0,
                    },
                    {
                            .count = (int)0,
                            .timestamp = (time64_t)0,
                    },
                    {
                            .count = (int)844,
                            .timestamp = (time64_t)1752852018,
                    },
            }
    
    Suggested-by: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4da4eab56c81d..781cf574642eb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -45,6 +45,7 @@
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 #include <linux/kexec.h>
+#include <linux/vmcore_info.h>
 
 #include <asm/fred.h>
 #include <asm/cpu_device_id.h>
@@ -1692,6 +1693,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
 out:
 	instrumentation_end();
 
+	/* Given it didn't panic, mark it as recoverable */
+	hwerror_tracking_log(HWE_RECOV_MCE);
 clear:
 	mce_wrmsrq(MSR_IA32_MCG_STATUS, 0);
 }
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f0584ccad4519..255453cdc72e9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -43,6 +43,7 @@
 #include <linux/uuid.h>
 #include <linux/ras.h>
 #include <linux/task_work.h>
+#include <linux/vmcore_info.h>
 
 #include <acpi/actbl1.h>
 #include <acpi/ghes.h>
@@ -1100,13 +1101,16 @@ static int ghes_proc(struct ghes *ghes)
 {
 	struct acpi_hest_generic_status *estatus = ghes->estatus;
 	u64 buf_paddr;
-	int rc;
+	int rc, sev;
 
 	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
 	if (rc)
 		goto out;
 
-	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
+	sev = ghes_severity(estatus->error_severity);
+	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
+		hwerror_tracking_log(HWE_RECOV_GHES);
+	else if (sev >= GHES_SEV_PANIC)
 		__ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
 
 	if (!ghes_estatus_cached(estatus)) {
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 70ac661883672..9d4fa1cb8afb9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -30,6 +30,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/slab.h>
+#include <linux/vmcore_info.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
 #include <ras/ras_event.h>
@@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 	switch (info->severity) {
 	case AER_CORRECTABLE:
 		aer_info->dev_total_cor_errs++;
+		hwerror_tracking_log(HWE_RECOV_AER);
 		counter = &aer_info->dev_cor_errs[0];
 		max = AER_MAX_TYPEOF_COR_ERRS;
 		break;
diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h
index 37e003ae52626..5894da92a6ba4 100644
--- a/include/linux/vmcore_info.h
+++ b/include/linux/vmcore_info.h
@@ -77,4 +77,18 @@ extern u32 *vmcoreinfo_note;
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len);
 void final_note(Elf_Word *buf);
+
+enum hwerror_tracking_source {
+	HWE_RECOV_AER,
+	HWE_RECOV_MCE,
+	HWE_RECOV_GHES,
+	HWE_RECOV_MAX,
+};
+
+#ifdef CONFIG_VMCORE_INFO
+void hwerror_tracking_log(enum hwerror_tracking_source src);
+#else
+void hwerror_tracking_log(enum hwerror_tracking_source src) {};
+#endif
+
 #endif /* LINUX_VMCORE_INFO_H */
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f89..c3d2bfffec298 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -31,6 +31,14 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;
 
+static struct hwerror_tracking_info {
+	int         count;
+	time64_t    timestamp;
+};
+
+static struct hwerror_tracking_info hwerror_tracking[HWE_RECOV_MAX];
+static DEFINE_SPINLOCK(hwerror_tracking_lock);
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len)
 {
@@ -118,6 +126,23 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
 }
 EXPORT_SYMBOL(paddr_vmcoreinfo_note);
 
+void hwerror_tracking_log(enum hwerror_tracking_source src)
+{
+	struct hwerror_tracking_info *hwet;
+	unsigned long flags;
+
+	if (src < 0 || src >= HWE_RECOV_MAX)
+		return;
+
+	hwet = &hwerror_tracking[src];
+
+	spin_lock_irqsave(&hwerror_tracking_lock, flags);
+	hwet->count++;
+	hwet->timestamp = ktime_get_real_seconds();
+	spin_unlock_irqrestore(&hwerror_tracking_lock, flags);
+}
+EXPORT_SYMBOL_GPL(hwerror_tracking_log);
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
 	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* RE: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-18 16:11             ` Breno Leitao
@ 2025-07-18 17:36               ` Luck, Tony
  2025-07-21  8:56                 ` Breno Leitao
  0 siblings, 1 reply; 30+ messages in thread
From: Luck, Tony @ 2025-07-18 17:36 UTC (permalink / raw)
  To: Breno Leitao, xueshuai@linux.alibaba.com, mahesh@linux.ibm.com,
	oohall@gmail.com
  Cc: Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse,
	Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com, osandov@osandov.com,
	linuxppc-dev@lists.ozlabs.org

> I found that I don't need to expose the metrics in vmcore info at all to
> be able to read them from vmcore, given crash/drgn can read those
> symbols.
>
> Global variable hwerror_tracking will be write-only during kernel
> run-time, and only read during post morten analyzes. I am still not sure
> if the compiler might not get rid of them completely, given no on reads.
> I am wondering if I should EXPORT_SYMBOL_GPL(hwerror_tracking) to avoid
> any optimization there.

Thanks for fleshing this out into a patch. It looks very much like I
imagined.

I'd be amazed if a compiler did elide all this code and data because it
noticed it was written but never read.

Is the spinlock when logging really helping anything? You weren't
worried about locking/atomics in your original patch because users
mostly care about zero vs. non-zero (or maybe vs. "many"). If the
count is slightly off when many logs happen, it may not matter.

The spinlock doesn't help with the timestamp part at all.

> Anyway, this is the patch I am using and it solves the problem I am
> interested in. Any opinion?

-Tony

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH] ghes: Track number of recovered hardware errors
  2025-07-18 17:36               ` Luck, Tony
@ 2025-07-21  8:56                 ` Breno Leitao
  0 siblings, 0 replies; 30+ messages in thread
From: Breno Leitao @ 2025-07-21  8:56 UTC (permalink / raw)
  To: Luck, Tony
  Cc: xueshuai@linux.alibaba.com, mahesh@linux.ibm.com,
	oohall@gmail.com, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Moore, Robert, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev,
	kernel-team@meta.com, osandov@osandov.com,
	linuxppc-dev@lists.ozlabs.org

Hello Tony,

On Fri, Jul 18, 2025 at 05:36:50PM +0000, Luck, Tony wrote:
> > I found that I don't need to expose the metrics in vmcore info at all to
> > be able to read them from vmcore, given crash/drgn can read those
> > symbols.
> >
> > Global variable hwerror_tracking will be write-only during kernel
> > run-time, and only read during post morten analyzes. I am still not sure
> > if the compiler might not get rid of them completely, given no on reads.
> > I am wondering if I should EXPORT_SYMBOL_GPL(hwerror_tracking) to avoid
> > any optimization there.
> 
> Thanks for fleshing this out into a patch. It looks very much like I
> imagined.
> 
> I'd be amazed if a compiler did elide all this code and data because it
> noticed it was written but never read.
> 
> Is the spinlock when logging really helping anything? You weren't
> worried about locking/atomics in your original patch because users
> mostly care about zero vs. non-zero (or maybe vs. "many"). If the
> count is slightly off when many logs happen, it may not matter.
> 
> The spinlock doesn't help with the timestamp part at all.

Agree, precise number is not important there, and if there are
conflicts, it will not hurt the message.

Let me remove the spinlock completely them and send a new version.

Thanks for your support,
--breno

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2025-07-21  8:56 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-14 16:57 [PATCH] ghes: Track number of recovered hardware errors Breno Leitao
2025-07-14 17:10 ` Luck, Tony
2025-07-14 17:10 ` Borislav Petkov
2025-07-14 17:33   ` Luck, Tony
2025-07-14 17:35     ` Borislav Petkov
2025-07-14 22:21       ` Luck, Tony
2025-07-15  8:29         ` Borislav Petkov
2025-07-15 10:20           ` Breno Leitao
2025-07-15 10:31             ` Borislav Petkov
2025-07-15 12:02               ` Breno Leitao
2025-07-15 12:53                 ` Borislav Petkov
2025-07-15 13:46                   ` Shuai Xue
2025-07-15 15:09                     ` Borislav Petkov
2025-07-16  2:05                       ` Shuai Xue
2025-07-16  6:30                         ` Mauro Carvalho Chehab
2025-07-15 17:25                     ` Breno Leitao
2025-07-16  3:04                       ` Shuai Xue
2025-07-16 12:42                         ` Breno Leitao
2025-07-17  3:03                           ` Shuai Xue
2025-07-17 12:06                             ` Breno Leitao
2025-07-17 17:19                               ` Luck, Tony
2025-07-17 17:39                                 ` Breno Leitao
2025-07-17 17:54                                   ` Luck, Tony
2025-07-15 10:07         ` Breno Leitao
2025-07-15 10:18           ` Borislav Petkov
2025-07-17 16:06         ` Breno Leitao
2025-07-17 17:29           ` Luck, Tony
2025-07-18 16:11             ` Breno Leitao
2025-07-18 17:36               ` Luck, Tony
2025-07-21  8:56                 ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).