linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Breno Leitao <leitao@debian.org>
To: "Rafael J. Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>,  James Morse <james.morse@arm.com>,
	Tony Luck <tony.luck@intel.com>,  Borislav Petkov <bp@alien8.de>,
	Robert Moore <robert.moore@intel.com>
Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	 acpica-devel@lists.linux.dev, kernel-team@meta.com,
	 Breno Leitao <leitao@debian.org>
Subject: [PATCH] ghes: Track number of recovered hardware errors
Date: Mon, 14 Jul 2025 09:57:29 -0700	[thread overview]
Message-ID: <20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org> (raw)

Add a global variable, ghes_recovered_erors, to count hardware errors
classified as recoverable or corrected. This counter is exported and
included in vmcoreinfo for post-crash diagnostics.

Tracking this value helps operators potentially correlate hardware
errors across system events and crash dumps, indicating that RAS logs
might be useful while analyzing these crashes. This discussion and
motivation could be found in [1].

Atomic operations are deliberately omitted, as precise accuracy is not
required for this metric.

Link: https://lore.kernel.org/all/20250704-taint_recovered-v1-0-7a817f2d228e@debian.org/#t [1]
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/acpi/apei/ghes.c | 15 +++++++++++++--
 include/acpi/ghes.h      |  2 ++
 kernel/vmcore_info.c     |  4 ++++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f0584ccad4519..3735cfba17667 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -118,6 +118,12 @@ static inline bool is_hest_sync_notify(struct ghes *ghes)
 	return notify_type == ACPI_HEST_NOTIFY_SEA;
 }
 
+/* Count the number of hardware recovered errors, to be reported at
+ * crash/vmcore
+ */
+unsigned int ghes_recovered_erors;
+EXPORT_SYMBOL_GPL(ghes_recovered_erors);
+
 /*
  * This driver isn't really modular, however for the time being,
  * continuing to use module_param is the easiest way to remain
@@ -1100,13 +1106,16 @@ static int ghes_proc(struct ghes *ghes)
 {
 	struct acpi_hest_generic_status *estatus = ghes->estatus;
 	u64 buf_paddr;
-	int rc;
+	int rc, sev;
 
 	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
 	if (rc)
 		goto out;
 
-	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
+	sev = ghes_severity(estatus->error_severity);
+	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
+		ghes_recovered_erors += 1;
+	else if (sev >= GHES_SEV_PANIC)
 		__ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
 
 	if (!ghes_estatus_cached(estatus)) {
@@ -1750,6 +1759,8 @@ void __init acpi_ghes_init(void)
 		pr_info(GHES_PFX "APEI firmware first mode is enabled by APEI bit.\n");
 	else
 		pr_info(GHES_PFX "Failed to enable APEI firmware first mode.\n");
+
+	ghes_recovered_erors = 0;
 }
 
 /*
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index be1dd4c1a9174..4b6be6733f826 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -75,6 +75,8 @@ void ghes_unregister_vendor_record_notifier(struct notifier_block *nb);
 struct list_head *ghes_get_devices(void);
 
 void ghes_estatus_pool_region_free(unsigned long addr, u32 size);
+
+extern unsigned int ghes_recovered_erors;
 #else
 static inline struct list_head *ghes_get_devices(void) { return NULL; }
 
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f89..cb2a7daef3a68 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -14,6 +14,7 @@
 #include <linux/cpuhotplug.h>
 #include <linux/memblock.h>
 #include <linux/kmemleak.h>
+#include <acpi/ghes.h>
 
 #include <asm/page.h>
 #include <asm/sections.h>
@@ -223,6 +224,9 @@ static int __init crash_save_vmcoreinfo_init(void)
 	VMCOREINFO_SYMBOL(kallsyms_offsets);
 	VMCOREINFO_SYMBOL(kallsyms_relative_base);
 #endif /* CONFIG_KALLSYMS */
+#ifdef CONFIG_ACPI_APEI_GHES
+	VMCOREINFO_NUMBER(ghes_recovered_erors);
+#endif
 
 	arch_crash_save_vmcoreinfo();
 	update_vmcoreinfo_note();

---
base-commit: d7b8f8e20813f0179d8ef519541a3527e7661d3a
change-id: 20250707-vmcore_hw_error-322429e6c316

Best regards,
--  
Breno Leitao <leitao@debian.org>


             reply	other threads:[~2025-07-14 16:57 UTC|newest]

Thread overview: 30+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-14 16:57 Breno Leitao [this message]
2025-07-14 17:10 ` [PATCH] ghes: Track number of recovered hardware errors Luck, Tony
2025-07-14 17:10 ` Borislav Petkov
2025-07-14 17:33   ` Luck, Tony
2025-07-14 17:35     ` Borislav Petkov
2025-07-14 22:21       ` Luck, Tony
2025-07-15  8:29         ` Borislav Petkov
2025-07-15 10:20           ` Breno Leitao
2025-07-15 10:31             ` Borislav Petkov
2025-07-15 12:02               ` Breno Leitao
2025-07-15 12:53                 ` Borislav Petkov
2025-07-15 13:46                   ` Shuai Xue
2025-07-15 15:09                     ` Borislav Petkov
2025-07-16  2:05                       ` Shuai Xue
2025-07-16  6:30                         ` Mauro Carvalho Chehab
2025-07-15 17:25                     ` Breno Leitao
2025-07-16  3:04                       ` Shuai Xue
2025-07-16 12:42                         ` Breno Leitao
2025-07-17  3:03                           ` Shuai Xue
2025-07-17 12:06                             ` Breno Leitao
2025-07-17 17:19                               ` Luck, Tony
2025-07-17 17:39                                 ` Breno Leitao
2025-07-17 17:54                                   ` Luck, Tony
2025-07-15 10:07         ` Breno Leitao
2025-07-15 10:18           ` Borislav Petkov
2025-07-17 16:06         ` Breno Leitao
2025-07-17 17:29           ` Luck, Tony
2025-07-18 16:11             ` Breno Leitao
2025-07-18 17:36               ` Luck, Tony
2025-07-21  8:56                 ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org \
    --to=leitao@debian.org \
    --cc=acpica-devel@lists.linux.dev \
    --cc=bp@alien8.de \
    --cc=james.morse@arm.com \
    --cc=kernel-team@meta.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=rafael@kernel.org \
    --cc=robert.moore@intel.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).