The Linux Kernel Mailing List
 help / color / mirror / Atom feed
* [PATCH 0/2] vmcoreinfo: GHES: track fatal hardware errors
@ 2026-06-17 13:32 Breno Leitao
  2026-06-17 13:32 ` [PATCH 1/2] ACPI: APEI: GHES: fix severity namespace in ghes_log_hwerr() Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Breno Leitao @ 2026-06-17 13:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Andrew Morton,
	Baoquan He, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Dave Young
  Cc: linux-acpi, linux-kernel, riel, caggio, kexec, Breno Leitao,
	kernel-team

Hardware errors reported through APEI/GHES are recorded in the kernel's
hwerr_data array so that crash tooling can tell from the vmcore whether a
hardware error preceded a crash.

This short series improves that path:

 * The first patch is just a fix.
 * The second adds a HWERR_FATAL bucket that records panic-severity
   errors -- the ones most likely to have caused the crash -- when GHES
   reports them.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Breno Leitao (2):
      ACPI: APEI: GHES: fix severity namespace in ghes_log_hwerr()
      vmcore_info: track fatal hardware errors

 drivers/acpi/apei/ghes.c    | 7 ++++++-
 include/uapi/linux/vmcore.h | 1 +
 2 files changed, 7 insertions(+), 1 deletion(-)
---
base-commit: 4fa3f5fabb30bf00d7475d5a33459ea83d639bf9
change-id: 20260617-hwerr-a28d5d66ae87

Best regards,
-- 
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH 1/2] ACPI: APEI: GHES: fix severity namespace in ghes_log_hwerr()
  2026-06-17 13:32 [PATCH 0/2] vmcoreinfo: GHES: track fatal hardware errors Breno Leitao
@ 2026-06-17 13:32 ` Breno Leitao
  2026-06-17 13:32 ` [PATCH 2/2] vmcore_info: track fatal hardware errors Breno Leitao
  2026-06-29 16:02 ` [PATCH 0/2] vmcoreinfo: GHES: " Breno Leitao
  2 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-06-17 13:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Andrew Morton,
	Baoquan He, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Dave Young
  Cc: linux-acpi, linux-kernel, riel, caggio, kexec, Breno Leitao,
	kernel-team

ghes_log_hwerr() receives a GHES_SEV_* value from ghes_severity() but
tests it against CPER_SEV_RECOVERABLE.  GHES_SEV_RECOVERABLE is 2 while
CPER_SEV_RECOVERABLE is 0, so every recoverable error is dropped and
only GHES_SEV_NO slips through; nothing useful is recorded through the
APEI/GHES path, which is the only one arm64 has.

Compare against GHES_SEV_RECOVERABLE so recoverable hardware errors are
tracked as intended.

Fixes: 918e1507cff9 ("vmcoreinfo: track and log recoverable hardware errors")
Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/acpi/apei/ghes.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3236a3ce79d6b..f0f9f1529e7aa 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -877,7 +877,7 @@ EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
 
 static void ghes_log_hwerr(int sev, guid_t *sec_type)
 {
-	if (sev != CPER_SEV_RECOVERABLE)
+	if (sev != GHES_SEV_RECOVERABLE)
 		return;
 
 	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH 2/2] vmcore_info: track fatal hardware errors
  2026-06-17 13:32 [PATCH 0/2] vmcoreinfo: GHES: track fatal hardware errors Breno Leitao
  2026-06-17 13:32 ` [PATCH 1/2] ACPI: APEI: GHES: fix severity namespace in ghes_log_hwerr() Breno Leitao
@ 2026-06-17 13:32 ` Breno Leitao
  2026-06-29 16:02 ` [PATCH 0/2] vmcoreinfo: GHES: " Breno Leitao
  2 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-06-17 13:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Andrew Morton,
	Baoquan He, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Dave Young
  Cc: linux-acpi, linux-kernel, riel, caggio, kexec, Breno Leitao,
	kernel-team

Fatal (panic-severity) hardware errors reported through APEI/GHES are
the ones most likely to have caused a crash, but hwerr_data did not
record them.  Add a HWERR_FATAL bucket and bump it from
ghes_log_hwerr() when a GHES_SEV_PANIC error is seen, so crash tooling
can tell from the vmcore that a fatal hardware error preceded the
crash.

Tools reading hwerr_data gain one entry (HWERR_FATAL).

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/acpi/apei/ghes.c    | 5 +++++
 include/uapi/linux/vmcore.h | 1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index f0f9f1529e7aa..5a9e16bdca2b6 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -877,6 +877,11 @@ EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
 
 static void ghes_log_hwerr(int sev, guid_t *sec_type)
 {
+	if (sev == GHES_SEV_PANIC) {
+		hwerr_log_error_type(HWERR_FATAL);
+		return;
+	}
+
 	if (sev != GHES_SEV_RECOVERABLE)
 		return;
 
diff --git a/include/uapi/linux/vmcore.h b/include/uapi/linux/vmcore.h
index 2ba89fafa518a..c774b037603e2 100644
--- a/include/uapi/linux/vmcore.h
+++ b/include/uapi/linux/vmcore.h
@@ -21,6 +21,7 @@ enum hwerr_error_type {
 	HWERR_RECOV_PCI,
 	HWERR_RECOV_CXL,
 	HWERR_RECOV_OTHERS,
+	HWERR_FATAL,		/* fatal hardware errors */
 	HWERR_RECOV_MAX,
 };
 

-- 
2.53.0-Meta


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 0/2] vmcoreinfo: GHES: track fatal hardware errors
  2026-06-17 13:32 [PATCH 0/2] vmcoreinfo: GHES: track fatal hardware errors Breno Leitao
  2026-06-17 13:32 ` [PATCH 1/2] ACPI: APEI: GHES: fix severity namespace in ghes_log_hwerr() Breno Leitao
  2026-06-17 13:32 ` [PATCH 2/2] vmcore_info: track fatal hardware errors Breno Leitao
@ 2026-06-29 16:02 ` Breno Leitao
  2 siblings, 0 replies; 4+ messages in thread
From: Breno Leitao @ 2026-06-29 16:02 UTC (permalink / raw)
  To: Rafael J. Wysocki, Tony Luck, Borislav Petkov, Hanjun Guo,
	Mauro Carvalho Chehab, Shuai Xue, Len Brown, Andrew Morton,
	Baoquan He, Mike Rapoport, Pasha Tatashin, Pratyush Yadav,
	Dave Young
  Cc: linux-acpi, linux-kernel, riel, caggio, kexec, kernel-team

On Wed, Jun 17, 2026 at 06:32:46AM -0700, Breno Leitao wrote:
> Hardware errors reported through APEI/GHES are recorded in the kernel's
> hwerr_data array so that crash tooling can tell from the vmcore whether a
> hardware error preceded a crash.

This Hardware error tracking "thing" is currently in an awkward
location—it doesn't belong in RAS and has minimal connection to vmcore
info.

Following Baoquan's earlier suggestion, I'll refactor this as a standalone
driver, which should make the code organization clearer and more maintainable.

https://lore.kernel.org/all/aYvi4Y_HNqk_u1-v@fedora/

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2026-06-29 16:02 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-17 13:32 [PATCH 0/2] vmcoreinfo: GHES: track fatal hardware errors Breno Leitao
2026-06-17 13:32 ` [PATCH 1/2] ACPI: APEI: GHES: fix severity namespace in ghes_log_hwerr() Breno Leitao
2026-06-17 13:32 ` [PATCH 2/2] vmcore_info: track fatal hardware errors Breno Leitao
2026-06-29 16:02 ` [PATCH 0/2] vmcoreinfo: GHES: " Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox