[PATCH v3] vmcoreinfo: Track and log recoverable hardware errors

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
@ 2025-07-22 16:56 Breno Leitao
  2025-07-23 14:28 ` kernel test robot
  2025-07-24  8:00 ` Shuai Xue
  0 siblings, 2 replies; 17+ messages in thread
From: Breno Leitao @ 2025-07-22 16:56 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck,
	Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas
  Cc: linux-acpi, linux-kernel, acpica-devel, osandov, xueshuai,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team,
	Breno Leitao

Introduce a generic infrastructure for tracking recoverable hardware
errors (HW errors that did not cause a panic) and record them for vmcore
consumption. This aids post-mortem crash analysis tools by preserving
a count and timestamp for the last occurrence of such errors.

Add centralized logging for three common sources of recoverable hardware
errors:

  - PCIe AER Correctable errors
  - x86 Machine Check Exceptions (MCE)
  - APEI/CPER GHES corrected or recoverable errors

hwerror_data is write-only at kernel runtime, and it is meant to be
read from vmcore using tools like crash/drgn. For example, this is how
it looks like when opening the crashdump from drgn.

	>>> prog['hwerror_data']
	(struct hwerror_info[3]){
		{
			.count = (int)844,
			.timestamp = (time64_t)1752852018,
		},
		...

This helps fleet operators quickly triage whether a crash may be
influenced by hardware recoverable errors (which executes a uncommon
code path in the kernel), especially when recoverable errors occurred
shortly before a panic, such as the bug fixed by
commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
when destroying the pool")

This is not intended to replace full hardware diagnostics but provides
a fast way to correlate hardware events with kernel panics quickly.

Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v3:
- Add more information about this feature in the commit message
  (Borislav Petkov)
- Renamed the function to hwerr_log_error_type() and use hwerr as
  suffix (Borislav Petkov)
- Make the empty function static inline (kernel test robot)
- Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org

Changes in v2:
- Split the counter by recoverable error (Tony Luck)
- Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org
---
 arch/x86/kernel/cpu/mce/core.c |  3 +++
 drivers/acpi/apei/ghes.c       |  8 ++++++--
 drivers/pci/pcie/aer.c         |  2 ++
 include/linux/vmcore_info.h    | 14 ++++++++++++++
 kernel/vmcore_info.c           | 18 ++++++++++++++++++
 5 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4da4eab56c81d..cb225a42eebbb 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -45,6 +45,7 @@
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 #include <linux/kexec.h>
+#include <linux/vmcore_info.h>
 
 #include <asm/fred.h>
 #include <asm/cpu_device_id.h>
@@ -1692,6 +1693,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
 out:
 	instrumentation_end();
 
+	/* Given it didn't panic, mark it as recoverable */
+	hwerr_log_error_type(HWERR_RECOV_MCE);
 clear:
 	mce_wrmsrq(MSR_IA32_MCG_STATUS, 0);
 }
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3b..ebda2aa3d68f2 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -43,6 +43,7 @@
 #include <linux/uuid.h>
 #include <linux/ras.h>
 #include <linux/task_work.h>
+#include <linux/vmcore_info.h>
 
 #include <acpi/actbl1.h>
 #include <acpi/ghes.h>
@@ -1136,13 +1137,16 @@ static int ghes_proc(struct ghes *ghes)
 {
 	struct acpi_hest_generic_status *estatus = ghes->estatus;
 	u64 buf_paddr;
-	int rc;
+	int rc, sev;
 
 	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
 	if (rc)
 		goto out;
 
-	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
+	sev = ghes_severity(estatus->error_severity);
+	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
+		hwerr_log_error_type(HWERR_RECOV_GHES);
+	else if (sev >= GHES_SEV_PANIC)
 		__ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
 
 	if (!ghes_estatus_cached(estatus)) {
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d7167..1ab744a3b7310 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -30,6 +30,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/slab.h>
+#include <linux/vmcore_info.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
 #include <ras/ras_event.h>
@@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 	switch (info->severity) {
 	case AER_CORRECTABLE:
 		aer_info->dev_total_cor_errs++;
+		hwerr_log_error_type(HWERR_RECOV_AER);
 		counter = &aer_info->dev_cor_errs[0];
 		max = AER_MAX_TYPEOF_COR_ERRS;
 		break;
diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h
index 37e003ae52626..39afce28bfaac 100644
--- a/include/linux/vmcore_info.h
+++ b/include/linux/vmcore_info.h
@@ -77,4 +77,18 @@ extern u32 *vmcoreinfo_note;
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len);
 void final_note(Elf_Word *buf);
+
+enum hwerr_error_type {
+	HWERR_RECOV_AER,
+	HWERR_RECOV_MCE,
+	HWERR_RECOV_GHES,
+	HWERR_RECOV_MAX,
+};
+
+#ifdef CONFIG_VMCORE_INFO
+void hwerr_log_error_type(enum hwerr_error_type src);
+#else
+static inline void hwerr_log_error_type(enum hwerr_error_type src) {};
+#endif
+
 #endif /* LINUX_VMCORE_INFO_H */
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f89..4b5ab45d468f5 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;
 
+struct hwerr_info {
+	int __data_racy count;
+	time64_t __data_racy timestamp;
+};
+
+static struct hwerr_info hwerr_data[HWERR_RECOV_MAX];
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len)
 {
@@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
 }
 EXPORT_SYMBOL(paddr_vmcoreinfo_note);
 
+void hwerr_log_error_type(enum hwerr_error_type src)
+{
+	if (src < 0 || src >= HWERR_RECOV_MAX)
+		return;
+
+	/* No need to atomics/locks given the precision is not important */
+	hwerr_data[src].count++;
+	hwerr_data[src].timestamp = ktime_get_real_seconds();
+}
+EXPORT_SYMBOL_GPL(hwerr_log_error_type);
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
 	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);

---
base-commit: 97987520025658f30bb787a99ffbd9bbff9ffc9d
change-id: 20250707-vmcore_hw_error-322429e6c316

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao
@ 2025-07-23 14:28 ` kernel test robot
  2025-07-23 15:36   ` Breno Leitao
  2025-07-24  8:00 ` Shuai Xue
  1 sibling, 1 reply; 17+ messages in thread
From: kernel test robot @ 2025-07-23 14:28 UTC (permalink / raw)
  To: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse,
	Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas
  Cc: oe-kbuild-all, linux-media, linux-acpi, linux-kernel,
	acpica-devel, osandov, xueshuai, konrad.wilk, linux-edac,
	linuxppc-dev, linux-pci, kernel-team, Breno Leitao

Hi Breno,

kernel test robot noticed the following build warnings:

[auto build test WARNING on 97987520025658f30bb787a99ffbd9bbff9ffc9d]

url:    https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/vmcoreinfo-Track-and-log-recoverable-hardware-errors/20250723-005950
base:   97987520025658f30bb787a99ffbd9bbff9ffc9d
patch link:    https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17%40debian.org
patch subject: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
config: x86_64-buildonly-randconfig-001-20250723 (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202507232209.GrgpSr47-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> vmlinux.o: warning: objtool: do_machine_check+0x5cc: call to hwerr_log_error_type() leaves .noinstr.text section

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-23 14:28 ` kernel test robot
@ 2025-07-23 15:36   ` Breno Leitao
  2025-07-23 19:00     ` Borislav Petkov
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-23 15:36 UTC (permalink / raw)
  To: kernel test robot
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck,
	Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, oe-kbuild-all, linux-media, linux-acpi,
	linux-kernel, acpica-devel, osandov, xueshuai, konrad.wilk,
	linux-edac, linuxppc-dev, linux-pci, kernel-team

On Wed, Jul 23, 2025 at 10:28:29PM +0800, kernel test robot wrote:
> Hi Breno,
> 
> kernel test robot noticed the following build warnings:
> 
> [auto build test WARNING on 97987520025658f30bb787a99ffbd9bbff9ffc9d]
> 
> url:    https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/vmcoreinfo-Track-and-log-recoverable-hardware-errors/20250723-005950
> base:   97987520025658f30bb787a99ffbd9bbff9ffc9d
> patch link:    https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17%40debian.org
> patch subject: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
> config: x86_64-buildonly-randconfig-001-20250723 (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/config)
> compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/reproduce)
> 
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202507232209.GrgpSr47-lkp@intel.com/
> 
> All warnings (new ones prefixed by >>):
> 
> >> vmlinux.o: warning: objtool: do_machine_check+0x5cc: call to hwerr_log_error_type() leaves .noinstr.text section

Oh, it seems a real issue.

Basically there are two approaches, from what I understand:

	1) mark do_machine_check() as noinstr

	2) Move hwerr_log_error_type() earlier inside the
	instrumentation_begin() area.

Probably option 1 might be more flexible, given that
hwerr_log_error_type() doesn't seem a function that anyone wants to
instrument?!

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-23 15:36   ` Breno Leitao
@ 2025-07-23 19:00     ` Borislav Petkov
  2025-07-23 23:21       ` Huang, Kai
  0 siblings, 1 reply; 17+ messages in thread
From: Borislav Petkov @ 2025-07-23 19:00 UTC (permalink / raw)
  To: Breno Leitao
  Cc: kernel test robot, Rafael J. Wysocki, Len Brown, James Morse,
	Tony Luck, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, oe-kbuild-all, linux-media, linux-acpi,
	linux-kernel, acpica-devel, osandov, xueshuai, konrad.wilk,
	linux-edac, linuxppc-dev, linux-pci, kernel-team

On Wed, Jul 23, 2025 at 08:36:52AM -0700, Breno Leitao wrote:
> Basically there are two approaches, from what I understand:
> 
> 	1) mark do_machine_check() as noinstr

do_machine_check is already noinstr. I think you mean mark
hwerr_log_error_type() noinstr.

And yes, you can mark it. hwerr_log_error_type() is not that fascinating
to allow instrumentation for it.

> 	2) Move hwerr_log_error_type() earlier inside the
> 	instrumentation_begin() area.

Or you can do that - that looks like less of an effort btw.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-23 19:00     ` Borislav Petkov
@ 2025-07-23 23:21       ` Huang, Kai
  0 siblings, 0 replies; 17+ messages in thread
From: Huang, Kai @ 2025-07-23 23:21 UTC (permalink / raw)
  To: leitao@debian.org, bp@alien8.de
  Cc: lkp, oe-kbuild-all@lists.linux.dev, xueshuai@linux.alibaba.com,
	acpica-devel@lists.linux.dev, linux-media@vger.kernel.org,
	Luck, Tony, james.morse@arm.com, dave.hansen@linux.intel.com,
	mchehab@kernel.org, konrad.wilk@oracle.com, oohall@gmail.com,
	helgaas@kernel.org, mingo@redhat.com, osandov@osandov.com,
	linux-kernel@vger.kernel.org, tglx@linutronix.de, lenb@kernel.org,
	kernel-team@meta.com, linux-edac@vger.kernel.org, hpa@zytor.com,
	linuxppc-dev@lists.ozlabs.org, mahesh@linux.ibm.com,
	guohanjun@huawei.com, rafael@kernel.org,
	linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org,
	x86@kernel.org, Moore, Robert

On Wed, 2025-07-23 at 21:00 +0200, Borislav Petkov wrote:
> On Wed, Jul 23, 2025 at 08:36:52AM -0700, Breno Leitao wrote:
> > Basically there are two approaches, from what I understand:
> > 
> > 	1) mark do_machine_check() as noinstr
> 
> do_machine_check is already noinstr. I think you mean mark
> hwerr_log_error_type() noinstr.
> 
> And yes, you can mark it. hwerr_log_error_type() is not that fascinating
> to allow instrumentation for it.

This option doesn't seem to be able to work because IIRC
hwerr_log_error_type() calls ktime_get_real_seconds() which is not
'noinstr'.

> 
> > 	2) Move hwerr_log_error_type() earlier inside the
> > 	instrumentation_begin() area.
> 
> Or you can do that - that looks like less of an effort btw.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao
  2025-07-23 14:28 ` kernel test robot
@ 2025-07-24  8:00 ` Shuai Xue
  2025-07-24 13:34   ` Breno Leitao
  1 sibling, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-07-24  8:00 UTC (permalink / raw)
  To: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse,
	Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas
  Cc: linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk,
	linux-edac, linuxppc-dev, linux-pci, kernel-team

Hi, Breno,

在 2025/7/23 00:56, Breno Leitao 写道:
> Introduce a generic infrastructure for tracking recoverable hardware
> errors (HW errors that did not cause a panic) and record them for vmcore
> consumption. This aids post-mortem crash analysis tools by preserving
> a count and timestamp for the last occurrence of such errors.
> 
> Add centralized logging for three common sources of recoverable hardware
> errors:

The term "recoverable" is highly ambiguous. Even within the x86
architecture, different vendors define errors differently. I'm not
trying to be pedantic about classification. As far as I know, for 2-bit
memory errors detected by scrub, AMD defines them as deferred errors
(DE) and handles them with log_error_deferred, while Intel uses
machine_check_poll. For 2-bit memory errors consumed by processes, both
Intel and AMD use MCE handling viado_machine_check(). Does your
HWERR_RECOV_MCE only focus on synchronous UE errors handled in
do_machine_check? What makes it special?

> 
>    - PCIe AER Correctable errors
>    - x86 Machine Check Exceptions (MCE)
>    - APEI/CPER GHES corrected or recoverable errors
> 
> hwerror_data is write-only at kernel runtime, and it is meant to be
> read from vmcore using tools like crash/drgn. For example, this is how
> it looks like when opening the crashdump from drgn.
> 
> 	>>> prog['hwerror_data']
> 	(struct hwerror_info[3]){
> 		{
> 			.count = (int)844,
> 			.timestamp = (time64_t)1752852018,
> 		},
> 		...
> 
> This helps fleet operators quickly triage whether a crash may be
> influenced by hardware recoverable errors (which executes a uncommon
> code path in the kernel), especially when recoverable errors occurred
> shortly before a panic, such as the bug fixed by
> commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
> when destroying the pool")
> 
> This is not intended to replace full hardware diagnostics but provides
> a fast way to correlate hardware events with kernel panics quickly.
> 
> Suggested-by: Tony Luck <tony.luck@intel.com>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> Changes in v3:
> - Add more information about this feature in the commit message
>    (Borislav Petkov)
> - Renamed the function to hwerr_log_error_type() and use hwerr as
>    suffix (Borislav Petkov)
> - Make the empty function static inline (kernel test robot)
> - Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org
> 
> Changes in v2:
> - Split the counter by recoverable error (Tony Luck)
> - Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org
> ---
>   arch/x86/kernel/cpu/mce/core.c |  3 +++
>   drivers/acpi/apei/ghes.c       |  8 ++++++--
>   drivers/pci/pcie/aer.c         |  2 ++
>   include/linux/vmcore_info.h    | 14 ++++++++++++++
>   kernel/vmcore_info.c           | 18 ++++++++++++++++++
>   5 files changed, 43 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 4da4eab56c81d..cb225a42eebbb 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -45,6 +45,7 @@
>   #include <linux/task_work.h>
>   #include <linux/hardirq.h>
>   #include <linux/kexec.h>
> +#include <linux/vmcore_info.h>
>   
>   #include <asm/fred.h>
>   #include <asm/cpu_device_id.h>
> @@ -1692,6 +1693,8 @@ noinstr void do_machine_check(struct pt_regs *regs)
>   out:
>   	instrumentation_end();
>   
> +	/* Given it didn't panic, mark it as recoverable */
> +	hwerr_log_error_type(HWERR_RECOV_MCE);
>   clear:
>   	mce_wrmsrq(MSR_IA32_MCG_STATUS, 0);
>   }
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3b..ebda2aa3d68f2 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -43,6 +43,7 @@
>   #include <linux/uuid.h>
>   #include <linux/ras.h>
>   #include <linux/task_work.h>
> +#include <linux/vmcore_info.h>
>   
>   #include <acpi/actbl1.h>
>   #include <acpi/ghes.h>
> @@ -1136,13 +1137,16 @@ static int ghes_proc(struct ghes *ghes)
>   {
>   	struct acpi_hest_generic_status *estatus = ghes->estatus;
>   	u64 buf_paddr;
> -	int rc;
> +	int rc, sev;
>   
>   	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
>   	if (rc)
>   		goto out;
>   
> -	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
> +	sev = ghes_severity(estatus->error_severity);
> +	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
> +		hwerr_log_error_type(HWERR_RECOV_GHES);

APEI does not define an error type named GHES. GHES is just a kernel
driver name. Many hardware error types can be handled in GHES (see
ghes_do_proc), for example, AER is routed by GHES when firmware-first
mode is used. As far as I know, firmware-first mode is commonly used in
production. Should GHES errors be categorized into AER, memory, and CXL
memory instead?

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-24  8:00 ` Shuai Xue
@ 2025-07-24 13:34   ` Breno Leitao
  2025-07-25  7:40     ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-24 13:34 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck,
	Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

Hello Shuai,

On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote:
> 在 2025/7/23 00:56, Breno Leitao 写道:
> > Introduce a generic infrastructure for tracking recoverable hardware
> > errors (HW errors that did not cause a panic) and record them for vmcore
> > consumption. This aids post-mortem crash analysis tools by preserving
> > a count and timestamp for the last occurrence of such errors.
> > 
> > Add centralized logging for three common sources of recoverable hardware
> > errors:
> 
> The term "recoverable" is highly ambiguous. Even within the x86
> architecture, different vendors define errors differently. I'm not
> trying to be pedantic about classification. As far as I know, for 2-bit
> memory errors detected by scrub, AMD defines them as deferred errors
> (DE) and handles them with log_error_deferred, while Intel uses
> machine_check_poll. For 2-bit memory errors consumed by processes,
> both Intel and AMD use MCE handling via do_machine_check(). Does your
> HWERR_RECOV_MCE only focus on synchronous UE errors handled in
> do_machine_check? What makes it special?

I understand that deferred errors (DE) detected by memory scrubbing are
typically silent and may not significantly impact system stability. In
other words, I’m not convinced that including DE metrics in crash dumps
would be helpful for correlating crashes with hardware issues—it might
just add noise.

Do you think it would be valuable to also log these events within
log_error_deferred()?

> > -	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
> > +	sev = ghes_severity(estatus->error_severity);
> > +	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
> > +		hwerr_log_error_type(HWERR_RECOV_GHES);
> 
> APEI does not define an error type named GHES. GHES is just a kernel
> driver name. Many hardware error types can be handled in GHES (see
> ghes_do_proc), for example, AER is routed by GHES when firmware-first
> mode is used. As far as I know, firmware-first mode is commonly used in
> production. Should GHES errors be categorized into AER, memory, and CXL
> memory instead?

I also considered slicing the data differently initially, but then
realized it would add more complexity than necessary for my needs.

If you believe we should further subdivide the data, I’m happy to do so.

You’re suggesting a structure like this, which would then map to the
corresponding CPER_SEC_ sections:

	enum hwerr_error_type {
	HWERR_RECOV_AER,     // maps to CPER_SEC_PCIE
	HWERR_RECOV_MCE,     // maps to default MCE + CPER_SEC_PCIE
	HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_*
	HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM
	}

Additionally, what about events related to CPU, Firmware, or DMA
errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we
include those in the classification as well?


Thanks for your review and for the ongoing discussion!
--breno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-24 13:34   ` Breno Leitao
@ 2025-07-25  7:40     ` Shuai Xue
  2025-07-25 16:16       ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-07-25  7:40 UTC (permalink / raw)
  To: Breno Leitao, Tony Luck, Borislav Petkov
  Cc: Rafael J. Wysocki, Len Brown, James Morse, Robert Moore,
	Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin,
	Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar,
	Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel,
	acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev,
	linux-pci, kernel-team



在 2025/7/24 21:34, Breno Leitao 写道:
> Hello Shuai,
> 
> On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote:
>> 在 2025/7/23 00:56, Breno Leitao 写道:
>>> Introduce a generic infrastructure for tracking recoverable hardware
>>> errors (HW errors that did not cause a panic) and record them for vmcore
>>> consumption. This aids post-mortem crash analysis tools by preserving
>>> a count and timestamp for the last occurrence of such errors.
>>>
>>> Add centralized logging for three common sources of recoverable hardware
>>> errors:
>>
>> The term "recoverable" is highly ambiguous. Even within the x86
>> architecture, different vendors define errors differently. I'm not
>> trying to be pedantic about classification. As far as I know, for 2-bit
>> memory errors detected by scrub, AMD defines them as deferred errors
>> (DE) and handles them with log_error_deferred, while Intel uses
>> machine_check_poll. For 2-bit memory errors consumed by processes,
>> both Intel and AMD use MCE handling via do_machine_check(). Does your
>> HWERR_RECOV_MCE only focus on synchronous UE errors handled in
>> do_machine_check? What makes it special?
> 
> I understand that deferred errors (DE) detected by memory scrubbing are
> typically silent and may not significantly impact system stability. In
> other words, I’m not convinced that including DE metrics in crash dumps
> would be helpful for correlating crashes with hardware issues—it might
> just add noise.
> 
> Do you think it would be valuable to also log these events within
> log_error_deferred()?

Not really, as you meationed, the DE is typically silent in backgroud.
But I hope it is well documented.
> 
>>> -	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
>>> +	sev = ghes_severity(estatus->error_severity);
>>> +	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
>>> +		hwerr_log_error_type(HWERR_RECOV_GHES);
>>
>> APEI does not define an error type named GHES. GHES is just a kernel
>> driver name. Many hardware error types can be handled in GHES (see
>> ghes_do_proc), for example, AER is routed by GHES when firmware-first
>> mode is used. As far as I know, firmware-first mode is commonly used in
>> production. Should GHES errors be categorized into AER, memory, and CXL
>> memory instead?
> 
> I also considered slicing the data differently initially, but then
> realized it would add more complexity than necessary for my needs.
> 
> If you believe we should further subdivide the data, I’m happy to do so.
> 
> You’re suggesting a structure like this, which would then map to the
> corresponding CPER_SEC_ sections:
> 
> 	enum hwerr_error_type {
> 	HWERR_RECOV_AER,     // maps to CPER_SEC_PCIE
> 	HWERR_RECOV_MCE,     // maps to default MCE + CPER_SEC_PCIE

CPER_SEC_PCIE is typo?

> 	HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_*
> 	HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM
> 	}
> 
> Additionally, what about events related to CPU, Firmware, or DMA
> errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we
> include those in the classification as well?

I would like to split a error from ghes to its own type,
it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :(
> 
> 
> Thanks for your review and for the ongoing discussion!
> --breno

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-25  7:40     ` Shuai Xue
@ 2025-07-25 16:16       ` Breno Leitao
  2025-07-28  1:08         ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-25 16:16 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

Hello Shuai,

On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
> > > APEI does not define an error type named GHES. GHES is just a kernel
> > > driver name. Many hardware error types can be handled in GHES (see
> > > ghes_do_proc), for example, AER is routed by GHES when firmware-first
> > > mode is used. As far as I know, firmware-first mode is commonly used in
> > > production. Should GHES errors be categorized into AER, memory, and CXL
> > > memory instead?
> > 
> > I also considered slicing the data differently initially, but then
> > realized it would add more complexity than necessary for my needs.
> > 
> > If you believe we should further subdivide the data, I’m happy to do so.
> > 
> > You’re suggesting a structure like this, which would then map to the
> > corresponding CPER_SEC_ sections:
> > 
> > 	enum hwerr_error_type {
> > 	HWERR_RECOV_AER,     // maps to CPER_SEC_PCIE
> > 	HWERR_RECOV_MCE,     // maps to default MCE + CPER_SEC_PCIE
> 
> CPER_SEC_PCIE is typo?

Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors
coming from GHES.

> > 	HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_*
> > 	HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM
> > 	}
> > 
> > Additionally, what about events related to CPU, Firmware, or DMA
> > errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we
> > include those in the classification as well?
> 
> I would like to split a error from ghes to its own type,
> it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :(

Makes sense. Regarding your answer, I suppose we might want to have
something like the following:

	enum hwerr_error_type {
		HWERR_RECOV_MCE,     // maps to errors in do_machine_check()
		HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_
		HWERR_RECOV_PCI,     // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI
		HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM_
		HWERR_RECOV_CPU,     // maps to CPER_SEC_PROC_
		HWERR_RECOV_DMA,     // maps to CPER_SEC_DMAR_
		HWERR_RECOV_OTHERS,  // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, 
	}

Is this what you think we should track?

Thanks
--breno

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-25 16:16       ` Breno Leitao
@ 2025-07-28  1:08         ` Shuai Xue
  2025-07-29 13:48           ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-07-28  1:08 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team



在 2025/7/26 00:16, Breno Leitao 写道:
> Hello Shuai,
> 
> On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
>>>> APEI does not define an error type named GHES. GHES is just a kernel
>>>> driver name. Many hardware error types can be handled in GHES (see
>>>> ghes_do_proc), for example, AER is routed by GHES when firmware-first
>>>> mode is used. As far as I know, firmware-first mode is commonly used in
>>>> production. Should GHES errors be categorized into AER, memory, and CXL
>>>> memory instead?
>>>
>>> I also considered slicing the data differently initially, but then
>>> realized it would add more complexity than necessary for my needs.
>>>
>>> If you believe we should further subdivide the data, I’m happy to do so.
>>>
>>> You’re suggesting a structure like this, which would then map to the
>>> corresponding CPER_SEC_ sections:
>>>
>>> 	enum hwerr_error_type {
>>> 	HWERR_RECOV_AER,     // maps to CPER_SEC_PCIE
>>> 	HWERR_RECOV_MCE,     // maps to default MCE + CPER_SEC_PCIE
>>
>> CPER_SEC_PCIE is typo?
> 
> Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors
> coming from GHES.
> 
>>> 	HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_*
>>> 	HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM
>>> 	}
>>>
>>> Additionally, what about events related to CPU, Firmware, or DMA
>>> errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we
>>> include those in the classification as well?
>>
>> I would like to split a error from ghes to its own type,
>> it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :(
> 
> Makes sense. Regarding your answer, I suppose we might want to have
> something like the following:
> 
> 	enum hwerr_error_type {
> 		HWERR_RECOV_MCE,     // maps to errors in do_machine_check()
> 		HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_
> 		HWERR_RECOV_PCI,     // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI
> 		HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM_
> 		HWERR_RECOV_CPU,     // maps to CPER_SEC_PROC_
> 		HWERR_RECOV_DMA,     // maps to CPER_SEC_DMAR_
> 		HWERR_RECOV_OTHERS,  // maps to CPER_SEC_FW_, CPER_SEC_DMAR_,
> 	}
> 
> Is this what you think we should track?
> 
> Thanks
> --breno

It sounds good to me.

Thanks.
Shuai


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-28  1:08         ` Shuai Xue
@ 2025-07-29 13:48           ` Breno Leitao
  2025-07-30  2:13             ` Shuai Xue
  0 siblings, 1 reply; 17+ messages in thread
From: Breno Leitao @ 2025-07-29 13:48 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

On Mon, Jul 28, 2025 at 09:08:25AM +0800, Shuai Xue wrote:
> 在 2025/7/26 00:16, Breno Leitao 写道:
> > On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
> > 
> > 	enum hwerr_error_type {
> > 		HWERR_RECOV_MCE,     // maps to errors in do_machine_check()
> > 		HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_
> > 		HWERR_RECOV_PCI,     // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI
> > 		HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM_
> > 		HWERR_RECOV_CPU,     // maps to CPER_SEC_PROC_
> > 		HWERR_RECOV_DMA,     // maps to CPER_SEC_DMAR_
> > 		HWERR_RECOV_OTHERS,  // maps to CPER_SEC_FW_, CPER_SEC_DMAR_,
> > 	}
> > 
> > Is this what you think we should track?
> > 
> > Thanks
> > --breno
> 
> It sounds good to me.

Does the following patch matches your expectation?

Thanks!

Author: Breno Leitao <leitao@debian.org>
Date:   Thu Jul 17 07:39:26 2025 -0700

    vmcoreinfo: Track and log recoverable hardware errors
    
    Introduce a generic infrastructure for tracking recoverable hardware
    errors (HW errors that did not cause a panic) and record them for vmcore
    consumption. This aids post-mortem crash analysis tools by preserving
    a count and timestamp for the last occurrence of such errors.
    
    Add centralized logging for sources of recoverable hardware
    errors based on the subsystem it has been notified.
    
    hwerror_data is write-only at kernel runtime, and it is meant to be read
    from vmcore using tools like crash/drgn. For example, this is how it
    looks like when opening the crashdump from drgn.
    
            >>> prog['hwerror_data']
            (struct hwerror_info[6]){
                    {
                            .count = (int)844,
                            .timestamp = (time64_t)1752852018,
                    },
                    ...
    
    This helps fleet operators quickly triage whether a crash may be
    influenced by hardware recoverable errors (which executes a uncommon
    code path in the kernel), especially when recoverable errors occurred
    shortly before a panic, such as the bug fixed by
    commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
    when destroying the pool")
    
    This is not intended to replace full hardware diagnostics but provides
    a fast way to correlate hardware events with kernel panics quickly.
    
    Suggested-by: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4da4eab56c81d..f85759453f89a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -45,6 +45,7 @@
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 #include <linux/kexec.h>
+#include <linux/vmcore_info.h>
 
 #include <asm/fred.h>
 #include <asm/cpu_device_id.h>
@@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
 	}
 
 out:
+	/* Given it didn't panic, mark it as recoverable */
+	hwerr_log_error_type(HWERR_RECOV_MCE);
+
 	instrumentation_end();
 
 clear:
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3b..f0b17efff713e 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -43,6 +43,7 @@
 #include <linux/uuid.h>
 #include <linux/ras.h>
 #include <linux/task_work.h>
+#include <linux/vmcore_info.h>
 
 #include <acpi/actbl1.h>
 #include <acpi/ghes.h>
@@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
 
+static void ghes_log_hwerr(int sev, guid_t *sec_type)
+{
+	if (sev != CPER_SEV_CORRECTED && sev != CPER_SEV_RECOVERABLE)
+		return;
+
+	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
+		hwerr_log_error_type(HWERR_RECOV_CPU);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
+		hwerr_log_error_type(HWERR_RECOV_CXL);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
+	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) {
+		hwerr_log_error_type(HWERR_RECOV_PCI);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
+		hwerr_log_error_type(HWERR_RECOV_MEMORY);
+		return;
+	}
+
+	hwerr_log_error_type(HWERR_RECOV_OTHERS);
+}
+
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
@@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
 			fru_text = gdata->fru_text;
 
+		ghes_log_hwerr(sev, sec_type);
 		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d7167..5ccb6ca347f3f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -30,6 +30,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/slab.h>
+#include <linux/vmcore_info.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
 #include <ras/ras_event.h>
@@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 	switch (info->severity) {
 	case AER_CORRECTABLE:
 		aer_info->dev_total_cor_errs++;
+		hwerr_log_error_type(HWERR_RECOV_PCI);
 		counter = &aer_info->dev_cor_errs[0];
 		max = AER_MAX_TYPEOF_COR_ERRS;
 		break;
diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h
index 37e003ae52626..538a3635fb1e5 100644
--- a/include/linux/vmcore_info.h
+++ b/include/linux/vmcore_info.h
@@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note;
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len);
 void final_note(Elf_Word *buf);
+
+enum hwerr_error_type {
+	HWERR_RECOV_MCE,
+	HWERR_RECOV_CPU,
+	HWERR_RECOV_MEMORY,
+	HWERR_RECOV_PCI,
+	HWERR_RECOV_CXL,
+	HWERR_RECOV_OTHERS,
+	HWERR_RECOV_MAX,
+};
+
+#ifdef CONFIG_VMCORE_INFO
+noinstr void hwerr_log_error_type(enum hwerr_error_type src);
+#else
+static inline void hwerr_log_error_type(enum hwerr_error_type src) {};
+#endif
+
 #endif /* LINUX_VMCORE_INFO_H */
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f89..4b5ab45d468f5 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;
 
+struct hwerr_info {
+	int __data_racy count;
+	time64_t __data_racy timestamp;
+};
+
+static struct hwerr_info hwerr_data[HWERR_RECOV_MAX];
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len)
 {
@@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
 }
 EXPORT_SYMBOL(paddr_vmcoreinfo_note);
 
+void hwerr_log_error_type(enum hwerr_error_type src)
+{
+	if (src < 0 || src >= HWERR_RECOV_MAX)
+		return;
+
+	/* No need to atomics/locks given the precision is not important */
+	hwerr_data[src].count++;
+	hwerr_data[src].timestamp = ktime_get_real_seconds();
+}
+EXPORT_SYMBOL_GPL(hwerr_log_error_type);
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
 	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-29 13:48           ` Breno Leitao
@ 2025-07-30  2:13             ` Shuai Xue
  2025-07-30 13:11               ` Breno Leitao
  0 siblings, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-07-30  2:13 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team



在 2025/7/29 21:48, Breno Leitao 写道:
> On Mon, Jul 28, 2025 at 09:08:25AM +0800, Shuai Xue wrote:
>> 在 2025/7/26 00:16, Breno Leitao 写道:
>>> On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote:
>>>
>>> 	enum hwerr_error_type {
>>> 		HWERR_RECOV_MCE,     // maps to errors in do_machine_check()
>>> 		HWERR_RECOV_CXL,     // maps to CPER_SEC_CXL_
>>> 		HWERR_RECOV_PCI,     // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI
>>> 		HWERR_RECOV_MEMORY,  // maps to CPER_SEC_PLATFORM_MEM_
>>> 		HWERR_RECOV_CPU,     // maps to CPER_SEC_PROC_
>>> 		HWERR_RECOV_DMA,     // maps to CPER_SEC_DMAR_
>>> 		HWERR_RECOV_OTHERS,  // maps to CPER_SEC_FW_, CPER_SEC_DMAR_,
>>> 	}
>>>
>>> Is this what you think we should track?
>>>
>>> Thanks
>>> --breno
>>
>> It sounds good to me.
> 
> Does the following patch matches your expectation?
> 
> Thanks!
> 
> Author: Breno Leitao <leitao@debian.org>
> Date:   Thu Jul 17 07:39:26 2025 -0700
> 
>      vmcoreinfo: Track and log recoverable hardware errors
>      
>      Introduce a generic infrastructure for tracking recoverable hardware
>      errors (HW errors that did not cause a panic) and record them for vmcore
>      consumption. This aids post-mortem crash analysis tools by preserving
>      a count and timestamp for the last occurrence of such errors.
>      
>      Add centralized logging for sources of recoverable hardware
>      errors based on the subsystem it has been notified.
>      
>      hwerror_data is write-only at kernel runtime, and it is meant to be read
>      from vmcore using tools like crash/drgn. For example, this is how it
>      looks like when opening the crashdump from drgn.
>      
>              >>> prog['hwerror_data']
>              (struct hwerror_info[6]){
>                      {
>                              .count = (int)844,
>                              .timestamp = (time64_t)1752852018,
>                      },
>                      ...
>      
>      This helps fleet operators quickly triage whether a crash may be
>      influenced by hardware recoverable errors (which executes a uncommon
>      code path in the kernel), especially when recoverable errors occurred
>      shortly before a panic, such as the bug fixed by
>      commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
>      when destroying the pool")
>      
>      This is not intended to replace full hardware diagnostics but provides
>      a fast way to correlate hardware events with kernel panics quickly.
>      
>      Suggested-by: Tony Luck <tony.luck@intel.com>
>      Signed-off-by: Breno Leitao <leitao@debian.org>
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 4da4eab56c81d..f85759453f89a 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -45,6 +45,7 @@
>   #include <linux/task_work.h>
>   #include <linux/hardirq.h>
>   #include <linux/kexec.h>
> +#include <linux/vmcore_info.h>
>   
>   #include <asm/fred.h>
>   #include <asm/cpu_device_id.h>
> @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
>   	}
>   
>   out:
> +	/* Given it didn't panic, mark it as recoverable */
> +	hwerr_log_error_type(HWERR_RECOV_MCE);
> +
>   	instrumentation_end();
>   
>   clear:
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3b..f0b17efff713e 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -43,6 +43,7 @@
>   #include <linux/uuid.h>
>   #include <linux/ras.h>
>   #include <linux/task_work.h>
> +#include <linux/vmcore_info.h>
>   
>   #include <acpi/actbl1.h>
>   #include <acpi/ghes.h>
> @@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
>   
> +static void ghes_log_hwerr(int sev, guid_t *sec_type)
> +{
> +	if (sev != CPER_SEV_CORRECTED && sev != CPER_SEV_RECOVERABLE)
> +		return;
> +
> +	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
> +	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
> +	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
> +		hwerr_log_error_type(HWERR_RECOV_CPU);
> +		return;
> +	}
> +
> +	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
> +	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
> +	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
> +	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
> +		hwerr_log_error_type(HWERR_RECOV_CXL);
> +		return;
> +	}
> +
> +	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
> +	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) {
> +		hwerr_log_error_type(HWERR_RECOV_PCI);
> +		return;
> +	}
> +
> +	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
> +		hwerr_log_error_type(HWERR_RECOV_MEMORY);
> +		return;
> +	}
> +
> +	hwerr_log_error_type(HWERR_RECOV_OTHERS);
> +}
> +
>   static void ghes_do_proc(struct ghes *ghes,
>   			 const struct acpi_hest_generic_status *estatus)
>   {
> @@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes,
>   		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
>   			fru_text = gdata->fru_text;
>   
> +		ghes_log_hwerr(sev, sec_type);
>   		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
>   			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
>   
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e286c197d7167..5ccb6ca347f3f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -30,6 +30,7 @@
>   #include <linux/kfifo.h>
>   #include <linux/ratelimit.h>
>   #include <linux/slab.h>
> +#include <linux/vmcore_info.h>
>   #include <acpi/apei.h>
>   #include <acpi/ghes.h>
>   #include <ras/ras_event.h>
> @@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>   	switch (info->severity) {
>   	case AER_CORRECTABLE:
>   		aer_info->dev_total_cor_errs++;
> +		hwerr_log_error_type(HWERR_RECOV_PCI);

Hi Breno,

Thanks for working on this! The patch looks good overall, but I noticed
an inconsistency in the AER handling:

In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
CPER_SEV_RECOVERABLE errors:

However, in the AER section, you're only handling AER_CORRECTABLE cases.
IMHO, Non-fatal errors are recoverable and correspond to
CPER_SEV_RECOVERABLE in the ACPI context.

The mapping should probably be:

- AER_CORRECTABLE → CPER_SEV_CORRECTED
- AER_NONFATAL → CPER_SEV_RECOVERABLE

What do you think?

Thanks,
Shuai



>   		counter = &aer_info->dev_cor_errs[0];
>   		max = AER_MAX_TYPEOF_COR_ERRS;
>   		break;
> diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h
> index 37e003ae52626..538a3635fb1e5 100644
> --- a/include/linux/vmcore_info.h
> +++ b/include/linux/vmcore_info.h
> @@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note;
>   Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>   			  void *data, size_t data_len);
>   void final_note(Elf_Word *buf);
> +
> +enum hwerr_error_type {
> +	HWERR_RECOV_MCE,
> +	HWERR_RECOV_CPU,
> +	HWERR_RECOV_MEMORY,
> +	HWERR_RECOV_PCI,
> +	HWERR_RECOV_CXL,
> +	HWERR_RECOV_OTHERS,
> +	HWERR_RECOV_MAX,
> +};
> +
> +#ifdef CONFIG_VMCORE_INFO
> +noinstr void hwerr_log_error_type(enum hwerr_error_type src);
> +#else
> +static inline void hwerr_log_error_type(enum hwerr_error_type src) {};
> +#endif
> +
>   #endif /* LINUX_VMCORE_INFO_H */
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e066d31d08f89..4b5ab45d468f5 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
>   /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
>   static unsigned char *vmcoreinfo_data_safecopy;
>   
> +struct hwerr_info {
> +	int __data_racy count;
> +	time64_t __data_racy timestamp;
> +};
> +
> +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX];
> +
>   Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>   			  void *data, size_t data_len)
>   {
> @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>   }
>   EXPORT_SYMBOL(paddr_vmcoreinfo_note);
>   
> +void hwerr_log_error_type(enum hwerr_error_type src)
> +{
> +	if (src < 0 || src >= HWERR_RECOV_MAX)
> +		return;
> +
> +	/* No need to atomics/locks given the precision is not important */
> +	hwerr_data[src].count++;
> +	hwerr_data[src].timestamp = ktime_get_real_seconds();
> +}
> +EXPORT_SYMBOL_GPL(hwerr_log_error_type);
> +
>   static int __init crash_save_vmcoreinfo_init(void)
>   {
>   	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-30  2:13             ` Shuai Xue
@ 2025-07-30 13:11               ` Breno Leitao
  2025-07-30 13:50                 ` Shuai Xue
  2025-07-30 16:21                 ` Mauro Carvalho Chehab
  0 siblings, 2 replies; 17+ messages in thread
From: Breno Leitao @ 2025-07-30 13:11 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

Hello Shuai,

On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
> In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
> CPER_SEV_RECOVERABLE errors:

Thanks. I was reading this code a bit more, and I want to make sure my
understanding is correct, giving I was confused about CORRECTED and
RECOVERABLE errors.

CPER_SEV_CORRECTED means it is corrected in the background, and the OS
was not even notified about it. That includes 1-bit ECC error.
THose are not the errors we are interested in, since they are irrelavant
to the OS.

If that is true, then I might not want count CPER_SEV_CORRECTED errors
at all, but only CPER_SEV_RECOVERABLE.

> However, in the AER section, you're only handling AER_CORRECTABLE cases.
> IMHO, Non-fatal errors are recoverable and correspond to
> CPER_SEV_RECOVERABLE in the ACPI context.
> 
> The mapping should probably be:
> 
> - AER_CORRECTABLE → CPER_SEV_CORRECTED
> - AER_NONFATAL → CPER_SEV_RECOVERABLE

Thanks. This means I want to count AER_NONFATAL but not AER_CORRECTABLE.
Is this right?

Summarizing, This is the a new version of the change, according to my
new understanding:

commit deca1c4b99dcfa64b29fe035f8422b4601212413
Author: Breno Leitao <leitao@debian.org>
Date:   Thu Jul 17 07:39:26 2025 -0700

    vmcoreinfo: Track and log recoverable hardware errors

    Introduce a generic infrastructure for tracking recoverable hardware
    errors (HW errors that are visible to the OS but does not cause a panic)
    and record them for vmcore consumption. This aids post-mortem crash
    analysis tools by preserving a count and timestamp for the last
    occurrence of such errors. On the other side, correctable errors, which
    the OS typically remains unaware of because the underlying hardware
    handles them transparently, are less relevant and therefore are NOT
    tracked in this infrastructure.

    Add centralized logging for sources of recoverable hardware
    errors based on the subsystem it has been notified.

    hwerror_data is write-only at kernel runtime, and it is meant to be read
    from vmcore using tools like crash/drgn. For example, this is how it
    looks like when opening the crashdump from drgn.

            >>> prog['hwerror_data']
            (struct hwerror_info[6]){
                    {
                            .count = (int)844,
                            .timestamp = (time64_t)1752852018,
                    },
                    ...

    This helps fleet operators quickly triage whether a crash may be
    influenced by hardware recoverable errors (which executes a uncommon
    code path in the kernel), especially when recoverable errors occurred
    shortly before a panic, such as the bug fixed by
    commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
    when destroying the pool")

    This is not intended to replace full hardware diagnostics but provides
    a fast way to correlate hardware events with kernel panics quickly.

    Suggested-by: Tony Luck <tony.luck@intel.com>
    Signed-off-by: Breno Leitao <leitao@debian.org>

diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 4da4eab56c81d..f85759453f89a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -45,6 +45,7 @@
 #include <linux/task_work.h>
 #include <linux/hardirq.h>
 #include <linux/kexec.h>
+#include <linux/vmcore_info.h>

 #include <asm/fred.h>
 #include <asm/cpu_device_id.h>
@@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
 	}

 out:
+	/* Given it didn't panic, mark it as recoverable */
+	hwerr_log_error_type(HWERR_RECOV_MCE);
+
 	instrumentation_end();

 clear:
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index a0d54993edb3b..9c549c4a1a708 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -43,6 +43,7 @@
 #include <linux/uuid.h>
 #include <linux/ras.h>
 #include <linux/task_work.h>
+#include <linux/vmcore_info.h>

 #include <acpi/actbl1.h>
 #include <acpi/ghes.h>
@@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");

+static void ghes_log_hwerr(int sev, guid_t *sec_type)
+{
+	if (sev != CPER_SEV_RECOVERABLE)
+		return;
+
+	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
+	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
+		hwerr_log_error_type(HWERR_RECOV_CPU);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
+	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
+		hwerr_log_error_type(HWERR_RECOV_CXL);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
+	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) {
+		hwerr_log_error_type(HWERR_RECOV_PCI);
+		return;
+	}
+
+	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
+		hwerr_log_error_type(HWERR_RECOV_MEMORY);
+		return;
+	}
+
+	hwerr_log_error_type(HWERR_RECOV_OTHERS);
+}
+
 static void ghes_do_proc(struct ghes *ghes,
 			 const struct acpi_hest_generic_status *estatus)
 {
@@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes,
 		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
 			fru_text = gdata->fru_text;

+		ghes_log_hwerr(sev, sec_type);
 		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
 			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d7167..d814c06cdbee6 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -30,6 +30,7 @@
 #include <linux/kfifo.h>
 #include <linux/ratelimit.h>
 #include <linux/slab.h>
+#include <linux/vmcore_info.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
 #include <ras/ras_event.h>
@@ -751,6 +752,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 		break;
 	case AER_NONFATAL:
 		aer_info->dev_total_nonfatal_errs++;
+		hwerr_log_error_type(HWERR_RECOV_PCI);
 		counter = &aer_info->dev_nonfatal_errs[0];
 		max = AER_MAX_TYPEOF_UNCOR_ERRS;
 		break;
diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h
index 37e003ae52626..538a3635fb1e5 100644
--- a/include/linux/vmcore_info.h
+++ b/include/linux/vmcore_info.h
@@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note;
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len);
 void final_note(Elf_Word *buf);
+
+enum hwerr_error_type {
+	HWERR_RECOV_MCE,
+	HWERR_RECOV_CPU,
+	HWERR_RECOV_MEMORY,
+	HWERR_RECOV_PCI,
+	HWERR_RECOV_CXL,
+	HWERR_RECOV_OTHERS,
+	HWERR_RECOV_MAX,
+};
+
+#ifdef CONFIG_VMCORE_INFO
+noinstr void hwerr_log_error_type(enum hwerr_error_type src);
+#else
+static inline void hwerr_log_error_type(enum hwerr_error_type src) {};
+#endif
+
 #endif /* LINUX_VMCORE_INFO_H */
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e066d31d08f89..4b5ab45d468f5 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
 /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
 static unsigned char *vmcoreinfo_data_safecopy;

+struct hwerr_info {
+	int __data_racy count;
+	time64_t __data_racy timestamp;
+};
+
+static struct hwerr_info hwerr_data[HWERR_RECOV_MAX];
+
 Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
 			  void *data, size_t data_len)
 {
@@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
 }
 EXPORT_SYMBOL(paddr_vmcoreinfo_note);

+void hwerr_log_error_type(enum hwerr_error_type src)
+{
+	if (src < 0 || src >= HWERR_RECOV_MAX)
+		return;
+
+	/* No need to atomics/locks given the precision is not important */
+	hwerr_data[src].count++;
+	hwerr_data[src].timestamp = ktime_get_real_seconds();
+}
+EXPORT_SYMBOL_GPL(hwerr_log_error_type);
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
 	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-30 13:11               ` Breno Leitao
@ 2025-07-30 13:50                 ` Shuai Xue
  2025-07-30 17:16                   ` Breno Leitao
  2025-07-30 16:21                 ` Mauro Carvalho Chehab
  1 sibling, 1 reply; 17+ messages in thread
From: Shuai Xue @ 2025-07-30 13:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team



在 2025/7/30 21:11, Breno Leitao 写道:
> Hello Shuai,
> 
> On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
>> In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
>> CPER_SEV_RECOVERABLE errors:
> 
> Thanks. I was reading this code a bit more, and I want to make sure my
> understanding is correct, giving I was confused about CORRECTED and
> RECOVERABLE errors.
> 
> CPER_SEV_CORRECTED means it is corrected in the background, and the OS
> was not even notified about it. That includes 1-bit ECC error.

Not quite correct. From ACPI spec:

     > A corrected error is a hardware error condition that has been
     > corrected by the hardware or by the firmware by the time the OSPM is
     > notified about the existence of the error condition.

For example, 1-bit ECC errors can be reported via CMCI interrupt when
the threshold of correctable errors exceeds the desired limit. The Linux
GHES driver then initiates kernel actions like soft-offlining pages.

> THose are not the errors we are interested in, since they are irrelavant
> to the OS.
> 
> If that is true, then I might not want count CPER_SEV_CORRECTED errors
> at all, but only CPER_SEV_RECOVERABLE.

Yes, that's the right approach. Hardware corrects CE errors and software
can continue running without intervention. Since HWERR_RECOV_MCE only
records uncorrected errors, focusing on CPER_SEV_RECOVERABLE is more
appropriate for crash correlation analysis.

> 
>> However, in the AER section, you're only handling AER_CORRECTABLE cases.
>> IMHO, Non-fatal errors are recoverable and correspond to
>> CPER_SEV_RECOVERABLE in the ACPI context.
>>
>> The mapping should probably be:
>>
>> - AER_CORRECTABLE → CPER_SEV_CORRECTED
>> - AER_NONFATAL → CPER_SEV_RECOVERABLE
> 
> Thanks. This means I want to count AER_NONFATAL but not AER_CORRECTABLE.
> Is this right?

Exactly. IMHO, the updated mapping looks correct:

     - GHES: Only CPER_SEV_RECOVERABLE
     - AER: Only AER_NONFATAL (which maps to recoverable errors)
     - MCE: Uncorrected errors that didn't cause panic

> 
> Summarizing, This is the a new version of the change, according to my
> new understanding:
> 
> commit deca1c4b99dcfa64b29fe035f8422b4601212413
> Author: Breno Leitao <leitao@debian.org>
> Date:   Thu Jul 17 07:39:26 2025 -0700
> 
>      vmcoreinfo: Track and log recoverable hardware errors
> 
>      Introduce a generic infrastructure for tracking recoverable hardware
>      errors (HW errors that are visible to the OS but does not cause a panic)
>      and record them for vmcore consumption. This aids post-mortem crash
>      analysis tools by preserving a count and timestamp for the last
>      occurrence of such errors. On the other side, correctable errors, which
>      the OS typically remains unaware of because the underlying hardware
>      handles them transparently, are less relevant and therefore are NOT
>      tracked in this infrastructure.
> 
>      Add centralized logging for sources of recoverable hardware
>      errors based on the subsystem it has been notified.
> 
>      hwerror_data is write-only at kernel runtime, and it is meant to be read
>      from vmcore using tools like crash/drgn. For example, this is how it
>      looks like when opening the crashdump from drgn.
> 
>              >>> prog['hwerror_data']
>              (struct hwerror_info[6]){
>                      {
>                              .count = (int)844,
>                              .timestamp = (time64_t)1752852018,
>                      },
>                      ...
> 
>      This helps fleet operators quickly triage whether a crash may be
>      influenced by hardware recoverable errors (which executes a uncommon
>      code path in the kernel), especially when recoverable errors occurred
>      shortly before a panic, such as the bug fixed by
>      commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them
>      when destroying the pool")
> 
>      This is not intended to replace full hardware diagnostics but provides
>      a fast way to correlate hardware events with kernel panics quickly.
> 
>      Suggested-by: Tony Luck <tony.luck@intel.com>
>      Signed-off-by: Breno Leitao <leitao@debian.org>
> 
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 4da4eab56c81d..f85759453f89a 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -45,6 +45,7 @@
>   #include <linux/task_work.h>
>   #include <linux/hardirq.h>
>   #include <linux/kexec.h>
> +#include <linux/vmcore_info.h>
> 
>   #include <asm/fred.h>
>   #include <asm/cpu_device_id.h>
> @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
>   	}
> 
>   out:
> +	/* Given it didn't panic, mark it as recoverable */
> +	hwerr_log_error_type(HWERR_RECOV_MCE);
> +

Indentation: needs tab alignment.

The current placement only logs errors that reach the out: label. Errors
that go to `clear` lable won't be recorded. Would it be better to log at
the beginning of do_machine_check() to capture all recoverable MCEs?

>   	instrumentation_end();
> 
>   clear:
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index a0d54993edb3b..9c549c4a1a708 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -43,6 +43,7 @@
>   #include <linux/uuid.h>
>   #include <linux/ras.h>
>   #include <linux/task_work.h>
> +#include <linux/vmcore_info.h>
> 
>   #include <acpi/actbl1.h>
>   #include <acpi/ghes.h>
> @@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
>   }
>   EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL");
> 
> +static void ghes_log_hwerr(int sev, guid_t *sec_type)
> +{
> +	if (sev != CPER_SEV_RECOVERABLE)
> +		return;
> +
> +	if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) ||
> +	    guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) ||
> +	    guid_equal(sec_type, &CPER_SEC_PROC_IA)) {
> +		hwerr_log_error_type(HWERR_RECOV_CPU);
> +		return;
> +	}
> +
> +	if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) ||
> +	    guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) ||
> +	    guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) ||
> +	    guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) {
> +		hwerr_log_error_type(HWERR_RECOV_CXL);
> +		return;
> +	}
> +
> +	if (guid_equal(sec_type, &CPER_SEC_PCIE) ||
> +	    guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) {
> +		hwerr_log_error_type(HWERR_RECOV_PCI);
> +		return;
> +	}
> +
> +	if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
> +		hwerr_log_error_type(HWERR_RECOV_MEMORY);
> +		return;
> +	}
> +
> +	hwerr_log_error_type(HWERR_RECOV_OTHERS);
> +}
> +
>   static void ghes_do_proc(struct ghes *ghes,
>   			 const struct acpi_hest_generic_status *estatus)
>   {
> @@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes,
>   		if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT)
>   			fru_text = gdata->fru_text;
> 
> +		ghes_log_hwerr(sev, sec_type);
>   		if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) {
>   			struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e286c197d7167..d814c06cdbee6 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -30,6 +30,7 @@
>   #include <linux/kfifo.h>
>   #include <linux/ratelimit.h>
>   #include <linux/slab.h>
> +#include <linux/vmcore_info.h>
>   #include <acpi/apei.h>
>   #include <acpi/ghes.h>
>   #include <ras/ras_event.h>
> @@ -751,6 +752,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>   		break;
>   	case AER_NONFATAL:
>   		aer_info->dev_total_nonfatal_errs++;
> +		hwerr_log_error_type(HWERR_RECOV_PCI);
>   		counter = &aer_info->dev_nonfatal_errs[0];
>   		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>   		break;
> diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h
> index 37e003ae52626..538a3635fb1e5 100644
> --- a/include/linux/vmcore_info.h
> +++ b/include/linux/vmcore_info.h
> @@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note;
>   Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>   			  void *data, size_t data_len);
>   void final_note(Elf_Word *buf);
> +
> +enum hwerr_error_type {
> +	HWERR_RECOV_MCE,
> +	HWERR_RECOV_CPU,
> +	HWERR_RECOV_MEMORY,
> +	HWERR_RECOV_PCI,
> +	HWERR_RECOV_CXL,
> +	HWERR_RECOV_OTHERS,
> +	HWERR_RECOV_MAX,
> +};
> +
> +#ifdef CONFIG_VMCORE_INFO
> +noinstr void hwerr_log_error_type(enum hwerr_error_type src);
> +#else
> +static inline void hwerr_log_error_type(enum hwerr_error_type src) {};
> +#endif
> +
>   #endif /* LINUX_VMCORE_INFO_H */
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e066d31d08f89..4b5ab45d468f5 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note;
>   /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */
>   static unsigned char *vmcoreinfo_data_safecopy;
> 
> +struct hwerr_info {
> +	int __data_racy count;
> +	time64_t __data_racy timestamp;
> +};
> +
> +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX];
> +
>   Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type,
>   			  void *data, size_t data_len)
>   {
> @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void)
>   }
>   EXPORT_SYMBOL(paddr_vmcoreinfo_note);
> 
> +void hwerr_log_error_type(enum hwerr_error_type src)
> +{
> +	if (src < 0 || src >= HWERR_RECOV_MAX)
> +		return;
> +
> +	/* No need to atomics/locks given the precision is not important */
> +	hwerr_data[src].count++;
> +	hwerr_data[src].timestamp = ktime_get_real_seconds();
> +}
> +EXPORT_SYMBOL_GPL(hwerr_log_error_type);
> +
>   static int __init crash_save_vmcoreinfo_init(void)
>   {
>   	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);

Look good for me.

Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com>

It would be valuable to get additional review from other RAS experts.

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-30 13:50                 ` Shuai Xue
@ 2025-07-30 17:16                   ` Breno Leitao
  0 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2025-07-30 17:16 UTC (permalink / raw)
  To: Shuai Xue
  Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown,
	James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

Hello Shuai,

Thanks for the review,

On Wed, Jul 30, 2025 at 09:50:39PM +0800, Shuai Xue wrote:
> 在 2025/7/30 21:11, Breno Leitao 写道:
> >
> > @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
> >   	}
> > 
> >   out:
> > +	/* Given it didn't panic, mark it as recoverable */
> > +	hwerr_log_error_type(HWERR_RECOV_MCE);
> > +
> 
> Indentation: needs tab alignment.

No sure I got what it the alignment process. The code seems to be
properly aligned, and using tabs. Could you please clarify what is the
current problem?

> The current placement only logs errors that reach the out: label. Errors
> that go to `clear` lable won't be recorded. Would it be better to log at
> the beginning of do_machine_check() to capture all recoverable MCEs?

This is a good point, and I've thought about it. I understand we don't
want to track the code flow that goes to the clear: label, since it is
wrongly triggered by some CPUs, and it is not a real MCE.
That is described in commit 8ca97812c3c830 ("x86/mce: Work around an
erratum on fast string copy instructions").

At the same time, the current block of MCEs are not being properly
tracked, since they return earlier in do_machine_check(). Here is
a quick 

   void do_machine_check(struct pt_regs *regs)
   ...
          if (unlikely(mce_flags.p5))
                  return pentium_machine_check(regs);
          else if (unlikely(mce_flags.winchip))
                  return winchip_machine_check(regs);
          else if (unlikely(!mca_cfg.initialized))
                  return unexpected_machine_check(regs);

         if (mce_flags.skx_repmov_quirk && quirk_skylake_repmov())
                  goto clear;

	  /* Code doesn't exit anymore unless through out: */

    }

Given that instrumentation is not enabled when those return are called,
we cannot easily call hwerr_log_error_type() before the returns.

An option is just to ignore those, given they are unlikely. Another
option is to call hwerr_log_error_type() inside those functions above,
so, we do not miss these counters in case do_machine_check() returns
earlier. 


--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -1481,6 +1481,7 @@ static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru
 static noinstr void unexpected_machine_check(struct pt_regs *regs)
 {
        instrumentation_begin();
+       hwerr_log_error_type(HWERR_RECOV_MCE);
        pr_err("CPU#%d: Unexpected int18 (Machine Check)\n",
               smp_processor_id());
        instrumentation_end();
diff --git a/arch/x86/kernel/cpu/mce/p5.c b/arch/x86/kernel/cpu/mce/p5.c
index 2272ad53fc339..a627ed10b752d 100644
--- a/arch/x86/kernel/cpu/mce/p5.c
+++ b/arch/x86/kernel/cpu/mce/p5.c
@@ -26,6 +26,7 @@ noinstr void pentium_machine_check(struct pt_regs *regs)
        u32 loaddr, hi, lotype;

        instrumentation_begin();
+       hwerr_log_error_type(HWERR_RECOV_MCE);
        rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi);
        rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi);

diff --git a/arch/x86/kernel/cpu/mce/winchip.c b/arch/x86/kernel/cpu/mce/winchip.c
index 6c99f29419090..b7862bf5ba870 100644
--- a/arch/x86/kernel/cpu/mce/winchip.c
+++ b/arch/x86/kernel/cpu/mce/winchip.c
@@ -20,6 +20,7 @@
 noinstr void winchip_machine_check(struct pt_regs *regs)
 {
        instrumentation_begin();
+       hwerr_log_error_type(HWERR_RECOV_MCE);
        pr_emerg("CPU0: Machine Check Exception.\n");
        add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE);
        instrumentation_end();


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-30 13:11               ` Breno Leitao
  2025-07-30 13:50                 ` Shuai Xue
@ 2025-07-30 16:21                 ` Mauro Carvalho Chehab
  2025-07-30 17:22                   ` Breno Leitao
  1 sibling, 1 reply; 17+ messages in thread
From: Mauro Carvalho Chehab @ 2025-07-30 16:21 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Shuai Xue, Tony Luck, Borislav Petkov, Rafael J. Wysocki,
	Len Brown, James Morse, Robert Moore, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

Em Wed, 30 Jul 2025 06:11:52 -0700
Breno Leitao <leitao@debian.org> escreveu:

> Hello Shuai,
> 
> On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
> > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
> > CPER_SEV_RECOVERABLE errors:  
> 
> Thanks. I was reading this code a bit more, and I want to make sure my
> understanding is correct, giving I was confused about CORRECTED and
> RECOVERABLE errors.
> 
> CPER_SEV_CORRECTED means it is corrected in the background, and the OS
> was not even notified about it. That includes 1-bit ECC error.
> THose are not the errors we are interested in, since they are irrelavant
> to the OS.

Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture
such errors, as they may be a symptom of a hardware defect. In a matter
of fact, at rasdamon, thresholds can be set to trigger an action, like
for instance, disable memory blocks that contain defective memories.

This is specially relevant on HPC and supercomputer workloads, where
it is a lot cheaper to disable a block of bad memory than to lose
an entire job because that could take several weeks of run time on
a supercomputer, just because a defective memory ended causing a
failure at the application.

Regards,
Mauro

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
  2025-07-30 16:21                 ` Mauro Carvalho Chehab
@ 2025-07-30 17:22                   ` Breno Leitao
  0 siblings, 0 replies; 17+ messages in thread
From: Breno Leitao @ 2025-07-30 17:22 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Shuai Xue, Tony Luck, Borislav Petkov, Rafael J. Wysocki,
	Len Brown, James Morse, Robert Moore, Thomas Gleixner,
	Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo,
	Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov,
	konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team

Hello Mauro,

On Wed, Jul 30, 2025 at 06:21:37PM +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 30 Jul 2025 06:11:52 -0700
> Breno Leitao <leitao@debian.org> escreveu:
> > On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
> > > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
> > > CPER_SEV_RECOVERABLE errors:  
> > 
> > Thanks. I was reading this code a bit more, and I want to make sure my
> > understanding is correct, giving I was confused about CORRECTED and
> > RECOVERABLE errors.
> > 
> > CPER_SEV_CORRECTED means it is corrected in the background, and the OS
> > was not even notified about it. That includes 1-bit ECC error.
> > THose are not the errors we are interested in, since they are irrelavant
> > to the OS.
> 
> Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture
> such errors, as they may be a symptom of a hardware defect. In a matter
> of fact, at rasdamon, thresholds can be set to trigger an action, like
> for instance, disable memory blocks that contain defective memories.

Sorry, I meant that Hardware-corrected errors aren't relevant in the
context of this patch, where we are errors that the OS has some
influence and decision.

> This is specially relevant on HPC and supercomputer workloads, where
> it is a lot cheaper to disable a block of bad memory than to lose
> an entire job because that could take several weeks of run time on
> a supercomputer, just because a defective memory ended causing a
> failure at the application.

Agree. These errors are used in several ways, including to detect
hardware aging and hardware replacement at maintenance windows.

In this patchset, I am more focused on what information to add to
crashdump, so, it makes it easy to correlate crashes to hardware events,
and RECOVERABLE are the main ones.

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2025-07-30 17:23 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao
2025-07-23 14:28 ` kernel test robot
2025-07-23 15:36   ` Breno Leitao
2025-07-23 19:00     ` Borislav Petkov
2025-07-23 23:21       ` Huang, Kai
2025-07-24  8:00 ` Shuai Xue
2025-07-24 13:34   ` Breno Leitao
2025-07-25  7:40     ` Shuai Xue
2025-07-25 16:16       ` Breno Leitao
2025-07-28  1:08         ` Shuai Xue
2025-07-29 13:48           ` Breno Leitao
2025-07-30  2:13             ` Shuai Xue
2025-07-30 13:11               ` Breno Leitao
2025-07-30 13:50                 ` Shuai Xue
2025-07-30 17:16                   ` Breno Leitao
2025-07-30 16:21                 ` Mauro Carvalho Chehab
2025-07-30 17:22                   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).