* [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors @ 2025-07-22 16:56 Breno Leitao 2025-07-23 14:28 ` kernel test robot 2025-07-24 8:00 ` Shuai Xue 0 siblings, 2 replies; 17+ messages in thread From: Breno Leitao @ 2025-07-22 16:56 UTC (permalink / raw) To: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas Cc: linux-acpi, linux-kernel, acpica-devel, osandov, xueshuai, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team, Breno Leitao Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that did not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last occurrence of such errors. Add centralized logging for three common sources of recoverable hardware errors: - PCIe AER Correctable errors - x86 Machine Check Exceptions (MCE) - APEI/CPER GHES corrected or recoverable errors hwerror_data is write-only at kernel runtime, and it is meant to be read from vmcore using tools like crash/drgn. For example, this is how it looks like when opening the crashdump from drgn. >>> prog['hwerror_data'] (struct hwerror_info[3]){ { .count = (int)844, .timestamp = (time64_t)1752852018, }, ... This helps fleet operators quickly triage whether a crash may be influenced by hardware recoverable errors (which executes a uncommon code path in the kernel), especially when recoverable errors occurred shortly before a panic, such as the bug fixed by commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") This is not intended to replace full hardware diagnostics but provides a fast way to correlate hardware events with kernel panics quickly. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> --- Changes in v3: - Add more information about this feature in the commit message (Borislav Petkov) - Renamed the function to hwerr_log_error_type() and use hwerr as suffix (Borislav Petkov) - Make the empty function static inline (kernel test robot) - Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org Changes in v2: - Split the counter by recoverable error (Tony Luck) - Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org --- arch/x86/kernel/cpu/mce/core.c | 3 +++ drivers/acpi/apei/ghes.c | 8 ++++++-- drivers/pci/pcie/aer.c | 2 ++ include/linux/vmcore_info.h | 14 ++++++++++++++ kernel/vmcore_info.c | 18 ++++++++++++++++++ 5 files changed, 43 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4da4eab56c81d..cb225a42eebbb 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -45,6 +45,7 @@ #include <linux/task_work.h> #include <linux/hardirq.h> #include <linux/kexec.h> +#include <linux/vmcore_info.h> #include <asm/fred.h> #include <asm/cpu_device_id.h> @@ -1692,6 +1693,8 @@ noinstr void do_machine_check(struct pt_regs *regs) out: instrumentation_end(); + /* Given it didn't panic, mark it as recoverable */ + hwerr_log_error_type(HWERR_RECOV_MCE); clear: mce_wrmsrq(MSR_IA32_MCG_STATUS, 0); } diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index a0d54993edb3b..ebda2aa3d68f2 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -43,6 +43,7 @@ #include <linux/uuid.h> #include <linux/ras.h> #include <linux/task_work.h> +#include <linux/vmcore_info.h> #include <acpi/actbl1.h> #include <acpi/ghes.h> @@ -1136,13 +1137,16 @@ static int ghes_proc(struct ghes *ghes) { struct acpi_hest_generic_status *estatus = ghes->estatus; u64 buf_paddr; - int rc; + int rc, sev; rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ); if (rc) goto out; - if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC) + sev = ghes_severity(estatus->error_severity); + if (sev == GHES_SEV_RECOVERABLE || sev == GHES_SEV_CORRECTED) + hwerr_log_error_type(HWERR_RECOV_GHES); + else if (sev >= GHES_SEV_PANIC) __ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ); if (!ghes_estatus_cached(estatus)) { diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..1ab744a3b7310 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -30,6 +30,7 @@ #include <linux/kfifo.h> #include <linux/ratelimit.h> #include <linux/slab.h> +#include <linux/vmcore_info.h> #include <acpi/apei.h> #include <acpi/ghes.h> #include <ras/ras_event.h> @@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev, switch (info->severity) { case AER_CORRECTABLE: aer_info->dev_total_cor_errs++; + hwerr_log_error_type(HWERR_RECOV_AER); counter = &aer_info->dev_cor_errs[0]; max = AER_MAX_TYPEOF_COR_ERRS; break; diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h index 37e003ae52626..39afce28bfaac 100644 --- a/include/linux/vmcore_info.h +++ b/include/linux/vmcore_info.h @@ -77,4 +77,18 @@ extern u32 *vmcoreinfo_note; Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); void final_note(Elf_Word *buf); + +enum hwerr_error_type { + HWERR_RECOV_AER, + HWERR_RECOV_MCE, + HWERR_RECOV_GHES, + HWERR_RECOV_MAX, +}; + +#ifdef CONFIG_VMCORE_INFO +void hwerr_log_error_type(enum hwerr_error_type src); +#else +static inline void hwerr_log_error_type(enum hwerr_error_type src) {}; +#endif + #endif /* LINUX_VMCORE_INFO_H */ diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c index e066d31d08f89..4b5ab45d468f5 100644 --- a/kernel/vmcore_info.c +++ b/kernel/vmcore_info.c @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note; /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ static unsigned char *vmcoreinfo_data_safecopy; +struct hwerr_info { + int __data_racy count; + time64_t __data_racy timestamp; +}; + +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX]; + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len) { @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) } EXPORT_SYMBOL(paddr_vmcoreinfo_note); +void hwerr_log_error_type(enum hwerr_error_type src) +{ + if (src < 0 || src >= HWERR_RECOV_MAX) + return; + + /* No need to atomics/locks given the precision is not important */ + hwerr_data[src].count++; + hwerr_data[src].timestamp = ktime_get_real_seconds(); +} +EXPORT_SYMBOL_GPL(hwerr_log_error_type); + static int __init crash_save_vmcoreinfo_init(void) { vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); --- base-commit: 97987520025658f30bb787a99ffbd9bbff9ffc9d change-id: 20250707-vmcore_hw_error-322429e6c316 Best regards, -- Breno Leitao <leitao@debian.org> ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao @ 2025-07-23 14:28 ` kernel test robot 2025-07-23 15:36 ` Breno Leitao 2025-07-24 8:00 ` Shuai Xue 1 sibling, 1 reply; 17+ messages in thread From: kernel test robot @ 2025-07-23 14:28 UTC (permalink / raw) To: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse, Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas Cc: oe-kbuild-all, linux-media, linux-acpi, linux-kernel, acpica-devel, osandov, xueshuai, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team, Breno Leitao Hi Breno, kernel test robot noticed the following build warnings: [auto build test WARNING on 97987520025658f30bb787a99ffbd9bbff9ffc9d] url: https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/vmcoreinfo-Track-and-log-recoverable-hardware-errors/20250723-005950 base: 97987520025658f30bb787a99ffbd9bbff9ffc9d patch link: https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17%40debian.org patch subject: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors config: x86_64-buildonly-randconfig-001-20250723 (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/config) compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202507232209.GrgpSr47-lkp@intel.com/ All warnings (new ones prefixed by >>): >> vmlinux.o: warning: objtool: do_machine_check+0x5cc: call to hwerr_log_error_type() leaves .noinstr.text section -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-23 14:28 ` kernel test robot @ 2025-07-23 15:36 ` Breno Leitao 2025-07-23 19:00 ` Borislav Petkov 0 siblings, 1 reply; 17+ messages in thread From: Breno Leitao @ 2025-07-23 15:36 UTC (permalink / raw) To: kernel test robot Cc: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, oe-kbuild-all, linux-media, linux-acpi, linux-kernel, acpica-devel, osandov, xueshuai, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team On Wed, Jul 23, 2025 at 10:28:29PM +0800, kernel test robot wrote: > Hi Breno, > > kernel test robot noticed the following build warnings: > > [auto build test WARNING on 97987520025658f30bb787a99ffbd9bbff9ffc9d] > > url: https://github.com/intel-lab-lkp/linux/commits/Breno-Leitao/vmcoreinfo-Track-and-log-recoverable-hardware-errors/20250723-005950 > base: 97987520025658f30bb787a99ffbd9bbff9ffc9d > patch link: https://lore.kernel.org/r/20250722-vmcore_hw_error-v3-1-ff0683fc1f17%40debian.org > patch subject: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors > config: x86_64-buildonly-randconfig-001-20250723 (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/config) > compiler: gcc-12 (Debian 12.2.0-14+deb12u1) 12.2.0 > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250723/202507232209.GrgpSr47-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202507232209.GrgpSr47-lkp@intel.com/ > > All warnings (new ones prefixed by >>): > > >> vmlinux.o: warning: objtool: do_machine_check+0x5cc: call to hwerr_log_error_type() leaves .noinstr.text section Oh, it seems a real issue. Basically there are two approaches, from what I understand: 1) mark do_machine_check() as noinstr 2) Move hwerr_log_error_type() earlier inside the instrumentation_begin() area. Probably option 1 might be more flexible, given that hwerr_log_error_type() doesn't seem a function that anyone wants to instrument?! ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-23 15:36 ` Breno Leitao @ 2025-07-23 19:00 ` Borislav Petkov 2025-07-23 23:21 ` Huang, Kai 0 siblings, 1 reply; 17+ messages in thread From: Borislav Petkov @ 2025-07-23 19:00 UTC (permalink / raw) To: Breno Leitao Cc: kernel test robot, Rafael J. Wysocki, Len Brown, James Morse, Tony Luck, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, oe-kbuild-all, linux-media, linux-acpi, linux-kernel, acpica-devel, osandov, xueshuai, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team On Wed, Jul 23, 2025 at 08:36:52AM -0700, Breno Leitao wrote: > Basically there are two approaches, from what I understand: > > 1) mark do_machine_check() as noinstr do_machine_check is already noinstr. I think you mean mark hwerr_log_error_type() noinstr. And yes, you can mark it. hwerr_log_error_type() is not that fascinating to allow instrumentation for it. > 2) Move hwerr_log_error_type() earlier inside the > instrumentation_begin() area. Or you can do that - that looks like less of an effort btw. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-23 19:00 ` Borislav Petkov @ 2025-07-23 23:21 ` Huang, Kai 0 siblings, 0 replies; 17+ messages in thread From: Huang, Kai @ 2025-07-23 23:21 UTC (permalink / raw) To: leitao@debian.org, bp@alien8.de Cc: lkp, oe-kbuild-all@lists.linux.dev, xueshuai@linux.alibaba.com, acpica-devel@lists.linux.dev, linux-media@vger.kernel.org, Luck, Tony, james.morse@arm.com, dave.hansen@linux.intel.com, mchehab@kernel.org, konrad.wilk@oracle.com, oohall@gmail.com, helgaas@kernel.org, mingo@redhat.com, osandov@osandov.com, linux-kernel@vger.kernel.org, tglx@linutronix.de, lenb@kernel.org, kernel-team@meta.com, linux-edac@vger.kernel.org, hpa@zytor.com, linuxppc-dev@lists.ozlabs.org, mahesh@linux.ibm.com, guohanjun@huawei.com, rafael@kernel.org, linux-pci@vger.kernel.org, linux-acpi@vger.kernel.org, x86@kernel.org, Moore, Robert On Wed, 2025-07-23 at 21:00 +0200, Borislav Petkov wrote: > On Wed, Jul 23, 2025 at 08:36:52AM -0700, Breno Leitao wrote: > > Basically there are two approaches, from what I understand: > > > > 1) mark do_machine_check() as noinstr > > do_machine_check is already noinstr. I think you mean mark > hwerr_log_error_type() noinstr. > > And yes, you can mark it. hwerr_log_error_type() is not that fascinating > to allow instrumentation for it. This option doesn't seem to be able to work because IIRC hwerr_log_error_type() calls ktime_get_real_seconds() which is not 'noinstr'. > > > 2) Move hwerr_log_error_type() earlier inside the > > instrumentation_begin() area. > > Or you can do that - that looks like less of an effort btw. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao 2025-07-23 14:28 ` kernel test robot @ 2025-07-24 8:00 ` Shuai Xue 2025-07-24 13:34 ` Breno Leitao 1 sibling, 1 reply; 17+ messages in thread From: Shuai Xue @ 2025-07-24 8:00 UTC (permalink / raw) To: Breno Leitao, Rafael J. Wysocki, Len Brown, James Morse, Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas Cc: linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Hi, Breno, 在 2025/7/23 00:56, Breno Leitao 写道: > Introduce a generic infrastructure for tracking recoverable hardware > errors (HW errors that did not cause a panic) and record them for vmcore > consumption. This aids post-mortem crash analysis tools by preserving > a count and timestamp for the last occurrence of such errors. > > Add centralized logging for three common sources of recoverable hardware > errors: The term "recoverable" is highly ambiguous. Even within the x86 architecture, different vendors define errors differently. I'm not trying to be pedantic about classification. As far as I know, for 2-bit memory errors detected by scrub, AMD defines them as deferred errors (DE) and handles them with log_error_deferred, while Intel uses machine_check_poll. For 2-bit memory errors consumed by processes, both Intel and AMD use MCE handling viado_machine_check(). Does your HWERR_RECOV_MCE only focus on synchronous UE errors handled in do_machine_check? What makes it special? > > - PCIe AER Correctable errors > - x86 Machine Check Exceptions (MCE) > - APEI/CPER GHES corrected or recoverable errors > > hwerror_data is write-only at kernel runtime, and it is meant to be > read from vmcore using tools like crash/drgn. For example, this is how > it looks like when opening the crashdump from drgn. > > >>> prog['hwerror_data'] > (struct hwerror_info[3]){ > { > .count = (int)844, > .timestamp = (time64_t)1752852018, > }, > ... > > This helps fleet operators quickly triage whether a crash may be > influenced by hardware recoverable errors (which executes a uncommon > code path in the kernel), especially when recoverable errors occurred > shortly before a panic, such as the bug fixed by > commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them > when destroying the pool") > > This is not intended to replace full hardware diagnostics but provides > a fast way to correlate hardware events with kernel panics quickly. > > Suggested-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > --- > Changes in v3: > - Add more information about this feature in the commit message > (Borislav Petkov) > - Renamed the function to hwerr_log_error_type() and use hwerr as > suffix (Borislav Petkov) > - Make the empty function static inline (kernel test robot) > - Link to v2: https://lore.kernel.org/r/20250721-vmcore_hw_error-v2-1-ab65a6b43c5a@debian.org > > Changes in v2: > - Split the counter by recoverable error (Tony Luck) > - Link to v1: https://lore.kernel.org/r/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org > --- > arch/x86/kernel/cpu/mce/core.c | 3 +++ > drivers/acpi/apei/ghes.c | 8 ++++++-- > drivers/pci/pcie/aer.c | 2 ++ > include/linux/vmcore_info.h | 14 ++++++++++++++ > kernel/vmcore_info.c | 18 ++++++++++++++++++ > 5 files changed, 43 insertions(+), 2 deletions(-) > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 4da4eab56c81d..cb225a42eebbb 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -45,6 +45,7 @@ > #include <linux/task_work.h> > #include <linux/hardirq.h> > #include <linux/kexec.h> > +#include <linux/vmcore_info.h> > > #include <asm/fred.h> > #include <asm/cpu_device_id.h> > @@ -1692,6 +1693,8 @@ noinstr void do_machine_check(struct pt_regs *regs) > out: > instrumentation_end(); > > + /* Given it didn't panic, mark it as recoverable */ > + hwerr_log_error_type(HWERR_RECOV_MCE); > clear: > mce_wrmsrq(MSR_IA32_MCG_STATUS, 0); > } > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index a0d54993edb3b..ebda2aa3d68f2 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -43,6 +43,7 @@ > #include <linux/uuid.h> > #include <linux/ras.h> > #include <linux/task_work.h> > +#include <linux/vmcore_info.h> > > #include <acpi/actbl1.h> > #include <acpi/ghes.h> > @@ -1136,13 +1137,16 @@ static int ghes_proc(struct ghes *ghes) > { > struct acpi_hest_generic_status *estatus = ghes->estatus; > u64 buf_paddr; > - int rc; > + int rc, sev; > > rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ); > if (rc) > goto out; > > - if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC) > + sev = ghes_severity(estatus->error_severity); > + if (sev == GHES_SEV_RECOVERABLE || sev == GHES_SEV_CORRECTED) > + hwerr_log_error_type(HWERR_RECOV_GHES); APEI does not define an error type named GHES. GHES is just a kernel driver name. Many hardware error types can be handled in GHES (see ghes_do_proc), for example, AER is routed by GHES when firmware-first mode is used. As far as I know, firmware-first mode is commonly used in production. Should GHES errors be categorized into AER, memory, and CXL memory instead? Thanks. Shuai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-24 8:00 ` Shuai Xue @ 2025-07-24 13:34 ` Breno Leitao 2025-07-25 7:40 ` Shuai Xue 0 siblings, 1 reply; 17+ messages in thread From: Breno Leitao @ 2025-07-24 13:34 UTC (permalink / raw) To: Shuai Xue Cc: Rafael J. Wysocki, Len Brown, James Morse, Tony Luck, Borislav Petkov, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Hello Shuai, On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote: > 在 2025/7/23 00:56, Breno Leitao 写道: > > Introduce a generic infrastructure for tracking recoverable hardware > > errors (HW errors that did not cause a panic) and record them for vmcore > > consumption. This aids post-mortem crash analysis tools by preserving > > a count and timestamp for the last occurrence of such errors. > > > > Add centralized logging for three common sources of recoverable hardware > > errors: > > The term "recoverable" is highly ambiguous. Even within the x86 > architecture, different vendors define errors differently. I'm not > trying to be pedantic about classification. As far as I know, for 2-bit > memory errors detected by scrub, AMD defines them as deferred errors > (DE) and handles them with log_error_deferred, while Intel uses > machine_check_poll. For 2-bit memory errors consumed by processes, > both Intel and AMD use MCE handling via do_machine_check(). Does your > HWERR_RECOV_MCE only focus on synchronous UE errors handled in > do_machine_check? What makes it special? I understand that deferred errors (DE) detected by memory scrubbing are typically silent and may not significantly impact system stability. In other words, I’m not convinced that including DE metrics in crash dumps would be helpful for correlating crashes with hardware issues—it might just add noise. Do you think it would be valuable to also log these events within log_error_deferred()? > > - if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC) > > + sev = ghes_severity(estatus->error_severity); > > + if (sev == GHES_SEV_RECOVERABLE || sev == GHES_SEV_CORRECTED) > > + hwerr_log_error_type(HWERR_RECOV_GHES); > > APEI does not define an error type named GHES. GHES is just a kernel > driver name. Many hardware error types can be handled in GHES (see > ghes_do_proc), for example, AER is routed by GHES when firmware-first > mode is used. As far as I know, firmware-first mode is commonly used in > production. Should GHES errors be categorized into AER, memory, and CXL > memory instead? I also considered slicing the data differently initially, but then realized it would add more complexity than necessary for my needs. If you believe we should further subdivide the data, I’m happy to do so. You’re suggesting a structure like this, which would then map to the corresponding CPER_SEC_ sections: enum hwerr_error_type { HWERR_RECOV_AER, // maps to CPER_SEC_PCIE HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_* HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM } Additionally, what about events related to CPU, Firmware, or DMA errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we include those in the classification as well? Thanks for your review and for the ongoing discussion! --breno ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-24 13:34 ` Breno Leitao @ 2025-07-25 7:40 ` Shuai Xue 2025-07-25 16:16 ` Breno Leitao 0 siblings, 1 reply; 17+ messages in thread From: Shuai Xue @ 2025-07-25 7:40 UTC (permalink / raw) To: Breno Leitao, Tony Luck, Borislav Petkov Cc: Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team 在 2025/7/24 21:34, Breno Leitao 写道: > Hello Shuai, > > On Thu, Jul 24, 2025 at 04:00:09PM +0800, Shuai Xue wrote: >> 在 2025/7/23 00:56, Breno Leitao 写道: >>> Introduce a generic infrastructure for tracking recoverable hardware >>> errors (HW errors that did not cause a panic) and record them for vmcore >>> consumption. This aids post-mortem crash analysis tools by preserving >>> a count and timestamp for the last occurrence of such errors. >>> >>> Add centralized logging for three common sources of recoverable hardware >>> errors: >> >> The term "recoverable" is highly ambiguous. Even within the x86 >> architecture, different vendors define errors differently. I'm not >> trying to be pedantic about classification. As far as I know, for 2-bit >> memory errors detected by scrub, AMD defines them as deferred errors >> (DE) and handles them with log_error_deferred, while Intel uses >> machine_check_poll. For 2-bit memory errors consumed by processes, >> both Intel and AMD use MCE handling via do_machine_check(). Does your >> HWERR_RECOV_MCE only focus on synchronous UE errors handled in >> do_machine_check? What makes it special? > > I understand that deferred errors (DE) detected by memory scrubbing are > typically silent and may not significantly impact system stability. In > other words, I’m not convinced that including DE metrics in crash dumps > would be helpful for correlating crashes with hardware issues—it might > just add noise. > > Do you think it would be valuable to also log these events within > log_error_deferred()? Not really, as you meationed, the DE is typically silent in backgroud. But I hope it is well documented. > >>> - if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC) >>> + sev = ghes_severity(estatus->error_severity); >>> + if (sev == GHES_SEV_RECOVERABLE || sev == GHES_SEV_CORRECTED) >>> + hwerr_log_error_type(HWERR_RECOV_GHES); >> >> APEI does not define an error type named GHES. GHES is just a kernel >> driver name. Many hardware error types can be handled in GHES (see >> ghes_do_proc), for example, AER is routed by GHES when firmware-first >> mode is used. As far as I know, firmware-first mode is commonly used in >> production. Should GHES errors be categorized into AER, memory, and CXL >> memory instead? > > I also considered slicing the data differently initially, but then > realized it would add more complexity than necessary for my needs. > > If you believe we should further subdivide the data, I’m happy to do so. > > You’re suggesting a structure like this, which would then map to the > corresponding CPER_SEC_ sections: > > enum hwerr_error_type { > HWERR_RECOV_AER, // maps to CPER_SEC_PCIE > HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE CPER_SEC_PCIE is typo? > HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_* > HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM > } > > Additionally, what about events related to CPU, Firmware, or DMA > errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we > include those in the classification as well? I would like to split a error from ghes to its own type, it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :( > > > Thanks for your review and for the ongoing discussion! > --breno Thanks. Shuai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-25 7:40 ` Shuai Xue @ 2025-07-25 16:16 ` Breno Leitao 2025-07-28 1:08 ` Shuai Xue 0 siblings, 1 reply; 17+ messages in thread From: Breno Leitao @ 2025-07-25 16:16 UTC (permalink / raw) To: Shuai Xue Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Hello Shuai, On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: > > > APEI does not define an error type named GHES. GHES is just a kernel > > > driver name. Many hardware error types can be handled in GHES (see > > > ghes_do_proc), for example, AER is routed by GHES when firmware-first > > > mode is used. As far as I know, firmware-first mode is commonly used in > > > production. Should GHES errors be categorized into AER, memory, and CXL > > > memory instead? > > > > I also considered slicing the data differently initially, but then > > realized it would add more complexity than necessary for my needs. > > > > If you believe we should further subdivide the data, I’m happy to do so. > > > > You’re suggesting a structure like this, which would then map to the > > corresponding CPER_SEC_ sections: > > > > enum hwerr_error_type { > > HWERR_RECOV_AER, // maps to CPER_SEC_PCIE > > HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE > > CPER_SEC_PCIE is typo? Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors coming from GHES. > > HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_* > > HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM > > } > > > > Additionally, what about events related to CPU, Firmware, or DMA > > errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we > > include those in the classification as well? > > I would like to split a error from ghes to its own type, > it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :( Makes sense. Regarding your answer, I suppose we might want to have something like the following: enum hwerr_error_type { HWERR_RECOV_MCE, // maps to errors in do_machine_check() HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_ HWERR_RECOV_PCI, // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM_ HWERR_RECOV_CPU, // maps to CPER_SEC_PROC_ HWERR_RECOV_DMA, // maps to CPER_SEC_DMAR_ HWERR_RECOV_OTHERS, // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, } Is this what you think we should track? Thanks --breno ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-25 16:16 ` Breno Leitao @ 2025-07-28 1:08 ` Shuai Xue 2025-07-29 13:48 ` Breno Leitao 0 siblings, 1 reply; 17+ messages in thread From: Shuai Xue @ 2025-07-28 1:08 UTC (permalink / raw) To: Breno Leitao Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team 在 2025/7/26 00:16, Breno Leitao 写道: > Hello Shuai, > > On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: >>>> APEI does not define an error type named GHES. GHES is just a kernel >>>> driver name. Many hardware error types can be handled in GHES (see >>>> ghes_do_proc), for example, AER is routed by GHES when firmware-first >>>> mode is used. As far as I know, firmware-first mode is commonly used in >>>> production. Should GHES errors be categorized into AER, memory, and CXL >>>> memory instead? >>> >>> I also considered slicing the data differently initially, but then >>> realized it would add more complexity than necessary for my needs. >>> >>> If you believe we should further subdivide the data, I’m happy to do so. >>> >>> You’re suggesting a structure like this, which would then map to the >>> corresponding CPER_SEC_ sections: >>> >>> enum hwerr_error_type { >>> HWERR_RECOV_AER, // maps to CPER_SEC_PCIE >>> HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE >> >> CPER_SEC_PCIE is typo? > > Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors > coming from GHES. > >>> HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_* >>> HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM >>> } >>> >>> Additionally, what about events related to CPU, Firmware, or DMA >>> errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we >>> include those in the classification as well? >> >> I would like to split a error from ghes to its own type, >> it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :( > > Makes sense. Regarding your answer, I suppose we might want to have > something like the following: > > enum hwerr_error_type { > HWERR_RECOV_MCE, // maps to errors in do_machine_check() > HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_ > HWERR_RECOV_PCI, // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI > HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM_ > HWERR_RECOV_CPU, // maps to CPER_SEC_PROC_ > HWERR_RECOV_DMA, // maps to CPER_SEC_DMAR_ > HWERR_RECOV_OTHERS, // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, > } > > Is this what you think we should track? > > Thanks > --breno It sounds good to me. Thanks. Shuai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-28 1:08 ` Shuai Xue @ 2025-07-29 13:48 ` Breno Leitao 2025-07-30 2:13 ` Shuai Xue 0 siblings, 1 reply; 17+ messages in thread From: Breno Leitao @ 2025-07-29 13:48 UTC (permalink / raw) To: Shuai Xue Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team On Mon, Jul 28, 2025 at 09:08:25AM +0800, Shuai Xue wrote: > 在 2025/7/26 00:16, Breno Leitao 写道: > > On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: > > > > enum hwerr_error_type { > > HWERR_RECOV_MCE, // maps to errors in do_machine_check() > > HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_ > > HWERR_RECOV_PCI, // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI > > HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM_ > > HWERR_RECOV_CPU, // maps to CPER_SEC_PROC_ > > HWERR_RECOV_DMA, // maps to CPER_SEC_DMAR_ > > HWERR_RECOV_OTHERS, // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, > > } > > > > Is this what you think we should track? > > > > Thanks > > --breno > > It sounds good to me. Does the following patch matches your expectation? Thanks! Author: Breno Leitao <leitao@debian.org> Date: Thu Jul 17 07:39:26 2025 -0700 vmcoreinfo: Track and log recoverable hardware errors Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that did not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last occurrence of such errors. Add centralized logging for sources of recoverable hardware errors based on the subsystem it has been notified. hwerror_data is write-only at kernel runtime, and it is meant to be read from vmcore using tools like crash/drgn. For example, this is how it looks like when opening the crashdump from drgn. >>> prog['hwerror_data'] (struct hwerror_info[6]){ { .count = (int)844, .timestamp = (time64_t)1752852018, }, ... This helps fleet operators quickly triage whether a crash may be influenced by hardware recoverable errors (which executes a uncommon code path in the kernel), especially when recoverable errors occurred shortly before a panic, such as the bug fixed by commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") This is not intended to replace full hardware diagnostics but provides a fast way to correlate hardware events with kernel panics quickly. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4da4eab56c81d..f85759453f89a 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -45,6 +45,7 @@ #include <linux/task_work.h> #include <linux/hardirq.h> #include <linux/kexec.h> +#include <linux/vmcore_info.h> #include <asm/fred.h> #include <asm/cpu_device_id.h> @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs) } out: + /* Given it didn't panic, mark it as recoverable */ + hwerr_log_error_type(HWERR_RECOV_MCE); + instrumentation_end(); clear: diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index a0d54993edb3b..f0b17efff713e 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -43,6 +43,7 @@ #include <linux/uuid.h> #include <linux/ras.h> #include <linux/task_work.h> +#include <linux/vmcore_info.h> #include <acpi/actbl1.h> #include <acpi/ghes.h> @@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd) } EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL"); +static void ghes_log_hwerr(int sev, guid_t *sec_type) +{ + if (sev != CPER_SEV_CORRECTED && sev != CPER_SEV_RECOVERABLE) + return; + + if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) || + guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) || + guid_equal(sec_type, &CPER_SEC_PROC_IA)) { + hwerr_log_error_type(HWERR_RECOV_CPU); + return; + } + + if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) || + guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) || + guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) || + guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) { + hwerr_log_error_type(HWERR_RECOV_CXL); + return; + } + + if (guid_equal(sec_type, &CPER_SEC_PCIE) || + guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) { + hwerr_log_error_type(HWERR_RECOV_PCI); + return; + } + + if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { + hwerr_log_error_type(HWERR_RECOV_MEMORY); + return; + } + + hwerr_log_error_type(HWERR_RECOV_OTHERS); +} + static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes, if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT) fru_text = gdata->fru_text; + ghes_log_hwerr(sev, sec_type); if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..5ccb6ca347f3f 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -30,6 +30,7 @@ #include <linux/kfifo.h> #include <linux/ratelimit.h> #include <linux/slab.h> +#include <linux/vmcore_info.h> #include <acpi/apei.h> #include <acpi/ghes.h> #include <ras/ras_event.h> @@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev, switch (info->severity) { case AER_CORRECTABLE: aer_info->dev_total_cor_errs++; + hwerr_log_error_type(HWERR_RECOV_PCI); counter = &aer_info->dev_cor_errs[0]; max = AER_MAX_TYPEOF_COR_ERRS; break; diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h index 37e003ae52626..538a3635fb1e5 100644 --- a/include/linux/vmcore_info.h +++ b/include/linux/vmcore_info.h @@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note; Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); void final_note(Elf_Word *buf); + +enum hwerr_error_type { + HWERR_RECOV_MCE, + HWERR_RECOV_CPU, + HWERR_RECOV_MEMORY, + HWERR_RECOV_PCI, + HWERR_RECOV_CXL, + HWERR_RECOV_OTHERS, + HWERR_RECOV_MAX, +}; + +#ifdef CONFIG_VMCORE_INFO +noinstr void hwerr_log_error_type(enum hwerr_error_type src); +#else +static inline void hwerr_log_error_type(enum hwerr_error_type src) {}; +#endif + #endif /* LINUX_VMCORE_INFO_H */ diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c index e066d31d08f89..4b5ab45d468f5 100644 --- a/kernel/vmcore_info.c +++ b/kernel/vmcore_info.c @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note; /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ static unsigned char *vmcoreinfo_data_safecopy; +struct hwerr_info { + int __data_racy count; + time64_t __data_racy timestamp; +}; + +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX]; + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len) { @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) } EXPORT_SYMBOL(paddr_vmcoreinfo_note); +void hwerr_log_error_type(enum hwerr_error_type src) +{ + if (src < 0 || src >= HWERR_RECOV_MAX) + return; + + /* No need to atomics/locks given the precision is not important */ + hwerr_data[src].count++; + hwerr_data[src].timestamp = ktime_get_real_seconds(); +} +EXPORT_SYMBOL_GPL(hwerr_log_error_type); + static int __init crash_save_vmcoreinfo_init(void) { vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-29 13:48 ` Breno Leitao @ 2025-07-30 2:13 ` Shuai Xue 2025-07-30 13:11 ` Breno Leitao 0 siblings, 1 reply; 17+ messages in thread From: Shuai Xue @ 2025-07-30 2:13 UTC (permalink / raw) To: Breno Leitao Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team 在 2025/7/29 21:48, Breno Leitao 写道: > On Mon, Jul 28, 2025 at 09:08:25AM +0800, Shuai Xue wrote: >> 在 2025/7/26 00:16, Breno Leitao 写道: >>> On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: >>> >>> enum hwerr_error_type { >>> HWERR_RECOV_MCE, // maps to errors in do_machine_check() >>> HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_ >>> HWERR_RECOV_PCI, // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI >>> HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM_ >>> HWERR_RECOV_CPU, // maps to CPER_SEC_PROC_ >>> HWERR_RECOV_DMA, // maps to CPER_SEC_DMAR_ >>> HWERR_RECOV_OTHERS, // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, >>> } >>> >>> Is this what you think we should track? >>> >>> Thanks >>> --breno >> >> It sounds good to me. > > Does the following patch matches your expectation? > > Thanks! > > Author: Breno Leitao <leitao@debian.org> > Date: Thu Jul 17 07:39:26 2025 -0700 > > vmcoreinfo: Track and log recoverable hardware errors > > Introduce a generic infrastructure for tracking recoverable hardware > errors (HW errors that did not cause a panic) and record them for vmcore > consumption. This aids post-mortem crash analysis tools by preserving > a count and timestamp for the last occurrence of such errors. > > Add centralized logging for sources of recoverable hardware > errors based on the subsystem it has been notified. > > hwerror_data is write-only at kernel runtime, and it is meant to be read > from vmcore using tools like crash/drgn. For example, this is how it > looks like when opening the crashdump from drgn. > > >>> prog['hwerror_data'] > (struct hwerror_info[6]){ > { > .count = (int)844, > .timestamp = (time64_t)1752852018, > }, > ... > > This helps fleet operators quickly triage whether a crash may be > influenced by hardware recoverable errors (which executes a uncommon > code path in the kernel), especially when recoverable errors occurred > shortly before a panic, such as the bug fixed by > commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them > when destroying the pool") > > This is not intended to replace full hardware diagnostics but provides > a fast way to correlate hardware events with kernel panics quickly. > > Suggested-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 4da4eab56c81d..f85759453f89a 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -45,6 +45,7 @@ > #include <linux/task_work.h> > #include <linux/hardirq.h> > #include <linux/kexec.h> > +#include <linux/vmcore_info.h> > > #include <asm/fred.h> > #include <asm/cpu_device_id.h> > @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs) > } > > out: > + /* Given it didn't panic, mark it as recoverable */ > + hwerr_log_error_type(HWERR_RECOV_MCE); > + > instrumentation_end(); > > clear: > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index a0d54993edb3b..f0b17efff713e 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -43,6 +43,7 @@ > #include <linux/uuid.h> > #include <linux/ras.h> > #include <linux/task_work.h> > +#include <linux/vmcore_info.h> > > #include <acpi/actbl1.h> > #include <acpi/ghes.h> > @@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd) > } > EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL"); > > +static void ghes_log_hwerr(int sev, guid_t *sec_type) > +{ > + if (sev != CPER_SEV_CORRECTED && sev != CPER_SEV_RECOVERABLE) > + return; > + > + if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) || > + guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) || > + guid_equal(sec_type, &CPER_SEC_PROC_IA)) { > + hwerr_log_error_type(HWERR_RECOV_CPU); > + return; > + } > + > + if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) || > + guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) || > + guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) || > + guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) { > + hwerr_log_error_type(HWERR_RECOV_CXL); > + return; > + } > + > + if (guid_equal(sec_type, &CPER_SEC_PCIE) || > + guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) { > + hwerr_log_error_type(HWERR_RECOV_PCI); > + return; > + } > + > + if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { > + hwerr_log_error_type(HWERR_RECOV_MEMORY); > + return; > + } > + > + hwerr_log_error_type(HWERR_RECOV_OTHERS); > +} > + > static void ghes_do_proc(struct ghes *ghes, > const struct acpi_hest_generic_status *estatus) > { > @@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes, > if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT) > fru_text = gdata->fru_text; > > + ghes_log_hwerr(sev, sec_type); > if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { > struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index e286c197d7167..5ccb6ca347f3f 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -30,6 +30,7 @@ > #include <linux/kfifo.h> > #include <linux/ratelimit.h> > #include <linux/slab.h> > +#include <linux/vmcore_info.h> > #include <acpi/apei.h> > #include <acpi/ghes.h> > #include <ras/ras_event.h> > @@ -746,6 +747,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev, > switch (info->severity) { > case AER_CORRECTABLE: > aer_info->dev_total_cor_errs++; > + hwerr_log_error_type(HWERR_RECOV_PCI); Hi Breno, Thanks for working on this! The patch looks good overall, but I noticed an inconsistency in the AER handling: In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and CPER_SEV_RECOVERABLE errors: However, in the AER section, you're only handling AER_CORRECTABLE cases. IMHO, Non-fatal errors are recoverable and correspond to CPER_SEV_RECOVERABLE in the ACPI context. The mapping should probably be: - AER_CORRECTABLE → CPER_SEV_CORRECTED - AER_NONFATAL → CPER_SEV_RECOVERABLE What do you think? Thanks, Shuai > counter = &aer_info->dev_cor_errs[0]; > max = AER_MAX_TYPEOF_COR_ERRS; > break; > diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h > index 37e003ae52626..538a3635fb1e5 100644 > --- a/include/linux/vmcore_info.h > +++ b/include/linux/vmcore_info.h > @@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note; > Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, > void *data, size_t data_len); > void final_note(Elf_Word *buf); > + > +enum hwerr_error_type { > + HWERR_RECOV_MCE, > + HWERR_RECOV_CPU, > + HWERR_RECOV_MEMORY, > + HWERR_RECOV_PCI, > + HWERR_RECOV_CXL, > + HWERR_RECOV_OTHERS, > + HWERR_RECOV_MAX, > +}; > + > +#ifdef CONFIG_VMCORE_INFO > +noinstr void hwerr_log_error_type(enum hwerr_error_type src); > +#else > +static inline void hwerr_log_error_type(enum hwerr_error_type src) {}; > +#endif > + > #endif /* LINUX_VMCORE_INFO_H */ > diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c > index e066d31d08f89..4b5ab45d468f5 100644 > --- a/kernel/vmcore_info.c > +++ b/kernel/vmcore_info.c > @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note; > /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ > static unsigned char *vmcoreinfo_data_safecopy; > > +struct hwerr_info { > + int __data_racy count; > + time64_t __data_racy timestamp; > +}; > + > +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX]; > + > Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, > void *data, size_t data_len) > { > @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) > } > EXPORT_SYMBOL(paddr_vmcoreinfo_note); > > +void hwerr_log_error_type(enum hwerr_error_type src) > +{ > + if (src < 0 || src >= HWERR_RECOV_MAX) > + return; > + > + /* No need to atomics/locks given the precision is not important */ > + hwerr_data[src].count++; > + hwerr_data[src].timestamp = ktime_get_real_seconds(); > +} > +EXPORT_SYMBOL_GPL(hwerr_log_error_type); > + > static int __init crash_save_vmcoreinfo_init(void) > { > vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-30 2:13 ` Shuai Xue @ 2025-07-30 13:11 ` Breno Leitao 2025-07-30 13:50 ` Shuai Xue 2025-07-30 16:21 ` Mauro Carvalho Chehab 0 siblings, 2 replies; 17+ messages in thread From: Breno Leitao @ 2025-07-30 13:11 UTC (permalink / raw) To: Shuai Xue Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Hello Shuai, On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote: > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and > CPER_SEV_RECOVERABLE errors: Thanks. I was reading this code a bit more, and I want to make sure my understanding is correct, giving I was confused about CORRECTED and RECOVERABLE errors. CPER_SEV_CORRECTED means it is corrected in the background, and the OS was not even notified about it. That includes 1-bit ECC error. THose are not the errors we are interested in, since they are irrelavant to the OS. If that is true, then I might not want count CPER_SEV_CORRECTED errors at all, but only CPER_SEV_RECOVERABLE. > However, in the AER section, you're only handling AER_CORRECTABLE cases. > IMHO, Non-fatal errors are recoverable and correspond to > CPER_SEV_RECOVERABLE in the ACPI context. > > The mapping should probably be: > > - AER_CORRECTABLE → CPER_SEV_CORRECTED > - AER_NONFATAL → CPER_SEV_RECOVERABLE Thanks. This means I want to count AER_NONFATAL but not AER_CORRECTABLE. Is this right? Summarizing, This is the a new version of the change, according to my new understanding: commit deca1c4b99dcfa64b29fe035f8422b4601212413 Author: Breno Leitao <leitao@debian.org> Date: Thu Jul 17 07:39:26 2025 -0700 vmcoreinfo: Track and log recoverable hardware errors Introduce a generic infrastructure for tracking recoverable hardware errors (HW errors that are visible to the OS but does not cause a panic) and record them for vmcore consumption. This aids post-mortem crash analysis tools by preserving a count and timestamp for the last occurrence of such errors. On the other side, correctable errors, which the OS typically remains unaware of because the underlying hardware handles them transparently, are less relevant and therefore are NOT tracked in this infrastructure. Add centralized logging for sources of recoverable hardware errors based on the subsystem it has been notified. hwerror_data is write-only at kernel runtime, and it is meant to be read from vmcore using tools like crash/drgn. For example, this is how it looks like when opening the crashdump from drgn. >>> prog['hwerror_data'] (struct hwerror_info[6]){ { .count = (int)844, .timestamp = (time64_t)1752852018, }, ... This helps fleet operators quickly triage whether a crash may be influenced by hardware recoverable errors (which executes a uncommon code path in the kernel), especially when recoverable errors occurred shortly before a panic, such as the bug fixed by commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them when destroying the pool") This is not intended to replace full hardware diagnostics but provides a fast way to correlate hardware events with kernel panics quickly. Suggested-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Breno Leitao <leitao@debian.org> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c index 4da4eab56c81d..f85759453f89a 100644 --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -45,6 +45,7 @@ #include <linux/task_work.h> #include <linux/hardirq.h> #include <linux/kexec.h> +#include <linux/vmcore_info.h> #include <asm/fred.h> #include <asm/cpu_device_id.h> @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs) } out: + /* Given it didn't panic, mark it as recoverable */ + hwerr_log_error_type(HWERR_RECOV_MCE); + instrumentation_end(); clear: diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index a0d54993edb3b..9c549c4a1a708 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -43,6 +43,7 @@ #include <linux/uuid.h> #include <linux/ras.h> #include <linux/task_work.h> +#include <linux/vmcore_info.h> #include <acpi/actbl1.h> #include <acpi/ghes.h> @@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd) } EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL"); +static void ghes_log_hwerr(int sev, guid_t *sec_type) +{ + if (sev != CPER_SEV_RECOVERABLE) + return; + + if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) || + guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) || + guid_equal(sec_type, &CPER_SEC_PROC_IA)) { + hwerr_log_error_type(HWERR_RECOV_CPU); + return; + } + + if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) || + guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) || + guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) || + guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) { + hwerr_log_error_type(HWERR_RECOV_CXL); + return; + } + + if (guid_equal(sec_type, &CPER_SEC_PCIE) || + guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) { + hwerr_log_error_type(HWERR_RECOV_PCI); + return; + } + + if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { + hwerr_log_error_type(HWERR_RECOV_MEMORY); + return; + } + + hwerr_log_error_type(HWERR_RECOV_OTHERS); +} + static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes, if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT) fru_text = gdata->fru_text; + ghes_log_hwerr(sev, sec_type); if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index e286c197d7167..d814c06cdbee6 100644 --- a/drivers/pci/pcie/aer.c +++ b/drivers/pci/pcie/aer.c @@ -30,6 +30,7 @@ #include <linux/kfifo.h> #include <linux/ratelimit.h> #include <linux/slab.h> +#include <linux/vmcore_info.h> #include <acpi/apei.h> #include <acpi/ghes.h> #include <ras/ras_event.h> @@ -751,6 +752,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev, break; case AER_NONFATAL: aer_info->dev_total_nonfatal_errs++; + hwerr_log_error_type(HWERR_RECOV_PCI); counter = &aer_info->dev_nonfatal_errs[0]; max = AER_MAX_TYPEOF_UNCOR_ERRS; break; diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h index 37e003ae52626..538a3635fb1e5 100644 --- a/include/linux/vmcore_info.h +++ b/include/linux/vmcore_info.h @@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note; Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len); void final_note(Elf_Word *buf); + +enum hwerr_error_type { + HWERR_RECOV_MCE, + HWERR_RECOV_CPU, + HWERR_RECOV_MEMORY, + HWERR_RECOV_PCI, + HWERR_RECOV_CXL, + HWERR_RECOV_OTHERS, + HWERR_RECOV_MAX, +}; + +#ifdef CONFIG_VMCORE_INFO +noinstr void hwerr_log_error_type(enum hwerr_error_type src); +#else +static inline void hwerr_log_error_type(enum hwerr_error_type src) {}; +#endif + #endif /* LINUX_VMCORE_INFO_H */ diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c index e066d31d08f89..4b5ab45d468f5 100644 --- a/kernel/vmcore_info.c +++ b/kernel/vmcore_info.c @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note; /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ static unsigned char *vmcoreinfo_data_safecopy; +struct hwerr_info { + int __data_racy count; + time64_t __data_racy timestamp; +}; + +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX]; + Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, void *data, size_t data_len) { @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) } EXPORT_SYMBOL(paddr_vmcoreinfo_note); +void hwerr_log_error_type(enum hwerr_error_type src) +{ + if (src < 0 || src >= HWERR_RECOV_MAX) + return; + + /* No need to atomics/locks given the precision is not important */ + hwerr_data[src].count++; + hwerr_data[src].timestamp = ktime_get_real_seconds(); +} +EXPORT_SYMBOL_GPL(hwerr_log_error_type); + static int __init crash_save_vmcoreinfo_init(void) { vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-30 13:11 ` Breno Leitao @ 2025-07-30 13:50 ` Shuai Xue 2025-07-30 17:16 ` Breno Leitao 2025-07-30 16:21 ` Mauro Carvalho Chehab 1 sibling, 1 reply; 17+ messages in thread From: Shuai Xue @ 2025-07-30 13:50 UTC (permalink / raw) To: Breno Leitao Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team 在 2025/7/30 21:11, Breno Leitao 写道: > Hello Shuai, > > On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote: >> In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and >> CPER_SEV_RECOVERABLE errors: > > Thanks. I was reading this code a bit more, and I want to make sure my > understanding is correct, giving I was confused about CORRECTED and > RECOVERABLE errors. > > CPER_SEV_CORRECTED means it is corrected in the background, and the OS > was not even notified about it. That includes 1-bit ECC error. Not quite correct. From ACPI spec: > A corrected error is a hardware error condition that has been > corrected by the hardware or by the firmware by the time the OSPM is > notified about the existence of the error condition. For example, 1-bit ECC errors can be reported via CMCI interrupt when the threshold of correctable errors exceeds the desired limit. The Linux GHES driver then initiates kernel actions like soft-offlining pages. > THose are not the errors we are interested in, since they are irrelavant > to the OS. > > If that is true, then I might not want count CPER_SEV_CORRECTED errors > at all, but only CPER_SEV_RECOVERABLE. Yes, that's the right approach. Hardware corrects CE errors and software can continue running without intervention. Since HWERR_RECOV_MCE only records uncorrected errors, focusing on CPER_SEV_RECOVERABLE is more appropriate for crash correlation analysis. > >> However, in the AER section, you're only handling AER_CORRECTABLE cases. >> IMHO, Non-fatal errors are recoverable and correspond to >> CPER_SEV_RECOVERABLE in the ACPI context. >> >> The mapping should probably be: >> >> - AER_CORRECTABLE → CPER_SEV_CORRECTED >> - AER_NONFATAL → CPER_SEV_RECOVERABLE > > Thanks. This means I want to count AER_NONFATAL but not AER_CORRECTABLE. > Is this right? Exactly. IMHO, the updated mapping looks correct: - GHES: Only CPER_SEV_RECOVERABLE - AER: Only AER_NONFATAL (which maps to recoverable errors) - MCE: Uncorrected errors that didn't cause panic > > Summarizing, This is the a new version of the change, according to my > new understanding: > > commit deca1c4b99dcfa64b29fe035f8422b4601212413 > Author: Breno Leitao <leitao@debian.org> > Date: Thu Jul 17 07:39:26 2025 -0700 > > vmcoreinfo: Track and log recoverable hardware errors > > Introduce a generic infrastructure for tracking recoverable hardware > errors (HW errors that are visible to the OS but does not cause a panic) > and record them for vmcore consumption. This aids post-mortem crash > analysis tools by preserving a count and timestamp for the last > occurrence of such errors. On the other side, correctable errors, which > the OS typically remains unaware of because the underlying hardware > handles them transparently, are less relevant and therefore are NOT > tracked in this infrastructure. > > Add centralized logging for sources of recoverable hardware > errors based on the subsystem it has been notified. > > hwerror_data is write-only at kernel runtime, and it is meant to be read > from vmcore using tools like crash/drgn. For example, this is how it > looks like when opening the crashdump from drgn. > > >>> prog['hwerror_data'] > (struct hwerror_info[6]){ > { > .count = (int)844, > .timestamp = (time64_t)1752852018, > }, > ... > > This helps fleet operators quickly triage whether a crash may be > influenced by hardware recoverable errors (which executes a uncommon > code path in the kernel), especially when recoverable errors occurred > shortly before a panic, such as the bug fixed by > commit ee62ce7a1d90 ("page_pool: Track DMA-mapped pages and unmap them > when destroying the pool") > > This is not intended to replace full hardware diagnostics but provides > a fast way to correlate hardware events with kernel panics quickly. > > Suggested-by: Tony Luck <tony.luck@intel.com> > Signed-off-by: Breno Leitao <leitao@debian.org> > > diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c > index 4da4eab56c81d..f85759453f89a 100644 > --- a/arch/x86/kernel/cpu/mce/core.c > +++ b/arch/x86/kernel/cpu/mce/core.c > @@ -45,6 +45,7 @@ > #include <linux/task_work.h> > #include <linux/hardirq.h> > #include <linux/kexec.h> > +#include <linux/vmcore_info.h> > > #include <asm/fred.h> > #include <asm/cpu_device_id.h> > @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs) > } > > out: > + /* Given it didn't panic, mark it as recoverable */ > + hwerr_log_error_type(HWERR_RECOV_MCE); > + Indentation: needs tab alignment. The current placement only logs errors that reach the out: label. Errors that go to `clear` lable won't be recorded. Would it be better to log at the beginning of do_machine_check() to capture all recoverable MCEs? > instrumentation_end(); > > clear: > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index a0d54993edb3b..9c549c4a1a708 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -43,6 +43,7 @@ > #include <linux/uuid.h> > #include <linux/ras.h> > #include <linux/task_work.h> > +#include <linux/vmcore_info.h> > > #include <acpi/actbl1.h> > #include <acpi/ghes.h> > @@ -867,6 +868,40 @@ int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd) > } > EXPORT_SYMBOL_NS_GPL(cxl_cper_kfifo_get, "CXL"); > > +static void ghes_log_hwerr(int sev, guid_t *sec_type) > +{ > + if (sev != CPER_SEV_RECOVERABLE) > + return; > + > + if (guid_equal(sec_type, &CPER_SEC_PROC_ARM) || > + guid_equal(sec_type, &CPER_SEC_PROC_GENERIC) || > + guid_equal(sec_type, &CPER_SEC_PROC_IA)) { > + hwerr_log_error_type(HWERR_RECOV_CPU); > + return; > + } > + > + if (guid_equal(sec_type, &CPER_SEC_CXL_PROT_ERR) || > + guid_equal(sec_type, &CPER_SEC_CXL_GEN_MEDIA_GUID) || > + guid_equal(sec_type, &CPER_SEC_CXL_DRAM_GUID) || > + guid_equal(sec_type, &CPER_SEC_CXL_MEM_MODULE_GUID)) { > + hwerr_log_error_type(HWERR_RECOV_CXL); > + return; > + } > + > + if (guid_equal(sec_type, &CPER_SEC_PCIE) || > + guid_equal(sec_type, &CPER_SEC_PCI_X_BUS) { > + hwerr_log_error_type(HWERR_RECOV_PCI); > + return; > + } > + > + if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { > + hwerr_log_error_type(HWERR_RECOV_MEMORY); > + return; > + } > + > + hwerr_log_error_type(HWERR_RECOV_OTHERS); > +} > + > static void ghes_do_proc(struct ghes *ghes, > const struct acpi_hest_generic_status *estatus) > { > @@ -888,6 +923,7 @@ static void ghes_do_proc(struct ghes *ghes, > if (gdata->validation_bits & CPER_SEC_VALID_FRU_TEXT) > fru_text = gdata->fru_text; > > + ghes_log_hwerr(sev, sec_type); > if (guid_equal(sec_type, &CPER_SEC_PLATFORM_MEM)) { > struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata); > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index e286c197d7167..d814c06cdbee6 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -30,6 +30,7 @@ > #include <linux/kfifo.h> > #include <linux/ratelimit.h> > #include <linux/slab.h> > +#include <linux/vmcore_info.h> > #include <acpi/apei.h> > #include <acpi/ghes.h> > #include <ras/ras_event.h> > @@ -751,6 +752,7 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev, > break; > case AER_NONFATAL: > aer_info->dev_total_nonfatal_errs++; > + hwerr_log_error_type(HWERR_RECOV_PCI); > counter = &aer_info->dev_nonfatal_errs[0]; > max = AER_MAX_TYPEOF_UNCOR_ERRS; > break; > diff --git a/include/linux/vmcore_info.h b/include/linux/vmcore_info.h > index 37e003ae52626..538a3635fb1e5 100644 > --- a/include/linux/vmcore_info.h > +++ b/include/linux/vmcore_info.h > @@ -77,4 +77,21 @@ extern u32 *vmcoreinfo_note; > Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, > void *data, size_t data_len); > void final_note(Elf_Word *buf); > + > +enum hwerr_error_type { > + HWERR_RECOV_MCE, > + HWERR_RECOV_CPU, > + HWERR_RECOV_MEMORY, > + HWERR_RECOV_PCI, > + HWERR_RECOV_CXL, > + HWERR_RECOV_OTHERS, > + HWERR_RECOV_MAX, > +}; > + > +#ifdef CONFIG_VMCORE_INFO > +noinstr void hwerr_log_error_type(enum hwerr_error_type src); > +#else > +static inline void hwerr_log_error_type(enum hwerr_error_type src) {}; > +#endif > + > #endif /* LINUX_VMCORE_INFO_H */ > diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c > index e066d31d08f89..4b5ab45d468f5 100644 > --- a/kernel/vmcore_info.c > +++ b/kernel/vmcore_info.c > @@ -31,6 +31,13 @@ u32 *vmcoreinfo_note; > /* trusted vmcoreinfo, e.g. we can make a copy in the crash memory */ > static unsigned char *vmcoreinfo_data_safecopy; > > +struct hwerr_info { > + int __data_racy count; > + time64_t __data_racy timestamp; > +}; > + > +static struct hwerr_info hwerr_data[HWERR_RECOV_MAX]; > + > Elf_Word *append_elf_note(Elf_Word *buf, char *name, unsigned int type, > void *data, size_t data_len) > { > @@ -118,6 +125,17 @@ phys_addr_t __weak paddr_vmcoreinfo_note(void) > } > EXPORT_SYMBOL(paddr_vmcoreinfo_note); > > +void hwerr_log_error_type(enum hwerr_error_type src) > +{ > + if (src < 0 || src >= HWERR_RECOV_MAX) > + return; > + > + /* No need to atomics/locks given the precision is not important */ > + hwerr_data[src].count++; > + hwerr_data[src].timestamp = ktime_get_real_seconds(); > +} > +EXPORT_SYMBOL_GPL(hwerr_log_error_type); > + > static int __init crash_save_vmcoreinfo_init(void) > { > vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL); Look good for me. Reviewed-by: Shuai Xue <xueshuai@linux.alibaba.com> It would be valuable to get additional review from other RAS experts. Thanks. Shuai ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-30 13:50 ` Shuai Xue @ 2025-07-30 17:16 ` Breno Leitao 0 siblings, 0 replies; 17+ messages in thread From: Breno Leitao @ 2025-07-30 17:16 UTC (permalink / raw) To: Shuai Xue Cc: Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Hello Shuai, Thanks for the review, On Wed, Jul 30, 2025 at 09:50:39PM +0800, Shuai Xue wrote: > 在 2025/7/30 21:11, Breno Leitao 写道: > > > > @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs) > > } > > > > out: > > + /* Given it didn't panic, mark it as recoverable */ > > + hwerr_log_error_type(HWERR_RECOV_MCE); > > + > > Indentation: needs tab alignment. No sure I got what it the alignment process. The code seems to be properly aligned, and using tabs. Could you please clarify what is the current problem? > The current placement only logs errors that reach the out: label. Errors > that go to `clear` lable won't be recorded. Would it be better to log at > the beginning of do_machine_check() to capture all recoverable MCEs? This is a good point, and I've thought about it. I understand we don't want to track the code flow that goes to the clear: label, since it is wrongly triggered by some CPUs, and it is not a real MCE. That is described in commit 8ca97812c3c830 ("x86/mce: Work around an erratum on fast string copy instructions"). At the same time, the current block of MCEs are not being properly tracked, since they return earlier in do_machine_check(). Here is a quick void do_machine_check(struct pt_regs *regs) ... if (unlikely(mce_flags.p5)) return pentium_machine_check(regs); else if (unlikely(mce_flags.winchip)) return winchip_machine_check(regs); else if (unlikely(!mca_cfg.initialized)) return unexpected_machine_check(regs); if (mce_flags.skx_repmov_quirk && quirk_skylake_repmov()) goto clear; /* Code doesn't exit anymore unless through out: */ } Given that instrumentation is not enabled when those return are called, we cannot easily call hwerr_log_error_type() before the returns. An option is just to ignore those, given they are unlikely. Another option is to call hwerr_log_error_type() inside those functions above, so, we do not miss these counters in case do_machine_check() returns earlier. --- a/arch/x86/kernel/cpu/mce/core.c +++ b/arch/x86/kernel/cpu/mce/core.c @@ -1481,6 +1481,7 @@ static void queue_task_work(struct mce_hw_err *err, char *msg, void (*func)(stru static noinstr void unexpected_machine_check(struct pt_regs *regs) { instrumentation_begin(); + hwerr_log_error_type(HWERR_RECOV_MCE); pr_err("CPU#%d: Unexpected int18 (Machine Check)\n", smp_processor_id()); instrumentation_end(); diff --git a/arch/x86/kernel/cpu/mce/p5.c b/arch/x86/kernel/cpu/mce/p5.c index 2272ad53fc339..a627ed10b752d 100644 --- a/arch/x86/kernel/cpu/mce/p5.c +++ b/arch/x86/kernel/cpu/mce/p5.c @@ -26,6 +26,7 @@ noinstr void pentium_machine_check(struct pt_regs *regs) u32 loaddr, hi, lotype; instrumentation_begin(); + hwerr_log_error_type(HWERR_RECOV_MCE); rdmsr(MSR_IA32_P5_MC_ADDR, loaddr, hi); rdmsr(MSR_IA32_P5_MC_TYPE, lotype, hi); diff --git a/arch/x86/kernel/cpu/mce/winchip.c b/arch/x86/kernel/cpu/mce/winchip.c index 6c99f29419090..b7862bf5ba870 100644 --- a/arch/x86/kernel/cpu/mce/winchip.c +++ b/arch/x86/kernel/cpu/mce/winchip.c @@ -20,6 +20,7 @@ noinstr void winchip_machine_check(struct pt_regs *regs) { instrumentation_begin(); + hwerr_log_error_type(HWERR_RECOV_MCE); pr_emerg("CPU0: Machine Check Exception.\n"); add_taint(TAINT_MACHINE_CHECK, LOCKDEP_NOW_UNRELIABLE); instrumentation_end(); ^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-30 13:11 ` Breno Leitao 2025-07-30 13:50 ` Shuai Xue @ 2025-07-30 16:21 ` Mauro Carvalho Chehab 2025-07-30 17:22 ` Breno Leitao 1 sibling, 1 reply; 17+ messages in thread From: Mauro Carvalho Chehab @ 2025-07-30 16:21 UTC (permalink / raw) To: Breno Leitao Cc: Shuai Xue, Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Em Wed, 30 Jul 2025 06:11:52 -0700 Breno Leitao <leitao@debian.org> escreveu: > Hello Shuai, > > On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote: > > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and > > CPER_SEV_RECOVERABLE errors: > > Thanks. I was reading this code a bit more, and I want to make sure my > understanding is correct, giving I was confused about CORRECTED and > RECOVERABLE errors. > > CPER_SEV_CORRECTED means it is corrected in the background, and the OS > was not even notified about it. That includes 1-bit ECC error. > THose are not the errors we are interested in, since they are irrelavant > to the OS. Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture such errors, as they may be a symptom of a hardware defect. In a matter of fact, at rasdamon, thresholds can be set to trigger an action, like for instance, disable memory blocks that contain defective memories. This is specially relevant on HPC and supercomputer workloads, where it is a lot cheaper to disable a block of bad memory than to lose an entire job because that could take several weeks of run time on a supercomputer, just because a defective memory ended causing a failure at the application. Regards, Mauro ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors 2025-07-30 16:21 ` Mauro Carvalho Chehab @ 2025-07-30 17:22 ` Breno Leitao 0 siblings, 0 replies; 17+ messages in thread From: Breno Leitao @ 2025-07-30 17:22 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Shuai Xue, Tony Luck, Borislav Petkov, Rafael J. Wysocki, Len Brown, James Morse, Robert Moore, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, H. Peter Anvin, Hanjun Guo, Mauro Carvalho Chehab, Mahesh J Salgaonkar, Oliver O'Halloran, Bjorn Helgaas, linux-acpi, linux-kernel, acpica-devel, osandov, konrad.wilk, linux-edac, linuxppc-dev, linux-pci, kernel-team Hello Mauro, On Wed, Jul 30, 2025 at 06:21:37PM +0200, Mauro Carvalho Chehab wrote: > Em Wed, 30 Jul 2025 06:11:52 -0700 > Breno Leitao <leitao@debian.org> escreveu: > > On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote: > > > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and > > > CPER_SEV_RECOVERABLE errors: > > > > Thanks. I was reading this code a bit more, and I want to make sure my > > understanding is correct, giving I was confused about CORRECTED and > > RECOVERABLE errors. > > > > CPER_SEV_CORRECTED means it is corrected in the background, and the OS > > was not even notified about it. That includes 1-bit ECC error. > > THose are not the errors we are interested in, since they are irrelavant > > to the OS. > > Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture > such errors, as they may be a symptom of a hardware defect. In a matter > of fact, at rasdamon, thresholds can be set to trigger an action, like > for instance, disable memory blocks that contain defective memories. Sorry, I meant that Hardware-corrected errors aren't relevant in the context of this patch, where we are errors that the OS has some influence and decision. > This is specially relevant on HPC and supercomputer workloads, where > it is a lot cheaper to disable a block of bad memory than to lose > an entire job because that could take several weeks of run time on > a supercomputer, just because a defective memory ended causing a > failure at the application. Agree. These errors are used in several ways, including to detect hardware aging and hardware replacement at maintenance windows. In this patchset, I am more focused on what information to add to crashdump, so, it makes it easy to correlate crashes to hardware events, and RECOVERABLE are the main ones. ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2025-07-30 17:23 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao 2025-07-23 14:28 ` kernel test robot 2025-07-23 15:36 ` Breno Leitao 2025-07-23 19:00 ` Borislav Petkov 2025-07-23 23:21 ` Huang, Kai 2025-07-24 8:00 ` Shuai Xue 2025-07-24 13:34 ` Breno Leitao 2025-07-25 7:40 ` Shuai Xue 2025-07-25 16:16 ` Breno Leitao 2025-07-28 1:08 ` Shuai Xue 2025-07-29 13:48 ` Breno Leitao 2025-07-30 2:13 ` Shuai Xue 2025-07-30 13:11 ` Breno Leitao 2025-07-30 13:50 ` Shuai Xue 2025-07-30 17:16 ` Breno Leitao 2025-07-30 16:21 ` Mauro Carvalho Chehab 2025-07-30 17:22 ` Breno Leitao
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).