* [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs
@ 2026-02-02 14:27 Breno Leitao
2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Breno Leitao @ 2026-02-02 14:27 UTC (permalink / raw)
To: akpm, bhe
Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, Breno Leitao,
kernel-team
The kernel already tracks recoverable hardware errors (CPU, memory, PCI,
CXL, etc.) in the hwerr_data array for vmcoreinfo crash dump analysis.
However, this data is only accessible after a crash.
This series adds a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to
expose these statistics at runtime, allowing monitoring tools to track
hardware health without requiring a kernel crash.
The directory contains one file per error subsystem:
/sys/kernel/hwerr_recovery_stats/{cpu, memory, pci, cxl, others}
Each file contains a single integer representing the error count.
This is useful for:
- Proactive detection of failing hardware components
- Time-series tracking of recoverable errors
- System health monitoring in cloud environments
To: akpm@linux-foundation.org
Cc: kexec@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-acpi@vger.kernel.org
To: bhe@redhat.com
Cc: linux-kernel@vger.kernel.org
Cc: dyoung@redhat.com
Cc: tony.luck@intel.com
Cc: xueshuai@linux.alibaba.com
Cc: vgoyal@redhat.com
Cc: zhiquan1.li@intel.com
Cc: olja@meta.com
Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Renamed vmcore_stats to hwerr_stats
- Separate each subsystem in multiple sysfs entries, one per file
- Link to v1: https://patch.msgid.link/20260129-vmcoreinfo_sysfs-v1-1-164c1fe1fe07@debian.org
---
Breno Leitao (2):
vmcoreinfo: expose hardware error recovery statistics via sysfs
docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/
.../ABI/testing/sysfs-kernel-hwerr_recovery_stats | 47 ++++++++++++++++++
Documentation/driver-api/hw-recoverable-errors.rst | 3 +-
kernel/vmcore_info.c | 55 ++++++++++++++++++++++
3 files changed, 104 insertions(+), 1 deletion(-)
---
base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
change-id: 20260129-vmcoreinfo_sysfs-ff4687979cd5
Best regards,
--
Breno Leitao <leitao@debian.org>
^ permalink raw reply [flat|nested] 6+ messages in thread
* [PATCH v2 1/2] vmcoreinfo: expose hardware error recovery statistics via sysfs
2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
@ 2026-02-02 14:27 ` Breno Leitao
2026-02-11 2:01 ` Baoquan He
2026-02-02 14:27 ` [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/ Breno Leitao
2026-02-10 9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2 siblings, 1 reply; 6+ messages in thread
From: Breno Leitao @ 2026-02-02 14:27 UTC (permalink / raw)
To: akpm, bhe
Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, Breno Leitao,
kernel-team
Add a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to expose
hardware error recovery statistics that are already tracked by the
kernel. This allows userspace monitoring tools to track recovered
hardware errors without requiring kernel crashes.
This is useful to track recoverable hardware errors in a time series,
even if the host doesn't crash.
The sysfs directory contains one file per error subsystem:
/sys/kernel/hwerr_recovery_stats/cpu - CPU-related errors (MCE, ARM errors)
/sys/kernel/hwerr_recovery_stats/memory - Memory-related errors
/sys/kernel/hwerr_recovery_stats/pci - PCI/PCIe AER non-fatal errors
/sys/kernel/hwerr_recovery_stats/cxl - CXL errors
/sys/kernel/hwerr_recovery_stats/others - Other hardware errors
Each file contains a single integer representing the count of recovered
errors for that subsystem.
These statistics provide visibility into the health of the system's
hardware and can be used by system administrators to proactively detect
failing components before they cause system crashes.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
kernel/vmcore_info.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 55 insertions(+)
diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e2784038bbed7..b7fcd21be7c59 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -6,6 +6,8 @@
#include <linux/buildid.h>
#include <linux/init.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
#include <linux/utsname.h>
#include <linux/vmalloc.h>
#include <linux/sizes.h>
@@ -139,6 +141,56 @@ void hwerr_log_error_type(enum hwerr_error_type src)
}
EXPORT_SYMBOL_GPL(hwerr_log_error_type);
+/* sysfs interface for hardware error recovery statistics */
+#define HWERR_ATTR_RO(_name, _type) \
+static ssize_t _name##_show(struct kobject *kobj, \
+ struct kobj_attribute *attr, char *buf) \
+{ \
+ return sysfs_emit(buf, "%d\n", \
+ atomic_read(&hwerr_data[_type].count)); \
+} \
+static struct kobj_attribute hwerr_##_name##_attr = __ATTR_RO(_name)
+
+HWERR_ATTR_RO(cpu, HWERR_RECOV_CPU);
+HWERR_ATTR_RO(memory, HWERR_RECOV_MEMORY);
+HWERR_ATTR_RO(pci, HWERR_RECOV_PCI);
+HWERR_ATTR_RO(cxl, HWERR_RECOV_CXL);
+HWERR_ATTR_RO(others, HWERR_RECOV_OTHERS);
+
+static struct attribute *hwerr_recovery_stats_attrs[] = {
+ &hwerr_cpu_attr.attr,
+ &hwerr_memory_attr.attr,
+ &hwerr_pci_attr.attr,
+ &hwerr_cxl_attr.attr,
+ &hwerr_others_attr.attr,
+ NULL,
+};
+
+static const struct attribute_group hwerr_recovery_stats_group = {
+ .attrs = hwerr_recovery_stats_attrs,
+};
+
+static struct kobject *hwerr_recovery_stats_kobj;
+
+static int __init hwerr_recovery_stats_init(void)
+{
+ hwerr_recovery_stats_kobj = kobject_create_and_add("hwerr_recovery_stats",
+ kernel_kobj);
+ if (!hwerr_recovery_stats_kobj) {
+ pr_warn("Failed to create hwerr_recovery_stats kobject\n");
+ return -ENOMEM;
+ }
+
+ if (sysfs_create_group(hwerr_recovery_stats_kobj,
+ &hwerr_recovery_stats_group)) {
+ kobject_put(hwerr_recovery_stats_kobj);
+ pr_warn("Failed to create hwerr_recovery_stats sysfs group\n");
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
static int __init crash_save_vmcoreinfo_init(void)
{
vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
@@ -248,6 +300,9 @@ static int __init crash_save_vmcoreinfo_init(void)
arch_crash_save_vmcoreinfo();
update_vmcoreinfo_note();
+ /* Create /sys/kernel/hwerr_recovery_stats/ directory */
+ hwerr_recovery_stats_init();
+
return 0;
}
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/
2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
@ 2026-02-02 14:27 ` Breno Leitao
2026-02-10 9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-02-02 14:27 UTC (permalink / raw)
To: akpm, bhe
Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, Breno Leitao,
kernel-team
Document the new hwerr_recovery_stats sysfs directory that exposes
hardware error recovery statistics.
Update hw-recoverable-errors.rst to reference the new sysfs interface
for runtime monitoring.
Signed-off-by: Breno Leitao <leitao@debian.org>
---
.../ABI/testing/sysfs-kernel-hwerr_recovery_stats | 47 ++++++++++++++++++++++
Documentation/driver-api/hw-recoverable-errors.rst | 3 +-
2 files changed, 49 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-kernel-hwerr_recovery_stats b/Documentation/ABI/testing/sysfs-kernel-hwerr_recovery_stats
new file mode 100644
index 0000000000000..4cb9f5a89fba9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-hwerr_recovery_stats
@@ -0,0 +1,47 @@
+What: /sys/kernel/hwerr_recovery_stats/
+Date: February 2026
+KernelVersion: 6.20
+Contact: Breno Leitao <leitao@debian.org>
+Description:
+ Directory containing hardware error recovery statistics.
+ These statistics track recoverable hardware errors that the
+ kernel has handled since boot.
+
+ Each file contains a single integer representing the count
+ of recovered errors for that subsystem.
+
+What: /sys/kernel/hwerr_recovery_stats/cpu
+Date: February 2026
+KernelVersion: 6.20
+Contact: Breno Leitao <leitao@debian.org>
+Description:
+ Count of CPU-related recovered errors (MCE, ARM processor
+ errors).
+
+What: /sys/kernel/hwerr_recovery_stats/memory
+Date: February 2026
+KernelVersion: 6.20
+Contact: Breno Leitao <leitao@debian.org>
+Description:
+ Count of memory-related recovered errors.
+
+What: /sys/kernel/hwerr_recovery_stats/pci
+Date: February 2026
+KernelVersion: 6.20
+Contact: Breno Leitao <leitao@debian.org>
+Description:
+ Count of PCI/PCIe AER non-fatal recovered errors.
+
+What: /sys/kernel/hwerr_recovery_stats/cxl
+Date: February 2026
+KernelVersion: 6.20
+Contact: Breno Leitao <leitao@debian.org>
+Description:
+ Count of CXL (Compute Express Link) recovered errors.
+
+What: /sys/kernel/hwerr_recovery_stats/others
+Date: February 2026
+KernelVersion: 6.20
+Contact: Breno Leitao <leitao@debian.org>
+Description:
+ Count of other hardware recovered errors.
diff --git a/Documentation/driver-api/hw-recoverable-errors.rst b/Documentation/driver-api/hw-recoverable-errors.rst
index fc526c3454bd7..4aefcd103be22 100644
--- a/Documentation/driver-api/hw-recoverable-errors.rst
+++ b/Documentation/driver-api/hw-recoverable-errors.rst
@@ -36,7 +36,8 @@ Data Exposure and Consumption
types like CPU, memory, PCI, CXL, and others.
- It is exposed via vmcoreinfo crash dump notes and can be read using tools
like `crash`, `drgn`, or other kernel crash analysis utilities.
-- There is no other way to read these data other than from crash dumps.
+- It is also exposed via sysfs at ``/sys/kernel/hwerr_recovery_stats/`` for runtime
+ monitoring without requiring a crash dump.
- These errors are divided by area, which includes CPU, Memory, PCI, CXL and
others.
--
2.47.3
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs
2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
2026-02-02 14:27 ` [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/ Breno Leitao
@ 2026-02-10 9:11 ` Breno Leitao
2026-02-10 18:46 ` Andrew Morton
2 siblings, 1 reply; 6+ messages in thread
From: Breno Leitao @ 2026-02-10 9:11 UTC (permalink / raw)
To: akpm, bhe
Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, kernel-team
Hello Andrew,
On Mon, Feb 02, 2026 at 06:27:38AM -0800, Breno Leitao wrote:
> The kernel already tracks recoverable hardware errors (CPU, memory, PCI,
> CXL, etc.) in the hwerr_data array for vmcoreinfo crash dump analysis.
> However, this data is only accessible after a crash.
>
> This series adds a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to
> expose these statistics at runtime, allowing monitoring tools to track
> hardware health without requiring a kernel crash.
>
> The directory contains one file per error subsystem:
> /sys/kernel/hwerr_recovery_stats/{cpu, memory, pci, cxl, others}
>
> Each file contains a single integer representing the error count.
>
> This is useful for:
> - Proactive detection of failing hardware components
> - Time-series tracking of recoverable errors
> - System health monitoring in cloud environments
Is there a chance this could be included in the 6.20 merge window?
Thanks,
--breno
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs
2026-02-10 9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
@ 2026-02-10 18:46 ` Andrew Morton
0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2026-02-10 18:46 UTC (permalink / raw)
To: Breno Leitao
Cc: bhe, linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, kernel-team
On Tue, 10 Feb 2026 01:11:41 -0800 Breno Leitao <leitao@debian.org> wrote:
> Hello Andrew,
>
> On Mon, Feb 02, 2026 at 06:27:38AM -0800, Breno Leitao wrote:
> > The kernel already tracks recoverable hardware errors (CPU, memory, PCI,
> > CXL, etc.) in the hwerr_data array for vmcoreinfo crash dump analysis.
> > However, this data is only accessible after a crash.
> >
> > This series adds a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to
> > expose these statistics at runtime, allowing monitoring tools to track
> > hardware health without requiring a kernel crash.
> >
> > The directory contains one file per error subsystem:
> > /sys/kernel/hwerr_recovery_stats/{cpu, memory, pci, cxl, others}
> >
> > Each file contains a single integer representing the error count.
> >
> > This is useful for:
> > - Proactive detection of failing hardware components
> > - Time-series tracking of recoverable errors
> > - System health monitoring in cloud environments
>
> Is there a chance this could be included in the 6.20 merge window?
During the 7.0 merge window? Sure. I'll be taking a look at this (and
a whole lot more) after 7.0-rc1 is released.
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v2 1/2] vmcoreinfo: expose hardware error recovery statistics via sysfs
2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
@ 2026-02-11 2:01 ` Baoquan He
0 siblings, 0 replies; 6+ messages in thread
From: Baoquan He @ 2026-02-11 2:01 UTC (permalink / raw)
To: Breno Leitao
Cc: akpm, linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, kernel-team
Hi Breno,
On 02/02/26 at 06:27am, Breno Leitao wrote:
> Add a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to expose
> hardware error recovery statistics that are already tracked by the
> kernel. This allows userspace monitoring tools to track recovered
> hardware errors without requiring kernel crashes.
>
> This is useful to track recoverable hardware errors in a time series,
> even if the host doesn't crash.
>
> The sysfs directory contains one file per error subsystem:
>
> /sys/kernel/hwerr_recovery_stats/cpu - CPU-related errors (MCE, ARM errors)
> /sys/kernel/hwerr_recovery_stats/memory - Memory-related errors
> /sys/kernel/hwerr_recovery_stats/pci - PCI/PCIe AER non-fatal errors
> /sys/kernel/hwerr_recovery_stats/cxl - CXL errors
> /sys/kernel/hwerr_recovery_stats/others - Other hardware errors
>
> Each file contains a single integer representing the count of recovered
> errors for that subsystem.
>
> These statistics provide visibility into the health of the system's
> hardware and can be used by system administrators to proactively detect
> failing components before they cause system crashes.
>
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> kernel/vmcore_info.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 55 insertions(+)
>
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e2784038bbed7..b7fcd21be7c59 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
Since we agreed hwerr_recovery_stats has nothing to do with vmcore, it
seems inappropriate to put its sysfs handling code in
kernel/vmcore_info.c. File kernel/vmcore_info.c is only used to build
vmcore info for later vmcore dumping. And hwerr_log_error_type() should
not be put in kernel/vmcore_info.c. either. I didn't check this
carefully before, sorry. Please reconsider if these can be handled better.
Thanks
Baoquan
> @@ -6,6 +6,8 @@
>
> #include <linux/buildid.h>
> #include <linux/init.h>
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
> #include <linux/utsname.h>
> #include <linux/vmalloc.h>
> #include <linux/sizes.h>
> @@ -139,6 +141,56 @@ void hwerr_log_error_type(enum hwerr_error_type src)
> }
> EXPORT_SYMBOL_GPL(hwerr_log_error_type);
>
> +/* sysfs interface for hardware error recovery statistics */
> +#define HWERR_ATTR_RO(_name, _type) \
> +static ssize_t _name##_show(struct kobject *kobj, \
> + struct kobj_attribute *attr, char *buf) \
> +{ \
> + return sysfs_emit(buf, "%d\n", \
> + atomic_read(&hwerr_data[_type].count)); \
> +} \
> +static struct kobj_attribute hwerr_##_name##_attr = __ATTR_RO(_name)
> +
> +HWERR_ATTR_RO(cpu, HWERR_RECOV_CPU);
> +HWERR_ATTR_RO(memory, HWERR_RECOV_MEMORY);
> +HWERR_ATTR_RO(pci, HWERR_RECOV_PCI);
> +HWERR_ATTR_RO(cxl, HWERR_RECOV_CXL);
> +HWERR_ATTR_RO(others, HWERR_RECOV_OTHERS);
> +
> +static struct attribute *hwerr_recovery_stats_attrs[] = {
> + &hwerr_cpu_attr.attr,
> + &hwerr_memory_attr.attr,
> + &hwerr_pci_attr.attr,
> + &hwerr_cxl_attr.attr,
> + &hwerr_others_attr.attr,
> + NULL,
> +};
> +
> +static const struct attribute_group hwerr_recovery_stats_group = {
> + .attrs = hwerr_recovery_stats_attrs,
> +};
> +
> +static struct kobject *hwerr_recovery_stats_kobj;
> +
> +static int __init hwerr_recovery_stats_init(void)
> +{
> + hwerr_recovery_stats_kobj = kobject_create_and_add("hwerr_recovery_stats",
> + kernel_kobj);
> + if (!hwerr_recovery_stats_kobj) {
> + pr_warn("Failed to create hwerr_recovery_stats kobject\n");
> + return -ENOMEM;
> + }
> +
> + if (sysfs_create_group(hwerr_recovery_stats_kobj,
> + &hwerr_recovery_stats_group)) {
> + kobject_put(hwerr_recovery_stats_kobj);
> + pr_warn("Failed to create hwerr_recovery_stats sysfs group\n");
> + return -ENOMEM;
> + }
> +
> + return 0;
> +}
> +
> static int __init crash_save_vmcoreinfo_init(void)
> {
> vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
> @@ -248,6 +300,9 @@ static int __init crash_save_vmcoreinfo_init(void)
> arch_crash_save_vmcoreinfo();
> update_vmcoreinfo_note();
>
> + /* Create /sys/kernel/hwerr_recovery_stats/ directory */
> + hwerr_recovery_stats_init();
> +
> return 0;
> }
>
>
> --
> 2.47.3
>
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2026-02-11 2:01 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
2026-02-11 2:01 ` Baoquan He
2026-02-02 14:27 ` [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/ Breno Leitao
2026-02-10 9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2026-02-10 18:46 ` Andrew Morton
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox