public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
* [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs
@ 2026-02-02 14:27 Breno Leitao
  2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Breno Leitao @ 2026-02-02 14:27 UTC (permalink / raw)
  To: akpm, bhe
  Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
	tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, Breno Leitao,
	kernel-team

The kernel already tracks recoverable hardware errors (CPU, memory, PCI,
CXL, etc.) in the hwerr_data array for vmcoreinfo crash dump analysis.
However, this data is only accessible after a crash.

This series adds a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to
expose these statistics at runtime, allowing monitoring tools to track
hardware health without requiring a kernel crash.

The directory contains one file per error subsystem:
  /sys/kernel/hwerr_recovery_stats/{cpu, memory, pci, cxl, others}

Each file contains a single integer representing the error count.

This is useful for:
- Proactive detection of failing hardware components
- Time-series tracking of recoverable errors
- System health monitoring in cloud environments

To: akpm@linux-foundation.org
Cc: kexec@lists.infradead.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-acpi@vger.kernel.org
To: bhe@redhat.com
Cc: linux-kernel@vger.kernel.org
Cc: dyoung@redhat.com
Cc: tony.luck@intel.com
Cc: xueshuai@linux.alibaba.com
Cc: vgoyal@redhat.com
Cc: zhiquan1.li@intel.com
Cc: olja@meta.com

Signed-off-by: Breno Leitao <leitao@debian.org>
---
Changes in v2:
- Renamed vmcore_stats to hwerr_stats
- Separate each subsystem in multiple sysfs entries, one per file
- Link to v1: https://patch.msgid.link/20260129-vmcoreinfo_sysfs-v1-1-164c1fe1fe07@debian.org

---
Breno Leitao (2):
      vmcoreinfo: expose hardware error recovery statistics via sysfs
      docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/

 .../ABI/testing/sysfs-kernel-hwerr_recovery_stats  | 47 ++++++++++++++++++
 Documentation/driver-api/hw-recoverable-errors.rst |  3 +-
 kernel/vmcore_info.c                               | 55 ++++++++++++++++++++++
 3 files changed, 104 insertions(+), 1 deletion(-)
---
base-commit: 4d310797262f0ddf129e76c2aad2b950adaf1fda
change-id: 20260129-vmcoreinfo_sysfs-ff4687979cd5

Best regards,
--  
Breno Leitao <leitao@debian.org>



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH v2 1/2] vmcoreinfo: expose hardware error recovery statistics via sysfs
  2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
@ 2026-02-02 14:27 ` Breno Leitao
  2026-02-11  2:01   ` Baoquan He
  2026-02-02 14:27 ` [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/ Breno Leitao
  2026-02-10  9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
  2 siblings, 1 reply; 6+ messages in thread
From: Breno Leitao @ 2026-02-02 14:27 UTC (permalink / raw)
  To: akpm, bhe
  Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
	tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, Breno Leitao,
	kernel-team

Add a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to expose
hardware error recovery statistics that are already tracked by the
kernel. This allows userspace monitoring tools to track recovered
hardware errors without requiring kernel crashes.

This is useful to track recoverable hardware errors in a time series,
even if the host doesn't crash.

The sysfs directory contains one file per error subsystem:

  /sys/kernel/hwerr_recovery_stats/cpu     - CPU-related errors (MCE, ARM errors)
  /sys/kernel/hwerr_recovery_stats/memory  - Memory-related errors
  /sys/kernel/hwerr_recovery_stats/pci     - PCI/PCIe AER non-fatal errors
  /sys/kernel/hwerr_recovery_stats/cxl     - CXL errors
  /sys/kernel/hwerr_recovery_stats/others  - Other hardware errors

Each file contains a single integer representing the count of recovered
errors for that subsystem.

These statistics provide visibility into the health of the system's
hardware and can be used by system administrators to proactively detect
failing components before they cause system crashes.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 kernel/vmcore_info.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
index e2784038bbed7..b7fcd21be7c59 100644
--- a/kernel/vmcore_info.c
+++ b/kernel/vmcore_info.c
@@ -6,6 +6,8 @@
 
 #include <linux/buildid.h>
 #include <linux/init.h>
+#include <linux/kobject.h>
+#include <linux/sysfs.h>
 #include <linux/utsname.h>
 #include <linux/vmalloc.h>
 #include <linux/sizes.h>
@@ -139,6 +141,56 @@ void hwerr_log_error_type(enum hwerr_error_type src)
 }
 EXPORT_SYMBOL_GPL(hwerr_log_error_type);
 
+/* sysfs interface for hardware error recovery statistics */
+#define HWERR_ATTR_RO(_name, _type)					\
+static ssize_t _name##_show(struct kobject *kobj,			\
+			    struct kobj_attribute *attr, char *buf)	\
+{									\
+	return sysfs_emit(buf, "%d\n",					\
+			  atomic_read(&hwerr_data[_type].count));	\
+}									\
+static struct kobj_attribute hwerr_##_name##_attr = __ATTR_RO(_name)
+
+HWERR_ATTR_RO(cpu, HWERR_RECOV_CPU);
+HWERR_ATTR_RO(memory, HWERR_RECOV_MEMORY);
+HWERR_ATTR_RO(pci, HWERR_RECOV_PCI);
+HWERR_ATTR_RO(cxl, HWERR_RECOV_CXL);
+HWERR_ATTR_RO(others, HWERR_RECOV_OTHERS);
+
+static struct attribute *hwerr_recovery_stats_attrs[] = {
+	&hwerr_cpu_attr.attr,
+	&hwerr_memory_attr.attr,
+	&hwerr_pci_attr.attr,
+	&hwerr_cxl_attr.attr,
+	&hwerr_others_attr.attr,
+	NULL,
+};
+
+static const struct attribute_group hwerr_recovery_stats_group = {
+	.attrs = hwerr_recovery_stats_attrs,
+};
+
+static struct kobject *hwerr_recovery_stats_kobj;
+
+static int __init hwerr_recovery_stats_init(void)
+{
+	hwerr_recovery_stats_kobj = kobject_create_and_add("hwerr_recovery_stats",
+							   kernel_kobj);
+	if (!hwerr_recovery_stats_kobj) {
+		pr_warn("Failed to create hwerr_recovery_stats kobject\n");
+		return -ENOMEM;
+	}
+
+	if (sysfs_create_group(hwerr_recovery_stats_kobj,
+			       &hwerr_recovery_stats_group)) {
+		kobject_put(hwerr_recovery_stats_kobj);
+		pr_warn("Failed to create hwerr_recovery_stats sysfs group\n");
+		return -ENOMEM;
+	}
+
+	return 0;
+}
+
 static int __init crash_save_vmcoreinfo_init(void)
 {
 	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
@@ -248,6 +300,9 @@ static int __init crash_save_vmcoreinfo_init(void)
 	arch_crash_save_vmcoreinfo();
 	update_vmcoreinfo_note();
 
+	/* Create /sys/kernel/hwerr_recovery_stats/ directory */
+	hwerr_recovery_stats_init();
+
 	return 0;
 }
 

-- 
2.47.3



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/
  2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
  2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
@ 2026-02-02 14:27 ` Breno Leitao
  2026-02-10  9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
  2 siblings, 0 replies; 6+ messages in thread
From: Breno Leitao @ 2026-02-02 14:27 UTC (permalink / raw)
  To: akpm, bhe
  Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
	tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, Breno Leitao,
	kernel-team

Document the new hwerr_recovery_stats sysfs directory that exposes
hardware error recovery statistics.

Update hw-recoverable-errors.rst to reference the new sysfs interface
for runtime monitoring.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 .../ABI/testing/sysfs-kernel-hwerr_recovery_stats  | 47 ++++++++++++++++++++++
 Documentation/driver-api/hw-recoverable-errors.rst |  3 +-
 2 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/Documentation/ABI/testing/sysfs-kernel-hwerr_recovery_stats b/Documentation/ABI/testing/sysfs-kernel-hwerr_recovery_stats
new file mode 100644
index 0000000000000..4cb9f5a89fba9
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-hwerr_recovery_stats
@@ -0,0 +1,47 @@
+What:		/sys/kernel/hwerr_recovery_stats/
+Date:		February 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@debian.org>
+Description:
+		Directory containing hardware error recovery statistics.
+		These statistics track recoverable hardware errors that the
+		kernel has handled since boot.
+
+		Each file contains a single integer representing the count
+		of recovered errors for that subsystem.
+
+What:		/sys/kernel/hwerr_recovery_stats/cpu
+Date:		February 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@debian.org>
+Description:
+		Count of CPU-related recovered errors (MCE, ARM processor
+		errors).
+
+What:		/sys/kernel/hwerr_recovery_stats/memory
+Date:		February 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@debian.org>
+Description:
+		Count of memory-related recovered errors.
+
+What:		/sys/kernel/hwerr_recovery_stats/pci
+Date:		February 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@debian.org>
+Description:
+		Count of PCI/PCIe AER non-fatal recovered errors.
+
+What:		/sys/kernel/hwerr_recovery_stats/cxl
+Date:		February 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@debian.org>
+Description:
+		Count of CXL (Compute Express Link) recovered errors.
+
+What:		/sys/kernel/hwerr_recovery_stats/others
+Date:		February 2026
+KernelVersion:	6.20
+Contact:	Breno Leitao <leitao@debian.org>
+Description:
+		Count of other hardware recovered errors.
diff --git a/Documentation/driver-api/hw-recoverable-errors.rst b/Documentation/driver-api/hw-recoverable-errors.rst
index fc526c3454bd7..4aefcd103be22 100644
--- a/Documentation/driver-api/hw-recoverable-errors.rst
+++ b/Documentation/driver-api/hw-recoverable-errors.rst
@@ -36,7 +36,8 @@ Data Exposure and Consumption
   types like CPU, memory, PCI, CXL, and others.
 - It is exposed via vmcoreinfo crash dump notes and can be read using tools
   like `crash`, `drgn`, or other kernel crash analysis utilities.
-- There is no other way to read these data other than from crash dumps.
+- It is also exposed via sysfs at ``/sys/kernel/hwerr_recovery_stats/`` for runtime
+  monitoring without requiring a crash dump.
 - These errors are divided by area, which includes CPU, Memory, PCI, CXL and
   others.
 

-- 
2.47.3



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs
  2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
  2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
  2026-02-02 14:27 ` [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/ Breno Leitao
@ 2026-02-10  9:11 ` Breno Leitao
  2026-02-10 18:46   ` Andrew Morton
  2 siblings, 1 reply; 6+ messages in thread
From: Breno Leitao @ 2026-02-10  9:11 UTC (permalink / raw)
  To: akpm, bhe
  Cc: linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
	tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, kernel-team

Hello Andrew,

On Mon, Feb 02, 2026 at 06:27:38AM -0800, Breno Leitao wrote:
> The kernel already tracks recoverable hardware errors (CPU, memory, PCI,
> CXL, etc.) in the hwerr_data array for vmcoreinfo crash dump analysis.
> However, this data is only accessible after a crash.
>
> This series adds a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to
> expose these statistics at runtime, allowing monitoring tools to track
> hardware health without requiring a kernel crash.
>
> The directory contains one file per error subsystem:
>   /sys/kernel/hwerr_recovery_stats/{cpu, memory, pci, cxl, others}
>
> Each file contains a single integer representing the error count.
>
> This is useful for:
> - Proactive detection of failing hardware components
> - Time-series tracking of recoverable errors
> - System health monitoring in cloud environments

Is there a chance this could be included in the 6.20 merge window?

Thanks,
--breno


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs
  2026-02-10  9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
@ 2026-02-10 18:46   ` Andrew Morton
  0 siblings, 0 replies; 6+ messages in thread
From: Andrew Morton @ 2026-02-10 18:46 UTC (permalink / raw)
  To: Breno Leitao
  Cc: bhe, linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
	tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, kernel-team

On Tue, 10 Feb 2026 01:11:41 -0800 Breno Leitao <leitao@debian.org> wrote:

> Hello Andrew,
> 
> On Mon, Feb 02, 2026 at 06:27:38AM -0800, Breno Leitao wrote:
> > The kernel already tracks recoverable hardware errors (CPU, memory, PCI,
> > CXL, etc.) in the hwerr_data array for vmcoreinfo crash dump analysis.
> > However, this data is only accessible after a crash.
> >
> > This series adds a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to
> > expose these statistics at runtime, allowing monitoring tools to track
> > hardware health without requiring a kernel crash.
> >
> > The directory contains one file per error subsystem:
> >   /sys/kernel/hwerr_recovery_stats/{cpu, memory, pci, cxl, others}
> >
> > Each file contains a single integer representing the error count.
> >
> > This is useful for:
> > - Proactive detection of failing hardware components
> > - Time-series tracking of recoverable errors
> > - System health monitoring in cloud environments
> 
> Is there a chance this could be included in the 6.20 merge window?

During the 7.0 merge window?  Sure.  I'll be taking a look at this (and
a whole lot more) after 7.0-rc1 is released.  


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH v2 1/2] vmcoreinfo: expose hardware error recovery statistics via sysfs
  2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
@ 2026-02-11  2:01   ` Baoquan He
  0 siblings, 0 replies; 6+ messages in thread
From: Baoquan He @ 2026-02-11  2:01 UTC (permalink / raw)
  To: Breno Leitao
  Cc: akpm, linux-kernel, kexec, linux-arm-kernel, linux-acpi, dyoung,
	tony.luck, xueshuai, vgoyal, zhiquan1.li, olja, kernel-team

Hi Breno,

On 02/02/26 at 06:27am, Breno Leitao wrote:
> Add a sysfs directory at /sys/kernel/hwerr_recovery_stats/ to expose
> hardware error recovery statistics that are already tracked by the
> kernel. This allows userspace monitoring tools to track recovered
> hardware errors without requiring kernel crashes.
> 
> This is useful to track recoverable hardware errors in a time series,
> even if the host doesn't crash.
> 
> The sysfs directory contains one file per error subsystem:
> 
>   /sys/kernel/hwerr_recovery_stats/cpu     - CPU-related errors (MCE, ARM errors)
>   /sys/kernel/hwerr_recovery_stats/memory  - Memory-related errors
>   /sys/kernel/hwerr_recovery_stats/pci     - PCI/PCIe AER non-fatal errors
>   /sys/kernel/hwerr_recovery_stats/cxl     - CXL errors
>   /sys/kernel/hwerr_recovery_stats/others  - Other hardware errors
> 
> Each file contains a single integer representing the count of recovered
> errors for that subsystem.
> 
> These statistics provide visibility into the health of the system's
> hardware and can be used by system administrators to proactively detect
> failing components before they cause system crashes.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
>  kernel/vmcore_info.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 55 insertions(+)
> 
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index e2784038bbed7..b7fcd21be7c59 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c

Since we agreed hwerr_recovery_stats has nothing to do with vmcore, it
seems inappropriate to put its sysfs handling code in
kernel/vmcore_info.c. File kernel/vmcore_info.c is only used to build
vmcore info for later vmcore dumping. And hwerr_log_error_type() should
not be put in kernel/vmcore_info.c. either. I didn't check this
carefully before, sorry. Please reconsider if these can be handled better.

Thanks
Baoquan

> @@ -6,6 +6,8 @@
>  
>  #include <linux/buildid.h>
>  #include <linux/init.h>
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
>  #include <linux/utsname.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sizes.h>
> @@ -139,6 +141,56 @@ void hwerr_log_error_type(enum hwerr_error_type src)
>  }
>  EXPORT_SYMBOL_GPL(hwerr_log_error_type);
>  
> +/* sysfs interface for hardware error recovery statistics */
> +#define HWERR_ATTR_RO(_name, _type)					\
> +static ssize_t _name##_show(struct kobject *kobj,			\
> +			    struct kobj_attribute *attr, char *buf)	\
> +{									\
> +	return sysfs_emit(buf, "%d\n",					\
> +			  atomic_read(&hwerr_data[_type].count));	\
> +}									\
> +static struct kobj_attribute hwerr_##_name##_attr = __ATTR_RO(_name)
> +
> +HWERR_ATTR_RO(cpu, HWERR_RECOV_CPU);
> +HWERR_ATTR_RO(memory, HWERR_RECOV_MEMORY);
> +HWERR_ATTR_RO(pci, HWERR_RECOV_PCI);
> +HWERR_ATTR_RO(cxl, HWERR_RECOV_CXL);
> +HWERR_ATTR_RO(others, HWERR_RECOV_OTHERS);
> +
> +static struct attribute *hwerr_recovery_stats_attrs[] = {
> +	&hwerr_cpu_attr.attr,
> +	&hwerr_memory_attr.attr,
> +	&hwerr_pci_attr.attr,
> +	&hwerr_cxl_attr.attr,
> +	&hwerr_others_attr.attr,
> +	NULL,
> +};
> +
> +static const struct attribute_group hwerr_recovery_stats_group = {
> +	.attrs = hwerr_recovery_stats_attrs,
> +};
> +
> +static struct kobject *hwerr_recovery_stats_kobj;
> +
> +static int __init hwerr_recovery_stats_init(void)
> +{
> +	hwerr_recovery_stats_kobj = kobject_create_and_add("hwerr_recovery_stats",
> +							   kernel_kobj);
> +	if (!hwerr_recovery_stats_kobj) {
> +		pr_warn("Failed to create hwerr_recovery_stats kobject\n");
> +		return -ENOMEM;
> +	}
> +
> +	if (sysfs_create_group(hwerr_recovery_stats_kobj,
> +			       &hwerr_recovery_stats_group)) {
> +		kobject_put(hwerr_recovery_stats_kobj);
> +		pr_warn("Failed to create hwerr_recovery_stats sysfs group\n");
> +		return -ENOMEM;
> +	}
> +
> +	return 0;
> +}
> +
>  static int __init crash_save_vmcoreinfo_init(void)
>  {
>  	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
> @@ -248,6 +300,9 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	arch_crash_save_vmcoreinfo();
>  	update_vmcoreinfo_note();
>  
> +	/* Create /sys/kernel/hwerr_recovery_stats/ directory */
> +	hwerr_recovery_stats_init();
> +
>  	return 0;
>  }
>  
> 
> -- 
> 2.47.3
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-02-11  2:01 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-02 14:27 [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2026-02-02 14:27 ` [PATCH v2 1/2] vmcoreinfo: expose " Breno Leitao
2026-02-11  2:01   ` Baoquan He
2026-02-02 14:27 ` [PATCH v2 2/2] docs: add ABI documentation for /sys/kernel/hwerr_recovery_stats/ Breno Leitao
2026-02-10  9:11 ` [PATCH v2 0/2] vmcoreinfo: Expose hardware error recovery statistics via sysfs Breno Leitao
2026-02-10 18:46   ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox