All of lore.kernel.org
 help / color / mirror / Atom feed
From: Baoquan He <bhe@redhat.com>
To: Breno Leitao <leitao@debian.org>
Cc: akpm@linux-foundation.org, linux-kernel@vger.kernel.org,
	kernel-team@meta.com, kexec@lists.infradead.org,
	dyoung@redhat.com, tony.luck@intel.com,
	xueshuai@linux.alibaba.com, vgoyal@redhat.com,
	zhiquan1.li@intel.com, olja@meta.com
Subject: Re: [PATCH] vmcore_info: expose hardware error recovery statistics via sysfs
Date: Fri, 30 Jan 2026 09:59:58 +0800	[thread overview]
Message-ID: <aXwQngYZx9Nk7hbv@fedora> (raw)
In-Reply-To: <20260129-vmcoreinfo_sysfs-v1-1-164c1fe1fe07@debian.org>

On 01/29/26 at 05:34am, Breno Leitao wrote:
> Add a sysfs file at /sys/kernel/vmcore_stats and expose hardware error
> recovery statistics that are already tracked by the kernel. This allows
> userspace monitoring tools to track recovered hardware errors without
> requiring kernel crashes.

I don't understand. If w/o requring kernel crashes, why do you call it
vmcore_stats? It's a normal showing of hardware error recovery
statistics tracked by kernel, can we name it /sys/kernel/hwerr_stats?
It's obviously having nothiing to do with vmcore, isn't it?

> 
> This is useful to track recoverable hardware errors in a time series,
> even if the host doesn't crash.
> 
> Create a generic vmcore_stats sysfs, and add a section for
> hwerr_recovery that shows the counts per subsystem and timestamps:
> 
>   - cpu: CPU-related errors (MCE, ARM processor errors)
>   - memory: Memory-related errors
>   - pci: PCI/PCIe AER non-fatal errors
>   - cxl: CXL errors
>   - other: Other hardware errors
> 
> Example output:
>   hwerr_recovery:
>     cpu: 0 (0)
>     memory: 2 (1738148257)
>     pci: 1 (1738147000)
>     cxl: 0 (0)
>     other: 0 (0)
> 
> The value in parentheses is the timestamp (seconds since epoch) of the
> last error of that type, or 0 if no errors have occurred.
> 
> These statistics provide visibility into the health of the system's
> hardware and can be used by system administrators to proactively detect
> failing components before they cause system crashes.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> To: akpm@linux-foundation.org
> Cc: kexec@lists.infradead.org
> To: bhe@redhat.com
> Cc: linux-kernel@vger.kernel.org
> Cc: dyoung@redhat.com
> Cc: tony.luck@intel.com
> Cc: xueshuai@linux.alibaba.com
> Cc: vgoyal@redhat.com
> Cc: zhiquan1.li@intel.com
> Cc: olja@meta.com
> ---
>  .../ABI/testing/sysfs-kernel-vmcore_stats          | 23 ++++++++++++++++
>  kernel/vmcore_info.c                               | 31 ++++++++++++++++++++++
>  2 files changed, 54 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-vmcore_stats b/Documentation/ABI/testing/sysfs-kernel-vmcore_stats
> new file mode 100644
> index 0000000000000..b42f18d24c00b
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-vmcore_stats
> @@ -0,0 +1,23 @@
> +What:		/sys/kernel/vmcore_stats
> +Date:		January 2026
> +KernelVersion:	6.20
> +Contact:	Breno Leitao <leitao@debian.org>
> +Description:
> +		Shows statistics related to vmcore functionality. Currently
> +		includes hardware error recovery statistics.
> +
> +		Format:
> +		  Recovered hardware errors:
> +		    metric: count (timestamp)
> +
> +		Statistics about recoverable hardware errors that the kernel
> +		has handled since boot. Each metric shows the count and
> +		timestamp (seconds since epoch) of the last error in
> +		parentheses (0 if no errors have occurred).
> +
> +		Metrics:
> +		    - cpu: CPU-related errors (MCE, ARM processor errors)
> +		    - memory: Memory-related errors
> +		    - pci: PCI/PCIe AER non-fatal errors
> +		    - cxl: CXL (Compute Express Link) errors
> +		    - other: Other hardware errors
> diff --git a/kernel/vmcore_info.c b/kernel/vmcore_info.c
> index fe9bf8db1922e..5974b4be08cbc 100644
> --- a/kernel/vmcore_info.c
> +++ b/kernel/vmcore_info.c
> @@ -6,6 +6,8 @@
>  
>  #include <linux/buildid.h>
>  #include <linux/init.h>
> +#include <linux/kobject.h>
> +#include <linux/sysfs.h>
>  #include <linux/utsname.h>
>  #include <linux/vmalloc.h>
>  #include <linux/sizes.h>
> @@ -135,6 +137,31 @@ void hwerr_log_error_type(enum hwerr_error_type src)
>  }
>  EXPORT_SYMBOL_GPL(hwerr_log_error_type);
>  
> +/* sysfs interface for hardware error recovery statistics */
> +static ssize_t vmcore_stats_show(struct kobject *kobj,
> +				 struct kobj_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf,
> +			  "Recovered hardware errors:\n"
> +			  "  cpu: %d (%lld)\n"
> +			  "  memory: %d (%lld)\n"
> +			  "  pci: %d (%lld)\n"
> +			  "  cxl: %d (%lld)\n"
> +			  "  other: %d (%lld)\n",
> +			  atomic_read(&hwerr_data[HWERR_RECOV_CPU].count),
> +			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_CPU].timestamp),
> +			  atomic_read(&hwerr_data[HWERR_RECOV_MEMORY].count),
> +			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_MEMORY].timestamp),
> +			  atomic_read(&hwerr_data[HWERR_RECOV_PCI].count),
> +			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_PCI].timestamp),
> +			  atomic_read(&hwerr_data[HWERR_RECOV_CXL].count),
> +			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_CXL].timestamp),
> +			  atomic_read(&hwerr_data[HWERR_RECOV_OTHERS].count),
> +			  (long long)READ_ONCE(hwerr_data[HWERR_RECOV_OTHERS].timestamp));
> +}
> +
> +static struct kobj_attribute vmcore_stats_attr = __ATTR_RO(vmcore_stats);
> +
>  static int __init crash_save_vmcoreinfo_init(void)
>  {
>  	vmcoreinfo_data = (unsigned char *)get_zeroed_page(GFP_KERNEL);
> @@ -244,6 +271,10 @@ static int __init crash_save_vmcoreinfo_init(void)
>  	arch_crash_save_vmcoreinfo();
>  	update_vmcoreinfo_note();
>  
> +	/* Create /sys/kernel/vmcore_stats */
> +	if (sysfs_create_file(kernel_kobj, &vmcore_stats_attr.attr))
> +		pr_warn("Failed to create vmcore_stats sysfs file\n");
> +
>  	return 0;
>  }
>  
> 
> ---
> base-commit: 8dfce8991b95d8625d0a1d2896e42f93b9d7f68d
> change-id: 20260129-vmcoreinfo_sysfs-ff4687979cd5
> 
> Best regards,
> --  
> Breno Leitao <leitao@debian.org>
> 



  parent reply	other threads:[~2026-01-30  2:00 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-29 13:34 [PATCH] vmcore_info: expose hardware error recovery statistics via sysfs Breno Leitao
2026-01-29 22:28 ` Andrew Morton
2026-01-30 11:33   ` Breno Leitao
2026-01-30  1:59 ` Baoquan He [this message]
2026-01-30 11:07   ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aXwQngYZx9Nk7hbv@fedora \
    --to=bhe@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=dyoung@redhat.com \
    --cc=kernel-team@meta.com \
    --cc=kexec@lists.infradead.org \
    --cc=leitao@debian.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=olja@meta.com \
    --cc=tony.luck@intel.com \
    --cc=vgoyal@redhat.com \
    --cc=xueshuai@linux.alibaba.com \
    --cc=zhiquan1.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.