linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
To: Breno Leitao <leitao@debian.org>
Cc: Shuai Xue <xueshuai@linux.alibaba.com>,
	Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>, James Morse <james.morse@arm.com>,
	Robert Moore <robert.moore@intel.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
	Hanjun Guo <guohanjun@huawei.com>,
	Mauro Carvalho Chehab <mchehab@kernel.org>,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	acpica-devel@lists.linux.dev, osandov@osandov.com,
	konrad.wilk@oracle.com, linux-edac@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org,
	kernel-team@meta.com
Subject: Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors
Date: Wed, 30 Jul 2025 18:21:37 +0200	[thread overview]
Message-ID: <20250730182137.18605ea1@foz.lan> (raw)
In-Reply-To: <zc4jm3hwvtwo5y2knk2bqzwmpf7ma7bdzs6uv2osavzcdew3nk@lfjrlp6sr7zz>

Em Wed, 30 Jul 2025 06:11:52 -0700
Breno Leitao <leitao@debian.org> escreveu:

> Hello Shuai,
> 
> On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote:
> > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and
> > CPER_SEV_RECOVERABLE errors:  
> 
> Thanks. I was reading this code a bit more, and I want to make sure my
> understanding is correct, giving I was confused about CORRECTED and
> RECOVERABLE errors.
> 
> CPER_SEV_CORRECTED means it is corrected in the background, and the OS
> was not even notified about it. That includes 1-bit ECC error.
> THose are not the errors we are interested in, since they are irrelavant
> to the OS.

Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture
such errors, as they may be a symptom of a hardware defect. In a matter
of fact, at rasdamon, thresholds can be set to trigger an action, like
for instance, disable memory blocks that contain defective memories.

This is specially relevant on HPC and supercomputer workloads, where
it is a lot cheaper to disable a block of bad memory than to lose
an entire job because that could take several weeks of run time on
a supercomputer, just because a defective memory ended causing a
failure at the application.

Regards,
Mauro

  parent reply	other threads:[~2025-07-30 16:21 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-22 16:56 [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao
2025-07-23 14:28 ` kernel test robot
2025-07-23 15:36   ` Breno Leitao
2025-07-23 19:00     ` Borislav Petkov
2025-07-23 23:21       ` Huang, Kai
2025-07-24  8:00 ` Shuai Xue
2025-07-24 13:34   ` Breno Leitao
2025-07-25  7:40     ` Shuai Xue
2025-07-25 16:16       ` Breno Leitao
2025-07-28  1:08         ` Shuai Xue
2025-07-29 13:48           ` Breno Leitao
2025-07-30  2:13             ` Shuai Xue
2025-07-30 13:11               ` Breno Leitao
2025-07-30 13:50                 ` Shuai Xue
2025-07-30 17:16                   ` Breno Leitao
2025-07-30 16:21                 ` Mauro Carvalho Chehab [this message]
2025-07-30 17:22                   ` Breno Leitao

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250730182137.18605ea1@foz.lan \
    --to=mchehab+huawei@kernel.org \
    --cc=acpica-devel@lists.linux.dev \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=guohanjun@huawei.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=kernel-team@meta.com \
    --cc=konrad.wilk@oracle.com \
    --cc=leitao@debian.org \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=mchehab@kernel.org \
    --cc=mingo@redhat.com \
    --cc=oohall@gmail.com \
    --cc=osandov@osandov.com \
    --cc=rafael@kernel.org \
    --cc=robert.moore@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).