[PATCH 0/2] panic: taint flag for recoverable hardware errors

linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] panic: taint flag for recoverable hardware errors
@ 2025-07-04 10:55 Breno Leitao
  2025-07-04 10:55 ` [PATCH 1/2] panic: add " Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Breno Leitao @ 2025-07-04 10:55 UTC (permalink / raw)
  To: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael
  Cc: Alexei Starovoitov, kbusch, rmikey, kuba, ast, linux-edac,
	mchehab, bp, linux-acpi, linux-kernel, linux-doc, kernel-team,
	Breno Leitao

Overview
========

This patchset introduces a new kernel taint flag to track systems that
have experienced recoverable hardware errors during runtime. The
motivation comes from the operational challenges of managing large
server fleets where hardware events are common, and having them tainting
the kernel helps operators to correlate problems more easily.

This complement the new MACHINE_CHECK taint that got added for fatal
errors. [1]

Problem Statement
=================

In large-scale deployments with thousands of servers, hardware errors
are inevitable. While modern systems can recover from many hardware
failures (corrected ECC errors, recoverable CPU errors, etc.), these
events causes the kernel to behave in very different ways, which can
cause  bugs due to the path that is rarely exercised.

I experienced this pain very recently, where several machines were
crashing due to a recoverable PCI offline port. The hardware was
behaving correctly, but, during the recoverable process, the kernel goes
through some code path that is rarely tested.

In my case, the kernel recoverable process caused some issues that were
hard to find the root cause. For instance, recoverable PCI events
cause the device to suddently go offline, and later PCI re-enumeration,
which would reinitalize the driver.

The event above caused some real crashes in production, in very
different ways. From those that I investigated, I found:

	1) If the disk was going away, it was causing a file systems
	   issue that got already fixed in 6.14 and 6.15

	2) If the network was going away, it was causing some iommu
	   issues discussed and fixed in [2].

	3) Possible other issues, that were not easy to correlate, such
	   as stalls, hungup tasks, memory leaks, warnings, etc.

	  a) These are hidden today, and I would like to expose them
	     with this patch.

In summary, when investigating system issues, there's no trivial way to
determine if a machine has previously experienced hardware problems that
might be contributing to current instability, other than going host by
host and scanning kernel logs.

Proposed Solution
=================

Add a new taint flag to the kernel (HW_ERROR_RECOVERED - for the lack of
a better name) that gets set whenever the kernel detects and recovers
from hardware errors.

The taint provides additional context during crash investigation *without*
implying that crashes are necessarily caused by hardware failures
(similar to how PROPRIETARY_MODULE taint works). It is just an extra
information that will provide more context about that machine.

This patchset focuses on ACPI/GHES, which handles most recoverable
hardware errors I have experience with, but can be extended to other
subsystems like EDAC HW_EVENT_ERR_CORRECTED in the future.

--

I would like to *thanks* Tony for the early discussions and
encouragement.

Link: https://lore.kernel.org/all/20250702-add_tain-v1-1-9187b10914b9@debian.org/ [1]
Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/ [2]

---
Breno Leitao (2):
      panic: add taint flag for recoverable hardware errors
      acpi/ghes: taint kernel on recovered hardware errors

 Documentation/admin-guide/tainted-kernels.rst | 7 ++++++-
 drivers/acpi/apei/ghes.c                      | 7 +++++--
 include/linux/panic.h                         | 3 ++-
 kernel/panic.c                                | 1 +
 tools/debugging/kernel-chktaint               | 8 ++++++++
 5 files changed, 22 insertions(+), 4 deletions(-)
---
base-commit: dc3cd0dfd91cad0611f0f0eace339a401da5d5ee
change-id: 20250703-taint_recovered-1d2e890a684b

Best regards,
--  
Breno Leitao <leitao@debian.org>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH 1/2] panic: add taint flag for recoverable hardware errors
  2025-07-04 10:55 [PATCH 0/2] panic: taint flag for recoverable hardware errors Breno Leitao
@ 2025-07-04 10:55 ` Breno Leitao
  2025-07-04 10:55 ` [PATCH 2/2] acpi/ghes: taint kernel on recovered " Breno Leitao
  2025-07-04 11:19 ` [PATCH 0/2] panic: taint flag for recoverable " Borislav Petkov
  2 siblings, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2025-07-04 10:55 UTC (permalink / raw)
  To: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael
  Cc: Alexei Starovoitov, kbusch, rmikey, kuba, ast, linux-edac,
	mchehab, bp, linux-acpi, linux-kernel, linux-doc, kernel-team,
	Breno Leitao

This change introduces a new taint flag, bit 20 ('H'), to indicate when
the kernel has identified recoverable hardware failures during runtime.

The flag is documented in tainted-kernels.rst, defined in panic.h, added
to the taint_flags array in panic.c, and supported in the
kernel-chktaint debugging tool.

Marking kernels that have encountered recoverable hardware errors helps
correlate future issues with hardware events, improving diagnostics and
support for affected systems

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 Documentation/admin-guide/tainted-kernels.rst | 7 ++++++-
 include/linux/panic.h                         | 3 ++-
 kernel/panic.c                                | 1 +
 tools/debugging/kernel-chktaint               | 8 ++++++++
 4 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/Documentation/admin-guide/tainted-kernels.rst b/Documentation/admin-guide/tainted-kernels.rst
index a0cc017e44246..28185e9c0e039 100644
--- a/Documentation/admin-guide/tainted-kernels.rst
+++ b/Documentation/admin-guide/tainted-kernels.rst
@@ -102,7 +102,8 @@ Bit  Log  Number  Reason that got the kernel tainted
  17  _/T  131072  kernel was built with the struct randomization plugin
  18  _/N  262144  an in-kernel test has been run
  19  _/J  524288  userspace used a mutating debug operation in fwctl
-===  ===  ======  ========================================================
+ 20  _/H 1048576  hardware recoverable failures identified
+===  === =======  ========================================================
 
 Note: The character ``_`` is representing a blank in this table to make reading
 easier.
@@ -189,3 +190,7 @@ More detailed explanation for tainting
  19) ``J`` if userpace opened /dev/fwctl/* and performed a FWTCL_RPC_DEBUG_WRITE
      to use the devices debugging features. Device debugging features could
      cause the device to malfunction in undefined ways.
+
+ 20) ``H`` if the kernel identified any recoverable hardware failure earlier
+     during its operation. This helps to correlate possible future issues to
+     the fact that the hardware got a recoverable error.
diff --git a/include/linux/panic.h b/include/linux/panic.h
index 4adc657669354..d8241a052d69a 100644
--- a/include/linux/panic.h
+++ b/include/linux/panic.h
@@ -73,7 +73,8 @@ static inline void set_arch_panic_timeout(int timeout, int arch_default_timeout)
 #define TAINT_RANDSTRUCT		17
 #define TAINT_TEST			18
 #define TAINT_FWCTL			19
-#define TAINT_FLAGS_COUNT		20
+#define TAINT_HW_ERROR_RECOVERED	20
+#define TAINT_FLAGS_COUNT		21
 #define TAINT_FLAGS_MAX			((1UL << TAINT_FLAGS_COUNT) - 1)
 
 struct taint_flag {
diff --git a/kernel/panic.c b/kernel/panic.c
index b0b9a8bf4560d..fd13baf5d94bc 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -540,6 +540,7 @@ const struct taint_flag taint_flags[TAINT_FLAGS_COUNT] = {
 	TAINT_FLAG(RANDSTRUCT,			'T', ' ', true),
 	TAINT_FLAG(TEST,			'N', ' ', true),
 	TAINT_FLAG(FWCTL,			'J', ' ', true),
+	TAINT_FLAG(HW_ERROR_RECOVERED,		'H', ' ', false),
 };
 
 #undef TAINT_FLAG
diff --git a/tools/debugging/kernel-chktaint b/tools/debugging/kernel-chktaint
index e7da0909d0970..b2099155a820c 100755
--- a/tools/debugging/kernel-chktaint
+++ b/tools/debugging/kernel-chktaint
@@ -212,6 +212,14 @@ else
 	echo " * fwctl's mutating debug interface was used (#19)"
 fi
 
+T=`expr $T / 2`
+if [ `expr $T % 2` -eq 0 ]; then
+	addout " "
+else
+	addout "H"
+	echo " * the kernel identified recoverable hardware errors (#20)"
+fi
+
 echo "For a more detailed explanation of the various taint flags see"
 echo " Documentation/admin-guide/tainted-kernels.rst in the Linux kernel sources"
 echo " or https://kernel.org/doc/html/latest/admin-guide/tainted-kernels.html"

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH 2/2] acpi/ghes: taint kernel on recovered hardware errors
  2025-07-04 10:55 [PATCH 0/2] panic: taint flag for recoverable hardware errors Breno Leitao
  2025-07-04 10:55 ` [PATCH 1/2] panic: add " Breno Leitao
@ 2025-07-04 10:55 ` Breno Leitao
  2025-07-04 11:19 ` [PATCH 0/2] panic: taint flag for recoverable " Borislav Petkov
  2 siblings, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2025-07-04 10:55 UTC (permalink / raw)
  To: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael
  Cc: Alexei Starovoitov, kbusch, rmikey, kuba, ast, linux-edac,
	mchehab, bp, linux-acpi, linux-kernel, linux-doc, kernel-team,
	Breno Leitao

Update the GHES error processing logic to taint the kernel when
a recoverable or corrected hardware error is detected.

If the error severity is GHES_SEV_RECOVERABLE or GHES_SEV_CORRECTED, the
TAINT_HW_ERROR_RECOVERED flag is set. This allows users and support
tools to identify systems that have experienced hardware issues that
were recovered at runtime, improving traceability and diagnostics.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
 drivers/acpi/apei/ghes.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3d44f926afe8e..f323cefe234b9 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -1102,13 +1102,16 @@ static int ghes_proc(struct ghes *ghes)
 {
 	struct acpi_hest_generic_status *estatus = ghes->estatus;
 	u64 buf_paddr;
-	int rc;
+	int rc, sev;
 
 	rc = ghes_read_estatus(ghes, estatus, &buf_paddr, FIX_APEI_GHES_IRQ);
 	if (rc)
 		goto out;
 
-	if (ghes_severity(estatus->error_severity) >= GHES_SEV_PANIC)
+	sev = ghes_severity(estatus->error_severity);
+	if (sev == GHES_SEV_RECOVERABLE || sev ==  GHES_SEV_CORRECTED)
+		add_taint(TAINT_HW_ERROR_RECOVERED, LOCKDEP_STILL_OK);
+	else if (sev >= GHES_SEV_PANIC)
 		__ghes_panic(ghes, estatus, buf_paddr, FIX_APEI_GHES_IRQ);
 
 	if (!ghes_estatus_cached(estatus)) {

-- 
2.47.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] panic: taint flag for recoverable hardware errors
  2025-07-04 10:55 [PATCH 0/2] panic: taint flag for recoverable hardware errors Breno Leitao
  2025-07-04 10:55 ` [PATCH 1/2] panic: add " Breno Leitao
  2025-07-04 10:55 ` [PATCH 2/2] acpi/ghes: taint kernel on recovered " Breno Leitao
@ 2025-07-04 11:19 ` Borislav Petkov
  2025-07-04 12:15   ` Breno Leitao
  2 siblings, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2025-07-04 11:19 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael,
	Alexei Starovoitov, kbusch, rmikey, kuba, linux-edac, mchehab,
	linux-acpi, linux-kernel, linux-doc, kernel-team

On Fri, Jul 04, 2025 at 03:55:18AM -0700, Breno Leitao wrote:
> Add a new taint flag to the kernel (HW_ERROR_RECOVERED - for the lack of
> a better name) that gets set whenever the kernel detects and recovers
> from hardware errors.
> 
> The taint provides additional context during crash investigation *without*
> implying that crashes are necessarily caused by hardware failures
> (similar to how PROPRIETARY_MODULE taint works). It is just an extra
> information that will provide more context about that machine.

Dunno, looks like a hack to me to serve your purpose only.

Because when this goes up, then people will start wanting to taint the kernel
for *every* *single* correctable error.

So even if an error got corrected, the kernel will be tainted.

Then users will say, oh oh, my kernel is tainted, I need to replace my hw
because broken. Even if it isn't broken in the very least.

Basically what we're doing with drivers/ras/cec.c will be undone.

All because you want to put a bit of information somewhere that the machine
had a recoverable error.

Well, that bit of information is in your own RAS logs, no? I presume you log
hw errors in a big fleet and then you analyze those logs when the machine
bombs. So a mere look at those logs will tell you that you had hw errors.

And mind you, that proposed solution does not help people who want to know
what the errors were: "Oh look, my kernel got tainted because of hw errors. Now
where are those errors?"

So I think this is just adding redundant information which we already have
somewhere else and also actively can mislead users.

IOW, no need to taint - you want to simply put a bit of info in the kdump blob
which gets dumped by the second kernel that the first kernel experienced hw
errors. That is, if you don't log hw errors. But you should...!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] panic: taint flag for recoverable hardware errors
  2025-07-04 11:19 ` [PATCH 0/2] panic: taint flag for recoverable " Borislav Petkov
@ 2025-07-04 12:15   ` Breno Leitao
  2025-07-04 13:25     ` Borislav Petkov
  0 siblings, 1 reply; 7+ messages in thread
From: Breno Leitao @ 2025-07-04 12:15 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael,
	Alexei Starovoitov, kbusch, rmikey, kuba, linux-edac, mchehab,
	linux-acpi, linux-kernel, linux-doc, kernel-team

Hello Borislav,

First of all, thanks for spending time in this one.

On Fri, Jul 04, 2025 at 01:19:54PM +0200, Borislav Petkov wrote:
> On Fri, Jul 04, 2025 at 03:55:18AM -0700, Breno Leitao wrote:
> > Add a new taint flag to the kernel (HW_ERROR_RECOVERED - for the lack of
> > a better name) that gets set whenever the kernel detects and recovers
> > from hardware errors.
> > 
> > The taint provides additional context during crash investigation *without*
> > implying that crashes are necessarily caused by hardware failures
> > (similar to how PROPRIETARY_MODULE taint works). It is just an extra
> > information that will provide more context about that machine.
> 
> Dunno, looks like a hack to me to serve your purpose only.
> 
> Because when this goes up, then people will start wanting to taint the kernel
> for *every* *single* correctable error.
> 
> So even if an error got corrected, the kernel will be tainted.
> 
> Then users will say, oh oh, my kernel is tainted, I need to replace my hw
> because broken. Even if it isn't broken in the very least.

The information is not there to show correlation of broken hardware,
but, to flag that this kernel is running on a hardware that has
recovered from an error. It doesn't not mean that the problem is in the
hardware.  During my investigations, most of the time, the kernel was
buggy when recovering from hardware issues, mainly PCI re-plugs.

Anyway, the taints are not to tell you the root cause of the problem,
but, to give you an indication that would help the investigations. For
instance:

TAINT_PROPRIETARY_MODULE: 
   - It doesn't mean that the machine crashed because of the proprietary module.

TAINT_FIRMWARE_WORKAROUND:
  - It doesn't tell you that crashes came because of the workaround,
    but, it tells you of this workaround.

Same for TAINT_LIVEPATCH, TAINT_FORCED_RMMOD and most of the taints. It
helps the users, it doesn't not tell you the root cause. For that we
have AITM. :-P

> Basically what we're doing with drivers/ras/cec.c will be undone.
> 
> All because you want to put a bit of information somewhere that the machine
> had a recoverable error.
> 
> Well, that bit of information is in your own RAS logs, no? I presume you log
> hw errors in a big fleet and then you analyze those logs when the machine
> bombs. So a mere look at those logs will tell you that you had hw errors.

True, but, this argument would apply for every taint flag above. You can
look at the logs and find LIVEPATCHES, PROPRIETARY_MODULES,
FIRMWARE_WORKAROUND, etc.

Those information could be somewhere else, but, being somewhere easy to
read proved to be useful.

For instance, reading from `cat /proc/sys/kernel/tainted` might be
*way easier* than parsing *thousands* different RAS tools logs for you
to find what is going on. 

> And mind you, that proposed solution does not help people who want to know
> what the errors were: "Oh look, my kernel got tainted because of hw errors. Now
> where are those errors?"

Agree and that is the intention. Whoever look a crash/warning knows that
the machine recovered from a hardware error, and this help the user in
two ways:

1) Know that the kernel executed a path that is not frequently executed.
2) Look at the RAS logs if you think this is hardware related

Maybe these two things doesn't mean much, but, it is like a heads-up
flag for whoever is looking at this issue.

> So I think this is just adding redundant information which we already have
> somewhere else and also actively can mislead users.
> 
> IOW, no need to taint - you want to simply put a bit of info in the kdump blob
> which gets dumped by the second kernel that the first kernel experienced hw
> errors. That is, if you don't log hw errors. But you should...!

Sure, saving this information somewhere will solve the problem as well.

I thought that adding a taint would be easier for few reasons:

1) you can easily read from userspace (/proc/sys/kernel/tainted), so, it
is easy to scan the fleet for hardware error, and query the RAS logs
only for those.

2) it is shown at crash time already, so, this information will be
"free" mostly.

3) taint is consumed by kdump/kexec already, so, nothing would change.

Anyway, I am happy to add this information somewhere else if you think
that taint is not the right place.

Thanks for your ideas and suggestions,
--breno

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] panic: taint flag for recoverable hardware errors
  2025-07-04 12:15   ` Breno Leitao
@ 2025-07-04 13:25     ` Borislav Petkov
  2025-07-14 17:01       ` Breno Leitao
  0 siblings, 1 reply; 7+ messages in thread
From: Borislav Petkov @ 2025-07-04 13:25 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael,
	Alexei Starovoitov, kbusch, rmikey, kuba, linux-edac, mchehab,
	linux-acpi, linux-kernel, linux-doc, kernel-team

On Fri, Jul 04, 2025 at 01:15:06PM +0100, Breno Leitao wrote:
> The information is not there to show correlation of broken hardware,
> but,

I didn't say that.

I say that users will misunderstand this taint. Like all the other things we
have issued wrt RAS - people jump to conclusions without even reading english
text. Not to even talk about taint flags.

You having to explain it here basically proves my point.

> For instance, reading from `cat /proc/sys/kernel/tainted` might be
> *way easier* than parsing *thousands* different RAS tools logs for you
> to find what is going on. 

Thousands huh? I know of only two but maybe you will enlighten me.

And those I know can simply dump you an error log which you can check. It is
way easy already.

> Anyway, I am happy to add this information somewhere else if you think
> that taint is not the right place.

Documentation/admin-guide/kdump/vmcoreinfo.rst could be one place.

But again, this is redundant info which you can read out from logs which you
already *have* to collect anyway, in a large fleet.

IMO, you have everything already and this is not really needed.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH 0/2] panic: taint flag for recoverable hardware errors
  2025-07-04 13:25     ` Borislav Petkov
@ 2025-07-14 17:01       ` Breno Leitao
  0 siblings, 0 replies; 7+ messages in thread
From: Breno Leitao @ 2025-07-14 17:01 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael,
	Alexei Starovoitov, kbusch, rmikey, kuba, linux-edac, mchehab,
	linux-acpi, linux-kernel, linux-doc, kernel-team

Hello Boris,

On Fri, Jul 04, 2025 at 03:25:18PM +0200, Borislav Petkov wrote:
> On Fri, Jul 04, 2025 at 01:15:06PM +0100, Breno Leitao wrote:
> 
> > Anyway, I am happy to add this information somewhere else if you think
> > that taint is not the right place.
> 
> Documentation/admin-guide/kdump/vmcoreinfo.rst could be one place.

I've tested adding it GHES and it would solve the problem I am looking
for.

https://lore.kernel.org/all/20250714-vmcore_hw_error-v1-1-8cf45edb6334@debian.org/

I wasn't sure whether to include you in the "Suggested-by" field since
your idea was more general, but I added it anyway. Please let me know if
this doesn't accurately reflect your contribution.

Thanks for the guidance,
--breno

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-14 17:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 10:55 [PATCH 0/2] panic: taint flag for recoverable hardware errors Breno Leitao
2025-07-04 10:55 ` [PATCH 1/2] panic: add " Breno Leitao
2025-07-04 10:55 ` [PATCH 2/2] acpi/ghes: taint kernel on recovered " Breno Leitao
2025-07-04 11:19 ` [PATCH 0/2] panic: taint flag for recoverable " Borislav Petkov
2025-07-04 12:15   ` Breno Leitao
2025-07-04 13:25     ` Borislav Petkov
2025-07-14 17:01       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).