linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/2] panic: taint flag for recoverable hardware errors
@ 2025-07-04 10:55 Breno Leitao
  2025-07-04 10:55 ` [PATCH 1/2] panic: add " Breno Leitao
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Breno Leitao @ 2025-07-04 10:55 UTC (permalink / raw)
  To: Len Brown, James Morse, Jonathan Corbet, tony.luck, rafael
  Cc: Alexei Starovoitov, kbusch, rmikey, kuba, ast, linux-edac,
	mchehab, bp, linux-acpi, linux-kernel, linux-doc, kernel-team,
	Breno Leitao

Overview
========

This patchset introduces a new kernel taint flag to track systems that
have experienced recoverable hardware errors during runtime. The
motivation comes from the operational challenges of managing large
server fleets where hardware events are common, and having them tainting
the kernel helps operators to correlate problems more easily.

This complement the new MACHINE_CHECK taint that got added for fatal
errors. [1]

Problem Statement
=================

In large-scale deployments with thousands of servers, hardware errors
are inevitable. While modern systems can recover from many hardware
failures (corrected ECC errors, recoverable CPU errors, etc.), these
events causes the kernel to behave in very different ways, which can
cause  bugs due to the path that is rarely exercised.

I experienced this pain very recently, where several machines were
crashing due to a recoverable PCI offline port. The hardware was
behaving correctly, but, during the recoverable process, the kernel goes
through some code path that is rarely tested.

In my case, the kernel recoverable process caused some issues that were
hard to find the root cause. For instance, recoverable PCI events
cause the device to suddently go offline, and later PCI re-enumeration,
which would reinitalize the driver.

The event above caused some real crashes in production, in very
different ways. From those that I investigated, I found:

	1) If the disk was going away, it was causing a file systems
	   issue that got already fixed in 6.14 and 6.15

	2) If the network was going away, it was causing some iommu
	   issues discussed and fixed in [2].

	3) Possible other issues, that were not easy to correlate, such
	   as stalls, hungup tasks, memory leaks, warnings, etc.

	  a) These are hidden today, and I would like to expose them
	     with this patch.

In summary, when investigating system issues, there's no trivial way to
determine if a machine has previously experienced hardware problems that
might be contributing to current instability, other than going host by
host and scanning kernel logs.

Proposed Solution
=================

Add a new taint flag to the kernel (HW_ERROR_RECOVERED - for the lack of
a better name) that gets set whenever the kernel detects and recovers
from hardware errors.

The taint provides additional context during crash investigation *without*
implying that crashes are necessarily caused by hardware failures
(similar to how PROPRIETARY_MODULE taint works). It is just an extra
information that will provide more context about that machine.

This patchset focuses on ACPI/GHES, which handles most recoverable
hardware errors I have experience with, but can be extended to other
subsystems like EDAC HW_EVENT_ERR_CORRECTED in the future.

--

I would like to *thanks* Tony for the early discussions and
encouragement.

Link: https://lore.kernel.org/all/20250702-add_tain-v1-1-9187b10914b9@debian.org/ [1]
Link: https://lore.kernel.org/all/20250409-page-pool-track-dma-v9-0-6a9ef2e0cba8@redhat.com/ [2]

---
Breno Leitao (2):
      panic: add taint flag for recoverable hardware errors
      acpi/ghes: taint kernel on recovered hardware errors

 Documentation/admin-guide/tainted-kernels.rst | 7 ++++++-
 drivers/acpi/apei/ghes.c                      | 7 +++++--
 include/linux/panic.h                         | 3 ++-
 kernel/panic.c                                | 1 +
 tools/debugging/kernel-chktaint               | 8 ++++++++
 5 files changed, 22 insertions(+), 4 deletions(-)
---
base-commit: dc3cd0dfd91cad0611f0f0eace339a401da5d5ee
change-id: 20250703-taint_recovered-1d2e890a684b

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-07-14 17:01 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-04 10:55 [PATCH 0/2] panic: taint flag for recoverable hardware errors Breno Leitao
2025-07-04 10:55 ` [PATCH 1/2] panic: add " Breno Leitao
2025-07-04 10:55 ` [PATCH 2/2] acpi/ghes: taint kernel on recovered " Breno Leitao
2025-07-04 11:19 ` [PATCH 0/2] panic: taint flag for recoverable " Borislav Petkov
2025-07-04 12:15   ` Breno Leitao
2025-07-04 13:25     ` Borislav Petkov
2025-07-14 17:01       ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).