[PATCH] PCI/AER: Add option to panic on unrecoverable errors

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] PCI/AER: Add option to panic on unrecoverable errors
@ 2026-02-06 18:23 Breno Leitao
  2026-02-06 18:41 ` Lukas Wunner
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Breno Leitao @ 2026-02-06 18:23 UTC (permalink / raw)
  To: Jonathan Corbet, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, kbusch
  Cc: linux-doc, linux-kernel, linuxppc-dev, linux-pci, dcostantino,
	rneu, kernel-team, Breno Leitao

When a device lacks an error_detected callback, AER recovery fails and
the device is left in a disconnected state. This can mask serious
hardware issues during development and testing.

Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
instead, making such failures immediately visible. The parameter
defaults to false to preserve existing behavior.

Signed-off-by: Breno Leitao <leitao@debian.org>
---
In environments where all hardware must be fully operational, silently
leaving a device in a disconnected state after an AER recovery failure
is unacceptable. This is common in high-reliability systems, production
servers, and testing infrastructure where a degraded system should not
continue running.

This patch adds a module parameter that allows administrators to enforce
a strict policy: if a device cannot recover from an AER error, the
kernel panics instead of continuing with degraded hardware. This ensures
that hardware failures are immediately visible and can trigger
appropriate remediation (restart, failover, alerting).
---
 Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
 drivers/pci/pcie/err.c                          | 3 +++
 drivers/pci/pcie/portdrv.c                      | 7 +++++++
 drivers/pci/pcie/portdrv.h                      | 1 +
 4 files changed, 20 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 1058f2a6d6a8c..ff95c24280e3c 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5240,6 +5240,15 @@ Kernel parameters
 		nomsi	Do not use MSI for native PCIe PME signaling (this makes
 			all PCIe root ports use INTx for all services).
 
+	pcieportdrv.aer_unrecoverable_fatal=
+			[PCIE] Panic on unrecoverable AER errors:
+		0	Log the error and leave the device in a disconnected
+			state (default).
+		1	Panic the kernel when a device cannot recover from an
+			AER error (no error_detected callback). Useful for
+			high-reliability systems where degraded hardware is
+			unacceptable.
+
 	pcmv=		[HW,PCMCIA] BadgePAD 4
 
 	pd_ignore_unused
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d75..788484791902e 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
 		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
 			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
 			pci_info(dev, "can't recover (no error_detected callback)\n");
+			if (aer_unrecoverable_fatal)
+				panic("AER: %s: no error_detected callback\n",
+				      pci_name(dev));
 		} else {
 			vote = PCI_ERS_RESULT_NONE;
 		}
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 38a41ccf79b9a..a411f60ff50ce 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -22,6 +22,13 @@
 #include "../pci.h"
 #include "portdrv.h"
 
+#ifdef CONFIG_PCIEAER
+bool aer_unrecoverable_fatal;
+module_param(aer_unrecoverable_fatal, bool, 0644);
+MODULE_PARM_DESC(aer_unrecoverable_fatal,
+		 "Panic if a device cannot recover from an AER error (default: false)");
+#endif
+
 /*
  * The PCIe Capability Interrupt Message Number (PCIe r3.1, sec 7.8.2) must
  * be one of the first 32 MSI-X entries.  Per PCI r3.0, sec 6.8.3.1, MSI
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index bd29d1cc7b8bd..6c67b18de93c9 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -29,6 +29,7 @@ extern bool pcie_ports_dpc_native;
 
 #ifdef CONFIG_PCIEAER
 int pcie_aer_init(void);
+extern bool aer_unrecoverable_fatal;
 #else
 static inline int pcie_aer_init(void) { return 0; }
 #endif

---
base-commit: 6bd9ed02871f22beb0e50690b0c3caf457104f7c
change-id: 20260206-pci-362cf172187f

Best regards,
--  
Breno Leitao <leitao@debian.org>


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 18:23 [PATCH] PCI/AER: Add option to panic on unrecoverable errors Breno Leitao
@ 2026-02-06 18:41 ` Lukas Wunner
  2026-02-06 18:50 ` Keith Busch
  2026-02-06 18:52 ` Bjorn Helgaas
  2 siblings, 0 replies; 9+ messages in thread
From: Lukas Wunner @ 2026-02-06 18:41 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Corbet, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, kbusch, linux-doc, linux-kernel, linuxppc-dev,
	linux-pci, dcostantino, rneu, kernel-team, Terry Bowman

On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> When a device lacks an error_detected callback, AER recovery fails and
> the device is left in a disconnected state. This can mask serious
> hardware issues during development and testing.
> 
> Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
> instead, making such failures immediately visible. The parameter
> defaults to false to preserve existing behavior.

There's a parallel effort by Terry Bowman (+cc) to introduce a
PCI_ERS_RESULT_PANIC return value for error handling:

https://lore.kernel.org/all/20260203025244.3093805-4-terry.bowman@amd.com/

Please consider using that as the basis for your needs.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 18:23 [PATCH] PCI/AER: Add option to panic on unrecoverable errors Breno Leitao
  2026-02-06 18:41 ` Lukas Wunner
@ 2026-02-06 18:50 ` Keith Busch
  2026-02-06 18:52 ` Bjorn Helgaas
  2 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2026-02-06 18:50 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Corbet, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, linux-doc, linux-kernel, linuxppc-dev, linux-pci,
	dcostantino, rneu, kernel-team

On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> When a device lacks an error_detected callback, AER recovery fails and
> the device is left in a disconnected state. This can mask serious
> hardware issues during development and testing.
> 
> Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
> instead, making such failures immediately visible. The parameter
> defaults to false to preserve existing behavior.

Sounds like a good idea. There used to be a code comment suggesting
there are probably conditions where you want this panic behavior but it
was removed with commit:

  b06d125e6280603a34d9064cd9c12748ca2edb04

Which I'm not sure was an accurate thing to do as it assumes the system
can remain operational without recoverying, and that's just not always
the case.

> @@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
>  		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>  			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>  			pci_info(dev, "can't recover (no error_detected callback)\n");
> +			if (aer_unrecoverable_fatal)
> +				panic("AER: %s: no error_detected callback\n",
> +				      pci_name(dev));

Is this the only condition that the panic behavior should apply? I feel
like we may want to defer the panic to the recovery failed case and even
include the "disconnect" condition. Maybe something like this?

---
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d75..c5a631e2b565b 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -295,5 +295,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	pci_info(bridge, "device recovery failed\n");
 
+	if (aer_unrecoverable_fatal &&
+	    (status == PCI_ERS_RESULT_DISCONNECT ||
+	     status == PCI_ERS_RESULT_NO_AER_DRIVER))
+		panic("AER: can not continue, status:%d\n", pci_name(dev), status);
+
 	return status;
 }
--

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 18:23 [PATCH] PCI/AER: Add option to panic on unrecoverable errors Breno Leitao
  2026-02-06 18:41 ` Lukas Wunner
  2026-02-06 18:50 ` Keith Busch
@ 2026-02-06 18:52 ` Bjorn Helgaas
  2026-02-06 19:22   ` Keith Busch
  2026-02-09 14:28   ` Breno Leitao
  2 siblings, 2 replies; 9+ messages in thread
From: Bjorn Helgaas @ 2026-02-06 18:52 UTC (permalink / raw)
  To: Breno Leitao
  Cc: Jonathan Corbet, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, kbusch, linux-doc, linux-kernel, linuxppc-dev,
	linux-pci, dcostantino, rneu, kernel-team

On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> When a device lacks an error_detected callback, AER recovery fails and
> the device is left in a disconnected state. This can mask serious
> hardware issues during development and testing.
> 
> Add a module parameter 'aer_unrecoverable_fatal' that panics the kernel
> instead, making such failures immediately visible. The parameter
> defaults to false to preserve existing behavior.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>
> ---
> In environments where all hardware must be fully operational, silently
> leaving a device in a disconnected state after an AER recovery failure
> is unacceptable. This is common in high-reliability systems, production
> servers, and testing infrastructure where a degraded system should not
> continue running.
> 
> This patch adds a module parameter that allows administrators to enforce
> a strict policy: if a device cannot recover from an AER error, the
> kernel panics instead of continuing with degraded hardware. This ensures
> that hardware failures are immediately visible and can trigger
> appropriate remediation (restart, failover, alerting).
> ---
>  Documentation/admin-guide/kernel-parameters.txt | 9 +++++++++
>  drivers/pci/pcie/err.c                          | 3 +++
>  drivers/pci/pcie/portdrv.c                      | 7 +++++++
>  drivers/pci/pcie/portdrv.h                      | 1 +
>  4 files changed, 20 insertions(+)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 1058f2a6d6a8c..ff95c24280e3c 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5240,6 +5240,15 @@ Kernel parameters
>  		nomsi	Do not use MSI for native PCIe PME signaling (this makes
>  			all PCIe root ports use INTx for all services).
>  
> +	pcieportdrv.aer_unrecoverable_fatal=
> +			[PCIE] Panic on unrecoverable AER errors:
> +		0	Log the error and leave the device in a disconnected
> +			state (default).
> +		1	Panic the kernel when a device cannot recover from an
> +			AER error (no error_detected callback). Useful for
> +			high-reliability systems where degraded hardware is
> +			unacceptable.

Just from an overall complexity point of view, I'm a little hesitant
to add new kernel parameters because this seems like a very specific
case.

Is there anything we could do to improve the logging to make the issue
more recognizable?  I assume you already look for KERN_CRIT, KERN_ERR,
etc., but it looks like the current message is just KERN_INFO.  I
think we could make a good case for at least KERN_WARNING.

But I guess you probably want something that's just impossible to
ignore.

Are there any other similar flags you already use that we could
piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
the existing "panic_on_warn" would be enough?

> +++ b/drivers/pci/pcie/err.c
> @@ -73,6 +73,9 @@ static int report_error_detected(struct pci_dev *dev,
>  		if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
>  			vote = PCI_ERS_RESULT_NO_AER_DRIVER;
>  			pci_info(dev, "can't recover (no error_detected callback)\n");
> +			if (aer_unrecoverable_fatal)
> +				panic("AER: %s: no error_detected callback\n",
> +				      pci_name(dev));
>  		} else {
>  			vote = PCI_ERS_RESULT_NONE;
>  		}

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 18:52 ` Bjorn Helgaas
@ 2026-02-06 19:22   ` Keith Busch
  2026-02-06 20:53     ` Lukas Wunner
  2026-02-09 14:28   ` Breno Leitao
  1 sibling, 1 reply; 9+ messages in thread
From: Keith Busch @ 2026-02-06 19:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Breno Leitao, Jonathan Corbet, Mahesh J Salgaonkar,
	Oliver O'Halloran, Bjorn Helgaas, linux-doc, linux-kernel,
	linuxppc-dev, linux-pci, dcostantino, rneu, kernel-team

On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> Just from an overall complexity point of view, I'm a little hesitant
> to add new kernel parameters because this seems like a very specific
> case.
> 
> Is there anything we could do to improve the logging to make the issue
> more recognizable?  I assume you already look for KERN_CRIT, KERN_ERR,
> etc., but it looks like the current message is just KERN_INFO.  I
> think we could make a good case for at least KERN_WARNING.
> 
> But I guess you probably want something that's just impossible to
> ignore.

It's not necessarily about improving visibility with a higher alert
level. It's more that the system can't be trusted to operate correctly
from here on. Consider an interconnected GPU setup and only one
experiences an unrecoverable error. We don't want to leave the system
limping along with this unresolved error as it can't perform anything
useful. A panic induced reboot is the least bad option to return the
system to operation, or crashes the system temporally close to failure
to get logs for the vendor if we're actively debugging.

> Are there any other similar flags you already use that we could
> piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> the existing "panic_on_warn" would be enough?

There are many KERN_WARNING messages that don't rise to the level of
warranting a 'panic' that don't want to enable such an option in
production. It looks like the panic_on_warn was introduced for developer
debugging.

I agree the curnent INFO level is too low for the generic unrecovered
condition, though.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 19:22   ` Keith Busch
@ 2026-02-06 20:53     ` Lukas Wunner
  2026-02-06 21:10       ` Lukas Wunner
  2026-02-07  5:55       ` Keith Busch
  0 siblings, 2 replies; 9+ messages in thread
From: Lukas Wunner @ 2026-02-06 20:53 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Breno Leitao, Jonathan Corbet, Mahesh J Salgaonkar,
	Oliver O'Halloran, Bjorn Helgaas, linux-doc, linux-kernel,
	linuxppc-dev, linux-pci, dcostantino, rneu, kernel-team

On Fri, Feb 06, 2026 at 12:22:44PM -0700, Keith Busch wrote:
> On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> > Are there any other similar flags you already use that we could
> > piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> > the existing "panic_on_warn" would be enough?
> 
> There are many KERN_WARNING messages that don't rise to the level of
> warranting a 'panic' that don't want to enable such an option in
> production. It looks like the panic_on_warn was introduced for developer
> debugging.

panic_on_warn springs into action on WARN() splats, not arbitrary
messages with KERN_WARNING severity.  Also, sysctl kernel.warn_limit
may be used to grant a certain number of panic-free WARNs.

FWIW, the "pcieportdrv.aer_unrecoverable_fatal" parameter introduced
by this patch feels somewhat oddly named.  Something like
"pci.panic_on_fatal" might be clearer and more succinct.

> I agree the curnent INFO level is too low for the generic unrecovered
> condition, though.

At least for unbound devices, I think 918b4053184c went way too far.
I think an unbound device should generally be considered recoverable
through a reset.

As for bound devices whose drivers lack pci_error_handlers, it has been
painful in practice that they're considered unrecoverable wholesale.
E.g. GPUs often expose an audio device as well as telemetry devices,
all arranged below an integrated PCIe switch.  All of these devices
need drivers with pci_error_handlers in order for the GPU to be
recoverable.  In some cases, dummy callbacks were added to render
the whole thing recoverable.

So I wouldn't consider 918b4053184c to have been a universally successful
approach and I fear that this patch goes even further.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 20:53     ` Lukas Wunner
@ 2026-02-06 21:10       ` Lukas Wunner
  2026-02-07  5:55       ` Keith Busch
  1 sibling, 0 replies; 9+ messages in thread
From: Lukas Wunner @ 2026-02-06 21:10 UTC (permalink / raw)
  To: Keith Busch
  Cc: Bjorn Helgaas, Breno Leitao, Jonathan Corbet, Mahesh J Salgaonkar,
	Oliver O'Halloran, Bjorn Helgaas, linux-doc, linux-kernel,
	linuxppc-dev, linux-pci, dcostantino, rneu, kernel-team

On Fri, Feb 06, 2026 at 09:53:39PM +0100, Lukas Wunner wrote:
> So I wouldn't consider 918b4053184c to have been a universally successful
> approach and I fear that this patch goes even further.

Forgot to mention -- there's another problem:

PCI_ERS_RESULT_NO_AER_DRIVER is obviously AER-specific.

powerpc (EEH) and s390 have error recovery mechanisms separate from AER
and we've been trying to align them more closely so that drivers don't
need to be aware of platform-specific behavior.

eeh_pe_report_edev() does not modify the pci_ers_result for unbound
drivers and those without pci_error_handlers.  And the default is
PCI_ERS_RESULT_NONE.  eeh_report_error() also returns PCI_ERS_RESULT_NONE
for drivers without ->error_detected() callback.

In the PCI_ERS_RESULT_NONE case, EEH seems to perform a reset and
assume successful recovery.

It's only AER that is this strict about unbound devices and drivers that
lack pci_error_handlers.

If anything we should try to *reduce* deviations between the various
error recovery mechanisms, not double down on increasing them.

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 20:53     ` Lukas Wunner
  2026-02-06 21:10       ` Lukas Wunner
@ 2026-02-07  5:55       ` Keith Busch
  1 sibling, 0 replies; 9+ messages in thread
From: Keith Busch @ 2026-02-07  5:55 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Breno Leitao, Jonathan Corbet, Mahesh J Salgaonkar,
	Oliver O'Halloran, Bjorn Helgaas, linux-doc, linux-kernel,
	linuxppc-dev, linux-pci, dcostantino, rneu, kernel-team

On Fri, Feb 06, 2026 at 09:53:39PM +0100, Lukas Wunner wrote:
> On Fri, Feb 06, 2026 at 12:22:44PM -0700, Keith Busch wrote:
> > On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> > > Are there any other similar flags you already use that we could
> > > piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> > > the existing "panic_on_warn" would be enough?
> > 
> > There are many KERN_WARNING messages that don't rise to the level of
> > warranting a 'panic' that don't want to enable such an option in
> > production. It looks like the panic_on_warn was introduced for developer
> > debugging.
> 
> panic_on_warn springs into action on WARN() splats, not arbitrary
> messages with KERN_WARNING severity.  Also, sysctl kernel.warn_limit
> may be used to grant a certain number of panic-free WARNs.

Okay, but the warn panic param still isn't an option for production.

> FWIW, the "pcieportdrv.aer_unrecoverable_fatal" parameter introduced
> by this patch feels somewhat oddly named.  Something like
> "pci.panic_on_fatal" might be clearer and more succinct.

Naming is hard; thanks for the suggestion.

> > I agree the curnent INFO level is too low for the generic unrecovered
> > condition, though.
> 
> At least for unbound devices, I think 918b4053184c went way too far.
> I think an unbound device should generally be considered recoverable
> through a reset.

Yes, I agree, especially considering the generic probe saves a
checkpoint of the state that we can restore to that is consistent with
the kernel's view. There's no clear reason to fail recovery when there's
no bound driver, so this changing that behavior s a good idea.

> As for bound devices whose drivers lack pci_error_handlers, it has been
> painful in practice that they're considered unrecoverable wholesale.

Yes, it gets tricky when there is a bound driver; there's no telling
whether or not it may initiate a broken transaction with cascading
consequences for the rest of the system if anything in the chain is not
cooperating with the error recovery orchestration. I don't know if there
is a best default action, so allowing it to be user defined seems okay.

> E.g. GPUs often expose an audio device as well as telemetry devices,
> all arranged below an integrated PCIe switch.  All of these devices
> need drivers with pci_error_handlers in order for the GPU to be
> recoverable.  In some cases, dummy callbacks were added to render
> the whole thing recoverable.

This experience sounds familiar, and it really does appear that a hard
reboot is the best outcome in many cases because orchestrating all the
components to recover is not going to happen. Hence the reboot param.

> So I wouldn't consider 918b4053184c to have been a universally successful
> approach and I fear that this patch goes even further.

If anyone goes through the effort of fixing that, will it be considered?
You told me in Vienna LPC '24 that you'd help resolve the pci hotplug
deadlocks that have been plaguing pci for the last 10 years, but not a
single comment has happened despite multiple complete and validated
solutions offered.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
  2026-02-06 18:52 ` Bjorn Helgaas
  2026-02-06 19:22   ` Keith Busch
@ 2026-02-09 14:28   ` Breno Leitao
  1 sibling, 0 replies; 9+ messages in thread
From: Breno Leitao @ 2026-02-09 14:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Jonathan Corbet, Mahesh J Salgaonkar, Oliver O'Halloran,
	Bjorn Helgaas, kbusch, linux-doc, linux-kernel, linuxppc-dev,
	linux-pci, dcostantino, rneu, kernel-team

Hello Bjorn,

On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> Is there anything we could do to improve the logging to make the issue
> more recognizable?  I assume you already look for KERN_CRIT, KERN_ERR,
> etc., but it looks like the current message is just KERN_INFO.  I
> think we could make a good case for at least KERN_WARNING.
>
> But I guess you probably want something that's just impossible to
> ignore.
>
> Are there any other similar flags you already use that we could
> piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> the existing "panic_on_warn" would be enough?

Let me provide context on what we observe in production environments.

We manage a fleet of machines that regularly encounter AER errors. The
typical failure pattern we see involves:

1) AER errors on devices (sometimes with proprietary drivers):

	{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 302
	 	0009:01:00.0:    [22] UncorrIntErr

2) The device enters an unrecoverable state where any subsequent access
   triggers additional failures.

3) The driver continues attempting hardware access, which generates
   cascading errors. On arm64, we observe sequences like:

	arm-smmu-v3 arm-smmu-v3.13.auto: unexpected global error reported (0x00000001), this could be serious
	arm-smmu-v3 arm-smmu-v3.13.auto: CMDQ error (cons 0x030120f3): ATC invalidate timeout
	..
	watchdog: CPU75: Watchdog detected hard LOCKUP on cpu 76

4) For NIC uncorrectable errors, we see:

	pcieport 0007:00:00.0: DPC: containment event, status:0x2009: unmasked uncorrectable error detected
	mlx5_core 0017:01:00.0 eth1: ERR CQE on SQ: 0x128b
	mlx5_core 0017:01:00.0 eth1: hw csum failure
	mlx5_core 0007:01:00.0 eth0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
	WARNING: CPU: 32 PID: 0 at drivers/iommu/dma-iommu.c:1237 iommu_dma_unmap_phys+0xd0/0xe0 (in a loop)


Keith and I discussed several approaches (all untested except the last
one -- this patch):

a) Mark the device as disconnected when recovery fails:

	diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
	index 6b697654d654..405aac6085a1 100644
	--- a/drivers/pci/pcie/err.c
	+++ b/drivers/pci/pcie/err.c
	@@ -271,6 +271,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
	     return status;

	 failed:
	+    pci_walk_bridge(bridge, pci_dev_set_disconnected, NULL);
	     pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

	     pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);

b) Remove the device from the bus entirely:

	diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
	index 6b697654d6546..33559a0022318 100644
	--- a/drivers/pci/pcie/err.c
	+++ b/drivers/pci/pcie/err.c

		cb(bridge, userdata);
	}

	+static void pci_err_detach_subordinate(struct pci_dev *bridge)
	+{
	+    struct pci_dev *dev, *tmp;
	+    int ret;
	+
	+    pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
	+
	+    ret = pci_trylock_rescan_remove(bridge);
	+    if (!ret)
	+        return;
	+
	+    list_for_each_entry_safe_reverse(dev, tmp, &bridge->devices, bus_list) {
	+        pci_dev_get(dev);
	+        pci_stop_and_remove_bus_device(dev);
	+        pci_dev_put(dev);
	+    }
	+    pci_unlock_rescan_remove();
	+}
	+
	pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
		pci_channel_state_t state,
		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
	@@ -271,6 +290,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
	return status;

	failed:
	+    pci_err_detach_subordinate(bridge);
	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

	pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);

c) Panic the system (this patch).

The key issue is that simply raising the log level to KERN_WARNING
wouldn't address the fundamental problem. Once recovery fails, the system
becomes unstable and eventually crashes with varied symptoms (soft lockup,
hard lockup, BUG). These different crash signatures make correlation
difficult and prevent effective tracking of the root cause.

As Keith suggested, panicking immediately when a device is unrecoverable
appears to be the most appropriate approach for our use case. While the
other options may have merit in different scenarios, they don't adequately
address our stability requirements.

Thanks for the review and suggestions,
--breno

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-02-09 14:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-06 18:23 [PATCH] PCI/AER: Add option to panic on unrecoverable errors Breno Leitao
2026-02-06 18:41 ` Lukas Wunner
2026-02-06 18:50 ` Keith Busch
2026-02-06 18:52 ` Bjorn Helgaas
2026-02-06 19:22   ` Keith Busch
2026-02-06 20:53     ` Lukas Wunner
2026-02-06 21:10       ` Lukas Wunner
2026-02-07  5:55       ` Keith Busch
2026-02-09 14:28   ` Breno Leitao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox