Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Breno Leitao <leitao@debian.org>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Jonathan Corbet <corbet@lwn.net>,
	 Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	Oliver O'Halloran <oohall@gmail.com>,
	 Bjorn Helgaas <bhelgaas@google.com>,
	kbusch@kernel.org, linux-doc@vger.kernel.org,
	 linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	linux-pci@vger.kernel.org,  dcostantino@meta.com, rneu@meta.com,
	kernel-team@meta.com
Subject: Re: [PATCH] PCI/AER: Add option to panic on unrecoverable errors
Date: Mon, 9 Feb 2026 06:28:40 -0800	[thread overview]
Message-ID: <aYnour-Z8rm8pW2D@gmail.com> (raw)
In-Reply-To: <20260206185232.GA70936@bhelgaas>

Hello Bjorn,

On Fri, Feb 06, 2026 at 12:52:32PM -0600, Bjorn Helgaas wrote:
> On Fri, Feb 06, 2026 at 10:23:11AM -0800, Breno Leitao wrote:
> Is there anything we could do to improve the logging to make the issue
> more recognizable?  I assume you already look for KERN_CRIT, KERN_ERR,
> etc., but it looks like the current message is just KERN_INFO.  I
> think we could make a good case for at least KERN_WARNING.
>
> But I guess you probably want something that's just impossible to
> ignore.
>
> Are there any other similar flags you already use that we could
> piggy-back on?  E.g., if we raised the level to KERN_WARNING, maybe
> the existing "panic_on_warn" would be enough?

Let me provide context on what we observe in production environments.

We manage a fleet of machines that regularly encounter AER errors. The
typical failure pattern we see involves:

1) AER errors on devices (sometimes with proprietary drivers):

	{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 302
	 	0009:01:00.0:    [22] UncorrIntErr

2) The device enters an unrecoverable state where any subsequent access
   triggers additional failures.

3) The driver continues attempting hardware access, which generates
   cascading errors. On arm64, we observe sequences like:

	arm-smmu-v3 arm-smmu-v3.13.auto: unexpected global error reported (0x00000001), this could be serious
	arm-smmu-v3 arm-smmu-v3.13.auto: CMDQ error (cons 0x030120f3): ATC invalidate timeout
	..
	watchdog: CPU75: Watchdog detected hard LOCKUP on cpu 76

4) For NIC uncorrectable errors, we see:

	pcieport 0007:00:00.0: DPC: containment event, status:0x2009: unmasked uncorrectable error detected
	mlx5_core 0017:01:00.0 eth1: ERR CQE on SQ: 0x128b
	mlx5_core 0017:01:00.0 eth1: hw csum failure
	mlx5_core 0007:01:00.0 eth0: mlx5e_ethtool_get_link_ksettings: query port ptys failed: -67
	WARNING: CPU: 32 PID: 0 at drivers/iommu/dma-iommu.c:1237 iommu_dma_unmap_phys+0xd0/0xe0 (in a loop)


Keith and I discussed several approaches (all untested except the last
one -- this patch):

a) Mark the device as disconnected when recovery fails:

	diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
	index 6b697654d654..405aac6085a1 100644
	--- a/drivers/pci/pcie/err.c
	+++ b/drivers/pci/pcie/err.c
	@@ -271,6 +271,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
	     return status;

	 failed:
	+    pci_walk_bridge(bridge, pci_dev_set_disconnected, NULL);
	     pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

	     pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);

b) Remove the device from the bus entirely:

	diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
	index 6b697654d6546..33559a0022318 100644
	--- a/drivers/pci/pcie/err.c
	+++ b/drivers/pci/pcie/err.c

		cb(bridge, userdata);
	}

	+static void pci_err_detach_subordinate(struct pci_dev *bridge)
	+{
	+    struct pci_dev *dev, *tmp;
	+    int ret;
	+
	+    pci_walk_bus(parent, pci_dev_set_disconnected, NULL);
	+
	+    ret = pci_trylock_rescan_remove(bridge);
	+    if (!ret)
	+        return;
	+
	+    list_for_each_entry_safe_reverse(dev, tmp, &bridge->devices, bus_list) {
	+        pci_dev_get(dev);
	+        pci_stop_and_remove_bus_device(dev);
	+        pci_dev_put(dev);
	+    }
	+    pci_unlock_rescan_remove();
	+}
	+
	pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
		pci_channel_state_t state,
		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
	@@ -271,6 +290,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
	return status;

	failed:
	+    pci_err_detach_subordinate(bridge);
	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

	pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);

c) Panic the system (this patch).

The key issue is that simply raising the log level to KERN_WARNING
wouldn't address the fundamental problem. Once recovery fails, the system
becomes unstable and eventually crashes with varied symptoms (soft lockup,
hard lockup, BUG). These different crash signatures make correlation
difficult and prevent effective tracking of the root cause.

As Keith suggested, panicking immediately when a device is unrecoverable
appears to be the most appropriate approach for our use case. While the
other options may have merit in different scenarios, they don't adequately
address our stability requirements.

Thanks for the review and suggestions,
--breno

     prev parent reply	other threads:[~2026-02-09 14:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-06 18:23 [PATCH] PCI/AER: Add option to panic on unrecoverable errors Breno Leitao
2026-02-06 18:41 ` Lukas Wunner
2026-02-06 18:50 ` Keith Busch
2026-02-06 18:52 ` Bjorn Helgaas
2026-02-06 19:22   ` Keith Busch
2026-02-06 20:53     ` Lukas Wunner
2026-02-06 21:10       ` Lukas Wunner
2026-02-07  5:55       ` Keith Busch
2026-02-09 14:28   ` Breno Leitao [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aYnour-Z8rm8pW2D@gmail.com \
    --to=leitao@debian.org \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dcostantino@meta.com \
    --cc=helgaas@kernel.org \
    --cc=kbusch@kernel.org \
    --cc=kernel-team@meta.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=rneu@meta.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox