linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] pci: implement "pci=aer_panic"
@ 2025-05-16 16:55 Hans Zhang
  2025-05-16 16:55 ` [PATCH 1/4] " Hans Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-16 16:55 UTC (permalink / raw)
  To: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang

The following series introduces a new kernel command-line option aer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministic recover
from fatal PCIe errors by triggering a controlled kernel panic when device
recovery fails, avoiding indefinite system hangs.

Problem Statement
In systems where unresolved PCIe errors (e.g., bus hangs) occur,
traditional error recovery mechanisms may leave the system unresponsive
indefinitely. This is unacceptable for high-availability environment
requiring prompt recovery via reboot.

Solution
The aer_panic option forces a kernel panic on unrecoverable AER errors.
This bypasses prolonged recovery attempts and ensures immediate reboot.

Patch Summary:
Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
its purpose and usage.

Command-Line Handling: Implements pci=aer_panic parsing and state
management in PCI core.

State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
mode is active.

Panic Trigger: Modifies recovery logic to panic the system when recovery
fails and aer_panic is enabled.

Impact
Controlled Recovery: Reduces downtime by replacing hangs with immediate
reboots.

Optional: Enabled via pci=aer_panic; no default behavior change.

Dependency: Requires CONFIG_PCIEAER.

For example, in mobile phones and tablets, when there is a problem with
the PCIe link and it cannot be restored, it is expected to provide an
alternative method to make the system panic without waiting for the
battery power to be completely exhausted before restarting the system.

---
For example, the sm8250 and sm8350 of qcom will panic and restart the
system when they are linked down.

https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950


Since the design schemes of each SOC manufacturer are different, the AXI
and other buses connected by PCIe do not have a design to prevent hanging.
Once a FATAL error occurs in the PCIe link and cannot be restored, the
system needs to be restarted.


Dear Mani,

I wonder if you know how other SoCs of qcom handle FATAL errors that occur
in PCIe link.
---

Hans Zhang (4):
  pci: implement "pci=aer_panic"
  PCI/AER: Introduce aer_panic kernel command-line option
  PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
  PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set

 .../admin-guide/kernel-parameters.txt          |  7 +++++++
 drivers/pci/pci.c                              |  2 ++
 drivers/pci/pci.h                              |  4 ++++
 drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
 drivers/pci/pcie/err.c                         |  8 ++++++--
 5 files changed, 37 insertions(+), 2 deletions(-)


base-commit: fee3e843b309444f48157e2188efa6818bae85cf
prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
-- 
2.25.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-05-22 16:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
2025-05-16 16:55 ` [PATCH 1/4] " Hans Zhang
2025-05-16 16:55 ` [PATCH 2/4] PCI/AER: Introduce aer_panic kernel command-line option Hans Zhang
2025-05-16 16:55 ` [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled() Hans Zhang
2025-05-17  4:07   ` Sathyanarayanan Kuppuswamy
2025-05-19 14:03     ` Hans Zhang
2025-05-16 16:55 ` [PATCH 4/4] PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set Hans Zhang
2025-05-16 18:10 ` [PATCH 0/4] pci: implement "pci=aer_panic" Sathyanarayanan Kuppuswamy
2025-05-19 14:21   ` Hans Zhang
2025-05-19 14:39     ` Hans Zhang
2025-05-19 14:41     ` Hans Zhang
2025-05-20 16:09       ` Sathyanarayanan Kuppuswamy
2025-05-21 14:54         ` Hans Zhang
2025-05-21 16:17           ` Sathyanarayanan Kuppuswamy
2025-05-22  9:33             ` Hans Zhang
2025-05-19 22:03 ` Bjorn Helgaas
2025-05-20 15:11   ` Hans Zhang
2025-05-22 11:47 ` Manivannan Sadhasivam
2025-05-22 16:01   ` Hans Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).