linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] pci: implement "pci=aer_panic"
@ 2025-05-16 16:55 Hans Zhang
  2025-05-16 16:55 ` [PATCH 1/4] " Hans Zhang
                   ` (6 more replies)
  0 siblings, 7 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-16 16:55 UTC (permalink / raw)
  To: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang

The following series introduces a new kernel command-line option aer_panic
to enhance error handling for PCIe Advanced Error Reporting (AER) in
mission-critical environments. This feature ensures deterministic recover
from fatal PCIe errors by triggering a controlled kernel panic when device
recovery fails, avoiding indefinite system hangs.

Problem Statement
In systems where unresolved PCIe errors (e.g., bus hangs) occur,
traditional error recovery mechanisms may leave the system unresponsive
indefinitely. This is unacceptable for high-availability environment
requiring prompt recovery via reboot.

Solution
The aer_panic option forces a kernel panic on unrecoverable AER errors.
This bypasses prolonged recovery attempts and ensures immediate reboot.

Patch Summary:
Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
its purpose and usage.

Command-Line Handling: Implements pci=aer_panic parsing and state
management in PCI core.

State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
mode is active.

Panic Trigger: Modifies recovery logic to panic the system when recovery
fails and aer_panic is enabled.

Impact
Controlled Recovery: Reduces downtime by replacing hangs with immediate
reboots.

Optional: Enabled via pci=aer_panic; no default behavior change.

Dependency: Requires CONFIG_PCIEAER.

For example, in mobile phones and tablets, when there is a problem with
the PCIe link and it cannot be restored, it is expected to provide an
alternative method to make the system panic without waiting for the
battery power to be completely exhausted before restarting the system.

---
For example, the sm8250 and sm8350 of qcom will panic and restart the
system when they are linked down.

https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440

https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950


Since the design schemes of each SOC manufacturer are different, the AXI
and other buses connected by PCIe do not have a design to prevent hanging.
Once a FATAL error occurs in the PCIe link and cannot be restored, the
system needs to be restarted.


Dear Mani,

I wonder if you know how other SoCs of qcom handle FATAL errors that occur
in PCIe link.
---

Hans Zhang (4):
  pci: implement "pci=aer_panic"
  PCI/AER: Introduce aer_panic kernel command-line option
  PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
  PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set

 .../admin-guide/kernel-parameters.txt          |  7 +++++++
 drivers/pci/pci.c                              |  2 ++
 drivers/pci/pci.h                              |  4 ++++
 drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
 drivers/pci/pcie/err.c                         |  8 ++++++--
 5 files changed, 37 insertions(+), 2 deletions(-)


base-commit: fee3e843b309444f48157e2188efa6818bae85cf
prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
-- 
2.25.1



^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/4] pci: implement "pci=aer_panic"
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
@ 2025-05-16 16:55 ` Hans Zhang
  2025-05-16 16:55 ` [PATCH 2/4] PCI/AER: Introduce aer_panic kernel command-line option Hans Zhang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-16 16:55 UTC (permalink / raw)
  To: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang

Add a new "aer_panic" parameter to force kernel panic on unrecoverable
PCIe Advanced Error Reporting (AER) errors. This is designed for systems
where unresolved PCIe bus hangs require immediate reboot to maintain
service availability.

The option can be enabled via "pci=aer_panic" on the kernel command line.
It prepares for safer error handling in mission-critical environments
by bypassing indefinite hangs and triggering controlled panic.

Signed-off-by: Hans Zhang <18255117159@163.com>
---
 Documentation/admin-guide/kernel-parameters.txt | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 8f75ec177399..a4a221bb1636 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -4679,6 +4679,13 @@
 		noaer		[PCIE] If the PCIEAER kernel config parameter is
 				enabled, this kernel boot option can be used to
 				disable the use of PCIE advanced error reporting.
+		aer_panic	[PCIE] Force kernel panic on unrecoverable
+				PCIe Advanced Error Reporting (AER) errors when
+				device recovery fails. This is recommended for
+				systems where bus hangs from unresolved errors
+				require immediate reboot. Use with caution as
+				this bypasses normal error recovery procedures.
+				Requires CONFIG_PCIEAER.
 		nodomains	[PCI] Disable support for multiple PCI
 				root domains (aka PCI segments, in ACPI-speak).
 		nommconf	[X86] Disable use of MMCONFIG for PCI
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/4] PCI/AER: Introduce aer_panic kernel command-line option
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
  2025-05-16 16:55 ` [PATCH 1/4] " Hans Zhang
@ 2025-05-16 16:55 ` Hans Zhang
  2025-05-16 16:55 ` [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled() Hans Zhang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-16 16:55 UTC (permalink / raw)
  To: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang

From: Hans Zhang <hans.zhang@cixtech.com>

Add a new "aer_panic" kernel parameter to force panic on unrecoverable
PCIe errors. This prepares for handling fatal AER errors in systems where
bus hangs require immediate reboot.

Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
---
 drivers/pci/pci.c      | 2 ++
 drivers/pci/pci.h      | 2 ++
 drivers/pci/pcie/aer.c | 6 ++++++
 3 files changed, 10 insertions(+)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e77d5b53c0ce..663454135224 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -6821,6 +6821,8 @@ static int __init pci_setup(char *str)
 				pcie_ats_disabled = true;
 			} else if (!strcmp(str, "noaer")) {
 				pci_no_aer();
+			} else if (!strcmp(str, "aer_panic")) {
+				pci_aer_panic();
 			} else if (!strcmp(str, "earlydump")) {
 				pci_early_dump = true;
 			} else if (!strncmp(str, "realloc=", 8)) {
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b81e99cd4b62..8ddfc1677eeb 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -958,6 +958,7 @@ static inline void of_pci_remove_host_bridge_node(struct pci_host_bridge *bridge
 
 #ifdef CONFIG_PCIEAER
 void pci_no_aer(void);
+void pci_aer_panic(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
@@ -968,6 +969,7 @@ void pci_save_aer_state(struct pci_dev *dev);
 void pci_restore_aer_state(struct pci_dev *dev);
 #else
 static inline void pci_no_aer(void) { }
+static inline void pci_aer_panic(void) { }
 static inline void pci_aer_init(struct pci_dev *d) { }
 static inline void pci_aer_exit(struct pci_dev *d) { }
 static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ade98c5a19b9..fa51fb8a5fe7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -112,6 +112,7 @@ struct aer_stats {
 					PCI_ERR_ROOT_MULTI_UNCOR_RCV)
 
 static bool pcie_aer_disable;
+static bool pcie_aer_panic;
 static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
 
 void pci_no_aer(void)
@@ -119,6 +120,11 @@ void pci_no_aer(void)
 	pcie_aer_disable = true;
 }
 
+void pci_aer_panic(void)
+{
+	pcie_aer_panic = true;
+}
+
 bool pci_aer_available(void)
 {
 	return !pcie_aer_disable && pci_msi_enabled();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
  2025-05-16 16:55 ` [PATCH 1/4] " Hans Zhang
  2025-05-16 16:55 ` [PATCH 2/4] PCI/AER: Introduce aer_panic kernel command-line option Hans Zhang
@ 2025-05-16 16:55 ` Hans Zhang
  2025-05-17  4:07   ` Sathyanarayanan Kuppuswamy
  2025-05-16 16:55 ` [PATCH 4/4] PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set Hans Zhang
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 19+ messages in thread
From: Hans Zhang @ 2025-05-16 16:55 UTC (permalink / raw)
  To: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang

From: Hans Zhang <hans.zhang@cixtech.com>

Add pci_aer_panic_enabled() to check if aer_panic is enabled system-wide.
Export the function for use in error recovery logic.

Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
---
 drivers/pci/pci.h      |  2 ++
 drivers/pci/pcie/aer.c | 12 ++++++++++++
 2 files changed, 14 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 8ddfc1677eeb..f92928dadc6a 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -959,6 +959,7 @@ static inline void of_pci_remove_host_bridge_node(struct pci_host_bridge *bridge
 #ifdef CONFIG_PCIEAER
 void pci_no_aer(void);
 void pci_aer_panic(void);
+bool pci_aer_panic_enabled(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
@@ -970,6 +971,7 @@ void pci_restore_aer_state(struct pci_dev *dev);
 #else
 static inline void pci_no_aer(void) { }
 static inline void pci_aer_panic(void) { }
+static inline bool pci_aer_panic_enabled(void) { return false; }
 static inline void pci_aer_init(struct pci_dev *d) { }
 static inline void pci_aer_exit(struct pci_dev *d) { }
 static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index fa51fb8a5fe7..4fd7db90b77c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -125,6 +125,18 @@ void pci_aer_panic(void)
 	pcie_aer_panic = true;
 }
 
+/**
+ * pci_aer_panic_enabled() - Are AER panic enabled system-wide?
+ *
+ * Return: true if AER panic has not been globally disabled through ACPI FADT,
+ * PCI bridge quirks, or the "pci=aer_panic" kernel command-line option.
+ */
+bool pci_aer_panic_enabled(void)
+{
+	return pcie_aer_panic;
+}
+EXPORT_SYMBOL(pci_aer_panic_enabled);
+
 bool pci_aer_available(void)
 {
 	return !pcie_aer_disable && pci_msi_enabled();
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 4/4] PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
                   ` (2 preceding siblings ...)
  2025-05-16 16:55 ` [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled() Hans Zhang
@ 2025-05-16 16:55 ` Hans Zhang
  2025-05-16 18:10 ` [PATCH 0/4] pci: implement "pci=aer_panic" Sathyanarayanan Kuppuswamy
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-16 16:55 UTC (permalink / raw)
  To: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang

From: Hans Zhang <hans.zhang@cixtech.com>

Modify pcie_do_recovery() to panic the system when device recovery fails
and aer_panic is enabled via kernel command-line. This addresses scenarios
where PCIe link errors cause bus hangs requiring forced reboots.

Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
---
 drivers/pci/pcie/err.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..f0994f66d462 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -271,8 +271,12 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
 
-	/* TODO: Should kernel panic here? */
-	pci_info(bridge, "device recovery failed\n");
+	if (!pci_aer_panic_enabled())
+		pci_info(bridge, "%s: device recovery failed\n",
+			 pci_name(bridge));
+	else
+		panic("Kernel panic: %s: device recovery failed\n",
+		      pci_name(bridge));
 
 	return status;
 }
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
                   ` (3 preceding siblings ...)
  2025-05-16 16:55 ` [PATCH 4/4] PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set Hans Zhang
@ 2025-05-16 18:10 ` Sathyanarayanan Kuppuswamy
  2025-05-19 14:21   ` Hans Zhang
  2025-05-19 22:03 ` Bjorn Helgaas
  2025-05-22 11:47 ` Manivannan Sadhasivam
  6 siblings, 1 reply; 19+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-16 18:10 UTC (permalink / raw)
  To: Hans Zhang, bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev


On 5/16/25 9:55 AM, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.

Why would a device recovery failure lead to a system hang? Worst case
that device may not be accessible, right?  Any real use case?

>
> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
>
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
>
> Patch Summary:
> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
> its purpose and usage.
>
> Command-Line Handling: Implements pci=aer_panic parsing and state
> management in PCI core.
>
> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
> mode is active.
>
> Panic Trigger: Modifies recovery logic to panic the system when recovery
> fails and aer_panic is enabled.
>
> Impact
> Controlled Recovery: Reduces downtime by replacing hangs with immediate
> reboots.
>
> Optional: Enabled via pci=aer_panic; no default behavior change.
>
> Dependency: Requires CONFIG_PCIEAER.
>
> For example, in mobile phones and tablets, when there is a problem with
> the PCIe link and it cannot be restored, it is expected to provide an
> alternative method to make the system panic without waiting for the
> battery power to be completely exhausted before restarting the system.
>
> ---
> For example, the sm8250 and sm8350 of qcom will panic and restart the
> system when they are linked down.
>
> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>
>
> Since the design schemes of each SOC manufacturer are different, the AXI
> and other buses connected by PCIe do not have a design to prevent hanging.
> Once a FATAL error occurs in the PCIe link and cannot be restored, the
> system needs to be restarted.
>
>
> Dear Mani,
>
> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
> in PCIe link.
> ---
>
> Hans Zhang (4):
>    pci: implement "pci=aer_panic"
>    PCI/AER: Introduce aer_panic kernel command-line option
>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>
>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>   drivers/pci/pci.c                              |  2 ++
>   drivers/pci/pci.h                              |  4 ++++
>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>   drivers/pci/pcie/err.c                         |  8 ++++++--
>   5 files changed, 37 insertions(+), 2 deletions(-)
>
>
> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
  2025-05-16 16:55 ` [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled() Hans Zhang
@ 2025-05-17  4:07   ` Sathyanarayanan Kuppuswamy
  2025-05-19 14:03     ` Hans Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-17  4:07 UTC (permalink / raw)
  To: Hans Zhang, bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang


On 5/16/25 9:55 AM, Hans Zhang wrote:
> From: Hans Zhang <hans.zhang@cixtech.com>
>
> Add pci_aer_panic_enabled() to check if aer_panic is enabled system-wide.
> Export the function for use in error recovery logic.
>
> Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
> ---
>   drivers/pci/pci.h      |  2 ++
>   drivers/pci/pcie/aer.c | 12 ++++++++++++
>   2 files changed, 14 insertions(+)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 8ddfc1677eeb..f92928dadc6a 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -959,6 +959,7 @@ static inline void of_pci_remove_host_bridge_node(struct pci_host_bridge *bridge
>   #ifdef CONFIG_PCIEAER
>   void pci_no_aer(void);
>   void pci_aer_panic(void);
> +bool pci_aer_panic_enabled(void);
>   void pci_aer_init(struct pci_dev *dev);
>   void pci_aer_exit(struct pci_dev *dev);
>   extern const struct attribute_group aer_stats_attr_group;
> @@ -970,6 +971,7 @@ void pci_restore_aer_state(struct pci_dev *dev);
>   #else
>   static inline void pci_no_aer(void) { }
>   static inline void pci_aer_panic(void) { }
> +static inline bool pci_aer_panic_enabled(void) { return false; }
>   static inline void pci_aer_init(struct pci_dev *d) { }
>   static inline void pci_aer_exit(struct pci_dev *d) { }
>   static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index fa51fb8a5fe7..4fd7db90b77c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -125,6 +125,18 @@ void pci_aer_panic(void)
>   	pcie_aer_panic = true;
>   }
>   
> +/**
> + * pci_aer_panic_enabled() - Are AER panic enabled system-wide?
> + *
> + * Return: true if AER panic has not been globally disabled through ACPI FADT,
> + * PCI bridge quirks, or the "pci=aer_panic" kernel command-line option.

I don't think we have code to disable it via ACPI FADT or PCI bridge quirks
currently, right? If yes, just list what is currently supported.

> + */
> +bool pci_aer_panic_enabled(void)
> +{
> +	return pcie_aer_panic;
> +}
> +EXPORT_SYMBOL(pci_aer_panic_enabled);
> +
>   bool pci_aer_available(void)
>   {
>   	return !pcie_aer_disable && pci_msi_enabled();

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
  2025-05-17  4:07   ` Sathyanarayanan Kuppuswamy
@ 2025-05-19 14:03     ` Hans Zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-19 14:03 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, bhelgaas, tglx, kw,
	manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev, Hans Zhang



On 2025/5/17 12:07, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/16/25 9:55 AM, Hans Zhang wrote:
>> From: Hans Zhang <hans.zhang@cixtech.com>
>>
>> Add pci_aer_panic_enabled() to check if aer_panic is enabled system-wide.
>> Export the function for use in error recovery logic.
>>
>> Signed-off-by: Hans Zhang <hans.zhang@cixtech.com>
>> ---
>>   drivers/pci/pci.h      |  2 ++
>>   drivers/pci/pcie/aer.c | 12 ++++++++++++
>>   2 files changed, 14 insertions(+)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 8ddfc1677eeb..f92928dadc6a 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -959,6 +959,7 @@ static inline void 
>> of_pci_remove_host_bridge_node(struct pci_host_bridge *bridge
>>   #ifdef CONFIG_PCIEAER
>>   void pci_no_aer(void);
>>   void pci_aer_panic(void);
>> +bool pci_aer_panic_enabled(void);
>>   void pci_aer_init(struct pci_dev *dev);
>>   void pci_aer_exit(struct pci_dev *dev);
>>   extern const struct attribute_group aer_stats_attr_group;
>> @@ -970,6 +971,7 @@ void pci_restore_aer_state(struct pci_dev *dev);
>>   #else
>>   static inline void pci_no_aer(void) { }
>>   static inline void pci_aer_panic(void) { }
>> +static inline bool pci_aer_panic_enabled(void) { return false; }
>>   static inline void pci_aer_init(struct pci_dev *d) { }
>>   static inline void pci_aer_exit(struct pci_dev *d) { }
>>   static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index fa51fb8a5fe7..4fd7db90b77c 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -125,6 +125,18 @@ void pci_aer_panic(void)
>>       pcie_aer_panic = true;
>>   }
>> +/**
>> + * pci_aer_panic_enabled() - Are AER panic enabled system-wide?
>> + *
>> + * Return: true if AER panic has not been globally disabled through 
>> ACPI FADT,
>> + * PCI bridge quirks, or the "pci=aer_panic" kernel command-line option.
> 
> I don't think we have code to disable it via ACPI FADT or PCI bridge quirks
> currently, right? If yes, just list what is currently supported.
> 

Dear Sathyanarayanan,

Thank you very much for your reply. You're right. If this series of 
patches is supported in the discussion, I will remove the comment "ACPI 
FADT, PCI bridge quirks" in the next version.

Best regards,
Hans

>> + */
>> +bool pci_aer_panic_enabled(void)
>> +{
>> +    return pcie_aer_panic;
>> +}
>> +EXPORT_SYMBOL(pci_aer_panic_enabled);
>> +
>>   bool pci_aer_available(void)
>>   {
>>       return !pcie_aer_disable && pci_msi_enabled();
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-16 18:10 ` [PATCH 0/4] pci: implement "pci=aer_panic" Sathyanarayanan Kuppuswamy
@ 2025-05-19 14:21   ` Hans Zhang
  2025-05-19 14:39     ` Hans Zhang
  2025-05-19 14:41     ` Hans Zhang
  0 siblings, 2 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-19 14:21 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, bhelgaas, tglx, kw,
	manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev



On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/16/25 9:55 AM, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option 
>> aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when 
>> device
>> recovery fails, avoiding indefinite system hangs.
> 
> Why would a device recovery failure lead to a system hang? Worst case
> that device may not be accessible, right?  Any real use case?
> 


Dear Sathyanarayanan,

Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers 
have designed it this way.)  For example, when many mobile phone SoCs of 
Qualcomm handle AER interrupts and there is a link down, that is, a 
fatal problem occurs in the current PCIe physical link, the system 
cannot recover.  At this point, a system restart is needed to solve the 
problem.

And our company design of SOC: http://radxa.com/products/orion/o6/, it 
has 5 road PCIe port.
There is also the same problem.  If there is a problem with one of the 
PCIe ports, it will cause the entire system to hang.  So I hope linux OS 
can offer an option that enables SOC manufacturers to choose to restart 
the system in case of fatal hardware errors occurring in PCIe.

There are also products such as mobile phones and tablets.  We don't 
want to wait until the battery is completely used up before restarting them.

For the specific code of Qualcomm, please refer to the email I sent.

Best regards,
Hans

>>
>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
>> Patch Summary:
>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>> its purpose and usage.
>>
>> Command-Line Handling: Implements pci=aer_panic parsing and state
>> management in PCI core.
>>
>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>> mode is active.
>>
>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>> fails and aer_panic is enabled.
>>
>> Impact
>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>> reboots.
>>
>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>
>> Dependency: Requires CONFIG_PCIEAER.
>>
>> For example, in mobile phones and tablets, when there is a problem with
>> the PCIe link and it cannot be restored, it is expected to provide an
>> alternative method to make the system panic without waiting for the
>> battery power to be completely exhausted before restarting the system.
>>
>> ---
>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>> system when they are linked down.
>>
>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>
>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>
>>
>> Since the design schemes of each SOC manufacturer are different, the AXI
>> and other buses connected by PCIe do not have a design to prevent 
>> hanging.
>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>> system needs to be restarted.
>>
>>
>> Dear Mani,
>>
>> I wonder if you know how other SoCs of qcom handle FATAL errors that 
>> occur
>> in PCIe link.
>> ---
>>
>> Hans Zhang (4):
>>    pci: implement "pci=aer_panic"
>>    PCI/AER: Introduce aer_panic kernel command-line option
>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>
>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>   drivers/pci/pci.c                              |  2 ++
>>   drivers/pci/pci.h                              |  4 ++++
>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>
>>
>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-19 14:21   ` Hans Zhang
@ 2025-05-19 14:39     ` Hans Zhang
  2025-05-19 14:41     ` Hans Zhang
  1 sibling, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-19 14:39 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, bhelgaas, tglx, kw,
	manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev



On 2025/5/19 22:21, Hans Zhang wrote:
> 
> 
> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>> The following series introduces a new kernel command-line option 
>>> aer_panic
>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>> mission-critical environments. This feature ensures deterministic 
>>> recover
>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>> device
>>> recovery fails, avoiding indefinite system hangs.
>>
>> Why would a device recovery failure lead to a system hang? Worst case
>> that device may not be accessible, right?  Any real use case?
>>
> 
> 
> Dear Sathyanarayanan,
> 
> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers 
> have designed it this way.)  For example, when many mobile phone SoCs of 
> Qualcomm handle AER interrupts and there is a link down, that is, a 
> fatal problem occurs in the current PCIe physical link, the system 
> cannot recover.  At this point, a system restart is needed to solve the 
> problem.
> 
> And our company design of SOC: http://radxa.com/products/orion/o6/, it 
> has 5 road PCIe port.
> There is also the same problem.  If there is a problem with one of the 
> PCIe ports, it will cause the entire system to hang.  So I hope linux OS 
> can offer an option that enables SOC manufacturers to choose to restart 
> the system in case of fatal hardware errors occurring in PCIe.
> 
> There are also products such as mobile phones and tablets.  We don't 
> want to wait until the battery is completely used up before restarting 
> them.
> 
> For the specific code of Qualcomm, please refer to the email I sent.
> 

Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
     /* Clear AXI link-down status */
     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register 
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving 
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes 
CPU Core1 to be unable to send TLP transfers and hang.  This is a very 
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in 
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/

Best regards,
Hans
> 
>>>
>>> Problem Statement
>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>> traditional error recovery mechanisms may leave the system unresponsive
>>> indefinitely. This is unacceptable for high-availability environment
>>> requiring prompt recovery via reboot.
>>>
>>> Solution
>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>
>>> Patch Summary:
>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>> explaining
>>> its purpose and usage.
>>>
>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>> management in PCI core.
>>>
>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>> mode is active.
>>>
>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>> fails and aer_panic is enabled.
>>>
>>> Impact
>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>> reboots.
>>>
>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>
>>> Dependency: Requires CONFIG_PCIEAER.
>>>
>>> For example, in mobile phones and tablets, when there is a problem with
>>> the PCIe link and it cannot be restored, it is expected to provide an
>>> alternative method to make the system panic without waiting for the
>>> battery power to be completely exhausted before restarting the system.
>>>
>>> ---
>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>> system when they are linked down.
>>>
>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>
>>>
>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>> and other buses connected by PCIe do not have a design to prevent 
>>> hanging.
>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>> system needs to be restarted.
>>>
>>>
>>> Dear Mani,
>>>
>>> I wonder if you know how other SoCs of qcom handle FATAL errors that 
>>> occur
>>> in PCIe link.
>>> ---
>>>
>>> Hans Zhang (4):
>>>    pci: implement "pci=aer_panic"
>>>    PCI/AER: Introduce aer_panic kernel command-line option
>>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>>
>>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>>   drivers/pci/pci.c                              |  2 ++
>>>   drivers/pci/pci.h                              |  4 ++++
>>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>>
>>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-19 14:21   ` Hans Zhang
  2025-05-19 14:39     ` Hans Zhang
@ 2025-05-19 14:41     ` Hans Zhang
  2025-05-20 16:09       ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 19+ messages in thread
From: Hans Zhang @ 2025-05-19 14:41 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, bhelgaas, tglx, kw,
	manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev



On 2025/5/19 22:21, Hans Zhang wrote:
> 
> 
> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>> The following series introduces a new kernel command-line option 
>>> aer_panic
>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>> mission-critical environments. This feature ensures deterministic 
>>> recover
>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>> device
>>> recovery fails, avoiding indefinite system hangs.
>>
>> Why would a device recovery failure lead to a system hang? Worst case
>> that device may not be accessible, right?  Any real use case?
>>
> 
> 
> Dear Sathyanarayanan,
> 
> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers 
> have designed it this way.)  For example, when many mobile phone SoCs of 
> Qualcomm handle AER interrupts and there is a link down, that is, a 
> fatal problem occurs in the current PCIe physical link, the system 
> cannot recover.  At this point, a system restart is needed to solve the 
> problem.
> 
> And our company design of SOC: http://radxa.com/products/orion/o6/, it 
> has 5 road PCIe port.
> There is also the same problem.  If there is a problem with one of the 
> PCIe ports, it will cause the entire system to hang.  So I hope linux OS 
> can offer an option that enables SOC manufacturers to choose to restart 
> the system in case of fatal hardware errors occurring in PCIe.
> 
> There are also products such as mobile phones and tablets.  We don't 
> want to wait until the battery is completely used up before restarting 
> them.
> 
> For the specific code of Qualcomm, please refer to the email I sent.
> 


Dear Sathyanarayanan,

Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
     /* Clear AXI link-down status */
     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register 
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving 
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes 
CPU Core1 to be unable to send TLP transfers and hang.  This is a very 
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in 
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/

Best regards,
Hans

> 
>>>
>>> Problem Statement
>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>> traditional error recovery mechanisms may leave the system unresponsive
>>> indefinitely. This is unacceptable for high-availability environment
>>> requiring prompt recovery via reboot.
>>>
>>> Solution
>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>
>>> Patch Summary:
>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>> explaining
>>> its purpose and usage.
>>>
>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>> management in PCI core.
>>>
>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>> mode is active.
>>>
>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>> fails and aer_panic is enabled.
>>>
>>> Impact
>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>> reboots.
>>>
>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>
>>> Dependency: Requires CONFIG_PCIEAER.
>>>
>>> For example, in mobile phones and tablets, when there is a problem with
>>> the PCIe link and it cannot be restored, it is expected to provide an
>>> alternative method to make the system panic without waiting for the
>>> battery power to be completely exhausted before restarting the system.
>>>
>>> ---
>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>> system when they are linked down.
>>>
>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>
>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>
>>>
>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>> and other buses connected by PCIe do not have a design to prevent 
>>> hanging.
>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>> system needs to be restarted.
>>>
>>>
>>> Dear Mani,
>>>
>>> I wonder if you know how other SoCs of qcom handle FATAL errors that 
>>> occur
>>> in PCIe link.
>>> ---
>>>
>>> Hans Zhang (4):
>>>    pci: implement "pci=aer_panic"
>>>    PCI/AER: Introduce aer_panic kernel command-line option
>>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>>
>>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>>   drivers/pci/pci.c                              |  2 ++
>>>   drivers/pci/pci.h                              |  4 ++++
>>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>>
>>>
>>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
                   ` (4 preceding siblings ...)
  2025-05-16 18:10 ` [PATCH 0/4] pci: implement "pci=aer_panic" Sathyanarayanan Kuppuswamy
@ 2025-05-19 22:03 ` Bjorn Helgaas
  2025-05-20 15:11   ` Hans Zhang
  2025-05-22 11:47 ` Manivannan Sadhasivam
  6 siblings, 1 reply; 19+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 22:03 UTC (permalink / raw)
  To: Hans Zhang
  Cc: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh, oohall,
	linux-pci, linux-kernel, linuxppc-dev

On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.

We try very hard not to add new kernel parameters.

It sounds like part of the problem is the use of SPI interrupts rather
than the PCIe-architected INTx/MSI/MSI-X.  I'm not sure this warrants
generic upstream code changes.  This might be something you need to
maintain out-of-tree.

> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
> 
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
> 
> Patch Summary:
> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
> its purpose and usage.
> 
> Command-Line Handling: Implements pci=aer_panic parsing and state
> management in PCI core.
> 
> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
> mode is active.
> 
> Panic Trigger: Modifies recovery logic to panic the system when recovery
> fails and aer_panic is enabled.
> 
> Impact
> Controlled Recovery: Reduces downtime by replacing hangs with immediate
> reboots.
> 
> Optional: Enabled via pci=aer_panic; no default behavior change.
> 
> Dependency: Requires CONFIG_PCIEAER.
> 
> For example, in mobile phones and tablets, when there is a problem with
> the PCIe link and it cannot be restored, it is expected to provide an
> alternative method to make the system panic without waiting for the
> battery power to be completely exhausted before restarting the system.
> 
> ---
> For example, the sm8250 and sm8350 of qcom will panic and restart the
> system when they are linked down.
> 
> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
> 
> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
> 
> 
> Since the design schemes of each SOC manufacturer are different, the AXI
> and other buses connected by PCIe do not have a design to prevent hanging.
> Once a FATAL error occurs in the PCIe link and cannot be restored, the
> system needs to be restarted.
> 
> 
> Dear Mani,
> 
> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
> in PCIe link.
> ---
> 
> Hans Zhang (4):
>   pci: implement "pci=aer_panic"
>   PCI/AER: Introduce aer_panic kernel command-line option
>   PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>   PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
> 
>  .../admin-guide/kernel-parameters.txt          |  7 +++++++
>  drivers/pci/pci.c                              |  2 ++
>  drivers/pci/pci.h                              |  4 ++++
>  drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>  drivers/pci/pcie/err.c                         |  8 ++++++--
>  5 files changed, 37 insertions(+), 2 deletions(-)
> 
> 
> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
> -- 
> 2.25.1
> 


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-19 22:03 ` Bjorn Helgaas
@ 2025-05-20 15:11   ` Hans Zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-20 15:11 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh, oohall,
	linux-pci, linux-kernel, linuxppc-dev



On 2025/5/20 06:03, Bjorn Helgaas wrote:
> On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when device
>> recovery fails, avoiding indefinite system hangs.
> 
> We try very hard not to add new kernel parameters.
> 
> It sounds like part of the problem is the use of SPI interrupts rather
> than the PCIe-architected INTx/MSI/MSI-X.  I'm not sure this warrants
> generic upstream code changes.  This might be something you need to
> maintain out-of-tree.
> 

Dear Bjorn,

This seems to have nothing to do with whether AER uses the 
INTx/MSI/MSI-X specified in the PCIe spec. Just like the example I gave 
earlier.

Our next-generation SOC has already converted AER interrupts into INTx 
and reported them to the GIC interrupt controller. But the following 
problems still cannot be solved.

```
Supplementary reasons:

drivers/pci/controller/cadence/pcie-cadence-host.c
cdns_pci_map_bus
      /* Clear AXI link-down status */
      cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);

https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52

If there has been a link down in this PCIe port, the register
CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to
continue.  This is different from Synopsys.

If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving
files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes
CPU Core1 to be unable to send TLP transfers and hang.  This is a very
extreme situation.
(The current Cadence code is Legacy PCIe IP, and the HPA IP is still in
the upstream process at present.)

Radxa O6 uses Cadence's PCIe HPA IP.
http://radxa.com/products/orion/o6/
```


If we are in the out-of-tree maintenance corresponding driver, but in 
the file the arch/arm64 / configs/defconfig "CONFIG_PCIEAER=y", make we 
can't modify the AER common code. It also cannot be compiled to aer.ko

Because: CONFIG_PCIEAER can only be equal to y or n.
config PCIEAER
	bool "PCI Express Advanced Error Reporting support"
	depends on PCIEPORTBUS
	select RAS
	help
	  This enables PCI Express Root Port Advanced Error Reporting
	  (AER) driver support. Error reporting messages sent to Root
	  Port will be handled by PCI Express AER driver.

Furthermore, the API of AER common code cannot be used either, and many 
variables have not been exported either. If we write another set of AER 
drivers by ourselves, it will lead to a lot of repetitive processing 
logic code.


I believe that the Qualcomm platform and many other platforms also have 
similar problems.


So can we add a config? For example: CONFIG_PCIEAER_PANIC instead of 
command-line option aer_panic. Or the AER driver can be KO(tristate), so 
that our SOC manufacturer can modify the AER driver.


Best regards,
Hans

>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
>> Patch Summary:
>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>> its purpose and usage.
>>
>> Command-Line Handling: Implements pci=aer_panic parsing and state
>> management in PCI core.
>>
>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>> mode is active.
>>
>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>> fails and aer_panic is enabled.
>>
>> Impact
>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>> reboots.
>>
>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>
>> Dependency: Requires CONFIG_PCIEAER.
>>
>> For example, in mobile phones and tablets, when there is a problem with
>> the PCIe link and it cannot be restored, it is expected to provide an
>> alternative method to make the system panic without waiting for the
>> battery power to be completely exhausted before restarting the system.
>>
>> ---
>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>> system when they are linked down.
>>
>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>
>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>
>>
>> Since the design schemes of each SOC manufacturer are different, the AXI
>> and other buses connected by PCIe do not have a design to prevent hanging.
>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>> system needs to be restarted.
>>
>>
>> Dear Mani,
>>
>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>> in PCIe link.
>> ---
>>
>> Hans Zhang (4):
>>    pci: implement "pci=aer_panic"
>>    PCI/AER: Introduce aer_panic kernel command-line option
>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>
>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>   drivers/pci/pci.c                              |  2 ++
>>   drivers/pci/pci.h                              |  4 ++++
>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>
>>
>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>> -- 
>> 2.25.1
>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-19 14:41     ` Hans Zhang
@ 2025-05-20 16:09       ` Sathyanarayanan Kuppuswamy
  2025-05-21 14:54         ` Hans Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20 16:09 UTC (permalink / raw)
  To: Hans Zhang, bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev


On 5/19/25 7:41 AM, Hans Zhang wrote:
>
>
> On 2025/5/19 22:21, Hans Zhang wrote:
>>
>>
>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>
>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>> The following series introduces a new kernel command-line option aer_panic
>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>> mission-critical environments. This feature ensures deterministic recover
>>>> from fatal PCIe errors by triggering a controlled kernel panic when device
>>>> recovery fails, avoiding indefinite system hangs.
>>>
>>> Why would a device recovery failure lead to a system hang? Worst case
>>> that device may not be accessible, right?  Any real use case?
>>>
>>
>>
>> Dear Sathyanarayanan,
>>
>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers have designed it this way.)  For example, when many mobile phone SoCs of Qualcomm handle AER interrupts and there is a link down, that is, a fatal problem occurs in the current PCIe physical link, the system cannot recover.  At this point, a system restart is needed to solve the problem.
>>
>> And our company design of SOC: http://radxa.com/products/orion/o6/, it has 5 road PCIe port.
>> There is also the same problem.  If there is a problem with one of the PCIe ports, it will cause the entire system to hang.  So I hope linux OS can offer an option that enables SOC manufacturers to choose to restart the system in case of fatal hardware errors occurring in PCIe.
>>
>> There are also products such as mobile phones and tablets.  We don't want to wait until the battery is completely used up before restarting them.
>>
>> For the specific code of Qualcomm, please refer to the email I sent.
>>
>
>
> Dear Sathyanarayanan,
>
> Supplementary reasons:
>
> drivers/pci/controller/cadence/pcie-cadence-host.c
> cdns_pci_map_bus
>     /* Clear AXI link-down status */
>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>
> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>
> If there has been a link down in this PCIe port, the register CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to continue.  This is different from Synopsys.
>
> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes CPU Core1 to be unable to send TLP transfers and hang. This is a very extreme situation.
> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still in the upstream process at present.)
>
> Radxa O6 uses Cadence's PCIe HPA IP.
> http://radxa.com/products/orion/o6/
>

It sounds like a system level issue to me. Why not they rely on watchdog to reboot for
this case ?

Even if you want to add this support, I think it is more appropriate to add this to your
specific PCIe controller driver.  I don't see why you want to add it part of generic
AER driver.

> Best regards,
> Hans
>
>>
>>>>
>>>> Problem Statement
>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>> traditional error recovery mechanisms may leave the system unresponsive
>>>> indefinitely. This is unacceptable for high-availability environment
>>>> requiring prompt recovery via reboot.
>>>>
>>>> Solution
>>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>>
>>>> Patch Summary:
>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>>>> its purpose and usage.
>>>>
>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>> management in PCI core.
>>>>
>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>>> mode is active.
>>>>
>>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>>> fails and aer_panic is enabled.
>>>>
>>>> Impact
>>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>>> reboots.
>>>>
>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>
>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>
>>>> For example, in mobile phones and tablets, when there is a problem with
>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>> alternative method to make the system panic without waiting for the
>>>> battery power to be completely exhausted before restarting the system.
>>>>
>>>> ---
>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>> system when they are linked down.
>>>>
>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>
>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>
>>>>
>>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>>> and other buses connected by PCIe do not have a design to prevent hanging.
>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>> system needs to be restarted.
>>>>
>>>>
>>>> Dear Mani,
>>>>
>>>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>>>> in PCIe link.
>>>> ---
>>>>
>>>> Hans Zhang (4):
>>>>    pci: implement "pci=aer_panic"
>>>>    PCI/AER: Introduce aer_panic kernel command-line option
>>>>    PCI/AER: Expose AER panic state via pci_aer_panic_enabled()
>>>>    PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set
>>>>
>>>>   .../admin-guide/kernel-parameters.txt          |  7 +++++++
>>>>   drivers/pci/pci.c                              |  2 ++
>>>>   drivers/pci/pci.h                              |  4 ++++
>>>>   drivers/pci/pcie/aer.c                         | 18 ++++++++++++++++++
>>>>   drivers/pci/pcie/err.c                         |  8 ++++++--
>>>>   5 files changed, 37 insertions(+), 2 deletions(-)
>>>>
>>>>
>>>> base-commit: fee3e843b309444f48157e2188efa6818bae85cf
>>>> prerequisite-patch-id: 299f33d3618e246cd7c04de10e591ace2d0116e6
>>>> prerequisite-patch-id: 482ad0609459a7654a4100cdc9f9aa4b671be50b
>>>
>
>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-20 16:09       ` Sathyanarayanan Kuppuswamy
@ 2025-05-21 14:54         ` Hans Zhang
  2025-05-21 16:17           ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 19+ messages in thread
From: Hans Zhang @ 2025-05-21 14:54 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, bhelgaas, tglx, kw,
	manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev



On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>
>>
>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>
>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>> The following series introduces a new kernel command-line option 
>>>>> aer_panic
>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>> mission-critical environments. This feature ensures deterministic 
>>>>> recover
>>>>> from fatal PCIe errors by triggering a controlled kernel panic when 
>>>>> device
>>>>> recovery fails, avoiding indefinite system hangs.
>>>>
>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>> that device may not be accessible, right?  Any real use case?
>>>>
>>>
>>>
>>> Dear Sathyanarayanan,
>>>
>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually 
>>> SPI interrupts, not INTx/MSI/MSIx interrupts.  (Some customers will 
>>> design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all 
>>> customers have designed it this way.)  For example, when many mobile 
>>> phone SoCs of Qualcomm handle AER interrupts and there is a link 
>>> down, that is, a fatal problem occurs in the current PCIe physical 
>>> link, the system cannot recover.  At this point, a system restart is 
>>> needed to solve the problem.
>>>
>>> And our company design of SOC: http://radxa.com/products/orion/o6/, 
>>> it has 5 road PCIe port.
>>> There is also the same problem.  If there is a problem with one of 
>>> the PCIe ports, it will cause the entire system to hang.  So I hope 
>>> linux OS can offer an option that enables SOC manufacturers to choose 
>>> to restart the system in case of fatal hardware errors occurring in 
>>> PCIe.
>>>
>>> There are also products such as mobile phones and tablets.  We don't 
>>> want to wait until the battery is completely used up before 
>>> restarting them.
>>>
>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>
>>
>>
>> Dear Sathyanarayanan,
>>
>> Supplementary reasons:
>>
>> drivers/pci/controller/cadence/pcie-cadence-host.c
>> cdns_pci_map_bus
>>     /* Clear AXI link-down status */
>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>
>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>
>> If there has been a link down in this PCIe port, the register 
>> CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
>> continue.  This is different from Synopsys.
>>
>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD 
>> saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it 
>> causes CPU Core1 to be unable to send TLP transfers and hang. This is 
>> a very extreme situation.
>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still 
>> in the upstream process at present.)
>>
>> Radxa O6 uses Cadence's PCIe HPA IP.
>> http://radxa.com/products/orion/o6/
>>
> 
> It sounds like a system level issue to me. Why not they rely on watchdog 
> to reboot for
> this case ?

Dear Sathyanarayanan,

Thank you for your reply. Yes, personally, I think it's also a problem 
at the system level. I conducted a local test. When I directly unplugged 
the EP device on the slot, the system would hang. It has been tested 
many times. Since we don't have a bus timeout response mechanism for 
PCIe, it hangs easily.

> 
> Even if you want to add this support, I think it is more appropriate to 
> add this to your
> specific PCIe controller driver.  I don't see why you want to add it 
> part of generic
> AER driver.
> 
Because we want to use the processing logic of the general AER driver. 
If the recovery is successful, there will be no problem. If the recovery 
fails, my original intention was to restart the system.

If added to the specific PCIe controller driver, a lot of repetitive AER 
processing logic will be written. So I was thinking whether the AER 
driver could be changed to be compiled as a KO module.


If this series is not reasonable, I'll drop it.


Best regards,
Hans

>>>
>>>>>
>>>>> Problem Statement
>>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>>> traditional error recovery mechanisms may leave the system 
>>>>> unresponsive
>>>>> indefinitely. This is unacceptable for high-availability environment
>>>>> requiring prompt recovery via reboot.
>>>>>
>>>>> Solution
>>>>> The aer_panic option forces a kernel panic on unrecoverable AER 
>>>>> errors.
>>>>> This bypasses prolonged recovery attempts and ensures immediate 
>>>>> reboot.
>>>>>
>>>>> Patch Summary:
>>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, 
>>>>> explaining
>>>>> its purpose and usage.
>>>>>
>>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>>> management in PCI core.
>>>>>
>>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the 
>>>>> panic
>>>>> mode is active.
>>>>>
>>>>> Panic Trigger: Modifies recovery logic to panic the system when 
>>>>> recovery
>>>>> fails and aer_panic is enabled.
>>>>>
>>>>> Impact
>>>>> Controlled Recovery: Reduces downtime by replacing hangs with 
>>>>> immediate
>>>>> reboots.
>>>>>
>>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>>
>>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>>
>>>>> For example, in mobile phones and tablets, when there is a problem 
>>>>> with
>>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>>> alternative method to make the system panic without waiting for the
>>>>> battery power to be completely exhausted before restarting the system.
>>>>>
>>>>> ---
>>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>>> system when they are linked down.
>>>>>
>>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>>
>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>>
>>>>>
>>>>> Since the design schemes of each SOC manufacturer are different, 
>>>>> the AXI
>>>>> and other buses connected by PCIe do not have a design to prevent 
>>>>> hanging.
>>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>>> system needs to be restarted.
>>>>>
>>>>>
>>>>> Dear Mani,
>>>>>
>>>>> I wonder if you know how other SoCs of qcom handle FATAL errors 
>>>>> that occur
>>>>> in PCIe link.
>>>>> ---
>>>>>



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-21 14:54         ` Hans Zhang
@ 2025-05-21 16:17           ` Sathyanarayanan Kuppuswamy
  2025-05-22  9:33             ` Hans Zhang
  0 siblings, 1 reply; 19+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-21 16:17 UTC (permalink / raw)
  To: Hans Zhang, bhelgaas, tglx, kw, manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev


On 5/21/25 7:54 AM, Hans Zhang wrote:
>
>
> On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
>>
>> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>>
>>>
>>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>>
>>>>
>>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>>
>>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>>> The following series introduces a new kernel command-line option aer_panic
>>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>>> mission-critical environments. This feature ensures deterministic recover
>>>>>> from fatal PCIe errors by triggering a controlled kernel panic when device
>>>>>> recovery fails, avoiding indefinite system hangs.
>>>>>
>>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>>> that device may not be accessible, right?  Any real use case?
>>>>>
>>>>
>>>>
>>>> Dear Sathyanarayanan,
>>>>
>>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are usually SPI interrupts, not INTx/MSI/MSIx interrupts. (Some customers will design it as an MSI/MSIx interrupt, e.g.: RK3588, but not all customers have designed it this way.)  For example, when many mobile phone SoCs of Qualcomm handle AER interrupts and there is a link down, that is, a fatal problem occurs in the current PCIe physical link, the system cannot recover.  At this point, a system restart is needed to solve the problem.
>>>>
>>>> And our company design of SOC: http://radxa.com/products/orion/o6/, it has 5 road PCIe port.
>>>> There is also the same problem.  If there is a problem with one of the PCIe ports, it will cause the entire system to hang.  So I hope linux OS can offer an option that enables SOC manufacturers to choose to restart the system in case of fatal hardware errors occurring in PCIe.
>>>>
>>>> There are also products such as mobile phones and tablets. We don't want to wait until the battery is completely used up before restarting them.
>>>>
>>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>>
>>>
>>>
>>> Dear Sathyanarayanan,
>>>
>>> Supplementary reasons:
>>>
>>> drivers/pci/controller/cadence/pcie-cadence-host.c
>>> cdns_pci_map_bus
>>>     /* Clear AXI link-down status */
>>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>>
>>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>>
>>> If there has been a link down in this PCIe port, the register CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to continue.  This is different from Synopsys.
>>>
>>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, it causes CPU Core1 to be unable to send TLP transfers and hang. This is a very extreme situation.
>>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still in the upstream process at present.)
>>>
>>> Radxa O6 uses Cadence's PCIe HPA IP.
>>> http://radxa.com/products/orion/o6/
>>>
>>
>> It sounds like a system level issue to me. Why not they rely on watchdog to reboot for
>> this case ?
>
> Dear Sathyanarayanan,
>
> Thank you for your reply. Yes, personally, I think it's also a problem at the system level. I conducted a local test. When I directly unplugged the EP device on the slot, the system would hang. It has been tested many times. Since we don't have a bus timeout response mechanism for PCIe, it hangs easily.

Any comment on why watchdog is not used to reboot the unresponsive system?

>
>>
>> Even if you want to add this support, I think it is more appropriate to add this to your
>> specific PCIe controller driver.  I don't see why you want to add it part of generic
>> AER driver.
>>
> Because we want to use the processing logic of the general AER driver. If the recovery is successful, there will be no problem. If the recovery fails, my original intention was to restart the system.
>
> If added to the specific PCIe controller driver, a lot of repetitive AER processing logic will be written. So I was thinking whether the AER driver could be changed to be compiled as a KO module.

May be you can rely on err handler callbacks to get notification on fatal errors or you can even use uevent handler to detect the disconnected device event and handle it there.

>
>
> If this series is not reasonable, I'll drop it.

Adding new kernel param to solve a specific system issue is not recommended. Try to find some custom solution for your chip/controller.

>
>
> Best regards,
> Hans
>
>>>>
>>>>>>
>>>>>> Problem Statement
>>>>>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>>>>>> traditional error recovery mechanisms may leave the system unresponsive
>>>>>> indefinitely. This is unacceptable for high-availability environment
>>>>>> requiring prompt recovery via reboot.
>>>>>>
>>>>>> Solution
>>>>>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>>>>>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>>>>>
>>>>>> Patch Summary:
>>>>>> Documentation Update: Adds aer_panic to kernel-parameters.txt, explaining
>>>>>> its purpose and usage.
>>>>>>
>>>>>> Command-Line Handling: Implements pci=aer_panic parsing and state
>>>>>> management in PCI core.
>>>>>>
>>>>>> State Exposure: Introduces pci_aer_panic_enabled() to check if the panic
>>>>>> mode is active.
>>>>>>
>>>>>> Panic Trigger: Modifies recovery logic to panic the system when recovery
>>>>>> fails and aer_panic is enabled.
>>>>>>
>>>>>> Impact
>>>>>> Controlled Recovery: Reduces downtime by replacing hangs with immediate
>>>>>> reboots.
>>>>>>
>>>>>> Optional: Enabled via pci=aer_panic; no default behavior change.
>>>>>>
>>>>>> Dependency: Requires CONFIG_PCIEAER.
>>>>>>
>>>>>> For example, in mobile phones and tablets, when there is a problem with
>>>>>> the PCIe link and it cannot be restored, it is expected to provide an
>>>>>> alternative method to make the system panic without waiting for the
>>>>>> battery power to be completely exhausted before restarting the system.
>>>>>>
>>>>>> ---
>>>>>> For example, the sm8250 and sm8350 of qcom will panic and restart the
>>>>>> system when they are linked down.
>>>>>>
>>>>>> https://github.com/DOITfit/xiaomi_kernel_sm8250/blob/d42aa408e8cef14f4ec006554fac67ef80b86d0d/drivers/pci/controller/pci-msm.c#L5440
>>>>>>
>>>>>> https://github.com/OnePlusOSS/android_kernel_oneplus_sm8350/blob/13ca08fdf0979fdd61d5e8991661874bb2d19150/drivers/net/wireless/cnss2/pci.c#L950
>>>>>>
>>>>>>
>>>>>> Since the design schemes of each SOC manufacturer are different, the AXI
>>>>>> and other buses connected by PCIe do not have a design to prevent hanging.
>>>>>> Once a FATAL error occurs in the PCIe link and cannot be restored, the
>>>>>> system needs to be restarted.
>>>>>>
>>>>>>
>>>>>> Dear Mani,
>>>>>>
>>>>>> I wonder if you know how other SoCs of qcom handle FATAL errors that occur
>>>>>> in PCIe link.
>>>>>> ---
>>>>>>
>
-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-21 16:17           ` Sathyanarayanan Kuppuswamy
@ 2025-05-22  9:33             ` Hans Zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-22  9:33 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, bhelgaas, tglx, kw,
	manivannan.sadhasivam, mahesh
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev



On 2025/5/22 00:17, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/21/25 7:54 AM, Hans Zhang wrote:
>>
>>
>> On 2025/5/21 00:09, Sathyanarayanan Kuppuswamy wrote:
>>>
>>> On 5/19/25 7:41 AM, Hans Zhang wrote:
>>>>
>>>>
>>>> On 2025/5/19 22:21, Hans Zhang wrote:
>>>>>
>>>>>
>>>>> On 2025/5/17 02:10, Sathyanarayanan Kuppuswamy wrote:
>>>>>>
>>>>>> On 5/16/25 9:55 AM, Hans Zhang wrote:
>>>>>>> The following series introduces a new kernel command-line option 
>>>>>>> aer_panic
>>>>>>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>>>>>>> mission-critical environments. This feature ensures deterministic 
>>>>>>> recover
>>>>>>> from fatal PCIe errors by triggering a controlled kernel panic 
>>>>>>> when device
>>>>>>> recovery fails, avoiding indefinite system hangs.
>>>>>>
>>>>>> Why would a device recovery failure lead to a system hang? Worst case
>>>>>> that device may not be accessible, right?  Any real use case?
>>>>>>
>>>>>
>>>>>
>>>>> Dear Sathyanarayanan,
>>>>>
>>>>> Due to Synopsys and Cadence PCIe IP, their AER interrupts are 
>>>>> usually SPI interrupts, not INTx/MSI/MSIx interrupts. (Some 
>>>>> customers will design it as an MSI/MSIx interrupt, e.g.: RK3588, 
>>>>> but not all customers have designed it this way.)  For example, 
>>>>> when many mobile phone SoCs of Qualcomm handle AER interrupts and 
>>>>> there is a link down, that is, a fatal problem occurs in the 
>>>>> current PCIe physical link, the system cannot recover.  At this 
>>>>> point, a system restart is needed to solve the problem.
>>>>>
>>>>> And our company design of SOC: http://radxa.com/products/orion/o6/, 
>>>>> it has 5 road PCIe port.
>>>>> There is also the same problem.  If there is a problem with one of 
>>>>> the PCIe ports, it will cause the entire system to hang.  So I hope 
>>>>> linux OS can offer an option that enables SOC manufacturers to 
>>>>> choose to restart the system in case of fatal hardware errors 
>>>>> occurring in PCIe.
>>>>>
>>>>> There are also products such as mobile phones and tablets. We don't 
>>>>> want to wait until the battery is completely used up before 
>>>>> restarting them.
>>>>>
>>>>> For the specific code of Qualcomm, please refer to the email I sent.
>>>>>
>>>>
>>>>
>>>> Dear Sathyanarayanan,
>>>>
>>>> Supplementary reasons:
>>>>
>>>> drivers/pci/controller/cadence/pcie-cadence-host.c
>>>> cdns_pci_map_bus
>>>>     /* Clear AXI link-down status */
>>>>     cdns_pcie_writel(pcie, CDNS_PCIE_AT_LINKDOWN, 0x0);
>>>>
>>>> https://elixir.bootlin.com/linux/v6.15-rc6/source/drivers/pci/controller/cadence/pcie-cadence-host.c#L52
>>>>
>>>> If there has been a link down in this PCIe port, the register 
>>>> CDNS_PCIE_AT_LINKDOWN must be set to 0 for the AXI transmission to 
>>>> continue.  This is different from Synopsys.
>>>>
>>>> If CPU Core0 runs to code L52 and CPU Core1 is executing NVMe SSD 
>>>> saving files, since the CDNS_PCIE_AT_LINKDOWN register is still 1, 
>>>> it causes CPU Core1 to be unable to send TLP transfers and hang. 
>>>> This is a very extreme situation.
>>>> (The current Cadence code is Legacy PCIe IP, and the HPA IP is still 
>>>> in the upstream process at present.)
>>>>
>>>> Radxa O6 uses Cadence's PCIe HPA IP.
>>>> http://radxa.com/products/orion/o6/
>>>>
>>>
>>> It sounds like a system level issue to me. Why not they rely on 
>>> watchdog to reboot for
>>> this case ?
>>
>> Dear Sathyanarayanan,
>>
>> Thank you for your reply. Yes, personally, I think it's also a problem 
>> at the system level. I conducted a local test. When I directly 
>> unplugged the EP device on the slot, the system would hang. It has 
>> been tested many times. Since we don't have a bus timeout response 
>> mechanism for PCIe, it hangs easily.
> 
> Any comment on why watchdog is not used to reboot the unresponsive system?

Dear Sathyanarayanan,

Thank you very much for your reply.

After my testing, the watchdog doesn't work properly every time. There 
might be other reasons causing the entire system to hang.


> 
>>
>>>
>>> Even if you want to add this support, I think it is more appropriate 
>>> to add this to your
>>> specific PCIe controller driver.  I don't see why you want to add it 
>>> part of generic
>>> AER driver.
>>>
>> Because we want to use the processing logic of the general AER driver. 
>> If the recovery is successful, there will be no problem. If the 
>> recovery fails, my original intention was to restart the system.
>>
>> If added to the specific PCIe controller driver, a lot of repetitive 
>> AER processing logic will be written. So I was thinking whether the 
>> AER driver could be changed to be compiled as a KO module.
> 
> May be you can rely on err handler callbacks to get notification on 
> fatal errors or you can even use uevent handler to detect the 
> disconnected device event and handle it there.

I will try the method you suggested.

> 
>>
>>
>> If this series is not reasonable, I'll drop it.
> 
> Adding new kernel param to solve a specific system issue is not 
> recommended. Try to find some custom solution for your chip/controller.
> 

Ok. Understood. Thank you again for your reply.

Best regards,
Hans



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
                   ` (5 preceding siblings ...)
  2025-05-19 22:03 ` Bjorn Helgaas
@ 2025-05-22 11:47 ` Manivannan Sadhasivam
  2025-05-22 16:01   ` Hans Zhang
  6 siblings, 1 reply; 19+ messages in thread
From: Manivannan Sadhasivam @ 2025-05-22 11:47 UTC (permalink / raw)
  To: Hans Zhang
  Cc: bhelgaas, tglx, kw, mahesh, oohall, linux-pci, linux-kernel,
	linuxppc-dev

On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
> The following series introduces a new kernel command-line option aer_panic
> to enhance error handling for PCIe Advanced Error Reporting (AER) in
> mission-critical environments. This feature ensures deterministic recover
> from fatal PCIe errors by triggering a controlled kernel panic when device
> recovery fails, avoiding indefinite system hangs.
> 
> Problem Statement
> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
> traditional error recovery mechanisms may leave the system unresponsive
> indefinitely. This is unacceptable for high-availability environment
> requiring prompt recovery via reboot.
> 
> Solution
> The aer_panic option forces a kernel panic on unrecoverable AER errors.
> This bypasses prolonged recovery attempts and ensures immediate reboot.
> 

You should not panic the kernel when a PCI error occurs (even if it is a fatal
one). You should instead try to reset the root complex. For that you need this
series that got merged recently:
https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2b50@linaro.org

PS: You need to populate the slot_reset callback in your controller driver to
reset the controller in the event of a fatal AER error or link down.

- Mani

-- 
மணிவண்ணன் சதாசிவம்


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/4] pci: implement "pci=aer_panic"
  2025-05-22 11:47 ` Manivannan Sadhasivam
@ 2025-05-22 16:01   ` Hans Zhang
  0 siblings, 0 replies; 19+ messages in thread
From: Hans Zhang @ 2025-05-22 16:01 UTC (permalink / raw)
  To: Manivannan Sadhasivam
  Cc: bhelgaas, tglx, kw, mahesh, oohall, linux-pci, linux-kernel,
	linuxppc-dev



On 2025/5/22 19:47, Manivannan Sadhasivam wrote:
> On Sat, May 17, 2025 at 12:55:14AM +0800, Hans Zhang wrote:
>> The following series introduces a new kernel command-line option aer_panic
>> to enhance error handling for PCIe Advanced Error Reporting (AER) in
>> mission-critical environments. This feature ensures deterministic recover
>> from fatal PCIe errors by triggering a controlled kernel panic when device
>> recovery fails, avoiding indefinite system hangs.
>>
>> Problem Statement
>> In systems where unresolved PCIe errors (e.g., bus hangs) occur,
>> traditional error recovery mechanisms may leave the system unresponsive
>> indefinitely. This is unacceptable for high-availability environment
>> requiring prompt recovery via reboot.
>>
>> Solution
>> The aer_panic option forces a kernel panic on unrecoverable AER errors.
>> This bypasses prolonged recovery attempts and ensures immediate reboot.
>>
> 
> You should not panic the kernel when a PCI error occurs (even if it is a fatal
> one). You should instead try to reset the root complex. For that you need this
> series that got merged recently:
> https://lore.kernel.org/all/20250508-pcie-reset-slot-v4-0-7050093e2b50@linaro.org
> 
> PS: You need to populate the slot_reset callback in your controller driver to
> reset the controller in the event of a fatal AER error or link down.

Dear Mani,

Thank you for your reply. I will take a look at the submission record 
you provided.

Best regards,
Hans



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2025-05-22 16:02 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-16 16:55 [PATCH 0/4] pci: implement "pci=aer_panic" Hans Zhang
2025-05-16 16:55 ` [PATCH 1/4] " Hans Zhang
2025-05-16 16:55 ` [PATCH 2/4] PCI/AER: Introduce aer_panic kernel command-line option Hans Zhang
2025-05-16 16:55 ` [PATCH 3/4] PCI/AER: Expose AER panic state via pci_aer_panic_enabled() Hans Zhang
2025-05-17  4:07   ` Sathyanarayanan Kuppuswamy
2025-05-19 14:03     ` Hans Zhang
2025-05-16 16:55 ` [PATCH 4/4] PCI/AER: Trigger kernel panic on recovery failure if aer_panic is set Hans Zhang
2025-05-16 18:10 ` [PATCH 0/4] pci: implement "pci=aer_panic" Sathyanarayanan Kuppuswamy
2025-05-19 14:21   ` Hans Zhang
2025-05-19 14:39     ` Hans Zhang
2025-05-19 14:41     ` Hans Zhang
2025-05-20 16:09       ` Sathyanarayanan Kuppuswamy
2025-05-21 14:54         ` Hans Zhang
2025-05-21 16:17           ` Sathyanarayanan Kuppuswamy
2025-05-22  9:33             ` Hans Zhang
2025-05-19 22:03 ` Bjorn Helgaas
2025-05-20 15:11   ` Hans Zhang
2025-05-22 11:47 ` Manivannan Sadhasivam
2025-05-22 16:01   ` Hans Zhang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).