* [PATCH] PCI/ERR: Clear fatal status of the reporting device
@ 2026-02-27 10:25 Sizhe Liu
2026-02-27 16:31 ` Bjorn Helgaas
0 siblings, 1 reply; 8+ messages in thread
From: Sizhe Liu @ 2026-02-27 10:25 UTC (permalink / raw)
To: bhelgaas, jonathan.cameron, shiju.jose, keith.busch
Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, liusizhe5
During PCIe native AER error recovery, ERR_FATAL status bits are not cleared
after fatal error handling. This causes stale ERR_FATAL bits to be reported
in subsequent AER events, even after reporting "device recovery successful".
Prior to commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery"), native
AER handled non-fatal and fatal errors separately, clearing the corresponding
status bits for both types after processing. That commit unified the error
paths through pcie_do_recovery(), which began invoking
pci_cleanup_aer_uncorrect_error_status() after commit bfcb79fca19d
("PCI/ERR: Run error recovery callbacks for all affected devices").
This function only clears non-fatal error (NFE) bits and leaves ERR_FATAL
bits uncleared, resulting in stale error status.
Fix this by explicitly clearing the ERR_FATAL status bits for the reporting
device during recovery, restoring the original fatal error handling behavior.
Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
---
drivers/pci/pcie/err.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d7..f51225f2592d 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -281,6 +281,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
if (host->native_aer || pcie_ports_native) {
pcie_clear_device_status(dev);
pci_aer_clear_nonfatal_status(dev);
+ pci_aer_clear_fatal_status(dev);
}
pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
--
2.33.0
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-27 10:25 [PATCH] PCI/ERR: Clear fatal status of the reporting device Sizhe Liu @ 2026-02-27 16:31 ` Bjorn Helgaas 2026-02-27 18:01 ` Kuppuswamy Sathyanarayanan 2026-02-27 21:15 ` Lukas Wunner 0 siblings, 2 replies; 8+ messages in thread From: Bjorn Helgaas @ 2026-02-27 16:31 UTC (permalink / raw) To: Sizhe Liu Cc: bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Shuai Xue, Lukas Wunner, Kuppuswamy Sathyanarayanan, Terry Bowman [+cc others interested in error handling] On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: > During PCIe native AER error recovery, ERR_FATAL status bits are not cleared > after fatal error handling. This causes stale ERR_FATAL bits to be reported > in subsequent AER events, even after reporting "device recovery successful". > > Prior to commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery"), native > AER handled non-fatal and fatal errors separately, clearing the corresponding > status bits for both types after processing. That commit unified the error > paths through pcie_do_recovery(), which began invoking > pci_cleanup_aer_uncorrect_error_status() after commit bfcb79fca19d > ("PCI/ERR: Run error recovery callbacks for all affected devices"). > > This function only clears non-fatal error (NFE) bits and leaves ERR_FATAL > bits uncleared, resulting in stale error status. > > Fix this by explicitly clearing the ERR_FATAL status bits for the reporting > device during recovery, restoring the original fatal error handling behavior. > > Signed-off-by: Sizhe Liu <liusizhe5@huawei.com> > --- > drivers/pci/pcie/err.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c > index bebe4bc111d7..f51225f2592d 100644 > --- a/drivers/pci/pcie/err.c > +++ b/drivers/pci/pcie/err.c > @@ -281,6 +281,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, > if (host->native_aer || pcie_ports_native) { > pcie_clear_device_status(dev); > pci_aer_clear_nonfatal_status(dev); > + pci_aer_clear_fatal_status(dev); > } > > pci_walk_bridge(bridge, pci_pm_runtime_put, NULL); > -- > 2.33.0 > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-27 16:31 ` Bjorn Helgaas @ 2026-02-27 18:01 ` Kuppuswamy Sathyanarayanan 2026-02-27 21:15 ` Lukas Wunner 1 sibling, 0 replies; 8+ messages in thread From: Kuppuswamy Sathyanarayanan @ 2026-02-27 18:01 UTC (permalink / raw) To: Bjorn Helgaas, Sizhe Liu Cc: bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Shuai Xue, Lukas Wunner, Terry Bowman Hi, On 2/27/2026 8:31 AM, Bjorn Helgaas wrote: > [+cc others interested in error handling] > > On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: >> During PCIe native AER error recovery, ERR_FATAL status bits are not cleared >> after fatal error handling. This causes stale ERR_FATAL bits to be reported >> in subsequent AER events, even after reporting "device recovery successful". >> >> Prior to commit bdb5ac85777d ("PCI/ERR: Handle fatal error recovery"), native >> AER handled non-fatal and fatal errors separately, clearing the corresponding >> status bits for both types after processing. That commit unified the error >> paths through pcie_do_recovery(), which began invoking >> pci_cleanup_aer_uncorrect_error_status() after commit bfcb79fca19d >> ("PCI/ERR: Run error recovery callbacks for all affected devices"). >> >> This function only clears non-fatal error (NFE) bits and leaves ERR_FATAL >> bits uncleared, resulting in stale error status. >> >> Fix this by explicitly clearing the ERR_FATAL status bits for the reporting >> device during recovery, restoring the original fatal error handling behavior. >> It seem to be a valid fix to me. You probably need to add Fixes: tag and cc stable. >> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com> >> --- >> drivers/pci/pcie/err.c | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c >> index bebe4bc111d7..f51225f2592d 100644 >> --- a/drivers/pci/pcie/err.c >> +++ b/drivers/pci/pcie/err.c >> @@ -281,6 +281,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev, >> if (host->native_aer || pcie_ports_native) { >> pcie_clear_device_status(dev); >> pci_aer_clear_nonfatal_status(dev); >> + pci_aer_clear_fatal_status(dev); >> } This code path is common for both fatal and non-fatal recovery. Clearing the fatal status here during non-fatal recovery could silently erase newly detected fatal errors. I suggest clearing it in the AER driver after it parses the error. >> >> pci_walk_bridge(bridge, pci_pm_runtime_put, NULL); >> -- >> 2.33.0 >> > -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-27 16:31 ` Bjorn Helgaas 2026-02-27 18:01 ` Kuppuswamy Sathyanarayanan @ 2026-02-27 21:15 ` Lukas Wunner 2026-02-27 22:47 ` Kuppuswamy Sathyanarayanan 2026-02-28 12:01 ` Sizhe Liu 1 sibling, 2 replies; 8+ messages in thread From: Lukas Wunner @ 2026-02-27 21:15 UTC (permalink / raw) To: Bjorn Helgaas Cc: Sizhe Liu, bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Shuai Xue, Kuppuswamy Sathyanarayanan, Terry Bowman On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: > During PCIe native AER error recovery, ERR_FATAL status bits are not cleared > after fatal error handling. This causes stale ERR_FATAL bits to be reported > in subsequent AER events, even after reporting "device recovery successful". Wrong. The bits are cleared by: report_slot_reset() err_handler->slot_reset() pci_restore_state() pci_aer_clear_status() pci_aer_raw_clear_status() Is this an LLM-generated submission? The confidently worded but incorrect commit message seems to suggest that it is. If so, please follow the guidelines in Documentation/process/generated-content.rst and be transparent about the tools you used to come up with the patch. Thanks, Lukas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-27 21:15 ` Lukas Wunner @ 2026-02-27 22:47 ` Kuppuswamy Sathyanarayanan 2026-02-28 2:06 ` Shuai Xue 2026-02-28 12:01 ` Sizhe Liu 1 sibling, 1 reply; 8+ messages in thread From: Kuppuswamy Sathyanarayanan @ 2026-02-27 22:47 UTC (permalink / raw) To: Lukas Wunner, Bjorn Helgaas Cc: Sizhe Liu, bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Shuai Xue, Terry Bowman Hi Lukas, On 2/27/2026 1:15 PM, Lukas Wunner wrote: > On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: >> During PCIe native AER error recovery, ERR_FATAL status bits are not cleared >> after fatal error handling. This causes stale ERR_FATAL bits to be reported >> in subsequent AER events, even after reporting "device recovery successful". > > Wrong. The bits are cleared by: > > report_slot_reset() > err_handler->slot_reset() > pci_restore_state() > pci_aer_clear_status() > pci_aer_raw_clear_status() Thanks for the correction and for sharing the call flow. I was not aware that the fatal status bits are already cleared via pci_restore_state(). That raises a question. If pci_restore_state() already clears all AER status bits through pci_aer_raw_clear_status(), do we still need the explicit pci_aer_clear_nonfatal_status() call in pcie_do_recovery()? Similarly, could pcie_clear_device_status() also be moved there? I see pcie_clear_device_status() call is sprinkled across all error handling paths (EDR, DPC & AER). Also, since pci_aer_raw_clear_status() clears all error status registers, is there a risk of silently losing newly detected errors that arrive while recovery is still in progress? > > Thanks, > > Lukas > -- Sathyanarayanan Kuppuswamy Linux Kernel Developer ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-27 22:47 ` Kuppuswamy Sathyanarayanan @ 2026-02-28 2:06 ` Shuai Xue 2026-03-03 13:34 ` Sizhe Liu 0 siblings, 1 reply; 8+ messages in thread From: Shuai Xue @ 2026-02-28 2:06 UTC (permalink / raw) To: Kuppuswamy Sathyanarayanan, Lukas Wunner, Bjorn Helgaas Cc: Sizhe Liu, bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Terry Bowman On 2/28/26 6:47 AM, Kuppuswamy Sathyanarayanan wrote: > Hi Lukas, > > On 2/27/2026 1:15 PM, Lukas Wunner wrote: >> On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: >>> During PCIe native AER error recovery, ERR_FATAL status bits are not cleared >>> after fatal error handling. This causes stale ERR_FATAL bits to be reported >>> in subsequent AER events, even after reporting "device recovery successful". >> >> Wrong. The bits are cleared by: >> >> report_slot_reset() >> err_handler->slot_reset() >> pci_restore_state() >> pci_aer_clear_status() >> pci_aer_raw_clear_status() > > Thanks for the correction and for sharing the call flow. I was not aware > that the fatal status bits are already cleared via pci_restore_state(). Hi, I also have a patch which trys to handle the same problem. Please also see, https://lore.kernel.org/all/de195f39-8197-494e-8451-2cf8fde064ea@linux.alibaba.com/ > > That raises a question. If pci_restore_state() already clears all AER > status bits through pci_aer_raw_clear_status(), do we still need the > explicit pci_aer_clear_nonfatal_status() call in pcie_do_recovery()? > Similarly, could pcie_clear_device_status() also be moved there? > I see pcie_clear_device_status() call is sprinkled across all error > handling paths (EDR, DPC & AER). > > Also, since pci_aer_raw_clear_status() clears all error status registers, > is there a risk of silently losing newly detected errors that arrive while > recovery is still in progress? since ->slot_reset() is driver-defined and not all drivers invoke pci_restore_state(), there could be cases where fatal AER status bits remain set after the frozen recovery completes. Lukas is working on moving pci_aer_clear_status() out of pci_restore_state() into the callers that actually need it. After that, IMHO, we should claer AER status conditional on the 'state' parameter, Thanks. Shuai ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-28 2:06 ` Shuai Xue @ 2026-03-03 13:34 ` Sizhe Liu 0 siblings, 0 replies; 8+ messages in thread From: Sizhe Liu @ 2026-03-03 13:34 UTC (permalink / raw) To: Shuai Xue, Kuppuswamy Sathyanarayanan, Lukas Wunner, Bjorn Helgaas Cc: bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Terry Bowman On 2026/2/28 10:06, Shuai Xue wrote: > On 2/28/26 6:47 AM, Kuppuswamy Sathyanarayanan wrote: >> Hi Lukas, >> >> On 2/27/2026 1:15 PM, Lukas Wunner wrote: >>> On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: >>>> During PCIe native AER error recovery, ERR_FATAL status bits are >>>> not cleared >>>> after fatal error handling. This causes stale ERR_FATAL bits to be >>>> reported >>>> in subsequent AER events, even after reporting "device recovery >>>> successful". >>> >>> Wrong. The bits are cleared by: >>> >>> report_slot_reset() >>> err_handler->slot_reset() >>> pci_restore_state() >>> pci_aer_clear_status() >>> pci_aer_raw_clear_status() >> >> Thanks for the correction and for sharing the call flow. I was not aware >> that the fatal status bits are already cleared via pci_restore_state(). > > Hi, > > I also have a patch which trys to handle the same problem. Please also > see, > > https://lore.kernel.org/all/de195f39-8197-494e-8451-2cf8fde064ea@linux.alibaba.com/ > > > Hi Shuai, Thanks for your feedback, sorry I missed your patch. To keep things focused, I think we can do the discussion further in your patch series. >> >> That raises a question. If pci_restore_state() already clears all AER >> status bits through pci_aer_raw_clear_status(), do we still need the >> explicit pci_aer_clear_nonfatal_status() call in pcie_do_recovery()? >> Similarly, could pcie_clear_device_status() also be moved there? >> I see pcie_clear_device_status() call is sprinkled across all error >> handling paths (EDR, DPC & AER). >> >> Also, since pci_aer_raw_clear_status() clears all error status >> registers, >> is there a risk of silently losing newly detected errors that arrive >> while >> recovery is still in progress? > > since ->slot_reset() is driver-defined and not all drivers invoke > pci_restore_state(), there could be cases where fatal AER status bits > remain set after the frozen recovery completes. > I agree with you. In addition, in pcie_do_recovery(), we use pci_walk_bridge() to walk bridges that are potentially AER-affected. For a device that has a subordinate bus, we only walk its subordinate buses, including any bridged devices under this bus, i.e., if the fatal error source is a Root Port(RP), the RP driver's callbacks will not be invoked. I'm not sure whether we need to call report_slot_reset() for the RP. IMHO, at least clear device status bits and error status bits. Results of my AER injection test on the RP: https://lore.kernel.org/linux-pci/46194828-2ea0-49e1-8812-07581cc04e8a@huawei.com/ > Lukas is working on moving pci_aer_clear_status() out of > pci_restore_state() into the callers that actually need it. > After that, IMHO, we should claer AER status conditional on the > 'state' parameter, Sounds interesting. I appreciate Lukas's efforts and the information you have shared. Regards, Sizhe > > Thanks. > Shuai ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI/ERR: Clear fatal status of the reporting device 2026-02-27 21:15 ` Lukas Wunner 2026-02-27 22:47 ` Kuppuswamy Sathyanarayanan @ 2026-02-28 12:01 ` Sizhe Liu 1 sibling, 0 replies; 8+ messages in thread From: Sizhe Liu @ 2026-02-28 12:01 UTC (permalink / raw) To: Lukas Wunner, Bjorn Helgaas Cc: bhelgaas, jonathan.cameron, shiju.jose, keith.busch, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, Shuai Xue, Kuppuswamy Sathyanarayanan, Terry Bowman On 2026/2/28 5:15, Lukas Wunner wrote: > On Fri, Feb 27, 2026 at 06:25:05PM +0800, Sizhe Liu wrote: >> During PCIe native AER error recovery, ERR_FATAL status bits are not cleared >> after fatal error handling. This causes stale ERR_FATAL bits to be reported >> in subsequent AER events, even after reporting "device recovery successful". > Wrong. The bits are cleared by: > > report_slot_reset() > err_handler->slot_reset() > pci_restore_state() > pci_aer_clear_status() > pci_aer_raw_clear_status() > > Is this an LLM-generated submission? The confidently worded but incorrect > commit message seems to suggest that it is. If so, please follow the > guidelines in Documentation/process/generated-content.rst and be > transparent about the tools you used to come up with the patch. Hi Lukas, Thank you for your feedback and the detailed call trace. I had some misunderstanding about this issue. I was not aware that we clear all AER status registers in pci_restore_state(), which led me to incorrectly assume fatal error status was not cleared in Native AER. This issue was discovered during native AER injection testing on the RP side of the Kunpeng CPU platform. The relevant test process and kernel logs are shown below: [root@localhost ~]# busybox devmem aer_nce_inject_addr 32 0x10 pcieport 0000:20:10.0: AER: Uncorrected (Fatal) error message received from 0000:20:10.0 pcieport 0000:20:10.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Data Link Layer, (Receiver ID) pcieport 0000:20:10.0: device [19e5:a120] error status/mask=00000010/04580000 pcieport 0000:20:10.0: [ 4] DLP pcieport 0000:20:10.0: AER: Root Port link has been reset (0) pcieport 0000:20:10.0: AER: device recovery successful [root@localhost ~]# busybox devmem aer_nce_inject_addr 32 0x1000 pcieport 0000:20:10.0: AER: Uncorrected (Non-Fatal) error message received from 0000:20:10.0 pcieport 0000:20:10.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Data Link Layer, (Receiver ID) pcieport 0000:20:10.0: device [19e5:a120] error status/mask=00001010/04580000 pcieport 0000:20:10.0: [ 4] DLP pcieport 0000:20:10.0: [12] TLP (First) pcieport 0000:20:10.0: AER: TLP Header: 00000000 00000000 00000000 00000000 pcieport 0000:20:10.0: AER: device recovery successful The root cause of the problem may not be here. I will perform further analysis. Thanks, Sizhe > > Thanks, > > Lukas > ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-03-03 13:34 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-27 10:25 [PATCH] PCI/ERR: Clear fatal status of the reporting device Sizhe Liu 2026-02-27 16:31 ` Bjorn Helgaas 2026-02-27 18:01 ` Kuppuswamy Sathyanarayanan 2026-02-27 21:15 ` Lukas Wunner 2026-02-27 22:47 ` Kuppuswamy Sathyanarayanan 2026-02-28 2:06 ` Shuai Xue 2026-03-03 13:34 ` Sizhe Liu 2026-02-28 12:01 ` Sizhe Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox