* [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A @ 2026-05-28 3:23 Max Lee 2026-05-28 3:41 ` sashiko-bot 2026-06-10 2:47 ` [PATCH v2] " Max Lee 0 siblings, 2 replies; 8+ messages in thread From: Max Lee @ 2026-05-28 3:23 UTC (permalink / raw) To: bhelgaas; +Cc: linux-pci, linux-kernel, acelan.kao, Max Lee The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates excessive Correctable Error (Replay Timer Timeout) AER events during PCIe link initialization. On systems where firmware enables AER reporting (CERptEn+), this causes an AER storm of ~240K error events within 11 seconds of boot, overwhelming the kernel's error handler and blocking shutdown/reboot. The root cause is a transient link training instability inherent to this device -- even on BIOS versions that suppress reporting, the error status register (CESta) shows Timeout+ set. Unlike the GL9750/GL9755 fixup (which only masks the parent root port), the RTS525A additionally requires masking the endpoint's own Correctable Error Mask register bit 12 (PCI_ERR_COR_REP_TIMER) to prevent it from sending ERR_COR messages upstream. Call pci_mask_replay_timer_timeout() to mask the parent root port as well. Signed-off-by: Max Lee <max.lee@canonical.com> --- drivers/pci/quirks.c | 19 +++++++++++++++++++ 1 file changed, 19 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index caaed1a01dc0..072d1456daad 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -6380,4 +6380,23 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev) } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout); + +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev) +{ + u32 val; + + if (!pdev->aer_cap) + return; + + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n"); + + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val); + val |= PCI_ERR_COR_REP_TIMER; + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val); + + /* Also mask the parent root port */ + pci_mask_replay_timer_timeout(pdev); +} +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a, + pci_mask_replay_timer_timeout_on_endpoint); #endif -- 2.43.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee @ 2026-05-28 3:41 ` sashiko-bot 2026-06-10 2:47 ` [PATCH v2] " Max Lee 1 sibling, 0 replies; 8+ messages in thread From: sashiko-bot @ 2026-05-28 3:41 UTC (permalink / raw) To: Max Lee; +Cc: linux-pci Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider: - [High] The quirk returns early if the endpoint lacks the AER capability, unintentionally skipping the AER mask on the parent root port. -- commit 33ed73136f45313933a545d6af06c38d893ebef2 Author: Max Lee <max.lee@canonical.com> PCI: Mask Replay Timer Timeout for Realtek RTS525A This commit adds a PCI fixup quirk for the Realtek RTS525A SD card reader. It masks the Replay Timer Timeout on both the endpoint and its parent root port to prevent an AER storm during PCIe link initialization. > diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c > index caaed1a01dc02..072d1456daad4 100644 > --- a/drivers/pci/quirks.c > +++ b/drivers/pci/quirks.c > @@ -6380,4 +6380,23 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev) > } > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout); > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout); > + > +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev) > +{ > + u32 val; > + > + if (!pdev->aer_cap) > + return; [Severity: High] If the endpoint lacks the AER capability, this early return will entirely skip masking the parent root port later in the function. Does this leave the system vulnerable to the AER storm from the parent port if the endpoint's AER capability is disabled or hidden? Could this be restructured so that the parent root port masking is still applied even when the endpoint lacks the AER capability? > + > + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n"); > + > + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val); > + val |= PCI_ERR_COR_REP_TIMER; > + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val); > + > + /* Also mask the parent root port */ > + pci_mask_replay_timer_timeout(pdev); > +} -- Sashiko AI review · https://sashiko.dev/#/patchset/20260528032347.644365-1-max.lee@canonical.com?part=1 ^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee 2026-05-28 3:41 ` sashiko-bot @ 2026-06-10 2:47 ` Max Lee 2026-07-01 6:27 ` Manivannan Sadhasivam 1 sibling, 1 reply; 8+ messages in thread From: Max Lee @ 2026-06-10 2:47 UTC (permalink / raw) To: bhelgaas; +Cc: linux-pci, linux-kernel, acelan.kao, Max Lee The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates excessive Correctable Error (Replay Timer Timeout) AER events during PCIe link initialization. On systems where firmware enables AER reporting (CERptEn+), this causes an AER storm of ~240K error events within 11 seconds of boot, overwhelming the kernel error handler and blocking shutdown/reboot. The root cause is a transient link training instability inherent to this device -- even on BIOS versions that suppress reporting, the error status register (CESta) shows Timeout+ set. Unlike the GL9750/GL9755 fixup, which only masks the parent root port, the RTS525A also needs its endpoint Correctable Error Mask bit 12 (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does not send ERR_COR messages upstream. Also mask the parent root port to cover root-port reporting of link errors caused by the endpoint. Signed-off-by: Max Lee <max.lee@canonical.com> --- Changes in v2: - Mask the parent root port even when the endpoint lacks AER capability. - Remove the early return before parent root port masking. drivers/pci/quirks.c | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index caaed1a01dc0..6597536a4c70 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev) } DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout); DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout); + +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev) +{ + u32 val; + + if (pdev->aer_cap) { + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n"); + + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val); + val |= PCI_ERR_COR_REP_TIMER; + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val); + } + + /* + * Also mask the parent root port. Do this even if the endpoint lacks + * AER capability because the root port may still report link errors + * caused by the endpoint. + */ + pci_mask_replay_timer_timeout(pdev); +} +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a, + pci_mask_replay_timer_timeout_on_endpoint); #endif -- 2.43.0 ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-06-10 2:47 ` [PATCH v2] " Max Lee @ 2026-07-01 6:27 ` Manivannan Sadhasivam 2026-07-01 20:42 ` Bjorn Helgaas 0 siblings, 1 reply; 8+ messages in thread From: Manivannan Sadhasivam @ 2026-07-01 6:27 UTC (permalink / raw) To: Max Lee; +Cc: bhelgaas, linux-pci, linux-kernel, acelan.kao On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote: > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates > excessive Correctable Error (Replay Timer Timeout) AER events during > PCIe link initialization. On systems where firmware enables AER > reporting (CERptEn+), this causes an AER storm of ~240K error events > within 11 seconds of boot, overwhelming the kernel error handler and > blocking shutdown/reboot. > > The root cause is a transient link training instability inherent to this > device -- even on BIOS versions that suppress reporting, the error > status register (CESta) shows Timeout+ set. > > Unlike the GL9750/GL9755 fixup, which only masks the parent root port, > the RTS525A also needs its endpoint Correctable Error Mask bit 12 > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does > not send ERR_COR messages upstream. Also mask the parent root port to > cover root-port reporting of link errors caused by the endpoint. > > Signed-off-by: Max Lee <max.lee@canonical.com> > --- > Changes in v2: > - Mask the parent root port even when the endpoint lacks AER capability. You've added this based on the comment by Sashiko I believe. But I think Sashiko was wrong here. If the EP lacks AER capability, then there is no way it is going to send AER errors to upstream RP. So it is OK to skip the quirk. There is a slight chance that the RP itself generating AER events when it has detected the Replay Timeout. But to verify, you need to share the AER log during bootup. Also, since you mentioned that the firmware is enabling the AER reporting for the device, which means the device would've been enumerated by BIOS before the kernel boots. In that case, RP would still continue to receive the COR_ERR and by the time the AER driver binds to the RP (which happens before the quirk for the RTS525A gets executed due to depth first enumeration), those AER errors will still be logged until the quirk execution. So you'd still see AER errors in dmesg, but not a flood before this patch. - Mani -- மணிவண்ணன் சதாசிவம் ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-07-01 6:27 ` Manivannan Sadhasivam @ 2026-07-01 20:42 ` Bjorn Helgaas 2026-07-02 5:10 ` Lukas Wunner 2026-07-02 8:08 ` Manivannan Sadhasivam 0 siblings, 2 replies; 8+ messages in thread From: Bjorn Helgaas @ 2026-07-01 20:42 UTC (permalink / raw) To: Manivannan Sadhasivam Cc: Max Lee, bhelgaas, linux-pci, linux-kernel, acelan.kao, Kai-Heng Feng, Victor Shih, Lukas Wunner, Jon Pan-Doh [+cc Kai-Heng, Victor, Lukas, Jon] On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote: > On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote: > > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates > > excessive Correctable Error (Replay Timer Timeout) AER events during > > PCIe link initialization. On systems where firmware enables AER > > reporting (CERptEn+), this causes an AER storm of ~240K error events > > within 11 seconds of boot, overwhelming the kernel error handler and > > blocking shutdown/reboot. Specifically, I guess these error events are the AER interrupts. We rate-limit the actual *messages*, but not the interrupts, so we can be overwhelmed by handling them. > > The root cause is a transient link training instability inherent to this > > device -- even on BIOS versions that suppress reporting, the error > > status register (CESta) shows Timeout+ set. > > > > Unlike the GL9750/GL9755 fixup, which only masks the parent root port, I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint and the Downstream Port leading to it: 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer timeout of AER") eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys GL975x SD host controller") Maybe those should be combined so the quirk masks PCI_ERR_COR_REP_TIMER at both ends of the link in one place. A quirk like that could be used for both GL975x and RTS525A. > > the RTS525A also needs its endpoint Correctable Error Mask bit 12 > > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does > > not send ERR_COR messages upstream. Also mask the parent root port to > > cover root-port reporting of link errors caused by the endpoint. I think this and the similar comment in the code are slightly misleading. pci_mask_replay_timer_timeout() masks PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev. That may be either a Root Port or a Switch Downstream Port. > > Signed-off-by: Max Lee <max.lee@canonical.com> > > --- > > Changes in v2: > > - Mask the parent root port even when the endpoint lacks AER capability. > > You've added this based on the comment by Sashiko I believe. But I > think Sashiko was wrong here. If the EP lacks AER capability, then > there is no way it is going to send AER errors to upstream RP. So it > is OK to skip the quirk. I'm not sure it's OK to skip the quirk. Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an Endpoint that detects a Correctable Error but lacks an AER Capability sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends ERR_COR upstream. I suspect that what we need here is: - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending ERR_COR - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from sending ERR_COR - If the upstream bridge is not the Root Port, mask PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it receives ERR_COR, e.g., if the Endpoint or bridge lacks AER I wonder if aer.c should disable the AER interrupt by temporarily clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that could avoid device quirks for transient issues like this. > There is a slight chance that the RP itself generating AER events > when it has detected the Replay Timeout. But to verify, you need to > share the AER log during bootup. > > Also, since you mentioned that the firmware is enabling the AER > reporting for the device, which means the device would've been > enumerated by BIOS before the kernel boots. In that case, RP would > still continue to receive the COR_ERR and by the time the AER driver > binds to the RP (which happens before the quirk for the RTS525A gets > executed due to depth first enumeration), those AER errors will > still be logged until the quirk execution. So you'd still see AER > errors in dmesg, but not a flood before this patch. > > +++ b/drivers/pci/quirks.c > > @@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev) > > } > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout); > > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout); > > + > > +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev) > > +{ > > + u32 val; > > + > > + if (pdev->aer_cap) { > > + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n"); > > + > > + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val); > > + val |= PCI_ERR_COR_REP_TIMER; > > + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val); > > + } > > + > > + /* > > + * Also mask the parent root port. Do this even if the endpoint lacks > > + * AER capability because the root port may still report link errors > > + * caused by the endpoint. > > + */ > > + pci_mask_replay_timer_timeout(pdev); > > +} > > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a, > > + pci_mask_replay_timer_timeout_on_endpoint); > > #endif > > -- > > 2.43.0 > > ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-07-01 20:42 ` Bjorn Helgaas @ 2026-07-02 5:10 ` Lukas Wunner 2026-07-02 7:06 ` Max Lee 2026-07-02 8:08 ` Manivannan Sadhasivam 1 sibling, 1 reply; 8+ messages in thread From: Lukas Wunner @ 2026-07-02 5:10 UTC (permalink / raw) To: Bjorn Helgaas Cc: Manivannan Sadhasivam, Max Lee, bhelgaas, linux-pci, linux-kernel, acelan.kao, Kai-Heng Feng, Victor Shih, Jon Pan-Doh On Wed, Jul 01, 2026 at 03:42:01PM -0500, Bjorn Helgaas wrote: > On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote: > > On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote: > > > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates > > > excessive Correctable Error (Replay Timer Timeout) AER events during > > > PCIe link initialization. On systems where firmware enables AER > > > reporting (CERptEn+), this causes an AER storm of ~240K error events > > > within 11 seconds of boot, overwhelming the kernel error handler and > > > blocking shutdown/reboot. > > Specifically, I guess these error events are the AER interrupts. We > rate-limit the actual *messages*, but not the interrupts, so we can be > overwhelmed by handling them. Ratelimiting went into v6.16. It would be good to know whether the behavior described in the commit message occurs only with older kernels or persists with ratelimiting. It would also be good to know whether the "overwhelming" occurs with OS-native AER handling (aer_print_error()) or with Firmware First AER handling (pci_print_aer()). The PCIe port driver is registered in a device_initcall, which happens much later than PCI device enumeration in a subsys_initcall. Do these errors occur before the port driver registers? If they do, then the "overwhelming" likely happens through the Firmware First code path. Or perhaps "within 11 seconds of boot" is when portdrv probes and registers the AER driver? It would be good to have full dmesg output for analysis because this is all a little murky. > Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an > Endpoint that detects a Correctable Error but lacks an AER Capability > sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends > ERR_COR upstream. > > I suspect that what we need here is: > > - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending > ERR_COR > > - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from > sending ERR_COR > > - If the upstream bridge is not the Root Port, mask > PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it > receives ERR_COR, e.g., if the Endpoint or bridge lacks AER The last bullet point doesn't sound quite right: Per table 6-4 in PCIe r7.0 sec 6.2.7, Replay Timer Timeout is detected by the Transmitter. This can only be the Endpoint or its upstream bridge. If the upstream bridge is not the Root Port, masking PCI_ERR_COR_REP_TIMER in the Root Port doesn't help. If the Endpoint or its upstream bridge don't support AER, one has to clear Correctable Error Reporting Enable in the Device Control Register to suppress the Replay Timer Timeout errors. Obviously this will suppress reporting of any other Correctable Error as well. > I wonder if aer.c should disable the AER interrupt by temporarily > clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that > could avoid device quirks for transient issues like this. The hierarchy below the Root Port might be deeper and we'd risk losing errors reported by other devices in that hierarchy while interrupt generation is temporarily disabled. Thanks, Lukas ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-07-02 5:10 ` Lukas Wunner @ 2026-07-02 7:06 ` Max Lee 0 siblings, 0 replies; 8+ messages in thread From: Max Lee @ 2026-07-02 7:06 UTC (permalink / raw) To: Lukas Wunner, bhelgaas, Manivannan Sadhasivam Cc: Bjorn Helgaas, linux-pci, linux-kernel, acelan.kao, Kai-Heng Feng, Victor Shih, Jon Pan-Doh On Thu, Jul 02, 2026 at 07:10:46AM +0200, Lukas Wunner wrote: > It would be good to have full dmesg output for analysis because this > is all a little murky. Thanks Mani, Bjorn and Lukas for the detailed analysis. I collected full boot dmesg and lspci -vv from the affected system. The system is: HP ZBook Power 16 inch G11 Mobile Workstation PC BIOS: W97 Ver. 01.09.01, 03/02/2026 Kernel: 6.17.0-1012-oem The topology is a direct Root Port -> Endpoint connection, with no switch in between: 0000:00:1c.6-[58]----00.0 Realtek RTS525A PCI Express Card Reader The log shows that OS-native AER is in control: acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME AER PCIeCapability LTR DPC] pcieport 0000:00:1c.6: AER: enabled with IRQ 127 I did not find GHES/HEST/Firmware First AER messages in dmesg/journal for this boot. The storm starts after the PCIe port driver enables AER on 0000:00:1c.6 and immediately after rtsx_pci enables the endpoint: [ 0.798443] pcieport 0000:00:1c.6: AER: enabled with IRQ 127 [ 1.057380] rtsx_pci 0000:58:00.0: enabling device (0000 -> 0002) [ 1.057399] pcieport 0000:00:1c.6: AER: Correctable error message received from 0000:58:00.0 [ 1.057405] rtsx_pci 0000:58:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Transmitter ID) [ 1.057408] rtsx_pci 0000:58:00.0: device [10ec:525a] error status/mask=00001000/00006000 [ 1.057411] rtsx_pci 0000:58:00.0: [12] Timeout Ratelimiting is present, but the interrupt/callback volume is still very large: [ 6.058155] aer_ratelimit: 120541 callbacks suppressed [ 11.059227] aer_ratelimit: 120906 callbacks suppressed /proc/interrupts later showed IRQ 127 for 0000:00:1c.6 (PCIe PME, aerdrv, PCIe bwctrl) at 940777 interrupts, while the endpoint's own rtsx_pci MSI had only 6 interrupts. Both the endpoint and the immediate upstream port expose AER capability. On this unpatched boot, Replay Timer Timeout is not masked on either side. Endpoint 0000:58:00.0: Capabilities: [100 v2] Advanced Error Reporting CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ Root Port 0000:00:1c.6: Capabilities: [100 v1] Advanced Error Reporting CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr- CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+ RootCmd: CERptEn+ NFERptEn+ FERptEn+ ErrorSrc: ERR_COR: 5800 ERR_FATAL/NONFATAL: 0000 So for this system, the endpoint does expose AER, and the observed storm does not require handling the no-endpoint-AER case. I agree that the v2 wording is imprecise. pci_mask_replay_timer_timeout() does not necessarily mask a Root Port; it masks the immediate upstream Downstream Port leading to pdev, which may be a Root Port or a Switch Downstream Port. If I send v3, I will fix the commit log and code comment to use that wording instead of "parent root port". I can also include the full dmesg and lspci output if useful. Thanks, Max ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A 2026-07-01 20:42 ` Bjorn Helgaas 2026-07-02 5:10 ` Lukas Wunner @ 2026-07-02 8:08 ` Manivannan Sadhasivam 1 sibling, 0 replies; 8+ messages in thread From: Manivannan Sadhasivam @ 2026-07-02 8:08 UTC (permalink / raw) To: Bjorn Helgaas Cc: Max Lee, bhelgaas, linux-pci, linux-kernel, acelan.kao, Kai-Heng Feng, Victor Shih, Lukas Wunner, Jon Pan-Doh On Wed, Jul 01, 2026 at 03:42:01PM -0500, Bjorn Helgaas wrote: > [+cc Kai-Heng, Victor, Lukas, Jon] > > On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote: > > On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote: > > > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates > > > excessive Correctable Error (Replay Timer Timeout) AER events during > > > PCIe link initialization. On systems where firmware enables AER > > > reporting (CERptEn+), this causes an AER storm of ~240K error events > > > within 11 seconds of boot, overwhelming the kernel error handler and > > > blocking shutdown/reboot. > > Specifically, I guess these error events are the AER interrupts. We > rate-limit the actual *messages*, but not the interrupts, so we can be > overwhelmed by handling them. > > > > The root cause is a transient link training instability inherent to this > > > device -- even on BIOS versions that suppress reporting, the error > > > status register (CESta) shows Timeout+ set. > > > > > > Unlike the GL9750/GL9755 fixup, which only masks the parent root port, > > I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint > and the Downstream Port leading to it: > > 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer > timeout of AER") > > eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys > GL975x SD host controller") > > Maybe those should be combined so the quirk masks > PCI_ERR_COR_REP_TIMER at both ends of the link in one place. A quirk > like that could be used for both GL975x and RTS525A. > > > > the RTS525A also needs its endpoint Correctable Error Mask bit 12 > > > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does > > > not send ERR_COR messages upstream. Also mask the parent root port to > > > cover root-port reporting of link errors caused by the endpoint. > > I think this and the similar comment in the code are slightly > misleading. pci_mask_replay_timer_timeout() masks > PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev. That > may be either a Root Port or a Switch Downstream Port. > > > > Signed-off-by: Max Lee <max.lee@canonical.com> > > > --- > > > Changes in v2: > > > - Mask the parent root port even when the endpoint lacks AER capability. > > > > You've added this based on the comment by Sashiko I believe. But I > > think Sashiko was wrong here. If the EP lacks AER capability, then > > there is no way it is going to send AER errors to upstream RP. So it > > is OK to skip the quirk. > > I'm not sure it's OK to skip the quirk. > > Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an > Endpoint that detects a Correctable Error but lacks an AER Capability > sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends > ERR_COR upstream. > Ah, you're right. I missed CERE. > I suspect that what we need here is: > > - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending > ERR_COR > > - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from > sending ERR_COR > > - If the upstream bridge is not the Root Port, mask > PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it > receives ERR_COR, e.g., if the Endpoint or bridge lacks AER > If the endpoint lacks AER capability, it will simply send an ERR_COR message to the upstream bridge without any advanced logging info. Because the message reaches the Root Port as a generic ERR_COR, RP will log the error in its Root Error Status register and raise the AER interrupt without checking its own PCI_ERR_COR_REP_TIMER mask status. Also, any upstream bridge lacking AER capability will simply forward the ERR_COR message upstream irrespective of its own PCI_ERR_COR_REP_TIMER status. So if the endpoint lacks AER capability, we cannot prevent it from reporting the error, unless we disable CERE, which will affect other errors also. However, we can still mask PCI_ERR_COR_REP_TIMER in the upstream bridge's AER to prevent the bridge itself from generating its own Replay Link Timeout errors due to the endpoint's behavior, but this acts only as a safeguard and will not prevent the messages originating from the endpoint. Correct me if my understanding is wrong. - Mani -- மணிவண்ணன் சதாசிவம் ^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-07-02 8:08 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee 2026-05-28 3:41 ` sashiko-bot 2026-06-10 2:47 ` [PATCH v2] " Max Lee 2026-07-01 6:27 ` Manivannan Sadhasivam 2026-07-01 20:42 ` Bjorn Helgaas 2026-07-02 5:10 ` Lukas Wunner 2026-07-02 7:06 ` Max Lee 2026-07-02 8:08 ` Manivannan Sadhasivam
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.