* [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A
@ 2026-05-28 3:23 Max Lee
2026-05-28 3:41 ` sashiko-bot
2026-06-10 2:47 ` [PATCH v2] " Max Lee
0 siblings, 2 replies; 5+ messages in thread
From: Max Lee @ 2026-05-28 3:23 UTC (permalink / raw)
To: bhelgaas; +Cc: linux-pci, linux-kernel, acelan.kao, Max Lee
The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
excessive Correctable Error (Replay Timer Timeout) AER events during
PCIe link initialization. On systems where firmware enables AER
reporting (CERptEn+), this causes an AER storm of ~240K error events
within 11 seconds of boot, overwhelming the kernel's error handler and
blocking shutdown/reboot.
The root cause is a transient link training instability inherent to this
device -- even on BIOS versions that suppress reporting, the error
status register (CESta) shows Timeout+ set.
Unlike the GL9750/GL9755 fixup (which only masks the parent root port),
the RTS525A additionally requires masking the endpoint's own Correctable
Error Mask register bit 12 (PCI_ERR_COR_REP_TIMER) to prevent it from
sending ERR_COR messages upstream. Call pci_mask_replay_timer_timeout()
to mask the parent root port as well.
Signed-off-by: Max Lee <max.lee@canonical.com>
---
drivers/pci/quirks.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index caaed1a01dc0..072d1456daad 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6380,4 +6380,23 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
+
+static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
+{
+ u32 val;
+
+ if (!pdev->aer_cap)
+ return;
+
+ pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
+
+ pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
+ val |= PCI_ERR_COR_REP_TIMER;
+ pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
+
+ /* Also mask the parent root port */
+ pci_mask_replay_timer_timeout(pdev);
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
+ pci_mask_replay_timer_timeout_on_endpoint);
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
@ 2026-05-28 3:41 ` sashiko-bot
2026-06-10 2:47 ` [PATCH v2] " Max Lee
1 sibling, 0 replies; 5+ messages in thread
From: sashiko-bot @ 2026-05-28 3:41 UTC (permalink / raw)
To: Max Lee; +Cc: linux-pci
Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] The quirk returns early if the endpoint lacks the AER capability, unintentionally skipping the AER mask on the parent root port.
--
commit 33ed73136f45313933a545d6af06c38d893ebef2
Author: Max Lee <max.lee@canonical.com>
PCI: Mask Replay Timer Timeout for Realtek RTS525A
This commit adds a PCI fixup quirk for the Realtek RTS525A SD card reader.
It masks the Replay Timer Timeout on both the endpoint and its parent root
port to prevent an AER storm during PCIe link initialization.
> diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
> index caaed1a01dc02..072d1456daad4 100644
> --- a/drivers/pci/quirks.c
> +++ b/drivers/pci/quirks.c
> @@ -6380,4 +6380,23 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
> }
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
> DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
> +
> +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
> +{
> + u32 val;
> +
> + if (!pdev->aer_cap)
> + return;
[Severity: High]
If the endpoint lacks the AER capability, this early return will entirely
skip masking the parent root port later in the function. Does this leave the
system vulnerable to the AER storm from the parent port if the endpoint's
AER capability is disabled or hidden?
Could this be restructured so that the parent root port masking is still
applied even when the endpoint lacks the AER capability?
> +
> + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
> +
> + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
> + val |= PCI_ERR_COR_REP_TIMER;
> + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
> +
> + /* Also mask the parent root port */
> + pci_mask_replay_timer_timeout(pdev);
> +}
--
Sashiko AI review · https://sashiko.dev/#/patchset/20260528032347.644365-1-max.lee@canonical.com?part=1
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
2026-05-28 3:41 ` sashiko-bot
@ 2026-06-10 2:47 ` Max Lee
2026-07-01 6:27 ` Manivannan Sadhasivam
1 sibling, 1 reply; 5+ messages in thread
From: Max Lee @ 2026-06-10 2:47 UTC (permalink / raw)
To: bhelgaas; +Cc: linux-pci, linux-kernel, acelan.kao, Max Lee
The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
excessive Correctable Error (Replay Timer Timeout) AER events during
PCIe link initialization. On systems where firmware enables AER
reporting (CERptEn+), this causes an AER storm of ~240K error events
within 11 seconds of boot, overwhelming the kernel error handler and
blocking shutdown/reboot.
The root cause is a transient link training instability inherent to this
device -- even on BIOS versions that suppress reporting, the error
status register (CESta) shows Timeout+ set.
Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
the RTS525A also needs its endpoint Correctable Error Mask bit 12
(PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
not send ERR_COR messages upstream. Also mask the parent root port to
cover root-port reporting of link errors caused by the endpoint.
Signed-off-by: Max Lee <max.lee@canonical.com>
---
Changes in v2:
- Mask the parent root port even when the endpoint lacks AER capability.
- Remove the early return before parent root port masking.
drivers/pci/quirks.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index caaed1a01dc0..6597536a4c70 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
+
+static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
+{
+ u32 val;
+
+ if (pdev->aer_cap) {
+ pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
+
+ pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
+ val |= PCI_ERR_COR_REP_TIMER;
+ pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
+ }
+
+ /*
+ * Also mask the parent root port. Do this even if the endpoint lacks
+ * AER capability because the root port may still report link errors
+ * caused by the endpoint.
+ */
+ pci_mask_replay_timer_timeout(pdev);
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
+ pci_mask_replay_timer_timeout_on_endpoint);
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-06-10 2:47 ` [PATCH v2] " Max Lee
@ 2026-07-01 6:27 ` Manivannan Sadhasivam
2026-07-01 20:42 ` Bjorn Helgaas
0 siblings, 1 reply; 5+ messages in thread
From: Manivannan Sadhasivam @ 2026-07-01 6:27 UTC (permalink / raw)
To: Max Lee; +Cc: bhelgaas, linux-pci, linux-kernel, acelan.kao
On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> excessive Correctable Error (Replay Timer Timeout) AER events during
> PCIe link initialization. On systems where firmware enables AER
> reporting (CERptEn+), this causes an AER storm of ~240K error events
> within 11 seconds of boot, overwhelming the kernel error handler and
> blocking shutdown/reboot.
>
> The root cause is a transient link training instability inherent to this
> device -- even on BIOS versions that suppress reporting, the error
> status register (CESta) shows Timeout+ set.
>
> Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
> the RTS525A also needs its endpoint Correctable Error Mask bit 12
> (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> not send ERR_COR messages upstream. Also mask the parent root port to
> cover root-port reporting of link errors caused by the endpoint.
>
> Signed-off-by: Max Lee <max.lee@canonical.com>
> ---
> Changes in v2:
> - Mask the parent root port even when the endpoint lacks AER capability.
You've added this based on the comment by Sashiko I believe. But I think Sashiko
was wrong here. If the EP lacks AER capability, then there is no way it is going
to send AER errors to upstream RP. So it is OK to skip the quirk.
There is a slight chance that the RP itself generating AER events when it has
detected the Replay Timeout. But to verify, you need to share the AER log during
bootup.
Also, since you mentioned that the firmware is enabling the AER reporting for
the device, which means the device would've been enumerated by BIOS before the
kernel boots. In that case, RP would still continue to receive the COR_ERR and
by the time the AER driver binds to the RP (which happens before the quirk for
the RTS525A gets executed due to depth first enumeration), those AER errors will
still be logged until the quirk execution. So you'd still see AER errors in
dmesg, but not a flood before this patch.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-07-01 6:27 ` Manivannan Sadhasivam
@ 2026-07-01 20:42 ` Bjorn Helgaas
0 siblings, 0 replies; 5+ messages in thread
From: Bjorn Helgaas @ 2026-07-01 20:42 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Max Lee, bhelgaas, linux-pci, linux-kernel, acelan.kao,
Kai-Heng Feng, Victor Shih, Lukas Wunner, Jon Pan-Doh
[+cc Kai-Heng, Victor, Lukas, Jon]
On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > excessive Correctable Error (Replay Timer Timeout) AER events during
> > PCIe link initialization. On systems where firmware enables AER
> > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > within 11 seconds of boot, overwhelming the kernel error handler and
> > blocking shutdown/reboot.
Specifically, I guess these error events are the AER interrupts. We
rate-limit the actual *messages*, but not the interrupts, so we can be
overwhelmed by handling them.
> > The root cause is a transient link training instability inherent to this
> > device -- even on BIOS versions that suppress reporting, the error
> > status register (CESta) shows Timeout+ set.
> >
> > Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint
and the Downstream Port leading to it:
015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
timeout of AER")
eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys
GL975x SD host controller")
Maybe those should be combined so the quirk masks
PCI_ERR_COR_REP_TIMER at both ends of the link in one place. A quirk
like that could be used for both GL975x and RTS525A.
> > the RTS525A also needs its endpoint Correctable Error Mask bit 12
> > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> > not send ERR_COR messages upstream. Also mask the parent root port to
> > cover root-port reporting of link errors caused by the endpoint.
I think this and the similar comment in the code are slightly
misleading. pci_mask_replay_timer_timeout() masks
PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev. That
may be either a Root Port or a Switch Downstream Port.
> > Signed-off-by: Max Lee <max.lee@canonical.com>
> > ---
> > Changes in v2:
> > - Mask the parent root port even when the endpoint lacks AER capability.
>
> You've added this based on the comment by Sashiko I believe. But I
> think Sashiko was wrong here. If the EP lacks AER capability, then
> there is no way it is going to send AER errors to upstream RP. So it
> is OK to skip the quirk.
I'm not sure it's OK to skip the quirk.
Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
Endpoint that detects a Correctable Error but lacks an AER Capability
sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
ERR_COR upstream.
I suspect that what we need here is:
- Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
ERR_COR
- Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
sending ERR_COR
- If the upstream bridge is not the Root Port, mask
PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
receives ERR_COR, e.g., if the Endpoint or bridge lacks AER
I wonder if aer.c should disable the AER interrupt by temporarily
clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that
could avoid device quirks for transient issues like this.
> There is a slight chance that the RP itself generating AER events
> when it has detected the Replay Timeout. But to verify, you need to
> share the AER log during bootup.
>
> Also, since you mentioned that the firmware is enabling the AER
> reporting for the device, which means the device would've been
> enumerated by BIOS before the kernel boots. In that case, RP would
> still continue to receive the COR_ERR and by the time the AER driver
> binds to the RP (which happens before the quirk for the RTS525A gets
> executed due to depth first enumeration), those AER errors will
> still be logged until the quirk execution. So you'd still see AER
> errors in dmesg, but not a flood before this patch.
> > +++ b/drivers/pci/quirks.c
> > @@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
> > }
> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
> > +
> > +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
> > +{
> > + u32 val;
> > +
> > + if (pdev->aer_cap) {
> > + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
> > +
> > + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
> > + val |= PCI_ERR_COR_REP_TIMER;
> > + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
> > + }
> > +
> > + /*
> > + * Also mask the parent root port. Do this even if the endpoint lacks
> > + * AER capability because the root port may still report link errors
> > + * caused by the endpoint.
> > + */
> > + pci_mask_replay_timer_timeout(pdev);
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
> > + pci_mask_replay_timer_timeout_on_endpoint);
> > #endif
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-07-01 20:42 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
2026-05-28 3:41 ` sashiko-bot
2026-06-10 2:47 ` [PATCH v2] " Max Lee
2026-07-01 6:27 ` Manivannan Sadhasivam
2026-07-01 20:42 ` Bjorn Helgaas
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.