* [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A
@ 2026-05-28 3:23 Max Lee
2026-06-10 2:47 ` [PATCH v2] " Max Lee
0 siblings, 1 reply; 8+ messages in thread
From: Max Lee @ 2026-05-28 3:23 UTC (permalink / raw)
To: bhelgaas; +Cc: linux-pci, linux-kernel, acelan.kao, Max Lee
The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
excessive Correctable Error (Replay Timer Timeout) AER events during
PCIe link initialization. On systems where firmware enables AER
reporting (CERptEn+), this causes an AER storm of ~240K error events
within 11 seconds of boot, overwhelming the kernel's error handler and
blocking shutdown/reboot.
The root cause is a transient link training instability inherent to this
device -- even on BIOS versions that suppress reporting, the error
status register (CESta) shows Timeout+ set.
Unlike the GL9750/GL9755 fixup (which only masks the parent root port),
the RTS525A additionally requires masking the endpoint's own Correctable
Error Mask register bit 12 (PCI_ERR_COR_REP_TIMER) to prevent it from
sending ERR_COR messages upstream. Call pci_mask_replay_timer_timeout()
to mask the parent root port as well.
Signed-off-by: Max Lee <max.lee@canonical.com>
---
drivers/pci/quirks.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index caaed1a01dc0..072d1456daad 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6380,4 +6380,23 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
+
+static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
+{
+ u32 val;
+
+ if (!pdev->aer_cap)
+ return;
+
+ pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
+
+ pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
+ val |= PCI_ERR_COR_REP_TIMER;
+ pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
+
+ /* Also mask the parent root port */
+ pci_mask_replay_timer_timeout(pdev);
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
+ pci_mask_replay_timer_timeout_on_endpoint);
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
@ 2026-06-10 2:47 ` Max Lee
2026-07-01 6:27 ` Manivannan Sadhasivam
0 siblings, 1 reply; 8+ messages in thread
From: Max Lee @ 2026-06-10 2:47 UTC (permalink / raw)
To: bhelgaas; +Cc: linux-pci, linux-kernel, acelan.kao, Max Lee
The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
excessive Correctable Error (Replay Timer Timeout) AER events during
PCIe link initialization. On systems where firmware enables AER
reporting (CERptEn+), this causes an AER storm of ~240K error events
within 11 seconds of boot, overwhelming the kernel error handler and
blocking shutdown/reboot.
The root cause is a transient link training instability inherent to this
device -- even on BIOS versions that suppress reporting, the error
status register (CESta) shows Timeout+ set.
Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
the RTS525A also needs its endpoint Correctable Error Mask bit 12
(PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
not send ERR_COR messages upstream. Also mask the parent root port to
cover root-port reporting of link errors caused by the endpoint.
Signed-off-by: Max Lee <max.lee@canonical.com>
---
Changes in v2:
- Mask the parent root port even when the endpoint lacks AER capability.
- Remove the early return before parent root port masking.
drivers/pci/quirks.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)
diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c
index caaed1a01dc0..6597536a4c70 100644
--- a/drivers/pci/quirks.c
+++ b/drivers/pci/quirks.c
@@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
}
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
+
+static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
+{
+ u32 val;
+
+ if (pdev->aer_cap) {
+ pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
+
+ pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
+ val |= PCI_ERR_COR_REP_TIMER;
+ pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
+ }
+
+ /*
+ * Also mask the parent root port. Do this even if the endpoint lacks
+ * AER capability because the root port may still report link errors
+ * caused by the endpoint.
+ */
+ pci_mask_replay_timer_timeout(pdev);
+}
+DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
+ pci_mask_replay_timer_timeout_on_endpoint);
#endif
--
2.43.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-06-10 2:47 ` [PATCH v2] " Max Lee
@ 2026-07-01 6:27 ` Manivannan Sadhasivam
2026-07-01 20:42 ` Bjorn Helgaas
0 siblings, 1 reply; 8+ messages in thread
From: Manivannan Sadhasivam @ 2026-07-01 6:27 UTC (permalink / raw)
To: Max Lee; +Cc: bhelgaas, linux-pci, linux-kernel, acelan.kao
On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> excessive Correctable Error (Replay Timer Timeout) AER events during
> PCIe link initialization. On systems where firmware enables AER
> reporting (CERptEn+), this causes an AER storm of ~240K error events
> within 11 seconds of boot, overwhelming the kernel error handler and
> blocking shutdown/reboot.
>
> The root cause is a transient link training instability inherent to this
> device -- even on BIOS versions that suppress reporting, the error
> status register (CESta) shows Timeout+ set.
>
> Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
> the RTS525A also needs its endpoint Correctable Error Mask bit 12
> (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> not send ERR_COR messages upstream. Also mask the parent root port to
> cover root-port reporting of link errors caused by the endpoint.
>
> Signed-off-by: Max Lee <max.lee@canonical.com>
> ---
> Changes in v2:
> - Mask the parent root port even when the endpoint lacks AER capability.
You've added this based on the comment by Sashiko I believe. But I think Sashiko
was wrong here. If the EP lacks AER capability, then there is no way it is going
to send AER errors to upstream RP. So it is OK to skip the quirk.
There is a slight chance that the RP itself generating AER events when it has
detected the Replay Timeout. But to verify, you need to share the AER log during
bootup.
Also, since you mentioned that the firmware is enabling the AER reporting for
the device, which means the device would've been enumerated by BIOS before the
kernel boots. In that case, RP would still continue to receive the COR_ERR and
by the time the AER driver binds to the RP (which happens before the quirk for
the RTS525A gets executed due to depth first enumeration), those AER errors will
still be logged until the quirk execution. So you'd still see AER errors in
dmesg, but not a flood before this patch.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-07-01 6:27 ` Manivannan Sadhasivam
@ 2026-07-01 20:42 ` Bjorn Helgaas
2026-07-02 5:10 ` Lukas Wunner
2026-07-02 8:08 ` Manivannan Sadhasivam
0 siblings, 2 replies; 8+ messages in thread
From: Bjorn Helgaas @ 2026-07-01 20:42 UTC (permalink / raw)
To: Manivannan Sadhasivam
Cc: Max Lee, bhelgaas, linux-pci, linux-kernel, acelan.kao,
Kai-Heng Feng, Victor Shih, Lukas Wunner, Jon Pan-Doh
[+cc Kai-Heng, Victor, Lukas, Jon]
On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > excessive Correctable Error (Replay Timer Timeout) AER events during
> > PCIe link initialization. On systems where firmware enables AER
> > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > within 11 seconds of boot, overwhelming the kernel error handler and
> > blocking shutdown/reboot.
Specifically, I guess these error events are the AER interrupts. We
rate-limit the actual *messages*, but not the interrupts, so we can be
overwhelmed by handling them.
> > The root cause is a transient link training instability inherent to this
> > device -- even on BIOS versions that suppress reporting, the error
> > status register (CESta) shows Timeout+ set.
> >
> > Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint
and the Downstream Port leading to it:
015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
timeout of AER")
eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys
GL975x SD host controller")
Maybe those should be combined so the quirk masks
PCI_ERR_COR_REP_TIMER at both ends of the link in one place. A quirk
like that could be used for both GL975x and RTS525A.
> > the RTS525A also needs its endpoint Correctable Error Mask bit 12
> > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> > not send ERR_COR messages upstream. Also mask the parent root port to
> > cover root-port reporting of link errors caused by the endpoint.
I think this and the similar comment in the code are slightly
misleading. pci_mask_replay_timer_timeout() masks
PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev. That
may be either a Root Port or a Switch Downstream Port.
> > Signed-off-by: Max Lee <max.lee@canonical.com>
> > ---
> > Changes in v2:
> > - Mask the parent root port even when the endpoint lacks AER capability.
>
> You've added this based on the comment by Sashiko I believe. But I
> think Sashiko was wrong here. If the EP lacks AER capability, then
> there is no way it is going to send AER errors to upstream RP. So it
> is OK to skip the quirk.
I'm not sure it's OK to skip the quirk.
Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
Endpoint that detects a Correctable Error but lacks an AER Capability
sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
ERR_COR upstream.
I suspect that what we need here is:
- Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
ERR_COR
- Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
sending ERR_COR
- If the upstream bridge is not the Root Port, mask
PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
receives ERR_COR, e.g., if the Endpoint or bridge lacks AER
I wonder if aer.c should disable the AER interrupt by temporarily
clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that
could avoid device quirks for transient issues like this.
> There is a slight chance that the RP itself generating AER events
> when it has detected the Replay Timeout. But to verify, you need to
> share the AER log during bootup.
>
> Also, since you mentioned that the firmware is enabling the AER
> reporting for the device, which means the device would've been
> enumerated by BIOS before the kernel boots. In that case, RP would
> still continue to receive the COR_ERR and by the time the AER driver
> binds to the RP (which happens before the quirk for the RTS525A gets
> executed due to depth first enumeration), those AER errors will
> still be logged until the quirk execution. So you'd still see AER
> errors in dmesg, but not a flood before this patch.
> > +++ b/drivers/pci/quirks.c
> > @@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
> > }
> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
> > +
> > +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
> > +{
> > + u32 val;
> > +
> > + if (pdev->aer_cap) {
> > + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
> > +
> > + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
> > + val |= PCI_ERR_COR_REP_TIMER;
> > + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
> > + }
> > +
> > + /*
> > + * Also mask the parent root port. Do this even if the endpoint lacks
> > + * AER capability because the root port may still report link errors
> > + * caused by the endpoint.
> > + */
> > + pci_mask_replay_timer_timeout(pdev);
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
> > + pci_mask_replay_timer_timeout_on_endpoint);
> > #endif
> > --
> > 2.43.0
> >
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-07-01 20:42 ` Bjorn Helgaas
@ 2026-07-02 5:10 ` Lukas Wunner
2026-07-02 7:06 ` Max Lee
2026-07-02 8:08 ` Manivannan Sadhasivam
1 sibling, 1 reply; 8+ messages in thread
From: Lukas Wunner @ 2026-07-02 5:10 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Manivannan Sadhasivam, Max Lee, bhelgaas, linux-pci, linux-kernel,
acelan.kao, Kai-Heng Feng, Victor Shih, Jon Pan-Doh
On Wed, Jul 01, 2026 at 03:42:01PM -0500, Bjorn Helgaas wrote:
> On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> > On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > > excessive Correctable Error (Replay Timer Timeout) AER events during
> > > PCIe link initialization. On systems where firmware enables AER
> > > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > > within 11 seconds of boot, overwhelming the kernel error handler and
> > > blocking shutdown/reboot.
>
> Specifically, I guess these error events are the AER interrupts. We
> rate-limit the actual *messages*, but not the interrupts, so we can be
> overwhelmed by handling them.
Ratelimiting went into v6.16. It would be good to know whether
the behavior described in the commit message occurs only with
older kernels or persists with ratelimiting.
It would also be good to know whether the "overwhelming" occurs
with OS-native AER handling (aer_print_error()) or with Firmware
First AER handling (pci_print_aer()).
The PCIe port driver is registered in a device_initcall, which happens
much later than PCI device enumeration in a subsys_initcall. Do these
errors occur before the port driver registers? If they do, then the
"overwhelming" likely happens through the Firmware First code path.
Or perhaps "within 11 seconds of boot" is when portdrv probes and
registers the AER driver?
It would be good to have full dmesg output for analysis because this
is all a little murky.
> Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
> Endpoint that detects a Correctable Error but lacks an AER Capability
> sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
> ERR_COR upstream.
>
> I suspect that what we need here is:
>
> - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
> ERR_COR
>
> - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
> sending ERR_COR
>
> - If the upstream bridge is not the Root Port, mask
> PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
> receives ERR_COR, e.g., if the Endpoint or bridge lacks AER
The last bullet point doesn't sound quite right: Per table 6-4 in
PCIe r7.0 sec 6.2.7, Replay Timer Timeout is detected by the Transmitter.
This can only be the Endpoint or its upstream bridge. If the upstream
bridge is not the Root Port, masking PCI_ERR_COR_REP_TIMER in the
Root Port doesn't help.
If the Endpoint or its upstream bridge don't support AER, one has to
clear Correctable Error Reporting Enable in the Device Control Register
to suppress the Replay Timer Timeout errors. Obviously this will
suppress reporting of any other Correctable Error as well.
> I wonder if aer.c should disable the AER interrupt by temporarily
> clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that
> could avoid device quirks for transient issues like this.
The hierarchy below the Root Port might be deeper and we'd risk
losing errors reported by other devices in that hierarchy while
interrupt generation is temporarily disabled.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-07-02 5:10 ` Lukas Wunner
@ 2026-07-02 7:06 ` Max Lee
2026-07-02 14:42 ` Lukas Wunner
0 siblings, 1 reply; 8+ messages in thread
From: Max Lee @ 2026-07-02 7:06 UTC (permalink / raw)
To: Lukas Wunner, bhelgaas, Manivannan Sadhasivam
Cc: Bjorn Helgaas, linux-pci, linux-kernel, acelan.kao, Kai-Heng Feng,
Victor Shih, Jon Pan-Doh
On Thu, Jul 02, 2026 at 07:10:46AM +0200, Lukas Wunner wrote:
> It would be good to have full dmesg output for analysis because this
> is all a little murky.
Thanks Mani, Bjorn and Lukas for the detailed analysis.
I collected full boot dmesg and lspci -vv from the affected system.
The system is:
HP ZBook Power 16 inch G11 Mobile Workstation PC
BIOS: W97 Ver. 01.09.01, 03/02/2026
Kernel: 6.17.0-1012-oem
The topology is a direct Root Port -> Endpoint connection, with no switch
in between:
0000:00:1c.6-[58]----00.0 Realtek RTS525A PCI Express Card Reader
The log shows that OS-native AER is in control:
acpi PNP0A08:00: _OSC: OS now controls [PCIeHotplug SHPCHotplug PME
AER PCIeCapability LTR DPC]
pcieport 0000:00:1c.6: AER: enabled with IRQ 127
I did not find GHES/HEST/Firmware First AER messages in dmesg/journal for
this boot.
The storm starts after the PCIe port driver enables AER on 0000:00:1c.6
and immediately after rtsx_pci enables the endpoint:
[ 0.798443] pcieport 0000:00:1c.6: AER: enabled with IRQ 127
[ 1.057380] rtsx_pci 0000:58:00.0: enabling device (0000 -> 0002)
[ 1.057399] pcieport 0000:00:1c.6: AER: Correctable error message
received from 0000:58:00.0
[ 1.057405] rtsx_pci 0000:58:00.0: PCIe Bus Error:
severity=Correctable, type=Data Link Layer, (Transmitter ID)
[ 1.057408] rtsx_pci 0000:58:00.0: device [10ec:525a] error
status/mask=00001000/00006000
[ 1.057411] rtsx_pci 0000:58:00.0: [12] Timeout
Ratelimiting is present, but the interrupt/callback volume is still very
large:
[ 6.058155] aer_ratelimit: 120541 callbacks suppressed
[ 11.059227] aer_ratelimit: 120906 callbacks suppressed
/proc/interrupts later showed IRQ 127 for 0000:00:1c.6
(PCIe PME, aerdrv, PCIe bwctrl) at 940777 interrupts, while the
endpoint's own rtsx_pci MSI had only 6 interrupts.
Both the endpoint and the immediate upstream port expose AER capability.
On this unpatched boot, Replay Timer Timeout is not masked on either side.
Endpoint 0000:58:00.0:
Capabilities: [100 v2] Advanced Error Reporting
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
Root Port 0000:00:1c.6:
Capabilities: [100 v1] Advanced Error Reporting
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
RootCmd: CERptEn+ NFERptEn+ FERptEn+
ErrorSrc: ERR_COR: 5800 ERR_FATAL/NONFATAL: 0000
So for this system, the endpoint does expose AER, and the observed storm
does not require handling the no-endpoint-AER case.
I agree that the v2 wording is imprecise. pci_mask_replay_timer_timeout()
does not necessarily mask a Root Port; it masks the immediate upstream
Downstream Port leading to pdev, which may be a Root Port or a Switch
Downstream Port. If I send v3, I will fix the commit log and code comment
to use that wording instead of "parent root port".
I can also include the full dmesg and lspci output if useful.
Thanks,
Max
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-07-01 20:42 ` Bjorn Helgaas
2026-07-02 5:10 ` Lukas Wunner
@ 2026-07-02 8:08 ` Manivannan Sadhasivam
1 sibling, 0 replies; 8+ messages in thread
From: Manivannan Sadhasivam @ 2026-07-02 8:08 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Max Lee, bhelgaas, linux-pci, linux-kernel, acelan.kao,
Kai-Heng Feng, Victor Shih, Lukas Wunner, Jon Pan-Doh
On Wed, Jul 01, 2026 at 03:42:01PM -0500, Bjorn Helgaas wrote:
> [+cc Kai-Heng, Victor, Lukas, Jon]
>
> On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> > On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > > excessive Correctable Error (Replay Timer Timeout) AER events during
> > > PCIe link initialization. On systems where firmware enables AER
> > > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > > within 11 seconds of boot, overwhelming the kernel error handler and
> > > blocking shutdown/reboot.
>
> Specifically, I guess these error events are the AER interrupts. We
> rate-limit the actual *messages*, but not the interrupts, so we can be
> overwhelmed by handling them.
>
> > > The root cause is a transient link training instability inherent to this
> > > device -- even on BIOS versions that suppress reporting, the error
> > > status register (CESta) shows Timeout+ set.
> > >
> > > Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
>
> I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint
> and the Downstream Port leading to it:
>
> 015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
> timeout of AER")
>
> eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys
> GL975x SD host controller")
>
> Maybe those should be combined so the quirk masks
> PCI_ERR_COR_REP_TIMER at both ends of the link in one place. A quirk
> like that could be used for both GL975x and RTS525A.
>
> > > the RTS525A also needs its endpoint Correctable Error Mask bit 12
> > > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> > > not send ERR_COR messages upstream. Also mask the parent root port to
> > > cover root-port reporting of link errors caused by the endpoint.
>
> I think this and the similar comment in the code are slightly
> misleading. pci_mask_replay_timer_timeout() masks
> PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev. That
> may be either a Root Port or a Switch Downstream Port.
>
> > > Signed-off-by: Max Lee <max.lee@canonical.com>
> > > ---
> > > Changes in v2:
> > > - Mask the parent root port even when the endpoint lacks AER capability.
> >
> > You've added this based on the comment by Sashiko I believe. But I
> > think Sashiko was wrong here. If the EP lacks AER capability, then
> > there is no way it is going to send AER errors to upstream RP. So it
> > is OK to skip the quirk.
>
> I'm not sure it's OK to skip the quirk.
>
> Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
> Endpoint that detects a Correctable Error but lacks an AER Capability
> sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
> ERR_COR upstream.
>
Ah, you're right. I missed CERE.
> I suspect that what we need here is:
>
> - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
> ERR_COR
>
> - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
> sending ERR_COR
>
> - If the upstream bridge is not the Root Port, mask
> PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
> receives ERR_COR, e.g., if the Endpoint or bridge lacks AER
>
If the endpoint lacks AER capability, it will simply send an ERR_COR message to
the upstream bridge without any advanced logging info. Because the message
reaches the Root Port as a generic ERR_COR, RP will log the error in its Root
Error Status register and raise the AER interrupt without checking its own
PCI_ERR_COR_REP_TIMER mask status. Also, any upstream bridge lacking AER
capability will simply forward the ERR_COR message upstream irrespective of its
own PCI_ERR_COR_REP_TIMER status.
So if the endpoint lacks AER capability, we cannot prevent it from reporting the
error, unless we disable CERE, which will affect other errors also. However, we
can still mask PCI_ERR_COR_REP_TIMER in the upstream bridge's AER to prevent
the bridge itself from generating its own Replay Link Timeout errors due to the
endpoint's behavior, but this acts only as a safeguard and will not prevent the
messages originating from the endpoint.
Correct me if my understanding is wrong.
- Mani
--
மணிவண்ணன் சதாசிவம்
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
2026-07-02 7:06 ` Max Lee
@ 2026-07-02 14:42 ` Lukas Wunner
0 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2026-07-02 14:42 UTC (permalink / raw)
To: Max Lee
Cc: bhelgaas, Manivannan Sadhasivam, Bjorn Helgaas, linux-pci,
linux-kernel, acelan.kao, Kai-Heng Feng, Victor Shih, Jon Pan-Doh
On Thu, Jul 02, 2026 at 03:06:29PM +0800, Max Lee wrote:
> Both the endpoint and the immediate upstream port expose AER capability.
> On this unpatched boot, Replay Timer Timeout is not masked on either side.
>
> Endpoint 0000:58:00.0:
>
> Capabilities: [100 v2] Advanced Error Reporting
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>
> Root Port 0000:00:1c.6:
>
> Capabilities: [100 v1] Advanced Error Reporting
> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
> RootCmd: CERptEn+ NFERptEn+ FERptEn+
> ErrorSrc: ERR_COR: 5800 ERR_FATAL/NONFATAL: 0000
There were Advisory Non-Fatal Errors on both ends of the links.
The upstream kernel does not support ANFE so far, but this
development branch contains three tentative patches to add it:
https://github.com/l1k/linux/commits/anfe_v1/
So far this is compile-tested only. You may want to give these
patches a spin to see which Non-Fatal Errors are signaled. The
kernel should also dump the TLP Prefix Log for those errors and
you can use this tool to decode it:
https://github.com/mmpg-x86/tlp-tool
See the example usage in: Documentation/PCI/pcieaer-howto.rst
The TLP Prefix Log might give a hint as to the root cause.
Your commit message mentions a "transient link training instability",
but I think if that were the case, you'd see Surprise Link Down Errors
(which are Fatal Errors). Except if the Root Port was hotplug-capable,
in which case Surprise Link Down Error generation is blocked per
PCIe r7.0 sec 3.2.1. Another exception would be if the Root Port is
not Surprise Down Error Reporting Capable (bit 19 in the Link
Capabilities Register).
Bjorn mentioned commit eeee3b5e6d0b, which states that the Replay Timer
Timeout errors (only) occur when ASPM is enabled. That may be the
actual root cause, so you may want to play with ASPM settings (disable
L1 substates etc) to see if it makes the issue go away. Disabling
non-working ASPM settings in a quirk would be better than silencing
the ensuing errors.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2026-07-02 14:42 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
2026-06-10 2:47 ` [PATCH v2] " Max Lee
2026-07-01 6:27 ` Manivannan Sadhasivam
2026-07-01 20:42 ` Bjorn Helgaas
2026-07-02 5:10 ` Lukas Wunner
2026-07-02 7:06 ` Max Lee
2026-07-02 14:42 ` Lukas Wunner
2026-07-02 8:08 ` Manivannan Sadhasivam
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox