From: Bjorn Helgaas <helgaas@kernel.org>
To: Manivannan Sadhasivam <mani@kernel.org>
Cc: Max Lee <max.lee@canonical.com>,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, acelan.kao@canonical.com,
Kai-Heng Feng <kaihengf@nvidia.com>,
Victor Shih <victorshihgli@gmail.com>,
Lukas Wunner <lukas@wunner.de>, Jon Pan-Doh <pandoh@google.com>
Subject: Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
Date: Wed, 1 Jul 2026 15:42:01 -0500 [thread overview]
Message-ID: <20260701204201.GA302478@bhelgaas> (raw)
In-Reply-To: <oc3i3yn4c22s5c3gtqgs7a4wrpqiywwvboxstdf7josxddqmpd@y43m75ovby2t>
[+cc Kai-Heng, Victor, Lukas, Jon]
On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > excessive Correctable Error (Replay Timer Timeout) AER events during
> > PCIe link initialization. On systems where firmware enables AER
> > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > within 11 seconds of boot, overwhelming the kernel error handler and
> > blocking shutdown/reboot.
Specifically, I guess these error events are the AER interrupts. We
rate-limit the actual *messages*, but not the interrupts, so we can be
overwhelmed by handling them.
> > The root cause is a transient link training instability inherent to this
> > device -- even on BIOS versions that suppress reporting, the error
> > status register (CESta) shows Timeout+ set.
> >
> > Unlike the GL9750/GL9755 fixup, which only masks the parent root port,
I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint
and the Downstream Port leading to it:
015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
timeout of AER")
eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys
GL975x SD host controller")
Maybe those should be combined so the quirk masks
PCI_ERR_COR_REP_TIMER at both ends of the link in one place. A quirk
like that could be used for both GL975x and RTS525A.
> > the RTS525A also needs its endpoint Correctable Error Mask bit 12
> > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> > not send ERR_COR messages upstream. Also mask the parent root port to
> > cover root-port reporting of link errors caused by the endpoint.
I think this and the similar comment in the code are slightly
misleading. pci_mask_replay_timer_timeout() masks
PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev. That
may be either a Root Port or a Switch Downstream Port.
> > Signed-off-by: Max Lee <max.lee@canonical.com>
> > ---
> > Changes in v2:
> > - Mask the parent root port even when the endpoint lacks AER capability.
>
> You've added this based on the comment by Sashiko I believe. But I
> think Sashiko was wrong here. If the EP lacks AER capability, then
> there is no way it is going to send AER errors to upstream RP. So it
> is OK to skip the quirk.
I'm not sure it's OK to skip the quirk.
Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
Endpoint that detects a Correctable Error but lacks an AER Capability
sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
ERR_COR upstream.
I suspect that what we need here is:
- Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
ERR_COR
- Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
sending ERR_COR
- If the upstream bridge is not the Root Port, mask
PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
receives ERR_COR, e.g., if the Endpoint or bridge lacks AER
I wonder if aer.c should disable the AER interrupt by temporarily
clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that
could avoid device quirks for transient issues like this.
> There is a slight chance that the RP itself generating AER events
> when it has detected the Replay Timeout. But to verify, you need to
> share the AER log during bootup.
>
> Also, since you mentioned that the firmware is enabling the AER
> reporting for the device, which means the device would've been
> enumerated by BIOS before the kernel boots. In that case, RP would
> still continue to receive the COR_ERR and by the time the AER driver
> binds to the RP (which happens before the quirk for the RTS525A gets
> executed due to depth first enumeration), those AER errors will
> still be logged until the quirk execution. So you'd still see AER
> errors in dmesg, but not a flood before this patch.
> > +++ b/drivers/pci/quirks.c
> > @@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
> > }
> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
> > DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
> > +
> > +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
> > +{
> > + u32 val;
> > +
> > + if (pdev->aer_cap) {
> > + pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
> > +
> > + pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
> > + val |= PCI_ERR_COR_REP_TIMER;
> > + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
> > + }
> > +
> > + /*
> > + * Also mask the parent root port. Do this even if the endpoint lacks
> > + * AER capability because the root port may still report link errors
> > + * caused by the endpoint.
> > + */
> > + pci_mask_replay_timer_timeout(pdev);
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
> > + pci_mask_replay_timer_timeout_on_endpoint);
> > #endif
> > --
> > 2.43.0
> >
next prev parent reply other threads:[~2026-07-01 20:42 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
2026-06-10 2:47 ` [PATCH v2] " Max Lee
2026-07-01 6:27 ` Manivannan Sadhasivam
2026-07-01 20:42 ` Bjorn Helgaas [this message]
2026-07-02 5:10 ` Lukas Wunner
2026-07-02 7:06 ` Max Lee
2026-07-02 14:42 ` Lukas Wunner
2026-07-02 8:08 ` Manivannan Sadhasivam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260701204201.GA302478@bhelgaas \
--to=helgaas@kernel.org \
--cc=acelan.kao@canonical.com \
--cc=bhelgaas@google.com \
--cc=kaihengf@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mani@kernel.org \
--cc=max.lee@canonical.com \
--cc=pandoh@google.com \
--cc=victorshihgli@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox