All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Manivannan Sadhasivam <mani@kernel.org>
Cc: Max Lee <max.lee@canonical.com>,
	bhelgaas@google.com, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, acelan.kao@canonical.com,
	Kai-Heng Feng <kaihengf@nvidia.com>,
	Victor Shih <victorshihgli@gmail.com>,
	Lukas Wunner <lukas@wunner.de>, Jon Pan-Doh <pandoh@google.com>
Subject: Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
Date: Wed, 1 Jul 2026 15:42:01 -0500	[thread overview]
Message-ID: <20260701204201.GA302478@bhelgaas> (raw)
In-Reply-To: <oc3i3yn4c22s5c3gtqgs7a4wrpqiywwvboxstdf7josxddqmpd@y43m75ovby2t>

[+cc Kai-Heng, Victor, Lukas, Jon]

On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > excessive Correctable Error (Replay Timer Timeout) AER events during
> > PCIe link initialization. On systems where firmware enables AER
> > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > within 11 seconds of boot, overwhelming the kernel error handler and
> > blocking shutdown/reboot.

Specifically, I guess these error events are the AER interrupts.  We
rate-limit the actual *messages*, but not the interrupts, so we can be
overwhelmed by handling them.

> > The root cause is a transient link training instability inherent to this
> > device -- even on BIOS versions that suppress reporting, the error
> > status register (CESta) shows Timeout+ set.
> > 
> > Unlike the GL9750/GL9755 fixup, which only masks the parent root port,

I think PCI_ERR_COR_REP_TIMER is masked in both the GL975x Endpoint
and the Downstream Port leading to it:

  015c9cbcf0ad ("mmc: sdhci-pci-gli: GL9750: Mask the replay timer
  timeout of AER")

  eeee3b5e6d0b ("PCI: Mask Replay Timer Timeout errors for Genesys
  GL975x SD host controller")

Maybe those should be combined so the quirk masks
PCI_ERR_COR_REP_TIMER at both ends of the link in one place.  A quirk
like that could be used for both GL975x and RTS525A.

> > the RTS525A also needs its endpoint Correctable Error Mask bit 12
> > (PCI_ERR_COR_REP_TIMER) masked when the endpoint exposes AER, so it does
> > not send ERR_COR messages upstream. Also mask the parent root port to
> > cover root-port reporting of link errors caused by the endpoint.

I think this and the similar comment in the code are slightly
misleading.  pci_mask_replay_timer_timeout() masks
PCI_ERR_COR_REP_TIMER in the Downstream Port leading to pdev.  That
may be either a Root Port or a Switch Downstream Port.

> > Signed-off-by: Max Lee <max.lee@canonical.com>
> > ---
> > Changes in v2:
> >   - Mask the parent root port even when the endpoint lacks AER capability.
> 
> You've added this based on the comment by Sashiko I believe. But I
> think Sashiko was wrong here. If the EP lacks AER capability, then
> there is no way it is going to send AER errors to upstream RP. So it
> is OK to skip the quirk.

I'm not sure it's OK to skip the quirk.

Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
Endpoint that detects a Correctable Error but lacks an AER Capability
sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
ERR_COR upstream.

I suspect that what we need here is:

  - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
    ERR_COR

  - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
    sending ERR_COR

  - If the upstream bridge is not the Root Port, mask
    PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
    receives ERR_COR, e.g., if the Endpoint or bridge lacks AER

I wonder if aer.c should disable the AER interrupt by temporarily
clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm.  Maybe that
could avoid device quirks for transient issues like this.

> There is a slight chance that the RP itself generating AER events
> when it has detected the Replay Timeout. But to verify, you need to
> share the AER log during bootup.
> 
> Also, since you mentioned that the firmware is enabling the AER
> reporting for the device, which means the device would've been
> enumerated by BIOS before the kernel boots. In that case, RP would
> still continue to receive the COR_ERR and by the time the AER driver
> binds to the RP (which happens before the quirk for the RTS525A gets
> executed due to depth first enumeration), those AER errors will
> still be logged until the quirk execution. So you'd still see AER
> errors in dmesg, but not a flood before this patch.

> > +++ b/drivers/pci/quirks.c
> > @@ -6380,4 +6380,26 @@ static void pci_mask_replay_timer_timeout(struct pci_dev *pdev)
> >  }
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9750, pci_mask_replay_timer_timeout);
> >  DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_GLI, 0x9755, pci_mask_replay_timer_timeout);
> > +
> > +static void pci_mask_replay_timer_timeout_on_endpoint(struct pci_dev *pdev)
> > +{
> > +	u32 val;
> > +
> > +	if (pdev->aer_cap) {
> > +		pci_info(pdev, "mask Replay Timer Timeout on endpoint due to hardware defect\n");
> > +
> > +		pci_read_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, &val);
> > +		val |= PCI_ERR_COR_REP_TIMER;
> > +		pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_MASK, val);
> > +	}
> > +
> > +	/*
> > +	 * Also mask the parent root port. Do this even if the endpoint lacks
> > +	 * AER capability because the root port may still report link errors
> > +	 * caused by the endpoint.
> > +	 */
> > +	pci_mask_replay_timer_timeout(pdev);
> > +}
> > +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_REALTEK, 0x525a,
> > +			pci_mask_replay_timer_timeout_on_endpoint);
> >  #endif
> > -- 
> > 2.43.0
> > 

      reply	other threads:[~2026-07-01 20:42 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-28  3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
2026-05-28  3:41 ` sashiko-bot
2026-06-10  2:47 ` [PATCH v2] " Max Lee
2026-07-01  6:27   ` Manivannan Sadhasivam
2026-07-01 20:42     ` Bjorn Helgaas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260701204201.GA302478@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=acelan.kao@canonical.com \
    --cc=bhelgaas@google.com \
    --cc=kaihengf@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mani@kernel.org \
    --cc=max.lee@canonical.com \
    --cc=pandoh@google.com \
    --cc=victorshihgli@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.