From: Lukas Wunner <lukas@wunner.de>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Manivannan Sadhasivam <mani@kernel.org>,
Max Lee <max.lee@canonical.com>,
bhelgaas@google.com, linux-pci@vger.kernel.org,
linux-kernel@vger.kernel.org, acelan.kao@canonical.com,
Kai-Heng Feng <kaihengf@nvidia.com>,
Victor Shih <victorshihgli@gmail.com>,
Jon Pan-Doh <pandoh@google.com>
Subject: Re: [PATCH v2] PCI: Mask Replay Timer Timeout for Realtek RTS525A
Date: Thu, 2 Jul 2026 07:10:46 +0200 [thread overview]
Message-ID: <akXy1gsijpfslOJm@wunner.de> (raw)
In-Reply-To: <20260701204201.GA302478@bhelgaas>
On Wed, Jul 01, 2026 at 03:42:01PM -0500, Bjorn Helgaas wrote:
> On Wed, Jul 01, 2026 at 08:27:13AM +0200, Manivannan Sadhasivam wrote:
> > On Wed, Jun 10, 2026 at 10:47:23AM +0800, Max Lee wrote:
> > > The Realtek RTS525A PCI-Express SD card reader (10ec:525a) generates
> > > excessive Correctable Error (Replay Timer Timeout) AER events during
> > > PCIe link initialization. On systems where firmware enables AER
> > > reporting (CERptEn+), this causes an AER storm of ~240K error events
> > > within 11 seconds of boot, overwhelming the kernel error handler and
> > > blocking shutdown/reboot.
>
> Specifically, I guess these error events are the AER interrupts. We
> rate-limit the actual *messages*, but not the interrupts, so we can be
> overwhelmed by handling them.
Ratelimiting went into v6.16. It would be good to know whether
the behavior described in the commit message occurs only with
older kernels or persists with ratelimiting.
It would also be good to know whether the "overwhelming" occurs
with OS-native AER handling (aer_print_error()) or with Firmware
First AER handling (pci_print_aer()).
The PCIe port driver is registered in a device_initcall, which happens
much later than PCI device enumeration in a subsys_initcall. Do these
errors occur before the port driver registers? If they do, then the
"overwhelming" likely happens through the Firmware First code path.
Or perhaps "within 11 seconds of boot" is when portdrv probes and
registers the AER driver?
It would be good to have full dmesg output for analysis because this
is all a little murky.
> Based on the flowchart in PCIe r7.0, sec 6.2.5, it looks like an
> Endpoint that detects a Correctable Error but lacks an AER Capability
> sets PCI_EXP_DEVSTA_CED and, if PCI_EXP_DEVCTL_CERE is set, sends
> ERR_COR upstream.
>
> I suspect that what we need here is:
>
> - Mask PCI_ERR_COR_REP_TIMER in Endpoint to prevent it from sending
> ERR_COR
>
> - Mask PCI_ERR_COR_REP_TIMER in upstream bridge to prevent it from
> sending ERR_COR
>
> - If the upstream bridge is not the Root Port, mask
> PCI_ERR_COR_REP_TIMER in Root Port to prevent AER interrupt if it
> receives ERR_COR, e.g., if the Endpoint or bridge lacks AER
The last bullet point doesn't sound quite right: Per table 6-4 in
PCIe r7.0 sec 6.2.7, Replay Timer Timeout is detected by the Transmitter.
This can only be the Endpoint or its upstream bridge. If the upstream
bridge is not the Root Port, masking PCI_ERR_COR_REP_TIMER in the
Root Port doesn't help.
If the Endpoint or its upstream bridge don't support AER, one has to
clear Correctable Error Reporting Enable in the Device Control Register
to suppress the Replay Timer Timeout errors. Obviously this will
suppress reporting of any other Correctable Error as well.
> I wonder if aer.c should disable the AER interrupt by temporarily
> clearing PCI_ERR_ROOT_CMD_COR_EN if it detects a storm. Maybe that
> could avoid device quirks for transient issues like this.
The hierarchy below the Root Port might be deeper and we'd risk
losing errors reported by other devices in that hierarchy while
interrupt generation is temporarily disabled.
Thanks,
Lukas
next prev parent reply other threads:[~2026-07-02 5:17 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-28 3:23 [PATCH] PCI: Mask Replay Timer Timeout for Realtek RTS525A Max Lee
2026-05-28 3:41 ` sashiko-bot
2026-06-10 2:47 ` [PATCH v2] " Max Lee
2026-07-01 6:27 ` Manivannan Sadhasivam
2026-07-01 20:42 ` Bjorn Helgaas
2026-07-02 5:10 ` Lukas Wunner [this message]
2026-07-02 7:06 ` Max Lee
2026-07-02 8:08 ` Manivannan Sadhasivam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=akXy1gsijpfslOJm@wunner.de \
--to=lukas@wunner.de \
--cc=acelan.kao@canonical.com \
--cc=bhelgaas@google.com \
--cc=helgaas@kernel.org \
--cc=kaihengf@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=mani@kernel.org \
--cc=max.lee@canonical.com \
--cc=pandoh@google.com \
--cc=victorshihgli@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.