From: Bjorn Helgaas <helgaas@kernel.org>
To: Jon Pan-Doh <pandoh@google.com>
Cc: "Karolina Stolarek" <karolina.stolarek@oracle.com>,
linux-pci@vger.kernel.org, "Bjorn Helgaas" <bhelgaas@google.com>,
"Martin Petersen" <martin.petersen@oracle.com>,
"Ben Fuller" <ben.fuller@oracle.com>,
"Drew Walton" <drewwalton@microsoft.com>,
"Anil Agrawal" <anilagrawal@meta.com>,
"Tony Luck" <tony.luck@intel.com>,
"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
"Sathyanarayanan Kuppuswamy"
<sathyanarayanan.kuppuswamy@linux.intel.com>,
"Lukas Wunner" <lukas@wunner.de>,
"Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
"Terry Bowman" <Terry.bowman@amd.com>
Subject: Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
Date: Thu, 20 Mar 2025 15:29:13 -0500 [thread overview]
Message-ID: <20250320202913.GA1097165@bhelgaas> (raw)
In-Reply-To: <CAMC_AXVi7cFOnSa25SEkZsYf27eoX1NwFmc8VnRgFQS44PpKRQ@mail.gmail.com>
On Thu, Mar 20, 2025 at 12:53:53PM -0700, Jon Pan-Doh wrote:
> On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > > + /*
> > > > + * Ratelimits are doubled as a given error produces 2 logs (root port
> > > > + * and endpoint) that should be under same ratelimit.
> > > > + */
> > > For these devices, we would call the ratelimit just once. I
> > > don't have a nice an clean solution for this problem, I just
> > > wanted to highlight that 1) we don't use the Root Port's
> > > ratelimit in aer_print_port_info(), 2) we may use the bursts to
> > > either print port_info + error message or just the message, in
> > > different combinations. I think we should reword this comment to
> > > highlight the fact that we don't check the ratelimit once per
> > > error, we could do it twice.
>
> You're right. I was thinking of amending it to something like:
>
> Ratelimits are doubled as a given error notification produces up to
> 2 logs (1 at root port and 1 at source device) that should be under
> the same ratelimit.
>
> > Very good point. aer_print_rp_info() is always ratelimited based
> > on the ERR_* Source Identification, but if Multiple ERR_* is set,
> > aer_print_error() ratelimits based on whatever downstream device
> > we found that had any error of the same class logged.
> >
> > That does seem like a problem. I would propose that we always
> > ratelimit using the device from Source Identification. I think
> > that's available in aer_print_error(); we would just use info->id
> > instead of "dev".
>
> Wouldn't you be incorrectly counting the non-source ID devices then?
> I think this is another reason where removing
> aer_print_port_info()[1] (only printing port info when failing to
> get device error info) simplifies things. Of course, we then have to
> weigh whether the loss of info is less than the ratelimit
> complexity.
Yes, I guess so. Maybe the ratelimit should be in the source of the
interrupt (Root Port for AER, Root Port or Downstream Port for DPC) so
it's more directly related to the interrupt that got us here in the
first place.
I think the struct aer_err_info is basically a per-interrupt thing, so
maybe we could evaluate __ratelimit() once at the initial entry, save
the result in aer_err_info, and use that saved value everywhere we
print messages?
- native AER: aer_isr_one_error() has RP pointer in rpc->rpd and
could save it (or pointer to the RP's ratelimit struct, or just
the result of __ratelimit()) in aer_err_info.
- GHES AER: I'm not sure struct cper_sec_pcie contains the RP, might
have to search upwards from the device we know about?
- native DPC: dpc_process_error() has DP pointer and could save it
in aer_err_info.
- EDR DPC: passes DP pointer to dpc_process_error().
> > > I'm also thinking -- we are ratelimiting the
> > > aer_print_port_info() and aer_print_error(). What about the
> > > messages in dpc_process_error()? Should we check early if DPC
> > > was triggered because of an uncorrectable error, and if so,
> > > ratelimit that?
> >
> > Good question. It does seem like the dpc_process_error() messages
> > should be similarly ratelimited. I think we currently only enable
> > DPC for fatal errors, and the Downstream Port takes the link down,
> > which resets the hierarchy below. So (1) we probably won't see
> > storms of fatal error messages, and (2) it looks like we might not
> > print any error info from downstream devices, since they're not
> > reachable while the link is down.
>
> I did not expect error storms from DPC so I thought it best to focus
> on AER.
Completely agreed.
> [1] https://lore.kernel.org/linux-pci/CAMC_AXWVOtKh2r4kX6c7jtJwQaEE4KEQsH=uoB1OhczJ=8K2VA@mail.gmail.com/
[BTW, email quoting style error here; shouldn't have the entire
message you're replying to repeated below. Gmail tries hard to screw
this up for you ;)]
> On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> ... (snipped)
next prev parent reply other threads:[~2025-03-20 20:29 UTC|newest]
Thread overview: 32+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-20 8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
2025-03-20 8:20 ` [PATCH v4 1/7] PCI/AER: Check log level once and propagate down Jon Pan-Doh
2025-03-20 8:20 ` [PATCH v4 2/7] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
2025-03-20 8:20 ` [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
2025-03-20 14:59 ` Karolina Stolarek
2025-03-20 19:07 ` Jon Pan-Doh
2025-03-20 8:20 ` [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
2025-03-20 17:42 ` Sathyanarayanan Kuppuswamy
2025-03-20 19:53 ` Jon Pan-Doh
2025-03-21 13:38 ` Karolina Stolarek
2025-03-20 8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-03-20 14:56 ` Karolina Stolarek
2025-03-20 17:51 ` Bjorn Helgaas
2025-03-20 19:53 ` Jon Pan-Doh
2025-03-20 20:29 ` Bjorn Helgaas [this message]
2025-03-21 1:58 ` Jon Pan-Doh
2025-03-20 19:37 ` Jon Pan-Doh
2025-03-21 1:00 ` Sathyanarayanan Kuppuswamy
2025-03-21 19:24 ` Jon Pan-Doh
2025-03-21 21:47 ` Sathyanarayanan Kuppuswamy
2025-03-21 21:59 ` Bjorn Helgaas
2025-03-21 22:11 ` Jon Pan-Doh
2025-03-20 8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
2025-03-20 14:57 ` Karolina Stolarek
2025-03-21 1:00 ` Sathyanarayanan Kuppuswamy
2025-03-20 8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
2025-03-20 14:58 ` Karolina Stolarek
2025-03-20 19:36 ` Jon Pan-Doh
2025-03-21 1:02 ` Sathyanarayanan Kuppuswamy
2025-03-21 1:55 ` Jon Pan-Doh
2025-03-20 14:34 ` [PATCH v4 0/7] Rate limit AER logs Christoph Hellwig
2025-03-20 18:45 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250320202913.GA1097165@bhelgaas \
--to=helgaas@kernel.org \
--cc=Jonathan.Cameron@huawei.com \
--cc=Terry.bowman@amd.com \
--cc=anilagrawal@meta.com \
--cc=ben.fuller@oracle.com \
--cc=bhelgaas@google.com \
--cc=drewwalton@microsoft.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=karolina.stolarek@oracle.com \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=martin.petersen@oracle.com \
--cc=pandoh@google.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox