From: Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com>
To: Jon Pan-Doh <pandoh@google.com>,
Bjorn Helgaas <bhelgaas@google.com>,
Karolina Stolarek <karolina.stolarek@oracle.com>
Cc: linux-pci@vger.kernel.org,
"Martin Petersen" <martin.petersen@oracle.com>,
"Ben Fuller" <ben.fuller@oracle.com>,
"Drew Walton" <drewwalton@microsoft.com>,
"Anil Agrawal" <anilagrawal@meta.com>,
"Tony Luck" <tony.luck@intel.com>,
"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
"Lukas Wunner" <lukas@wunner.de>,
"Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
"Terry Bowman" <Terry.bowman@amd.com>
Subject: Re: [PATCH v3 0/8] Rate limit AER logs
Date: Wed, 19 Mar 2025 15:29:17 -0700 [thread overview]
Message-ID: <e03bf65b-c961-4196-8844-c61ac59a4a1c@linux.intel.com> (raw)
In-Reply-To: <20250319084050.366718-1-pandoh@google.com>
Hi Jon,
On 3/19/25 1:40 AM, Jon Pan-Doh wrote:
> Proposal
> ========
>
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits
> for more robust error logging. Allow userspace to configure ratelimits
> via sysfs knobs.
>
> Motivation
> ==========
>
> Several OCP members have issues with inconsistent PCIe error handling,
> exacerbated at datacenter scale (myriad of devices).
> OCP HW/Fault Management subproject set out to solve this by
> standardizing industry:
>
> - PCIe error handling best practices
> - Fault Management/RAS (incl. PCIe errors)
>
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services is part of the
> roadmap.
>
> Background
> ==========
>
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
>
> There have been previous attempts to add ratelimits to AER logs ([4],
> [5]). The most recent attempt[5] has many similarities with the proposed
> approach.
>
> Patch organization
> ==================
> 1-4 AER logging cleanup
> 5-8 Ratelimits and sysfs knobs
>
> Outstanding work
> ================
> Cleanup:
> - Consolidate aer_print_error() and pci_print_error() path
>
> Roadmap:
> - IRQ ratelimiting
What is the baseline version for this patch set? When I tried to apply it on
v6.14-rc7 or linux-next, it does not apply cleanly.
> v3:
> - Ratelimit aer_print_port_info() (drop Patch 1)
> - Add ratelimit enable toggle
> - Move trace outside of ratelimit
> - Split log level (Patch 2) into two
> - More descriptive documentation/sysfs naming
> v2:
> - Rebased on top of pci/aer (6.14.rc-1)
> - Split series into log and IRQ ratelimits (defer patch 5)
> - Dropped patch 8 (Move AER sysfs)
> - Added log level cleanup patch[6] from Karolina's series
> - Fixed bug where dpc errors didn't increment counters
> - "X callbacks suppressed" message on ratelimit release -> immediately
> - Separate documentation into own patch
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
> [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
> [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
> [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/
>
> Jon Pan-Doh (6):
> PCI/AER: Move AER stat collection out of __aer_print_error()
> PCI/AER: Rename struct aer_stats to aer_report
> PCI/AER: Introduce ratelimit for error logs
> PCI/AER: Add ratelimits to PCI AER Documentation
> PCI/AER: Add sysfs attributes for log ratelimits
> PCI/AER: Update AER sysfs ABI filename
>
> Karolina Stolarek (2):
> PCI/AER: Check log level once and propagate down
> PCI/AER: Make all pci_print_aer() log levels depend on error type
>
> ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++
> Documentation/PCI/pcieaer-howto.rst | 16 +-
> drivers/pci/pci-sysfs.c | 1 +
> drivers/pci/pci.h | 4 +-
> drivers/pci/pcie/aer.c | 276 +++++++++++++-----
> drivers/pci/pcie/dpc.c | 3 +-
> include/linux/pci.h | 2 +-
> 7 files changed, 266 insertions(+), 70 deletions(-)
> rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)
>
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
next prev parent reply other threads:[~2025-03-19 22:29 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-19 8:40 [PATCH v3 0/8] Rate limit AER logs Jon Pan-Doh
2025-03-19 8:40 ` [PATCH v3 1/8] PCI/AER: Check log level once and propagate down Jon Pan-Doh
2025-03-20 0:07 ` Sathyanarayanan Kuppuswamy
2025-03-19 8:40 ` [PATCH v3 2/8] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
2025-03-19 9:52 ` Ilpo Järvinen
2025-03-20 2:39 ` Sathyanarayanan Kuppuswamy
2025-03-20 8:27 ` Jon Pan-Doh
2025-03-20 14:23 ` Sathyanarayanan Kuppuswamy
2025-03-20 19:06 ` Jon Pan-Doh
2025-03-19 8:40 ` [PATCH v3 3/8] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
2025-03-19 18:19 ` Bjorn Helgaas
2025-03-20 8:27 ` Jon Pan-Doh
2025-03-20 3:22 ` Sathyanarayanan Kuppuswamy
2025-03-20 8:29 ` Jon Pan-Doh
2025-03-19 8:40 ` [PATCH v3 4/8] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
2025-03-20 3:29 ` Sathyanarayanan Kuppuswamy
2025-03-20 8:28 ` Jon Pan-Doh
2025-03-19 8:40 ` [PATCH v3 5/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-03-19 18:47 ` Bjorn Helgaas
2025-03-20 8:27 ` Jon Pan-Doh
2025-03-20 18:23 ` Bjorn Helgaas
2025-03-19 8:40 ` [PATCH v3 6/8] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
2025-03-19 8:40 ` [PATCH v3 7/8] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
2025-03-19 9:51 ` Ilpo Järvinen
2025-03-20 8:27 ` Jon Pan-Doh
2025-03-19 8:40 ` [PATCH v3 8/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
2025-03-19 22:29 ` Sathyanarayanan Kuppuswamy [this message]
2025-03-19 22:52 ` [PATCH v3 0/8] Rate limit AER logs Jon Pan-Doh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=e03bf65b-c961-4196-8844-c61ac59a4a1c@linux.intel.com \
--to=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=Jonathan.Cameron@huawei.com \
--cc=Terry.bowman@amd.com \
--cc=anilagrawal@meta.com \
--cc=ben.fuller@oracle.com \
--cc=bhelgaas@google.com \
--cc=drewwalton@microsoft.com \
--cc=ilpo.jarvinen@linux.intel.com \
--cc=karolina.stolarek@oracle.com \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=martin.petersen@oracle.com \
--cc=pandoh@google.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox