Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Jon Pan-Doh <pandoh@google.com>
To: Bjorn Helgaas <bhelgaas@google.com>,
	Karolina Stolarek <karolina.stolarek@oracle.com>
Cc: linux-pci@vger.kernel.org,
	"Martin Petersen" <martin.petersen@oracle.com>,
	"Ben Fuller" <ben.fuller@oracle.com>,
	"Drew Walton" <drewwalton@microsoft.com>,
	"Anil Agrawal" <anilagrawal@meta.com>,
	"Tony Luck" <tony.luck@intel.com>,
	"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
	"Sathyanarayanan Kuppuswamy"
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	"Lukas Wunner" <lukas@wunner.de>,
	"Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
	"Sargun Dhillon" <sargun@meta.com>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	"Jon Pan-Doh" <pandoh@google.com>
Subject: [PATCH v5 0/8] Rate limit AER logs
Date: Thu, 20 Mar 2025 18:57:58 -0700	[thread overview]
Message-ID: <20250321015806.954866-1-pandoh@google.com> (raw)

Proposal
========

When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits
for more robust error logging. Allow userspace to configure ratelimits
via sysfs knobs.

Motivation
==========

Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
of devices), affects repairabilitiy flows for fleet operators.

Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services will allow for
more predictable repair flows and decrease machine downtime.

Background
==========

AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup). 

There have been previous attempts to add ratelimits to AER logs ([4],
[5]). The most recent attempt[5] has many similarities with the proposed
approach.

Patch organization
==================
1-5 AER logging cleanup
6-8 Ratelimits and sysfs knobs

Outstanding work
================
Cleanup:
- Consolidate aer_print_error() and pci_print_error() path[6]

Roadmap:
- IRQ ratelimiting

v5:
- Handle multi-error AER by evaluating ratelimits once and storing result
- Reword/rename commit messages/functions/variable
v4:
- Fix bug where trace not emitted with malformed aer_err_info
- Extend ratelimit to malformed aer_err_info
- Update commit messages with patch motivation
- Squash AER sysfs filename change (Patch 8)
v3:
- Ratelimit aer_print_port_info() (drop Patch 1)
- Add ratelimit enable toggle
- Move trace outside of ratelimit
- Split log level (Patch 2) into two
- More descriptive documentation/sysfs naming
v2:
- Rebased on top of pci/aer (6.14.rc-1)
- Split series into log and IRQ ratelimits (defer patch 5)
- Dropped patch 8 (Move AER sysfs)
- Added log level cleanup patch[7] from Karolina's series
- Fixed bug where dpc errors didn't increment counters
- "X callbacks suppressed" message on ratelimit release -> immediately
- Separate documentation into own patch

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
[6] https://lore.kernel.org/linux-pci/8bcb8c9a7b38ce3bdaca5a64fe76f08b0b337511.1742202797.git.karolina.stolarek@oracle.com/
[7] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/

Jon Pan-Doh (6):
  PCI/AER: Move AER stat collection out of __aer_print_error()
  PCI/AER: Rename aer_print_port_info() to aer_printrp_info()
  PCI/AER: Rename struct aer_stats to aer_report
  PCI/AER: Introduce ratelimit for error logs
  PCI/AER: Add ratelimits to PCI AER Documentation
  PCI/AER: Add sysfs attributes for log ratelimits

Karolina Stolarek (2):
  PCI/AER: Check log level once and propagate down
  PCI/AER: Make all pci_print_aer() log levels depend on error type

 ...es-aer_stats => sysfs-bus-pci-devices-aer} |  34 +++
 Documentation/PCI/pcieaer-howto.rst           |  16 +-
 drivers/pci/pci-sysfs.c                       |   1 +
 drivers/pci/pci.h                             |   6 +-
 drivers/pci/pcie/aer.c                        | 282 +++++++++++++-----
 drivers/pci/pcie/dpc.c                        |   3 +-
 include/linux/pci.h                           |   2 +-
 7 files changed, 270 insertions(+), 74 deletions(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)

-- 
2.49.0.395.g12beb8f557-goog


             reply	other threads:[~2025-03-21  1:58 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-21  1:57 Jon Pan-Doh [this message]
2025-03-21  1:57 ` [PATCH v5 1/8] PCI/AER: Check log level once and propagate down Jon Pan-Doh
2025-05-01 21:43   ` Bjorn Helgaas
2025-05-05  9:30     ` Karolina Stolarek
2025-05-05 17:43       ` Bjorn Helgaas
2025-05-08 15:07         ` Karolina Stolarek
2025-03-21  1:58 ` [PATCH v5 2/8] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
2025-03-21  1:58 ` [PATCH v5 3/8] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
2025-03-21  1:58 ` [PATCH v5 4/8] PCI/AER: Rename aer_print_port_info() to aer_printrp_info() Jon Pan-Doh
2025-03-21 13:39   ` Karolina Stolarek
2025-03-21 19:26     ` Jon Pan-Doh
2025-03-21  1:58 ` [PATCH v5 5/8] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
2025-03-21 22:01   ` Bjorn Helgaas
2025-03-21 22:15     ` Jon Pan-Doh
2025-03-21 22:30       ` Paul E. McKenney
2025-03-21 22:16     ` Paul E. McKenney
2025-03-21 22:39       ` Bjorn Helgaas
2025-03-21 22:47         ` Paul E. McKenney
2025-05-01 22:02         ` Bjorn Helgaas
2025-05-02  2:16           ` Paul E. McKenney
2025-03-21  1:58 ` [PATCH v5 6/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-03-21 13:46   ` Karolina Stolarek
2025-03-21 18:41     ` Jon Pan-Doh
2025-04-04  9:32       ` Karolina Stolarek
2025-03-25 17:17   ` Paul E. McKenney
2025-03-27 22:49     ` Jon Pan-Doh
2025-04-03 19:02       ` Paul E. McKenney
2025-03-31 18:48   ` Bjorn Helgaas
2025-04-01  0:30     ` Jon Pan-Doh
2025-04-01 18:02       ` Bjorn Helgaas
2025-04-24 20:31   ` Bjorn Helgaas
2025-03-21  1:58 ` [PATCH v5 7/8] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
2025-03-21  1:58 ` [PATCH v5 8/8] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
2025-03-23 12:20   ` Krzysztof Wilczyński
2025-03-27 22:50     ` Jon Pan-Doh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250321015806.954866-1-pandoh@google.com \
    --to=pandoh@google.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=anilagrawal@meta.com \
    --cc=ben.fuller@oracle.com \
    --cc=bhelgaas@google.com \
    --cc=drewwalton@microsoft.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=karolina.stolarek@oracle.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=martin.petersen@oracle.com \
    --cc=paulmck@kernel.org \
    --cc=sargun@meta.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox