public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/8] Rate limit AER logs/IRQs
@ 2025-01-15  7:42 Jon Pan-Doh
  2025-01-15  7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
                   ` (9 more replies)
  0 siblings, 10 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15  7:42 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Jon Pan-Doh

Proposal
========

When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits
for more robust error logging. Allow userspace to configure ratelimits
via sysfs knobs.

Motivation
==========

Several OCP members have issues with inconsistent PCIe error handling,
exacerbated at datacenter scale (myriad of devices).
OCP HW/Fault Management subproject set out to solve this by
standardizing industry:

- PCIe error handling best practices
- Fault Management/RAS (incl. PCIe errors)

Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services is part of the
roadmap.

Background
==========

AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup). 

There have been previous attempts to add ratelimits to AER logs ([4],
[5]). The most recent attempt[5] has many similarities with the proposed
approach.

Patch organization
==================
1-3 AER logging cleanup
4-7 Ratelimits and sysfs knobs
8   Sysfs cleanup (RFC that breaks existing ABI/can be dropped)

Outstanding work
================
Cleanup:
- Consolidate aer_print_error() and pci_print_error() path
- Elevate log level logic out of print functions[6]

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
[6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/

Jon Pan-Doh (8):
  PCI/AER: Remove aer_print_port_info
  PCI/AER: Move AER stat collection out of __aer_print_error
  PCI/AER: Rename struct aer_stats to aer_info
  PCI/AER: Introduce ratelimit for error logs
  PCI/AER: Introduce ratelimit for AER IRQs
  PCI/AER: Add AER sysfs attributes for ratelimits
  PCI/AER: Update AER sysfs ABI filename
  PCI/AER: Move AER sysfs attributes into separate directory

 ...es-aer_stats => sysfs-bus-pci-devices-aer} |  50 +++-
 Documentation/PCI/pcieaer-howto.rst           |  10 +-
 drivers/pci/pci-sysfs.c                       |   2 +-
 drivers/pci/pci.h                             |   2 +-
 drivers/pci/pcie/aer.c                        | 227 +++++++++++++-----
 include/linux/pci.h                           |   2 +-
 6 files changed, 216 insertions(+), 77 deletions(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%)

-- 
2.48.0.rc2.279.g1de40edade-goog


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-03-04 23:43 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-15  7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-16 14:27   ` Karolina Stolarek
2025-01-18  1:57     ` Jon Pan-Doh
2025-01-20  9:25       ` Karolina Stolarek
2025-02-12 23:20         ` Jon Pan-Doh
2025-01-21 14:18   ` Ilpo Järvinen
2025-02-12 23:20     ` Jon Pan-Doh
2025-01-25  4:15   ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
2025-01-16 14:47   ` Karolina Stolarek
2025-01-18  1:57     ` Jon Pan-Doh
2025-01-25  4:37   ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
2025-01-16 10:11   ` Karolina Stolarek
2025-01-18  1:59     ` Jon Pan-Doh
2025-01-20 10:13       ` Karolina Stolarek
2025-02-12 23:20         ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-01-16 11:11   ` Karolina Stolarek
2025-01-18  1:59     ` Jon Pan-Doh
2025-01-20 10:25       ` Karolina Stolarek
2025-01-15  7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
2025-01-16 12:02   ` Karolina Stolarek
2025-01-18  1:58     ` Jon Pan-Doh
2025-01-20 10:38       ` Karolina Stolarek
2025-01-25  7:39   ` Lukas Wunner
2025-01-31 14:43     ` Jonathan Cameron
2025-03-04 23:42       ` Jon Pan-Doh
2025-02-06 13:56     ` Karolina Stolarek
2025-02-06 20:25       ` Lukas Wunner
2025-01-31 14:55   ` Jonathan Cameron
2025-03-04 23:42     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
2025-01-31 14:32   ` Jonathan Cameron
2025-02-28 23:11     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
2025-01-15  7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
2025-01-16 10:26   ` Karolina Stolarek
2025-01-16 17:18     ` Rajat Jain
2025-01-31 14:36       ` Jonathan Cameron
2025-02-12 23:19         ` Jon Pan-Doh
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
2025-01-24  6:46   ` Jon Pan-Doh
2025-01-25  7:59     ` Lukas Wunner
2025-02-06 13:32 ` Karolina Stolarek
2025-02-12 23:19   ` Jon Pan-Doh
2025-02-13 16:00     ` Karolina Stolarek
2025-02-14  2:49       ` Jon Pan-Doh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox