Linux PCI subsystem development
 help / color / mirror / Atom feed
* [PATCH v4 0/7] Rate limit AER logs
@ 2025-03-20  8:20 Jon Pan-Doh
  2025-03-20  8:20 ` [PATCH v4 1/7] PCI/AER: Check log level once and propagate down Jon Pan-Doh
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

Proposal
========

When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits
for more robust error logging. Allow userspace to configure ratelimits
via sysfs knobs.

Motivation
==========

Several OCP members have issues with inconsistent PCIe error handling,
exacerbated at datacenter scale (myriad of devices).
OCP HW/Fault Management subproject set out to solve this by
standardizing industry:

- PCIe error handling best practices
- Fault Management/RAS (incl. PCIe errors)

Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services is part of the
roadmap.

Background
==========

AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup). 

There have been previous attempts to add ratelimits to AER logs ([4],
[5]). The most recent attempt[5] has many similarities with the proposed
approach.

Patch organization
==================
1-4 AER logging cleanup
5-7 Ratelimits and sysfs knobs

Outstanding work
================
Cleanup:
- Consolidate aer_print_error() and pci_print_error() path

Roadmap:
- IRQ ratelimiting

v4:
- Fix bug where trace not emitted with malformed aer_err_info
- Extend ratelimit to malformed aer_err_info
- Update commit messages with patch motivation
- Squash AER sysfs filename change (Patch 8)
v3:
- Ratelimit aer_print_port_info() (drop Patch 1)
- Add ratelimit enable toggle
- Move trace outside of ratelimit
- Split log level (Patch 2) into two
- More descriptive documentation/sysfs naming
v2:
- Rebased on top of pci/aer (6.14.rc-1)
- Split series into log and IRQ ratelimits (defer patch 5)
- Dropped patch 8 (Move AER sysfs)
- Added log level cleanup patch[6] from Karolina's series
- Fixed bug where dpc errors didn't increment counters
- "X callbacks suppressed" message on ratelimit release -> immediately
- Separate documentation into own patch

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
[6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/

Jon Pan-Doh (5):
  PCI/AER: Move AER stat collection out of __aer_print_error()
  PCI/AER: Rename struct aer_stats to aer_report
  PCI/AER: Introduce ratelimit for error logs
  PCI/AER: Add ratelimits to PCI AER Documentation
  PCI/AER: Add sysfs attributes for log ratelimits

Karolina Stolarek (2):
  PCI/AER: Check log level once and propagate down
  PCI/AER: Make all pci_print_aer() log levels depend on error type

 ...es-aer_stats => sysfs-bus-pci-devices-aer} |  34 +++
 Documentation/PCI/pcieaer-howto.rst           |  16 +-
 drivers/pci/pci-sysfs.c                       |   1 +
 drivers/pci/pci.h                             |   4 +-
 drivers/pci/pcie/aer.c                        | 271 +++++++++++++-----
 drivers/pci/pcie/dpc.c                        |   3 +-
 include/linux/pci.h                           |   2 +-
 7 files changed, 260 insertions(+), 71 deletions(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)

-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 1/7] PCI/AER: Check log level once and propagate down
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20  8:20 ` [PATCH v4 2/7] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

From: Karolina Stolarek <karolina.stolarek@oracle.com>

When reporting an AER error, we check its type multiple times
to determine the log level for each message. Do this check only
in the top-level functions (aer_isr_one_error(), pci_print_aer()) and
propagate the result down the call chain.

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pci.h      |  2 +-
 drivers/pci/pcie/aer.c | 34 +++++++++++++++++-----------------
 drivers/pci/pcie/dpc.c |  2 +-
 3 files changed, 19 insertions(+), 19 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b8911d1e10dc..75985b96ecc1 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -551,7 +551,7 @@ struct aer_err_info {
 };
 
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
-void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
+void aer_print_error(struct pci_dev *dev, struct aer_err_info *info, const char *level);
 
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
 		      unsigned int tlp_len, bool flit,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9cff7069577e..45629e1ea058 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -670,20 +670,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 }
 
 static void __aer_print_error(struct pci_dev *dev,
-			      struct aer_err_info *info)
+			      struct aer_err_info *info,
+			      const char *level)
 {
 	const char **strings;
 	unsigned long status = info->status & ~info->mask;
-	const char *level, *errmsg;
+	const char *errmsg;
 	int i;
 
-	if (info->severity == AER_CORRECTABLE) {
+	if (info->severity == AER_CORRECTABLE)
 		strings = aer_correctable_error_string;
-		level = KERN_WARNING;
-	} else {
+	else
 		strings = aer_uncorrectable_error_string;
-		level = KERN_ERR;
-	}
 
 	for_each_set_bit(i, &status, 32) {
 		errmsg = strings[i];
@@ -696,11 +694,11 @@ static void __aer_print_error(struct pci_dev *dev,
 	pci_dev_aer_stats_incr(dev, info);
 }
 
-void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
+void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
+		     const char *level)
 {
 	int layer, agent;
 	int id = pci_dev_id(dev);
-	const char *level;
 
 	if (!info->status) {
 		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -711,8 +709,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
-
 	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
 		   aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
@@ -720,7 +716,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
 		   dev->vendor, dev->device, info->status, info->mask);
 
-	__aer_print_error(dev, info);
+	__aer_print_error(dev, info, level);
 
 	if (info->tlp_header_valid)
 		pcie_print_tlp_log(dev, &info->tlp, dev_fmt("  "));
@@ -765,15 +761,18 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 {
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
+	const char *level;
 	struct aer_err_info info;
 
 	if (aer_severity == AER_CORRECTABLE) {
 		status = aer->cor_status;
 		mask = aer->cor_mask;
+		level = KERN_WARNING;
 	} else {
 		status = aer->uncor_status;
 		mask = aer->uncor_mask;
 		tlp_header_valid = status & AER_LOG_TLP_MASKS;
+		level = KERN_ERR;
 	}
 
 	layer = AER_GET_LAYER_ERROR(aer_severity, status);
@@ -786,7 +785,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
 	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
-	__aer_print_error(dev, &info);
+	__aer_print_error(dev, &info, level);
 	pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
 		aer_error_layer[layer], aer_agent_string[agent]);
 
@@ -1257,14 +1256,15 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	return 1;
 }
 
-static inline void aer_process_err_devices(struct aer_err_info *e_info)
+static inline void aer_process_err_devices(struct aer_err_info *e_info,
+					   const char *level)
 {
 	int i;
 
 	/* Report all before handle them, not to lost records by reset etc. */
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
 		if (aer_get_device_error_info(e_info->dev[i], e_info))
-			aer_print_error(e_info->dev[i], e_info);
+			aer_print_error(e_info->dev[i], e_info, level);
 	}
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
 		if (aer_get_device_error_info(e_info->dev[i], e_info))
@@ -1300,7 +1300,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		aer_print_port_info(pdev, &e_info);
 
 		if (find_source_device(pdev, &e_info))
-			aer_process_err_devices(&e_info);
+			aer_process_err_devices(&e_info, KERN_WARNING);
 	}
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -1319,7 +1319,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		aer_print_port_info(pdev, &e_info);
 
 		if (find_source_device(pdev, &e_info))
-			aer_process_err_devices(&e_info);
+			aer_process_err_devices(&e_info, KERN_ERR);
 	}
 }
 
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df42f15c9829..9e4c9ac737a7 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -289,7 +289,7 @@ void dpc_process_error(struct pci_dev *pdev)
 	else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
 		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
 		 aer_get_device_error_info(pdev, &info)) {
-		aer_print_error(pdev, &info);
+		aer_print_error(pdev, &info, KERN_ERR);
 		pci_aer_clear_nonfatal_status(pdev);
 		pci_aer_clear_fatal_status(pdev);
 	}
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/7] PCI/AER: Make all pci_print_aer() log levels depend on error type
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
  2025-03-20  8:20 ` [PATCH v4 1/7] PCI/AER: Check log level once and propagate down Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20  8:20 ` [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

From: Karolina Stolarek <karolina.stolarek@oracle.com>

Some existing logs in pci_print_aer() log with error severity
by default. Convert them to depend on error type (consistent
with rest of AER logging).

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/aer.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 45629e1ea058..3116b4678081 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -784,14 +784,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	info.mask = mask;
 	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
-	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+	aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
 	__aer_print_error(dev, &info, level);
-	pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-		aer_error_layer[layer], aer_agent_string[agent]);
+	aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
+		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	if (aer_severity != AER_CORRECTABLE)
-		pci_err(dev, "aer_uncor_severity: 0x%08x\n",
-			aer->uncor_severity);
+		aer_printk(level, dev, "aer_uncor_severity: 0x%08x\n",
+			   aer->uncor_severity);
 
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error()
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
  2025-03-20  8:20 ` [PATCH v4 1/7] PCI/AER: Check log level once and propagate down Jon Pan-Doh
  2025-03-20  8:20 ` [PATCH v4 2/7] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20 14:59   ` Karolina Stolarek
  2025-03-20  8:20 ` [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

Decouple stat collection from internal AER print functions. Prerequisite
to add ratelimits to logs while leaving stats unaffected. Also simplifies
control flow as stats collection is no longer buried in nested functions.

AERs from ghes or cxl drivers are a minor exception. Stat collection occurs
in pci_print_aer(), an external interface, as that is where aer_err_info
is populated.

Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 10 ++++++----
 drivers/pci/pcie/dpc.c |  1 +
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 75985b96ecc1..9d63d32f041c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -551,6 +551,7 @@ struct aer_err_info {
 };
 
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info, const char *level);
 
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3116b4678081..e5db1fdd8421 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -617,8 +617,7 @@ const struct attribute_group aer_stats_attr_group = {
 	.is_visible = aer_stats_attrs_are_visible,
 };
 
-static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
-				   struct aer_err_info *info)
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
 {
 	unsigned long status = info->status & ~info->mask;
 	int i, max = -1;
@@ -691,7 +690,6 @@ static void __aer_print_error(struct pci_dev *dev,
 		aer_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
 				info->first_error == i ? " (First)" : "");
 	}
-	pci_dev_aer_stats_incr(dev, info);
 }
 
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
@@ -784,6 +782,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	info.mask = mask;
 	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
+	pci_dev_aer_stats_incr(dev, &info);
+
 	aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
 	__aer_print_error(dev, &info, level);
 	aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
@@ -1263,8 +1263,10 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info,
 
 	/* Report all before handle them, not to lost records by reset etc. */
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (aer_get_device_error_info(e_info->dev[i], e_info))
+		if (aer_get_device_error_info(e_info->dev[i], e_info)) {
+			pci_dev_aer_stats_incr(e_info->dev[i], e_info);
 			aer_print_error(e_info->dev[i], e_info, level);
+		}
 	}
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
 		if (aer_get_device_error_info(e_info->dev[i], e_info))
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 9e4c9ac737a7..81cd6e8ff3a4 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -289,6 +289,7 @@ void dpc_process_error(struct pci_dev *pdev)
 	else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
 		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
 		 aer_get_device_error_info(pdev, &info)) {
+		pci_dev_aer_stats_incr(pdev, &info);
 		aer_print_error(pdev, &info, KERN_ERR);
 		pci_aer_clear_nonfatal_status(pdev);
 		pci_aer_clear_fatal_status(pdev);
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
                   ` (2 preceding siblings ...)
  2025-03-20  8:20 ` [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20 17:42   ` Sathyanarayanan Kuppuswamy
  2025-03-20  8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

Update name to reflect the broader definition of structs/variables that
are stored (e.g. ratelimits).

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
---
 drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
 include/linux/pci.h    |  2 +-
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e5db1fdd8421..3069376b3553 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -54,11 +54,11 @@ struct aer_rpc {
 	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
 };
 
-/* AER stats for the device */
-struct aer_stats {
+/* AER report for the device */
+struct aer_report {
 
 	/*
-	 * Fields for all AER capable devices. They indicate the errors
+	 * Stats for all AER capable devices. They indicate the errors
 	 * "as seen by this device". Note that this may mean that if an
 	 * end point is causing problems, the AER counters may increment
 	 * at its link partner (e.g. root port) because the errors will be
@@ -80,7 +80,7 @@ struct aer_stats {
 	u64 dev_total_nonfatal_errs;
 
 	/*
-	 * Fields for Root ports & root complex event collectors only, these
+	 * Stats for Root ports & root complex event collectors only, these
 	 * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
 	 * messages received by the root port / event collector, INCLUDING the
 	 * ones that are generated internally (by the rootport itself)
@@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev)
 	if (!dev->aer_cap)
 		return;
 
-	dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
 
 	/*
 	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
@@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev)
 
 void pci_aer_exit(struct pci_dev *dev)
 {
-	kfree(dev->aer_stats);
-	dev->aer_stats = NULL;
+	kfree(dev->aer_report);
+	dev->aer_report = NULL;
 }
 
 #define AER_AGENT_RECEIVER		0
@@ -537,10 +537,10 @@ static const char *aer_agent_string[] = {
 {									\
 	unsigned int i;							\
 	struct pci_dev *pdev = to_pci_dev(dev);				\
-	u64 *stats = pdev->aer_stats->stats_array;			\
+	u64 *stats = pdev->aer_report->stats_array;			\
 	size_t len = 0;							\
 									\
-	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
+	for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\
 		if (strings_array[i])					\
 			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
 					     strings_array[i],		\
@@ -551,7 +551,7 @@ static const char *aer_agent_string[] = {
 					     i, stats[i]);		\
 	}								\
 	len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string,	\
-			     pdev->aer_stats->total_field);		\
+			     pdev->aer_report->total_field);		\
 	return len;							\
 }									\
 static DEVICE_ATTR_RO(name)
@@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
 		     char *buf)						\
 {									\
 	struct pci_dev *pdev = to_pci_dev(dev);				\
-	return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field);	\
+	return sysfs_emit(buf, "%llu\n", pdev->aer_report->field);	\
 }									\
 static DEVICE_ATTR_RO(name)
 
@@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
 	struct device *dev = kobj_to_dev(kobj);
 	struct pci_dev *pdev = to_pci_dev(dev);
 
-	if (!pdev->aer_stats)
+	if (!pdev->aer_report)
 		return 0;
 
 	if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
@@ -622,25 +622,25 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
 	unsigned long status = info->status & ~info->mask;
 	int i, max = -1;
 	u64 *counter = NULL;
-	struct aer_stats *aer_stats = pdev->aer_stats;
+	struct aer_report *aer_report = pdev->aer_report;
 
-	if (!aer_stats)
+	if (!aer_report)
 		return;
 
 	switch (info->severity) {
 	case AER_CORRECTABLE:
-		aer_stats->dev_total_cor_errs++;
-		counter = &aer_stats->dev_cor_errs[0];
+		aer_report->dev_total_cor_errs++;
+		counter = &aer_report->dev_cor_errs[0];
 		max = AER_MAX_TYPEOF_COR_ERRS;
 		break;
 	case AER_NONFATAL:
-		aer_stats->dev_total_nonfatal_errs++;
-		counter = &aer_stats->dev_nonfatal_errs[0];
+		aer_report->dev_total_nonfatal_errs++;
+		counter = &aer_report->dev_nonfatal_errs[0];
 		max = AER_MAX_TYPEOF_UNCOR_ERRS;
 		break;
 	case AER_FATAL:
-		aer_stats->dev_total_fatal_errs++;
-		counter = &aer_stats->dev_fatal_errs[0];
+		aer_report->dev_total_fatal_errs++;
+		counter = &aer_report->dev_fatal_errs[0];
 		max = AER_MAX_TYPEOF_UNCOR_ERRS;
 		break;
 	}
@@ -652,19 +652,19 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
 static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 				 struct aer_err_source *e_src)
 {
-	struct aer_stats *aer_stats = pdev->aer_stats;
+	struct aer_report *aer_report = pdev->aer_report;
 
-	if (!aer_stats)
+	if (!aer_report)
 		return;
 
 	if (e_src->status & PCI_ERR_ROOT_COR_RCV)
-		aer_stats->rootport_total_cor_errs++;
+		aer_report->rootport_total_cor_errs++;
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
 		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
-			aer_stats->rootport_total_fatal_errs++;
+			aer_report->rootport_total_fatal_errs++;
 		else
-			aer_stats->rootport_total_nonfatal_errs++;
+			aer_report->rootport_total_nonfatal_errs++;
 	}
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e4bf67bf8172..900edb6f8f62 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -346,7 +346,7 @@ struct pci_dev {
 	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
 #ifdef CONFIG_PCIEAER
 	u16		aer_cap;	/* AER capability offset */
-	struct aer_stats *aer_stats;	/* AER stats for this device */
+	struct aer_report *aer_report;	/* AER report for this device */
 #endif
 #ifdef CONFIG_PCIEPORTBUS
 	struct rcec_ea	*rcec_ea;	/* RCEC cached endpoint association */
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
                   ` (3 preceding siblings ...)
  2025-03-20  8:20 ` [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20 14:56   ` Karolina Stolarek
  2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
  2025-03-20  8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
                   ` (3 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

Spammy devices can flood kernel logs with AER errors and slow/stall
execution. Add per-device ratelimits for AER correctable and uncorrectable
errors that use the kernel defaults (10 per 5s).

Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
true count of 11.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
---
 drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3069376b3553..081cef5fc678 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -28,6 +28,7 @@
 #include <linux/interrupt.h>
 #include <linux/delay.h>
 #include <linux/kfifo.h>
+#include <linux/ratelimit.h>
 #include <linux/slab.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
@@ -88,6 +89,10 @@ struct aer_report {
 	u64 rootport_total_cor_errs;
 	u64 rootport_total_fatal_errs;
 	u64 rootport_total_nonfatal_errs;
+
+	/* Ratelimits for errors */
+	struct ratelimit_state cor_log_ratelimit;
+	struct ratelimit_state uncor_log_ratelimit;
 };
 
 #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
@@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
 
 	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
 
+	/*
+	 * Ratelimits are doubled as a given error produces 2 logs (root port
+	 * and endpoint) that should be under same ratelimit.
+	 */
+	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
+	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
+
 	/*
 	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
 	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
@@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 	}
 }
 
+static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
+{
+	struct ratelimit_state *ratelimit;
+
+	if (severity == AER_CORRECTABLE)
+		ratelimit = &dev->aer_report->cor_log_ratelimit;
+	else
+		ratelimit = &dev->aer_report->uncor_log_ratelimit;
+	return __ratelimit(ratelimit);
+}
+
 static void __aer_print_error(struct pci_dev *dev,
 			      struct aer_err_info *info,
 			      const char *level)
@@ -698,6 +723,12 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
 	int layer, agent;
 	int id = pci_dev_id(dev);
 
+	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+			info->severity, info->tlp_header_valid, &info->tlp);
+
+	if (!aer_ratelimit(dev, info->severity))
+		return;
+
 	if (!info->status) {
 		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity]);
@@ -722,21 +753,28 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
 out:
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		pci_err(dev, "  Error of this Agent is reported first\n");
-
-	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
-			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
 static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
 {
 	u8 bus = info->id >> 8;
 	u8 devfn = info->id & 0xff;
+	struct pci_dev *endpoint;
+	int i;
+
+	/* extract endpoint device ratelimit */
+	for (i = 0; i < info->error_dev_num; i++) {
+		endpoint = info->dev[i];
+		if (info->id == pci_dev_id(endpoint))
+			break;
+	}
 
-	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
-		 info->multi_error_valid ? "Multiple " : "",
-		 aer_error_severity_string[info->severity],
-		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
-		 PCI_FUNC(devfn));
+	if (aer_ratelimit(endpoint, info->severity))
+		pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
+			 info->multi_error_valid ? "Multiple " : "",
+			 aer_error_severity_string[info->severity],
+			 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
+			 PCI_FUNC(devfn));
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -784,6 +822,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	pci_dev_aer_stats_incr(dev, &info);
 
+	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+			aer_severity, tlp_header_valid, &aer->header_log);
+
+	if (!aer_ratelimit(dev, aer_severity))
+		return;
+
 	aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
 	__aer_print_error(dev, &info, level);
 	aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
@@ -795,9 +839,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
-
-	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
-			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
 
@@ -1299,10 +1340,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 			e_info.multi_error_valid = 1;
 		else
 			e_info.multi_error_valid = 0;
-		aer_print_port_info(pdev, &e_info);
 
-		if (find_source_device(pdev, &e_info))
+		if (find_source_device(pdev, &e_info)) {
+			aer_print_port_info(pdev, &e_info);
 			aer_process_err_devices(&e_info, KERN_WARNING);
+		}
 	}
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -1318,10 +1360,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		else
 			e_info.multi_error_valid = 0;
 
-		aer_print_port_info(pdev, &e_info);
-
-		if (find_source_device(pdev, &e_info))
+		if (find_source_device(pdev, &e_info)) {
+			aer_print_port_info(pdev, &e_info);
 			aer_process_err_devices(&e_info, KERN_ERR);
+		}
 	}
 }
 
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
                   ` (4 preceding siblings ...)
  2025-03-20  8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20 14:57   ` Karolina Stolarek
  2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
  2025-03-20  8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
                   ` (2 subsequent siblings)
  8 siblings, 2 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

Add ratelimits section for rationale and defaults.

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
 Documentation/PCI/pcieaer-howto.rst | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index f013f3b27c82..896d2a232a90 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -85,6 +85,17 @@ In the example, 'Requester ID' means the ID of the device that sent
 the error message to the Root Port. Please refer to PCIe specs for other
 fields.
 
+AER Ratelimits
+--------------
+
+Since error messages can be generated for each transaction, we may see
+large volumes of errors reported. To prevent spammy devices from flooding
+the console/stalling execution, messages are throttled by device and error
+type (correctable vs. uncorrectable).
+
+AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
+DEFAULT_RATELIMIT_INTERVAL (5 seconds).
+
 AER Statistics / Counters
 -------------------------
 
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
                   ` (5 preceding siblings ...)
  2025-03-20  8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
@ 2025-03-20  8:20 ` Jon Pan-Doh
  2025-03-20 14:58   ` Karolina Stolarek
  2025-03-21  1:02   ` Sathyanarayanan Kuppuswamy
  2025-03-20 14:34 ` [PATCH v4 0/7] Rate limit AER logs Christoph Hellwig
  2025-03-20 18:45 ` Paul E. McKenney
  8 siblings, 2 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20  8:20 UTC (permalink / raw)
  To: Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
	Lukas Wunner, Jonathan Cameron, Terry Bowman, Jon Pan-Doh

Allow userspace to read/write log ratelimits per device (including
enable/disable). Create aer/ sysfs directory to store them and any
future aer configs.

Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
attributes (e.g. stats and ratelimits).

Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
Documentation/ABI/testing/sysfs-bus-pci-devices-aer

Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
Sent 6 AER errors. Observed 5 errors logged while AER stats
(cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.

Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
logged and accounted in AER stats (12 total errors).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
 ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++++++
 Documentation/PCI/pcieaer-howto.rst           |  5 +-
 drivers/pci/pci-sysfs.c                       |  1 +
 drivers/pci/pci.h                             |  1 +
 drivers/pci/pcie/aer.c                        | 93 +++++++++++++++++++
 5 files changed, 133 insertions(+), 1 deletion(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
similarity index 77%
rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
index d1f67bb81d5d..4561653fdbde 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
@@ -117,3 +117,37 @@ Date:		July 2018
 KernelVersion:	4.19.0
 Contact:	linux-pci@vger.kernel.org, rajatja@google.com
 Description:	Total number of ERR_NONFATAL messages reported to rootport.
+
+PCIe AER ratelimits
+-------------------
+
+These attributes show up under all the devices that are AER capable.
+They represent configurable ratelimits of logs per error type.
+
+See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
+
+What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable
+Date:		March 2025
+KernelVersion:	6.15.0
+Contact:	linux-pci@vger.kernel.org, pandoh@google.com
+Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
+		gets whether or not AER is currently enabled. Enabled by
+		default.
+
+What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_cor_log
+Date:		March 2025
+KernelVersion:	6.15.0
+Contact:	linux-pci@vger.kernel.org, pandoh@google.com
+Description:	Ratelimit burst for correctable error logs. Writing a value
+		changes the number of errors (burst) allowed per interval
+		(5 second window) before ratelimiting. Reading gets the
+		current ratelimit burst.
+
+What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_uncor_log
+Date:		March 2025
+KernelVersion:	6.15.0
+Contact:	linux-pci@vger.kernel.org, pandoh@google.com
+Description:	Ratelimit burst for uncorrectable error logs. Writing a
+		value changes the number of errors (burst) allowed per
+		interval (5 second window) before ratelimiting. Reading
+		gets the current ratelimit burst.
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 896d2a232a90..043cdb3194be 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -96,12 +96,15 @@ type (correctable vs. uncorrectable).
 AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
 DEFAULT_RATELIMIT_INTERVAL (5 seconds).
 
+Ratelimits are exposed in the form of sysfs attributes and configurable.
+See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
+
 AER Statistics / Counters
 -------------------------
 
 When PCIe AER errors are captured, the counters / statistics are also exposed
 in the form of sysfs attributes which are documented at
-Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
 
 Developer Guide
 ===============
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index b46ce1a2c554..16de3093294e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1801,6 +1801,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
 	&pcie_dev_attr_group,
 #ifdef CONFIG_PCIEAER
 	&aer_stats_attr_group,
+	&aer_attr_group,
 #endif
 #ifdef CONFIG_PCIEASPM
 	&aspm_ctrl_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9d63d32f041c..34633fe12201 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -889,6 +889,7 @@ void pci_no_aer(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
+extern const struct attribute_group aer_attr_group;
 void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pci_aer_clear_status(struct pci_dev *dev);
 int pci_aer_raw_clear_status(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 081cef5fc678..f84ae1872fa3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -631,6 +631,99 @@ const struct attribute_group aer_stats_attr_group = {
 	.is_visible = aer_stats_attrs_are_visible,
 };
 
+/*
+ * Ratelimit enable toggle uses interval value of
+ * 0: disabled
+ * DEFAULT_RATELIMIT_INTERVAL: enabled
+ */
+static ssize_t ratelimit_log_enable_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	bool enable = pdev->aer_report->cor_log_ratelimit.interval != 0;
+
+	return sysfs_emit(buf, "%d\n", enable);
+}
+
+static ssize_t ratelimit_log_enable_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	bool enable;
+	int interval;
+
+	if (kstrtobool(buf, &enable) < 0)
+		return -EINVAL;
+
+	if (enable)
+		interval = DEFAULT_RATELIMIT_INTERVAL;
+	else
+		interval = 0;
+
+	pdev->aer_report->cor_log_ratelimit.interval = interval;
+	pdev->aer_report->uncor_log_ratelimit.interval = interval;
+	return count;
+}
+static DEVICE_ATTR_RW(ratelimit_log_enable);
+
+/*
+ * Ratelimits are doubled as a given error produces 2 logs (root port
+ * and endpoint) that should be under same ratelimit.
+ */
+#define aer_ratelimit_burst_attr(name, ratelimit)			\
+	static ssize_t							\
+	name##_show(struct device *dev, struct device_attribute *attr,	\
+		    char *buf)						\
+{									\
+	struct pci_dev *pdev = to_pci_dev(dev);				\
+	return sysfs_emit(buf, "%d\n",					\
+			  pdev->aer_report->ratelimit.burst / 2);	\
+}									\
+									\
+	static ssize_t							\
+	name##_store(struct device *dev, struct device_attribute *attr,	\
+		     const char *buf, size_t count)			\
+{									\
+	struct pci_dev *pdev = to_pci_dev(dev);				\
+	int burst;							\
+									\
+	if (kstrtoint(buf, 0, &burst) < 0)				\
+		return -EINVAL;						\
+									\
+	pdev->aer_report->ratelimit.burst = burst * 2;			\
+	return count;							\
+}									\
+static DEVICE_ATTR_RW(name)
+
+aer_ratelimit_burst_attr(ratelimit_in_5secs_cor_log, cor_log_ratelimit);
+aer_ratelimit_burst_attr(ratelimit_in_5secs_uncor_log, uncor_log_ratelimit);
+
+static struct attribute *aer_attrs[] = {
+	&dev_attr_ratelimit_log_enable.attr,
+	&dev_attr_ratelimit_in_5secs_cor_log.attr,
+	&dev_attr_ratelimit_in_5secs_uncor_log.attr,
+	NULL
+};
+
+static umode_t aer_attrs_are_visible(struct kobject *kobj,
+				     struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->aer_report)
+		return 0;
+	return a->mode;
+}
+
+const struct attribute_group aer_attr_group = {
+	.name = "aer",
+	.attrs = aer_attrs,
+	.is_visible = aer_attrs_are_visible,
+};
+
 void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
 {
 	unsigned long status = info->status & ~info->mask;
-- 
2.49.0.rc1.451.g8f38331e32-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/7] Rate limit AER logs
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
                   ` (6 preceding siblings ...)
  2025-03-20  8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
@ 2025-03-20 14:34 ` Christoph Hellwig
  2025-03-20 18:45 ` Paul E. McKenney
  8 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2025-03-20 14:34 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 01:20:50AM -0700, Jon Pan-Doh wrote:
> Several OCP members 

The fact that some users are part of a borderline illegal purchasing
cartel should not in any way matter for a linux patch submission.
Please work on your commit messages.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20  8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
@ 2025-03-20 14:56   ` Karolina Stolarek
  2025-03-20 17:51     ` Bjorn Helgaas
  2025-03-20 19:37     ` Jon Pan-Doh
  2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
  1 sibling, 2 replies; 32+ messages in thread
From: Karolina Stolarek @ 2025-03-20 14:56 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman


On 20/03/2025 09:20, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall 
> execution. Add per-device ratelimits for AER correctable and 
> uncorrectable errors that use the kernel defaults (10 per 5s).
> 
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors 
> logged while AER stats (cat /sys/bus/pci/devices/<dev>/ 
> aer_dev_correctable) show true count of 11.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer- 
> inject.git
 >
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>

For future reference -- please drop r-bs from patches that have 
functional/bigger changes. New code nullifies previous reviews.

> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3069376b3553..081cef5fc678 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -88,6 +89,10 @@ struct aer_report {
>   	u64 rootport_total_cor_errs;
>   	u64 rootport_total_fatal_errs;
>   	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>   };
>   
>   #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
> +	/*
> +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> +	 * and endpoint) that should be under same ratelimit.
> +	 */

It took me a bit to understand what this comment is about.

When we handle an error message, we first use the source's ratelimit to 
decide if we want to print the port info, and then the actual error. In 
theory, there could be more errors of the same class coming from other 
devices within one message. For these devices, we would call the 
ratelimit just once. I don't have a nice an clean solution for this 
problem, I just wanted to highlight that 1) we don't use the Root Port's 
ratelimit in aer_print_port_info(), 2) we may use the bursts to either 
print port_info + error message or just the message, in different 
combinations. I think we should reword this comment to highlight the 
fact that we don't check the ratelimit once per error, we could do it twice.

Also, I wonder -- do only Endpoints generate error messages? From what I 
understand, that some errors can be detected by intermediary devices.

> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>   	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)

I really like this solution, it's nice and tidy


>   static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
>   {

I'm also thinking -- we are ratelimiting the aer_print_port_info() and 
aer_print_error(). What about the messages in dpc_process_error()? 
Should we check early if DPC was triggered because of an uncorrectable 
error, and if so, ratelimit that?

All the best,
Karolina

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-03-20  8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
@ 2025-03-20 14:57   ` Karolina Stolarek
  2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
  1 sibling, 0 replies; 32+ messages in thread
From: Karolina Stolarek @ 2025-03-20 14:57 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

On 20/03/2025 09:20, Jon Pan-Doh wrote:
> Add ratelimits section for rationale and defaults.
> 
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
>   Documentation/PCI/pcieaer-howto.rst | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index f013f3b27c82..896d2a232a90 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -85,6 +85,17 @@ In the example, 'Requester ID' means the ID of the device that sent
>   the error message to the Root Port. Please refer to PCIe specs for other
>   fields.
>   
> +AER Ratelimits
> +--------------
> +
> +Since error messages can be generated for each transaction, we may see
> +large volumes of errors reported. To prevent spammy devices from flooding
> +the console/stalling execution, messages are throttled by device and error
> +type (correctable vs. uncorrectable).
> +
> +AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
> +DEFAULT_RATELIMIT_INTERVAL (5 seconds).

This is not quite true, as we double the number of available bursts so 
we can print both the port info and an error message. We could say that 
this limit (2 * DEFAULT_RATELIMIT_BURST) roughly translates to ten error 
notifications within the 5 second window.

All the best,
Karolina

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits
  2025-03-20  8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
@ 2025-03-20 14:58   ` Karolina Stolarek
  2025-03-20 19:36     ` Jon Pan-Doh
  2025-03-21  1:02   ` Sathyanarayanan Kuppuswamy
  1 sibling, 1 reply; 32+ messages in thread
From: Karolina Stolarek @ 2025-03-20 14:58 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

On 20/03/2025 09:20, Jon Pan-Doh wrote:
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> similarity index 77%
> rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> index d1f67bb81d5d..4561653fdbde 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> @@ -117,3 +117,37 @@ Date:		July 2018
>   KernelVersion:	4.19.0
>   Contact:	linux-pci@vger.kernel.org, rajatja@google.com
>   Description:	Total number of ERR_NONFATAL messages reported to rootport.
> +
> +PCIe AER ratelimits
> +-------------------
> +
> +These attributes show up under all the devices that are AER capable.
> +They represent configurable ratelimits of logs per error type.
> +
> +See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable

Having a dedicated toggle for this makes sense. It would be hard to come 
up with a magical number that disables ratelimiting.

> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
> +		gets whether or not AER is currently enabled. Enabled by
> +		default.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_cor_log

I think this attribute name (and the uncor counterpart) is too wordy. A 
user can check what this knob controls by looking up this file or AER 
docs, so I'd name it to "ratelimit_burst_cor_log" or something along 
these lines.

> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 081cef5fc678..f84ae1872fa3 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -631,6 +631,99 @@ const struct attribute_group aer_stats_attr_group = {
>   	.is_visible = aer_stats_attrs_are_visible,
>   };
>   
> +/*
> + * Ratelimit enable toggle uses interval value of
> + * 0: disabled
> + * DEFAULT_RATELIMIT_INTERVAL: enabled

We set that internally, but to the user we are just operating on 0s and 
1s. I would connect this comment with what we have in the documentation.

> + */
> +static ssize_t ratelimit_log_enable_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enable = pdev->aer_report->cor_log_ratelimit.interval != 0;
> +
> +	return sysfs_emit(buf, "%d\n", enable);
> +}
> +
> +static ssize_t ratelimit_log_enable_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enable;
> +	int interval;
> +
> +	if (kstrtobool(buf, &enable) < 0)
> +		return -EINVAL;
> +
> +	if (enable)
> +		interval = DEFAULT_RATELIMIT_INTERVAL;
> +	else
> +		interval = 0;
> +
> +	pdev->aer_report->cor_log_ratelimit.interval = interval;
> +	pdev->aer_report->uncor_log_ratelimit.interval = interval;
> +	return count;

Nit, suggestion: add a blank line before return (this applies to all 
returns in the patch)

All the best,
Karolina

> +}
> +static DEVICE_ATTR_RW(ratelimit_log_enable);
> +
> +/*
> + * Ratelimits are doubled as a given error produces 2 logs (root port
> + * and endpoint) that should be under same ratelimit.
> + */
> +#define aer_ratelimit_burst_attr(name, ratelimit)			\
> +	static ssize_t							\
> +	name##_show(struct device *dev, struct device_attribute *attr,	\
> +		    char *buf)						\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +	return sysfs_emit(buf, "%d\n",					\
> +			  pdev->aer_report->ratelimit.burst / 2);	\
> +}									\
> +									\
> +	static ssize_t							\
> +	name##_store(struct device *dev, struct device_attribute *attr,	\
> +		     const char *buf, size_t count)			\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +	int burst;							\
> +									\
> +	if (kstrtoint(buf, 0, &burst) < 0)				\
> +		return -EINVAL;						\
> +									\
> +	pdev->aer_report->ratelimit.burst = burst * 2;			\
> +	return count;							\
> +}									\
> +static DEVICE_ATTR_RW(name)
> +
> +aer_ratelimit_burst_attr(ratelimit_in_5secs_cor_log, cor_log_ratelimit);
> +aer_ratelimit_burst_attr(ratelimit_in_5secs_uncor_log, uncor_log_ratelimit);
> +
> +static struct attribute *aer_attrs[] = {
> +	&dev_attr_ratelimit_log_enable.attr,
> +	&dev_attr_ratelimit_in_5secs_cor_log.attr,
> +	&dev_attr_ratelimit_in_5secs_uncor_log.attr,
> +	NULL
> +};
> +
> +static umode_t aer_attrs_are_visible(struct kobject *kobj,
> +				     struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->aer_report)
> +		return 0;
> +	return a->mode;
> +}
> +
> +const struct attribute_group aer_attr_group = {
> +	.name = "aer",
> +	.attrs = aer_attrs,
> +	.is_visible = aer_attrs_are_visible,
> +};
> +
>   void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
>   {
>   	unsigned long status = info->status & ~info->mask;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error()
  2025-03-20  8:20 ` [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
@ 2025-03-20 14:59   ` Karolina Stolarek
  2025-03-20 19:07     ` Jon Pan-Doh
  0 siblings, 1 reply; 32+ messages in thread
From: Karolina Stolarek @ 2025-03-20 14:59 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

On 20/03/2025 09:20, Jon Pan-Doh wrote:
> Decouple stat collection from internal AER print functions. Prerequisite
> to add ratelimits to logs while leaving stats unaffected. Also simplifies
> control flow as stats collection is no longer buried in nested functions.
> 
> AERs from ghes or cxl drivers are a minor exception. Stat collection occurs
> in pci_print_aer(), an external interface, as that is where aer_err_info
> is populated.

Hopefully, I'll get my patch in sooner than later, and the second 
paragraph (and the related changes) will go away.

As for the first one, I'd restructure it to chain all the sentences 
together. I'm thinking about something like this:

"
Decouple stat collection from internal AER print functions, so the 
ratelimit does not impact the error counters. The stats collection is no 
longer buried in nested functions, simplifying the function flow.
"

All the best,
Karolina


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report
  2025-03-20  8:20 ` [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
@ 2025-03-20 17:42   ` Sathyanarayanan Kuppuswamy
  2025-03-20 19:53     ` Jon Pan-Doh
  0 siblings, 1 reply; 32+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-03-20 17:42 UTC (permalink / raw)
  To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Terry Bowman


On 3/20/25 1:20 AM, Jon Pan-Doh wrote:
> Update name to reflect the broader definition of structs/variables that
> are stored (e.g. ratelimits).

I think you meant rate limit will be added by an upcoming patch. Please
mention that it is a preparatory patch for adding rate limit support.

or move this patch after ratelimit support patch. That way this rename
will make more sense.

>
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> ---
>   drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
>   include/linux/pci.h    |  2 +-
>   2 files changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e5db1fdd8421..3069376b3553 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -54,11 +54,11 @@ struct aer_rpc {
>   	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>   };
>   
> -/* AER stats for the device */
> -struct aer_stats {
> +/* AER report for the device */
> +struct aer_report {

As Bjorn suggested, I also think be aer_info would be a better name for it.

>   
>   	/*
> -	 * Fields for all AER capable devices. They indicate the errors
> +	 * Stats for all AER capable devices. They indicate the errors

If you move this patch after ratelimit support patch, try adding that
detail in the help content as well.

May be "Tracks error statistics and AER debug related controls for all
AER capable devices"

>   	 * "as seen by this device". Note that this may mean that if an
>   	 * end point is causing problems, the AER counters may increment
>   	 * at its link partner (e.g. root port) because the errors will be
> @@ -80,7 +80,7 @@ struct aer_stats {
>   	u64 dev_total_nonfatal_errs;
>   
>   	/*
> -	 * Fields for Root ports & root complex event collectors only, these
> +	 * Stats for Root ports & root complex event collectors only, these
>   	 * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
>   	 * messages received by the root port / event collector, INCLUDING the
>   	 * ones that are generated internally (by the rootport itself)
> @@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev)
>   	if (!dev->aer_cap)
>   		return;
>   
> -	dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> +	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   void pci_aer_exit(struct pci_dev *dev)
>   {
> -	kfree(dev->aer_stats);
> -	dev->aer_stats = NULL;
> +	kfree(dev->aer_report);
> +	dev->aer_report = NULL;
>   }
>   
>   #define AER_AGENT_RECEIVER		0
> @@ -537,10 +537,10 @@ static const char *aer_agent_string[] = {
>   {									\
>   	unsigned int i;							\
>   	struct pci_dev *pdev = to_pci_dev(dev);				\
> -	u64 *stats = pdev->aer_stats->stats_array;			\
> +	u64 *stats = pdev->aer_report->stats_array;			\
>   	size_t len = 0;							\
>   									\
> -	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> +	for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\
>   		if (strings_array[i])					\
>   			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
>   					     strings_array[i],		\
> @@ -551,7 +551,7 @@ static const char *aer_agent_string[] = {
>   					     i, stats[i]);		\
>   	}								\
>   	len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string,	\
> -			     pdev->aer_stats->total_field);		\
> +			     pdev->aer_report->total_field);		\
>   	return len;							\
>   }									\
>   static DEVICE_ATTR_RO(name)
> @@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
>   		     char *buf)						\
>   {									\
>   	struct pci_dev *pdev = to_pci_dev(dev);				\
> -	return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field);	\
> +	return sysfs_emit(buf, "%llu\n", pdev->aer_report->field);	\
>   }									\
>   static DEVICE_ATTR_RO(name)
>   
> @@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
>   	struct device *dev = kobj_to_dev(kobj);
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   
> -	if (!pdev->aer_stats)
> +	if (!pdev->aer_report)
>   		return 0;
>   
>   	if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> @@ -622,25 +622,25 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
>   	unsigned long status = info->status & ~info->mask;
>   	int i, max = -1;
>   	u64 *counter = NULL;
> -	struct aer_stats *aer_stats = pdev->aer_stats;
> +	struct aer_report *aer_report = pdev->aer_report;
>   
> -	if (!aer_stats)
> +	if (!aer_report)
>   		return;
>   
>   	switch (info->severity) {
>   	case AER_CORRECTABLE:
> -		aer_stats->dev_total_cor_errs++;
> -		counter = &aer_stats->dev_cor_errs[0];
> +		aer_report->dev_total_cor_errs++;
> +		counter = &aer_report->dev_cor_errs[0];
>   		max = AER_MAX_TYPEOF_COR_ERRS;
>   		break;
>   	case AER_NONFATAL:
> -		aer_stats->dev_total_nonfatal_errs++;
> -		counter = &aer_stats->dev_nonfatal_errs[0];
> +		aer_report->dev_total_nonfatal_errs++;
> +		counter = &aer_report->dev_nonfatal_errs[0];
>   		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>   		break;
>   	case AER_FATAL:
> -		aer_stats->dev_total_fatal_errs++;
> -		counter = &aer_stats->dev_fatal_errs[0];
> +		aer_report->dev_total_fatal_errs++;
> +		counter = &aer_report->dev_fatal_errs[0];
>   		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>   		break;
>   	}
> @@ -652,19 +652,19 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
>   static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   				 struct aer_err_source *e_src)
>   {
> -	struct aer_stats *aer_stats = pdev->aer_stats;
> +	struct aer_report *aer_report = pdev->aer_report;
>   
> -	if (!aer_stats)
> +	if (!aer_report)
>   		return;
>   
>   	if (e_src->status & PCI_ERR_ROOT_COR_RCV)
> -		aer_stats->rootport_total_cor_errs++;
> +		aer_report->rootport_total_cor_errs++;
>   
>   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
>   		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
> -			aer_stats->rootport_total_fatal_errs++;
> +			aer_report->rootport_total_fatal_errs++;
>   		else
> -			aer_stats->rootport_total_nonfatal_errs++;
> +			aer_report->rootport_total_nonfatal_errs++;
>   	}
>   }
>   
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e4bf67bf8172..900edb6f8f62 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -346,7 +346,7 @@ struct pci_dev {
>   	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
>   #ifdef CONFIG_PCIEAER
>   	u16		aer_cap;	/* AER capability offset */
> -	struct aer_stats *aer_stats;	/* AER stats for this device */
> +	struct aer_report *aer_report;	/* AER report for this device */
>   #endif
>   #ifdef CONFIG_PCIEPORTBUS
>   	struct rcec_ea	*rcec_ea;	/* RCEC cached endpoint association */

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20 14:56   ` Karolina Stolarek
@ 2025-03-20 17:51     ` Bjorn Helgaas
  2025-03-20 19:53       ` Jon Pan-Doh
  2025-03-20 19:37     ` Jon Pan-Doh
  1 sibling, 1 reply; 32+ messages in thread
From: Bjorn Helgaas @ 2025-03-20 17:51 UTC (permalink / raw)
  To: Karolina Stolarek
  Cc: Jon Pan-Doh, linux-pci, Bjorn Helgaas, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > Spammy devices can flood kernel logs with AER errors and slow/stall
> > execution. Add per-device ratelimits for AER correctable and
> > uncorrectable errors that use the kernel defaults (10 per 5s).

> > +	/*
> > +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> > +	 * and endpoint) that should be under same ratelimit.
> > +	 */
> 
> It took me a bit to understand what this comment is about.
> 
> When we handle an error message, we first use the source's ratelimit to
> decide if we want to print the port info, and then the actual error. In
> theory, there could be more errors of the same class coming from other
> devices within one message.

I think this refers to the fact that when we get an AER interrupt from
a Root Port, the RP has a single Requester ID logged in the Source
Identification, but if Multiple ERR_* is set, find_device_iter() may
collect error info from several devices?

> For these devices, we would call the ratelimit
> just once. I don't have a nice an clean solution for this problem, I just
> wanted to highlight that 1) we don't use the Root Port's ratelimit in
> aer_print_port_info(), 2) we may use the bursts to either print port_info +
> error message or just the message, in different combinations. I think we
> should reword this comment to highlight the fact that we don't check the
> ratelimit once per error, we could do it twice.

Very good point.  aer_print_rp_info() is always ratelimited based on
the ERR_* Source Identification, but if Multiple ERR_* is set,
aer_print_error() ratelimits based on whatever downstream device we
found that had any error of the same class logged.

E.g., if we had something like this topology:

  00:1c.0 Root Port to [bus 01-04]
  01:00.0 Switch Upstream Port to [bus 02-04]
  02:00.0 Switch Downstream Port to [bus 03]
  02:00.1 Switch Downstream Port to [bus 04]
  03:00.0 NIC
  04:00.0 NVMe

where 03:00.0 and 04:00.0 both logged ERR_FATAL, and the Root Port
received the 03:00.0 message first, 03:00.0 would be logged as the
Source Identification, and Multiple ERR_FATAL Received should be set.
The messages related to 00:1c.0 and 03:00.0 would be ratelimited based
on 03:00.0, but aer_print_error() messages related to 04:00.0 would be
ratelimited based on 04:00.0.

That does seem like a problem.  I would propose that we always
ratelimit using the device from Source Identification. I think that's
available in aer_print_error(); we would just use info->id instead of
"dev".

That does mean we'd have to figure out the pci_dev corresponding to
the Requester ID in info->id.  Maybe we could add an
aer_err_info.src_dev pointer, and fill it in when find_device_iter()
finds the right device?

> Also, I wonder -- do only Endpoints generate error messages? From what I
> understand, that some errors can be detected by intermediary devices.

Yes, I think any device, including switches between a Root Port and
Endpoint, can detect errors and generate error messages.

I guess this means the "endpoint" variable in aer_print_port_info() is
probably too specific.  IIUC the aer_print_port_info() "dev" parameter
is always a Root Port since it came from aer_rpc.rpd.  Naming it "rp"
would make this more obvious and free up "dev" for the source device.
And "aer_print_port_info" itself could be more descriptive, e.g.,
"aer_print_rp_info()", since *every* PCIe device has a Port.

> I'm also thinking -- we are ratelimiting the aer_print_port_info() and
> aer_print_error(). What about the messages in dpc_process_error()? Should we
> check early if DPC was triggered because of an uncorrectable error, and if
> so, ratelimit that?

Good question.  It does seem like the dpc_process_error() messages
should be similarly ratelimited.  I think we currently only enable DPC
for fatal errors, and the Downstream Port takes the link down, which
resets the hierarchy below.  So (1) we probably won't see storms of
fatal error messages, and (2) it looks like we might not print any
error info from downstream devices, since they're not reachable while
the link is down.

It does seem like we *should* try to print that info after the link
comes back up, since the log registers are sticky and should survive
the reset.  Maybe we do that already and I just missed it.

This seems like something we could put off a little bit while we deal
with the AER correctable error issue.

Bjorn

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 0/7] Rate limit AER logs
  2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
                   ` (7 preceding siblings ...)
  2025-03-20 14:34 ` [PATCH v4 0/7] Rate limit AER logs Christoph Hellwig
@ 2025-03-20 18:45 ` Paul E. McKenney
  8 siblings, 0 replies; 32+ messages in thread
From: Paul E. McKenney @ 2025-03-20 18:45 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 01:20:50AM -0700, Jon Pan-Doh wrote:
> Proposal
> ========
> 
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits
> for more robust error logging. Allow userspace to configure ratelimits
> via sysfs knobs.
> 
> Motivation
> ==========
> 
> Several OCP members have issues with inconsistent PCIe error handling,
> exacerbated at datacenter scale (myriad of devices).
> OCP HW/Fault Management subproject set out to solve this by
> standardizing industry:
> 
> - PCIe error handling best practices
> - Fault Management/RAS (incl. PCIe errors)
> 
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services is part of the
> roadmap.

For the series:

Reported-by: Sargun Dhillon <sargun@meta.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>

And thank you!  I did throw together a quick hack to get things going
internally on our fleet, but this series is way better.

							Thanx, Paul

> Background
> ==========
> 
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup). 
> 
> There have been previous attempts to add ratelimits to AER logs ([4],
> [5]). The most recent attempt[5] has many similarities with the proposed
> approach.
> 
> Patch organization
> ==================
> 1-4 AER logging cleanup
> 5-7 Ratelimits and sysfs knobs
> 
> Outstanding work
> ================
> Cleanup:
> - Consolidate aer_print_error() and pci_print_error() path
> 
> Roadmap:
> - IRQ ratelimiting
> 
> v4:
> - Fix bug where trace not emitted with malformed aer_err_info
> - Extend ratelimit to malformed aer_err_info
> - Update commit messages with patch motivation
> - Squash AER sysfs filename change (Patch 8)
> v3:
> - Ratelimit aer_print_port_info() (drop Patch 1)
> - Add ratelimit enable toggle
> - Move trace outside of ratelimit
> - Split log level (Patch 2) into two
> - More descriptive documentation/sysfs naming
> v2:
> - Rebased on top of pci/aer (6.14.rc-1)
> - Split series into log and IRQ ratelimits (defer patch 5)
> - Dropped patch 8 (Move AER sysfs)
> - Added log level cleanup patch[6] from Karolina's series
> - Fixed bug where dpc errors didn't increment counters
> - "X callbacks suppressed" message on ratelimit release -> immediately
> - Separate documentation into own patch
> 
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
> [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
> [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
> [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/
> 
> Jon Pan-Doh (5):
>   PCI/AER: Move AER stat collection out of __aer_print_error()
>   PCI/AER: Rename struct aer_stats to aer_report
>   PCI/AER: Introduce ratelimit for error logs
>   PCI/AER: Add ratelimits to PCI AER Documentation
>   PCI/AER: Add sysfs attributes for log ratelimits
> 
> Karolina Stolarek (2):
>   PCI/AER: Check log level once and propagate down
>   PCI/AER: Make all pci_print_aer() log levels depend on error type
> 
>  ...es-aer_stats => sysfs-bus-pci-devices-aer} |  34 +++
>  Documentation/PCI/pcieaer-howto.rst           |  16 +-
>  drivers/pci/pci-sysfs.c                       |   1 +
>  drivers/pci/pci.h                             |   4 +-
>  drivers/pci/pcie/aer.c                        | 271 +++++++++++++-----
>  drivers/pci/pcie/dpc.c                        |   3 +-
>  include/linux/pci.h                           |   2 +-
>  7 files changed, 260 insertions(+), 71 deletions(-)
>  rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)
> 
> -- 
> 2.49.0.rc1.451.g8f38331e32-goog
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error()
  2025-03-20 14:59   ` Karolina Stolarek
@ 2025-03-20 19:07     ` Jon Pan-Doh
  0 siblings, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20 19:07 UTC (permalink / raw)
  To: Karolina Stolarek
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

On Thu, Mar 20, 2025 at 7:59 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> As for the first one, I'd restructure it to chain all the sentences
> together. I'm thinking about something like this:
>
> "
> Decouple stat collection from internal AER print functions, so the
> ratelimit does not impact the error counters. The stats collection is no
> longer buried in nested functions, simplifying the function flow.
> "

Thanks for the suggestion. Will reword in v5.

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits
  2025-03-20 14:58   ` Karolina Stolarek
@ 2025-03-20 19:36     ` Jon Pan-Doh
  0 siblings, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20 19:36 UTC (permalink / raw)
  To: Karolina Stolarek
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

On Thu, Mar 20, 2025 at 7:58 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
>
> On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > +What:                /sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_cor_log
>
> I think this attribute name (and the uncor counterpart) is too wordy. A
> user can check what this knob controls by looking up this file or AER
> docs, so I'd name it to "ratelimit_burst_cor_log" or something along
> these lines.

Will change in v5.

> > +/*
> > + * Ratelimit enable toggle uses interval value of
> > + * 0: disabled
> > + * DEFAULT_RATELIMIT_INTERVAL: enabled
>
> We set that internally, but to the user we are just operating on 0s and
> 1s. I would connect this comment with what we have in the documentation.

Will reword in v5

> > +     pdev->aer_report->cor_log_ratelimit.interval = interval;
> > +     pdev->aer_report->uncor_log_ratelimit.interval = interval;
> > +     return count;
>
> Nit, suggestion: add a blank line before return (this applies to all
> returns in the patch)

Ack.

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20 14:56   ` Karolina Stolarek
  2025-03-20 17:51     ` Bjorn Helgaas
@ 2025-03-20 19:37     ` Jon Pan-Doh
  1 sibling, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20 19:37 UTC (permalink / raw)
  To: Karolina Stolarek
  Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

On Thu, Mar 20, 2025 at 7:57 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> For future reference -- please drop r-bs from patches that have
> functional/bigger changes. New code nullifies previous reviews.

Ack. Will remove in v5.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report
  2025-03-20 17:42   ` Sathyanarayanan Kuppuswamy
@ 2025-03-20 19:53     ` Jon Pan-Doh
  2025-03-21 13:38       ` Karolina Stolarek
  0 siblings, 1 reply; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20 19:53 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 10:42 AM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> I think you meant rate limit will be added by an upcoming patch. Please
> mention that it is a preparatory patch for adding rate limit support.

Appended to commit message:

"This is a preparatory patch for adding rate limit support"

> As Bjorn suggested, I also think be aer_info would be a better name for it.

Ok. Karolina, given that Sathyanarayanan and Bjorn are partial to aer_info,
should I switch back to aer_info? Or do you still have strong objections?

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20 17:51     ` Bjorn Helgaas
@ 2025-03-20 19:53       ` Jon Pan-Doh
  2025-03-20 20:29         ` Bjorn Helgaas
  0 siblings, 1 reply; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-20 19:53 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Karolina Stolarek, linux-pci, Bjorn Helgaas, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > +   /*
> > > +    * Ratelimits are doubled as a given error produces 2 logs (root port
> > > +    * and endpoint) that should be under same ratelimit.
> > > +    */
> > For these devices, we would call the ratelimit
> > just once. I don't have a nice an clean solution for this problem, I just
> > wanted to highlight that 1) we don't use the Root Port's ratelimit in
> > aer_print_port_info(), 2) we may use the bursts to either print port_info +
> > error message or just the message, in different combinations. I think we
> > should reword this comment to highlight the fact that we don't check the
> > ratelimit once per error, we could do it twice.

You're right. I was thinking of amending it to something like:

Ratelimits are doubled as a given error notification produces up to 2 logs
(1 at root port and 1 at source device) that should be under the same ratelimit.

> Very good point.  aer_print_rp_info() is always ratelimited based on
> the ERR_* Source Identification, but if Multiple ERR_* is set,
> aer_print_error() ratelimits based on whatever downstream device we
> found that had any error of the same class logged.
>
> That does seem like a problem.  I would propose that we always
> ratelimit using the device from Source Identification. I think that's
> available in aer_print_error(); we would just use info->id instead of
> "dev".

Wouldn't you be incorrectly counting the non-source ID devices then?
I think this is another reason where removing aer_print_port_info()[1] (only
printing port info when failing to get device error info) simplifies things. Of
course, we then have to weigh whether the loss of info is less than the
ratelimit complexity.

> > I'm also thinking -- we are ratelimiting the aer_print_port_info() and
> > aer_print_error(). What about the messages in dpc_process_error()? Should we
> > check early if DPC was triggered because of an uncorrectable error, and if
> > so, ratelimit that?
>
> Good question.  It does seem like the dpc_process_error() messages
> should be similarly ratelimited.  I think we currently only enable DPC
> for fatal errors, and the Downstream Port takes the link down, which
> resets the hierarchy below.  So (1) we probably won't see storms of
> fatal error messages, and (2) it looks like we might not print any
> error info from downstream devices, since they're not reachable while
> the link is down.

I did not expect error storms from DPC so I thought it best to focus on AER.

[1] https://lore.kernel.org/linux-pci/CAMC_AXWVOtKh2r4kX6c7jtJwQaEE4KEQsH=uoB1OhczJ=8K2VA@mail.gmail.com/

Thanks,
Jon





On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > Spammy devices can flood kernel logs with AER errors and slow/stall
> > > execution. Add per-device ratelimits for AER correctable and
> > > uncorrectable errors that use the kernel defaults (10 per 5s).
>
> > > +   /*
> > > +    * Ratelimits are doubled as a given error produces 2 logs (root port
> > > +    * and endpoint) that should be under same ratelimit.
> > > +    */
> >
> > It took me a bit to understand what this comment is about.
> >
> > When we handle an error message, we first use the source's ratelimit to
> > decide if we want to print the port info, and then the actual error. In
> > theory, there could be more errors of the same class coming from other
> > devices within one message.
>
> I think this refers to the fact that when we get an AER interrupt from
> a Root Port, the RP has a single Requester ID logged in the Source
> Identification, but if Multiple ERR_* is set, find_device_iter() may
> collect error info from several devices?
>
> > For these devices, we would call the ratelimit
> > just once. I don't have a nice an clean solution for this problem, I just
> > wanted to highlight that 1) we don't use the Root Port's ratelimit in
> > aer_print_port_info(), 2) we may use the bursts to either print port_info +
> > error message or just the message, in different combinations. I think we
> > should reword this comment to highlight the fact that we don't check the
> > ratelimit once per error, we could do it twice.
>
> Very good point.  aer_print_rp_info() is always ratelimited based on
> the ERR_* Source Identification, but if Multiple ERR_* is set,
> aer_print_error() ratelimits based on whatever downstream device we
> found that had any error of the same class logged.
>
> E.g., if we had something like this topology:
>
>   00:1c.0 Root Port to [bus 01-04]
>   01:00.0 Switch Upstream Port to [bus 02-04]
>   02:00.0 Switch Downstream Port to [bus 03]
>   02:00.1 Switch Downstream Port to [bus 04]
>   03:00.0 NIC
>   04:00.0 NVMe
>
> where 03:00.0 and 04:00.0 both logged ERR_FATAL, and the Root Port
> received the 03:00.0 message first, 03:00.0 would be logged as the
> Source Identification, and Multiple ERR_FATAL Received should be set.
> The messages related to 00:1c.0 and 03:00.0 would be ratelimited based
> on 03:00.0, but aer_print_error() messages related to 04:00.0 would be
> ratelimited based on 04:00.0.
>
> That does seem like a problem.  I would propose that we always
> ratelimit using the device from Source Identification. I think that's
> available in aer_print_error(); we would just use info->id instead of
> "dev".
>
> That does mean we'd have to figure out the pci_dev corresponding to
> the Requester ID in info->id.  Maybe we could add an
> aer_err_info.src_dev pointer, and fill it in when find_device_iter()
> finds the right device?
>
> > Also, I wonder -- do only Endpoints generate error messages? From what I
> > understand, that some errors can be detected by intermediary devices.
>
> Yes, I think any device, including switches between a Root Port and
> Endpoint, can detect errors and generate error messages.
>
> I guess this means the "endpoint" variable in aer_print_port_info() is
> probably too specific.  IIUC the aer_print_port_info() "dev" parameter
> is always a Root Port since it came from aer_rpc.rpd.  Naming it "rp"
> would make this more obvious and free up "dev" for the source device.
> And "aer_print_port_info" itself could be more descriptive, e.g.,
> "aer_print_rp_info()", since *every* PCIe device has a Port.
>
> > I'm also thinking -- we are ratelimiting the aer_print_port_info() and
> > aer_print_error(). What about the messages in dpc_process_error()? Should we
> > check early if DPC was triggered because of an uncorrectable error, and if
> > so, ratelimit that?
>
> Good question.  It does seem like the dpc_process_error() messages
> should be similarly ratelimited.  I think we currently only enable DPC
> for fatal errors, and the Downstream Port takes the link down, which
> resets the hierarchy below.  So (1) we probably won't see storms of
> fatal error messages, and (2) it looks like we might not print any
> error info from downstream devices, since they're not reachable while
> the link is down.
>
> It does seem like we *should* try to print that info after the link
> comes back up, since the log registers are sticky and should survive
> the reset.  Maybe we do that already and I just missed it.
>
> This seems like something we could put off a little bit while we deal
> with the AER correctable error issue.
>
> Bjorn

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20 19:53       ` Jon Pan-Doh
@ 2025-03-20 20:29         ` Bjorn Helgaas
  2025-03-21  1:58           ` Jon Pan-Doh
  0 siblings, 1 reply; 32+ messages in thread
From: Bjorn Helgaas @ 2025-03-20 20:29 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: Karolina Stolarek, linux-pci, Bjorn Helgaas, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 12:53:53PM -0700, Jon Pan-Doh wrote:
> On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Thu, Mar 20, 2025 at 03:56:53PM +0100, Karolina Stolarek wrote:
> > > On 20/03/2025 09:20, Jon Pan-Doh wrote:
> > > > +   /*
> > > > +    * Ratelimits are doubled as a given error produces 2 logs (root port
> > > > +    * and endpoint) that should be under same ratelimit.
> > > > +    */
> > > For these devices, we would call the ratelimit just once. I
> > > don't have a nice an clean solution for this problem, I just
> > > wanted to highlight that 1) we don't use the Root Port's
> > > ratelimit in aer_print_port_info(), 2) we may use the bursts to
> > > either print port_info + error message or just the message, in
> > > different combinations. I think we should reword this comment to
> > > highlight the fact that we don't check the ratelimit once per
> > > error, we could do it twice.
> 
> You're right. I was thinking of amending it to something like:
> 
> Ratelimits are doubled as a given error notification produces up to
> 2 logs (1 at root port and 1 at source device) that should be under
> the same ratelimit.
> 
> > Very good point.  aer_print_rp_info() is always ratelimited based
> > on the ERR_* Source Identification, but if Multiple ERR_* is set,
> > aer_print_error() ratelimits based on whatever downstream device
> > we found that had any error of the same class logged.
> >
> > That does seem like a problem.  I would propose that we always
> > ratelimit using the device from Source Identification. I think
> > that's available in aer_print_error(); we would just use info->id
> > instead of "dev".
> 
> Wouldn't you be incorrectly counting the non-source ID devices then?
> I think this is another reason where removing
> aer_print_port_info()[1] (only printing port info when failing to
> get device error info) simplifies things. Of course, we then have to
> weigh whether the loss of info is less than the ratelimit
> complexity.

Yes, I guess so.  Maybe the ratelimit should be in the source of the
interrupt (Root Port for AER, Root Port or Downstream Port for DPC) so
it's more directly related to the interrupt that got us here in the
first place.

I think the struct aer_err_info is basically a per-interrupt thing, so
maybe we could evaluate __ratelimit() once at the initial entry, save
the result in aer_err_info, and use that saved value everywhere we
print messages?

  - native AER: aer_isr_one_error() has RP pointer in rpc->rpd and
    could save it (or pointer to the RP's ratelimit struct, or just
    the result of __ratelimit()) in aer_err_info.

  - GHES AER: I'm not sure struct cper_sec_pcie contains the RP, might
    have to search upwards from the device we know about?

  - native DPC: dpc_process_error() has DP pointer and could save it
    in aer_err_info.

  - EDR DPC: passes DP pointer to dpc_process_error().

> > > I'm also thinking -- we are ratelimiting the
> > > aer_print_port_info() and aer_print_error(). What about the
> > > messages in dpc_process_error()? Should we check early if DPC
> > > was triggered because of an uncorrectable error, and if so,
> > > ratelimit that?
> >
> > Good question.  It does seem like the dpc_process_error() messages
> > should be similarly ratelimited.  I think we currently only enable
> > DPC for fatal errors, and the Downstream Port takes the link down,
> > which resets the hierarchy below.  So (1) we probably won't see
> > storms of fatal error messages, and (2) it looks like we might not
> > print any error info from downstream devices, since they're not
> > reachable while the link is down.
> 
> I did not expect error storms from DPC so I thought it best to focus
> on AER.

Completely agreed.

> [1] https://lore.kernel.org/linux-pci/CAMC_AXWVOtKh2r4kX6c7jtJwQaEE4KEQsH=uoB1OhczJ=8K2VA@mail.gmail.com/

[BTW, email quoting style error here; shouldn't have the entire
message you're replying to repeated below.  Gmail tries hard to screw
this up for you ;)]

> On Thu, Mar 20, 2025 at 10:51 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> ... (snipped)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20  8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
  2025-03-20 14:56   ` Karolina Stolarek
@ 2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
  2025-03-21 19:24     ` Jon Pan-Doh
  1 sibling, 1 reply; 32+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-03-21  1:00 UTC (permalink / raw)
  To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Terry Bowman

Hi,

On 3/20/25 1:20 AM, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER correctable and uncorrectable
> errors that use the kernel defaults (10 per 5s).

Should we exclude fatal errors from the rate limit? Fatal error logs 
would be
really useful for debug analysis, and they not happen very frequently.

>
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
> while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
> true count of 11.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> ---
>   drivers/pci/pcie/aer.c | 74 +++++++++++++++++++++++++++++++++---------
>   1 file changed, 58 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3069376b3553..081cef5fc678 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -28,6 +28,7 @@
>   #include <linux/interrupt.h>
>   #include <linux/delay.h>
>   #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
>   #include <linux/slab.h>
>   #include <acpi/apei.h>
>   #include <acpi/ghes.h>
> @@ -88,6 +89,10 @@ struct aer_report {
>   	u64 rootport_total_cor_errs;
>   	u64 rootport_total_fatal_errs;
>   	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>   };
>   
>   #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
> +	/*
> +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> +	 * and endpoint) that should be under same ratelimit.
> +	 */
> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>   	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> +{
> +	struct ratelimit_state *ratelimit;
> +
> +	if (severity == AER_CORRECTABLE)
> +		ratelimit = &dev->aer_report->cor_log_ratelimit;
> +	else
> +		ratelimit = &dev->aer_report->uncor_log_ratelimit;
> +	return __ratelimit(ratelimit);
> +}
> +
>   static void __aer_print_error(struct pci_dev *dev,
>   			      struct aer_err_info *info,
>   			      const char *level)
> @@ -698,6 +723,12 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
>   	int layer, agent;
>   	int id = pci_dev_id(dev);
>   
> +	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +			info->severity, info->tlp_header_valid, &info->tlp);
> +
> +	if (!aer_ratelimit(dev, info->severity))
> +		return;
> +
>   	if (!info->status) {
>   		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>   			aer_error_severity_string[info->severity]);
> @@ -722,21 +753,28 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
>   out:
>   	if (info->id && info->error_dev_num > 1 && info->id == id)
>   		pci_err(dev, "  Error of this Agent is reported first\n");
> -
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> -			info->severity, info->tlp_header_valid, &info->tlp);
>   }
>   
>   static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
>   {
>   	u8 bus = info->id >> 8;
>   	u8 devfn = info->id & 0xff;
> +	struct pci_dev *endpoint;
> +	int i;
> +
> +	/* extract endpoint device ratelimit */
> +	for (i = 0; i < info->error_dev_num; i++) {
> +		endpoint = info->dev[i];
> +		if (info->id == pci_dev_id(endpoint))
> +			break;
> +	}
>   
> -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> -		 info->multi_error_valid ? "Multiple " : "",
> -		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn));
> +	if (aer_ratelimit(endpoint, info->severity))
> +		pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> +			 info->multi_error_valid ? "Multiple " : "",
> +			 aer_error_severity_string[info->severity],
> +			 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> +			 PCI_FUNC(devfn));
>   }
>   
>   #ifdef CONFIG_ACPI_APEI_PCIEAER
> @@ -784,6 +822,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	pci_dev_aer_stats_incr(dev, &info);
>   
> +	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +			aer_severity, tlp_header_valid, &aer->header_log);
> +
> +	if (!aer_ratelimit(dev, aer_severity))
> +		return;
> +
>   	aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>   	__aer_print_error(dev, &info, level);
>   	aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
> @@ -795,9 +839,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	if (tlp_header_valid)
>   		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
> -
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> -			aer_severity, tlp_header_valid, &aer->header_log);
>   }
>   EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>   
> @@ -1299,10 +1340,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   			e_info.multi_error_valid = 1;
>   		else
>   			e_info.multi_error_valid = 0;
> -		aer_print_port_info(pdev, &e_info);
>   
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info);
>   			aer_process_err_devices(&e_info, KERN_WARNING);
> +		}
>   	}
>   
>   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> @@ -1318,10 +1360,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		else
>   			e_info.multi_error_valid = 0;
>   
> -		aer_print_port_info(pdev, &e_info);
> -
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info);
>   			aer_process_err_devices(&e_info, KERN_ERR);
> +		}
>   	}
>   }
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-03-20  8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
  2025-03-20 14:57   ` Karolina Stolarek
@ 2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
  1 sibling, 0 replies; 32+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-03-21  1:00 UTC (permalink / raw)
  To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Terry Bowman


On 3/20/25 1:20 AM, Jon Pan-Doh wrote:
> Add ratelimits section for rationale and defaults.
>
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan 
<sathyanarayanan.kuppuswamy@linux.intel.com>

>   Documentation/PCI/pcieaer-howto.rst | 11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index f013f3b27c82..896d2a232a90 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -85,6 +85,17 @@ In the example, 'Requester ID' means the ID of the device that sent
>   the error message to the Root Port. Please refer to PCIe specs for other
>   fields.
>   
> +AER Ratelimits
> +--------------
> +
> +Since error messages can be generated for each transaction, we may see
> +large volumes of errors reported. To prevent spammy devices from flooding
> +the console/stalling execution, messages are throttled by device and error
> +type (correctable vs. uncorrectable).
> +
> +AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
> +DEFAULT_RATELIMIT_INTERVAL (5 seconds).
> +
>   AER Statistics / Counters
>   -------------------------
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits
  2025-03-20  8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
  2025-03-20 14:58   ` Karolina Stolarek
@ 2025-03-21  1:02   ` Sathyanarayanan Kuppuswamy
  2025-03-21  1:55     ` Jon Pan-Doh
  1 sibling, 1 reply; 32+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-03-21  1:02 UTC (permalink / raw)
  To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
  Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
	Tony Luck, Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Terry Bowman


On 3/20/25 1:20 AM, Jon Pan-Doh wrote:
> Allow userspace to read/write log ratelimits per device (including
> enable/disable). Create aer/ sysfs directory to store them and any
> future aer configs.
>
> Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
> attributes (e.g. stats and ratelimits).
>
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
> Documentation/ABI/testing/sysfs-bus-pci-devices-aer
>
> Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
> Sent 6 AER errors. Observed 5 errors logged while AER stats
> (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
>
> Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
> logged and accounted in AER stats (12 total errors).
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
>   ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++++++
>   Documentation/PCI/pcieaer-howto.rst           |  5 +-
>   drivers/pci/pci-sysfs.c                       |  1 +
>   drivers/pci/pci.h                             |  1 +
>   drivers/pci/pcie/aer.c                        | 93 +++++++++++++++++++
>   5 files changed, 133 insertions(+), 1 deletion(-)
>   rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> similarity index 77%
> rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> index d1f67bb81d5d..4561653fdbde 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> @@ -117,3 +117,37 @@ Date:		July 2018
>   KernelVersion:	4.19.0
>   Contact:	linux-pci@vger.kernel.org, rajatja@google.com
>   Description:	Total number of ERR_NONFATAL messages reported to rootport.
> +
> +PCIe AER ratelimits
> +-------------------
> +
> +These attributes show up under all the devices that are AER capable.
> +They represent configurable ratelimits of logs per error type.
> +
> +See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
> +		gets whether or not AER is currently enabled. Enabled by
> +		default.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_cor_log

Why not just use ratelimit_cor_log and ratelimit_uncor_log? Any way the
detail about 5 second window would be available in the Documentation.

> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Ratelimit burst for correctable error logs. Writing a value
> +		changes the number of errors (burst) allowed per interval
> +		(5 second window) before ratelimiting. Reading gets the
> +		current ratelimit burst.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_uncor_log
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Ratelimit burst for uncorrectable error logs. Writing a
> +		value changes the number of errors (burst) allowed per
> +		interval (5 second window) before ratelimiting. Reading
> +		gets the current ratelimit burst.
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index 896d2a232a90..043cdb3194be 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -96,12 +96,15 @@ type (correctable vs. uncorrectable).
>   AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
>   DEFAULT_RATELIMIT_INTERVAL (5 seconds).
>   
> +Ratelimits are exposed in the form of sysfs attributes and configurable.
> +See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
> +
>   AER Statistics / Counters
>   -------------------------
>   
>   When PCIe AER errors are captured, the counters / statistics are also exposed
>   in the form of sysfs attributes which are documented at
> -Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
>   
>   Developer Guide
>   ===============
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index b46ce1a2c554..16de3093294e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1801,6 +1801,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
>   	&pcie_dev_attr_group,
>   #ifdef CONFIG_PCIEAER
>   	&aer_stats_attr_group,
> +	&aer_attr_group,
>   #endif
>   #ifdef CONFIG_PCIEASPM
>   	&aspm_ctrl_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9d63d32f041c..34633fe12201 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -889,6 +889,7 @@ void pci_no_aer(void);
>   void pci_aer_init(struct pci_dev *dev);
>   void pci_aer_exit(struct pci_dev *dev);
>   extern const struct attribute_group aer_stats_attr_group;
> +extern const struct attribute_group aer_attr_group;
>   void pci_aer_clear_fatal_status(struct pci_dev *dev);
>   int pci_aer_clear_status(struct pci_dev *dev);
>   int pci_aer_raw_clear_status(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 081cef5fc678..f84ae1872fa3 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -631,6 +631,99 @@ const struct attribute_group aer_stats_attr_group = {
>   	.is_visible = aer_stats_attrs_are_visible,
>   };
>   
> +/*
> + * Ratelimit enable toggle uses interval value of
> + * 0: disabled
> + * DEFAULT_RATELIMIT_INTERVAL: enabled
> + */
> +static ssize_t ratelimit_log_enable_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enable = pdev->aer_report->cor_log_ratelimit.interval != 0;
> +
> +	return sysfs_emit(buf, "%d\n", enable);
> +}
> +
> +static ssize_t ratelimit_log_enable_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enable;
> +	int interval;
> +
> +	if (kstrtobool(buf, &enable) < 0)
> +		return -EINVAL;
> +
> +	if (enable)
> +		interval = DEFAULT_RATELIMIT_INTERVAL;
> +	else
> +		interval = 0;
> +
> +	pdev->aer_report->cor_log_ratelimit.interval = interval;
> +	pdev->aer_report->uncor_log_ratelimit.interval = interval;
> +	return count;
> +}
> +static DEVICE_ATTR_RW(ratelimit_log_enable);
> +
> +/*
> + * Ratelimits are doubled as a given error produces 2 logs (root port
> + * and endpoint) that should be under same ratelimit.
> + */
> +#define aer_ratelimit_burst_attr(name, ratelimit)			\
> +	static ssize_t							\
> +	name##_show(struct device *dev, struct device_attribute *attr,	\
> +		    char *buf)						\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +	return sysfs_emit(buf, "%d\n",					\
> +			  pdev->aer_report->ratelimit.burst / 2);	\
> +}									\
> +									\
> +	static ssize_t							\
> +	name##_store(struct device *dev, struct device_attribute *attr,	\
> +		     const char *buf, size_t count)			\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +	int burst;							\
> +									\
> +	if (kstrtoint(buf, 0, &burst) < 0)				\
> +		return -EINVAL;						\
> +									\
> +	pdev->aer_report->ratelimit.burst = burst * 2;			\
> +	return count;							\
> +}									\
> +static DEVICE_ATTR_RW(name)
> +
> +aer_ratelimit_burst_attr(ratelimit_in_5secs_cor_log, cor_log_ratelimit);
> +aer_ratelimit_burst_attr(ratelimit_in_5secs_uncor_log, uncor_log_ratelimit);
> +
> +static struct attribute *aer_attrs[] = {
> +	&dev_attr_ratelimit_log_enable.attr,
> +	&dev_attr_ratelimit_in_5secs_cor_log.attr,
> +	&dev_attr_ratelimit_in_5secs_uncor_log.attr,
> +	NULL
> +};
> +
> +static umode_t aer_attrs_are_visible(struct kobject *kobj,
> +				     struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->aer_report)
> +		return 0;
> +	return a->mode;
> +}
> +
> +const struct attribute_group aer_attr_group = {
> +	.name = "aer",
> +	.attrs = aer_attrs,
> +	.is_visible = aer_attrs_are_visible,
> +};
> +
>   void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
>   {
>   	unsigned long status = info->status & ~info->mask;

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits
  2025-03-21  1:02   ` Sathyanarayanan Kuppuswamy
@ 2025-03-21  1:55     ` Jon Pan-Doh
  0 siblings, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-21  1:55 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 6:02 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>
> On 3/20/25 1:20 AM, Jon Pan-Doh wrote:
> > +What:                /sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_cor_log
>
> Why not just use ratelimit_cor_log and ratelimit_uncor_log? Any way the
> detail about 5 second window would be available in the Documentation.

This was the original name. It seems to have gone full circle so I'm not sure
how to resolve it. Personally, I think Karolina's suggestion is a good
compromise
of detail vs. verboseness:

- Jonathan: ratelimit_in_5secs_cor_log[1]
- Karolina: ratelimit_burst_cor_log[2]

[1] https://lore.kernel.org/linux-pci/CAMC_AXW9nE-q_8qqX+1KOeYdTQVoUDovY03aPbLGZBpe9HCcWQ@mail.gmail.com/
[2] https://lore.kernel.org/linux-pci/d75a96f1-5162-4ec4-971b-9ebd4cfa5447@oracle.com/

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-20 20:29         ` Bjorn Helgaas
@ 2025-03-21  1:58           ` Jon Pan-Doh
  0 siblings, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-21  1:58 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Karolina Stolarek, linux-pci, Bjorn Helgaas, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 1:29 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
>
> On Thu, Mar 20, 2025 at 12:53:53PM -0700, Jon Pan-Doh wrote:
> I think the struct aer_err_info is basically a per-interrupt thing, so
> maybe we could evaluate __ratelimit() once at the initial entry, save
> the result in aer_err_info, and use that saved value everywhere we
> print messages?

I like this approach. Another advantage is it removes the need for the 2x
ratelimit logic. Updated for v5.

>   - native AER: aer_isr_one_error() has RP pointer in rpc->rpd and
>     could save it (or pointer to the RP's ratelimit struct, or just
>     the result of __ratelimit()) in aer_err_info.

Similar to aer_err_info.dev[], I store the evaluated __ratelimit() in
aer_err_info.ratelimited[]. The main quirk is that for multiple
errors, you won't
see the root port log if the first error is ratelimited, but the
subsequent errors
are under the limit. I think this is fine, as the log prints out the
first error only,
but can change aer_print_port_info() to log if any of the errors is
under the limit.

>   - GHES AER: I'm not sure struct cper_sec_pcie contains the RP, might
>     have to search upwards from the device we know about?
>
>   - native DPC: dpc_process_error() has DP pointer and could save it
>     in aer_err_info.
>
>   - EDR DPC: passes DP pointer to dpc_process_error().

These are largely unchanged:
- GHES/CXL gated by aer_ratelimit() in pci_print_aer()
- DPC not ratelimited with the expectation that there won't be error storms

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report
  2025-03-20 19:53     ` Jon Pan-Doh
@ 2025-03-21 13:38       ` Karolina Stolarek
  0 siblings, 0 replies; 32+ messages in thread
From: Karolina Stolarek @ 2025-03-21 13:38 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: linux-pci, Bjorn Helgaas, Sathyanarayanan Kuppuswamy,
	Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

On 20/03/2025 20:53, Jon Pan-Doh wrote:
>
> Ok. Karolina, given that Sathyanarayanan and Bjorn are partial to aer_info,
> should I switch back to aer_info? Or do you still have strong objections?

No objections, let's go with aer_info. You can keep my rb, I don't 
expect bigger changes here.

All the best,
Karolina

> 
> Thanks,
> Jon


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
@ 2025-03-21 19:24     ` Jon Pan-Doh
  2025-03-21 21:47       ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-21 19:24 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

On Thu, Mar 20, 2025 at 6:00 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> Should we exclude fatal errors from the rate limit? Fatal error logs
> would be
> really useful for debug analysis, and they not happen very frequently.

The logs today only make the distinction between correctable vs.
uncorrectable so I thought it made sense to be consistent.

Maybe this is something that could be deferred? The only fixed
component is the sysfs attribute names (which can be made to refer to
uncorrectable nonfatal vs. uncorrectable in doc/underlying
implementation).

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-21 19:24     ` Jon Pan-Doh
@ 2025-03-21 21:47       ` Sathyanarayanan Kuppuswamy
  2025-03-21 21:59         ` Bjorn Helgaas
  2025-03-21 22:11         ` Jon Pan-Doh
  0 siblings, 2 replies; 32+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-03-21 21:47 UTC (permalink / raw)
  To: Jon Pan-Doh
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

Hi Jon,

On 3/21/25 12:24 PM, Jon Pan-Doh wrote:
> On Thu, Mar 20, 2025 at 6:00 PM Sathyanarayanan Kuppuswamy
> <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
>> Should we exclude fatal errors from the rate limit? Fatal error logs
>> would be
>> really useful for debug analysis, and they not happen very frequently.
> The logs today only make the distinction between correctable vs.
> uncorrectable so I thought it made sense to be consistent.

You're right. From a logging perspective, the current driver only
differentiates between correctable and uncorrectable errors. However,
the goal of your patch series is to reduce the spam of frequent errors.
While we are rate-limiting these frequent logs, we must ensure that we
don't miss important logs. I believe we did not rate-limit DPC logs for
this very reason.


>
> Maybe this is something that could be deferred? The only fixed

I am fine with deferring. IIUC, if needed, through sysfs user can
skip rate-limit for uncorrectable errors, right?

But, is the required change to do this complex? Won't skipping the
rate limit check for fatal errors solve the problem?

Bjorn, any comments? Do you think Fatal errors should be
rate-limited?

> component is the sysfs attribute names (which can be made to refer to
> uncorrectable nonfatal vs. uncorrectable in doc/underlying
> implementation).
>
> Thanks,
> Jon

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-21 21:47       ` Sathyanarayanan Kuppuswamy
@ 2025-03-21 21:59         ` Bjorn Helgaas
  2025-03-21 22:11         ` Jon Pan-Doh
  1 sibling, 0 replies; 32+ messages in thread
From: Bjorn Helgaas @ 2025-03-21 21:59 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek, linux-pci,
	Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

On Fri, Mar 21, 2025 at 02:47:36PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 3/21/25 12:24 PM, Jon Pan-Doh wrote:
> > On Thu, Mar 20, 2025 at 6:00 PM Sathyanarayanan Kuppuswamy
> > <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> > > Should we exclude fatal errors from the rate limit? Fatal error
> > > logs would be really useful for debug analysis, and they not
> > > happen very frequently.
>
> > The logs today only make the distinction between correctable vs.
> > uncorrectable so I thought it made sense to be consistent.
> 
> You're right. From a logging perspective, the current driver only
> differentiates between correctable and uncorrectable errors.
> However, the goal of your patch series is to reduce the spam of
> frequent errors.  While we are rate-limiting these frequent logs, we
> must ensure that we don't miss important logs. I believe we did not
> rate-limit DPC logs for this very reason.
> 
> > Maybe this is something that could be deferred? The only fixed
> 
> I am fine with deferring. IIUC, if needed, through sysfs user can
> skip rate-limit for uncorrectable errors, right?
> 
> But, is the required change to do this complex? Won't skipping the
> rate limit check for fatal errors solve the problem?
> 
> Bjorn, any comments? Do you think Fatal errors should be
> rate-limited?

I'm inclined to not ratelimit fatal errors unless we've seen issues
with a flood of them.

Bjorn

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
  2025-03-21 21:47       ` Sathyanarayanan Kuppuswamy
  2025-03-21 21:59         ` Bjorn Helgaas
@ 2025-03-21 22:11         ` Jon Pan-Doh
  1 sibling, 0 replies; 32+ messages in thread
From: Jon Pan-Doh @ 2025-03-21 22:11 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron, Terry Bowman

On Fri, Mar 21, 2025 at 2:47 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> I am fine with deferring. IIUC, if needed, through sysfs user can
> skip rate-limit for uncorrectable errors, right?

Yes.

> But, is the required change to do this complex? Won't skipping the
> rate limit check for fatal errors solve the problem?

No, it's not complex. I was trying to minimize extra changes in an
attempt to get this in for 6.15. However, we're right against the
merge window so that may not be possible.

Thanks,
Jon

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2025-03-21 22:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 1/7] PCI/AER: Check log level once and propagate down Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 2/7] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
2025-03-20 14:59   ` Karolina Stolarek
2025-03-20 19:07     ` Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
2025-03-20 17:42   ` Sathyanarayanan Kuppuswamy
2025-03-20 19:53     ` Jon Pan-Doh
2025-03-21 13:38       ` Karolina Stolarek
2025-03-20  8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-03-20 14:56   ` Karolina Stolarek
2025-03-20 17:51     ` Bjorn Helgaas
2025-03-20 19:53       ` Jon Pan-Doh
2025-03-20 20:29         ` Bjorn Helgaas
2025-03-21  1:58           ` Jon Pan-Doh
2025-03-20 19:37     ` Jon Pan-Doh
2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
2025-03-21 19:24     ` Jon Pan-Doh
2025-03-21 21:47       ` Sathyanarayanan Kuppuswamy
2025-03-21 21:59         ` Bjorn Helgaas
2025-03-21 22:11         ` Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
2025-03-20 14:57   ` Karolina Stolarek
2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
2025-03-20  8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
2025-03-20 14:58   ` Karolina Stolarek
2025-03-20 19:36     ` Jon Pan-Doh
2025-03-21  1:02   ` Sathyanarayanan Kuppuswamy
2025-03-21  1:55     ` Jon Pan-Doh
2025-03-20 14:34 ` [PATCH v4 0/7] Rate limit AER logs Christoph Hellwig
2025-03-20 18:45 ` Paul E. McKenney

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox