linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 00/16] Rate limit AER logs
@ 2025-05-19 21:35 Bjorn Helgaas
  2025-05-19 21:35 ` [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it Bjorn Helgaas
                   ` (16 more replies)
  0 siblings, 17 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

This work is mostly due to Jon Pan-Doh and Karolina Stolarek.  I rebased
this to v6.15-rc1, factored out some of the trace and statistics updates,
and added some minor cleanups.

Proposal
========

When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits for
more robust error logging. Allow userspace to configure ratelimits via
sysfs knobs.

Motivation
==========

Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
of devices), affects repairabilitiy flows for fleet operators.

Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services will allow for more
predictable repair flows and decrease machine downtime.

Background
==========

AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup).

There have been previous attempts to add ratelimits to AER logs ([4], [5]).
The most recent attempt[5] has many similarities with the proposed
approach.


v6:
- Rebase to v6.15-rc1
- Initialize struct aer_err_info completely before using it
- Log DPC Error Source ID only when it's valid
- Consolidate AER Error Source ID logging to one place
- Tidy Error Source ID bus/dev/fn decoding using macros
- Rename aer_print_port_info() to aer_print_source()
- Consolidate trace events and statistic updates to one non-ratelimited place
- Save log level in struct aer_err_info instead of passing as parameter
v5: https://lore.kernel.org/r/20250321015806.954866-1-pandoh@google.com
- Handle multi-error AER by evaluating ratelimits once and storing result
- Reword/rename commit messages/functions/variable
v4: https://lore.kernel.org/r/20250320082057.622983-1-pandoh@google.com
- Fix bug where trace not emitted with malformed aer_err_info
- Extend ratelimit to malformed aer_err_info
- Update commit messages with patch motivation
- Squash AER sysfs filename change (Patch 8)
v3: https://lore.kernel.org/r/20250319084050.366718-1-pandoh@google.com
- Ratelimit aer_print_port_info() (drop Patch 1)
- Add ratelimit enable toggle
- Move trace outside of ratelimit
- Split log level (Patch 2) into two
- More descriptive documentation/sysfs naming
v2: https://lore.kernel.org/r/20250214023543.992372-1-pandoh@google.com
- Rebased on top of pci/aer (6.14.rc-1)
- Split series into log and IRQ ratelimits (defer patch 5)
- Dropped patch 8 (Move AER sysfs)
- Added log level cleanup patch[7] from Karolina's series
- Fixed bug where dpc errors didn't increment counters
- "X callbacks suppressed" message on ratelimit release -> immediately
- Separate documentation into own patch
v1: https://lore.kernel.org/r/20250115074301.3514927-1-pandoh@google.com

[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
[6]
https://lore.kernel.org/linux-pci/8bcb8c9a7b38ce3bdaca5a64fe76f08b0b337511.1742202797.git.k
arolina.stolarek@oracle.com/
[7]
https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.k
arolina.stolarek@oracle.com/


Bjorn Helgaas (9):
  PCI/DPC: Initialize aer_err_info before using it
  PCI/DPC: Log Error Source ID only when valid
  PCI/AER: Consolidate Error Source ID logging in aer_print_port_info()
  PCI/AER: Extract bus/dev/fn in aer_print_port_info() with
    PCI_BUS_NUM(), etc
  PCI/AER: Move aer_print_source() earlier in file
  PCI/AER: Initialize aer_err_info before using it
  PCI/AER: Simplify pci_print_aer()
  PCI/AER: Update statistics early in logging
  PCI/AER: Combine trace_aer_event() with statistics updates

Jon Pan-Doh (4):
  PCI/AER: Rename aer_print_port_info() to aer_print_source()
  PCI/AER: Introduce ratelimit for error logs
  PCI/AER: Add ratelimits to PCI AER Documentation
  PCI/AER: Add sysfs attributes for log ratelimits

Karolina Stolarek (3):
  PCI/AER: Check log level once and remember it
  PCI/AER: Make all pci_print_aer() log levels depend on error type
  PCI/AER: Rename struct aer_stats to aer_report

 ...es-aer_stats => sysfs-bus-pci-devices-aer} |  34 ++
 Documentation/PCI/pcieaer-howto.rst           |  16 +-
 drivers/pci/pci-sysfs.c                       |   1 +
 drivers/pci/pci.h                             |   5 +-
 drivers/pci/pcie/aer.c                        | 346 ++++++++++++------
 drivers/pci/pcie/dpc.c                        |  49 ++-
 include/linux/pci.h                           |   2 +-
 7 files changed, 329 insertions(+), 124 deletions(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 22:41   ` Sathyanarayanan Kuppuswamy
  2025-05-20  9:39   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid Bjorn Helgaas
                   ` (15 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

Previously the struct aer_err_info "info" was allocated on the stack
without being initialized, so it contained junk except for the fields we
explicitly set later.

Initialize "info" at declaration so it starts as all zeroes.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/dpc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index df42f15c9829..fe7719238456 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -258,7 +258,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
 void dpc_process_error(struct pci_dev *pdev)
 {
 	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
-	struct aer_err_info info;
+	struct aer_err_info info = { 0 };
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
  2025-05-19 21:35 ` [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:15   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:28   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info() Bjorn Helgaas
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

DPC Error Source ID is only valid when the DPC Trigger Reason indicates
that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
Message (PCIe r6.0, sec 7.9.14.5).

When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
log the Error Source ID (decoded into domain/bus/device/function).  Don't
print the source otherwise, since it's not valid.

For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
logging changes:

  - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
  - pci 0000:00:01.0: DPC: ERR_FATAL detected
  + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0

and when DPC triggered for other reasons, where DPC Error Source ID is
undefined, e.g., unmasked uncorrectable error:

  - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
  - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
  + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected

Previously the "containment event" message was at KERN_INFO and the
"%s detected" message was at KERN_WARNING.  Now the single message is at
KERN_WARNING.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/dpc.c | 45 ++++++++++++++++++++++++++----------------
 1 file changed, 28 insertions(+), 17 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fe7719238456..315bf2bfd570 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -261,25 +261,36 @@ void dpc_process_error(struct pci_dev *pdev)
 	struct aer_err_info info = { 0 };
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
-	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
-
-	pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
-		 status, source);
 
 	reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN;
-	ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
-	pci_warn(pdev, "%s detected\n",
-		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR) ?
-		 "unmasked uncorrectable error" :
-		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE) ?
-		 "ERR_NONFATAL" :
-		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
-		 "ERR_FATAL" :
-		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
-		 "RP PIO error" :
-		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
-		 "software trigger" :
-		 "reserved error");
+
+	switch (reason) {
+	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR:
+		pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
+			 status);
+		break;
+	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
+	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
+		pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID,
+				     &source);
+		pci_warn(pdev, "containment event, status:%#06x, %s received from %04x:%02x:%02x.%d\n",
+			 status,
+			 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
+				"ERR_FATAL" : "ERR_NONFATAL",
+			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
+			 PCI_SLOT(source), PCI_FUNC(source));
+		return;
+	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
+		ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
+		pci_warn(pdev, "containment event, status:%#06x: %s detected\n",
+			 status,
+			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
+			 "RP PIO error" :
+			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
+			 "software trigger" :
+			 "reserved error");
+		break;
+	}
 
 	/* show RP PIO error detail information */
 	if (pdev->dpc_rp_extensions &&
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info()
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
  2025-05-19 21:35 ` [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it Bjorn Helgaas
  2025-05-19 21:35 ` [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:39   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:31   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc Bjorn Helgaas
                   ` (13 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

Previously we decoded the AER Error Source ID in two places.  Consolidate
them so both places use aer_print_port_info().  Add a "details" parameter
so we can add a note when we didn't find any downstream devices with errors
logged in their AER Capability.

When we didn't read any error details from the source device, we logged two
messages: one in aer_isr_one_error() and another in find_source_device().
Since they both contain the same information, only log the first one when
when find_source_device() has found error details.

This changes the dmesg logging when we found no devices with errors logged:

  - pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0
  - pci 0000:00:01.0: AER: found no error details for 0000:02:00.0
  + pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0 (no details found)

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 30 ++++++++++++++++--------------
 1 file changed, 16 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a1cf8c7ef628..b8494ccd935b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -733,16 +733,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
-static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
+static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
+				const char *details)
 {
 	u8 bus = info->id >> 8;
 	u8 devfn = info->id & 0xff;
 
-	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
+	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
 		 info->multi_error_valid ? "Multiple " : "",
 		 aer_error_severity_string[info->severity],
 		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
-		 PCI_FUNC(devfn));
+		 PCI_FUNC(devfn), details);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -926,13 +927,13 @@ static bool find_source_device(struct pci_dev *parent,
 	else
 		pci_walk_bus(parent->subordinate, find_device_iter, e_info);
 
+	/*
+	 * If we didn't find any devices with errors logged in the AER
+	 * Capability, just print the Error Source ID from the Root Port or
+	 * RCEC that received an ERR_* Message.
+	 */
 	if (!e_info->error_dev_num) {
-		u8 bus = e_info->id >> 8;
-		u8 devfn = e_info->id & 0xff;
-
-		pci_info(parent, "found no error details for %04x:%02x:%02x.%d\n",
-			 pci_domain_nr(parent->bus), bus, PCI_SLOT(devfn),
-			 PCI_FUNC(devfn));
+		aer_print_port_info(parent, e_info, " (no details found)");
 		return false;
 	}
 	return true;
@@ -1297,10 +1298,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 			e_info.multi_error_valid = 1;
 		else
 			e_info.multi_error_valid = 0;
-		aer_print_port_info(pdev, &e_info);
 
-		if (find_source_device(pdev, &e_info))
+		if (find_source_device(pdev, &e_info)) {
+			aer_print_port_info(pdev, &e_info, "");
 			aer_process_err_devices(&e_info);
+		}
 	}
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -1316,10 +1318,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		else
 			e_info.multi_error_valid = 0;
 
-		aer_print_port_info(pdev, &e_info);
-
-		if (find_source_device(pdev, &e_info))
+		if (find_source_device(pdev, &e_info)) {
+			aer_print_port_info(pdev, &e_info, "");
 			aer_process_err_devices(&e_info);
+		}
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (2 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info() Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:47   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:32   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source() Bjorn Helgaas
                   ` (12 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

Use PCI_BUS_NUM(), PCI_SLOT(), PCI_FUNC() to extract the bus number,
device, and function number directly from the Error Source ID.  There's no
need to shift and mask it explicitly.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index b8494ccd935b..dc8a50e0a2b7 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -736,14 +736,13 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
 				const char *details)
 {
-	u8 bus = info->id >> 8;
-	u8 devfn = info->id & 0xff;
+	u16 source = info->id;
 
 	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
 		 info->multi_error_valid ? "Multiple " : "",
 		 aer_error_severity_string[info->severity],
-		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
-		 PCI_FUNC(devfn), details);
+		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
+		 PCI_SLOT(source), PCI_FUNC(source), details);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source()
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (3 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:48   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:33   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file Bjorn Helgaas
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Jon Pan-Doh <pandoh@google.com>

Rename aer_print_port_info() to aer_print_source() to be more descriptive.
This logs the Error Source ID logged by a Root Port or Root Complex Event
Collector when it receives an ERR_COR, ERR_NONFATAL, or ERR_FATAL Message.

[bhelgaas: aer_print_rp_info() -> aer_print_source()]
Link: https://lore.kernel.org/r/20250321015806.954866-5-pandoh@google.com
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index dc8a50e0a2b7..eb42d50b2def 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -733,8 +733,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
-static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
-				const char *details)
+static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
+			     const char *details)
 {
 	u16 source = info->id;
 
@@ -932,7 +932,7 @@ static bool find_source_device(struct pci_dev *parent,
 	 * RCEC that received an ERR_* Message.
 	 */
 	if (!e_info->error_dev_num) {
-		aer_print_port_info(parent, e_info, " (no details found)");
+		aer_print_source(parent, e_info, " (no details found)");
 		return false;
 	}
 	return true;
@@ -1299,7 +1299,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 			e_info.multi_error_valid = 0;
 
 		if (find_source_device(pdev, &e_info)) {
-			aer_print_port_info(pdev, &e_info, "");
+			aer_print_source(pdev, &e_info, "");
 			aer_process_err_devices(&e_info);
 		}
 	}
@@ -1318,7 +1318,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 			e_info.multi_error_valid = 0;
 
 		if (find_source_device(pdev, &e_info)) {
-			aer_print_port_info(pdev, &e_info, "");
+			aer_print_source(pdev, &e_info, "");
 			aer_process_err_devices(&e_info);
 		}
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (4 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source() Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:49   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:34   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it Bjorn Helgaas
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

Move aer_print_source() earlier in the file so a future change can use it
from aer_print_error(), where it's easier to rate limit it.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index eb42d50b2def..95a4cab1d517 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -696,6 +696,18 @@ static void __aer_print_error(struct pci_dev *dev,
 	pci_dev_aer_stats_incr(dev, info);
 }
 
+static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
+			     const char *details)
+{
+	u16 source = info->id;
+
+	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
+		 info->multi_error_valid ? "Multiple " : "",
+		 aer_error_severity_string[info->severity],
+		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
+		 PCI_SLOT(source), PCI_FUNC(source), details);
+}
+
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int layer, agent;
@@ -733,18 +745,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
-static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
-			     const char *details)
-{
-	u16 source = info->id;
-
-	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
-		 info->multi_error_valid ? "Multiple " : "",
-		 aer_error_severity_string[info->severity],
-		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
-		 PCI_SLOT(source), PCI_FUNC(source), details);
-}
-
 #ifdef CONFIG_ACPI_APEI_PCIEAER
 int cper_severity_to_aer(int cper_severity)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (5 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:50   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:39   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer() Bjorn Helgaas
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

Previously the struct aer_err_info "e_info" was allocated on the stack
without being initialized, so it contained junk except for the fields we
explicitly set later.

Initialize "e_info" at declaration with a designated initializer list,
which initializes the other members to zero.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 37 ++++++++++++++++---------------------
 1 file changed, 16 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 95a4cab1d517..40f003eca1c5 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1281,7 +1281,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		struct aer_err_source *e_src)
 {
 	struct pci_dev *pdev = rpc->rpd;
-	struct aer_err_info e_info;
+	u32 status = e_src->status;
 
 	pci_rootport_aer_stats_incr(pdev, e_src);
 
@@ -1289,14 +1289,13 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 	 * There is a possibility that both correctable error and
 	 * uncorrectable error being logged. Report correctable error first.
 	 */
-	if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
-		e_info.id = ERR_COR_ID(e_src->id);
-		e_info.severity = AER_CORRECTABLE;
-
-		if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
-			e_info.multi_error_valid = 1;
-		else
-			e_info.multi_error_valid = 0;
+	if (status & PCI_ERR_ROOT_COR_RCV) {
+		int multi = status & PCI_ERR_ROOT_MULTI_COR_RCV;
+		struct aer_err_info e_info = {
+			.id = ERR_COR_ID(e_src->id),
+			.severity = AER_CORRECTABLE,
+			.multi_error_valid = multi ? 1 : 0,
+		};
 
 		if (find_source_device(pdev, &e_info)) {
 			aer_print_source(pdev, &e_info, "");
@@ -1304,18 +1303,14 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		}
 	}
 
-	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
-		e_info.id = ERR_UNCOR_ID(e_src->id);
-
-		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
-			e_info.severity = AER_FATAL;
-		else
-			e_info.severity = AER_NONFATAL;
-
-		if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
-			e_info.multi_error_valid = 1;
-		else
-			e_info.multi_error_valid = 0;
+	if (status & PCI_ERR_ROOT_UNCOR_RCV) {
+		int fatal = status & PCI_ERR_ROOT_FATAL_RCV;
+		int multi = status & PCI_ERR_ROOT_MULTI_UNCOR_RCV;
+		struct aer_err_info e_info = {
+			.id = ERR_UNCOR_ID(e_src->id),
+			.severity = fatal ? AER_FATAL : AER_NONFATAL,
+			.multi_error_valid = multi ? 1 : 0,
+		};
 
 		if (find_source_device(pdev, &e_info)) {
 			aer_print_source(pdev, &e_info, "");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer()
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (6 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  0:02   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:42   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 09/16] PCI/AER: Update statistics early in logging Bjorn Helgaas
                   ` (8 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

Simplify pci_print_aer() by initializing the struct aer_err_info "info"
with a designated initializer list (it was previously initialized with
memset()) and using pci_name().

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 40f003eca1c5..73d618354f6a 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -765,7 +765,10 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 {
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
-	struct aer_err_info info;
+	struct aer_err_info info = {
+		.severity = aer_severity,
+		.first_error = PCI_ERR_CAP_FEP(aer->cap_control),
+	};
 
 	if (aer_severity == AER_CORRECTABLE) {
 		status = aer->cor_status;
@@ -776,14 +779,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		tlp_header_valid = status & AER_LOG_TLP_MASKS;
 	}
 
-	layer = AER_GET_LAYER_ERROR(aer_severity, status);
-	agent = AER_GET_AGENT(aer_severity, status);
-
-	memset(&info, 0, sizeof(info));
-	info.severity = aer_severity;
 	info.status = status;
 	info.mask = mask;
-	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
+
+	layer = AER_GET_LAYER_ERROR(aer_severity, status);
+	agent = AER_GET_AGENT(aer_severity, status);
 
 	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
 	__aer_print_error(dev, &info);
@@ -797,7 +797,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
 
-	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+	trace_aer_event(pci_name(dev), (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 09/16] PCI/AER: Update statistics early in logging
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (7 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer() Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  1:32   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:04   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates Bjorn Helgaas
                   ` (7 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

There are two AER logging entry points:

  - aer_print_error() is used by DPC (dpc_process_error()) and native AER
    handling (aer_process_err_devices()).

  - pci_print_aer() is used by GHES (aer_recover_work_func()) and CXL
    (cxl_handle_rdport_errors())

Both use __aer_print_error() to print the AER error bits.  Previously
__aer_print_error() also incremented the AER statistics via
pci_dev_aer_stats_incr().

Call pci_dev_aer_stats_incr() early in the entry points instead of in
__aer_print_error() so we update the statistics even if the actual printing
of error bits is rate limited by a future change.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 73d618354f6a..eb80c382187d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -693,7 +693,6 @@ static void __aer_print_error(struct pci_dev *dev,
 		aer_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
 				info->first_error == i ? " (First)" : "");
 	}
-	pci_dev_aer_stats_incr(dev, info);
 }
 
 static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
@@ -714,6 +713,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	int id = pci_dev_id(dev);
 	const char *level;
 
+	pci_dev_aer_stats_incr(dev, info);
+
 	if (!info->status) {
 		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity]);
@@ -782,6 +783,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	info.status = status;
 	info.mask = mask;
 
+	pci_dev_aer_stats_incr(dev, &info);
+
 	layer = AER_GET_LAYER_ERROR(aer_severity, status);
 	agent = AER_GET_AGENT(aer_severity, status);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (8 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 09/16] PCI/AER: Update statistics early in logging Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  1:49   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:08   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 11/16] PCI/AER: Check log level once and remember it Bjorn Helgaas
                   ` (6 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Bjorn Helgaas <bhelgaas@google.com>

As with the AER statistics, we always want to emit trace events, even if
the actual dmesg logging is rate limited.

Call trace_aer_event() directly from pci_dev_aer_stats_incr(), where we
update the statistics.

Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index eb80c382187d..4683a99c7568 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -625,6 +625,9 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 	u64 *counter = NULL;
 	struct aer_stats *aer_stats = pdev->aer_stats;
 
+	trace_aer_event(pci_name(pdev), (info->status & ~info->mask),
+			info->severity, info->tlp_header_valid, &info->tlp);
+
 	if (!aer_stats)
 		return;
 
@@ -741,9 +744,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 out:
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		pci_err(dev, "  Error of this Agent is reported first\n");
-
-	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
-			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
 #ifdef CONFIG_ACPI_APEI_PCIEAER
@@ -782,6 +782,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	info.status = status;
 	info.mask = mask;
+	info.tlp_header_valid = tlp_header_valid;
+	if (tlp_header_valid)
+		info.tlp = aer->header_log;
 
 	pci_dev_aer_stats_incr(dev, &info);
 
@@ -799,9 +802,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
-
-	trace_aer_event(pci_name(dev), (status & ~mask),
-			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 11/16] PCI/AER: Check log level once and remember it
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (9 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-19 23:17   ` Weinan Liu
                     ` (2 more replies)
  2025-05-19 21:35 ` [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type Bjorn Helgaas
                   ` (5 subsequent siblings)
  16 siblings, 3 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Karolina Stolarek <karolina.stolarek@oracle.com>

When reporting an AER error, we check its type multiple times to determine
the log level for each message. Do this check only in the top-level
functions (aer_isr_one_error(), pci_print_aer()) and save the level in
struct aer_err_info.

[bhelgaas: save log level in struct aer_err_info instead of passing it
as a parameter]
Link: https://lore.kernel.org/r/20250321015806.954866-2-pandoh@google.com
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 21 ++++++++++-----------
 drivers/pci/pcie/dpc.c |  1 +
 3 files changed, 12 insertions(+), 11 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b81e99cd4b62..705f9ef58acc 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -588,6 +588,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
 struct aer_err_info {
 	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
+	const char *level;		/* printk level */
 
 	unsigned int id:16;
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 4683a99c7568..73b03a195b14 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -672,21 +672,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 	}
 }
 
-static void __aer_print_error(struct pci_dev *dev,
-			      struct aer_err_info *info)
+static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	const char **strings;
 	unsigned long status = info->status & ~info->mask;
-	const char *level, *errmsg;
+	const char *level = info->level;
+	const char *errmsg;
 	int i;
 
-	if (info->severity == AER_CORRECTABLE) {
+	if (info->severity == AER_CORRECTABLE)
 		strings = aer_correctable_error_string;
-		level = KERN_WARNING;
-	} else {
+	else
 		strings = aer_uncorrectable_error_string;
-		level = KERN_ERR;
-	}
 
 	for_each_set_bit(i, &status, 32) {
 		errmsg = strings[i];
@@ -714,7 +711,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int layer, agent;
 	int id = pci_dev_id(dev);
-	const char *level;
+	const char *level = info->level;
 
 	pci_dev_aer_stats_incr(dev, info);
 
@@ -727,8 +724,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
-
 	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
 		   aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
@@ -774,9 +769,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	if (aer_severity == AER_CORRECTABLE) {
 		status = aer->cor_status;
 		mask = aer->cor_mask;
+		info.level = KERN_WARNING;
 	} else {
 		status = aer->uncor_status;
 		mask = aer->uncor_mask;
+		info.level = KERN_ERR;
 		tlp_header_valid = status & AER_LOG_TLP_MASKS;
 	}
 
@@ -1297,6 +1294,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		struct aer_err_info e_info = {
 			.id = ERR_COR_ID(e_src->id),
 			.severity = AER_CORRECTABLE,
+			.level = KERN_WARNING,
 			.multi_error_valid = multi ? 1 : 0,
 		};
 
@@ -1312,6 +1310,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
 		struct aer_err_info e_info = {
 			.id = ERR_UNCOR_ID(e_src->id),
 			.severity = fatal ? AER_FATAL : AER_NONFATAL,
+			.level = KERN_ERR,
 			.multi_error_valid = multi ? 1 : 0,
 		};
 
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 315bf2bfd570..34af0ea45c0d 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -252,6 +252,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
 	else
 		info->severity = AER_NONFATAL;
 
+	info->level = KERN_WARNING;
 	return 1;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (10 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 11/16] PCI/AER: Check log level once and remember it Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  3:23   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:37   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report Bjorn Helgaas
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Karolina Stolarek <karolina.stolarek@oracle.com>

Some existing logs in pci_print_aer() log with error severity by default.
Convert them to depend on error type (consistent with rest of AER logging).

Link: https://lore.kernel.org/r/20250321015806.954866-3-pandoh@google.com
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 73b03a195b14..06a7dda20846 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -788,15 +788,21 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	layer = AER_GET_LAYER_ERROR(aer_severity, status);
 	agent = AER_GET_AGENT(aer_severity, status);
 
-	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+	aer_printk(info.level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
+		   status, mask);
 	__aer_print_error(dev, &info);
-	pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-		aer_error_layer[layer], aer_agent_string[agent]);
+	aer_printk(info.level, dev, "aer_layer=%s, aer_agent=%s\n",
+		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	if (aer_severity != AER_CORRECTABLE)
-		pci_err(dev, "aer_uncor_severity: 0x%08x\n",
-			aer->uncor_severity);
+		aer_printk(info.level, dev, "aer_uncor_severity: 0x%08x\n",
+			   aer->uncor_severity);
 
+	/*
+	 * pcie_print_tlp_log() uses KERN_ERR, but we only call it when
+	 * tlp_header_valid is set, and info.level is always KERN_ERR in
+	 * that case.
+	 */
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (11 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  3:30   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:38   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs Bjorn Helgaas
                   ` (3 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Karolina Stolarek <karolina.stolarek@oracle.com>

Update name to reflect the broader definition of structs/variables that are
stored (e.g. ratelimits). This is a preparatory patch for adding rate limit
support.

Link: https://lore.kernel.org/r/20250321015806.954866-6-pandoh@google.com
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
 include/linux/pci.h    |  2 +-
 2 files changed, 26 insertions(+), 26 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 06a7dda20846..da62032bf024 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -54,11 +54,11 @@ struct aer_rpc {
 	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
 };
 
-/* AER stats for the device */
-struct aer_stats {
+/* AER report for the device */
+struct aer_report {
 
 	/*
-	 * Fields for all AER capable devices. They indicate the errors
+	 * Stats for all AER capable devices. They indicate the errors
 	 * "as seen by this device". Note that this may mean that if an
 	 * Endpoint is causing problems, the AER counters may increment
 	 * at its link partner (e.g. Root Port) because the errors will be
@@ -80,7 +80,7 @@ struct aer_stats {
 	u64 dev_total_nonfatal_errs;
 
 	/*
-	 * Fields for Root Ports & Root Complex Event Collectors only; these
+	 * Stats for Root Ports & Root Complex Event Collectors only; these
 	 * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
 	 * messages received by the Root Port / Event Collector, INCLUDING the
 	 * ones that are generated internally (by the Root Port itself)
@@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev)
 	if (!dev->aer_cap)
 		return;
 
-	dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
 
 	/*
 	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
@@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev)
 
 void pci_aer_exit(struct pci_dev *dev)
 {
-	kfree(dev->aer_stats);
-	dev->aer_stats = NULL;
+	kfree(dev->aer_report);
+	dev->aer_report = NULL;
 }
 
 #define AER_AGENT_RECEIVER		0
@@ -537,10 +537,10 @@ static const char *aer_agent_string[] = {
 {									\
 	unsigned int i;							\
 	struct pci_dev *pdev = to_pci_dev(dev);				\
-	u64 *stats = pdev->aer_stats->stats_array;			\
+	u64 *stats = pdev->aer_report->stats_array;			\
 	size_t len = 0;							\
 									\
-	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
+	for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\
 		if (strings_array[i])					\
 			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
 					     strings_array[i],		\
@@ -551,7 +551,7 @@ static const char *aer_agent_string[] = {
 					     i, stats[i]);		\
 	}								\
 	len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string,	\
-			     pdev->aer_stats->total_field);		\
+			     pdev->aer_report->total_field);		\
 	return len;							\
 }									\
 static DEVICE_ATTR_RO(name)
@@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
 		     char *buf)						\
 {									\
 	struct pci_dev *pdev = to_pci_dev(dev);				\
-	return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field);	\
+	return sysfs_emit(buf, "%llu\n", pdev->aer_report->field);	\
 }									\
 static DEVICE_ATTR_RO(name)
 
@@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
 	struct device *dev = kobj_to_dev(kobj);
 	struct pci_dev *pdev = to_pci_dev(dev);
 
-	if (!pdev->aer_stats)
+	if (!pdev->aer_report)
 		return 0;
 
 	if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
@@ -623,28 +623,28 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 	unsigned long status = info->status & ~info->mask;
 	int i, max = -1;
 	u64 *counter = NULL;
-	struct aer_stats *aer_stats = pdev->aer_stats;
+	struct aer_report *aer_report = pdev->aer_report;
 
 	trace_aer_event(pci_name(pdev), (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 
-	if (!aer_stats)
+	if (!aer_report)
 		return;
 
 	switch (info->severity) {
 	case AER_CORRECTABLE:
-		aer_stats->dev_total_cor_errs++;
-		counter = &aer_stats->dev_cor_errs[0];
+		aer_report->dev_total_cor_errs++;
+		counter = &aer_report->dev_cor_errs[0];
 		max = AER_MAX_TYPEOF_COR_ERRS;
 		break;
 	case AER_NONFATAL:
-		aer_stats->dev_total_nonfatal_errs++;
-		counter = &aer_stats->dev_nonfatal_errs[0];
+		aer_report->dev_total_nonfatal_errs++;
+		counter = &aer_report->dev_nonfatal_errs[0];
 		max = AER_MAX_TYPEOF_UNCOR_ERRS;
 		break;
 	case AER_FATAL:
-		aer_stats->dev_total_fatal_errs++;
-		counter = &aer_stats->dev_fatal_errs[0];
+		aer_report->dev_total_fatal_errs++;
+		counter = &aer_report->dev_fatal_errs[0];
 		max = AER_MAX_TYPEOF_UNCOR_ERRS;
 		break;
 	}
@@ -656,19 +656,19 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 				 struct aer_err_source *e_src)
 {
-	struct aer_stats *aer_stats = pdev->aer_stats;
+	struct aer_report *aer_report = pdev->aer_report;
 
-	if (!aer_stats)
+	if (!aer_report)
 		return;
 
 	if (e_src->status & PCI_ERR_ROOT_COR_RCV)
-		aer_stats->rootport_total_cor_errs++;
+		aer_report->rootport_total_cor_errs++;
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
 		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
-			aer_stats->rootport_total_fatal_errs++;
+			aer_report->rootport_total_fatal_errs++;
 		else
-			aer_stats->rootport_total_nonfatal_errs++;
+			aer_report->rootport_total_nonfatal_errs++;
 	}
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0e8e3fd77e96..4b11a90107cb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -346,7 +346,7 @@ struct pci_dev {
 	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
 #ifdef CONFIG_PCIEAER
 	u16		aer_cap;	/* AER capability offset */
-	struct aer_stats *aer_stats;	/* AER stats for this device */
+	struct aer_report *aer_report;	/* AER report for this device */
 #endif
 #ifdef CONFIG_PCIEPORTBUS
 	struct rcec_ea	*rcec_ea;	/* RCEC cached endpoint association */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (12 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  4:59   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:55   ` Ilpo Järvinen
  2025-05-19 21:35 ` [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation Bjorn Helgaas
                   ` (2 subsequent siblings)
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Jon Pan-Doh <pandoh@google.com>

Spammy devices can flood kernel logs with AER errors and slow/stall
execution. Add per-device ratelimits for AER correctable and uncorrectable
errors that use the kernel defaults (10 per 5s).

There are two AER logging entry points:

  - aer_print_error() is used by DPC and native AER

  - pci_print_aer() is used by GHES and CXL

The native AER aer_print_error() case includes a loop that may log details
from multiple devices.  This is ratelimited by the union of ratelimits for
these devices, set by add_error_device(), which collects the devices.  If
no such device is found, the Error Source message is ratelimited by the
Root Port or RCEC that received the ERR_* message.

The DPC aer_print_error() case is currently not ratelimited.

The GHES and CXL pci_print_aer() cases are ratelimited by the Error Source
device.

Sargun at Meta reported internally that a flood of AER errors causes RCU
CPU stall warnings and CSD-lock warnings.

Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
true count of 11.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

[bhelgaas: commit log, factor out trace_aer_event() and aer_print_rp_info()
changes to previous patches, collect single aer_err_info.ratelimit as union
of ratelimits of all error source devices]
Link: https://lore.kernel.org/r/20250321015806.954866-7-pandoh@google.com
Reported-by: Sargun Dhillon <sargun@meta.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
---
 drivers/pci/pci.h      |  3 ++-
 drivers/pci/pcie/aer.c | 49 ++++++++++++++++++++++++++++++++++++------
 drivers/pci/pcie/dpc.c |  1 +
 3 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 705f9ef58acc..65c466279ade 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -593,7 +593,8 @@ struct aer_err_info {
 	unsigned int id:16;
 
 	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
-	unsigned int __pad1:5;
+	unsigned int ratelimit:1;	/* 0=skip, 1=print */
+	unsigned int __pad1:4;
 	unsigned int multi_error_valid:1;
 
 	unsigned int first_error:5;
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index da62032bf024..c335e0bb9f51 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -28,6 +28,7 @@
 #include <linux/interrupt.h>
 #include <linux/delay.h>
 #include <linux/kfifo.h>
+#include <linux/ratelimit.h>
 #include <linux/slab.h>
 #include <acpi/apei.h>
 #include <acpi/ghes.h>
@@ -88,6 +89,10 @@ struct aer_report {
 	u64 rootport_total_cor_errs;
 	u64 rootport_total_fatal_errs;
 	u64 rootport_total_nonfatal_errs;
+
+	/* Ratelimits for errors */
+	struct ratelimit_state cor_log_ratelimit;
+	struct ratelimit_state uncor_log_ratelimit;
 };
 
 #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
@@ -379,6 +384,11 @@ void pci_aer_init(struct pci_dev *dev)
 
 	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
 
+	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
+			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+
 	/*
 	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
 	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
@@ -672,6 +682,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
 	}
 }
 
+static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
+{
+	struct ratelimit_state *ratelimit;
+
+	if (severity == AER_CORRECTABLE)
+		ratelimit = &dev->aer_report->cor_log_ratelimit;
+	else
+		ratelimit = &dev->aer_report->uncor_log_ratelimit;
+
+	return __ratelimit(ratelimit);
+}
+
 static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	const char **strings;
@@ -715,6 +737,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 
 	pci_dev_aer_stats_incr(dev, info);
 
+	if (!info->ratelimit)
+		return;
+
 	if (!info->status) {
 		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity]);
@@ -785,6 +810,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	pci_dev_aer_stats_incr(dev, &info);
 
+	if (!aer_ratelimit(dev, info.severity))
+		return;
+
 	layer = AER_GET_LAYER_ERROR(aer_severity, status);
 	agent = AER_GET_AGENT(aer_severity, status);
 
@@ -815,8 +843,14 @@ EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
  */
 static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
 {
+	/*
+	 * Ratelimit AER log messages.  Generally we add the Error Source
+	 * device, but there are is_error_source() cases that can result in
+	 * multiple devices being added here, so we OR them all together.
+	 */
 	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
 		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
+		e_info->ratelimit |= aer_ratelimit(dev, e_info->severity);
 		e_info->error_dev_num++;
 		return 0;
 	}
@@ -914,7 +948,7 @@ static int find_device_iter(struct pci_dev *dev, void *data)
  * e_info->error_dev_num and e_info->dev[], based on the given information.
  */
 static bool find_source_device(struct pci_dev *parent,
-		struct aer_err_info *e_info)
+			       struct aer_err_info *e_info)
 {
 	struct pci_dev *dev = parent;
 	int result;
@@ -935,10 +969,12 @@ static bool find_source_device(struct pci_dev *parent,
 	/*
 	 * If we didn't find any devices with errors logged in the AER
 	 * Capability, just print the Error Source ID from the Root Port or
-	 * RCEC that received an ERR_* Message.
+	 * RCEC that received an ERR_* Message, ratelimited by the RP or
+	 * RCEC.
 	 */
 	if (!e_info->error_dev_num) {
-		aer_print_source(parent, e_info, " (no details found)");
+		if (aer_ratelimit(parent, e_info->severity))
+			aer_print_source(parent, e_info, " (no details found)");
 		return false;
 	}
 	return true;
@@ -1147,9 +1183,10 @@ static void aer_recover_work_func(struct work_struct *work)
 		pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
 						   entry.devfn);
 		if (!pdev) {
-			pr_err("no pci_dev for %04x:%02x:%02x.%x\n",
-			       entry.domain, entry.bus,
-			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
+			pr_err_ratelimited("%04x:%02x:%02x.%x: no pci_dev found\n",
+					   entry.domain, entry.bus,
+					   PCI_SLOT(entry.devfn),
+					   PCI_FUNC(entry.devfn));
 			continue;
 		}
 		pci_print_aer(pdev, entry.severity, entry.regs);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 34af0ea45c0d..597df7790f36 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -301,6 +301,7 @@ void dpc_process_error(struct pci_dev *pdev)
 	else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
 		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
 		 aer_get_device_error_info(pdev, &info)) {
+		info.ratelimit = 1;	/* no ratelimiting */
 		aer_print_error(pdev, &info);
 		pci_aer_clear_nonfatal_status(pdev);
 		pci_aer_clear_fatal_status(pdev);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (13 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  5:01   ` Sathyanarayanan Kuppuswamy
  2025-05-19 21:35 ` [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits Bjorn Helgaas
  2025-05-20  9:05 ` [PATCH v6 00/16] Rate limit AER logs Krzysztof Wilczyński
  16 siblings, 1 reply; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Jon Pan-Doh <pandoh@google.com>

Add ratelimits section for rationale and defaults.

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
 Documentation/PCI/pcieaer-howto.rst | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index f013f3b27c82..896d2a232a90 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -85,6 +85,17 @@ In the example, 'Requester ID' means the ID of the device that sent
 the error message to the Root Port. Please refer to PCIe specs for other
 fields.
 
+AER Ratelimits
+--------------
+
+Since error messages can be generated for each transaction, we may see
+large volumes of errors reported. To prevent spammy devices from flooding
+the console/stalling execution, messages are throttled by device and error
+type (correctable vs. uncorrectable).
+
+AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
+DEFAULT_RATELIMIT_INTERVAL (5 seconds).
+
 AER Statistics / Counters
 -------------------------
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (14 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation Bjorn Helgaas
@ 2025-05-19 21:35 ` Bjorn Helgaas
  2025-05-20  5:05   ` Sathyanarayanan Kuppuswamy
  2025-05-20 12:02   ` Ilpo Järvinen
  2025-05-20  9:05 ` [PATCH v6 00/16] Rate limit AER logs Krzysztof Wilczyński
  16 siblings, 2 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-19 21:35 UTC (permalink / raw)
  To: linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

From: Jon Pan-Doh <pandoh@google.com>

Allow userspace to read/write log ratelimits per device (including
enable/disable). Create aer/ sysfs directory to store them and any
future aer configs.

Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
attributes (e.g. stats and ratelimits).

  Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
    sysfs-bus-pci-devices-aer

Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
Sent 6 AER errors. Observed 5 errors logged while AER stats
(cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.

Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
logged and accounted in AER stats (12 total errors).

[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git

Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
Acked-by: Paul E. McKenney <paulmck@kernel.org>
---
 ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++++++
 Documentation/PCI/pcieaer-howto.rst           |  5 +-
 drivers/pci/pci-sysfs.c                       |  1 +
 drivers/pci/pci.h                             |  1 +
 drivers/pci/pcie/aer.c                        | 99 +++++++++++++++++++
 5 files changed, 139 insertions(+), 1 deletion(-)
 rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)

diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
similarity index 77%
rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
index d1f67bb81d5d..771204197b71 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
@@ -117,3 +117,37 @@ Date:		July 2018
 KernelVersion:	4.19.0
 Contact:	linux-pci@vger.kernel.org, rajatja@google.com
 Description:	Total number of ERR_NONFATAL messages reported to rootport.
+
+PCIe AER ratelimits
+-------------------
+
+These attributes show up under all the devices that are AER capable.
+They represent configurable ratelimits of logs per error type.
+
+See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
+
+What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable
+Date:		March 2025
+KernelVersion:	6.15.0
+Contact:	linux-pci@vger.kernel.org, pandoh@google.com
+Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
+		gets whether or not AER is currently enabled. Enabled by
+		default.
+
+What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_burst_cor_log
+Date:		March 2025
+KernelVersion:	6.15.0
+Contact:	linux-pci@vger.kernel.org, pandoh@google.com
+Description:	Ratelimit burst for correctable error logs. Writing a value
+		changes the number of errors (burst) allowed per interval
+		(5 second window) before ratelimiting. Reading gets the
+		current ratelimit burst.
+
+What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_burst_uncor_log
+Date:		March 2025
+KernelVersion:	6.15.0
+Contact:	linux-pci@vger.kernel.org, pandoh@google.com
+Description:	Ratelimit burst for uncorrectable error logs. Writing a
+		value changes the number of errors (burst) allowed per
+		interval (5 second window) before ratelimiting. Reading
+		gets the current ratelimit burst.
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 896d2a232a90..043cdb3194be 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -96,12 +96,15 @@ type (correctable vs. uncorrectable).
 AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
 DEFAULT_RATELIMIT_INTERVAL (5 seconds).
 
+Ratelimits are exposed in the form of sysfs attributes and configurable.
+See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
+
 AER Statistics / Counters
 -------------------------
 
 When PCIe AER errors are captured, the counters / statistics are also exposed
 in the form of sysfs attributes which are documented at
-Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
 
 Developer Guide
 ===============
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index c6cda56ca52c..278de99b00ce 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1805,6 +1805,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
 	&pcie_dev_attr_group,
 #ifdef CONFIG_PCIEAER
 	&aer_stats_attr_group,
+	&aer_attr_group,
 #endif
 #ifdef CONFIG_PCIEASPM
 	&aspm_ctrl_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 65c466279ade..a3261e842d6d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -963,6 +963,7 @@ void pci_no_aer(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
+extern const struct attribute_group aer_attr_group;
 void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pci_aer_clear_status(struct pci_dev *dev);
 int pci_aer_raw_clear_status(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c335e0bb9f51..42df5cb963b3 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -627,6 +627,105 @@ const struct attribute_group aer_stats_attr_group = {
 	.is_visible = aer_stats_attrs_are_visible,
 };
 
+/*
+ * Ratelimit enable toggle
+ * 0: disabled with ratelimit.interval = 0
+ * 1: enabled with ratelimit.interval = nonzero
+ */
+static ssize_t ratelimit_log_enable_show(struct device *dev,
+					 struct device_attribute *attr,
+					 char *buf)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	bool enabled = pdev->aer_report->cor_log_ratelimit.interval != 0;
+
+	return sysfs_emit(buf, "%d\n", enabled);
+}
+
+static ssize_t ratelimit_log_enable_store(struct device *dev,
+					  struct device_attribute *attr,
+					  const char *buf, size_t count)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	bool enable;
+	int interval;
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+
+	if (kstrtobool(buf, &enable) < 0)
+		return -EINVAL;
+
+	if (enable)
+		interval = DEFAULT_RATELIMIT_INTERVAL;
+	else
+		interval = 0;
+
+	pdev->aer_report->cor_log_ratelimit.interval = interval;
+	pdev->aer_report->uncor_log_ratelimit.interval = interval;
+
+	return count;
+}
+static DEVICE_ATTR_RW(ratelimit_log_enable);
+
+#define aer_ratelimit_burst_attr(name, ratelimit)			\
+	static ssize_t							\
+	name##_show(struct device *dev, struct device_attribute *attr,	\
+		    char *buf)						\
+{									\
+	struct pci_dev *pdev = to_pci_dev(dev);				\
+									\
+	return sysfs_emit(buf, "%d\n",					\
+			  pdev->aer_report->ratelimit.burst);		\
+}									\
+									\
+	static ssize_t							\
+	name##_store(struct device *dev, struct device_attribute *attr,	\
+		     const char *buf, size_t count)			\
+{									\
+	struct pci_dev *pdev = to_pci_dev(dev);				\
+	int burst;							\
+									\
+	if (!capable(CAP_SYS_ADMIN))					\
+		return -EPERM;						\
+									\
+	if (kstrtoint(buf, 0, &burst) < 0)				\
+		return -EINVAL;						\
+									\
+	pdev->aer_report->ratelimit.burst = burst;			\
+									\
+	return count;							\
+}									\
+static DEVICE_ATTR_RW(name)
+
+aer_ratelimit_burst_attr(ratelimit_burst_cor_log, cor_log_ratelimit);
+aer_ratelimit_burst_attr(ratelimit_burst_uncor_log, uncor_log_ratelimit);
+
+static struct attribute *aer_attrs[] = {
+	&dev_attr_ratelimit_log_enable.attr,
+	&dev_attr_ratelimit_burst_cor_log.attr,
+	&dev_attr_ratelimit_burst_uncor_log.attr,
+	NULL
+};
+
+static umode_t aer_attrs_are_visible(struct kobject *kobj,
+				     struct attribute *a, int n)
+{
+	struct device *dev = kobj_to_dev(kobj);
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	if (!pdev->aer_report)
+		return 0;
+
+	return a->mode;
+}
+
+const struct attribute_group aer_attr_group = {
+	.name = "aer",
+	.attrs = aer_attrs,
+	.is_visible = aer_attrs_are_visible,
+};
+
 static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
 				   struct aer_err_info *info)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it
  2025-05-19 21:35 ` [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it Bjorn Helgaas
@ 2025-05-19 22:41   ` Sathyanarayanan Kuppuswamy
  2025-05-20 13:53     ` Bjorn Helgaas
  2025-05-20  9:39   ` Ilpo Järvinen
  1 sibling, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 22:41 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas

Hi,

On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> Previously the struct aer_err_info "info" was allocated on the stack

/s/Previously/Currently ?

> without being initialized, so it contained junk except for the fields we
> explicitly set later.
>
> Initialize "info" at declaration so it starts as all zeroes.

/s/zeroes/zeros

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>   drivers/pci/pcie/dpc.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index df42f15c9829..fe7719238456 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -258,7 +258,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
>   void dpc_process_error(struct pci_dev *pdev)
>   {
>   	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> -	struct aer_err_info info;
> +	struct aer_err_info info = { 0 };
>   
>   	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
>   	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid
  2025-05-19 21:35 ` [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid Bjorn Helgaas
@ 2025-05-19 23:15   ` Sathyanarayanan Kuppuswamy
  2025-05-20 14:00     ` Bjorn Helgaas
  2025-05-20 10:28   ` Ilpo Järvinen
  1 sibling, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 23:15 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> DPC Error Source ID is only valid when the DPC Trigger Reason indicates
> that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
> Message (PCIe r6.0, sec 7.9.14.5).
>
> When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
> or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
> log the Error Source ID (decoded into domain/bus/device/function).  Don't
> print the source otherwise, since it's not valid.
>
> For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
> logging changes:
>
>    - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
>    - pci 0000:00:01.0: DPC: ERR_FATAL detected
>    + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0
>
> and when DPC triggered for other reasons, where DPC Error Source ID is
> undefined, e.g., unmasked uncorrectable error:
>
>    - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
>    - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
>    + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected
>
> Previously the "containment event" message was at KERN_INFO and the
> "%s detected" message was at KERN_WARNING.  Now the single message is at
> KERN_WARNING.

Since we are handling Uncorrectable errors, why not use pci_err?

>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/dpc.c | 45 ++++++++++++++++++++++++++----------------
>   1 file changed, 28 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fe7719238456..315bf2bfd570 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -261,25 +261,36 @@ void dpc_process_error(struct pci_dev *pdev)
>   	struct aer_err_info info = { 0 };
>   
>   	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> -	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> -
> -	pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> -		 status, source);
>   
>   	reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN;
> -	ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
> -	pci_warn(pdev, "%s detected\n",
> -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR) ?
> -		 "unmasked uncorrectable error" :
> -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE) ?
> -		 "ERR_NONFATAL" :
> -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> -		 "ERR_FATAL" :
> -		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
> -		 "RP PIO error" :
> -		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
> -		 "software trigger" :
> -		 "reserved error");
> +
> +	switch (reason) {
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR:
> +		pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
> +			 status);
> +		break;
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
> +		pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID,
> +				     &source);
> +		pci_warn(pdev, "containment event, status:%#06x, %s received from %04x:%02x:%02x.%d\n",
> +			 status,

I see the BDF extraction and format code in many places in the PCI drivers. May be a
common macro will make it more readable.

> +			 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> +				"ERR_FATAL" : "ERR_NONFATAL",
> +			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
> +			 PCI_SLOT(source), PCI_FUNC(source));
> +		return;
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
> +		ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
> +		pci_warn(pdev, "containment event, status:%#06x: %s detected\n",
> +			 status,
> +			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
> +			 "RP PIO error" :
> +			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
> +			 "software trigger" :
> +			 "reserved error");
> +		break;
> +	}
>   
>   	/* show RP PIO error detail information */
>   	if (pdev->dpc_rp_extensions &&

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* [PATCH v6 11/16] PCI/AER: Check log level once and remember it
  2025-05-19 21:35 ` [PATCH v6 11/16] PCI/AER: Check log level once and remember it Bjorn Helgaas
@ 2025-05-19 23:17   ` Weinan Liu
  2025-05-20 14:46     ` Bjorn Helgaas
  2025-05-20  2:49   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:26   ` Ilpo Järvinen
  2 siblings, 1 reply; 68+ messages in thread
From: Weinan Liu @ 2025-05-19 23:17 UTC (permalink / raw)
  To: helgaas
  Cc: Jonathan.Cameron, anilagrawal, ben.fuller, bhelgaas, dave.jiang,
	drewwalton, ilpo.jarvinen, kaihengf, karolina.stolarek, kbusch,
	linux-kernel, linux-pci, linuxppc-dev, lukas, mahesh,
	martin.petersen, oohall, pandoh, paulmck, rrichter, sargun,
	sathyanarayanan.kuppuswamy, shiju.jose, terry.bowman, tony.luck


> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 315bf2bfd570..34af0ea45c0d 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -252,6 +252,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
>   else
>   info->severity = AER_NONFATAL;
>
> + info->level = KERN_WARNING;
>  return 1;
> }

I think the print level should be KERN_ERR for uncorrectable errors.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info()
  2025-05-19 21:35 ` [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info() Bjorn Helgaas
@ 2025-05-19 23:39   ` Sathyanarayanan Kuppuswamy
  2025-05-20 14:21     ` Bjorn Helgaas
  2025-05-20 10:31   ` Ilpo Järvinen
  1 sibling, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 23:39 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas

Hi,

On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> Previously we decoded the AER Error Source ID in two places.  Consolidate
> them so both places use aer_print_port_info().  Add a "details" parameter
> so we can add a note when we didn't find any downstream devices with errors
> logged in their AER Capability.
>
> When we didn't read any error details from the source device, we logged two
> messages: one in aer_isr_one_error() and another in find_source_device().
> Since they both contain the same information, only log the first one when
> when find_source_device() has found error details.
/s/when//
>
> This changes the dmesg logging when we found no devices with errors logged:
>
>    - pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0
>    - pci 0000:00:01.0: AER: found no error details for 0000:02:00.0
>    + pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0 (no details found)
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>   drivers/pci/pcie/aer.c | 30 ++++++++++++++++--------------
>   1 file changed, 16 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a1cf8c7ef628..b8494ccd935b 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -733,16 +733,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   			info->severity, info->tlp_header_valid, &info->tlp);
>   }
>   
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> +static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
> +				const char *details)
>   {
>   	u8 bus = info->id >> 8;
>   	u8 devfn = info->id & 0xff;
>   
> -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> +	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",

Instead of relying on the callers, why not add a space before details here?

>   		 info->multi_error_valid ? "Multiple " : "",
>   		 aer_error_severity_string[info->severity],
>   		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn));
> +		 PCI_FUNC(devfn), details);
>   }
>   
>   #ifdef CONFIG_ACPI_APEI_PCIEAER
> @@ -926,13 +927,13 @@ static bool find_source_device(struct pci_dev *parent,
>   	else
>   		pci_walk_bus(parent->subordinate, find_device_iter, e_info);
>   
> +	/*
> +	 * If we didn't find any devices with errors logged in the AER
> +	 * Capability, just print the Error Source ID from the Root Port or
> +	 * RCEC that received an ERR_* Message.
> +	 */
>   	if (!e_info->error_dev_num) {
> -		u8 bus = e_info->id >> 8;
> -		u8 devfn = e_info->id & 0xff;
> -
> -		pci_info(parent, "found no error details for %04x:%02x:%02x.%d\n",
> -			 pci_domain_nr(parent->bus), bus, PCI_SLOT(devfn),
> -			 PCI_FUNC(devfn));
> +		aer_print_port_info(parent, e_info, " (no details found)");
>   		return false;
>   	}
>   	return true;
> @@ -1297,10 +1298,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   			e_info.multi_error_valid = 1;
>   		else
>   			e_info.multi_error_valid = 0;
> -		aer_print_port_info(pdev, &e_info);
>   

Instead of printing the error information in find_source_device() (a helper function), I think it be better to print it here (the error handler). source_found = find_source_device(pdev, &e_info); aer_print_port_info(pdev, &e_info, source_found? "" : "(no details found) " );

if (source_found) aer_process_err_devices(&e_info)


> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info, "");
>   			aer_process_err_devices(&e_info);
> +		}
>   	}
>   
>   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> @@ -1316,10 +1318,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		else
>   			e_info.multi_error_valid = 0;
>   
> -		aer_print_port_info(pdev, &e_info);
> -
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info, "");
>   			aer_process_err_devices(&e_info);
> +		}
>   	}
>   }
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc
  2025-05-19 21:35 ` [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc Bjorn Helgaas
@ 2025-05-19 23:47   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:32   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 23:47 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> Use PCI_BUS_NUM(), PCI_SLOT(), PCI_FUNC() to extract the bus number,
> device, and function number directly from the Error Source ID.  There's no
> need to shift and mask it explicitly.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 7 +++----
>   1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b8494ccd935b..dc8a50e0a2b7 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -736,14 +736,13 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
>   				const char *details)
>   {
> -	u8 bus = info->id >> 8;
> -	u8 devfn = info->id & 0xff;
> +	u16 source = info->id;
>   
>   	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",

Since it is used in many places in PCI driver, may be define

#define PCI_ADDR_FMT "%04x:%02x:%02x.%d"

>   		 info->multi_error_valid ? "Multiple " : "",
>   		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn), details);
> +		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
> +		 PCI_SLOT(source), PCI_FUNC(source), details);
>   }
>   
>   #ifdef CONFIG_ACPI_APEI_PCIEAER

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source()
  2025-05-19 21:35 ` [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source() Bjorn Helgaas
@ 2025-05-19 23:48   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:33   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 23:48 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Jon Pan-Doh <pandoh@google.com>
>
> Rename aer_print_port_info() to aer_print_source() to be more descriptive.
> This logs the Error Source ID logged by a Root Port or Root Complex Event
> Collector when it receives an ERR_COR, ERR_NONFATAL, or ERR_FATAL Message.
>
> [bhelgaas: aer_print_rp_info() -> aer_print_source()]
> Link: https://lore.kernel.org/r/20250321015806.954866-5-pandoh@google.com
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index dc8a50e0a2b7..eb42d50b2def 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -733,8 +733,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   			info->severity, info->tlp_header_valid, &info->tlp);
>   }
>   
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
> -				const char *details)
> +static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> +			     const char *details)
>   {
>   	u16 source = info->id;
>   
> @@ -932,7 +932,7 @@ static bool find_source_device(struct pci_dev *parent,
>   	 * RCEC that received an ERR_* Message.
>   	 */
>   	if (!e_info->error_dev_num) {
> -		aer_print_port_info(parent, e_info, " (no details found)");
> +		aer_print_source(parent, e_info, " (no details found)");
>   		return false;
>   	}
>   	return true;
> @@ -1299,7 +1299,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   			e_info.multi_error_valid = 0;
>   
>   		if (find_source_device(pdev, &e_info)) {
> -			aer_print_port_info(pdev, &e_info, "");
> +			aer_print_source(pdev, &e_info, "");
>   			aer_process_err_devices(&e_info);
>   		}
>   	}
> @@ -1318,7 +1318,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   			e_info.multi_error_valid = 0;
>   
>   		if (find_source_device(pdev, &e_info)) {
> -			aer_print_port_info(pdev, &e_info, "");
> +			aer_print_source(pdev, &e_info, "");
>   			aer_process_err_devices(&e_info);
>   		}
>   	}

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file
  2025-05-19 21:35 ` [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file Bjorn Helgaas
@ 2025-05-19 23:49   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:34   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 23:49 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> Move aer_print_source() earlier in the file so a future change can use it
> from aer_print_error(), where it's easier to rate limit it.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 24 ++++++++++++------------
>   1 file changed, 12 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index eb42d50b2def..95a4cab1d517 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -696,6 +696,18 @@ static void __aer_print_error(struct pci_dev *dev,
>   	pci_dev_aer_stats_incr(dev, info);
>   }
>   
> +static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> +			     const char *details)
> +{
> +	u16 source = info->id;
> +
> +	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
> +		 info->multi_error_valid ? "Multiple " : "",
> +		 aer_error_severity_string[info->severity],
> +		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
> +		 PCI_SLOT(source), PCI_FUNC(source), details);
> +}
> +
>   void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   {
>   	int layer, agent;
> @@ -733,18 +745,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   			info->severity, info->tlp_header_valid, &info->tlp);
>   }
>   
> -static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> -			     const char *details)
> -{
> -	u16 source = info->id;
> -
> -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
> -		 info->multi_error_valid ? "Multiple " : "",
> -		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
> -		 PCI_SLOT(source), PCI_FUNC(source), details);
> -}
> -
>   #ifdef CONFIG_ACPI_APEI_PCIEAER
>   int cper_severity_to_aer(int cper_severity)
>   {

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it
  2025-05-19 21:35 ` [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it Bjorn Helgaas
@ 2025-05-19 23:50   ` Sathyanarayanan Kuppuswamy
  2025-05-20 10:39   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-19 23:50 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> Previously the struct aer_err_info "e_info" was allocated on the stack
> without being initialized, so it contained junk except for the fields we
> explicitly set later.
>
> Initialize "e_info" at declaration with a designated initializer list,
> which initializes the other members to zero.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 37 ++++++++++++++++---------------------
>   1 file changed, 16 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 95a4cab1d517..40f003eca1c5 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1281,7 +1281,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		struct aer_err_source *e_src)
>   {
>   	struct pci_dev *pdev = rpc->rpd;
> -	struct aer_err_info e_info;
> +	u32 status = e_src->status;
>   
>   	pci_rootport_aer_stats_incr(pdev, e_src);
>   
> @@ -1289,14 +1289,13 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   	 * There is a possibility that both correctable error and
>   	 * uncorrectable error being logged. Report correctable error first.
>   	 */
> -	if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
> -		e_info.id = ERR_COR_ID(e_src->id);
> -		e_info.severity = AER_CORRECTABLE;
> -
> -		if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
> -			e_info.multi_error_valid = 1;
> -		else
> -			e_info.multi_error_valid = 0;
> +	if (status & PCI_ERR_ROOT_COR_RCV) {
> +		int multi = status & PCI_ERR_ROOT_MULTI_COR_RCV;
> +		struct aer_err_info e_info = {
> +			.id = ERR_COR_ID(e_src->id),
> +			.severity = AER_CORRECTABLE,
> +			.multi_error_valid = multi ? 1 : 0,
> +		};
>   
>   		if (find_source_device(pdev, &e_info)) {
>   			aer_print_source(pdev, &e_info, "");
> @@ -1304,18 +1303,14 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		}
>   	}
>   
> -	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> -		e_info.id = ERR_UNCOR_ID(e_src->id);
> -
> -		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
> -			e_info.severity = AER_FATAL;
> -		else
> -			e_info.severity = AER_NONFATAL;
> -
> -		if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
> -			e_info.multi_error_valid = 1;
> -		else
> -			e_info.multi_error_valid = 0;
> +	if (status & PCI_ERR_ROOT_UNCOR_RCV) {
> +		int fatal = status & PCI_ERR_ROOT_FATAL_RCV;
> +		int multi = status & PCI_ERR_ROOT_MULTI_UNCOR_RCV;
> +		struct aer_err_info e_info = {
> +			.id = ERR_UNCOR_ID(e_src->id),
> +			.severity = fatal ? AER_FATAL : AER_NONFATAL,
> +			.multi_error_valid = multi ? 1 : 0,
> +		};
>   
>   		if (find_source_device(pdev, &e_info)) {
>   			aer_print_source(pdev, &e_info, "");

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer()
  2025-05-19 21:35 ` [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer() Bjorn Helgaas
@ 2025-05-20  0:02   ` Sathyanarayanan Kuppuswamy
  2025-05-20 14:38     ` Bjorn Helgaas
  2025-05-20 10:42   ` Ilpo Järvinen
  1 sibling, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  0:02 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> Simplify pci_print_aer() by initializing the struct aer_err_info "info"
> with a designated initializer list (it was previously initialized with
> memset()) and using pci_name().
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>   drivers/pci/pcie/aer.c | 16 ++++++++--------
>   1 file changed, 8 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 40f003eca1c5..73d618354f6a 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -765,7 +765,10 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   {
>   	int layer, agent, tlp_header_valid = 0;
>   	u32 status, mask;
> -	struct aer_err_info info;

You have cleaned up other stack allocations of struct aer_err_info to zero
initialization in your previous patches. Why not follow the same format
here? I don't think this function resets all fields of aer_err_info, right?

> +	struct aer_err_info info = {
> +		.severity = aer_severity,
> +		.first_error = PCI_ERR_CAP_FEP(aer->cap_control),
> +	};
>   
>   	if (aer_severity == AER_CORRECTABLE) {
>   		status = aer->cor_status;
> @@ -776,14 +779,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   		tlp_header_valid = status & AER_LOG_TLP_MASKS;
>   	}
>   
> -	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> -	agent = AER_GET_AGENT(aer_severity, status);
> -
> -	memset(&info, 0, sizeof(info));
> -	info.severity = aer_severity;
>   	info.status = status;
>   	info.mask = mask;
> -	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
> +
> +	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> +	agent = AER_GET_AGENT(aer_severity, status);
>   
>   	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>   	__aer_print_error(dev, &info);
> @@ -797,7 +797,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   	if (tlp_header_valid)
>   		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
>   
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +	trace_aer_event(pci_name(dev), (status & ~mask),
>   			aer_severity, tlp_header_valid, &aer->header_log);
>   }
>   EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 09/16] PCI/AER: Update statistics early in logging
  2025-05-19 21:35 ` [PATCH v6 09/16] PCI/AER: Update statistics early in logging Bjorn Helgaas
@ 2025-05-20  1:32   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:04   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  1:32 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> There are two AER logging entry points:
>
>    - aer_print_error() is used by DPC (dpc_process_error()) and native AER
>      handling (aer_process_err_devices()).
>
>    - pci_print_aer() is used by GHES (aer_recover_work_func()) and CXL
>      (cxl_handle_rdport_errors())
>
> Both use __aer_print_error() to print the AER error bits.  Previously
> __aer_print_error() also incremented the AER statistics via
> pci_dev_aer_stats_incr().
>
> Call pci_dev_aer_stats_incr() early in the entry points instead of in
> __aer_print_error() so we update the statistics even if the actual printing
> of error bits is rate limited by a future change.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 5 ++++-
>   1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 73d618354f6a..eb80c382187d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -693,7 +693,6 @@ static void __aer_print_error(struct pci_dev *dev,
>   		aer_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
>   				info->first_error == i ? " (First)" : "");
>   	}
> -	pci_dev_aer_stats_incr(dev, info);
>   }
>   
>   static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> @@ -714,6 +713,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   	int id = pci_dev_id(dev);
>   	const char *level;
>   
> +	pci_dev_aer_stats_incr(dev, info);
> +
>   	if (!info->status) {
>   		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>   			aer_error_severity_string[info->severity]);
> @@ -782,6 +783,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   	info.status = status;
>   	info.mask = mask;
>   
> +	pci_dev_aer_stats_incr(dev, &info);
> +
>   	layer = AER_GET_LAYER_ERROR(aer_severity, status);
>   	agent = AER_GET_AGENT(aer_severity, status);
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates
  2025-05-19 21:35 ` [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates Bjorn Helgaas
@ 2025-05-20  1:49   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:08   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  1:49 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Bjorn Helgaas <bhelgaas@google.com>
>
> As with the AER statistics, we always want to emit trace events, even if
> the actual dmesg logging is rate limited.
>
> Call trace_aer_event() directly from pci_dev_aer_stats_incr(), where we
> update the statistics.
>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 12 ++++++------
>   1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index eb80c382187d..4683a99c7568 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -625,6 +625,9 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>   	u64 *counter = NULL;
>   	struct aer_stats *aer_stats = pdev->aer_stats;
>   
> +	trace_aer_event(pci_name(pdev), (info->status & ~info->mask),
> +			info->severity, info->tlp_header_valid, &info->tlp);
> +
>   	if (!aer_stats)
>   		return;
>   
> @@ -741,9 +744,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   out:
>   	if (info->id && info->error_dev_num > 1 && info->id == id)
>   		pci_err(dev, "  Error of this Agent is reported first\n");
> -
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> -			info->severity, info->tlp_header_valid, &info->tlp);
>   }
>   
>   #ifdef CONFIG_ACPI_APEI_PCIEAER
> @@ -782,6 +782,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	info.status = status;
>   	info.mask = mask;
> +	info.tlp_header_valid = tlp_header_valid;
> +	if (tlp_header_valid)

I think you can skip this check. The trace call checks for valid flag before accessing
the tlp buffer. If you want to keep it, try to set it to NULL for !tlp_header_valid case.

> +		info.tlp = aer->header_log;
>   
>   	pci_dev_aer_stats_incr(dev, &info);
>   
> @@ -799,9 +802,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	if (tlp_header_valid)
>   		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
> -
> -	trace_aer_event(pci_name(dev), (status & ~mask),
> -			aer_severity, tlp_header_valid, &aer->header_log);
>   }
>   EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 11/16] PCI/AER: Check log level once and remember it
  2025-05-19 21:35 ` [PATCH v6 11/16] PCI/AER: Check log level once and remember it Bjorn Helgaas
  2025-05-19 23:17   ` Weinan Liu
@ 2025-05-20  2:49   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:26   ` Ilpo Järvinen
  2 siblings, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  2:49 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Karolina Stolarek <karolina.stolarek@oracle.com>
>
> When reporting an AER error, we check its type multiple times to determine
> the log level for each message. Do this check only in the top-level
> functions (aer_isr_one_error(), pci_print_aer()) and save the level in
> struct aer_err_info.
>
> [bhelgaas: save log level in struct aer_err_info instead of passing it
> as a parameter]
> Link: https://lore.kernel.org/r/20250321015806.954866-2-pandoh@google.com
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pci.h      |  1 +
>   drivers/pci/pcie/aer.c | 21 ++++++++++-----------
>   drivers/pci/pcie/dpc.c |  1 +
>   3 files changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index b81e99cd4b62..705f9ef58acc 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -588,6 +588,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
>   struct aer_err_info {
>   	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>   	int error_dev_num;
> +	const char *level;		/* printk level */
>   
>   	unsigned int id:16;
>   
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4683a99c7568..73b03a195b14 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -672,21 +672,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> -static void __aer_print_error(struct pci_dev *dev,
> -			      struct aer_err_info *info)
> +static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   {
>   	const char **strings;
>   	unsigned long status = info->status & ~info->mask;
> -	const char *level, *errmsg;
> +	const char *level = info->level;
> +	const char *errmsg;
>   	int i;
>   
> -	if (info->severity == AER_CORRECTABLE) {
> +	if (info->severity == AER_CORRECTABLE)
>   		strings = aer_correctable_error_string;
> -		level = KERN_WARNING;
> -	} else {
> +	else
>   		strings = aer_uncorrectable_error_string;
> -		level = KERN_ERR;
> -	}
>   
>   	for_each_set_bit(i, &status, 32) {
>   		errmsg = strings[i];
> @@ -714,7 +711,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   {
>   	int layer, agent;
>   	int id = pci_dev_id(dev);
> -	const char *level;
> +	const char *level = info->level;
>   
>   	pci_dev_aer_stats_incr(dev, info);
>   
> @@ -727,8 +724,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>   	agent = AER_GET_AGENT(info->severity, info->status);
>   
> -	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> -
>   	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>   		   aer_error_severity_string[info->severity],
>   		   aer_error_layer[layer], aer_agent_string[agent]);
> @@ -774,9 +769,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   	if (aer_severity == AER_CORRECTABLE) {
>   		status = aer->cor_status;
>   		mask = aer->cor_mask;
> +		info.level = KERN_WARNING;
>   	} else {
>   		status = aer->uncor_status;
>   		mask = aer->uncor_mask;
> +		info.level = KERN_ERR;
>   		tlp_header_valid = status & AER_LOG_TLP_MASKS;
>   	}
>   
> @@ -1297,6 +1294,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		struct aer_err_info e_info = {
>   			.id = ERR_COR_ID(e_src->id),
>   			.severity = AER_CORRECTABLE,
> +			.level = KERN_WARNING,
>   			.multi_error_valid = multi ? 1 : 0,
>   		};
>   
> @@ -1312,6 +1310,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>   		struct aer_err_info e_info = {
>   			.id = ERR_UNCOR_ID(e_src->id),
>   			.severity = fatal ? AER_FATAL : AER_NONFATAL,
> +			.level = KERN_ERR,
>   			.multi_error_valid = multi ? 1 : 0,
>   		};
>   
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 315bf2bfd570..34af0ea45c0d 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -252,6 +252,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
>   	else
>   		info->severity = AER_NONFATAL;
>   
> +	info->level = KERN_WARNING;

As Weinan pointed out, it should be KERN_ERR.

>   	return 1;
>   }
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type
  2025-05-19 21:35 ` [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type Bjorn Helgaas
@ 2025-05-20  3:23   ` Sathyanarayanan Kuppuswamy
  2025-05-20 11:37   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  3:23 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Karolina Stolarek <karolina.stolarek@oracle.com>
>
> Some existing logs in pci_print_aer() log with error severity by default.
> Convert them to depend on error type (consistent with rest of AER logging).
>
> Link: https://lore.kernel.org/r/20250321015806.954866-3-pandoh@google.com
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 16 +++++++++++-----
>   1 file changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 73b03a195b14..06a7dda20846 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -788,15 +788,21 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   	layer = AER_GET_LAYER_ERROR(aer_severity, status);
>   	agent = AER_GET_AGENT(aer_severity, status);
>   
> -	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> +	aer_printk(info.level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
> +		   status, mask);
>   	__aer_print_error(dev, &info);
> -	pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> -		aer_error_layer[layer], aer_agent_string[agent]);
> +	aer_printk(info.level, dev, "aer_layer=%s, aer_agent=%s\n",
> +		   aer_error_layer[layer], aer_agent_string[agent]);
>   
>   	if (aer_severity != AER_CORRECTABLE)
> -		pci_err(dev, "aer_uncor_severity: 0x%08x\n",
> -			aer->uncor_severity);
> +		aer_printk(info.level, dev, "aer_uncor_severity: 0x%08x\n",
> +			   aer->uncor_severity);
>   
> +	/*
> +	 * pcie_print_tlp_log() uses KERN_ERR, but we only call it when
> +	 * tlp_header_valid is set, and info.level is always KERN_ERR in
> +	 * that case.
> +	 */
>   	if (tlp_header_valid)
>   		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
>   }

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report
  2025-05-19 21:35 ` [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report Bjorn Helgaas
@ 2025-05-20  3:30   ` Sathyanarayanan Kuppuswamy
  2025-05-20 21:25     ` Bjorn Helgaas
  2025-05-20 11:38   ` Ilpo Järvinen
  1 sibling, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  3:30 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Karolina Stolarek <karolina.stolarek@oracle.com>
>
> Update name to reflect the broader definition of structs/variables that are
> stored (e.g. ratelimits). This is a preparatory patch for adding rate limit
> support.
>
> Link: https://lore.kernel.org/r/20250321015806.954866-6-pandoh@google.com
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
>   include/linux/pci.h    |  2 +-
>   2 files changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 06a7dda20846..da62032bf024 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -54,11 +54,11 @@ struct aer_rpc {
>   	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>   };
>   
> -/* AER stats for the device */
> -struct aer_stats {
> +/* AER report for the device */
> +struct aer_report {

For me aer_report also sounds like stats like struct. I prefer aer_info, but
it is up to you.

>   
>   	/*
> -	 * Fields for all AER capable devices. They indicate the errors
> +	 * Stats for all AER capable devices. They indicate the errors
>   	 * "as seen by this device". Note that this may mean that if an
>   	 * Endpoint is causing problems, the AER counters may increment
>   	 * at its link partner (e.g. Root Port) because the errors will be
> @@ -80,7 +80,7 @@ struct aer_stats {
>   	u64 dev_total_nonfatal_errs;
>   
>   	/*
> -	 * Fields for Root Ports & Root Complex Event Collectors only; these
> +	 * Stats for Root Ports & Root Complex Event Collectors only; these
>   	 * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
>   	 * messages received by the Root Port / Event Collector, INCLUDING the
>   	 * ones that are generated internally (by the Root Port itself)
> @@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev)
>   	if (!dev->aer_cap)
>   		return;
>   
> -	dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> +	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   void pci_aer_exit(struct pci_dev *dev)
>   {
> -	kfree(dev->aer_stats);
> -	dev->aer_stats = NULL;
> +	kfree(dev->aer_report);
> +	dev->aer_report = NULL;
>   }
>   
>   #define AER_AGENT_RECEIVER		0
> @@ -537,10 +537,10 @@ static const char *aer_agent_string[] = {
>   {									\
>   	unsigned int i;							\
>   	struct pci_dev *pdev = to_pci_dev(dev);				\
> -	u64 *stats = pdev->aer_stats->stats_array;			\
> +	u64 *stats = pdev->aer_report->stats_array;			\
>   	size_t len = 0;							\
>   									\
> -	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> +	for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\
>   		if (strings_array[i])					\
>   			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
>   					     strings_array[i],		\
> @@ -551,7 +551,7 @@ static const char *aer_agent_string[] = {
>   					     i, stats[i]);		\
>   	}								\
>   	len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string,	\
> -			     pdev->aer_stats->total_field);		\
> +			     pdev->aer_report->total_field);		\
>   	return len;							\
>   }									\
>   static DEVICE_ATTR_RO(name)
> @@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
>   		     char *buf)						\
>   {									\
>   	struct pci_dev *pdev = to_pci_dev(dev);				\
> -	return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field);	\
> +	return sysfs_emit(buf, "%llu\n", pdev->aer_report->field);	\
>   }									\
>   static DEVICE_ATTR_RO(name)
>   
> @@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
>   	struct device *dev = kobj_to_dev(kobj);
>   	struct pci_dev *pdev = to_pci_dev(dev);
>   
> -	if (!pdev->aer_stats)
> +	if (!pdev->aer_report)
>   		return 0;
>   
>   	if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> @@ -623,28 +623,28 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>   	unsigned long status = info->status & ~info->mask;
>   	int i, max = -1;
>   	u64 *counter = NULL;
> -	struct aer_stats *aer_stats = pdev->aer_stats;
> +	struct aer_report *aer_report = pdev->aer_report;
>   
>   	trace_aer_event(pci_name(pdev), (info->status & ~info->mask),
>   			info->severity, info->tlp_header_valid, &info->tlp);
>   
> -	if (!aer_stats)
> +	if (!aer_report)
>   		return;
>   
>   	switch (info->severity) {
>   	case AER_CORRECTABLE:
> -		aer_stats->dev_total_cor_errs++;
> -		counter = &aer_stats->dev_cor_errs[0];
> +		aer_report->dev_total_cor_errs++;
> +		counter = &aer_report->dev_cor_errs[0];
>   		max = AER_MAX_TYPEOF_COR_ERRS;
>   		break;
>   	case AER_NONFATAL:
> -		aer_stats->dev_total_nonfatal_errs++;
> -		counter = &aer_stats->dev_nonfatal_errs[0];
> +		aer_report->dev_total_nonfatal_errs++;
> +		counter = &aer_report->dev_nonfatal_errs[0];
>   		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>   		break;
>   	case AER_FATAL:
> -		aer_stats->dev_total_fatal_errs++;
> -		counter = &aer_stats->dev_fatal_errs[0];
> +		aer_report->dev_total_fatal_errs++;
> +		counter = &aer_report->dev_fatal_errs[0];
>   		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>   		break;
>   	}
> @@ -656,19 +656,19 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>   static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   				 struct aer_err_source *e_src)
>   {
> -	struct aer_stats *aer_stats = pdev->aer_stats;
> +	struct aer_report *aer_report = pdev->aer_report;
>   
> -	if (!aer_stats)
> +	if (!aer_report)
>   		return;
>   
>   	if (e_src->status & PCI_ERR_ROOT_COR_RCV)
> -		aer_stats->rootport_total_cor_errs++;
> +		aer_report->rootport_total_cor_errs++;
>   
>   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
>   		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
> -			aer_stats->rootport_total_fatal_errs++;
> +			aer_report->rootport_total_fatal_errs++;
>   		else
> -			aer_stats->rootport_total_nonfatal_errs++;
> +			aer_report->rootport_total_nonfatal_errs++;
>   	}
>   }
>   
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0e8e3fd77e96..4b11a90107cb 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -346,7 +346,7 @@ struct pci_dev {
>   	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
>   #ifdef CONFIG_PCIEAER
>   	u16		aer_cap;	/* AER capability offset */
> -	struct aer_stats *aer_stats;	/* AER stats for this device */
> +	struct aer_report *aer_report;	/* AER report for this device */
>   #endif
>   #ifdef CONFIG_PCIEPORTBUS
>   	struct rcec_ea	*rcec_ea;	/* RCEC cached endpoint association */

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-19 21:35 ` [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs Bjorn Helgaas
@ 2025-05-20  4:59   ` Sathyanarayanan Kuppuswamy
  2025-05-20 18:31     ` Bjorn Helgaas
  2025-05-20 11:55   ` Ilpo Järvinen
  1 sibling, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  4:59 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas

Hi Bjorn,

On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Jon Pan-Doh <pandoh@google.com>
>
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER correctable and uncorrectable
> errors that use the kernel defaults (10 per 5s).
>
> There are two AER logging entry points:
>
>    - aer_print_error() is used by DPC and native AER
>
>    - pci_print_aer() is used by GHES and CXL
>
> The native AER aer_print_error() case includes a loop that may log details
> from multiple devices.  This is ratelimited by the union of ratelimits for
> these devices, set by add_error_device(), which collects the devices.  If
> no such device is found, the Error Source message is ratelimited by the
> Root Port or RCEC that received the ERR_* message.
>
> The DPC aer_print_error() case is currently not ratelimited.

Can we also not rate limit fatal errors in AER driver?

>
> The GHES and CXL pci_print_aer() cases are ratelimited by the Error Source
> device.
>
> Sargun at Meta reported internally that a flood of AER errors causes RCU
> CPU stall warnings and CSD-lock warnings.
>
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
> while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
> true count of 11.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> [bhelgaas: commit log, factor out trace_aer_event() and aer_print_rp_info()
> changes to previous patches, collect single aer_err_info.ratelimit as union
> of ratelimits of all error source devices]
> Link: https://lore.kernel.org/r/20250321015806.954866-7-pandoh@google.com
> Reported-by: Sargun Dhillon <sargun@meta.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>   drivers/pci/pci.h      |  3 ++-
>   drivers/pci/pcie/aer.c | 49 ++++++++++++++++++++++++++++++++++++------
>   drivers/pci/pcie/dpc.c |  1 +
>   3 files changed, 46 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 705f9ef58acc..65c466279ade 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -593,7 +593,8 @@ struct aer_err_info {
>   	unsigned int id:16;
>   
>   	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
> -	unsigned int __pad1:5;
> +	unsigned int ratelimit:1;	/* 0=skip, 1=print */
> +	unsigned int __pad1:4;
>   	unsigned int multi_error_valid:1;
>   
>   	unsigned int first_error:5;
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index da62032bf024..c335e0bb9f51 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -28,6 +28,7 @@
>   #include <linux/interrupt.h>
>   #include <linux/delay.h>
>   #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
>   #include <linux/slab.h>
>   #include <acpi/apei.h>
>   #include <acpi/ghes.h>
> @@ -88,6 +89,10 @@ struct aer_report {
>   	u64 rootport_total_cor_errs;
>   	u64 rootport_total_fatal_errs;
>   	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>   };
>   
>   #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,11 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> +
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>   	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -672,6 +682,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> +{
> +	struct ratelimit_state *ratelimit;
> +
> +	if (severity == AER_CORRECTABLE)
> +		ratelimit = &dev->aer_report->cor_log_ratelimit;
> +	else
> +		ratelimit = &dev->aer_report->uncor_log_ratelimit;
> +
> +	return __ratelimit(ratelimit);
> +}
> +
>   static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   {
>   	const char **strings;
> @@ -715,6 +737,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>   
>   	pci_dev_aer_stats_incr(dev, info);
>   
> +	if (!info->ratelimit)
> +		return;
> +
>   	if (!info->status) {
>   		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>   			aer_error_severity_string[info->severity]);
> @@ -785,6 +810,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	pci_dev_aer_stats_incr(dev, &info);
>   
> +	if (!aer_ratelimit(dev, info.severity))
> +		return;
> +
>   	layer = AER_GET_LAYER_ERROR(aer_severity, status);
>   	agent = AER_GET_AGENT(aer_severity, status);
>   
> @@ -815,8 +843,14 @@ EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>    */
>   static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>   {
> +	/*
> +	 * Ratelimit AER log messages.  Generally we add the Error Source
> +	 * device, but there are is_error_source() cases that can result in
> +	 * multiple devices being added here, so we OR them all together.
> +	 */
>   	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>   		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> +		e_info->ratelimit |= aer_ratelimit(dev, e_info->severity);
>   		e_info->error_dev_num++;
>   		return 0;
>   	}
> @@ -914,7 +948,7 @@ static int find_device_iter(struct pci_dev *dev, void *data)
>    * e_info->error_dev_num and e_info->dev[], based on the given information.
>    */
>   static bool find_source_device(struct pci_dev *parent,
> -		struct aer_err_info *e_info)
> +			       struct aer_err_info *e_info)
>   {
>   	struct pci_dev *dev = parent;
>   	int result;
> @@ -935,10 +969,12 @@ static bool find_source_device(struct pci_dev *parent,
>   	/*
>   	 * If we didn't find any devices with errors logged in the AER
>   	 * Capability, just print the Error Source ID from the Root Port or
> -	 * RCEC that received an ERR_* Message.
> +	 * RCEC that received an ERR_* Message, ratelimited by the RP or
> +	 * RCEC.
>   	 */
>   	if (!e_info->error_dev_num) {
> -		aer_print_source(parent, e_info, " (no details found)");
> +		if (aer_ratelimit(parent, e_info->severity))
> +			aer_print_source(parent, e_info, " (no details found)");
>   		return false;
>   	}
>   	return true;
> @@ -1147,9 +1183,10 @@ static void aer_recover_work_func(struct work_struct *work)
>   		pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
>   						   entry.devfn);
>   		if (!pdev) {
> -			pr_err("no pci_dev for %04x:%02x:%02x.%x\n",
> -			       entry.domain, entry.bus,
> -			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
> +			pr_err_ratelimited("%04x:%02x:%02x.%x: no pci_dev found\n",
> +					   entry.domain, entry.bus,
> +					   PCI_SLOT(entry.devfn),
> +					   PCI_FUNC(entry.devfn));
>   			continue;
>   		}
>   		pci_print_aer(pdev, entry.severity, entry.regs);
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 34af0ea45c0d..597df7790f36 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -301,6 +301,7 @@ void dpc_process_error(struct pci_dev *pdev)
>   	else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
>   		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
>   		 aer_get_device_error_info(pdev, &info)) {
> +		info.ratelimit = 1;	/* no ratelimiting */
>   		aer_print_error(pdev, &info);
>   		pci_aer_clear_nonfatal_status(pdev);
>   		pci_aer_clear_fatal_status(pdev);

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-05-19 21:35 ` [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation Bjorn Helgaas
@ 2025-05-20  5:01   ` Sathyanarayanan Kuppuswamy
  2025-05-20 19:48     ` Bjorn Helgaas
  0 siblings, 1 reply; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  5:01 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Jon Pan-Doh <pandoh@google.com>
>
> Add ratelimits section for rationale and defaults.
>
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>   Documentation/PCI/pcieaer-howto.rst | 11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index f013f3b27c82..896d2a232a90 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -85,6 +85,17 @@ In the example, 'Requester ID' means the ID of the device that sent
>   the error message to the Root Port. Please refer to PCIe specs for other
>   fields.
>   
> +AER Ratelimits
> +--------------
> +
> +Since error messages can be generated for each transaction, we may see
> +large volumes of errors reported. To prevent spammy devices from flooding
> +the console/stalling execution, messages are throttled by device and error
> +type (correctable vs. uncorrectable).

Can we list exceptions like DPC and FATAL errors (if added) ?

> +
> +AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
> +DEFAULT_RATELIMIT_INTERVAL (5 seconds).
> +
>   AER Statistics / Counters
>   -------------------------
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits
  2025-05-19 21:35 ` [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits Bjorn Helgaas
@ 2025-05-20  5:05   ` Sathyanarayanan Kuppuswamy
  2025-05-20 12:02   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20  5:05 UTC (permalink / raw)
  To: Bjorn Helgaas, linux-pci
  Cc: Jon Pan-Doh, Karolina Stolarek, Martin Petersen, Ben Fuller,
	Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
	Lukas Wunner, Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas


On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> From: Jon Pan-Doh <pandoh@google.com>
>
> Allow userspace to read/write log ratelimits per device (including
> enable/disable). Create aer/ sysfs directory to store them and any
> future aer configs.
>
> Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
> attributes (e.g. stats and ratelimits).
>
>    Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
>      sysfs-bus-pci-devices-aer
>
> Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
> Sent 6 AER errors. Observed 5 errors logged while AER stats
> (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
>
> Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
> logged and accounted in AER stats (12 total errors).
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++++++
>   Documentation/PCI/pcieaer-howto.rst           |  5 +-
>   drivers/pci/pci-sysfs.c                       |  1 +
>   drivers/pci/pci.h                             |  1 +
>   drivers/pci/pcie/aer.c                        | 99 +++++++++++++++++++
>   5 files changed, 139 insertions(+), 1 deletion(-)
>   rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> similarity index 77%
> rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> index d1f67bb81d5d..771204197b71 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> @@ -117,3 +117,37 @@ Date:		July 2018
>   KernelVersion:	4.19.0
>   Contact:	linux-pci@vger.kernel.org, rajatja@google.com
>   Description:	Total number of ERR_NONFATAL messages reported to rootport.
> +
> +PCIe AER ratelimits
> +-------------------
> +
> +These attributes show up under all the devices that are AER capable.
> +They represent configurable ratelimits of logs per error type.
> +
> +See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
> +		gets whether or not AER is currently enabled. Enabled by
> +		default.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_burst_cor_log
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Ratelimit burst for correctable error logs. Writing a value
> +		changes the number of errors (burst) allowed per interval
> +		(5 second window) before ratelimiting. Reading gets the
> +		current ratelimit burst.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_burst_uncor_log
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Ratelimit burst for uncorrectable error logs. Writing a
> +		value changes the number of errors (burst) allowed per
> +		interval (5 second window) before ratelimiting. Reading
> +		gets the current ratelimit burst.
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index 896d2a232a90..043cdb3194be 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -96,12 +96,15 @@ type (correctable vs. uncorrectable).
>   AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
>   DEFAULT_RATELIMIT_INTERVAL (5 seconds).
>   
> +Ratelimits are exposed in the form of sysfs attributes and configurable.
> +See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
> +
>   AER Statistics / Counters
>   -------------------------
>   
>   When PCIe AER errors are captured, the counters / statistics are also exposed
>   in the form of sysfs attributes which are documented at
> -Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
>   
>   Developer Guide
>   ===============
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index c6cda56ca52c..278de99b00ce 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1805,6 +1805,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
>   	&pcie_dev_attr_group,
>   #ifdef CONFIG_PCIEAER
>   	&aer_stats_attr_group,
> +	&aer_attr_group,
>   #endif
>   #ifdef CONFIG_PCIEASPM
>   	&aspm_ctrl_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 65c466279ade..a3261e842d6d 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -963,6 +963,7 @@ void pci_no_aer(void);
>   void pci_aer_init(struct pci_dev *dev);
>   void pci_aer_exit(struct pci_dev *dev);
>   extern const struct attribute_group aer_stats_attr_group;
> +extern const struct attribute_group aer_attr_group;
>   void pci_aer_clear_fatal_status(struct pci_dev *dev);
>   int pci_aer_clear_status(struct pci_dev *dev);
>   int pci_aer_raw_clear_status(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c335e0bb9f51..42df5cb963b3 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -627,6 +627,105 @@ const struct attribute_group aer_stats_attr_group = {
>   	.is_visible = aer_stats_attrs_are_visible,
>   };
>   
> +/*
> + * Ratelimit enable toggle
> + * 0: disabled with ratelimit.interval = 0
> + * 1: enabled with ratelimit.interval = nonzero
> + */
> +static ssize_t ratelimit_log_enable_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enabled = pdev->aer_report->cor_log_ratelimit.interval != 0;
> +
> +	return sysfs_emit(buf, "%d\n", enabled);
> +}
> +
> +static ssize_t ratelimit_log_enable_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enable;
> +	int interval;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (kstrtobool(buf, &enable) < 0)
> +		return -EINVAL;
> +
> +	if (enable)
> +		interval = DEFAULT_RATELIMIT_INTERVAL;
> +	else
> +		interval = 0;
> +
> +	pdev->aer_report->cor_log_ratelimit.interval = interval;
> +	pdev->aer_report->uncor_log_ratelimit.interval = interval;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(ratelimit_log_enable);
> +
> +#define aer_ratelimit_burst_attr(name, ratelimit)			\
> +	static ssize_t							\
> +	name##_show(struct device *dev, struct device_attribute *attr,	\
> +		    char *buf)						\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +									\
> +	return sysfs_emit(buf, "%d\n",					\
> +			  pdev->aer_report->ratelimit.burst);		\
> +}									\
> +									\
> +	static ssize_t							\
> +	name##_store(struct device *dev, struct device_attribute *attr,	\
> +		     const char *buf, size_t count)			\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +	int burst;							\
> +									\
> +	if (!capable(CAP_SYS_ADMIN))					\
> +		return -EPERM;						\
> +									\
> +	if (kstrtoint(buf, 0, &burst) < 0)				\
> +		return -EINVAL;						\
> +									\
> +	pdev->aer_report->ratelimit.burst = burst;			\
> +									\
> +	return count;							\
> +}									\
> +static DEVICE_ATTR_RW(name)
> +
> +aer_ratelimit_burst_attr(ratelimit_burst_cor_log, cor_log_ratelimit);
> +aer_ratelimit_burst_attr(ratelimit_burst_uncor_log, uncor_log_ratelimit);
> +
> +static struct attribute *aer_attrs[] = {
> +	&dev_attr_ratelimit_log_enable.attr,
> +	&dev_attr_ratelimit_burst_cor_log.attr,
> +	&dev_attr_ratelimit_burst_uncor_log.attr,
> +	NULL
> +};
> +
> +static umode_t aer_attrs_are_visible(struct kobject *kobj,
> +				     struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->aer_report)
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +const struct attribute_group aer_attr_group = {
> +	.name = "aer",
> +	.attrs = aer_attrs,
> +	.is_visible = aer_attrs_are_visible,
> +};
> +
>   static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>   				   struct aer_err_info *info)
>   {

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 00/16] Rate limit AER logs
  2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
                   ` (15 preceding siblings ...)
  2025-05-19 21:35 ` [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits Bjorn Helgaas
@ 2025-05-20  9:05 ` Krzysztof Wilczyński
  16 siblings, 0 replies; 68+ messages in thread
From: Krzysztof Wilczyński @ 2025-05-20  9:05 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas

Hello,

> This work is mostly due to Jon Pan-Doh and Karolina Stolarek.  I rebased
> this to v6.15-rc1, factored out some of the trace and statistics updates,
> and added some minor cleanups.
> 
> Proposal
> ========
> 
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits for
> more robust error logging. Allow userspace to configure ratelimits via
> sysfs knobs.
> 
> Motivation
> ==========
> 
> Inconsistent PCIe error handling, exacerbated at datacenter scale (myriad
> of devices), affects repairabilitiy flows for fleet operators.
> 
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services will allow for more
> predictable repair flows and decrease machine downtime.
> 
> Background
> ==========
> 
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
> 
> There have been previous attempts to add ratelimits to AER logs ([4], [5]).
> The most recent attempt[5] has many similarities with the proposed
> approach.

I have been testing this series locally with and without faults triggered
using the AER error injection facility.  No issues thus far.

And, as such...

Tested-by: Krzysztof Wilczyński <kwilczynski@kernel.org>

Thank you!

	Krzysztof

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it
  2025-05-19 21:35 ` [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it Bjorn Helgaas
  2025-05-19 22:41   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20  9:39   ` Ilpo Järvinen
  2025-05-20 13:54     ` Bjorn Helgaas
  1 sibling, 1 reply; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20  9:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> Previously the struct aer_err_info "info" was allocated on the stack
> without being initialized, so it contained junk except for the fields we
> explicitly set later.
> 
> Initialize "info" at declaration so it starts as all zeroes.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/dpc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index df42f15c9829..fe7719238456 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -258,7 +258,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
>  void dpc_process_error(struct pci_dev *pdev)
>  {
>  	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> -	struct aer_err_info info;
> +	struct aer_err_info info = { 0 };

= {}; is enough to initialize it, no need to add those zeros.

>  
>  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
>  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> 

-- 
 i.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid
  2025-05-19 21:35 ` [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid Bjorn Helgaas
  2025-05-19 23:15   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:28   ` Ilpo Järvinen
  2025-05-20 14:05     ` Bjorn Helgaas
  1 sibling, 1 reply; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> DPC Error Source ID is only valid when the DPC Trigger Reason indicates
> that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
> Message (PCIe r6.0, sec 7.9.14.5).
> 
> When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
> or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
> log the Error Source ID (decoded into domain/bus/device/function).  Don't
> print the source otherwise, since it's not valid.
> 
> For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
> logging changes:
> 
>   - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
>   - pci 0000:00:01.0: DPC: ERR_FATAL detected
>   + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0
> 
> and when DPC triggered for other reasons, where DPC Error Source ID is
> undefined, e.g., unmasked uncorrectable error:
> 
>   - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
>   - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
>   + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected
> 
> Previously the "containment event" message was at KERN_INFO and the
> "%s detected" message was at KERN_WARNING.  Now the single message is at
> KERN_WARNING.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/dpc.c | 45 ++++++++++++++++++++++++++----------------
>  1 file changed, 28 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fe7719238456..315bf2bfd570 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -261,25 +261,36 @@ void dpc_process_error(struct pci_dev *pdev)
>  	struct aer_err_info info = { 0 };
>  
>  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> -	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> -
> -	pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> -		 status, source);
>  
>  	reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN;
> -	ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
> -	pci_warn(pdev, "%s detected\n",
> -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR) ?
> -		 "unmasked uncorrectable error" :
> -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE) ?
> -		 "ERR_NONFATAL" :
> -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> -		 "ERR_FATAL" :
> -		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
> -		 "RP PIO error" :
> -		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
> -		 "software trigger" :
> -		 "reserved error");
> +
> +	switch (reason) {
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR:
> +		pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
> +			 status);
> +		break;
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
> +		pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID,
> +				     &source);
> +		pci_warn(pdev, "containment event, status:%#06x, %s received from %04x:%02x:%02x.%d\n",
> +			 status,
> +			 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> +				"ERR_FATAL" : "ERR_NONFATAL",
> +			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
> +			 PCI_SLOT(source), PCI_FUNC(source));
> +		return;
> +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
> +		ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
> +		pci_warn(pdev, "containment event, status:%#06x: %s detected\n",
> +			 status,
> +			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
> +			 "RP PIO error" :
> +			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
> +			 "software trigger" :
> +			 "reserved error");
> +		break;
> +	}
>  
>  	/* show RP PIO error detail information */
>  	if (pdev->dpc_rp_extensions &&
> 

After adding that switch (reason) there, wouldn't it make sense to move 
also the code from the if blocks into the case blocks? That if 
conditions check for reason anyway so those if branches would naturally 
belong under one of the cases each.

-- 
 i.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info()
  2025-05-19 21:35 ` [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info() Bjorn Helgaas
  2025-05-19 23:39   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:31   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:31 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 3864 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> Previously we decoded the AER Error Source ID in two places.  Consolidate
> them so both places use aer_print_port_info().  Add a "details" parameter
> so we can add a note when we didn't find any downstream devices with errors
> logged in their AER Capability.
> 
> When we didn't read any error details from the source device, we logged two
> messages: one in aer_isr_one_error() and another in find_source_device().
> Since they both contain the same information, only log the first one when
> when find_source_device() has found error details.
> 
> This changes the dmesg logging when we found no devices with errors logged:
> 
>   - pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0
>   - pci 0000:00:01.0: AER: found no error details for 0000:02:00.0
>   + pci 0000:00:01.0: AER: Correctable error message received from 0000:02:00.0 (no details found)
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 30 ++++++++++++++++--------------
>  1 file changed, 16 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a1cf8c7ef628..b8494ccd935b 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -733,16 +733,17 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> +static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
> +				const char *details)
>  {
>  	u8 bus = info->id >> 8;
>  	u8 devfn = info->id & 0xff;
>  
> -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> +	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
>  		 info->multi_error_valid ? "Multiple " : "",
>  		 aer_error_severity_string[info->severity],
>  		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn));
> +		 PCI_FUNC(devfn), details);
>  }
>  
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> @@ -926,13 +927,13 @@ static bool find_source_device(struct pci_dev *parent,
>  	else
>  		pci_walk_bus(parent->subordinate, find_device_iter, e_info);
>  
> +	/*
> +	 * If we didn't find any devices with errors logged in the AER
> +	 * Capability, just print the Error Source ID from the Root Port or
> +	 * RCEC that received an ERR_* Message.
> +	 */
>  	if (!e_info->error_dev_num) {
> -		u8 bus = e_info->id >> 8;
> -		u8 devfn = e_info->id & 0xff;
> -
> -		pci_info(parent, "found no error details for %04x:%02x:%02x.%d\n",
> -			 pci_domain_nr(parent->bus), bus, PCI_SLOT(devfn),
> -			 PCI_FUNC(devfn));
> +		aer_print_port_info(parent, e_info, " (no details found)");
>  		return false;
>  	}
>  	return true;
> @@ -1297,10 +1298,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  			e_info.multi_error_valid = 1;
>  		else
>  			e_info.multi_error_valid = 0;
> -		aer_print_port_info(pdev, &e_info);
>  
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info, "");
>  			aer_process_err_devices(&e_info);
> +		}
>  	}
>  
>  	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> @@ -1316,10 +1318,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  		else
>  			e_info.multi_error_valid = 0;
>  
> -		aer_print_port_info(pdev, &e_info);
> -
> -		if (find_source_device(pdev, &e_info))
> +		if (find_source_device(pdev, &e_info)) {
> +			aer_print_port_info(pdev, &e_info, "");
>  			aer_process_err_devices(&e_info);
> +		}
>  	}
>  }
>  
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc
  2025-05-19 21:35 ` [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc Bjorn Helgaas
  2025-05-19 23:47   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:32   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:32 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> Use PCI_BUS_NUM(), PCI_SLOT(), PCI_FUNC() to extract the bus number,
> device, and function number directly from the Error Source ID.  There's no
> need to shift and mask it explicitly.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b8494ccd935b..dc8a50e0a2b7 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -736,14 +736,13 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
>  				const char *details)
>  {
> -	u8 bus = info->id >> 8;
> -	u8 devfn = info->id & 0xff;
> +	u16 source = info->id;
>  
>  	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
>  		 info->multi_error_valid ? "Multiple " : "",
>  		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn), details);
> +		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
> +		 PCI_SLOT(source), PCI_FUNC(source), details);
>  }
>  
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source()
  2025-05-19 21:35 ` [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source() Bjorn Helgaas
  2025-05-19 23:48   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:33   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2314 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Jon Pan-Doh <pandoh@google.com>
> 
> Rename aer_print_port_info() to aer_print_source() to be more descriptive.
> This logs the Error Source ID logged by a Root Port or Root Complex Event
> Collector when it receives an ERR_COR, ERR_NONFATAL, or ERR_FATAL Message.
> 
> [bhelgaas: aer_print_rp_info() -> aer_print_source()]
> Link: https://lore.kernel.org/r/20250321015806.954866-5-pandoh@google.com
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index dc8a50e0a2b7..eb42d50b2def 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -733,8 +733,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info,
> -				const char *details)
> +static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> +			     const char *details)
>  {
>  	u16 source = info->id;
>  
> @@ -932,7 +932,7 @@ static bool find_source_device(struct pci_dev *parent,
>  	 * RCEC that received an ERR_* Message.
>  	 */
>  	if (!e_info->error_dev_num) {
> -		aer_print_port_info(parent, e_info, " (no details found)");
> +		aer_print_source(parent, e_info, " (no details found)");
>  		return false;
>  	}
>  	return true;
> @@ -1299,7 +1299,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  			e_info.multi_error_valid = 0;
>  
>  		if (find_source_device(pdev, &e_info)) {
> -			aer_print_port_info(pdev, &e_info, "");
> +			aer_print_source(pdev, &e_info, "");
>  			aer_process_err_devices(&e_info);
>  		}
>  	}
> @@ -1318,7 +1318,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  			e_info.multi_error_valid = 0;
>  
>  		if (find_source_device(pdev, &e_info)) {
> -			aer_print_port_info(pdev, &e_info, "");
> +			aer_print_source(pdev, &e_info, "");
>  			aer_process_err_devices(&e_info);
>  		}
>  	}
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file
  2025-05-19 21:35 ` [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file Bjorn Helgaas
  2025-05-19 23:49   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:34   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
	Jonathan Cameron, Sargun Dhillon, Paul E . McKenney,
	Mahesh J Salgaonkar, Oliver O'Halloran, Kai-Heng Feng,
	Keith Busch, Robert Richter, Terry Bowman, Shiju Jose, Dave Jiang,
	linux-kernel, linuxppc-dev, Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2041 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> Move aer_print_source() earlier in the file so a future change can use it
> from aer_print_error(), where it's easier to rate limit it.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 24 ++++++++++++------------
>  1 file changed, 12 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index eb42d50b2def..95a4cab1d517 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -696,6 +696,18 @@ static void __aer_print_error(struct pci_dev *dev,
>  	pci_dev_aer_stats_incr(dev, info);
>  }
>  
> +static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> +			     const char *details)
> +{
> +	u16 source = info->id;
> +
> +	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
> +		 info->multi_error_valid ? "Multiple " : "",
> +		 aer_error_severity_string[info->severity],
> +		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
> +		 PCI_SLOT(source), PCI_FUNC(source), details);
> +}
> +
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	int layer, agent;
> @@ -733,18 +745,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> -static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> -			     const char *details)
> -{
> -	u16 source = info->id;
> -
> -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
> -		 info->multi_error_valid ? "Multiple " : "",
> -		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), PCI_BUS_NUM(source),
> -		 PCI_SLOT(source), PCI_FUNC(source), details);
> -}
> -
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
>  int cper_severity_to_aer(int cper_severity)
>  {
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it
  2025-05-19 21:35 ` [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it Bjorn Helgaas
  2025-05-19 23:50   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:39   ` Ilpo Järvinen
  2025-05-20 14:27     ` Bjorn Helgaas
  1 sibling, 1 reply; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:39 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2941 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> Previously the struct aer_err_info "e_info" was allocated on the stack
> without being initialized, so it contained junk except for the fields we
> explicitly set later.
> 
> Initialize "e_info" at declaration with a designated initializer list,
> which initializes the other members to zero.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 37 ++++++++++++++++---------------------
>  1 file changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 95a4cab1d517..40f003eca1c5 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1281,7 +1281,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  		struct aer_err_source *e_src)

Unrelated to this change, these would fit on a single line.

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

>  {
>  	struct pci_dev *pdev = rpc->rpd;
> -	struct aer_err_info e_info;
> +	u32 status = e_src->status;
>  
>  	pci_rootport_aer_stats_incr(pdev, e_src);
>  
> @@ -1289,14 +1289,13 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  	 * There is a possibility that both correctable error and
>  	 * uncorrectable error being logged. Report correctable error first.
>  	 */
> -	if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
> -		e_info.id = ERR_COR_ID(e_src->id);
> -		e_info.severity = AER_CORRECTABLE;
> -
> -		if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
> -			e_info.multi_error_valid = 1;
> -		else
> -			e_info.multi_error_valid = 0;
> +	if (status & PCI_ERR_ROOT_COR_RCV) {
> +		int multi = status & PCI_ERR_ROOT_MULTI_COR_RCV;
> +		struct aer_err_info e_info = {
> +			.id = ERR_COR_ID(e_src->id),
> +			.severity = AER_CORRECTABLE,
> +			.multi_error_valid = multi ? 1 : 0,
> +		};
>  
>  		if (find_source_device(pdev, &e_info)) {
>  			aer_print_source(pdev, &e_info, "");
> @@ -1304,18 +1303,14 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  		}
>  	}
>  
> -	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> -		e_info.id = ERR_UNCOR_ID(e_src->id);
> -
> -		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
> -			e_info.severity = AER_FATAL;
> -		else
> -			e_info.severity = AER_NONFATAL;
> -
> -		if (e_src->status & PCI_ERR_ROOT_MULTI_UNCOR_RCV)
> -			e_info.multi_error_valid = 1;
> -		else
> -			e_info.multi_error_valid = 0;
> +	if (status & PCI_ERR_ROOT_UNCOR_RCV) {
> +		int fatal = status & PCI_ERR_ROOT_FATAL_RCV;
> +		int multi = status & PCI_ERR_ROOT_MULTI_UNCOR_RCV;
> +		struct aer_err_info e_info = {
> +			.id = ERR_UNCOR_ID(e_src->id),
> +			.severity = fatal ? AER_FATAL : AER_NONFATAL,
> +			.multi_error_valid = multi ? 1 : 0,
> +		};
>  
>  		if (find_source_device(pdev, &e_info)) {
>  			aer_print_source(pdev, &e_info, "");
> 

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer()
  2025-05-19 21:35 ` [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer() Bjorn Helgaas
  2025-05-20  0:02   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 10:42   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 10:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2165 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> Simplify pci_print_aer() by initializing the struct aer_err_info "info"
> with a designated initializer list (it was previously initialized with
> memset()) and using pci_name().
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 16 ++++++++--------
>  1 file changed, 8 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 40f003eca1c5..73d618354f6a 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -765,7 +765,10 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  {
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
> -	struct aer_err_info info;
> +	struct aer_err_info info = {
> +		.severity = aer_severity,
> +		.first_error = PCI_ERR_CAP_FEP(aer->cap_control),
> +	};
>  
>  	if (aer_severity == AER_CORRECTABLE) {
>  		status = aer->cor_status;
> @@ -776,14 +779,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		tlp_header_valid = status & AER_LOG_TLP_MASKS;
>  	}
>  
> -	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> -	agent = AER_GET_AGENT(aer_severity, status);
> -
> -	memset(&info, 0, sizeof(info));
> -	info.severity = aer_severity;
>  	info.status = status;
>  	info.mask = mask;
> -	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
> +
> +	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> +	agent = AER_GET_AGENT(aer_severity, status);
>  
>  	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>  	__aer_print_error(dev, &info);
> @@ -797,7 +797,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (tlp_header_valid)
>  		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
>  
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +	trace_aer_event(pci_name(dev), (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 09/16] PCI/AER: Update statistics early in logging
  2025-05-19 21:35 ` [PATCH v6 09/16] PCI/AER: Update statistics early in logging Bjorn Helgaas
  2025-05-20  1:32   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 11:04   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 11:04 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2149 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> There are two AER logging entry points:
> 
>   - aer_print_error() is used by DPC (dpc_process_error()) and native AER
>     handling (aer_process_err_devices()).
> 
>   - pci_print_aer() is used by GHES (aer_recover_work_func()) and CXL
>     (cxl_handle_rdport_errors())
> 
> Both use __aer_print_error() to print the AER error bits.  Previously
> __aer_print_error() also incremented the AER statistics via
> pci_dev_aer_stats_incr().
> 
> Call pci_dev_aer_stats_incr() early in the entry points instead of in
> __aer_print_error() so we update the statistics even if the actual printing
> of error bits is rate limited by a future change.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 73d618354f6a..eb80c382187d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -693,7 +693,6 @@ static void __aer_print_error(struct pci_dev *dev,
>  		aer_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
>  				info->first_error == i ? " (First)" : "");
>  	}
> -	pci_dev_aer_stats_incr(dev, info);
>  }
>  
>  static void aer_print_source(struct pci_dev *dev, struct aer_err_info *info,
> @@ -714,6 +713,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	int id = pci_dev_id(dev);
>  	const char *level;
>  
> +	pci_dev_aer_stats_incr(dev, info);
> +
>  	if (!info->status) {
>  		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>  			aer_error_severity_string[info->severity]);
> @@ -782,6 +783,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	info.status = status;
>  	info.mask = mask;
>  
> +	pci_dev_aer_stats_incr(dev, &info);
> +
>  	layer = AER_GET_LAYER_ERROR(aer_severity, status);
>  	agent = AER_GET_AGENT(aer_severity, status);
>  
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates
  2025-05-19 21:35 ` [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates Bjorn Helgaas
  2025-05-20  1:49   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 11:08   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 11:08 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2132 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Bjorn Helgaas <bhelgaas@google.com>
> 
> As with the AER statistics, we always want to emit trace events, even if
> the actual dmesg logging is rate limited.
> 
> Call trace_aer_event() directly from pci_dev_aer_stats_incr(), where we
> update the statistics.
> 
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 12 ++++++------
>  1 file changed, 6 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index eb80c382187d..4683a99c7568 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -625,6 +625,9 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>  	u64 *counter = NULL;
>  	struct aer_stats *aer_stats = pdev->aer_stats;
>  
> +	trace_aer_event(pci_name(pdev), (info->status & ~info->mask),
> +			info->severity, info->tlp_header_valid, &info->tlp);
> +
>  	if (!aer_stats)
>  		return;
>  
> @@ -741,9 +744,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  out:
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		pci_err(dev, "  Error of this Agent is reported first\n");
> -
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> -			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> @@ -782,6 +782,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  
>  	info.status = status;
>  	info.mask = mask;
> +	info.tlp_header_valid = tlp_header_valid;
> +	if (tlp_header_valid)
> +		info.tlp = aer->header_log;
>  
>  	pci_dev_aer_stats_incr(dev, &info);
>  
> @@ -799,9 +802,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  
>  	if (tlp_header_valid)
>  		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
> -
> -	trace_aer_event(pci_name(dev), (status & ~mask),
> -			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>  
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 11/16] PCI/AER: Check log level once and remember it
  2025-05-19 21:35 ` [PATCH v6 11/16] PCI/AER: Check log level once and remember it Bjorn Helgaas
  2025-05-19 23:17   ` Weinan Liu
  2025-05-20  2:49   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 11:26   ` Ilpo Järvinen
  2 siblings, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 11:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 4573 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Karolina Stolarek <karolina.stolarek@oracle.com>
> 
> When reporting an AER error, we check its type multiple times to determine
> the log level for each message. Do this check only in the top-level
> functions (aer_isr_one_error(), pci_print_aer()) and save the level in
> struct aer_err_info.
> 
> [bhelgaas: save log level in struct aer_err_info instead of passing it
> as a parameter]
> Link: https://lore.kernel.org/r/20250321015806.954866-2-pandoh@google.com
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pci.h      |  1 +
>  drivers/pci/pcie/aer.c | 21 ++++++++++-----------
>  drivers/pci/pcie/dpc.c |  1 +
>  3 files changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index b81e99cd4b62..705f9ef58acc 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -588,6 +588,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
>  struct aer_err_info {
>  	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>  	int error_dev_num;
> +	const char *level;		/* printk level */

As a general direction, wouldn't it be better to start adding these 
comments in the kerneldoc compatible format (even if not yet enabling the 
kerneldoc with /**)?

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

>  
>  	unsigned int id:16;
>  
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4683a99c7568..73b03a195b14 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -672,21 +672,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>  	}
>  }
>  
> -static void __aer_print_error(struct pci_dev *dev,
> -			      struct aer_err_info *info)
> +static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	const char **strings;
>  	unsigned long status = info->status & ~info->mask;
> -	const char *level, *errmsg;
> +	const char *level = info->level;
> +	const char *errmsg;
>  	int i;
>  
> -	if (info->severity == AER_CORRECTABLE) {
> +	if (info->severity == AER_CORRECTABLE)
>  		strings = aer_correctable_error_string;
> -		level = KERN_WARNING;
> -	} else {
> +	else
>  		strings = aer_uncorrectable_error_string;
> -		level = KERN_ERR;
> -	}
>  
>  	for_each_set_bit(i, &status, 32) {
>  		errmsg = strings[i];
> @@ -714,7 +711,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	int layer, agent;
>  	int id = pci_dev_id(dev);
> -	const char *level;
> +	const char *level = info->level;
>  
>  	pci_dev_aer_stats_incr(dev, info);
>  
> @@ -727,8 +724,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>  	agent = AER_GET_AGENT(info->severity, info->status);
>  
> -	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> -
>  	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>  		   aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
> @@ -774,9 +769,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (aer_severity == AER_CORRECTABLE) {
>  		status = aer->cor_status;
>  		mask = aer->cor_mask;
> +		info.level = KERN_WARNING;
>  	} else {
>  		status = aer->uncor_status;
>  		mask = aer->uncor_mask;
> +		info.level = KERN_ERR;
>  		tlp_header_valid = status & AER_LOG_TLP_MASKS;
>  	}
>  
> @@ -1297,6 +1294,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  		struct aer_err_info e_info = {
>  			.id = ERR_COR_ID(e_src->id),
>  			.severity = AER_CORRECTABLE,
> +			.level = KERN_WARNING,
>  			.multi_error_valid = multi ? 1 : 0,
>  		};
>  
> @@ -1312,6 +1310,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
>  		struct aer_err_info e_info = {
>  			.id = ERR_UNCOR_ID(e_src->id),
>  			.severity = fatal ? AER_FATAL : AER_NONFATAL,
> +			.level = KERN_ERR,
>  			.multi_error_valid = multi ? 1 : 0,
>  		};
>  
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 315bf2bfd570..34af0ea45c0d 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -252,6 +252,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
>  	else
>  		info->severity = AER_NONFATAL;
>  
> +	info->level = KERN_WARNING;
>  	return 1;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type
  2025-05-19 21:35 ` [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type Bjorn Helgaas
  2025-05-20  3:23   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 11:37   ` Ilpo Järvinen
  2025-05-20 15:04     ` Bjorn Helgaas
  1 sibling, 1 reply; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 11:37 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 2215 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Karolina Stolarek <karolina.stolarek@oracle.com>
> 
> Some existing logs in pci_print_aer() log with error severity by default.
> Convert them to depend on error type (consistent with rest of AER logging).
> 
> Link: https://lore.kernel.org/r/20250321015806.954866-3-pandoh@google.com
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 16 +++++++++++-----
>  1 file changed, 11 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 73b03a195b14..06a7dda20846 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -788,15 +788,21 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	layer = AER_GET_LAYER_ERROR(aer_severity, status);
>  	agent = AER_GET_AGENT(aer_severity, status);
>  
> -	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> +	aer_printk(info.level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
> +		   status, mask);
>  	__aer_print_error(dev, &info);
> -	pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> -		aer_error_layer[layer], aer_agent_string[agent]);
> +	aer_printk(info.level, dev, "aer_layer=%s, aer_agent=%s\n",
> +		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	if (aer_severity != AER_CORRECTABLE)
> -		pci_err(dev, "aer_uncor_severity: 0x%08x\n",
> -			aer->uncor_severity);
> +		aer_printk(info.level, dev, "aer_uncor_severity: 0x%08x\n",
> +			   aer->uncor_severity);
>  
> +	/*
> +	 * pcie_print_tlp_log() uses KERN_ERR, but we only call it when
> +	 * tlp_header_valid is set, and info.level is always KERN_ERR in
> +	 * that case.
> +	 */
>  	if (tlp_header_valid)
>  		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));

There's another similar callsite but only this has the comment added. I 
was thinking if this call could be made from __aer_print_error(). There 
would be small change in order of messages but I can't seem to decide if 
it would be bad/good.

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>


>  }
> 

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report
  2025-05-19 21:35 ` [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report Bjorn Helgaas
  2025-05-20  3:30   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 11:38   ` Ilpo Järvinen
  1 sibling, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 11:38 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 6874 bytes --]

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Karolina Stolarek <karolina.stolarek@oracle.com>
> 
> Update name to reflect the broader definition of structs/variables that are
> stored (e.g. ratelimits). This is a preparatory patch for adding rate limit
> support.
> 
> Link: https://lore.kernel.org/r/20250321015806.954866-6-pandoh@google.com
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
>  include/linux/pci.h    |  2 +-
>  2 files changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 06a7dda20846..da62032bf024 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -54,11 +54,11 @@ struct aer_rpc {
>  	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
>  };
>  
> -/* AER stats for the device */
> -struct aer_stats {
> +/* AER report for the device */
> +struct aer_report {
>  
>  	/*
> -	 * Fields for all AER capable devices. They indicate the errors
> +	 * Stats for all AER capable devices. They indicate the errors
>  	 * "as seen by this device". Note that this may mean that if an
>  	 * Endpoint is causing problems, the AER counters may increment
>  	 * at its link partner (e.g. Root Port) because the errors will be
> @@ -80,7 +80,7 @@ struct aer_stats {
>  	u64 dev_total_nonfatal_errs;
>  
>  	/*
> -	 * Fields for Root Ports & Root Complex Event Collectors only; these
> +	 * Stats for Root Ports & Root Complex Event Collectors only; these
>  	 * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
>  	 * messages received by the Root Port / Event Collector, INCLUDING the
>  	 * ones that are generated internally (by the Root Port itself)
> @@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev)
>  	if (!dev->aer_cap)
>  		return;
>  
> -	dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> +	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>  
>  	/*
>  	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev)
>  
>  void pci_aer_exit(struct pci_dev *dev)
>  {
> -	kfree(dev->aer_stats);
> -	dev->aer_stats = NULL;
> +	kfree(dev->aer_report);
> +	dev->aer_report = NULL;
>  }
>  
>  #define AER_AGENT_RECEIVER		0
> @@ -537,10 +537,10 @@ static const char *aer_agent_string[] = {
>  {									\
>  	unsigned int i;							\
>  	struct pci_dev *pdev = to_pci_dev(dev);				\
> -	u64 *stats = pdev->aer_stats->stats_array;			\
> +	u64 *stats = pdev->aer_report->stats_array;			\
>  	size_t len = 0;							\
>  									\
> -	for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> +	for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\
>  		if (strings_array[i])					\
>  			len += sysfs_emit_at(buf, len, "%s %llu\n",	\
>  					     strings_array[i],		\
> @@ -551,7 +551,7 @@ static const char *aer_agent_string[] = {
>  					     i, stats[i]);		\
>  	}								\
>  	len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string,	\
> -			     pdev->aer_stats->total_field);		\
> +			     pdev->aer_report->total_field);		\
>  	return len;							\
>  }									\
>  static DEVICE_ATTR_RO(name)
> @@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
>  		     char *buf)						\
>  {									\
>  	struct pci_dev *pdev = to_pci_dev(dev);				\
> -	return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field);	\
> +	return sysfs_emit(buf, "%llu\n", pdev->aer_report->field);	\
>  }									\
>  static DEVICE_ATTR_RO(name)
>  
> @@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
>  	struct device *dev = kobj_to_dev(kobj);
>  	struct pci_dev *pdev = to_pci_dev(dev);
>  
> -	if (!pdev->aer_stats)
> +	if (!pdev->aer_report)
>  		return 0;
>  
>  	if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> @@ -623,28 +623,28 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>  	unsigned long status = info->status & ~info->mask;
>  	int i, max = -1;
>  	u64 *counter = NULL;
> -	struct aer_stats *aer_stats = pdev->aer_stats;
> +	struct aer_report *aer_report = pdev->aer_report;
>  
>  	trace_aer_event(pci_name(pdev), (info->status & ~info->mask),
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  
> -	if (!aer_stats)
> +	if (!aer_report)
>  		return;
>  
>  	switch (info->severity) {
>  	case AER_CORRECTABLE:
> -		aer_stats->dev_total_cor_errs++;
> -		counter = &aer_stats->dev_cor_errs[0];
> +		aer_report->dev_total_cor_errs++;
> +		counter = &aer_report->dev_cor_errs[0];
>  		max = AER_MAX_TYPEOF_COR_ERRS;
>  		break;
>  	case AER_NONFATAL:
> -		aer_stats->dev_total_nonfatal_errs++;
> -		counter = &aer_stats->dev_nonfatal_errs[0];
> +		aer_report->dev_total_nonfatal_errs++;
> +		counter = &aer_report->dev_nonfatal_errs[0];
>  		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>  		break;
>  	case AER_FATAL:
> -		aer_stats->dev_total_fatal_errs++;
> -		counter = &aer_stats->dev_fatal_errs[0];
> +		aer_report->dev_total_fatal_errs++;
> +		counter = &aer_report->dev_fatal_errs[0];
>  		max = AER_MAX_TYPEOF_UNCOR_ERRS;
>  		break;
>  	}
> @@ -656,19 +656,19 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>  static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>  				 struct aer_err_source *e_src)
>  {
> -	struct aer_stats *aer_stats = pdev->aer_stats;
> +	struct aer_report *aer_report = pdev->aer_report;
>  
> -	if (!aer_stats)
> +	if (!aer_report)
>  		return;
>  
>  	if (e_src->status & PCI_ERR_ROOT_COR_RCV)
> -		aer_stats->rootport_total_cor_errs++;
> +		aer_report->rootport_total_cor_errs++;
>  
>  	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
>  		if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
> -			aer_stats->rootport_total_fatal_errs++;
> +			aer_report->rootport_total_fatal_errs++;
>  		else
> -			aer_stats->rootport_total_nonfatal_errs++;
> +			aer_report->rootport_total_nonfatal_errs++;
>  	}
>  }
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0e8e3fd77e96..4b11a90107cb 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -346,7 +346,7 @@ struct pci_dev {
>  	u8		hdr_type;	/* PCI header type (`multi' flag masked out) */
>  #ifdef CONFIG_PCIEAER
>  	u16		aer_cap;	/* AER capability offset */
> -	struct aer_stats *aer_stats;	/* AER stats for this device */
> +	struct aer_report *aer_report;	/* AER report for this device */
>  #endif
>  #ifdef CONFIG_PCIEPORTBUS
>  	struct rcec_ea	*rcec_ea;	/* RCEC cached endpoint association */
> 

Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>

-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-19 21:35 ` [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs Bjorn Helgaas
  2025-05-20  4:59   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 11:55   ` Ilpo Järvinen
  2025-05-20 19:38     ` Bjorn Helgaas
  1 sibling, 1 reply; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 11:55 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Jon Pan-Doh <pandoh@google.com>
> 
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER correctable and uncorrectable
> errors that use the kernel defaults (10 per 5s).
> 
> There are two AER logging entry points:
> 
>   - aer_print_error() is used by DPC and native AER
> 
>   - pci_print_aer() is used by GHES and CXL
> 
> The native AER aer_print_error() case includes a loop that may log details
> from multiple devices.  This is ratelimited by the union of ratelimits for
> these devices, set by add_error_device(), which collects the devices.  If
> no such device is found, the Error Source message is ratelimited by the
> Root Port or RCEC that received the ERR_* message.
> 
> The DPC aer_print_error() case is currently not ratelimited.
> 
> The GHES and CXL pci_print_aer() cases are ratelimited by the Error Source
> device.
> 
> Sargun at Meta reported internally that a flood of AER errors causes RCU
> CPU stall warnings and CSD-lock warnings.
> 
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
> while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
> true count of 11.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> 
> [bhelgaas: commit log, factor out trace_aer_event() and aer_print_rp_info()
> changes to previous patches, collect single aer_err_info.ratelimit as union
> of ratelimits of all error source devices]
> Link: https://lore.kernel.org/r/20250321015806.954866-7-pandoh@google.com
> Reported-by: Sargun Dhillon <sargun@meta.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> ---
>  drivers/pci/pci.h      |  3 ++-
>  drivers/pci/pcie/aer.c | 49 ++++++++++++++++++++++++++++++++++++------
>  drivers/pci/pcie/dpc.c |  1 +
>  3 files changed, 46 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 705f9ef58acc..65c466279ade 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -593,7 +593,8 @@ struct aer_err_info {
>  	unsigned int id:16;
>  
>  	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
> -	unsigned int __pad1:5;
> +	unsigned int ratelimit:1;	/* 0=skip, 1=print */
> +	unsigned int __pad1:4;
>  	unsigned int multi_error_valid:1;
>  
>  	unsigned int first_error:5;
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index da62032bf024..c335e0bb9f51 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -28,6 +28,7 @@
>  #include <linux/interrupt.h>
>  #include <linux/delay.h>
>  #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
>  #include <linux/slab.h>
>  #include <acpi/apei.h>
>  #include <acpi/ghes.h>
> @@ -88,6 +89,10 @@ struct aer_report {
>  	u64 rootport_total_cor_errs;
>  	u64 rootport_total_fatal_errs;
>  	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>  };
>  
>  #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,11 @@ void pci_aer_init(struct pci_dev *dev)
>  
>  	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>  
> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> +
>  	/*
>  	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>  	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -672,6 +682,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>  	}
>  }
>  
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)
> +{
> +	struct ratelimit_state *ratelimit;
> +
> +	if (severity == AER_CORRECTABLE)
> +		ratelimit = &dev->aer_report->cor_log_ratelimit;
> +	else
> +		ratelimit = &dev->aer_report->uncor_log_ratelimit;
> +
> +	return __ratelimit(ratelimit);
> +}
> +
>  static void __aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	const char **strings;
> @@ -715,6 +737,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	pci_dev_aer_stats_incr(dev, info);
>  
> +	if (!info->ratelimit)
> +		return;
> +
>  	if (!info->status) {
>  		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>  			aer_error_severity_string[info->severity]);
> @@ -785,6 +810,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  
>  	pci_dev_aer_stats_incr(dev, &info);
>  
> +	if (!aer_ratelimit(dev, info.severity))
> +		return;
> +
>  	layer = AER_GET_LAYER_ERROR(aer_severity, status);
>  	agent = AER_GET_AGENT(aer_severity, status);
>  
> @@ -815,8 +843,14 @@ EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>   */
>  static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>  {
> +	/*
> +	 * Ratelimit AER log messages.  Generally we add the Error Source
> +	 * device, but there are is_error_source() cases that can result in
> +	 * multiple devices being added here, so we OR them all together.

I can see the code uses OR ;-) but I wasn't helpful because this comment 
didn't explain why at all. As this ratelimit thing is using reverse logic 
to begin with, this is a very tricky bit.

Perhaps something less vague like:

... we ratelimit if all devices have reached their ratelimit.

Assuming that was the intention here? (I'm not sure.)

> +	 */
>  	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
>  		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> +		e_info->ratelimit |= aer_ratelimit(dev, e_info->severity);
>  		e_info->error_dev_num++;
>  		return 0;
>  	}
> @@ -914,7 +948,7 @@ static int find_device_iter(struct pci_dev *dev, void *data)
>   * e_info->error_dev_num and e_info->dev[], based on the given information.
>   */
>  static bool find_source_device(struct pci_dev *parent,
> -		struct aer_err_info *e_info)
> +			       struct aer_err_info *e_info)
>  {
>  	struct pci_dev *dev = parent;
>  	int result;
> @@ -935,10 +969,12 @@ static bool find_source_device(struct pci_dev *parent,
>  	/*
>  	 * If we didn't find any devices with errors logged in the AER
>  	 * Capability, just print the Error Source ID from the Root Port or
> -	 * RCEC that received an ERR_* Message.
> +	 * RCEC that received an ERR_* Message, ratelimited by the RP or
> +	 * RCEC.
>  	 */
>  	if (!e_info->error_dev_num) {
> -		aer_print_source(parent, e_info, " (no details found)");
> +		if (aer_ratelimit(parent, e_info->severity))
> +			aer_print_source(parent, e_info, " (no details found)");
>  		return false;
>  	}
>  	return true;
> @@ -1147,9 +1183,10 @@ static void aer_recover_work_func(struct work_struct *work)
>  		pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
>  						   entry.devfn);
>  		if (!pdev) {
> -			pr_err("no pci_dev for %04x:%02x:%02x.%x\n",
> -			       entry.domain, entry.bus,
> -			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
> +			pr_err_ratelimited("%04x:%02x:%02x.%x: no pci_dev found\n",

This case was not mentioned in the changelog.

> +					   entry.domain, entry.bus,
> +					   PCI_SLOT(entry.devfn),
> +					   PCI_FUNC(entry.devfn));
>  			continue;
>  		}
>  		pci_print_aer(pdev, entry.severity, entry.regs);
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index 34af0ea45c0d..597df7790f36 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -301,6 +301,7 @@ void dpc_process_error(struct pci_dev *pdev)
>  	else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
>  		 dpc_get_aer_uncorrect_severity(pdev, &info) &&
>  		 aer_get_device_error_info(pdev, &info)) {
> +		info.ratelimit = 1;	/* no ratelimiting */
>  		aer_print_error(pdev, &info);
>  		pci_aer_clear_nonfatal_status(pdev);
>  		pci_aer_clear_fatal_status(pdev);
> 

-- 
 i.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits
  2025-05-19 21:35 ` [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits Bjorn Helgaas
  2025-05-20  5:05   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 12:02   ` Ilpo Järvinen
  2025-05-20 16:31     ` Bjorn Helgaas
  1 sibling, 1 reply; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 12:02 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Mon, 19 May 2025, Bjorn Helgaas wrote:

> From: Jon Pan-Doh <pandoh@google.com>
> 
> Allow userspace to read/write log ratelimits per device (including
> enable/disable). Create aer/ sysfs directory to store them and any
> future aer configs.
> 
> Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
> attributes (e.g. stats and ratelimits).
> 
>   Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
>     sysfs-bus-pci-devices-aer
> 
> Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
> Sent 6 AER errors. Observed 5 errors logged while AER stats
> (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
> 
> Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
> logged and accounted in AER stats (12 total errors).
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> 
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
> ---
>  ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++++++
>  Documentation/PCI/pcieaer-howto.rst           |  5 +-
>  drivers/pci/pci-sysfs.c                       |  1 +
>  drivers/pci/pci.h                             |  1 +
>  drivers/pci/pcie/aer.c                        | 99 +++++++++++++++++++
>  5 files changed, 139 insertions(+), 1 deletion(-)
>  rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> similarity index 77%
> rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> index d1f67bb81d5d..771204197b71 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> @@ -117,3 +117,37 @@ Date:		July 2018
>  KernelVersion:	4.19.0
>  Contact:	linux-pci@vger.kernel.org, rajatja@google.com
>  Description:	Total number of ERR_NONFATAL messages reported to rootport.
> +
> +PCIe AER ratelimits
> +-------------------
> +
> +These attributes show up under all the devices that are AER capable.
> +They represent configurable ratelimits of logs per error type.
> +
> +See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable
> +Date:		March 2025
> +KernelVersion:	6.15.0

This ship has sailed.

> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
> +		gets whether or not AER is currently enabled.

AER or AER ratelimiting is enabled?

> +             Enabled by
> +		default.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_burst_cor_log
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Ratelimit burst for correctable error logs. Writing a value
> +		changes the number of errors (burst) allowed per interval
> +		(5 second window) before ratelimiting. Reading gets the
> +		current ratelimit burst.
> +
> +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_burst_uncor_log
> +Date:		March 2025
> +KernelVersion:	6.15.0
> +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> +Description:	Ratelimit burst for uncorrectable error logs. Writing a
> +		value changes the number of errors (burst) allowed per
> +		interval (5 second window) before ratelimiting. Reading
> +		gets the current ratelimit burst.
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index 896d2a232a90..043cdb3194be 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -96,12 +96,15 @@ type (correctable vs. uncorrectable).
>  AER uses the default ratelimit of DEFAULT_RATELIMIT_BURST (10 events) over
>  DEFAULT_RATELIMIT_INTERVAL (5 seconds).
>  
> +Ratelimits are exposed in the form of sysfs attributes and configurable.
> +See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
> +
>  AER Statistics / Counters
>  -------------------------
>  
>  When PCIe AER errors are captured, the counters / statistics are also exposed
>  in the form of sysfs attributes which are documented at
> -Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
>  
>  Developer Guide
>  ===============
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index c6cda56ca52c..278de99b00ce 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1805,6 +1805,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
>  	&pcie_dev_attr_group,
>  #ifdef CONFIG_PCIEAER
>  	&aer_stats_attr_group,
> +	&aer_attr_group,
>  #endif
>  #ifdef CONFIG_PCIEASPM
>  	&aspm_ctrl_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 65c466279ade..a3261e842d6d 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -963,6 +963,7 @@ void pci_no_aer(void);
>  void pci_aer_init(struct pci_dev *dev);
>  void pci_aer_exit(struct pci_dev *dev);
>  extern const struct attribute_group aer_stats_attr_group;
> +extern const struct attribute_group aer_attr_group;
>  void pci_aer_clear_fatal_status(struct pci_dev *dev);
>  int pci_aer_clear_status(struct pci_dev *dev);
>  int pci_aer_raw_clear_status(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c335e0bb9f51..42df5cb963b3 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -627,6 +627,105 @@ const struct attribute_group aer_stats_attr_group = {
>  	.is_visible = aer_stats_attrs_are_visible,
>  };
>  
> +/*
> + * Ratelimit enable toggle
> + * 0: disabled with ratelimit.interval = 0
> + * 1: enabled with ratelimit.interval = nonzero
> + */
> +static ssize_t ratelimit_log_enable_show(struct device *dev,
> +					 struct device_attribute *attr,
> +					 char *buf)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enabled = pdev->aer_report->cor_log_ratelimit.interval != 0;
> +
> +	return sysfs_emit(buf, "%d\n", enabled);
> +}
> +
> +static ssize_t ratelimit_log_enable_store(struct device *dev,
> +					  struct device_attribute *attr,
> +					  const char *buf, size_t count)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	bool enable;
> +	int interval;
> +
> +	if (!capable(CAP_SYS_ADMIN))
> +		return -EPERM;
> +
> +	if (kstrtobool(buf, &enable) < 0)
> +		return -EINVAL;
> +
> +	if (enable)
> +		interval = DEFAULT_RATELIMIT_INTERVAL;
> +	else
> +		interval = 0;
> +
> +	pdev->aer_report->cor_log_ratelimit.interval = interval;
> +	pdev->aer_report->uncor_log_ratelimit.interval = interval;
> +
> +	return count;
> +}
> +static DEVICE_ATTR_RW(ratelimit_log_enable);
> +
> +#define aer_ratelimit_burst_attr(name, ratelimit)			\
> +	static ssize_t							\
> +	name##_show(struct device *dev, struct device_attribute *attr,	\
> +		    char *buf)						\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +									\
> +	return sysfs_emit(buf, "%d\n",					\
> +			  pdev->aer_report->ratelimit.burst);		\
> +}									\
> +									\
> +	static ssize_t							\
> +	name##_store(struct device *dev, struct device_attribute *attr,	\
> +		     const char *buf, size_t count)			\
> +{									\
> +	struct pci_dev *pdev = to_pci_dev(dev);				\
> +	int burst;							\
> +									\
> +	if (!capable(CAP_SYS_ADMIN))					\
> +		return -EPERM;						\
> +									\
> +	if (kstrtoint(buf, 0, &burst) < 0)				\
> +		return -EINVAL;						\
> +									\
> +	pdev->aer_report->ratelimit.burst = burst;			\
> +									\
> +	return count;							\
> +}									\
> +static DEVICE_ATTR_RW(name)
> +
> +aer_ratelimit_burst_attr(ratelimit_burst_cor_log, cor_log_ratelimit);
> +aer_ratelimit_burst_attr(ratelimit_burst_uncor_log, uncor_log_ratelimit);
> +
> +static struct attribute *aer_attrs[] = {
> +	&dev_attr_ratelimit_log_enable.attr,
> +	&dev_attr_ratelimit_burst_cor_log.attr,
> +	&dev_attr_ratelimit_burst_uncor_log.attr,
> +	NULL
> +};
> +
> +static umode_t aer_attrs_are_visible(struct kobject *kobj,
> +				     struct attribute *a, int n)
> +{
> +	struct device *dev = kobj_to_dev(kobj);
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	if (!pdev->aer_report)
> +		return 0;
> +
> +	return a->mode;
> +}
> +
> +const struct attribute_group aer_attr_group = {
> +	.name = "aer",
> +	.attrs = aer_attrs,
> +	.is_visible = aer_attrs_are_visible,
> +};
> +
>  static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
>  				   struct aer_err_info *info)
>  {
> 

-- 
 i.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it
  2025-05-19 22:41   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 13:53     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 13:53 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 03:41:50PM -0700, Sathyanarayanan Kuppuswamy wrote:
> Hi,
> 
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > Previously the struct aer_err_info "info" was allocated on the stack
> 
> /s/Previously/Currently ?

I prefer "previously" here because it clearly refers to the situation
*before* this patch (allocated on stack without initialization), and
it also gives a hint that this situation is what the patch changes.

If I used "currently," I could be mentioning something relevant that
isn't being changed by the patch, e.g., "currently the struct is
allocated on the stack so it's important to keep it small."

> > without being initialized, so it contained junk except for the fields we
> > explicitly set later.
> > 
> > Initialize "info" at declaration so it starts as all zeroes.
> 
> /s/zeroes/zeros

Fixed, thank you!

> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> > 
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >   drivers/pci/pcie/dpc.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index df42f15c9829..fe7719238456 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -258,7 +258,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
> >   void dpc_process_error(struct pci_dev *pdev)
> >   {
> >   	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> > -	struct aer_err_info info;
> > +	struct aer_err_info info = { 0 };
> >   	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> >   	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it
  2025-05-20  9:39   ` Ilpo Järvinen
@ 2025-05-20 13:54     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 13:54 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, May 20, 2025 at 12:39:18PM +0300, Ilpo Järvinen wrote:
> On Mon, 19 May 2025, Bjorn Helgaas wrote:
> 
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > Previously the struct aer_err_info "info" was allocated on the stack
> > without being initialized, so it contained junk except for the fields we
> > explicitly set later.
> > 
> > Initialize "info" at declaration so it starts as all zeroes.
> > 
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >  drivers/pci/pcie/dpc.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index df42f15c9829..fe7719238456 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -258,7 +258,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
> >  void dpc_process_error(struct pci_dev *pdev)
> >  {
> >  	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
> > -	struct aer_err_info info;
> > +	struct aer_err_info info = { 0 };
> 
> = {}; is enough to initialize it, no need to add those zeros.

Changed, thank you!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid
  2025-05-19 23:15   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 14:00     ` Bjorn Helgaas
  2025-05-20 14:20       ` Ilpo Järvinen
  0 siblings, 1 reply; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 14:00 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 04:15:56PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > DPC Error Source ID is only valid when the DPC Trigger Reason indicates
> > that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
> > Message (PCIe r6.0, sec 7.9.14.5).
> > 
> > When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
> > or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
> > log the Error Source ID (decoded into domain/bus/device/function).  Don't
> > print the source otherwise, since it's not valid.
> > 
> > For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
> > logging changes:
> > 
> >    - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
> >    - pci 0000:00:01.0: DPC: ERR_FATAL detected
> >    + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0
> > 
> > and when DPC triggered for other reasons, where DPC Error Source ID is
> > undefined, e.g., unmasked uncorrectable error:
> > 
> >    - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
> >    - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
> >    + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected
> > 
> > Previously the "containment event" message was at KERN_INFO and the
> > "%s detected" message was at KERN_WARNING.  Now the single message is at
> > KERN_WARNING.
> 
> Since we are handling Uncorrectable errors, why not use pci_err?

Sounds reasonable to me.  I would do it in a separate patch because
the point of this one is to avoid logging junk when Error Source ID is
not valid.

> > +		pci_warn(pdev, "containment event, status:%#06x, %s received from %04x:%02x:%02x.%d\n",
> > +			 status,
> 
> I see the BDF extraction and format code in many places in the PCI
> drivers. May be a common macro will make it more readable.

Good idea.  Not sure how to implement it, so I put that on my TODO
list for now.

> > +			 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> > +				"ERR_FATAL" : "ERR_NONFATAL",
> > +			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
> > +			 PCI_SLOT(source), PCI_FUNC(source));

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid
  2025-05-20 10:28   ` Ilpo Järvinen
@ 2025-05-20 14:05     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 14:05 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, May 20, 2025 at 01:28:02PM +0300, Ilpo Järvinen wrote:
> On Mon, 19 May 2025, Bjorn Helgaas wrote:
> > DPC Error Source ID is only valid when the DPC Trigger Reason indicates
> > that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
> > Message (PCIe r6.0, sec 7.9.14.5).
> > 
> > When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
> > or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
> > log the Error Source ID (decoded into domain/bus/device/function).  Don't
> > print the source otherwise, since it's not valid.
> > 
> > For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
> > logging changes:
> > 
> >   - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
> >   - pci 0000:00:01.0: DPC: ERR_FATAL detected
> >   + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0
> > 
> > and when DPC triggered for other reasons, where DPC Error Source ID is
> > undefined, e.g., unmasked uncorrectable error:
> > 
> >   - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
> >   - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
> >   + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected
> > 
> > Previously the "containment event" message was at KERN_INFO and the
> > "%s detected" message was at KERN_WARNING.  Now the single message is at
> > KERN_WARNING.
> > 
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >  drivers/pci/pcie/dpc.c | 45 ++++++++++++++++++++++++++----------------
> >  1 file changed, 28 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index fe7719238456..315bf2bfd570 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -261,25 +261,36 @@ void dpc_process_error(struct pci_dev *pdev)
> >  	struct aer_err_info info = { 0 };
> >  
> >  	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
> > -	pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID, &source);
> > -
> > -	pci_info(pdev, "containment event, status:%#06x source:%#06x\n",
> > -		 status, source);
> >  
> >  	reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN;
> > -	ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
> > -	pci_warn(pdev, "%s detected\n",
> > -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR) ?
> > -		 "unmasked uncorrectable error" :
> > -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE) ?
> > -		 "ERR_NONFATAL" :
> > -		 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> > -		 "ERR_FATAL" :
> > -		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
> > -		 "RP PIO error" :
> > -		 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
> > -		 "software trigger" :
> > -		 "reserved error");
> > +
> > +	switch (reason) {
> > +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR:
> > +		pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
> > +			 status);
> > +		break;
> > +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
> > +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
> > +		pci_read_config_word(pdev, cap + PCI_EXP_DPC_SOURCE_ID,
> > +				     &source);
> > +		pci_warn(pdev, "containment event, status:%#06x, %s received from %04x:%02x:%02x.%d\n",
> > +			 status,
> > +			 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> > +				"ERR_FATAL" : "ERR_NONFATAL",
> > +			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
> > +			 PCI_SLOT(source), PCI_FUNC(source));
> > +		return;
> > +	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
> > +		ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
> > +		pci_warn(pdev, "containment event, status:%#06x: %s detected\n",
> > +			 status,
> > +			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO) ?
> > +			 "RP PIO error" :
> > +			 (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_SW_TRIGGER) ?
> > +			 "software trigger" :
> > +			 "reserved error");
> > +		break;
> > +	}
> >  
> >  	/* show RP PIO error detail information */
> >  	if (pdev->dpc_rp_extensions &&
> 
> After adding that switch (reason) there, wouldn't it make sense to move 
> also the code from the if blocks into the case blocks? That if 
> conditions check for reason anyway so those if branches would naturally 
> belong under one of the cases each.

Great idea, thanks!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid
  2025-05-20 14:00     ` Bjorn Helgaas
@ 2025-05-20 14:20       ` Ilpo Järvinen
  0 siblings, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-20 14:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Sathyanarayanan Kuppuswamy, linux-pci, Jon Pan-Doh,
	Karolina Stolarek, Martin Petersen, Ben Fuller, Drew Walton,
	Anil Agrawal, Tony Luck, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, 20 May 2025, Bjorn Helgaas wrote:

> On Mon, May 19, 2025 at 04:15:56PM -0700, Sathyanarayanan Kuppuswamy wrote:
> > On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > > From: Bjorn Helgaas <bhelgaas@google.com>
> > > 
> > > DPC Error Source ID is only valid when the DPC Trigger Reason indicates
> > > that DPC was triggered due to reception of an ERR_NONFATAL or ERR_FATAL
> > > Message (PCIe r6.0, sec 7.9.14.5).
> > > 
> > > When DPC was triggered by ERR_NONFATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE)
> > > or ERR_FATAL (PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) from a downstream device,
> > > log the Error Source ID (decoded into domain/bus/device/function).  Don't
> > > print the source otherwise, since it's not valid.
> > > 
> > > For DPC trigger due to reception of ERR_NONFATAL or ERR_FATAL, the dmesg
> > > logging changes:
> > > 
> > >    - pci 0000:00:01.0: DPC: containment event, status:0x000d source:0x0200
> > >    - pci 0000:00:01.0: DPC: ERR_FATAL detected
> > >    + pci 0000:00:01.0: DPC: containment event, status:0x000d, ERR_FATAL received from 0000:02:00.0
> > > 
> > > and when DPC triggered for other reasons, where DPC Error Source ID is
> > > undefined, e.g., unmasked uncorrectable error:
> > > 
> > >    - pci 0000:00:01.0: DPC: containment event, status:0x0009 source:0x0200
> > >    - pci 0000:00:01.0: DPC: unmasked uncorrectable error detected
> > >    + pci 0000:00:01.0: DPC: containment event, status:0x0009: unmasked uncorrectable error detected
> > > 
> > > Previously the "containment event" message was at KERN_INFO and the
> > > "%s detected" message was at KERN_WARNING.  Now the single message is at
> > > KERN_WARNING.
> > 
> > Since we are handling Uncorrectable errors, why not use pci_err?
> 
> Sounds reasonable to me.  I would do it in a separate patch because
> the point of this one is to avoid logging junk when Error Source ID is
> not valid.
> 
> > > +		pci_warn(pdev, "containment event, status:%#06x, %s received from %04x:%02x:%02x.%d\n",
> > > +			 status,
> > 
> > I see the BDF extraction and format code in many places in the PCI
> > drivers. May be a common macro will make it more readable.
> 
> Good idea.  Not sure how to implement it, so I put that on my TODO
> list for now.

Instead of macros, it might be worth adding a printf specifier for this. 
Together with some flags, it should be possible to cover also the 
variations that print less than the full BDF format.

> > > +			 (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE) ?
> > > +				"ERR_FATAL" : "ERR_NONFATAL",
> > > +			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
> > > +			 PCI_SLOT(source), PCI_FUNC(source));
> 

-- 
 i.


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info()
  2025-05-19 23:39   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 14:21     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 14:21 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 04:39:19PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > Previously we decoded the AER Error Source ID in two places.  Consolidate
> > them so both places use aer_print_port_info().  Add a "details" parameter
> > so we can add a note when we didn't find any downstream devices with errors
> > logged in their AER Capability.
> > 
> > When we didn't read any error details from the source device, we logged two
> > messages: one in aer_isr_one_error() and another in find_source_device().
> > Since they both contain the same information, only log the first one when
> > when find_source_device() has found error details.
> /s/when//

Fixed, thanks!

> > -	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> > +	pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d%s\n",
> 
> Instead of relying on the callers, why not add a space before details here?

Could, but I don't like adding an extra space at the end of the line
when the caller passes "".  The extra space could make the line wrap
unnecessarily.

> >   		 info->multi_error_valid ? "Multiple " : "",
> >   		 aer_error_severity_string[info->severity],
> >   		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> > -		 PCI_FUNC(devfn));
> > +		 PCI_FUNC(devfn), details);
> >   }
> >   #ifdef CONFIG_ACPI_APEI_PCIEAER
> > @@ -926,13 +927,13 @@ static bool find_source_device(struct pci_dev *parent,
> >   	else
> >   		pci_walk_bus(parent->subordinate, find_device_iter, e_info);
> > +	/*
> > +	 * If we didn't find any devices with errors logged in the AER
> > +	 * Capability, just print the Error Source ID from the Root Port or
> > +	 * RCEC that received an ERR_* Message.
> > +	 */
> >   	if (!e_info->error_dev_num) {
> > -		u8 bus = e_info->id >> 8;
> > -		u8 devfn = e_info->id & 0xff;
> > -
> > -		pci_info(parent, "found no error details for %04x:%02x:%02x.%d\n",
> > -			 pci_domain_nr(parent->bus), bus, PCI_SLOT(devfn),
> > -			 PCI_FUNC(devfn));
> > +		aer_print_port_info(parent, e_info, " (no details found)");
> >   		return false;
> >   	}
> >   	return true;
> > @@ -1297,10 +1298,11 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> >   			e_info.multi_error_valid = 1;
> >   		else
> >   			e_info.multi_error_valid = 0;
> > -		aer_print_port_info(pdev, &e_info);
> 
> Instead of printing the error information in find_source_device() (a helper function), I think it be better to print it here (the error handler). source_found = find_source_device(pdev, &e_info); aer_print_port_info(pdev, &e_info, source_found? "" : "(no details found) " );
> 
> if (source_found) aer_process_err_devices(&e_info)

Great idea, thanks!  That looks much nicer.

> > -		if (find_source_device(pdev, &e_info))
> > +		if (find_source_device(pdev, &e_info)) {
> > +			aer_print_port_info(pdev, &e_info, "");
> >   			aer_process_err_devices(&e_info);
> > +		}
> >   	}
> >   	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> > @@ -1316,10 +1318,10 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> >   		else
> >   			e_info.multi_error_valid = 0;
> > -		aer_print_port_info(pdev, &e_info);
> > -
> > -		if (find_source_device(pdev, &e_info))
> > +		if (find_source_device(pdev, &e_info)) {
> > +			aer_print_port_info(pdev, &e_info, "");
> >   			aer_process_err_devices(&e_info);
> > +		}
> >   	}
> >   }
> 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it
  2025-05-20 10:39   ` Ilpo Järvinen
@ 2025-05-20 14:27     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 14:27 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, May 20, 2025 at 01:39:06PM +0300, Ilpo Järvinen wrote:
> On Mon, 19 May 2025, Bjorn Helgaas wrote:
> 
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > Previously the struct aer_err_info "e_info" was allocated on the stack
> > without being initialized, so it contained junk except for the fields we
> > explicitly set later.
> > 
> > Initialize "e_info" at declaration with a designated initializer list,
> > which initializes the other members to zero.
> > 
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >  drivers/pci/pcie/aer.c | 37 ++++++++++++++++---------------------
> >  1 file changed, 16 insertions(+), 21 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 95a4cab1d517..40f003eca1c5 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -1281,7 +1281,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> >  		struct aer_err_source *e_src)
> 
> Unrelated to this change, these would fit on a single line.

Thanks, fixed!

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer()
  2025-05-20  0:02   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 14:38     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 14:38 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 05:02:28PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Bjorn Helgaas <bhelgaas@google.com>
> > 
> > Simplify pci_print_aer() by initializing the struct aer_err_info "info"
> > with a designated initializer list (it was previously initialized with
> > memset()) and using pci_name().
> > 
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >   drivers/pci/pcie/aer.c | 16 ++++++++--------
> >   1 file changed, 8 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 40f003eca1c5..73d618354f6a 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -765,7 +765,10 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> >   {
> >   	int layer, agent, tlp_header_valid = 0;
> >   	u32 status, mask;
> > -	struct aer_err_info info;
> 
> You have cleaned up other stack allocations of struct aer_err_info to zero
> initialization in your previous patches. Why not follow the same format
> here? I don't think this function resets all fields of aer_err_info, right?

This is new to me, but IIUC this does initialize all the fields.
https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html says "Omitted
fields are implicitly initialized the same as for objects that have
static storage duration."

> > +	struct aer_err_info info = {
> > +		.severity = aer_severity,
> > +		.first_error = PCI_ERR_CAP_FEP(aer->cap_control),
> > +	};
> >   	if (aer_severity == AER_CORRECTABLE) {
> >   		status = aer->cor_status;
> > @@ -776,14 +779,11 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> >   		tlp_header_valid = status & AER_LOG_TLP_MASKS;
> >   	}
> > -	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> > -	agent = AER_GET_AGENT(aer_severity, status);
> > -
> > -	memset(&info, 0, sizeof(info));
> > -	info.severity = aer_severity;
> >   	info.status = status;
> >   	info.mask = mask;
> > -	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
> > +
> > +	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> > +	agent = AER_GET_AGENT(aer_severity, status);
> >   	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> >   	__aer_print_error(dev, &info);
> > @@ -797,7 +797,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> >   	if (tlp_header_valid)
> >   		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
> > -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> > +	trace_aer_event(pci_name(dev), (status & ~mask),
> >   			aer_severity, tlp_header_valid, &aer->header_log);
> >   }
> >   EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
> 
> -- 
> Sathyanarayanan Kuppuswamy
> Linux Kernel Developer
> 

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 11/16] PCI/AER: Check log level once and remember it
  2025-05-19 23:17   ` Weinan Liu
@ 2025-05-20 14:46     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 14:46 UTC (permalink / raw)
  To: Weinan Liu
  Cc: Jonathan.Cameron, anilagrawal, ben.fuller, bhelgaas, dave.jiang,
	drewwalton, ilpo.jarvinen, kaihengf, karolina.stolarek, kbusch,
	linux-kernel, linux-pci, linuxppc-dev, lukas, mahesh,
	martin.petersen, oohall, pandoh, paulmck, rrichter, sargun,
	sathyanarayanan.kuppuswamy, shiju.jose, terry.bowman, tony.luck

On Mon, May 19, 2025 at 11:17:28PM +0000, Weinan Liu wrote:
> > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> > index 315bf2bfd570..34af0ea45c0d 100644
> > --- a/drivers/pci/pcie/dpc.c
> > +++ b/drivers/pci/pcie/dpc.c
> > @@ -252,6 +252,7 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
> >   else
> >   info->severity = AER_NONFATAL;
> >
> > + info->level = KERN_WARNING;
> >  return 1;
> > }
> 
> I think the print level should be KERN_ERR for uncorrectable errors.

Yes, thank you, fixed!  dpc_get_aer_uncorrect_severity() always sets
info->severity to AER_FATAL or AER_NONFATAL, and aer_print_error()
only uses KERN_WARNING for AER_CORRECTABLE.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type
  2025-05-20 11:37   ` Ilpo Järvinen
@ 2025-05-20 15:04     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 15:04 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, May 20, 2025 at 02:37:33PM +0300, Ilpo Järvinen wrote:
> On Mon, 19 May 2025, Bjorn Helgaas wrote:
> 
> > From: Karolina Stolarek <karolina.stolarek@oracle.com>
> > 
> > Some existing logs in pci_print_aer() log with error severity by default.
> > Convert them to depend on error type (consistent with rest of AER logging).
> > 
> > Link: https://lore.kernel.org/r/20250321015806.954866-3-pandoh@google.com
> > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> >  drivers/pci/pcie/aer.c | 16 +++++++++++-----
> >  1 file changed, 11 insertions(+), 5 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 73b03a195b14..06a7dda20846 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -788,15 +788,21 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> >  	layer = AER_GET_LAYER_ERROR(aer_severity, status);
> >  	agent = AER_GET_AGENT(aer_severity, status);
> >  
> > -	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> > +	aer_printk(info.level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n",
> > +		   status, mask);
> >  	__aer_print_error(dev, &info);
> > -	pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> > -		aer_error_layer[layer], aer_agent_string[agent]);
> > +	aer_printk(info.level, dev, "aer_layer=%s, aer_agent=%s\n",
> > +		   aer_error_layer[layer], aer_agent_string[agent]);
> >  
> >  	if (aer_severity != AER_CORRECTABLE)
> > -		pci_err(dev, "aer_uncor_severity: 0x%08x\n",
> > -			aer->uncor_severity);
> > +		aer_printk(info.level, dev, "aer_uncor_severity: 0x%08x\n",
> > +			   aer->uncor_severity);
> >  
> > +	/*
> > +	 * pcie_print_tlp_log() uses KERN_ERR, but we only call it when
> > +	 * tlp_header_valid is set, and info.level is always KERN_ERR in
> > +	 * that case.
> > +	 */
> >  	if (tlp_header_valid)
> >  		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
> 
> There's another similar callsite but only this has the comment added. I 
> was thinking if this call could be made from __aer_print_error(). There 
> would be small change in order of messages but I can't seem to decide if 
> it would be bad/good.

I guess the other caller is dpc_process_rp_pio_error(), which uses
pci_err() for other logging, so at least it matches the level used by
pcie_print_tlp_log().

This patch uses info.level to control the message level, and
pcie_print_tlp_log() doesn't look at info.level.  I added this comment
to explain why that's OK and the message level happens to match
already.  Maybe not super ideal long term.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits
  2025-05-20 12:02   ` Ilpo Järvinen
@ 2025-05-20 16:31     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 16:31 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, May 20, 2025 at 03:02:06PM +0300, Ilpo Järvinen wrote:
> On Mon, 19 May 2025, Bjorn Helgaas wrote:
> 
> > From: Jon Pan-Doh <pandoh@google.com>
> > 
> > Allow userspace to read/write log ratelimits per device (including
> > enable/disable). Create aer/ sysfs directory to store them and any
> > future aer configs.
> > 
> > Update AER sysfs ABI filename to reflect the broader scope of AER sysfs
> > attributes (e.g. stats and ratelimits).
> > 
> >   Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats ->
> >     sysfs-bus-pci-devices-aer
> > 
> > Tested using aer-inject[1]. Configured correctable log ratelimit to 5.
> > Sent 6 AER errors. Observed 5 errors logged while AER stats
> > (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
> > 
> > Disabled ratelimiting and sent 6 more AER errors. Observed all 6 errors
> > logged and accounted in AER stats (12 total errors).
> > 
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> > 
> > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> > Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > Acked-by: Paul E. McKenney <paulmck@kernel.org>
> > ---
> >  ...es-aer_stats => sysfs-bus-pci-devices-aer} | 34 +++++++
> >  Documentation/PCI/pcieaer-howto.rst           |  5 +-
> >  drivers/pci/pci-sysfs.c                       |  1 +
> >  drivers/pci/pci.h                             |  1 +
> >  drivers/pci/pcie/aer.c                        | 99 +++++++++++++++++++
> >  5 files changed, 139 insertions(+), 1 deletion(-)
> >  rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (77%)
> > 
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > similarity index 77%
> > rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > index d1f67bb81d5d..771204197b71 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > @@ -117,3 +117,37 @@ Date:		July 2018
> >  KernelVersion:	4.19.0
> >  Contact:	linux-pci@vger.kernel.org, rajatja@google.com
> >  Description:	Total number of ERR_NONFATAL messages reported to rootport.
> > +
> > +PCIe AER ratelimits
> > +-------------------
> > +
> > +These attributes show up under all the devices that are AER capable.
> > +They represent configurable ratelimits of logs per error type.
> > +
> > +See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
> > +
> > +What:		/sys/bus/pci/devices/<dev>/aer/ratelimit_log_enable
> > +Date:		March 2025
> > +KernelVersion:	6.15.0
> 
> This ship has sailed.

Updated to May 2025 and 6.16.0 (I hope :)).

> > +Contact:	linux-pci@vger.kernel.org, pandoh@google.com
> > +Description:	Writing 1/0 enables/disables AER log ratelimiting. Reading
> > +		gets whether or not AER is currently enabled.
> 
> AER or AER ratelimiting is enabled?

I think we want "AER ratelimiting" here, thanks!

> > + * Ratelimit enable toggle
> > + * 0: disabled with ratelimit.interval = 0
> > + * 1: enabled with ratelimit.interval = nonzero
> > + */
> > +static ssize_t ratelimit_log_enable_show(struct device *dev,
> > +					 struct device_attribute *attr,
> > +					 char *buf)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	bool enabled = pdev->aer_report->cor_log_ratelimit.interval != 0;
> > +
> > +	return sysfs_emit(buf, "%d\n", enabled);
> > +}
> > +
> > +static ssize_t ratelimit_log_enable_store(struct device *dev,
> > +					  struct device_attribute *attr,
> > +					  const char *buf, size_t count)
> > +{
> > +	struct pci_dev *pdev = to_pci_dev(dev);
> > +	bool enable;
> > +	int interval;
> > +
> > +	if (!capable(CAP_SYS_ADMIN))
> > +		return -EPERM;
> > +
> > +	if (kstrtobool(buf, &enable) < 0)
> > +		return -EINVAL;
> > +
> > +	if (enable)
> > +		interval = DEFAULT_RATELIMIT_INTERVAL;
> > +	else
> > +		interval = 0;
> > +
> > +	pdev->aer_report->cor_log_ratelimit.interval = interval;
> > +	pdev->aer_report->uncor_log_ratelimit.interval = interval;
> > +
> > +	return count;
> > +}
> > +static DEVICE_ATTR_RW(ratelimit_log_enable);

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-20  4:59   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 18:31     ` Bjorn Helgaas
  2025-05-20 18:42       ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 18:31 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 09:59:29PM -0700, Sathyanarayanan Kuppuswamy wrote:
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Jon Pan-Doh <pandoh@google.com>
> > 
> > Spammy devices can flood kernel logs with AER errors and slow/stall
> > execution. Add per-device ratelimits for AER correctable and uncorrectable
> > errors that use the kernel defaults (10 per 5s).
> > 
> > There are two AER logging entry points:
> > 
> >    - aer_print_error() is used by DPC and native AER
> > 
> >    - pci_print_aer() is used by GHES and CXL
> > 
> > The native AER aer_print_error() case includes a loop that may log details
> > from multiple devices.  This is ratelimited by the union of ratelimits for
> > these devices, set by add_error_device(), which collects the devices.  If
> > no such device is found, the Error Source message is ratelimited by the
> > Root Port or RCEC that received the ERR_* message.
> > 
> > The DPC aer_print_error() case is currently not ratelimited.
> 
> Can we also not rate limit fatal errors in AER driver?

In other words, only rate limit AER_CORRECTABLE and AER_NONFATAL for
AER?  Seems plausible to me.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-20 18:31     ` Bjorn Helgaas
@ 2025-05-20 18:42       ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20 18:42 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas


On 5/20/25 11:31 AM, Bjorn Helgaas wrote:
> On Mon, May 19, 2025 at 09:59:29PM -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
>>> From: Jon Pan-Doh <pandoh@google.com>
>>>
>>> Spammy devices can flood kernel logs with AER errors and slow/stall
>>> execution. Add per-device ratelimits for AER correctable and uncorrectable
>>> errors that use the kernel defaults (10 per 5s).
>>>
>>> There are two AER logging entry points:
>>>
>>>     - aer_print_error() is used by DPC and native AER
>>>
>>>     - pci_print_aer() is used by GHES and CXL
>>>
>>> The native AER aer_print_error() case includes a loop that may log details
>>> from multiple devices.  This is ratelimited by the union of ratelimits for
>>> these devices, set by add_error_device(), which collects the devices.  If
>>> no such device is found, the Error Source message is ratelimited by the
>>> Root Port or RCEC that received the ERR_* message.
>>>
>>> The DPC aer_print_error() case is currently not ratelimited.
>> Can we also not rate limit fatal errors in AER driver?
> In other words, only rate limit AER_CORRECTABLE and AER_NONFATAL for
> AER?  Seems plausible to me.
Yes, we might lose important information by rate-limiting FATAL errors. I
believe FATAL errors should be infrequent, so it's reasonable to allow them
through without rate limiting. Once you make this change, please also
update the related SysFS documentation and update code accordingly.

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-20 11:55   ` Ilpo Järvinen
@ 2025-05-20 19:38     ` Bjorn Helgaas
  2025-05-21  9:57       ` Ilpo Järvinen
  0 siblings, 1 reply; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 19:38 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

On Tue, May 20, 2025 at 02:55:32PM +0300, Ilpo Järvinen wrote:
> On Mon, 19 May 2025, Bjorn Helgaas wrote:
> 
> > From: Jon Pan-Doh <pandoh@google.com>
> > 
> > Spammy devices can flood kernel logs with AER errors and slow/stall
> > execution. Add per-device ratelimits for AER correctable and uncorrectable
> > errors that use the kernel defaults (10 per 5s).
> > 
> > There are two AER logging entry points:
> > 
> >   - aer_print_error() is used by DPC and native AER
> > 
> >   - pci_print_aer() is used by GHES and CXL
> > 
> > The native AER aer_print_error() case includes a loop that may log details
> > from multiple devices.  This is ratelimited by the union of ratelimits for
> > these devices, set by add_error_device(), which collects the devices.  If
> > no such device is found, the Error Source message is ratelimited by the
> > Root Port or RCEC that received the ERR_* message.
> > 
> > The DPC aer_print_error() case is currently not ratelimited.
> > 
> > The GHES and CXL pci_print_aer() cases are ratelimited by the Error Source
> > device.

> >  static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
> >  {
> > +	/*
> > +	 * Ratelimit AER log messages.  Generally we add the Error Source
> > +	 * device, but there are is_error_source() cases that can result in
> > +	 * multiple devices being added here, so we OR them all together.
> 
> I can see the code uses OR ;-) but I wasn't helpful because this comment 
> didn't explain why at all. As this ratelimit thing is using reverse logic 
> to begin with, this is a very tricky bit.
> 
> Perhaps something less vague like:
> 
> ... we ratelimit if all devices have reached their ratelimit.
> 
> Assuming that was the intention here? (I'm not sure.)

My intention was that if there's any downstream device that has an
unmasked error logged and it has not reached its ratelimit, we should
log messages for all devices with errors logged.  Does something like
this help?

  /*
   * Ratelimit AER log messages.  "dev" is either the source
   * identified by the root's Error Source ID or it has an unmasked
   * error logged in its own AER Capability.  If any of these devices
   * has not reached its ratelimit, log messages for all of them.
   * Messages are emitted when e_info->ratelimit is non-zero.
   *
   * Note that e_info->ratelimit was already initialized to 1 for the
   * ERR_FATAL case.
   */

The ERR_FATAL case is from this post-v6 change that I haven't posted
yet:

  aer_isr_one_error(...)
  {
    ...
    if (status & PCI_ERR_ROOT_UNCOR_RCV) {
      int fatal = status & PCI_ERR_ROOT_FATAL_RCV;
      struct aer_err_info e_info = {
        ...
 +      .ratelimit = fatal ? 1 : 0;


> > +	 */
> >  	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> >  		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> > +		e_info->ratelimit |= aer_ratelimit(dev, e_info->severity);
> >  		e_info->error_dev_num++;
> >  		return 0;
> >  	}

> > @@ -1147,9 +1183,10 @@ static void aer_recover_work_func(struct work_struct *work)
> >  		pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
> >  						   entry.devfn);
> >  		if (!pdev) {
> > -			pr_err("no pci_dev for %04x:%02x:%02x.%x\n",
> > -			       entry.domain, entry.bus,
> > -			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
> > +			pr_err_ratelimited("%04x:%02x:%02x.%x: no pci_dev found\n",
> 
> This case was not mentioned in the changelog.

Sharp eyes!  What do you think of this commit log text?

  The CXL pci_print_aer() case is ratelimited by the Error Source device.

  The GHES pci_print_aer() case is via aer_recover_work_func(), which
  searches for the Error Source device.  If the device is not found, there's
  no per-device ratelimit, so we use a system-wide ratelimit that covers all
  error types (correctable, non-fatal, and fatal).

This isn't really ideal because in pci_print_aer(), the struct
aer_capability_regs has already been filled by firmware and the
logging doesn't read any registers from the device at all.

However, pci_print_aer() *does* want the pci_dev for statistics and
tracing (pci_dev_aer_stats_incr()) and, of course, for the aer_printks
themselves.

We could leave this pr_err() completely alone; hopefully it's a rare
case.  I think the CXL path just silently skips pci_print_aer() if
this happens.

Eventually I would really like the native AER path to start by doing
whatever firmware is doing, e.g., fill in struct aer_capability_regs,
so the core of the AER handling could be identical between native AER
and GHES/CXL.  If we could do that, maybe we could figure out a
cleaner way to handle this corner case.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-05-20  5:01   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 19:48     ` Bjorn Helgaas
  2025-05-20 20:36       ` Sathyanarayanan Kuppuswamy
  0 siblings, 1 reply; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 19:48 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 10:01:09PM -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Jon Pan-Doh <pandoh@google.com>
> > 
> > Add ratelimits section for rationale and defaults.

> > +AER Ratelimits
> > +--------------
> > +
> > +Since error messages can be generated for each transaction, we may see
> > +large volumes of errors reported. To prevent spammy devices from flooding
> > +the console/stalling execution, messages are throttled by device and error
> > +type (correctable vs. uncorrectable).
> 
> Can we list exceptions like DPC and FATAL errors (if added) ?

Like this?

  +... messages are throttled by device and error
  +type (correctable vs. non-fatal uncorrectable).  Fatal errors, including
  +DPC errors, are not ratelimited.

DPC is currently only triggered for fatal errors.

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation
  2025-05-20 19:48     ` Bjorn Helgaas
@ 2025-05-20 20:36       ` Sathyanarayanan Kuppuswamy
  0 siblings, 0 replies; 68+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-05-20 20:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas


On 5/20/25 12:48 PM, Bjorn Helgaas wrote:
> On Mon, May 19, 2025 at 10:01:09PM -0700, Sathyanarayanan Kuppuswamy wrote:
>> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
>>> From: Jon Pan-Doh <pandoh@google.com>
>>>
>>> Add ratelimits section for rationale and defaults.
>>> +AER Ratelimits
>>> +--------------
>>> +
>>> +Since error messages can be generated for each transaction, we may see
>>> +large volumes of errors reported. To prevent spammy devices from flooding
>>> +the console/stalling execution, messages are throttled by device and error
>>> +type (correctable vs. uncorrectable).
>> Can we list exceptions like DPC and FATAL errors (if added) ?
> Like this?
>
>    +... messages are throttled by device and error
>    +type (correctable vs. non-fatal uncorrectable).  Fatal errors, including
>    +DPC errors, are not ratelimited.
>
> DPC is currently only triggered for fatal errors.

Yes.  I think it is good enough.


-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report
  2025-05-20  3:30   ` Sathyanarayanan Kuppuswamy
@ 2025-05-20 21:25     ` Bjorn Helgaas
  0 siblings, 0 replies; 68+ messages in thread
From: Bjorn Helgaas @ 2025-05-20 21:25 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Ilpo Järvinen, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, linux-kernel, linuxppc-dev,
	Bjorn Helgaas

On Mon, May 19, 2025 at 08:30:09PM -0700, Sathyanarayanan Kuppuswamy wrote:
> 
> On 5/19/25 2:35 PM, Bjorn Helgaas wrote:
> > From: Karolina Stolarek <karolina.stolarek@oracle.com>
> > 
> > Update name to reflect the broader definition of structs/variables that are
> > stored (e.g. ratelimits). This is a preparatory patch for adding rate limit
> > support.
> > 
> > Link: https://lore.kernel.org/r/20250321015806.954866-6-pandoh@google.com
> > Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> > Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
> > ---
> 
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> >   drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
> >   include/linux/pci.h    |  2 +-
> >   2 files changed, 26 insertions(+), 26 deletions(-)
> > 
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 06a7dda20846..da62032bf024 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -54,11 +54,11 @@ struct aer_rpc {
> >   	DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
> >   };
> > -/* AER stats for the device */
> > -struct aer_stats {
> > +/* AER report for the device */
> > +struct aer_report {
> 
> For me aer_report also sounds like stats like struct. I prefer
> aer_info, but it is up to you.

I tend to agree and can imagine a future where we might collect the
stats, ratelimits, and maybe aer_capability_regs into a per-device AER
structure.  "aer_info" seems like a decent generic name, so I did
s/\<aer_stats\>/aer_info/

^ permalink raw reply	[flat|nested] 68+ messages in thread

* Re: [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs
  2025-05-20 19:38     ` Bjorn Helgaas
@ 2025-05-21  9:57       ` Ilpo Järvinen
  0 siblings, 0 replies; 68+ messages in thread
From: Ilpo Järvinen @ 2025-05-21  9:57 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-pci, Jon Pan-Doh, Karolina Stolarek, Martin Petersen,
	Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
	Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron,
	Sargun Dhillon, Paul E . McKenney, Mahesh J Salgaonkar,
	Oliver O'Halloran, Kai-Heng Feng, Keith Busch, Robert Richter,
	Terry Bowman, Shiju Jose, Dave Jiang, LKML, linuxppc-dev,
	Bjorn Helgaas

[-- Attachment #1: Type: text/plain, Size: 5678 bytes --]

On Tue, 20 May 2025, Bjorn Helgaas wrote:

> On Tue, May 20, 2025 at 02:55:32PM +0300, Ilpo Järvinen wrote:
> > On Mon, 19 May 2025, Bjorn Helgaas wrote:
> > 
> > > From: Jon Pan-Doh <pandoh@google.com>
> > > 
> > > Spammy devices can flood kernel logs with AER errors and slow/stall
> > > execution. Add per-device ratelimits for AER correctable and uncorrectable
> > > errors that use the kernel defaults (10 per 5s).
> > > 
> > > There are two AER logging entry points:
> > > 
> > >   - aer_print_error() is used by DPC and native AER
> > > 
> > >   - pci_print_aer() is used by GHES and CXL
> > > 
> > > The native AER aer_print_error() case includes a loop that may log details
> > > from multiple devices.  This is ratelimited by the union of ratelimits for
> > > these devices, set by add_error_device(), which collects the devices.  If
> > > no such device is found, the Error Source message is ratelimited by the
> > > Root Port or RCEC that received the ERR_* message.
> > > 
> > > The DPC aer_print_error() case is currently not ratelimited.
> > > 
> > > The GHES and CXL pci_print_aer() cases are ratelimited by the Error Source
> > > device.
> 
> > >  static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
> > >  {
> > > +	/*
> > > +	 * Ratelimit AER log messages.  Generally we add the Error Source
> > > +	 * device, but there are is_error_source() cases that can result in
> > > +	 * multiple devices being added here, so we OR them all together.
> > 
> > I can see the code uses OR ;-) but I wasn't helpful because this comment 
> > didn't explain why at all. As this ratelimit thing is using reverse logic 
> > to begin with, this is a very tricky bit.
> > 
> > Perhaps something less vague like:
> > 
> > ... we ratelimit if all devices have reached their ratelimit.
> > 
> > Assuming that was the intention here? (I'm not sure.)
> 
> My intention was that if there's any downstream device that has an
> unmasked error logged and it has not reached its ratelimit, we should
> log messages for all devices with errors logged.  Does something like
> this help?
> 
>   /*
>    * Ratelimit AER log messages.  "dev" is either the source
>    * identified by the root's Error Source ID or it has an unmasked
>    * error logged in its own AER Capability.  If any of these devices
>    * has not reached its ratelimit, log messages for all of them.
>    * Messages are emitted when e_info->ratelimit is non-zero.
>    *
>    * Note that e_info->ratelimit was already initialized to 1 for the
>    * ERR_FATAL case.
>    */

Yes, this is much clearer of intent, thanks.

> The ERR_FATAL case is from this post-v6 change that I haven't posted
> yet:
> 
>   aer_isr_one_error(...)
>   {
>     ...
>     if (status & PCI_ERR_ROOT_UNCOR_RCV) {
>       int fatal = status & PCI_ERR_ROOT_FATAL_RCV;
>       struct aer_err_info e_info = {
>         ...
>  +      .ratelimit = fatal ? 1 : 0;
> 
> 
> > > +	 */
> > >  	if (e_info->error_dev_num < AER_MAX_MULTI_ERR_DEVICES) {
> > >  		e_info->dev[e_info->error_dev_num] = pci_dev_get(dev);
> > > +		e_info->ratelimit |= aer_ratelimit(dev, e_info->severity);
> > >  		e_info->error_dev_num++;
> > >  		return 0;
> > >  	}
> 
> > > @@ -1147,9 +1183,10 @@ static void aer_recover_work_func(struct work_struct *work)
> > >  		pdev = pci_get_domain_bus_and_slot(entry.domain, entry.bus,
> > >  						   entry.devfn);
> > >  		if (!pdev) {
> > > -			pr_err("no pci_dev for %04x:%02x:%02x.%x\n",
> > > -			       entry.domain, entry.bus,
> > > -			       PCI_SLOT(entry.devfn), PCI_FUNC(entry.devfn));
> > > +			pr_err_ratelimited("%04x:%02x:%02x.%x: no pci_dev found\n",
> > 
> > This case was not mentioned in the changelog.
> 
> Sharp eyes!  What do you think of this commit log text?
> 
>   The CXL pci_print_aer() case is ratelimited by the Error Source device.
> 
>   The GHES pci_print_aer() case is via aer_recover_work_func(), which
>   searches for the Error Source device.  If the device is not found, there's
>   no per-device ratelimit, so we use a system-wide ratelimit that covers all
>   error types (correctable, non-fatal, and fatal).

Works for me as long as it is mentioned.

> This isn't really ideal because in pci_print_aer(), the struct
> aer_capability_regs has already been filled by firmware and the
> logging doesn't read any registers from the device at all.
> 
> However, pci_print_aer() *does* want the pci_dev for statistics and
> tracing (pci_dev_aer_stats_incr()) and, of course, for the aer_printks
> themselves.

While not a perfect solution, this looks yet another case where it would 
help to create a dummy pci_dev struct with minimal setup which allows 
calling functions that input a pci_dev.

That solution is not perfect because it arms a trap. Downstream 
functions could get changed and if the developer assumes they have a full 
pci_dev at hand, it could cause issues with the dummy pci_dev. How likely
it happens is debatable but for many cases where the call-chain isn't 
overly complex such as here, dummy pci_dev seems helpful.

> We could leave this pr_err() completely alone; hopefully it's a rare
> case.  I think the CXL path just silently skips pci_print_aer() if
> this happens.
> 
> Eventually I would really like the native AER path to start by doing
> whatever firmware is doing, e.g., fill in struct aer_capability_regs,
> so the core of the AER handling could be identical between native AER
> and GHES/CXL.  If we could do that, maybe we could figure out a
> cleaner way to handle this corner case.


-- 
 i.

^ permalink raw reply	[flat|nested] 68+ messages in thread

end of thread, other threads:[~2025-05-21 10:04 UTC | newest]

Thread overview: 68+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-19 21:35 [PATCH v6 00/16] Rate limit AER logs Bjorn Helgaas
2025-05-19 21:35 ` [PATCH v6 01/16] PCI/DPC: Initialize aer_err_info before using it Bjorn Helgaas
2025-05-19 22:41   ` Sathyanarayanan Kuppuswamy
2025-05-20 13:53     ` Bjorn Helgaas
2025-05-20  9:39   ` Ilpo Järvinen
2025-05-20 13:54     ` Bjorn Helgaas
2025-05-19 21:35 ` [PATCH v6 02/16] PCI/DPC: Log Error Source ID only when valid Bjorn Helgaas
2025-05-19 23:15   ` Sathyanarayanan Kuppuswamy
2025-05-20 14:00     ` Bjorn Helgaas
2025-05-20 14:20       ` Ilpo Järvinen
2025-05-20 10:28   ` Ilpo Järvinen
2025-05-20 14:05     ` Bjorn Helgaas
2025-05-19 21:35 ` [PATCH v6 03/16] PCI/AER: Consolidate Error Source ID logging in aer_print_port_info() Bjorn Helgaas
2025-05-19 23:39   ` Sathyanarayanan Kuppuswamy
2025-05-20 14:21     ` Bjorn Helgaas
2025-05-20 10:31   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 04/16] PCI/AER: Extract bus/dev/fn in aer_print_port_info() with PCI_BUS_NUM(), etc Bjorn Helgaas
2025-05-19 23:47   ` Sathyanarayanan Kuppuswamy
2025-05-20 10:32   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 05/16] PCI/AER: Rename aer_print_port_info() to aer_print_source() Bjorn Helgaas
2025-05-19 23:48   ` Sathyanarayanan Kuppuswamy
2025-05-20 10:33   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 06/16] PCI/AER: Move aer_print_source() earlier in file Bjorn Helgaas
2025-05-19 23:49   ` Sathyanarayanan Kuppuswamy
2025-05-20 10:34   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 07/16] PCI/AER: Initialize aer_err_info before using it Bjorn Helgaas
2025-05-19 23:50   ` Sathyanarayanan Kuppuswamy
2025-05-20 10:39   ` Ilpo Järvinen
2025-05-20 14:27     ` Bjorn Helgaas
2025-05-19 21:35 ` [PATCH v6 08/16] PCI/AER: Simplify pci_print_aer() Bjorn Helgaas
2025-05-20  0:02   ` Sathyanarayanan Kuppuswamy
2025-05-20 14:38     ` Bjorn Helgaas
2025-05-20 10:42   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 09/16] PCI/AER: Update statistics early in logging Bjorn Helgaas
2025-05-20  1:32   ` Sathyanarayanan Kuppuswamy
2025-05-20 11:04   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 10/16] PCI/AER: Combine trace_aer_event() with statistics updates Bjorn Helgaas
2025-05-20  1:49   ` Sathyanarayanan Kuppuswamy
2025-05-20 11:08   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 11/16] PCI/AER: Check log level once and remember it Bjorn Helgaas
2025-05-19 23:17   ` Weinan Liu
2025-05-20 14:46     ` Bjorn Helgaas
2025-05-20  2:49   ` Sathyanarayanan Kuppuswamy
2025-05-20 11:26   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 12/16] PCI/AER: Make all pci_print_aer() log levels depend on error type Bjorn Helgaas
2025-05-20  3:23   ` Sathyanarayanan Kuppuswamy
2025-05-20 11:37   ` Ilpo Järvinen
2025-05-20 15:04     ` Bjorn Helgaas
2025-05-19 21:35 ` [PATCH v6 13/16] PCI/AER: Rename struct aer_stats to aer_report Bjorn Helgaas
2025-05-20  3:30   ` Sathyanarayanan Kuppuswamy
2025-05-20 21:25     ` Bjorn Helgaas
2025-05-20 11:38   ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 14/16] PCI/AER: Introduce ratelimit for error logs Bjorn Helgaas
2025-05-20  4:59   ` Sathyanarayanan Kuppuswamy
2025-05-20 18:31     ` Bjorn Helgaas
2025-05-20 18:42       ` Sathyanarayanan Kuppuswamy
2025-05-20 11:55   ` Ilpo Järvinen
2025-05-20 19:38     ` Bjorn Helgaas
2025-05-21  9:57       ` Ilpo Järvinen
2025-05-19 21:35 ` [PATCH v6 15/16] PCI/AER: Add ratelimits to PCI AER Documentation Bjorn Helgaas
2025-05-20  5:01   ` Sathyanarayanan Kuppuswamy
2025-05-20 19:48     ` Bjorn Helgaas
2025-05-20 20:36       ` Sathyanarayanan Kuppuswamy
2025-05-19 21:35 ` [PATCH v6 16/16] PCI/AER: Add sysfs attributes for log ratelimits Bjorn Helgaas
2025-05-20  5:05   ` Sathyanarayanan Kuppuswamy
2025-05-20 12:02   ` Ilpo Järvinen
2025-05-20 16:31     ` Bjorn Helgaas
2025-05-20  9:05 ` [PATCH v6 00/16] Rate limit AER logs Krzysztof Wilczyński

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).