* [PATCH 0/8] Rate limit AER logs/IRQs
@ 2025-01-15 7:42 Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
` (9 more replies)
0 siblings, 10 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Proposal
========
When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits
for more robust error logging. Allow userspace to configure ratelimits
via sysfs knobs.
Motivation
==========
Several OCP members have issues with inconsistent PCIe error handling,
exacerbated at datacenter scale (myriad of devices).
OCP HW/Fault Management subproject set out to solve this by
standardizing industry:
- PCIe error handling best practices
- Fault Management/RAS (incl. PCIe errors)
Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services is part of the
roadmap.
Background
==========
AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup).
There have been previous attempts to add ratelimits to AER logs ([4],
[5]). The most recent attempt[5] has many similarities with the proposed
approach.
Patch organization
==================
1-3 AER logging cleanup
4-7 Ratelimits and sysfs knobs
8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped)
Outstanding work
================
Cleanup:
- Consolidate aer_print_error() and pci_print_error() path
- Elevate log level logic out of print functions[6]
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
[6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/
Jon Pan-Doh (8):
PCI/AER: Remove aer_print_port_info
PCI/AER: Move AER stat collection out of __aer_print_error
PCI/AER: Rename struct aer_stats to aer_info
PCI/AER: Introduce ratelimit for error logs
PCI/AER: Introduce ratelimit for AER IRQs
PCI/AER: Add AER sysfs attributes for ratelimits
PCI/AER: Update AER sysfs ABI filename
PCI/AER: Move AER sysfs attributes into separate directory
...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++-
Documentation/PCI/pcieaer-howto.rst | 10 +-
drivers/pci/pci-sysfs.c | 2 +-
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/aer.c | 227 +++++++++++++-----
include/linux/pci.h | 2 +-
6 files changed, 216 insertions(+), 77 deletions(-)
rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%)
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply [flat|nested] 51+ messages in thread
* [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-16 14:27 ` Karolina Stolarek
` (2 more replies)
2025-01-15 7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
` (8 subsequent siblings)
9 siblings, 3 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Info logged is duplicated when either the source device is processed. If
no source device is found, than an error is logged.
Code flow:
aer_isr_one_error()
-> aer_print_port_info()
-> find_source_device()
-> return/pci_info() if no device found else continue
-> aer_process_err_devices()
-> aer_print_error()
aer_print_port_info():
[ 21.596150] pcieport 0000:00:04.0: Correctable error message received
from 0000:01:00.0
aer_print_error():
[ 21.596163] e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
[ 21.600575] e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
[ 21.604707] e1000e 0000:01:00.0: [ 6] BadTLP
Tested using aer-inject[1] tool. No more root port log on dmesg.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pcie/aer.c | 15 ---------------
1 file changed, 15 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 34ce9f834d0c..ba40800b5494 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -735,18 +735,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
info->severity, info->tlp_header_valid, &info->tlp);
}
-static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
-{
- u8 bus = info->id >> 8;
- u8 devfn = info->id & 0xff;
-
- pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
- info->multi_error_valid ? "Multiple " : "",
- aer_error_severity_string[info->severity],
- pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
- PCI_FUNC(devfn));
-}
-
#ifdef CONFIG_ACPI_APEI_PCIEAER
int cper_severity_to_aer(int cper_severity)
{
@@ -1295,7 +1283,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
e_info.multi_error_valid = 1;
else
e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, &e_info);
if (find_source_device(pdev, &e_info))
aer_process_err_devices(&e_info);
@@ -1314,8 +1301,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
else
e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, &e_info);
-
if (find_source_device(pdev, &e_info))
aer_process_err_devices(&e_info);
}
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-16 14:47 ` Karolina Stolarek
2025-01-25 4:37 ` Sathyanarayanan Kuppuswamy
2025-01-15 7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
` (7 subsequent siblings)
9 siblings, 2 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Decouple stat collection from internal AER print functions. AERs from ghes
or cxl drivers have stat collection in pci_print_aer as that is where
aer_err_info is populated.
Tested using aer-inject[1] tool. AER sysfs counters still updated
correctly.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pcie/aer.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ba40800b5494..4bb0b3840402 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -695,7 +695,6 @@ static void __aer_print_error(struct pci_dev *dev,
pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
info->first_error == i ? " (First)" : "");
}
- pci_dev_aer_stats_incr(dev, info);
}
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
@@ -775,6 +774,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
+ pci_dev_aer_stats_incr(dev, &info);
+
pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
__aer_print_error(dev, &info);
pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
@@ -1249,8 +1250,10 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
/* Report all before handle them, not to lost records by reset etc. */
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info->dev[i], e_info))
+ if (aer_get_device_error_info(e_info->dev[i], e_info)) {
+ pci_dev_aer_stats_incr(e_info->dev[i], e_info);
aer_print_error(e_info->dev[i], e_info);
+ }
}
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
if (aer_get_device_error_info(e_info->dev[i], e_info))
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-16 10:11 ` Karolina Stolarek
2025-01-15 7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
` (6 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Update name to reflect the broader definition of structs/variables that
are stored (e.g. ratelimits).
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
include/linux/pci.h | 2 +-
2 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 4bb0b3840402..5ab5cd7368bc 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -50,11 +50,11 @@ struct aer_rpc {
DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
};
-/* AER stats for the device */
-struct aer_stats {
+/* AER info for the device */
+struct aer_info {
/*
- * Fields for all AER capable devices. They indicate the errors
+ * Stats for all AER capable devices. They indicate the errors
* "as seen by this device". Note that this may mean that if an
* end point is causing problems, the AER counters may increment
* at its link partner (e.g. root port) because the errors will be
@@ -76,7 +76,7 @@ struct aer_stats {
u64 dev_total_nonfatal_errs;
/*
- * Fields for Root ports & root complex event collectors only, these
+ * Stats for Root ports & root complex event collectors only, these
* indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
* messages received by the root port / event collector, INCLUDING the
* ones that are generated internally (by the rootport itself)
@@ -373,7 +373,7 @@ void pci_aer_init(struct pci_dev *dev)
if (!dev->aer_cap)
return;
- dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
/*
* We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
@@ -394,8 +394,8 @@ void pci_aer_init(struct pci_dev *dev)
void pci_aer_exit(struct pci_dev *dev)
{
- kfree(dev->aer_stats);
- dev->aer_stats = NULL;
+ kfree(dev->aer_info);
+ dev->aer_info = NULL;
}
#define AER_AGENT_RECEIVER 0
@@ -533,10 +533,10 @@ static const char *aer_agent_string[] = {
{ \
unsigned int i; \
struct pci_dev *pdev = to_pci_dev(dev); \
- u64 *stats = pdev->aer_stats->stats_array; \
+ u64 *stats = pdev->aer_info->stats_array; \
size_t len = 0; \
\
- for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
+ for (i = 0; i < ARRAY_SIZE(pdev->aer_info->stats_array); i++) {\
if (strings_array[i]) \
len += sysfs_emit_at(buf, len, "%s %llu\n", \
strings_array[i], \
@@ -547,7 +547,7 @@ static const char *aer_agent_string[] = {
i, stats[i]); \
} \
len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string, \
- pdev->aer_stats->total_field); \
+ pdev->aer_info->total_field); \
return len; \
} \
static DEVICE_ATTR_RO(name)
@@ -568,7 +568,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
char *buf) \
{ \
struct pci_dev *pdev = to_pci_dev(dev); \
- return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field); \
+ return sysfs_emit(buf, "%llu\n", pdev->aer_info->field); \
} \
static DEVICE_ATTR_RO(name)
@@ -595,7 +595,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
struct device *dev = kobj_to_dev(kobj);
struct pci_dev *pdev = to_pci_dev(dev);
- if (!pdev->aer_stats)
+ if (!pdev->aer_info)
return 0;
if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
@@ -619,25 +619,25 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
unsigned long status = info->status & ~info->mask;
int i, max = -1;
u64 *counter = NULL;
- struct aer_stats *aer_stats = pdev->aer_stats;
+ struct aer_info *aer_info = pdev->aer_info;
- if (!aer_stats)
+ if (!aer_info)
return;
switch (info->severity) {
case AER_CORRECTABLE:
- aer_stats->dev_total_cor_errs++;
- counter = &aer_stats->dev_cor_errs[0];
+ aer_info->dev_total_cor_errs++;
+ counter = &aer_info->dev_cor_errs[0];
max = AER_MAX_TYPEOF_COR_ERRS;
break;
case AER_NONFATAL:
- aer_stats->dev_total_nonfatal_errs++;
- counter = &aer_stats->dev_nonfatal_errs[0];
+ aer_info->dev_total_nonfatal_errs++;
+ counter = &aer_info->dev_nonfatal_errs[0];
max = AER_MAX_TYPEOF_UNCOR_ERRS;
break;
case AER_FATAL:
- aer_stats->dev_total_fatal_errs++;
- counter = &aer_stats->dev_fatal_errs[0];
+ aer_info->dev_total_fatal_errs++;
+ counter = &aer_info->dev_fatal_errs[0];
max = AER_MAX_TYPEOF_UNCOR_ERRS;
break;
}
@@ -649,19 +649,19 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
struct aer_err_source *e_src)
{
- struct aer_stats *aer_stats = pdev->aer_stats;
+ struct aer_info *aer_info = pdev->aer_info;
- if (!aer_stats)
+ if (!aer_info)
return;
if (e_src->status & PCI_ERR_ROOT_COR_RCV)
- aer_stats->rootport_total_cor_errs++;
+ aer_info->rootport_total_cor_errs++;
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
- aer_stats->rootport_total_fatal_errs++;
+ aer_info->rootport_total_fatal_errs++;
else
- aer_stats->rootport_total_nonfatal_errs++;
+ aer_info->rootport_total_nonfatal_errs++;
}
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index db9b47ce3eef..72e6f5560164 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -346,7 +346,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
- struct aer_stats *aer_stats; /* AER stats for this device */
+ struct aer_info *aer_info; /* AER info for this device */
#endif
#ifdef CONFIG_PCIEPORTBUS
struct rcec_ea *rcec_ea; /* RCEC cached endpoint association */
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (2 preceding siblings ...)
2025-01-15 7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-16 11:11 ` Karolina Stolarek
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
` (5 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Spammy devices can flood kernel logs with AER errors and slow/stall
execution. Add per-device ratelimits for AER errors (correctable and
uncorrectable). Set the default rate to the default kernel ratelimit
(10 per 5s).
Tested using aer-inject[1] tool. Sent 11 AER errors. Observed 10 errors
logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable)
show true count of 11.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
Documentation/PCI/pcieaer-howto.rst | 6 ++++++
drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++++--
2 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index f013f3b27c82..5546de60f184 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -85,6 +85,12 @@ In the example, 'Requester ID' means the ID of the device that sent
the error message to the Root Port. Please refer to PCIe specs for other
fields.
+AER Ratelimits
+-------------------------
+
+Error messages are ratelimited per device and error type. This prevents spammy
+devices from flooding the console.
+
AER Statistics / Counters
-------------------------
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 5ab5cd7368bc..025c50b0f293 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -27,6 +27,7 @@
#include <linux/interrupt.h>
#include <linux/delay.h>
#include <linux/kfifo.h>
+#include <linux/ratelimit.h>
#include <linux/slab.h>
#include <acpi/apei.h>
#include <acpi/ghes.h>
@@ -84,6 +85,10 @@ struct aer_info {
u64 rootport_total_cor_errs;
u64 rootport_total_fatal_errs;
u64 rootport_total_nonfatal_errs;
+
+ /* Ratelimits for errors */
+ struct ratelimit_state cor_log_ratelimit;
+ struct ratelimit_state uncor_log_ratelimit;
};
#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
@@ -374,6 +379,12 @@ void pci_aer_init(struct pci_dev *dev)
return;
dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
+ ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+ ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+ ratelimit_set_flags(&dev->aer_info->cor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
+ ratelimit_set_flags(&dev->aer_info->uncor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
/*
* We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
@@ -702,6 +713,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
int layer, agent;
int id = pci_dev_id(dev);
const char *level;
+ struct ratelimit_state *ratelimit;
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -709,11 +721,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
goto out;
}
+ if (info->severity == AER_CORRECTABLE) {
+ ratelimit = &dev->aer_info->cor_log_ratelimit;
+ level = KERN_WARNING;
+ } else {
+ ratelimit = &dev->aer_info->uncor_log_ratelimit;
+ level = KERN_ERR;
+ }
+
+ if (!__ratelimit(ratelimit))
+ return;
+
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
- level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
-
pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);
@@ -755,11 +776,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
int layer, agent, tlp_header_valid = 0;
u32 status, mask;
struct aer_err_info info;
+ struct ratelimit_state *ratelimit;
if (aer_severity == AER_CORRECTABLE) {
+ ratelimit = &dev->aer_info->cor_log_ratelimit;
status = aer->cor_status;
mask = aer->cor_mask;
} else {
+ ratelimit = &dev->aer_info->uncor_log_ratelimit;
status = aer->uncor_status;
mask = aer->uncor_mask;
tlp_header_valid = status & AER_LOG_TLP_MASKS;
@@ -776,6 +800,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
pci_dev_aer_stats_incr(dev, &info);
+ if (!__ratelimit(ratelimit))
+ return;
+
pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
__aer_print_error(dev, &info);
pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (3 preceding siblings ...)
2025-01-15 7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-16 12:02 ` Karolina Stolarek
` (2 more replies)
2025-01-15 7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
` (4 subsequent siblings)
9 siblings, 3 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
After ratelimiting logs, spammy devices can still slow execution by
continued AER IRQ servicing.
Add higher per-device ratelimits for AER errors to mask out those IRQs.
Set the default rate to 3x default AER ratelimit (30 per 5s).
Tested using aer-inject[1] tool. Injected 32 AER errors. Observed IRQ
masked via lspci and sysfs counters record 31 errors (1 masked).
Before: CEMsk: BadTLP-
After: CEMsk: BadTLP+
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
Documentation/PCI/pcieaer-howto.rst | 4 +-
drivers/pci/pcie/aer.c | 64 +++++++++++++++++++++++++----
2 files changed, 57 insertions(+), 11 deletions(-)
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 5546de60f184..d41272504b18 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -88,8 +88,8 @@ fields.
AER Ratelimits
-------------------------
-Error messages are ratelimited per device and error type. This prevents spammy
-devices from flooding the console.
+Errors, both at log and IRQ level, are ratelimited per device and error type.
+This prevents spammy devices from stalling execution.
AER Statistics / Counters
-------------------------
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 025c50b0f293..1db70ae87f52 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -87,6 +87,8 @@ struct aer_info {
u64 rootport_total_nonfatal_errs;
/* Ratelimits for errors */
+ struct ratelimit_state cor_irq_ratelimit;
+ struct ratelimit_state uncor_irq_ratelimit;
struct ratelimit_state cor_log_ratelimit;
struct ratelimit_state uncor_log_ratelimit;
};
@@ -379,6 +381,10 @@ void pci_aer_init(struct pci_dev *dev)
return;
dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
+ ratelimit_state_init(&dev->aer_info->cor_irq_ratelimit,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3);
+ ratelimit_state_init(&dev->aer_info->uncor_irq_ratelimit,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3);
ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
@@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
}
}
+static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
+{
+ const char **strings;
+ const char *errmsg;
+ u16 aer_offset = dev->aer_cap;
+ u16 mask_reg_offset;
+ u32 mask;
+ unsigned long status = info->status;
+ int i;
+
+ if (info->severity == AER_CORRECTABLE) {
+ strings = aer_correctable_error_string;
+ mask_reg_offset = PCI_ERR_COR_MASK;
+ } else {
+ strings = aer_uncorrectable_error_string;
+ mask_reg_offset = PCI_ERR_UNCOR_MASK;
+ }
+
+ pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
+ mask |= status;
+ pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
+
+ pci_warn(dev, "%s error(s) masked due to rate-limiting:",
+ aer_error_severity_string[info->severity]);
+ for_each_set_bit(i, &status, 32) {
+ errmsg = strings[i];
+ if (!errmsg)
+ errmsg = "Unknown Error Bit";
+
+ pci_warn(dev, " [%2d] %-22s\n", i, errmsg);
+ }
+}
+
static void __print_tlp_header(struct pci_dev *dev, struct pcie_tlp_log *t)
{
pci_err(dev, " TLP Header: %08x %08x %08x %08x\n",
@@ -713,7 +752,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
int layer, agent;
int id = pci_dev_id(dev);
const char *level;
- struct ratelimit_state *ratelimit;
+ struct ratelimit_state *irq_ratelimit;
+ struct ratelimit_state *log_ratelimit;
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -722,14 +762,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
}
if (info->severity == AER_CORRECTABLE) {
- ratelimit = &dev->aer_info->cor_log_ratelimit;
+ irq_ratelimit = &dev->aer_info->cor_irq_ratelimit;
+ log_ratelimit = &dev->aer_info->cor_log_ratelimit;
level = KERN_WARNING;
} else {
- ratelimit = &dev->aer_info->uncor_log_ratelimit;
+ irq_ratelimit = &dev->aer_info->uncor_irq_ratelimit;
+ log_ratelimit = &dev->aer_info->uncor_log_ratelimit;
level = KERN_ERR;
}
- if (!__ratelimit(ratelimit))
+ if (!__ratelimit(irq_ratelimit)) {
+ mask_reported_error(dev, info);
+ return;
+ }
+ if (!__ratelimit(log_ratelimit))
return;
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
@@ -776,14 +822,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
int layer, agent, tlp_header_valid = 0;
u32 status, mask;
struct aer_err_info info;
- struct ratelimit_state *ratelimit;
+ struct ratelimit_state *log_ratelimit;
if (aer_severity == AER_CORRECTABLE) {
- ratelimit = &dev->aer_info->cor_log_ratelimit;
+ log_ratelimit = &dev->aer_info->cor_log_ratelimit;
status = aer->cor_status;
mask = aer->cor_mask;
} else {
- ratelimit = &dev->aer_info->uncor_log_ratelimit;
+ log_ratelimit = &dev->aer_info->uncor_log_ratelimit;
status = aer->uncor_status;
mask = aer->uncor_mask;
tlp_header_valid = status & AER_LOG_TLP_MASKS;
@@ -799,8 +845,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
pci_dev_aer_stats_incr(dev, &info);
-
- if (!__ratelimit(ratelimit))
+ /* Only ratelimit logs (no IRQ) as AERs reported via GHES/CXL (caller). */
+ if (!__ratelimit(log_ratelimit))
return;
pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (4 preceding siblings ...)
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-31 14:32 ` Jonathan Cameron
2025-01-15 7:42 ` [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
` (3 subsequent siblings)
9 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Allow userspace to read/write ratelimits per device. Create aer/ sysfs
directory to store them and any future aer configs.
Tested using aer-inject[1] tool. Configured correctable log/irq ratelimits
to 5/10 respectively. Sent 12 AER errors. Observed 5 errors logged while
AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 11
(1 masked).
Before: CEMsk: BadTLP-
After: CEMsk: BadTLP+.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
.../testing/sysfs-bus-pci-devices-aer_stats | 32 +++++++++++
Documentation/PCI/pcieaer-howto.rst | 4 +-
drivers/pci/pci-sysfs.c | 1 +
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 56 +++++++++++++++++++
5 files changed, 93 insertions(+), 1 deletion(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
index d1f67bb81d5d..c680a53af0f4 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -117,3 +117,35 @@ Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.
+
+PCIe AER ratelimits
+----------------------------
+
+These attributes show up under all the devices that are AER capable. Provides
+configurable ratelimits of logs/irq per error type. Writing a nonzero value
+changes the number of errors (burst) allowed per 5 second window before
+ratelimiting. Reading gets the current ratelimits.
+
+What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_irq
+Date: Jan 2025
+KernelVersion: 6.14.0
+Contact: linux-pci@vger.kernel.org, pandoh@google.com
+Description: IRQ ratelimit for correctable errors.
+
+What: /sys/bus/pci/devices/<dev>/aer/ratelimit_uncor_irq
+Date: Jan 2025
+KernelVersion: 6.14.0
+Contact: linux-pci@vger.kernel.org, pandoh@google.com
+Description: IRQ ratelimit for uncorrectable errors.
+
+What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_log
+Date: Jan 2025
+KernelVersion: 6.14.0
+Contact: linux-pci@vger.kernel.org, pandoh@google.com
+Description: Log ratelimit for correctable errors.
+
+What: /sys/bus/pci/devices/<dev>/aer/ratelimit_uncor_log
+Date: Jan 2025
+KernelVersion: 6.14.0
+Contact: linux-pci@vger.kernel.org, pandoh@google.com
+Description: Log ratelimit for uncorrectable errors.
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index d41272504b18..4d5b0638f120 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -89,7 +89,9 @@ AER Ratelimits
-------------------------
Errors, both at log and IRQ level, are ratelimited per device and error type.
-This prevents spammy devices from stalling execution.
+This prevents spammy devices from stalling execution. Ratelimits are exposed
+in the form of sysfs attributes and configurable. See
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
AER Statistics / Counters
-------------------------
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 7679d75d71e5..41acb6713e2d 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1693,6 +1693,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
&pcie_dev_attr_group,
#ifdef CONFIG_PCIEAER
&aer_stats_attr_group,
+ &aer_attr_group,
#endif
#ifdef CONFIG_PCIEASPM
&aspm_ctrl_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 2e40fc63ba31..9d0272a890ef 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -881,6 +881,7 @@ void pci_no_aer(void);
void pci_aer_init(struct pci_dev *dev);
void pci_aer_exit(struct pci_dev *dev);
extern const struct attribute_group aer_stats_attr_group;
+extern const struct attribute_group aer_attr_group;
void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pci_aer_clear_status(struct pci_dev *dev);
int pci_aer_raw_clear_status(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1db70ae87f52..e48e2951baae 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -630,6 +630,62 @@ const struct attribute_group aer_stats_attr_group = {
.is_visible = aer_stats_attrs_are_visible,
};
+#define aer_ratelimit_attr(name, ratelimit) \
+ static ssize_t \
+ name##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sysfs_emit(buf, "%u errors every %u seconds\n", \
+ pdev->aer_info->ratelimit.burst, \
+ pdev->aer_info->ratelimit.interval / HZ); \
+} \
+ \
+ static ssize_t \
+ name##_store(struct device *dev, struct device_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ int burst; \
+ \
+ if (kstrtoint(buf, 0, &burst) < 0) \
+ return -EINVAL; \
+ \
+ pdev->aer_info->ratelimit.burst = burst; \
+ return count; \
+} \
+static DEVICE_ATTR_RW(name)
+
+aer_ratelimit_attr(ratelimit_cor_irq, cor_irq_ratelimit);
+aer_ratelimit_attr(ratelimit_uncor_irq, uncor_irq_ratelimit);
+aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
+aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
+
+static struct attribute *aer_attrs[] __ro_after_init = {
+ &dev_attr_ratelimit_cor_irq.attr,
+ &dev_attr_ratelimit_uncor_irq.attr,
+ &dev_attr_ratelimit_cor_log.attr,
+ &dev_attr_ratelimit_uncor_log.attr,
+ NULL
+};
+
+static umode_t aer_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_info)
+ return 0;
+ return a->mode;
+}
+
+const struct attribute_group aer_attr_group = {
+ .name = "aer",
+ .attrs = aer_attrs,
+ .is_visible = aer_attrs_are_visible,
+};
+
static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
struct aer_err_info *info)
{
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (5 preceding siblings ...)
2025-01-15 7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
@ 2025-01-15 7:42 ` Jon Pan-Doh
2025-01-15 7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
` (2 subsequent siblings)
9 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:42 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Change Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer to reflect the broader
scope of AER sysfs attributes (e.g. stats and ratelimits).
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
...fs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} | 0
Documentation/PCI/pcieaer-howto.rst | 4 ++--
2 files changed, 2 insertions(+), 2 deletions(-)
rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (100%)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
similarity index 100%
rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 4d5b0638f120..4c9d31ae7d88 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -91,14 +91,14 @@ AER Ratelimits
Errors, both at log and IRQ level, are ratelimited per device and error type.
This prevents spammy devices from stalling execution. Ratelimits are exposed
in the form of sysfs attributes and configurable. See
-Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer
AER Statistics / Counters
-------------------------
When PCIe AER errors are captured, the counters / statistics are also exposed
in the form of sysfs attributes which are documented at
-Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer
Developer Guide
===============
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (6 preceding siblings ...)
2025-01-15 7:42 ` [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
@ 2025-01-15 7:43 ` Jon Pan-Doh
2025-01-16 10:26 ` Karolina Stolarek
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
2025-02-06 13:32 ` Karolina Stolarek
9 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-15 7:43 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Jon Pan-Doh
Prepare for the addition of new AER sysfs attributes (e.g. ratelimits)
by moving them into their own directory. Update naming to reflect
broader definition and for consistency.
/sys/bus/pci/devices/<dev>/aer_dev_correctable
/sys/bus/pci/devices/<dev>/aer_dev_fatal
/sys/bus/pci/devices/<dev>/aer_dev_nonfatal
/sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
/sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
/sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
->
/sys/bus/pci/devices/<dev>/aer/err_cor
/sys/bus/pci/devices/<dev>/aer/err_fatal
/sys/bus/pci/devices/<dev>/aer/err_nonfatal
/sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
/sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
/sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
Tested using aer-inject[1] tool. Sent 1 AER error. Observed AER stats
correctedly logged (cat /sys/bus/pci/devices/<dev>/aer/dev_err_cor).
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
.../ABI/testing/sysfs-bus-pci-devices-aer | 18 +++---
drivers/pci/pci-sysfs.c | 1 -
drivers/pci/pci.h | 1 -
drivers/pci/pcie/aer.c | 64 +++++++------------
4 files changed, 32 insertions(+), 52 deletions(-)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
index c680a53af0f4..e1472583207b 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
@@ -9,7 +9,7 @@ errors may be "seen" / reported by the link partner and not the
problematic endpoint itself (which may report all counters as 0 as it never
saw any problems).
-What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
+What: /sys/bus/pci/devices/<dev>/aer/err_cor
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
@@ -19,7 +19,7 @@ Description: List of correctable errors seen and reported by this
TOTAL_ERR_COR at the end of the file may not match the actual
total of all the errors in the file. Sample output::
- localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
+ localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_cor
Receiver Error 2
Bad TLP 0
Bad DLLP 0
@@ -30,7 +30,7 @@ Description: List of correctable errors seen and reported by this
Header Log Overflow 0
TOTAL_ERR_COR 2
-What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
+What: /sys/bus/pci/devices/<dev>/aer/err_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
@@ -40,7 +40,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
TOTAL_ERR_FATAL at the end of the file may not match the actual
total of all the errors in the file. Sample output::
- localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
+ localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_fatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
@@ -60,7 +60,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
TLP Prefix Blocked Error 0
TOTAL_ERR_FATAL 0
-What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
+What: /sys/bus/pci/devices/<dev>/aer/err_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
@@ -70,7 +70,7 @@ Description: List of uncorrectable nonfatal errors seen and reported by this
TOTAL_ERR_NONFATAL at the end of the file may not match the
actual total of all the errors in the file. Sample output::
- localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
+ localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_nonfatal
Undefined 0
Data Link Protocol 0
Surprise Down Error 0
@@ -100,19 +100,19 @@ collectors) that are AER capable. These indicate the number of error messages as
device, so these counters include them and are thus cumulative of all the error
messages on the PCI hierarchy originating at that root port.
-What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
+What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_COR messages reported to rootport.
-What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
+What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_FATAL messages reported to rootport.
-What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
+What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index 41acb6713e2d..e16b92edf3bd 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1692,7 +1692,6 @@ const struct attribute_group *pci_dev_attr_groups[] = {
&pci_bridge_attr_group,
&pcie_dev_attr_group,
#ifdef CONFIG_PCIEAER
- &aer_stats_attr_group,
&aer_attr_group,
#endif
#ifdef CONFIG_PCIEASPM
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9d0272a890ef..a80cfc08f634 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -880,7 +880,6 @@ static inline void of_pci_remove_node(struct pci_dev *pdev) { }
void pci_no_aer(void);
void pci_aer_init(struct pci_dev *dev);
void pci_aer_exit(struct pci_dev *dev);
-extern const struct attribute_group aer_stats_attr_group;
extern const struct attribute_group aer_attr_group;
void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pci_aer_clear_status(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e48e2951baae..68850525cc8d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -569,13 +569,13 @@ static const char *aer_agent_string[] = {
} \
static DEVICE_ATTR_RO(name)
-aer_stats_dev_attr(aer_dev_correctable, dev_cor_errs,
+aer_stats_dev_attr(err_cor, dev_cor_errs,
aer_correctable_error_string, "ERR_COR",
dev_total_cor_errs);
-aer_stats_dev_attr(aer_dev_fatal, dev_fatal_errs,
+aer_stats_dev_attr(err_fatal, dev_fatal_errs,
aer_uncorrectable_error_string, "ERR_FATAL",
dev_total_fatal_errs);
-aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
+aer_stats_dev_attr(err_nonfatal, dev_nonfatal_errs,
aer_uncorrectable_error_string, "ERR_NONFATAL",
dev_total_nonfatal_errs);
@@ -589,47 +589,13 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
} \
static DEVICE_ATTR_RO(name)
-aer_stats_rootport_attr(aer_rootport_total_err_cor,
+aer_stats_rootport_attr(rootport_total_err_cor,
rootport_total_cor_errs);
-aer_stats_rootport_attr(aer_rootport_total_err_fatal,
+aer_stats_rootport_attr(rootport_total_err_fatal,
rootport_total_fatal_errs);
-aer_stats_rootport_attr(aer_rootport_total_err_nonfatal,
+aer_stats_rootport_attr(rootport_total_err_nonfatal,
rootport_total_nonfatal_errs);
-static struct attribute *aer_stats_attrs[] __ro_after_init = {
- &dev_attr_aer_dev_correctable.attr,
- &dev_attr_aer_dev_fatal.attr,
- &dev_attr_aer_dev_nonfatal.attr,
- &dev_attr_aer_rootport_total_err_cor.attr,
- &dev_attr_aer_rootport_total_err_fatal.attr,
- &dev_attr_aer_rootport_total_err_nonfatal.attr,
- NULL
-};
-
-static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
- struct attribute *a, int n)
-{
- struct device *dev = kobj_to_dev(kobj);
- struct pci_dev *pdev = to_pci_dev(dev);
-
- if (!pdev->aer_info)
- return 0;
-
- if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
- a == &dev_attr_aer_rootport_total_err_fatal.attr ||
- a == &dev_attr_aer_rootport_total_err_nonfatal.attr) &&
- ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
- (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
- return 0;
-
- return a->mode;
-}
-
-const struct attribute_group aer_stats_attr_group = {
- .attrs = aer_stats_attrs,
- .is_visible = aer_stats_attrs_are_visible,
-};
-
#define aer_ratelimit_attr(name, ratelimit) \
static ssize_t \
name##_show(struct device *dev, struct device_attribute *attr, \
@@ -662,6 +628,14 @@ aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
static struct attribute *aer_attrs[] __ro_after_init = {
+ /* Stats */
+ &dev_attr_err_cor.attr,
+ &dev_attr_err_fatal.attr,
+ &dev_attr_err_nonfatal.attr,
+ &dev_attr_rootport_total_err_cor.attr,
+ &dev_attr_rootport_total_err_fatal.attr,
+ &dev_attr_rootport_total_err_nonfatal.attr,
+ /* Ratelimits */
&dev_attr_ratelimit_cor_irq.attr,
&dev_attr_ratelimit_uncor_irq.attr,
&dev_attr_ratelimit_cor_log.attr,
@@ -670,13 +644,21 @@ static struct attribute *aer_attrs[] __ro_after_init = {
};
static umode_t aer_attrs_are_visible(struct kobject *kobj,
- struct attribute *a, int n)
+ struct attribute *a, int n)
{
struct device *dev = kobj_to_dev(kobj);
struct pci_dev *pdev = to_pci_dev(dev);
if (!pdev->aer_info)
return 0;
+
+ if ((a == &dev_attr_rootport_total_err_cor.attr ||
+ a == &dev_attr_rootport_total_err_fatal.attr ||
+ a == &dev_attr_rootport_total_err_nonfatal.attr) &&
+ ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
+ return 0;
+
return a->mode;
}
--
2.48.0.rc2.279.g1de40edade-goog
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info
2025-01-15 7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
@ 2025-01-16 10:11 ` Karolina Stolarek
2025-01-18 1:59 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-16 10:11 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> Update name to reflect the broader definition of structs/variables that
> are stored (e.g. ratelimits).
I understand the intention behind this change, but I'm not fully
convinced if we should mix AER error attributes with the tools to
control error reporting/generation (ratelimits). I'd argue that
"aer_info" name still doesn't express what the structure does.
aer_stats sits in pci_dev, so I can see why you decided to use it; it's
one of the few available places where we could keep a stateful ratelimit.
How about creating a struct to keep all the ratelimits in one place and
embedding that in the pci_dev?
All the best,
Karolina
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
> include/linux/pci.h | 2 +-
> 2 files changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4bb0b3840402..5ab5cd7368bc 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -50,11 +50,11 @@ struct aer_rpc {
> DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
> };
>
> -/* AER stats for the device */
> -struct aer_stats {
> +/* AER info for the device */
> +struct aer_info {
>
> /*
> - * Fields for all AER capable devices. They indicate the errors
> + * Stats for all AER capable devices. They indicate the errors
> * "as seen by this device". Note that this may mean that if an
> * end point is causing problems, the AER counters may increment
> * at its link partner (e.g. root port) because the errors will be
> @@ -76,7 +76,7 @@ struct aer_stats {
> u64 dev_total_nonfatal_errs;
>
> /*
> - * Fields for Root ports & root complex event collectors only, these
> + * Stats for Root ports & root complex event collectors only, these
> * indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
> * messages received by the root port / event collector, INCLUDING the
> * ones that are generated internally (by the rootport itself)
> @@ -373,7 +373,7 @@ void pci_aer_init(struct pci_dev *dev)
> if (!dev->aer_cap)
> return;
>
> - dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> + dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
>
> /*
> * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -394,8 +394,8 @@ void pci_aer_init(struct pci_dev *dev)
>
> void pci_aer_exit(struct pci_dev *dev)
> {
> - kfree(dev->aer_stats);
> - dev->aer_stats = NULL;
> + kfree(dev->aer_info);
> + dev->aer_info = NULL;
> }
>
> #define AER_AGENT_RECEIVER 0
> @@ -533,10 +533,10 @@ static const char *aer_agent_string[] = {
> { \
> unsigned int i; \
> struct pci_dev *pdev = to_pci_dev(dev); \
> - u64 *stats = pdev->aer_stats->stats_array; \
> + u64 *stats = pdev->aer_info->stats_array; \
> size_t len = 0; \
> \
> - for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> + for (i = 0; i < ARRAY_SIZE(pdev->aer_info->stats_array); i++) {\
> if (strings_array[i]) \
> len += sysfs_emit_at(buf, len, "%s %llu\n", \
> strings_array[i], \
> @@ -547,7 +547,7 @@ static const char *aer_agent_string[] = {
> i, stats[i]); \
> } \
> len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string, \
> - pdev->aer_stats->total_field); \
> + pdev->aer_info->total_field); \
> return len; \
> } \
> static DEVICE_ATTR_RO(name)
> @@ -568,7 +568,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> char *buf) \
> { \
> struct pci_dev *pdev = to_pci_dev(dev); \
> - return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field); \
> + return sysfs_emit(buf, "%llu\n", pdev->aer_info->field); \
> } \
> static DEVICE_ATTR_RO(name)
>
> @@ -595,7 +595,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
> struct device *dev = kobj_to_dev(kobj);
> struct pci_dev *pdev = to_pci_dev(dev);
>
> - if (!pdev->aer_stats)
> + if (!pdev->aer_info)
> return 0;
>
> if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> @@ -619,25 +619,25 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
> unsigned long status = info->status & ~info->mask;
> int i, max = -1;
> u64 *counter = NULL;
> - struct aer_stats *aer_stats = pdev->aer_stats;
> + struct aer_info *aer_info = pdev->aer_info;
>
> - if (!aer_stats)
> + if (!aer_info)
> return;
>
> switch (info->severity) {
> case AER_CORRECTABLE:
> - aer_stats->dev_total_cor_errs++;
> - counter = &aer_stats->dev_cor_errs[0];
> + aer_info->dev_total_cor_errs++;
> + counter = &aer_info->dev_cor_errs[0];
> max = AER_MAX_TYPEOF_COR_ERRS;
> break;
> case AER_NONFATAL:
> - aer_stats->dev_total_nonfatal_errs++;
> - counter = &aer_stats->dev_nonfatal_errs[0];
> + aer_info->dev_total_nonfatal_errs++;
> + counter = &aer_info->dev_nonfatal_errs[0];
> max = AER_MAX_TYPEOF_UNCOR_ERRS;
> break;
> case AER_FATAL:
> - aer_stats->dev_total_fatal_errs++;
> - counter = &aer_stats->dev_fatal_errs[0];
> + aer_info->dev_total_fatal_errs++;
> + counter = &aer_info->dev_fatal_errs[0];
> max = AER_MAX_TYPEOF_UNCOR_ERRS;
> break;
> }
> @@ -649,19 +649,19 @@ static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
> static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
> struct aer_err_source *e_src)
> {
> - struct aer_stats *aer_stats = pdev->aer_stats;
> + struct aer_info *aer_info = pdev->aer_info;
>
> - if (!aer_stats)
> + if (!aer_info)
> return;
>
> if (e_src->status & PCI_ERR_ROOT_COR_RCV)
> - aer_stats->rootport_total_cor_errs++;
> + aer_info->rootport_total_cor_errs++;
>
> if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
> if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
> - aer_stats->rootport_total_fatal_errs++;
> + aer_info->rootport_total_fatal_errs++;
> else
> - aer_stats->rootport_total_nonfatal_errs++;
> + aer_info->rootport_total_nonfatal_errs++;
> }
> }
>
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index db9b47ce3eef..72e6f5560164 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -346,7 +346,7 @@ struct pci_dev {
> u8 hdr_type; /* PCI header type (`multi' flag masked out) */
> #ifdef CONFIG_PCIEAER
> u16 aer_cap; /* AER capability offset */
> - struct aer_stats *aer_stats; /* AER stats for this device */
> + struct aer_info *aer_info; /* AER info for this device */
> #endif
> #ifdef CONFIG_PCIEPORTBUS
> struct rcec_ea *rcec_ea; /* RCEC cached endpoint association */
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory
2025-01-15 7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
@ 2025-01-16 10:26 ` Karolina Stolarek
2025-01-16 17:18 ` Rajat Jain
0 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-16 10:26 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 15/01/2025 08:43, Jon Pan-Doh wrote:
> Prepare for the addition of new AER sysfs attributes (e.g. ratelimits)
> by moving them into their own directory. Update naming to reflect
> broader definition and for consistency.
>
> /sys/bus/pci/devices/<dev>/aer_dev_correctable
> /sys/bus/pci/devices/<dev>/aer_dev_fatal
> /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
> /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
> /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
> /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
> ->
> /sys/bus/pci/devices/<dev>/aer/err_cor
> /sys/bus/pci/devices/<dev>/aer/err_fatal
> /sys/bus/pci/devices/<dev>/aer/err_nonfatal
> /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
> /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
> /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
>
> Tested using aer-inject[1] tool. Sent 1 AER error. Observed AER stats
> correctedly logged (cat /sys/bus/pci/devices/<dev>/aer/dev_err_cor).
I'm not a sysfs expert but my understanding is that we shouldn't do
major changes in the existing hierarchies.
On one hand, I think it would be nice to extract out AER-specific info
and knobs into a subdirectory (e.g., using attribute_group with name
"aer"), but on the other this would be disruptive to the userspace. I
can imagine that there are tools that watch these values that would
break after this change.
All the best,
Karolina
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> .../ABI/testing/sysfs-bus-pci-devices-aer | 18 +++---
> drivers/pci/pci-sysfs.c | 1 -
> drivers/pci/pci.h | 1 -
> drivers/pci/pcie/aer.c | 64 +++++++------------
> 4 files changed, 32 insertions(+), 52 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> index c680a53af0f4..e1472583207b 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> @@ -9,7 +9,7 @@ errors may be "seen" / reported by the link partner and not the
> problematic endpoint itself (which may report all counters as 0 as it never
> saw any problems).
>
> -What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
> +What: /sys/bus/pci/devices/<dev>/aer/err_cor
> Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> @@ -19,7 +19,7 @@ Description: List of correctable errors seen and reported by this
> TOTAL_ERR_COR at the end of the file may not match the actual
> total of all the errors in the file. Sample output::
>
> - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
> + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_cor
> Receiver Error 2
> Bad TLP 0
> Bad DLLP 0
> @@ -30,7 +30,7 @@ Description: List of correctable errors seen and reported by this
> Header Log Overflow 0
> TOTAL_ERR_COR 2
>
> -What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
> +What: /sys/bus/pci/devices/<dev>/aer/err_fatal
> Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> @@ -40,7 +40,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
> TOTAL_ERR_FATAL at the end of the file may not match the actual
> total of all the errors in the file. Sample output::
>
> - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
> + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_fatal
> Undefined 0
> Data Link Protocol 0
> Surprise Down Error 0
> @@ -60,7 +60,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
> TLP Prefix Blocked Error 0
> TOTAL_ERR_FATAL 0
>
> -What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
> +What: /sys/bus/pci/devices/<dev>/aer/err_nonfatal
> Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> @@ -70,7 +70,7 @@ Description: List of uncorrectable nonfatal errors seen and reported by this
> TOTAL_ERR_NONFATAL at the end of the file may not match the
> actual total of all the errors in the file. Sample output::
>
> - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
> + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_nonfatal
> Undefined 0
> Data Link Protocol 0
> Surprise Down Error 0
> @@ -100,19 +100,19 @@ collectors) that are AER capable. These indicate the number of error messages as
> device, so these counters include them and are thus cumulative of all the error
> messages on the PCI hierarchy originating at that root port.
>
> -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
> +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
> Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> Description: Total number of ERR_COR messages reported to rootport.
>
> -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
> +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
> Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> Description: Total number of ERR_FATAL messages reported to rootport.
>
> -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
> +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
> Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 41acb6713e2d..e16b92edf3bd 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1692,7 +1692,6 @@ const struct attribute_group *pci_dev_attr_groups[] = {
> &pci_bridge_attr_group,
> &pcie_dev_attr_group,
> #ifdef CONFIG_PCIEAER
> - &aer_stats_attr_group,
> &aer_attr_group,
> #endif
> #ifdef CONFIG_PCIEASPM
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9d0272a890ef..a80cfc08f634 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -880,7 +880,6 @@ static inline void of_pci_remove_node(struct pci_dev *pdev) { }
> void pci_no_aer(void);
> void pci_aer_init(struct pci_dev *dev);
> void pci_aer_exit(struct pci_dev *dev);
> -extern const struct attribute_group aer_stats_attr_group;
> extern const struct attribute_group aer_attr_group;
> void pci_aer_clear_fatal_status(struct pci_dev *dev);
> int pci_aer_clear_status(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e48e2951baae..68850525cc8d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -569,13 +569,13 @@ static const char *aer_agent_string[] = {
> } \
> static DEVICE_ATTR_RO(name)
>
> -aer_stats_dev_attr(aer_dev_correctable, dev_cor_errs,
> +aer_stats_dev_attr(err_cor, dev_cor_errs,
> aer_correctable_error_string, "ERR_COR",
> dev_total_cor_errs);
> -aer_stats_dev_attr(aer_dev_fatal, dev_fatal_errs,
> +aer_stats_dev_attr(err_fatal, dev_fatal_errs,
> aer_uncorrectable_error_string, "ERR_FATAL",
> dev_total_fatal_errs);
> -aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> +aer_stats_dev_attr(err_nonfatal, dev_nonfatal_errs,
> aer_uncorrectable_error_string, "ERR_NONFATAL",
> dev_total_nonfatal_errs);
>
> @@ -589,47 +589,13 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> } \
> static DEVICE_ATTR_RO(name)
>
> -aer_stats_rootport_attr(aer_rootport_total_err_cor,
> +aer_stats_rootport_attr(rootport_total_err_cor,
> rootport_total_cor_errs);
> -aer_stats_rootport_attr(aer_rootport_total_err_fatal,
> +aer_stats_rootport_attr(rootport_total_err_fatal,
> rootport_total_fatal_errs);
> -aer_stats_rootport_attr(aer_rootport_total_err_nonfatal,
> +aer_stats_rootport_attr(rootport_total_err_nonfatal,
> rootport_total_nonfatal_errs);
>
> -static struct attribute *aer_stats_attrs[] __ro_after_init = {
> - &dev_attr_aer_dev_correctable.attr,
> - &dev_attr_aer_dev_fatal.attr,
> - &dev_attr_aer_dev_nonfatal.attr,
> - &dev_attr_aer_rootport_total_err_cor.attr,
> - &dev_attr_aer_rootport_total_err_fatal.attr,
> - &dev_attr_aer_rootport_total_err_nonfatal.attr,
> - NULL
> -};
> -
> -static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
> - struct attribute *a, int n)
> -{
> - struct device *dev = kobj_to_dev(kobj);
> - struct pci_dev *pdev = to_pci_dev(dev);
> -
> - if (!pdev->aer_info)
> - return 0;
> -
> - if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> - a == &dev_attr_aer_rootport_total_err_fatal.attr ||
> - a == &dev_attr_aer_rootport_total_err_nonfatal.attr) &&
> - ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> - (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
> - return 0;
> -
> - return a->mode;
> -}
> -
> -const struct attribute_group aer_stats_attr_group = {
> - .attrs = aer_stats_attrs,
> - .is_visible = aer_stats_attrs_are_visible,
> -};
> -
> #define aer_ratelimit_attr(name, ratelimit) \
> static ssize_t \
> name##_show(struct device *dev, struct device_attribute *attr, \
> @@ -662,6 +628,14 @@ aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
> aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
>
> static struct attribute *aer_attrs[] __ro_after_init = {
> + /* Stats */
> + &dev_attr_err_cor.attr,
> + &dev_attr_err_fatal.attr,
> + &dev_attr_err_nonfatal.attr,
> + &dev_attr_rootport_total_err_cor.attr,
> + &dev_attr_rootport_total_err_fatal.attr,
> + &dev_attr_rootport_total_err_nonfatal.attr,
> + /* Ratelimits */
> &dev_attr_ratelimit_cor_irq.attr,
> &dev_attr_ratelimit_uncor_irq.attr,
> &dev_attr_ratelimit_cor_log.attr,
> @@ -670,13 +644,21 @@ static struct attribute *aer_attrs[] __ro_after_init = {
> };
>
> static umode_t aer_attrs_are_visible(struct kobject *kobj,
> - struct attribute *a, int n)
> + struct attribute *a, int n)
> {
> struct device *dev = kobj_to_dev(kobj);
> struct pci_dev *pdev = to_pci_dev(dev);
>
> if (!pdev->aer_info)
> return 0;
> +
> + if ((a == &dev_attr_rootport_total_err_cor.attr ||
> + a == &dev_attr_rootport_total_err_fatal.attr ||
> + a == &dev_attr_rootport_total_err_nonfatal.attr) &&
> + ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> + (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
> + return 0;
> +
> return a->mode;
> }
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs
2025-01-15 7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
@ 2025-01-16 11:11 ` Karolina Stolarek
2025-01-18 1:59 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-16 11:11 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER errors (correctable and
> uncorrectable). Set the default rate to the default kernel ratelimit
> (10 per 5s).
>
> Tested using aer-inject[1] tool. Sent 11 AER errors. Observed 10 errors
> logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable)
> show true count of 11.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> Documentation/PCI/pcieaer-howto.rst | 6 ++++++
> drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++++--
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index f013f3b27c82..5546de60f184 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -85,6 +85,12 @@ In the example, 'Requester ID' means the ID of the device that sent
> the error message to the Root Port. Please refer to PCIe specs for other
> fields.
>
> +AER Ratelimits
> +-------------------------
> +
> +Error messages are ratelimited per device and error type. This prevents spammy
> +devices from flooding the console.
> +
Nit: I would create a separate commit for the documentation updates.
Also, I think it would be good to mention the default interval and,
maybe, explain why such rate-limiting was introduced in the first place.
If you don't feel like writing about it, let me know and we can work it
out together.
> AER Statistics / Counters
> -------------------------
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 5ab5cd7368bc..025c50b0f293 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -27,6 +27,7 @@
> #include <linux/interrupt.h>
> #include <linux/delay.h>
> #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
> #include <linux/slab.h>
> #include <acpi/apei.h>
> #include <acpi/ghes.h>
> @@ -84,6 +85,10 @@ struct aer_info {
> u64 rootport_total_cor_errs;
> u64 rootport_total_fatal_errs;
> u64 rootport_total_nonfatal_errs;
> +
> + /* Ratelimits for errors */
> + struct ratelimit_state cor_log_ratelimit;
> + struct ratelimit_state uncor_log_ratelimit;
My comment for 3/8 applies here as well.
> };
>
> #define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
> @@ -374,6 +379,12 @@ void pci_aer_init(struct pci_dev *dev)
> return;
>
> dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
> + ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> + ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> + ratelimit_set_flags(&dev->aer_info->cor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
> + ratelimit_set_flags(&dev->aer_info->uncor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
In some cases, it would be beneficial to keep the "X callbacks
suppressed" to give an idea of how prevalent the errors are. On some
devices, we could see just 11 Correctable Errors per 5 seconds, but on
others this would be >1k (seen such a case myself).
As it's not immediately clear what the defaults are, could you add a
comment for this?
>
> /*
> * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -702,6 +713,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> int layer, agent;
> int id = pci_dev_id(dev);
> const char *level;
> + struct ratelimit_state *ratelimit;
>
> if (!info->status) {
> pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> @@ -709,11 +721,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> goto out;
> }
>
> + if (info->severity == AER_CORRECTABLE) {
> + ratelimit = &dev->aer_info->cor_log_ratelimit;
> + level = KERN_WARNING;
> + } else {
> + ratelimit = &dev->aer_info->uncor_log_ratelimit;
> + level = KERN_ERR;
> + }
> +
> + if (!__ratelimit(ratelimit))
> + return;
> +
Maybe it's just me but I found "!__ratelimit(..)" expression confusing.
At first, I read this "if there is not ratelimit", whereas the function
returns 0 ("hey, you can't fire at this point of time") and we negate
it. My series attempted to communicate this via a variable, but maybe
that was too wordy/complicated, so I'm not pushy about introducing a
similar solution here.
All the best,
Karolina
> layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> agent = AER_GET_AGENT(info->severity, info->status);
>
> - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> -
> pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> aer_error_severity_string[info->severity],
> aer_error_layer[layer], aer_agent_string[agent]);
> @@ -755,11 +776,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> int layer, agent, tlp_header_valid = 0;
> u32 status, mask;
> struct aer_err_info info;
> + struct ratelimit_state *ratelimit;
>
> if (aer_severity == AER_CORRECTABLE) {
> + ratelimit = &dev->aer_info->cor_log_ratelimit;
> status = aer->cor_status;
> mask = aer->cor_mask;
> } else {
> + ratelimit = &dev->aer_info->uncor_log_ratelimit;
> status = aer->uncor_status;
> mask = aer->uncor_mask;
> tlp_header_valid = status & AER_LOG_TLP_MASKS;
> @@ -776,6 +800,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>
> pci_dev_aer_stats_incr(dev, &info);
>
> + if (!__ratelimit(ratelimit))
> + return;
> +
> pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> __aer_print_error(dev, &info);
> pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
@ 2025-01-16 12:02 ` Karolina Stolarek
2025-01-18 1:58 ` Jon Pan-Doh
2025-01-25 7:39 ` Lukas Wunner
2025-01-31 14:55 ` Jonathan Cameron
2 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-16 12:02 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> (...)
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 025c50b0f293..1db70ae87f52 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
> }
> }
>
> +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
> +{
> + const char **strings;
> + const char *errmsg;
> + u16 aer_offset = dev->aer_cap;
> + u16 mask_reg_offset;
> + u32 mask;
> + unsigned long status = info->status;
> + int i;
> +
> + if (info->severity == AER_CORRECTABLE) {
> + strings = aer_correctable_error_string;
> + mask_reg_offset = PCI_ERR_COR_MASK;
> + } else {
> + strings = aer_uncorrectable_error_string;
> + mask_reg_offset = PCI_ERR_UNCOR_MASK;
> + }
> +
> + pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
> + mask |= status;
> + pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
> +
> + pci_warn(dev, "%s error(s) masked due to rate-limiting:",
> + aer_error_severity_string[info->severity]);
> + for_each_set_bit(i, &status, 32) {
> + errmsg = strings[i];
> + if (!errmsg)
> + errmsg = "Unknown Error Bit";
> +
> + pci_warn(dev, " [%2d] %-22s\n", i, errmsg);
> + }
> +}
> +
My approach was more heavy-handed :)
To confirm that I understand the flow -- when we're processing
aer_err_info, that potentially carries a couple of errors, and we hit a
ratelimit, we mask the error bits in Error Status Register and print
a warning. After doing so, we won't see these types of errors reported
again. What if some time passes (let's say, 2 mins), and we hit a
condition that would normally generate an error but it's now masked? Are
we fine with missing it? I think we should be informed about
Uncorrectable errors as much as possible, as they indicate Link
integrity issues.
> - if (!__ratelimit(ratelimit))
> + if (!__ratelimit(irq_ratelimit)) {
> + mask_reported_error(dev, info);
> + return;
> + }
> + if (!__ratelimit(log_ratelimit))
> return;
Nit: add a line between these two ifs
All the best,
Karolina
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
@ 2025-01-16 14:27 ` Karolina Stolarek
2025-01-18 1:57 ` Jon Pan-Doh
2025-01-21 14:18 ` Ilpo Järvinen
2025-01-25 4:15 ` Sathyanarayanan Kuppuswamy
2 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-16 14:27 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> Info logged is duplicated when either the source device is processed. If
> no source device is found, than an error is logged.
Nit: s/than/then/
>
> Code flow:
> aer_isr_one_error()
> -> aer_print_port_info()
> -> find_source_device()
> -> return/pci_info() if no device found else continue
> -> aer_process_err_devices()
> -> aer_print_error()
>
> aer_print_port_info():
> [ 21.596150] pcieport 0000:00:04.0: Correctable error message received
> from 0000:01:00.0
I agree that the bus, device and function info is repeated later, but
isn't this line also about the fact we deal with one or multiple errors
in a message? The question is how valuable this information, in itself, is.
All the best,
Karolina
>
> aer_print_error():
> [ 21.596163] e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
> [ 21.600575] e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
> [ 21.604707] e1000e 0000:01:00.0: [ 6] BadTLP
>
> Tested using aer-inject[1] tool. No more root port log on dmesg.
>
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-01-15 7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
@ 2025-01-16 14:47 ` Karolina Stolarek
2025-01-18 1:57 ` Jon Pan-Doh
2025-01-25 4:37 ` Sathyanarayanan Kuppuswamy
1 sibling, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-16 14:47 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> Decouple stat collection from internal AER print functions. AERs from ghes
> or cxl drivers have stat collection in pci_print_aer as that is where
> aer_err_info is populated.
>
> Tested using aer-inject[1] tool. AER sysfs counters still updated
> correctly.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ba40800b5494..4bb0b3840402 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -695,7 +695,6 @@ static void __aer_print_error(struct pci_dev *dev,
> pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
> info->first_error == i ? " (First)" : "");
> }
> - pci_dev_aer_stats_incr(dev, info);
> }
With this change, we stop calling pci_dev_aer_stats_incr() in
dpc_process_error(). Is this intended?
All the best,
Karolina
>
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> @@ -775,6 +774,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> info.mask = mask;
> info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>
> + pci_dev_aer_stats_incr(dev, &info);
> +
> pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> __aer_print_error(dev, &info);
> pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> @@ -1249,8 +1250,10 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>
> /* Report all before handle them, not to lost records by reset etc. */
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> - if (aer_get_device_error_info(e_info->dev[i], e_info))
> + if (aer_get_device_error_info(e_info->dev[i], e_info)) {
> + pci_dev_aer_stats_incr(e_info->dev[i], e_info);
> aer_print_error(e_info->dev[i], e_info);
> + }
> }
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> if (aer_get_device_error_info(e_info->dev[i], e_info))
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory
2025-01-16 10:26 ` Karolina Stolarek
@ 2025-01-16 17:18 ` Rajat Jain
2025-01-31 14:36 ` Jonathan Cameron
0 siblings, 1 reply; 51+ messages in thread
From: Rajat Jain @ 2025-01-16 17:18 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Jon Pan-Doh, Bjorn Helgaas, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
Hello,
On Thu, Jan 16, 2025 at 2:26 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
>
> On 15/01/2025 08:43, Jon Pan-Doh wrote:
> > Prepare for the addition of new AER sysfs attributes (e.g. ratelimits)
> > by moving them into their own directory. Update naming to reflect
> > broader definition and for consistency.
> >
> > /sys/bus/pci/devices/<dev>/aer_dev_correctable
> > /sys/bus/pci/devices/<dev>/aer_dev_fatal
> > /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
> > /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
> > /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
> > /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
> > ->
> > /sys/bus/pci/devices/<dev>/aer/err_cor
> > /sys/bus/pci/devices/<dev>/aer/err_fatal
> > /sys/bus/pci/devices/<dev>/aer/err_nonfatal
> > /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
> > /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
> > /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
> >
> > Tested using aer-inject[1] tool. Sent 1 AER error. Observed AER stats
> > correctedly logged (cat /sys/bus/pci/devices/<dev>/aer/dev_err_cor).
>
> I'm not a sysfs expert but my understanding is that we shouldn't do
> major changes in the existing hierarchies.
>
> On one hand, I think it would be nice to extract out AER-specific info
> and knobs into a subdirectory (e.g., using attribute_group with name
> "aer"), but on the other this would be disruptive to the userspace. I
> can imagine that there are tools that watch these values that would
> break after this change.
Thank you. This is the right guidance.
As the original author to introduce these attributes, I just wanted to
chime in from the ChromeOS team's perspective (who originally
introduced these attributes). I can say that we have used these
attributes for debugging mostly manually, and do not have tools yet
with hardcoded hierarchy / paths. So we wouldn't be opposed to it, if
changes to the hierarchy have wider acceptance and it seems better in
general.
Thanks & Best Regards,
Rajat
>
> All the best,
> Karolina
>
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> >
> > Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> > ---
> > .../ABI/testing/sysfs-bus-pci-devices-aer | 18 +++---
> > drivers/pci/pci-sysfs.c | 1 -
> > drivers/pci/pci.h | 1 -
> > drivers/pci/pcie/aer.c | 64 +++++++------------
> > 4 files changed, 32 insertions(+), 52 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > index c680a53af0f4..e1472583207b 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > @@ -9,7 +9,7 @@ errors may be "seen" / reported by the link partner and not the
> > problematic endpoint itself (which may report all counters as 0 as it never
> > saw any problems).
> >
> > -What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
> > +What: /sys/bus/pci/devices/<dev>/aer/err_cor
> > Date: July 2018
> > KernelVersion: 4.19.0
> > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > @@ -19,7 +19,7 @@ Description: List of correctable errors seen and reported by this
> > TOTAL_ERR_COR at the end of the file may not match the actual
> > total of all the errors in the file. Sample output::
> >
> > - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
> > + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_cor
> > Receiver Error 2
> > Bad TLP 0
> > Bad DLLP 0
> > @@ -30,7 +30,7 @@ Description: List of correctable errors seen and reported by this
> > Header Log Overflow 0
> > TOTAL_ERR_COR 2
> >
> > -What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
> > +What: /sys/bus/pci/devices/<dev>/aer/err_fatal
> > Date: July 2018
> > KernelVersion: 4.19.0
> > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > @@ -40,7 +40,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
> > TOTAL_ERR_FATAL at the end of the file may not match the actual
> > total of all the errors in the file. Sample output::
> >
> > - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
> > + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_fatal
> > Undefined 0
> > Data Link Protocol 0
> > Surprise Down Error 0
> > @@ -60,7 +60,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
> > TLP Prefix Blocked Error 0
> > TOTAL_ERR_FATAL 0
> >
> > -What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
> > +What: /sys/bus/pci/devices/<dev>/aer/err_nonfatal
> > Date: July 2018
> > KernelVersion: 4.19.0
> > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > @@ -70,7 +70,7 @@ Description: List of uncorrectable nonfatal errors seen and reported by this
> > TOTAL_ERR_NONFATAL at the end of the file may not match the
> > actual total of all the errors in the file. Sample output::
> >
> > - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
> > + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_nonfatal
> > Undefined 0
> > Data Link Protocol 0
> > Surprise Down Error 0
> > @@ -100,19 +100,19 @@ collectors) that are AER capable. These indicate the number of error messages as
> > device, so these counters include them and are thus cumulative of all the error
> > messages on the PCI hierarchy originating at that root port.
> >
> > -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
> > +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
> > Date: July 2018
> > KernelVersion: 4.19.0
> > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > Description: Total number of ERR_COR messages reported to rootport.
> >
> > -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
> > +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
> > Date: July 2018
> > KernelVersion: 4.19.0
> > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > Description: Total number of ERR_FATAL messages reported to rootport.
> >
> > -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
> > +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
> > Date: July 2018
> > KernelVersion: 4.19.0
> > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > index 41acb6713e2d..e16b92edf3bd 100644
> > --- a/drivers/pci/pci-sysfs.c
> > +++ b/drivers/pci/pci-sysfs.c
> > @@ -1692,7 +1692,6 @@ const struct attribute_group *pci_dev_attr_groups[] = {
> > &pci_bridge_attr_group,
> > &pcie_dev_attr_group,
> > #ifdef CONFIG_PCIEAER
> > - &aer_stats_attr_group,
> > &aer_attr_group,
> > #endif
> > #ifdef CONFIG_PCIEASPM
> > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > index 9d0272a890ef..a80cfc08f634 100644
> > --- a/drivers/pci/pci.h
> > +++ b/drivers/pci/pci.h
> > @@ -880,7 +880,6 @@ static inline void of_pci_remove_node(struct pci_dev *pdev) { }
> > void pci_no_aer(void);
> > void pci_aer_init(struct pci_dev *dev);
> > void pci_aer_exit(struct pci_dev *dev);
> > -extern const struct attribute_group aer_stats_attr_group;
> > extern const struct attribute_group aer_attr_group;
> > void pci_aer_clear_fatal_status(struct pci_dev *dev);
> > int pci_aer_clear_status(struct pci_dev *dev);
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index e48e2951baae..68850525cc8d 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -569,13 +569,13 @@ static const char *aer_agent_string[] = {
> > } \
> > static DEVICE_ATTR_RO(name)
> >
> > -aer_stats_dev_attr(aer_dev_correctable, dev_cor_errs,
> > +aer_stats_dev_attr(err_cor, dev_cor_errs,
> > aer_correctable_error_string, "ERR_COR",
> > dev_total_cor_errs);
> > -aer_stats_dev_attr(aer_dev_fatal, dev_fatal_errs,
> > +aer_stats_dev_attr(err_fatal, dev_fatal_errs,
> > aer_uncorrectable_error_string, "ERR_FATAL",
> > dev_total_fatal_errs);
> > -aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> > +aer_stats_dev_attr(err_nonfatal, dev_nonfatal_errs,
> > aer_uncorrectable_error_string, "ERR_NONFATAL",
> > dev_total_nonfatal_errs);
> >
> > @@ -589,47 +589,13 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> > } \
> > static DEVICE_ATTR_RO(name)
> >
> > -aer_stats_rootport_attr(aer_rootport_total_err_cor,
> > +aer_stats_rootport_attr(rootport_total_err_cor,
> > rootport_total_cor_errs);
> > -aer_stats_rootport_attr(aer_rootport_total_err_fatal,
> > +aer_stats_rootport_attr(rootport_total_err_fatal,
> > rootport_total_fatal_errs);
> > -aer_stats_rootport_attr(aer_rootport_total_err_nonfatal,
> > +aer_stats_rootport_attr(rootport_total_err_nonfatal,
> > rootport_total_nonfatal_errs);
> >
> > -static struct attribute *aer_stats_attrs[] __ro_after_init = {
> > - &dev_attr_aer_dev_correctable.attr,
> > - &dev_attr_aer_dev_fatal.attr,
> > - &dev_attr_aer_dev_nonfatal.attr,
> > - &dev_attr_aer_rootport_total_err_cor.attr,
> > - &dev_attr_aer_rootport_total_err_fatal.attr,
> > - &dev_attr_aer_rootport_total_err_nonfatal.attr,
> > - NULL
> > -};
> > -
> > -static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
> > - struct attribute *a, int n)
> > -{
> > - struct device *dev = kobj_to_dev(kobj);
> > - struct pci_dev *pdev = to_pci_dev(dev);
> > -
> > - if (!pdev->aer_info)
> > - return 0;
> > -
> > - if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> > - a == &dev_attr_aer_rootport_total_err_fatal.attr ||
> > - a == &dev_attr_aer_rootport_total_err_nonfatal.attr) &&
> > - ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> > - (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
> > - return 0;
> > -
> > - return a->mode;
> > -}
> > -
> > -const struct attribute_group aer_stats_attr_group = {
> > - .attrs = aer_stats_attrs,
> > - .is_visible = aer_stats_attrs_are_visible,
> > -};
> > -
> > #define aer_ratelimit_attr(name, ratelimit) \
> > static ssize_t \
> > name##_show(struct device *dev, struct device_attribute *attr, \
> > @@ -662,6 +628,14 @@ aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
> > aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
> >
> > static struct attribute *aer_attrs[] __ro_after_init = {
> > + /* Stats */
> > + &dev_attr_err_cor.attr,
> > + &dev_attr_err_fatal.attr,
> > + &dev_attr_err_nonfatal.attr,
> > + &dev_attr_rootport_total_err_cor.attr,
> > + &dev_attr_rootport_total_err_fatal.attr,
> > + &dev_attr_rootport_total_err_nonfatal.attr,
> > + /* Ratelimits */
> > &dev_attr_ratelimit_cor_irq.attr,
> > &dev_attr_ratelimit_uncor_irq.attr,
> > &dev_attr_ratelimit_cor_log.attr,
> > @@ -670,13 +644,21 @@ static struct attribute *aer_attrs[] __ro_after_init = {
> > };
> >
> > static umode_t aer_attrs_are_visible(struct kobject *kobj,
> > - struct attribute *a, int n)
> > + struct attribute *a, int n)
> > {
> > struct device *dev = kobj_to_dev(kobj);
> > struct pci_dev *pdev = to_pci_dev(dev);
> >
> > if (!pdev->aer_info)
> > return 0;
> > +
> > + if ((a == &dev_attr_rootport_total_err_cor.attr ||
> > + a == &dev_attr_rootport_total_err_fatal.attr ||
> > + a == &dev_attr_rootport_total_err_nonfatal.attr) &&
> > + ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> > + (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
> > + return 0;
> > +
> > return a->mode;
> > }
> >
>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-01-16 14:47 ` Karolina Stolarek
@ 2025-01-18 1:57 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-18 1:57 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Thu, Jan 16, 2025 at 6:47 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> On 15/01/2025 08:42, Jon Pan-Doh wrote:
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index ba40800b5494..4bb0b3840402 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -695,7 +695,6 @@ static void __aer_print_error(struct pci_dev *dev,
> > pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
> > info->first_error == i ? " (First)" : "");
> > }
> > - pci_dev_aer_stats_incr(dev, info);
> > }
>
> With this change, we stop calling pci_dev_aer_stats_incr() in
> dpc_process_error(). Is this intended?
No, this should be fixed.
We can call pci_dev_aer_stats_incr() in dpc_process_error() (similar
to aer_process_error).
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-16 14:27 ` Karolina Stolarek
@ 2025-01-18 1:57 ` Jon Pan-Doh
2025-01-20 9:25 ` Karolina Stolarek
0 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-18 1:57 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Thu, Jan 16, 2025 at 6:27 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> On 15/01/2025 08:42, Jon Pan-Doh wrote:
> > aer_print_port_info():
> > [ 21.596150] pcieport 0000:00:04.0: Correctable error message received
> > from 0000:01:00.0
>
> I agree that the bus, device and function info is repeated later, but
> isn't this line also about the fact we deal with one or multiple errors
> in a message? The question is how valuable this information, in itself, is.
I think whether or not there are multiple errors can be gleaned from
__aer_print_error(). It prints out all fields of the status register
(as well as denotes the first error).
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-16 12:02 ` Karolina Stolarek
@ 2025-01-18 1:58 ` Jon Pan-Doh
2025-01-20 10:38 ` Karolina Stolarek
0 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-18 1:58 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> To confirm that I understand the flow -- when we're processing
> aer_err_info, that potentially carries a couple of errors, and we hit a
> ratelimit, we mask the error bits in Error Status Register and print
> a warning. After doing so, we won't see these types of errors reported
> again. What if some time passes (let's say, 2 mins), and we hit a
> condition that would normally generate an error but it's now masked? Are
> we fine with missing it? I think we should be informed about
> Uncorrectable errors as much as possible, as they indicate Link
> integrity issues.
Your understanding is correct. There's definitely more nuance/tradeoff
with uncorrectable errors (likelihood of uncorrectable spam vs.
missing critical errors). At the minimum, I think the uncorrectable
IRQ default should be higher (semi-arbitrarily chose defaults for
IRQs).
I think a dynamic (un)masking in the kernel is a bit too much and
punted the decision to userspace (e.g. rasdaemon et al.) to manage
(part of OCP Fault Management groups roadmap).
Other options include:
- only focus on correctable errors
- seen uncorrectable spam e.g. new HW bringup but it is rarer
- some type of system-wide toggle (sysfs, kernel config/cmdline) for
uncorrectable spam handling (may be clunky)
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs
2025-01-16 11:11 ` Karolina Stolarek
@ 2025-01-18 1:59 ` Jon Pan-Doh
2025-01-20 10:25 ` Karolina Stolarek
0 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-18 1:59 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Thu, Jan 16, 2025 at 3:11 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> On 15/01/2025 08:42, Jon Pan-Doh wrote:
> > @@ -374,6 +379,12 @@ void pci_aer_init(struct pci_dev *dev)
> > + ratelimit_set_flags(&dev->aer_info->cor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
> > + ratelimit_set_flags(&dev->aer_info->uncor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
>
> In some cases, it would be beneficial to keep the "X callbacks
> suppressed" to give an idea of how prevalent the errors are. On some
> devices, we could see just 11 Correctable Errors per 5 seconds, but on
> others this would be >1k (seen such a case myself).
>
> As it's not immediately clear what the defaults are, could you add a
> comment for this?
It seems like the convention is not to use RATELIMIT_MSG_ON_RELEASE (I
was unfamiliar :)). I'll omit this in the next version. Let me know if
you'd still like the comment in that case.
> > @@ -709,11 +721,20 @@ void aer_print_error(struct pci_dev *dev, struct
> > + if (!__ratelimit(ratelimit))
> > + return;
> > +
>
> Maybe it's just me but I found "!__ratelimit(..)" expression confusing.
> At first, I read this "if there is not ratelimit", whereas the function
> returns 0 ("hey, you can't fire at this point of time") and we negate
> it. My series attempted to communicate this via a variable, but maybe
> that was too wordy/complicated, so I'm not pushy about introducing a
> similar solution here.
I see your point. I'm not particularly attached to it, but the
alternatives seem equally unappealing :/:
- put rest of function inside conditional: if
(__ratelimit(ratelimit))) { log errors }
- another print helper?
- separate variable (as you suggested
FYI the majority of existing usage seems to be split between
__ratelimit() and !__ratelimit() (though majority doesn't always
indicate the right thing). The only instance I see of a variable is in
drivers/iommu/intel/dmar.c:dmar_fault where the variable is used in
repeated conditionals.
Thanks,
Jon
On Thu, Jan 16, 2025 at 3:11 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
>
> On 15/01/2025 08:42, Jon Pan-Doh wrote:
> > Spammy devices can flood kernel logs with AER errors and slow/stall
> > execution. Add per-device ratelimits for AER errors (correctable and
> > uncorrectable). Set the default rate to the default kernel ratelimit
> > (10 per 5s).
> >
> > Tested using aer-inject[1] tool. Sent 11 AER errors. Observed 10 errors
> > logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable)
> > show true count of 11.
> >
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> >
> > Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> > ---
> > Documentation/PCI/pcieaer-howto.rst | 6 ++++++
> > drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++++--
> > 2 files changed, 35 insertions(+), 2 deletions(-)
> >
> > diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> > index f013f3b27c82..5546de60f184 100644
> > --- a/Documentation/PCI/pcieaer-howto.rst
> > +++ b/Documentation/PCI/pcieaer-howto.rst
> > @@ -85,6 +85,12 @@ In the example, 'Requester ID' means the ID of the device that sent
> > the error message to the Root Port. Please refer to PCIe specs for other
> > fields.
> >
> > +AER Ratelimits
> > +-------------------------
> > +
> > +Error messages are ratelimited per device and error type. This prevents spammy
> > +devices from flooding the console.
> > +
>
> Nit: I would create a separate commit for the documentation updates.
> Also, I think it would be good to mention the default interval and,
> maybe, explain why such rate-limiting was introduced in the first place.
> If you don't feel like writing about it, let me know and we can work it
> out together.
>
> > AER Statistics / Counters
> > -------------------------
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 5ab5cd7368bc..025c50b0f293 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -27,6 +27,7 @@
> > #include <linux/interrupt.h>
> > #include <linux/delay.h>
> > #include <linux/kfifo.h>
> > +#include <linux/ratelimit.h>
> > #include <linux/slab.h>
> > #include <acpi/apei.h>
> > #include <acpi/ghes.h>
> > @@ -84,6 +85,10 @@ struct aer_info {
> > u64 rootport_total_cor_errs;
> > u64 rootport_total_fatal_errs;
> > u64 rootport_total_nonfatal_errs;
> > +
> > + /* Ratelimits for errors */
> > + struct ratelimit_state cor_log_ratelimit;
> > + struct ratelimit_state uncor_log_ratelimit;
>
> My comment for 3/8 applies here as well.
>
> > };
> >
> > #define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
> > @@ -374,6 +379,12 @@ void pci_aer_init(struct pci_dev *dev)
> > return;
> >
> > dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
> > + ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
> > + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> > + ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
> > + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> > + ratelimit_set_flags(&dev->aer_info->cor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
> > + ratelimit_set_flags(&dev->aer_info->uncor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
>
> In some cases, it would be beneficial to keep the "X callbacks
> suppressed" to give an idea of how prevalent the errors are. On some
> devices, we could see just 11 Correctable Errors per 5 seconds, but on
> others this would be >1k (seen such a case myself).
>
> As it's not immediately clear what the defaults are, could you add a
> comment for this?
>
> >
> > /*
> > * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> > @@ -702,6 +713,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> > int layer, agent;
> > int id = pci_dev_id(dev);
> > const char *level;
> > + struct ratelimit_state *ratelimit;
> >
> > if (!info->status) {
> > pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> > @@ -709,11 +721,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> > goto out;
> > }
> >
> > + if (info->severity == AER_CORRECTABLE) {
> > + ratelimit = &dev->aer_info->cor_log_ratelimit;
> > + level = KERN_WARNING;
> > + } else {
> > + ratelimit = &dev->aer_info->uncor_log_ratelimit;
> > + level = KERN_ERR;
> > + }
> > +
> > + if (!__ratelimit(ratelimit))
> > + return;
> > +
>
> Maybe it's just me but I found "!__ratelimit(..)" expression confusing.
> At first, I read this "if there is not ratelimit", whereas the function
> returns 0 ("hey, you can't fire at this point of time") and we negate
> it. My series attempted to communicate this via a variable, but maybe
> that was too wordy/complicated, so I'm not pushy about introducing a
> similar solution here.
>
> All the best,
> Karolina
>
> > layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> > agent = AER_GET_AGENT(info->severity, info->status);
> >
> > - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> > -
> > pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> > aer_error_severity_string[info->severity],
> > aer_error_layer[layer], aer_agent_string[agent]);
> > @@ -755,11 +776,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> > int layer, agent, tlp_header_valid = 0;
> > u32 status, mask;
> > struct aer_err_info info;
> > + struct ratelimit_state *ratelimit;
> >
> > if (aer_severity == AER_CORRECTABLE) {
> > + ratelimit = &dev->aer_info->cor_log_ratelimit;
> > status = aer->cor_status;
> > mask = aer->cor_mask;
> > } else {
> > + ratelimit = &dev->aer_info->uncor_log_ratelimit;
> > status = aer->uncor_status;
> > mask = aer->uncor_mask;
> > tlp_header_valid = status & AER_LOG_TLP_MASKS;
> > @@ -776,6 +800,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> >
> > pci_dev_aer_stats_incr(dev, &info);
> >
> > + if (!__ratelimit(ratelimit))
> > + return;
> > +
> > pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> > __aer_print_error(dev, &info);
> > pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info
2025-01-16 10:11 ` Karolina Stolarek
@ 2025-01-18 1:59 ` Jon Pan-Doh
2025-01-20 10:13 ` Karolina Stolarek
0 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-18 1:59 UTC (permalink / raw)
To: Karolina Stolarek, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On Thu, Jan 16, 2025 at 2:12 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> On 15/01/2025 08:42, Jon Pan-Doh wrote:
> > Update name to reflect the broader definition of structs/variables that
> > are stored (e.g. ratelimits).
>
> I understand the intention behind this change, but I'm not fully
> convinced if we should mix AER error attributes with the tools to
> control error reporting/generation (ratelimits). I'd argue that
> "aer_info" name still doesn't express what the structure does.
>
> aer_stats sits in pci_dev, so I can see why you decided to use it; it's
> one of the few available places where we could keep a stateful ratelimit.
>
> How about creating a struct to keep all the ratelimits in one place and
> embedding that in the pci_dev?
In a previous version, I had a separate struct aer_ratelimits in
pci_dev (similar to your patch[1]):
@@ -346,7 +346,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi'
flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
struct aer_stats *aer_stats; /* AER stats for this device */
+ struct aer_ratelimits *aer_ratelimits; /* AER ratelimits
for this device */
However, in an internal review, Bjorn said:
> I don't think we need to burn two pointers here since we always populate both at the
> same time. Perhaps combine the stats and ratelimits into a single "struct aer_info" or
> similar.
It seems like the preference is to minimize additional memory in
pci_dev. Any suggestions for a more appropriate name than "aer_info"?
Otherwise, is this a good enough justification to maintain two pointers Bjorn?
Thanks,
Jon
[1] https://lore.kernel.org/linux-pci/68ef082c855b4e1d094dcfc9a861f43488b64922.1736341506.git.karolina.stolarek@oracle.com/#Z31include:linux:pci.h
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-18 1:57 ` Jon Pan-Doh
@ 2025-01-20 9:25 ` Karolina Stolarek
2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-20 9:25 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On 18/01/2025 02:57, Jon Pan-Doh wrote:
> On Thu, Jan 16, 2025 at 6:27 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>>
>> I agree that the bus, device and function info is repeated later, but
>> isn't this line also about the fact we deal with one or multiple errors
>> in a message? The question is how valuable this information, in itself, is.
>
> I think whether or not there are multiple errors can be gleaned from
> __aer_print_error(). It prints out all fields of the status register
> (as well as denotes the first error).
That's true, so I think it's reasonable to remove it.
For the code changes:
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
As for the commit message, I'd drop the before-after dmesg messages and
explain that the information presented by aer_print_port_info() can be
easily inferred from the actual error messages. I had to read the code
to remind myself what information is duplicated.
All the best,
Karolina
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info
2025-01-18 1:59 ` Jon Pan-Doh
@ 2025-01-20 10:13 ` Karolina Stolarek
2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-20 10:13 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 18/01/2025 02:59, Jon Pan-Doh wrote:
>
> In a previous version, I had a separate struct aer_ratelimits in
> pci_dev (similar to your patch[1]):
>
> @@ -346,7 +346,7 @@ struct pci_dev {
> u8 hdr_type; /* PCI header type (`multi'
> flag masked out) */
> #ifdef CONFIG_PCIEAER
> u16 aer_cap; /* AER capability offset */
> struct aer_stats *aer_stats; /* AER stats for this device */
> + struct aer_ratelimits *aer_ratelimits; /* AER ratelimits
> for this device */
>
> However, in an internal review, Bjorn said:
>> I don't think we need to burn two pointers here since we always populate both at the
>> same time. Perhaps combine the stats and ratelimits into a single "struct aer_info" or
>> similar.
I see, thanks for sharing it here.
My thinking is that it's preferable to have small and specialized
structs, with one used for collecting/showing error info and the other
to control the rates at which they are reported and generated. I'd
propose a similar change to the one above.
> It seems like the preference is to minimize additional memory in
> pci_dev. Any suggestions for a more appropriate name than "aer_info"?
Like I said, my preference would be to leave aer_stats as it is, but if
we have to stick to one struct, I'd go with "aer_control" or "aer_report".
All the best,
Karolina
>
> Otherwise, is this a good enough justification to maintain two pointers Bjorn?
>
> Thanks,
> Jon
>
> [1] https://lore.kernel.org/linux-pci/68ef082c855b4e1d094dcfc9a861f43488b64922.1736341506.git.karolina.stolarek@oracle.com/#Z31include:linux:pci.h
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs
2025-01-18 1:59 ` Jon Pan-Doh
@ 2025-01-20 10:25 ` Karolina Stolarek
0 siblings, 0 replies; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-20 10:25 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On 18/01/2025 02:59, Jon Pan-Doh wrote:
> On Thu, Jan 16, 2025 at 3:11 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> In some cases, it would be beneficial to keep the "X callbacks
>> suppressed" to give an idea of how prevalent the errors are. On some
>> devices, we could see just 11 Correctable Errors per 5 seconds, but on
>> others this would be >1k (seen such a case myself).
>>
>> As it's not immediately clear what the defaults are, could you add a
>> comment for this?
>
> It seems like the convention is not to use RATELIMIT_MSG_ON_RELEASE (I
> was unfamiliar :)). I'll omit this in the next version. Let me know if
> you'd still like the comment in that case.
Interesting, I wasn't aware of that. There are still a couple of places
where this flag is used, but maybe it's legacy code.
I'm not too passionate about the comment, it was just a suggestion (I
tend to be verbose in my code), feel free to omit it.
> FYI the majority of existing usage seems to be split between
> __ratelimit() and !__ratelimit() (though majority doesn't always
> indicate the right thing). The only instance I see of a variable is in
> drivers/iommu/intel/dmar.c:dmar_fault where the variable is used in
> repeated conditionals.
Yeah, it's a problem that's hard to solve at this level.
I thought that introducing a variable to make it easier to read is
acceptable, but it's hard to defend this approach when the value is
checked only once. Let's keep things simple and leave it as it is.
All the best,
Karolina
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-18 1:58 ` Jon Pan-Doh
@ 2025-01-20 10:38 ` Karolina Stolarek
0 siblings, 0 replies; 51+ messages in thread
From: Karolina Stolarek @ 2025-01-20 10:38 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On 18/01/2025 02:58, Jon Pan-Doh wrote:
> On Thu, Jan 16, 2025 at 4:02 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> To confirm that I understand the flow -- when we're processing
>> aer_err_info, that potentially carries a couple of errors, and we hit a
>> ratelimit, we mask the error bits in Error Status Register and print
>> a warning. After doing so, we won't see these types of errors reported
>> again. What if some time passes (let's say, 2 mins), and we hit a
>> condition that would normally generate an error but it's now masked? Are
>> we fine with missing it? I think we should be informed about
>> Uncorrectable errors as much as possible, as they indicate Link
>> integrity issues.
>
> Your understanding is correct. There's definitely more nuance/tradeoff
> with uncorrectable errors (likelihood of uncorrectable spam vs.
> missing critical errors). At the minimum, I think the uncorrectable
> IRQ default should be higher (semi-arbitrarily chose defaults for
> IRQs).
My comment was mostly me worrying about Uncorrectable errors. Pulling
the plug on Correctable errors after we see too many is reasonable, I think.
> I think a dynamic (un)masking in the kernel is a bit too much and
> punted the decision to userspace (e.g. rasdaemon et al.) to manage
> (part of OCP Fault Management groups roadmap).
I agree. Still, if we decide to go with IRQ masking for (Un)correctable
errors, this should be communicated to the user in the documentation.
All the best,
Karolina
>
> Other options include:
> - only focus on correctable errors
> - seen uncorrectable spam e.g. new HW bringup but it is rarer
> - some type of system-wide toggle (sysfs, kernel config/cmdline) for
> uncorrectable spam handling (may be clunky)
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-16 14:27 ` Karolina Stolarek
@ 2025-01-21 14:18 ` Ilpo Järvinen
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-25 4:15 ` Sathyanarayanan Kuppuswamy
2 siblings, 1 reply; 51+ messages in thread
From: Ilpo Järvinen @ 2025-01-21 14:18 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Tue, 14 Jan 2025, Jon Pan-Doh wrote:
> Info logged is duplicated when either the source device is processed. If
Is the first sentence lacking something, as "either" feels to require
another option/alternative that is not present in the sentence? (I'm
non-native myself)
> no source device is found, than an error is logged.
>
> Code flow:
> aer_isr_one_error()
> -> aer_print_port_info()
> -> find_source_device()
> -> return/pci_info() if no device found else continue
> -> aer_process_err_devices()
> -> aer_print_error()
>
> aer_print_port_info():
> [ 21.596150] pcieport 0000:00:04.0: Correctable error message received
> from 0000:01:00.0
>
> aer_print_error():
> [ 21.596163] e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
> [ 21.600575] e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
> [ 21.604707] e1000e 0000:01:00.0: [ 6] BadTLP
>
> Tested using aer-inject[1] tool. No more root port log on dmesg.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 15 ---------------
> 1 file changed, 15 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 34ce9f834d0c..ba40800b5494 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -735,18 +735,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> info->severity, info->tlp_header_valid, &info->tlp);
> }
>
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> -{
> - u8 bus = info->id >> 8;
> - u8 devfn = info->id & 0xff;
> -
> - pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> - info->multi_error_valid ? "Multiple " : "",
> - aer_error_severity_string[info->severity],
> - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> - PCI_FUNC(devfn));
> -}
> -
> #ifdef CONFIG_ACPI_APEI_PCIEAER
> int cper_severity_to_aer(int cper_severity)
> {
> @@ -1295,7 +1283,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> e_info.multi_error_valid = 1;
> else
> e_info.multi_error_valid = 0;
> - aer_print_port_info(pdev, &e_info);
>
> if (find_source_device(pdev, &e_info))
> aer_process_err_devices(&e_info);
> @@ -1314,8 +1301,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> else
> e_info.multi_error_valid = 0;
>
> - aer_print_port_info(pdev, &e_info);
> -
> if (find_source_device(pdev, &e_info))
> aer_process_err_devices(&e_info);
> }
>
--
i.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (7 preceding siblings ...)
2025-01-15 7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
@ 2025-01-23 15:18 ` Bowman, Terry
2025-01-24 6:46 ` Jon Pan-Doh
2025-02-06 13:32 ` Karolina Stolarek
9 siblings, 1 reply; 51+ messages in thread
From: Bowman, Terry @ 2025-01-23 15:18 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, PradeepVineshReddy.Kodamati, Lukas Wunner
Hi Jon,
Can you share the base commit used here? I would like to try the patchset.
Regards,
Terry
On 1/15/2025 1:42 AM, Jon Pan-Doh wrote:
> Proposal
> ========
>
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits
> for more robust error logging. Allow userspace to configure ratelimits
> via sysfs knobs.
>
> Motivation
> ==========
>
> Several OCP members have issues with inconsistent PCIe error handling,
> exacerbated at datacenter scale (myriad of devices).
> OCP HW/Fault Management subproject set out to solve this by
> standardizing industry:
>
> - PCIe error handling best practices
> - Fault Management/RAS (incl. PCIe errors)
>
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services is part of the
> roadmap.
>
> Background
> ==========
>
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
>
> There have been previous attempts to add ratelimits to AER logs ([4],
> [5]). The most recent attempt[5] has many similarities with the proposed
> approach.
>
> Patch organization
> ==================
> 1-3 AER logging cleanup
> 4-7 Ratelimits and sysfs knobs
> 8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped)
>
> Outstanding work
> ================
> Cleanup:
> - Consolidate aer_print_error() and pci_print_error() path
> - Elevate log level logic out of print functions[6]
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
> [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
> [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
> [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/
>
> Jon Pan-Doh (8):
> PCI/AER: Remove aer_print_port_info
> PCI/AER: Move AER stat collection out of __aer_print_error
> PCI/AER: Rename struct aer_stats to aer_info
> PCI/AER: Introduce ratelimit for error logs
> PCI/AER: Introduce ratelimit for AER IRQs
> PCI/AER: Add AER sysfs attributes for ratelimits
> PCI/AER: Update AER sysfs ABI filename
> PCI/AER: Move AER sysfs attributes into separate directory
>
> ...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++-
> Documentation/PCI/pcieaer-howto.rst | 10 +-
> drivers/pci/pci-sysfs.c | 2 +-
> drivers/pci/pci.h | 2 +-
> drivers/pci/pcie/aer.c | 227 +++++++++++++-----
> include/linux/pci.h | 2 +-
> 6 files changed, 216 insertions(+), 77 deletions(-)
> rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%)
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
@ 2025-01-24 6:46 ` Jon Pan-Doh
2025-01-25 7:59 ` Lukas Wunner
0 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-01-24 6:46 UTC (permalink / raw)
To: Bowman, Terry
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
PradeepVineshReddy.Kodamati, Lukas Wunner
On Thu, Jan 23, 2025 at 7:18 AM Bowman, Terry <terry.bowman@amd.com> wrote:
> Can you share the base commit used here? I would like to try the patchset.
Sure, it's 7f5b6a8ec18e3add4c74682f60b90c31bdf849f2 ("Merge tag
'pci-v6.13-fixes-3' of
git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci").
As Karolina pointed out[1], there is a chance of conflicts (e.g. TLP
log/print consolidation series) with pci/err and pci-next branches.
The next version will be rebased on top of one of those branches.
Thanks,
Jon
[1] https://lore.kernel.org/linux-pci/8be04d4f-c9e8-4ed2-bf6a-3550d51eb972@oracle.com/
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-16 14:27 ` Karolina Stolarek
2025-01-21 14:18 ` Ilpo Järvinen
@ 2025-01-25 4:15 ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20 ` Jon Pan-Doh
2 siblings, 1 reply; 51+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-01-25 4:15 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 1/14/25 11:42 PM, Jon Pan-Doh wrote:
> Info logged is duplicated when either the source device is processed. If
> no source device is found, than an error is logged.
Any issue with logging when no source device is found in device
hierarchy?
>
> Code flow:
> aer_isr_one_error()
> -> aer_print_port_info()
> -> find_source_device()
> -> return/pci_info() if no device found else continue
> -> aer_process_err_devices()
> -> aer_print_error()
>
> aer_print_port_info():
> [ 21.596150] pcieport 0000:00:04.0: Correctable error message received
> from 0000:01:00.0
>
> aer_print_error():
> [ 21.596163] e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
> [ 21.600575] e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
> [ 21.604707] e1000e 0000:01:00.0: [ 6] BadTLP
Please remove time stamp from dmesg log.
>
> Tested using aer-inject[1] tool. No more root port log on dmesg.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 15 ---------------
> 1 file changed, 15 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 34ce9f834d0c..ba40800b5494 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -735,18 +735,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> info->severity, info->tlp_header_valid, &info->tlp);
> }
>
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> -{
> - u8 bus = info->id >> 8;
> - u8 devfn = info->id & 0xff;
> -
> - pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> - info->multi_error_valid ? "Multiple " : "",
> - aer_error_severity_string[info->severity],
> - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> - PCI_FUNC(devfn));
> -}
> -
> #ifdef CONFIG_ACPI_APEI_PCIEAER
> int cper_severity_to_aer(int cper_severity)
> {
> @@ -1295,7 +1283,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> e_info.multi_error_valid = 1;
> else
> e_info.multi_error_valid = 0;
> - aer_print_port_info(pdev, &e_info);
>
> if (find_source_device(pdev, &e_info))
> aer_process_err_devices(&e_info);
> @@ -1314,8 +1301,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> else
> e_info.multi_error_valid = 0;
>
> - aer_print_port_info(pdev, &e_info);
> -
> if (find_source_device(pdev, &e_info))
> aer_process_err_devices(&e_info);
> }
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-01-15 7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
2025-01-16 14:47 ` Karolina Stolarek
@ 2025-01-25 4:37 ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20 ` Jon Pan-Doh
1 sibling, 1 reply; 51+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-01-25 4:37 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck
On 1/14/25 11:42 PM, Jon Pan-Doh wrote:
> Decouple stat collection from internal AER print functions. AERs from ghes
> or cxl drivers have stat collection in pci_print_aer as that is where
> aer_err_info is populated.
Isn't pci_print_ear() internally calls __aer_print_error()? So the stat
collection
should work fine even now. Can you give more info on why you want to
decouple here.
>
> Tested using aer-inject[1] tool. AER sysfs counters still updated
> correctly.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 7 +++++--
> 1 file changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ba40800b5494..4bb0b3840402 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -695,7 +695,6 @@ static void __aer_print_error(struct pci_dev *dev,
> pci_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
> info->first_error == i ? " (First)" : "");
> }
> - pci_dev_aer_stats_incr(dev, info);
> }
>
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> @@ -775,6 +774,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> info.mask = mask;
> info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>
> + pci_dev_aer_stats_incr(dev, &info);
> +
> pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> __aer_print_error(dev, &info);
> pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> @@ -1249,8 +1250,10 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>
> /* Report all before handle them, not to lost records by reset etc. */
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> - if (aer_get_device_error_info(e_info->dev[i], e_info))
> + if (aer_get_device_error_info(e_info->dev[i], e_info)) {
> + pci_dev_aer_stats_incr(e_info->dev[i], e_info);
> aer_print_error(e_info->dev[i], e_info);
> + }
> }
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> if (aer_get_device_error_info(e_info->dev[i], e_info))
--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
2025-01-16 12:02 ` Karolina Stolarek
@ 2025-01-25 7:39 ` Lukas Wunner
2025-01-31 14:43 ` Jonathan Cameron
2025-02-06 13:56 ` Karolina Stolarek
2025-01-31 14:55 ` Jonathan Cameron
2 siblings, 2 replies; 51+ messages in thread
From: Lukas Wunner @ 2025-01-25 7:39 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote:
> After ratelimiting logs, spammy devices can still slow execution by
> continued AER IRQ servicing.
>
> Add higher per-device ratelimits for AER errors to mask out those IRQs.
> Set the default rate to 3x default AER ratelimit (30 per 5s).
Masking errors at the register level feels overzealous,
in particular because it also disables logging via tracepoints.
Is there a concrete device that necessitates this change?
If there is, consider adding a quirk for this particular device
which masks specific errors, but doesn't affect other devices.
If there isn't, consider dropping this change until a buggy device
appears that actually needs it.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-01-24 6:46 ` Jon Pan-Doh
@ 2025-01-25 7:59 ` Lukas Wunner
0 siblings, 0 replies; 51+ messages in thread
From: Lukas Wunner @ 2025-01-25 7:59 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bowman, Terry, Bjorn Helgaas, Karolina Stolarek, linux-pci,
Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
PradeepVineshReddy.Kodamati
On Thu, Jan 23, 2025 at 10:46:29PM -0800, Jon Pan-Doh wrote:
> On Thu, Jan 23, 2025 at 7:18AM Bowman, Terry <terry.bowman@amd.com> wrote:
> > Can you share the base commit used here? I would like to try the patchset.
>
> Sure, it's 7f5b6a8ec18e3add4c74682f60b90c31bdf849f2 ("Merge tag
> 'pci-v6.13-fixes-3' of
> git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci").
>
> As Karolina pointed out[1], there is a chance of conflicts (e.g. TLP
> log/print consolidation series) with pci/err and pci-next branches.
> The next version will be rebased on top of one of those branches.
Patch sets that are intended to be applied to pci.git should generally
be based on the most recent -rc1 release because pci.git contains
a collection of topic branches (each based on rc1) which are then
merged together to form the pull request for Linus.
Note that there are several other patch sets in flight which target AER:
Terry's CXL error handling (now at v5):
https://lore.kernel.org/linux-pci/20250107143852.3692571-1-terry.bowman@amd.com/
Shuai's endpoint error reporting (now at v2):
https://lore.kernel.org/linux-pci/20241112135419.59491-1-xueshuai@linux.alibaba.com/
You may want to double-check that your changes do not collide with
these other in-flight patch sets. Terry's is most far along and
may be applied in the upcoming cycle, though it's unclear to me
whether that'll be through the pci or cxl tree. Probably the former
to avoid merge conflicts?
Thanks,
Lukas
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits
2025-01-15 7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
@ 2025-01-31 14:32 ` Jonathan Cameron
2025-02-28 23:11 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2025-01-31 14:32 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Tue, 14 Jan 2025 23:42:58 -0800
Jon Pan-Doh <pandoh@google.com> wrote:
> Allow userspace to read/write ratelimits per device. Create aer/ sysfs
> directory to store them and any future aer configs.
>
> Tested using aer-inject[1] tool. Configured correctable log/irq ratelimits
> to 5/10 respectively. Sent 12 AER errors. Observed 5 errors logged while
> AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 11
> (1 masked).
>
> Before: CEMsk: BadTLP-
> After: CEMsk: BadTLP+.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> .../testing/sysfs-bus-pci-devices-aer_stats | 32 +++++++++++
> Documentation/PCI/pcieaer-howto.rst | 4 +-
> drivers/pci/pci-sysfs.c | 1 +
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 56 +++++++++++++++++++
> 5 files changed, 93 insertions(+), 1 deletion(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> index d1f67bb81d5d..c680a53af0f4 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> @@ -117,3 +117,35 @@ Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> Description: Total number of ERR_NONFATAL messages reported to rootport.
> +
> +PCIe AER ratelimits
> +----------------------------
Purely from an rst formatting point of view (this gets picked up in the docs
build) --- should extend I think only under the text.
> +
> +These attributes show up under all the devices that are AER capable. Provides
> +configurable ratelimits of logs/irq per error type. Writing a nonzero value
> +changes the number of errors (burst) allowed per 5 second window before
> +ratelimiting. Reading gets the current ratelimits.
> +
> +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_irq
> +Date: Jan 2025
> +KernelVersion: 6.14.0
> +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> +Description: IRQ ratelimit for correctable errors.
I would add enough info that we can understand what values to write and what
to read to each description. This doc didn't lead me to the comment below
and it should have done...
> +
> +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_uncor_irq
> +Date: Jan 2025
> +KernelVersion: 6.14.0
> +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> +Description: IRQ ratelimit for uncorrectable errors.
> +
> +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_log
> +Date: Jan 2025
> +KernelVersion: 6.14.0
> +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> +Description: Log ratelimit for correctable errors.
> +
> +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_uncor_log
> +Date: Jan 2025
> +KernelVersion: 6.14.0
> +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> +Description: Log ratelimit for uncorrectable errors.
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index d41272504b18..4d5b0638f120 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -89,7 +89,9 @@ AER Ratelimits
> -------------------------
>
> Errors, both at log and IRQ level, are ratelimited per device and error type.
> -This prevents spammy devices from stalling execution.
> +This prevents spammy devices from stalling execution. Ratelimits are exposed
> +in the form of sysfs attributes and configurable. See
> +Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
>
> AER Statistics / Counters
> -------------------------
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index 7679d75d71e5..41acb6713e2d 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1693,6 +1693,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
> &pcie_dev_attr_group,
> #ifdef CONFIG_PCIEAER
> &aer_stats_attr_group,
> + &aer_attr_group,
> #endif
> #ifdef CONFIG_PCIEASPM
> &aspm_ctrl_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 2e40fc63ba31..9d0272a890ef 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -881,6 +881,7 @@ void pci_no_aer(void);
> void pci_aer_init(struct pci_dev *dev);
> void pci_aer_exit(struct pci_dev *dev);
> extern const struct attribute_group aer_stats_attr_group;
> +extern const struct attribute_group aer_attr_group;
> void pci_aer_clear_fatal_status(struct pci_dev *dev);
> int pci_aer_clear_status(struct pci_dev *dev);
> int pci_aer_raw_clear_status(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1db70ae87f52..e48e2951baae 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -630,6 +630,62 @@ const struct attribute_group aer_stats_attr_group = {
> .is_visible = aer_stats_attrs_are_visible,
> };
>
> +#define aer_ratelimit_attr(name, ratelimit) \
> + static ssize_t \
> + name##_show(struct device *dev, struct device_attribute *attr, \
> + char *buf) \
> +{ \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + return sysfs_emit(buf, "%u errors every %u seconds\n", \
> + pdev->aer_info->ratelimit.burst, \
> + pdev->aer_info->ratelimit.interval / HZ); \
Usual rule of thumb is you need a very strong reason to read something
different from what was written to a sysfs file.
I think your interval is fixed? If so name the files so that is apparent
and just have a count returned here. Or if you want to future proof
add a read only ratelimit_period_secs file that prints 5
ratelimit_in_5secs_uncor_log etc.
> +} \
> + \
> + static ssize_t \
> + name##_store(struct device *dev, struct device_attribute *attr, \
> + const char *buf, size_t count) \
> +{ \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + int burst; \
> + \
> + if (kstrtoint(buf, 0, &burst) < 0) \
> + return -EINVAL; \
> + \
> + pdev->aer_info->ratelimit.burst = burst; \
> + return count; \
> +} \
> +static DEVICE_ATTR_RW(name)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory
2025-01-16 17:18 ` Rajat Jain
@ 2025-01-31 14:36 ` Jonathan Cameron
2025-02-12 23:19 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2025-01-31 14:36 UTC (permalink / raw)
To: Rajat Jain
Cc: Karolina Stolarek, Jon Pan-Doh, Bjorn Helgaas, linux-pci,
Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Thu, 16 Jan 2025 09:18:20 -0800
Rajat Jain <rajatja@google.com> wrote:
> Hello,
>
> On Thu, Jan 16, 2025 at 2:26 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
> >
> > On 15/01/2025 08:43, Jon Pan-Doh wrote:
> > > Prepare for the addition of new AER sysfs attributes (e.g. ratelimits)
> > > by moving them into their own directory. Update naming to reflect
> > > broader definition and for consistency.
> > >
> > > /sys/bus/pci/devices/<dev>/aer_dev_correctable
> > > /sys/bus/pci/devices/<dev>/aer_dev_fatal
> > > /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
> > > /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
> > > /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
> > > /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
> > > ->
> > > /sys/bus/pci/devices/<dev>/aer/err_cor
> > > /sys/bus/pci/devices/<dev>/aer/err_fatal
> > > /sys/bus/pci/devices/<dev>/aer/err_nonfatal
> > > /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
> > > /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
> > > /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
> > >
> > > Tested using aer-inject[1] tool. Sent 1 AER error. Observed AER stats
> > > correctedly logged (cat /sys/bus/pci/devices/<dev>/aer/dev_err_cor).
> >
> > I'm not a sysfs expert but my understanding is that we shouldn't do
> > major changes in the existing hierarchies.
> >
> > On one hand, I think it would be nice to extract out AER-specific info
> > and knobs into a subdirectory (e.g., using attribute_group with name
> > "aer"), but on the other this would be disruptive to the userspace. I
> > can imagine that there are tools that watch these values that would
> > break after this change.
>
> Thank you. This is the right guidance.
>
> As the original author to introduce these attributes, I just wanted to
> chime in from the ChromeOS team's perspective (who originally
> introduced these attributes). I can say that we have used these
> attributes for debugging mostly manually, and do not have tools yet
> with hardcoded hierarchy / paths. So we wouldn't be opposed to it, if
> changes to the hierarchy have wider acceptance and it seems better in
> general.
You'd need to be really sure no one is using them or this is
ABI breakage and will need reverting. If it's been live for a while
then we are in a mess as we have to revert and break new users...
Generally I'd go with don't touch the existing elements.
Jonathan
>
> Thanks & Best Regards,
>
> Rajat
>
>
> >
> > All the best,
> > Karolina
> >
> > >
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
> > >
> > > Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> > > ---
> > > .../ABI/testing/sysfs-bus-pci-devices-aer | 18 +++---
> > > drivers/pci/pci-sysfs.c | 1 -
> > > drivers/pci/pci.h | 1 -
> > > drivers/pci/pcie/aer.c | 64 +++++++------------
> > > 4 files changed, 32 insertions(+), 52 deletions(-)
> > >
> > > diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > > index c680a53af0f4..e1472583207b 100644
> > > --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > > +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
> > > @@ -9,7 +9,7 @@ errors may be "seen" / reported by the link partner and not the
> > > problematic endpoint itself (which may report all counters as 0 as it never
> > > saw any problems).
> > >
> > > -What: /sys/bus/pci/devices/<dev>/aer_dev_correctable
> > > +What: /sys/bus/pci/devices/<dev>/aer/err_cor
> > > Date: July 2018
> > > KernelVersion: 4.19.0
> > > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > > @@ -19,7 +19,7 @@ Description: List of correctable errors seen and reported by this
> > > TOTAL_ERR_COR at the end of the file may not match the actual
> > > total of all the errors in the file. Sample output::
> > >
> > > - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_correctable
> > > + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_cor
> > > Receiver Error 2
> > > Bad TLP 0
> > > Bad DLLP 0
> > > @@ -30,7 +30,7 @@ Description: List of correctable errors seen and reported by this
> > > Header Log Overflow 0
> > > TOTAL_ERR_COR 2
> > >
> > > -What: /sys/bus/pci/devices/<dev>/aer_dev_fatal
> > > +What: /sys/bus/pci/devices/<dev>/aer/err_fatal
> > > Date: July 2018
> > > KernelVersion: 4.19.0
> > > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > > @@ -40,7 +40,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
> > > TOTAL_ERR_FATAL at the end of the file may not match the actual
> > > total of all the errors in the file. Sample output::
> > >
> > > - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_fatal
> > > + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_fatal
> > > Undefined 0
> > > Data Link Protocol 0
> > > Surprise Down Error 0
> > > @@ -60,7 +60,7 @@ Description: List of uncorrectable fatal errors seen and reported by this
> > > TLP Prefix Blocked Error 0
> > > TOTAL_ERR_FATAL 0
> > >
> > > -What: /sys/bus/pci/devices/<dev>/aer_dev_nonfatal
> > > +What: /sys/bus/pci/devices/<dev>/aer/err_nonfatal
> > > Date: July 2018
> > > KernelVersion: 4.19.0
> > > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > > @@ -70,7 +70,7 @@ Description: List of uncorrectable nonfatal errors seen and reported by this
> > > TOTAL_ERR_NONFATAL at the end of the file may not match the
> > > actual total of all the errors in the file. Sample output::
> > >
> > > - localhost /sys/devices/pci0000:00/0000:00:1c.0 # cat aer_dev_nonfatal
> > > + localhost /sys/devices/pci0000:00/0000:00:1c.0/aer # cat err_nonfatal
> > > Undefined 0
> > > Data Link Protocol 0
> > > Surprise Down Error 0
> > > @@ -100,19 +100,19 @@ collectors) that are AER capable. These indicate the number of error messages as
> > > device, so these counters include them and are thus cumulative of all the error
> > > messages on the PCI hierarchy originating at that root port.
> > >
> > > -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_cor
> > > +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_cor
> > > Date: July 2018
> > > KernelVersion: 4.19.0
> > > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > > Description: Total number of ERR_COR messages reported to rootport.
> > >
> > > -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_fatal
> > > +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_fatal
> > > Date: July 2018
> > > KernelVersion: 4.19.0
> > > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > > Description: Total number of ERR_FATAL messages reported to rootport.
> > >
> > > -What: /sys/bus/pci/devices/<dev>/aer_rootport_total_err_nonfatal
> > > +What: /sys/bus/pci/devices/<dev>/aer/rootport_total_err_nonfatal
> > > Date: July 2018
> > > KernelVersion: 4.19.0
> > > Contact: linux-pci@vger.kernel.org, rajatja@google.com
> > > diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> > > index 41acb6713e2d..e16b92edf3bd 100644
> > > --- a/drivers/pci/pci-sysfs.c
> > > +++ b/drivers/pci/pci-sysfs.c
> > > @@ -1692,7 +1692,6 @@ const struct attribute_group *pci_dev_attr_groups[] = {
> > > &pci_bridge_attr_group,
> > > &pcie_dev_attr_group,
> > > #ifdef CONFIG_PCIEAER
> > > - &aer_stats_attr_group,
> > > &aer_attr_group,
> > > #endif
> > > #ifdef CONFIG_PCIEASPM
> > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> > > index 9d0272a890ef..a80cfc08f634 100644
> > > --- a/drivers/pci/pci.h
> > > +++ b/drivers/pci/pci.h
> > > @@ -880,7 +880,6 @@ static inline void of_pci_remove_node(struct pci_dev *pdev) { }
> > > void pci_no_aer(void);
> > > void pci_aer_init(struct pci_dev *dev);
> > > void pci_aer_exit(struct pci_dev *dev);
> > > -extern const struct attribute_group aer_stats_attr_group;
> > > extern const struct attribute_group aer_attr_group;
> > > void pci_aer_clear_fatal_status(struct pci_dev *dev);
> > > int pci_aer_clear_status(struct pci_dev *dev);
> > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > > index e48e2951baae..68850525cc8d 100644
> > > --- a/drivers/pci/pcie/aer.c
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -569,13 +569,13 @@ static const char *aer_agent_string[] = {
> > > } \
> > > static DEVICE_ATTR_RO(name)
> > >
> > > -aer_stats_dev_attr(aer_dev_correctable, dev_cor_errs,
> > > +aer_stats_dev_attr(err_cor, dev_cor_errs,
> > > aer_correctable_error_string, "ERR_COR",
> > > dev_total_cor_errs);
> > > -aer_stats_dev_attr(aer_dev_fatal, dev_fatal_errs,
> > > +aer_stats_dev_attr(err_fatal, dev_fatal_errs,
> > > aer_uncorrectable_error_string, "ERR_FATAL",
> > > dev_total_fatal_errs);
> > > -aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> > > +aer_stats_dev_attr(err_nonfatal, dev_nonfatal_errs,
> > > aer_uncorrectable_error_string, "ERR_NONFATAL",
> > > dev_total_nonfatal_errs);
> > >
> > > @@ -589,47 +589,13 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
> > > } \
> > > static DEVICE_ATTR_RO(name)
> > >
> > > -aer_stats_rootport_attr(aer_rootport_total_err_cor,
> > > +aer_stats_rootport_attr(rootport_total_err_cor,
> > > rootport_total_cor_errs);
> > > -aer_stats_rootport_attr(aer_rootport_total_err_fatal,
> > > +aer_stats_rootport_attr(rootport_total_err_fatal,
> > > rootport_total_fatal_errs);
> > > -aer_stats_rootport_attr(aer_rootport_total_err_nonfatal,
> > > +aer_stats_rootport_attr(rootport_total_err_nonfatal,
> > > rootport_total_nonfatal_errs);
> > >
> > > -static struct attribute *aer_stats_attrs[] __ro_after_init = {
> > > - &dev_attr_aer_dev_correctable.attr,
> > > - &dev_attr_aer_dev_fatal.attr,
> > > - &dev_attr_aer_dev_nonfatal.attr,
> > > - &dev_attr_aer_rootport_total_err_cor.attr,
> > > - &dev_attr_aer_rootport_total_err_fatal.attr,
> > > - &dev_attr_aer_rootport_total_err_nonfatal.attr,
> > > - NULL
> > > -};
> > > -
> > > -static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
> > > - struct attribute *a, int n)
> > > -{
> > > - struct device *dev = kobj_to_dev(kobj);
> > > - struct pci_dev *pdev = to_pci_dev(dev);
> > > -
> > > - if (!pdev->aer_info)
> > > - return 0;
> > > -
> > > - if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
> > > - a == &dev_attr_aer_rootport_total_err_fatal.attr ||
> > > - a == &dev_attr_aer_rootport_total_err_nonfatal.attr) &&
> > > - ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> > > - (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
> > > - return 0;
> > > -
> > > - return a->mode;
> > > -}
> > > -
> > > -const struct attribute_group aer_stats_attr_group = {
> > > - .attrs = aer_stats_attrs,
> > > - .is_visible = aer_stats_attrs_are_visible,
> > > -};
> > > -
> > > #define aer_ratelimit_attr(name, ratelimit) \
> > > static ssize_t \
> > > name##_show(struct device *dev, struct device_attribute *attr, \
> > > @@ -662,6 +628,14 @@ aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
> > > aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
> > >
> > > static struct attribute *aer_attrs[] __ro_after_init = {
> > > + /* Stats */
> > > + &dev_attr_err_cor.attr,
> > > + &dev_attr_err_fatal.attr,
> > > + &dev_attr_err_nonfatal.attr,
> > > + &dev_attr_rootport_total_err_cor.attr,
> > > + &dev_attr_rootport_total_err_fatal.attr,
> > > + &dev_attr_rootport_total_err_nonfatal.attr,
> > > + /* Ratelimits */
> > > &dev_attr_ratelimit_cor_irq.attr,
> > > &dev_attr_ratelimit_uncor_irq.attr,
> > > &dev_attr_ratelimit_cor_log.attr,
> > > @@ -670,13 +644,21 @@ static struct attribute *aer_attrs[] __ro_after_init = {
> > > };
> > >
> > > static umode_t aer_attrs_are_visible(struct kobject *kobj,
> > > - struct attribute *a, int n)
> > > + struct attribute *a, int n)
> > > {
> > > struct device *dev = kobj_to_dev(kobj);
> > > struct pci_dev *pdev = to_pci_dev(dev);
> > >
> > > if (!pdev->aer_info)
> > > return 0;
> > > +
> > > + if ((a == &dev_attr_rootport_total_err_cor.attr ||
> > > + a == &dev_attr_rootport_total_err_fatal.attr ||
> > > + a == &dev_attr_rootport_total_err_nonfatal.attr) &&
> > > + ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> > > + (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC)))
> > > + return 0;
> > > +
> > > return a->mode;
> > > }
> > >
> >
> >
>
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-25 7:39 ` Lukas Wunner
@ 2025-01-31 14:43 ` Jonathan Cameron
2025-03-04 23:42 ` Jon Pan-Doh
2025-02-06 13:56 ` Karolina Stolarek
1 sibling, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2025-01-31 14:43 UTC (permalink / raw)
To: Lukas Wunner
Cc: Jon Pan-Doh, Bjorn Helgaas, Karolina Stolarek, linux-pci,
Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Sat, 25 Jan 2025 08:39:35 +0100
Lukas Wunner <lukas@wunner.de> wrote:
> On Tue, Jan 14, 2025 at 11:42:57PM -0800, Jon Pan-Doh wrote:
> > After ratelimiting logs, spammy devices can still slow execution by
> > continued AER IRQ servicing.
> >
> > Add higher per-device ratelimits for AER errors to mask out those IRQs.
> > Set the default rate to 3x default AER ratelimit (30 per 5s).
>
> Masking errors at the register level feels overzealous,
> in particular because it also disables logging via tracepoints.
>
> Is there a concrete device that necessitates this change?
> If there is, consider adding a quirk for this particular device
> which masks specific errors, but doesn't affect other devices.
> If there isn't, consider dropping this change until a buggy device
> appears that actually needs it.
Fully agree with this comment. At very least this should default
to not ratelimiting on the tracepoints unless a specific opt in has
occurred (probably a platform or device driver quirk).
In particular I'd worry that you are masking whatever errors are
finally trigger masking. That might be the only one of that
particular type that was seen and I think the only report we
see is the 'I masked it message'. So rasdaemon for example
never sees the error at all. So another tweak would be report
one last time so we definitely see any given error type at least
once.
For CXL errors we trigger off one AER error type (internal error),
but then that is multiplexed onto finer grained errors. Even if
we fix the above we would want the masking in the CXL RAS controls,
not AER.
Jonathan
>
> Thanks,
>
> Lukas
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
2025-01-16 12:02 ` Karolina Stolarek
2025-01-25 7:39 ` Lukas Wunner
@ 2025-01-31 14:55 ` Jonathan Cameron
2025-03-04 23:42 ` Jon Pan-Doh
2 siblings, 1 reply; 51+ messages in thread
From: Jonathan Cameron @ 2025-01-31 14:55 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Tue, 14 Jan 2025 23:42:57 -0800
Jon Pan-Doh <pandoh@google.com> wrote:
> After ratelimiting logs, spammy devices can still slow execution by
> continued AER IRQ servicing.
>
> Add higher per-device ratelimits for AER errors to mask out those IRQs.
> Set the default rate to 3x default AER ratelimit (30 per 5s).
>
> Tested using aer-inject[1] tool. Injected 32 AER errors. Observed IRQ
> masked via lspci and sysfs counters record 31 errors (1 masked).
>
> Before: CEMsk: BadTLP-
> After: CEMsk: BadTLP+
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Hi Jon,
Comment inline.
> ---
> Documentation/PCI/pcieaer-howto.rst | 4 +-
> drivers/pci/pcie/aer.c | 64 +++++++++++++++++++++++++----
> 2 files changed, 57 insertions(+), 11 deletions(-)
>
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index 5546de60f184..d41272504b18 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -88,8 +88,8 @@ fields.
> AER Ratelimits
> -------------------------
>
> -Error messages are ratelimited per device and error type. This prevents spammy
> -devices from flooding the console.
> +Errors, both at log and IRQ level, are ratelimited per device and error type.
> +This prevents spammy devices from stalling execution.
>
> AER Statistics / Counters
> -------------------------
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 025c50b0f293..1db70ae87f52 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -87,6 +87,8 @@ struct aer_info {
> u64 rootport_total_nonfatal_errs;
>
> /* Ratelimits for errors */
> + struct ratelimit_state cor_irq_ratelimit;
> + struct ratelimit_state uncor_irq_ratelimit;
> struct ratelimit_state cor_log_ratelimit;
> struct ratelimit_state uncor_log_ratelimit;
> };
> @@ -379,6 +381,10 @@ void pci_aer_init(struct pci_dev *dev)
> return;
>
> dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
> + ratelimit_state_init(&dev->aer_info->cor_irq_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3);
> + ratelimit_state_init(&dev->aer_info->uncor_irq_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST*3);
> ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
> DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
> @@ -676,6 +682,39 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
> }
> }
>
> +static void mask_reported_error(struct pci_dev *dev, struct aer_err_info *info)
> +{
> + const char **strings;
> + const char *errmsg;
> + u16 aer_offset = dev->aer_cap;
> + u16 mask_reg_offset;
> + u32 mask;
> + unsigned long status = info->status;
> + int i;
> +
> + if (info->severity == AER_CORRECTABLE) {
> + strings = aer_correctable_error_string;
> + mask_reg_offset = PCI_ERR_COR_MASK;
> + } else {
> + strings = aer_uncorrectable_error_string;
> + mask_reg_offset = PCI_ERR_UNCOR_MASK;
> + }
> +
> + pci_read_config_dword(dev, aer_offset + mask_reg_offset, &mask);
> + mask |= status;
> + pci_write_config_dword(dev, aer_offset + mask_reg_offset, mask);
> +
> + pci_warn(dev, "%s error(s) masked due to rate-limiting:",
> + aer_error_severity_string[info->severity]);
> + for_each_set_bit(i, &status, 32) {
> + errmsg = strings[i];
> + if (!errmsg)
> + errmsg = "Unknown Error Bit";
> +
> + pci_warn(dev, " [%2d] %-22s\n", i, errmsg);
> + }
> +}
> +
> static void __print_tlp_header(struct pci_dev *dev, struct pcie_tlp_log *t)
> {
> pci_err(dev, " TLP Header: %08x %08x %08x %08x\n",
> @@ -713,7 +752,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> int layer, agent;
> int id = pci_dev_id(dev);
> const char *level;
> - struct ratelimit_state *ratelimit;
> + struct ratelimit_state *irq_ratelimit;
> + struct ratelimit_state *log_ratelimit;
>
> if (!info->status) {
> pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> @@ -722,14 +762,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> }
>
> if (info->severity == AER_CORRECTABLE) {
> - ratelimit = &dev->aer_info->cor_log_ratelimit;
> + irq_ratelimit = &dev->aer_info->cor_irq_ratelimit;
> + log_ratelimit = &dev->aer_info->cor_log_ratelimit;
> level = KERN_WARNING;
> } else {
> - ratelimit = &dev->aer_info->uncor_log_ratelimit;
> + irq_ratelimit = &dev->aer_info->uncor_irq_ratelimit;
> + log_ratelimit = &dev->aer_info->uncor_log_ratelimit;
> level = KERN_ERR;
> }
>
> - if (!__ratelimit(ratelimit))
> + if (!__ratelimit(irq_ratelimit)) {
> + mask_reported_error(dev, info);
> + return;
So if I follow correctly. We count irqs for any error type and
then mask whatever was set on one that triggered this rate_limit check?
That last one isn't reported other than via a log message.
Imagine that is a totally unrelated error to the earlier ones,
now RASDaemon has no info on it at all as the tracepoint never
fired. To me that's a very different situation to it knowing there
were 10 errors of the type vs more.
I'd like to see that final trace point and also to see a tracepoint
that lets rasdaemon etc know you cut off errors after this point
+ rasdaemon support for using that.
Terry can address if we need to do anything different for CXL given
he is updating this handling for that. Superficially I think we
need to driver the masking down into the CXL RAS capability registers.
Internal error in the AER capabilities is far to big a hammer to apply.
> + }
> + if (!__ratelimit(log_ratelimit))
> return;
>
> layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> @@ -776,14 +822,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> int layer, agent, tlp_header_valid = 0;
> u32 status, mask;
> struct aer_err_info info;
> - struct ratelimit_state *ratelimit;
> + struct ratelimit_state *log_ratelimit;
>
> if (aer_severity == AER_CORRECTABLE) {
> - ratelimit = &dev->aer_info->cor_log_ratelimit;
> + log_ratelimit = &dev->aer_info->cor_log_ratelimit;
> status = aer->cor_status;
> mask = aer->cor_mask;
> } else {
> - ratelimit = &dev->aer_info->uncor_log_ratelimit;
> + log_ratelimit = &dev->aer_info->uncor_log_ratelimit;
> status = aer->uncor_status;
> mask = aer->uncor_mask;
> tlp_header_valid = status & AER_LOG_TLP_MASKS;
> @@ -799,8 +845,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>
> pci_dev_aer_stats_incr(dev, &info);
> -
> - if (!__ratelimit(ratelimit))
> + /* Only ratelimit logs (no IRQ) as AERs reported via GHES/CXL (caller). */
> + if (!__ratelimit(log_ratelimit))
> return;
>
> pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
` (8 preceding siblings ...)
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
@ 2025-02-06 13:32 ` Karolina Stolarek
2025-02-12 23:19 ` Jon Pan-Doh
9 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-02-06 13:32 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
Hi Jon,
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> Proposal
> ========
>
> When using native AER, spammy devices can flood kernel logs with AER errors
> and slow/stall execution. Add per-device per-error-severity ratelimits
> for more robust error logging. Allow userspace to configure ratelimits
> via sysfs knobs.
Do you have any update on the series?
I'm aware that a lot is happening in the AER code right now, so I was
thinking if it would be helpful to split up the series to get the logs
ratelimiting in sooner. There are some concerns about disabling error
generation that should be discussed, but I don't want them to block the
logs ratelimit changes. I think it would be good to fix this first to
save people (myself included) from overflown syslogs.
All the best,
Karolina
>
> Motivation
> ==========
>
> Several OCP members have issues with inconsistent PCIe error handling,
> exacerbated at datacenter scale (myriad of devices).
> OCP HW/Fault Management subproject set out to solve this by
> standardizing industry:
>
> - PCIe error handling best practices
> - Fault Management/RAS (incl. PCIe errors)
>
> Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
> rasdaemon) to collect/pass on to repairability services is part of the
> roadmap.
>
> Background
> ==========
>
> AER error spam has been observed many times, both publicly (e.g. [1], [2],
> [3]) and privately. While it usually occurs with correctable errors, it can
> happen with uncorrectable errors (e.g. during new HW bringup).
>
> There have been previous attempts to add ratelimits to AER logs ([4],
> [5]). The most recent attempt[5] has many similarities with the proposed
> approach.
>
> Patch organization
> ==================
> 1-3 AER logging cleanup
> 4-7 Ratelimits and sysfs knobs
> 8 Sysfs cleanup (RFC that breaks existing ABI/can be dropped)
>
> Outstanding work
> ================
> Cleanup:
> - Consolidate aer_print_error() and pci_print_error() path
> - Elevate log level logic out of print functions[6]
>
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
> [2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
> [3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
> [4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
> [5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
> [6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/
>
> Jon Pan-Doh (8):
> PCI/AER: Remove aer_print_port_info
> PCI/AER: Move AER stat collection out of __aer_print_error
> PCI/AER: Rename struct aer_stats to aer_info
> PCI/AER: Introduce ratelimit for error logs
> PCI/AER: Introduce ratelimit for AER IRQs
> PCI/AER: Add AER sysfs attributes for ratelimits
> PCI/AER: Update AER sysfs ABI filename
> PCI/AER: Move AER sysfs attributes into separate directory
>
> ...es-aer_stats => sysfs-bus-pci-devices-aer} | 50 +++-
> Documentation/PCI/pcieaer-howto.rst | 10 +-
> drivers/pci/pci-sysfs.c | 2 +-
> drivers/pci/pci.h | 2 +-
> drivers/pci/pcie/aer.c | 227 +++++++++++++-----
> include/linux/pci.h | 2 +-
> 6 files changed, 216 insertions(+), 77 deletions(-)
> rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (69%)
>
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-25 7:39 ` Lukas Wunner
2025-01-31 14:43 ` Jonathan Cameron
@ 2025-02-06 13:56 ` Karolina Stolarek
2025-02-06 20:25 ` Lukas Wunner
1 sibling, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-02-06 13:56 UTC (permalink / raw)
To: Lukas Wunner
Cc: Jon Pan-Doh, Bjorn Helgaas, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On 25/01/2025 08:39, Lukas Wunner wrote:
>
> Masking errors at the register level feels overzealous,
> in particular because it also disables logging via tracepoints.
>
> Is there a concrete device that necessitates this change?
I faced issues with excessive Correctable Errors reporting with Samsung
PM1733 NVMe (a couple of thousand errors per hour), which were still
polluting the logs even after introducing a ratelimit (first every 2s,
second ever 30s, as proposed in [1]). Also, instead of masking the
errors automatically, we could give a user a sysfs knob to turn error
generation off and on.
> If there is, consider adding a quirk for this particular device
> which masks specific errors, but doesn't affect other devices.
There were many other reports of Correctable Error floods, as signaled
in the cover letter, so it's hard to pinpoint the specific driver that
should mask these errors.
All the best,
Karolina
-------------------------------------
[1] -
https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
> If there isn't, consider dropping this change until a buggy device
> appears that actually needs it.
>
> Thanks,
>
> Lukas
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-02-06 13:56 ` Karolina Stolarek
@ 2025-02-06 20:25 ` Lukas Wunner
0 siblings, 0 replies; 51+ messages in thread
From: Lukas Wunner @ 2025-02-06 20:25 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Jon Pan-Doh, Bjorn Helgaas, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Thu, Feb 06, 2025 at 02:56:20PM +0100, Karolina Stolarek wrote:
> On 25/01/2025 08:39, Lukas Wunner wrote:
> > Masking errors at the register level feels overzealous,
> > in particular because it also disables logging via tracepoints.
> >
> > Is there a concrete device that necessitates this change?
>
> I faced issues with excessive Correctable Errors reporting with Samsung
> PM1733 NVMe (a couple of thousand errors per hour), which were still
> polluting the logs even after introducing a ratelimit
I'd suggest to add a "u32 aer_cor_mask" to "struct pci_dev" in the
"#ifdef CONFIG_PCIEAER" section.
Then add a "DECLARE_PCI_FIXUP_HEADER()" macro in drivers/pci/quirks.c
for the Samsung PM1733 which calls a new function which sets exactly the
error bits you're seeing to aer_cor_mask. This should be #ifdef'ed to
CONFIG_PCIEAER as well.
Finally, amend aer.c to set the bits in aer_cor_mask in the
PCI_ERR_COR_MASK register on probe.
> > If there is, consider adding a quirk for this particular device
> > which masks specific errors, but doesn't affect other devices.
>
> There were many other reports of Correctable Error floods, as signaled in
> the cover letter, so it's hard to pinpoint the specific driver that should
> mask these errors.
If a specific device frequently signals the same errors,
I think that's a bug of that device and if the vendor doesn't
provide a firmware update, quiescing the errors through a quirk
is a plausible solution.
Of course if this is widespread, it becomes a maintenance nightmare
and then the quirk approach is not a viable option. I cannot say
whether that's the case. So far there's a report for one specific
product (the Samsung drive) and hinting that the problem may be
widespread. It's difficult to make a recommendation without
precise data.
Thanks,
Lukas
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-02-06 13:32 ` Karolina Stolarek
@ 2025-02-12 23:19 ` Jon Pan-Doh
2025-02-13 16:00 ` Karolina Stolarek
0 siblings, 1 reply; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:19 UTC (permalink / raw)
To: Karolina Stolarek
Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Thu, Feb 6, 2025 at 5:32 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> Do you have any update on the series?
>
> I'm aware that a lot is happening in the AER code right now, so I was
> thinking if it would be helpful to split up the series to get the logs
> ratelimiting in sooner. There are some concerns about disabling error
> generation that should be discussed, but I don't want them to block the
> logs ratelimit changes. I think it would be good to fix this first to
> save people (myself included) from overflown syslogs.
Sorry for the delayed response. I was on vacation and hadn't had time
to address the comments. I think splitting the series into log
ratelimits vs. irq ratelimits is a good idea as we continue to discuss
the latter. I'll aim to send out v2 by the end of week.
One outstanding item (mentioned off-list) is Bjorn's desire to
consolidate the logging paths (aer_print_error() for AER IRQ and
pci_print_error() for CXL/GHES) as a prerequisite (clean up/reduce
tech debt). Maybe you could help with this?
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory
2025-01-31 14:36 ` Jonathan Cameron
@ 2025-02-12 23:19 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:19 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Rajat Jain, Karolina Stolarek, Bjorn Helgaas, linux-pci,
Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Fri, Jan 31, 2025 at 6:36 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> You'd need to be really sure no one is using them or this is
> ABI breakage and will need reverting. If it's been live for a while
> then we are in a mess as we have to revert and break new users...
>
> Generally I'd go with don't touch the existing elements.
Ack. Karolina/your feedback echoes Bjorn/Rajat's comments in internal
review. Will drop it in the next version.
I posted it to solicit outside feedback from other users (e.g. how
many tools depend on the path).
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-01-25 4:37 ` Sathyanarayanan Kuppuswamy
@ 2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:20 UTC (permalink / raw)
To: Sathyanarayanan Kuppuswamy
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Fri, Jan 24, 2025 at 8:37 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> Isn't pci_print_ear() internally calls __aer_print_error()? So the stat
> collection
> should work fine even now. Can you give more info on why you want to
> decouple here.
Sure. From a readability perspective, I think that it is better to not
have stat collection buried in print functions. This allows the print
functions to be side-effect free as well (more or less).
One of the outstanding items is to unify the {aer, pci)_print_error()
paths, which may alter where we do the stat collection. However, I
think the general idea stands.
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info
2025-01-20 10:13 ` Karolina Stolarek
@ 2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:20 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Mon, Jan 20, 2025 at 2:13 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> Like I said, my preference would be to leave aer_stats as it is, but if
> we have to stick to one struct, I'd go with "aer_control" or "aer_report".
Will change to aer_report in v2.
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-20 9:25 ` Karolina Stolarek
@ 2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:20 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Mon, Jan 20, 2025 at 1:26 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> As for the commit message, I'd drop the before-after dmesg messages and
> explain that the information presented by aer_print_port_info() can be
> easily inferred from the actual error messages. I had to read the code
> to remind myself what information is duplicated.
Any objection to doing both (i.e. dmesg + explicitly call out fields)?
An ex. of duplicated log was requested in an earlier version (internal
review).
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-21 14:18 ` Ilpo Järvinen
@ 2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:20 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Tue, Jan 21, 2025 at 6:18 AM Ilpo Järvinen
<ilpo.jarvinen@linux.intel.com> wrote:
> Is the first sentence lacking something, as "either" feels to require
> another option/alternative that is not present in the sentence? (I'm
> non-native myself)
Typo. Good catch. Will remove in v2.
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 1/8] PCI/AER: Remove aer_print_port_info
2025-01-25 4:15 ` Sathyanarayanan Kuppuswamy
@ 2025-02-12 23:20 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-12 23:20 UTC (permalink / raw)
To: Sathyanarayanan Kuppuswamy
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Fri, Jan 24, 2025 at 8:15 PM Sathyanarayanan Kuppuswamy
<sathyanarayanan.kuppuswamy@linux.intel.com> wrote:
> Any issue with logging when no source device is found in device
> hierarchy?
When there is no source device, an error is logged (pci_info()) with
the same BDF info that is found in aer_print_port_info() (i.e. from
aer_error_info). Will add it to v2 commit message to be more explicit.
> Please remove time stamp from dmesg log.
Ack.
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-02-12 23:19 ` Jon Pan-Doh
@ 2025-02-13 16:00 ` Karolina Stolarek
2025-02-14 2:49 ` Jon Pan-Doh
0 siblings, 1 reply; 51+ messages in thread
From: Karolina Stolarek @ 2025-02-13 16:00 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On 13/02/2025 00:19, Jon Pan-Doh wrote:
>
> Sorry for the delayed response. I was on vacation and hadn't had time
> to address the comments.
It's alright, there's no need to apologize for it :)
> I think splitting the series into log
> ratelimits vs. irq ratelimits is a good idea as we continue to discuss
> the latter. I'll aim to send out v2 by the end of week.
OK, sounds good
> One outstanding item (mentioned off-list) is Bjorn's desire to
> consolidate the logging paths (aer_print_error() for AER IRQ and
> pci_print_error() for CXL/GHES) as a prerequisite (clean up/reduce
> tech debt). Maybe you could help with this?
I'd need to dive into CXL part and ramp up, but I think that's something
I can help with. Does it mean that you'd rebase this series on the top
of the proposed cleanup?
All the best,
Karolina
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 0/8] Rate limit AER logs/IRQs
2025-02-13 16:00 ` Karolina Stolarek
@ 2025-02-14 2:49 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:49 UTC (permalink / raw)
To: Karolina Stolarek
Cc: linux-pci, Bjorn Helgaas, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck
On Thu, Feb 13, 2025 at 8:00 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> I'd need to dive into CXL part and ramp up, but I think that's something
> I can help with. Does it mean that you'd rebase this series on the top
> of the proposed cleanup?
Yeah. Either that or you can append to the beginning of this series as
the first few patches are AER cleanup.
The former is probably easier depending on how large the patch(es) are
(i.e. I will rebase the ratelimit series on top of AER log cleanup).
It may even make sense to absorb the first few patches of this series
into the cleanup effort.
Thanks for the help,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits
2025-01-31 14:32 ` Jonathan Cameron
@ 2025-02-28 23:11 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-02-28 23:11 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Fri, Jan 31, 2025 at 6:33 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Tue, 14 Jan 2025 23:42:58 -0800
> Jon Pan-Doh <pandoh@google.com> wrote:
> > +
> > +These attributes show up under all the devices that are AER capable. Provides
> > +configurable ratelimits of logs/irq per error type. Writing a nonzero value
> > +changes the number of errors (burst) allowed per 5 second window before
> > +ratelimiting. Reading gets the current ratelimits.
> > +
> > +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_irq
> > +Date: Jan 2025
> > +KernelVersion: 6.14.0
> > +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> > +Description: IRQ ratelimit for correctable errors.
>
> I would add enough info that we can understand what values to write and what
> to read to each description. This doc didn't lead me to the comment below
> and it should have done...
Since there's a lot of commonality in the descriptions, I opted for
putting the bulk of info at the top of "PCIe AER ratelimits" section.
I believe I already wrote what should be written/read. Is there extra
specificity you are looking for?
How does this look?
PCIe AER ratelimits
-------------------
These attributes show up under all the devices that are AER capable.
They represent configurable ratelimits of logs per error type. Writing a
value changes the number of errors (burst) allowed per interval
(5 second window) before ratelimiting. Reading gets the current ratelimit
burst.
See Documentation/PCI/pcieaer-howto.rst for more info on ratelimits.
What: /sys/bus/pci/devices/<dev>/aer/ratelimit_in_5secs_cor_log
Date: March 2025
KernelVersion: 6.15.0
Contact: linux-pci@vger.kernel.org, pandoh@google.com
Description: Ratelimit burst for correctable error logs.
> > +#define aer_ratelimit_attr(name, ratelimit) \
> > + static ssize_t \
> > + name##_show(struct device *dev, struct device_attribute *attr, \
> > + char *buf) \
> > +{ \
> > + struct pci_dev *pdev = to_pci_dev(dev); \
> > + return sysfs_emit(buf, "%u errors every %u seconds\n", \
> > + pdev->aer_info->ratelimit.burst, \
> > + pdev->aer_info->ratelimit.interval / HZ); \
>
> Usual rule of thumb is you need a very strong reason to read something
> different from what was written to a sysfs file.
Ack. Will make read/write symmetric in v3.
> I think your interval is fixed? If so name the files so that is apparent
> and just have a count returned here. Or if you want to future proof
> add a read only ratelimit_period_secs file that prints 5
>
> ratelimit_in_5secs_uncor_log etc.
Yes, the interval is fixed. Will update sysfs entry names to be more
specific in v3.
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-31 14:43 ` Jonathan Cameron
@ 2025-03-04 23:42 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-03-04 23:42 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Lukas Wunner, Bjorn Helgaas, Karolina Stolarek, linux-pci,
Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Fri, Jan 31, 2025 at 6:44 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Sat, 25 Jan 2025 08:39:35 +0100
> Lukas Wunner <lukas@wunner.de> wrote:
> > Masking errors at the register level feels overzealous,
> > in particular because it also disables logging via tracepoints.
> >
> > Is there a concrete device that necessitates this change?
> > If there is, consider adding a quirk for this particular device
> > which masks specific errors, but doesn't affect other devices.
> > If there isn't, consider dropping this change until a buggy device
> > appears that actually needs it.
>
> Fully agree with this comment. At very least this should default
> to not ratelimiting on the tracepoints unless a specific opt in has
> occurred (probably a platform or device driver quirk).
Hi Lukas and Jonathan,
Thanks for the comments. Since IRQ ratelimiting/masking is more
drastic, it requires more nuance/thought (split the series in v2[1] as
a result).
I am not targeting specific devices per say. The intent is to allow
userspace daemons/agents to dynamically collect telemetry/handle spam.
In the context of the datacenter (i.e. several OCP members), there are
many deployments of new HW/configurations where we may see/have seen
error spam when trying to enable native AER. Kernel quirks work in the
medium term (until the underlying device is fixed), but require a
kernel rollout. There is a desire to address this faster (i.e. without
rollout/reinstall) and I think IRQ ratelimiting fits the requirements.
I like the idea of having IRQ ratelimiting off as default though as it
is a big change.
> In particular I'd worry that you are masking whatever errors are
> finally trigger masking. That might be the only one of that
> particular type that was seen and I think the only report we
> see is the 'I masked it message'. So rasdaemon for example
> never sees the error at all. So another tweak would be report
> one last time so we definitely see any given error type at least
> once.
Ack.
[1] https://lore.kernel.org/linux-pci/20250214023543.992372-1-pandoh@google.com/
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
2025-01-31 14:55 ` Jonathan Cameron
@ 2025-03-04 23:42 ` Jon Pan-Doh
0 siblings, 0 replies; 51+ messages in thread
From: Jon Pan-Doh @ 2025-03-04 23:42 UTC (permalink / raw)
To: Jonathan Cameron
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck
On Fri, Jan 31, 2025 at 6:55 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
> So if I follow correctly. We count irqs for any error type and
> then mask whatever was set on one that triggered this rate_limit check?
> That last one isn't reported other than via a log message.
>
> Imagine that is a totally unrelated error to the earlier ones,
> now RASDaemon has no info on it at all as the tracepoint never
> fired. To me that's a very different situation to it knowing there
> were 10 errors of the type vs more.
>
> I'd like to see that final trace point and also to see a tracepoint
> that lets rasdaemon etc know you cut off errors after this point
> + rasdaemon support for using that.
Your understanding is correct. I get your point and will try to
address this when I send out the IRQ ratelimiting series.
Thanks,
Jon
^ permalink raw reply [flat|nested] 51+ messages in thread
end of thread, other threads:[~2025-03-04 23:43 UTC | newest]
Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-16 14:27 ` Karolina Stolarek
2025-01-18 1:57 ` Jon Pan-Doh
2025-01-20 9:25 ` Karolina Stolarek
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-21 14:18 ` Ilpo Järvinen
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-25 4:15 ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
2025-01-16 14:47 ` Karolina Stolarek
2025-01-18 1:57 ` Jon Pan-Doh
2025-01-25 4:37 ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
2025-01-16 10:11 ` Karolina Stolarek
2025-01-18 1:59 ` Jon Pan-Doh
2025-01-20 10:13 ` Karolina Stolarek
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-01-16 11:11 ` Karolina Stolarek
2025-01-18 1:59 ` Jon Pan-Doh
2025-01-20 10:25 ` Karolina Stolarek
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
2025-01-16 12:02 ` Karolina Stolarek
2025-01-18 1:58 ` Jon Pan-Doh
2025-01-20 10:38 ` Karolina Stolarek
2025-01-25 7:39 ` Lukas Wunner
2025-01-31 14:43 ` Jonathan Cameron
2025-03-04 23:42 ` Jon Pan-Doh
2025-02-06 13:56 ` Karolina Stolarek
2025-02-06 20:25 ` Lukas Wunner
2025-01-31 14:55 ` Jonathan Cameron
2025-03-04 23:42 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
2025-01-31 14:32 ` Jonathan Cameron
2025-02-28 23:11 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
2025-01-15 7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
2025-01-16 10:26 ` Karolina Stolarek
2025-01-16 17:18 ` Rajat Jain
2025-01-31 14:36 ` Jonathan Cameron
2025-02-12 23:19 ` Jon Pan-Doh
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
2025-01-24 6:46 ` Jon Pan-Doh
2025-01-25 7:59 ` Lukas Wunner
2025-02-06 13:32 ` Karolina Stolarek
2025-02-12 23:19 ` Jon Pan-Doh
2025-02-13 16:00 ` Karolina Stolarek
2025-02-14 2:49 ` Jon Pan-Doh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox