* [PATCH v2 0/8] Rate limit AER logs
@ 2025-02-14 2:35 Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
` (7 more replies)
0 siblings, 8 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Proposal
========
When using native AER, spammy devices can flood kernel logs with AER errors
and slow/stall execution. Add per-device per-error-severity ratelimits
for more robust error logging. Allow userspace to configure ratelimits
via sysfs knobs.
Motivation
==========
Several OCP members have issues with inconsistent PCIe error handling,
exacerbated at datacenter scale (myriad of devices).
OCP HW/Fault Management subproject set out to solve this by
standardizing industry:
- PCIe error handling best practices
- Fault Management/RAS (incl. PCIe errors)
Exposing PCIe errors/debug info in-band for a userspace daemon (e.g.
rasdaemon) to collect/pass on to repairability services is part of the
roadmap.
Background
==========
AER error spam has been observed many times, both publicly (e.g. [1], [2],
[3]) and privately. While it usually occurs with correctable errors, it can
happen with uncorrectable errors (e.g. during new HW bringup).
There have been previous attempts to add ratelimits to AER logs ([4],
[5]). The most recent attempt[5] has many similarities with the proposed
approach.
Patch organization
==================
1-4 AER logging cleanup
5-8 Ratelimits and sysfs knobs
Outstanding work
================
Cleanup:
- Consolidate aer_print_error() and pci_print_error() path
Roadmap:
- IRQ ratelimiting
v2:
- Rebased on top of pci/aer (6.14.rc-1)
- Split series into log and IRQ ratelimits (defer patch 5)
- Dropped patch 8 (Move AER sysfs)
- Added log level cleanup patch[6] from Karolina's series
- Fixed bug where dpc errors didn't increment counters
- "X callbacks suppressed" message on ratelimit release -> immediately
- Separate documentation into own patch
[1] https://bugzilla.kernel.org/show_bug.cgi?id=215027
[2] https://bugzilla.kernel.org/show_bug.cgi?id=201517
[3] https://bugzilla.kernel.org/show_bug.cgi?id=196183
[4] https://lore.kernel.org/linux-pci/20230606035442.2886343-2-grundler@chromium.org/
[5] https://lore.kernel.org/linux-pci/cover.1736341506.git.karolina.stolarek@oracle.com/
[6] https://lore.kernel.org/linux-pci/edd77011aafad4c0654358a26b4e538d0c5a321d.1736341506.git.karolina.stolarek@oracle.com/
Jon Pan-Doh (7):
PCI/AER: Remove aer_print_port_info
PCI/AER: Move AER stat collection out of __aer_print_error
PCI/AER: Rename struct aer_stats to aer_report
PCI/AER: Introduce ratelimit for error logs
PCI/AER: Add ratelimits to PCI AER Documentation
PCI/AER: Add AER sysfs attributes for log ratelimits
PCI/AER: Update AER sysfs ABI filename
Karolina Stolarek (1):
PCI/AER: Use the same log level for all messages
...es-aer_stats => sysfs-bus-pci-devices-aer} | 20 ++
Documentation/PCI/pcieaer-howto.rst | 13 +-
drivers/pci/pci-sysfs.c | 1 +
drivers/pci/pci.h | 4 +-
drivers/pci/pcie/aer.c | 194 ++++++++++++------
drivers/pci/pcie/dpc.c | 3 +-
include/linux/pci.h | 2 +-
7 files changed, 169 insertions(+), 68 deletions(-)
rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (85%)
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply [flat|nested] 39+ messages in thread
* [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-03-04 18:32 ` Bjorn Helgaas
2025-02-14 2:35 ` [PATCH v2 2/8] PCI/AER: Use the same log level for all messages Jon Pan-Doh
` (6 subsequent siblings)
7 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Info logged is duplicated when the source device is processed. In both
cases, BDF and error severity are derived from aer_error_info. If
no source device is found, then an error is logged with the BDF from
aer_error_info.
Code flow:
aer_isr_one_error()
-> aer_print_port_info()
-> find_source_device()
-> return/pci_info() if no device found else continue
-> aer_process_err_devices()
-> aer_print_error()
aer_print_port_info():
pcieport 0000:00:04.0: Correctable error message received
from 0000:01:00.0
aer_print_error():
e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
e1000e 0000:01:00.0: [ 6] BadTLP
Tested using aer-inject[1]. No more root port log on dmesg.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
---
drivers/pci/pcie/aer.c | 15 ---------------
1 file changed, 15 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ad4206125b86..9a8cc81d01e4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -733,18 +733,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
info->severity, info->tlp_header_valid, &info->tlp);
}
-static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
-{
- u8 bus = info->id >> 8;
- u8 devfn = info->id & 0xff;
-
- pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
- info->multi_error_valid ? "Multiple " : "",
- aer_error_severity_string[info->severity],
- pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
- PCI_FUNC(devfn));
-}
-
#ifdef CONFIG_ACPI_APEI_PCIEAER
int cper_severity_to_aer(int cper_severity)
{
@@ -1296,7 +1284,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
e_info.multi_error_valid = 1;
else
e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, &e_info);
if (find_source_device(pdev, &e_info))
aer_process_err_devices(&e_info);
@@ -1315,8 +1302,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
else
e_info.multi_error_valid = 0;
- aer_print_port_info(pdev, &e_info);
-
if (find_source_device(pdev, &e_info))
aer_process_err_devices(&e_info);
}
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-02-17 11:25 ` Karolina Stolarek
2025-03-04 18:59 ` Bjorn Helgaas
2025-02-14 2:35 ` [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
` (5 subsequent siblings)
7 siblings, 2 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
From: Karolina Stolarek <karolina.stolarek@oracle.com>
When reporting an AER error, we check its type multiple times
to determine the log level for each message. Do this check only
in the top-level function and propagate the result down the call
chain. Make aer_print_port_info output to match the level of the
reported error.
Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
Reviewed-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/aer.c | 43 ++++++++++++++++++++++--------------------
drivers/pci/pcie/dpc.c | 2 +-
3 files changed, 25 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 01e51db8d285..8cb816ee5388 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -550,7 +550,7 @@ struct aer_err_info {
};
int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
-void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
+void aer_print_error(struct pci_dev *dev, struct aer_err_info *info, const char *level);
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
unsigned int tlp_len, struct pcie_tlp_log *log);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9a8cc81d01e4..f1fdaa052cf6 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -670,20 +670,18 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
}
static void __aer_print_error(struct pci_dev *dev,
- struct aer_err_info *info)
+ struct aer_err_info *info,
+ const char *level)
{
const char **strings;
unsigned long status = info->status & ~info->mask;
- const char *level, *errmsg;
+ const char *errmsg;
int i;
- if (info->severity == AER_CORRECTABLE) {
+ if (info->severity == AER_CORRECTABLE)
strings = aer_correctable_error_string;
- level = KERN_WARNING;
- } else {
+ else
strings = aer_uncorrectable_error_string;
- level = KERN_ERR;
- }
for_each_set_bit(i, &status, 32) {
errmsg = strings[i];
@@ -696,11 +694,11 @@ static void __aer_print_error(struct pci_dev *dev,
pci_dev_aer_stats_incr(dev, info);
}
-void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
+void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
+ const char *level)
{
int layer, agent;
int id = pci_dev_id(dev);
- const char *level;
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -711,8 +709,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
- level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
-
aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);
@@ -720,7 +716,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
aer_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
dev->vendor, dev->device, info->status, info->mask);
- __aer_print_error(dev, info);
+ __aer_print_error(dev, info, level);
if (info->tlp_header_valid)
pcie_print_tlp_log(dev, &info->tlp, dev_fmt(" "));
@@ -753,15 +749,18 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
{
int layer, agent, tlp_header_valid = 0;
u32 status, mask;
+ const char *level;
struct aer_err_info info;
if (aer_severity == AER_CORRECTABLE) {
status = aer->cor_status;
mask = aer->cor_mask;
+ level = KERN_WARNING;
} else {
status = aer->uncor_status;
mask = aer->uncor_mask;
tlp_header_valid = status & AER_LOG_TLP_MASKS;
+ level = KERN_ERR;
}
layer = AER_GET_LAYER_ERROR(aer_severity, status);
@@ -773,13 +772,13 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
- pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
- __aer_print_error(dev, &info);
- pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
+ aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+ __aer_print_error(dev, &info, level);
+ aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
aer_error_layer[layer], aer_agent_string[agent]);
if (aer_severity != AER_CORRECTABLE)
- pci_err(dev, "aer_uncor_severity: 0x%08x\n",
+ aer_printk(level, dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
if (tlp_header_valid)
@@ -1244,14 +1243,15 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
return 1;
}
-static inline void aer_process_err_devices(struct aer_err_info *e_info)
+static inline void aer_process_err_devices(struct aer_err_info *e_info,
+ const char *level)
{
int i;
/* Report all before handle them, not to lost records by reset etc. */
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
if (aer_get_device_error_info(e_info->dev[i], e_info))
- aer_print_error(e_info->dev[i], e_info);
+ aer_print_error(e_info->dev[i], e_info, level);
}
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
if (aer_get_device_error_info(e_info->dev[i], e_info))
@@ -1269,6 +1269,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
{
struct pci_dev *pdev = rpc->rpd;
struct aer_err_info e_info;
+ const char *level;
pci_rootport_aer_stats_incr(pdev, e_src);
@@ -1279,6 +1280,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
if (e_src->status & PCI_ERR_ROOT_COR_RCV) {
e_info.id = ERR_COR_ID(e_src->id);
e_info.severity = AER_CORRECTABLE;
+ level = KERN_WARNING;
if (e_src->status & PCI_ERR_ROOT_MULTI_COR_RCV)
e_info.multi_error_valid = 1;
@@ -1286,11 +1288,12 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
e_info.multi_error_valid = 0;
if (find_source_device(pdev, &e_info))
- aer_process_err_devices(&e_info);
+ aer_process_err_devices(&e_info, level);
}
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
e_info.id = ERR_UNCOR_ID(e_src->id);
+ level = KERN_ERR;
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
e_info.severity = AER_FATAL;
@@ -1303,7 +1306,7 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
e_info.multi_error_valid = 0;
if (find_source_device(pdev, &e_info))
- aer_process_err_devices(&e_info);
+ aer_process_err_devices(&e_info, level);
}
}
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index 242cabd5eeeb..f06fad95f2eb 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,7 +287,7 @@ void dpc_process_error(struct pci_dev *pdev)
else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
- aer_print_error(pdev, &info);
+ aer_print_error(pdev, &info, KERN_ERR);
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
}
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 2/8] PCI/AER: Use the same log level for all messages Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-02-17 11:29 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
` (4 subsequent siblings)
7 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Decouple stat collection from internal AER print functions. AERs from ghes
or cxl drivers have stat collection in pci_print_aer as that is where
aer_err_info is populated.
Tested using aer-inject[1]. AER sysfs counters still updated correctly.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 10 ++++++----
drivers/pci/pcie/dpc.c | 1 +
3 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 8cb816ee5388..26104aee06c0 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -550,6 +550,7 @@ struct aer_err_info {
};
int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info, const char *level);
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f1fdaa052cf6..d6edb95d468f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -617,8 +617,7 @@ const struct attribute_group aer_stats_attr_group = {
.is_visible = aer_stats_attrs_are_visible,
};
-static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
- struct aer_err_info *info)
+void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
{
unsigned long status = info->status & ~info->mask;
int i, max = -1;
@@ -691,7 +690,6 @@ static void __aer_print_error(struct pci_dev *dev,
aer_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
info->first_error == i ? " (First)" : "");
}
- pci_dev_aer_stats_incr(dev, info);
}
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
@@ -772,6 +770,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
+ pci_dev_aer_stats_incr(dev, &info);
+
aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
__aer_print_error(dev, &info, level);
aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
@@ -1250,8 +1250,10 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info,
/* Report all before handle them, not to lost records by reset etc. */
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info->dev[i], e_info))
+ if (aer_get_device_error_info(e_info->dev[i], e_info)) {
+ pci_dev_aer_stats_incr(e_info->dev[i], e_info);
aer_print_error(e_info->dev[i], e_info, level);
+ }
}
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
if (aer_get_device_error_info(e_info->dev[i], e_info))
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f06fad95f2eb..a85ea76b4dea 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -287,6 +287,7 @@ void dpc_process_error(struct pci_dev *pdev)
else if (reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_UNCOR &&
dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(pdev, &info)) {
+ pci_dev_aer_stats_incr(pdev, &info);
aer_print_error(pdev, &info, KERN_ERR);
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
` (2 preceding siblings ...)
2025-02-14 2:35 ` [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-02-17 11:29 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
` (3 subsequent siblings)
7 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Update name to reflect the broader definition of structs/variables that
are stored (e.g. ratelimits).
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
include/linux/pci.h | 2 +-
2 files changed, 26 insertions(+), 26 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d6edb95d468f..b4f902fd5ef6 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -54,11 +54,11 @@ struct aer_rpc {
DECLARE_KFIFO(aer_fifo, struct aer_err_source, AER_ERROR_SOURCES_MAX);
};
-/* AER stats for the device */
-struct aer_stats {
+/* AER report for the device */
+struct aer_report {
/*
- * Fields for all AER capable devices. They indicate the errors
+ * Stats for all AER capable devices. They indicate the errors
* "as seen by this device". Note that this may mean that if an
* end point is causing problems, the AER counters may increment
* at its link partner (e.g. root port) because the errors will be
@@ -80,7 +80,7 @@ struct aer_stats {
u64 dev_total_nonfatal_errs;
/*
- * Fields for Root ports & root complex event collectors only, these
+ * Stats for Root ports & root complex event collectors only, these
* indicate the total number of ERR_COR, ERR_FATAL, and ERR_NONFATAL
* messages received by the root port / event collector, INCLUDING the
* ones that are generated internally (by the rootport itself)
@@ -377,7 +377,7 @@ void pci_aer_init(struct pci_dev *dev)
if (!dev->aer_cap)
return;
- dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
+ dev->aer_report = kzalloc(sizeof(struct aer_report), GFP_KERNEL);
/*
* We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
@@ -398,8 +398,8 @@ void pci_aer_init(struct pci_dev *dev)
void pci_aer_exit(struct pci_dev *dev)
{
- kfree(dev->aer_stats);
- dev->aer_stats = NULL;
+ kfree(dev->aer_report);
+ dev->aer_report = NULL;
}
#define AER_AGENT_RECEIVER 0
@@ -537,10 +537,10 @@ static const char *aer_agent_string[] = {
{ \
unsigned int i; \
struct pci_dev *pdev = to_pci_dev(dev); \
- u64 *stats = pdev->aer_stats->stats_array; \
+ u64 *stats = pdev->aer_report->stats_array; \
size_t len = 0; \
\
- for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
+ for (i = 0; i < ARRAY_SIZE(pdev->aer_report->stats_array); i++) {\
if (strings_array[i]) \
len += sysfs_emit_at(buf, len, "%s %llu\n", \
strings_array[i], \
@@ -551,7 +551,7 @@ static const char *aer_agent_string[] = {
i, stats[i]); \
} \
len += sysfs_emit_at(buf, len, "TOTAL_%s %llu\n", total_string, \
- pdev->aer_stats->total_field); \
+ pdev->aer_report->total_field); \
return len; \
} \
static DEVICE_ATTR_RO(name)
@@ -572,7 +572,7 @@ aer_stats_dev_attr(aer_dev_nonfatal, dev_nonfatal_errs,
char *buf) \
{ \
struct pci_dev *pdev = to_pci_dev(dev); \
- return sysfs_emit(buf, "%llu\n", pdev->aer_stats->field); \
+ return sysfs_emit(buf, "%llu\n", pdev->aer_report->field); \
} \
static DEVICE_ATTR_RO(name)
@@ -599,7 +599,7 @@ static umode_t aer_stats_attrs_are_visible(struct kobject *kobj,
struct device *dev = kobj_to_dev(kobj);
struct pci_dev *pdev = to_pci_dev(dev);
- if (!pdev->aer_stats)
+ if (!pdev->aer_report)
return 0;
if ((a == &dev_attr_aer_rootport_total_err_cor.attr ||
@@ -622,25 +622,25 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
unsigned long status = info->status & ~info->mask;
int i, max = -1;
u64 *counter = NULL;
- struct aer_stats *aer_stats = pdev->aer_stats;
+ struct aer_report *aer_report = pdev->aer_report;
- if (!aer_stats)
+ if (!aer_report)
return;
switch (info->severity) {
case AER_CORRECTABLE:
- aer_stats->dev_total_cor_errs++;
- counter = &aer_stats->dev_cor_errs[0];
+ aer_report->dev_total_cor_errs++;
+ counter = &aer_report->dev_cor_errs[0];
max = AER_MAX_TYPEOF_COR_ERRS;
break;
case AER_NONFATAL:
- aer_stats->dev_total_nonfatal_errs++;
- counter = &aer_stats->dev_nonfatal_errs[0];
+ aer_report->dev_total_nonfatal_errs++;
+ counter = &aer_report->dev_nonfatal_errs[0];
max = AER_MAX_TYPEOF_UNCOR_ERRS;
break;
case AER_FATAL:
- aer_stats->dev_total_fatal_errs++;
- counter = &aer_stats->dev_fatal_errs[0];
+ aer_report->dev_total_fatal_errs++;
+ counter = &aer_report->dev_fatal_errs[0];
max = AER_MAX_TYPEOF_UNCOR_ERRS;
break;
}
@@ -652,19 +652,19 @@ void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
struct aer_err_source *e_src)
{
- struct aer_stats *aer_stats = pdev->aer_stats;
+ struct aer_report *aer_report = pdev->aer_report;
- if (!aer_stats)
+ if (!aer_report)
return;
if (e_src->status & PCI_ERR_ROOT_COR_RCV)
- aer_stats->rootport_total_cor_errs++;
+ aer_report->rootport_total_cor_errs++;
if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
if (e_src->status & PCI_ERR_ROOT_FATAL_RCV)
- aer_stats->rootport_total_fatal_errs++;
+ aer_report->rootport_total_fatal_errs++;
else
- aer_stats->rootport_total_nonfatal_errs++;
+ aer_report->rootport_total_nonfatal_errs++;
}
}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3c2c8d8eb1fd..7f55009a626b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -346,7 +346,7 @@ struct pci_dev {
u8 hdr_type; /* PCI header type (`multi' flag masked out) */
#ifdef CONFIG_PCIEAER
u16 aer_cap; /* AER capability offset */
- struct aer_stats *aer_stats; /* AER stats for this device */
+ struct aer_report *aer_report; /* AER report for this device */
#endif
#ifdef CONFIG_PCIEPORTBUS
struct rcec_ea *rcec_ea; /* RCEC cached endpoint association */
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
` (3 preceding siblings ...)
2025-02-14 2:35 ` [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-02-17 11:29 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 6/8] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
` (2 subsequent siblings)
7 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Spammy devices can flood kernel logs with AER errors and slow/stall
execution. Add per-device ratelimits for AER errors (correctable and
uncorrectable). Set the default rate to the default kernel ratelimit
(10 per 5s).
Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
true count of 11.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
drivers/pci/pcie/aer.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index b4f902fd5ef6..c5b5381e2930 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -28,6 +28,7 @@
#include <linux/interrupt.h>
#include <linux/delay.h>
#include <linux/kfifo.h>
+#include <linux/ratelimit.h>
#include <linux/slab.h>
#include <acpi/apei.h>
#include <acpi/ghes.h>
@@ -88,6 +89,10 @@ struct aer_report {
u64 rootport_total_cor_errs;
u64 rootport_total_fatal_errs;
u64 rootport_total_nonfatal_errs;
+
+ /* Ratelimits for errors */
+ struct ratelimit_state cor_log_ratelimit;
+ struct ratelimit_state uncor_log_ratelimit;
};
#define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
@@ -378,6 +383,10 @@ void pci_aer_init(struct pci_dev *dev)
return;
dev->aer_report = kzalloc(sizeof(struct aer_report), GFP_KERNEL);
+ ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
+ ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
+ DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
/*
* We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
@@ -697,6 +706,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
{
int layer, agent;
int id = pci_dev_id(dev);
+ struct ratelimit_state *ratelimit;
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
@@ -704,6 +714,14 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
goto out;
}
+ if (info->severity == AER_CORRECTABLE)
+ ratelimit = &dev->aer_report->cor_log_ratelimit;
+ else
+ ratelimit = &dev->aer_report->uncor_log_ratelimit;
+
+ if (!__ratelimit(ratelimit))
+ return;
+
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
@@ -749,12 +767,15 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
u32 status, mask;
const char *level;
struct aer_err_info info;
+ struct ratelimit_state *ratelimit;
if (aer_severity == AER_CORRECTABLE) {
+ ratelimit = &dev->aer_report->cor_log_ratelimit;
status = aer->cor_status;
mask = aer->cor_mask;
level = KERN_WARNING;
} else {
+ ratelimit = &dev->aer_report->uncor_log_ratelimit;
status = aer->uncor_status;
mask = aer->uncor_mask;
tlp_header_valid = status & AER_LOG_TLP_MASKS;
@@ -772,6 +793,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
pci_dev_aer_stats_incr(dev, &info);
+ if (!__ratelimit(ratelimit))
+ return;
+
aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
__aer_print_error(dev, &info, level);
aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 6/8] PCI/AER: Add ratelimits to PCI AER Documentation
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
` (4 preceding siblings ...)
2025-02-14 2:35 ` [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-02-17 11:30 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 8/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
7 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Add ratelimits section for rationale and defaults.
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
Documentation/PCI/pcieaer-howto.rst | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index f013f3b27c82..167c0b277b62 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -85,6 +85,14 @@ In the example, 'Requester ID' means the ID of the device that sent
the error message to the Root Port. Please refer to PCIe specs for other
fields.
+AER Ratelimits
+--------------
+
+Error messages are ratelimited per device and error type. This prevents
+spammy devices from flooding the console and stalling execution. Set the
+default ratelimit to DEFAULT_RATELIMIT_BURST over
+DEFAULT_RATELIMIT_INTERVAL (10 per 5 seconds).
+
AER Statistics / Counters
-------------------------
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
` (5 preceding siblings ...)
2025-02-14 2:35 ` [PATCH v2 6/8] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
2025-02-17 13:31 ` Karolina Stolarek
2025-02-25 13:56 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 8/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
7 siblings, 2 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Allow userspace to read/write log ratelimits per device. Create aer/ sysfs
directory to store them and any future aer configs.
Tested using aer-inject[1]. Configured correctable log ratelimits to 5.
Sent 6 AER errors. Observed 5 errors logged while AER stats
(cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
[1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
.../testing/sysfs-bus-pci-devices-aer_stats | 20 +++++++
Documentation/PCI/pcieaer-howto.rst | 3 ++
drivers/pci/pci-sysfs.c | 1 +
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 52 +++++++++++++++++++
5 files changed, 77 insertions(+)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
index d1f67bb81d5d..c1221614c079 100644
--- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
@@ -117,3 +117,23 @@ Date: July 2018
KernelVersion: 4.19.0
Contact: linux-pci@vger.kernel.org, rajatja@google.com
Description: Total number of ERR_NONFATAL messages reported to rootport.
+
+PCIe AER ratelimits
+-------------------
+
+These attributes show up under all the devices that are AER capable.
+Provides configurable ratelimits of logs/irq per error type. Writing a
+nonzero value changes the number of errors (burst) allowed per 5 second
+window before ratelimiting. Reading gets the current ratelimits.
+
+What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_log
+Date: March 2025
+KernelVersion: 6.15.0
+Contact: linux-pci@vger.kernel.org, pandoh@google.com
+Description: Log ratelimit for correctable errors.
+
+What: /sys/bus/pci/devices/<dev>/aer/ratelimit_uncor_log
+Date: March 2025
+KernelVersion: 6.15.0
+Contact: linux-pci@vger.kernel.org, pandoh@google.com
+Description: Log ratelimit for uncorrectable errors.
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 167c0b277b62..ab5b0f232204 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -93,6 +93,9 @@ spammy devices from flooding the console and stalling execution. Set the
default ratelimit to DEFAULT_RATELIMIT_BURST over
DEFAULT_RATELIMIT_INTERVAL (10 per 5 seconds).
+Ratelimits are exposed in the form of sysfs attributes and configurable.
+See Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats.
+
AER Statistics / Counters
-------------------------
diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
index b46ce1a2c554..16de3093294e 100644
--- a/drivers/pci/pci-sysfs.c
+++ b/drivers/pci/pci-sysfs.c
@@ -1801,6 +1801,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
&pcie_dev_attr_group,
#ifdef CONFIG_PCIEAER
&aer_stats_attr_group,
+ &aer_attr_group,
#endif
#ifdef CONFIG_PCIEASPM
&aspm_ctrl_attr_group,
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 26104aee06c0..26d30a99c48b 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -887,6 +887,7 @@ void pci_no_aer(void);
void pci_aer_init(struct pci_dev *dev);
void pci_aer_exit(struct pci_dev *dev);
extern const struct attribute_group aer_stats_attr_group;
+extern const struct attribute_group aer_attr_group;
void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pci_aer_clear_status(struct pci_dev *dev);
int pci_aer_raw_clear_status(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c5b5381e2930..1237faee6542 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -626,6 +626,58 @@ const struct attribute_group aer_stats_attr_group = {
.is_visible = aer_stats_attrs_are_visible,
};
+#define aer_ratelimit_attr(name, ratelimit) \
+ static ssize_t \
+ name##_show(struct device *dev, struct device_attribute *attr, \
+ char *buf) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ return sysfs_emit(buf, "%u errors every %u seconds\n", \
+ pdev->aer_report->ratelimit.burst, \
+ pdev->aer_report->ratelimit.interval / HZ); \
+} \
+ \
+ static ssize_t \
+ name##_store(struct device *dev, struct device_attribute *attr, \
+ const char *buf, size_t count) \
+{ \
+ struct pci_dev *pdev = to_pci_dev(dev); \
+ int burst; \
+ \
+ if (kstrtoint(buf, 0, &burst) < 0) \
+ return -EINVAL; \
+ \
+ pdev->aer_report->ratelimit.burst = burst; \
+ return count; \
+} \
+static DEVICE_ATTR_RW(name)
+
+aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
+aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
+
+static struct attribute *aer_attrs[] __ro_after_init = {
+ &dev_attr_ratelimit_cor_log.attr,
+ &dev_attr_ratelimit_uncor_log.attr,
+ NULL
+};
+
+static umode_t aer_attrs_are_visible(struct kobject *kobj,
+ struct attribute *a, int n)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ if (!pdev->aer_report)
+ return 0;
+ return a->mode;
+}
+
+const struct attribute_group aer_attr_group = {
+ .name = "aer",
+ .attrs = aer_attrs,
+ .is_visible = aer_attrs_are_visible,
+};
+
void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
{
unsigned long status = info->status & ~info->mask;
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* [PATCH v2 8/8] PCI/AER: Update AER sysfs ABI filename
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
` (6 preceding siblings ...)
2025-02-14 2:35 ` [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits Jon Pan-Doh
@ 2025-02-14 2:35 ` Jon Pan-Doh
7 siblings, 0 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-14 2:35 UTC (permalink / raw)
To: Bjorn Helgaas, Karolina Stolarek
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron, Jon Pan-Doh
Change Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats to
Documentation/ABI/testing/sysfs-bus-pci-devices-aer to reflect the broader
scope of AER sysfs attributes (e.g. stats and ratelimits).
Signed-off-by: Jon Pan-Doh <pandoh@google.com>
---
...fs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} | 0
Documentation/PCI/pcieaer-howto.rst | 4 ++--
2 files changed, 2 insertions(+), 2 deletions(-)
rename Documentation/ABI/testing/{sysfs-bus-pci-devices-aer_stats => sysfs-bus-pci-devices-aer} (100%)
diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer
similarity index 100%
rename from Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
rename to Documentation/ABI/testing/sysfs-bus-pci-devices-aer
diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index ab5b0f232204..912d70b9e178 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -94,14 +94,14 @@ default ratelimit to DEFAULT_RATELIMIT_BURST over
DEFAULT_RATELIMIT_INTERVAL (10 per 5 seconds).
Ratelimits are exposed in the form of sysfs attributes and configurable.
-See Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats.
+See Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
AER Statistics / Counters
-------------------------
When PCIe AER errors are captured, the counters / statistics are also exposed
in the form of sysfs attributes which are documented at
-Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
+Documentation/ABI/testing/sysfs-bus-pci-devices-aer.
Developer Guide
===============
--
2.48.1.601.g30ceb7b040-goog
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-02-14 2:35 ` [PATCH v2 2/8] PCI/AER: Use the same log level for all messages Jon Pan-Doh
@ 2025-02-17 11:25 ` Karolina Stolarek
2025-02-19 2:48 ` Jon Pan-Doh
2025-03-04 18:59 ` Bjorn Helgaas
1 sibling, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-17 11:25 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> From: Karolina Stolarek <karolina.stolarek@oracle.com>
>
> When reporting an AER error, we check its type multiple times
> to determine the log level for each message. Do this check only
> in the top-level function and propagate the result down the call
> chain. Make aer_print_port_info output to match the level of the
> reported error.
>
> Signed-off-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> Reviewed-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pci.h | 2 +-
> drivers/pci/pcie/aer.c | 43 ++++++++++++++++++++++--------------------
> drivers/pci/pcie/dpc.c | 2 +-
> 3 files changed, 25 insertions(+), 22 deletions(-)
(...)
> @@ -773,13 +772,13 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> info.mask = mask;
> info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>
> - pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> - __aer_print_error(dev, &info);
> - pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
> + aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> + __aer_print_error(dev, &info, level);
> + aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
> aer_error_layer[layer], aer_agent_string[agent]);
It seems that the printk's alignment is wonky after the rebase.
Checkpatch agrees with me here...
>
> if (aer_severity != AER_CORRECTABLE)
> - pci_err(dev, "aer_uncor_severity: 0x%08x\n",
> + aer_printk(level, dev, "aer_uncor_severity: 0x%08x\n",
> aer->uncor_severity);
...and here
All the best,
Karolina
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-02-14 2:35 ` [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
@ 2025-02-17 11:29 ` Karolina Stolarek
2025-02-19 2:48 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-17 11:29 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> Decouple stat collection from internal AER print functions. AERs from ghes
> or cxl drivers have stat collection in pci_print_aer as that is where
> aer_err_info is populated.
>
> Tested using aer-inject[1]. AER sysfs counters still updated correctly.
I don't think we have to mention that it was tested. In other patches
you mention specific examples that illustrate the change nicely, but we
don't get the same value from the statement above.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 10 ++++++----
> drivers/pci/pcie/dpc.c | 1 +
> 3 files changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 8cb816ee5388..26104aee06c0 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -550,6 +550,7 @@ struct aer_err_info {
> };
>
> int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info);
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info, const char *level);
>
> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f1fdaa052cf6..d6edb95d468f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -617,8 +617,7 @@ const struct attribute_group aer_stats_attr_group = {
> .is_visible = aer_stats_attrs_are_visible,
> };
>
> -static void pci_dev_aer_stats_incr(struct pci_dev *pdev,
> - struct aer_err_info *info)
> +void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
> {
> unsigned long status = info->status & ~info->mask;
> int i, max = -1;
> @@ -691,7 +690,6 @@ static void __aer_print_error(struct pci_dev *dev,
> aer_printk(level, dev, " [%2d] %-22s%s\n", i, errmsg,
> info->first_error == i ? " (First)" : "");
> }
> - pci_dev_aer_stats_incr(dev, info);
> }
>
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
> @@ -772,6 +770,8 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> info.mask = mask;
> info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>
> + pci_dev_aer_stats_incr(dev, &info);
With this change, we increment the stats when we iterate the recovery
queue in ghes_handle_aer. Is there a possibility that in the GHES path
we would increment the stats twice? First via AER module (aer_isr) and
then in aer_recover_work_func, or is it either/or?
All the best,
Karolina
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report
2025-02-14 2:35 ` [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
@ 2025-02-17 11:29 ` Karolina Stolarek
2025-02-19 2:49 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-17 11:29 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> Update name to reflect the broader definition of structs/variables that
> are stored (e.g. ratelimits).
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 50 +++++++++++++++++++++---------------------
> include/linux/pci.h | 2 +-
> 2 files changed, 26 insertions(+), 26 deletions(-)
(...)
> - dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> + dev->aer_report = kzalloc(sizeof(struct aer_report), GFP_KERNEL);
The rename brings back a checkpatch warning (to use
sizeof(*dev->aer_report)). If you feel like it, you can fix it. Apart
from that:
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
All the best,
Karolina
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs
2025-02-14 2:35 ` [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
@ 2025-02-17 11:29 ` Karolina Stolarek
2025-02-19 2:49 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-17 11:29 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER errors (correctable and
> uncorrectable). Set the default rate to the default kernel ratelimit
> (10 per 5s).
I'd rephrase the last sentence to say "Add per-device ratelimits for AER
correctable and uncorrectable errors that use the kernel defaults (10
bursts per 5s)", but it's just a nit.
Overall, it looks good to me:
Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
>
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors logged
> while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) show
> true count of 11.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> drivers/pci/pcie/aer.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b4f902fd5ef6..c5b5381e2930 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -28,6 +28,7 @@
> #include <linux/interrupt.h>
> #include <linux/delay.h>
> #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
> #include <linux/slab.h>
> #include <acpi/apei.h>
> #include <acpi/ghes.h>
> @@ -88,6 +89,10 @@ struct aer_report {
> u64 rootport_total_cor_errs;
> u64 rootport_total_fatal_errs;
> u64 rootport_total_nonfatal_errs;
> +
> + /* Ratelimits for errors */
> + struct ratelimit_state cor_log_ratelimit;
> + struct ratelimit_state uncor_log_ratelimit;
> };
>
> #define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
> @@ -378,6 +383,10 @@ void pci_aer_init(struct pci_dev *dev)
> return;
>
> dev->aer_report = kzalloc(sizeof(struct aer_report), GFP_KERNEL);
> + ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> + ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
>
> /*
> * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -697,6 +706,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
> {
> int layer, agent;
> int id = pci_dev_id(dev);
> + struct ratelimit_state *ratelimit;
>
> if (!info->status) {
> pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> @@ -704,6 +714,14 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info,
> goto out;
> }
>
> + if (info->severity == AER_CORRECTABLE)
> + ratelimit = &dev->aer_report->cor_log_ratelimit;
> + else
> + ratelimit = &dev->aer_report->uncor_log_ratelimit;
> +
> + if (!__ratelimit(ratelimit))
> + return;
> +
> layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> agent = AER_GET_AGENT(info->severity, info->status);
>
> @@ -749,12 +767,15 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> u32 status, mask;
> const char *level;
> struct aer_err_info info;
> + struct ratelimit_state *ratelimit;
>
> if (aer_severity == AER_CORRECTABLE) {
> + ratelimit = &dev->aer_report->cor_log_ratelimit;
> status = aer->cor_status;
> mask = aer->cor_mask;
> level = KERN_WARNING;
> } else {
> + ratelimit = &dev->aer_report->uncor_log_ratelimit;
> status = aer->uncor_status;
> mask = aer->uncor_mask;
> tlp_header_valid = status & AER_LOG_TLP_MASKS;
> @@ -772,6 +793,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>
> pci_dev_aer_stats_incr(dev, &info);
>
> + if (!__ratelimit(ratelimit))
> + return;
> +
> aer_printk(level, dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> __aer_print_error(dev, &info, level);
> aer_printk(level, dev, "aer_layer=%s, aer_agent=%s\n",
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 6/8] PCI/AER: Add ratelimits to PCI AER Documentation
2025-02-14 2:35 ` [PATCH v2 6/8] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
@ 2025-02-17 11:30 ` Karolina Stolarek
0 siblings, 0 replies; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-17 11:30 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> Add ratelimits section for rationale and defaults.
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> Documentation/PCI/pcieaer-howto.rst | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index f013f3b27c82..167c0b277b62 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -85,6 +85,14 @@ In the example, 'Requester ID' means the ID of the device that sent
> the error message to the Root Port. Please refer to PCIe specs for other
> fields.
>
> +AER Ratelimits
> +--------------
> +
> +Error messages are ratelimited per device and error type. This prevents
> +spammy devices from flooding the console and stalling execution. Set the
> +default ratelimit to DEFAULT_RATELIMIT_BURST over
> +DEFAULT_RATELIMIT_INTERVAL (10 per 5 seconds).
The imperative tone of the last sentence doesn't fit here. How about
adding information about the ratelimit values to the first one, to
specify how frequently we limit the error reporting?
If you're eager, you can explain here that the errors are reported on
each interrupt generated for an Error Message, and that PCIe registers
can only toggle error generation on and off, with no option to control
the rate. Feel free to use bits of my cover letter[1] and if you do so,
please add my SOB to the patch.
All the best,
Karolina
[1] -
https://lore.kernel.org/linux-pci/cover.1734005191.git.karolina.stolarek@oracle.com/T/#u
> +
> AER Statistics / Counters
> -------------------------
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-14 2:35 ` [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits Jon Pan-Doh
@ 2025-02-17 13:31 ` Karolina Stolarek
2025-02-19 2:50 ` Jon Pan-Doh
2025-02-19 5:42 ` Jon Pan-Doh
2025-02-25 13:56 ` Karolina Stolarek
1 sibling, 2 replies; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-17 13:31 UTC (permalink / raw)
To: Jon Pan-Doh, Bjorn Helgaas
Cc: linux-pci, Martin Petersen, Ben Fuller, Drew Walton, Anil Agrawal,
Tony Luck, Ilpo Järvinen, Sathyanarayanan Kuppuswamy,
Lukas Wunner, Jonathan Cameron
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> Allow userspace to read/write log ratelimits per device. Create aer/ sysfs
> directory to store them and any future aer configs.
I don't think it's neccessary to keep ratelimits in a separate directory
when we decided to we keep the rest of AER attributes at the dev level.
>
> Tested using aer-inject[1]. Configured correctable log ratelimits to 5.
> Sent 6 AER errors. Observed 5 errors logged while AER stats
> (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable) shows 6.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> .../testing/sysfs-bus-pci-devices-aer_stats | 20 +++++++
> Documentation/PCI/pcieaer-howto.rst | 3 ++
> drivers/pci/pci-sysfs.c | 1 +
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 52 +++++++++++++++++++
> 5 files changed, 77 insertions(+)
>
> diff --git a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> index d1f67bb81d5d..c1221614c079 100644
> --- a/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> +++ b/Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats
> @@ -117,3 +117,23 @@ Date: July 2018
> KernelVersion: 4.19.0
> Contact: linux-pci@vger.kernel.org, rajatja@google.com
> Description: Total number of ERR_NONFATAL messages reported to rootport.
> +
> +PCIe AER ratelimits
> +-------------------
> +
> +These attributes show up under all the devices that are AER capable.
> +Provides configurable ratelimits of logs/irq per error type. Writing a
This sentence seems to refer to _one_ attribute and mentions IRQ
ratelimiting, which is not a part of the series.
How about rephrasing this section to say that the attributes allow
configuration of the rate at which AER errors are reported, with each of
them dedicated to one error type (correctable or uncorrectable)?
Something along these lines.
> +nonzero value changes the number of errors (burst) allowed per 5 second
> +window before ratelimiting. Reading gets the current ratelimits.
> +
> +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_cor_log
> +Date: March 2025
> +KernelVersion: 6.15.0
> +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> +Description: Log ratelimit for correctable errors.
> +
> +What: /sys/bus/pci/devices/<dev>/aer/ratelimit_uncor_log
> +Date: March 2025
> +KernelVersion: 6.15.0
> +Contact: linux-pci@vger.kernel.org, pandoh@google.com
> +Description: Log ratelimit for uncorrectable errors.
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index 167c0b277b62..ab5b0f232204 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -93,6 +93,9 @@ spammy devices from flooding the console and stalling execution. Set the
> default ratelimit to DEFAULT_RATELIMIT_BURST over
> DEFAULT_RATELIMIT_INTERVAL (10 per 5 seconds).
>
> +Ratelimits are exposed in the form of sysfs attributes and configurable.
> +See Documentation/ABI/testing/sysfs-bus-pci-devices-aer_stats.
> +
> AER Statistics / Counters
> -------------------------
>
> diff --git a/drivers/pci/pci-sysfs.c b/drivers/pci/pci-sysfs.c
> index b46ce1a2c554..16de3093294e 100644
> --- a/drivers/pci/pci-sysfs.c
> +++ b/drivers/pci/pci-sysfs.c
> @@ -1801,6 +1801,7 @@ const struct attribute_group *pci_dev_attr_groups[] = {
> &pcie_dev_attr_group,
> #ifdef CONFIG_PCIEAER
> &aer_stats_attr_group,
> + &aer_attr_group,
> #endif
> #ifdef CONFIG_PCIEASPM
> &aspm_ctrl_attr_group,
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 26104aee06c0..26d30a99c48b 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -887,6 +887,7 @@ void pci_no_aer(void);
> void pci_aer_init(struct pci_dev *dev);
> void pci_aer_exit(struct pci_dev *dev);
> extern const struct attribute_group aer_stats_attr_group;
> +extern const struct attribute_group aer_attr_group;
> void pci_aer_clear_fatal_status(struct pci_dev *dev);
> int pci_aer_clear_status(struct pci_dev *dev);
> int pci_aer_raw_clear_status(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c5b5381e2930..1237faee6542 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -626,6 +626,58 @@ const struct attribute_group aer_stats_attr_group = {
> .is_visible = aer_stats_attrs_are_visible,
> };
>
> +#define aer_ratelimit_attr(name, ratelimit) \
> + static ssize_t \
> + name##_show(struct device *dev, struct device_attribute *attr, \
> + char *buf) \
> +{ \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + return sysfs_emit(buf, "%u errors every %u seconds\n", \
The convention is that sysfs files should provide one value per file. It
won't be just people interacting with this interface, but scripts as
well. Parsing such a string is a hassle. As we can only change the burst
of the ratelimit, I'd simply emit pdev->aer_report->ratelimit.burst.
All the best,
Karolina
> + pdev->aer_report->ratelimit.burst, \
> + pdev->aer_report->ratelimit.interval / HZ); \
> +} \
> + \
> + static ssize_t \
> + name##_store(struct device *dev, struct device_attribute *attr, \
> + const char *buf, size_t count) \
> +{ \
> + struct pci_dev *pdev = to_pci_dev(dev); \
> + int burst; \
> + \
> + if (kstrtoint(buf, 0, &burst) < 0) \
> + return -EINVAL; \
> + \
> + pdev->aer_report->ratelimit.burst = burst; \
> + return count; \
> +} \
> +static DEVICE_ATTR_RW(name)
> +
> +aer_ratelimit_attr(ratelimit_cor_log, cor_log_ratelimit);
> +aer_ratelimit_attr(ratelimit_uncor_log, uncor_log_ratelimit);
> +
> +static struct attribute *aer_attrs[] __ro_after_init = {
> + &dev_attr_ratelimit_cor_log.attr,
> + &dev_attr_ratelimit_uncor_log.attr,
> + NULL
> +};
> +
> +static umode_t aer_attrs_are_visible(struct kobject *kobj,
> + struct attribute *a, int n)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + if (!pdev->aer_report)
> + return 0;
> + return a->mode;
> +}
> +
> +const struct attribute_group aer_attr_group = {
> + .name = "aer",
> + .attrs = aer_attrs,
> + .is_visible = aer_attrs_are_visible,
> +};
> +
> void pci_dev_aer_stats_incr(struct pci_dev *pdev, struct aer_err_info *info)
> {
> unsigned long status = info->status & ~info->mask;
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-02-17 11:25 ` Karolina Stolarek
@ 2025-02-19 2:48 ` Jon Pan-Doh
2025-02-24 11:26 ` Karolina Stolarek
0 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-19 2:48 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 17, 2025 at 3:25 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> It seems that the printk's alignment is wonky after the rebase.
> Checkpatch agrees with me here...
Odd. It passed checkpatch for me. These are the commands I used:
{Kernel home dir}$ scripts/checkpatch.pl {diff (e.g. downloaded from Patchwork)}
{Kernel home dir}$ scripts/checkpatch.pl -f drivers/pci/pcie/aer.c
Maybe I'm not using it correctly. Could you paste your checkpatch
command/output?
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-02-17 11:29 ` Karolina Stolarek
@ 2025-02-19 2:48 ` Jon Pan-Doh
2025-02-24 11:26 ` Karolina Stolarek
0 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-19 2:48 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 17, 2025 at 3:29 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> On 14/02/2025 03:35, Jon Pan-Doh wrote:
> > Tested using aer-inject[1]. AER sysfs counters still updated correctly.
>
> I don't think we have to mention that it was tested. In other patches
> you mention specific examples that illustrate the change nicely, but we
> don't get the same value from the statement above.
Ack. Will omit in v3.
> With this change, we increment the stats when we iterate the recovery
> queue in ghes_handle_aer. Is there a possibility that in the GHES path
> we would increment the stats twice? First via AER module (aer_isr) and
> then in aer_recover_work_func, or is it either/or?
It's either/or. aer_isr deals with native AER handling (i.e. by OS).
However, AER errors from GHES originate from ACPI APEI (i.e. FW
notifies OS of error).
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report
2025-02-17 11:29 ` Karolina Stolarek
@ 2025-02-19 2:49 ` Jon Pan-Doh
0 siblings, 0 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-19 2:49 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 17, 2025 at 3:29 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
>
> On 14/02/2025 03:35, Jon Pan-Doh wrote:
> > - dev->aer_stats = kzalloc(sizeof(struct aer_stats), GFP_KERNEL);
> > + dev->aer_report = kzalloc(sizeof(struct aer_report), GFP_KERNEL);
>
> The rename brings back a checkpatch warning (to use
> sizeof(*dev->aer_report)). If you feel like it, you can fix it.
Huh. Similar to my other reply[1], checkpatch isn't showing any warnings.
Maybe a divergence in checkpatch version? Mine is 0.32 (Pulled from
pci/aer branch commit: 28d3871db7ef8ad0112f195c48a72d8638af89d1).
[1] https://lore.kernel.org/linux-pci/CAMC_AXVgYegnfc-vyKuxZS-Ck=aCJ95=HqdYNraVv99kXxw1QA@mail.gmail.com/
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs
2025-02-17 11:29 ` Karolina Stolarek
@ 2025-02-19 2:49 ` Jon Pan-Doh
0 siblings, 0 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-19 2:49 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 17, 2025 at 3:29 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> I'd rephrase the last sentence to say "Add per-device ratelimits for AER
> correctable and uncorrectable errors that use the kernel defaults (10
> bursts per 5s)", but it's just a nit.
Ack. Will change in v3.
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-17 13:31 ` Karolina Stolarek
@ 2025-02-19 2:50 ` Jon Pan-Doh
2025-02-24 11:28 ` Karolina Stolarek
2025-02-19 5:42 ` Jon Pan-Doh
1 sibling, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-19 2:50 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 17, 2025 at 5:31 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> I don't think it's neccessary to keep ratelimits in a separate directory
> when we decided to we keep the rest of AER attributes at the dev level.
My motivation for this is that there may be more AER sysfs attributes we want to
expose. An example being OCP Fault Management roadmap which aims to have
userspace manage/enforce AER settings (set/get PCIE AER regs) for datacenter
repairability.
Given the permanence of sysfs entries, I think that it is reasonable
to create a new
directory to make AER sysfs attributes more extensible.
> > +These attributes show up under all the devices that are AER capable.
> > +Provides configurable ratelimits of logs/irq per error type. Writing a
>
> This sentence seems to refer to _one_ attribute and mentions IRQ
> ratelimiting, which is not a part of the series.
>
> How about rephrasing this section to say that the attributes allow
> configuration of the rate at which AER errors are reported, with each of
> them dedicated to one error type (correctable or uncorrectable)?
> Something along these lines.
Good catch. IRQ verbiage slipped in from v1. How's this:
These attributes show up under all the devices that are AER capable.
They allow configuration of the rate at which AER errors are reported,
with each of them dedicated to one error type (correctable or uncorrectable).
I kept the first sentence as that is common under the other AER sysfs
attributes.
> The convention is that sysfs files should provide one value per file. It
> won't be just people interacting with this interface, but scripts as
> well. Parsing such a string is a hassle. As we can only change the burst
> of the ratelimit, I'd simply emit pdev->aer_report->ratelimit.burst.
Ack. Will update in v3 and add explicit mention of ratelimit interval
in sysfs-bus-pci-devices-aer_stats so context is still there.
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-17 13:31 ` Karolina Stolarek
2025-02-19 2:50 ` Jon Pan-Doh
@ 2025-02-19 5:42 ` Jon Pan-Doh
1 sibling, 0 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-19 5:42 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 17, 2025 at 5:31 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> The convention is that sysfs files should provide one value per file. It
> won't be just people interacting with this interface, but scripts as
> well. Parsing such a string is a hassle. As we can only change the burst
> of the ratelimit, I'd simply emit pdev->aer_report->ratelimit.burst.
Realized that Jonathan said something similar in v1[1] (that I forgot
to fix). In addition to the reason you provided, he stated that
convention is to read/write the same thing to a sysfs file.
[1] https://lore.kernel.org/linux-pci/20250131143246.000037a2@huawei.com/
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-02-19 2:48 ` Jon Pan-Doh
@ 2025-02-24 11:26 ` Karolina Stolarek
2025-02-28 22:25 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-24 11:26 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On 19/02/2025 03:48, Jon Pan-Doh wrote:
> On Mon, Feb 17, 2025 at 3:25 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> It seems that the printk's alignment is wonky after the rebase.
>> Checkpatch agrees with me here...
>
> Odd. It passed checkpatch for me. These are the commands I used:
>
> {Kernel home dir}$ scripts/checkpatch.pl {diff (e.g. downloaded from Patchwork)}
> {Kernel home dir}$ scripts/checkpatch.pl -f drivers/pci/pcie/aer.c
Right, that's probably I'm using checkpatch with a "strict" parameter. I
applied all of your patches and then run:
git rebase --exec "git show --format=email HEAD | scripts/checkpatch.pl
--strict --codespell" -i <upstream>
All the best,
Karolina
> Maybe I'm not using it correctly. Could you paste your checkpatch
> command/output?
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error
2025-02-19 2:48 ` Jon Pan-Doh
@ 2025-02-24 11:26 ` Karolina Stolarek
0 siblings, 0 replies; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-24 11:26 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On 19/02/2025 03:48, Jon Pan-Doh wrote:
> On Mon, Feb 17, 2025 at 3:29 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> With this change, we increment the stats when we iterate the recovery
>> queue in ghes_handle_aer. Is there a possibility that in the GHES path
>> we would increment the stats twice? First via AER module (aer_isr) and
>> then in aer_recover_work_func, or is it either/or?
>
> It's either/or. aer_isr deals with native AER handling (i.e. by OS).
> However, AER errors from GHES originate from ACPI APEI (i.e. FW
> notifies OS of error).
Ah, OK, that should be fine then, thanks for the confirmation.
All the best,
Karolina
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-19 2:50 ` Jon Pan-Doh
@ 2025-02-24 11:28 ` Karolina Stolarek
0 siblings, 0 replies; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-24 11:28 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On 19/02/2025 03:50, Jon Pan-Doh wrote:
> On Mon, Feb 17, 2025 at 5:31 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> I don't think it's neccessary to keep ratelimits in a separate directory
>> when we decided to we keep the rest of AER attributes at the dev level.
>
> My motivation for this is that there may be more AER sysfs attributes we want to
> expose. An example being OCP Fault Management roadmap which aims to have
> userspace manage/enforce AER settings (set/get PCIE AER regs) for datacenter
> repairability.
>
> Given the permanence of sysfs entries, I think that it is reasonable
> to create a new
> directory to make AER sysfs attributes more extensible.
OK, I see what you mean. Maybe I was just unhappy with the naming but,
to be completely honest, "aer_ratelimit" isn't an option, so I won't
propose it as an alternative. Let's leave it as it is for now.
> Good catch. IRQ verbiage slipped in from v1. How's this:
>
> These attributes show up under all the devices that are AER capable.
> They allow configuration of the rate at which AER errors are reported,
> with each of them dedicated to one error type (correctable or uncorrectable).
Perfect
>> The convention is that sysfs files should provide one value per file. It
>> won't be just people interacting with this interface, but scripts as
>> well. Parsing such a string is a hassle. As we can only change the burst
>> of the ratelimit, I'd simply emit pdev->aer_report->ratelimit.burst.
>
> Ack. Will update in v3 and add explicit mention of ratelimit interval
> in sysfs-bus-pci-devices-aer_stats so context is still there.
Good!
All the best,
Karolina
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-14 2:35 ` [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits Jon Pan-Doh
2025-02-17 13:31 ` Karolina Stolarek
@ 2025-02-25 13:56 ` Karolina Stolarek
2025-02-28 22:28 ` Jon Pan-Doh
1 sibling, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-02-25 13:56 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
Hi Jon,
Sorry for catching this one so late:
On 14/02/2025 03:35, Jon Pan-Doh wrote:
> +
> +static struct attribute *aer_attrs[] __ro_after_init = {
^^^^^^^^^^^^^^^
This is a copy-paste error. These attributes are in the read-only region
and can't be written to, so please remove it in the next version.
Also, what value do we have to write to turn off ratelimiting
completely? Can we handle that as a special case?
All the best,
Karolina
> + &dev_attr_ratelimit_cor_log.attr,
> + &dev_attr_ratelimit_uncor_log.attr,
> + NULL
> +};
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-02-24 11:26 ` Karolina Stolarek
@ 2025-02-28 22:25 ` Jon Pan-Doh
0 siblings, 0 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-28 22:25 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Feb 24, 2025 at 3:27 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> Right, that's probably I'm using checkpatch with a "strict" parameter.
Ah. I see the formatting issues now. Will fix it in v3.
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-25 13:56 ` Karolina Stolarek
@ 2025-02-28 22:28 ` Jon Pan-Doh
2025-03-03 8:31 ` Karolina Stolarek
0 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-02-28 22:28 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Tue, Feb 25, 2025 at 5:56 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> On 14/02/2025 03:35, Jon Pan-Doh wrote:
> > +
> > +static struct attribute *aer_attrs[] __ro_after_init = {
> ^^^^^^^^^^^^^^^
> This is a copy-paste error. These attributes are in the read-only region
> and can't be written to, so please remove it in the next version.
Ack.
> Also, what value do we have to write to turn off ratelimiting
> completely? Can we handle that as a special case?
Not something I originally planned, but can add it in v3. Given the
permanence of sysfs entries, would we want one toggle for all
ratelimiting (logs and potentially IRQs) or separate ones?
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-02-28 22:28 ` Jon Pan-Doh
@ 2025-03-03 8:31 ` Karolina Stolarek
2025-03-04 0:58 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-03-03 8:31 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On 28/02/2025 23:28, Jon Pan-Doh wrote:
> On Tue, Feb 25, 2025 at 5:56 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> On 14/02/2025 03:35, Jon Pan-Doh wrote:
>>
>> Also, what value do we have to write to turn off ratelimiting
>> completely? Can we handle that as a special case?
>
> Not something I originally planned, but can add it in v3.
Right, I think that some users would be interested in seeing all
uncorrectable errors and we should give them an option to turn off
ratelimiting completely.
> Given the permanence of sysfs entries, would we want one toggle
> for all ratelimiting (logs and potentially IRQs) or separate ones?
In my opinion, we want them to be separate. We may want to see no logs
of errors but still have them recorded in rasdaemon, for example.
All the best,
Karolina
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-03-03 8:31 ` Karolina Stolarek
@ 2025-03-04 0:58 ` Jon Pan-Doh
2025-03-07 12:10 ` Karolina Stolarek
0 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-03-04 0:58 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On Mon, Mar 3, 2025 at 12:31 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> In my opinion, we want them to be separate. We may want to see no logs
> of errors but still have them recorded in rasdaemon, for example.
Understood. So the sysfs toggles could be something like:
aer/ratelimit_log_enable
aer/ratelimit_irq_enable (with default = off)
This assumes that IRQ ratelimiting part is able to be merged.
FYI, the current implementation ratelimits for both logs and trace
events, but increments AER counters. If there's a scenario where you'd
want no logs but have trace events sent, then we may need another
ratelimit and/or roll that into IRQ ratelimiting (to avoid trace
buffer/userspace agent getting inundated with events). Granted, there
is probably a higher tolerance for spam there than in console logs.
If that's desirable, maybe it could be a follow-up as well? I figure
this series is at least a good first step to handle any spam (vs.
status quo).
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info
2025-02-14 2:35 ` [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
@ 2025-03-04 18:32 ` Bjorn Helgaas
2025-03-05 1:04 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2025-03-04 18:32 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Thu, Feb 13, 2025 at 06:35:36PM -0800, Jon Pan-Doh wrote:
> Info logged is duplicated when the source device is processed. In both
> cases, BDF and error severity are derived from aer_error_info. If
> no source device is found, then an error is logged with the BDF from
> aer_error_info.
Nit: say what the patch does in the commit log as well as in the
subject.
> Code flow:
> aer_isr_one_error()
> -> aer_print_port_info()
> -> find_source_device()
> -> return/pci_info() if no device found else continue
> -> aer_process_err_devices()
> -> aer_print_error()
Nit: drop "->"; the indentation is enough.
> aer_print_port_info():
> pcieport 0000:00:04.0: Correctable error message received
> from 0000:01:00.0
Nit: don't wrap log messages, and indent them a couple space since
they're quoted material.
> aer_print_error():
> e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
> e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
> e1000e 0000:01:00.0: [ 6] BadTLP
Give us a clear sample of the log showing the duplicated info. Are
you're referring to this:
pcieport 0000:00:04.0: Correctable error message received from 0000:01:00.0
e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
e1000e 0000:01:00.0: [ 6] BadTLP
where the pcieport message refers to 0000:01:00.0, and the subsequent
e1000e messages also include 0000:01:00.0?
It's true this is redundant information, but that e1000e device may
no longer be accessible.
In that case, I think aer_get_device_error_info() would probably
return 0 because config reads would all return ~0, and
PCI_ERR_COR_STATUS & ~PCI_ERR_COR_MASK would be 0, so
we probably wouldn't see the e1000e messages at all.
> Tested using aer-inject[1]. No more root port log on dmesg.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>
> ---
> drivers/pci/pcie/aer.c | 15 ---------------
> 1 file changed, 15 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ad4206125b86..9a8cc81d01e4 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -733,18 +733,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> info->severity, info->tlp_header_valid, &info->tlp);
> }
>
> -static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
> -{
> - u8 bus = info->id >> 8;
> - u8 devfn = info->id & 0xff;
> -
> - pci_info(dev, "%s%s error message received from %04x:%02x:%02x.%d\n",
> - info->multi_error_valid ? "Multiple " : "",
> - aer_error_severity_string[info->severity],
> - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> - PCI_FUNC(devfn));
> -}
> -
> #ifdef CONFIG_ACPI_APEI_PCIEAER
> int cper_severity_to_aer(int cper_severity)
> {
> @@ -1296,7 +1284,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> e_info.multi_error_valid = 1;
> else
> e_info.multi_error_valid = 0;
> - aer_print_port_info(pdev, &e_info);
>
> if (find_source_device(pdev, &e_info))
> aer_process_err_devices(&e_info);
> @@ -1315,8 +1302,6 @@ static void aer_isr_one_error(struct aer_rpc *rpc,
> else
> e_info.multi_error_valid = 0;
>
> - aer_print_port_info(pdev, &e_info);
> -
> if (find_source_device(pdev, &e_info))
> aer_process_err_devices(&e_info);
> }
> --
> 2.48.1.601.g30ceb7b040-goog
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-02-14 2:35 ` [PATCH v2 2/8] PCI/AER: Use the same log level for all messages Jon Pan-Doh
2025-02-17 11:25 ` Karolina Stolarek
@ 2025-03-04 18:59 ` Bjorn Helgaas
2025-03-07 12:04 ` Karolina Stolarek
1 sibling, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2025-03-04 18:59 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Thu, Feb 13, 2025 at 06:35:37PM -0800, Jon Pan-Doh wrote:
> From: Karolina Stolarek <karolina.stolarek@oracle.com>
>
> When reporting an AER error, we check its type multiple times
> to determine the log level for each message. Do this check only
> in the top-level function and propagate the result down the call
> chain. Make aer_print_port_info output to match the level of the
> reported error.
Can you mention the top-level function name? I guess it's
aer_isr_one_error()? Nit: add "()" after function names.
It *might* be possible to split this into two patches:
- Check level once and propagate down. This would include the
changes to aer_isr_one_error(), aer_process_err_devices(),
aer_print_error(), __aer_print_error(), and probably
dpc_process_error(). It looks like this wouldn't change the
levels of any messages.
- Change log levels. It looks like the main place is
pci_print_aer(), which previously always used pci_err() and would
now use either KERN_WARNING or KERN_ERR.
pcie_print_tlp_log() also always uses pci_err(). Maybe that's only
used for Uncorrectable errors? I'm not sure what the rules are for
the Header Log.
Bjorn
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info
2025-03-04 18:32 ` Bjorn Helgaas
@ 2025-03-05 1:04 ` Jon Pan-Doh
2025-03-05 22:35 ` Bjorn Helgaas
0 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-03-05 1:04 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Tue, Mar 4, 2025 at 10:32 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> It's true this is redundant information, but that e1000e device may
> no longer be accessible.
>
> In that case, I think aer_get_device_error_info() would probably
> return 0 because config reads would all return ~0, and
> PCI_ERR_COR_STATUS & ~PCI_ERR_COR_MASK would be 0, so
> we probably wouldn't see the e1000e messages at all.
Wouldn't we have larger issues if the device is no longer accessible?
Would a log suffice in that case (i.e. when aer_get_device_error()
returns 0)? Something along the lines of "{device} is not accessible
while processing (un)correctable error"
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info
2025-03-05 1:04 ` Jon Pan-Doh
@ 2025-03-05 22:35 ` Bjorn Helgaas
2025-03-06 1:32 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Bjorn Helgaas @ 2025-03-05 22:35 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Tue, Mar 04, 2025 at 05:04:21PM -0800, Jon Pan-Doh wrote:
> On Tue, Mar 4, 2025 at 10:32 AM Bjorn Helgaas <helgaas@kernel.org> wrote:
> > It's true this is redundant information, but that e1000e device may
> > no longer be accessible.
> >
> > In that case, I think aer_get_device_error_info() would probably
> > return 0 because config reads would all return ~0, and
> > PCI_ERR_COR_STATUS & ~PCI_ERR_COR_MASK would be 0, so
> > we probably wouldn't see the e1000e messages at all.
>
> Wouldn't we have larger issues if the device is no longer accessible?
> Would a log suffice in that case (i.e. when aer_get_device_error()
> returns 0)? Something along the lines of "{device} is not accessible
> while processing (un)correctable error"
It's quite likely that a device is inaccessible after an uncorrectable
error.
DPC takes the link down automatically for uncorrectable errors, but I
don't think aer_print_port_info() is used in that case anyway.
Documentation/PCI/pci-error-recovery.rst mentions other cases where
the affected device is disconnected.
If the purpose of this patch is only to turn this:
pcieport 0000:00:04.0: Correctable error message received from 0000:01:00.0
e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link
Layer, (Receiver ID)
e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
e1000e 0000:01:00.0: [ 6] BadTLP
into this:
e1000e 0000:01:00.0: PCIe Bus Error: severity=Correctable, type=Data Link
Layer, (Receiver ID)
e1000e 0000:01:00.0: device [8086:10d3] error status/mask=00000040/0000e000
e1000e 0000:01:00.0: [ 6] BadTLP
I don't think it's worth it.
I guess the problem is that future patches rate limit the e1000e
messages, and we really need to rate limit the pcieport message using
the same e1000e ratelimit_state. We do know the Requester ID of the
device, so maybe we could look up that ratelimit_state?
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info
2025-03-05 22:35 ` Bjorn Helgaas
@ 2025-03-06 1:32 ` Jon Pan-Doh
2025-03-07 0:02 ` Bjorn Helgaas
0 siblings, 1 reply; 39+ messages in thread
From: Jon Pan-Doh @ 2025-03-06 1:32 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Wed, Mar 5, 2025 at 2:35 PM Bjorn Helgaas <helgaas@kernel.org> wrote:
> I guess the problem is that future patches rate limit the e1000e
> messages, and we really need to rate limit the pcieport message using
> the same e1000e ratelimit_state. We do know the Requester ID of the
> device, so maybe we could look up that ratelimit_state?
Yeah, the intent behind the patch is to simplify the ratelimiting logic.
If I accessed the ratelimit via aer_err_info, then the ratelimit would need
to be doubled.
> On Tue, Mar 04, 2025 at 05:04:21PM -0800, Jon Pan-Doh wrote:
> > Would a log suffice in that case (i.e. when aer_get_device_error()
> > returns 0)? Something along the lines of "{device} is not accessible
> > while processing (un)correctable error"
What are your thoughts on this? It adds the pcie port log in the
edge case described (with no loss of info) and doesn't require
changes to current ratelimit logic. Something like this (with more
fields filled in of course):
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 21cdf590b25e..bdfc7e8d6f0f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1253,6 +1253,8 @@ static inline void
aer_process_err_devices(struct aer_err_info *e_info)
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
if (aer_get_device_error_info(e_info->dev[i], e_info))
aer_print_error(e_info->dev[i], e_info);
+ else
+ pci_error(e_info->dev[i], "{device} is not
accessible while processing (un)correctable error");
}
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
if (aer_get_device_error_info(e_info->dev[i], e_info))
Thanks,
Jon
^ permalink raw reply related [flat|nested] 39+ messages in thread
* Re: [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info
2025-03-06 1:32 ` Jon Pan-Doh
@ 2025-03-07 0:02 ` Bjorn Helgaas
0 siblings, 0 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2025-03-07 0:02 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, Karolina Stolarek, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Wed, Mar 05, 2025 at 05:32:45PM -0800, Jon Pan-Doh wrote:
> > On Tue, Mar 04, 2025 at 05:04:21PM -0800, Jon Pan-Doh wrote:
> > > Would a log suffice in that case (i.e. when aer_get_device_error()
> > > returns 0)? Something along the lines of "{device} is not accessible
> > > while processing (un)correctable error"
>
> What are your thoughts on this? It adds the pcie port log in the
> edge case described (with no loss of info) and doesn't require
> changes to current ratelimit logic. Something like this (with more
> fields filled in of course):
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 21cdf590b25e..bdfc7e8d6f0f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1253,6 +1253,8 @@ static inline void
> aer_process_err_devices(struct aer_err_info *e_info)
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> if (aer_get_device_error_info(e_info->dev[i], e_info))
> aer_print_error(e_info->dev[i], e_info);
> + else
> + pci_error(e_info->dev[i], "{device} is not
> accessible while processing (un)correctable error");
> }
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> if (aer_get_device_error_info(e_info->dev[i], e_info))
Maybe, although I think consistency is very important, and we'll
always have Root Port info but won't always have Endpoint info. So
dropping the Root Port message seems possibly the wrong way around
when it's the Endpoint part that's "optional".
One thing I do like about the current messages is that they associate
information with the device that is the source of the information. I
remember finding this very confusing when I first looked at how AER
works.
E.g., the "pcieport ... Correctable error" message means the Root Port
received an ERR_COR and generated an interrupt, and the error class
and error source came from the Root Port AER Capability. Similarly,
the "e1000e ... error status" message contains information read from
the Endpoint AER Capability.
I do think the existing messages are WAY too verbose. I would love to
make them more concise, and I think the important endpoint info could
probably be squeezed into a single line, although obviously TLP header
logs would be too much for that.
Bjorn
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-03-04 18:59 ` Bjorn Helgaas
@ 2025-03-07 12:04 ` Karolina Stolarek
2025-03-13 21:15 ` Jon Pan-Doh
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-03-07 12:04 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: Jon Pan-Doh, Bjorn Helgaas, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On 04/03/2025 19:59, Bjorn Helgaas wrote:
>
> pcie_print_tlp_log() also always uses pci_err(). Maybe that's only
> used for Uncorrectable errors? I'm not sure what the rules are for
> the Header Log.
I kept it this way, because logging of the TLP header is required only
for Uncorrectable Errors (per PCIe Spec v6.2, 6.2.7 Error Listings and
Rules).
As for other changes, I agree, we can split it up. Jon, would you like
to do it in v3?
All the best,
Karolina
>
> Bjorn
>
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-03-04 0:58 ` Jon Pan-Doh
@ 2025-03-07 12:10 ` Karolina Stolarek
2025-03-07 19:41 ` Bjorn Helgaas
0 siblings, 1 reply; 39+ messages in thread
From: Karolina Stolarek @ 2025-03-07 12:10 UTC (permalink / raw)
To: Jon Pan-Doh
Cc: Bjorn Helgaas, linux-pci, Martin Petersen, Ben Fuller,
Drew Walton, Anil Agrawal, Tony Luck, Ilpo Järvinen,
Sathyanarayanan Kuppuswamy, Lukas Wunner, Jonathan Cameron
On 04/03/2025 01:58, Jon Pan-Doh wrote:
> On Mon, Mar 3, 2025 at 12:31 AM Karolina Stolarek
> <karolina.stolarek@oracle.com> wrote:
>> In my opinion, we want them to be separate. We may want to see no logs
>> of errors but still have them recorded in rasdaemon, for example.
>
> Understood. So the sysfs toggles could be something like:
>
> aer/ratelimit_log_enable
> aer/ratelimit_irq_enable (with default = off)
>
> This assumes that IRQ ratelimiting part is able to be merged.
Sounds good to me
> FYI, the current implementation ratelimits for both logs and trace
> events, but increments AER counters. If there's a scenario where you'd
> want no logs but have trace events sent, then we may need another
> ratelimit and/or roll that into IRQ ratelimiting (to avoid trace
> buffer/userspace agent getting inundated with events). Granted, there
> is probably a higher tolerance for spam there than in console logs.
Right, I see what you mean. I think we would like to still trace them,
at least that's what I got from the conversation I had with Jonathan[1].
It would be good to agree on the final solution here.
All the best,
Karolina
--------------------------------------------------------------------
[1] - https://lore.kernel.org/linux-pci/20241216104424.00000fab@huawei.com/
>
> If that's desirable, maybe it could be a follow-up as well? I figure
> this series is at least a good first step to handle any spam (vs.
> status quo).
>
> Thanks,
> Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits
2025-03-07 12:10 ` Karolina Stolarek
@ 2025-03-07 19:41 ` Bjorn Helgaas
0 siblings, 0 replies; 39+ messages in thread
From: Bjorn Helgaas @ 2025-03-07 19:41 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Jon Pan-Doh, Bjorn Helgaas, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Fri, Mar 07, 2025 at 01:10:33PM +0100, Karolina Stolarek wrote:
> On 04/03/2025 01:58, Jon Pan-Doh wrote:
> > On Mon, Mar 3, 2025 at 12:31 AM Karolina Stolarek
> > <karolina.stolarek@oracle.com> wrote:
> > > In my opinion, we want them to be separate. We may want to see no logs
> > > of errors but still have them recorded in rasdaemon, for example.
> >
> > Understood. So the sysfs toggles could be something like:
> >
> > aer/ratelimit_log_enable
> > aer/ratelimit_irq_enable (with default = off)
> >
> > This assumes that IRQ ratelimiting part is able to be merged.
>
> Sounds good to me
>
> > FYI, the current implementation ratelimits for both logs and trace
> > events, but increments AER counters. If there's a scenario where you'd
> > want no logs but have trace events sent, then we may need another
> > ratelimit and/or roll that into IRQ ratelimiting (to avoid trace
> > buffer/userspace agent getting inundated with events). Granted, there
> > is probably a higher tolerance for spam there than in console logs.
>
> Right, I see what you mean. I think we would like to still trace them, at
> least that's what I got from the conversation I had with Jonathan[1]. It
> would be good to agree on the final solution here.
I agree; I don't think we have a need to rate limit trace events.
If we get inundated by trace events, I suspect the solution will be to
turn off the interrupt and poll periodically.
> --------------------------------------------------------------------
> [1] - https://lore.kernel.org/linux-pci/20241216104424.00000fab@huawei.com/
^ permalink raw reply [flat|nested] 39+ messages in thread
* Re: [PATCH v2 2/8] PCI/AER: Use the same log level for all messages
2025-03-07 12:04 ` Karolina Stolarek
@ 2025-03-13 21:15 ` Jon Pan-Doh
0 siblings, 0 replies; 39+ messages in thread
From: Jon Pan-Doh @ 2025-03-13 21:15 UTC (permalink / raw)
To: Karolina Stolarek
Cc: Bjorn Helgaas, Bjorn Helgaas, linux-pci, Martin Petersen,
Ben Fuller, Drew Walton, Anil Agrawal, Tony Luck,
Ilpo Järvinen, Sathyanarayanan Kuppuswamy, Lukas Wunner,
Jonathan Cameron
On Fri, Mar 7, 2025 at 4:05 AM Karolina Stolarek
<karolina.stolarek@oracle.com> wrote:
> As for other changes, I agree, we can split it up. Jon, would you like
> to do it in v3?
Yeah. I can do this in v3.
Thanks,
Jon
^ permalink raw reply [flat|nested] 39+ messages in thread
end of thread, other threads:[~2025-03-13 21:15 UTC | newest]
Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-14 2:35 [PATCH v2 0/8] Rate limit AER logs Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-03-04 18:32 ` Bjorn Helgaas
2025-03-05 1:04 ` Jon Pan-Doh
2025-03-05 22:35 ` Bjorn Helgaas
2025-03-06 1:32 ` Jon Pan-Doh
2025-03-07 0:02 ` Bjorn Helgaas
2025-02-14 2:35 ` [PATCH v2 2/8] PCI/AER: Use the same log level for all messages Jon Pan-Doh
2025-02-17 11:25 ` Karolina Stolarek
2025-02-19 2:48 ` Jon Pan-Doh
2025-02-24 11:26 ` Karolina Stolarek
2025-02-28 22:25 ` Jon Pan-Doh
2025-03-04 18:59 ` Bjorn Helgaas
2025-03-07 12:04 ` Karolina Stolarek
2025-03-13 21:15 ` Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 3/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
2025-02-17 11:29 ` Karolina Stolarek
2025-02-19 2:48 ` Jon Pan-Doh
2025-02-24 11:26 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 4/8] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
2025-02-17 11:29 ` Karolina Stolarek
2025-02-19 2:49 ` Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 5/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-02-17 11:29 ` Karolina Stolarek
2025-02-19 2:49 ` Jon Pan-Doh
2025-02-14 2:35 ` [PATCH v2 6/8] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
2025-02-17 11:30 ` Karolina Stolarek
2025-02-14 2:35 ` [PATCH v2 7/8] PCI/AER: Add AER sysfs attributes for log ratelimits Jon Pan-Doh
2025-02-17 13:31 ` Karolina Stolarek
2025-02-19 2:50 ` Jon Pan-Doh
2025-02-24 11:28 ` Karolina Stolarek
2025-02-19 5:42 ` Jon Pan-Doh
2025-02-25 13:56 ` Karolina Stolarek
2025-02-28 22:28 ` Jon Pan-Doh
2025-03-03 8:31 ` Karolina Stolarek
2025-03-04 0:58 ` Jon Pan-Doh
2025-03-07 12:10 ` Karolina Stolarek
2025-03-07 19:41 ` Bjorn Helgaas
2025-02-14 2:35 ` [PATCH v2 8/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox