* [PATCH] PCI/AER: Add kernel.aer_print_skip_mask to control aer log
@ 2025-01-08 7:57 Bijie Xu
2025-03-04 23:22 ` Bjorn Helgaas
0 siblings, 1 reply; 3+ messages in thread
From: Bijie Xu @ 2025-01-08 7:57 UTC (permalink / raw)
To: oohall; +Cc: bijie.xu, mahesh, bhelgaas, linuxppc-dev, linux-pci, linux-kernel
Sometimes certain PCIE devices installed on some servers occasionally
produce large number of AER correctable error logs, which is quite
annoying. Add this sysctl parameter kernel.aer_print_skip_mask to
skip printing AER errors of certain severity.
The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3
low bits of the mask are used to skip these 3 severities. Set bit 0
can skip printing NONFATAL AER errors, and set bit 1 can skip printing
FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors.
And multiple bits can be set to skip multiple severities.
Signed-off-by: Bijie Xu <bijie.xu@corigine.com>
---
drivers/pci/pcie/aer.c | 23 +++++++++++++++++++++++
1 file changed, 23 insertions(+)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 80c5ba8d8296..b46973526bcf 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -698,6 +698,7 @@ static void __aer_print_error(struct pci_dev *dev,
pci_dev_aer_stats_incr(dev, info);
}
+unsigned int aer_print_skip_mask __read_mostly;
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
{
int layer, agent;
@@ -710,6 +711,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
goto out;
}
+ if ((1 << info->severity) & aer_print_skip_mask)
+ goto out;
+
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
@@ -1596,3 +1600,22 @@ int __init pcie_aer_init(void)
return -ENXIO;
return pcie_port_service_register(&aerdriver);
}
+
+static const struct ctl_table aer_print_skip_mask_sysctls[] = {
+ {
+ .procname = "aer_print_skip_mask",
+ .data = &aer_print_skip_mask,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = &proc_douintvec,
+ },
+ {}
+};
+
+static int __init aer_print_skip_mask_sysctl_init(void)
+{
+ register_sysctl_init("kernel", aer_print_skip_mask_sysctls);
+ return 0;
+}
+
+late_initcall(aer_print_skip_mask_sysctl_init);
--
2.25.1
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] PCI/AER: Add kernel.aer_print_skip_mask to control aer log
2025-01-08 7:57 [PATCH] PCI/AER: Add kernel.aer_print_skip_mask to control aer log Bijie Xu
@ 2025-03-04 23:22 ` Bjorn Helgaas
2025-03-06 8:25 ` Bijie Xu
0 siblings, 1 reply; 3+ messages in thread
From: Bjorn Helgaas @ 2025-03-04 23:22 UTC (permalink / raw)
To: Bijie Xu
Cc: oohall, mahesh, bhelgaas, linuxppc-dev, linux-pci, linux-kernel,
Jon Pan-Doh, Karolina Stolarek
[+cc Jon, Karolina]
On Wed, Jan 08, 2025 at 03:57:03PM +0800, Bijie Xu wrote:
> Sometimes certain PCIE devices installed on some servers occasionally
> produce large number of AER correctable error logs, which is quite
> annoying. Add this sysctl parameter kernel.aer_print_skip_mask to
> skip printing AER errors of certain severity.
>
> The AER severity can be 0(NONFATAL), 1(FATAL), 2(CORRECTABLE). The 3
> low bits of the mask are used to skip these 3 severities. Set bit 0
> can skip printing NONFATAL AER errors, and set bit 1 can skip printing
> FATAL AER errors, set bit 2 can skip printing CORRECTABLE AER errors.
> And multiple bits can be set to skip multiple severities.
This is definitely annoying, actually MORE than annoying in some
cases.
I'm hoping the correctable error rate-limiting work can reduce the
annoyance to an tolerable level:
https://lore.kernel.org/r/20250214023543.992372-1-pandoh@google.com
Can you take a look at this and see if it's going the right direction
for you, or if it needs extensions to do what you need?
> Signed-off-by: Bijie Xu <bijie.xu@corigine.com>
> ---
> drivers/pci/pcie/aer.c | 23 +++++++++++++++++++++++
> 1 file changed, 23 insertions(+)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 80c5ba8d8296..b46973526bcf 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -698,6 +698,7 @@ static void __aer_print_error(struct pci_dev *dev,
> pci_dev_aer_stats_incr(dev, info);
> }
>
> +unsigned int aer_print_skip_mask __read_mostly;
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> {
> int layer, agent;
> @@ -710,6 +711,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> goto out;
> }
>
> + if ((1 << info->severity) & aer_print_skip_mask)
> + goto out;
> +
> layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> agent = AER_GET_AGENT(info->severity, info->status);
>
> @@ -1596,3 +1600,22 @@ int __init pcie_aer_init(void)
> return -ENXIO;
> return pcie_port_service_register(&aerdriver);
> }
> +
> +static const struct ctl_table aer_print_skip_mask_sysctls[] = {
> + {
> + .procname = "aer_print_skip_mask",
> + .data = &aer_print_skip_mask,
> + .maxlen = sizeof(unsigned int),
> + .mode = 0644,
> + .proc_handler = &proc_douintvec,
> + },
> + {}
> +};
> +
> +static int __init aer_print_skip_mask_sysctl_init(void)
> +{
> + register_sysctl_init("kernel", aer_print_skip_mask_sysctls);
> + return 0;
> +}
> +
> +late_initcall(aer_print_skip_mask_sysctl_init);
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] PCI/AER: Add kernel.aer_print_skip_mask to control aer log
2025-03-04 23:22 ` Bjorn Helgaas
@ 2025-03-06 8:25 ` Bijie Xu
0 siblings, 0 replies; 3+ messages in thread
From: Bijie Xu @ 2025-03-06 8:25 UTC (permalink / raw)
To: helgaas
Cc: bhelgaas, bijie.xu, karolina.stolarek, linux-kernel, linux-pci,
linuxppc-dev, mahesh, oohall, pandoh
On Tue, 4 Mar 2025 17:22:30 -0600, Bjorn Helgaas wrote:
> Can you take a look at this and see if it's going the right direction
> for you, or if it needs extensions to do what you need?
Thanks for your suggestion. I've taken sometime to review that patch you suggested.
It solves part of the problem. And it can set ratelimit on a single device, which
is good.
But this patch solves the problem in a different way.
1. Some users are very nervous to notice this kind of error logs. This patch can
give them an option to disable these logs entirely on the whole system level
instead of just set a ratelimit on a specific device.
2. The sysctl configuration can be persisted after a system reboot. Users may dislike
these AER logs appearing again after a system reboot.
Regards,
Bijie Xu
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-03-06 8:26 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-08 7:57 [PATCH] PCI/AER: Add kernel.aer_print_skip_mask to control aer log Bijie Xu
2025-03-04 23:22 ` Bjorn Helgaas
2025-03-06 8:25 ` Bijie Xu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).