From: Karolina Stolarek <karolina.stolarek@oracle.com>
To: Jon Pan-Doh <pandoh@google.com>, Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org,
Martin Petersen <martin.petersen@oracle.com>,
Ben Fuller <ben.fuller@oracle.com>,
Drew Walton <drewwalton@microsoft.com>,
Anil Agrawal <anilagrawal@meta.com>,
Tony Luck <tony.luck@intel.com>
Subject: Re: [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs
Date: Thu, 16 Jan 2025 12:11:14 +0100 [thread overview]
Message-ID: <b0eb4417-600b-4b70-ae5f-d51ccbd5ef2c@oracle.com> (raw)
In-Reply-To: <20250115074301.3514927-5-pandoh@google.com>
On 15/01/2025 08:42, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall
> execution. Add per-device ratelimits for AER errors (correctable and
> uncorrectable). Set the default rate to the default kernel ratelimit
> (10 per 5s).
>
> Tested using aer-inject[1] tool. Sent 11 AER errors. Observed 10 errors
> logged while AER stats (cat /sys/bus/pci/devices/<dev>/aer_dev_correctable)
> show true count of 11.
>
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git
>
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> ---
> Documentation/PCI/pcieaer-howto.rst | 6 ++++++
> drivers/pci/pcie/aer.c | 31 +++++++++++++++++++++++++++--
> 2 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
> index f013f3b27c82..5546de60f184 100644
> --- a/Documentation/PCI/pcieaer-howto.rst
> +++ b/Documentation/PCI/pcieaer-howto.rst
> @@ -85,6 +85,12 @@ In the example, 'Requester ID' means the ID of the device that sent
> the error message to the Root Port. Please refer to PCIe specs for other
> fields.
>
> +AER Ratelimits
> +-------------------------
> +
> +Error messages are ratelimited per device and error type. This prevents spammy
> +devices from flooding the console.
> +
Nit: I would create a separate commit for the documentation updates.
Also, I think it would be good to mention the default interval and,
maybe, explain why such rate-limiting was introduced in the first place.
If you don't feel like writing about it, let me know and we can work it
out together.
> AER Statistics / Counters
> -------------------------
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 5ab5cd7368bc..025c50b0f293 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -27,6 +27,7 @@
> #include <linux/interrupt.h>
> #include <linux/delay.h>
> #include <linux/kfifo.h>
> +#include <linux/ratelimit.h>
> #include <linux/slab.h>
> #include <acpi/apei.h>
> #include <acpi/ghes.h>
> @@ -84,6 +85,10 @@ struct aer_info {
> u64 rootport_total_cor_errs;
> u64 rootport_total_fatal_errs;
> u64 rootport_total_nonfatal_errs;
> +
> + /* Ratelimits for errors */
> + struct ratelimit_state cor_log_ratelimit;
> + struct ratelimit_state uncor_log_ratelimit;
My comment for 3/8 applies here as well.
> };
>
> #define AER_LOG_TLP_MASKS (PCI_ERR_UNC_POISON_TLP| \
> @@ -374,6 +379,12 @@ void pci_aer_init(struct pci_dev *dev)
> return;
>
> dev->aer_info = kzalloc(sizeof(struct aer_info), GFP_KERNEL);
> + ratelimit_state_init(&dev->aer_info->cor_log_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> + ratelimit_state_init(&dev->aer_info->uncor_log_ratelimit,
> + DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST);
> + ratelimit_set_flags(&dev->aer_info->cor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
> + ratelimit_set_flags(&dev->aer_info->uncor_log_ratelimit, RATELIMIT_MSG_ON_RELEASE);
In some cases, it would be beneficial to keep the "X callbacks
suppressed" to give an idea of how prevalent the errors are. On some
devices, we could see just 11 Correctable Errors per 5 seconds, but on
others this would be >1k (seen such a case myself).
As it's not immediately clear what the defaults are, could you add a
comment for this?
>
> /*
> * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
> @@ -702,6 +713,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> int layer, agent;
> int id = pci_dev_id(dev);
> const char *level;
> + struct ratelimit_state *ratelimit;
>
> if (!info->status) {
> pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> @@ -709,11 +721,20 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> goto out;
> }
>
> + if (info->severity == AER_CORRECTABLE) {
> + ratelimit = &dev->aer_info->cor_log_ratelimit;
> + level = KERN_WARNING;
> + } else {
> + ratelimit = &dev->aer_info->uncor_log_ratelimit;
> + level = KERN_ERR;
> + }
> +
> + if (!__ratelimit(ratelimit))
> + return;
> +
Maybe it's just me but I found "!__ratelimit(..)" expression confusing.
At first, I read this "if there is not ratelimit", whereas the function
returns 0 ("hey, you can't fire at this point of time") and we negate
it. My series attempted to communicate this via a variable, but maybe
that was too wordy/complicated, so I'm not pushy about introducing a
similar solution here.
All the best,
Karolina
> layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> agent = AER_GET_AGENT(info->severity, info->status);
>
> - level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
> -
> pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> aer_error_severity_string[info->severity],
> aer_error_layer[layer], aer_agent_string[agent]);
> @@ -755,11 +776,14 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> int layer, agent, tlp_header_valid = 0;
> u32 status, mask;
> struct aer_err_info info;
> + struct ratelimit_state *ratelimit;
>
> if (aer_severity == AER_CORRECTABLE) {
> + ratelimit = &dev->aer_info->cor_log_ratelimit;
> status = aer->cor_status;
> mask = aer->cor_mask;
> } else {
> + ratelimit = &dev->aer_info->uncor_log_ratelimit;
> status = aer->uncor_status;
> mask = aer->uncor_mask;
> tlp_header_valid = status & AER_LOG_TLP_MASKS;
> @@ -776,6 +800,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>
> pci_dev_aer_stats_incr(dev, &info);
>
> + if (!__ratelimit(ratelimit))
> + return;
> +
> pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
> __aer_print_error(dev, &info);
> pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
next prev parent reply other threads:[~2025-01-16 11:11 UTC|newest]
Thread overview: 51+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-15 7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-16 14:27 ` Karolina Stolarek
2025-01-18 1:57 ` Jon Pan-Doh
2025-01-20 9:25 ` Karolina Stolarek
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-21 14:18 ` Ilpo Järvinen
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-25 4:15 ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
2025-01-16 14:47 ` Karolina Stolarek
2025-01-18 1:57 ` Jon Pan-Doh
2025-01-25 4:37 ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
2025-01-16 10:11 ` Karolina Stolarek
2025-01-18 1:59 ` Jon Pan-Doh
2025-01-20 10:13 ` Karolina Stolarek
2025-02-12 23:20 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-01-16 11:11 ` Karolina Stolarek [this message]
2025-01-18 1:59 ` Jon Pan-Doh
2025-01-20 10:25 ` Karolina Stolarek
2025-01-15 7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
2025-01-16 12:02 ` Karolina Stolarek
2025-01-18 1:58 ` Jon Pan-Doh
2025-01-20 10:38 ` Karolina Stolarek
2025-01-25 7:39 ` Lukas Wunner
2025-01-31 14:43 ` Jonathan Cameron
2025-03-04 23:42 ` Jon Pan-Doh
2025-02-06 13:56 ` Karolina Stolarek
2025-02-06 20:25 ` Lukas Wunner
2025-01-31 14:55 ` Jonathan Cameron
2025-03-04 23:42 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
2025-01-31 14:32 ` Jonathan Cameron
2025-02-28 23:11 ` Jon Pan-Doh
2025-01-15 7:42 ` [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
2025-01-15 7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
2025-01-16 10:26 ` Karolina Stolarek
2025-01-16 17:18 ` Rajat Jain
2025-01-31 14:36 ` Jonathan Cameron
2025-02-12 23:19 ` Jon Pan-Doh
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
2025-01-24 6:46 ` Jon Pan-Doh
2025-01-25 7:59 ` Lukas Wunner
2025-02-06 13:32 ` Karolina Stolarek
2025-02-12 23:19 ` Jon Pan-Doh
2025-02-13 16:00 ` Karolina Stolarek
2025-02-14 2:49 ` Jon Pan-Doh
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=b0eb4417-600b-4b70-ae5f-d51ccbd5ef2c@oracle.com \
--to=karolina.stolarek@oracle.com \
--cc=anilagrawal@meta.com \
--cc=ben.fuller@oracle.com \
--cc=bhelgaas@google.com \
--cc=drewwalton@microsoft.com \
--cc=linux-pci@vger.kernel.org \
--cc=martin.petersen@oracle.com \
--cc=pandoh@google.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox