Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Karolina Stolarek <karolina.stolarek@oracle.com>
To: Jon Pan-Doh <pandoh@google.com>
Cc: linux-pci@vger.kernel.org, "Bjorn Helgaas" <bhelgaas@google.com>,
	"Martin Petersen" <martin.petersen@oracle.com>,
	"Ben Fuller" <ben.fuller@oracle.com>,
	"Drew Walton" <drewwalton@microsoft.com>,
	"Anil Agrawal" <anilagrawal@meta.com>,
	"Tony Luck" <tony.luck@intel.com>,
	"Ilpo Järvinen" <ilpo.jarvinen@linux.intel.com>,
	"Sathyanarayanan Kuppuswamy"
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	"Lukas Wunner" <lukas@wunner.de>,
	"Jonathan Cameron" <Jonathan.Cameron@huawei.com>,
	"Terry Bowman" <Terry.bowman@amd.com>
Subject: Re: [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs
Date: Thu, 20 Mar 2025 15:56:53 +0100	[thread overview]
Message-ID: <ebafe3cc-2d0f-4000-863d-20365977e27c@oracle.com> (raw)
In-Reply-To: <20250320082057.622983-6-pandoh@google.com>


On 20/03/2025 09:20, Jon Pan-Doh wrote:
> Spammy devices can flood kernel logs with AER errors and slow/stall 
> execution. Add per-device ratelimits for AER correctable and 
> uncorrectable errors that use the kernel defaults (10 per 5s).
> 
> Tested using aer-inject[1]. Sent 11 AER errors. Observed 10 errors 
> logged while AER stats (cat /sys/bus/pci/devices/<dev>/ 
> aer_dev_correctable) show true count of 11.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer- 
> inject.git
 >
> Signed-off-by: Jon Pan-Doh <pandoh@google.com>
> Reviewed-by: Karolina Stolarek <karolina.stolarek@oracle.com>

For future reference -- please drop r-bs from patches that have 
functional/bigger changes. New code nullifies previous reviews.

> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3069376b3553..081cef5fc678 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -88,6 +89,10 @@ struct aer_report {
>   	u64 rootport_total_cor_errs;
>   	u64 rootport_total_fatal_errs;
>   	u64 rootport_total_nonfatal_errs;
> +
> +	/* Ratelimits for errors */
> +	struct ratelimit_state cor_log_ratelimit;
> +	struct ratelimit_state uncor_log_ratelimit;
>   };
>   
>   #define AER_LOG_TLP_MASKS		(PCI_ERR_UNC_POISON_TLP|	\
> @@ -379,6 +384,15 @@ void pci_aer_init(struct pci_dev *dev)
>   
>   	dev->aer_report = kzalloc(sizeof(*dev->aer_report), GFP_KERNEL);
>   
> +	/*
> +	 * Ratelimits are doubled as a given error produces 2 logs (root port
> +	 * and endpoint) that should be under same ratelimit.
> +	 */

It took me a bit to understand what this comment is about.

When we handle an error message, we first use the source's ratelimit to 
decide if we want to print the port info, and then the actual error. In 
theory, there could be more errors of the same class coming from other 
devices within one message. For these devices, we would call the 
ratelimit just once. I don't have a nice an clean solution for this 
problem, I just wanted to highlight that 1) we don't use the Root Port's 
ratelimit in aer_print_port_info(), 2) we may use the bursts to either 
print port_info + error message or just the message, in different 
combinations. I think we should reword this comment to highlight the 
fact that we don't check the ratelimit once per error, we could do it twice.

Also, I wonder -- do only Endpoints generate error messages? From what I 
understand, that some errors can be detected by intermediary devices.

> +	ratelimit_state_init(&dev->aer_report->cor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +	ratelimit_state_init(&dev->aer_report->uncor_log_ratelimit,
> +			     DEFAULT_RATELIMIT_INTERVAL, DEFAULT_RATELIMIT_BURST * 2);
> +
>   	/*
>   	 * We save/restore PCI_ERR_UNCOR_MASK, PCI_ERR_UNCOR_SEVER,
>   	 * PCI_ERR_COR_MASK, and PCI_ERR_CAP.  Root and Root Complex Event
> @@ -668,6 +682,17 @@ static void pci_rootport_aer_stats_incr(struct pci_dev *pdev,
>   	}
>   }
>   
> +static int aer_ratelimit(struct pci_dev *dev, unsigned int severity)

I really like this solution, it's nice and tidy


>   static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
>   {

I'm also thinking -- we are ratelimiting the aer_print_port_info() and 
aer_print_error(). What about the messages in dpc_process_error()? 
Should we check early if DPC was triggered because of an uncorrectable 
error, and if so, ratelimit that?

All the best,
Karolina

  reply	other threads:[~2025-03-20 14:57 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-20  8:20 [PATCH v4 0/7] Rate limit AER logs Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 1/7] PCI/AER: Check log level once and propagate down Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 2/7] PCI/AER: Make all pci_print_aer() log levels depend on error type Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 3/7] PCI/AER: Move AER stat collection out of __aer_print_error() Jon Pan-Doh
2025-03-20 14:59   ` Karolina Stolarek
2025-03-20 19:07     ` Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 4/7] PCI/AER: Rename struct aer_stats to aer_report Jon Pan-Doh
2025-03-20 17:42   ` Sathyanarayanan Kuppuswamy
2025-03-20 19:53     ` Jon Pan-Doh
2025-03-21 13:38       ` Karolina Stolarek
2025-03-20  8:20 ` [PATCH v4 5/7] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-03-20 14:56   ` Karolina Stolarek [this message]
2025-03-20 17:51     ` Bjorn Helgaas
2025-03-20 19:53       ` Jon Pan-Doh
2025-03-20 20:29         ` Bjorn Helgaas
2025-03-21  1:58           ` Jon Pan-Doh
2025-03-20 19:37     ` Jon Pan-Doh
2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
2025-03-21 19:24     ` Jon Pan-Doh
2025-03-21 21:47       ` Sathyanarayanan Kuppuswamy
2025-03-21 21:59         ` Bjorn Helgaas
2025-03-21 22:11         ` Jon Pan-Doh
2025-03-20  8:20 ` [PATCH v4 6/7] PCI/AER: Add ratelimits to PCI AER Documentation Jon Pan-Doh
2025-03-20 14:57   ` Karolina Stolarek
2025-03-21  1:00   ` Sathyanarayanan Kuppuswamy
2025-03-20  8:20 ` [PATCH v4 7/7] PCI/AER: Add sysfs attributes for log ratelimits Jon Pan-Doh
2025-03-20 14:58   ` Karolina Stolarek
2025-03-20 19:36     ` Jon Pan-Doh
2025-03-21  1:02   ` Sathyanarayanan Kuppuswamy
2025-03-21  1:55     ` Jon Pan-Doh
2025-03-20 14:34 ` [PATCH v4 0/7] Rate limit AER logs Christoph Hellwig
2025-03-20 18:45 ` Paul E. McKenney

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ebafe3cc-2d0f-4000-863d-20365977e27c@oracle.com \
    --to=karolina.stolarek@oracle.com \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=Terry.bowman@amd.com \
    --cc=anilagrawal@meta.com \
    --cc=ben.fuller@oracle.com \
    --cc=bhelgaas@google.com \
    --cc=drewwalton@microsoft.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=martin.petersen@oracle.com \
    --cc=pandoh@google.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox