linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Grant Grundler <grundler@chromium.org>
Cc: Rajat Jain <rajatja@chromium.org>,
	Rajat Khandelwal <rajat.khandelwal@linux.intel.com>,
	linux-pci@vger.kernel.org,
	Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
	linux-kernel@vger.kernel.org,
	Oliver O 'Halloran <oohall@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors
Date: Thu, 6 Apr 2023 14:50:45 -0500	[thread overview]
Message-ID: <20230406195045.GA3729127@bhelgaas> (raw)
In-Reply-To: <20230317175109.3859943-2-grundler@chromium.org>

On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote:
> From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
> 
> There are many instances where correctable errors tend to inundate
> the message buffer. We observe such instances during thunderbolt PCIe
> tunneling.
> 
> It's true that they are mitigated by the hardware and are non-fatal
> but we shouldn't be spamming the logs with such correctable errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs, hence rate limit them.
> 
> A typical example log inside an HP TBT4 dock:
> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000
> [54912.661211] igc 0000:2b:00.0:    [ 8] Rollover
> [54912.661219] igc 0000:2b:00.0:    [12] Timeout
> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
> [54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000
> [54982.838817] igc 0000:2b:00.0:    [12] Timeout

The timestamps don't contribute to understanding the problem, so we
can omit them.

> This gets repeated continuously, thus inundating the buffer.
> 
> Signed-off-by: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
> Signed-off-by: Grant Grundler <grundler@chromium.org>
> ---
>  drivers/pci/pcie/aer.c | 42 ++++++++++++++++++++++++++++--------------
>  1 file changed, 28 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index cb6b96233967..b592cea8bffe 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -706,8 +706,8 @@ static void __aer_print_error(struct pci_dev *dev,
>  			errmsg = "Unknown Error Bit";
>  
>  		if (info->severity == AER_CORRECTABLE)
> -			pci_info(dev, "   [%2d] %-22s%s\n", i, errmsg,
> -				info->first_error == i ? " (First)" : "");
> +			pci_info_ratelimited(dev, "   [%2d] %-22s%s\n", i, errmsg,
> +					     info->first_error == i ? " (First)" : "");

I don't think this is going to reliably work the way we want.  We have
a bunch of pci_info_ratelimited() calls, and each caller has its own
ratelimit_state data.  Unless we call pci_info_ratelimited() exactly
the same number of times for each error, the ratelimit counters will
get out of sync and we'll end up printing fragments from error A mixed
with fragments from error B.

I think we need to explicitly manage the ratelimiting ourselves,
similar to print_hmi_event_info() or print_extlog_rcd().  Then we can
have a *single* ratelimit_state, and we can check it once to determine
whether to log this correctable error.

>  		else
>  			pci_err(dev, "   [%2d] %-22s%s\n", i, errmsg,
>  				info->first_error == i ? " (First)" : "");
> @@ -719,7 +719,6 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	int layer, agent;
>  	int id = ((dev->bus->number << 8) | dev->devfn);
> -	const char *level;
>  
>  	if (!info->status) {
>  		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> @@ -730,14 +729,21 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>  	agent = AER_GET_AGENT(info->severity, info->status);
>  
> -	level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
> +	if (info->severity == AER_CORRECTABLE) {
> +		pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> +				     aer_error_severity_string[info->severity],
> +				     aer_error_layer[layer], aer_agent_string[agent]);
>  
> -	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> -		   aer_error_layer[layer], aer_agent_string[agent]);
> +		pci_info_ratelimited(dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> +				     dev->vendor, dev->device, info->status, info->mask);
> +	} else {
> +		pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> +			aer_error_severity_string[info->severity],
> +			aer_error_layer[layer], aer_agent_string[agent]);
>  
> -	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> -		   dev->vendor, dev->device, info->status, info->mask);
> +		pci_err(dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> +			dev->vendor, dev->device, info->status, info->mask);
> +	}
>  
>  	__aer_print_error(dev, info);
>  
> @@ -757,11 +763,19 @@ static void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
>  	u8 bus = info->id >> 8;
>  	u8 devfn = info->id & 0xff;
>  
> -	pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
> -		 info->multi_error_valid ? "Multiple " : "",
> -		 aer_error_severity_string[info->severity],
> -		 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> -		 PCI_FUNC(devfn));
> +	if (info->severity == AER_CORRECTABLE)
> +		pci_info_ratelimited(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
> +				     info->multi_error_valid ? "Multiple " : "",
> +				     aer_error_severity_string[info->severity],
> +				     pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> +				     PCI_FUNC(devfn));
> +	else
> +		pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n",
> +			 info->multi_error_valid ? "Multiple " : "",
> +			 aer_error_severity_string[info->severity],
> +			 pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn),
> +			 PCI_FUNC(devfn));
> +
>  }
>  
>  #ifdef CONFIG_ACPI_APEI_PCIEAER
> -- 
> 2.40.0.rc1.284.g88254d51c5-goog
> 

  reply	other threads:[~2023-04-06 19:51 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-03-17 17:51 [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Grant Grundler
2023-03-17 17:51 ` [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors Grant Grundler
2023-04-06 19:50   ` Bjorn Helgaas [this message]
2023-04-07 18:53     ` Grant Grundler
2023-04-07 19:46       ` Bjorn Helgaas
2023-04-07 23:46         ` Grant Grundler
2023-05-17 16:02           ` Bjorn Helgaas
2023-05-17 21:02             ` Grant Grundler
2023-05-18  5:58             ` Grant Grundler
2023-04-07 23:47         ` Grant Grundler
2023-04-07 23:49         ` Grant Grundler
2023-05-18  6:11         ` Grant Grundler
2023-06-06  3:44           ` Grant Grundler
2023-06-06  3:45           ` Grant Grundler
2023-03-17 18:50 ` [PATCHv2 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO Sathyanarayanan Kuppuswamy
2023-03-17 19:30   ` Bjorn Helgaas
2023-03-19  6:00     ` Grant Grundler

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230406195045.GA3729127@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=grundler@chromium.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=rajat.khandelwal@linux.intel.com \
    --cc=rajatja@chromium.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).