Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Paul Menzel <pmenzel@molgen.mpg.de>,
	"Neftin, Sasha" <sasha.neftin@intel.com>,
	Leon Romanovsky <leon@kernel.org>,
	linux-pci@vger.kernel.org,
	Frederick Zhang <frederick888@tsundere.moe>,
	rajat.khandelwal@intel.com, linux-kernel@vger.kernel.org,
	oohall@gmail.com, bhelgaas@google.com,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH] PCI/AER: Rate limit the reporting of the correctable errors
Date: Wed, 4 Jan 2023 10:27:33 +0530	[thread overview]
Message-ID: <e6e53119-a249-a03f-c9eb-3caafbe5d983@linux.intel.com> (raw)
In-Reply-To: <20230103191418.GA1011392@bhelgaas>

[-- Attachment #1: Type: text/plain, Size: 3323 bytes --]

Hi Bjorn,

Thanks for the acknowledgement.

On 1/4/2023 12:44 AM, Bjorn Helgaas wrote:
> [+cc Paul, Sasha, Leon, Frederick]
>
> (Please cc folks who have commented on previous versions of your
> patch.)
>
> On Tue, Jan 03, 2023 at 10:25:48PM +0530, Rajat Khandelwal wrote:
>> There are many instances where correctable errors tend to inundate
>> the message buffer. We observe such instances during thunderbolt PCIe
>> tunneling.
>>
>> It's true that they are mitigated by the hardware and are non-fatal
>> but we shouldn't be spamming the logs with such correctable errors as it
>> confuses other kernel developers less familiar with PCI errors, support
>> staff, and users who happen to look at the logs, hence rate limit them.
> I want a better understanding of why we have so many errors before
> rate-limiting everybody.

--> So, we are debugging this inside Intel along with the thunderbolt/PCIe team. Apparently, it will
take some time to reach to a conclusion. Since I witness these errors in other thunderbolt devices
also, I am currently segregating all the TBT devices so that we have proper data to debug.

>
>> A typical example log inside an HP TBT4 dock:
>> [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error received: 0000:2b:00.0
>> [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
>> [54912.661203] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001100/00002000
>> [54912.661211] igc 0000:2b:00.0:    [ 8] Rollover
>> [54912.661219] igc 0000:2b:00.0:    [12] Timeout
>> [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 0000:2b:00.0
>> [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Transmitter ID)
>> [54982.838808] igc 0000:2b:00.0:   device [8086:5502] error status/mask=00001000/00002000
>> [54982.838817] igc 0000:2b:00.0:    [12] Timeout
> Please remove the timestamps; they don't contribute to understanding
> the problem.

--> Sure.

>
>> This gets repeated continuously, thus inundating the buffer.
> Did you verify that we actually clear the Correctable Error Status
> register?

--> This patch targets only rate limiting the correctable errors since they are
non-fatal, and they kind of inundate the CPU logs, particularly during thunderbolt
connections. It doesn't have an impact anywhere else.
As per your suggestion in the igc patch, I found rate limiting as a doable option
currently. Have eradicated any kind of masking the bits.

>
> https://bugzilla.kernel.org/show_bug.cgi?id=216863  looks like a
> similar issue.  The issue Frederick is seeing happens when resuming
> from sleep.  Is there some event that triggers the correctable errors
> you see?

--> The signatures look similar but there is no such event which triggers these errors.
I witness them in many situations (hot plug, cold boot, warm boot, s0ix, etc.).
Further, I think the replay correctable errors arise in thunderbolt PCIe devices because
the timeout values are not adjusted properly concerning thunderbolt daisy chains.
Not sure, but since these PCIe devices work directly on the motherboard, and only give issues
when they are inside thunderbolt devices, I think the addition of PCIe bridges in the daisy chain
is not synced with proper timeout values.

>
> Bjorn

[-- Attachment #2: Type: text/html, Size: 4754 bytes --]

next prev parent reply	other threads:[~2023-01-04  6:26 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-03 16:55 [PATCH] PCI/AER: Rate limit the reporting of the correctable errors Rajat Khandelwal
2023-01-03 19:14 ` Bjorn Helgaas
2023-01-04  4:57   ` Rajat Khandelwal [this message]
2023-01-04  6:46     ` Leon Romanovsky
2023-01-04 13:04       ` Rajat Khandelwal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e6e53119-a249-a03f-c9eb-3caafbe5d983@linux.intel.com \
    --to=rajat.khandelwal@linux.intel.com \
    --cc=bhelgaas@google.com \
    --cc=frederick888@tsundere.moe \
    --cc=helgaas@kernel.org \
    --cc=leon@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=oohall@gmail.com \
    --cc=pmenzel@molgen.mpg.de \
    --cc=rajat.khandelwal@intel.com \
    --cc=sasha.neftin@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).