All of lore.kernel.org
 help / color / mirror / Atom feed
From: Lukas Wunner <lukas@wunner.de>
To: Karolina Stolarek <karolina.stolarek@oracle.com>
Cc: Jon Pan-Doh <pandoh@google.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linux-pci@vger.kernel.org,
	Martin Petersen <martin.petersen@oracle.com>,
	Ben Fuller <ben.fuller@oracle.com>,
	Drew Walton <drewwalton@microsoft.com>,
	Anil Agrawal <anilagrawal@meta.com>,
	Tony Luck <tony.luck@intel.com>
Subject: Re: [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs
Date: Thu, 6 Feb 2025 21:25:33 +0100	[thread overview]
Message-ID: <Z6UavQr5UCDSDyMk@wunner.de> (raw)
In-Reply-To: <29d3d0c4-656a-403b-841d-e3b61bbe594d@oracle.com>

On Thu, Feb 06, 2025 at 02:56:20PM +0100, Karolina Stolarek wrote:
> On 25/01/2025 08:39, Lukas Wunner wrote:
> > Masking errors at the register level feels overzealous,
> > in particular because it also disables logging via tracepoints.
> > 
> > Is there a concrete device that necessitates this change?
> 
> I faced issues with excessive Correctable Errors reporting with Samsung
> PM1733 NVMe (a couple of thousand errors per hour), which were still
> polluting the logs even after introducing a ratelimit

I'd suggest to add a "u32 aer_cor_mask" to "struct pci_dev" in the
"#ifdef CONFIG_PCIEAER" section.

Then add a "DECLARE_PCI_FIXUP_HEADER()" macro in drivers/pci/quirks.c
for the Samsung PM1733 which calls a new function which sets exactly the
error bits you're seeing to aer_cor_mask.  This should be #ifdef'ed to
CONFIG_PCIEAER as well.

Finally, amend aer.c to set the bits in aer_cor_mask in the
PCI_ERR_COR_MASK register on probe.


> > If there is, consider adding a quirk for this particular device
> > which masks specific errors, but doesn't affect other devices.
> 
> There were many other reports of Correctable Error floods, as signaled in
> the cover letter, so it's hard to pinpoint the specific driver that should
> mask these errors.

If a specific device frequently signals the same errors,
I think that's a bug of that device and if the vendor doesn't
provide a firmware update, quiescing the errors through a quirk
is a plausible solution.

Of course if this is widespread, it becomes a maintenance nightmare
and then the quirk approach is not a viable option.  I cannot say
whether that's the case.  So far there's a report for one specific
product (the Samsung drive) and hinting that the problem may be
widespread.  It's difficult to make a recommendation without
precise data.

Thanks,

Lukas

  reply	other threads:[~2025-02-06 20:25 UTC|newest]

Thread overview: 51+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-15  7:42 [PATCH 0/8] Rate limit AER logs/IRQs Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 1/8] PCI/AER: Remove aer_print_port_info Jon Pan-Doh
2025-01-16 14:27   ` Karolina Stolarek
2025-01-18  1:57     ` Jon Pan-Doh
2025-01-20  9:25       ` Karolina Stolarek
2025-02-12 23:20         ` Jon Pan-Doh
2025-01-21 14:18   ` Ilpo Järvinen
2025-02-12 23:20     ` Jon Pan-Doh
2025-01-25  4:15   ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 2/8] PCI/AER: Move AER stat collection out of __aer_print_error Jon Pan-Doh
2025-01-16 14:47   ` Karolina Stolarek
2025-01-18  1:57     ` Jon Pan-Doh
2025-01-25  4:37   ` Sathyanarayanan Kuppuswamy
2025-02-12 23:20     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 3/8] PCI/AER: Rename struct aer_stats to aer_info Jon Pan-Doh
2025-01-16 10:11   ` Karolina Stolarek
2025-01-18  1:59     ` Jon Pan-Doh
2025-01-20 10:13       ` Karolina Stolarek
2025-02-12 23:20         ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 4/8] PCI/AER: Introduce ratelimit for error logs Jon Pan-Doh
2025-01-16 11:11   ` Karolina Stolarek
2025-01-18  1:59     ` Jon Pan-Doh
2025-01-20 10:25       ` Karolina Stolarek
2025-01-15  7:42 ` [PATCH 5/8] PCI/AER: Introduce ratelimit for AER IRQs Jon Pan-Doh
2025-01-16 12:02   ` Karolina Stolarek
2025-01-18  1:58     ` Jon Pan-Doh
2025-01-20 10:38       ` Karolina Stolarek
2025-01-25  7:39   ` Lukas Wunner
2025-01-31 14:43     ` Jonathan Cameron
2025-03-04 23:42       ` Jon Pan-Doh
2025-02-06 13:56     ` Karolina Stolarek
2025-02-06 20:25       ` Lukas Wunner [this message]
2025-01-31 14:55   ` Jonathan Cameron
2025-03-04 23:42     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 6/8] PCI/AER: Add AER sysfs attributes for ratelimits Jon Pan-Doh
2025-01-31 14:32   ` Jonathan Cameron
2025-02-28 23:11     ` Jon Pan-Doh
2025-01-15  7:42 ` [PATCH 7/8] PCI/AER: Update AER sysfs ABI filename Jon Pan-Doh
2025-01-15  7:43 ` [PATCH 8/8] PCI/AER: Move AER sysfs attributes into separate directory Jon Pan-Doh
2025-01-16 10:26   ` Karolina Stolarek
2025-01-16 17:18     ` Rajat Jain
2025-01-31 14:36       ` Jonathan Cameron
2025-02-12 23:19         ` Jon Pan-Doh
2025-01-23 15:18 ` [PATCH 0/8] Rate limit AER logs/IRQs Bowman, Terry
2025-01-24  6:46   ` Jon Pan-Doh
2025-01-25  7:59     ` Lukas Wunner
2025-02-06 13:32 ` Karolina Stolarek
2025-02-12 23:19   ` Jon Pan-Doh
2025-02-13 16:00     ` Karolina Stolarek
2025-02-14  2:49       ` Jon Pan-Doh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z6UavQr5UCDSDyMk@wunner.de \
    --to=lukas@wunner.de \
    --cc=anilagrawal@meta.com \
    --cc=ben.fuller@oracle.com \
    --cc=bhelgaas@google.com \
    --cc=drewwalton@microsoft.com \
    --cc=karolina.stolarek@oracle.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=pandoh@google.com \
    --cc=tony.luck@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.