Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Alexey Bogoslavsky <Alexey.Bogoslavsky@wdc.com>
Cc: "'hch@lst.de'" <hch@lst.de>,
	"linux-pci@vger.kernel.org" <linux-pci@vger.kernel.org>,
	"bhelgaas@google.com" <bhelgaas@google.com>,
	Rajat Khandelwal <rajat.khandelwal@linux.intel.com>
Subject: Re: [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD
Date: Tue, 17 Jan 2023 08:22:57 -0600	[thread overview]
Message-ID: <20230117142257.GA117624@bhelgaas> (raw)
In-Reply-To: <DM6PR04MB6473099D6C2AA889D959A11C8BC69@DM6PR04MB6473.namprd04.prod.outlook.com>

[+cc Rajat]

On Tue, Jan 17, 2023 at 01:20:28PM +0000, Alexey Bogoslavsky wrote:
> >From: 'hch@lst.de' <hch@lst.de> 
> >On Mon, Jan 16, 2023 at 06:32:54PM +0000, Alexey Bogoslavsky wrote:
> >> From: Alexey Bogoslavsky <mailto:Alexey.Bogoslavsky@wdc.com>
> >>
> >> A bug was found in SN730 WD SSD that causes occasional false AER reporting
> >> of correctable errors. While functionally harmless, this causes error
> >> messages to appear in the system log (dmesg) which, in turn, causes
> >> problems in automated platform validation tests.

I don't think automated test problems warrant an OS change to suppress
warnings.  Those are internal tests where you can automatically ignore
warnings you don't care about.

> >> Since the issue can not
> >> be fixed by FW, customers asked for correctable error reporting to be
> >> quirked out in the kernel for this particular device.

Customer complaints are a little more of an issue.  But let's clarify
what's happening here.  You mention "false AER reporting," and I want
to understand that better.  I guess you mean that SN730 logs a
correctable error and generates an ERR_COR message when no error
actually occurred?

I hesitate to turn off correctable error reporting completely because
that information is useful for monitoring overall system health.  But
there are two things I think we could do:

  - Reformat the correctable error messages so they are obviously
    different from uncorrectable errors and indicate that these have
    been automatically corrected, and

  - Possibly rate-limit them on a per-device basis, e.g., something
    like https://lore.kernel.org/r/20230103165548.570377-1-rajat.khandelwal@linux.intel.com

If we do end up having to turn off reporting completely, I would log a
note that we're doing that so we know we're giving up something and
there may be legitimate errors that we will never see.

> >> The patch was manually verified. It was checked that correctable errors
> >> are still detected but ignored for the target device (SN730), and are both
> >> detected and reported for devices not affected by this quirk.
> >>
> >> Signed-off-by: Alexey Bogoslavsky mailto:alexey.bogoslavsky@wdc.com

Check the history and use the same format here, i.e.,

  Alexey Bogoslavsky <alexey.bogoslavsky@wdc.com>

(Also for the From: line inserted at the top.)

> >> ---
> >>  drivers/pci/pcie/aer.c  | 3 +++
> >>  drivers/pci/quirks.c    | 6 ++++++
> >>  include/linux/pci.h     | 1 +
> >>  include/linux/pci_ids.h | 4 ++++
> >>  4 files changed, 14 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d7ee79d7b192..5cc24d28b76d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -721,6 +721,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> >>               goto out;
> >>       }
> >>
> >> +     if ((info->severity == AER_CORRECTABLE) && dev->ignore_correctable_errors)
> 
> >No need for the inner braces.
> Will fix
> 
> >> +DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_WESTERN_DIGITAL, PCI_DEVICE_ID_WESTERN_DIGITAL_SN730, wd_ignore_correctable_errors);
> 
> >Overly long line.  Also wd_ seems like an odd prefix, I'd do pci_
> instead.
> Will fix both, thanks
> >But overall I'm not really sure it's worth adding code just to surpress
> >a harmless warning.
> This is, of course, problematic. We're only resorting to this option after we've tried pretty much everything else

  reply	other threads:[~2023-01-17 14:25 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <BY5PR04MB704131DBB47254C9F1FF12B38B409@BY5PR04MB7041.namprd04.prod.outlook.com>
2023-01-16 18:32 ` [PATCH 1/1] PCI/AER: Ignore correctable error reports for SN730 WD SSD Alexey Bogoslavsky
2023-01-17  7:14   ` 'hch@lst.de'
2023-01-17 13:20     ` Alexey Bogoslavsky
2023-01-17 14:22       ` Bjorn Helgaas [this message]
2023-01-17 18:06         ` Alexey Bogoslavsky
2023-01-17 15:54   ` Keith Busch
2023-01-17 18:15     ` Alexey Bogoslavsky
2023-04-11 22:15       ` Bjorn Helgaas
2023-04-24 11:27         ` Alexey Bogoslavsky
2023-04-28 22:00           ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230117142257.GA117624@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=Alexey.Bogoslavsky@wdc.com \
    --cc=bhelgaas@google.com \
    --cc=hch@lst.de \
    --cc=linux-pci@vger.kernel.org \
    --cc=rajat.khandelwal@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox