From: "Raj, Ashok" <ashok.raj@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Kuppuswamy Sathyanarayanan
<sathyanarayanan.kuppuswamy@linux.intel.com>,
Kuppuswamy Sathyanarayanan <knsathya@kernel.org>,
Ashok Raj <ashok.raj@intel.com>,
linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
Eric Badger <ebadger@purestorage.com>,
Oliver OHalloran <oohall@gmail.com>,
Bjorn Helgaas <bhelgaas@google.com>,
linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
Date: Sun, 13 Mar 2022 14:43:14 -0700 [thread overview]
Message-ID: <20220313214314.GD182809@otc-nc-03> (raw)
In-Reply-To: <20220313195220.GA436941@bhelgaas>
On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > assumption is incorrect.
> >
> > Consider a scenario where aer_irq() is triggered for a correctable
> > error, and while we process the error and before we clear the error
> > status in "Root Error Status" register, if the same kind of error
> > is triggered again, since aer_irq() only clears events it saw, the
> > multi-bit error is left in tact. This will cause the interrupt to fire
> > again, resulting in entering aer_irq() with just the multi-bit error
> > logged in the "Root Error Status" register.
> >
> > Repeated AER recovery test has revealed this condition does happen
> > and this prevents any new interrupt from being triggered. Allow to
> > process interrupt even if only multi-correctable (BIT 1) or
> > multi-uncorrectable bit (BIT 3) is set.
> >
> > Reported-by: Eric Badger <ebadger@purestorage.com>
>
> Is there a bug report with any concrete details (dmesg, lspci, etc)
> that we can include here?
Eric might have more details to add when he collected numerous logs to get
to the timeline of the problem. The test was to stress the links with an
automated power off, this will result in some eDPC UC error followed by
link down. The recovery worked fine for several cycles and suddenly there
were no more interrupts. A manual rescan on pci would probe and device is
operational again.
The test patch revealed we entered the aer_irq() with just the multi-error
PCI_ERR_ROOT_MULTI_COR_RCV or PCI_ERR_ROOT_MULTI_UNCOR_RCV, then we didn't
clear those bits causing interrupt generation to cease after that.
Cheers,
Ashok
next prev parent reply other threads:[~2022-03-13 21:54 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-11 2:58 [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly Kuppuswamy Sathyanarayanan
2022-03-13 19:52 ` Bjorn Helgaas
2022-03-13 21:43 ` Raj, Ashok [this message]
2022-03-14 16:21 ` Eric Badger
2022-03-17 22:59 ` Bjorn Helgaas
2022-03-18 20:28 ` Sathyanarayanan Kuppuswamy
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220313214314.GD182809@otc-nc-03 \
--to=ashok.raj@intel.com \
--cc=bhelgaas@google.com \
--cc=ebadger@purestorage.com \
--cc=helgaas@kernel.org \
--cc=knsathya@kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=oohall@gmail.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).