Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

From: Bjorn Helgaas <helgaas@kernel.org>
To: Eric Badger <ebadger@purestorage.com>
Cc: Kuppuswamy Sathyanarayanan
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	Kuppuswamy Sathyanarayanan <knsathya@kernel.org>,
	"Raj, Ashok" <ashok.raj@intel.com>,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org,
	Oliver OHalloran <oohall@gmail.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	linuxppc-dev@lists.ozlabs.org
Subject: Re: [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly
Date: Thu, 17 Mar 2022 17:59:44 -0500	[thread overview]
Message-ID: <20220317225944.GA765564@bhelgaas> (raw)
In-Reply-To: <20220314162146.GA1439451@ebps>

On Mon, Mar 14, 2022 at 09:21:46AM -0700, Eric Badger wrote:
> On Sun, Mar 13, 2022 at 02:43:14PM -0700, Raj, Ashok wrote:
> > On Sun, Mar 13, 2022 at 02:52:20PM -0500, Bjorn Helgaas wrote:
> > > On Fri, Mar 11, 2022 at 02:58:07AM +0000, Kuppuswamy Sathyanarayanan wrote:
> > > > Currently the aer_irq() handler returns IRQ_NONE for cases without bits
> > > > PCI_ERR_ROOT_UNCOR_RCV or PCI_ERR_ROOT_COR_RCV are set. But this
> > > > assumption is incorrect.
> > > > 
> > > > Consider a scenario where aer_irq() is triggered for a correctable
> > > > error, and while we process the error and before we clear the error
> > > > status in "Root Error Status" register, if the same kind of error
> > > > is triggered again, since aer_irq() only clears events it saw, the
> > > > multi-bit error is left in tact. This will cause the interrupt to fire
> > > > again, resulting in entering aer_irq() with just the multi-bit error
> > > > logged in the "Root Error Status" register.
> > > > 
> > > > Repeated AER recovery test has revealed this condition does happen
> > > > and this prevents any new interrupt from being triggered. Allow to
> > > > process interrupt even if only multi-correctable (BIT 1) or
> > > > multi-uncorrectable bit (BIT 3) is set.
> > > > 
> > > > Reported-by: Eric Badger <ebadger@purestorage.com>
> > > 
> > > Is there a bug report with any concrete details (dmesg, lspci, etc)
> > > that we can include here?
> > 
> > Eric might have more details to add when he collected numerous logs to get
> > to the timeline of the problem. The test was to stress the links with an
> > automated power off, this will result in some eDPC UC error followed by
> > link down. The recovery worked fine for several cycles and suddenly there
> > were no more interrupts. A manual rescan on pci would probe and device is
> > operational again.
> 
> The problem was originally discovered while performing a looping hot plug
> test. At hot remove time, one or more corrected errors usually appeared:
> 
> [256236.078151] pcieport 0000:89:02.0: AER: Corrected error received: 0000:89:02.0
> [256236.078154] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
> [256236.088606] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000001/00000000
> [256236.097857] pcieport 0000:89:02.0: AER:    [ 0] RxErr                 
> [256236.152622] pcieport 0000:89:02.0: pciehp: Slot(400): Link Down
> [256236.152623] pcieport 0000:89:02.0: pciehp: Slot(400): Card not present
> [256236.152631] pcieport 0000:89:02.0: DPC: containment event, status:0x1f01 source:0x0000
> [256236.152632] pcieport 0000:89:02.0: DPC: unmasked uncorrectable error detected reason 0 ext_reason 0
> [256236.152634] pcieport 0000:89:02.0: AER: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
> [256236.164207] pcieport 0000:89:02.0: AER:   device [8086:347a] error status/mask=00000020/00100000
> [256236.173464] pcieport 0000:89:02.0: AER:    [ 5] SDES                   (First)
> [256236.278407] pci 0000:8a:00.0: Removing from iommu group 32
> [256237.500837] pcieport 0000:89:02.0: Data Link Layer Link Active not set in 1000 msec
> [256237.500842] pcieport 0000:89:02.0: link reset at upstream device 0000:89:02.0 failed
> [256237.500865] pcieport 0000:89:02.0: AER: Device recovery failed
> 
> The problematic case arose when 2 corrected errors arrived in a sequence like this:
> 
> 1. Correctable error triggered, bit 0 (ERR_COR) set in Root Error Status,
>    which now has value 0x1.
> 2. aer_irq() triggered, reads Root Error Status, finds value 0x1.
> 3. Second correctable error triggered, bit 1 (multiple ERR_COR) set in Root
>    Error Status, which now has value 0x3.
> 4. aer_irq() writes back 0x1 to Root Error Status, which now has value 0x2.
> 5. aer_irq() triggered again due to the second error, but, finding value 0x2
>    in Root Error Status, takes no action. Future interrupts are now inhibited.

Thanks for the additional details!

After this patch, I guess aer_irq() still reads 0x2
(PCI_ERR_ROOT_MULTI_COR_RCV), but now it writes 0x2 back which clears
PCI_ERR_ROOT_MULTI_COR_RCV.

In addition, aer_irq() will continue on to read PCI_ERR_ROOT_ERR_SRC,
which probably contains either 0 or junk left over from being captured
when PCI_ERR_ROOT_COR_RCV was set.

And aer_irq() will queue an e_src record with status ==
PCI_ERR_ROOT_MULTI_COR_RCV.  But since PCI_ERR_ROOT_COR_RCV is not set
in status, aer_isr_one_error() will do nothing, right?

That might not be *terrible* and is definitely better than not being
able to handle future interrupts.  But we basically threw away the
information that multiple errors occurred, and we queued an e_src
record that occupies space without being used for anything.

Bjorn

next prev parent reply	other threads:[~2022-03-17 23:00 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-11  2:58 [PATCH v1] PCI/AER: Handle Multi UnCorrectable/Correctable errors properly Kuppuswamy Sathyanarayanan
2022-03-13 19:52 ` Bjorn Helgaas
2022-03-13 21:43   ` Raj, Ashok
2022-03-14 16:21     ` Eric Badger
2022-03-17 22:59       ` Bjorn Helgaas [this message]
2022-03-18 20:28         ` Sathyanarayanan Kuppuswamy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20220317225944.GA765564@bhelgaas \
    --to=helgaas@kernel.org \
    --cc=ashok.raj@intel.com \
    --cc=bhelgaas@google.com \
    --cc=ebadger@purestorage.com \
    --cc=knsathya@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=oohall@gmail.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).