All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: David Henningsson <david.henningsson@canonical.com>
Cc: linux-pci@vger.kernel.org, bhelgaas@google.com
Subject: Re: Dmesg filled with "AER: Corrected error received"
Date: Fri, 15 Jan 2016 17:21:51 -0600	[thread overview]
Message-ID: <20160115232151.GB14080@localhost> (raw)
In-Reply-To: <20151229155822.GA17321@localhost>

On Tue, Dec 29, 2015 at 09:58:22AM -0600, Bjorn Helgaas wrote:
> On Fri, Dec 18, 2015 at 11:30:33AM +0100, David Henningsson wrote:
> > Hi Linux PCI maintainers,
> > 
> > My dmesg gets filled with a few lines repeated over and over again:
> > 
> > pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> > pcieport 0000:00:1c.0: can't find device of ID00e0
> > pcieport 0000:00:1c.0: AER: Corrected error received: id=00e0
> > pcieport 0000:00:1c.0: PCIe Bus Error: severity=Corrected,
> > type=Physical Layer, id=00e0(Receiver ID)
> > pcieport 0000:00:1c.0:   device [8086:9d14] error
> > status/mask=00000001/00002000
> > pcieport 0000:00:1c.0:    [ 0] Receiver Error
> > 
> > This happens 10-30 times per second (!), so dmesg fills up quickly.
> > The bug is present in both vanilla and Ubuntu kernels.
> 
> This is a pretty obvious bug in our AER code.  We normally clear
> correctable errors by writing the PCI_ERR_COR_STATUS register in
> handle_error_source().  The execution path looks like this:
> 
>   aer_isr_one_error
>     aer_print_port_info
>     if (find_source_device())
>       aer_process_err_devices
>         handle_error_source
>           pci_write_config_dword(dev, PCI_ERR_COR_STATUS, ...)
> 
> In this case, find_source_device() printed "can't find device of
> ID00e0" [sic] and returned false, so we don't call
> aer_process_err_devices().  The error is never cleared, so
> we discover it again and again.
> 
> I'll work on fixing this.  Incidentally, there's another report
> with similar symptoms here:
> 
>   https://bugzilla.kernel.org/show_bug.cgi?id=109691

I've thought about this problem a bit, but realistically I don't have
time to do the fix I'd like to do, which would involve reading the AER
status registers in the ISR and also *clearing* the error indication,
also in the ISR.  I think the current design, where we read bits of
the status in various places, and clear it in yet other locations, is
error-prone.

Anybody else who is interested should feel free to take a crack at it.

Bjorn

      parent reply	other threads:[~2016-01-15 23:21 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-12-18 10:30 Dmesg filled with "AER: Corrected error received" David Henningsson
2015-12-22 21:57 ` Bjorn Helgaas
2015-12-23  8:06   ` David Henningsson
2015-12-29 15:58 ` Bjorn Helgaas
2015-12-30 12:52   ` David Henningsson
2016-01-15 23:21   ` Bjorn Helgaas [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20160115232151.GB14080@localhost \
    --to=helgaas@kernel.org \
    --cc=bhelgaas@google.com \
    --cc=david.henningsson@canonical.com \
    --cc=linux-pci@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.