linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Keith Busch <keith.busch@intel.com>
To: Bjorn Helgaas <helgaas@kernel.org>
Cc: Linux PCI <linux-pci@vger.kernel.org>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Benjamin Herrenschmidt <benh@kernel.crashing.org>,
	Sinan Kaya <okaya@kernel.org>, Thomas Tai <thomas.tai@oracle.com>,
	poza@codeaurora.org, Lukas Wunner <lukas@wunner.de>,
	Christoph Hellwig <hch@lst.de>,
	Mika Westerberg <mika.westerberg@linux.intel.com>
Subject: Re: [PATCHv4 08/12] PCI: ERR: Always use the first downstream port
Date: Fri, 28 Sep 2018 15:35:23 -0600	[thread overview]
Message-ID: <20180928213523.GA22508@localhost.localdomain> (raw)
In-Reply-To: <20180928205034.GA119911@bhelgaas-glaptop.roam.corp.google.com>

On Fri, Sep 28, 2018 at 03:50:34PM -0500, Bjorn Helgaas wrote:
> On Fri, Sep 28, 2018 at 09:42:20AM -0600, Keith Busch wrote:
> > On Thu, Sep 27, 2018 at 05:56:25PM -0500, Bjorn Helgaas wrote:
> > > On Wed, Sep 26, 2018 at 04:19:25PM -0600, Keith Busch wrote:
> > The "first downstream port" was supposed to mean the first DSP we
> > find when walking toward the root from the pci_dev that reported
> > ERR_[NON]FATAL.
> > 
> > By "use", I mean "walking down the sub-tree for error handling".
> > 
> > After thinking more on this, that doesn't really capture the intent. A
> > better way might be:
> > 
> >   Run error handling starting from the downstream port of the bus
> >   reporting an error
> 
> I think the light is beginning to dawn.  Does this make sense (as far
> as it goes)?
> 
>   PCI/ERR: Run error recovery callbacks for all affected devices
> 
>   If an Endpoint reports an error with ERR_FATAL, we previously ran
>   driver error recovery callbacks only for the Endpoint.  But if
>   recovery requires that we reset the Endpoint, we may have to use a
>   port farther upstream to initiate a Link reset, and that may affect
>   components other than the Endpoint, e.g., multi-function peers and
>   their children.  Drivers for those devices were never notified of
>   the impending reset.
> 
>   Call driver error recovery callbacks for every device that will be
>   reset.

Yes!
 
> Now help me understand this part:
> 
> > This allows two other clean-ups.  First, error handling can only run
> > on bridges so this patch removes checks for endpoints.  
> 
> "error handling can only run on bridges"?  I *think* only Root Ports
> and Root Complex Event Collectors can assert AER interrupts, so
> aer_irq() is only run for them (actually I don't think we quite
> support Event Collectors yet).  Is this what you're getting at?

I mean the pci_dev sent to pcie_do_recovery(), which may be any device
in the topology including or below the root port that aer_irq() serviced.

> Probably not, because the "dev" passed to pcie_do_recovery() isn't an
> RP or RCEC.  But the e_info->dev[i] that aer_process_err_devices()
> eventually passes in doesn't have to be a bridge at all, does it?

Yes, e_info->dev[i] is sent to pcie_do_recovery(). That could be an RP,
but it may also be anything anything below it.

The assumption I'm making (which I think is a safe assumption with
general consensus) is that errors detected on an end point or an upstream
port happened because of something wrong with the link going upstream:
end devices have no other option, and I don't think it's possible a
bus error occurs on "virtual" busses.

The patch then assumes we should avoid sending any traffic through that
link until error recovery completes.
 
> So I can't quite figure out the bridge connection here.

I should have included RP's, or more generically "type 1 header devices".

> > Second, the first accessible port does not inherit the channel error
> > state since we can access it, so the special cases for error detect
> > and resume are no longer necessary.
> 
> When we call pcie_do_recovery(), I guess we specify a "dev" that
> logged an AER event and a pci_channel_state.  It seems a little bit
> weird to handle those separately, as opposed to incorporating into 
> the pci_dev or something.
>
> But if you're just saying that "if A is frozen because of an error,
> that doesn't mean bridges upstream from A are frozen", that makes good
> sense.

Yes. Another way to think of it is in terms of busses instead of
devices. While devices report errors, the error is because of the
bus. Paths going through that bus are considered frozen, and everything
else is not. We can safely communicate with the RP or DSP that connects
to that bus because communicating with those do not go through the
frozen bus.

  reply	other threads:[~2018-09-28 21:33 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-09-20 16:27 [PATCHv4 00/12] pci error handling fixes Keith Busch
2018-09-20 16:27 ` [PATCHv4 01/12] PCI: portdrv: Initialize service drivers directly Keith Busch
2018-09-20 16:27 ` [PATCHv4 02/12] PCI: portdrv: Restore pci state on slot reset Keith Busch
2018-09-20 16:27 ` [PATCHv4 03/12] PCI: DPC: Save and restore control state Keith Busch
2018-09-20 19:46   ` Sinan Kaya
2018-09-20 19:47     ` Sinan Kaya
2018-09-20 19:54       ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 04/12] PCI: AER: Take reference on error devices Keith Busch
2018-09-20 16:27 ` [PATCHv4 05/12] PCI: AER: Don't read upstream ports below fatal errors Keith Busch
2018-09-20 16:27 ` [PATCHv4 06/12] PCI: ERR: Use slot reset if available Keith Busch
2018-09-20 16:27 ` [PATCHv4 07/12] PCI: ERR: Handle fatal error recovery Keith Busch
2018-09-20 16:27 ` [PATCHv4 08/12] PCI: ERR: Always use the first downstream port Keith Busch
2018-09-26 22:01   ` Bjorn Helgaas
2018-09-26 22:19     ` Keith Busch
2018-09-27 22:56       ` Bjorn Helgaas
2018-09-28 15:42         ` Keith Busch
2018-09-28 20:50           ` Bjorn Helgaas
2018-09-28 21:35             ` Keith Busch [this message]
2018-09-28 23:28               ` Bjorn Helgaas
2018-10-01 15:14                 ` Keith Busch
2018-10-02 19:35                   ` Bjorn Helgaas
2018-10-02 19:55                     ` Keith Busch
2018-09-20 16:27 ` [PATCHv4 09/12] PCI: ERR: Simplify broadcast callouts Keith Busch
2018-09-20 16:27 ` [PATCHv4 10/12] PCI: ERR: Report current recovery status for udev Keith Busch
2018-09-20 16:27 ` [PATCHv4 11/12] PCI: Unify device inaccessible Keith Busch
2018-09-20 16:27 ` [PATCHv4 12/12] PCI: Make link active reporting detection generic Keith Busch
2018-09-20 20:00 ` [PATCHv4 00/12] pci error handling fixes Sinan Kaya
2018-09-20 21:17 ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180928213523.GA22508@localhost.localdomain \
    --to=keith.busch@intel.com \
    --cc=benh@kernel.crashing.org \
    --cc=bhelgaas@google.com \
    --cc=hch@lst.de \
    --cc=helgaas@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mika.westerberg@linux.intel.com \
    --cc=okaya@kernel.org \
    --cc=poza@codeaurora.org \
    --cc=thomas.tai@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).