linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
To: poza@codeaurora.org
Cc: Thomas Tai <thomas.tai@oracle.com>,
	bhelgaas@google.com, keith.busch@intel.com,
	linux-pci@vger.kernel.org, linux-pci-owner@vger.kernel.org
Subject: Re: [PATCH 1/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed
Date: Thu, 16 Aug 2018 16:51:43 +1000	[thread overview]
Message-ID: <e931bc7a7e468e8bb2ee3766b3fcf96f1800f552.camel@kernel.crashing.org> (raw)
In-Reply-To: <6cb069038530757f31f3dd60328c7e30@codeaurora.org>

On Thu, 2018-08-16 at 12:06 +0530, poza@codeaurora.org wrote:
> 
> > Under what circumstances do you actually "unplug" the device ? We are
> > trying to cleanup/fix some of the PowerPC EEH code which is in a way
> > similar to AER, and we found that this unplug/replug, which we do if
> > the driver doesn't have recovery callbacks only, is causing more
> > problems than it solves.
> > 
> > We are moving toward instead unbinding the driver, resetting the
> > device, then re-binding the driver instead of unplug/replug.
> > 
> > Also why would you ever bypass the driver callbacks if the driver has
> > some ? The whole point is to keep the driver bound while resetting the
> > device (provided it has the right callbacks) so we don't lose the
> > linkage between stroage devices and mounted filesystems.
> > 
> 
> there are two different paths we are taking, one is for ERR_FATAL and 
> the other is for ERR_NONFATAL.

Ok. With EEH we have a much finer granularity but fundamentally we
don't care as long as we reset the device.

> while AER and DPC converge into ERR_FATAL path; ERR_NONFATAL path is the 
> one where AER and driver callbacks would come into picture.

I disagree. See below.

> since DPC does hw recovery of the link, and handles only ERR_FATAL, we 
> do remove devices and re-enumerate them.

What is "DPC" ?

> but if the error is no fatal, we will fall back to driver callbacks to 
> initiate link error handling and recovery process.

Hrm ... the callbacks were never about link errors, they were about
device errors of any kind and predate PCIe concept of links in fact.

They include iommu errors, HW errors, SW bugs etc...

> under normal circumstances I do not see the some downstream devices 
> would have been replaced in case of FATAL errors, but code might not 
> want to assume that
> same device is there, since we are removing downstream devices 
> completely. so taking reference of old BDF and keeping it for later 
> reference is not good idea.
> although I think there are some parts of pcie_do_fatal_recovery need a 
> fix, which Thomas is fixing is anyway.

I think you shouldn't be unplugging/replugging.

> so yes, I agree with you and we are talking about same thing e.g.
> "
>   We are moving toward instead unbinding the driver, resetting the
>   device, then re-binding the driver instead of unplug/replug.
> "
> 
> driver callbacks are very much there, please have a look at 
> pcie_do_nonfatal_recovery().

Even for fatal. Why on earth would you skip the driver callbacks ? Most
errors are fatal anyway.

For us, we don't really differenciate because in order to avoid any
form of propagation of bad data, our HW will freeze/isolate a device on
any error, whether it's fatal or non fatal.

So we treat them both as needing a full recovery which usually implies
a reset of the device.

However we do that through the use of the driver callbacks because
otherwise you will lose the linkage between a storage driver and the
moutned file systems which cannot be recovered, so you are killing your
system.

> refering Sepc: 6.2.2.2.1 Fatal Errors, where link is unreliable and it 
> might need AER style reset of link or DPC style HW recovery
> In both cases, the shutdown callbacks are expected to be called,

No, this is wrong and not the intent of the error handling.

You seem to be applying PCIe specific concepts brain-farted at Intel
that are way way away from what we care about in practice and in Linux.

> e.g.  some driver handle errors ERR_NONFATAL or FATAL in similar ways
> e.g.
> ioat_pcie_error_detected();  calls  ioat_shutdown(); in case of 
> ERR_NONFATAL
> otherwise ioat_shutdown() in case of ERR_FATAL.

Since when the error handling callbacks even have the concept of FATAL
vs. non-fatal ? This doesn't appear anyhwhere in the prototype of the
struct pci_error_handlers and shouldn't.

Benm=.

> 
> > Cheers,
> > Ben.

  reply	other threads:[~2018-08-16  9:48 UTC|newest]

Thread overview: 91+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-08-13 16:51 [PATCH 0/1] PCI/AER: prevent pcie_do_fatal_recovery from using device after it is removed Thomas Tai
2018-08-13 16:51 ` [PATCH 1/1] " Thomas Tai
2018-08-14  9:16   ` poza
2018-08-14  9:22     ` poza
2018-08-14 13:51       ` Thomas Tai
2018-08-15 14:57         ` poza
2018-08-15 15:02           ` Thomas Tai
2018-08-15 15:26             ` poza
2018-08-15 15:43               ` Thomas Tai
2018-08-15 15:59                 ` poza
2018-08-15 16:04                   ` Thomas Tai
2018-08-15 21:55       ` Benjamin Herrenschmidt
2018-08-15 21:56       ` Benjamin Herrenschmidt
2018-08-16  6:36         ` poza
2018-08-16  6:51           ` Benjamin Herrenschmidt [this message]
2018-08-16  6:59             ` Benjamin Herrenschmidt
2018-08-16  8:07               ` poza
2018-08-16  8:12                 ` Benjamin Herrenschmidt
2018-08-16  9:03                   ` poza
2018-08-16 10:07                     ` Benjamin Herrenschmidt
2018-08-16 14:11                       ` poza
2018-08-16 23:30                         ` Benjamin Herrenschmidt
2018-08-17 10:29                           ` poza
2018-08-17 10:44                             ` poza
2018-08-18  7:38                             ` Benjamin Herrenschmidt
2018-08-16  7:05             ` Benjamin Herrenschmidt
2018-08-16  7:15               ` Benjamin Herrenschmidt
2018-08-16  7:56                 ` poza
2018-08-16  8:10                   ` Benjamin Herrenschmidt
2018-08-16  8:05               ` poza
2018-08-16  8:15                 ` Benjamin Herrenschmidt
2018-08-16  8:22                   ` poza
2018-08-16  8:28                     ` Benjamin Herrenschmidt
2018-08-16 13:30                       ` Thomas Tai
2018-08-16 13:46                     ` Sinan Kaya
2018-08-16 23:27                       ` Benjamin Herrenschmidt
2018-08-17  6:35                         ` poza
2018-08-19  2:24                         ` Bjorn Helgaas
2018-08-20  5:09                           ` poza
2018-08-20  5:15                             ` Benjamin Herrenschmidt
2018-08-20 13:02                               ` Thomas Tai
2018-08-20 13:27                                 ` Benjamin Herrenschmidt
2018-08-19  2:19               ` Bjorn Helgaas
2018-08-19 21:41                 ` Sinan Kaya
2018-08-20  2:03                   ` Benjamin Herrenschmidt
2018-08-20  5:19                     ` poza
2018-08-20  5:33                       ` Benjamin Herrenschmidt
2018-08-20  7:56                         ` poza
2018-08-20 11:22                           ` Benjamin Herrenschmidt
2018-08-20 13:26                             ` poza
2018-08-20 21:02                               ` Benjamin Herrenschmidt
2018-08-21  5:14                                 ` poza
2018-08-21  6:06                                   ` Benjamin Herrenschmidt
2018-08-21 14:37                                     ` Keith Busch
2018-08-21 15:07                                       ` Sinan Kaya
2018-08-21 15:29                                         ` Keith Busch
2018-08-21 15:50                                           ` Sinan Kaya
2018-08-21 15:55                                             ` Sinan Kaya
2018-08-21 16:44                                             ` Keith Busch
2018-08-21 15:30                                       ` poza
2018-08-21 21:14                                       ` Benjamin Herrenschmidt
2018-08-21 22:04                                         ` Keith Busch
2018-08-21 22:33                                           ` Keith Busch
2018-08-21 23:06                                           ` Benjamin Herrenschmidt
2018-08-21 23:13                                             ` Keith Busch
2018-08-22  0:36                                               ` Benjamin Herrenschmidt
2018-08-30  0:01                                             ` Keith Busch
2018-08-30  0:10                                               ` Sinan Kaya
2018-08-30  0:46                                                 ` Keith Busch
2018-08-30  4:26                                               ` Benjamin Herrenschmidt
2018-08-20 15:53                             ` Keith Busch
2018-08-20 16:13                               ` poza
2018-08-20 16:32                                 ` Keith Busch
2018-08-20 21:05                               ` Benjamin Herrenschmidt
2018-08-20 21:21                                 ` Sinan Kaya
2018-08-20 21:35                                   ` Keith Busch
2018-08-20 21:53                                     ` Benjamin Herrenschmidt
2018-08-20 22:02                                       ` Sinan Kaya
2018-08-20 22:04                                         ` Benjamin Herrenschmidt
2018-08-20 22:13                                           ` Sinan Kaya
2018-08-20 22:19                                             ` Benjamin Herrenschmidt
2018-08-22  9:13                                             ` Lukas Wunner
2018-08-22 14:38                                               ` Keith Busch
2018-08-22 14:51                                                 ` Sinan Kaya
2018-08-20 22:13                                           ` Keith Busch
2018-08-20 22:19                                             ` Benjamin Herrenschmidt
2018-08-21  1:30                                               ` Keith Busch
2018-08-20  4:37                 ` Benjamin Herrenschmidt
2018-08-20  4:39                 ` PATCH] Partial revert of "PCI/AER: Handle ERR_FATAL with removal and re-enumeration of devices" Benjamin Herrenschmidt
2018-08-21 19:50                   ` Bjorn Helgaas
2018-08-22  4:35                     ` poza

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e931bc7a7e468e8bb2ee3766b3fcf96f1800f552.camel@kernel.crashing.org \
    --to=benh@kernel.crashing.org \
    --cc=bhelgaas@google.com \
    --cc=keith.busch@intel.com \
    --cc=linux-pci-owner@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=poza@codeaurora.org \
    --cc=thomas.tai@oracle.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).