public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
From: Bjorn Helgaas <helgaas@kernel.org>
To: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Cc: amd-gfx@lists.freedesktop.org,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-pci@vger.kernel.org, alexander.deucher@amd.com,
	nirmodas@amd.com, Dennis.Li@amd.com, christian.koenig@amd.com,
	luben.tuikov@amd.com, bhelgaas@google.com
Subject: Re: [PATCH v4 0/8] Implement PCI Error Recovery on Navi12
Date: Wed, 2 Sep 2020 16:36:25 -0500	[thread overview]
Message-ID: <20200902213625.GA269978@bjorn-Precision-5520> (raw)
In-Reply-To: <1599072130-10043-1-git-send-email-andrey.grodzovsky@amd.com>

On Wed, Sep 02, 2020 at 02:42:02PM -0400, Andrey Grodzovsky wrote:
> Many PCI bus controllers are able to detect a variety of hardware PCI errors on the bus, 
> such as parity errors on the data and address buses,  A typical action taken is to disconnect 
> the affected device, halting all I/O to it. Typically, a reconnection mechanism is also offered, 
> so that the affected PCI device(s) are reset and put back into working condition. 
> In our case the reconnection mechanism is facilitated by kernel Downstream Port Containment (DPC) 
> driver which will intercept the PCIe error, remove (isolate) the faulting device after which it 
> will call into PCIe recovery code of the PCI core. 
> This code will call hooks which are implemented in this patchset where the error is 
> first reported at which point we block the GPU scheduler, next DPC resets the 
> PCI link which generates HW interrupt which is intercepted by SMU/PSP who 
> start executing mode1 reset of the ASIC, next step is slot reset hook is called 
> at which point we wait for ASIC reset to complete, restore PCI config space and run 
> HW suspend/resume sequence to resinit the ASIC. 
> Last hook called is resume normal operation at which point we will restart the GPU scheduler.
> 
> More info on PCIe error handling and DPC are here:
> https://www.kernel.org/doc/html/latest/PCI/pci-error-recovery.html
> https://patchwork.kernel.org/patch/8945681/
> 
> v4:Rebase to 5.9 kernel and revert PCI error recovery core commit which breaks the feature.

What does this apply to?  I tried 

  - v5.9-rc1 (9123e3a74ec7 ("Linux 5.9-rc1")),
  - v5.9-rc2 (d012a7190fc1 ("Linux 5.9-rc2")),
  - v5.9-rc3 (f75aef392f86 ("Linux 5.9-rc3")),
  - drm-next (3393649977f9 ("Merge tag 'drm-intel-next-2020-08-24-1' of git://anongit.freedesktop.org/drm/drm-intel into drm-next")),
  - linux-next (4442749a2031 ("Add linux-next specific files for 20200902"))

but it doesn't apply cleanly to any.

> Andrey Grodzovsky (8):
>   drm/amdgpu: Avoid accessing HW when suspending SW state
>   drm/amdgpu: Block all job scheduling activity during DPC recovery
>   drm/amdgpu: Fix SMU error failure
>   drm/amdgpu: Fix consecutive DPC recovery failures.
>   drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code.
>   drm/amdgpu: Disable DPC for XGMI for now.
>   drm/amdgpu: Minor checkpatch fix
>   Revert "PCI/ERR: Update error status after reset_link()"
> 
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 247 +++++++++++++++++++++--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |   4 +-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_gfx.c    |   6 +
>  drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |   6 +
>  drivers/gpu/drm/amd/amdgpu/gfx_v10_0.c     |  18 ++-
>  drivers/gpu/drm/amd/amdgpu/nv.c            |   4 +-
>  drivers/gpu/drm/amd/amdgpu/soc15.c         |   4 +-
>  drivers/gpu/drm/amd/pm/swsmu/smu_cmn.c     |   3 +
>  drivers/pci/pcie/err.c                     |   3 +-
>  10 files changed, 222 insertions(+), 79 deletions(-)
> 
> -- 
> 2.7.4
> 

  parent reply	other threads:[~2020-09-02 21:36 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-09-02 18:42 [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 1/8] drm/amdgpu: Avoid accessing HW when suspending SW state Andrey Grodzovsky
2020-09-02 21:56   ` Bjorn Helgaas
2020-09-03  1:32   ` Li, Dennis
2020-09-02 18:42 ` [PATCH v4 2/8] drm/amdgpu: Block all job scheduling activity during DPC recovery Andrey Grodzovsky
2020-09-02 22:07   ` Bjorn Helgaas
2020-09-02 18:42 ` [PATCH v4 3/8] drm/amdgpu: Fix SMU error failure Andrey Grodzovsky
2020-09-02 22:05   ` Bjorn Helgaas
2020-09-03 15:29     ` Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 4/8] drm/amdgpu: Fix consecutive DPC recovery failures Andrey Grodzovsky
2020-09-02 22:23   ` Bjorn Helgaas
2020-09-03 15:45     ` Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 5/8] drm/amdgpu: Trim amdgpu_pci_slot_reset by reusing code Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 6/8] drm/amdgpu: Disable DPC for XGMI for now Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 7/8] drm/amdgpu: Minor checkpatch fix Andrey Grodzovsky
2020-09-02 18:42 ` [PATCH v4 8/8] Revert "PCI/ERR: Update error status after reset_link()" Andrey Grodzovsky
2020-09-02 19:00   ` Kuppuswamy, Sathyanarayanan
2020-09-02 19:54     ` Andrey Grodzovsky
2020-09-02 20:27       ` Kuppuswamy, Sathyanarayanan
2020-09-02 21:36 ` Bjorn Helgaas [this message]
     [not found] <DM6PR12MB4340FC71C17B28E1601EBCDDEA2F0@DM6PR12MB4340.namprd12.prod.outlook.com>
2020-09-03  0:41 ` [PATCH v4 0/8] Implement PCI Error Recovery on Navi12 Bjorn Helgaas
2020-09-03 15:01   ` Andrey Grodzovsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200902213625.GA269978@bjorn-Precision-5520 \
    --to=helgaas@kernel.org \
    --cc=Dennis.Li@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=bhelgaas@google.com \
    --cc=christian.koenig@amd.com \
    --cc=linux-pci@vger.kernel.org \
    --cc=luben.tuikov@amd.com \
    --cc=nirmodas@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox