[PATCH 0/4] PCI: Update error recovery documentation

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/4] PCI: Update error recovery documentation
@ 2025-08-29  7:25 Lukas Wunner
  2025-08-29  7:25 ` [PATCH 1/4] PCI/AER: Sync documentation with code Lukas Wunner
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Lukas Wunner @ 2025-08-29  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas, Jonathan Corbet
  Cc: Terry Bowman, Ilpo Jarvinen, Sathyanarayanan Kuppuswamy,
	Niklas Schnelle, Linas Vepstas, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

The documentation on PCIe Advanced Error Reporting hasn't kept up with
code changes over the years.  This series seeks to remedy as many issues
as I could find.

Previous commits touching the documentation either prefixed the subject
with "Documentation: PCI:" or (when combined with code changes) "PCI/AER:"
or "PCI/ERR:".  I chose the latter for brevity and to avoid mentioning
"documentation" or "PCI" twice in the subject.  If anyone objects or
finds other mistakes in these patches, please let me know.  Thanks!

Lukas Wunner (4):
  PCI/AER: Sync documentation with code
  PCI/ERR: Sync documentation with code
  PCI/ERR: Amend documentation with DPC and AER specifics
  PCI/ERR: Tidy documentation's PCIe nomenclature

 Documentation/PCI/pci-error-recovery.rst | 42 +++++++++---
 Documentation/PCI/pcieaer-howto.rst      | 81 +++++++++++-------------
 2 files changed, 71 insertions(+), 52 deletions(-)

-- 
2.47.2

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/4] PCI/AER: Sync documentation with code
  2025-08-29  7:25 [PATCH 0/4] PCI: Update error recovery documentation Lukas Wunner
@ 2025-08-29  7:25 ` Lukas Wunner
  2025-08-29  7:25 ` [PATCH 2/4] PCI/ERR: " Lukas Wunner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2025-08-29  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas, Jonathan Corbet
  Cc: Terry Bowman, Ilpo Jarvinen, Sathyanarayanan Kuppuswamy,
	Niklas Schnelle, Linas Vepstas, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

The PCIe Advanced Error Reporting driver has evolved over the years but
its documentation hasn't.  Catch up with past code changes:

* The documentation claims that Correctable Errors are logged with
  KERN_INFO severity, but the code uses KERN_WARN.

  It had used KERN_WARN from the beginning with commit 6c2b374d7485
  ("PCI-Express AER implemetation: AER core and aerdriver").  In 2013,
  commit 2cced2d95961 ("aerdrv: Cleanup log output for AER") switched to
  KERN_ERR, until 2020 when it was reverted back to KERN_WARN by commit
  e83e2ca3c395 ("PCI/AER: Log correctable errors as warning, not error").

* An example log message in the documentation uses the term "Uncorrected",
  but the code uses "Uncorrectable" since commit 02a06f5f1a6a ("PCI/AER:
  Use 'Correctable' and 'Uncorrectable' spec terms for errors").

* The example contains the Requester ID "id=0500", which is omitted since
  commit 010caed4ccb6 ("PCI/AER: Decode Error Source Requester ID").

* The example contains the error name "Unsupported Request", which is
  instead reported as "UnsupReq" since commit bd237801fef2 ("PCI/AER:
  Adopt lspci names for AER error decoding").

* The example doesn't prepend "0x" to hex values from the TLP Header Log,
  as introduced by commit f68ea779d98a ("PCI: Add pcie_print_tlp_log() to
  print TLP Header and Prefix Log").

* The documentation refers to a reset_link callback which was removed by
  commit b6cf1a42f916 ("PCI/ERR: Remove service dependency in
  pcie_do_recovery()").

* Commit 579086225502 ("PCI/ERR: Recover from RCiEP AER errors") added
  support to recover Root Complex Integrated Endpoints by applying a
  Function Level Reset, alternatively to the Secondary Bus Reset which is
  applied otherwise.

* On non-fatal errors, a reset was previously never performed.  But the
  AER driver has just been amended to allow drivers to opt in to a reset.

* The documentation claims that a warning message is logged if a driver
  lacks pci_error_handlers.  But the message has been informational
  (logged with KERN_INFO severity) since its introduction with commit
  01daacfb9035 ("PCI/AER: Log which device prevents error recovery").

  The documentation claims that the message is only logged for fatal
  errors, which is incorrect.  Moreover it refers to "section 3", even
  though the documentation no longer contains section numbers since commit
  4e37f055a92e ("Documentation: PCI: convert pcieaer-howto.txt to reST").
  Section 3 is titled "Developer Guide".  That's the same section where
  the reference is located, so it is self-referential and can be dropped.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 Documentation/PCI/pcieaer-howto.rst | 81 ++++++++++++++---------------
 1 file changed, 38 insertions(+), 43 deletions(-)

diff --git a/Documentation/PCI/pcieaer-howto.rst b/Documentation/PCI/pcieaer-howto.rst
index 4b71e2f43ca7..d448efe572c8 100644
--- a/Documentation/PCI/pcieaer-howto.rst
+++ b/Documentation/PCI/pcieaer-howto.rst
@@ -70,16 +70,16 @@ AER error output
 ----------------
 
 When a PCIe AER error is captured, an error message will be output to
-console. If it's a correctable error, it is output as an info message.
+console. If it's a correctable error, it is output as a warning message.
 Otherwise, it is printed as an error. So users could choose different
 log level to filter out correctable error messages.
 
 Below shows an example::
 
-  0000:50:00.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, id=0500(Requester ID)
+  0000:50:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Requester ID)
   0000:50:00.0:   device [8086:0329] error status/mask=00100000/00000000
-  0000:50:00.0:    [20] Unsupported Request    (First)
-  0000:50:00.0:   TLP Header: 04000001 00200a03 05010000 00050100
+  0000:50:00.0:    [20] UnsupReq               (First)
+  0000:50:00.0:   TLP Header: 0x04000001 0x00200a03 0x05010000 0x00050100
 
 In the example, 'Requester ID' means the ID of the device that sent
 the error message to the Root Port. Please refer to PCIe specs for other
@@ -152,18 +152,6 @@ the device driver.
 Provide callbacks
 -----------------
 
-callback reset_link to reset PCIe link
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
-This callback is used to reset the PCIe physical link when a
-fatal error happens. The Root Port AER service driver provides a
-default reset_link function, but different Upstream Ports might
-have different specifications to reset the PCIe link, so
-Upstream Port drivers may provide their own reset_link functions.
-
-Section 3.2.2.2 provides more detailed info on when to call
-reset_link.
-
 PCI error-recovery callbacks
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
@@ -174,8 +162,8 @@ when performing error recovery actions.
 Data struct pci_driver has a pointer, err_handler, to point to
 pci_error_handlers who consists of a couple of callback function
 pointers. The AER driver follows the rules defined in
-pci-error-recovery.rst except PCIe-specific parts (e.g.
-reset_link). Please refer to pci-error-recovery.rst for detailed
+pci-error-recovery.rst except PCIe-specific parts (see
+below). Please refer to pci-error-recovery.rst for detailed
 definitions of the callbacks.
 
 The sections below specify when to call the error callback functions.
@@ -189,10 +177,21 @@ software intervention or any loss of data. These errors do not
 require any recovery actions. The AER driver clears the device's
 correctable error status register accordingly and logs these errors.
 
-Non-correctable (non-fatal and fatal) errors
-~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+Uncorrectable (non-fatal and fatal) errors
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 
-If an error message indicates a non-fatal error, performing link reset
+The AER driver performs a Secondary Bus Reset to recover from
+uncorrectable errors. The reset is applied at the port above
+the originating device: If the originating device is an Endpoint,
+only the Endpoint is reset. If on the other hand the originating
+device has subordinate devices, those are all affected by the
+reset as well.
+
+If the originating device is a Root Complex Integrated Endpoint,
+there's no port above where a Secondary Bus Reset could be applied.
+In this case, the AER driver instead applies a Function Level Reset.
+
+If an error message indicates a non-fatal error, performing a reset
 at upstream is not required. The AER driver calls error_detected(dev,
 pci_channel_io_normal) to all drivers associated within a hierarchy in
 question. For example::
@@ -204,38 +203,34 @@ Downstream Port B and Endpoint.
 
 A driver may return PCI_ERS_RESULT_CAN_RECOVER,
 PCI_ERS_RESULT_DISCONNECT, or PCI_ERS_RESULT_NEED_RESET, depending on
-whether it can recover or the AER driver calls mmio_enabled as next.
+whether it can recover without a reset, considers the device unrecoverable
+or needs a reset for recovery. If all affected drivers agree that they can
+recover without a reset, it is skipped. Should one driver request a reset,
+it overrides all other drivers.
 
 If an error message indicates a fatal error, kernel will broadcast
 error_detected(dev, pci_channel_io_frozen) to all drivers within
-a hierarchy in question. Then, performing link reset at upstream is
-necessary. As different kinds of devices might use different approaches
-to reset link, AER port service driver is required to provide the
-function to reset link via callback parameter of pcie_do_recovery()
-function. If reset_link is not NULL, recovery function will use it
-to reset the link. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
-and reset_link returns PCI_ERS_RESULT_RECOVERED, the error handling goes
-to mmio_enabled.
+a hierarchy in question. Then, performing a reset at upstream is
+necessary. If error_detected returns PCI_ERS_RESULT_CAN_RECOVER
+to indicate that recovery without a reset is possible, the error
+handling goes to mmio_enabled, but afterwards a reset is still
+performed.
 
-Frequent Asked Questions
-------------------------
+In other words, for non-fatal errors, drivers may opt in to a reset.
+But for fatal errors, they cannot opt out of a reset, based on the
+assumption that the link is unreliable.
+
+Frequently Asked Questions
+--------------------------
 
 Q:
   What happens if a PCIe device driver does not provide an
   error recovery handler (pci_driver->err_handler is equal to NULL)?
 
 A:
-  The devices attached with the driver won't be recovered. If the
-  error is fatal, kernel will print out warning messages. Please refer
-  to section 3 for more information.
-
-Q:
-  What happens if an upstream port service driver does not provide
-  callback reset_link?
-
-A:
-  Fatal error recovery will fail if the errors are reported by the
-  upstream ports who are attached by the service driver.
+  The devices attached with the driver won't be recovered.
+  The kernel will print out informational messages to identify
+  unrecoverable devices.
 
 
 Software error injection
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/4] PCI/ERR: Sync documentation with code
  2025-08-29  7:25 [PATCH 0/4] PCI: Update error recovery documentation Lukas Wunner
  2025-08-29  7:25 ` [PATCH 1/4] PCI/AER: Sync documentation with code Lukas Wunner
@ 2025-08-29  7:25 ` Lukas Wunner
  2025-08-29  7:25 ` [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics Lukas Wunner
  2025-08-29  7:25 ` [PATCH 4/4] PCI/ERR: Tidy documentation's PCIe nomenclature Lukas Wunner
  3 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2025-08-29  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas, Jonathan Corbet
  Cc: Terry Bowman, Ilpo Jarvinen, Sathyanarayanan Kuppuswamy,
	Niklas Schnelle, Linas Vepstas, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

Amend the documentation on PCI error recovery to fix minor inaccuracies
vis-à-vis the actual code:

* The documentation claims that a missing ->resume() or ->mmio_enabled()
  callback always leads to recovery through reset.  But none of the
  implementations do this (pcie_do_recovery(), eeh_handle_normal_event(),
  zpci_event_do_error_state_clear()).

  Drop the claim to align the documentation with the code.

* The documentation does not list PCI_ERS_RESULT_RECOVERED as a valid
  return value from ->error_detected().  But none of the implementations
  forbid this and some drivers are returning it, e.g.:
  drivers/bus/mhi/host/pci_generic.c
  drivers/infiniband/hw/hfi1/pcie.c

  Further down in the documentation it is implied that the return value is
  in fact allowed:
  "The platform will call the resume() callback on all affected device
  drivers if all drivers on the segment have returned
  PCI_ERS_RESULT_RECOVERED from one of the 3 previous callbacks."

  The "3 previous callbacks" being ->error_detected(), ->mmio_enabled()
  and ->slot_reset().

  Add it to the valid return values for consistency.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 Documentation/PCI/pci-error-recovery.rst | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index 42e1e78353f3..d5c661baa87f 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -108,8 +108,8 @@ A driver does not have to implement all of these callbacks; however,
 if it implements any, it must implement error_detected(). If a callback
 is not implemented, the corresponding feature is considered unsupported.
 For example, if mmio_enabled() and resume() aren't there, then it
-is assumed that the driver is not doing any direct recovery and requires
-a slot reset.  Typically a driver will want to know about
+is assumed that the driver does not need these callbacks
+for recovery.  Typically a driver will want to know about
 a slot_reset().
 
 The actual steps taken by a platform to recover from a PCI error
@@ -141,6 +141,9 @@ shouldn't do any new IOs. Called in task context. This is sort of a
 All drivers participating in this system must implement this call.
 The driver must return one of the following result codes:
 
+  - PCI_ERS_RESULT_RECOVERED
+      Driver returns this if it thinks the device is usable despite
+      the error and does not need further intervention.
   - PCI_ERS_RESULT_CAN_RECOVER
       Driver returns this if it thinks it might be able to recover
       the HW by just banging IOs or if it wants to be given
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics
  2025-08-29  7:25 [PATCH 0/4] PCI: Update error recovery documentation Lukas Wunner
  2025-08-29  7:25 ` [PATCH 1/4] PCI/AER: Sync documentation with code Lukas Wunner
  2025-08-29  7:25 ` [PATCH 2/4] PCI/ERR: " Lukas Wunner
@ 2025-08-29  7:25 ` Lukas Wunner
  2025-08-29  8:38   ` Niklas Schnelle
  2025-08-29 23:25   ` Linas Vepstas
  2025-08-29  7:25 ` [PATCH 4/4] PCI/ERR: Tidy documentation's PCIe nomenclature Lukas Wunner
  3 siblings, 2 replies; 8+ messages in thread
From: Lukas Wunner @ 2025-08-29  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas, Jonathan Corbet
  Cc: Terry Bowman, Ilpo Jarvinen, Sathyanarayanan Kuppuswamy,
	Niklas Schnelle, Linas Vepstas, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

Amend the documentation on PCI error recovery with specifics about
Downstream Port Containment and Advanced Error Reporting:

* Explain that with DPC, devices are inaccessible upon an error (similar
  to EEH on powerpc) and do not become accessible until the link is
  re-enabled.

* Explain that with AER, although devices may already be accessible in the
  ->error_detected() callback, accesses should be deferred to the
  ->mmio_enabled() callback for compatibility with EEH on powerpc.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 Documentation/PCI/pci-error-recovery.rst | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index d5c661baa87f..c88c304b2103 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -122,6 +122,10 @@ A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
 is isolated, in that all I/O is blocked: all reads return 0xffffffff,
 all writes are ignored.
 
+Similarly, on platforms supporting Downstream Port Containment
+(PCIe r7.0 sec 6.2.11), the link to the sub-hierarchy with the
+faulting device is disabled. Any device in the sub-hierarchy
+becomes inaccessible.
 
 STEP 1: Notification
 --------------------
@@ -204,6 +208,23 @@ link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
 instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
 
+.. note::
+
+   On platforms supporting Advanced Error Reporting (PCIe r7.0 sec 6.2),
+   the faulting device may already be accessible in STEP 1 (Notification).
+   Drivers should nevertheless defer accesses to STEP 2 (MMIO Enabled)
+   to be compatible with EEH on powerpc.
+
+   On platforms supporting Downstream Port Containment, the link to the
+   sub-hierarchy with the faulting device is re-enabled in STEP 3 (Link
+   Reset). Hence devices in the sub-hierarchy are inaccessible until
+   STEP 4 (Slot Reset).
+
+   For errors such as Surprise Down (PCIe r7.0 sec 6.2.7), the device
+   may not even be accessible in STEP 4 (Slot Reset). Drivers can detect
+   accessibility by checking whether reads from the device return all 1's
+   (PCI_POSSIBLE_ERROR()).
+
 .. note::
 
    The following is proposed; no platform implements this yet:
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 4/4] PCI/ERR: Tidy documentation's PCIe nomenclature
  2025-08-29  7:25 [PATCH 0/4] PCI: Update error recovery documentation Lukas Wunner
                   ` (2 preceding siblings ...)
  2025-08-29  7:25 ` [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics Lukas Wunner
@ 2025-08-29  7:25 ` Lukas Wunner
  3 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2025-08-29  7:25 UTC (permalink / raw)
  To: Bjorn Helgaas, Jonathan Corbet
  Cc: Terry Bowman, Ilpo Jarvinen, Sathyanarayanan Kuppuswamy,
	Niklas Schnelle, Linas Vepstas, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

Commit 11502feab423 ("Documentation: PCI: Tidy AER documentation")
replaced the terms "PCI-E", "PCI-Express" and "PCI Express" with "PCIe"
in the AER documentation.

Do the same in the documentation on PCI error recovery.  While at it,
add a missing period and a missing blank.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
---
 Documentation/PCI/pci-error-recovery.rst | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index c88c304b2103..500d4e9b2143 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -13,7 +13,7 @@ PCI Error Recovery
 Many PCI bus controllers are able to detect a variety of hardware
 PCI errors on the bus, such as parity errors on the data and address
 buses, as well as SERR and PERR errors.  Some of the more advanced
-chipsets are able to deal with these errors; these include PCI-E chipsets,
+chipsets are able to deal with these errors; these include PCIe chipsets,
 and the PCI-host bridges found on IBM Power4, Power5 and Power6-based
 pSeries boxes. A typical action taken is to disconnect the affected device,
 halting all I/O to it.  The goal of a disconnection is to avoid system
@@ -206,7 +206,7 @@ reset or some such, but not restart operations. This callback is made if
 all drivers on a segment agree that they can try to recover and if no automatic
 link reset was performed by the HW. If the platform can't just re-enable IOs
 without a slot reset or a link reset, it will not call this callback, and
-instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
+instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset).
 
 .. note::
 
@@ -258,14 +258,14 @@ The driver should return one of the following result codes:
 
 The next step taken depends on the results returned by the drivers.
 If all drivers returned PCI_ERS_RESULT_RECOVERED, then the platform
-proceeds to either STEP3 (Link Reset) or to STEP 5 (Resume Operations).
+proceeds to either STEP 3 (Link Reset) or to STEP 5 (Resume Operations).
 
 If any driver returned PCI_ERS_RESULT_NEED_RESET, then the platform
 proceeds to STEP 4 (Slot Reset)
 
 STEP 3: Link Reset
 ------------------
-The platform resets the link.  This is a PCI-Express specific step
+The platform resets the link.  This is a PCIe specific step
 and is done whenever a fatal error has been detected that can be
 "solved" by resetting the link.
 
@@ -287,13 +287,13 @@ that is equivalent to what it would be after a fresh system
 power-on followed by power-on BIOS/system firmware initialization.
 Soft reset is also known as hot-reset.
 
-Powerpc fundamental reset is supported by PCI Express cards only
+Powerpc fundamental reset is supported by PCIe cards only
 and results in device's state machines, hardware logic, port states and
 configuration registers to initialize to their default conditions.
 
 For most PCI devices, a soft reset will be sufficient for recovery.
 Optional fundamental reset is provided to support a limited number
-of PCI Express devices for which a soft reset is not sufficient
+of PCIe devices for which a soft reset is not sufficient
 for recovery.
 
 If the platform supports PCI hotplug, then the reset might be
@@ -337,7 +337,7 @@ Result codes:
 	- PCI_ERS_RESULT_DISCONNECT
 	  Same as above.
 
-Drivers for PCI Express cards that require a fundamental reset must
+Drivers for PCIe cards that require a fundamental reset must
 set the needs_freset bit in the pci_dev structure in their probe function.
 For example, the QLogic qla2xxx driver sets the needs_freset bit for certain
 PCI card types::
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics
  2025-08-29  7:25 ` [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics Lukas Wunner
@ 2025-08-29  8:38   ` Niklas Schnelle
  2025-08-29 23:25   ` Linas Vepstas
  1 sibling, 0 replies; 8+ messages in thread
From: Niklas Schnelle @ 2025-08-29  8:38 UTC (permalink / raw)
  To: Lukas Wunner, Bjorn Helgaas, Jonathan Corbet
  Cc: Terry Bowman, Ilpo Jarvinen, Sathyanarayanan Kuppuswamy,
	Linas Vepstas, Mahesh J Salgaonkar, Oliver OHalloran,
	linuxppc-dev, linux-pci, linux-doc, Brian Norris

On Fri, 2025-08-29 at 09:25 +0200, Lukas Wunner wrote:
> Amend the documentation on PCI error recovery with specifics about
> Downstream Port Containment and Advanced Error Reporting:
> 
> * Explain that with DPC, devices are inaccessible upon an error (similar
>   to EEH on powerpc) and do not become accessible until the link is
>   re-enabled.
> 
> * Explain that with AER, although devices may already be accessible in the
>   ->error_detected() callback, accesses should be deferred to the
>   ->mmio_enabled() callback for compatibility with EEH on powerpc.
> 
> Signed-off-by: Lukas Wunner <lukas@wunner.de>
> ---
>  Documentation/PCI/pci-error-recovery.rst | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
> index d5c661baa87f..c88c304b2103 100644
> --- a/Documentation/PCI/pci-error-recovery.rst
> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -122,6 +122,10 @@ A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
>  is isolated, in that all I/O is blocked: all reads return 0xffffffff,
>  all writes are ignored.
>  
> +Similarly, on platforms supporting Downstream Port Containment
> +(PCIe r7.0 sec 6.2.11), the link to the sub-hierarchy with the
> +faulting device is disabled. Any device in the sub-hierarchy
> +becomes inaccessible.
>  
>  STEP 1: Notification
>  --------------------
> @@ -204,6 +208,23 @@ link reset was performed by the HW. If the platform can't just re-enable IOs
>  without a slot reset or a link reset, it will not call this callback, and
>  instead will have gone directly to STEP 3 (Link Reset) or STEP 4 (Slot Reset)
>  
> +.. note::
> +
> +   On platforms supporting Advanced Error Reporting (PCIe r7.0 sec 6.2),
> +   the faulting device may already be accessible in STEP 1 (Notification).
> +   Drivers should nevertheless defer accesses to STEP 2 (MMIO Enabled)
> +   to be compatible with EEH on powerpc.

I'm biased of course but I'd prefer either "with error recovery support
on powerpc and s390" or simply "with systems where devices are
inaccessible until MMIO is re-enabled explicitly or a reset occurs.".

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics
  2025-08-29  7:25 ` [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics Lukas Wunner
  2025-08-29  8:38   ` Niklas Schnelle
@ 2025-08-29 23:25   ` Linas Vepstas
  2025-08-30  8:12     ` Lukas Wunner
  1 sibling, 1 reply; 8+ messages in thread
From: Linas Vepstas @ 2025-08-29 23:25 UTC (permalink / raw)
  To: Lukas Wunner
  Cc: Bjorn Helgaas, Jonathan Corbet, Terry Bowman, Ilpo Jarvinen,
	Sathyanarayanan Kuppuswamy, Niklas Schnelle, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

On Fri, Aug 29, 2025 at 2:41 AM Lukas Wunner <lukas@wunner.de> wrote:
>
> +   On platforms supporting Downstream Port Containment, the link to the
> +   sub-hierarchy with the faulting device is re-enabled in STEP 3 (Link
> +   Reset). Hence devices in the sub-hierarchy are inaccessible until
> +   STEP 4 (Slot Reset).

I'm confused. In the good old days, w/EEH, a slot reset was literally turning
the power off and on again to the device, for that slot. So it's not so much
that the device becomes "accessible again", but that it is now fresh, clean
but also unconfigured. I have not studied DPC, but the way this is worded
here makes me think that something else is happening.

-- Linas

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics
  2025-08-29 23:25   ` Linas Vepstas
@ 2025-08-30  8:12     ` Lukas Wunner
  0 siblings, 0 replies; 8+ messages in thread
From: Lukas Wunner @ 2025-08-30  8:12 UTC (permalink / raw)
  To: Linas Vepstas
  Cc: Bjorn Helgaas, Jonathan Corbet, Terry Bowman, Ilpo Jarvinen,
	Sathyanarayanan Kuppuswamy, Niklas Schnelle, Mahesh J Salgaonkar,
	Oliver OHalloran, linuxppc-dev, linux-pci, linux-doc,
	Brian Norris

On Fri, Aug 29, 2025 at 06:25:08PM -0500, Linas Vepstas wrote:
> On Fri, Aug 29, 2025 at 2:41AM Lukas Wunner <lukas@wunner.de> wrote:
> >
> > +   On platforms supporting Downstream Port Containment, the link to the
> > +   sub-hierarchy with the faulting device is re-enabled in STEP 3 (Link
> > +   Reset). Hence devices in the sub-hierarchy are inaccessible until
> > +   STEP 4 (Slot Reset).
> 
> I'm confused. In the good old days, w/EEH, a slot reset was literally turning
> the power off and on again to the device, for that slot. So it's not so much
> that the device becomes "accessible again", but that it is now fresh, clean
> but also unconfigured. I have not studied DPC, but the way this is worded
> here makes me think that something else is happening.

With DPC, when a Downstream Port (or Root Port) detects an error,
it immediately disables the downstream link, thereby preventing
corrupted data from reaching the rest of the system.  So the error
is "contained" at the Downstream Port.

It is then necessary for system software (i.e. drivers/pci/pcie/dpc.c)
to "release" the Downstream Port out of containment by re-enabling the
link.  This happens in dpc_reset_link() by writing (and thus clearing)
the PCI_EXP_DPC_STATUS_TRIGGER bit in the PCI_EXP_DPC_STATUS register.

In-between, the devices downstream are inaccessible.

Disabling the link results in a Hot Reset being propagated down the
hierarchy below the Downstream Port.  So there's no power cycle
involved.  After the link is re-enabled, devices are in power state
D0_uninitialized and need to be re-initialized by the driver in
->slot_reset() and/or ->resume().

If you feel the above-quoted paragraph isn't accurate or complete
or doesn't capture this sequence of events properly, please let me
know what specifically should be rephrased / amended.

Thanks for taking a look!

Lukas

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-08-30  8:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-29  7:25 [PATCH 0/4] PCI: Update error recovery documentation Lukas Wunner
2025-08-29  7:25 ` [PATCH 1/4] PCI/AER: Sync documentation with code Lukas Wunner
2025-08-29  7:25 ` [PATCH 2/4] PCI/ERR: " Lukas Wunner
2025-08-29  7:25 ` [PATCH 3/4] PCI/ERR: Amend documentation with DPC and AER specifics Lukas Wunner
2025-08-29  8:38   ` Niklas Schnelle
2025-08-29 23:25   ` Linas Vepstas
2025-08-30  8:12     ` Lukas Wunner
2025-08-29  7:25 ` [PATCH 4/4] PCI/ERR: Tidy documentation's PCIe nomenclature Lukas Wunner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).