public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/6] Add support for root port RAS error handling
@ 2024-03-13  8:35 Li Ming
  2024-03-13  8:35 ` [RFC PATCH 1/6] PCI/RCEC: Introduce pcie_walk_rcec_all() Li Ming
                   ` (6 more replies)
  0 siblings, 7 replies; 31+ messages in thread
From: Li Ming @ 2024-03-13  8:35 UTC (permalink / raw)
  To: dan.j.williams, rrichter, terry.bowman; +Cc: linux-cxl, linux-kernel, Li Ming

Protocol errors signaled to a CXL root port may be captured by a Root
Complex Event Collector(RCEC). If those errors are not cleared and
reported the system owner loses forensic information for system failure
analysis.

Per CXL r3.1 section 9.18.1.5, the recommendation for this case from CXL
specification is the 'Else' statement in 'IMPLEMENTATION NODE' under
'Table 9-24 RDPAS Structure':

	"Probe all CXL Downstream Ports and determine whether they have logged an
	error in the CXL.io or CXL.cachemem status registers."

The CXL subsystem already supports RCH RAS Error handling that has a
dependency on the RCEC. Reuse and extend that RCH topoogy support to
handle reported errors in the VH topology case. The implementation is
composed of:
* Provide a new interface from RCEC side to support walk all devices
  under RCEC and RCEC associated bus range. PCIe AER core uses this
  interface to walk all CXL endpoints and all CXL root ports under the
  bus ranges.

* Update the PCIe AER core to enable Uncorrectable Internal Errors and
  Correctable Internal Errors report for root ports.

* Invoke the cxl_pci error handler for RCEC reported errors.

* Handle root-port errors in the cxl_pci handler when the device is
  direct attached.

The implementation is only for above case without CXL switch, still
remain two opens to be discussed.
1. Is it compatible for CXL switch port error handling?
CXL switch port error handling proposal has not yet been finalized.
Should confirm that this implementation will be compatible with that.

2. How to handle the case which CXL root port reported CXL.CM protocol
erros by itself?
Not support for this case in the patchset at present, my opinion is that
invoking the cxl_pci handle to deal with such case as well.

base-commit: 73bf93edeeea866b0b6efbc8d2595bdaaba7f1a5 branch: next

Li Ming (6):
  PCI/RCEC: Introduce pcie_walk_rcec_all()
  PCI/CXL: A new attribute to indicate if a host bridge is CXL capable
  PCI/AER: Enable RCEC to report internal error for CXL root port
  PCI/AER: Support to handle errors detected by CXL root port
  cxl: Use __free() for cxl_pci/mem_find_port() to drop put_device()
  cxl/pci: Add support for the RAS handling of RCEC captured errors on
    RP

 drivers/acpi/pci_root.c |  1 +
 drivers/cxl/core/pci.c  | 89 +++++++++++++++++++++++++++--------------
 drivers/cxl/core/port.c |  9 +++++
 drivers/cxl/cxl.h       |  2 +
 drivers/cxl/mem.c       |  5 +--
 drivers/cxl/pci.c       | 12 +++---
 drivers/pci/pci.h       |  6 +++
 drivers/pci/pcie/aer.c  | 44 +++++++++++++-------
 drivers/pci/pcie/rcec.c | 44 +++++++++++++++++++-
 include/linux/pci.h     |  1 +
 10 files changed, 155 insertions(+), 58 deletions(-)

-- 
2.40.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-04-23  2:33 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-13  8:35 [RFC PATCH 0/6] Add support for root port RAS error handling Li Ming
2024-03-13  8:35 ` [RFC PATCH 1/6] PCI/RCEC: Introduce pcie_walk_rcec_all() Li Ming
2024-03-25 20:15   ` Terry Bowman
2024-04-16  4:39     ` Dan Williams
2024-04-22 14:34       ` Terry Bowman
2024-04-22 23:03         ` Dan Williams
2024-04-23  2:33           ` Li, Ming
2024-04-16  7:23     ` Li, Ming
2024-03-13  8:35 ` [RFC PATCH 2/6] PCI/CXL: A new attribute to indicate CXL-capable host bridge Li Ming
2024-03-13  8:35 ` [RFC PATCH 3/6] PCI/AER: Enable RCEC to report internal error for CXL root port Li Ming
2024-03-25 19:42   ` Terry Bowman
2024-04-16  7:27     ` Li, Ming
2024-04-16 14:46       ` Terry Bowman
2024-04-18  5:53         ` Li, Ming
2024-04-18 14:57           ` Dan Williams
2024-04-22  2:06             ` Li, Ming
2024-04-22 23:01               ` Dan Williams
2024-03-13  8:36 ` [RFC PATCH 4/6] PCI/AER: Extend RCH RAS error handling to support VH topology case Li Ming
2024-03-15  2:30   ` Dan Williams
2024-03-15  3:43     ` Li, Ming
2024-03-15  4:05       ` Dan Williams
2024-03-15  5:08         ` Li, Ming
2024-03-25 19:14   ` Terry Bowman
2024-03-13  8:36 ` [RFC PATCH 5/6] cxl: Use __free() for cxl_pci/mem_find_port() to drop put_device() Li Ming
2024-03-15  2:24   ` Dan Williams
2024-03-15  4:05     ` Li, Ming
2024-03-13  8:36 ` [RFC PATCH 6/6] cxl/pci: Support to handle root port RAS errors captured by RCEC Li Ming
2024-03-15  1:45 ` [RFC PATCH 0/6] Add support for root port RAS error handling Dan Williams
2024-03-15  8:40   ` Li, Ming
2024-03-15 18:21     ` Dan Williams
2024-03-20 12:48       ` Li, Ming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox