From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Dave Jiang <dave.jiang@intel.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-pci@vger.kernel.org>,
<dan.j.williams@intel.com>, <ira.weiny@intel.com>,
<vishal.l.verma@intel.com>, <alison.schofield@intel.com>,
<rostedt@goodmis.org>, <terry.bowman@amd.com>,
<bhelgaas@google.com>,
<sathyanarayanan.kuppuswamy@linux.intel.com>,
<shiju.jose@huawei.com>
Subject: Re: [PATCH v4 00/11] cxl/pci: Add fundamental error handling
Date: Tue, 13 Dec 2022 15:17:44 +0000 [thread overview]
Message-ID: <20221213151744.00003e58@Huawei.com> (raw)
In-Reply-To: <166974401763.1608150.5424589924034481387.stgit@djiang5-desk3.ch.intel.com>
On Tue, 29 Nov 2022 10:48:06 -0700
Dave Jiang <dave.jiang@intel.com> wrote:
> Hi Bjorn,
> I added a new optional callback for AER error handler to allow the PCI
> device driver to do additional logging. Please Ack the patch if it looks
> reasonable to you and Dan can take the series through cxl tree. Thank you!
>
> Hi Steve,
> Please review the trace event implementation and Ack if it looks ok.
> Thank you!
>
In the interests of avoiding possible duplication, this is a quick note that
we are looking into the associated RAS daemon support for these errors.
Jonathan
> v4:
> - Change header log for eventtrace to static array (Steve)
> - Fix CE status bits (Shiju)
> - Fix ECC capitalization (Shiju)
> - Add PCI error handler callback documentation (Sathyanarayanan)
> - Clarify callback as additional information capture only (Jonathan)
> - Clarify need of callback to clear CE by CXL device (Jonathan)
> - Fix 0-day complaint of __force __le32.
>
> v3:
> - Copy header log in 32bit chunks (Jonathan)
> - Export header log whole as raw data (Jonathan)
> - Added callback in PCI AER err handler for correctable errors (Jonathan)
> - Tested on qemu thanks to Jonathan's CXL AER injection enabling!
>
> v2:
> - Convert error reporting via printk to trace events
> - Drop ".rmap =" initialization (Jonathan)
> - return PCI_ERS_RESULT_NEED_RESET for UE in pci_channel_io_normal (Shiju)
>
> Add a 'struct pci_error_handlers' instance for the cxl_pci driver.
> Section 8.2.4.16 "CXL RAS Capability Structure" of the CXL rev3.0
> specification defines the error sources considered in this
> implementation. The RAS Capability Structure defines protocol, link and
> internal errors which are distinct from memory poison errors that are
> conveyed via direct consumption and/or media scanning.
>
> The errors reported by the RAS registers are categorized into
> correctable and uncorrectable errors, where the uncorrectable errors are
> optionally steered to either fatal or non-fatal AER events. Table 12-2
> "Device Specific Error Reporting and Nomenclature Guidelines" in the CXL
> rev3.0 specification outlines that the remediation for uncorrectable errors
> is a reset to recover. This matches how the Linux PCIe AER core treats
> uncorrectable errors as occasions to reset the device to recover
> operation.
>
> While the specification notes "CXL Reset" or "Secondary Bus Reset" as
> theoretical recovery options, they are not feasible in practice since
> in-flight CXL.mem operations may not terminate and cause knock-on system
> fatal events. Reset is only reliable for recovering CXL.io, it is not
> reliable for recovering CXL.mem. Assuming the system survives, a reset
> causes CXL.mem operation to restart from scratch.
>
> The "ECN: Error Isolation on CXL.mem and CXL.cache" [1] document
> recognizes the CXL Reset vs CXL.mem operational conflict and helps to at
> least provide a mechanism for the Root Port to terminate in flight
> CXL.mem operations with completions. That still poses problems in
> practice if the kernel is running out of "System RAM" backed by the CXL
> device and poison is used to convey the data lost to the protocol error.
>
> Regardless of whether the reset and restart of CXL.mem operations is
> feasible / successful, the logging is still useful. So, the
> implementation reads, reports, and clears the status in the RAS
> Capability Structure registers, and it notifies the 'struct cxl_memdev'
> associated with the given PCIe endpoint to reattach to its driver over
> the reset so that the HDM decoder configuration can be reconstructed.
>
> The first half of the series reworks component register mapping so that
> the cxl_pci driver can own the RAS Capability while the cxl_port driver
> continues to own the HDM Decoder Capability. The last half implements
> the RAS Capability Structure mapping and reporting via 'struct
> pci_error_handlers'.
>
> The reporting of error information is done through event tracing. A new
> cxl_ras event is introduced to report the Uncorrectable and Correctable
> errors raised by CXL. The expectation is a monitoring user daemon such as
> "cxl monitor" will harvest those events and record them in a log in a
> format (JSON) that's consumable by management applications.
>
> For correctable errors, current Linux implementation does not provide any
> means to reach the pci device driver. Add an optional callback with the
> PCI aer error handler to allow the pci device driver to log additional
> information from the device.
>
> [1]: https://www.computeexpresslink.org/spec-landing
>
> ---
>
> Dan Williams (8):
> cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers
> cxl/pci: Cleanup cxl_map_device_regs()
> cxl/pci: Kill cxl_map_regs()
> cxl/core/regs: Make cxl_map_{component, device}_regs() device generic
> cxl/port: Limit the port driver to just the HDM Decoder Capability
> cxl/pci: Prepare for mapping RAS Capability Structure
> cxl/pci: Find and map the RAS Capability Structure
> cxl/pci: Add (hopeful) error handling support
>
> Dave Jiang (3):
> cxl/pci: add tracepoint events for CXL RAS
> PCI/AER: Add optional logging callback for correctable error
> cxl/pci: Add callback to log AER correctable error
>
>
> Documentation/PCI/pci-error-recovery.rst | 7 +
> drivers/cxl/core/hdm.c | 33 ++--
> drivers/cxl/core/memdev.c | 1 +
> drivers/cxl/core/pci.c | 3 +-
> drivers/cxl/core/port.c | 2 +-
> drivers/cxl/core/regs.c | 172 ++++++++++--------
> drivers/cxl/cxl.h | 38 +++-
> drivers/cxl/cxlmem.h | 2 +
> drivers/cxl/cxlpci.h | 9 -
> drivers/cxl/pci.c | 213 ++++++++++++++++++-----
> drivers/pci/pcie/aer.c | 8 +-
> include/linux/pci.h | 3 +
> include/trace/events/cxl.h | 112 ++++++++++++
> 13 files changed, 453 insertions(+), 150 deletions(-)
> create mode 100644 include/trace/events/cxl.h
>
> --
>
prev parent reply other threads:[~2022-12-13 15:17 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-29 17:48 [PATCH v4 00/11] cxl/pci: Add fundamental error handling Dave Jiang
2022-11-29 17:48 ` [PATCH v4 01/11] cxl/pci: Cleanup repeated code in cxl_probe_regs() helpers Dave Jiang
2022-11-29 17:48 ` [PATCH v4 02/11] cxl/pci: Cleanup cxl_map_device_regs() Dave Jiang
2022-11-29 17:48 ` [PATCH v4 03/11] cxl/pci: Kill cxl_map_regs() Dave Jiang
2022-11-29 17:48 ` [PATCH v4 04/11] cxl/core/regs: Make cxl_map_{component, device}_regs() device generic Dave Jiang
2022-11-29 17:48 ` [PATCH v4 05/11] cxl/port: Limit the port driver to just the HDM Decoder Capability Dave Jiang
2022-11-29 17:48 ` [PATCH v4 06/11] cxl/pci: Prepare for mapping RAS Capability Structure Dave Jiang
2022-11-29 17:48 ` [PATCH v4 07/11] cxl/pci: Find and map the " Dave Jiang
2022-11-29 17:48 ` [PATCH v4 08/11] cxl/pci: add tracepoint events for CXL RAS Dave Jiang
2022-11-29 19:45 ` Steven Rostedt
2022-11-29 17:48 ` [PATCH v4 09/11] cxl/pci: Add (hopeful) error handling support Dave Jiang
2023-01-06 16:05 ` Jonathan Cameron
2023-01-06 16:12 ` Dave Jiang
2022-11-29 17:49 ` [PATCH v4 10/11] PCI/AER: Add optional logging callback for correctable error Dave Jiang
2022-11-30 19:45 ` Bjorn Helgaas
2022-11-30 21:37 ` Dave Jiang
2022-11-30 22:11 ` [v5 10/11 PATCH] " Dave Jiang
2022-11-30 22:13 ` [v5 11/11 PATCH] cxl/pci: Add callback to log AER " Dave Jiang
2022-11-30 22:47 ` Bjorn Helgaas
2022-12-01 0:02 ` [v6 " Dave Jiang
2022-12-07 20:04 ` Terry Bowman
2022-12-07 20:29 ` Bjorn Helgaas
2022-12-07 20:54 ` Terry Bowman
2022-11-29 17:49 ` [PATCH v4 11/11] " Dave Jiang
2022-12-13 15:17 ` Jonathan Cameron [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221213151744.00003e58@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=ira.weiny@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=rostedt@goodmis.org \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=shiju.jose@huawei.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.