Linux ACPI
 help / color / mirror / Atom feed
From: Jonathan Cameron <jic23@kernel.org>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <djbw@kernel.org>,
	<bhelgaas@google.com>, <shiju.jose@huawei.com>,
	<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
	<rrichter@amd.com>, <dan.carpenter@linaro.org>,
	<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
	<Benjamin.Cheatham@amd.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	<vishal.l.verma@intel.com>, <alucerop@amd.com>,
	<ira.weiny@intel.com>, <corbet@lwn.net>, <rafael@kernel.org>,
	<xueshuai@linux.alibaba.com>, <linux-cxl@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>,
	<linux-acpi@vger.kernel.org>, <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling
Date: Thu, 7 May 2026 19:51:56 +0100	[thread overview]
Message-ID: <20260507195156.3757a20b@jic23-huawei> (raw)
In-Reply-To: <20260505173029.2718246-12-terry.bowman@amd.com>

On Tue, 5 May 2026 12:30:29 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Add Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> describing the end-to-end CXL protocol error path: AER ingress, the
> AER-CXL kfifo handoff, the cxl_core consumer worker, RCD/RCH special
> cases, severity policy, trace events, and a source code map.
> 
> This documents the architecture introduced by the preceding patches in
> this series.
> 
> This was generated by claude-opus-4.7.

Maybe too much?  I got bored reading it and stopped which is probably
not the best sign.

A few formatting related comments inline.

Thanks,

J
> 
> Assisted-by: Claude:claude-opus-4.7
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  Documentation/driver-api/cxl/index.rst        |   1 +
>  .../cxl/linux/protocol-error-handling.rst     | 440 ++++++++++++++++++
>  2 files changed, 441 insertions(+)
>  create mode 100644 Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> 
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index 3dfae1d310ca..6861b2e5726a 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -42,6 +42,7 @@ that have impacts on each other.  The docs here break up configurations steps.
>     linux/dax-driver
>     linux/memory-hotplug
>     linux/access-coordinates
> +   linux/protocol-error-handling
>  
>  .. toctree::
>     :maxdepth: 2
> diff --git a/Documentation/driver-api/cxl/linux/protocol-error-handling.rst b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> new file mode 100644
> index 000000000000..4d6f33f0ed31
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> @@ -0,0 +1,440 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============================
> +CXL Protocol Error Handling
> +==============================
> +
> +This document describes how the kernel detects, classifies, dispatches,
> +logs, and recovers from CXL protocol errors signaled through the PCIe
> +Advanced Error Reporting (AER) interface. It covers both Virtual
> +Hierarchy (VH) topologies (Root Ports, Upstream/Downstream Switch
> +Ports, and Endpoints) and Restricted CXL Host (RCH) topologies
> +(Root Complex Event Collectors driving Restricted CXL Devices).

Odd drifting wrapping. I thought only humans did that. I guess it's common
enough in kernel docs maybe it learn it!  Anyhow, I think Docs are 80 char
limit in which case something like:

This document describes how the kernel detects, classifies, dispatches, logs,
and recovers from CXL protocol errors signaled through the PCIe Advanced Error
Reporting (AER) interface. It covers both Virtual Hierarchy (VH) topologies
(Root Ports, Upstream/Downstream Switch Ports, and Endpoints) and Restricted
CXL Host (RCH) topologies (Root Complex Event Collectors driving Restricted
CXL Devices).

Maybe t was intentional to keep lines similar lengths and brackets on last one?
I'm not sure..

> +
> +It is intended for kernel developers maintaining or extending
> +``drivers/pci/pcie/aer*.c``, ``drivers/cxl/core/ras.c``, and the
> +related plumbing in ``include/linux/aer.h``.
> +
> +
> +Background
> +==========
> +
> +A CXL device reports protocol-layer failures (CXL.cachemem RAS) as
> +PCIe AER **Internal Errors**: ``PCI_ERR_COR_INTERNAL`` for correctable
> +events and ``PCI_ERR_UNC_INTN`` for uncorrectable events. From the AER
> +core's point of view these look like ordinary PCIe AER messages, but
> +their semantics are CXL-specific: the actual fault information lives
> +in CXL RAS capability registers, not in the PCIe AER status registers.
> +
> +Historically, native CXL.cachemem RAS handling was implemented only
> +for CXL Endpoints and for RCH Downstream Ports. CXL Root Ports,
> +Upstream Switch Ports, and Downstream Switch Ports were not covered.
> +This left the kernel unable to log or react to protocol errors
> +signaled by switch components.

I'd drop the historical bit.  Not sure it adds value and these tend to
become stale (like all the 'New Courts' in my local Uni. Some of those are
500+ years old :)

> +
> +The unified CXL protocol error path closes that gap by routing every
> +CXL Internal Error through a single producer/consumer pipeline shared
> +by all CXL device types.

The unified CXL Protocol path routes every ...
(so no historical gap - as we don't care now you fixed it ;)

Similar follows for some other parts - I might not have called them all out.

> +
> +
> +Architecture overview
> +=====================
> +
> +CXL protocol error handling is implemented as a distinct error plane
> +layered on top of the existing PCIe AER infrastructure. The two planes

(drop existing - same why do we need the history theme)

> +are kept separate:
> +
> +* The **PCIe AER plane** continues to handle native PCIe errors
** handles native  
> +  (Receiver overflows, malformed TLPs, completion timeouts, and so
> +  on). This is unchanged.
> +
> +* The **CXL protocol error plane** owns CXL Internal Errors. The AER
> +  core forwards them to ``cxl_core`` via a dedicated kfifo; ``cxl_core``
> +  then dispatches to CE/UE handlers and drives the recovery and
> +  panic policy.
> +
> +The boundary between the two planes is ``is_cxl_error()`` in

I think you can drop the `` and the automarkup.py magic in the kernel docs build
will make that :c:func::is_cxl_error or something along those lines to
both pretty print it and hopefully match autobuilt kernel-doc (assuming
we include it anywhere for cxl)


> +===============
> +
> +The diagram below shows the full path from an AER interrupt through
> +producer classification, kfifo handoff, and consumer dispatch.
> +
> +.. code-block:: text
> +
> +   +-------------------------------------------------------------------------+
> +   |                  CXL Internal Error Packet Flow                         |
> +   |    From PCIe AER Interrupt to CXL Protocol Error Handling and Logging   |
> +   +-------------------------------------------------------------------------+
> +
> +      CXL device (RP / USP / DSP / EP / RCD) raises AER Internal Error
> +      (correctable PCI_ERR_COR_INTERNAL or uncorrectable PCI_ERR_UNC_INTN)
> +                      |
> +                      v
> +      +-------------------------------------------------------------+
> +      |    PCIe Root Port AER MSI/MSI-X interrupt fires             |
> +      +-------------------------------------------------------------+
> +                      |
> +      ============= drivers/pci/pcie/aer.c (AER core) =============
> +                      |
> +                      v
> +           +---------------------------------+
> +           |  aer_irq()  /  aer_isr()        |  (top + threaded handler)
> +           +---------------------------------+
> +                      |
> +                      v
> +           +---------------------------------+
> +           |  aer_isr_one_error()            |
> +           |  aer_isr_one_error_type()       |
> +           +---------------------------------+
> +                      |
> +                      v
> +          +------------------------------------------+
> +          |  aer_get_device_error_info()             |
> +          |  - reads PCI_ERR_COR_STATUS              |
> +          |  - reads PCI_ERR_UNCOR_STATUS  (*if RP/  |
> +          |    RCEC/DSP, or non-fatal severity)      |
> +          |  - sets info->is_cxl = pcie_is_cxl(dev)  |
> +          +------------------------------------------+
> +                      |
> +                      v
> +           +---------------------------------+
> +           |  handle_error_source(dev, info) |
> +           +---------------------------------+
> +              |                          |
> +              |  is_cxl_error()          +--->  pci_aer_handle_error()
> +              |  (CXL device + Internal)        (native PCIe AER path,
> +              v                                  not covered here)
> +      +-------------------------------------------------------------+
> +      | Topology dispatch within AER core:                          |
> +      |                                                             |
> +      |   - VH topology  (RP / USP / DSP / EP)                      |
> +      |     -> drivers/pci/pcie/aer_cxl_vh.c                        |
> +      |                                                             |
> +      |   - RCH topology (RCEC iterates RCDs under it)              |
> +      |     -> drivers/pci/pcie/aer_cxl_rch.c                       |
> +      +-------------------------------------------------------------+
> +           |                                            |
> +           | VH path                            RCH path (RCEC AER)
> +           v                                            v
> +      ============= aer_cxl_vh.c (VH      ============= aer_cxl_rch.c (RCH
> +                    producer) =============              producer) ==========
> +           |                                            |
> +           v                                            v
> +      +-----------------------------+         +-------------------------------+
> +      | cxl_forward_error(pdev,info)|         | cxl_rch_handle_error_iter()   |
> +      |  - if AER_CORRECTABLE:      |         |  - iterate each RCD pdev      |
> +      |     clear PCI_ERR_COR_STATUS|         |    beneath the RCEC           |
> +      |  - pci_dev_get(pdev)        |         |  - call cxl_forward_error()   |
> +      |  - build cxl_proto_err_     |         |    for each RCD               |
> +      |    work_data                |         |    (same producer helper as   |
> +      |    { pdev, severity }       |         |     the VH path uses)         |
> +      |  - kfifo_in_spinlocked(...) |         +-------------------------------+
> +      |  - schedule_work(...)       |                       |
> +      +-----------------------------+                       |
> +              |                                             |
> +              +-----------------+---------------------------+
> +                                |
> +                                v
> +                    +--------------------------+
> +                    |     AER-CXL kfifo        |
> +                    |     (work_struct)        |
> +                    +--------------------------+
> +                                |
> +                                v
> +      ============= drivers/cxl/core/ras.c (consumer worker) =======
> +                                |
> +                                v
> +      +-------------------------------------------------------------+
> +      | cxl_proto_err_work_fn() (workqueue handler)                 |
> +      |   for_each_cxl_proto_err(&wd, __cxl_proto_err_work_fn)      |
> +      +-------------------------------------------------------------+
> +                      |
> +                      v
> +      +-------------------------------------------------------------+
> +      | __cxl_proto_err_work_fn(wd)                                 |
> +      |   port = find_cxl_port_by_dev(&pdev->dev, &dport)           |
> +      |   cxl_handle_proto_error(pdev, port, dport, severity)       |
> +      |   pci_dev_put(pdev)                                         |
> +      +-------------------------------------------------------------+
> +                      |
> +                      v
> +      +-------------------------------------------------------------+
> +      | cxl_handle_proto_error()                                    |
> +      +-------------------------------------------------------------+
> +           |                                            |
> +      pci_pcie_type ==                          pci_pcie_type !=
> +      PCI_EXP_TYPE_RC_END                       PCI_EXP_TYPE_RC_END
> +      (RCD Endpoint)                            (VH: RP/USP/DSP/EP)
> +           |                                            |
> +           v                                            |
> +      +-------------------------------------+           |
> +      | cxl_handle_rdport_errors(pdev)      |           |
> +      |   - process RCH Downstream Port's   |           |
> +      |     RAS register block first        |           |
> +      |   - cxl_handle_cor_ras() for CE     |           |
> +      |   - cxl_handle_ras() for UE         |           |
> +      |     (log only; does NOT panic)      |           |
> +      +-------------------------------------+           |
> +           |                                            |
> +           +--------------------+-----------------------+
> +                                |
> +                                v
> +                   +-----------------------------+
> +                   | severity == AER_CORRECTABLE |
> +                   +-----------------------------+
> +                         |                  |
> +                         yes                no
> +                         v                  v
> +            +----------------------+   +-------------------------+
> +            | cxl_handle_cor_ras() |   | cxl_do_recovery()       |
> +            |  - emit cxl_aer_     |   | (described below)       |
> +            |    correctable_      |   +-------------------------+
> +            |    error trace       |
> +            | pcie_clear_device_   |
> +            |   status()           |
> +            +----------------------+
> +
> +                    +-------------------------------+
> +                    | cxl_do_recovery()             |
> +                    |  if pci_dev_is_disconnected:  |
> +                    |    panic("CXL cachemem err.") |
> +                    |                               |
> +                    |  ue = cxl_handle_ras()        |
> +                    |    -> emit                    |
> +                    |       cxl_aer_uncorrectable_  |
> +                    |       error trace event       |
> +                    |                               |
> +                    |  if (ue):                     |
> +                    |    panic("CXL cachemem err.") |
> +                    |                               |
> +                    |  pcie_clear_device_status()   |
> +                    |  pci_aer_clear_nonfatal_status|
> +                    |  pci_aer_clear_fatal_status   |
> +                    +-------------------------------+

Pretty diagram but maybe far too much given we have the code?

> +
> +
> +Severity policy
> +===============
> +
> +The kernel's response to a CXL protocol error depends on the AER
> +severity reported by the device and on the result of inspecting the
> +CXL RAS registers.
> +


      parent reply	other threads:[~2026-05-07 18:52 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-11 12:51           ` Bowman, Terry
2026-05-11 23:28             ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-07 18:22   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06 18:00   ` Dave Jiang
2026-05-11 21:04     ` Bowman, Terry
2026-05-11 22:36       ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260507195156.3757a20b@jic23-huawei \
    --to=jic23@kernel.org \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox