From: Jonathan Cameron <jic23@kernel.org>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>,
<alison.schofield@intel.com>, <djbw@kernel.org>,
<bhelgaas@google.com>, <shiju.jose@huawei.com>,
<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
<rrichter@amd.com>, <dan.carpenter@linaro.org>,
<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
<Benjamin.Cheatham@amd.com>,
<sathyanarayanan.kuppuswamy@linux.intel.com>,
<vishal.l.verma@intel.com>, <alucerop@amd.com>,
<ira.weiny@intel.com>, <corbet@lwn.net>, <rafael@kernel.org>,
<xueshuai@linux.alibaba.com>, <linux-cxl@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>,
<linux-acpi@vger.kernel.org>, <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow
Date: Thu, 7 May 2026 19:22:10 +0100 [thread overview]
Message-ID: <20260507192210.766d54fd@jic23-huawei> (raw)
In-Reply-To: <20260505173029.2718246-7-terry.bowman@amd.com>
On Tue, 5 May 2026 12:30:24 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> Add CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints. Establish a
> common flow for correctable and uncorrectable CXL protocol errors.
> RCH Downstream Port error handling is added in a following patch.
>
> Add cxl_handle_proto_error() to dispatch correctable and uncorrectable
> errors through the CXL RAS helpers. Add cxl_do_recovery() to coordinate
> uncorrectable recovery. Panic via panic() on any uncorrectable CXL RAS
> error. CXL.cachemem traffic cannot be safely recovered from an
> uncorrectable protocol error in software, so panic regardless of the
> AER severity reported. Gate error handling on the port driver being
> bound to avoid processing errors on disabled devices.
>
> Panic explicitly on pci_dev_is_disconnected() before accessing the RAS
> registers. A CXL device disconnecting during an uncorrectable error event
> is itself unrecoverable, particularly for devices in interleaved HDM
> regions. Relying on the status readl() returning ~0u to trip the existing
> panic path leaves the cause ambiguous.
>
> The panic policy applies to the RAS register block of the device whose
> error triggered the recovery: Root/Downstream Port RAS for VH Ports,
> Endpoint Port RAS for VH Endpoints and RCDs. Upstream RCH Downstream
> Port RAS UEs handled via cxl_handle_rdport_errors() are logged only, as
> before this series. Only the RCD Endpoint's own RAS UE drives the panic.
>
> Add to_ras_base() to centralize the RAS base lookup. It selects
> dport->regs.ras for Root/Downstream Ports and port->regs.ras for
> Upstream Ports and Endpoints.
>
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() so
> cxl_core can clear PCIe/AER state during recovery.
>
> Wire the AER core to the kfifo in this commit by adding the
> is_cxl_error() switch in handle_error_source() alongside the consumer
> registration. This way the producer and consumer go live in the same
> commit, so CXL errors are not silently dropped during bisect.
>
> The correctable AER status is cleared by the producer in
> cxl_forward_error().
>
> Co-developed-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
A few trivial things inline. With those tidied up
Reviewed-by: Jonathan Cameron <jic23@kernel.org>
> + * find_cxl_port_by_dev - Use @dev as hint to do a _by_dport or _by_uport lookup
> + * @dev: generic device that may either be a companion of port or target dport
> + * @dport: output parameter; set to the matched dport for dport-class
> + * lookups (Root Port, Downstream Port), NULL otherwise.
> + *
> + * Return a 'struct cxl_port' with an elevated reference if found. Use
> + * __free(put_cxl_port) to release.
> + */
> +static struct cxl_port *find_cxl_port_by_dev(struct device *dev, struct cxl_dport **dport)
> +{
> + struct pci_dev *pdev;
> +
> + *dport = NULL;
> + if (!dev_is_pci(dev))
> + return NULL;
> +
> + pdev = to_pci_dev(dev);
Only used once. So little point in this step...
> +
> + switch (pci_pcie_type(pdev)) {
switch (pci_pcie_type(to_pci_dev(dev))) {
looks readable enough to me.
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + return find_cxl_port_by_dport(dev, dport);
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + case PCI_EXP_TYPE_RC_END:
> + return find_cxl_port_by_uport(dev);
> + }
> +
> + return NULL;
> +}
> +
> +static void cxl_do_recovery(struct pci_dev *pdev, struct cxl_port *port, struct cxl_dport *dport)
> +{
> + struct device *dev = &pdev->dev;
> + bool ue;
> +
> + if (pci_dev_is_disconnected(pdev))
> + panic("CXL cachemem error: device disconnected during UE recovery");
> +
> + ue = cxl_handle_ras(dev, pci_get_dsn(pdev),
> + to_ras_base(port, dport));
My lazy (or maybe busy) nature means I haven't checked, but if this remains
the same for rest of series it fits on one line of around 78 chars.
> + if (ue)
> + panic("CXL cachemem error.");
> +
> + pcie_clear_device_status(pdev);
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> +}
> +int cxl_ras_init(void)
> +{
> + cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> + cxl_register_proto_err_work(&cxl_proto_err_work);
> +
> + return 0;
void cxl_ras_init() as per earlier suggestion still looks good ;)
> +}
> +
> +void cxl_ras_exit(void)
> +{
> + cxl_cper_unregister_prot_err_work();
> + cxl_unregister_proto_err_work();
> +}
next prev parent reply other threads:[~2026-05-07 18:22 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 21:17 ` Dave Jiang
2026-05-07 17:53 ` Jonathan Cameron
2026-05-07 18:26 ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:46 ` Dave Jiang
2026-05-07 18:08 ` Jonathan Cameron
2026-05-07 18:33 ` Bowman, Terry
2026-05-08 14:05 ` Jonathan Cameron
2026-05-09 3:49 ` Dan Williams (nvidia)
2026-05-11 12:51 ` Bowman, Terry
2026-05-11 23:28 ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 22:02 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06 ` Dave Jiang
2026-05-07 18:11 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 22:16 ` Dave Jiang
2026-05-07 18:14 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-07 18:22 ` Jonathan Cameron [this message]
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:59 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43 ` Dave Jiang
2026-05-07 18:25 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06 18:00 ` Dave Jiang
2026-05-11 21:04 ` Bowman, Terry
2026-05-11 22:36 ` Dave Jiang
2026-05-07 18:29 ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34 ` Dave Jiang
2026-05-07 18:51 ` Jonathan Cameron
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260507192210.766d54fd@jic23-huawei \
--to=jic23@kernel.org \
--cc=Benjamin.Cheatham@amd.com \
--cc=PradeepVineshReddy.Kodamati@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=bhelgaas@google.com \
--cc=corbet@lwn.net \
--cc=dan.carpenter@linaro.org \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=djbw@kernel.org \
--cc=ira.weiny@intel.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=ming.li@zohomail.com \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=shiju.jose@huawei.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox