Linux PCI subsystem development
 help / color / mirror / Atom feed
From: Jonathan Cameron <jic23@kernel.org>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <djbw@kernel.org>,
	<bhelgaas@google.com>, <shiju.jose@huawei.com>,
	<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
	<rrichter@amd.com>, <dan.carpenter@linaro.org>,
	<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
	<Benjamin.Cheatham@amd.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	<vishal.l.verma@intel.com>, <alucerop@amd.com>,
	<ira.weiny@intel.com>, <corbet@lwn.net>, <rafael@kernel.org>,
	<xueshuai@linux.alibaba.com>, <linux-cxl@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>,
	<linux-acpi@vger.kernel.org>, <linux-doc@vger.kernel.org>
Subject: Re: [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow
Date: Thu, 7 May 2026 19:22:10 +0100	[thread overview]
Message-ID: <20260507192210.766d54fd@jic23-huawei> (raw)
In-Reply-To: <20260505173029.2718246-7-terry.bowman@amd.com>

On Tue, 5 May 2026 12:30:24 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Add CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints. Establish a
> common flow for correctable and uncorrectable CXL protocol errors.
> RCH Downstream Port error handling is added in a following patch.
> 
> Add cxl_handle_proto_error() to dispatch correctable and uncorrectable
> errors through the CXL RAS helpers. Add cxl_do_recovery() to coordinate
> uncorrectable recovery. Panic via panic() on any uncorrectable CXL RAS
> error. CXL.cachemem traffic cannot be safely recovered from an
> uncorrectable protocol error in software, so panic regardless of the
> AER severity reported. Gate error handling on the port driver being
> bound to avoid processing errors on disabled devices.
> 
> Panic explicitly on pci_dev_is_disconnected() before accessing the RAS
> registers. A CXL device disconnecting during an uncorrectable error event
> is itself unrecoverable, particularly for devices in interleaved HDM
> regions. Relying on the status readl() returning ~0u to trip the existing
> panic path leaves the cause ambiguous.
> 
> The panic policy applies to the RAS register block of the device whose
> error triggered the recovery: Root/Downstream Port RAS for VH Ports,
> Endpoint Port RAS for VH Endpoints and RCDs. Upstream RCH Downstream
> Port RAS UEs handled via cxl_handle_rdport_errors() are logged only, as
> before this series. Only the RCD Endpoint's own RAS UE drives the panic.
> 
> Add to_ras_base() to centralize the RAS base lookup. It selects
> dport->regs.ras for Root/Downstream Ports and port->regs.ras for
> Upstream Ports and Endpoints.
> 
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() so
> cxl_core can clear PCIe/AER state during recovery.
> 
> Wire the AER core to the kfifo in this commit by adding the
> is_cxl_error() switch in handle_error_source() alongside the consumer
> registration. This way the producer and consumer go live in the same
> commit, so CXL errors are not silently dropped during bisect.
> 
> The correctable AER status is cleared by the producer in
> cxl_forward_error().
> 
> Co-developed-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
A few trivial things inline. With those tidied up
Reviewed-by: Jonathan Cameron <jic23@kernel.org>

> + * find_cxl_port_by_dev - Use @dev as hint to do a _by_dport or _by_uport lookup
> + * @dev: generic device that may either be a companion of port or target dport
> + * @dport: output parameter; set to the matched dport for dport-class
> + * lookups (Root Port, Downstream Port), NULL otherwise.
> + *
> + * Return a 'struct cxl_port' with an elevated reference if found. Use
> + * __free(put_cxl_port) to release.
> + */
> +static struct cxl_port *find_cxl_port_by_dev(struct device *dev, struct cxl_dport **dport)
> +{
> +	struct pci_dev *pdev;
> +
> +	*dport = NULL;
> +	if (!dev_is_pci(dev))
> +		return NULL;
> +
> +	pdev = to_pci_dev(dev);

Only used once. So little point in this step...

> +
> +	switch (pci_pcie_type(pdev)) {
	switch (pci_pcie_type(to_pci_dev(dev))) {

looks readable enough to me.

> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +		return find_cxl_port_by_dport(dev, dport);
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	case PCI_EXP_TYPE_RC_END:
> +		return find_cxl_port_by_uport(dev);
> +	}
> +
> +	return NULL;
> +}

> +
> +static void cxl_do_recovery(struct pci_dev *pdev, struct cxl_port *port, struct cxl_dport *dport)
> +{
> +	struct device *dev = &pdev->dev;
> +	bool ue;
> +
> +	if (pci_dev_is_disconnected(pdev))
> +		panic("CXL cachemem error: device disconnected during UE recovery");
> +
> +	ue = cxl_handle_ras(dev, pci_get_dsn(pdev),
> +			    to_ras_base(port, dport));

My lazy (or maybe busy) nature means I haven't checked, but if this remains
the same for rest of series it fits on one line of around 78 chars.

> +	if (ue)
> +		panic("CXL cachemem error.");
> +
> +	pcie_clear_device_status(pdev);
> +	pci_aer_clear_nonfatal_status(pdev);
> +	pci_aer_clear_fatal_status(pdev);
> +}

> +int cxl_ras_init(void)
> +{
> +	cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> +
> +	return 0;

void cxl_ras_init() as per earlier suggestion still looks good ;)

> +}
> +
> +void cxl_ras_exit(void)
> +{
> +	cxl_cper_unregister_prot_err_work();
> +	cxl_unregister_proto_err_work();
> +}

  parent reply	other threads:[~2026-05-07 18:22 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 20:26   ` sashiko-bot
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:07   ` sashiko-bot
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 21:30   ` sashiko-bot
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 21:52   ` sashiko-bot
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 22:28   ` sashiko-bot
2026-05-07 18:22   ` Jonathan Cameron [this message]
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:34   ` sashiko-bot
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06  1:01   ` sashiko-bot
2026-05-06 18:00   ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260507192210.766d54fd@jic23-huawei \
    --to=jic23@kernel.org \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@linaro.org \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox