Linux PCI subsystem development
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: "Terry Bowman" <terry.bowman@amd.com>
Cc: linux-pci@vger.kernel.org
Subject: Re: [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow
Date: Tue, 05 May 2026 22:28:52 +0000	[thread overview]
Message-ID: <20260505222853.00913C2BCB4@smtp.kernel.org> (raw)
In-Reply-To: <20260505173029.2718246-7-terry.bowman@amd.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] Mixed AER uncorrectable errors and Fatal errors bypass standard PCIe error recovery.
- [High] `cxl_forward_error()` unconditionally writes to `pdev->aer_cap + PCI_ERR_COR_STATUS`, which can corrupt BAR0 if `aer_cap` is 0.
- [Medium] AER Uncorrectable Error status is left uncleared if the CXL port driver is unbound.
--

commit 40776581efad3966047d440ec936219af21ff33c
Author: Terry Bowman <terry.bowman@amd.com>

PCI: Establish common CXL Port protocol error flow

This commit adds CXL Port protocol error handling callbacks to unify
detection, logging, and recovery across CXL Ports and Endpoints. It also
routes CXL protocol errors through the CXL RAS helpers and wires the AER
core to a kfifo for error processing.

> The correctable AER status is cleared by the producer in
> cxl_forward_error().

The patch 'PCI/CXL: Add RCH support to CXL handlers' updates
cxl_rch_handle_error_iter() to call cxl_forward_error() directly for
RCD Endpoints (RCiEPs).

If cxl_forward_error() unconditionally writes to
pdev->aer_cap + PCI_ERR_COR_STATUS to clear the correctable AER status,
what happens if an RCD Endpoint does not expose an AER capability to the
OS and dev->aer_cap is 0?

Would writing to 0 + PCI_ERR_COR_STATUS (offset 0x10) inadvertently
overwrite the PCI Base Address Register 0 (BAR0) and corrupt the device's
memory mapping? 

The original pci_aer_handle_error() explicitly guarded this write by
checking if the aer offset is non-zero.

> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c

[ ... ]

> @@ -1186,7 +1187,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	cxl_rch_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);
> +	if (is_cxl_error(dev, info))
> +		cxl_forward_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
>  	pci_dev_put(dev);
>  }

Since is_cxl_error() returns true when the CXL Internal Error bit is set,
does this bypass standard PCIe error recovery if a device experiences a
concurrent CXL Internal Error and a standard PCIe Uncorrectable Error
(such as a Data Link Protocol Error)?

Additionally, if the error is fatal, the PCIe link is frozen. Standard
handling in pci_aer_handle_error() would normally call pcie_do_recovery()
to perform a link reset.

Because cxl_do_recovery() only clears the AER status and returns without
resetting the link, could the device be left permanently frozen?

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c

[ ... ]

> +static int __cxl_proto_err_work_fn(struct cxl_proto_err_work_data *wd)
> +{
> +	struct cxl_dport *dport;
> +	struct cxl_port *port __free(put_cxl_port) =
> +		find_cxl_port_by_dev(&wd->pdev->dev, &dport);
> +
> +	if (!port) {
> +		dev_err_ratelimited(&wd->pdev->dev,
> +				    "Failed to find parent port device in CXL topology\n");
> +		return 0;
> +	}
> +
> +	/*
> +	 * Hold the port device lock and verify a driver is bound before
> +	 * handling errors. Protects against NULL deref if an error is
> +	 * dispatched before probe completion or after driver removal.
> +	 */
> +	guard(device)(&port->dev);
> +	if (!port->dev.driver) {
> +		dev_err_ratelimited(&port->dev,
> +				    "Port device is unbound, abort error handling\n");
> +		return 0;
> +	}
> +
> +	cxl_handle_proto_error(wd->pdev, port, dport, wd->severity);
> +
> +	return 0;
> +}

If the port is not found or the driver is unbound, this function aborts
and returns early before calling cxl_handle_proto_error().

Since handle_error_source() already bypassed the standard PCIe error
recovery, does returning early here leave the AER Uncorrectable Error
status uncleared on the device?

Standard PCIe handling normally ensures these registers are cleared even
if no driver is bound. Could leaving them uncleared block future error
reporting?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505173029.2718246-1-terry.bowman@amd.com?part=6

  reply	other threads:[~2026-05-05 22:28 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 20:26   ` sashiko-bot
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:07   ` sashiko-bot
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 21:30   ` sashiko-bot
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 21:52   ` sashiko-bot
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 22:28   ` sashiko-bot [this message]
2026-05-07 18:22   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:34   ` sashiko-bot
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06  1:01   ` sashiko-bot
2026-05-06 18:00   ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505222853.00913C2BCB4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox