Linux PCI subsystem development
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: "Terry Bowman" <terry.bowman@amd.com>
Cc: linux-pci@vger.kernel.org
Subject: Re: [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers
Date: Tue, 05 May 2026 23:34:39 +0000	[thread overview]
Message-ID: <20260505233439.B22ABC2BCB4@smtp.kernel.org> (raw)
In-Reply-To: <20260505173029.2718246-8-terry.bowman@amd.com>

Thank you for your contribution! Sashiko AI review found 4 potential issue(s) to consider:
- [High] Uncorrectable protocol errors on RCH Downstream Ports are swallowed without triggering the intended panic, resulting in silent data corruption.
- [High] cxl_forward_error() corrupts the RCD Endpoint's AER state and potentially BAR0 by misapplying the RCEC's AER Correctable Status mask.
- [High] Forwarding RCH protocol errors to the asynchronous CXL kfifo introduces a race condition with PCIe AER recovery, leading to concurrent MMIO accesses during device reset.
- [High] Direct uncorrectable CXL protocol errors from RCDs (RC_END) bypass the unified AER-CXL kfifo error flow and fail to panic as intended.
--

commit 71e85613fcf97cc6374754fbf172de84d9fe7371
Author: Terry Bowman <terry.bowman@amd.com>

PCI/CXL: Add RCH support to CXL handlers

This commit folds Restricted CXL Host (RCH) error handling into the new CXL
Port error handling flow. RCH protocol errors are now forwarded through the
AER-CXL kfifo, and uncorrectable errors are intended to panic via
cxl_do_recovery().

> Behavior change: an RCD uncorrectable CXL RAS error now panics via
> cxl_do_recovery(). Before this patch the RCH path returned
> PCI_ERS_RESULT_NEED_RESET via cxl_pci's err_handler. After this patch
> the same condition panics.

Do direct uncorrectable CXL protocol errors from RCDs (RC_END) bypass the
unified AER-CXL kfifo error flow and fail to panic as intended?

When an RCD (PCI_EXP_TYPE_RC_END) signals an error directly,
handle_error_source() invokes is_cxl_error(dev, info). However,
is_cxl_error() checks the PCIe type but only matches PCI_EXP_TYPE_ENDPOINT,
completely omitting PCI_EXP_TYPE_RC_END.

Consequently, is_cxl_error() returns false, bypassing cxl_forward_error()
and the kfifo. The error falls back to pci_aer_handle_error(), which
triggers the legacy cxl_error_detected() handler.

The legacy handler returns PCI_ERS_RESULT_NEED_RESET rather than triggering
cxl_do_recovery() to panic, violating the stated safety policy for fatal
cachemem errors.

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 0a552d5a236e7..1f1dd20623f6d 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
[ ... ]
> @@ -329,6 +324,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>  static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
>  				   struct cxl_dport *dport, int severity)
>  {
> +	/*
> +	 * An RC_END device is an RCD (Restricted CXL Device). Its AER
> +	 * interrupt is shared with the RCH Downstream Port, so handle RCH
> +	 * Downstream Port protocol errors first before processing the RCD's
> +	 * own errors. See CXL spec r3.1 s12.2.
> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
> +		cxl_handle_rdport_errors(pdev);
> +

Are uncorrectable protocol errors on RCH Downstream Ports swallowed here
without triggering the intended panic?

cxl_handle_rdport_errors() correctly logs the error via cxl_handle_ras(),
which returns true for uncorrectable errors, but the return value is ignored.
Execution then proceeds unconditionally to cxl_do_recovery():

cxl_handle_proto_error() {
    ...
    if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
        cxl_handle_rdport_errors(pdev);
    
    if (severity == AER_CORRECTABLE) {
        ...
    } else {
        cxl_do_recovery(pdev, port, dport);
    }
}

Because dport evaluates to NULL for RCDs in this context, cxl_do_recovery()
checks only the RCD Endpoint's RAS capability. If the uncorrectable error
occurred exclusively in the RCH Downstream Port, cxl_do_recovery() observes
the uncorrectable error flag as false and silently returns without panicking.

Could this allow the system to continue using corrupted CXL.cachemem traffic?

> diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
> index e471eefec9c40..83142eac0cabf 100644
> --- a/drivers/pci/pcie/aer_cxl_rch.c
> +++ b/drivers/pci/pcie/aer_cxl_rch.c
[ ... ]
> @@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
>  
>  	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
>  		return 0;
>  
> -	guard(device)(&dev->dev);
> -
> -	err_handler = dev->driver ? dev->driver->err_handler : NULL;
> -	if (!err_handler)
> -		return 0;
> -
> -	if (info->severity == AER_CORRECTABLE) {
> -		if (err_handler->cor_error_detected)
> -			err_handler->cor_error_detected(dev);
> -	} else if (err_handler->error_detected) {
> -		if (info->severity == AER_NONFATAL)
> -			err_handler->error_detected(dev, pci_channel_io_normal);
> -		else if (info->severity == AER_FATAL)
> -			err_handler->error_detected(dev, pci_channel_io_frozen);
> -	}
> +	cxl_forward_error(dev, info);
>  	return 0;
>  }

Does cxl_forward_error() corrupt the RCD Endpoint's AER state and potentially
BAR0 by misapplying the RCEC's AER Correctable Status mask?

When an RCH Downstream Port signals an error, it is logged as an internal
error by the RCEC. This code iterates over the associated RCD Endpoints and
passes the RCD device (dev) along with the RCEC's error record (info) to
cxl_forward_error().

For correctable errors, cxl_forward_error() executes:

    pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS, info->status);

This writes the RCEC's error bitmask into the RCD's Write-1-to-Clear (W1C)
AER capability register, which will silently clear unrelated pending
correctable errors on the RCD.

Additionally, if the RCD lacks a native PCIe AER capability (pdev->aer_cap
is 0), the write targets offset 0x10 (BAR0), which might silently corrupt the
device's MMIO configuration.


Does forwarding RCH protocol errors to the asynchronous CXL kfifo introduce a
race condition with PCIe AER recovery?

Before this patch, cxl_rch_handle_error_iter() synchronously called the RCD
driver's error handler before the AER driver proceeded with recovery. This
patch delegates the handling to cxl_forward_error(), which queues a work item
and returns immediately.

Because is_cxl_error() evaluates to false for the RCEC (PCI_EXP_TYPE_RC_EC),
the AER driver proceeds directly to pci_aer_handle_error(), invoking
pcie_do_recovery() and issuing resets.

Concurrently, the asynchronous work queues wake up and access the RCH
Downstream Port's AER and RAS MMIO capabilities. Accessing MMIO regions while
the device or its parent is actively undergoing reset can trigger Master
Aborts, return ~0 data, or freeze the system.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260505173029.2718246-1-terry.bowman@amd.com?part=7

  reply	other threads:[~2026-05-05 23:34 UTC|newest]

Thread overview: 40+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 20:26   ` sashiko-bot
2026-05-05 21:17   ` Dave Jiang
2026-05-07 17:53   ` Jonathan Cameron
2026-05-07 18:26     ` Bowman, Terry
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:07   ` sashiko-bot
2026-05-05 21:46   ` Dave Jiang
2026-05-07 18:08   ` Jonathan Cameron
2026-05-07 18:33     ` Bowman, Terry
2026-05-08 14:05       ` Jonathan Cameron
2026-05-09  3:49         ` Dan Williams (nvidia)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 21:30   ` sashiko-bot
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-07 18:11     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 21:52   ` sashiko-bot
2026-05-05 22:16   ` Dave Jiang
2026-05-07 18:14   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 22:28   ` sashiko-bot
2026-05-07 18:22   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:34   ` sashiko-bot [this message]
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-07 18:25     ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06  1:01   ` sashiko-bot
2026-05-06 18:00   ` Dave Jiang
2026-05-07 18:29   ` Jonathan Cameron
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang
2026-05-07 18:51   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260505233439.B22ABC2BCB4@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=sashiko@lists.linux.dev \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox