From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 70A5B37B014 for ; Tue, 5 May 2026 22:28:53 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778020133; cv=none; b=b9dgej61ph520zcr20gwiWberg1/EClYGKaT5F7sushskb30CB6Eh2FGu65kZnQ0rw8a2V9k+87gO4skNc+xggOUtwvI0uHEnZ6ojVGvmyqasWulwl2wiAvHIl9jHh3mIMgcl977xuhOI8dSBkGmmRYVOzZRmaCbGR1sZkHrsWA= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1778020133; c=relaxed/simple; bh=NcJrR+AoDc1OAKzaL766DaGMVHu0D+hJUS0ew21RXaA=; h=From:Subject:To:Cc:In-Reply-To:References:Content-Type:Date: Message-Id; b=qRIDVhDEd0IhYB6KXZlJb65XCQWggclWXmKGhICD79QyFp989k/IoOM9Wt0eXtBTPb252eYHsQdhi70707LwmjTz3LqayFagYJqQdxv2oqJC+lBD1dsRhEPo7+d5jdAxRH1TUVJo38kScce2xQFvz7zlKgSKxrhRDcdcY0jve/k= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=eUsz642U; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="eUsz642U" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 00913C2BCB4; Tue, 5 May 2026 22:28:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1778020133; bh=NcJrR+AoDc1OAKzaL766DaGMVHu0D+hJUS0ew21RXaA=; h=From:Subject:Reply-To:To:Cc:In-Reply-To:References:Date:From; b=eUsz642UQWOAq3jJsXYvu6qi1QBEwd287p38VeCVt9YG/lz9UgjslD6H40wBjYlmx Df337SfIzUbYFu38v+Ff1S7v1rTzrDl5gJ6TVeXr64/TQpHPjcu60oPJL3NDYUQZZa ohQQUj501BgLVt8iq/4kiuIfOD0AgYu8WHSrHektWnmTGNeV0lxknsFXHgjkHsOesp RjMsX+qLMdjjLtNIOB4fft2DgXqkLlf+Somql1673eL3b/cszvfsNm3jeH4Zw/uDoh fJtHr/etf+JwAHEKLjMkBQoHMeFCdcHVZvO+JwutG2F6UTACM28kyYzaBlRLf9H3eD 9hv0tQg2z6S4g== From: sashiko-bot@kernel.org Subject: Re: [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Reply-To: sashiko@lists.linux.dev To: "Terry Bowman" Cc: linux-pci@vger.kernel.org In-Reply-To: <20260505173029.2718246-7-terry.bowman@amd.com> References: <20260505173029.2718246-7-terry.bowman@amd.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Date: Tue, 05 May 2026 22:28:52 +0000 Message-Id: <20260505222853.00913C2BCB4@smtp.kernel.org> Precedence: bulk X-Mailing-List: linux-pci@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Thank you for your contribution! Sashiko AI review found 3 potential issue(= s) to consider: - [High] Mixed AER uncorrectable errors and Fatal errors bypass standard PC= Ie error recovery. - [High] `cxl_forward_error()` unconditionally writes to `pdev->aer_cap + P= CI_ERR_COR_STATUS`, which can corrupt BAR0 if `aer_cap` is 0. - [Medium] AER Uncorrectable Error status is left uncleared if the CXL port= driver is unbound. -- commit 40776581efad3966047d440ec936219af21ff33c Author: Terry Bowman PCI: Establish common CXL Port protocol error flow This commit adds CXL Port protocol error handling callbacks to unify detection, logging, and recovery across CXL Ports and Endpoints. It also routes CXL protocol errors through the CXL RAS helpers and wires the AER core to a kfifo for error processing. > The correctable AER status is cleared by the producer in > cxl_forward_error(). The patch 'PCI/CXL: Add RCH support to CXL handlers' updates cxl_rch_handle_error_iter() to call cxl_forward_error() directly for RCD Endpoints (RCiEPs). If cxl_forward_error() unconditionally writes to pdev->aer_cap + PCI_ERR_COR_STATUS to clear the correctable AER status, what happens if an RCD Endpoint does not expose an AER capability to the OS and dev->aer_cap is 0? Would writing to 0 + PCI_ERR_COR_STATUS (offset 0x10) inadvertently overwrite the PCI Base Address Register 0 (BAR0) and corrupt the device's memory mapping?=20 The original pci_aer_handle_error() explicitly guarded this write by checking if the aer offset is non-zero. > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c [ ... ] > @@ -1186,7 +1187,10 @@ static void pci_aer_handle_error(struct pci_dev *d= ev, struct aer_err_info *info) > static void handle_error_source(struct pci_dev *dev, struct aer_err_info= *info) > { > cxl_rch_handle_error(dev, info); > - pci_aer_handle_error(dev, info); > + if (is_cxl_error(dev, info)) > + cxl_forward_error(dev, info); > + else > + pci_aer_handle_error(dev, info); > pci_dev_put(dev); > } Since is_cxl_error() returns true when the CXL Internal Error bit is set, does this bypass standard PCIe error recovery if a device experiences a concurrent CXL Internal Error and a standard PCIe Uncorrectable Error (such as a Data Link Protocol Error)? Additionally, if the error is fatal, the PCIe link is frozen. Standard handling in pci_aer_handle_error() would normally call pcie_do_recovery() to perform a link reset. Because cxl_do_recovery() only clears the AER status and returns without resetting the link, could the device be left permanently frozen? > diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c > --- a/drivers/cxl/core/ras.c > +++ b/drivers/cxl/core/ras.c [ ... ] > +static int __cxl_proto_err_work_fn(struct cxl_proto_err_work_data *wd) > +{ > + struct cxl_dport *dport; > + struct cxl_port *port __free(put_cxl_port) =3D > + find_cxl_port_by_dev(&wd->pdev->dev, &dport); > + > + if (!port) { > + dev_err_ratelimited(&wd->pdev->dev, > + "Failed to find parent port device in CXL topology\n"); > + return 0; > + } > + > + /* > + * Hold the port device lock and verify a driver is bound before > + * handling errors. Protects against NULL deref if an error is > + * dispatched before probe completion or after driver removal. > + */ > + guard(device)(&port->dev); > + if (!port->dev.driver) { > + dev_err_ratelimited(&port->dev, > + "Port device is unbound, abort error handling\n"); > + return 0; > + } > + > + cxl_handle_proto_error(wd->pdev, port, dport, wd->severity); > + > + return 0; > +} If the port is not found or the driver is unbound, this function aborts and returns early before calling cxl_handle_proto_error(). Since handle_error_source() already bypassed the standard PCIe error recovery, does returning early here leave the AER Uncorrectable Error status uncleared on the device? Standard PCIe handling normally ensures these registers are cleared even if no driver is bound. Could leaving them uncleared block future error reporting? --=20 Sashiko AI review =C2=B7 https://sashiko.dev/#/patchset/20260505173029.2718= 246-1-terry.bowman@amd.com?part=3D6