From: Terry Bowman <terry.bowman@amd.com>
To: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
<linux-pci@vger.kernel.org>, <nifan.cxl@gmail.com>,
<dave@stgolabs.net>, <jonathan.cameron@huawei.com>,
<dave.jiang@intel.com>, <alison.schofield@intel.com>,
<vishal.l.verma@intel.com>, <dan.j.williams@intel.com>,
<bhelgaas@google.com>, <mahesh@linux.ibm.com>,
<ira.weiny@intel.com>, <oohall@gmail.com>,
<Benjamin.Cheatham@amd.com>, <rrichter@amd.com>,
<nathan.fontenot@amd.com>, <terry.bowman@amd.com>,
<Smita.KoralahalliChannabasappa@amd.com>, <lukas@wunner.de>,
<ming.li@zohomail.com>, <PradeepVineshReddy.Kodamati@amd.com>
Subject: [PATCH v6 06/17] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver
Date: Fri, 7 Feb 2025 18:29:30 -0600 [thread overview]
Message-ID: <20250208002941.4135321-7-terry.bowman@amd.com> (raw)
In-Reply-To: <20250208002941.4135321-1-terry.bowman@amd.com>
Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
apply to CXL devices. Recovery can not be used for CXL devices because of
potential corruption on what can be system memory. Also, current PCIe UCE
recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
does not begin at the RP/DSP but begins at the first downstream device.
This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
CXL recovery is needed because of the different handling requirements
Add a new function, cxl_do_recovery() using the following.
Add cxl_walk_bridge() to iterate the detected error's sub-topology.
cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
will begin iteration at the RP or DSP rather than beginning at the
first downstream device.
pci_walk_bridge() is candidate to possibly reuse cxl_walk_bridge() but
needs further investigation. This will be left for future improvement
to make the CXL and PCI handling paths more common.
Add cxl_report_error_detected() as an analog to report_error_detected().
It will call pci_driver::cxl_err_handlers for each iterated downstream
device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
indicating if there was a UCE error detected during handling.
cxl_do_recovery() uses the status from cxl_report_error_detected() to
determine how to proceed. Non-fatal CXL UCE errors will be treated as
fatal. If a UCE was present during handling then cxl_do_recovery()
will kernel panic.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 12 ++++++---
drivers/pci/pci.h | 3 +++
drivers/pci/pcie/aer.c | 4 +++
drivers/pci/pcie/err.c | 58 ++++++++++++++++++++++++++++++++++++++++++
include/linux/pci.h | 3 +++
5 files changed, 77 insertions(+), 3 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index a5c65f79db18..eda532f7440c 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -661,10 +661,16 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
status = readl(addr);
- if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
- writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+ if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK)) {
+ dev_err(cxl_dev, "%s():%d: CE Status is empty\n", __func__, __LINE__);
+ return;
}
+ writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+ if (is_cxl_memdev(cxl_dev))
+ trace_cxl_aer_correctable_error(to_cxl_memdev(cxl_dev), status);
+ else if (is_cxl_port(cxl_dev))
+ trace_cxl_port_aer_correctable_error(cxl_dev, pcie_dev, status);
}
static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 01e51db8d285..deb193b387af 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -722,6 +722,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_channel_state_t state,
pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
+/* CXL error reporting and handling */
+void cxl_do_recovery(struct pci_dev *dev);
+
bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 34ec0958afff..ee38db08d005 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1012,6 +1012,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
err_handler->error_detected(dev, pci_channel_io_normal);
else if (info->severity == AER_FATAL)
err_handler->error_detected(dev, pci_channel_io_frozen);
+
+ cxl_do_recovery(dev);
}
out:
device_unlock(&dev->dev);
@@ -1041,6 +1043,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
pdrv->cxl_err_handler->cor_error_detected(dev);
pcie_clear_device_status(dev);
+ } else {
+ cxl_do_recovery(dev);
}
}
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..05f2d1ef4c36 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -24,6 +24,9 @@
static pci_ers_result_t merge_result(enum pci_ers_result orig,
enum pci_ers_result new)
{
+ if (new == PCI_ERS_RESULT_PANIC)
+ return PCI_ERS_RESULT_PANIC;
+
if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
return PCI_ERS_RESULT_NO_AER_DRIVER;
@@ -276,3 +279,58 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
return status;
}
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+ int (*cb)(struct pci_dev *, void *),
+ void *userdata)
+{
+ if (cb(bridge, userdata))
+ return;
+
+ if (bridge->subordinate)
+ pci_walk_bus(bridge->subordinate, cb, userdata);
+}
+
+static int cxl_report_error_detected(struct pci_dev *dev, void *data)
+{
+ const struct cxl_error_handlers *cxl_err_handler;
+ pci_ers_result_t vote, *result = data;
+ struct pci_driver *pdrv;
+
+ device_lock(&dev->dev);
+ pdrv = dev->driver;
+ if (!pdrv || !pdrv->cxl_err_handler ||
+ !pdrv->cxl_err_handler->error_detected)
+ goto out;
+
+ cxl_err_handler = pdrv->cxl_err_handler;
+ vote = cxl_err_handler->error_detected(dev);
+ *result = merge_result(*result, vote);
+out:
+ device_unlock(&dev->dev);
+ return 0;
+}
+
+void cxl_do_recovery(struct pci_dev *dev)
+{
+ struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+ pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+
+ cxl_walk_bridge(dev, cxl_report_error_detected, &status);
+ if (status == PCI_ERS_RESULT_PANIC)
+ panic("CXL cachemem error.");
+
+ /*
+ * If we have native control of AER, clear error status in the device
+ * that detected the error. If the platform retained control of AER,
+ * it is responsible for clearing this status. In that case, the
+ * signaling device may not even be visible to the OS.
+ */
+ if (host->native_aer || pcie_ports_native) {
+ pcie_clear_device_status(dev);
+ pci_aer_clear_nonfatal_status(dev);
+ pci_aer_clear_fatal_status(dev);
+ }
+
+ pci_info(dev, "CXL uncorrectable error.\n");
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 82a0401c58d3..5b539b5bf0d1 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -864,6 +864,9 @@ enum pci_ers_result {
/* No AER capabilities registered for the driver */
PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+ /* System is unstable, panic */
+ PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
};
/* PCI bus error event callbacks */
--
2.34.1
next prev parent reply other threads:[~2025-02-08 0:31 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-02-08 0:29 [PATCH v6 00/17] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-02-08 0:29 ` [PATCH v6 01/17] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2025-02-08 0:29 ` [PATCH v6 02/17] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
2025-02-08 0:29 ` [PATCH v6 03/17] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2025-02-08 0:29 ` [PATCH v6 04/17] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2025-02-08 0:29 ` [PATCH v6 05/17] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
2025-02-08 0:29 ` Terry Bowman [this message]
2025-02-10 18:43 ` [PATCH v6 06/17] PCI/AER: Add CXL PCIe Port uncorrectable error recovery " Gregory Price
2025-02-10 20:13 ` Gregory Price
2025-02-08 0:29 ` [PATCH v6 07/17] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
2025-02-08 0:29 ` [PATCH v6 08/17] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
2025-02-08 0:29 ` [PATCH v6 09/17] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2025-02-10 20:16 ` Gregory Price
2025-02-10 20:36 ` Bowman, Terry
2025-02-08 0:29 ` [PATCH v6 10/17] cxl/pci: Add log message for unmapped registers in existing RAS handlers Terry Bowman
2025-02-08 0:29 ` [PATCH v6 11/17] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
2025-02-08 0:29 ` [PATCH v6 12/17] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
2025-02-08 0:29 ` [PATCH v6 13/17] cxl/pci: Add trace logging " Terry Bowman
2025-02-08 0:29 ` [PATCH v6 14/17] cxl/pci: Update CXL Port RAS logging to also display PCIe SBDF Terry Bowman
2025-02-08 0:29 ` [PATCH v6 15/17] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2025-02-08 0:29 ` [PATCH v6 16/17] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
2025-02-08 0:29 ` [PATCH v6 17/17] cxl/pci: Handle CXL Endpoint and RCH protocol errors separately from PCIe errors Terry Bowman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250208002941.4135321-7-terry.bowman@amd.com \
--to=terry.bowman@amd.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=PradeepVineshReddy.Kodamati@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=mahesh@linux.ibm.com \
--cc=ming.li@zohomail.com \
--cc=nathan.fontenot@amd.com \
--cc=nifan.cxl@gmail.com \
--cc=oohall@gmail.com \
--cc=rrichter@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox