Linux CXL
 help / color / mirror / Atom feed
From: Terry Bowman <terry.bowman@amd.com>
To: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-pci@vger.kernel.org>, <nifan.cxl@gmail.com>,
	<ming4.li@intel.com>, <dave@stgolabs.net>,
	<jonathan.cameron@huawei.com>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
	<dan.j.williams@intel.com>, <bhelgaas@google.com>,
	<mahesh@linux.ibm.com>, <ira.weiny@intel.com>, <oohall@gmail.com>,
	<Benjamin.Cheatham@amd.com>, <rrichter@amd.com>,
	<nathan.fontenot@amd.com>, <terry.bowman@amd.com>,
	<Smita.KoralahalliChannabasappa@amd.com>, <lukas@wunner.de>
Subject: [PATCH v3 07/15] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver
Date: Wed, 13 Nov 2024 15:54:21 -0600	[thread overview]
Message-ID: <20241113215429.3177981-8-terry.bowman@amd.com> (raw)
In-Reply-To: <20241113215429.3177981-1-terry.bowman@amd.com>

Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
apply to CXL devices. Recovery can not be used for CXL devices because of
the potential for corruption on what can be system memory. Also, current
PCIe UCE recovery does not begin at the bridge but begins at the bridge's
first downstream device. This will miss handling CXL protocol errors in a
CXL root port. A separate CXL recovery is needed because of the different
handling requirements

Add a new function, cxl_do_recovery() using the following.

Add cxl_walk_bridge() to iterate the detected error's sub-topology.
cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
will begin iteration at the bridge rather than beginning at the
bridge's first downstream child.

Add cxl_report_error_detected() as an analog to report_error_detected().
It will call pci_driver::cxl_err_handlers for each iterated downstream
child. The pci_driver::cxl_err_handlers UCE handler returns a boolean
indicating if there was a UCE error detected during handling.

cxl_do_recovery() uses the status from cxl_report_error_detected() to
determine how to proceed. Non-fatal CXL UCE errors will be treated as
fatal. If a UCE was present during handling then cxl_do_recovery()
will kernel panic.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.h      |  3 +++
 drivers/pci/pcie/aer.c |  5 +++-
 drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 14d00ce45bfa..5a67e41919d8 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		pci_channel_state_t state,
 		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
 
+/* CXL error reporting and handling */
+void cxl_do_recovery(struct pci_dev *dev);
+
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index bb34e205e082..87fddd514030 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1048,7 +1048,10 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 			pdrv->cxl_err_handler->cor_error_detected(dev);
 
 		pcie_clear_device_status(dev);
-	}
+	} else if (info->severity == AER_NONFATAL)
+		cxl_do_recovery(dev);
+	else if (info->severity == AER_FATAL)
+		cxl_do_recovery(dev);
 }
 
 static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..3785f4ca5103 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	return status;
 }
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+			    int (*cb)(struct pci_dev *, void *),
+			    void *userdata)
+{
+	bool *status = userdata;
+
+	cb(bridge, status);
+	if (bridge->subordinate && !*status)
+		pci_walk_bus(bridge->subordinate, cb, status);
+}
+
+static int cxl_report_error_detected(struct pci_dev *dev, void *data)
+{
+	struct pci_driver *pdrv = dev->driver;
+	bool *status = data;
+
+	device_lock(&dev->dev);
+	if (pdrv && pdrv->cxl_err_handler &&
+	    pdrv->cxl_err_handler->error_detected) {
+		const struct cxl_error_handlers *cxl_err_handler =
+			pdrv->cxl_err_handler;
+		*status |= cxl_err_handler->error_detected(dev);
+	}
+	device_unlock(&dev->dev);
+	return *status;
+}
+
+void cxl_do_recovery(struct pci_dev *dev)
+{
+	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+	int type = pci_pcie_type(dev);
+	struct pci_dev *bridge;
+	int status;
+
+	if (type == PCI_EXP_TYPE_ROOT_PORT ||
+	    type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    type == PCI_EXP_TYPE_UPSTREAM ||
+	    type == PCI_EXP_TYPE_ENDPOINT)
+		bridge = dev;
+	else
+		bridge = pci_upstream_bridge(dev);
+
+	cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
+	if (status)
+		panic("CXL cachemem error. Invoking panic");
+
+	if (host->native_aer || pcie_ports_native) {
+		pcie_clear_device_status(dev);
+		pci_aer_clear_nonfatal_status(dev);
+	}
+
+	pci_info(bridge, "No uncorrectable error found. Continuing.\n");
+}
-- 
2.34.1


  parent reply	other threads:[~2024-11-13 21:55 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-11-13 21:54 [PATCH v3 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-11-13 21:54 ` [PATCH v3 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-11-13 21:54 ` [PATCH v3 02/15] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
2024-11-13 21:54 ` [PATCH v3 03/15] cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-11-14 15:45   ` Lukas Wunner
2024-11-14 16:45     ` Bowman, Terry
2024-11-14 16:52       ` Lukas Wunner
2024-11-14 17:07         ` Bowman, Terry
2024-11-15  8:47           ` Lukas Wunner
2024-11-15 13:54             ` Bowman, Terry
2024-11-17 17:02               ` Lukas Wunner
2024-11-19 12:20                 ` Bowman, Terry
2024-11-13 21:54 ` [PATCH v3 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-11-13 21:54 ` [PATCH v3 05/15] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-11-14 16:44   ` Lukas Wunner
2024-11-14 18:41     ` Bowman, Terry
2024-11-15  8:51       ` Lukas Wunner
2024-11-15 13:56         ` Bowman, Terry
2024-11-15 14:49       ` Li Ming
2024-11-15 19:46         ` Bowman, Terry
2024-11-17  7:38           ` Li Ming
2024-11-27 17:03   ` Jonathan Cameron
2024-11-27 20:29     ` Bowman, Terry
2024-11-13 21:54 ` [PATCH v3 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-11-15  9:35   ` Lukas Wunner
2024-11-21 20:24     ` Bowman, Terry
2024-11-27 17:05       ` Jonathan Cameron
2024-11-27 20:53         ` Bowman, Terry
2024-11-13 21:54 ` Terry Bowman [this message]
2024-11-18 10:37   ` [PATCH v3 07/15] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Lukas Wunner
2024-11-19 12:23     ` Bowman, Terry
2024-11-13 21:54 ` [PATCH v3 08/15] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
2024-11-15 15:28   ` Li Ming
2024-11-15 19:33     ` Bowman, Terry
2024-11-16 14:49   ` kernel test robot
2024-11-17  7:45   ` Li Ming
2024-11-18  2:21     ` Li Ming
2024-11-19 12:28       ` Bowman, Terry
2024-11-13 21:54 ` [PATCH v3 09/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-11-13 21:54 ` [PATCH v3 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe ports Terry Bowman
2024-11-13 21:54 ` [PATCH v3 11/15] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
2024-11-13 21:54 ` [PATCH v3 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-11-13 21:54 ` [PATCH v3 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-11-13 21:54 ` [PATCH v3 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-11-13 21:54 ` [PATCH v3 15/15] PCI/AER: Enable internal errors for CXL upstream and downstream switch ports Terry Bowman
2024-11-18 11:54   ` Lukas Wunner
2024-11-21 22:25     ` Bowman, Terry
2024-11-21 22:32       ` Lukas Wunner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20241113215429.3177981-8-terry.bowman@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=ming4.li@intel.com \
    --cc=nathan.fontenot@amd.com \
    --cc=nifan.cxl@gmail.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox