From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <ming4.li@intel.com>, <linux-cxl@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>,
<dave@stgolabs.net>, <dave.jiang@intel.com>,
<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
<dan.j.williams@intel.com>, <bhelgaas@google.com>,
<mahesh@linux.ibm.com>, <oohall@gmail.com>,
<Benjamin.Cheatham@amd.com>, <rrichter@amd.com>,
<nathan.fontenot@amd.com>,
<smita.koralahallichannabasappa@amd.com>
Subject: Re: [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver
Date: Wed, 16 Oct 2024 17:54:26 +0100 [thread overview]
Message-ID: <20241016175426.0000411e@Huawei.com> (raw)
In-Reply-To: <20241008221657.1130181-8-terry.bowman@amd.com>
On Tue, 8 Oct 2024 17:16:49 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The current pcie_do_recovery() handles device recovery as result of
> uncorrectable errors (UCE). But, CXL port devices require unique
> recovery handling.
>
> Create a cxl_do_recovery() function parallel to pcie_do_recovery(). Add CXL
> specific handling to the new recovery function.
>
> The CXL port UCE recovery must invoke the AER service driver's CXL port
> UCE callback. This is different than the standard pcie_do_recovery()
> recovery that calls the pci_driver::err_handler UCE handler instead.
>
> Treat all CXL PCIe port UCE errors as fatal and call kernel panic to
> "recover" the error. A panic is called instead of attempting recovery
> to avoid potential system corruption.
>
> The uncorrectable support added here will be used to complete CXL PCIe
> port error handling in the future.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,
I'm a little bothered by the subtle difference in the bus walks
in here vs the existing cases. If we need them, comments needed
to explain why.
If we are going to have separate handling, see if you can share
a lot more of the code by factoring out common functions for
the pci and cxl handling with callbacks to handle the differences.
I've managed to get my head around this code a few times in the past
(I think!) and really don't fancy having two subtle variants to
consider next time we get a bug :( The RC_EC additions hurt my head.
Jonathan
> static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 31090770fffc..de12f2eb19ef 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -86,6 +86,63 @@ static int report_error_detected(struct pci_dev *dev,
> return 0;
> }
>
> +static int cxl_report_error_detected(struct pci_dev *dev,
> + pci_channel_state_t state,
> + enum pci_ers_result *result)
> +{
> + struct cxl_port_err_hndlrs *cxl_port_hndlrs;
> + struct pci_driver *pdrv;
> + pci_ers_result_t vote;
> +
> + device_lock(&dev->dev);
> + cxl_port_hndlrs = find_cxl_port_hndlrs();
Can we refactor to have a common function under this and report_error_detected()?
> + pdrv = dev->driver;
> + if (pci_dev_is_disconnected(dev)) {
> + vote = PCI_ERS_RESULT_DISCONNECT;
> + } else if (!pci_dev_set_io_state(dev, state)) {
> + pci_info(dev, "can't recover (state transition %u -> %u invalid)\n",
> + dev->error_state, state);
> + vote = PCI_ERS_RESULT_NONE;
> + } else if (!cxl_port_hndlrs || !cxl_port_hndlrs->error_detected) {
> + if (dev->hdr_type != PCI_HEADER_TYPE_BRIDGE) {
> + vote = PCI_ERS_RESULT_NO_AER_DRIVER;
> + pci_info(dev, "can't recover (no error_detected callback)\n");
> + } else {
> + vote = PCI_ERS_RESULT_NONE;
> + }
> + } else {
> + vote = cxl_port_hndlrs->error_detected(dev, state);
> + }
> + pci_uevent_ers(dev, vote);
> + *result = merge_result(*result, vote);
> + device_unlock(&dev->dev);
> + return 0;
> +}
> static int pci_pm_runtime_get_sync(struct pci_dev *pdev, void *data)
> {
> pm_runtime_get_sync(&pdev->dev);
> @@ -188,6 +245,28 @@ static void pci_walk_bridge(struct pci_dev *bridge,
> cb(bridge, userdata);
> }
>
> +/**
> + * cxl_walk_bridge - walk bridges potentially AER affected
> + * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
> + * @cb: callback to be called for each device found
> + * @userdata: arbitrary pointer to be passed to callback
> + *
> + * If the device provided is a bridge, walk the subordinate bus, including
> + * the device itself and any bridged devices on buses under this bus. Call
> + * the provided callback on each device found.
> + *
> + * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
> + * call the callback on the device itself.
only call the callback on the device itself.
(as you call it as stated above either way).
> + */
> +static void cxl_walk_bridge(struct pci_dev *bridge,
> + int (*cb)(struct pci_dev *, void *),
> + void *userdata)
> +{
> + cb(bridge, userdata);
> + if (bridge->subordinate)
> + pci_walk_bus(bridge->subordinate, cb, userdata);
The difference between this and pci_walk_bridge() is subtle and
I'd like to avoid having both if we can.
> +}
> +
> pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_channel_state_t state,
> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> @@ -276,3 +355,74 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>
> return status;
> }
> +
> +pci_ers_result_t cxl_do_recovery(struct pci_dev *bridge,
> + pci_channel_state_t state,
> + pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev))
> +{
> + struct pci_host_bridge *host = pci_find_host_bridge(bridge->bus);
> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> + int type = pci_pcie_type(bridge);
> +
> + if ((type != PCI_EXP_TYPE_ROOT_PORT) &&
> + (type != PCI_EXP_TYPE_RC_EC) &&
> + (type != PCI_EXP_TYPE_DOWNSTREAM) &&
> + (type != PCI_EXP_TYPE_UPSTREAM)) {
> + pci_dbg(bridge, "Unsupported device type (%x)\n", type);
> + return status;
> + }
> +
Would similar trick to in pcie_do_recovery work here for the upstream
and downstream ports use pci_upstream_bridge() and for the others pass the dev into
pci_walk_bridge()?
> + cxl_walk_bridge(bridge, pci_pm_runtime_get_sync, NULL);
> +
> + pci_dbg(bridge, "broadcast error_detected message\n");
> + if (state == pci_channel_io_frozen) {
> + cxl_walk_bridge(bridge, cxl_report_frozen_detected, &status);
> + if (reset_subordinates(bridge) != PCI_ERS_RESULT_RECOVERED) {
> + pci_warn(bridge, "subordinate device reset failed\n");
> + goto failed;
> + }
> + } else {
> + cxl_walk_bridge(bridge, cxl_report_normal_detected, &status);
> + }
> +
> + if (status == PCI_ERS_RESULT_PANIC)
> + panic("CXL cachemem error. Invoking panic");
> +
> + if (status == PCI_ERS_RESULT_CAN_RECOVER) {
> + status = PCI_ERS_RESULT_RECOVERED;
> + pci_dbg(bridge, "broadcast mmio_enabled message\n");
> + cxl_walk_bridge(bridge, report_mmio_enabled, &status);
> + }
> +
> + if (status == PCI_ERS_RESULT_NEED_RESET) {
> + status = PCI_ERS_RESULT_RECOVERED;
> + pci_dbg(bridge, "broadcast slot_reset message\n");
> + report_slot_reset(bridge, &status);
> + pci_walk_bridge(bridge, report_slot_reset, &status);
> + }
> +
> + if (status != PCI_ERS_RESULT_RECOVERED)
> + goto failed;
> +
> + pci_dbg(bridge, "broadcast resume message\n");
> + cxl_walk_bridge(bridge, report_resume, &status);
> +
> + if (host->native_aer || pcie_ports_native) {
> + pcie_clear_device_status(bridge);
> + pci_aer_clear_nonfatal_status(bridge);
> + }
> +
> + cxl_walk_bridge(bridge, pci_pm_runtime_put, NULL);
> +
> + pci_info(bridge, "device recovery successful\n");
> + return status;
> +
> +failed:
> + cxl_walk_bridge(bridge, pci_pm_runtime_put, NULL);
> +
> + pci_uevent_ers(bridge, PCI_ERS_RESULT_DISCONNECT);
> +
> + pci_info(bridge, "device recovery failed\n");
> +
> + return status;
> +}
next prev parent reply other threads:[~2024-10-16 16:54 UTC|newest]
Thread overview: 62+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
2024-10-22 1:53 ` Dan Williams
2024-10-22 13:50 ` Terry Bowman
2024-10-22 17:09 ` Dan Williams
2024-10-22 18:40 ` Terry Bowman
2024-10-22 23:43 ` Dan Williams
2024-10-24 15:20 ` Bowman, Terry
2024-10-24 19:10 ` Dan Williams
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
2024-10-16 16:11 ` Jonathan Cameron
2024-10-22 2:17 ` Dan Williams
2024-10-22 13:54 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
2024-10-10 19:11 ` Bjorn Helgaas
2024-10-14 17:27 ` Terry Bowman
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-16 16:22 ` Jonathan Cameron
2024-10-16 17:18 ` Terry Bowman
2024-10-16 17:29 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-16 16:28 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
2024-10-16 16:30 ` Jonathan Cameron
2024-10-16 17:31 ` Terry Bowman
2024-10-17 13:31 ` Jonathan Cameron
2024-10-17 14:50 ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-16 16:54 ` Jonathan Cameron [this message]
2024-10-16 18:07 ` Terry Bowman
2024-10-17 13:43 ` Jonathan Cameron
2024-10-17 16:21 ` Bowman, Terry
2024-10-17 17:08 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static Terry Bowman
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
2024-10-16 17:14 ` Jonathan Cameron
2024-10-16 18:16 ` Terry Bowman
2024-10-17 13:50 ` Jonathan Cameron
2024-10-17 16:26 ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 10/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-08 22:16 ` [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-17 13:57 ` Jonathan Cameron
2024-10-17 16:42 ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-10-17 14:04 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-10-16 17:22 ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
2024-10-16 17:21 ` Jonathan Cameron
2024-10-16 17:24 ` Terry Bowman
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2024-10-14 17:22 ` Terry Bowman
2024-10-14 17:29 ` Bjorn Helgaas
2024-10-14 17:33 ` Terry Bowman
2024-10-17 16:34 ` Fan Ni
2024-10-17 17:27 ` Bowman, Terry
2024-10-21 22:19 ` Fan Ni
2024-10-18 23:22 ` Bjorn Helgaas
2024-10-21 19:22 ` Terry Bowman
2024-10-22 1:43 ` Dan Williams
2024-10-22 13:29 ` Terry Bowman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241016175426.0000411e@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=mahesh@linux.ibm.com \
--cc=ming4.li@intel.com \
--cc=nathan.fontenot@amd.com \
--cc=oohall@gmail.com \
--cc=rrichter@amd.com \
--cc=smita.koralahallichannabasappa@amd.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox