From: Dave Jiang <dave.jiang@intel.com>
To: "Bowman, Terry" <terry.bowman@amd.com>,
dave@stgolabs.net, jonathan.cameron@huawei.com,
alison.schofield@intel.com, dan.j.williams@intel.com,
bhelgaas@google.com, shiju.jose@huawei.com, ming.li@zohomail.com,
Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com,
lukas@wunner.de, Benjamin.Cheatham@amd.com,
sathyanarayanan.kuppuswamy@linux.intel.com,
linux-cxl@vger.kernel.org, alucerop@amd.com, ira.weiny@intel.com
Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [PATCH v12 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
Date: Tue, 30 Sep 2025 09:46:34 -0700 [thread overview]
Message-ID: <ab001a63-47e4-4d85-b536-8103835a5b39@intel.com> (raw)
In-Reply-To: <20351ea0-4bb7-4b5e-b097-42ef145dea68@amd.com>
On 9/30/25 9:43 AM, Bowman, Terry wrote:
>
>
> On 9/30/2025 11:13 AM, Dave Jiang wrote:
>>
>> On 9/30/25 7:38 AM, Bowman, Terry wrote:
>>>
>>> On 9/29/2025 7:26 PM, Dave Jiang wrote:
>>>> On 9/25/25 3:34 PM, Terry Bowman wrote:
>>>>> Populate the cxl_do_recovery() function with uncorrectable protocol error (UCE)
>>>>> handling. Follow similar design as found in PCIe error driver,
>>>>> pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs
>>>>> as fatal with a kernel panic. This is to prevent corruption on CXL memory.
>>>>>
>>>>> Introduce cxl_walk_port(). Make this analogous to pci_walk_bridge() but walking
>>>>> CXL ports instead. This will iterate through the CXL topology from the
>>>>> erroring device through the downstream CXL Ports and Endpoints.
>>>>>
>>>>> Export pci_aer_clear_fatal_status() for CXL to use if a UCE is not found.
>>>>>
>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>>
>>>>> ---
>>>>>
>>>>> Changes in v11->v12:
>>>>> - Cleaned up port discovery in cxl_do_recovery() (Dave)
>>>>> - Added PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
>>>>>
>>>>> Changes in v10->v11:
>>>>> - pci_ers_merge_results() - Move to earlier patch
>>>>> ---
>>>>> drivers/cxl/core/ras.c | 111 +++++++++++++++++++++++++++++++++++++++++
>>>>> 1 file changed, 111 insertions(+)
>>>>>
>>>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>>>> index 7e8d63c32d72..45f92defca64 100644
>>>>> --- a/drivers/cxl/core/ras.c
>>>>> +++ b/drivers/cxl/core/ras.c
>>>>> @@ -443,8 +443,119 @@ void cxl_endpoint_port_init_ras(struct cxl_port *ep)
>>>>> }
>>>>> EXPORT_SYMBOL_NS_GPL(cxl_endpoint_port_init_ras, "CXL");
>>>>>
>>>>> +static int cxl_report_error_detected(struct device *dev, void *data)
>>>>> +{
>>>>> + struct pci_dev *pdev = to_pci_dev(dev);
>>>>> + pci_ers_result_t vote, *result = data;
>>>>> +
>>>>> + guard(device)(dev);
>>>>> +
>>>>> + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT) ||
>>>>> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)) {
>>>>> + if (!cxl_pci_drv_bound(pdev))
>>>>> + return 0;
>>>>> +
>>>>> + vote = cxl_error_detected(dev);
>>>>> + } else {
>>>>> + vote = cxl_port_error_detected(dev);
>>>>> + }
>>>>> +
>>>>> + *result = pci_ers_merge_result(*result, vote);
>>>>> +
>>>>> + return 0;
>>>>> +}
>>>>> +
>>>>> +static int match_port_by_parent_dport(struct device *dev, const void *dport_dev)
>>>>> +{
>>>>> + struct cxl_port *port;
>>>>> +
>>>>> + if (!is_cxl_port(dev))
>>>>> + return 0;
>>>>> +
>>>>> + port = to_cxl_port(dev);
>>>>> +
>>>>> + return port->parent_dport->dport_dev == dport_dev;
>>>>> +}
>>>>> +
>>>>> +static void cxl_walk_port(struct device *port_dev,
>>>>> + int (*cb)(struct device *, void *),
>>>>> + void *userdata)
>>>>> +{
>>>>> + struct cxl_dport *dport = NULL;
>>>>> + struct cxl_port *port;
>>>>> + unsigned long index;
>>>>> +
>>>>> + if (!port_dev)
>>>>> + return;
>>>>> +
>>>>> + port = to_cxl_port(port_dev);
>>>>> + if (port->uport_dev && dev_is_pci(port->uport_dev))
>>>>> + cb(port->uport_dev, userdata);
>>>> Could use some comments on what is being walked. Also an explanation of what is happening here would be good.
>>> Ok
>>>> If this is an endpoint port, this would be the PCI endpoint device.
>>>> If it's a switch port, then this is the upstream port.
>>>> If it's a root port, this is skipped.
>>>>
>>>>> +
>>>>> + xa_for_each(&port->dports, index, dport)
>>>>> + {
>>>>> + struct device *child_port_dev __free(put_device) =
>>>>> + bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
>>>>> + match_port_by_parent_dport);
>>>>> +
>>>>> + cb(dport->dport_dev, userdata);
>>>> This is going through all the downstream ports
>>>>> +
>>>>> + cxl_walk_port(child_port_dev, cxl_report_error_detected, userdata);
>>>>> + }
>>>>> +
>>>>> + if (is_cxl_endpoint(port))
>>>>> + cb(port->uport_dev->parent, userdata);
>>>> And this is the downstream parent port of the endpoint device
>>>>
>>>> Why not move this before the xa_for_each() and return early? endpoint ports don't have dports, no need to even try to run that block above.
>>> Sure, I'll change that.
>>>> So in the current implementation,
>>>> 1. Endpoint. It checks the device, and then it checks the downstream parent port for errors. Is checking the parent dport necessary?
>>>> 2. Switch. It checks the upstream port, then it checks all the downstream ports for errors.
>>>> 3. Root port. It checks all the downstream ports for errors.
>>>> Is this the correct understanding of what this function does?
>>> Yes. The ordering is different as you pointed out. I can move the endpoint
>>> check earlier with an early return.
>> As the endpoint, what is the reason the check the parent dport? Pardon my ignorance.
>
> There is none. An endpoint port will not have downstream ports.
parent dport. It would be the root port or the switch downstream port. This is what the current code is doing:
>>>>> + if (is_cxl_endpoint(port))
>>>>> + cb(port->uport_dev->parent, userdata);
DJ
>
>>>>> +}
>>>>> +
>>>>> static void cxl_do_recovery(struct device *dev)
>>>>> {
>>>>> + pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>>> + struct pci_dev *pdev = to_pci_dev(dev);
>>>>> + struct cxl_port *port = NULL;
>>>>> +
>>>>> + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>>>>> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>>>>> + struct cxl_dport *dport;
>>>>> + struct cxl_port *rp_port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
>>>>> +
>>>>> + port = rp_port;
>>>>> +
>>>>> + } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
>>>>> + struct cxl_port *us_port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
>>>>> +
>>>>> + port = us_port;
>>>>> +
>>>>> + } else if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT) ||
>>>>> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)) {
>>>>> + struct cxl_dev_state *cxlds;
>>>>> +
>>>>> + if (!cxl_pci_drv_bound(pdev))
>>>>> + return;
>>>> Need to have the pci dev lock before checking driver bound.
>>>> DJ
>>> Ok, I'll try to add that into cxl_pci_drv_bound(). Terry
>> Do you need the lock beyond just checking the driver data? Maybe do it outside cxl_pci_drv_bound(). I would have an assert in the function though to ensure lock is held when calling this function.
>
> Ok, good idea.
>
> Terry
>> DJ
>>>>> +
>>>>> + cxlds = pci_get_drvdata(pdev);
>>>>> + port = cxlds->cxlmd->endpoint;
>>>>> + }
>>>>> +
>>>>> + if (!port) {
>>>>> + dev_err(dev, "Failed to find the CXL device\n");
>>>>> + return;
>>>>> + }
>>>>> +
>>>>> + cxl_walk_port(&port->dev, cxl_report_error_detected, &status);
>>>>> + if (status == PCI_ERS_RESULT_PANIC)
>>>>> + panic("CXL cachemem error.");
>>>>> +
>>>>> + /*
>>>>> + * If we have native control of AER, clear error status in the device
>>>>> + * that detected the error. If the platform retained control of AER,
>>>>> + * it is responsible for clearing this status. In that case, the
>>>>> + * signaling device may not even be visible to the OS.
>>>>> + */
>>>>> + if (cxl_error_is_native(pdev)) {
>>>>> + pcie_clear_device_status(pdev);
>>>>> + pci_aer_clear_nonfatal_status(pdev);
>>>>> + pci_aer_clear_fatal_status(pdev);
>>>>> + }
>>>>> }
>>>>>
>>>>> static void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>>>
>
next prev parent reply other threads:[~2025-09-30 16:46 UTC|newest]
Thread overview: 92+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-25 22:34 [PATCH v12 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2025-09-25 22:34 ` [PATCH v12 01/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-09-25 22:34 ` [PATCH v12 02/25] cxl/pci: Remove unnecessary CXL RCH " Terry Bowman
2025-10-01 15:09 ` Jonathan Cameron
2025-09-25 22:34 ` [PATCH v12 03/25] cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c Terry Bowman
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 04/25] CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS Terry Bowman
2025-09-25 23:17 ` Dave Jiang
2025-10-01 15:11 ` Jonathan Cameron
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 05/25] cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditional block Terry Bowman
2025-09-25 23:31 ` Dave Jiang
2025-10-01 15:23 ` Jonathan Cameron
2025-10-03 20:11 ` Cheatham, Benjamin
2025-10-06 18:52 ` Bowman, Terry
2025-09-25 22:34 ` [PATCH v12 06/25] CXL/AER: Introduce aer_cxl_rch.c into AER driver for handling CXL RCH errors Terry Bowman
2025-09-25 23:36 ` Dave Jiang
2025-09-26 12:32 ` kernel test robot
2025-10-01 15:42 ` Jonathan Cameron
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 07/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
2025-09-25 23:53 ` Dave Jiang
2025-10-01 15:58 ` Jonathan Cameron
2025-10-02 15:25 ` Bowman, Terry
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 08/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-10-03 20:11 ` Cheatham, Benjamin
2025-10-06 19:59 ` Bowman, Terry
2025-09-25 22:34 ` [PATCH v12 10/25] CXL/AER: Update PCI class code check to use FIELD_GET() Terry Bowman
2025-09-26 0:02 ` Dave Jiang
2025-10-01 16:12 ` Jonathan Cameron
2025-10-02 7:40 ` Lukas Wunner
2025-10-30 17:16 ` Bowman, Terry
2025-10-31 5:30 ` Lukas Wunner
2025-09-25 22:34 ` [PATCH v12 11/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 12/25] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 13/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-09-26 20:44 ` Dave Jiang
2025-09-25 22:34 ` [PATCH v12 14/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 15/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-09-26 21:10 ` Dave Jiang
2025-10-24 10:25 ` Alejandro Lucero Palau
2025-10-24 17:15 ` Dave Jiang
2025-10-24 19:40 ` Bowman, Terry
2025-10-27 16:33 ` Alejandro Lucero Palau
2025-09-25 22:34 ` [PATCH v12 16/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
2025-09-26 21:26 ` Dave Jiang
2025-10-01 16:14 ` Jonathan Cameron
2025-10-03 20:11 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 17/25] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
2025-09-26 22:04 ` Dave Jiang
2025-09-30 14:06 ` Bowman, Terry
2025-09-30 16:09 ` Dave Jiang
2025-10-03 20:12 ` Cheatham, Benjamin
2025-10-06 21:07 ` Bowman, Terry
2025-09-25 22:34 ` [PATCH v12 18/25] CXL/AER: Introduce aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
2025-09-26 22:56 ` Dave Jiang
2025-10-03 20:12 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 19/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
2025-09-26 23:02 ` Dave Jiang
2025-10-02 12:27 ` Jonathan Cameron
2025-10-03 20:12 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 20/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-09-26 23:26 ` Dave Jiang
2025-10-03 20:12 ` Cheatham, Benjamin
2025-10-06 20:17 ` Dave Jiang
2025-09-25 22:34 ` [PATCH v12 21/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
2025-09-29 23:32 ` Dave Jiang
2025-10-03 20:12 ` Cheatham, Benjamin
2025-10-06 21:28 ` Bowman, Terry
2025-09-25 22:34 ` [PATCH v12 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
2025-09-26 15:01 ` kernel test robot
2025-09-26 18:10 ` kernel test robot
2025-09-25 22:34 ` [PATCH v12 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-09-30 0:26 ` Dave Jiang
2025-09-30 14:38 ` Bowman, Terry
2025-09-30 16:13 ` Dave Jiang
2025-09-30 16:43 ` Bowman, Terry
2025-09-30 16:46 ` Dave Jiang [this message]
2025-10-01 13:58 ` Bowman, Terry
2025-10-01 15:33 ` Dave Jiang
2025-10-03 20:12 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 24/25] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-09-30 0:28 ` Dave Jiang
2025-10-03 20:12 ` Cheatham, Benjamin
2025-09-25 22:34 ` [PATCH v12 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-10-03 20:12 ` Cheatham, Benjamin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ab001a63-47e4-4d85-b536-8103835a5b39@intel.com \
--to=dave.jiang@intel.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=PradeepVineshReddy.Kodamati@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=bhelgaas@google.com \
--cc=dan.carpenter@linaro.org \
--cc=dan.j.williams@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=ming.li@zohomail.com \
--cc=rrichter@amd.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=shiju.jose@huawei.com \
--cc=terry.bowman@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox