Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery

public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed

From: "Bowman, Terry" <terry.bowman@amd.com>
To: Dave Jiang <dave.jiang@intel.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>
Cc: dave@stgolabs.net, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	shiju.jose@huawei.com, ming.li@zohomail.com,
	Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
	dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com,
	lukas@wunner.de, Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, alucerop@amd.com, ira.weiny@intel.com,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org
Subject: Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
Date: Wed, 5 Nov 2025 08:59:00 -0600	[thread overview]
Message-ID: <5246e21b-d226-4faf-936b-d3dffe2cc45e@amd.com> (raw)
In-Reply-To: <10115294-8be9-42af-a466-40a194cfa4e8@intel.com>



On 11/4/2025 5:43 PM, Dave Jiang wrote:
>
> On 11/4/25 11:47 AM, Jonathan Cameron wrote:
>> On Tue, 4 Nov 2025 11:03:03 -0600
>> Terry Bowman <terry.bowman@amd.com> wrote:
>>
>>> Implement cxl_do_recovery() to handle uncorrectable protocol
>>> errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
>>> all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
>>> potential CXL memory corruption.
>>>
>>> Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
>>> CXL topology from the error source through downstream CXL ports and
>>> endpoints.
>>>
>>> Introduce cxl_report_error_detected(), mirroring PCI's
>>> report_error_detected(), and implement device locking for the affected
>>> subtree. Endpoints require locking the PCI device (pdev->dev) and the
>>> CXL memdev (cxlmd->dev). CXL ports require locking the PCI
>>> device (pdev->dev) and the parent CXL port.
>>>
>>> The device locks should be taken early where possible. The initially
>>> reporting device will be locked after kfifo dequeue. Iterated devices
>>> will be locked in cxl_report_error_detected() and must lock the
>>> iterated devices except for the first device as it has already been
>>> locked.
>>>
>>> Export pci_aer_clear_fatal_status() for use when a UCE is not present.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Follow on comments around the locking stuff. If that has been there
>> a while and I didn't notice before, sorry!
>>
>>> ---
>>>
>>> Changes in v12->v13:
>>> - Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
>>> - Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
>>>   (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
>>>   cxl_handle_proto_error() (Terry)
>>> - Remove unnecessary check for endpoint port. (Dave Jiang)
>>> - Remove check for RCIEP EP in cxl_report_error_detected(). (Terry)
>>>
>>> Changes in v11->v12:
>>> - Clean up port discovery in cxl_do_recovery() (Dave)
>>> - Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
>>>
>>> Changes in v10->v11:
>>> - pci_ers_merge_results() - Move to earlier patch
>>> ---
>>>  drivers/cxl/core/ras.c | 135 ++++++++++++++++++++++++++++++++++++++++-
>>>  drivers/pci/pci.h      |   1 -
>>>  drivers/pci/pcie/aer.c |   1 +
>>>  include/linux/aer.h    |   2 +
>>>  4 files changed, 135 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index 5bc144cde0ee..52c6f19564b6 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
>>>  		device_unlock(dev);
>>>  }
>>>  
>>> +/**
>>> + * cxl_report_error_detected
>>> + * @dev: Device being reported
>>> + * @data: Result
>>> + * @err_pdev: Device with initial detected error. Is locked immediately
>>> + *            after KFIFO dequeue.
>>> + */
>>> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
>>> +{
>>> +	bool need_lock = (dev != &err_pdev->dev);
>> Add a comment on why this controls need for locking.
>> The resulting code is complex enough I'd be tempted to split the whole
>> thing into locked and unlocked variants.
> May not be a bad idea. Terry, can you see if this would reduce the complexity?
>
> DJ 

I agree and will split into 2 functions. Do you have naming suggestions for a function copy 
without locks? Is cxl_report_error_detected_nolock() OK to go along with existing 
cxl_report_error_detected()? 

Terry

>>> +	pci_ers_result_t vote, *result = data;
>>> +	struct pci_dev *pdev;
>>> +
>>> +	if (!dev || !dev_is_pci(dev))
>>> +		return 0;
>>> +	pdev = to_pci_dev(dev);
>>> +
>>> +	device_lock_if(&pdev->dev, need_lock);
>>> +	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
>>> +		device_unlock_if(&pdev->dev, need_lock);
>>> +		return PCI_ERS_RESULT_NONE;
>>> +	}
>>> +
>>> +	if (pdev->aer_cap)
>>> +		pci_clear_and_set_config_dword(pdev,
>>> +					       pdev->aer_cap + PCI_ERR_COR_STATUS,
>>> +					       0, PCI_ERR_COR_INTERNAL);
>>> +
>>> +	if (is_pcie_endpoint(pdev)) {
>>> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>> +
>>> +		device_lock_if(&cxlds->cxlmd->dev, need_lock);
>>> +		vote = cxl_error_detected(&cxlds->cxlmd->dev);
>>> +		device_unlock_if(&cxlds->cxlmd->dev, need_lock);
>>> +	} else {
>>> +		vote = cxl_port_error_detected(dev);
>>> +	}
>>> +
>>> +	pcie_clear_device_status(pdev);
>>> +	*result = pcie_ers_merge_result(*result, vote);
>>> +	device_unlock_if(&pdev->dev, need_lock);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * cxl_walk_port
>> Needs a short description I think to count as valid kernel-doc and
>> stop the tool moaning if anyone runs it on this.
>>
>>> + *
>>> + * @port: Port be traversed into
>>> + * @cb: Callback for handling the CXL Ports
>>> + * @userdata: Result
>>> + * @err_pdev: Device with initial detected error. Is locked immediately
>>> + *            after KFIFO dequeue.
>>> + */
>>> +static void cxl_walk_port(struct cxl_port *port,
>>> +			  int (*cb)(struct device *, void *, struct pci_dev *),
>>> +			  void *userdata,
>>> +			  struct pci_dev *err_pdev)
>>> +{
>>> +	struct cxl_port *err_port __free(put_cxl_port) = get_cxl_port(err_pdev);
>>> +	bool need_lock = (port != err_port);
>>> +	struct cxl_dport *dport = NULL;
>>> +	unsigned long index;
>>> +
>>> +	device_lock_if(&port->dev, need_lock);
>>> +	if (is_cxl_endpoint(port)) {
>>> +		cb(port->uport_dev->parent, userdata, err_pdev);
>>> +		device_unlock_if(&port->dev, need_lock);
>>> +		return;
>>> +	}
>>> +
>>> +	if (port->uport_dev && dev_is_pci(port->uport_dev))
>>> +		cb(port->uport_dev, userdata, err_pdev);
>>> +
>>> +	/*
>>> +	 * Iterate over the set of Downstream Ports recorded in port->dports (XArray):
>>> +	 *  - For each dport, attempt to find a child CXL Port whose parent dport
>>> +	 *    match.
>>> +	 *  - Invoke the provided callback on the dport's device.
>>> +	 *  - If a matching child CXL Port device is found, recurse into that port to
>>> +	 *    continue the walk.
>>> +	 */
>>> +	xa_for_each(&port->dports, index, dport)
>>> +	{
>> Move that to line above for normal kernel loop formatting.
>>
>> 	xa_for_each(&port->dports, index, dport) {
>>
>>> +		struct device *child_port_dev __free(put_device) =
>>> +			bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
>>> +					match_port_by_parent_dport);
>>> +
>>> +		cb(dport->dport_dev, userdata, err_pdev);
>>> +		if (child_port_dev)
>>> +			cxl_walk_port(to_cxl_port(child_port_dev), cb, userdata, err_pdev);
>>> +	}
>>> +	device_unlock_if(&port->dev, need_lock);
>>> +}
>>> +
>>>  
>>>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>>> @@ -483,16 +613,15 @@ static void cxl_proto_err_work_fn(struct work_struct *work)
>>>  			if (!cxl_pci_drv_bound(pdev))
>>>  				return;
>>>  			cxlmd_dev = &cxlds->cxlmd->dev;
>>> -			device_lock_if(cxlmd_dev, cxlmd_dev);
>>>  		} else {
>>>  			cxlmd_dev = NULL;
>>>  		}
>>>  
>>> +		/* Lock the CXL parent Port */
>>>  		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>>> -		if (!port)
>>> -			return;
>>>  		guard(device)(&port->dev);
>>>  
>>> +		device_lock_if(cxlmd_dev, cxlmd_dev);
>>>  		cxl_handle_proto_error(&wd);
>>>  		device_unlock_if(cxlmd_dev, cxlmd_dev);
>> Same issue on these helpers, but I'm also not sure why moving them in this
>> patch makes sense. I'm not sure what changed.
>>
>> Perhaps this is stuff that ended up in wrong patch?
>>>  	}
>>

next prev parent reply	other threads:[~2025-11-05 14:59 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
2025-11-04 17:50   ` Jonathan Cameron
2025-11-19  3:19   ` dan.j.williams
2025-12-08 18:04   ` Bjorn Helgaas
2025-12-08 22:13     ` Bowman, Terry
2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
2025-11-04 17:52   ` Jonathan Cameron
2025-11-19  3:19   ` dan.j.williams
2025-11-19 15:55     ` Bowman, Terry
2025-11-19 23:34       ` dan.j.williams
2025-11-21 20:31   ` Gregory Price
2025-11-04 17:02 ` [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-11-04 17:53   ` Jonathan Cameron
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH " Terry Bowman
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c Terry Bowman
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c Terry Bowman
2025-11-04 18:03   ` Jonathan Cameron
2025-11-19  3:20   ` dan.j.williams
2025-11-19 16:07     ` Bowman, Terry
2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
2025-11-04 18:05   ` Jonathan Cameron
2025-11-04 19:53   ` Dave Jiang
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c Terry Bowman
2025-11-19  3:20   ` dan.j.williams
2025-11-19  8:26     ` Lukas Wunner
2025-11-19 23:36       ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-11-04 18:08   ` Jonathan Cameron
2025-11-04 18:26   ` Bjorn Helgaas
2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-11-04 18:10   ` Jonathan Cameron
2025-11-11  8:17   ` Alison Schofield
2025-11-19  3:19   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-11-19  3:27   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-11-19 21:23   ` dan.j.williams
2025-11-19 22:02     ` Bowman, Terry
2025-11-19 23:40       ` dan.j.williams
2025-11-21 14:56         ` Bowman, Terry
2025-11-04 17:02 ` [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-11-05  8:30   ` Alejandro Lucero Palau
2025-11-19 22:00   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-11-04 18:15   ` Jonathan Cameron
2025-11-04 20:03   ` Dave Jiang
2025-11-11  8:23   ` Alison Schofield
2025-11-04 17:02 ` [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
2025-11-04 19:03   ` Bjorn Helgaas
2025-11-20  0:17   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
2025-11-20  0:44   ` dan.j.williams
2025-11-20  0:53   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
2025-11-05 17:51   ` Gregory Price
2025-11-05 19:03     ` Gregory Price
2025-11-05 22:26       ` Gregory Price
2025-11-06 17:11         ` Gregory Price
2025-11-06 23:32         ` Bowman, Terry
2025-11-11  8:33   ` Alison Schofield
2025-11-13 21:42     ` Alison Schofield
2025-11-13 22:39       ` Bowman, Terry
2025-11-20  1:24   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard() Terry Bowman
2025-11-04 18:18   ` Jonathan Cameron
2025-11-04 20:15   ` Dave Jiang
2025-11-04 17:02 ` [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints Terry Bowman
2025-11-04 18:29   ` Jonathan Cameron
2025-11-04 19:09   ` Bjorn Helgaas
2025-11-04 17:03 ` [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
2025-11-04 18:32   ` Jonathan Cameron
2025-11-04 21:20   ` Dave Jiang
2025-11-04 21:27     ` Bowman, Terry
2025-11-04 23:39       ` Dave Jiang
2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-11-04 18:40   ` Jonathan Cameron
2025-11-04 18:45   ` Bjorn Helgaas
2025-11-20  3:33   ` dan.j.williams
2025-11-04 17:03 ` [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
2025-11-04 18:41   ` Jonathan Cameron
2025-11-04 19:03   ` Bjorn Helgaas
2025-11-14 15:20     ` Bowman, Terry
2025-11-14 16:09       ` Jonathan Cameron
2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-11-04 18:47   ` Jonathan Cameron
2025-11-04 23:43     ` Dave Jiang
2025-11-05 14:59       ` Bowman, Terry [this message]
2025-11-05 16:10         ` Dave Jiang
2025-11-11  8:37   ` Alison Schofield
2025-12-08 18:40   ` Bjorn Helgaas
2025-11-04 17:03 ` [RESEND v13 24/25] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-11-04 17:03 ` [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-11-20  3:10   ` dan.j.williams
2025-11-04 19:11 ` [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Bjorn Helgaas
2025-11-04 21:54   ` Bowman, Terry
2025-11-04 22:12     ` Bjorn Helgaas
2025-12-04 17:30       ` Bowman, Terry
2025-12-08 18:42         ` Bjorn Helgaas

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5246e21b-d226-4faf-936b-d3dffe2cc45e@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=dan.carpenter@linaro.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox