All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
To: Terry Bowman <terry.bowman@amd.com>
Cc: <dave@stgolabs.net>, <dave.jiang@intel.com>,
	<alison.schofield@intel.com>, <dan.j.williams@intel.com>,
	<bhelgaas@google.com>, <shiju.jose@huawei.com>,
	<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
	<rrichter@amd.com>, <dan.carpenter@linaro.org>,
	<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
	<Benjamin.Cheatham@amd.com>,
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	<linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>,
	<linux-pci@vger.kernel.org>
Subject: Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
Date: Fri, 27 Jun 2025 12:00:50 +0100	[thread overview]
Message-ID: <20250627120050.00001461@huawei.com> (raw)
In-Reply-To: <20250626224252.1415009-7-terry.bowman@amd.com>

On Thu, 26 Jun 2025 17:42:41 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
> 
> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the

After earlier update it already exists, you are just filling it in here.
So reword this.

> AER service driver. This will begin the CXL protocol error processing with
> a call to cxl_handle_proto_error().
> 
> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver. Add check that Endpoint is bound to a CXL
> driver.
> 
> Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
> and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
> will return a reference counted 'struct pci_dev *'. This will serve as
> reference count to prevent releasing the CXL Endpoint's mapped RAS while
> handling the error. Use scope base __free() to put the reference count.
> This will change when adding support for CXL port devices in the future.
> 
> Implement cxl_handle_proto_error() to differentiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
> errors will be processed with a call to walk the associated Root Complex
> Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for the
> RCiEP.

I'd drop the RCiEP description beyond saying 'handle it as before'
as I think there is no major change in this.

> 
> VH correctable error (CE) processing will call the CXL CE handler. VH
> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> and pci_clean_device_status() used to clean up AER status after handling.
> 
> Maintain the locking logic found in the original AER driver. Replace the
> existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
> lock for maintainability.

This change is fine, but it is an AND change in a patch doing quite a few other
things.  So do it in a trivial precursor patch.  Look at the other things in this
description and see if they can be factored out too so that the guts of this
patch are much easier to spot.


> CE errors did not include locking in previous driver
> implementation. Leave the updated CE handling path as-is.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
A few comments inline.

Jonathan

> ---
>  drivers/cxl/core/native_ras.c | 87 +++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/cxl/pci.c             |  6 +++
>  drivers/pci/pci.c             |  1 +
>  drivers/pci/pci.h             |  7 ---
>  drivers/pci/pcie/aer.c        |  1 +
>  drivers/pci/pcie/cxl_aer.c    | 41 -----------------
>  drivers/pci/pcie/rcec.c       |  1 +
>  include/linux/aer.h           |  2 +
>  include/linux/pci.h           | 10 ++++
>  10 files changed, 109 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> index 011815ddaae3..5bd79d5019e7 100644
> --- a/drivers/cxl/core/native_ras.c
> +++ b/drivers/cxl/core/native_ras.c
> @@ -6,9 +6,96 @@
>  #include <cxl/event.h>
>  #include <cxlmem.h>
>  #include <core/core.h>
> +#include <cxlpci.h>
> +
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> +}
> +
> +static bool is_cxl_rcd(struct pci_dev *pdev)
> +{
> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)
> +		return false;
> +
> +	/*
> +	 * The capability, status, and control fields in Device 0,
> +	 * Function 0 DVSEC control the CXL functionality of the
> +	 * entire device (CXL 3.2, 8.1.3).
> +	 */
> +	if (pdev->devfn != PCI_DEVFN(0, 0))
> +		return false;
> +
> +	/*
> +	 * CXL Memory Devices must have the 502h class code set (CXL

Short wrap.

> +	 * 3.2, 8.1.12.1).
> +	 */
> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) != PCI_CLASS_MEMORY_CXL)
> +		return false;
> +
> +	return true;


If this isn't going to get more complex

	return FIELD_GET(...)

> +}
> +
> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
> +{
> +	struct cxl_proto_error_info *err_info = data;
> +
> +	guard(device)(&pdev->dev);
> +
> +	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
> +		return 0;
> +
> +	if (err_info->severity == AER_CORRECTABLE)
> +		cxl_cor_error_detected(pdev);
> +	else
> +		cxl_error_detected(pdev, pci_channel_io_frozen);
> +
> +	return 1;
> +}
> +
> +static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(err_info->segment,
> +					    err_info->bus,
> +					    err_info->devfn);
> +
> +	if (!pdev) {
> +		pr_err("Failed to find the CXL device (SBDF=%x:%x:%x:%x)\n",
> +		       err_info->segment, err_info->bus, PCI_SLOT(err_info->devfn),
> +		       PCI_FUNC(err_info->devfn));
> +		return;
> +	}
> +
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.

I don't think the reference here to the CXL.mem driver is that helpful
given the code is immediate above. Maybe move the comment?


> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +
> +	if (err_info->severity == AER_CORRECTABLE) {
> +		int aer = pdev->aer_cap;
> +
> +		if (aer)
> +			pci_clear_and_set_config_dword(pdev,
> +						       aer + PCI_ERR_COR_STATUS,
> +						       0, PCI_ERR_COR_INTERNAL);
> +
> +		cxl_cor_error_detected(pdev);
> +
> +		pcie_clear_device_status(pdev);
> +	} else {
> +		cxl_do_recovery(pdev);
> +	}
> +}



  reply	other threads:[~2025-06-27 11:01 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-07-18 17:55   ` Dave Jiang
2025-07-23 21:58   ` dan.j.williams
2025-07-23 22:15     ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
2025-07-23 22:30   ` dan.j.williams
2025-07-23 23:21     ` Bowman, Terry
2025-07-24 18:00       ` dan.j.williams
2025-08-09 10:56   ` Alejandro Lucero Palau
2025-08-11 19:14     ` Bowman, Terry
2025-08-11 23:14       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
2025-06-27 14:14     ` Bowman, Terry
2025-06-27  9:53   ` Jonathan Cameron
2025-07-02 16:00     ` Bowman, Terry
2025-06-27 11:32   ` Shiju Jose
2025-06-27 14:24     ` Bowman, Terry
2025-07-01 21:27   ` Dave Jiang
2025-07-23 22:56   ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
2025-06-27 10:12     ` Jonathan Cameron
2025-06-27 14:29     ` Bowman, Terry
2025-07-24  0:01   ` dan.j.williams
2025-07-24 17:06     ` Bowman, Terry
2025-07-24 20:32       ` dan.j.williams
2025-07-24  1:16   ` dan.j.williams
2025-07-24 17:02     ` Bowman, Terry
2025-07-24 20:23       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
2025-06-27 10:24   ` Jonathan Cameron
2025-07-02 16:21     ` Bowman, Terry
2025-07-02 19:54       ` Dan Carpenter
2025-07-02 19:57         ` Bowman, Terry
2025-07-03 10:06       ` Jonathan Cameron
2025-07-01 21:53   ` Dave Jiang
2025-07-02 17:10     ` Bowman, Terry
2025-07-24  2:01   ` dan.j.williams
2025-07-24 17:21     ` Bowman, Terry
2025-07-24 20:55       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-06-27 11:00   ` Jonathan Cameron [this message]
2025-07-02 17:51     ` Bowman, Terry
2025-07-01 23:04   ` Dave Jiang
2025-07-02 17:56     ` Bowman, Terry
2025-07-03 10:11       ` Jonathan Cameron
2025-07-25  0:38   ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-06-27 11:05   ` Jonathan Cameron
2025-07-02 21:06     ` Bowman, Terry
2025-06-27 12:27   ` Shiju Jose
2025-07-02 21:34     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
2025-06-27 11:12   ` Jonathan Cameron
2025-07-18 18:01   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-06-27 11:17   ` Jonathan Cameron
2025-07-02 21:41     ` Bowman, Terry
2025-07-18 21:28   ` Dave Jiang
2025-07-18 21:55     ` Bowman, Terry
2025-07-18 22:01       ` Dave Jiang
2025-07-18 22:40         ` Bowman, Terry
2025-07-18 22:45           ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 10/17] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-06-26 22:42 ` [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-07-21 21:56   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-06-27 12:22   ` Shiju Jose
2025-07-02  1:18     ` Alison Schofield
2025-07-02 22:07       ` Bowman, Terry
2025-07-02 21:56     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-06-27 11:48   ` Jonathan Cameron
2025-07-21 22:17   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
2025-06-27 11:52   ` Jonathan Cameron
2025-06-27 12:27   ` Shiju Jose
2025-07-21 22:35   ` Dave Jiang
2025-07-22 18:23     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 15/17] CXL/PCI: Introduce CXL Port " Terry Bowman
2025-06-26 22:42 ` [PATCH v10 16/17] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-06-26 22:42 ` [PATCH v10 17/17] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-07-23 21:55 ` [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging dan.j.williams
2025-07-24 15:58   ` Bowman, Terry
2025-08-18 15:18 ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250627120050.00001461@huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.carpenter@linaro.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.