All of lore.kernel.org
 help / color / mirror / Atom feed
From: Dave Jiang <dave.jiang@intel.com>
To: Terry Bowman <terry.bowman@amd.com>,
	PradeepVineshReddy.Kodamati@amd.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, ira.weiny@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com, bp@alien8.de,
	ming.li@zohomail.com, shiju.jose@huawei.com,
	dan.carpenter@linaro.org, Smita.KoralahalliChannabasappa@amd.com,
	kobayashi.da-06@fujitsu.com, yanfei.xu@intel.com,
	rrichter@amd.com, peterz@infradead.org, coly.li@suse.de,
	uaisheng.ye@intel.com, fabio.m.de.francesco@linux.intel.com,
	ilpo.jarvinen@linux.intel.com, yazen.ghannam@amd.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org
Subject: Re: [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error
Date: Fri, 6 Jun 2025 08:57:54 -0700	[thread overview]
Message-ID: <81214183-fd94-428b-abeb-3ec3d2688030@intel.com> (raw)
In-Reply-To: <20250603172239.159260-5-terry.bowman@amd.com>



On 6/3/25 10:22 AM, Terry Bowman wrote:
> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
> 
> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
> AER service driver. This will begin the CXL protocol error processing
> with a call to cxl_handle_prot_error().
> 
> Update cxl/core/ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver.
> 
> Introduce sbdf_to_pci() to take the SBDF values from 'struct cxl_prot_error_info'
> and use in discovering the erring PCI device. Make scope based reference
> increments/decrements for the discovered PCI device and the associated
> CXL device.
> 
> Implement cxl_handle_prot_error() to differentiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
> RCH errors will be processed with a call to walk the associated Root
> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for
> the RCiEP.
> 
> VH correctable error (CE) processing will call the CXL CE handler. VH
> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> and pci_clean_device_status() used to clean up AER status after handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/ras.c  | 92 +++++++++++++++++++++++++++++++++++++++++
>  drivers/pci/pci.c       |  1 +
>  drivers/pci/pci.h       |  8 ----
>  drivers/pci/pcie/aer.c  |  1 +
>  drivers/pci/pcie/rcec.c |  1 +
>  include/linux/aer.h     |  2 +
>  include/linux/pci.h     | 10 +++++
>  7 files changed, 107 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index d35525e79e04..9ed5c682e128 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -110,8 +110,100 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
>  #ifdef CONFIG_PCIEAER_CXL
>  
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> +}
> +
> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
> +{
> +	struct cxl_prot_error_info *err_info = data;
> +	struct pci_dev *pdev_ref __free(pci_dev_put) = pci_dev_get(pdev);
> +	struct cxl_dev_state *cxlds;
> +
> +	/*
> +	 * The capability, status, and control fields in Device 0,
> +	 * Function 0 DVSEC control the CXL functionality of the
> +	 * entire device (CXL 3.0, 8.1.3).
> +	 */
> +	if (pdev->devfn != PCI_DEVFN(0, 0))
> +		return 0;
> +
> +	/*
> +	 * CXL Memory Devices must have the 502h class code set (CXL
> +	 * 3.0, 8.1.12.1).
> +	 */
> +	if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)

Should use FIELD_GET() to be consistent with the rest of CXL code base

> +		return 0;
> +
> +	if (!is_cxl_memdev(&pdev->dev) || !pdev->dev.driver)

I think you need to hold the pdev->dev lock while checking if the driver exists.

> +		return 0;
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);

Maybe a comment on why cxlmd->dev ref is needed here.

> +
> +	if (err_info->severity == AER_CORRECTABLE)
> +		cxl_cor_error_detected(pdev);
> +	else
> +		cxl_do_recovery(pdev);
> +
> +	return 1;
> +}
> +
> +static struct pci_dev *sbdf_to_pci(struct cxl_prot_error_info *err_info)
> +{
> +	unsigned int devfn = PCI_DEVFN(err_info->device,
> +				       err_info->function);
> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(err_info->segment,
> +					    err_info->bus,
> +					    devfn);

Looks like DanC already caught that. Maybe have this function return with a ref held. I would also add a comment for the function mention that the caller need to put the device.

> +	return pdev;
> +}
> +
> +static void cxl_handle_prot_error(struct cxl_prot_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(sbdf_to_pci(err_info));
> +
> +	if (!pdev) {
> +		pr_err("Failed to find the CXL device\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.
> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +

cxl_rch_handle_error_iter() holds the pdev device lock when handling errors. Does the code block below need locking?

DJ

> +	if (err_info->severity == AER_CORRECTABLE) {
> +		int aer = pdev->aer_cap;
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +		struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> +		if (aer)
> +			pci_clear_and_set_config_dword(pdev,
> +						       aer + PCI_ERR_COR_STATUS,
> +						       0, PCI_ERR_COR_INTERNAL);
> +
> +		cxl_cor_error_detected(pdev);
> +
> +		pcie_clear_device_status(pdev);
> +	} else {
> +		cxl_do_recovery(pdev);
> +	}
> +}
> +
>  static void cxl_prot_err_work_fn(struct work_struct *work)
>  {
> +	struct cxl_prot_err_work_data wd;
> +
> +	while (cxl_prot_err_kfifo_get(&wd)) {
> +		struct cxl_prot_error_info *err_info = &wd.err_info;
> +
> +		cxl_handle_prot_error(err_info);
> +	}
>  }
>  
>  #else
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e77d5b53c0ce..524ac32b744a 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2328,6 +2328,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
>  	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
>  	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
>  }
> +EXPORT_SYMBOL_NS_GPL(pcie_clear_device_status, "CXL");
>  #endif
>  
>  /**
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index d6296500b004..3c54a5ed803e 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -649,16 +649,10 @@ static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; }
>  void pci_rcec_init(struct pci_dev *dev);
>  void pci_rcec_exit(struct pci_dev *dev);
>  void pcie_link_rcec(struct pci_dev *rcec);
> -void pcie_walk_rcec(struct pci_dev *rcec,
> -		    int (*cb)(struct pci_dev *, void *),
> -		    void *userdata);
>  #else
>  static inline void pci_rcec_init(struct pci_dev *dev) { }
>  static inline void pci_rcec_exit(struct pci_dev *dev) { }
>  static inline void pcie_link_rcec(struct pci_dev *rcec) { }
> -static inline void pcie_walk_rcec(struct pci_dev *rcec,
> -				  int (*cb)(struct pci_dev *, void *),
> -				  void *userdata) { }
>  #endif
>  
>  #ifdef CONFIG_PCI_ATS
> @@ -967,7 +961,6 @@ void pci_no_aer(void);
>  void pci_aer_init(struct pci_dev *dev);
>  void pci_aer_exit(struct pci_dev *dev);
>  extern const struct attribute_group aer_stats_attr_group;
> -void pci_aer_clear_fatal_status(struct pci_dev *dev);
>  int pci_aer_clear_status(struct pci_dev *dev);
>  int pci_aer_raw_clear_status(struct pci_dev *dev);
>  void pci_save_aer_state(struct pci_dev *dev);
> @@ -976,7 +969,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
>  static inline void pci_no_aer(void) { }
>  static inline void pci_aer_init(struct pci_dev *d) { }
>  static inline void pci_aer_exit(struct pci_dev *d) { }
> -static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>  static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
>  static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
>  static inline void pci_save_aer_state(struct pci_dev *dev) { }
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 5350fa5be784..6e88331c6303 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -290,6 +290,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>  	if (status)
>  		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
>  }
> +EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
>  
>  /**
>   * pci_aer_raw_clear_status - Clear AER error registers.
> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
> index d0bcd141ac9c..fb6cf6449a1d 100644
> --- a/drivers/pci/pcie/rcec.c
> +++ b/drivers/pci/pcie/rcec.c
> @@ -145,6 +145,7 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
>  
>  	walk_rcec(walk_rcec_helper, &rcec_data);
>  }
> +EXPORT_SYMBOL_NS_GPL(pcie_walk_rcec, "CXL");
>  
>  void pci_rcec_init(struct pci_dev *dev)
>  {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 550407240ab5..c9a18eca16f8 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -77,12 +77,14 @@ struct cxl_prot_err_work_data {
>  
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> +void pci_aer_clear_fatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
>  #else
>  static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  {
>  	return -EINVAL;
>  }
> +static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index bff3009f9ff0..cd53715d53f3 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1806,6 +1806,9 @@ extern bool pcie_ports_native;
>  
>  int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
>  			  bool use_lt);
> +void pcie_walk_rcec(struct pci_dev *rcec,
> +		    int (*cb)(struct pci_dev *, void *),
> +		    void *userdata);
>  #else
>  #define pcie_ports_disabled	true
>  #define pcie_ports_native	false
> @@ -1816,8 +1819,15 @@ static inline int pcie_set_target_speed(struct pci_dev *port,
>  {
>  	return -EOPNOTSUPP;
>  }
> +
> +static inline void pcie_walk_rcec(struct pci_dev *rcec,
> +				  int (*cb)(struct pci_dev *, void *),
> +				  void *userdata) { }
> +
>  #endif
>  
> +void pcie_clear_device_status(struct pci_dev *dev);
> +
>  #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
>  #define PCIE_LINK_STATE_L1		BIT(2)	/* L1 state */
>  #define PCIE_LINK_STATE_L1_1		BIT(3)	/* ASPM L1.1 state */


  parent reply	other threads:[~2025-06-06 15:58 UTC|newest]

Thread overview: 90+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-03 17:22 [PATCH v9 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-06-03 17:22 ` [PATCH v9 01/16] PCI/CXL: Add pcie_is_cxl() Terry Bowman
2025-06-04 19:06   ` Sathyanarayanan Kuppuswamy
2025-06-04 19:18     ` Bowman, Terry
2025-06-05 23:24   ` Dave Jiang
2025-06-03 17:22 ` [PATCH v9 02/16] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-06-03 22:02   ` Sathyanarayanan Kuppuswamy
2025-06-04 14:32     ` Bowman, Terry
2025-06-04 19:24       ` Sathyanarayanan Kuppuswamy
2025-06-04 21:30         ` Bowman, Terry
2025-06-05 23:28   ` Dave Jiang
2025-06-03 17:22 ` [PATCH v9 03/16] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
2025-06-04  6:01   ` Dan Carpenter
2025-06-04 14:37     ` Bowman, Terry
2025-06-04 17:24       ` Dan Carpenter
2025-06-04 19:21         ` Bowman, Terry
2025-06-04 22:50   ` Sathyanarayanan Kuppuswamy
2025-06-05 14:04     ` Bowman, Terry
2025-06-06  0:27   ` Dave Jiang
2025-06-06 14:27     ` Bowman, Terry
2025-06-06 14:36       ` Dave Jiang
2025-06-12 11:04   ` Jonathan Cameron
2025-06-12 14:29     ` Bowman, Terry
2025-06-03 17:22 ` [PATCH v9 04/16] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-06-04  6:05   ` Dan Carpenter
2025-06-04 14:38     ` Bowman, Terry
2025-06-04 23:58   ` Sathyanarayanan Kuppuswamy
2025-06-06 15:57   ` Dave Jiang [this message]
2025-06-06 18:14     ` Bowman, Terry
2025-06-06 22:43       ` Dave Jiang
2025-06-09 19:57         ` Bowman, Terry
2025-06-09 20:34           ` Dave Jiang
2025-06-12 11:17             ` Jonathan Cameron
2025-06-06 21:08     ` Bowman, Terry
2025-06-06 23:15     ` Bowman, Terry
2025-06-09 20:17       ` Dave Jiang
2025-06-10  4:15   ` Lukas Wunner
2025-06-10 18:07     ` Bowman, Terry
2025-06-10 21:20       ` Bowman, Terry
2025-06-11  4:38         ` Lukas Wunner
2025-06-17 16:08           ` Dave Jiang
2025-06-17 18:20             ` Robert Richter
2025-06-12 11:36   ` Jonathan Cameron
2025-06-12 18:35     ` Bowman, Terry
2025-06-03 17:22 ` [PATCH v9 05/16] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-06-05 15:14   ` Sathyanarayanan Kuppuswamy
2025-06-05 16:01     ` Bowman, Terry
2025-06-06 16:45   ` Dave Jiang
2025-06-06 18:16     ` Bowman, Terry
2025-06-12 16:06   ` Jonathan Cameron
2025-06-12 16:29     ` Bowman, Terry
2025-06-03 17:22 ` [PATCH v9 06/16] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
2025-06-06 17:04   ` Dave Jiang
2025-06-06 18:17     ` Bowman, Terry
2025-06-03 17:22 ` [PATCH v9 07/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-06-03 17:22 ` [PATCH v9 08/16] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-06-05 16:42   ` Sathyanarayanan Kuppuswamy
2025-06-03 17:22 ` [PATCH v9 09/16] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-06-05 16:42   ` Sathyanarayanan Kuppuswamy
2025-06-06 17:27   ` Dave Jiang
2025-06-03 17:22 ` [PATCH v9 10/16] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-06-05 16:49   ` Sathyanarayanan Kuppuswamy
2025-06-06  9:08   ` Shiju Jose
2025-06-06 14:41     ` Bowman, Terry
2025-06-06 15:24       ` Bowman, Terry
2025-06-12 16:25         ` Jonathan Cameron
2025-06-03 17:22 ` [PATCH v9 11/16] cxl/pci: Update __cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-06-05 18:37   ` Sathyanarayanan Kuppuswamy
2025-06-06 20:30   ` Dave Jiang
2025-06-06 20:55     ` Bowman, Terry
2025-06-06 22:38       ` Dave Jiang
2025-06-12 16:46   ` Jonathan Cameron
2025-06-16 20:30     ` Bowman, Terry
2025-06-03 17:22 ` [PATCH v9 12/16] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
2025-06-06  0:22   ` Sathyanarayanan Kuppuswamy
2025-06-12 16:55   ` Jonathan Cameron
2025-06-03 17:22 ` [PATCH v9 13/16] cxl/pci: Introduce CXL Port " Terry Bowman
2025-06-06  0:50   ` Sathyanarayanan Kuppuswamy
2025-06-12 17:14   ` Jonathan Cameron
2025-06-16 22:17     ` Bowman, Terry
2025-06-03 17:22 ` [PATCH v9 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-06-06  0:50   ` Sathyanarayanan Kuppuswamy
2025-06-12 17:16   ` Jonathan Cameron
2025-06-03 17:22 ` [PATCH v9 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-06-06  0:51   ` Sathyanarayanan Kuppuswamy
2025-06-03 17:22 ` [PATCH v9 16/16] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-06-06  0:52   ` Sathyanarayanan Kuppuswamy
2025-06-06 13:51     ` Bowman, Terry
2025-06-06 22:59   ` Dave Jiang
2025-06-12 17:19   ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=81214183-fd94-428b-abeb-3ec3d2688030@intel.com \
    --to=dave.jiang@intel.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=coly.li@suse.de \
    --cc=dan.carpenter@linaro.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave@stgolabs.net \
    --cc=fabio.m.de.francesco@linux.intel.com \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=kobayashi.da-06@fujitsu.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=ming.li@zohomail.com \
    --cc=peterz@infradead.org \
    --cc=rrichter@amd.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=uaisheng.ye@intel.com \
    --cc=vishal.l.verma@intel.com \
    --cc=yanfei.xu@intel.com \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.