public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Jiang <dave.jiang@intel.com>
To: Terry Bowman <terry.bowman@amd.com>,
	dave@stgolabs.net, jic23@kernel.org, alison.schofield@intel.com,
	djbw@kernel.org, bhelgaas@google.com, shiju.jose@huawei.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	vishal.l.verma@intel.com, alucerop@amd.com, ira.weiny@intel.com,
	corbet@lwn.net, rafael@kernel.org, xueshuai@linux.alibaba.com,
	linux-cxl@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-acpi@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo
Date: Tue, 5 May 2026 14:17:28 -0700	[thread overview]
Message-ID: <3220622a-9241-450c-aedf-d80211eaf561@intel.com> (raw)
In-Reply-To: <20260505173029.2718246-2-terry.bowman@amd.com>



On 5/5/26 10:30 AM, Terry Bowman wrote:
> CXL virtual hierarchy (VH) native RAS handling for CXL Port devices will be
> added soon. This requires a notification mechanism for the AER driver to
> share the AER interrupt with the CXL driver. The CXL drivers use the
> notification to handle and log the CXL RAS errors.
> 
> Note, 'CXL protocol error' terminology refers to CXL VH and not CXL RCH
> errors unless specifically noted going forward.
> 
> Introduce a new file in the AER driver to handle the CXL protocol
> errors: pci/pcie/aer_cxl_vh.c.
> 
> Add a kfifo work queue to be used by the AER and CXL drivers. Multiple
> AER IRQ worker threads can be running and enqueueing concurrently, so
> include write path synchronization. Pack the kfifo, the spinlock, the
> rwsem, and the work pointer into a single structure. Initialize the
> kfifo with INIT_KFIFO() from a subsys_initcall so its mask, esize and
> data fields are valid before any producer or consumer runs.
> 
> Add CXL work queue handler registration functions in the AER driver.
> Export them so the CXL driver can assign or clear the work handler.
> 
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work
> data. It contains a reference to the PCI error source device and the
> error severity. The cxl_core driver uses this when dequeuing the work.
> 
> Introduce cxl_forward_error() to add a given CXL protocol error to a
> work structure and push it onto the AER-CXL kfifo. This function takes
> a pci_dev_get() on the source device. The kfifo consumer is responsible
> for the matching pci_dev_put() after dequeue. On enqueue failure
> cxl_forward_error() does the put itself.
> 
> Synchronize accesses to the work function pointer during registration,
> deregistration, enqueue, and dequeue.
> 
> handle_error_source() is intentionally not changed here. The is_cxl_error()
> switch that routes errors to cxl_forward_error() is added in a later patch
> together with the kfifo consumer registration. This way the producer and
> consumer land in the same commit, so CXL errors are not silently dropped
> during bisect.
> 
> Also add MAINTAINERS entries for both drivers/pci/pcie/aer_cxl_vh.c
> (new in this patch) and drivers/pci/pcie/aer_cxl_rch.c (already in tree
> but previously unlisted) under the existing CXL entry. This way the CXL
> maintainers are CC'd on changes to the AER-CXL bridging code.
> 
> Co-developed-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>


> 
> ---
> 
> Changes in v16->v17:
> - Reword "kfifo semaphore" to "kfifo spinlock" to match fifo_lock.
> - Defer the handle_error_source() is_cxl_error() switch to the patch that
>   registers the kfifo consumer to keep each commit bisect-safe.
> - Rename rwsema to rwsem
> - Change CPER exports to use EXPORT_SYMBOL_FOR_MODULES.
> - Add work cancel function.
> - Replace kfifo_put() with kfifo_in_spinlocked() for multiple producers
> - Add fifo_lock spinlock for concurrent producer serialisation
> - Initialize the embedded kfifo with INIT_KFIFO() in a subsys_initcall so
>   kfifo->mask, ->esize and ->data are set before first use.
> - Clear PCI_ERR_COR_STATUS in cxl_forward_error() before enqueue so the
>   device is acked for correctable events even when the consumer drops the
>   event. Uncorrectable status is left for cxl_do_recovery() to clear after
>   recovery completes, mirroring the AER core convention.
> - WARN on double-registration in cxl_register_proto_err_work() to make an
>   unintended second consumer visible at runtime.
> - Add direct rwsem.h, cleanup.h and workqueue.h includes for symbols used
>   in aer_cxl_vh.c
> - Add MAINTAINERS entries for drivers/pci/pcie/aer_cxl_*.c
> - Update message
> 
> Changes in v15->v16:
> - Add pci_dev_put() and comment in pci_dev_get() (Dan)
> - /rw_sema/rwsema/ (Dan)
> - Split validation checks in cxl_forward_error() to allow
>   for meaningful reason in log (Terry)
> - Shorten commit title to remove wordiness (Terry)
> - Remove bitfield.h include, unnecessary. (Terry)
> 
> Changes in v14->v15:
> - Moved pci_dev_get() call to this patch (Dave)
> 
> Changes in v13 -> v14:
> - Replaced workqueue_types.h include with 'struct work_struct'
>   predeclaration (Bjorn)
> - Update error message (Bjorn)
> - Reordered 'struct cxl_proto_err_work_data' (Bjorn)
> - Remove export of cxl_error_is_native() here (Bjorn)
> 
> Changes in v12->v13:
> - Added Dave Jiang's review-by
> - Update error message (Ben)
> 
> Changes in v11->v12:
> - None
> ---
>  MAINTAINERS                   |   2 +
>  drivers/pci/pcie/Makefile     |   1 +
>  drivers/pci/pcie/aer.c        |  10 ---
>  drivers/pci/pcie/aer_cxl_vh.c | 142 ++++++++++++++++++++++++++++++++++
>  drivers/pci/pcie/portdrv.h    |   4 +
>  include/linux/aer.h           |  28 +++++++
>  6 files changed, 177 insertions(+), 10 deletions(-)
>  create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 882214b0e7db..93d4e43bb90d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6433,6 +6433,8 @@ S:	Maintained
>  F:	Documentation/driver-api/cxl
>  F:	Documentation/userspace-api/fwctl/fwctl-cxl.rst
>  F:	drivers/cxl/
> +F:	drivers/pci/pcie/aer_cxl_rch.c
> +F:	drivers/pci/pcie/aer_cxl_vh.c
>  F:	include/cxl/
>  F:	include/uapi/linux/cxl_mem.h
>  F:	tools/testing/cxl/
> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
> index b0b43a18c304..62d3d3c69a5d 100644
> --- a/drivers/pci/pcie/Makefile
> +++ b/drivers/pci/pcie/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
>  obj-y				+= aspm.o
>  obj-$(CONFIG_PCIEAER)		+= aer.o err.o tlp.o
>  obj-$(CONFIG_CXL_RAS)		+= aer_cxl_rch.o
> +obj-$(CONFIG_CXL_RAS)		+= aer_cxl_vh.o
>  obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
>  obj-$(CONFIG_PCIE_PME)		+= pme.o
>  obj-$(CONFIG_PCIE_DPC)		+= dpc.o
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c4fd9c0b2a54..c5bce25df51c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1150,16 +1150,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>   */
>  EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
>  
> -#ifdef CONFIG_CXL_RAS
> -bool is_aer_internal_error(struct aer_err_info *info)
> -{
> -	if (info->severity == AER_CORRECTABLE)
> -		return info->status & PCI_ERR_COR_INTERNAL;
> -
> -	return info->status & PCI_ERR_UNC_INTN;
> -}
> -#endif
> -
>  /**
>   * pci_aer_handle_error - handle logging error into an event log
>   * @dev: pointer to pci_dev data structure of error source device
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> new file mode 100644
> index 000000000000..c0fea2c2b9bc
> --- /dev/null
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -0,0 +1,142 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2026 AMD Corporation. All rights reserved. */
> +
> +#include <linux/aer.h>
> +#include <linux/cleanup.h>
> +#include <linux/init.h>
> +#include <linux/kfifo.h>
> +#include <linux/rwsem.h>
> +#include <linux/workqueue.h>
> +#include "../pci.h"
> +#include "portdrv.h"
> +
> +#define CXL_ERROR_SOURCES_MAX          128
> +
> +struct cxl_proto_err_kfifo {
> +	struct work_struct *work;
> +	struct rw_semaphore rwsem;
> +	spinlock_t fifo_lock;
> +	DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
> +		      CXL_ERROR_SOURCES_MAX);
> +};
> +
> +static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
> +	.rwsem = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rwsem),
> +	.fifo_lock = __SPIN_LOCK_UNLOCKED(cxl_proto_err_kfifo.fifo_lock),
> +};
> +
> +static int __init cxl_proto_err_kfifo_init(void)
> +{
> +	INIT_KFIFO(cxl_proto_err_kfifo.fifo);
> +	return 0;
> +}
> +subsys_initcall(cxl_proto_err_kfifo_init);
> +
> +bool is_aer_internal_error(struct aer_err_info *info)
> +{
> +	if (info->severity == AER_CORRECTABLE)
> +		return info->status & PCI_ERR_COR_INTERNAL;
> +
> +	return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	if (!info || !info->is_cxl)
> +		return false;
> +
> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
> +		return false;
> +
> +	return is_aer_internal_error(info);
> +}
> +
> +void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	struct cxl_proto_err_work_data wd = {
> +		.severity = info->severity,
> +		.pdev = pdev,
> +	};
> +
> +	if (info->severity == AER_CORRECTABLE)
> +		pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS,
> +				       info->status);
> +
> +	guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
> +
> +	if (!cxl_proto_err_kfifo.work) {
> +		dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo reader not registered\n");
> +		return;
> +	}
> +
> +	/*
> +	 * Reference discipline: the AER caller (handle_error_source())
> +	 * holds a ref on @pdev for the duration of this call and releases
> +	 * it on return. Take a fresh ref here so the pdev stays live while
> +	 * queued in the kfifo; the consumer (for_each_cxl_proto_err())
> +	 * drops that ref after handling. On enqueue failure below, drop
> +	 * the ref we just took to avoid a leak.
> +	 */
> +	pci_dev_get(pdev);
> +
> +	/* Serialize concurrent kfifo writers: multiple AER threaded IRQs */
> +	if (!kfifo_in_spinlocked(&cxl_proto_err_kfifo.fifo, &wd, 1,
> +				 &cxl_proto_err_kfifo.fifo_lock)) {
> +		dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo add failed\n");
> +		pci_dev_put(pdev);
> +		return;
> +	}
> +
> +	schedule_work(cxl_proto_err_kfifo.work);
> +}
> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> +	guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
> +	WARN_ONCE(cxl_proto_err_kfifo.work,
> +		  "AER-CXL kfifo consumer already registered\n");
> +	cxl_proto_err_kfifo.work = work;
> +}
> +EXPORT_SYMBOL_FOR_MODULES(cxl_register_proto_err_work, "cxl_core");
> +
> +static struct work_struct *cancel_cxl_proto_err(void)
> +{
> +	struct work_struct *work;
> +	struct cxl_proto_err_work_data wd;
> +
> +	guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
> +	work = cxl_proto_err_kfifo.work;
> +	cxl_proto_err_kfifo.work = NULL;
> +	while (kfifo_get(&cxl_proto_err_kfifo.fifo, &wd)) {
> +		dev_err_ratelimited(&wd.pdev->dev,
> +				    "AER-CXL error report canceled\n");
> +		pci_dev_put(wd.pdev);
> +	}
> +	return work;
> +}
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> +	struct work_struct *work = cancel_cxl_proto_err();
> +
> +	if (work)
> +		cancel_work_sync(work);
> +}
> +EXPORT_SYMBOL_FOR_MODULES(cxl_unregister_proto_err_work, "cxl_core");
> +
> +int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
> +			   cxl_proto_err_fn_t fn)
> +{
> +	int rc;
> +
> +	guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
> +	while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) {
> +		rc = fn(wd);
> +		pci_dev_put(wd->pdev);
> +		if (rc)
> +			return rc;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index cc58bf2f2c84..66a6b8099c96 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -130,9 +130,13 @@ struct aer_err_info;
>  bool is_aer_internal_error(struct aer_err_info *info);
>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
> +void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
>  #else
>  static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
> +static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
>  #endif /* CONFIG_CXL_RAS */
>  #endif /* _PORTDRV_H_ */
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index df0f5c382286..78841cf4268c 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -25,6 +25,7 @@
>  #define PCIE_STD_MAX_TLP_HEADERLOG	(PCIE_STD_NUM_TLP_HEADERLOG + 10)
>  
>  struct pci_dev;
> +struct work_struct;
>  
>  struct pcie_tlp_log {
>  	union {
> @@ -53,6 +54,18 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_work_data - Error information used in CXL error handling
> + * @pdev: PCI device detecting the error
> + * @severity: AER severity
> + */
> +struct cxl_proto_err_work_data {
> +	struct pci_dev *pdev;
> +	int severity;
> +};
> +
> +typedef int (*cxl_proto_err_fn_t)(struct cxl_proto_err_work_data *wd);
> +
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> @@ -66,6 +79,21 @@ static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
>  #endif
>  
> +#ifdef CONFIG_CXL_RAS
> +void cxl_register_proto_err_work(struct work_struct *work);
> +int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
> +			   cxl_proto_err_fn_t fn);
> +void cxl_unregister_proto_err_work(void);
> +#else
> +static inline void cxl_register_proto_err_work(struct work_struct *work) { }
> +static inline int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
> +					 cxl_proto_err_fn_t fn)
> +{
> +	return 0;
> +}
> +static inline void cxl_unregister_proto_err_work(void) { }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);


  reply	other threads:[~2026-05-05 21:17 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 21:17   ` Dave Jiang [this message]
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:46   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 22:02   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 22:16   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:59   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06 18:00   ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34   ` Dave Jiang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3220622a-9241-450c-aedf-d80211eaf561@intel.com \
    --to=dave.jiang@intel.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=alucerop@amd.com \
    --cc=bhelgaas@google.com \
    --cc=corbet@lwn.net \
    --cc=dan.carpenter@linaro.org \
    --cc=dave@stgolabs.net \
    --cc=djbw@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=jic23@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    --cc=xueshuai@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox