From: Terry Bowman <terry.bowman@amd.com>
To: <dave@stgolabs.net>, <jic23@kernel.org>, <dave.jiang@intel.com>,
<alison.schofield@intel.com>, <djbw@kernel.org>,
<bhelgaas@google.com>, <shiju.jose@huawei.com>,
<ming.li@zohomail.com>, <Smita.KoralahalliChannabasappa@amd.com>,
<rrichter@amd.com>, <dan.carpenter@linaro.org>,
<PradeepVineshReddy.Kodamati@amd.com>, <lukas@wunner.de>,
<Benjamin.Cheatham@amd.com>,
<sathyanarayanan.kuppuswamy@linux.intel.com>,
<vishal.l.verma@intel.com>, <alucerop@amd.com>,
<ira.weiny@intel.com>, <corbet@lwn.net>, <rafael@kernel.org>,
<xueshuai@linux.alibaba.com>, <linux-cxl@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>, <linux-pci@vger.kernel.org>,
<linux-acpi@vger.kernel.org>, <linux-doc@vger.kernel.org>,
<terry.bowman@amd.com>
Subject: [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo
Date: Tue, 5 May 2026 12:30:19 -0500 [thread overview]
Message-ID: <20260505173029.2718246-2-terry.bowman@amd.com> (raw)
In-Reply-To: <20260505173029.2718246-1-terry.bowman@amd.com>
CXL virtual hierarchy (VH) native RAS handling for CXL Port devices will be
added soon. This requires a notification mechanism for the AER driver to
share the AER interrupt with the CXL driver. The CXL drivers use the
notification to handle and log the CXL RAS errors.
Note, 'CXL protocol error' terminology refers to CXL VH and not CXL RCH
errors unless specifically noted going forward.
Introduce a new file in the AER driver to handle the CXL protocol
errors: pci/pcie/aer_cxl_vh.c.
Add a kfifo work queue to be used by the AER and CXL drivers. Multiple
AER IRQ worker threads can be running and enqueueing concurrently, so
include write path synchronization. Pack the kfifo, the spinlock, the
rwsem, and the work pointer into a single structure. Initialize the
kfifo with INIT_KFIFO() from a subsys_initcall so its mask, esize and
data fields are valid before any producer or consumer runs.
Add CXL work queue handler registration functions in the AER driver.
Export them so the CXL driver can assign or clear the work handler.
Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work
data. It contains a reference to the PCI error source device and the
error severity. The cxl_core driver uses this when dequeuing the work.
Introduce cxl_forward_error() to add a given CXL protocol error to a
work structure and push it onto the AER-CXL kfifo. This function takes
a pci_dev_get() on the source device. The kfifo consumer is responsible
for the matching pci_dev_put() after dequeue. On enqueue failure
cxl_forward_error() does the put itself.
Synchronize accesses to the work function pointer during registration,
deregistration, enqueue, and dequeue.
handle_error_source() is intentionally not changed here. The is_cxl_error()
switch that routes errors to cxl_forward_error() is added in a later patch
together with the kfifo consumer registration. This way the producer and
consumer land in the same commit, so CXL errors are not silently dropped
during bisect.
Also add MAINTAINERS entries for both drivers/pci/pcie/aer_cxl_vh.c
(new in this patch) and drivers/pci/pcie/aer_cxl_rch.c (already in tree
but previously unlisted) under the existing CXL entry. This way the CXL
maintainers are CC'd on changes to the AER-CXL bridging code.
Co-developed-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- Reword "kfifo semaphore" to "kfifo spinlock" to match fifo_lock.
- Defer the handle_error_source() is_cxl_error() switch to the patch that
registers the kfifo consumer to keep each commit bisect-safe.
- Rename rwsema to rwsem
- Change CPER exports to use EXPORT_SYMBOL_FOR_MODULES.
- Add work cancel function.
- Replace kfifo_put() with kfifo_in_spinlocked() for multiple producers
- Add fifo_lock spinlock for concurrent producer serialisation
- Initialize the embedded kfifo with INIT_KFIFO() in a subsys_initcall so
kfifo->mask, ->esize and ->data are set before first use.
- Clear PCI_ERR_COR_STATUS in cxl_forward_error() before enqueue so the
device is acked for correctable events even when the consumer drops the
event. Uncorrectable status is left for cxl_do_recovery() to clear after
recovery completes, mirroring the AER core convention.
- WARN on double-registration in cxl_register_proto_err_work() to make an
unintended second consumer visible at runtime.
- Add direct rwsem.h, cleanup.h and workqueue.h includes for symbols used
in aer_cxl_vh.c
- Add MAINTAINERS entries for drivers/pci/pcie/aer_cxl_*.c
- Update message
Changes in v15->v16:
- Add pci_dev_put() and comment in pci_dev_get() (Dan)
- /rw_sema/rwsema/ (Dan)
- Split validation checks in cxl_forward_error() to allow
for meaningful reason in log (Terry)
- Shorten commit title to remove wordiness (Terry)
- Remove bitfield.h include, unnecessary. (Terry)
Changes in v14->v15:
- Moved pci_dev_get() call to this patch (Dave)
Changes in v13 -> v14:
- Replaced workqueue_types.h include with 'struct work_struct'
predeclaration (Bjorn)
- Update error message (Bjorn)
- Reordered 'struct cxl_proto_err_work_data' (Bjorn)
- Remove export of cxl_error_is_native() here (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
- Update error message (Ben)
Changes in v11->v12:
- None
---
MAINTAINERS | 2 +
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/aer.c | 10 ---
drivers/pci/pcie/aer_cxl_vh.c | 142 ++++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.h | 4 +
include/linux/aer.h | 28 +++++++
6 files changed, 177 insertions(+), 10 deletions(-)
create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 882214b0e7db..93d4e43bb90d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6433,6 +6433,8 @@ S: Maintained
F: Documentation/driver-api/cxl
F: Documentation/userspace-api/fwctl/fwctl-cxl.rst
F: drivers/cxl/
+F: drivers/pci/pcie/aer_cxl_rch.c
+F: drivers/pci/pcie/aer_cxl_vh.c
F: include/cxl/
F: include/uapi/linux/cxl_mem.h
F: tools/testing/cxl/
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index b0b43a18c304..62d3d3c69a5d 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o bwctrl.o
obj-y += aspm.o
obj-$(CONFIG_PCIEAER) += aer.o err.o tlp.o
obj-$(CONFIG_CXL_RAS) += aer_cxl_rch.o
+obj-$(CONFIG_CXL_RAS) += aer_cxl_vh.o
obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
obj-$(CONFIG_PCIE_PME) += pme.o
obj-$(CONFIG_PCIE_DPC) += dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c4fd9c0b2a54..c5bce25df51c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1150,16 +1150,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
*/
EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
-#ifdef CONFIG_CXL_RAS
-bool is_aer_internal_error(struct aer_err_info *info)
-{
- if (info->severity == AER_CORRECTABLE)
- return info->status & PCI_ERR_COR_INTERNAL;
-
- return info->status & PCI_ERR_UNC_INTN;
-}
-#endif
-
/**
* pci_aer_handle_error - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
new file mode 100644
index 000000000000..c0fea2c2b9bc
--- /dev/null
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 AMD Corporation. All rights reserved. */
+
+#include <linux/aer.h>
+#include <linux/cleanup.h>
+#include <linux/init.h>
+#include <linux/kfifo.h>
+#include <linux/rwsem.h>
+#include <linux/workqueue.h>
+#include "../pci.h"
+#include "portdrv.h"
+
+#define CXL_ERROR_SOURCES_MAX 128
+
+struct cxl_proto_err_kfifo {
+ struct work_struct *work;
+ struct rw_semaphore rwsem;
+ spinlock_t fifo_lock;
+ DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
+ CXL_ERROR_SOURCES_MAX);
+};
+
+static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
+ .rwsem = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rwsem),
+ .fifo_lock = __SPIN_LOCK_UNLOCKED(cxl_proto_err_kfifo.fifo_lock),
+};
+
+static int __init cxl_proto_err_kfifo_init(void)
+{
+ INIT_KFIFO(cxl_proto_err_kfifo.fifo);
+ return 0;
+}
+subsys_initcall(cxl_proto_err_kfifo_init);
+
+bool is_aer_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ if (!info || !info->is_cxl)
+ return false;
+
+ if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ return false;
+
+ return is_aer_internal_error(info);
+}
+
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ struct cxl_proto_err_work_data wd = {
+ .severity = info->severity,
+ .pdev = pdev,
+ };
+
+ if (info->severity == AER_CORRECTABLE)
+ pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS,
+ info->status);
+
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
+
+ if (!cxl_proto_err_kfifo.work) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo reader not registered\n");
+ return;
+ }
+
+ /*
+ * Reference discipline: the AER caller (handle_error_source())
+ * holds a ref on @pdev for the duration of this call and releases
+ * it on return. Take a fresh ref here so the pdev stays live while
+ * queued in the kfifo; the consumer (for_each_cxl_proto_err())
+ * drops that ref after handling. On enqueue failure below, drop
+ * the ref we just took to avoid a leak.
+ */
+ pci_dev_get(pdev);
+
+ /* Serialize concurrent kfifo writers: multiple AER threaded IRQs */
+ if (!kfifo_in_spinlocked(&cxl_proto_err_kfifo.fifo, &wd, 1,
+ &cxl_proto_err_kfifo.fifo_lock)) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo add failed\n");
+ pci_dev_put(pdev);
+ return;
+ }
+
+ schedule_work(cxl_proto_err_kfifo.work);
+}
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
+ WARN_ONCE(cxl_proto_err_kfifo.work,
+ "AER-CXL kfifo consumer already registered\n");
+ cxl_proto_err_kfifo.work = work;
+}
+EXPORT_SYMBOL_FOR_MODULES(cxl_register_proto_err_work, "cxl_core");
+
+static struct work_struct *cancel_cxl_proto_err(void)
+{
+ struct work_struct *work;
+ struct cxl_proto_err_work_data wd;
+
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
+ work = cxl_proto_err_kfifo.work;
+ cxl_proto_err_kfifo.work = NULL;
+ while (kfifo_get(&cxl_proto_err_kfifo.fifo, &wd)) {
+ dev_err_ratelimited(&wd.pdev->dev,
+ "AER-CXL error report canceled\n");
+ pci_dev_put(wd.pdev);
+ }
+ return work;
+}
+
+void cxl_unregister_proto_err_work(void)
+{
+ struct work_struct *work = cancel_cxl_proto_err();
+
+ if (work)
+ cancel_work_sync(work);
+}
+EXPORT_SYMBOL_FOR_MODULES(cxl_unregister_proto_err_work, "cxl_core");
+
+int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ cxl_proto_err_fn_t fn)
+{
+ int rc;
+
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
+ while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) {
+ rc = fn(wd);
+ pci_dev_put(wd->pdev);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index cc58bf2f2c84..66a6b8099c96 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -130,9 +130,13 @@ struct aer_err_info;
bool is_aer_internal_error(struct aer_err_info *info);
void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
void cxl_rch_enable_rcec(struct pci_dev *rcec);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
#else
static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
#endif /* CONFIG_CXL_RAS */
#endif /* _PORTDRV_H_ */
diff --git a/include/linux/aer.h b/include/linux/aer.h
index df0f5c382286..78841cf4268c 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -25,6 +25,7 @@
#define PCIE_STD_MAX_TLP_HEADERLOG (PCIE_STD_NUM_TLP_HEADERLOG + 10)
struct pci_dev;
+struct work_struct;
struct pcie_tlp_log {
union {
@@ -53,6 +54,18 @@ struct aer_capability_regs {
u16 uncor_err_source;
};
+/**
+ * struct cxl_proto_err_work_data - Error information used in CXL error handling
+ * @pdev: PCI device detecting the error
+ * @severity: AER severity
+ */
+struct cxl_proto_err_work_data {
+ struct pci_dev *pdev;
+ int severity;
+};
+
+typedef int (*cxl_proto_err_fn_t)(struct cxl_proto_err_work_data *wd);
+
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
@@ -66,6 +79,21 @@ static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
+#ifdef CONFIG_CXL_RAS
+void cxl_register_proto_err_work(struct work_struct *work);
+int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ cxl_proto_err_fn_t fn);
+void cxl_unregister_proto_err_work(void);
+#else
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ cxl_proto_err_fn_t fn)
+{
+ return 0;
+}
+static inline void cxl_unregister_proto_err_work(void) { }
+#endif
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer);
int cper_severity_to_aer(int cper_severity);
--
2.34.1
next prev parent reply other threads:[~2026-05-05 17:31 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` Terry Bowman [this message]
2026-05-05 21:17 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Dave Jiang
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
2026-05-05 21:46 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
2026-05-05 22:02 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
2026-05-05 22:06 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
2026-05-05 22:16 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
2026-05-05 23:59 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-05-06 17:43 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-06 18:00 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
2026-05-06 18:34 ` Dave Jiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260505173029.2718246-2-terry.bowman@amd.com \
--to=terry.bowman@amd.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=PradeepVineshReddy.Kodamati@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=alucerop@amd.com \
--cc=bhelgaas@google.com \
--cc=corbet@lwn.net \
--cc=dan.carpenter@linaro.org \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=djbw@kernel.org \
--cc=ira.weiny@intel.com \
--cc=jic23@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=lukas@wunner.de \
--cc=ming.li@zohomail.com \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=sathyanarayanan.kuppuswamy@linux.intel.com \
--cc=shiju.jose@huawei.com \
--cc=vishal.l.verma@intel.com \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox