* [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 21:17 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
` (9 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
CXL virtual hierarchy (VH) native RAS handling for CXL Port devices will be
added soon. This requires a notification mechanism for the AER driver to
share the AER interrupt with the CXL driver. The CXL drivers use the
notification to handle and log the CXL RAS errors.
Note, 'CXL protocol error' terminology refers to CXL VH and not CXL RCH
errors unless specifically noted going forward.
Introduce a new file in the AER driver to handle the CXL protocol
errors: pci/pcie/aer_cxl_vh.c.
Add a kfifo work queue to be used by the AER and CXL drivers. Multiple
AER IRQ worker threads can be running and enqueueing concurrently, so
include write path synchronization. Pack the kfifo, the spinlock, the
rwsem, and the work pointer into a single structure. Initialize the
kfifo with INIT_KFIFO() from a subsys_initcall so its mask, esize and
data fields are valid before any producer or consumer runs.
Add CXL work queue handler registration functions in the AER driver.
Export them so the CXL driver can assign or clear the work handler.
Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work
data. It contains a reference to the PCI error source device and the
error severity. The cxl_core driver uses this when dequeuing the work.
Introduce cxl_forward_error() to add a given CXL protocol error to a
work structure and push it onto the AER-CXL kfifo. This function takes
a pci_dev_get() on the source device. The kfifo consumer is responsible
for the matching pci_dev_put() after dequeue. On enqueue failure
cxl_forward_error() does the put itself.
Synchronize accesses to the work function pointer during registration,
deregistration, enqueue, and dequeue.
handle_error_source() is intentionally not changed here. The is_cxl_error()
switch that routes errors to cxl_forward_error() is added in a later patch
together with the kfifo consumer registration. This way the producer and
consumer land in the same commit, so CXL errors are not silently dropped
during bisect.
Also add MAINTAINERS entries for both drivers/pci/pcie/aer_cxl_vh.c
(new in this patch) and drivers/pci/pcie/aer_cxl_rch.c (already in tree
but previously unlisted) under the existing CXL entry. This way the CXL
maintainers are CC'd on changes to the AER-CXL bridging code.
Co-developed-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- Reword "kfifo semaphore" to "kfifo spinlock" to match fifo_lock.
- Defer the handle_error_source() is_cxl_error() switch to the patch that
registers the kfifo consumer to keep each commit bisect-safe.
- Rename rwsema to rwsem
- Change CPER exports to use EXPORT_SYMBOL_FOR_MODULES.
- Add work cancel function.
- Replace kfifo_put() with kfifo_in_spinlocked() for multiple producers
- Add fifo_lock spinlock for concurrent producer serialisation
- Initialize the embedded kfifo with INIT_KFIFO() in a subsys_initcall so
kfifo->mask, ->esize and ->data are set before first use.
- Clear PCI_ERR_COR_STATUS in cxl_forward_error() before enqueue so the
device is acked for correctable events even when the consumer drops the
event. Uncorrectable status is left for cxl_do_recovery() to clear after
recovery completes, mirroring the AER core convention.
- WARN on double-registration in cxl_register_proto_err_work() to make an
unintended second consumer visible at runtime.
- Add direct rwsem.h, cleanup.h and workqueue.h includes for symbols used
in aer_cxl_vh.c
- Add MAINTAINERS entries for drivers/pci/pcie/aer_cxl_*.c
- Update message
Changes in v15->v16:
- Add pci_dev_put() and comment in pci_dev_get() (Dan)
- /rw_sema/rwsema/ (Dan)
- Split validation checks in cxl_forward_error() to allow
for meaningful reason in log (Terry)
- Shorten commit title to remove wordiness (Terry)
- Remove bitfield.h include, unnecessary. (Terry)
Changes in v14->v15:
- Moved pci_dev_get() call to this patch (Dave)
Changes in v13 -> v14:
- Replaced workqueue_types.h include with 'struct work_struct'
predeclaration (Bjorn)
- Update error message (Bjorn)
- Reordered 'struct cxl_proto_err_work_data' (Bjorn)
- Remove export of cxl_error_is_native() here (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
- Update error message (Ben)
Changes in v11->v12:
- None
---
MAINTAINERS | 2 +
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/aer.c | 10 ---
drivers/pci/pcie/aer_cxl_vh.c | 142 ++++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.h | 4 +
include/linux/aer.h | 28 +++++++
6 files changed, 177 insertions(+), 10 deletions(-)
create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
diff --git a/MAINTAINERS b/MAINTAINERS
index 882214b0e7db..93d4e43bb90d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6433,6 +6433,8 @@ S: Maintained
F: Documentation/driver-api/cxl
F: Documentation/userspace-api/fwctl/fwctl-cxl.rst
F: drivers/cxl/
+F: drivers/pci/pcie/aer_cxl_rch.c
+F: drivers/pci/pcie/aer_cxl_vh.c
F: include/cxl/
F: include/uapi/linux/cxl_mem.h
F: tools/testing/cxl/
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index b0b43a18c304..62d3d3c69a5d 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o bwctrl.o
obj-y += aspm.o
obj-$(CONFIG_PCIEAER) += aer.o err.o tlp.o
obj-$(CONFIG_CXL_RAS) += aer_cxl_rch.o
+obj-$(CONFIG_CXL_RAS) += aer_cxl_vh.o
obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
obj-$(CONFIG_PCIE_PME) += pme.o
obj-$(CONFIG_PCIE_DPC) += dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c4fd9c0b2a54..c5bce25df51c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1150,16 +1150,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
*/
EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
-#ifdef CONFIG_CXL_RAS
-bool is_aer_internal_error(struct aer_err_info *info)
-{
- if (info->severity == AER_CORRECTABLE)
- return info->status & PCI_ERR_COR_INTERNAL;
-
- return info->status & PCI_ERR_UNC_INTN;
-}
-#endif
-
/**
* pci_aer_handle_error - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
new file mode 100644
index 000000000000..c0fea2c2b9bc
--- /dev/null
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -0,0 +1,142 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2026 AMD Corporation. All rights reserved. */
+
+#include <linux/aer.h>
+#include <linux/cleanup.h>
+#include <linux/init.h>
+#include <linux/kfifo.h>
+#include <linux/rwsem.h>
+#include <linux/workqueue.h>
+#include "../pci.h"
+#include "portdrv.h"
+
+#define CXL_ERROR_SOURCES_MAX 128
+
+struct cxl_proto_err_kfifo {
+ struct work_struct *work;
+ struct rw_semaphore rwsem;
+ spinlock_t fifo_lock;
+ DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
+ CXL_ERROR_SOURCES_MAX);
+};
+
+static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
+ .rwsem = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rwsem),
+ .fifo_lock = __SPIN_LOCK_UNLOCKED(cxl_proto_err_kfifo.fifo_lock),
+};
+
+static int __init cxl_proto_err_kfifo_init(void)
+{
+ INIT_KFIFO(cxl_proto_err_kfifo.fifo);
+ return 0;
+}
+subsys_initcall(cxl_proto_err_kfifo_init);
+
+bool is_aer_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ if (!info || !info->is_cxl)
+ return false;
+
+ if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ return false;
+
+ return is_aer_internal_error(info);
+}
+
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ struct cxl_proto_err_work_data wd = {
+ .severity = info->severity,
+ .pdev = pdev,
+ };
+
+ if (info->severity == AER_CORRECTABLE)
+ pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS,
+ info->status);
+
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
+
+ if (!cxl_proto_err_kfifo.work) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo reader not registered\n");
+ return;
+ }
+
+ /*
+ * Reference discipline: the AER caller (handle_error_source())
+ * holds a ref on @pdev for the duration of this call and releases
+ * it on return. Take a fresh ref here so the pdev stays live while
+ * queued in the kfifo; the consumer (for_each_cxl_proto_err())
+ * drops that ref after handling. On enqueue failure below, drop
+ * the ref we just took to avoid a leak.
+ */
+ pci_dev_get(pdev);
+
+ /* Serialize concurrent kfifo writers: multiple AER threaded IRQs */
+ if (!kfifo_in_spinlocked(&cxl_proto_err_kfifo.fifo, &wd, 1,
+ &cxl_proto_err_kfifo.fifo_lock)) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo add failed\n");
+ pci_dev_put(pdev);
+ return;
+ }
+
+ schedule_work(cxl_proto_err_kfifo.work);
+}
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
+ WARN_ONCE(cxl_proto_err_kfifo.work,
+ "AER-CXL kfifo consumer already registered\n");
+ cxl_proto_err_kfifo.work = work;
+}
+EXPORT_SYMBOL_FOR_MODULES(cxl_register_proto_err_work, "cxl_core");
+
+static struct work_struct *cancel_cxl_proto_err(void)
+{
+ struct work_struct *work;
+ struct cxl_proto_err_work_data wd;
+
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
+ work = cxl_proto_err_kfifo.work;
+ cxl_proto_err_kfifo.work = NULL;
+ while (kfifo_get(&cxl_proto_err_kfifo.fifo, &wd)) {
+ dev_err_ratelimited(&wd.pdev->dev,
+ "AER-CXL error report canceled\n");
+ pci_dev_put(wd.pdev);
+ }
+ return work;
+}
+
+void cxl_unregister_proto_err_work(void)
+{
+ struct work_struct *work = cancel_cxl_proto_err();
+
+ if (work)
+ cancel_work_sync(work);
+}
+EXPORT_SYMBOL_FOR_MODULES(cxl_unregister_proto_err_work, "cxl_core");
+
+int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ cxl_proto_err_fn_t fn)
+{
+ int rc;
+
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
+ while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) {
+ rc = fn(wd);
+ pci_dev_put(wd->pdev);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index cc58bf2f2c84..66a6b8099c96 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -130,9 +130,13 @@ struct aer_err_info;
bool is_aer_internal_error(struct aer_err_info *info);
void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
void cxl_rch_enable_rcec(struct pci_dev *rcec);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
#else
static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
#endif /* CONFIG_CXL_RAS */
#endif /* _PORTDRV_H_ */
diff --git a/include/linux/aer.h b/include/linux/aer.h
index df0f5c382286..78841cf4268c 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -25,6 +25,7 @@
#define PCIE_STD_MAX_TLP_HEADERLOG (PCIE_STD_NUM_TLP_HEADERLOG + 10)
struct pci_dev;
+struct work_struct;
struct pcie_tlp_log {
union {
@@ -53,6 +54,18 @@ struct aer_capability_regs {
u16 uncor_err_source;
};
+/**
+ * struct cxl_proto_err_work_data - Error information used in CXL error handling
+ * @pdev: PCI device detecting the error
+ * @severity: AER severity
+ */
+struct cxl_proto_err_work_data {
+ struct pci_dev *pdev;
+ int severity;
+};
+
+typedef int (*cxl_proto_err_fn_t)(struct cxl_proto_err_work_data *wd);
+
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
@@ -66,6 +79,21 @@ static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
+#ifdef CONFIG_CXL_RAS
+void cxl_register_proto_err_work(struct work_struct *work);
+int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ cxl_proto_err_fn_t fn);
+void cxl_unregister_proto_err_work(void);
+#else
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ cxl_proto_err_fn_t fn)
+{
+ return 0;
+}
+static inline void cxl_unregister_proto_err_work(void) { }
+#endif
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer);
int cper_severity_to_aer(int cper_severity);
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
@ 2026-05-05 21:17 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-05 21:17 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> CXL virtual hierarchy (VH) native RAS handling for CXL Port devices will be
> added soon. This requires a notification mechanism for the AER driver to
> share the AER interrupt with the CXL driver. The CXL drivers use the
> notification to handle and log the CXL RAS errors.
>
> Note, 'CXL protocol error' terminology refers to CXL VH and not CXL RCH
> errors unless specifically noted going forward.
>
> Introduce a new file in the AER driver to handle the CXL protocol
> errors: pci/pcie/aer_cxl_vh.c.
>
> Add a kfifo work queue to be used by the AER and CXL drivers. Multiple
> AER IRQ worker threads can be running and enqueueing concurrently, so
> include write path synchronization. Pack the kfifo, the spinlock, the
> rwsem, and the work pointer into a single structure. Initialize the
> kfifo with INIT_KFIFO() from a subsys_initcall so its mask, esize and
> data fields are valid before any producer or consumer runs.
>
> Add CXL work queue handler registration functions in the AER driver.
> Export them so the CXL driver can assign or clear the work handler.
>
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work
> data. It contains a reference to the PCI error source device and the
> error severity. The cxl_core driver uses this when dequeuing the work.
>
> Introduce cxl_forward_error() to add a given CXL protocol error to a
> work structure and push it onto the AER-CXL kfifo. This function takes
> a pci_dev_get() on the source device. The kfifo consumer is responsible
> for the matching pci_dev_put() after dequeue. On enqueue failure
> cxl_forward_error() does the put itself.
>
> Synchronize accesses to the work function pointer during registration,
> deregistration, enqueue, and dequeue.
>
> handle_error_source() is intentionally not changed here. The is_cxl_error()
> switch that routes errors to cxl_forward_error() is added in a later patch
> together with the kfifo consumer registration. This way the producer and
> consumer land in the same commit, so CXL errors are not silently dropped
> during bisect.
>
> Also add MAINTAINERS entries for both drivers/pci/pcie/aer_cxl_vh.c
> (new in this patch) and drivers/pci/pcie/aer_cxl_rch.c (already in tree
> but previously unlisted) under the existing CXL entry. This way the CXL
> maintainers are CC'd on changes to the AER-CXL bridging code.
>
> Co-developed-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
>
> Changes in v16->v17:
> - Reword "kfifo semaphore" to "kfifo spinlock" to match fifo_lock.
> - Defer the handle_error_source() is_cxl_error() switch to the patch that
> registers the kfifo consumer to keep each commit bisect-safe.
> - Rename rwsema to rwsem
> - Change CPER exports to use EXPORT_SYMBOL_FOR_MODULES.
> - Add work cancel function.
> - Replace kfifo_put() with kfifo_in_spinlocked() for multiple producers
> - Add fifo_lock spinlock for concurrent producer serialisation
> - Initialize the embedded kfifo with INIT_KFIFO() in a subsys_initcall so
> kfifo->mask, ->esize and ->data are set before first use.
> - Clear PCI_ERR_COR_STATUS in cxl_forward_error() before enqueue so the
> device is acked for correctable events even when the consumer drops the
> event. Uncorrectable status is left for cxl_do_recovery() to clear after
> recovery completes, mirroring the AER core convention.
> - WARN on double-registration in cxl_register_proto_err_work() to make an
> unintended second consumer visible at runtime.
> - Add direct rwsem.h, cleanup.h and workqueue.h includes for symbols used
> in aer_cxl_vh.c
> - Add MAINTAINERS entries for drivers/pci/pcie/aer_cxl_*.c
> - Update message
>
> Changes in v15->v16:
> - Add pci_dev_put() and comment in pci_dev_get() (Dan)
> - /rw_sema/rwsema/ (Dan)
> - Split validation checks in cxl_forward_error() to allow
> for meaningful reason in log (Terry)
> - Shorten commit title to remove wordiness (Terry)
> - Remove bitfield.h include, unnecessary. (Terry)
>
> Changes in v14->v15:
> - Moved pci_dev_get() call to this patch (Dave)
>
> Changes in v13 -> v14:
> - Replaced workqueue_types.h include with 'struct work_struct'
> predeclaration (Bjorn)
> - Update error message (Bjorn)
> - Reordered 'struct cxl_proto_err_work_data' (Bjorn)
> - Remove export of cxl_error_is_native() here (Bjorn)
>
> Changes in v12->v13:
> - Added Dave Jiang's review-by
> - Update error message (Ben)
>
> Changes in v11->v12:
> - None
> ---
> MAINTAINERS | 2 +
> drivers/pci/pcie/Makefile | 1 +
> drivers/pci/pcie/aer.c | 10 ---
> drivers/pci/pcie/aer_cxl_vh.c | 142 ++++++++++++++++++++++++++++++++++
> drivers/pci/pcie/portdrv.h | 4 +
> include/linux/aer.h | 28 +++++++
> 6 files changed, 177 insertions(+), 10 deletions(-)
> create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 882214b0e7db..93d4e43bb90d 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -6433,6 +6433,8 @@ S: Maintained
> F: Documentation/driver-api/cxl
> F: Documentation/userspace-api/fwctl/fwctl-cxl.rst
> F: drivers/cxl/
> +F: drivers/pci/pcie/aer_cxl_rch.c
> +F: drivers/pci/pcie/aer_cxl_vh.c
> F: include/cxl/
> F: include/uapi/linux/cxl_mem.h
> F: tools/testing/cxl/
> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
> index b0b43a18c304..62d3d3c69a5d 100644
> --- a/drivers/pci/pcie/Makefile
> +++ b/drivers/pci/pcie/Makefile
> @@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o bwctrl.o
> obj-y += aspm.o
> obj-$(CONFIG_PCIEAER) += aer.o err.o tlp.o
> obj-$(CONFIG_CXL_RAS) += aer_cxl_rch.o
> +obj-$(CONFIG_CXL_RAS) += aer_cxl_vh.o
> obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
> obj-$(CONFIG_PCIE_PME) += pme.o
> obj-$(CONFIG_PCIE_DPC) += dpc.o
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c4fd9c0b2a54..c5bce25df51c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1150,16 +1150,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> */
> EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
>
> -#ifdef CONFIG_CXL_RAS
> -bool is_aer_internal_error(struct aer_err_info *info)
> -{
> - if (info->severity == AER_CORRECTABLE)
> - return info->status & PCI_ERR_COR_INTERNAL;
> -
> - return info->status & PCI_ERR_UNC_INTN;
> -}
> -#endif
> -
> /**
> * pci_aer_handle_error - handle logging error into an event log
> * @dev: pointer to pci_dev data structure of error source device
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> new file mode 100644
> index 000000000000..c0fea2c2b9bc
> --- /dev/null
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -0,0 +1,142 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2026 AMD Corporation. All rights reserved. */
> +
> +#include <linux/aer.h>
> +#include <linux/cleanup.h>
> +#include <linux/init.h>
> +#include <linux/kfifo.h>
> +#include <linux/rwsem.h>
> +#include <linux/workqueue.h>
> +#include "../pci.h"
> +#include "portdrv.h"
> +
> +#define CXL_ERROR_SOURCES_MAX 128
> +
> +struct cxl_proto_err_kfifo {
> + struct work_struct *work;
> + struct rw_semaphore rwsem;
> + spinlock_t fifo_lock;
> + DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
> + CXL_ERROR_SOURCES_MAX);
> +};
> +
> +static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
> + .rwsem = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rwsem),
> + .fifo_lock = __SPIN_LOCK_UNLOCKED(cxl_proto_err_kfifo.fifo_lock),
> +};
> +
> +static int __init cxl_proto_err_kfifo_init(void)
> +{
> + INIT_KFIFO(cxl_proto_err_kfifo.fifo);
> + return 0;
> +}
> +subsys_initcall(cxl_proto_err_kfifo_init);
> +
> +bool is_aer_internal_error(struct aer_err_info *info)
> +{
> + if (info->severity == AER_CORRECTABLE)
> + return info->status & PCI_ERR_COR_INTERNAL;
> +
> + return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> + if (!info || !info->is_cxl)
> + return false;
> +
> + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
> + return false;
> +
> + return is_aer_internal_error(info);
> +}
> +
> +void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> + struct cxl_proto_err_work_data wd = {
> + .severity = info->severity,
> + .pdev = pdev,
> + };
> +
> + if (info->severity == AER_CORRECTABLE)
> + pci_write_config_dword(pdev, pdev->aer_cap + PCI_ERR_COR_STATUS,
> + info->status);
> +
> + guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
> +
> + if (!cxl_proto_err_kfifo.work) {
> + dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo reader not registered\n");
> + return;
> + }
> +
> + /*
> + * Reference discipline: the AER caller (handle_error_source())
> + * holds a ref on @pdev for the duration of this call and releases
> + * it on return. Take a fresh ref here so the pdev stays live while
> + * queued in the kfifo; the consumer (for_each_cxl_proto_err())
> + * drops that ref after handling. On enqueue failure below, drop
> + * the ref we just took to avoid a leak.
> + */
> + pci_dev_get(pdev);
> +
> + /* Serialize concurrent kfifo writers: multiple AER threaded IRQs */
> + if (!kfifo_in_spinlocked(&cxl_proto_err_kfifo.fifo, &wd, 1,
> + &cxl_proto_err_kfifo.fifo_lock)) {
> + dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo add failed\n");
> + pci_dev_put(pdev);
> + return;
> + }
> +
> + schedule_work(cxl_proto_err_kfifo.work);
> +}
> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> + guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
> + WARN_ONCE(cxl_proto_err_kfifo.work,
> + "AER-CXL kfifo consumer already registered\n");
> + cxl_proto_err_kfifo.work = work;
> +}
> +EXPORT_SYMBOL_FOR_MODULES(cxl_register_proto_err_work, "cxl_core");
> +
> +static struct work_struct *cancel_cxl_proto_err(void)
> +{
> + struct work_struct *work;
> + struct cxl_proto_err_work_data wd;
> +
> + guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
> + work = cxl_proto_err_kfifo.work;
> + cxl_proto_err_kfifo.work = NULL;
> + while (kfifo_get(&cxl_proto_err_kfifo.fifo, &wd)) {
> + dev_err_ratelimited(&wd.pdev->dev,
> + "AER-CXL error report canceled\n");
> + pci_dev_put(wd.pdev);
> + }
> + return work;
> +}
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> + struct work_struct *work = cancel_cxl_proto_err();
> +
> + if (work)
> + cancel_work_sync(work);
> +}
> +EXPORT_SYMBOL_FOR_MODULES(cxl_unregister_proto_err_work, "cxl_core");
> +
> +int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
> + cxl_proto_err_fn_t fn)
> +{
> + int rc;
> +
> + guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
> + while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) {
> + rc = fn(wd);
> + pci_dev_put(wd->pdev);
> + if (rc)
> + return rc;
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index cc58bf2f2c84..66a6b8099c96 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -130,9 +130,13 @@ struct aer_err_info;
> bool is_aer_internal_error(struct aer_err_info *info);
> void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
> void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
> +void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
> #else
> static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
> static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
> static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
> +static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
> #endif /* CONFIG_CXL_RAS */
> #endif /* _PORTDRV_H_ */
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index df0f5c382286..78841cf4268c 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -25,6 +25,7 @@
> #define PCIE_STD_MAX_TLP_HEADERLOG (PCIE_STD_NUM_TLP_HEADERLOG + 10)
>
> struct pci_dev;
> +struct work_struct;
>
> struct pcie_tlp_log {
> union {
> @@ -53,6 +54,18 @@ struct aer_capability_regs {
> u16 uncor_err_source;
> };
>
> +/**
> + * struct cxl_proto_err_work_data - Error information used in CXL error handling
> + * @pdev: PCI device detecting the error
> + * @severity: AER severity
> + */
> +struct cxl_proto_err_work_data {
> + struct pci_dev *pdev;
> + int severity;
> +};
> +
> +typedef int (*cxl_proto_err_fn_t)(struct cxl_proto_err_work_data *wd);
> +
> #if defined(CONFIG_PCIEAER)
> int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> int pcie_aer_is_native(struct pci_dev *dev);
> @@ -66,6 +79,21 @@ static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
> #endif
>
> +#ifdef CONFIG_CXL_RAS
> +void cxl_register_proto_err_work(struct work_struct *work);
> +int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
> + cxl_proto_err_fn_t fn);
> +void cxl_unregister_proto_err_work(void);
> +#else
> +static inline void cxl_register_proto_err_work(struct work_struct *work) { }
> +static inline int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
> + cxl_proto_err_fn_t fn)
> +{
> + return 0;
> +}
> +static inline void cxl_unregister_proto_err_work(void) { }
> +#endif
> +
> void pci_print_aer(struct pci_dev *dev, int aer_severity,
> struct aer_capability_regs *aer);
> int cper_severity_to_aer(int cper_severity);
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 21:46 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
` (8 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
From: Dan Williams <djbw@kernel.org>
CXL protocol error logging uses two parallel sets of trace events. The
cxl_port_aer_correctable_error() and cxl_port_aer_uncorrectable_error()
events are used by CPER for CXL Port devices. The cxl_aer_correctable_error()
and cxl_aer_uncorrectable_error() events are used for CXL Endpoints. Update
the trace routines to use the latter for all CXL devices on both the CPER
and native AER paths.
Generalize cxl_aer_correctable_error()/cxl_aer_uncorrectable_error to
take a struct device * and a u64 serial argument supplied by the caller.
cxl_handle_ras() and cxl_handle_cor_ras() gain the new u64 serial parameter,
sourced from pci_get_dsn().
The CPER path keeps its existing Port-vs-Endpoint dispatch and passes the
new arguments to the unified trace events. The CPER path will be folded
together in a following patch.
Remove the now-unused cxl_port_aer_correctable_error() and
cxl_port_aer_uncorrectable_error().
**WARNING: ABI BREAK**
Rename the trace event field "memdev" to "device" so all CXL device types
(Ports and Endpoints) can be reported under a common field name. Note this
is an ABI break for userspace tools that key off the old "memdev" field.
Specifically, rasdaemon's ras-cxl-handler.c looks up "memdev" and bails on
NULL, so an unmodified rasdaemon will drop every CXL CE/UCE event once this
kernel ships. A rasdaemon update is needed in a separate series.
The need for the field rename was discussed in v16 review [1].
Also, for CXL Upstream Switch Port (USP) and Endpoint (EP) fatal UCE,
the cxl_aer_uncorrectable_error trace event is not emitted. The AER core
only retrieves PCI_ERR_UNCOR_STATUS for Root Ports, RCECs, and Downstream
Ports, or for non-fatal severities. PCI config reads to the source device
are expected to fail otherwise, so the AER core never reads the status
word, is_cxl_error() does not classify the event as CXL, and the AER path
handles it instead. In this case the AER handler consumes the event and
logs it as an AER error without calling the CXL RAS handlers or trace
logging.
Before this patch, Endpoint and Port devices emitted different events:
# Endpoint (cxl_aer_*):
cxl_aer_correctable_error: memdev=mem0 host=0000:0c:00.0 serial=0: status: 'CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0c:00.0 serial=0: status: 'Cache Data ECC Error | Memory Data ECC Error' first_error: 'Cache Data ECC Error'
# Port (cxl_port_aer_*, no serial field):
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='CRC Threshold Hit'
cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Cache Data ECC Error | Memory Data ECC Error' first_error: 'Cache Data ECC Error'
After this patch, all CXL devices emit the unified cxl_aer_* events
with the same field layout:
cxl_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c serial=0 status: 'CRC Threshold Hit'
cxl_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c serial=0 status: 'Cache Data ECC Error | Memory Data ECC Error' first_error: 'Cache Data ECC Error'
[1] https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-mobl4.notmuch/
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Dan Williams <djbw@kernel.org>
---
Changes in v16->v17:
- Replace cxlds->serial with pci_get_dsn()
- Change 'memdev' to 'device' (Dan)
- Updated Commit message
Changes in v15->v16:
- Add Dan's review-by
- Incorporate Dan's comment into commit message:
"Add the serial number at the end to preserve compatibility with
libtraceevent parsing of the parameters."
Changes in v14->v15:
- Update commit message.
- Moved cxl_handle_ras/cxl_handle_cor_ras() changes to future patch (terry)
Changes in v13->v14:
- Update commit headline (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
Changes in v11 -> v12:
- Correct parameters to call trace_cxl_aer_correctable_error()
- Add reviewed-by for Jonathan and Shiju
Changes in v10->v11:
- Updated CE and UCE trace routines to maintain consistent TP_Struct ABI
and unchanged TP_printk() logging.
---
drivers/cxl/core/core.h | 11 ++++--
drivers/cxl/core/ras.c | 39 +++++++++++--------
drivers/cxl/core/ras_rch.c | 6 ++-
drivers/cxl/core/trace.h | 76 ++++++++------------------------------
4 files changed, 49 insertions(+), 83 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 82ca3a476708..132ac9c1ebf4 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -183,8 +183,9 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
#ifdef CONFIG_CXL_RAS
int cxl_ras_init(void);
void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+void cxl_handle_cor_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
@@ -195,11 +196,13 @@ static inline int cxl_ras_init(void)
return 0;
}
static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static inline bool cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base)
{
return false;
}
-static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
+static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base) { }
static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 006c6ffc2f56..d7081caaf5d3 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -13,7 +13,7 @@ static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
{
u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
- trace_cxl_port_aer_correctable_error(&pdev->dev, status);
+ trace_cxl_aer_correctable_error(&pdev->dev, status, pci_get_dsn(pdev));
}
static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
@@ -28,20 +28,24 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
else
fe = status;
- trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
- ras_cap.header_log);
+ trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
+ ras_cap.header_log,
+ pci_get_dsn(pdev));
}
-static void cxl_cper_trace_corr_prot_err(struct cxl_memdev *cxlmd,
+static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
+ struct cxl_memdev *cxlmd,
struct cxl_ras_capability_regs ras_cap)
{
u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
- trace_cxl_aer_correctable_error(cxlmd, status);
+ trace_cxl_aer_correctable_error(&cxlmd->dev, status,
+ pci_get_dsn(pdev));
}
static void
-cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
+cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
+ struct cxl_memdev *cxlmd,
struct cxl_ras_capability_regs ras_cap)
{
u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
@@ -53,8 +57,9 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
else
fe = status;
- trace_cxl_aer_uncorrectable_error(cxlmd, status, fe,
- ras_cap.header_log);
+ trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
+ ras_cap.header_log,
+ pci_get_dsn(pdev));
}
static int match_memdev_by_parent(struct device *dev, const void *uport)
@@ -101,9 +106,9 @@ void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
cxlmd = to_cxl_memdev(mem_dev);
if (data->severity == AER_CORRECTABLE)
- cxl_cper_trace_corr_prot_err(cxlmd, data->ras_cap);
+ cxl_cper_trace_corr_prot_err(pdev, cxlmd, data->ras_cap);
else
- cxl_cper_trace_uncorr_prot_err(cxlmd, data->ras_cap);
+ cxl_cper_trace_uncorr_prot_err(pdev, cxlmd, data->ras_cap);
}
EXPORT_SYMBOL_GPL(cxl_cper_handle_prot_err);
@@ -183,7 +188,7 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
+void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
u32 status;
@@ -195,7 +200,7 @@ void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
+ trace_cxl_aer_correctable_error(dev, status, serial);
}
}
@@ -220,7 +225,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
@@ -247,7 +252,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
@@ -270,7 +275,8 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
if (cxlds->rcd)
cxl_handle_rdport_errors(cxlds);
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
+ cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
+ cxlmd->endpoint->regs.ras);
}
}
EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -299,7 +305,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
* chance the situation is recoverable dump the status of the RAS
* capability registers and bounce the active state of the memdev.
*/
- ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
+ ue = cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
+ cxlmd->endpoint->regs.ras);
}
switch (state) {
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index 0a8b3b9b6388..61835fbafc0f 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -115,7 +115,9 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
pci_print_aer(pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+ cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
+ dport->regs.ras);
else
- cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+ cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
+ dport->regs.ras);
}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a972e4ef1936..6f3957b3c3af 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,49 +48,22 @@
{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" } \
)
-TRACE_EVENT(cxl_port_aer_uncorrectable_error,
- TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
- TP_ARGS(dev, status, fe, hl),
+TRACE_EVENT(cxl_aer_uncorrectable_error,
+ TP_PROTO(const struct device *dev, u32 status, u32 fe, u32 *hl,
+ u64 serial),
+ TP_ARGS(dev, status, fe, hl, serial),
TP_STRUCT__entry(
__string(device, dev_name(dev))
__string(host, dev_name(dev->parent))
- __field(u32, status)
- __field(u32, first_error)
- __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
- ),
- TP_fast_assign(
- __assign_str(device);
- __assign_str(host);
- __entry->status = status;
- __entry->first_error = fe;
- /*
- * Embed the 512B headerlog data for user app retrieval and
- * parsing, but no need to print this in the trace buffer.
- */
- memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
- ),
- TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
- __get_str(device), __get_str(host),
- show_uc_errs(__entry->status),
- show_uc_errs(__entry->first_error)
- )
-);
-
-TRACE_EVENT(cxl_aer_uncorrectable_error,
- TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
- TP_ARGS(cxlmd, status, fe, hl),
- TP_STRUCT__entry(
- __string(memdev, dev_name(&cxlmd->dev))
- __string(host, dev_name(cxlmd->dev.parent))
__field(u64, serial)
__field(u32, status)
__field(u32, first_error)
__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
),
TP_fast_assign(
- __assign_str(memdev);
+ __assign_str(device);
__assign_str(host);
- __entry->serial = cxlmd->cxlds->serial;
+ __entry->serial = serial;
__entry->status = status;
__entry->first_error = fe;
/*
@@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
*/
memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
),
- TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error: '%s'",
- __get_str(memdev), __get_str(host), __entry->serial,
+ TP_printk("device=%s host=%s serial=%lld status: '%s' first_error: '%s'",
+ __get_str(device), __get_str(host), __entry->serial,
show_uc_errs(__entry->status),
show_uc_errs(__entry->first_error)
)
@@ -124,42 +97,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" } \
)
-TRACE_EVENT(cxl_port_aer_correctable_error,
- TP_PROTO(struct device *dev, u32 status),
- TP_ARGS(dev, status),
+TRACE_EVENT(cxl_aer_correctable_error,
+ TP_PROTO(const struct device *dev, u32 status, u64 serial),
+ TP_ARGS(dev, status, serial),
TP_STRUCT__entry(
__string(device, dev_name(dev))
__string(host, dev_name(dev->parent))
- __field(u32, status)
- ),
- TP_fast_assign(
- __assign_str(device);
- __assign_str(host);
- __entry->status = status;
- ),
- TP_printk("device=%s host=%s status='%s'",
- __get_str(device), __get_str(host),
- show_ce_errs(__entry->status)
- )
-);
-
-TRACE_EVENT(cxl_aer_correctable_error,
- TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
- TP_ARGS(cxlmd, status),
- TP_STRUCT__entry(
- __string(memdev, dev_name(&cxlmd->dev))
- __string(host, dev_name(cxlmd->dev.parent))
__field(u64, serial)
__field(u32, status)
),
TP_fast_assign(
- __assign_str(memdev);
+ __assign_str(device);
__assign_str(host);
- __entry->serial = cxlmd->cxlds->serial;
+ __entry->serial = serial;
__entry->status = status;
),
- TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
- __get_str(memdev), __get_str(host), __entry->serial,
+ TP_printk("device=%s host=%s serial=%lld status: '%s'",
+ __get_str(device), __get_str(host), __entry->serial,
show_ce_errs(__entry->status)
)
);
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
@ 2026-05-05 21:46 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-05 21:46 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
ming.li, Smita.KoralahalliChannabasappa, rrichter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, vishal.l.verma, alucerop, ira.weiny,
corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> From: Dan Williams <djbw@kernel.org>
>
> CXL protocol error logging uses two parallel sets of trace events. The
> cxl_port_aer_correctable_error() and cxl_port_aer_uncorrectable_error()
> events are used by CPER for CXL Port devices. The cxl_aer_correctable_error()
> and cxl_aer_uncorrectable_error() events are used for CXL Endpoints. Update
> the trace routines to use the latter for all CXL devices on both the CPER
> and native AER paths.
>
> Generalize cxl_aer_correctable_error()/cxl_aer_uncorrectable_error to
> take a struct device * and a u64 serial argument supplied by the caller.
> cxl_handle_ras() and cxl_handle_cor_ras() gain the new u64 serial parameter,
> sourced from pci_get_dsn().
>
> The CPER path keeps its existing Port-vs-Endpoint dispatch and passes the
> new arguments to the unified trace events. The CPER path will be folded
> together in a following patch.
>
> Remove the now-unused cxl_port_aer_correctable_error() and
> cxl_port_aer_uncorrectable_error().
>
> **WARNING: ABI BREAK**
> Rename the trace event field "memdev" to "device" so all CXL device types
> (Ports and Endpoints) can be reported under a common field name. Note this
> is an ABI break for userspace tools that key off the old "memdev" field.
> Specifically, rasdaemon's ras-cxl-handler.c looks up "memdev" and bails on
> NULL, so an unmodified rasdaemon will drop every CXL CE/UCE event once this
> kernel ships. A rasdaemon update is needed in a separate series.
>
> The need for the field rename was discussed in v16 review [1].
>
> Also, for CXL Upstream Switch Port (USP) and Endpoint (EP) fatal UCE,
> the cxl_aer_uncorrectable_error trace event is not emitted. The AER core
> only retrieves PCI_ERR_UNCOR_STATUS for Root Ports, RCECs, and Downstream
> Ports, or for non-fatal severities. PCI config reads to the source device
> are expected to fail otherwise, so the AER core never reads the status
> word, is_cxl_error() does not classify the event as CXL, and the AER path
> handles it instead. In this case the AER handler consumes the event and
> logs it as an AER error without calling the CXL RAS handlers or trace
> logging.
>
> Before this patch, Endpoint and Port devices emitted different events:
>
> # Endpoint (cxl_aer_*):
> cxl_aer_correctable_error: memdev=mem0 host=0000:0c:00.0 serial=0: status: 'CRC Threshold Hit'
> cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0c:00.0 serial=0: status: 'Cache Data ECC Error | Memory Data ECC Error' first_error: 'Cache Data ECC Error'
>
> # Port (cxl_port_aer_*, no serial field):
> cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='CRC Threshold Hit'
> cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Cache Data ECC Error | Memory Data ECC Error' first_error: 'Cache Data ECC Error'
>
> After this patch, all CXL devices emit the unified cxl_aer_* events
> with the same field layout:
>
> cxl_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c serial=0 status: 'CRC Threshold Hit'
> cxl_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c serial=0 status: 'Cache Data ECC Error | Memory Data ECC Error' first_error: 'Cache Data ECC Error'
>
> [1] https://lore.kernel.org/linux-cxl/69cb2d5ba3111_178904100b7@dwillia2-mobl4.notmuch/
>
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Dan Williams <djbw@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
>
> Changes in v16->v17:
> - Replace cxlds->serial with pci_get_dsn()
> - Change 'memdev' to 'device' (Dan)
> - Updated Commit message
>
> Changes in v15->v16:
> - Add Dan's review-by
> - Incorporate Dan's comment into commit message:
> "Add the serial number at the end to preserve compatibility with
> libtraceevent parsing of the parameters."
>
> Changes in v14->v15:
> - Update commit message.
> - Moved cxl_handle_ras/cxl_handle_cor_ras() changes to future patch (terry)
>
> Changes in v13->v14:
> - Update commit headline (Bjorn)
>
> Changes in v12->v13:
> - Added Dave Jiang's review-by
>
> Changes in v11 -> v12:
> - Correct parameters to call trace_cxl_aer_correctable_error()
> - Add reviewed-by for Jonathan and Shiju
>
> Changes in v10->v11:
> - Updated CE and UCE trace routines to maintain consistent TP_Struct ABI
> and unchanged TP_printk() logging.
> ---
> drivers/cxl/core/core.h | 11 ++++--
> drivers/cxl/core/ras.c | 39 +++++++++++--------
> drivers/cxl/core/ras_rch.c | 6 ++-
> drivers/cxl/core/trace.h | 76 ++++++++------------------------------
> 4 files changed, 49 insertions(+), 83 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 82ca3a476708..132ac9c1ebf4 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -183,8 +183,9 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
> #ifdef CONFIG_CXL_RAS
> int cxl_ras_init(void);
> void cxl_ras_exit(void);
> -bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
> -void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
> +bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
> +void cxl_handle_cor_ras(struct device *dev, u64 serial,
> + void __iomem *ras_base);
> void cxl_dport_map_rch_aer(struct cxl_dport *dport);
> void cxl_disable_rch_root_ints(struct cxl_dport *dport);
> void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
> @@ -195,11 +196,13 @@ static inline int cxl_ras_init(void)
> return 0;
> }
> static inline void cxl_ras_exit(void) { }
> -static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> +static inline bool cxl_handle_ras(struct device *dev, u64 serial,
> + void __iomem *ras_base)
> {
> return false;
> }
> -static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
> +static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
> + void __iomem *ras_base) { }
> static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
> static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
> static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 006c6ffc2f56..d7081caaf5d3 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -13,7 +13,7 @@ static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> {
> u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
>
> - trace_cxl_port_aer_correctable_error(&pdev->dev, status);
> + trace_cxl_aer_correctable_error(&pdev->dev, status, pci_get_dsn(pdev));
> }
>
> static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
> @@ -28,20 +28,24 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
> else
> fe = status;
>
> - trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
> - ras_cap.header_log);
> + trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
> + ras_cap.header_log,
> + pci_get_dsn(pdev));
> }
>
> -static void cxl_cper_trace_corr_prot_err(struct cxl_memdev *cxlmd,
> +static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
> + struct cxl_memdev *cxlmd,
> struct cxl_ras_capability_regs ras_cap)
> {
> u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
>
> - trace_cxl_aer_correctable_error(cxlmd, status);
> + trace_cxl_aer_correctable_error(&cxlmd->dev, status,
> + pci_get_dsn(pdev));
> }
>
> static void
> -cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
> +cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
> + struct cxl_memdev *cxlmd,
> struct cxl_ras_capability_regs ras_cap)
> {
> u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
> @@ -53,8 +57,9 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
> else
> fe = status;
>
> - trace_cxl_aer_uncorrectable_error(cxlmd, status, fe,
> - ras_cap.header_log);
> + trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
> + ras_cap.header_log,
> + pci_get_dsn(pdev));
> }
>
> static int match_memdev_by_parent(struct device *dev, const void *uport)
> @@ -101,9 +106,9 @@ void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
>
> cxlmd = to_cxl_memdev(mem_dev);
> if (data->severity == AER_CORRECTABLE)
> - cxl_cper_trace_corr_prot_err(cxlmd, data->ras_cap);
> + cxl_cper_trace_corr_prot_err(pdev, cxlmd, data->ras_cap);
> else
> - cxl_cper_trace_uncorr_prot_err(cxlmd, data->ras_cap);
> + cxl_cper_trace_uncorr_prot_err(pdev, cxlmd, data->ras_cap);
> }
> EXPORT_SYMBOL_GPL(cxl_cper_handle_prot_err);
>
> @@ -183,7 +188,7 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>
> -void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
> +void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> void __iomem *addr;
> u32 status;
> @@ -195,7 +200,7 @@ void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
> status = readl(addr);
> if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> - trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> + trace_cxl_aer_correctable_error(dev, status, serial);
> }
> }
>
> @@ -220,7 +225,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
> * Log the state of the RAS status registers and prepare them to log the
> * next error status. Return 1 if reset needed.
> */
> -bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> +bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> u32 hl[CXL_HEADERLOG_SIZE_U32];
> void __iomem *addr;
> @@ -247,7 +252,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
> @@ -270,7 +275,8 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
> if (cxlds->rcd)
> cxl_handle_rdport_errors(cxlds);
>
> - cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
> + cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> + cxlmd->endpoint->regs.ras);
> }
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
> @@ -299,7 +305,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> * chance the situation is recoverable dump the status of the RAS
> * capability registers and bounce the active state of the memdev.
> */
> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
> + ue = cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> + cxlmd->endpoint->regs.ras);
> }
>
> switch (state) {
> diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
> index 0a8b3b9b6388..61835fbafc0f 100644
> --- a/drivers/cxl/core/ras_rch.c
> +++ b/drivers/cxl/core/ras_rch.c
> @@ -115,7 +115,9 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
>
> pci_print_aer(pdev, severity, &aer_regs);
> if (severity == AER_CORRECTABLE)
> - cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> + cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> + dport->regs.ras);
> else
> - cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> + cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> + dport->regs.ras);
> }
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index a972e4ef1936..6f3957b3c3af 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,49 +48,22 @@
> { CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" } \
> )
>
> -TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> - TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> - TP_ARGS(dev, status, fe, hl),
> +TRACE_EVENT(cxl_aer_uncorrectable_error,
> + TP_PROTO(const struct device *dev, u32 status, u32 fe, u32 *hl,
> + u64 serial),
> + TP_ARGS(dev, status, fe, hl, serial),
> TP_STRUCT__entry(
> __string(device, dev_name(dev))
> __string(host, dev_name(dev->parent))
> - __field(u32, status)
> - __field(u32, first_error)
> - __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> - ),
> - TP_fast_assign(
> - __assign_str(device);
> - __assign_str(host);
> - __entry->status = status;
> - __entry->first_error = fe;
> - /*
> - * Embed the 512B headerlog data for user app retrieval and
> - * parsing, but no need to print this in the trace buffer.
> - */
> - memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> - ),
> - TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> - __get_str(device), __get_str(host),
> - show_uc_errs(__entry->status),
> - show_uc_errs(__entry->first_error)
> - )
> -);
> -
> -TRACE_EVENT(cxl_aer_uncorrectable_error,
> - TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
> - TP_ARGS(cxlmd, status, fe, hl),
> - TP_STRUCT__entry(
> - __string(memdev, dev_name(&cxlmd->dev))
> - __string(host, dev_name(cxlmd->dev.parent))
> __field(u64, serial)
> __field(u32, status)
> __field(u32, first_error)
> __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> ),
> TP_fast_assign(
> - __assign_str(memdev);
> + __assign_str(device);
> __assign_str(host);
> - __entry->serial = cxlmd->cxlds->serial;
> + __entry->serial = serial;
> __entry->status = status;
> __entry->first_error = fe;
> /*
> @@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> */
> memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> ),
> - TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error: '%s'",
> - __get_str(memdev), __get_str(host), __entry->serial,
> + TP_printk("device=%s host=%s serial=%lld status: '%s' first_error: '%s'",
> + __get_str(device), __get_str(host), __entry->serial,
> show_uc_errs(__entry->status),
> show_uc_errs(__entry->first_error)
> )
> @@ -124,42 +97,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> { CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" } \
> )
>
> -TRACE_EVENT(cxl_port_aer_correctable_error,
> - TP_PROTO(struct device *dev, u32 status),
> - TP_ARGS(dev, status),
> +TRACE_EVENT(cxl_aer_correctable_error,
> + TP_PROTO(const struct device *dev, u32 status, u64 serial),
> + TP_ARGS(dev, status, serial),
> TP_STRUCT__entry(
> __string(device, dev_name(dev))
> __string(host, dev_name(dev->parent))
> - __field(u32, status)
> - ),
> - TP_fast_assign(
> - __assign_str(device);
> - __assign_str(host);
> - __entry->status = status;
> - ),
> - TP_printk("device=%s host=%s status='%s'",
> - __get_str(device), __get_str(host),
> - show_ce_errs(__entry->status)
> - )
> -);
> -
> -TRACE_EVENT(cxl_aer_correctable_error,
> - TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
> - TP_ARGS(cxlmd, status),
> - TP_STRUCT__entry(
> - __string(memdev, dev_name(&cxlmd->dev))
> - __string(host, dev_name(cxlmd->dev.parent))
> __field(u64, serial)
> __field(u32, status)
> ),
> TP_fast_assign(
> - __assign_str(memdev);
> + __assign_str(device);
> __assign_str(host);
> - __entry->serial = cxlmd->cxlds->serial;
> + __entry->serial = serial;
> __entry->status = status;
> ),
> - TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
> - __get_str(memdev), __get_str(host), __entry->serial,
> + TP_printk("device=%s host=%s serial=%lld status: '%s'",
> + __get_str(device), __get_str(host), __entry->serial,
> show_ce_errs(__entry->status)
> )
> );
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-05-05 17:30 ` [PATCH v17 01/11] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-05-05 17:30 ` [PATCH v17 02/11] cxl/ras: Unify Endpoint and Port AER trace events Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 22:02 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
` (7 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
Fold the Port and Endpoint specific paths in cxl_cper_handle_prot_err()
into a single code path. Drop the PCI type dispatch block as both Port
and Endpoint devices now go through the same code path.
Extend the pdev->dev.driver != NULL gate to Port devices, which previously
bypassed it. This check and the existing device lock will ensure the CXL
device remains accessible while in scope.
Recent trace event changes generalize the interface to take a
struct device * for all CXL devices. Update the Endpoint CPER path
to pass &pdev->dev (the PCI device) instead of &cxlmd->dev (the
memdev). This makes the trace event's "device=" field show the PCI
BDF for all CPER callers, replacing the prior "device=memN" output
for Endpoints. Userspace consumers correlating CPER trace events to
memdev names must map the PCI BDF back via /sys/bus/cxl/devices/.
Remove the bus_find_device(&cxl_bus_type, ..., match_memdev_by_parent)
lookup along with the match_memdev_by_parent() helper.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- New commit
---
drivers/cxl/core/ras.c | 81 +++++++-----------------------------------
1 file changed, 13 insertions(+), 68 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index d7081caaf5d3..56611da8357a 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -8,65 +8,28 @@
#include <cxlpci.h>
#include "trace.h"
-static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
- struct cxl_ras_capability_regs ras_cap)
+static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev, u64 serial,
+ struct cxl_ras_capability_regs *ras_cap)
{
- u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
+ u32 status = ras_cap->cor_status & ~ras_cap->cor_mask;
- trace_cxl_aer_correctable_error(&pdev->dev, status, pci_get_dsn(pdev));
+ trace_cxl_aer_correctable_error(&pdev->dev, status, serial);
}
-static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
- struct cxl_ras_capability_regs ras_cap)
+static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev, u64 serial,
+ struct cxl_ras_capability_regs *ras_cap)
{
- u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+ u32 status = ras_cap->uncor_status & ~ras_cap->uncor_mask;
u32 fe;
if (hweight32(status) > 1)
fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
- ras_cap.cap_control));
+ ras_cap->cap_control));
else
fe = status;
trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
- ras_cap.header_log,
- pci_get_dsn(pdev));
-}
-
-static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
- struct cxl_memdev *cxlmd,
- struct cxl_ras_capability_regs ras_cap)
-{
- u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
-
- trace_cxl_aer_correctable_error(&cxlmd->dev, status,
- pci_get_dsn(pdev));
-}
-
-static void
-cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
- struct cxl_memdev *cxlmd,
- struct cxl_ras_capability_regs ras_cap)
-{
- u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
- u32 fe;
-
- if (hweight32(status) > 1)
- fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
- ras_cap.cap_control));
- else
- fe = status;
-
- trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
- ras_cap.header_log,
- pci_get_dsn(pdev));
-}
-
-static int match_memdev_by_parent(struct device *dev, const void *uport)
-{
- if (is_cxl_memdev(dev) && dev->parent == uport)
- return 1;
- return 0;
+ ras_cap->header_log, serial);
}
void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
@@ -77,38 +40,20 @@ void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
data->prot_err.agent_addr.bus,
devfn);
- struct cxl_memdev *cxlmd;
- int port_type;
if (!pdev)
return;
- port_type = pci_pcie_type(pdev);
- if (port_type == PCI_EXP_TYPE_ROOT_PORT ||
- port_type == PCI_EXP_TYPE_DOWNSTREAM ||
- port_type == PCI_EXP_TYPE_UPSTREAM) {
- if (data->severity == AER_CORRECTABLE)
- cxl_cper_trace_corr_port_prot_err(pdev, data->ras_cap);
- else
- cxl_cper_trace_uncorr_port_prot_err(pdev, data->ras_cap);
-
- return;
- }
-
guard(device)(&pdev->dev);
if (!pdev->dev.driver)
return;
- struct device *mem_dev __free(put_device) = bus_find_device(
- &cxl_bus_type, NULL, pdev, match_memdev_by_parent);
- if (!mem_dev)
- return;
-
- cxlmd = to_cxl_memdev(mem_dev);
if (data->severity == AER_CORRECTABLE)
- cxl_cper_trace_corr_prot_err(pdev, cxlmd, data->ras_cap);
+ cxl_cper_trace_corr_prot_err(pdev, pci_get_dsn(pdev),
+ &data->ras_cap);
else
- cxl_cper_trace_uncorr_prot_err(pdev, cxlmd, data->ras_cap);
+ cxl_cper_trace_uncorr_prot_err(pdev, pci_get_dsn(pdev),
+ &data->ras_cap);
}
EXPORT_SYMBOL_GPL(cxl_cper_handle_prot_err);
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
@ 2026-05-05 22:02 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-05 22:02 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
ming.li, Smita.KoralahalliChannabasappa, rrichter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, vishal.l.verma, alucerop, ira.weiny,
corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> Fold the Port and Endpoint specific paths in cxl_cper_handle_prot_err()
> into a single code path. Drop the PCI type dispatch block as both Port
> and Endpoint devices now go through the same code path.
>
> Extend the pdev->dev.driver != NULL gate to Port devices, which previously
> bypassed it. This check and the existing device lock will ensure the CXL
> device remains accessible while in scope.
>
> Recent trace event changes generalize the interface to take a
> struct device * for all CXL devices. Update the Endpoint CPER path
> to pass &pdev->dev (the PCI device) instead of &cxlmd->dev (the
> memdev). This makes the trace event's "device=" field show the PCI
> BDF for all CPER callers, replacing the prior "device=memN" output
> for Endpoints. Userspace consumers correlating CPER trace events to
> memdev names must map the PCI BDF back via /sys/bus/cxl/devices/.
>
> Remove the bus_find_device(&cxl_bus_type, ..., match_memdev_by_parent)
> lookup along with the match_memdev_by_parent() helper.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
>
> Changes in v16->v17:
> - New commit
> ---
> drivers/cxl/core/ras.c | 81 +++++++-----------------------------------
> 1 file changed, 13 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index d7081caaf5d3..56611da8357a 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -8,65 +8,28 @@
> #include <cxlpci.h>
> #include "trace.h"
>
> -static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> - struct cxl_ras_capability_regs ras_cap)
> +static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev, u64 serial,
> + struct cxl_ras_capability_regs *ras_cap)
> {
> - u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
> + u32 status = ras_cap->cor_status & ~ras_cap->cor_mask;
>
> - trace_cxl_aer_correctable_error(&pdev->dev, status, pci_get_dsn(pdev));
> + trace_cxl_aer_correctable_error(&pdev->dev, status, serial);
> }
>
> -static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
> - struct cxl_ras_capability_regs ras_cap)
> +static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev, u64 serial,
> + struct cxl_ras_capability_regs *ras_cap)
> {
> - u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
> + u32 status = ras_cap->uncor_status & ~ras_cap->uncor_mask;
> u32 fe;
>
> if (hweight32(status) > 1)
> fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
> - ras_cap.cap_control));
> + ras_cap->cap_control));
> else
> fe = status;
>
> trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
> - ras_cap.header_log,
> - pci_get_dsn(pdev));
> -}
> -
> -static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
> - struct cxl_memdev *cxlmd,
> - struct cxl_ras_capability_regs ras_cap)
> -{
> - u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
> -
> - trace_cxl_aer_correctable_error(&cxlmd->dev, status,
> - pci_get_dsn(pdev));
> -}
> -
> -static void
> -cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
> - struct cxl_memdev *cxlmd,
> - struct cxl_ras_capability_regs ras_cap)
> -{
> - u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
> - u32 fe;
> -
> - if (hweight32(status) > 1)
> - fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
> - ras_cap.cap_control));
> - else
> - fe = status;
> -
> - trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
> - ras_cap.header_log,
> - pci_get_dsn(pdev));
> -}
> -
> -static int match_memdev_by_parent(struct device *dev, const void *uport)
> -{
> - if (is_cxl_memdev(dev) && dev->parent == uport)
> - return 1;
> - return 0;
> + ras_cap->header_log, serial);
> }
>
> void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
> @@ -77,38 +40,20 @@ void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
> pci_get_domain_bus_and_slot(data->prot_err.agent_addr.segment,
> data->prot_err.agent_addr.bus,
> devfn);
> - struct cxl_memdev *cxlmd;
> - int port_type;
>
> if (!pdev)
> return;
>
> - port_type = pci_pcie_type(pdev);
> - if (port_type == PCI_EXP_TYPE_ROOT_PORT ||
> - port_type == PCI_EXP_TYPE_DOWNSTREAM ||
> - port_type == PCI_EXP_TYPE_UPSTREAM) {
> - if (data->severity == AER_CORRECTABLE)
> - cxl_cper_trace_corr_port_prot_err(pdev, data->ras_cap);
> - else
> - cxl_cper_trace_uncorr_port_prot_err(pdev, data->ras_cap);
> -
> - return;
> - }
> -
> guard(device)(&pdev->dev);
> if (!pdev->dev.driver)
> return;
>
> - struct device *mem_dev __free(put_device) = bus_find_device(
> - &cxl_bus_type, NULL, pdev, match_memdev_by_parent);
> - if (!mem_dev)
> - return;
> -
> - cxlmd = to_cxl_memdev(mem_dev);
> if (data->severity == AER_CORRECTABLE)
> - cxl_cper_trace_corr_prot_err(pdev, cxlmd, data->ras_cap);
> + cxl_cper_trace_corr_prot_err(pdev, pci_get_dsn(pdev),
> + &data->ras_cap);
> else
> - cxl_cper_trace_uncorr_prot_err(pdev, cxlmd, data->ras_cap);
> + cxl_cper_trace_uncorr_prot_err(pdev, pci_get_dsn(pdev),
> + &data->ras_cap);
> }
> EXPORT_SYMBOL_GPL(cxl_cper_handle_prot_err);
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport()
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (2 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 03/11] cxl: Use common CPER handling for all CXL devices Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 22:06 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
` (6 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
From: Dan Williams <djbw@kernel.org>
find_cxl_port() and find_cxl_port_by_uport() are internal port lookup
functions that search the CXL bus by dport and uport respectively, but
their names do not make the lookup method clear.
Rename find_cxl_port() to find_cxl_port_by_dport() to make the lookup
method explicit and consistent with find_cxl_port_by_uport(). Both
functions remain static to port.c; the upcoming patch that adds the
first cross-file caller will widen their scope.
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Dan Williams <djbw@kernel.org>
---
Changes in v16->v17:
- New commit
---
drivers/cxl/core/port.c | 20 ++++++++++++++------
1 file changed, 14 insertions(+), 6 deletions(-)
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index c5aacd7054f1..b35a9016fc81 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1377,7 +1377,7 @@ static int match_port_by_dport(struct device *dev, const void *data)
return dport != NULL;
}
-static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
+static struct cxl_port *__find_cxl_port_by_dport(struct cxl_find_port_ctx *ctx)
{
struct device *dev;
@@ -1390,8 +1390,16 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
return NULL;
}
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
- struct cxl_dport **dport)
+/**
+ * find_cxl_port_by_dport - find a cxl_port by one of its targets
+ * @dport_dev: device representing the dport target
+ * @dport: optional output of the 'struct cxl_dport' companion of the @dport_dev
+ *
+ * Return a 'struct cxl_port' with an elevated reference if found. Use
+ * __free(put_cxl_port) to release.
+ */
+static struct cxl_port *find_cxl_port_by_dport(struct device *dport_dev,
+ struct cxl_dport **dport)
{
struct cxl_find_port_ctx ctx = {
.dport_dev = dport_dev,
@@ -1399,7 +1407,7 @@ static struct cxl_port *find_cxl_port(struct device *dport_dev,
};
struct cxl_port *port;
- port = __find_cxl_port(&ctx);
+ port = __find_cxl_port_by_dport(&ctx);
return port;
}
@@ -1893,14 +1901,14 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_enumerate_ports, "CXL");
struct cxl_port *cxl_pci_find_port(struct pci_dev *pdev,
struct cxl_dport **dport)
{
- return find_cxl_port(pdev->dev.parent, dport);
+ return find_cxl_port_by_dport(pdev->dev.parent, dport);
}
EXPORT_SYMBOL_NS_GPL(cxl_pci_find_port, "CXL");
struct cxl_port *cxl_mem_find_port(struct cxl_memdev *cxlmd,
struct cxl_dport **dport)
{
- return find_cxl_port(grandparent(&cxlmd->dev), dport);
+ return find_cxl_port_by_dport(grandparent(&cxlmd->dev), dport);
}
EXPORT_SYMBOL_NS_GPL(cxl_mem_find_port, "CXL");
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport()
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
@ 2026-05-05 22:06 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-05 22:06 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
ming.li, Smita.KoralahalliChannabasappa, rrichter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, vishal.l.verma, alucerop, ira.weiny,
corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> From: Dan Williams <djbw@kernel.org>
>
> find_cxl_port() and find_cxl_port_by_uport() are internal port lookup
> functions that search the CXL bus by dport and uport respectively, but
> their names do not make the lookup method clear.
>
> Rename find_cxl_port() to find_cxl_port_by_dport() to make the lookup
> method explicit and consistent with find_cxl_port_by_uport(). Both
> functions remain static to port.c; the upcoming patch that adds the
> first cross-file caller will widen their scope.
>
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Dan Williams <djbw@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
>
> Changes in v16->v17:
> - New commit
> ---
> drivers/cxl/core/port.c | 20 ++++++++++++++------
> 1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index c5aacd7054f1..b35a9016fc81 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1377,7 +1377,7 @@ static int match_port_by_dport(struct device *dev, const void *data)
> return dport != NULL;
> }
>
> -static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
> +static struct cxl_port *__find_cxl_port_by_dport(struct cxl_find_port_ctx *ctx)
> {
> struct device *dev;
>
> @@ -1390,8 +1390,16 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
> return NULL;
> }
>
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> - struct cxl_dport **dport)
> +/**
> + * find_cxl_port_by_dport - find a cxl_port by one of its targets
> + * @dport_dev: device representing the dport target
> + * @dport: optional output of the 'struct cxl_dport' companion of the @dport_dev
> + *
> + * Return a 'struct cxl_port' with an elevated reference if found. Use
> + * __free(put_cxl_port) to release.
> + */
> +static struct cxl_port *find_cxl_port_by_dport(struct device *dport_dev,
> + struct cxl_dport **dport)
> {
> struct cxl_find_port_ctx ctx = {
> .dport_dev = dport_dev,
> @@ -1399,7 +1407,7 @@ static struct cxl_port *find_cxl_port(struct device *dport_dev,
> };
> struct cxl_port *port;
>
> - port = __find_cxl_port(&ctx);
> + port = __find_cxl_port_by_dport(&ctx);
> return port;
> }
>
> @@ -1893,14 +1901,14 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_enumerate_ports, "CXL");
> struct cxl_port *cxl_pci_find_port(struct pci_dev *pdev,
> struct cxl_dport **dport)
> {
> - return find_cxl_port(pdev->dev.parent, dport);
> + return find_cxl_port_by_dport(pdev->dev.parent, dport);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_pci_find_port, "CXL");
>
> struct cxl_port *cxl_mem_find_port(struct cxl_memdev *cxlmd,
> struct cxl_dport **dport)
> {
> - return find_cxl_port(grandparent(&cxlmd->dev), dport);
> + return find_cxl_port_by_dport(grandparent(&cxlmd->dev), dport);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_mem_find_port, "CXL");
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (3 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 04/11] cxl: Rename find_cxl_port() to find_cxl_port_by_dport() Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 22:16 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
` (5 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
From: Dan Williams <djbw@kernel.org>
Some CPER functions used by CXL drivers are exported using the
EXPORT_SYMBOL_NS_GPL(fn, ns) macro. This doesn't provide compile time
enforcement or visibility of the consumers.
This can be improved by using EXPORT_SYMBOL_FOR_MODULES() instead.
EXPORT_SYMBOL_FOR_MODULES() explicitly names the modules that can access
the function. This provides more precise control and visibility of symbol
exposure than the namespace macro. It also provides compile time checking.
To improve control and clarity, update cxl_cper_register_prot_err_work(),
cxl_cper_unregister_prot_err_work(), and cxl_cper_prot_err_kfifo_get()
to use EXPORT_SYMBOL_FOR_MODULES(). Also, update the register and unregister
functions to return void type.
Update the CPER kfifo unregister to cancel work while using
synchronization.
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Dan Williams <djbw@kernel.org>
---
Changes in v16->v17:
- Split from v16 02/10 ("Update unregistration for AER-CXL and
CPER-CXL kfifos"); AER-CXL half folded into v17 01/10.
- Convert exports to EXPORT_SYMBOL_FOR_MODULES("cxl_core").
- Change register/unregister return type from int to void.
- Drop work_struct argument from cxl_cper_unregister_prot_err_work();
it now cancels its own work.
- Remove now-redundant cancel_work_sync() from cxl_ras_exit().
- Add WARN_ONCE() in cxl_cper_register_prot_err_work() for
double-registration.
---
drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
drivers/cxl/core/ras.c | 6 +++---
include/cxl/event.h | 10 ++++------
3 files changed, 21 insertions(+), 22 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 3236a3ce79d6..dd0a073af93c 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -778,33 +778,34 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
#endif
}
-int cxl_cper_register_prot_err_work(struct work_struct *work)
+void cxl_cper_register_prot_err_work(struct work_struct *work)
{
- if (cxl_cper_prot_err_work)
- return -EINVAL;
-
guard(spinlock)(&cxl_cper_prot_err_work_lock);
+ WARN_ONCE(cxl_cper_prot_err_work,
+ "CPER-CXL kfifo consumer already registered\n");
cxl_cper_prot_err_work = work;
- return 0;
}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
+EXPORT_SYMBOL_FOR_MODULES(cxl_cper_register_prot_err_work, "cxl_core");
-int cxl_cper_unregister_prot_err_work(struct work_struct *work)
+void cxl_cper_unregister_prot_err_work(void)
{
- if (cxl_cper_prot_err_work != work)
- return -EINVAL;
+ struct work_struct *work;
- guard(spinlock)(&cxl_cper_prot_err_work_lock);
+ spin_lock(&cxl_cper_prot_err_work_lock);
+ work = cxl_cper_prot_err_work;
cxl_cper_prot_err_work = NULL;
- return 0;
+ spin_unlock(&cxl_cper_prot_err_work_lock);
+
+ if (work)
+ cancel_work_sync(work);
}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
+EXPORT_SYMBOL_FOR_MODULES(cxl_cper_unregister_prot_err_work, "cxl_core");
int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
{
return kfifo_get(&cxl_cper_prot_err_fifo, wd);
}
-EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
+EXPORT_SYMBOL_FOR_MODULES(cxl_cper_prot_err_kfifo_get, "cxl_core");
/* Room for 8 entries for each of the 4 event log queues */
#define CXL_CPER_FIFO_DEPTH 32
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 56611da8357a..9193dac4e507 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -68,13 +68,13 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
int cxl_ras_init(void)
{
- return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
+ cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
+ return 0;
}
void cxl_ras_exit(void)
{
- cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
- cancel_work_sync(&cxl_cper_prot_err_work);
+ cxl_cper_unregister_prot_err_work();
}
static void cxl_dport_map_ras(struct cxl_dport *dport)
diff --git a/include/cxl/event.h b/include/cxl/event.h
index ff97fea718d2..51acedb0d683 100644
--- a/include/cxl/event.h
+++ b/include/cxl/event.h
@@ -289,8 +289,8 @@ struct cxl_cper_prot_err_work_data {
int cxl_cper_register_work(struct work_struct *work);
int cxl_cper_unregister_work(struct work_struct *work);
int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);
-int cxl_cper_register_prot_err_work(struct work_struct *work);
-int cxl_cper_unregister_prot_err_work(struct work_struct *work);
+void cxl_cper_register_prot_err_work(struct work_struct *work);
+void cxl_cper_unregister_prot_err_work(void);
int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd);
#else
static inline int cxl_cper_register_work(struct work_struct *work)
@@ -306,13 +306,11 @@ static inline int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
{
return 0;
}
-static inline int cxl_cper_register_prot_err_work(struct work_struct *work)
+static inline void cxl_cper_register_prot_err_work(struct work_struct *work)
{
- return 0;
}
-static inline int cxl_cper_unregister_prot_err_work(struct work_struct *work)
+static inline void cxl_cper_unregister_prot_err_work(void)
{
- return 0;
}
static inline int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
{
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
@ 2026-05-05 22:16 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-05 22:16 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
ming.li, Smita.KoralahalliChannabasappa, rrichter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, vishal.l.verma, alucerop, ira.weiny,
corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> From: Dan Williams <djbw@kernel.org>
>
> Some CPER functions used by CXL drivers are exported using the
> EXPORT_SYMBOL_NS_GPL(fn, ns) macro. This doesn't provide compile time
> enforcement or visibility of the consumers.
>
> This can be improved by using EXPORT_SYMBOL_FOR_MODULES() instead.
> EXPORT_SYMBOL_FOR_MODULES() explicitly names the modules that can access
> the function. This provides more precise control and visibility of symbol
> exposure than the namespace macro. It also provides compile time checking.
>
> To improve control and clarity, update cxl_cper_register_prot_err_work(),
> cxl_cper_unregister_prot_err_work(), and cxl_cper_prot_err_kfifo_get()
> to use EXPORT_SYMBOL_FOR_MODULES(). Also, update the register and unregister
> functions to return void type.
>
> Update the CPER kfifo unregister to cancel work while using
> synchronization.
>
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Dan Williams <djbw@kernel.org>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
>
> Changes in v16->v17:
> - Split from v16 02/10 ("Update unregistration for AER-CXL and
> CPER-CXL kfifos"); AER-CXL half folded into v17 01/10.
> - Convert exports to EXPORT_SYMBOL_FOR_MODULES("cxl_core").
> - Change register/unregister return type from int to void.
> - Drop work_struct argument from cxl_cper_unregister_prot_err_work();
> it now cancels its own work.
> - Remove now-redundant cancel_work_sync() from cxl_ras_exit().
> - Add WARN_ONCE() in cxl_cper_register_prot_err_work() for
> double-registration.
> ---
> drivers/acpi/apei/ghes.c | 27 ++++++++++++++-------------
> drivers/cxl/core/ras.c | 6 +++---
> include/cxl/event.h | 10 ++++------
> 3 files changed, 21 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 3236a3ce79d6..dd0a073af93c 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -778,33 +778,34 @@ static void cxl_cper_post_prot_err(struct cxl_cper_sec_prot_err *prot_err,
> #endif
> }
>
> -int cxl_cper_register_prot_err_work(struct work_struct *work)
> +void cxl_cper_register_prot_err_work(struct work_struct *work)
> {
> - if (cxl_cper_prot_err_work)
> - return -EINVAL;
> -
> guard(spinlock)(&cxl_cper_prot_err_work_lock);
> + WARN_ONCE(cxl_cper_prot_err_work,
> + "CPER-CXL kfifo consumer already registered\n");
> cxl_cper_prot_err_work = work;
> - return 0;
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_cper_register_prot_err_work, "CXL");
> +EXPORT_SYMBOL_FOR_MODULES(cxl_cper_register_prot_err_work, "cxl_core");
>
> -int cxl_cper_unregister_prot_err_work(struct work_struct *work)
> +void cxl_cper_unregister_prot_err_work(void)
> {
> - if (cxl_cper_prot_err_work != work)
> - return -EINVAL;
> + struct work_struct *work;
>
> - guard(spinlock)(&cxl_cper_prot_err_work_lock);
> + spin_lock(&cxl_cper_prot_err_work_lock);
> + work = cxl_cper_prot_err_work;
> cxl_cper_prot_err_work = NULL;
> - return 0;
> + spin_unlock(&cxl_cper_prot_err_work_lock);
> +
> + if (work)
> + cancel_work_sync(work);
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
> +EXPORT_SYMBOL_FOR_MODULES(cxl_cper_unregister_prot_err_work, "cxl_core");
>
> int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
> {
> return kfifo_get(&cxl_cper_prot_err_fifo, wd);
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_cper_prot_err_kfifo_get, "CXL");
> +EXPORT_SYMBOL_FOR_MODULES(cxl_cper_prot_err_kfifo_get, "cxl_core");
>
> /* Room for 8 entries for each of the 4 event log queues */
> #define CXL_CPER_FIFO_DEPTH 32
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 56611da8357a..9193dac4e507 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -68,13 +68,13 @@ static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>
> int cxl_ras_init(void)
> {
> - return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> + cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> + return 0;
> }
>
> void cxl_ras_exit(void)
> {
> - cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> - cancel_work_sync(&cxl_cper_prot_err_work);
> + cxl_cper_unregister_prot_err_work();
> }
>
> static void cxl_dport_map_ras(struct cxl_dport *dport)
> diff --git a/include/cxl/event.h b/include/cxl/event.h
> index ff97fea718d2..51acedb0d683 100644
> --- a/include/cxl/event.h
> +++ b/include/cxl/event.h
> @@ -289,8 +289,8 @@ struct cxl_cper_prot_err_work_data {
> int cxl_cper_register_work(struct work_struct *work);
> int cxl_cper_unregister_work(struct work_struct *work);
> int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd);
> -int cxl_cper_register_prot_err_work(struct work_struct *work);
> -int cxl_cper_unregister_prot_err_work(struct work_struct *work);
> +void cxl_cper_register_prot_err_work(struct work_struct *work);
> +void cxl_cper_unregister_prot_err_work(void);
> int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd);
> #else
> static inline int cxl_cper_register_work(struct work_struct *work)
> @@ -306,13 +306,11 @@ static inline int cxl_cper_kfifo_get(struct cxl_cper_work_data *wd)
> {
> return 0;
> }
> -static inline int cxl_cper_register_prot_err_work(struct work_struct *work)
> +static inline void cxl_cper_register_prot_err_work(struct work_struct *work)
> {
> - return 0;
> }
> -static inline int cxl_cper_unregister_prot_err_work(struct work_struct *work)
> +static inline void cxl_cper_unregister_prot_err_work(void)
> {
> - return 0;
> }
> static inline int cxl_cper_prot_err_kfifo_get(struct cxl_cper_prot_err_work_data *wd)
> {
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (4 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 05/11] cxl: Limit CXL-CPER kfifo registration functions scope Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
` (4 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
Add CXL Port protocol error handling callbacks to unify detection,
logging, and recovery across CXL Ports and Endpoints. Establish a
common flow for correctable and uncorrectable CXL protocol errors.
RCH Downstream Port error handling is added in a following patch.
Add cxl_handle_proto_error() to dispatch correctable and uncorrectable
errors through the CXL RAS helpers. Add cxl_do_recovery() to coordinate
uncorrectable recovery. Panic via panic() on any uncorrectable CXL RAS
error. CXL.cachemem traffic cannot be safely recovered from an
uncorrectable protocol error in software, so panic regardless of the
AER severity reported. Gate error handling on the port driver being
bound to avoid processing errors on disabled devices.
Panic explicitly on pci_dev_is_disconnected() before accessing the RAS
registers. A CXL device disconnecting during an uncorrectable error event
is itself unrecoverable, particularly for devices in interleaved HDM
regions. Relying on the status readl() returning ~0u to trip the existing
panic path leaves the cause ambiguous.
The panic policy applies to the RAS register block of the device whose
error triggered the recovery: Root/Downstream Port RAS for VH Ports,
Endpoint Port RAS for VH Endpoints and RCDs. Upstream RCH Downstream
Port RAS UEs handled via cxl_handle_rdport_errors() are logged only, as
before this series. Only the RCD Endpoint's own RAS UE drives the panic.
Add to_ras_base() to centralize the RAS base lookup. It selects
dport->regs.ras for Root/Downstream Ports and port->regs.ras for
Upstream Ports and Endpoints.
Export pcie_clear_device_status() and pci_aer_clear_fatal_status() so
cxl_core can clear PCIe/AER state during recovery.
Wire the AER core to the kfifo in this commit by adding the
is_cxl_error() switch in handle_error_source() alongside the consumer
registration. This way the producer and consumer go live in the same
commit, so CXL errors are not silently dropped during bisect.
The correctable AER status is cleared by the producer in
cxl_forward_error().
Co-developed-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- get_cxl_port() -> find_cxl_port_by_dev()
- Simplified find_cxl_port_by_dev()
- Replace and remove cxl_serial_number() w/ pci_get_dsn()
- cxl_get_ras_base() -> to_ras_base()
- Drop dependency on PCI_ERS_RESULT_PANIC; cxl_do_recovery() panics
directly. (PANIC enum patch dropped from series.)
- Clarify panic semantics: panic on any uncorrectable CXL RAS error, not
only AER-FATAL severities.
- Drop the redundant PCI_ERR_COR_STATUS RMW in cxl_handle_proto_error();
cxl_forward_error() already acks the correctable AER status.
- Add is_cxl_error() switch in handle_error_source() here, paired with the
kfifo consumer registration, to keep each commit bisect-safe.
- Drop pcie_aer_is_native() guard in cxl_do_recovery() (always native).
- Swap order with the "Limit" patch for bisectability w/ cxl_ras_exit()
- Reword for "any uncorrectable" CXL RAS error panics.
- Restore log messages for port-not-found and port-unbound cases.
- Whitespace cleanup (Jonathan)
- Update to get_cxl_port() documentation (Terry)
- Fix __cxl_proto_err_work_fn() to return 0 for transient errors.
- Drop !port check in cxl_do_recovery(), caller already validated
- Fix kerneldoc @pdev -> @dev in find_cxl_port_by_dev()
- Fix missing space in pr_err_ratelimited()
- Add disconnect check before access
- Made pcie_clear_device_status() and pci_aer_clear_fatal_status()
EXPORT_SYMBOL_FOR_MODULES("cxl_core") (Dan)
- Move find_cxl_port_by_dport() and find_cxl_port_by_uport()
de-staticisation and core.h declarations from the rename patch to
here, where the first cross-file callers in find_cxl_port_by_dev()
land.
Changes in v15->v16:
- get_ras_base(), initialize dport to NULL (Jonathan)
- Remove guard(device)(&cxlmd->dev) (Jonathan)
- Fix dev_warns() (Jonathan)
- Remove comment in cxl_port_error_detected() (Dan)
- Update switch-case brackets to follow clang-format (Dan)
- Add PCI_EXP_TYPE_RC_END for cxl_get_ras_base() (Terry)
- Add NULL port check in cxl_serial_number() (Terry)
Changes in v14->v15:
- Update commit message and title. Added Bjorn's ack.
- Move CE and UCE handling logic here
Changes in v13->v14:
- Add Dave Jiang's review-by
- Update commit message & headline (Bjorn)
- Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
one line (Jonathan)
- Remove cxl_walk_port() (Dan)
- Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
sufficient (Dan)
- Remove device_lock_if()
- Combined CE and UCE here (Terry)
Changes in v12->v13:
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)
Changes in v11->v12:
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
pci_to_cxl_dev()
- Change cxl_error_detected() -> cxl_cor_error_detected()
- Remove NULL variable assignments
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
port searches.
Changes in v10->v11:
- None
---
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/port.c | 6 +-
drivers/cxl/core/ras.c | 139 +++++++++++++++++++++++++++++++---
drivers/pci/pci.c | 1 +
drivers/pci/pci.h | 2 -
drivers/pci/pcie/aer.c | 6 +-
drivers/pci/pcie/aer_cxl_vh.c | 9 ++-
include/linux/aer.h | 2 +
include/linux/pci.h | 2 +
9 files changed, 152 insertions(+), 18 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 132ac9c1ebf4..bc36cd1575a4 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -210,6 +210,9 @@ static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
#endif /* CONFIG_CXL_RAS */
int cxl_gpf_port_setup(struct cxl_dport *dport);
+struct cxl_port *find_cxl_port_by_dport(struct device *dport_dev,
+ struct cxl_dport **dport);
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
struct cxl_hdm;
int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index b35a9016fc81..bf417a6aeade 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1398,8 +1398,8 @@ static struct cxl_port *__find_cxl_port_by_dport(struct cxl_find_port_ctx *ctx)
* Return a 'struct cxl_port' with an elevated reference if found. Use
* __free(put_cxl_port) to release.
*/
-static struct cxl_port *find_cxl_port_by_dport(struct device *dport_dev,
- struct cxl_dport **dport)
+struct cxl_port *find_cxl_port_by_dport(struct device *dport_dev,
+ struct cxl_dport **dport)
{
struct cxl_find_port_ctx ctx = {
.dport_dev = dport_dev,
@@ -1594,7 +1594,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
* Function takes a device reference on the port device. Caller should do a
* put_device() when done.
*/
-static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
{
struct device *dev;
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 9193dac4e507..0a552d5a236e 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -66,17 +66,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
}
static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
-int cxl_ras_init(void)
-{
- cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
- return 0;
-}
-
-void cxl_ras_exit(void)
-{
- cxl_cper_unregister_prot_err_work();
-}
-
static void cxl_dport_map_ras(struct cxl_dport *dport)
{
struct cxl_register_map *map = &dport->reg_map;
@@ -133,6 +122,67 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
+/**
+ * find_cxl_port_by_dev - Use @dev as hint to do a _by_dport or _by_uport lookup
+ * @dev: generic device that may either be a companion of port or target dport
+ * @dport: output parameter; set to the matched dport for dport-class
+ * lookups (Root Port, Downstream Port), NULL otherwise.
+ *
+ * Return a 'struct cxl_port' with an elevated reference if found. Use
+ * __free(put_cxl_port) to release.
+ */
+static struct cxl_port *find_cxl_port_by_dev(struct device *dev, struct cxl_dport **dport)
+{
+ struct pci_dev *pdev;
+
+ *dport = NULL;
+ if (!dev_is_pci(dev))
+ return NULL;
+
+ pdev = to_pci_dev(dev);
+
+ switch (pci_pcie_type(pdev)) {
+ case PCI_EXP_TYPE_ROOT_PORT:
+ case PCI_EXP_TYPE_DOWNSTREAM:
+ return find_cxl_port_by_dport(dev, dport);
+ case PCI_EXP_TYPE_UPSTREAM:
+ case PCI_EXP_TYPE_ENDPOINT:
+ case PCI_EXP_TYPE_RC_END:
+ return find_cxl_port_by_uport(dev);
+ }
+
+ return NULL;
+}
+
+static void __iomem *to_ras_base(struct cxl_port *port, struct cxl_dport *dport)
+{
+ if (!port)
+ return NULL;
+
+ if (dport)
+ return dport->regs.ras;
+
+ return port->regs.ras;
+}
+
+static void cxl_do_recovery(struct pci_dev *pdev, struct cxl_port *port, struct cxl_dport *dport)
+{
+ struct device *dev = &pdev->dev;
+ bool ue;
+
+ if (pci_dev_is_disconnected(pdev))
+ panic("CXL cachemem error: device disconnected during UE recovery");
+
+ ue = cxl_handle_ras(dev, pci_get_dsn(pdev),
+ to_ras_base(port, dport));
+ if (ue)
+ panic("CXL cachemem error.");
+
+ pcie_clear_device_status(pdev);
+ pci_aer_clear_nonfatal_status(pdev);
+ pci_aer_clear_fatal_status(pdev);
+}
+
void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
@@ -275,3 +325,70 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_NEED_RESET;
}
EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+
+static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
+ struct cxl_dport *dport, int severity)
+{
+ if (severity == AER_CORRECTABLE) {
+ cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
+ to_ras_base(port, dport));
+ pcie_clear_device_status(pdev);
+ } else {
+ cxl_do_recovery(pdev, port, dport);
+ }
+}
+
+static int __cxl_proto_err_work_fn(struct cxl_proto_err_work_data *wd)
+{
+ struct cxl_dport *dport;
+ struct cxl_port *port __free(put_cxl_port) =
+ find_cxl_port_by_dev(&wd->pdev->dev, &dport);
+
+ if (!port) {
+ dev_err_ratelimited(&wd->pdev->dev,
+ "Failed to find parent port device in CXL topology\n");
+ return 0;
+ }
+
+ /*
+ * Hold the port device lock and verify a driver is bound before
+ * handling errors. Protects against NULL deref if an error is
+ * dispatched before probe completion or after driver removal.
+ */
+ guard(device)(&port->dev);
+ if (!port->dev.driver) {
+ dev_err_ratelimited(&port->dev,
+ "Port device is unbound, abort error handling\n");
+ return 0;
+ }
+
+ cxl_handle_proto_error(wd->pdev, port, dport, wd->severity);
+
+ return 0;
+}
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+ struct cxl_proto_err_work_data wd;
+ int rc;
+
+ rc = for_each_cxl_proto_err(&wd, __cxl_proto_err_work_fn);
+ if (rc)
+ pr_err_ratelimited("Failed to handle the CXL error (%d)\n", rc);
+}
+
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+int cxl_ras_init(void)
+{
+ cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
+ cxl_register_proto_err_work(&cxl_proto_err_work);
+
+ return 0;
+}
+
+void cxl_ras_exit(void)
+{
+ cxl_cper_unregister_prot_err_work();
+ cxl_unregister_proto_err_work();
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8f7cfcc00090..e4b225dd6075 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2245,6 +2245,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
PCI_EXP_DEVSTA_CED | PCI_EXP_DEVSTA_NFED |
PCI_EXP_DEVSTA_FED | PCI_EXP_DEVSTA_URD);
}
+EXPORT_SYMBOL_FOR_MODULES(pcie_clear_device_status, "cxl_core");
#endif
/**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4a14f88e543a..29e588f5289e 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -265,7 +265,6 @@ void pci_refresh_power_state(struct pci_dev *dev);
int pci_power_up(struct pci_dev *dev);
void pci_disable_enabled_device(struct pci_dev *dev);
int pci_finish_runtime_suspend(struct pci_dev *dev);
-void pcie_clear_device_status(struct pci_dev *dev);
void pcie_clear_root_pme_status(struct pci_dev *dev);
bool pci_check_pme_status(struct pci_dev *dev);
void pci_pme_wakeup_bus(struct pci_bus *bus);
@@ -1296,7 +1295,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
static inline void pci_no_aer(void) { }
static inline void pci_aer_init(struct pci_dev *d) { }
static inline void pci_aer_exit(struct pci_dev *d) { }
-static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
static inline void pci_save_aer_state(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c5bce25df51c..b9c6c7b97217 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -295,6 +295,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
if (status)
pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
}
+EXPORT_SYMBOL_FOR_MODULES(pci_aer_clear_fatal_status, "cxl_core");
/**
* pci_aer_raw_clear_status - Clear AER error registers.
@@ -1186,7 +1187,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
cxl_rch_handle_error(dev, info);
- pci_aer_handle_error(dev, info);
+ if (is_cxl_error(dev, info))
+ cxl_forward_error(dev, info);
+ else
+ pci_aer_handle_error(dev, info);
pci_dev_put(dev);
}
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index c0fea2c2b9bc..3c54c1647417 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -45,8 +45,15 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
if (!info || !info->is_cxl)
return false;
- if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ switch (pci_pcie_type(pdev)) {
+ case PCI_EXP_TYPE_ENDPOINT:
+ case PCI_EXP_TYPE_ROOT_PORT:
+ case PCI_EXP_TYPE_UPSTREAM:
+ case PCI_EXP_TYPE_DOWNSTREAM:
+ break;
+ default:
return false;
+ }
return is_aer_internal_error(info);
}
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 78841cf4268c..979ed2f9fd38 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -68,6 +68,7 @@ typedef int (*cxl_proto_err_fn_t)(struct cxl_proto_err_work_data *wd);
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
void pci_aer_unmask_internal_errors(struct pci_dev *dev);
#else
@@ -75,6 +76,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
return -EINVAL;
}
+static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 2c4454583c11..39a386871bcb 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1941,8 +1941,10 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { }
#ifdef CONFIG_PCIEAER
bool pci_aer_available(void);
+void pcie_clear_device_status(struct pci_dev *dev);
#else
static inline bool pci_aer_available(void) { return false; }
+static inline void pcie_clear_device_status(struct pci_dev *dev) { }
#endif
bool pci_ats_disabled(void);
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (5 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 06/11] PCI: Establish common CXL Port protocol error flow Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 23:59 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
` (3 subsequent siblings)
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
Restricted CXL Host (RCH) error handling is a separate path from the
new CXL Port error handling flow. Fold RCH error handling into the
Port flow so both share a common entry point.
Update cxl_rch_handle_error_iter() to forward RCH protocol errors
through the AER-CXL kfifo.
Update cxl_handle_proto_error() to dispatch RCH errors via
cxl_handle_rdport_errors(). cxl_handle_rdport_errors() handles both
correctable and uncorrectable RCH protocol errors.
Behavior change: an RCD uncorrectable CXL RAS error now panics via
cxl_do_recovery(). Before this patch the RCH path returned
PCI_ERS_RESULT_NEED_RESET via cxl_pci's err_handler. After this patch
the same condition panics. This matches the panic policy added in the
common CXL Port protocol error flow. CXL.cachemem traffic cannot be
safely recovered from an uncorrectable protocol error in software.
Change cxl_handle_rdport_errors() to take a PCI device instead of a
CXL device state, matching the new caller context. The error trace events
emitted from this path now report device=<PCI BDF> instead of device=<memN>,
matching the rest of the unified CXL trace events. Userspace consumers keyed
off the memdev name need to map the PCI BDF back to a memdev.
Include the RCD Endpoint serial number in RCH log messages so the RCH
can be associated with its RCD.
Remove the cxlds->rcd check from cxl_cor_error_detected() and
cxl_error_detected(). RCH errors are now forwarded by
cxl_rch_handle_error_iter() through the AER-CXL kfifo to
cxl_handle_proto_error(), so cxl_pci's err_handler no longer sees
them.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- Drop now-dead cxlds->rcd branches from cxl_{cor_,}error_detected().
- Drop duplicate subject line from commit body.
- Document panic-on-uncorrectable behavior change for RCD path.
- Document trace event device-name change (memN -> PCI BDF) for RCH path.
- Rewrite cxl_handle_proto_error() RC_END comment to clarify RCD/RCH shared
interrupt relationship
- Rewrite commit message
Changes in v16:
- New commit
---
drivers/cxl/core/core.h | 4 ++--
drivers/cxl/core/ras.c | 14 +++++++++-----
drivers/cxl/core/ras_rch.c | 8 +++-----
drivers/pci/pcie/aer_cxl_rch.c | 17 +----------------
4 files changed, 15 insertions(+), 28 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index bc36cd1575a4..2c7387506dfb 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -188,7 +188,7 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
-void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
+void cxl_handle_rdport_errors(struct pci_dev *pdev);
void devm_cxl_dport_ras_setup(struct cxl_dport *dport);
#else
static inline int cxl_ras_init(void)
@@ -205,7 +205,7 @@ static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base) { }
static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
-static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
+static inline void cxl_handle_rdport_errors(struct pci_dev *pdev) { }
static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
#endif /* CONFIG_CXL_RAS */
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 0a552d5a236e..1f1dd20623f6 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -267,9 +267,6 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
return;
}
- if (cxlds->rcd)
- cxl_handle_rdport_errors(cxlds);
-
cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
cxlmd->endpoint->regs.ras);
}
@@ -292,8 +289,6 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_DISCONNECT;
}
- if (cxlds->rcd)
- cxl_handle_rdport_errors(cxlds);
/*
* A frozen channel indicates an impending reset which is fatal to
* CXL.mem operation, and will likely crash the system. On the off
@@ -329,6 +324,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
struct cxl_dport *dport, int severity)
{
+ /*
+ * An RC_END device is an RCD (Restricted CXL Device). Its AER
+ * interrupt is shared with the RCH Downstream Port, so handle RCH
+ * Downstream Port protocol errors first before processing the RCD's
+ * own errors. See CXL spec r3.1 s12.2.
+ */
+ if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
+ cxl_handle_rdport_errors(pdev);
+
if (severity == AER_CORRECTABLE) {
cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
to_ras_base(port, dport));
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index 61835fbafc0f..cbd02cabefbc 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -1,7 +1,6 @@
// SPDX-License-Identifier: GPL-2.0-only
/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
-#include <linux/types.h>
#include <linux/aer.h>
#include "cxl.h"
#include "core.h"
@@ -95,9 +94,8 @@ static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
return false;
}
-void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
+void cxl_handle_rdport_errors(struct pci_dev *pdev)
{
- struct pci_dev *pdev = to_pci_dev(cxlds->dev);
struct aer_capability_regs aer_regs;
struct cxl_dport *dport;
int severity;
@@ -115,9 +113,9 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
pci_print_aer(pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
+ cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
dport->regs.ras);
else
- cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
+ cxl_handle_ras(&pdev->dev, pci_get_dsn(pdev),
dport->regs.ras);
}
diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
index e471eefec9c4..83142eac0cab 100644
--- a/drivers/pci/pcie/aer_cxl_rch.c
+++ b/drivers/pci/pcie/aer_cxl_rch.c
@@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
{
struct aer_err_info *info = (struct aer_err_info *)data;
- const struct pci_error_handlers *err_handler;
if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
return 0;
- guard(device)(&dev->dev);
-
- err_handler = dev->driver ? dev->driver->err_handler : NULL;
- if (!err_handler)
- return 0;
-
- if (info->severity == AER_CORRECTABLE) {
- if (err_handler->cor_error_detected)
- err_handler->cor_error_detected(dev);
- } else if (err_handler->error_detected) {
- if (info->severity == AER_NONFATAL)
- err_handler->error_detected(dev, pci_channel_io_normal);
- else if (info->severity == AER_FATAL)
- err_handler->error_detected(dev, pci_channel_io_frozen);
- }
+ cxl_forward_error(dev, info);
return 0;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
@ 2026-05-05 23:59 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-05 23:59 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> Restricted CXL Host (RCH) error handling is a separate path from the
> new CXL Port error handling flow. Fold RCH error handling into the
> Port flow so both share a common entry point.
>
> Update cxl_rch_handle_error_iter() to forward RCH protocol errors
> through the AER-CXL kfifo.
>
> Update cxl_handle_proto_error() to dispatch RCH errors via
> cxl_handle_rdport_errors(). cxl_handle_rdport_errors() handles both
> correctable and uncorrectable RCH protocol errors.
>
> Behavior change: an RCD uncorrectable CXL RAS error now panics via
> cxl_do_recovery(). Before this patch the RCH path returned
> PCI_ERS_RESULT_NEED_RESET via cxl_pci's err_handler. After this patch
> the same condition panics. This matches the panic policy added in the
> common CXL Port protocol error flow. CXL.cachemem traffic cannot be
> safely recovered from an uncorrectable protocol error in software.
>
> Change cxl_handle_rdport_errors() to take a PCI device instead of a
> CXL device state, matching the new caller context. The error trace events
> emitted from this path now report device=<PCI BDF> instead of device=<memN>,
> matching the rest of the unified CXL trace events. Userspace consumers keyed
> off the memdev name need to map the PCI BDF back to a memdev.
>
> Include the RCD Endpoint serial number in RCH log messages so the RCH
> can be associated with its RCD.
>
> Remove the cxlds->rcd check from cxl_cor_error_detected() and
> cxl_error_detected(). RCH errors are now forwarded by
> cxl_rch_handle_error_iter() through the AER-CXL kfifo to
> cxl_handle_proto_error(), so cxl_pci's err_handler no longer sees
> them.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> ---
>
> Changes in v16->v17:
> - Drop now-dead cxlds->rcd branches from cxl_{cor_,}error_detected().
> - Drop duplicate subject line from commit body.
> - Document panic-on-uncorrectable behavior change for RCD path.
> - Document trace event device-name change (memN -> PCI BDF) for RCH path.
> - Rewrite cxl_handle_proto_error() RC_END comment to clarify RCD/RCH shared
> interrupt relationship
> - Rewrite commit message
>
> Changes in v16:
> - New commit
> ---
> drivers/cxl/core/core.h | 4 ++--
> drivers/cxl/core/ras.c | 14 +++++++++-----
> drivers/cxl/core/ras_rch.c | 8 +++-----
> drivers/pci/pcie/aer_cxl_rch.c | 17 +----------------
> 4 files changed, 15 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index bc36cd1575a4..2c7387506dfb 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -188,7 +188,7 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial,
> void __iomem *ras_base);
> void cxl_dport_map_rch_aer(struct cxl_dport *dport);
> void cxl_disable_rch_root_ints(struct cxl_dport *dport);
> -void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
> +void cxl_handle_rdport_errors(struct pci_dev *pdev);
> void devm_cxl_dport_ras_setup(struct cxl_dport *dport);
> #else
> static inline int cxl_ras_init(void)
> @@ -205,7 +205,7 @@ static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
> void __iomem *ras_base) { }
> static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
> static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
> -static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
> +static inline void cxl_handle_rdport_errors(struct pci_dev *pdev) { }
> static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
> #endif /* CONFIG_CXL_RAS */
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 0a552d5a236e..1f1dd20623f6 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -267,9 +267,6 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
> return;
> }
>
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(cxlds);
> -
> cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> cxlmd->endpoint->regs.ras);
> }
> @@ -292,8 +289,6 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> return PCI_ERS_RESULT_DISCONNECT;
> }
>
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(cxlds);
> /*
> * A frozen channel indicates an impending reset which is fatal to
> * CXL.mem operation, and will likely crash the system. On the off
> @@ -329,6 +324,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
> struct cxl_dport *dport, int severity)
> {
> + /*
> + * An RC_END device is an RCD (Restricted CXL Device). Its AER
> + * interrupt is shared with the RCH Downstream Port, so handle RCH
> + * Downstream Port protocol errors first before processing the RCD's
> + * own errors. See CXL spec r3.1 s12.2.
> + */
> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
May as well use is_cxl_restricted(pdev).
DJ
> + cxl_handle_rdport_errors(pdev);
> +
> if (severity == AER_CORRECTABLE) {
> cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
> to_ras_base(port, dport));
> diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
> index 61835fbafc0f..cbd02cabefbc 100644
> --- a/drivers/cxl/core/ras_rch.c
> +++ b/drivers/cxl/core/ras_rch.c
> @@ -1,7 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0-only
> /* Copyright(c) 2025 AMD Corporation. All rights reserved. */
>
> -#include <linux/types.h>
> #include <linux/aer.h>
> #include "cxl.h"
> #include "core.h"
> @@ -95,9 +94,8 @@ static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
> return false;
> }
>
> -void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
> +void cxl_handle_rdport_errors(struct pci_dev *pdev)
> {
> - struct pci_dev *pdev = to_pci_dev(cxlds->dev);
> struct aer_capability_regs aer_regs;
> struct cxl_dport *dport;
> int severity;
> @@ -115,9 +113,9 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
>
> pci_print_aer(pdev, severity, &aer_regs);
> if (severity == AER_CORRECTABLE)
> - cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> + cxl_handle_cor_ras(&pdev->dev, pci_get_dsn(pdev),
> dport->regs.ras);
> else
> - cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> + cxl_handle_ras(&pdev->dev, pci_get_dsn(pdev),
> dport->regs.ras);
> }
> diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
> index e471eefec9c4..83142eac0cab 100644
> --- a/drivers/pci/pcie/aer_cxl_rch.c
> +++ b/drivers/pci/pcie/aer_cxl_rch.c
> @@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
> static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> {
> struct aer_err_info *info = (struct aer_err_info *)data;
> - const struct pci_error_handlers *err_handler;
>
> if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
> return 0;
>
> - guard(device)(&dev->dev);
> -
> - err_handler = dev->driver ? dev->driver->err_handler : NULL;
> - if (!err_handler)
> - return 0;
> -
> - if (info->severity == AER_CORRECTABLE) {
> - if (err_handler->cor_error_detected)
> - err_handler->cor_error_detected(dev);
> - } else if (err_handler->error_detected) {
> - if (info->severity == AER_NONFATAL)
> - err_handler->error_detected(dev, pci_channel_io_normal);
> - else if (info->severity == AER_FATAL)
> - err_handler->error_detected(dev, pci_channel_io_frozen);
> - }
> + cxl_forward_error(dev, info);
> return 0;
> }
>
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (6 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 07/11] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
` (2 subsequent siblings)
10 siblings, 0 replies; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
CXL drivers no longer need their own correctable PCI AER handler. The
PCIe AER correctable status is logged and cleared by the AER driver,
and CXL RAS correctable status is now logged and cleared via the new
common CXL protocol error flow: cxl_handle_proto_error() invokes
cxl_handle_cor_ras() for VH Endpoints, and dispatches to
cxl_handle_rdport_errors() for RCDs (which calls cxl_handle_cor_ras()
with the RCH dport's RAS register block). Both paths are reached via
the AER-CXL kfifo, so the .cor_error_detected callback in the CXL PCI
driver is redundant.
Remove cxl_cor_error_detected() and drop the .cor_error_detected entry
from cxl_pci's pci_error_handlers.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- Update commit message
- Add Reviewed-by from Jonathan and DaveJ
Changes in v15->v16:
- None
Changes in v14->v15:
- Remove cxl_pci_cor_error_detected(). Is not needed. AER is logged
in the AER driver. (Dan)
- Update commit message (Terry)
Changes in v13->v14:
- New commit
- Change cxl_cor_error_detected() parameter to &pdev->dev device from
memdev device. (Terry)
- Updated commit message (Terry)
---
drivers/cxl/core/ras.c | 20 --------------------
drivers/cxl/cxlpci.h | 3 ---
drivers/cxl/pci.c | 1 -
3 files changed, 24 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 1f1dd20623f6..5cc4087c2807 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -253,26 +253,6 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
return true;
}
-void cxl_cor_error_detected(struct pci_dev *pdev)
-{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
- struct cxl_memdev *cxlmd = cxlds->cxlmd;
- struct device *dev = &cxlds->cxlmd->dev;
-
- scoped_guard(device, dev) {
- if (!dev->driver) {
- dev_warn(&pdev->dev,
- "%s: memdev disabled, abort error handling\n",
- dev_name(dev));
- return;
- }
-
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
- cxlmd->endpoint->regs.ras);
- }
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
-
pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index b826eb53cf7b..06c46adcf0f6 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -89,14 +89,11 @@ struct cxl_dev_state;
void read_cdat_data(struct cxl_port *port);
#ifdef CONFIG_CXL_RAS
-void cxl_cor_error_detected(struct pci_dev *pdev);
pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
pci_channel_state_t state);
void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
void devm_cxl_port_ras_setup(struct cxl_port *port);
#else
-static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
-
static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bace662dc988..5eb64ced0de5 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1004,7 +1004,6 @@ static const struct pci_error_handlers cxl_error_handlers = {
.error_detected = cxl_error_detected,
.slot_reset = cxl_slot_reset,
.resume = cxl_error_resume,
- .cor_error_detected = cxl_cor_error_detected,
.reset_done = cxl_reset_done,
};
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (7 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 08/11] cxl: Remove Endpoint AER correctable handler Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-06 17:43 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
The CXL cxl_core driver now implements protocol RAS support. PCI
uncorrectable (UCE) protocol errors, however, continue to be reported via
the AER capability and must still be handled by a PCI error recovery callback.
UCE handling is required to provide direction for recovery.
Replace the existing cxl_error_detected() callback in cxl/pci.c with a new
cxl_pci_error_detected() implementation that handles uncorrectable AER PCI
protocol errors.
The handler decides solely based on the pci_channel_state_t parameter and
does not access PCIe AER capability registers from .error_detected, matching
the pattern used by other drivers including the NVMe and ixgbe drivers.
CXL.cachemem-corrupting protocol errors are routed separately through the
AER-CXL kfifo to cxl_handle_proto_error(), so cxl_pci does not need to
second-guess the AER core's classification.
claude-opus-4.7 was used for research on PCI error state transitions and
requirements.
Assisted-by: Claude:claude-opus-4.7
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- Rename pci_error_handlers struct instance to cxl_pci_error_handlers to
avoid shadowing the struct type tag.
- Restore scoped_guard(device) and dev->driver check around AER read.
- NULL-check find_cxl_port_by_dev() before deref of port->uport_dev.
- Updated commit message. (Terry)
- Add scope cleanup for port variable in cxl_pci_error_detected() (Terry)
- Drop cxl_uncor_aer_present(), rely on AER state
Changes in v15->v16:
- Update commit message (DaveJ)
- s/cxl_handle_aer()/cxl_uncor_aer_present()/g (Jonathan)
- cxl_uncor_aer_present(): Leave original result calculation based on
if a UCE is present and the provided state (Terry)
- Add call to pci_print_aer(). AER fails to log because is upstream
link (Terry)
Changes in v14->v15:
- Update commit message and title. Added Bjorn's ack.
- Move CE and UCE handling logic here
Changes in v13->v14:
- Add Dave Jiang's review-by
- Update commit message & headline (Bjorn)
- Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
one line (Jonathan)
- Remove cxl_walk_port() (Dan)
- Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
sufficient (Dan)
- Remove device_lock_if()
- Combined CE and UCE here (Terry)
Changes in v12->v13:
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)
Changes in v11->v12:
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
pci_to_cxl_dev()
- Change cxl_error_detected() -> cxl_cor_error_detected()
- Remove NULL variable assignments
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
port searches.
Changes in v10->v11:
- None
---
drivers/cxl/core/ras.c | 43 ++++++++++++++++--------------------------
drivers/cxl/cxlpci.h | 8 ++++----
drivers/cxl/pci.c | 6 +++---
3 files changed, 23 insertions(+), 34 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 5cc4087c2807..a98ce0f412ad 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -253,38 +253,27 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
return true;
}
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
- struct cxl_memdev *cxlmd = cxlds->cxlmd;
- struct device *dev = &cxlmd->dev;
- bool ue;
+ struct cxl_dport *dport;
+ struct cxl_port *port __free(put_cxl_port) =
+ find_cxl_port_by_dev(&pdev->dev, &dport);
+ struct cxl_memdev *cxlmd;
+ struct device *dev;
- scoped_guard(device, dev) {
- if (!dev->driver) {
- dev_warn(&pdev->dev,
- "%s: memdev disabled, abort error handling\n",
- dev_name(dev));
- return PCI_ERS_RESULT_DISCONNECT;
- }
+ if (!port)
+ return PCI_ERS_RESULT_DISCONNECT;
- /*
- * A frozen channel indicates an impending reset which is fatal to
- * CXL.mem operation, and will likely crash the system. On the off
- * chance the situation is recoverable dump the status of the RAS
- * capability registers and bounce the active state of the memdev.
- */
- ue = cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
- cxlmd->endpoint->regs.ras);
- }
+ cxlmd = to_cxl_memdev(port->uport_dev);
+ dev = &cxlmd->dev;
switch (state) {
case pci_channel_io_normal:
- if (ue) {
- device_release_driver(dev);
- return PCI_ERS_RESULT_NEED_RESET;
- }
+ /*
+ * Non-fatal CXL protocol errors are handled asynchronously
+ * by the AER-CXL kfifo worker (cxl_proto_err_work_fn).
+ */
return PCI_ERS_RESULT_CAN_RECOVER;
case pci_channel_io_frozen:
dev_warn(&pdev->dev,
@@ -299,7 +288,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
}
return PCI_ERS_RESULT_NEED_RESET;
}
-EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
struct cxl_dport *dport, int severity)
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 06c46adcf0f6..8aeb80a4e573 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -89,13 +89,13 @@ struct cxl_dev_state;
void read_cdat_data(struct cxl_port *port);
#ifdef CONFIG_CXL_RAS
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state);
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state);
void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
void devm_cxl_port_ras_setup(struct cxl_port *port);
#else
-static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
{
return PCI_ERS_RESULT_NONE;
}
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 5eb64ced0de5..6459f94f8fa8 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1000,8 +1000,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
}
}
-static const struct pci_error_handlers cxl_error_handlers = {
- .error_detected = cxl_error_detected,
+static const struct pci_error_handlers cxl_pci_error_handlers = {
+ .error_detected = cxl_pci_error_detected,
.slot_reset = cxl_slot_reset,
.resume = cxl_error_resume,
.reset_done = cxl_reset_done,
@@ -1011,7 +1011,7 @@ static struct pci_driver cxl_pci_driver = {
.name = KBUILD_MODNAME,
.id_table = cxl_mem_pci_tbl,
.probe = cxl_pci_probe,
- .err_handler = &cxl_error_handlers,
+ .err_handler = &cxl_pci_error_handlers,
.dev_groups = cxl_rcd_groups,
.driver = {
.probe_type = PROBE_PREFER_ASYNCHRONOUS,
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
@ 2026-05-06 17:43 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-06 17:43 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> The CXL cxl_core driver now implements protocol RAS support. PCI
> uncorrectable (UCE) protocol errors, however, continue to be reported via
> the AER capability and must still be handled by a PCI error recovery callback.
> UCE handling is required to provide direction for recovery.
>
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a new
> cxl_pci_error_detected() implementation that handles uncorrectable AER PCI
> protocol errors.
>
> The handler decides solely based on the pci_channel_state_t parameter and
> does not access PCIe AER capability registers from .error_detected, matching
> the pattern used by other drivers including the NVMe and ixgbe drivers.
> CXL.cachemem-corrupting protocol errors are routed separately through the
> AER-CXL kfifo to cxl_handle_proto_error(), so cxl_pci does not need to
> second-guess the AER core's classification.
>
> claude-opus-4.7 was used for research on PCI error state transitions and
> requirements.
>
> Assisted-by: Claude:claude-opus-4.7
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> ---
>
> Changes in v16->v17:
> - Rename pci_error_handlers struct instance to cxl_pci_error_handlers to
> avoid shadowing the struct type tag.
> - Restore scoped_guard(device) and dev->driver check around AER read.
> - NULL-check find_cxl_port_by_dev() before deref of port->uport_dev.
> - Updated commit message. (Terry)
> - Add scope cleanup for port variable in cxl_pci_error_detected() (Terry)
> - Drop cxl_uncor_aer_present(), rely on AER state
>
> Changes in v15->v16:
> - Update commit message (DaveJ)
> - s/cxl_handle_aer()/cxl_uncor_aer_present()/g (Jonathan)
> - cxl_uncor_aer_present(): Leave original result calculation based on
> if a UCE is present and the provided state (Terry)
> - Add call to pci_print_aer(). AER fails to log because is upstream
> link (Terry)
>
> Changes in v14->v15:
> - Update commit message and title. Added Bjorn's ack.
> - Move CE and UCE handling logic here
>
> Changes in v13->v14:
> - Add Dave Jiang's review-by
> - Update commit message & headline (Bjorn)
> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> one line (Jonathan)
> - Remove cxl_walk_port() (Dan)
> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> sufficient (Dan)
> - Remove device_lock_if()
> - Combined CE and UCE here (Terry)
>
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
>
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> port searches.
>
> Changes in v10->v11:
> - None
> ---
> drivers/cxl/core/ras.c | 43 ++++++++++++++++--------------------------
> drivers/cxl/cxlpci.h | 8 ++++----
> drivers/cxl/pci.c | 6 +++---
> 3 files changed, 23 insertions(+), 34 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 5cc4087c2807..a98ce0f412ad 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -253,38 +253,27 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> return true;
> }
>
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state)
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> {
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
> - struct device *dev = &cxlmd->dev;
> - bool ue;
> + struct cxl_dport *dport;
> + struct cxl_port *port __free(put_cxl_port) =
> + find_cxl_port_by_dev(&pdev->dev, &dport);
Move this to right before 'port' is being checked. It's ok to do inline var declaration with __free().
DJ
> + struct cxl_memdev *cxlmd;
> + struct device *dev;
>
> - scoped_guard(device, dev) {
> - if (!dev->driver) {
> - dev_warn(&pdev->dev,
> - "%s: memdev disabled, abort error handling\n",
> - dev_name(dev));
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> + if (!port)
> + return PCI_ERS_RESULT_DISCONNECT;
>
> - /*
> - * A frozen channel indicates an impending reset which is fatal to
> - * CXL.mem operation, and will likely crash the system. On the off
> - * chance the situation is recoverable dump the status of the RAS
> - * capability registers and bounce the active state of the memdev.
> - */
> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, pci_get_dsn(pdev),
> - cxlmd->endpoint->regs.ras);
> - }
> + cxlmd = to_cxl_memdev(port->uport_dev);
> + dev = &cxlmd->dev;
>
> switch (state) {
> case pci_channel_io_normal:
> - if (ue) {
> - device_release_driver(dev);
> - return PCI_ERS_RESULT_NEED_RESET;
> - }
> + /*
> + * Non-fatal CXL protocol errors are handled asynchronously
> + * by the AER-CXL kfifo worker (cxl_proto_err_work_fn).
> + */
> return PCI_ERS_RESULT_CAN_RECOVER;
> case pci_channel_io_frozen:
> dev_warn(&pdev->dev,
> @@ -299,7 +288,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> }
> return PCI_ERS_RESULT_NEED_RESET;
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
>
> static void cxl_handle_proto_error(struct pci_dev *pdev, struct cxl_port *port,
> struct cxl_dport *dport, int severity)
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 06c46adcf0f6..8aeb80a4e573 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -89,13 +89,13 @@ struct cxl_dev_state;
> void read_cdat_data(struct cxl_port *port);
>
> #ifdef CONFIG_CXL_RAS
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state);
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state);
> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
> void devm_cxl_port_ras_setup(struct cxl_port *port);
> #else
> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state)
> +static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> {
> return PCI_ERS_RESULT_NONE;
> }
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 5eb64ced0de5..6459f94f8fa8 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1000,8 +1000,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
> }
> }
>
> -static const struct pci_error_handlers cxl_error_handlers = {
> - .error_detected = cxl_error_detected,
> +static const struct pci_error_handlers cxl_pci_error_handlers = {
> + .error_detected = cxl_pci_error_detected,
> .slot_reset = cxl_slot_reset,
> .resume = cxl_error_resume,
> .reset_done = cxl_reset_done,
> @@ -1011,7 +1011,7 @@ static struct pci_driver cxl_pci_driver = {
> .name = KBUILD_MODNAME,
> .id_table = cxl_mem_pci_tbl,
> .probe = cxl_pci_probe,
> - .err_handler = &cxl_error_handlers,
> + .err_handler = &cxl_pci_error_handlers,
> .dev_groups = cxl_rcd_groups,
> .driver = {
> .probe_type = PROBE_PREFER_ASYNCHRONOUS,
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (8 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 09/11] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-06 18:00 ` Dave Jiang
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
CXL protocol errors are not enabled for all CXL devices after boot. They
must be enabled in order to process CXL protocol errors. Provide matching
teardown helpers so the masks are restored when a CXL Port or Downstream
Port goes away.
Add pci_aer_mask_internal_errors() as the symmetric counterpart to
pci_aer_unmask_internal_errors() and export both for the cxl_core module.
Introduce cxl_unmask_proto_interrupts() and cxl_mask_proto_interrupts()
in cxl_core to wrap the PCI helpers with the dev_is_pci() and
pcie_aer_is_native() gating CXL needs. Both helpers tolerate a NULL
@dev so teardown callers do not have to special-case it.
Wire cxl_unmask_proto_interrupts() into the success path of
cxl_dport_map_ras() and devm_cxl_port_ras_setup() so the unmask only
runs when the RAS register block was actually mapped. Pair each unmask
with a devm_add_action_or_reset() registration of
cxl_mask_proto_interrupts() scoped to the cxl_port device. The mask is
then restored when the cxl_port device releases its devres. This
applies to Endpoints, Upstream Switch Ports, Downstream Switch Ports,
and Root Ports.
Co-developed-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Dan Williams <djbw@kernel.org>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16->v17:
- Drop redundant cxl_mask_proto_interrupts() calls from unregister_port()
and cxl_dport_remove(); the devres action registered alongside the unmask
is the sole mask path.
- Update title
- Remove unnecessary check for aer_capabilities
- Gate cxl_unmask_proto_interrupts() on pcie_aer_is_native()
- Add pci_aer_mask_internal_errors() and cxl_mask_proto_interrupts()
- Only unmask on successful cxl_map_component_regs()
- NULL-check @dev in cxl_{un,}mask_proto_interrupts()
- Drop static and declare in core/core.h
Change in v15 -> v16:
- None
Change in v14 -> v15:
- None
Changes in v13->v14:
- Update commit title's prefix (Bjorn)
Changes in v12->v13:
- Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
- Add Dave Jiang's and Ben's review-by
Changes in v11->v12:
- None
---
drivers/cxl/core/core.h | 4 +++
drivers/cxl/core/ras.c | 63 ++++++++++++++++++++++++++++++++++++++---
drivers/pci/pcie/aer.c | 25 ++++++++++++++++
include/linux/aer.h | 2 ++
4 files changed, 90 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 2c7387506dfb..ff39985d363f 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -190,6 +190,8 @@ void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
void cxl_handle_rdport_errors(struct pci_dev *pdev);
void devm_cxl_dport_ras_setup(struct cxl_dport *dport);
+void cxl_unmask_proto_interrupts(struct device *dev);
+void cxl_mask_proto_interrupts(struct device *dev);
#else
static inline int cxl_ras_init(void)
{
@@ -207,6 +209,8 @@ static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
static inline void cxl_handle_rdport_errors(struct pci_dev *pdev) { }
static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
+static inline void cxl_unmask_proto_interrupts(struct device *dev) { }
+static inline void cxl_mask_proto_interrupts(struct device *dev) { }
#endif /* CONFIG_CXL_RAS */
int cxl_gpf_port_setup(struct cxl_dport *dport);
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index a98ce0f412ad..b45e2b539b5f 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -66,16 +66,59 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
}
static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
+void cxl_unmask_proto_interrupts(struct device *dev)
+{
+ struct pci_dev *pdev;
+
+ if (!dev || !dev_is_pci(dev))
+ return;
+
+ pdev = to_pci_dev(dev);
+ if (!pcie_aer_is_native(pdev))
+ return;
+
+ pci_aer_unmask_internal_errors(pdev);
+}
+
+void cxl_mask_proto_interrupts(struct device *dev)
+{
+ struct pci_dev *pdev;
+
+ if (!dev || !dev_is_pci(dev))
+ return;
+
+ pdev = to_pci_dev(dev);
+ if (!pcie_aer_is_native(pdev))
+ return;
+
+ pci_aer_mask_internal_errors(pdev);
+}
+
+static void cxl_mask_proto_irqs(void *dev)
+{
+ cxl_mask_proto_interrupts(dev);
+}
+
static void cxl_dport_map_ras(struct cxl_dport *dport)
{
struct cxl_register_map *map = &dport->reg_map;
struct device *dev = dport->dport_dev;
- if (!map->component_map.ras.valid)
+ if (!map->component_map.ras.valid) {
dev_dbg(dev, "RAS registers not found\n");
- else if (cxl_map_component_regs(map, &dport->regs.component,
- BIT(CXL_CM_CAP_CAP_ID_RAS)))
+ return;
+ }
+
+ if (cxl_map_component_regs(map, &dport->regs.component,
+ BIT(CXL_CM_CAP_CAP_ID_RAS))) {
dev_dbg(dev, "Failed to map RAS capability.\n");
+ return;
+ }
+
+ cxl_unmask_proto_interrupts(dev);
+ if (devm_add_action_or_reset(dport_to_host(dport),
+ cxl_mask_proto_irqs, dev))
+ dev_warn(dev, "failed to register CXL proto-irq mask cleanup\n");
}
/**
@@ -109,6 +152,7 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_dport_rch_ras_setup, "CXL");
void devm_cxl_port_ras_setup(struct cxl_port *port)
{
struct cxl_register_map *map = &port->reg_map;
+ struct device *dev;
if (!map->component_map.ras.valid) {
dev_dbg(&port->dev, "RAS registers not found\n");
@@ -117,8 +161,19 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
map->host = &port->dev;
if (cxl_map_component_regs(map, &port->regs,
- BIT(CXL_CM_CAP_CAP_ID_RAS)))
+ BIT(CXL_CM_CAP_CAP_ID_RAS))) {
dev_dbg(&port->dev, "Failed to map RAS capability\n");
+ return;
+ }
+
+ dev = is_cxl_endpoint(port) ? port->uport_dev->parent : port->uport_dev;
+ if (!dev_is_pci(dev))
+ return;
+
+ cxl_unmask_proto_interrupts(dev);
+ if (devm_add_action_or_reset(&port->dev, cxl_mask_proto_irqs, dev))
+ dev_warn(&port->dev,
+ "Failed to register CXL proto-irq mask cleanup\n");
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index b9c6c7b97217..eaa36fe0eb31 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1151,6 +1151,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
*/
EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
+/**
+ * pci_aer_mask_internal_errors - mask internal errors
+ * @dev: pointer to the pci_dev data structure
+ *
+ * Mask internal errors in the Uncorrectable and Correctable Error
+ * Mask registers.
+ *
+ * Note: AER must be enabled and supported by the device which must be
+ * checked in advance, e.g. with pcie_aer_is_native().
+ */
+void pci_aer_mask_internal_errors(struct pci_dev *dev)
+{
+ int aer = dev->aer_cap;
+ u32 mask;
+
+ pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
+ mask |= PCI_ERR_UNC_INTN;
+ pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
+
+ pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
+ mask |= PCI_ERR_COR_INTERNAL;
+ pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
+}
+EXPORT_SYMBOL_FOR_MODULES(pci_aer_mask_internal_errors, "cxl_core");
+
/**
* pci_aer_handle_error - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 979ed2f9fd38..c52db62d4c7e 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -71,6 +71,7 @@ int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
void pci_aer_unmask_internal_errors(struct pci_dev *dev);
+void pci_aer_mask_internal_errors(struct pci_dev *dev);
#else
static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
@@ -79,6 +80,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
+static inline void pci_aer_mask_internal_errors(struct pci_dev *dev) { }
#endif
#ifdef CONFIG_CXL_RAS
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
@ 2026-05-06 18:00 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-06 18:00 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> CXL protocol errors are not enabled for all CXL devices after boot. They
> must be enabled in order to process CXL protocol errors. Provide matching
> teardown helpers so the masks are restored when a CXL Port or Downstream
> Port goes away.
>
> Add pci_aer_mask_internal_errors() as the symmetric counterpart to
> pci_aer_unmask_internal_errors() and export both for the cxl_core module.
>
> Introduce cxl_unmask_proto_interrupts() and cxl_mask_proto_interrupts()
> in cxl_core to wrap the PCI helpers with the dev_is_pci() and
> pcie_aer_is_native() gating CXL needs. Both helpers tolerate a NULL
> @dev so teardown callers do not have to special-case it.
>
> Wire cxl_unmask_proto_interrupts() into the success path of
> cxl_dport_map_ras() and devm_cxl_port_ras_setup() so the unmask only
> runs when the RAS register block was actually mapped. Pair each unmask
> with a devm_add_action_or_reset() registration of
> cxl_mask_proto_interrupts() scoped to the cxl_port device. The mask is
> then restored when the cxl_port device releases its devres. This
> applies to Endpoints, Upstream Switch Ports, Downstream Switch Ports,
> and Root Ports.
>
> Co-developed-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Dan Williams <djbw@kernel.org>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
I do wonder if we should save the original mask values and write those back rather than blindly remask everything when we are done.
>
> ---
>
> Changes in v16->v17:
> - Drop redundant cxl_mask_proto_interrupts() calls from unregister_port()
> and cxl_dport_remove(); the devres action registered alongside the unmask
> is the sole mask path.
> - Update title
> - Remove unnecessary check for aer_capabilities
> - Gate cxl_unmask_proto_interrupts() on pcie_aer_is_native()
> - Add pci_aer_mask_internal_errors() and cxl_mask_proto_interrupts()
> - Only unmask on successful cxl_map_component_regs()
> - NULL-check @dev in cxl_{un,}mask_proto_interrupts()
> - Drop static and declare in core/core.h
>
> Change in v15 -> v16:
> - None
>
> Change in v14 -> v15:
> - None
>
> Changes in v13->v14:
> - Update commit title's prefix (Bjorn)
>
> Changes in v12->v13:
> - Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
> - Add Dave Jiang's and Ben's review-by
>
> Changes in v11->v12:
> - None
> ---
> drivers/cxl/core/core.h | 4 +++
> drivers/cxl/core/ras.c | 63 ++++++++++++++++++++++++++++++++++++++---
> drivers/pci/pcie/aer.c | 25 ++++++++++++++++
> include/linux/aer.h | 2 ++
> 4 files changed, 90 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 2c7387506dfb..ff39985d363f 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -190,6 +190,8 @@ void cxl_dport_map_rch_aer(struct cxl_dport *dport);
> void cxl_disable_rch_root_ints(struct cxl_dport *dport);
> void cxl_handle_rdport_errors(struct pci_dev *pdev);
> void devm_cxl_dport_ras_setup(struct cxl_dport *dport);
> +void cxl_unmask_proto_interrupts(struct device *dev);
> +void cxl_mask_proto_interrupts(struct device *dev);
> #else
> static inline int cxl_ras_init(void)
> {
> @@ -207,6 +209,8 @@ static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
> static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
> static inline void cxl_handle_rdport_errors(struct pci_dev *pdev) { }
> static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
> +static inline void cxl_unmask_proto_interrupts(struct device *dev) { }
> +static inline void cxl_mask_proto_interrupts(struct device *dev) { }
> #endif /* CONFIG_CXL_RAS */
>
> int cxl_gpf_port_setup(struct cxl_dport *dport);
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index a98ce0f412ad..b45e2b539b5f 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -66,16 +66,59 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
> }
> static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>
> +void cxl_unmask_proto_interrupts(struct device *dev)
> +{
> + struct pci_dev *pdev;
> +
> + if (!dev || !dev_is_pci(dev))
> + return;
> +
> + pdev = to_pci_dev(dev);
> + if (!pcie_aer_is_native(pdev))
> + return;
> +
> + pci_aer_unmask_internal_errors(pdev);
> +}
> +
> +void cxl_mask_proto_interrupts(struct device *dev)
> +{
> + struct pci_dev *pdev;
> +
> + if (!dev || !dev_is_pci(dev))
> + return;
> +
> + pdev = to_pci_dev(dev);
> + if (!pcie_aer_is_native(pdev))
> + return;
> +
> + pci_aer_mask_internal_errors(pdev);
> +}
> +
> +static void cxl_mask_proto_irqs(void *dev)
> +{
> + cxl_mask_proto_interrupts(dev);
> +}
> +
> static void cxl_dport_map_ras(struct cxl_dport *dport)
> {
> struct cxl_register_map *map = &dport->reg_map;
> struct device *dev = dport->dport_dev;
>
> - if (!map->component_map.ras.valid)
> + if (!map->component_map.ras.valid) {
> dev_dbg(dev, "RAS registers not found\n");
> - else if (cxl_map_component_regs(map, &dport->regs.component,
> - BIT(CXL_CM_CAP_CAP_ID_RAS)))
> + return;
> + }
> +
> + if (cxl_map_component_regs(map, &dport->regs.component,
> + BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> dev_dbg(dev, "Failed to map RAS capability.\n");
> + return;
> + }
> +
> + cxl_unmask_proto_interrupts(dev);
> + if (devm_add_action_or_reset(dport_to_host(dport),
> + cxl_mask_proto_irqs, dev))
> + dev_warn(dev, "failed to register CXL proto-irq mask cleanup\n");
> }
>
> /**
> @@ -109,6 +152,7 @@ EXPORT_SYMBOL_NS_GPL(devm_cxl_dport_rch_ras_setup, "CXL");
> void devm_cxl_port_ras_setup(struct cxl_port *port)
> {
> struct cxl_register_map *map = &port->reg_map;
> + struct device *dev;
>
> if (!map->component_map.ras.valid) {
> dev_dbg(&port->dev, "RAS registers not found\n");
> @@ -117,8 +161,19 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
>
> map->host = &port->dev;
> if (cxl_map_component_regs(map, &port->regs,
> - BIT(CXL_CM_CAP_CAP_ID_RAS)))
> + BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> dev_dbg(&port->dev, "Failed to map RAS capability\n");
> + return;
> + }
> +
> + dev = is_cxl_endpoint(port) ? port->uport_dev->parent : port->uport_dev;
> + if (!dev_is_pci(dev))
> + return;
> +
> + cxl_unmask_proto_interrupts(dev);
> + if (devm_add_action_or_reset(&port->dev, cxl_mask_proto_irqs, dev))
> + dev_warn(&port->dev,
> + "Failed to register CXL proto-irq mask cleanup\n");
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index b9c6c7b97217..eaa36fe0eb31 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1151,6 +1151,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> */
> EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
>
> +/**
> + * pci_aer_mask_internal_errors - mask internal errors
> + * @dev: pointer to the pci_dev data structure
> + *
> + * Mask internal errors in the Uncorrectable and Correctable Error
> + * Mask registers.
> + *
> + * Note: AER must be enabled and supported by the device which must be
> + * checked in advance, e.g. with pcie_aer_is_native().
> + */
> +void pci_aer_mask_internal_errors(struct pci_dev *dev)
> +{
> + int aer = dev->aer_cap;
> + u32 mask;
> +
> + pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
> + mask |= PCI_ERR_UNC_INTN;
> + pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
> +
> + pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
> + mask |= PCI_ERR_COR_INTERNAL;
> + pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> +}
> +EXPORT_SYMBOL_FOR_MODULES(pci_aer_mask_internal_errors, "cxl_core");
> +
> /**
> * pci_aer_handle_error - handle logging error into an event log
> * @dev: pointer to pci_dev data structure of error source device
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 979ed2f9fd38..c52db62d4c7e 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -71,6 +71,7 @@ int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> void pci_aer_clear_fatal_status(struct pci_dev *dev);
> int pcie_aer_is_native(struct pci_dev *dev);
> void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +void pci_aer_mask_internal_errors(struct pci_dev *dev);
> #else
> static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> {
> @@ -79,6 +80,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
> static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
> +static inline void pci_aer_mask_internal_errors(struct pci_dev *dev) { }
> #endif
>
> #ifdef CONFIG_CXL_RAS
^ permalink raw reply [flat|nested] 21+ messages in thread
* [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling
2026-05-05 17:30 [PATCH v17 00/11] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (9 preceding siblings ...)
2026-05-05 17:30 ` [PATCH v17 10/11] PCI/CXL: Mask/Unmask CXL protocol errors Terry Bowman
@ 2026-05-05 17:30 ` Terry Bowman
2026-05-06 18:34 ` Dave Jiang
10 siblings, 1 reply; 21+ messages in thread
From: Terry Bowman @ 2026-05-05 17:30 UTC (permalink / raw)
To: dave, jic23, dave.jiang, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc, terry.bowman
Add Documentation/driver-api/cxl/linux/protocol-error-handling.rst
describing the end-to-end CXL protocol error path: AER ingress, the
AER-CXL kfifo handoff, the cxl_core consumer worker, RCD/RCH special
cases, severity policy, trace events, and a source code map.
This documents the architecture introduced by the preceding patches in
this series.
This was generated by claude-opus-4.7.
Assisted-by: Claude:claude-opus-4.7
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Documentation/driver-api/cxl/index.rst | 1 +
.../cxl/linux/protocol-error-handling.rst | 440 ++++++++++++++++++
2 files changed, 441 insertions(+)
create mode 100644 Documentation/driver-api/cxl/linux/protocol-error-handling.rst
diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
index 3dfae1d310ca..6861b2e5726a 100644
--- a/Documentation/driver-api/cxl/index.rst
+++ b/Documentation/driver-api/cxl/index.rst
@@ -42,6 +42,7 @@ that have impacts on each other. The docs here break up configurations steps.
linux/dax-driver
linux/memory-hotplug
linux/access-coordinates
+ linux/protocol-error-handling
.. toctree::
:maxdepth: 2
diff --git a/Documentation/driver-api/cxl/linux/protocol-error-handling.rst b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst
new file mode 100644
index 000000000000..4d6f33f0ed31
--- /dev/null
+++ b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst
@@ -0,0 +1,440 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==============================
+CXL Protocol Error Handling
+==============================
+
+This document describes how the kernel detects, classifies, dispatches,
+logs, and recovers from CXL protocol errors signaled through the PCIe
+Advanced Error Reporting (AER) interface. It covers both Virtual
+Hierarchy (VH) topologies (Root Ports, Upstream/Downstream Switch
+Ports, and Endpoints) and Restricted CXL Host (RCH) topologies
+(Root Complex Event Collectors driving Restricted CXL Devices).
+
+It is intended for kernel developers maintaining or extending
+``drivers/pci/pcie/aer*.c``, ``drivers/cxl/core/ras.c``, and the
+related plumbing in ``include/linux/aer.h``.
+
+
+Background
+==========
+
+A CXL device reports protocol-layer failures (CXL.cachemem RAS) as
+PCIe AER **Internal Errors**: ``PCI_ERR_COR_INTERNAL`` for correctable
+events and ``PCI_ERR_UNC_INTN`` for uncorrectable events. From the AER
+core's point of view these look like ordinary PCIe AER messages, but
+their semantics are CXL-specific: the actual fault information lives
+in CXL RAS capability registers, not in the PCIe AER status registers.
+
+Historically, native CXL.cachemem RAS handling was implemented only
+for CXL Endpoints and for RCH Downstream Ports. CXL Root Ports,
+Upstream Switch Ports, and Downstream Switch Ports were not covered.
+This left the kernel unable to log or react to protocol errors
+signaled by switch components.
+
+The unified CXL protocol error path closes that gap by routing every
+CXL Internal Error through a single producer/consumer pipeline shared
+by all CXL device types.
+
+
+Architecture overview
+=====================
+
+CXL protocol error handling is implemented as a distinct error plane
+layered on top of the existing PCIe AER infrastructure. The two planes
+are kept separate:
+
+* The **PCIe AER plane** continues to handle native PCIe errors
+ (Receiver overflows, malformed TLPs, completion timeouts, and so
+ on). This is unchanged.
+
+* The **CXL protocol error plane** owns CXL Internal Errors. The AER
+ core forwards them to ``cxl_core`` via a dedicated kfifo; ``cxl_core``
+ then dispatches to CE/UE handlers and drives the recovery and
+ panic policy.
+
+The boundary between the two planes is ``is_cxl_error()`` in
+``drivers/pci/pcie/aer_cxl_vh.c``, which inspects ``info->is_cxl``
+(set from ``pcie_is_cxl()``) together with the PCIe device type and
+the AER status word. When ``is_cxl_error()`` returns true the event
+is enqueued into the AER-CXL kfifo; otherwise the event flows through
+``pci_aer_handle_error()`` as before.
+
+The pipeline has three layers:
+
+1. **Producer** (``aer_cxl_vh.c``, ``aer_cxl_rch.c``) - runs in AER
+ IRQ/threaded context, classifies, clears the AER CE status, and
+ enqueues ``struct cxl_proto_err_work_data``.
+2. **Queue** - the AER-CXL kfifo plus a backing ``struct work_struct``.
+3. **Consumer** (``cxl_core/ras.c``) - workqueue-context worker that
+ resolves the CXL Port topology and dispatches to CE/UE handlers.
+
+
+Topologies
+==========
+
+Two topologies are supported, and both feed the same kfifo.
+
+Virtual Hierarchy (VH)
+----------------------
+
+A standard CXL VH consists of a CXL Root Port (RP), an optional CXL
+Upstream Switch Port (USP), one or more CXL Downstream Switch Ports
+(DSPs), and CXL Endpoints (EPs) attached to the DSPs. Each component
+is a regular PCIe device with a CXL DVSEC and a CXL RAS capability,
+and it raises Internal Errors directly to the AER subsystem via the
+RP's MSI/MSI-X interrupt.
+
+The VH producer is ``cxl_forward_error()`` in
+``drivers/pci/pcie/aer_cxl_vh.c``.
+
+Restricted CXL Host (RCH)
+-------------------------
+
+In the RCH topology, a Root Complex Event Collector (RCEC) aggregates
+errors from one or more Restricted CXL Devices (RCDs) attached as
+Root Complex Integrated Endpoints. The RCEC delivers the AER
+interrupt; the AER driver iterates the RCDs beneath it.
+
+The RCH producer is ``cxl_rch_handle_error_iter()`` in
+``drivers/pci/pcie/aer_cxl_rch.c``. For each RCD it finds, it calls
+``cxl_forward_error()`` (the same producer helper used by the VH
+path), so RCH events end up in the same AER-CXL kfifo as VH events.
+
+
+End-to-end flow
+===============
+
+The diagram below shows the full path from an AER interrupt through
+producer classification, kfifo handoff, and consumer dispatch.
+
+.. code-block:: text
+
+ +-------------------------------------------------------------------------+
+ | CXL Internal Error Packet Flow |
+ | From PCIe AER Interrupt to CXL Protocol Error Handling and Logging |
+ +-------------------------------------------------------------------------+
+
+ CXL device (RP / USP / DSP / EP / RCD) raises AER Internal Error
+ (correctable PCI_ERR_COR_INTERNAL or uncorrectable PCI_ERR_UNC_INTN)
+ |
+ v
+ +-------------------------------------------------------------+
+ | PCIe Root Port AER MSI/MSI-X interrupt fires |
+ +-------------------------------------------------------------+
+ |
+ ============= drivers/pci/pcie/aer.c (AER core) =============
+ |
+ v
+ +---------------------------------+
+ | aer_irq() / aer_isr() | (top + threaded handler)
+ +---------------------------------+
+ |
+ v
+ +---------------------------------+
+ | aer_isr_one_error() |
+ | aer_isr_one_error_type() |
+ +---------------------------------+
+ |
+ v
+ +------------------------------------------+
+ | aer_get_device_error_info() |
+ | - reads PCI_ERR_COR_STATUS |
+ | - reads PCI_ERR_UNCOR_STATUS (*if RP/ |
+ | RCEC/DSP, or non-fatal severity) |
+ | - sets info->is_cxl = pcie_is_cxl(dev) |
+ +------------------------------------------+
+ |
+ v
+ +---------------------------------+
+ | handle_error_source(dev, info) |
+ +---------------------------------+
+ | |
+ | is_cxl_error() +---> pci_aer_handle_error()
+ | (CXL device + Internal) (native PCIe AER path,
+ v not covered here)
+ +-------------------------------------------------------------+
+ | Topology dispatch within AER core: |
+ | |
+ | - VH topology (RP / USP / DSP / EP) |
+ | -> drivers/pci/pcie/aer_cxl_vh.c |
+ | |
+ | - RCH topology (RCEC iterates RCDs under it) |
+ | -> drivers/pci/pcie/aer_cxl_rch.c |
+ +-------------------------------------------------------------+
+ | |
+ | VH path RCH path (RCEC AER)
+ v v
+ ============= aer_cxl_vh.c (VH ============= aer_cxl_rch.c (RCH
+ producer) ============= producer) ==========
+ | |
+ v v
+ +-----------------------------+ +-------------------------------+
+ | cxl_forward_error(pdev,info)| | cxl_rch_handle_error_iter() |
+ | - if AER_CORRECTABLE: | | - iterate each RCD pdev |
+ | clear PCI_ERR_COR_STATUS| | beneath the RCEC |
+ | - pci_dev_get(pdev) | | - call cxl_forward_error() |
+ | - build cxl_proto_err_ | | for each RCD |
+ | work_data | | (same producer helper as |
+ | { pdev, severity } | | the VH path uses) |
+ | - kfifo_in_spinlocked(...) | +-------------------------------+
+ | - schedule_work(...) | |
+ +-----------------------------+ |
+ | |
+ +-----------------+---------------------------+
+ |
+ v
+ +--------------------------+
+ | AER-CXL kfifo |
+ | (work_struct) |
+ +--------------------------+
+ |
+ v
+ ============= drivers/cxl/core/ras.c (consumer worker) =======
+ |
+ v
+ +-------------------------------------------------------------+
+ | cxl_proto_err_work_fn() (workqueue handler) |
+ | for_each_cxl_proto_err(&wd, __cxl_proto_err_work_fn) |
+ +-------------------------------------------------------------+
+ |
+ v
+ +-------------------------------------------------------------+
+ | __cxl_proto_err_work_fn(wd) |
+ | port = find_cxl_port_by_dev(&pdev->dev, &dport) |
+ | cxl_handle_proto_error(pdev, port, dport, severity) |
+ | pci_dev_put(pdev) |
+ +-------------------------------------------------------------+
+ |
+ v
+ +-------------------------------------------------------------+
+ | cxl_handle_proto_error() |
+ +-------------------------------------------------------------+
+ | |
+ pci_pcie_type == pci_pcie_type !=
+ PCI_EXP_TYPE_RC_END PCI_EXP_TYPE_RC_END
+ (RCD Endpoint) (VH: RP/USP/DSP/EP)
+ | |
+ v |
+ +-------------------------------------+ |
+ | cxl_handle_rdport_errors(pdev) | |
+ | - process RCH Downstream Port's | |
+ | RAS register block first | |
+ | - cxl_handle_cor_ras() for CE | |
+ | - cxl_handle_ras() for UE | |
+ | (log only; does NOT panic) | |
+ +-------------------------------------+ |
+ | |
+ +--------------------+-----------------------+
+ |
+ v
+ +-----------------------------+
+ | severity == AER_CORRECTABLE |
+ +-----------------------------+
+ | |
+ yes no
+ v v
+ +----------------------+ +-------------------------+
+ | cxl_handle_cor_ras() | | cxl_do_recovery() |
+ | - emit cxl_aer_ | | (described below) |
+ | correctable_ | +-------------------------+
+ | error trace |
+ | pcie_clear_device_ |
+ | status() |
+ +----------------------+
+
+ +-------------------------------+
+ | cxl_do_recovery() |
+ | if pci_dev_is_disconnected: |
+ | panic("CXL cachemem err.") |
+ | |
+ | ue = cxl_handle_ras() |
+ | -> emit |
+ | cxl_aer_uncorrectable_ |
+ | error trace event |
+ | |
+ | if (ue): |
+ | panic("CXL cachemem err.") |
+ | |
+ | pcie_clear_device_status() |
+ | pci_aer_clear_nonfatal_status|
+ | pci_aer_clear_fatal_status |
+ +-------------------------------+
+
+
+Severity policy
+===============
+
+The kernel's response to a CXL protocol error depends on the AER
+severity reported by the device and on the result of inspecting the
+CXL RAS registers.
+
+Correctable Error (CE)
+----------------------
+
+* The AER driver clears ``PCI_ERR_COR_STATUS`` in the producer
+ (``cxl_forward_error()``) before enqueue, so the device is
+ acknowledged even if the consumer drops the event.
+* The consumer's ``cxl_handle_cor_ras()`` reads and clears the CXL
+ RAS correctable status and emits a ``cxl_aer_correctable_error``
+ trace event.
+* No recovery action is taken.
+
+Uncorrectable Error (UE), non-fatal
+-----------------------------------
+
+* The producer enqueues the event without clearing the AER UCE
+ status.
+* The consumer enters ``cxl_do_recovery()``.
+* ``cxl_handle_ras()`` reads the CXL RAS uncorrectable status and
+ emits a ``cxl_aer_uncorrectable_error`` trace event.
+* If ``cxl_handle_ras()`` returns true (a CXL RAS UE bit was set),
+ the kernel panics with ``"CXL cachemem error."``. CXL.cachemem
+ traffic cannot be safely recovered in software once corruption is
+ observed; continuing risks silent data loss across all devices in
+ an interleaved HDM region.
+* If ``cxl_handle_ras()`` returns false (no CXL RAS bit set, i.e.
+ the AER UCE was a PCIe-side issue rather than a CXL.cachemem
+ issue), the AER UCE status is cleared and execution continues.
+
+Uncorrectable Error (UE), fatal
+-------------------------------
+
+Fatal severity follows the same recovery path as non-fatal in
+``cxl_do_recovery()``, with one important caveat: the AER core only
+reads ``PCI_ERR_UNCOR_STATUS`` for Root Ports, RCECs, Downstream
+Ports, or non-fatal severities (see ``aer_get_device_error_info()``
+in ``drivers/pci/pcie/aer.c``). For a fatal UE signaled by an
+upstream component, PCI config reads to the source device are
+expected to fail, so ``UNCOR_STATUS`` is never retrieved and
+``info->status`` stays zero.
+
+The practical consequence: a fatal UE on an Upstream Switch Port or
+Endpoint is **not** classified as a CXL error by ``is_cxl_error()``.
+It falls through to ``pci_aer_handle_error()`` and is processed by
+the standard AER recovery flow. Only the CXL trace events emitted by
+the AER core (``aer_event``) appear; the CXL-specific
+``cxl_aer_uncorrectable_error`` event is not emitted on this path.
+
+Disconnect during recovery
+--------------------------
+
+``cxl_do_recovery()`` checks ``pci_dev_is_disconnected(pdev)`` before
+touching the RAS registers. A device disconnecting during an
+uncorrectable error event is itself unrecoverable, particularly when
+the device backs an interleaved HDM region; in that case the kernel
+panics directly rather than returning ``~0u`` from the readl() and
+masking the cause.
+
+
+RCD/RCH special cases
+=====================
+
+RCD Endpoint flow
+-----------------
+
+When ``cxl_handle_proto_error()`` sees ``pci_pcie_type(pdev) ==
+PCI_EXP_TYPE_RC_END`` (i.e. an RCD Endpoint), it calls
+``cxl_handle_rdport_errors()`` first. This processes the RAS state
+of the RCH Downstream Port that hosts the RCD before falling through
+to the common CE/UE dispatch on the RCD Endpoint itself.
+
+The RCH Downstream Port's RAS UE is **logged only**: it emits the
+trace event but does not panic. The panic decision is taken on the
+RCD Endpoint's own RAS in ``cxl_do_recovery()``.
+
+This split mirrors the structure of an RCH topology: the RCH dport
+is functionally a CXL infrastructure component (similar to a switch
+port), while the RCD itself is the actual CXL.cachemem source whose
+corruption drives the recovery decision.
+
+RCH ingress aggregation
+-----------------------
+
+RCH errors do not arrive on a per-RCD interrupt. The RCEC is the AER
+source, and the AER driver drives ``cxl_rch_handle_error_iter()`` to
+walk each RCD beneath it and forward an event per RCD through the
+shared kfifo. From the consumer's point of view, RCH-originated
+events are indistinguishable from VH events.
+
+
+Trace events
+============
+
+Two unified trace events are emitted from ``cxl_handle_cor_ras()``
+and ``cxl_handle_ras()`` and are used by every CXL device type and
+both topologies:
+
+* ``cxl_aer_correctable_error`` - emitted when a CXL RAS CE bit is
+ set; carries the human-readable status string.
+* ``cxl_aer_uncorrectable_error`` - emitted when a CXL RAS UE bit is
+ set; carries both the current status and the first-error pointer.
+
+Common fields:
+
+* ``device=<PCI BDF>`` - the source device (always a PCI BDF, even
+ for RCH paths where the trace was historically a memdev name).
+* ``host=<bridge>`` - the parent host bridge or PCI host BDF.
+* ``serial=<u64>`` - the device serial from ``pci_get_dsn()``.
+
+The ``device`` field replaces the older ``memdev`` field that earlier
+revisions emitted on Endpoint events. Userspace consumers
+(rasdaemon's ``ras-cxl-handler.c``) need a corresponding update to
+read the new field name.
+
+
+Source code map
+===============
+
+============================================ ==============================
+File Role
+============================================ ==============================
+``drivers/pci/pcie/aer.c`` AER core; receives the IRQ,
+ builds ``aer_err_info``,
+ dispatches to either the CXL
+ path (``is_cxl_error()``) or
+ ``pci_aer_handle_error()``.
+``drivers/pci/pcie/aer_cxl_vh.c`` VH producer; provides
+ ``is_cxl_error()``,
+ ``cxl_forward_error()``, the
+ AER-CXL kfifo, and the
+ consumer registration
+ helpers.
+``drivers/pci/pcie/aer_cxl_rch.c`` RCH producer; iterates RCDs
+ under an RCEC and forwards
+ each via
+ ``cxl_forward_error()``.
+``drivers/cxl/core/ras.c`` Consumer; defines
+ ``cxl_proto_err_work_fn()``,
+ ``cxl_handle_proto_error()``,
+ ``cxl_handle_rdport_errors()``,
+ ``cxl_do_recovery()``,
+ ``cxl_handle_cor_ras()`` and
+ ``cxl_handle_ras()``.
+``include/linux/aer.h`` Public declarations:
+ ``struct cxl_proto_err_work_data``,
+ ``cxl_proto_err_fn_t``,
+ ``cxl_register_proto_err_work()``
+ and ``for_each_cxl_proto_err()``.
+============================================ ==============================
+
+
+Limitations and future work
+===========================
+
+* **USP/EP fatal UCE is not classified as CXL.** As described under
+ `Severity policy`_, the AER core never retrieves
+ ``PCI_ERR_UNCOR_STATUS`` in this scenario, so ``is_cxl_error()``
+ cannot tag the event as CXL. The event is handled by the AER path
+ only. Resolving this requires either an AER-core change to attempt
+ a config read with link-validity gating, or a separate CXL-side
+ notification mechanism for upstream-signaled fatal events.
+* **User-defined status masks** are not yet supported. All CE and UE
+ status bits are reported as they appear in the RAS register.
+* **Port traversing in cxl_do_recovery()** is not yet implemented; a
+ CXL UE today is reported and acted on at the source device only,
+ not propagated to ancestor ports.
+* The RCH producer (``aer_cxl_rch.c``) currently lives under
+ ``drivers/pci/pcie/`` for historical reasons. Moving it to
+ ``drivers/cxl/core/ras_rch.c`` is on the roadmap.
+
--
2.34.1
^ permalink raw reply related [flat|nested] 21+ messages in thread* Re: [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling
2026-05-05 17:30 ` [PATCH v17 11/11] Documentation: cxl: Document CXL protocol error handling Terry Bowman
@ 2026-05-06 18:34 ` Dave Jiang
0 siblings, 0 replies; 21+ messages in thread
From: Dave Jiang @ 2026-05-06 18:34 UTC (permalink / raw)
To: Terry Bowman, dave, jic23, alison.schofield, djbw, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, vishal.l.verma,
alucerop, ira.weiny, corbet, rafael, xueshuai, linux-cxl
Cc: linux-kernel, linux-pci, linux-acpi, linux-doc
On 5/5/26 10:30 AM, Terry Bowman wrote:
> Add Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> describing the end-to-end CXL protocol error path: AER ingress, the
> AER-CXL kfifo handoff, the cxl_core consumer worker, RCD/RCH special
> cases, severity policy, trace events, and a source code map.
>
> This documents the architecture introduced by the preceding patches in
> this series.
>
> This was generated by claude-opus-4.7.
>
> Assisted-by: Claude:claude-opus-4.7
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> Documentation/driver-api/cxl/index.rst | 1 +
> .../cxl/linux/protocol-error-handling.rst | 440 ++++++++++++++++++
> 2 files changed, 441 insertions(+)
> create mode 100644 Documentation/driver-api/cxl/linux/protocol-error-handling.rst
>
> diff --git a/Documentation/driver-api/cxl/index.rst b/Documentation/driver-api/cxl/index.rst
> index 3dfae1d310ca..6861b2e5726a 100644
> --- a/Documentation/driver-api/cxl/index.rst
> +++ b/Documentation/driver-api/cxl/index.rst
> @@ -42,6 +42,7 @@ that have impacts on each other. The docs here break up configurations steps.
> linux/dax-driver
> linux/memory-hotplug
> linux/access-coordinates
> + linux/protocol-error-handling
>
> .. toctree::
> :maxdepth: 2
> diff --git a/Documentation/driver-api/cxl/linux/protocol-error-handling.rst b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> new file mode 100644
> index 000000000000..4d6f33f0ed31
> --- /dev/null
> +++ b/Documentation/driver-api/cxl/linux/protocol-error-handling.rst
> @@ -0,0 +1,440 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +==============================
> +CXL Protocol Error Handling
> +==============================
> +
> +This document describes how the kernel detects, classifies, dispatches,
> +logs, and recovers from CXL protocol errors signaled through the PCIe
> +Advanced Error Reporting (AER) interface. It covers both Virtual
> +Hierarchy (VH) topologies (Root Ports, Upstream/Downstream Switch
> +Ports, and Endpoints) and Restricted CXL Host (RCH) topologies
> +(Root Complex Event Collectors driving Restricted CXL Devices).
> +
> +It is intended for kernel developers maintaining or extending
> +``drivers/pci/pcie/aer*.c``, ``drivers/cxl/core/ras.c``, and the
> +related plumbing in ``include/linux/aer.h``.
> +
> +
> +Background
> +==========
> +
> +A CXL device reports protocol-layer failures (CXL.cachemem RAS) as
> +PCIe AER **Internal Errors**: ``PCI_ERR_COR_INTERNAL`` for correctable
> +events and ``PCI_ERR_UNC_INTN`` for uncorrectable events. From the AER
> +core's point of view these look like ordinary PCIe AER messages, but
> +their semantics are CXL-specific: the actual fault information lives
> +in CXL RAS capability registers, not in the PCIe AER status registers.
> +
> +Historically, native CXL.cachemem RAS handling was implemented only
> +for CXL Endpoints and for RCH Downstream Ports. CXL Root Ports,
> +Upstream Switch Ports, and Downstream Switch Ports were not covered.
> +This left the kernel unable to log or react to protocol errors
> +signaled by switch components.
> +
> +The unified CXL protocol error path closes that gap by routing every
> +CXL Internal Error through a single producer/consumer pipeline shared
> +by all CXL device types.
> +
> +
> +Architecture overview
> +=====================
> +
> +CXL protocol error handling is implemented as a distinct error plane
> +layered on top of the existing PCIe AER infrastructure. The two planes
> +are kept separate:
> +
> +* The **PCIe AER plane** continues to handle native PCIe errors
> + (Receiver overflows, malformed TLPs, completion timeouts, and so
> + on). This is unchanged.
> +
> +* The **CXL protocol error plane** owns CXL Internal Errors. The AER
> + core forwards them to ``cxl_core`` via a dedicated kfifo; ``cxl_core``
> + then dispatches to CE/UE handlers and drives the recovery and
> + panic policy.
> +
> +The boundary between the two planes is ``is_cxl_error()`` in
> +``drivers/pci/pcie/aer_cxl_vh.c``, which inspects ``info->is_cxl``
> +(set from ``pcie_is_cxl()``) together with the PCIe device type and
> +the AER status word. When ``is_cxl_error()`` returns true the event
> +is enqueued into the AER-CXL kfifo; otherwise the event flows through
> +``pci_aer_handle_error()`` as before.
> +
> +The pipeline has three layers:
> +
> +1. **Producer** (``aer_cxl_vh.c``, ``aer_cxl_rch.c``) - runs in AER
> + IRQ/threaded context, classifies, clears the AER CE status, and
> + enqueues ``struct cxl_proto_err_work_data``.
> +2. **Queue** - the AER-CXL kfifo plus a backing ``struct work_struct``.
> +3. **Consumer** (``cxl_core/ras.c``) - workqueue-context worker that
> + resolves the CXL Port topology and dispatches to CE/UE handlers.
> +
> +
> +Topologies
> +==========
> +
> +Two topologies are supported, and both feed the same kfifo.
> +
> +Virtual Hierarchy (VH)
> +----------------------
> +
> +A standard CXL VH consists of a CXL Root Port (RP), an optional CXL
> +Upstream Switch Port (USP), one or more CXL Downstream Switch Ports
I think it's clearer if you say "an optional CXL Upstream Switch Port (USP)
with one or more CXL Downstream Switch Ports (DSP)" to indicate that this is
a wholly contained component. Otherwise it reads that only the USP is
optional.
DJ
> +(DSPs), and CXL Endpoints (EPs) attached to the DSPs. Each component
> +is a regular PCIe device with a CXL DVSEC and a CXL RAS capability,
> +and it raises Internal Errors directly to the AER subsystem via the
> +RP's MSI/MSI-X interrupt.
> +
> +The VH producer is ``cxl_forward_error()`` in
> +``drivers/pci/pcie/aer_cxl_vh.c``.
> +
> +Restricted CXL Host (RCH)
> +-------------------------
> +
> +In the RCH topology, a Root Complex Event Collector (RCEC) aggregates
> +errors from one or more Restricted CXL Devices (RCDs) attached as
> +Root Complex Integrated Endpoints. The RCEC delivers the AER
> +interrupt; the AER driver iterates the RCDs beneath it.
> +
> +The RCH producer is ``cxl_rch_handle_error_iter()`` in
> +``drivers/pci/pcie/aer_cxl_rch.c``. For each RCD it finds, it calls
> +``cxl_forward_error()`` (the same producer helper used by the VH
> +path), so RCH events end up in the same AER-CXL kfifo as VH events.
> +
> +
> +End-to-end flow
> +===============
> +
> +The diagram below shows the full path from an AER interrupt through
> +producer classification, kfifo handoff, and consumer dispatch.
> +
> +.. code-block:: text
> +
> + +-------------------------------------------------------------------------+
> + | CXL Internal Error Packet Flow |
> + | From PCIe AER Interrupt to CXL Protocol Error Handling and Logging |
> + +-------------------------------------------------------------------------+
> +
> + CXL device (RP / USP / DSP / EP / RCD) raises AER Internal Error
> + (correctable PCI_ERR_COR_INTERNAL or uncorrectable PCI_ERR_UNC_INTN)
> + |
> + v
> + +-------------------------------------------------------------+
> + | PCIe Root Port AER MSI/MSI-X interrupt fires |
> + +-------------------------------------------------------------+
> + |
> + ============= drivers/pci/pcie/aer.c (AER core) =============
> + |
> + v
> + +---------------------------------+
> + | aer_irq() / aer_isr() | (top + threaded handler)
> + +---------------------------------+
> + |
> + v
> + +---------------------------------+
> + | aer_isr_one_error() |
> + | aer_isr_one_error_type() |
> + +---------------------------------+
> + |
> + v
> + +------------------------------------------+
> + | aer_get_device_error_info() |
> + | - reads PCI_ERR_COR_STATUS |
> + | - reads PCI_ERR_UNCOR_STATUS (*if RP/ |
> + | RCEC/DSP, or non-fatal severity) |
> + | - sets info->is_cxl = pcie_is_cxl(dev) |
> + +------------------------------------------+
> + |
> + v
> + +---------------------------------+
> + | handle_error_source(dev, info) |
> + +---------------------------------+
> + | |
> + | is_cxl_error() +---> pci_aer_handle_error()
> + | (CXL device + Internal) (native PCIe AER path,
> + v not covered here)
> + +-------------------------------------------------------------+
> + | Topology dispatch within AER core: |
> + | |
> + | - VH topology (RP / USP / DSP / EP) |
> + | -> drivers/pci/pcie/aer_cxl_vh.c |
> + | |
> + | - RCH topology (RCEC iterates RCDs under it) |
> + | -> drivers/pci/pcie/aer_cxl_rch.c |
> + +-------------------------------------------------------------+
> + | |
> + | VH path RCH path (RCEC AER)
> + v v
> + ============= aer_cxl_vh.c (VH ============= aer_cxl_rch.c (RCH
> + producer) ============= producer) ==========
> + | |
> + v v
> + +-----------------------------+ +-------------------------------+
> + | cxl_forward_error(pdev,info)| | cxl_rch_handle_error_iter() |
> + | - if AER_CORRECTABLE: | | - iterate each RCD pdev |
> + | clear PCI_ERR_COR_STATUS| | beneath the RCEC |
> + | - pci_dev_get(pdev) | | - call cxl_forward_error() |
> + | - build cxl_proto_err_ | | for each RCD |
> + | work_data | | (same producer helper as |
> + | { pdev, severity } | | the VH path uses) |
> + | - kfifo_in_spinlocked(...) | +-------------------------------+
> + | - schedule_work(...) | |
> + +-----------------------------+ |
> + | |
> + +-----------------+---------------------------+
> + |
> + v
> + +--------------------------+
> + | AER-CXL kfifo |
> + | (work_struct) |
> + +--------------------------+
> + |
> + v
> + ============= drivers/cxl/core/ras.c (consumer worker) =======
> + |
> + v
> + +-------------------------------------------------------------+
> + | cxl_proto_err_work_fn() (workqueue handler) |
> + | for_each_cxl_proto_err(&wd, __cxl_proto_err_work_fn) |
> + +-------------------------------------------------------------+
> + |
> + v
> + +-------------------------------------------------------------+
> + | __cxl_proto_err_work_fn(wd) |
> + | port = find_cxl_port_by_dev(&pdev->dev, &dport) |
> + | cxl_handle_proto_error(pdev, port, dport, severity) |
> + | pci_dev_put(pdev) |
> + +-------------------------------------------------------------+
> + |
> + v
> + +-------------------------------------------------------------+
> + | cxl_handle_proto_error() |
> + +-------------------------------------------------------------+
> + | |
> + pci_pcie_type == pci_pcie_type !=
> + PCI_EXP_TYPE_RC_END PCI_EXP_TYPE_RC_END
> + (RCD Endpoint) (VH: RP/USP/DSP/EP)
> + | |
> + v |
> + +-------------------------------------+ |
> + | cxl_handle_rdport_errors(pdev) | |
> + | - process RCH Downstream Port's | |
> + | RAS register block first | |
> + | - cxl_handle_cor_ras() for CE | |
> + | - cxl_handle_ras() for UE | |
> + | (log only; does NOT panic) | |
> + +-------------------------------------+ |
> + | |
> + +--------------------+-----------------------+
> + |
> + v
> + +-----------------------------+
> + | severity == AER_CORRECTABLE |
> + +-----------------------------+
> + | |
> + yes no
> + v v
> + +----------------------+ +-------------------------+
> + | cxl_handle_cor_ras() | | cxl_do_recovery() |
> + | - emit cxl_aer_ | | (described below) |
> + | correctable_ | +-------------------------+
> + | error trace |
> + | pcie_clear_device_ |
> + | status() |
> + +----------------------+
> +
> + +-------------------------------+
> + | cxl_do_recovery() |
> + | if pci_dev_is_disconnected: |
> + | panic("CXL cachemem err.") |
> + | |
> + | ue = cxl_handle_ras() |
> + | -> emit |
> + | cxl_aer_uncorrectable_ |
> + | error trace event |
> + | |
> + | if (ue): |
> + | panic("CXL cachemem err.") |
> + | |
> + | pcie_clear_device_status() |
> + | pci_aer_clear_nonfatal_status|
> + | pci_aer_clear_fatal_status |
> + +-------------------------------+
> +
> +
> +Severity policy
> +===============
> +
> +The kernel's response to a CXL protocol error depends on the AER
> +severity reported by the device and on the result of inspecting the
> +CXL RAS registers.
> +
> +Correctable Error (CE)
> +----------------------
> +
> +* The AER driver clears ``PCI_ERR_COR_STATUS`` in the producer
> + (``cxl_forward_error()``) before enqueue, so the device is
> + acknowledged even if the consumer drops the event.
> +* The consumer's ``cxl_handle_cor_ras()`` reads and clears the CXL
> + RAS correctable status and emits a ``cxl_aer_correctable_error``
> + trace event.
> +* No recovery action is taken.
> +
> +Uncorrectable Error (UE), non-fatal
> +-----------------------------------
> +
> +* The producer enqueues the event without clearing the AER UCE
> + status.
> +* The consumer enters ``cxl_do_recovery()``.
> +* ``cxl_handle_ras()`` reads the CXL RAS uncorrectable status and
> + emits a ``cxl_aer_uncorrectable_error`` trace event.
> +* If ``cxl_handle_ras()`` returns true (a CXL RAS UE bit was set),
> + the kernel panics with ``"CXL cachemem error."``. CXL.cachemem
> + traffic cannot be safely recovered in software once corruption is
> + observed; continuing risks silent data loss across all devices in
> + an interleaved HDM region.
> +* If ``cxl_handle_ras()`` returns false (no CXL RAS bit set, i.e.
> + the AER UCE was a PCIe-side issue rather than a CXL.cachemem
> + issue), the AER UCE status is cleared and execution continues.
> +
> +Uncorrectable Error (UE), fatal
> +-------------------------------
> +
> +Fatal severity follows the same recovery path as non-fatal in
> +``cxl_do_recovery()``, with one important caveat: the AER core only
> +reads ``PCI_ERR_UNCOR_STATUS`` for Root Ports, RCECs, Downstream
> +Ports, or non-fatal severities (see ``aer_get_device_error_info()``
> +in ``drivers/pci/pcie/aer.c``). For a fatal UE signaled by an
> +upstream component, PCI config reads to the source device are
> +expected to fail, so ``UNCOR_STATUS`` is never retrieved and
> +``info->status`` stays zero.
> +
> +The practical consequence: a fatal UE on an Upstream Switch Port or
> +Endpoint is **not** classified as a CXL error by ``is_cxl_error()``.
> +It falls through to ``pci_aer_handle_error()`` and is processed by
> +the standard AER recovery flow. Only the CXL trace events emitted by
> +the AER core (``aer_event``) appear; the CXL-specific
> +``cxl_aer_uncorrectable_error`` event is not emitted on this path.
> +
> +Disconnect during recovery
> +--------------------------
> +
> +``cxl_do_recovery()`` checks ``pci_dev_is_disconnected(pdev)`` before
> +touching the RAS registers. A device disconnecting during an
> +uncorrectable error event is itself unrecoverable, particularly when
> +the device backs an interleaved HDM region; in that case the kernel
> +panics directly rather than returning ``~0u`` from the readl() and
> +masking the cause.
> +
> +
> +RCD/RCH special cases
> +=====================
> +
> +RCD Endpoint flow
> +-----------------
> +
> +When ``cxl_handle_proto_error()`` sees ``pci_pcie_type(pdev) ==
> +PCI_EXP_TYPE_RC_END`` (i.e. an RCD Endpoint), it calls
> +``cxl_handle_rdport_errors()`` first. This processes the RAS state
> +of the RCH Downstream Port that hosts the RCD before falling through
> +to the common CE/UE dispatch on the RCD Endpoint itself.
> +
> +The RCH Downstream Port's RAS UE is **logged only**: it emits the
> +trace event but does not panic. The panic decision is taken on the
> +RCD Endpoint's own RAS in ``cxl_do_recovery()``.
> +
> +This split mirrors the structure of an RCH topology: the RCH dport
> +is functionally a CXL infrastructure component (similar to a switch
> +port), while the RCD itself is the actual CXL.cachemem source whose
> +corruption drives the recovery decision.
> +
> +RCH ingress aggregation
> +-----------------------
> +
> +RCH errors do not arrive on a per-RCD interrupt. The RCEC is the AER
> +source, and the AER driver drives ``cxl_rch_handle_error_iter()`` to
> +walk each RCD beneath it and forward an event per RCD through the
> +shared kfifo. From the consumer's point of view, RCH-originated
> +events are indistinguishable from VH events.
> +
> +
> +Trace events
> +============
> +
> +Two unified trace events are emitted from ``cxl_handle_cor_ras()``
> +and ``cxl_handle_ras()`` and are used by every CXL device type and
> +both topologies:
> +
> +* ``cxl_aer_correctable_error`` - emitted when a CXL RAS CE bit is
> + set; carries the human-readable status string.
> +* ``cxl_aer_uncorrectable_error`` - emitted when a CXL RAS UE bit is
> + set; carries both the current status and the first-error pointer.
> +
> +Common fields:
> +
> +* ``device=<PCI BDF>`` - the source device (always a PCI BDF, even
> + for RCH paths where the trace was historically a memdev name).
> +* ``host=<bridge>`` - the parent host bridge or PCI host BDF.
> +* ``serial=<u64>`` - the device serial from ``pci_get_dsn()``.
> +
> +The ``device`` field replaces the older ``memdev`` field that earlier
> +revisions emitted on Endpoint events. Userspace consumers
> +(rasdaemon's ``ras-cxl-handler.c``) need a corresponding update to
> +read the new field name.
> +
> +
> +Source code map
> +===============
> +
> +============================================ ==============================
> +File Role
> +============================================ ==============================
> +``drivers/pci/pcie/aer.c`` AER core; receives the IRQ,
> + builds ``aer_err_info``,
> + dispatches to either the CXL
> + path (``is_cxl_error()``) or
> + ``pci_aer_handle_error()``.
> +``drivers/pci/pcie/aer_cxl_vh.c`` VH producer; provides
> + ``is_cxl_error()``,
> + ``cxl_forward_error()``, the
> + AER-CXL kfifo, and the
> + consumer registration
> + helpers.
> +``drivers/pci/pcie/aer_cxl_rch.c`` RCH producer; iterates RCDs
> + under an RCEC and forwards
> + each via
> + ``cxl_forward_error()``.
> +``drivers/cxl/core/ras.c`` Consumer; defines
> + ``cxl_proto_err_work_fn()``,
> + ``cxl_handle_proto_error()``,
> + ``cxl_handle_rdport_errors()``,
> + ``cxl_do_recovery()``,
> + ``cxl_handle_cor_ras()`` and
> + ``cxl_handle_ras()``.
> +``include/linux/aer.h`` Public declarations:
> + ``struct cxl_proto_err_work_data``,
> + ``cxl_proto_err_fn_t``,
> + ``cxl_register_proto_err_work()``
> + and ``for_each_cxl_proto_err()``.
> +============================================ ==============================
> +
> +
> +Limitations and future work
> +===========================
> +
> +* **USP/EP fatal UCE is not classified as CXL.** As described under
> + `Severity policy`_, the AER core never retrieves
> + ``PCI_ERR_UNCOR_STATUS`` in this scenario, so ``is_cxl_error()``
> + cannot tag the event as CXL. The event is handled by the AER path
> + only. Resolving this requires either an AER-core change to attempt
> + a config read with link-validity gating, or a separate CXL-side
> + notification mechanism for upstream-signaled fatal events.
> +* **User-defined status masks** are not yet supported. All CE and UE
> + status bits are reported as they appear in the RAS register.
> +* **Port traversing in cxl_do_recovery()** is not yet implemented; a
> + CXL UE today is reported and acted on at the source device only,
> + not propagated to ancestor ports.
> +* The RCH producer (``aer_cxl_rch.c``) currently lives under
> + ``drivers/pci/pcie/`` for historical reasons. Moving it to
> + ``drivers/cxl/core/ras_rch.c`` is on the roadmap.
> +
^ permalink raw reply [flat|nested] 21+ messages in thread