* [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 12:20 ` Jonathan Cameron
2026-03-28 0:28 ` Dan Williams
2026-03-02 20:36 ` [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos Terry Bowman
` (8 subsequent siblings)
9 siblings, 2 replies; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
soon. This requires a notification mechanism for the AER driver to share
the AER interrupt with the CXL driver. The notification will be used as an
indication for the CXL drivers to handle and log the CXL RAS errors.
Note, 'CXL protocol error' terminology will refer to CXL VH and not
CXL RCH errors unless specifically noted going forward.
Introduce a new file in the AER driver to handle the CXL protocol errors
named pci/pcie/aer_cxl_vh.c.
Add a kfifo work queue to be used by the AER and CXL drivers. The AER
driver will be the sole kfifo producer adding work and the cxl_core will be
the sole kfifo consumer removing work. Add the boilerplate kfifo support.
Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
Add CXL work queue handler registration functions in the AER driver. Export
the functions allowing CXL driver to access. Implement registration
functions for the CXL driver to assign or clear the work handler function.
Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
This will contain a reference to the PCI error source device and the error
severity. This will be used when the work is dequeued by the cxl_core driver.
Introduce cxl_forward_error() to take a given CXL protocol error and add it
to a work structure before pushing onto the AER-CXL kfifo. This function
takes a reference count increment of the PCI device. The kfifo consumer is
responsible for reference decrementing. If there is an error on adding the
work then this function must decrement the reference count.
Synchronize accesses to the work function pointer during registration,
deregistration, enqueue, and dequeue. Further synchronization fixes will
be added in the following patch.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v15->v16:
- Add pci_dev_put() and comment in pci_dev_get() (Dan)
- /rw_sema/rwsema/ (Dan)
- Split validation checks in cxl_forward_error() to allow
for meaningful reason in log (Terry)
- Shorten commit title to remove wordiness (Terry)
- Remove bitfield.h include, unnecessary. (Terry)
Changes in v14->v15:
- Moved pci_dev_get() call to this patch (Dave)
Changes in v13 -> v14:
- Replaced workqueue_types.h include with 'struct work_struct'
predeclaration (Bjorn)
- Update error message (Bjorn)
- Reordered 'struct cxl_proto_err_work_data' (Bjorn)
- Remove export of cxl_error_is_native() here (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
- Update error message (Ben)
Changes in v11->v12:
- None
---
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/aer.c | 15 ++----
drivers/pci/pcie/aer_cxl_vh.c | 87 +++++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.h | 4 ++
include/linux/aer.h | 22 +++++++++
5 files changed, 118 insertions(+), 11 deletions(-)
create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index b0b43a18c304..62d3d3c69a5d 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o bwctrl.o
obj-y += aspm.o
obj-$(CONFIG_PCIEAER) += aer.o err.o tlp.o
obj-$(CONFIG_CXL_RAS) += aer_cxl_rch.o
+obj-$(CONFIG_CXL_RAS) += aer_cxl_vh.o
obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
obj-$(CONFIG_PCIE_PME) += pme.o
obj-$(CONFIG_PCIE_DPC) += dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d916378bc707..2e996e339d7c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1152,16 +1152,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
*/
EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
-#ifdef CONFIG_CXL_RAS
-bool is_aer_internal_error(struct aer_err_info *info)
-{
- if (info->severity == AER_CORRECTABLE)
- return info->status & PCI_ERR_COR_INTERNAL;
-
- return info->status & PCI_ERR_UNC_INTN;
-}
-#endif
-
/**
* pci_aer_handle_error - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
@@ -1198,7 +1188,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
cxl_rch_handle_error(dev, info);
- pci_aer_handle_error(dev, info);
+ if (is_cxl_error(dev, info))
+ cxl_forward_error(dev, info);
+ else
+ pci_aer_handle_error(dev, info);
pci_dev_put(dev);
}
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
new file mode 100644
index 000000000000..7e2bc1894395
--- /dev/null
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/types.h>
+#include <linux/kfifo.h>
+#include <linux/aer.h>
+#include "../pci.h"
+#include "portdrv.h"
+
+#define CXL_ERROR_SOURCES_MAX 128
+
+struct cxl_proto_err_kfifo {
+ struct work_struct *work;
+ struct rw_semaphore rwsema;
+ DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
+ CXL_ERROR_SOURCES_MAX);
+};
+
+static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
+ .rwsema = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rwsema)
+};
+
+bool is_aer_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ if (!info || !info->is_cxl)
+ return false;
+
+ if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ return false;
+
+ return is_aer_internal_error(info);
+}
+
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ struct cxl_proto_err_work_data wd = (struct cxl_proto_err_work_data) {
+ .severity = info->severity,
+ .pdev = pdev
+ };
+
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rwsema);
+
+ if (!cxl_proto_err_kfifo.work) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo reader not registered");
+ return;
+ }
+
+ /* The reference is held as long as the pdev is live in the kfifo */
+ pci_dev_get(pdev);
+
+ if (!kfifo_put(&cxl_proto_err_kfifo.fifo, wd)) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo add failed");
+ pci_dev_put(pdev);
+ return;
+ }
+
+ schedule_work(cxl_proto_err_kfifo.work);
+}
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
+ cxl_proto_err_kfifo.work = work;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
+
+void cxl_unregister_proto_err_work(void)
+{
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
+ cxl_proto_err_kfifo.work = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
+
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
+{
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rwsema);
+ return kfifo_get(&cxl_proto_err_kfifo.fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index cc58bf2f2c84..66a6b8099c96 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -130,9 +130,13 @@ struct aer_err_info;
bool is_aer_internal_error(struct aer_err_info *info);
void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
void cxl_rch_enable_rcec(struct pci_dev *rcec);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
#else
static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
#endif /* CONFIG_CXL_RAS */
#endif /* _PORTDRV_H_ */
diff --git a/include/linux/aer.h b/include/linux/aer.h
index df0f5c382286..f351e41dd979 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -53,6 +53,16 @@ struct aer_capability_regs {
u16 uncor_err_source;
};
+/**
+ * struct cxl_proto_err_work_data - Error information used in CXL error handling
+ * @pdev: PCI device detecting the error
+ * @severity: AER severity
+ */
+struct cxl_proto_err_work_data {
+ struct pci_dev *pdev;
+ int severity;
+};
+
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
@@ -66,6 +76,18 @@ static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
+struct work_struct;
+
+#ifdef CONFIG_CXL_RAS
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
+void cxl_register_proto_err_work(struct work_struct *work);
+void cxl_unregister_proto_err_work(void);
+#else
+static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline void cxl_unregister_proto_err_work(void) { }
+#endif
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer);
int cper_severity_to_aer(int cper_severity);
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo
2026-03-02 20:36 ` [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
@ 2026-03-09 12:20 ` Jonathan Cameron
2026-03-28 0:28 ` Dan Williams
1 sibling, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 12:20 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:39 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
> soon. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used as an
> indication for the CXL drivers to handle and log the CXL RAS errors.
>
> Note, 'CXL protocol error' terminology will refer to CXL VH and not
> CXL RCH errors unless specifically noted going forward.
>
> Introduce a new file in the AER driver to handle the CXL protocol errors
> named pci/pcie/aer_cxl_vh.c.
>
> Add a kfifo work queue to be used by the AER and CXL drivers. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
>
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
>
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
> This will contain a reference to the PCI error source device and the error
> severity. This will be used when the work is dequeued by the cxl_core driver.
>
> Introduce cxl_forward_error() to take a given CXL protocol error and add it
> to a work structure before pushing onto the AER-CXL kfifo. This function
> takes a reference count increment of the PCI device. The kfifo consumer is
> responsible for reference decrementing. If there is an error on adding the
> work then this function must decrement the reference count.
>
> Synchronize accesses to the work function pointer during registration,
> deregistration, enqueue, and dequeue. Further synchronization fixes will
> be added in the following patch.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>
Hi Terry,
Just some include related comments. There are a few missing that
should be included from the c file.
Thanks,
Jonathan
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> new file mode 100644
> index 000000000000..7e2bc1894395
> --- /dev/null
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -0,0 +1,87 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/types.h>
> +#include <linux/kfifo.h>
> +#include <linux/aer.h>
What motivated the decisions on which includes to have here?
Generally follow include what you use, with exceptions for some
cases where there is something inherent that means a header will
always include another one.
I'd definitely expect rwsem.h and cleanup.h (for guard()) for instance.
Also workqueue.h for schedule_work()
> +#include "../pci.h"
> +#include "portdrv.h"
> +
> +#define CXL_ERROR_SOURCES_MAX 128
> +
> +struct cxl_proto_err_kfifo {
> + struct work_struct *work;
> + struct rw_semaphore rwsema;
> + DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
> + CXL_ERROR_SOURCES_MAX);
> +};
> +
> +static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
> + .rwsema = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rwsema)
> +};
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo
2026-03-02 20:36 ` [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-03-09 12:20 ` Jonathan Cameron
@ 2026-03-28 0:28 ` Dan Williams
1 sibling, 0 replies; 26+ messages in thread
From: Dan Williams @ 2026-03-28 0:28 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Terry Bowman wrote:
> CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
> soon. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used as an
> indication for the CXL drivers to handle and log the CXL RAS errors.
>
> Note, 'CXL protocol error' terminology will refer to CXL VH and not
> CXL RCH errors unless specifically noted going forward.
>
> Introduce a new file in the AER driver to handle the CXL protocol errors
> named pci/pcie/aer_cxl_vh.c.
>
> Add a kfifo work queue to be used by the AER and CXL drivers. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
>
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
>
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
> This will contain a reference to the PCI error source device and the error
> severity. This will be used when the work is dequeued by the cxl_core driver.
>
> Introduce cxl_forward_error() to take a given CXL protocol error and add it
> to a work structure before pushing onto the AER-CXL kfifo. This function
> takes a reference count increment of the PCI device. The kfifo consumer is
> responsible for reference decrementing. If there is an error on adding the
> work then this function must decrement the reference count.
>
> Synchronize accesses to the work function pointer during registration,
> deregistration, enqueue, and dequeue. Further synchronization fixes will
> be added in the following patch.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
>
> ---
>
> Changes in v15->v16:
> - Add pci_dev_put() and comment in pci_dev_get() (Dan)
> - /rw_sema/rwsema/ (Dan)
To be clear I asked for s/rw_sema/rwsem/, easy enough to fix up.
[..]
> drivers/pci/pcie/aer_cxl_vh.c | 87 +++++++++++++++++++++++++++++++++++
> create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
Bjorn, do you want this additional burden to fall on PCI core reviewers,
or perhaps make this change to the "COMPUTE EXPRESS LINK (CXL)" entry?
diff --git a/MAINTAINERS b/MAINTAINERS
index 61bf550fd37c..1b46ac52839d 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -6438,6 +6438,8 @@ S: Maintained
F: Documentation/driver-api/cxl
F: Documentation/userspace-api/fwctl/fwctl-cxl.rst
F: drivers/cxl/
+F: drivers/pci/pcie/aer_cxl_rch.c
+F: drivers/pci/pcie/aer_cxl_vh.c
F: include/cxl/
F: include/uapi/linux/cxl_mem.h
F: tools/testing/cxl/
[..]
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> new file mode 100644
> index 000000000000..7e2bc1894395
> --- /dev/null
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
[..]
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> + guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
> + cxl_proto_err_kfifo.work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
Probably more appropriate to make all of these exports:
EXPORT_SYMBOL_FOR_MODULES(..., "cxl_core");
> +void cxl_unregister_proto_err_work(void)
> +{
> + guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
> + cxl_proto_err_kfifo.work = NULL;
Where is the work cancellation for this?
Oh, patch2 has it, that should really be in this patch. I would fold in
that fix... but it needs a bit more, see below.
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
> +
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
> +{
> + guard(rwsem_read)(&cxl_proto_err_kfifo.rwsema);
> + return kfifo_get(&cxl_proto_err_kfifo.fifo, wd);
I just realized that this reacquires the semaphore on every invocation,
and leaks stranded references.
How about something like below which I think also address Jonathan's
concern about the awkwardness of the consumer needing to manage the
producer's refcounting. With this change all the refcounting is internal
to the producer, and it is properly cleaned up as to not leave PCI
devices with dangling refcounts if someone unloads the CXL core. The
rule is errors are dropped on the floor when the work handler is
missing, and any errors that race unregistration are also dropped on the
floor.
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index 348374859ee4..cc4e443511d0 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -72,16 +72,43 @@ void cxl_register_proto_err_work(struct work_struct *work)
}
EXPORT_SYMBOL_FOR_MODULES(cxl_register_proto_err_work, "cxl_core");
-void cxl_unregister_proto_err_work(void)
+static struct work_struct *cancel_cxl_proto_err(void)
{
+ struct work_struct *work;
+ struct cxl_proto_err_work_data wd;
+
guard(rwsem_write)(&cxl_proto_err_kfifo.rwsem);
+ work = cxl_proto_err_kfifo.work;
cxl_proto_err_kfifo.work = NULL;
+ while (kfifo_get(&cxl_proto_err_kfifo.fifo, &wd)) {
+ dev_err_ratelimited(&wd.pdev->dev,
+ "AER-CXL error report canceled\n");
+ pci_dev_put(wd.pdev);
+ }
+ return work;
+}
+
+void cxl_unregister_proto_err_work(void)
+{
+ struct work_struct *work = cancel_cxl_proto_err();
+
+ if (work)
+ cancel_work_sync(work);
}
EXPORT_SYMBOL_FOR_MODULES(cxl_unregister_proto_err_work, "cxl_core");
-int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
+int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd, int (*fn)(struct cxl_proto_err_work_data *))
{
+ int rc;
+
guard(rwsem_read)(&cxl_proto_err_kfifo.rwsem);
- return kfifo_get(&cxl_proto_err_kfifo.fifo, wd);
+ while (kfifo_get(&cxl_proto_err_kfifo.fifo, wd)) {
+ rc = fn(wd);
+ pci_dev_put(wd->pdev);
+ if (rc)
+ return rc;
+ }
+
+ return 0;
}
-EXPORT_SYMBOL_FOR_MODULES(cxl_proto_err_kfifo_get, "cxl_core");
+EXPORT_SYMBOL_FOR_MODULES(for_each_cxl_proto_err, "cxl_core");
diff --git a/include/linux/aer.h b/include/linux/aer.h
index f351e41dd979..8d60fd97ed67 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -79,12 +79,19 @@ static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
struct work_struct;
#ifdef CONFIG_CXL_RAS
-int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
void cxl_register_proto_err_work(struct work_struct *work);
+int for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ int (*fn)(struct cxl_proto_err_work_data *));
void cxl_unregister_proto_err_work(void);
#else
static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline int
+for_each_cxl_proto_err(struct cxl_proto_err_work_data *wd,
+ int (*fn)(struct cxl_proto_err_work_data *))
+{
+ return 0;
+}
static inline void cxl_unregister_proto_err_work(void) { }
#endif
^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-03-02 20:36 ` [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 12:27 ` Jonathan Cameron
2026-03-09 18:30 ` Dave Jiang
2026-03-02 20:36 ` [PATCH v16 03/10] cxl: Update CXL Endpoint tracing Terry Bowman
` (7 subsequent siblings)
9 siblings, 2 replies; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
The current AER-CXL kfifo unregistration does not cancel pending work after
clearing the work function pointer. In addition, cancel_work_sync() is
called on behalf of the CPER-CXL kfifo in cxl_ras_exit() and should be
moved into the kfifo deregistration function.
Add logic to cancel the AER-CXL kfifo's pending work in
cxl_unregister_proto_err_work().
Move the CPER-CXL kfifo cancel call from cxl_ras_exit() to
cxl_cper_unregister_prot_err_work(). Release the CPER-CXL spinlock
before calling cancel_work_sync() to avoid deadlock.
In both kfifo unregistration cases, add the necessary synchronization
to enforce proper lock ordering: protect pointer updates under the
lock, and clear the work pointer, then cancel any outstanding work
after the lock is released.
Link: https://lore.kernel.org/linux-cxl/6982ca54e094b_55fa1005@dwillia2-mobl4.notmuch/
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Assisted-by: Azure:gtp-4.1-nano-key
----
Changes in v16:
- New commit
---
drivers/acpi/apei/ghes.c | 6 +++++-
drivers/cxl/core/ras.c | 1 -
drivers/pci/pcie/aer_cxl_vh.c | 9 ++++++++-
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 8acd2742bb27..de935e0e1dcf 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -776,8 +776,12 @@ int cxl_cper_unregister_prot_err_work(struct work_struct *work)
if (cxl_cper_prot_err_work != work)
return -EINVAL;
- guard(spinlock)(&cxl_cper_prot_err_work_lock);
+ spin_lock(&cxl_cper_prot_err_work_lock);
cxl_cper_prot_err_work = NULL;
+ spin_unlock(&cxl_cper_prot_err_work_lock);
+
+ cancel_work_sync(work);
+
return 0;
}
EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 006c6ffc2f56..949d8c8ecdfe 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -124,7 +124,6 @@ int cxl_ras_init(void)
void cxl_ras_exit(void)
{
cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
- cancel_work_sync(&cxl_cper_prot_err_work);
}
static void cxl_dport_map_ras(struct cxl_dport *dport)
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index 7e2bc1894395..ebca1112652a 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -74,8 +74,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
void cxl_unregister_proto_err_work(void)
{
- guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
+ struct work_struct *work;
+
+ down_write(&cxl_proto_err_kfifo.rwsema);
+ work = cxl_proto_err_kfifo.work;
cxl_proto_err_kfifo.work = NULL;
+ up_write(&cxl_proto_err_kfifo.rwsema);
+
+ if (work)
+ cancel_work_sync(work);
}
EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos
2026-03-02 20:36 ` [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos Terry Bowman
@ 2026-03-09 12:27 ` Jonathan Cameron
2026-03-11 15:03 ` Bowman, Terry
2026-03-09 18:30 ` Dave Jiang
1 sibling, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 12:27 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:40 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> The current AER-CXL kfifo unregistration does not cancel pending work after
> clearing the work function pointer. In addition, cancel_work_sync() is
> called on behalf of the CPER-CXL kfifo in cxl_ras_exit() and should be
> moved into the kfifo deregistration function.
>
> Add logic to cancel the AER-CXL kfifo's pending work in
> cxl_unregister_proto_err_work().
>
> Move the CPER-CXL kfifo cancel call from cxl_ras_exit() to
> cxl_cper_unregister_prot_err_work(). Release the CPER-CXL spinlock
> before calling cancel_work_sync() to avoid deadlock.
>
> In both kfifo unregistration cases, add the necessary synchronization
> to enforce proper lock ordering: protect pointer updates under the
> lock, and clear the work pointer, then cancel any outstanding work
> after the lock is released.
From that description, this feels like it's walking the edge of
being a fix? If so should call it out as such. If not, make it
clear there isn't a known bug being fixed up.
Otherwise seems sensible to me
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
> Link: https://lore.kernel.org/linux-cxl/6982ca54e094b_55fa1005@dwillia2-mobl4.notmuch/
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Assisted-by: Azure:gtp-4.1-nano-key
>
> ----
>
> Changes in v16:
> - New commit
> ---
> drivers/acpi/apei/ghes.c | 6 +++++-
> drivers/cxl/core/ras.c | 1 -
> drivers/pci/pcie/aer_cxl_vh.c | 9 ++++++++-
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8acd2742bb27..de935e0e1dcf 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -776,8 +776,12 @@ int cxl_cper_unregister_prot_err_work(struct work_struct *work)
> if (cxl_cper_prot_err_work != work)
> return -EINVAL;
>
> - guard(spinlock)(&cxl_cper_prot_err_work_lock);
> + spin_lock(&cxl_cper_prot_err_work_lock);
> cxl_cper_prot_err_work = NULL;
> + spin_unlock(&cxl_cper_prot_err_work_lock);
> +
> + cancel_work_sync(work);
> +
> return 0;
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 006c6ffc2f56..949d8c8ecdfe 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -124,7 +124,6 @@ int cxl_ras_init(void)
> void cxl_ras_exit(void)
> {
> cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> - cancel_work_sync(&cxl_cper_prot_err_work);
> }
>
> static void cxl_dport_map_ras(struct cxl_dport *dport)
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> index 7e2bc1894395..ebca1112652a 100644
> --- a/drivers/pci/pcie/aer_cxl_vh.c
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -74,8 +74,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>
> void cxl_unregister_proto_err_work(void)
> {
> - guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
> + struct work_struct *work;
> +
> + down_write(&cxl_proto_err_kfifo.rwsema);
> + work = cxl_proto_err_kfifo.work;
> cxl_proto_err_kfifo.work = NULL;
> + up_write(&cxl_proto_err_kfifo.rwsema);
> +
> + if (work)
> + cancel_work_sync(work);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos
2026-03-09 12:27 ` Jonathan Cameron
@ 2026-03-11 15:03 ` Bowman, Terry
0 siblings, 0 replies; 26+ messages in thread
From: Bowman, Terry @ 2026-03-11 15:03 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On 3/9/2026 7:27 AM, Jonathan Cameron wrote:
> On Mon, 2 Mar 2026 14:36:40 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The current AER-CXL kfifo unregistration does not cancel pending work after
>> clearing the work function pointer. In addition, cancel_work_sync() is
>> called on behalf of the CPER-CXL kfifo in cxl_ras_exit() and should be
>> moved into the kfifo deregistration function.
>>
>> Add logic to cancel the AER-CXL kfifo's pending work in
>> cxl_unregister_proto_err_work().
>>
>> Move the CPER-CXL kfifo cancel call from cxl_ras_exit() to
>> cxl_cper_unregister_prot_err_work(). Release the CPER-CXL spinlock
>> before calling w to avoid deadlock.
>>
>> In both kfifo unregistration cases, add the necessary synchronization
>> to enforce proper lock ordering: protect pointer updates under the
>> lock, and clear the work pointer, then cancel any outstanding work
>> after the lock is released.
>
> From that description, this feels like it's walking the edge of
> being a fix? If so should call it out as such. If not, make it
> clear there isn't a known bug being fixed up.
>
> Otherwise seems sensible to me
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
I'll add explanation detailing a call to cancel_sync_work() will prevent the
possibility of using the work routine after its unregistered.
Thanks for reviewing.
- Terry
>>
>> Link: https://lore.kernel.org/linux-cxl/6982ca54e094b_55fa1005@dwillia2-mobl4.notmuch/
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Assisted-by: Azure:gtp-4.1-nano-key
>>
>> ----
>>
>> Changes in v16:
>> - New commit
>> ---
>> drivers/acpi/apei/ghes.c | 6 +++++-
>> drivers/cxl/core/ras.c | 1 -
>> drivers/pci/pcie/aer_cxl_vh.c | 9 ++++++++-
>> 3 files changed, 13 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
>> index 8acd2742bb27..de935e0e1dcf 100644
>> --- a/drivers/acpi/apei/ghes.c
>> +++ b/drivers/acpi/apei/ghes.c
>> @@ -776,8 +776,12 @@ int cxl_cper_unregister_prot_err_work(struct work_struct *work)
>> if (cxl_cper_prot_err_work != work)
>> return -EINVAL;
>>
>> - guard(spinlock)(&cxl_cper_prot_err_work_lock);
>> + spin_lock(&cxl_cper_prot_err_work_lock);
>> cxl_cper_prot_err_work = NULL;
>> + spin_unlock(&cxl_cper_prot_err_work_lock);
>> +
>> + cancel_work_sync(work);
>> +
>> return 0;
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 006c6ffc2f56..949d8c8ecdfe 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -124,7 +124,6 @@ int cxl_ras_init(void)
>> void cxl_ras_exit(void)
>> {
>> cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> - cancel_work_sync(&cxl_cper_prot_err_work);
>> }
>>
>> static void cxl_dport_map_ras(struct cxl_dport *dport)
>> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
>> index 7e2bc1894395..ebca1112652a 100644
>> --- a/drivers/pci/pcie/aer_cxl_vh.c
>> +++ b/drivers/pci/pcie/aer_cxl_vh.c
>> @@ -74,8 +74,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>>
>> void cxl_unregister_proto_err_work(void)
>> {
>> - guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
>> + struct work_struct *work;
>> +
>> + down_write(&cxl_proto_err_kfifo.rwsema);
>> + work = cxl_proto_err_kfifo.work;
>> cxl_proto_err_kfifo.work = NULL;
>> + up_write(&cxl_proto_err_kfifo.rwsema);
>> +
>> + if (work)
>> + cancel_work_sync(work);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>>
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos
2026-03-02 20:36 ` [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos Terry Bowman
2026-03-09 12:27 ` Jonathan Cameron
@ 2026-03-09 18:30 ` Dave Jiang
1 sibling, 0 replies; 26+ messages in thread
From: Dave Jiang @ 2026-03-09 18:30 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 3/2/26 1:36 PM, Terry Bowman wrote:
> The current AER-CXL kfifo unregistration does not cancel pending work after
> clearing the work function pointer. In addition, cancel_work_sync() is
> called on behalf of the CPER-CXL kfifo in cxl_ras_exit() and should be
> moved into the kfifo deregistration function.
>
> Add logic to cancel the AER-CXL kfifo's pending work in
> cxl_unregister_proto_err_work().
>
> Move the CPER-CXL kfifo cancel call from cxl_ras_exit() to
> cxl_cper_unregister_prot_err_work(). Release the CPER-CXL spinlock
> before calling cancel_work_sync() to avoid deadlock.
>
> In both kfifo unregistration cases, add the necessary synchronization
> to enforce proper lock ordering: protect pointer updates under the
> lock, and clear the work pointer, then cancel any outstanding work
> after the lock is released.
>
> Link: https://lore.kernel.org/linux-cxl/6982ca54e094b_55fa1005@dwillia2-mobl4.notmuch/
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Assisted-by: Azure:gtp-4.1-nano-key
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ----
>
> Changes in v16:
> - New commit
> ---
> drivers/acpi/apei/ghes.c | 6 +++++-
> drivers/cxl/core/ras.c | 1 -
> drivers/pci/pcie/aer_cxl_vh.c | 9 ++++++++-
> 3 files changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
> index 8acd2742bb27..de935e0e1dcf 100644
> --- a/drivers/acpi/apei/ghes.c
> +++ b/drivers/acpi/apei/ghes.c
> @@ -776,8 +776,12 @@ int cxl_cper_unregister_prot_err_work(struct work_struct *work)
> if (cxl_cper_prot_err_work != work)
> return -EINVAL;
>
> - guard(spinlock)(&cxl_cper_prot_err_work_lock);
> + spin_lock(&cxl_cper_prot_err_work_lock);
> cxl_cper_prot_err_work = NULL;
> + spin_unlock(&cxl_cper_prot_err_work_lock);
> +
> + cancel_work_sync(work);
> +
> return 0;
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cper_unregister_prot_err_work, "CXL");
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 006c6ffc2f56..949d8c8ecdfe 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -124,7 +124,6 @@ int cxl_ras_init(void)
> void cxl_ras_exit(void)
> {
> cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> - cancel_work_sync(&cxl_cper_prot_err_work);
> }
>
> static void cxl_dport_map_ras(struct cxl_dport *dport)
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> index 7e2bc1894395..ebca1112652a 100644
> --- a/drivers/pci/pcie/aer_cxl_vh.c
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -74,8 +74,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>
> void cxl_unregister_proto_err_work(void)
> {
> - guard(rwsem_write)(&cxl_proto_err_kfifo.rwsema);
> + struct work_struct *work;
> +
> + down_write(&cxl_proto_err_kfifo.rwsema);
> + work = cxl_proto_err_kfifo.work;
> cxl_proto_err_kfifo.work = NULL;
> + up_write(&cxl_proto_err_kfifo.rwsema);
> +
> + if (work)
> + cancel_work_sync(work);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v16 03/10] cxl: Update CXL Endpoint tracing
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-03-02 20:36 ` [PATCH v16 01/10] PCI/AER: Introduce AER-CXL Kfifo Terry Bowman
2026-03-02 20:36 ` [PATCH v16 02/10] PCI/CXL: Update unregistration for AER-CXL and CPER-CXL kfifos Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-02 20:36 ` [PATCH v16 04/10] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
` (6 subsequent siblings)
9 siblings, 0 replies; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL protocol error handling will be expanded to soon include CXL Port
support along with existing Endpoint support. 2 updates are needed first:
- Update calling interfaces to use 'struct device*'
- Log endpoint serial number
Add serial number parameter to the trace logging. This is used for EPs
and 0 is provided for CXL port devices without a serial number. Add the
serial number at the end to preserve compatibility with libtraceevent
parsing of the parameters.
Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
unchanged with respect to member data types and order.
Below is output of correctable and uncorrectable protocol error logging.
CXL Root Port and CXL Endpoint examples are included below.
The tracing support for CXL Port devices and Endpoints is already implemented.
Update cxl_handle_ras() & cxl_handle_cor_ras() to also call the CXL trace
routines.
Root Port:
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Endpoint:
cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v15->v16:
- Add Dan's review-by
- Incorporate Dan's comment into commit message:
"Add the serial number at the end to preserve compatibility with
libtraceevent parsing of the parameters."
Changes in v14->v15:
- Update commit message.
- Moved cxl_handle_ras/cxl_handle_cor_ras() changes to future patch (terry)
Changes in v13->v14:
- Update commit headline (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
Changes in v11 -> v12:
- Correct parameters to call trace_cxl_aer_correctable_error()
- Add reviewed-by for Jonathan and Shiju
Changes in v10->v11:
- Updated CE and UCE trace routines to maintain consistent TP_Struct ABI
and unchanged TP_printk() logging.
---
drivers/cxl/core/core.h | 11 +++++++----
drivers/cxl/core/ras.c | 23 ++++++++++++++---------
drivers/cxl/core/ras_rch.c | 6 ++++--
drivers/cxl/core/trace.h | 21 +++++++++++----------
4 files changed, 36 insertions(+), 25 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 5b0570df0fd9..5051800882c5 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -181,8 +181,9 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
#ifdef CONFIG_CXL_RAS
int cxl_ras_init(void);
void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+void cxl_handle_cor_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
@@ -193,11 +194,13 @@ static inline int cxl_ras_init(void)
return 0;
}
static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static inline bool cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base)
{
return false;
}
-static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
+static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base) { }
static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 949d8c8ecdfe..44791f6d7d50 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -37,7 +37,8 @@ static void cxl_cper_trace_corr_prot_err(struct cxl_memdev *cxlmd,
{
u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
- trace_cxl_aer_correctable_error(cxlmd, status);
+ trace_cxl_aer_correctable_error(&cxlmd->dev, status,
+ cxlmd->cxlds->serial);
}
static void
@@ -45,6 +46,7 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
struct cxl_ras_capability_regs ras_cap)
{
u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
u32 fe;
if (hweight32(status) > 1)
@@ -53,8 +55,9 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
else
fe = status;
- trace_cxl_aer_uncorrectable_error(cxlmd, status, fe,
- ras_cap.header_log);
+ trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
+ ras_cap.header_log,
+ cxlds->serial);
}
static int match_memdev_by_parent(struct device *dev, const void *uport)
@@ -182,7 +185,7 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
+void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
u32 status;
@@ -194,7 +197,7 @@ void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
+ trace_cxl_aer_correctable_error(dev, status, serial);
}
}
@@ -219,7 +222,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
@@ -246,7 +249,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
@@ -269,7 +272,8 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
if (cxlds->rcd)
cxl_handle_rdport_errors(cxlds);
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
+ cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ cxlmd->endpoint->regs.ras);
}
}
EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -298,7 +302,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
* chance the situation is recoverable dump the status of the RAS
* capability registers and bounce the active state of the memdev.
*/
- ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
+ ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ cxlmd->endpoint->regs.ras);
}
switch (state) {
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index 0a8b3b9b6388..5771abfc16de 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -115,7 +115,9 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
pci_print_aer(pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+ cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ dport->regs.ras);
else
- cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+ cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ dport->regs.ras);
}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a972e4ef1936..5f630543b720 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -77,11 +77,12 @@ TRACE_EVENT(cxl_port_aer_uncorrectable_error,
);
TRACE_EVENT(cxl_aer_uncorrectable_error,
- TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
- TP_ARGS(cxlmd, status, fe, hl),
+ TP_PROTO(const struct device *cxlmd, u32 status, u32 fe, u32 *hl,
+ u64 serial),
+ TP_ARGS(cxlmd, status, fe, hl, serial),
TP_STRUCT__entry(
- __string(memdev, dev_name(&cxlmd->dev))
- __string(host, dev_name(cxlmd->dev.parent))
+ __string(memdev, dev_name(cxlmd))
+ __string(host, dev_name(cxlmd->parent))
__field(u64, serial)
__field(u32, status)
__field(u32, first_error)
@@ -90,7 +91,7 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
TP_fast_assign(
__assign_str(memdev);
__assign_str(host);
- __entry->serial = cxlmd->cxlds->serial;
+ __entry->serial = serial;
__entry->status = status;
__entry->first_error = fe;
/*
@@ -144,18 +145,18 @@ TRACE_EVENT(cxl_port_aer_correctable_error,
);
TRACE_EVENT(cxl_aer_correctable_error,
- TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
- TP_ARGS(cxlmd, status),
+ TP_PROTO(const struct device *cxlmd, u32 status, u64 serial),
+ TP_ARGS(cxlmd, status, serial),
TP_STRUCT__entry(
- __string(memdev, dev_name(&cxlmd->dev))
- __string(host, dev_name(cxlmd->dev.parent))
+ __string(memdev, dev_name(cxlmd))
+ __string(host, dev_name(cxlmd->parent))
__field(u64, serial)
__field(u32, status)
),
TP_fast_assign(
__assign_str(memdev);
__assign_str(host);
- __entry->serial = cxlmd->cxlds->serial;
+ __entry->serial = serial;
__entry->status = status;
),
TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v16 04/10] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (2 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 03/10] cxl: Update CXL Endpoint tracing Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-02 20:36 ` [PATCH v16 05/10] PCI: Establish common CXL Port protocol error flow Terry Bowman
` (5 subsequent siblings)
9 siblings, 0 replies; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
The CXL driver's uncorrectable (UCE) protocol error handling will be updated
in the future. One required change is for the CXL error handlers to force a
system panic when a UCE is detected.
Introduce PCI_ERS_RESULT_PANIC as a 'enum pci_ers_result' type. This will
be used by CXL UCE fatal and non-fatal recovery in future patches. Update
PCIe recovery documentation with details of PCI_ERS_RESULT_PANIC.
To clarify, PCI's merge_result() implemented in err.c is not to be changed.
merge_result() is not aware of PCI_ERS_RESULT_PANIC and will not return
PCI_ERS_RESULT_PANIC.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v15 -> v16:
- None
Changes in v14 -> v15:
- None
Changes in v13 -> v14:
- Add review-by for Dan
- Update Title prefix (Bjorn)
- Removed merge_result. Only logging error for device reporting the
error (Dan)
Changes in v12->v13:
- Add Dave Jiang's, Jonathan's, Ben's review-by
- Typo fix (Ben)
Changes v11 -> v12:
- Documentation requested (Lukas)
---
Documentation/PCI/pci-error-recovery.rst | 2 ++
include/linux/pci.h | 3 +++
2 files changed, 5 insertions(+)
diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index 43838723fde9..55be63f1a649 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -102,6 +102,8 @@ Possible return values are::
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
+ PCI_ERS_RESULT_NO_AER_DRIVER, /* No AER capabilities registered for the driver */
+ PCI_ERS_RESULT_PANIC, /* System is unstable, panic. Is CXL specific */
};
A driver does not have to implement all of these callbacks; however,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1c270f1d5123..0d6ad11e3422 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -933,6 +933,9 @@ enum pci_ers_result {
/* No AER capabilities registered for the driver */
PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+ /* System is unstable, panic. Is CXL specific */
+ PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
};
/* PCI bus error event callbacks */
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* [PATCH v16 05/10] PCI: Establish common CXL Port protocol error flow
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (3 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 04/10] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 12:45 ` [PATCH v16 05/10] PCI: Establish common CXL Port protocol error flowUIRE Jonathan Cameron
2026-03-02 20:36 ` [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
` (4 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Introduce CXL Port protocol error handling callbacks to unify detection,
logging, and recovery across CXL Ports and Endpoints. Establish a consistent
flow for correctable and uncorrectable CXL protocol errors. Support for RCH
Downstream Port error handling will be added in a future patch.
Provide the solution by adding cxl_port_cor_error_detected() and
cxl_port_error_detected() to handle correctable and uncorrectable handling
through CXL RAS helpers, coordinating uncorrectable recovery in
cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
to preserve fatal cachemem behavior. Gate Endpoint handling on the Endpoint
driver being bound to avoid processing errors on disabled devices.
Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
for Upstream Ports/Endpoints.
Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
cxl_core to clear PCIe/AER state in these flows.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
Changes in v15->v16:
- get_ras_base(), initialize dport to NULL (Jonathan)
- Remove guard(device)(&cxlmd->dev) (Jonathan)
- Fix dev_warns() (Jonathan)
- Remove comment in cxl_port_error_detected() (Dan)
- Made pcie_clear_device_status() and pci_aer_clear_fatal_status()
"CXL" Export namespace (Dan)
- Update switch-case brackets to follow clang-format (Dan)
- Add PCI_EXP_TYPE_RC_END for cxl_get_ras_base() (Terry)
- Add NULL port check in cxl_serial_number() (Terry)
Changes in v14->v15:
- Update commit message and title. Added Bjorn's ack.
- Move CE and UCE handling logic here
Changes in v13->v14:
- Add Dave Jiang's review-by
- Update commit message & headline (Bjorn)
- Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
one line (Jonathan)
- Remove cxl_walk_port() (Dan)
- Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
sufficient (Dan)
- Remove device_lock_if()
- Combined CE and UCE here (Terry)
Changes in v12->v13:
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)
Changes in v11->v12:
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
pci_to_cxl_dev()
- Change cxl_error_detected() -> cxl_cor_error_detected()
- Remove NULL variable assignments
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
port searches.
Changes in v10->v11:
- None
---
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/port.c | 6 +-
drivers/cxl/core/ras.c | 189 ++++++++++++++++++++++++++++++++--
drivers/pci/pci.c | 1 +
drivers/pci/pci.h | 2 -
drivers/pci/pcie/aer.c | 1 +
drivers/pci/pcie/aer_cxl_vh.c | 5 +-
include/linux/aer.h | 2 +
include/linux/pci.h | 2 +
9 files changed, 195 insertions(+), 16 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 5051800882c5..0eb2e28bb2c2 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -208,6 +208,9 @@ static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
#endif /* CONFIG_CXL_RAS */
int cxl_gpf_port_setup(struct cxl_dport *dport);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport);
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
struct cxl_hdm;
int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0c5957d1d329..27271402915f 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1386,8 +1386,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
return NULL;
}
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
- struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport)
{
struct cxl_find_port_ctx ctx = {
.dport_dev = dport_dev,
@@ -1582,7 +1582,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
* Function takes a device reference on the port device. Caller should do a
* put_device() when done.
*/
-static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
{
struct device *dev;
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 44791f6d7d50..1d4be2d78469 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -119,16 +119,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
}
static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
-int cxl_ras_init(void)
-{
- return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
-}
-
-void cxl_ras_exit(void)
-{
- cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
-}
-
static void cxl_dport_map_ras(struct cxl_dport *dport)
{
struct cxl_register_map *map = &dport->reg_map;
@@ -185,6 +175,117 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
+/*
+ * get_cxl_port - Return the parent CXL Port of a PCI device
+ * @pdev: PCI device whose parent CXL Port is being queried
+ *
+ * Looks up and returns the parent CXL Port associated with @pdev. On
+ * success, the returned port has its reference count incremented and must
+ * be released by the caller. Returns NULL if no associated CXL port is
+ * found.
+ *
+ * Return: Pointer to the parent &struct cxl_port or NULL on failure
+ */
+static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
+{
+ switch (pci_pcie_type(pdev)) {
+ case PCI_EXP_TYPE_ROOT_PORT:
+ case PCI_EXP_TYPE_DOWNSTREAM: {
+ struct cxl_dport *dport;
+ struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return port;
+ }
+ case PCI_EXP_TYPE_UPSTREAM:
+ case PCI_EXP_TYPE_ENDPOINT:
+ case PCI_EXP_TYPE_RC_END: {
+ struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return port;
+ }
+ }
+
+ pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
+ pci_name(pdev), pci_pcie_type(pdev));
+ return NULL;
+}
+
+static u64 cxl_serial_number(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+ struct device *port_dev = port ? port->uport_dev : NULL;
+ struct cxl_memdev *cxlmd;
+
+ if (!port_dev || !is_cxl_memdev(dev))
+ return 0;
+
+ cxlmd = to_cxl_memdev(port_dev);
+ return cxlmd->cxlds->serial;
+}
+
+static void __iomem *cxl_get_ras_base(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ switch (pci_pcie_type(pdev)) {
+ case PCI_EXP_TYPE_ROOT_PORT:
+ case PCI_EXP_TYPE_DOWNSTREAM: {
+ struct cxl_dport *dport = NULL;
+ struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
+
+ if (!dport) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return dport->regs.ras;
+ }
+ case PCI_EXP_TYPE_UPSTREAM:
+ case PCI_EXP_TYPE_ENDPOINT:
+ case PCI_EXP_TYPE_RC_END: {
+ struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return port->regs.ras;
+ }
+ }
+ dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
+ return NULL;
+}
+
+static void cxl_do_recovery(struct pci_dev *pdev)
+{
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+ struct device *dev = &pdev->dev;
+ pci_ers_result_t status;
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device\n");
+ return;
+ }
+
+ status = cxl_handle_ras(dev, cxl_serial_number(dev), cxl_get_ras_base(dev));
+ if (status == PCI_ERS_RESULT_PANIC)
+ panic("CXL cachemem error.");
+
+ if (pcie_aer_is_native(pdev)) {
+ pcie_clear_device_status(pdev);
+ pci_aer_clear_nonfatal_status(pdev);
+ pci_aer_clear_fatal_status(pdev);
+ }
+}
+
void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
@@ -327,3 +428,71 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_NEED_RESET;
}
EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+
+static void cxl_handle_proto_error(struct pci_dev *pdev, int severity)
+{
+ if (severity == AER_CORRECTABLE) {
+ struct device *dev = &pdev->dev;
+
+ if (!pcie_aer_is_native(pdev))
+ return;
+
+ if (pdev->aer_cap)
+ pci_clear_and_set_config_dword(pdev,
+ pdev->aer_cap + PCI_ERR_COR_STATUS,
+ 0, PCI_ERR_COR_INTERNAL);
+
+ cxl_handle_cor_ras(dev, cxl_serial_number(dev),
+ cxl_get_ras_base(dev));
+ pcie_clear_device_status(pdev);
+ } else {
+ cxl_do_recovery(pdev);
+ }
+}
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+ struct cxl_proto_err_work_data wd;
+
+ /*
+ * Dequeue work forwarded from the AER driver
+ * See cxl_forward_error() for matching pci_dev_get()
+ */
+ while (cxl_proto_err_kfifo_get(&wd)) {
+ struct pci_dev *pdev __free(pci_dev_put) = wd.pdev;
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+
+ if (!port) {
+ pr_err_ratelimited("%s: Failed to find parent port device in CXL topology\n",
+ pci_name(pdev));
+ continue;
+ }
+
+ guard(device)(&port->dev);
+ if (!port->dev.driver) {
+ pr_err_ratelimited("%s: Port device is unbound, abort error handling\n",
+ dev_name(&port->dev));
+ continue;
+ }
+
+ cxl_handle_proto_error(pdev, wd.severity);
+ }
+}
+
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+int cxl_ras_init(void)
+{
+ if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
+ pr_err("Failed to initialize CXL RAS CPER\n");
+
+ cxl_register_proto_err_work(&cxl_proto_err_work);
+
+ return 0;
+}
+
+void cxl_ras_exit(void)
+{
+ cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
+ cxl_unregister_proto_err_work();
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 8479c2e1f74f..2c4bad5ad2b1 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2246,6 +2246,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
}
+EXPORT_SYMBOL_NS_GPL(pcie_clear_device_status, "CXL");
#endif
/**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 13d998fbacce..780f262d2c3c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -263,7 +263,6 @@ void pci_refresh_power_state(struct pci_dev *dev);
int pci_power_up(struct pci_dev *dev);
void pci_disable_enabled_device(struct pci_dev *dev);
int pci_finish_runtime_suspend(struct pci_dev *dev);
-void pcie_clear_device_status(struct pci_dev *dev);
void pcie_clear_root_pme_status(struct pci_dev *dev);
bool pci_check_pme_status(struct pci_dev *dev);
void pci_pme_wakeup_bus(struct pci_bus *bus);
@@ -1291,7 +1290,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
static inline void pci_no_aer(void) { }
static inline void pci_aer_init(struct pci_dev *d) { }
static inline void pci_aer_exit(struct pci_dev *d) { }
-static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
static inline void pci_save_aer_state(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 2e996e339d7c..871fa633b4da 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -295,6 +295,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
if (status)
pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
}
+EXPORT_SYMBOL_NS_GPL(pci_aer_clear_fatal_status, "CXL");
/**
* pci_aer_raw_clear_status - Clear AER error registers.
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index ebca1112652a..818ec0d0a012 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -33,7 +33,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
if (!info || !info->is_cxl)
return false;
- if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
return false;
return is_aer_internal_error(info);
diff --git a/include/linux/aer.h b/include/linux/aer.h
index f351e41dd979..c1aef7859d0a 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -65,6 +65,7 @@ struct cxl_proto_err_work_data {
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
void pci_aer_unmask_internal_errors(struct pci_dev *dev);
#else
@@ -72,6 +73,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
return -EINVAL;
}
+static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 0d6ad11e3422..e7ed8da4844f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1938,8 +1938,10 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { }
#ifdef CONFIG_PCIEAER
bool pci_aer_available(void);
+void pcie_clear_device_status(struct pci_dev *dev);
#else
static inline bool pci_aer_available(void) { return false; }
+static inline void pcie_clear_device_status(struct pci_dev *dev) { }
#endif
bool pci_ats_disabled(void);
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 05/10] PCI: Establish common CXL Port protocol error flowUIRE
2026-03-02 20:36 ` [PATCH v16 05/10] PCI: Establish common CXL Port protocol error flow Terry Bowman
@ 2026-03-09 12:45 ` Jonathan Cameron
0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 12:45 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:43 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> Introduce CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints. Establish a consistent
> flow for correctable and uncorrectable CXL protocol errors. Support for RCH
> Downstream Port error handling will be added in a future patch.
>
> Provide the solution by adding cxl_port_cor_error_detected() and
> cxl_port_error_detected() to handle correctable and uncorrectable handling
> through CXL RAS helpers, coordinating uncorrectable recovery in
> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
> to preserve fatal cachemem behavior. Gate Endpoint handling on the Endpoint
> driver being bound to avoid processing errors on disabled devices.
>
> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
> for Upstream Ports/Endpoints.
>
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
> cxl_core to clear PCIe/AER state in these flows.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
Hi Terry,
It's been long enough I've pretty much forgotten what this does, so
fresh review. Might well be commenting on things that have come up
before!
Jonathan
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 44791f6d7d50..1d4be2d78469 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> +static u64 cxl_serial_number(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> + struct device *port_dev = port ? port->uport_dev : NULL;
Maybe someone else asked for this, but to my eyes it's too complex.
Would be simpler to drop this local variable and
> + struct cxl_memdev *cxlmd;
> +
if (!port || !port->uport_dev || !is_cxl_memdev(dev))
return 0;
cxlmd = to_cxl_memdev(port->uport_dev);
..
> + if (!port_dev || !is_cxl_memdev(dev))
> + return 0;
> +
> + cxlmd = to_cxl_memdev(port_dev);
> + return cxlmd->cxlds->serial;
> +}
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> + struct device *dev = &pdev->dev;
> + pci_ers_result_t status;
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device\n");
> + return;
> + }
> +
> + status = cxl_handle_ras(dev, cxl_serial_number(dev), cxl_get_ras_base(dev));
Extra space after =
> + if (status == PCI_ERS_RESULT_PANIC)
> + panic("CXL cachemem error.");
> +
> + if (pcie_aer_is_native(pdev)) {
> + pcie_clear_device_status(pdev);
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> + }
> +}
> +
> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> void __iomem *addr;
> @@ -327,3 +428,71 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> + struct cxl_proto_err_work_data wd;
> +
> + /*
> + * Dequeue work forwarded from the AER driver
> + * See cxl_forward_error() for matching pci_dev_get()
> + */
> + while (cxl_proto_err_kfifo_get(&wd)) {
> + struct pci_dev *pdev __free(pci_dev_put) = wd.pdev;
I'm not particularly keen on the lack of constructor / destructor pairing this is giving
us but the alternatives are fiddly. You could make cxl_proto_err_kfifo_get() return wd
/ ERR_PTR() and use a new DEFINE_FREE() for that structure. But it would require
memory allocation or a dance.
This would become something like
struct cxl_proto_err_work_data scratch;
whilst(true) {
struct cxl_proto_err_work_data *wd __free(cxl_proto_err_work_d) =
cxl_proto_err_kfifo_get(scratch);
if (IS_ERR(wd))
return;
struct cxl_port *port __free(put_cxl_port) = get_cxl_port(wd->pdev);
...
}
Hmm. Also not exactly elegant. Unless others have a better idea, let us stick
to what you have.
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (!port) {
> + pr_err_ratelimited("%s: Failed to find parent port device in CXL topology\n",
> + pci_name(pdev));
> + continue;
> + }
> +
> + guard(device)(&port->dev);
> + if (!port->dev.driver) {
> + pr_err_ratelimited("%s: Port device is unbound, abort error handling\n",
> + dev_name(&port->dev));
> + continue;
> + }
> +
> + cxl_handle_proto_error(pdev, wd.severity);
> + }
> +}
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (4 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 05/10] PCI: Establish common CXL Port protocol error flow Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 14:00 ` Jonathan Cameron
2026-03-02 20:36 ` [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices Terry Bowman
` (3 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Restricted CXL Host (RCH) error handling is not currently supported by the
CXL Port error handling flow. Integrate the existing RCH error handling
into the new Port error handling.
Update cxl_rch_handle_error_iter() to forward the RCH protocol error using
the AER-CXL kfifo.
Update cxl_handle_proto_error() to begin the RCH error handling with a call
to cxl_handle_rdport_errors(). This function handles both correctable and
uncorrectable RCH protocol errors.
Change the cxl_handle_rdport_errors() function parameter from a CXL device
state to a PCI device.
Report the serial number of the RCD Endpoint in the RCH logging. This
is used to associate the RCH with the RCD in the logs.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v16:
- New commit
---
drivers/cxl/core/core.h | 6 ++++--
drivers/cxl/core/ras.c | 15 ++++++++++++---
drivers/cxl/core/ras_rch.c | 13 +++++++------
drivers/pci/pcie/aer_cxl_rch.c | 17 +----------------
4 files changed, 24 insertions(+), 27 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 0eb2e28bb2c2..76d2593e68c6 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -186,8 +186,9 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
-void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
+void cxl_handle_rdport_errors(struct pci_dev *pdev);
void devm_cxl_dport_ras_setup(struct cxl_dport *dport);
+u64 cxl_serial_number(struct device *dev);
#else
static inline int cxl_ras_init(void)
{
@@ -203,8 +204,9 @@ static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base) { }
static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
-static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
+static inline void cxl_handle_rdport_errors(struct pci_dev *pdev) { }
static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
+static inline u64 cxl_serial_number(struct device *dev) { return 0; }
#endif /* CONFIG_CXL_RAS */
int cxl_gpf_port_setup(struct cxl_dport *dport);
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 1d4be2d78469..48d3ef7cbb92 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -218,7 +218,7 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
return NULL;
}
-static u64 cxl_serial_number(struct device *dev)
+u64 cxl_serial_number(struct device *dev)
{
struct pci_dev *pdev = to_pci_dev(dev);
struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
@@ -371,7 +371,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
}
if (cxlds->rcd)
- cxl_handle_rdport_errors(cxlds);
+ cxl_handle_rdport_errors(pdev);
cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
cxlmd->endpoint->regs.ras);
@@ -396,7 +396,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
}
if (cxlds->rcd)
- cxl_handle_rdport_errors(cxlds);
+ cxl_handle_rdport_errors(pdev);
/*
* A frozen channel indicates an impending reset which is fatal to
* CXL.mem operation, and will likely crash the system. On the off
@@ -431,6 +431,15 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
static void cxl_handle_proto_error(struct pci_dev *pdev, int severity)
{
+ /*
+ * CXL RCD's AER error interrupt is used for reporting RCD and RCH
+ * Downstream Port protocol errors. RCH protocol errors are handled
+ * using a unique procedure separate from from CXL Port devices.
+ * See CXL spec r4.0, 12.2 CXL Error Handling
+ */
+ if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
+ cxl_handle_rdport_errors(pdev);
+
if (severity == AER_CORRECTABLE) {
struct device *dev = &pdev->dev;
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index 5771abfc16de..184b7877f700 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -95,17 +95,20 @@ static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
return false;
}
-void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
+void cxl_handle_rdport_errors(struct pci_dev *pdev)
{
- struct pci_dev *pdev = to_pci_dev(cxlds->dev);
struct aer_capability_regs aer_regs;
+ struct device *dev = &pdev->dev;
+ u64 serial = cxl_serial_number(dev);
struct cxl_dport *dport;
+ void __iomem *ras_base;
int severity;
struct cxl_port *port __free(put_cxl_port) =
cxl_pci_find_port(pdev, &dport);
if (!port)
return;
+ ras_base = dport->regs.ras;
if (!cxl_rch_get_aer_info(dport->regs.dport_aer, &aer_regs))
return;
@@ -115,9 +118,7 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
pci_print_aer(pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
- dport->regs.ras);
+ cxl_handle_cor_ras(dev, serial, ras_base);
else
- cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
- dport->regs.ras);
+ cxl_handle_ras(dev, serial, ras_base);
}
diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
index e471eefec9c4..83142eac0cab 100644
--- a/drivers/pci/pcie/aer_cxl_rch.c
+++ b/drivers/pci/pcie/aer_cxl_rch.c
@@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
{
struct aer_err_info *info = (struct aer_err_info *)data;
- const struct pci_error_handlers *err_handler;
if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
return 0;
- guard(device)(&dev->dev);
-
- err_handler = dev->driver ? dev->driver->err_handler : NULL;
- if (!err_handler)
- return 0;
-
- if (info->severity == AER_CORRECTABLE) {
- if (err_handler->cor_error_detected)
- err_handler->cor_error_detected(dev);
- } else if (err_handler->error_detected) {
- if (info->severity == AER_NONFATAL)
- err_handler->error_detected(dev, pci_channel_io_normal);
- else if (info->severity == AER_FATAL)
- err_handler->error_detected(dev, pci_channel_io_frozen);
- }
+ cxl_forward_error(dev, info);
return 0;
}
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers
2026-03-02 20:36 ` [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
@ 2026-03-09 14:00 ` Jonathan Cameron
2026-03-11 15:21 ` Bowman, Terry
0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 14:00 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:44 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> Restricted CXL Host (RCH) error handling is not currently supported by the
> CXL Port error handling flow. Integrate the existing RCH error handling
> into the new Port error handling.
>
> Update cxl_rch_handle_error_iter() to forward the RCH protocol error using
> the AER-CXL kfifo.
>
> Update cxl_handle_proto_error() to begin the RCH error handling with a call
> to cxl_handle_rdport_errors(). This function handles both correctable and
> uncorrectable RCH protocol errors.
>
> Change the cxl_handle_rdport_errors() function parameter from a CXL device
> state to a PCI device.
>
> Report the serial number of the RCD Endpoint in the RCH logging. This
> is used to associate the RCH with the RCD in the logs.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
One question inline.
+ a comment on a bit of neighboring code.
J
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 1d4be2d78469..48d3ef7cbb92 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> static void cxl_handle_proto_error(struct pci_dev *pdev, int severity)
> {
> + /*
> + * CXL RCD's AER error interrupt is used for reporting RCD and RCH
> + * Downstream Port protocol errors. RCH protocol errors are handled
> + * using a unique procedure separate from from CXL Port devices.
> + * See CXL spec r4.0, 12.2 CXL Error Handling
> + */
> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
> + cxl_handle_rdport_errors(pdev);
Maybe I'm missing something but why do we want to carry on running the rest
of this function after this? Superficially seems like we will be doing
at least some stuff that didn't happen before.
> +
> if (severity == AER_CORRECTABLE) {
> struct device *dev = &pdev->dev;
> diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
> index e471eefec9c4..83142eac0cab 100644
> --- a/drivers/pci/pcie/aer_cxl_rch.c
> +++ b/drivers/pci/pcie/aer_cxl_rch.c
> @@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
> static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> {
> struct aer_err_info *info = (struct aer_err_info *)data;
Not related to this patch but that cast isn't needed.
> - const struct pci_error_handlers *err_handler;
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers
2026-03-09 14:00 ` Jonathan Cameron
@ 2026-03-11 15:21 ` Bowman, Terry
0 siblings, 0 replies; 26+ messages in thread
From: Bowman, Terry @ 2026-03-11 15:21 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On 3/9/2026 9:00 AM, Jonathan Cameron wrote:
> On Mon, 2 Mar 2026 14:36:44 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Restricted CXL Host (RCH) error handling is not currently supported by the
>> CXL Port error handling flow. Integrate the existing RCH error handling
>> into the new Port error handling.
>>
>> Update cxl_rch_handle_error_iter() to forward the RCH protocol error using
>> the AER-CXL kfifo.
>>
>> Update cxl_handle_proto_error() to begin the RCH error handling with a call
>> to cxl_handle_rdport_errors(). This function handles both correctable and
>> uncorrectable RCH protocol errors.
>>
>> Change the cxl_handle_rdport_errors() function parameter from a CXL device
>> state to a PCI device.
>>
>> Report the serial number of the RCD Endpoint in the RCH logging. This
>> is used to associate the RCH with the RCD in the logs.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> One question inline.
>
> + a comment on a bit of neighboring code.
>
> J
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 1d4be2d78469..48d3ef7cbb92 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>
>> static void cxl_handle_proto_error(struct pci_dev *pdev, int severity)
>> {
>> + /*
>> + * CXL RCD's AER error interrupt is used for reporting RCD and RCH
>> + * Downstream Port protocol errors. RCH protocol errors are handled
>> + * using a unique procedure separate from from CXL Port devices.
>> + * See CXL spec r4.0, 12.2 CXL Error Handling
>> + */
>> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END)
>> + w
>
> Maybe I'm missing something but why do we want to carry on running the rest
> of this function after this? Superficially seems like we will be doing
> at least some stuff that didn't happen before.
>
Hi Jonathan,
Before introducing this series, a CXL RCiEP's internal AER interrupt results
in calling the RCH handler (cxl_handle_proto_error()) and the EP handler.
The EP handler was:
scoped_guard(device, dev) {
if (!dev->driver) {
dev_warn(&pdev->dev,
"%s: memdev disabled, abort error handling\n",
dev_name(dev));
return PCI_ERS_RESULT_DISCONNECT;
}
if (cxlds->rcd)
cxl_handle_rdport_errors(cxlds); <== RCH handling
/*
* A frozen channel indicates an impending reset which is fatal to
* CXL.mem operation, and will likely crash the system. On the off
* chance the situation is recoverable dump the status of the RAS
* capability registers and bounce the active state of the memdev.
*/
ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras); <== EP handling
}
-Terry
>> +
>> if (severity == AER_CORRECTABLE) {
>> struct device *dev = &pdev->dev;
>
>> diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
>> index e471eefec9c4..83142eac0cab 100644
>> --- a/drivers/pci/pcie/aer_cxl_rch.c
>> +++ b/drivers/pci/pcie/aer_cxl_rch.c
>> @@ -37,26 +37,11 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>> static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> {
>> struct aer_err_info *info = (struct aer_err_info *)data;
>
> Not related to this patch but that cast isn't needed.
>
>> - const struct pci_error_handlers *err_handler;
>>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (5 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 06/10] PCI/CXL: Add RCH support to CXL handlers Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 14:05 ` Jonathan Cameron
2026-03-02 20:36 ` [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
` (2 subsequent siblings)
9 siblings, 1 reply; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL Protocol trace logging is called for Endpoints in cxl_handle_ras() and
cxl_handle_cor_ras(). Trace logging support for CXL Port devices is missing.
CXL Endpoint trace logging utilizes a separate trace routine than CXL Port
device handling. Using is_cxl_memdev(), determine if the device is a CXL EP
or one of the CXL Port devices.
Update cxl_handle_ras() and cxl_handle_cor_ras() to call the CXL Port trace
logging function. Change cxl_handle_ras() return values to be pci_ers_result_t
type.
Check for invalid ras_base and add log messages if NULL.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v15 -> v16:
- New commit
---
drivers/cxl/core/core.h | 10 ++++++----
drivers/cxl/core/ras.c | 36 +++++++++++++++++++++++++-----------
2 files changed, 31 insertions(+), 15 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 76d2593e68c6..984cc37be186 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
#include <cxl/mailbox.h>
#include <linux/rwsem.h>
+#include <linux/pci.h>
extern const struct device_type cxl_nvdimm_bridge_type;
extern const struct device_type cxl_nvdimm_type;
@@ -181,7 +182,8 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
#ifdef CONFIG_CXL_RAS
int cxl_ras_init(void);
void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base);
void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
@@ -195,10 +197,10 @@ static inline int cxl_ras_init(void)
return 0;
}
static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, u64 serial,
- void __iomem *ras_base)
+static inline pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base)
{
- return false;
+ return PCI_ERS_RESULT_NONE;
}
static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base) { }
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 48d3ef7cbb92..254144d19764 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -291,15 +291,22 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
void __iomem *addr;
u32 status;
- if (!ras_base)
+ if (!ras_base) {
+ pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
+ dev_name(dev));
return;
+ }
addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
status = readl(addr);
- if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
- writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+ if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+ return;
+
+ writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+ if (is_cxl_memdev(dev))
trace_cxl_aer_correctable_error(dev, status, serial);
- }
+ else
+ trace_cxl_port_aer_correctable_error(dev, status);
}
/* CXL spec rev3.0 8.2.4.16.1 */
@@ -321,22 +328,26 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
/*
* Log the state of the RAS status registers and prepare them to log the
- * next error status. Return 1 if reset needed.
+ * next error status. Return PCI_ERS_RESULT_PANIC if reset needed.
*/
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
+pci_ers_result_t
+cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
u32 status;
u32 fe;
- if (!ras_base)
- return false;
+ if (!ras_base) {
+ pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
+ dev_name(dev));
+ return PCI_ERS_RESULT_NONE;
+ }
addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
status = readl(addr);
if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
- return false;
+ return PCI_ERS_RESULT_NONE;
/* If multiple errors, log header points to first error from ctrl reg */
if (hweight32(status) > 1) {
@@ -350,10 +361,13 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
+ if (is_cxl_memdev(dev))
+ trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
+ else
+ trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
- return true;
+ return PCI_ERS_RESULT_PANIC;
}
void cxl_cor_error_detected(struct pci_dev *pdev)
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices
2026-03-02 20:36 ` [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices Terry Bowman
@ 2026-03-09 14:05 ` Jonathan Cameron
2026-03-11 15:37 ` Bowman, Terry
0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 14:05 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:45 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL Protocol trace logging is called for Endpoints in cxl_handle_ras() and
> cxl_handle_cor_ras(). Trace logging support for CXL Port devices is missing.
>
> CXL Endpoint trace logging utilizes a separate trace routine than CXL Port
> device handling. Using is_cxl_memdev(), determine if the device is a CXL EP
> or one of the CXL Port devices.
>
> Update cxl_handle_ras() and cxl_handle_cor_ras() to call the CXL Port trace
> logging function. Change cxl_handle_ras() return values to be pci_ers_result_t
> type.
Why this last bit?
>
> Check for invalid ras_base and add log messages if NULL.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
A few comments inline.
Thanks,
Jonathan
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 48d3ef7cbb92..254144d19764 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -291,15 +291,22 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> void __iomem *addr;
> u32 status;
>
> - if (!ras_base)
> + if (!ras_base) {
> + pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
> + dev_name(dev));
This print isn't mentioned in the commit message. Probably needs some comment
on why all paths that get here are error paths.
> return;
> + }
>
> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> + return;
> +
> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> + if (is_cxl_memdev(dev))
> trace_cxl_aer_correctable_error(dev, status, serial);
> - }
> + else
> + trace_cxl_port_aer_correctable_error(dev, status);
> }
>
> /* CXL spec rev3.0 8.2.4.16.1 */
> @@ -321,22 +328,26 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>
> /*
> * Log the state of the RAS status registers and prepare them to log the
> - * next error status. Return 1 if reset needed.
> + * next error status. Return PCI_ERS_RESULT_PANIC if reset needed.
This seems odd as normally PANIC implies more than reset. I guess system reset,
kind of...
> */
> -bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> +pci_ers_result_t
> +cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> u32 hl[CXL_HEADERLOG_SIZE_U32];
> void __iomem *addr;
> u32 status;
> u32 fe;
>
> - if (!ras_base)
> - return false;
> + if (!ras_base) {
> + pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
> + dev_name(dev));
> + return PCI_ERS_RESULT_NONE;
> + }
>
> addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
> - return false;
> + return PCI_ERS_RESULT_NONE;
>
> /* If multiple errors, log header points to first error from ctrl reg */
> if (hweight32(status) > 1) {
> @@ -350,10 +361,13 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
> + if (is_cxl_memdev(dev))
> + trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
> + else
> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> - return true;
> + return PCI_ERS_RESULT_PANIC;
> }
>
> void cxl_cor_error_detected(struct pci_dev *pdev)
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices
2026-03-09 14:05 ` Jonathan Cameron
@ 2026-03-11 15:37 ` Bowman, Terry
2026-03-12 13:05 ` Jonathan Cameron
0 siblings, 1 reply; 26+ messages in thread
From: Bowman, Terry @ 2026-03-11 15:37 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On 3/9/2026 9:05 AM, Jonathan Cameron wrote:
> On Mon, 2 Mar 2026 14:36:45 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL Protocol trace logging is called for Endpoints in cxl_handle_ras() and
>> cxl_handle_cor_ras(). Trace logging support for CXL Port devices is missing.
>>
>> CXL Endpoint trace logging utilizes a separate trace routine than CXL Port
>> device handling. Using is_cxl_memdev(), determine if the device is a CXL EP
>> or one of the CXL Port devices.
>>
>> Update cxl_handle_ras() and cxl_handle_cor_ras() to call the CXL Port trace
>> logging function. Change cxl_handle_ras() return values to be pci_ers_result_t
>> type.
>
> Why this last bit?
>
You requested in previous review this should return a value more meaningful than bool.
I changed to return pci_ers_result_t.
https://lore.kernel.org/linux-cxl/20260205171346.00001e6b@huawei.com/
>>
>> Check for invalid ras_base and add log messages if NULL.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> A few comments inline.
>
> Thanks,
>
> Jonathan
>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 48d3ef7cbb92..254144d19764 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -291,15 +291,22 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> void __iomem *addr;
>> u32 status;
>>
>> - if (!ras_base)
>> + if (!ras_base) {
>> + pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
>> + dev_name(dev));
>
> This print isn't mentioned in the commit message. Probably needs some comment
> on why all paths that get here are error paths.
>
Good idea. I'll update with those details.
-Terry
>> return;
>> + }
>>
>> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>> status = readl(addr);
>> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> + return;
>> +
>> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> + if (is_cxl_memdev(dev))
>> trace_cxl_aer_correctable_error(dev, status, serial);
>> - }
>> + else
>> + trace_cxl_port_aer_correctable_error(dev, status);
>> }
>>
>> /* CXL spec rev3.0 8.2.4.16.1 */
>> @@ -321,22 +328,26 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>>
>> /*
>> * Log the state of the RAS status registers and prepare them to log the
>> - * next error status. Return 1 if reset needed.
>> + * next error status. Return PCI_ERS_RESULT_PANIC if reset needed.
>
> This seems odd as normally PANIC implies more than reset. I guess system reset,
> kind of...
>
>> */
>> -bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> +pci_ers_result_t
>> +cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> {
>> u32 hl[CXL_HEADERLOG_SIZE_U32];
>> void __iomem *addr;
>> u32 status;
>> u32 fe;
>>
>> - if (!ras_base)
>> - return false;
>> + if (!ras_base) {
>> + pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
>> + dev_name(dev));
>> + return PCI_ERS_RESULT_NONE;
>> + }
>>
>> addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>> status = readl(addr);
>> if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
>> - return false;
>> + return PCI_ERS_RESULT_NONE;
>>
>> /* If multiple errors, log header points to first error from ctrl reg */
>> if (hweight32(status) > 1) {
>> @@ -350,10 +361,13 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> }
>>
>> header_log_copy(ras_base, hl);
>> - trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
>> + if (is_cxl_memdev(dev))
>> + trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
>> + else
>> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
>> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>
>> - return true;
>> + return PCI_ERS_RESULT_PANIC;
>> }
>>
>> void cxl_cor_error_detected(struct pci_dev *pdev)
>
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices
2026-03-11 15:37 ` Bowman, Terry
@ 2026-03-12 13:05 ` Jonathan Cameron
0 siblings, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-12 13:05 UTC (permalink / raw)
To: Bowman, Terry
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Wed, 11 Mar 2026 10:37:33 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:
> On 3/9/2026 9:05 AM, Jonathan Cameron wrote:
> > On Mon, 2 Mar 2026 14:36:45 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> CXL Protocol trace logging is called for Endpoints in cxl_handle_ras() and
> >> cxl_handle_cor_ras(). Trace logging support for CXL Port devices is missing.
> >>
> >> CXL Endpoint trace logging utilizes a separate trace routine than CXL Port
> >> device handling. Using is_cxl_memdev(), determine if the device is a CXL EP
> >> or one of the CXL Port devices.
> >>
> >> Update cxl_handle_ras() and cxl_handle_cor_ras() to call the CXL Port trace
> >> logging function. Change cxl_handle_ras() return values to be pci_ers_result_t
> >> type.
> >
> > Why this last bit?
> >
>
> You requested in previous review this should return a value more meaningful than bool.
> I changed to return pci_ers_result_t.
>
> https://lore.kernel.org/linux-cxl/20260205171346.00001e6b@huawei.com/
Ah. I was probably thinking errnos.
Maybe not appropriate. Given how it is used, a bool was probably the right answer.
Sorry!
J
>
>
> >>
> >> Check for invalid ras_base and add log messages if NULL.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (6 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 07/10] cxl: Update error handlers to support CXL Port devices Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 14:12 ` Jonathan Cameron
2026-03-02 20:36 ` [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-03-02 20:36 ` [PATCH v16 10/10] cxl: Enable CXL protocol error reporting Terry Bowman
9 siblings, 1 reply; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL drivers now implement protocol RAS support. PCI protocol errors,
however, continue to be reported via the AER capability and must still be
handled by a PCI error recovery callback.
Replace the existing cxl_error_detected() callback in cxl/pci.c with a
new cxl_pci_error_detected() implementation that handles uncorrectable
AER PCI protocol errors. Changes for PCI Correctable protocol errors will
be added in a future patch.
Introduce function cxl_uncor_aer_present() to handle and log the CXL
Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
the AER driver and require logging here with a call to pci_print_aer().
This cleanly separates CXL protocol error handling from PCI AER handling
and ensures that each subsystem processes only the errors it is
responsible.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Assisted-by: Azure:gpt4.1-nano-key
---
Changes in v15->v16:
- Update commit message (DaveJ)
- s/cxl_handle_aer()/cxl_uncor_aer_present()/g (Jonathan)
- cxl_uncor_aer_present(): Leave original result calculation based on
if a UCE is present and the provided state (Terry)
- Add call to pci_print_aer(). AER fails to log because is upstream
link (Terry)
Changes in v14->v15:
- Update commit message and title. Added Bjorn's ack.
- Move CE and UCE handling logic here
Changes in v13->v14:
- Add Dave Jiang's review-by
- Update commit message & headline (Bjorn)
- Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
one line (Jonathan)
- Remove cxl_walk_port() (Dan)
- Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
sufficient (Dan)
- Remove device_lock_if()
- Combined CE and UCE here (Terry)
Changes in v12->v13:
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)
Changes in v11->v12:
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
pci_to_cxl_dev()
- Change cxl_error_detected() -> cxl_cor_error_detected()
- Remove NULL variable assignments
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
port searches.
Changes in v10->v11:
- None
---
drivers/cxl/core/ras.c | 57 ++++++++++++++++++++++++------------------
drivers/cxl/cxlpci.h | 9 +++----
drivers/cxl/pci.c | 6 ++---
3 files changed, 39 insertions(+), 33 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 254144d19764..884e40c66638 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -393,34 +393,41 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
}
EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+static bool cxl_uncor_aer_present(struct pci_dev *pdev)
{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
- struct cxl_memdev *cxlmd = cxlds->cxlmd;
- struct device *dev = &cxlmd->dev;
- bool ue;
-
- scoped_guard(device, dev) {
- if (!dev->driver) {
- dev_warn(&pdev->dev,
- "%s: memdev disabled, abort error handling\n",
- dev_name(dev));
- return PCI_ERS_RESULT_DISCONNECT;
- }
+ struct aer_capability_regs aer_regs;
+ u32 fatal, aer_cap = pdev->aer_cap;
- if (cxlds->rcd)
- cxl_handle_rdport_errors(pdev);
- /*
- * A frozen channel indicates an impending reset which is fatal to
- * CXL.mem operation, and will likely crash the system. On the off
- * chance the situation is recoverable dump the status of the RAS
- * capability registers and bounce the active state of the memdev.
- */
- ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
- cxlmd->endpoint->regs.ras);
+ if (!aer_cap) {
+ pr_warn_ratelimited("%s: AER capability isn't present\n",
+ pci_name(pdev));
+ return false;
}
+ pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS,
+ &aer_regs.uncor_status);
+ pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK,
+ &aer_regs.uncor_mask);
+ pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_SEVER,
+ &aer_regs.uncor_severity);
+
+ fatal = (aer_regs.uncor_severity & aer_regs.uncor_severity);
+ pci_print_aer(pdev, fatal ? AER_FATAL : AER_NONFATAL, &aer_regs);
+
+ pci_aer_clear_nonfatal_status(pdev);
+ pci_aer_clear_fatal_status(pdev);
+
+ return aer_regs.uncor_status & ~aer_regs.uncor_mask;
+}
+
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
+{
+ bool ue = cxl_uncor_aer_present(pdev);
+ struct cxl_port *port = get_cxl_port(pdev);
+ struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+ struct device *dev = &cxlmd->dev;
+
switch (state) {
case pci_channel_io_normal:
if (ue) {
@@ -441,7 +448,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
}
return PCI_ERS_RESULT_NEED_RESET;
}
-EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
static void cxl_handle_proto_error(struct pci_dev *pdev, int severity)
{
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 0cf64218aa16..86029d96d6bb 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port);
#ifdef CONFIG_CXL_RAS
void cxl_cor_error_detected(struct pci_dev *pdev);
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state);
void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t error);
void devm_cxl_port_ras_setup(struct cxl_port *port);
#else
static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
-
-static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
{
return PCI_ERS_RESULT_NONE;
}
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index fbb300a01830..b57f4727af53 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
}
}
-static const struct pci_error_handlers cxl_error_handlers = {
- .error_detected = cxl_error_detected,
+static const struct pci_error_handlers pci_error_handlers = {
+ .error_detected = cxl_pci_error_detected,
.slot_reset = cxl_slot_reset,
.resume = cxl_error_resume,
.cor_error_detected = cxl_cor_error_detected,
@@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
.name = KBUILD_MODNAME,
.id_table = cxl_mem_pci_tbl,
.probe = cxl_pci_probe,
- .err_handler = &cxl_error_handlers,
+ .err_handler = &pci_error_handlers,
.dev_groups = cxl_rcd_groups,
.driver = {
.probe_type = PROBE_PREFER_ASYNCHRONOUS,
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler
2026-03-02 20:36 ` [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
@ 2026-03-09 14:12 ` Jonathan Cameron
2026-03-11 15:58 ` Bowman, Terry
0 siblings, 1 reply; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 14:12 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:46 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL drivers now implement protocol RAS support. PCI protocol errors,
> however, continue to be reported via the AER capability and must still be
> handled by a PCI error recovery callback.
>
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
> new cxl_pci_error_detected() implementation that handles uncorrectable
> AER PCI protocol errors. Changes for PCI Correctable protocol errors will
> be added in a future patch.
>
> Introduce function cxl_uncor_aer_present() to handle and log the CXL
> Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
> the AER driver and require logging here with a call to pci_print_aer().
>
> This cleanly separates CXL protocol error handling from PCI AER handling
> and ensures that each subsystem processes only the errors it is
> responsible.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Assisted-by: Azure:gpt4.1-nano-key
One question inline.
>
> ---
>
> Changes in v15->v16:
> - Update commit message (DaveJ)
> - s/cxl_handle_aer()/cxl_uncor_aer_present()/g (Jonathan)
> - cxl_uncor_aer_present(): Leave original result calculation based on
> if a UCE is present and the provided state (Terry)
> - Add call to pci_print_aer(). AER fails to log because is upstream
> link (Terry)
>
> Changes in v14->v15:
> - Update commit message and title. Added Bjorn's ack.
> - Move CE and UCE handling logic here
>
> Changes in v13->v14:
> - Add Dave Jiang's review-by
> - Update commit message & headline (Bjorn)
> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> one line (Jonathan)
> - Remove cxl_walk_port() (Dan)
> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> sufficient (Dan)
> - Remove device_lock_if()
> - Combined CE and UCE here (Terry)
>
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
>
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> port searches.
>
> Changes in v10->v11:
> - None
> ---
> drivers/cxl/core/ras.c | 57 ++++++++++++++++++++++++------------------
> drivers/cxl/cxlpci.h | 9 +++----
> drivers/cxl/pci.c | 6 ++---
> 3 files changed, 39 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 254144d19764..884e40c66638 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
...
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> +{
> + bool ue = cxl_uncor_aer_present(pdev);
> + struct cxl_port *port = get_cxl_port(pdev);
This got a reference that wasn't (I think) previously taken.
I'm not spotting where that is released. It it is somewhere beyond
this function, good to add a comment saying where.
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct device *dev = &cxlmd->dev;
> +
> switch (state) {
> case pci_channel_io_normal:
> if (ue) {
> @@ -441,7 +448,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> }
> return PCI_ERS_RESULT_NEED_RESET;
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
^ permalink raw reply [flat|nested] 26+ messages in thread* Re: [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler
2026-03-09 14:12 ` Jonathan Cameron
@ 2026-03-11 15:58 ` Bowman, Terry
0 siblings, 0 replies; 26+ messages in thread
From: Bowman, Terry @ 2026-03-11 15:58 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On 3/9/2026 9:12 AM, Jonathan Cameron wrote:
> On Mon, 2 Mar 2026 14:36:46 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL drivers now implement protocol RAS support. PCI protocol errors,
>> however, continue to be reported via the AER capability and must still be
>> handled by a PCI error recovery callback.
>>
>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
>> new cxl_pci_error_detected() implementation that handles uncorrectable
>> AER PCI protocol errors. Changes for PCI Correctable protocol errors will
>> be added in a future patch.
>>
>> Introduce function cxl_uncor_aer_present() to handle and log the CXL
>> Endpoint's AER errors. Endpoint fatal AER errors are not currently logged by
>> the AER driver and require logging here with a call to pci_print_aer().
>>
>> This cleanly separates CXL protocol error handling from PCI AER handling
>> and ensures that each subsystem processes only the errors it is
>> responsible.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Assisted-by: Azure:gpt4.1-nano-key
> One question inline.
>
>>
>> ---
>>
>> Changes in v15->v16:
>> - Update commit message (DaveJ)
>> - s/cxl_handle_aer()/cxl_uncor_aer_present()/g (Jonathan)
>> - cxl_uncor_aer_present(): Leave original result calculation based on
>> if a UCE is present and the provided state (Terry)
>> - Add call to pci_print_aer(). AER fails to log because is upstream
>> link (Terry)
>>
>> Changes in v14->v15:
>> - Update commit message and title. Added Bjorn's ack.
>> - Move CE and UCE handling logic here
>>
>> Changes in v13->v14:
>> - Add Dave Jiang's review-by
>> - Update commit message & headline (Bjorn)
>> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
>> one line (Jonathan)
>> - Remove cxl_walk_port() (Dan)
>> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
>> sufficient (Dan)
>> - Remove device_lock_if()
>> - Combined CE and UCE here (Terry)
>>
>> Changes in v12->v13:
>> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>> patch (Terry)
>> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
>> - Remove check for dport->dport_dev (Dave)
>> - Remove whitespace (Terry)
>>
>> Changes in v11->v12:
>> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
>> pci_to_cxl_dev()
>> - Change cxl_error_detected() -> cxl_cor_error_detected()
>> - Remove NULL variable assignments
>> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
>> port searches.
>>
>> Changes in v10->v11:
>> - None
>> ---
>> drivers/cxl/core/ras.c | 57 ++++++++++++++++++++++++------------------
>> drivers/cxl/cxlpci.h | 9 +++----
>> drivers/cxl/pci.c | 6 ++---
>> 3 files changed, 39 insertions(+), 33 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 254144d19764..884e40c66638 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
> ...
>
>
>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> + pci_channel_state_t state)
>> +{
>> + bool ue = cxl_uncor_aer_present(pdev);
>> + struct cxl_port *port = get_cxl_port(pdev);
>
> This got a reference that wasn't (I think) previously taken.
> I'm not spotting where that is released. It it is somewhere beyond
> this function, good to add a comment saying where.
>
>
This should be using the scope cleanup. I will change. Thanks.
-Terry
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (7 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 08/10] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
2026-03-09 14:13 ` Jonathan Cameron
2026-03-09 18:55 ` Dave Jiang
2026-03-02 20:36 ` [PATCH v16 10/10] cxl: Enable CXL protocol error reporting Terry Bowman
9 siblings, 2 replies; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL drivers dont require a correctable PCI AER handler. Correctable AER
errors reported by CXL devices are logged and cleared in the AER driver.
This makes the correctable AER handler callback in the CXL driver
unnecessary.
Remove cxl_cor_error_detected() and drop the .cor_error_detected callback
from the CXL PCI error handlers.
This consolidates correctable error reporting under the CXL RAS infrastructure
and avoids redundant or conflicting logging with the AER driver.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v15->v16:
- None
Changes in v14->v15:
- Remove cxl_pci_cor_error_detected(). Is not needed. AER is logged
in the AER driver. (Dan)
- Update commit message (Terry)
Changes in v13->v14:
- New commit
- Change cxl_cor_error_detected() parameter to &pdev->dev device from
memdev device. (Terry)
- Updated commit message (Terry)
---
drivers/cxl/core/ras.c | 23 -----------------------
drivers/cxl/cxlpci.h | 2 --
drivers/cxl/pci.c | 1 -
3 files changed, 26 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 884e40c66638..d6112b812c82 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -370,29 +370,6 @@ cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
return PCI_ERS_RESULT_PANIC;
}
-void cxl_cor_error_detected(struct pci_dev *pdev)
-{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
- struct cxl_memdev *cxlmd = cxlds->cxlmd;
- struct device *dev = &cxlds->cxlmd->dev;
-
- scoped_guard(device, dev) {
- if (!dev->driver) {
- dev_warn(&pdev->dev,
- "%s: memdev disabled, abort error handling\n",
- dev_name(dev));
- return;
- }
-
- if (cxlds->rcd)
- cxl_handle_rdport_errors(pdev);
-
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
- cxlmd->endpoint->regs.ras);
- }
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
-
static bool cxl_uncor_aer_present(struct pci_dev *pdev)
{
struct aer_capability_regs aer_regs;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 86029d96d6bb..184a95e96ea9 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -78,13 +78,11 @@ struct cxl_dev_state;
void read_cdat_data(struct cxl_port *port);
#ifdef CONFIG_CXL_RAS
-void cxl_cor_error_detected(struct pci_dev *pdev);
void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t error);
void devm_cxl_port_ras_setup(struct cxl_port *port);
#else
-static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index b57f4727af53..77a2ee57222b 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1055,7 +1055,6 @@ static const struct pci_error_handlers pci_error_handlers = {
.error_detected = cxl_pci_error_detected,
.slot_reset = cxl_slot_reset,
.resume = cxl_error_resume,
- .cor_error_detected = cxl_cor_error_detected,
.reset_done = cxl_reset_done,
};
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread* Re: [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler
2026-03-02 20:36 ` [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler Terry Bowman
@ 2026-03-09 14:13 ` Jonathan Cameron
2026-03-09 18:55 ` Dave Jiang
1 sibling, 0 replies; 26+ messages in thread
From: Jonathan Cameron @ 2026-03-09 14:13 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Mar 2026 14:36:47 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL drivers dont require a correctable PCI AER handler. Correctable AER
> errors reported by CXL devices are logged and cleared in the AER driver.
> This makes the correctable AER handler callback in the CXL driver
> unnecessary.
>
> Remove cxl_cor_error_detected() and drop the .cor_error_detected callback
> from the CXL PCI error handlers.
>
> This consolidates correctable error reporting under the CXL RAS infrastructure
> and avoids redundant or conflicting logging with the AER driver.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Nice.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler
2026-03-02 20:36 ` [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-03-09 14:13 ` Jonathan Cameron
@ 2026-03-09 18:55 ` Dave Jiang
1 sibling, 0 replies; 26+ messages in thread
From: Dave Jiang @ 2026-03-09 18:55 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 3/2/26 1:36 PM, Terry Bowman wrote:
> CXL drivers dont require a correctable PCI AER handler. Correctable AER
> errors reported by CXL devices are logged and cleared in the AER driver.
> This makes the correctable AER handler callback in the CXL driver
> unnecessary.
>
> Remove cxl_cor_error_detected() and drop the .cor_error_detected callback
> from the CXL PCI error handlers.
>
> This consolidates correctable error reporting under the CXL RAS infrastructure
> and avoids redundant or conflicting logging with the AER driver.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
> ---
>
> Changes in v15->v16:
> - None
>
> Changes in v14->v15:
> - Remove cxl_pci_cor_error_detected(). Is not needed. AER is logged
> in the AER driver. (Dan)
> - Update commit message (Terry)
>
> Changes in v13->v14:
> - New commit
> - Change cxl_cor_error_detected() parameter to &pdev->dev device from
> memdev device. (Terry)
> - Updated commit message (Terry)
> ---
> drivers/cxl/core/ras.c | 23 -----------------------
> drivers/cxl/cxlpci.h | 2 --
> drivers/cxl/pci.c | 1 -
> 3 files changed, 26 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 884e40c66638..d6112b812c82 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -370,29 +370,6 @@ cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> return PCI_ERS_RESULT_PANIC;
> }
>
> -void cxl_cor_error_detected(struct pci_dev *pdev)
> -{
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
> - struct device *dev = &cxlds->cxlmd->dev;
> -
> - scoped_guard(device, dev) {
> - if (!dev->driver) {
> - dev_warn(&pdev->dev,
> - "%s: memdev disabled, abort error handling\n",
> - dev_name(dev));
> - return;
> - }
> -
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(pdev);
> -
> - cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
> - cxlmd->endpoint->regs.ras);
> - }
> -}
> -EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
> -
> static bool cxl_uncor_aer_present(struct pci_dev *pdev)
> {
> struct aer_capability_regs aer_regs;
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 86029d96d6bb..184a95e96ea9 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -78,13 +78,11 @@ struct cxl_dev_state;
> void read_cdat_data(struct cxl_port *port);
>
> #ifdef CONFIG_CXL_RAS
> -void cxl_cor_error_detected(struct pci_dev *pdev);
> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
> pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> pci_channel_state_t error);
> void devm_cxl_port_ras_setup(struct cxl_port *port);
> #else
> -static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
> static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> pci_channel_state_t state)
> {
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index b57f4727af53..77a2ee57222b 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1055,7 +1055,6 @@ static const struct pci_error_handlers pci_error_handlers = {
> .error_detected = cxl_pci_error_detected,
> .slot_reset = cxl_slot_reset,
> .resume = cxl_error_resume,
> - .cor_error_detected = cxl_cor_error_detected,
> .reset_done = cxl_reset_done,
> };
>
^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH v16 10/10] cxl: Enable CXL protocol error reporting
2026-03-02 20:36 [PATCH v16 00/10] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (8 preceding siblings ...)
2026-03-02 20:36 ` [PATCH v16 09/10] cxl: Remove Endpoint AER correctable handler Terry Bowman
@ 2026-03-02 20:36 ` Terry Bowman
9 siblings, 0 replies; 26+ messages in thread
From: Terry Bowman @ 2026-03-02 20:36 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL protocol errors are not enabled for all CXL devices after boot. These
must be enabled inorder to process CXL protocol errors.
Introduce cxl_unmask_proto_interrupts() to call pci_aer_unmask_internal_errors().
pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
correctable internal errors and uncorrectable internal errors for all CXL
devices.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
---
Change in v15 -> v16:
- None
Change in v14 -> v15:
- None
Changes in v13->v14:
- Update commit title's prefix (Bjorn)
Changes in v12->v13:
- Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
- Add Dave Jiang's and Ben's review-by
Changes in v11->v12:
- None
---
drivers/cxl/core/port.c | 2 ++
drivers/cxl/core/ras.c | 22 ++++++++++++++++++++++
drivers/cxl/cxlpci.h | 4 ++++
3 files changed, 28 insertions(+)
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 27271402915f..c33d58fb7264 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1852,6 +1852,8 @@ int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd)
rc = cxl_add_ep(dport, &cxlmd->dev);
+ cxl_unmask_proto_interrupts(cxlmd->cxlds->dev);
+
/*
* If the endpoint already exists in the port's list,
* that's ok, it was added on a previous pass.
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index d6112b812c82..bfe6cb35154e 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -119,6 +119,24 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
}
static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
+void cxl_unmask_proto_interrupts(struct device *dev)
+{
+ if (!dev || !dev_is_pci(dev))
+ return;
+
+ struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(to_pci_dev(dev));
+
+ if (!pdev->aer_cap) {
+ pdev->aer_cap = pci_find_ext_capability(pdev,
+ PCI_EXT_CAP_ID_ERR);
+ if (!pdev->aer_cap)
+ return;
+ }
+
+ pci_aer_unmask_internal_errors(pdev);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unmask_proto_interrupts, "CXL");
+
static void cxl_dport_map_ras(struct cxl_dport *dport)
{
struct cxl_register_map *map = &dport->reg_map;
@@ -129,6 +147,8 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
else if (cxl_map_component_regs(map, &dport->regs.component,
BIT(CXL_CM_CAP_CAP_ID_RAS)))
dev_dbg(dev, "Failed to map RAS capability.\n");
+
+ cxl_unmask_proto_interrupts(dev);
}
/**
@@ -172,6 +192,8 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
if (cxl_map_component_regs(map, &port->regs,
BIT(CXL_CM_CAP_CAP_ID_RAS)))
dev_dbg(&port->dev, "Failed to map RAS capability\n");
+
+ cxl_unmask_proto_interrupts(port->uport_dev);
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 184a95e96ea9..b7cf9d6137b3 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -82,6 +82,7 @@ void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t error);
void devm_cxl_port_ras_setup(struct cxl_port *port);
+void cxl_unmask_proto_interrupts(struct device *dev);
#else
static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
@@ -96,6 +97,9 @@ static inline void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport)
static inline void devm_cxl_port_ras_setup(struct cxl_port *port)
{
}
+static inline void cxl_unmask_proto_interrupts(struct device *dev)
+{
+}
#endif
#endif /* __CXL_PCI_H__ */
--
2.34.1
^ permalink raw reply related [flat|nested] 26+ messages in thread