* [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging
@ 2026-02-03 2:52 Terry Bowman
2026-02-03 2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
` (8 more replies)
0 siblings, 9 replies; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
This patch series enables CXL protocol error handling for both CXL Ports
and CXL Endpoints (EP). The previous revision can be found here:
https://lore.kernel.org/linux-cxl/20260114182055.46029-1-terry.bowman@amd.com/
The base for this series is: 9a8920ca8ebf (for-7.0/cxl-aer-prep) +
376e2f17f3fc (Dan's v2 series).[1]
This introduces CXL error handling as an error plane implemented on top of
the existing PCIe AER handling. The CXL handling relies on the alternate
protocol training and pcie_is_cxl() to properly decode the error type and
forward the CXL protocol error to the cxl_core driver via a kfifo.
The cxl_core driver then dequeues the work (containing the CXL protocol
error details) from the kfifo. The correctable (CE) or uncorrectable (UCE)
handler is called for handling and logging. This all follows a similar flow
as the AER driver.
PCIe AER CE and UCE callback handling changes are also required. The CXL
PCIe CE handler is unnecessary and can be removed because the AER driver
already provides the handling and logging. Update the PCIe UCE handler to
return the correct pcie_ers_result_t based on the AER status and remove
the unnecessary logging.
== Patch Details==
The first patch introduces the AER-CXL kfifo instance and all the
required related kfifo changes.
Patch#2 adds serial value to the RAS trace log. It also updates
the parameter list for using with CXL Port devices.
Patch#3 introduces PCI_ERS_RESULT_PANIC. This version does not
use merge_result() and avoids the related exporting. This series only
handles and logs for the erroring device and does not iterate the topology.
Patch#4 dequeues work from the AER-CXL kfifo.
Patch#5 adds infrastructure directing CXL protocol errors to correctable
or uncorrectable handlers.
Patch#6 updates the 2 CXL RAS handlers to include handling and logging
for CXL Port devices.
Patch#7 updates the UCE handler to 'handle' the AER status.
Patch#8 removes the PCIe AER CE handler because the handling and
logging is already performed by AER driver.
== Notes ==
One side effect to note is an EP or USP fatal UCE will be handled by
CXL AER callbacks instead of CXL protocol handler callbacks. This is
because the upstream link is in an unknown state preventing accessing the
device's AER capability. Without the AER info onhand the AER driver
assumes the error is an AER error. Another patch series is in-flight
addressing this and adds a link validity check in the AER driver:
https://lore.kernel.org/linux-pci/20260124074557.73961-1-xueshuai@linux.alibaba.com/
Dan's related series addressing RAS setup has more details.[1]
[1] cxl/port: Unify RAS setup across port types
https://lore.kernel.org/linux-cxl/20260131000403.2135324-1-dan.j.williams@intel.com/
== Testing ==
Below are the testing results while using QEMU. The QEMU testing uses a
CXL Root Port, CXL Upstream Switch Port, CXL Downstream Switch Port and CXL
Endpoint as given below. I've attached the QEMU startup commandline used.
This testing uses protocol error injection at all the devices.
The sub-topology for the QEMU testing is:
---------------------
| CXL RP - 0C:00.0 |
---------------------
|
---------------------
| CXL USP - 0D:00.0 |
---------------------
|
---------------------
| CXL DSP - 0E:00.0 |
---------------------
|
---------------------
| CXL EP - 0F:00.0 |
---------------------
root@tbowman-cxl:~# lspci -t
-+-[0000:00]- -00.0
| +-01.0
| +-02.0
| +-03.0
| +-1f.0
| +-1f.2
| \-1f.3
\-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0
The topology was created with:
${qemu} -boot menu=on \
-cpu host \
-nographic \
-monitor telnet:127.0.0.1:1234,server,nowait \
-M virt,cxl=on \
-chardev stdio,id=s1,signal=off,mux=on -serial none \
-device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
-machine q35,cxl=on \
-m 16G,maxmem=24G,slots=8 \
-cpu EPYC-v3 \
-smp 32 \
-accel kvm \
-drive file=${img},format=raw,index=0,media=disk \
-device e1000,netdev=user.0 \
-netdev user,id=user.0,hostfwd=tcp::5555-:22 \
-object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
-object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/cxltest1.raw,size=256M \
-object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/cxltest2.raw,size=256M \
-object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/cxltest3.raw,size=256M \
-object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
-object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/lsa1.raw,size=256M \
-object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/lsa2.raw,size=256M \
-object memory-backend-file,id=cxl-lsa3,share=on,mem-path=/tmp/lsa3.raw,size=256M \
-device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
-device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
-device cxl-upstream,bus=root_port0,id=us0 \
-device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
-device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
-device cxl-downstream,port=1,bus=us0,id=swport1,chassis=0,slot=5 \
-device cxl-type3,bus=swport1,volatile-memdev=cxl-mem1,lsa=cxl-lsa1,id=cxl-vmem1 \
-device cxl-downstream,port=2,bus=us0,id=swport2,chassis=0,slot=6 \
-device cxl-type3,bus=swport2,volatile-memdev=cxl-mem2,lsa=cxl-lsa2,id=cxl-vmem2 \
-device cxl-downstream,port=3,bus=us0,id=swport3,chassis=0,slot=7 \
-device cxl-type3,bus=swport3,volatile-memdev=cxl-mem3,lsa=cxl-lsa3,id=cxl-vmem3 \
-M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k
=== Root Port ===
root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0: [14] CorrIntErr
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='CRC Threshold Hit'
root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0: [22] UncorrIntErr
cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 26 UID: 0 PID: 176 Comm: kworker/26:0 Tainted: G E 6.19.0-rc5-00036-g95c8d7385a48 #776 PREEMPT(voluntary)
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_proto_err_work_fn [cxl_core]
Call Trace:
<TASK>
dump_stack_lvl+0x26/0xc0
dump_stack+0x10/0x20
vpanic+0x35e/0x3b0
panic+0x57/0x60
cxl_proto_err_work_fn+0x430/0x440 [cxl_core]
process_one_work+0x22b/0x600
worker_thread+0x195/0x350
kthread+0x119/0x230
? __pfx_worker_thread+0x10/0x10
? __pfx_kthread+0x10/0x10
ret_from_fork+0x261/0x2e0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: disabled
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
=== Upstream Switch Port ===
root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0: [14] CorrIntErr
cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='CRC Threshold Hit'
root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
pcieport 0000:0c:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0x50/0x80
pcieport 0000:0c:00.0: AER: Root Port link has been reset (0)
cxl_pci 0000:0f:00.0: mem0: restart CXL.mem after slot reset
cxl_pci 0000:10:00.0: mem1: restart CXL.mem after slot reset
cxl_pci 0000:11:00.0: mem3: restart CXL.mem after slot reset
cxl_pci 0000:12:00.0: mem2: restart CXL.mem after slot reset
cxl_pci 0000:0f:00.0: mem0: error resume successful
cxl_pci 0000:10:00.0: mem1: error resume successful
cxl_pci 0000:11:00.0: mem3: error resume successful
cxl_pci 0000:12:00.0: mem2: error resume successful
pcieport 0000:0c:00.0: AER: device recovery successful
=== Dowstream Switch Port ===
root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0: [14] CorrIntErr
cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='CRC Threshold Hit'
root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0: [22] UncorrIntErr
cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 26 UID: 0 PID: 274 Comm: kworker/26:1 Tainted: G E 6.19.0-rc5-00036-g95c8d7385a48 #776 PREEMPT(voluntary)
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_proto_err_work_fn [cxl_core]
Call Trace:
<TASK>
dump_stack_lvl+0x26/0xc0
dump_stack+0x10/0x20
vpanic+0x35e/0x3b0
panic+0x57/0x60
cxl_proto_err_work_fn+0x430/0x440 [cxl_core]
process_one_work+0x22b/0x600
worker_thread+0x195/0x350
kthread+0x119/0x230
? __pfx_worker_thread+0x10/0x10
? __pfx_kthread+0x10/0x10
ret_from_fork+0x261/0x2e0
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: disabled
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
=== Endpoint ===
root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
cxl_pci 0000:0f:00.0: device [8086:0d93] error status/mask=00004000/0000a000
cxl_pci 0000:0f:00.0: [14] CorrIntErr
cxl_port_aer_correctable_error: device=0000:0f:00.0 host=0000:0e:00.0 status='CRC Threshold Hit'
root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0x50/0x80
pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
cxl_pci 0000:0f:00.0: mem0: error resume successful
pcieport 0000:0e:00.0: AER: device recovery
== Changes ==
Changes in v14->v15:
PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c
- Move pci_dev_get() call to this patch (Dave)
cxl: Update CXL Endpoint tracing
- Update commit message
- Moved cxl_handle_ras/cxl_handle_cor_ras() changes to future patch (Terry)
PCI/ERR: Introduce PCI_ERS_RESULT_PANIC
- None
PCI/AER: Dequeue forwarded CXL error
- Move pci_dev_get() to cxl_forward_error() (Dave)
- Move in is_cxl_error() change from later patch (Terry)
PCI: Establish common CXL Port protocol error flow
- Update commit message and title. Added Bjorn's ack.
- Move CE and UCE handling logic here (Terry)
cxl: Update error handlers to support CXL Port protocol errors
- New commit (Terry)
cxl: Update Endpoint AER uncorrectable handler
- Title update (Terry)
- Change cxl_pci_error-detected() to handle & log AER (Terry)
- Update commit message (Terry)
- Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
cxl: Remove Endpoint AER correctable handler
- Remove cxl_pci_cor_error_detected(). Is not needed. AER is logged
in the AER driver. (Dan)
- Update commit message (Terry)
cxl: Enable CXL protocol error reporting
- Update commit title's prefix (Bjorn)
Changes in v13->v14:
PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
- Add Jonathan's and Dan's review-by
- Update commit title prefix (Bjorn)
- Revert format fix for cxl_sbr_masked() (Jonathan)
- Update 'Compute Express Link' comment block (Jonathan)
- Move PCI_DVSEC_CXL_FLEXBUS definitions to later patch where
used (Jonathan)
- Removed stray change (Bjorn)
PCI: Update CXL DVSEC definitions
- New patch. Split from previous patch such that there is now a separate
move patch and a format fix patch.
- Formatting update requested (Bjorn)
- Remove PCI_DVSEC_HEADER1_LENGTH_MASK because it duplicates
PCI_DVSEC_HEADER1_LEN() (Bjorn)
- Add Dan's review-by
PCI: Introduce pcie_is_cxl()
- Move FLEXBUS_STATUS DVSEC here (Jonathan)
- Remove check for EP and USP (Dan)
- Update commit message (Bjorn)
- Fix writing past 80 columns (Bjorn)
- Add pci_is_pcie() parent bridge check at beginning of function (Bjorn)
PCI: Replace cxl_error_is_native() with pcie_aer_is_native()
- New commit
cxl/pci: Move CXL driver's RCH error handling into core/ras_rch.c
- Add sign-off for Dan and Jonathan
- Revert inadvertent formatting of cxl_dport_map_rch_aer() (Jonathan)
- Remove default value for CXL_RCH_RAS (Dan)
- Remove unnecessary pci.h include in core.h & ras_rch.c (Jonathan)
- Add linux/types.h include in ras_rch.c (Jonathan)
- Change CONFIG_CXL_RCH_RAS -> CONFIG_CXL_RAS (Dan)
PCI/AER: Export pci_aer_unmask_internal_errors
- New commit. Bjorn requested separating out and adding immediatetly
before being used. This is called from cxl_rch_enable_rcec() in
following patch.
PCI/AER: Update is_internal_error() to be non-static is_aer_internal_error()
- New commit
PCI/AER: Move CXL RCH error handling to aer_cxl_rch.c
- Add review-by and signed-off for Dan
- Commit message fixup (Dan)
- Update commit message with use-case description (Dan, Lukas)
- Make cxl_error_is_native() static (Dan)
- Make is_internal_error() non-static, non-export (Terry)
PCI/AER: Use guard() in cxl_rch_handle_error_iter()
- Add review-by for Jonathan, Dave Jiang, Dan WIlliams, and Bjorn
- Remove cleanup.h (Jonathan)
- Reverted comment removal (Bjorn)
- Move this patch after pci/pcie/aer_cxl_rch.c creation (Bjorn)
PCI/AER: Replace PCIEAER_CXL symbol with CXL_RAS
- New commit
PCI/AER: Report CXL or PCIe bus type in AER trace logging
- Merged with Dan's commit. Changes are moving bus_type the last
parameter in function calls (Dan)
- Removed all DCOs because of changes (Terry)
- Update commit message (Bjorn)
- Add Bjorn's ack-by
PCI/AER: Update struct aer_err_info with kernel-doc formatting
- New commit
cxl/mem: Clarify @host for devm_cxl_add_nvdimm()
- New commit
cxl/port: Remove "enumerate dports" helpers
- New commit
cxl/port: Fix devm resource leaks around with dport management
- New commit
cxl/port: Move dport operations to a driver event
- New commit
cxl/port: Move dport RAS reporting to a port resource
- New commit
cxl: Map CXL Endpoint Port and CXL Switch Port RAS registers
- Correct message spelling (Terry)
cxl/port: Move endpoint component register management to cxl_port
- Correct message spelling (Terry)
cxl/port: Map Port component registers before switchport init
- Updates to use cxl_port_setup_regs() (Dan)
cxl: Change CXL handlers to use guard() instead of scoped_guard()
- Add reviewed-by for Jonathan and Dave Jiang
PCI/ERR: Introduce PCI_ERS_RESULT_PANIC
- Add review-by for Dan
- Update Title prefix (Bjorn)
- Removed merge_result. Only logging error for device reporting the
error (Dan)
- Remove PCI_ERS_RESULT_PANIC paragraph in pci-error-recovery.rst (Bjorn)
PCI/AER: Move AER driver's CXL VH handling to pcie/aer_cxl_vh.c
- Replaced workqueue_types.h include with 'struct work_struct'
predeclaration (Bjorn)
- Update error message (Bjorn)
- Reordered 'struct cxl_proto_err_work_data' (Bjorn)
- Remove export of cxl_error_is_native() here (Bjorn)
cxl/port: Unify endpoint and switch port lookup
- New patch
PCI/AER: Dequeue forwarded CXL error
- Update commit title's prefix (Bjorn)
- Add pdev ref get in AER driver before enqueue and add pdev ref put in
CXL driver after dequeue and handling (Dan)
- Removed handling to simplify patch context (Terry)
PCI: Introduce CXL Port protocol error handlers
- Add Dave Jiang's review-by
- Update commit message & headline (Bjorn)
- Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
one line (Jonathan)
- Remove cxl_walk_port(). Only log the erroring device. No port walking. (Dan)
- Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
sufficient (Dan)
- Remove device_lock_if()
- Combine CE and UCE here (Terry)
cxl: Update Endpoint uncorrectable protocol error handling
- Update commit headline (Bjorn)
- Rename pci_error_detected()/pci_cor_error_detected() ->
cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
- Remove now-invalid comment in cxl_error_detected() (Jonathan)
- Split into separate patches for UCE and CE (Terry)
cxl: Update Endpoint correctable protocol error handling
- New commit
- Change cxl_cor_error_detected() parameter to &pdev->dev device from
memdev device. (Terry)
cxl: Enable CXL protocol errors during CXL Port probe
- Update commit title's prefix (Bjorn)
Changes in v12->v13:
CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
- Add Dave Jiang's reviewed-by
- Remove changes to existing PCI_DVSEC_CXL_PORT* defines. Update commit
message. (Jonathan)
PCI/CXL: Introduce pcie_is_cxl()
- Add Ben's "reviewed-by"
cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
- None
cxl/pci: Remove unnecessary CXL RCH handling helper functions
- None
cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core
- None
cxl: Move CXL driver's RCH error handling into core/ras_rch.c
- None
CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock
- New patch
CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
- Add forward declararation of 'struct aer_err_info' in pci/pci.h (Terry)
- Changed copyright date from 2025 to 2023 (Jonathan)
- Add David Jiang's, Jonathan's, and Ben's review-by
- Readd 'struct aer_err_info' (Bot)
PCI/AER: Report CXL or PCIe bus error type in trace logging
- Remove duplicated aer_err_info inline comments. Is already in the
kernel-doc header (Ben)
cxl/pci: Update RAS handler interfaces to also support CXL Ports
- None
cxl/pci: Log message if RAS registers are unmapped
- Added Bens review-by
cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
- Added Dave Jiang's review-by
cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
- Add Ben's review-by
cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
- Change as result of dport delay fix. No longer need switchport and
endport approach. Refactor. (Terry)
CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
- Add Dave Jiang's, Jonathan's, Ben's review-by
- Typo fix (Ben)
CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors
- Add Dave Jiang's review-by
- Update error message (Ben)
cxl: Introduce cxl_pci_drv_bound() to check for bound driver
- Add Dave Jiang's review-by.
cxl: Change CXL handlers to use guard() instead of scoped_guard()
- New patch
cxl/pci: Introduce CXL protocol error handlers for endpoints
- Updated all the implemetnation and commit message. (Terry)
- Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
pdev (Dave Jiang)
CXL/PCI: Introduce CXL Port protocol error handlers
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)
PCI/AER: Dequeue forwarded CXL error
- Rewrite cxl_handle_proto_error() and cxl_proto_err_work_fn() (Terry)
- Rename get_cxl_host dev() to be get_cxl_port() (Terry)
- Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
- Change pr_err() calls to ratelimited. (Terry)
- Update commit message. (Terry)
- Remove namespace qualifier from pcie_clear_device_status()
export (Dave Jiang)
- Move locks into cxl_proto_err_work_fn() (Dave)
- Update log messages in cxl_forward_error() (Ben)
CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
- Renamed pci_ers_merge_result() to pcie_ers_merge_result().
pci_ers_merge_result() is already used in eeh driver. (Bot)
CXL/PCI: Introduce CXL uncorrectable protocol error recovery
- Rewrite report_error_detected() and cxl_walk_port (Terry)
- Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
- Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
(pdev->dev & parent cxl_port) in cxl_report_error_detected() and
cxl_handle_proto_error() (Terry)
- Remove unnecessary check for endpoint port. (Dave Jiang)
- Remove check for RCIEP EP in cxl_report_error_detected() (Terry)
CXL/PCI: Enable CXL protocol errors during CXL Port probe
- Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
- Add Dave Jiang's and Ben's review-by
CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
- Added dev and dev_is_pci() checks in cxl_mask_proto_interrupts() (Terry)
Changes in v11 -> v12:
cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
- Added Dave Jiang's review by
- Moved to front of series
cxl/pci: Remove unnecessary CXL RCH handling helper functions
- Add reviewed-by for Alejandro & Dave Jiang
- Moved to front of series
cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
- Update CONFIG_CXL_RAS in CXL Kconfig to have CXL_PCI dependency (Terry)
CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
- Added review-by for Sathyanarayanan
- Changed Kconfig dependency from PCIEAER_CXL to PCIEAER. Moved
this backwards into this patch.
cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditio
- Moved CXL_RCH_RAS Kconfig definition here from following commit
CXL/AER: Introduce aer_cxl_rch.c into AER driver for handling CXL RCH errors
- Rename drivers/pci/pcie/cxl_rch.c to drivers/pci/pcie/aer_cxl_rch.c (Lukas)
- Removed forward declararation of 'struct aer_err_info' in pci/pci.h (Terry)
CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
- Change formatting to be same as existing definitions
- Change GENMASK() -> __GENMASK() and BIT() to _BITUL()
PCI/CXL: Introduce pcie_is_cxl()
- Add review-by for Alejandro
- Add comment in set_pcie_cxl() explaining why updating parent status.
PCI/AER: Report CXL or PCIe bus error type in trace logging
- Change aer_err_info::is_cxl to be bool a bitfield. Update structure padding. (Lukas)
- Add kernel-doc for 'struct aer_err_info' (Lukas)
cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
- Correct parameters to call trace_cxl_aer_correctable_error() (Shiju)
- Add reviewed-by for Jonathan and Shiju
cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
- Add check for dport_parent->rch before calling cxl_dport_init_ras_reporting().
- RCH dports are initialized from cxl_dport_init_ras_reporting cxl_mem_probe().
CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
- Documentation requested by (Lukas)
CXL/AER: Introduce aer_cxl_vh.c in AER driver for forwarding CXL errors
- Rename drivers/pci/pcie/cxl_aer.c to drivers/pci/pcie/aer_cxl_vh.c (Lukas)
cxl: Introduce cxl_pci_drv_bound() to check for bound driver
- New patch
PCI/AER: Dequeue forwarded CXL error
- Add guard for CE case in cxl_handle_proto_error() (Dave)
- Updated commit message (Terry)
CXL/PCI: Introduce CXL Port protocol error handlers
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
pci_to_cxl_dev() (Lukas)
- Change cxl_error_detected() -> cxl_cor_error_detected() (Terry)
- Remove NULL variable assignments (Jonathan)
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
port searches. (Dave)
CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
- Remove static inline pci_ers_merge_result() definition for !CONFIG_PCIEAER.
Is not needed. (Lukas)
CXL/PCI: Introduce CXL uncorrectable protocol error recovery
- Clean up port discovery in cxl_do_recovery() (Dave)
- Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
Changes in v10 -> v11:
cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
- New patch
CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
- New patch
cxl/pci: Remove unnecessary CXL RCH handling helper functions
- New patch
cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditional block
- New patch
CXL/AER: Introduce rch_aer.c into AER driver for handling CXL RCH errors
- Remove changes in code-split and move to earlier, new patch
- Add #include <linux/bitfield.h> to cxl_ras.c
- Move cxl_rch_handle_error() & cxl_rch_enable_rcec() declarations from pci.h
to aer.h, more localized.
- Introduce CONFIG_CXL_RCH_RAS, includes Makefile changes, ras.c ifdef changes
CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
- New patch
PCI/CXL: Introduce pcie_is_cxl()
- Amended set_pcie_cxl() to check for Upstream Port's and EP's parent
downstream port by calling set_pcie_cxl(). (Dan)
- Retitle patch: 'Add' -> 'Introduce'
- Add check for CXL.mem and CXL.cache (Alejandro, Dan)
PCI/AER: Report CXL or PCIe bus error type in trace logging
- Remove duplicate call to trace_aer_event() (Shiju)
- Added Dan William's and Dave Jiang's reviewed-by
CXL/AER: Update PCI class code check to use FIELD_GET()
- Add #include <linux/bitfield.h> to cxl_ras.c (Terry)
- Removed line wrapping at "(CXL 3.2, 8.1.12.1)". (Jonathan)
cxl/pci: Log message if RAS registers are unmapped
- Added Dave Jiang's review-by (Terry)
cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
- Updated CE and UCE trace routines to maintian consistent TP_Struct ABI
and unchanged TP_printk() logging. (Shiju, Alison)
cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
- Added Dave Jiang and Jonathan Cameron's review-by
- Changes moved to core/ras.c
cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
- Use local pointer for readability in cxl_switch_port_init_ras() (Jonathan Cameron)
- Rename port to be ep in cxl_endpoint_port_init_ras() (Dave Jiang)
- Rename dport to be parent_dport in cxl_endpoint_port_init_ras()
and cxl_switch_port_init_ras() (Dave Jiang)
- Port helper changes were in cxl/port.c, now in core/ras.c (Dave Jiang)
cxl/pci: Introduce CXL Endpoint protocol error handlers
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
CXL/AER: Introduce cxl_aer.c into AER driver for forwarding CXL errors
- Move RCH implementation to cxl_rch.c and RCH declarations to pci/pci.h. (Terry)
- Introduce 'struct cxl_proto_err_kfifo' containing semaphore, fifo,
and work struct. (Dan)
- Remove embedded struct from cxl_proto_err_work (Dan)
- Make 'struct work_struct *cxl_proto_err_work' definition static (Jonathan)
- Add check for NULL cxl_proto_err_kfifo to determine if CXL driver is
not registered for workqueue. (Dan)
PCI/AER: Dequeue forwarded CXL error
- Reword patch commit message to remove RCiEP details (Jonathan)
- Add #include <linux/bitfield.h> (Terry)
- is_cxl_rcd() - Fix short comment message wrap (Jonathan)
- is_cxl_rcd() - Combine return calls into 1 (Jonathan)
- cxl_handle_proto_error() - Move comment earlier (Jonathan)
- Usse FIELD_GET() in discovering class code (Jonathan)
- Remove BDF from cxl_proto_err_work_data. Use 'struct pci_dev *' (Dan)
CXL/PCI: Introduce CXL Port protocol error handlers
- Removed check for PCI_EXP_TYPE_RC_END in cxl_report_error_detected() (Terry)
- Update is_cxl_error() to check for acceptable PCI EP and port types
CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
- pci_ers_merge_result() - Change export to non-namespace and rename
to be pci_ers_merge_result() (Jonathan)
- Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result (Terry)
CXL/PCI: Introduce CXL uncorrectable protocol error recovery
- pci_ers_merge_results() - Move to earlier patch
CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
- Remove guard() in cxl_mask_proto_interrupts(). Observed device lockup/block
during testing. (Terry)
Changes in v9 -> v10:
- Add drivers/pci/pcie/cxl_aer.c
- Add drivers/cxl/core/native_ras.c
- Change cxl_register_prot_err_work()/cxl_unregister_prot_err_work to return void
- Check for pcie_ports_native in cxl_do_recovery()
- Remove debug logging in cxl_do_recovery()
- Update PCI_ERS_RESULT_PANIC definition to indicate is CXL specific
- Revert trace logging changes: name,parent -> memdev,host.
- Use FIELD_GET() to check for EP class code (cxl_aer.c & native_ras.c).
- Change _prot_ to _proto_ everywhere
- cxl_rch_handle_error_iter(), check if driver is cxl_pci_driver
- Remove cxl_create_prot_error_info(). Move logic into forward_cxl_error()
- Remove sbdf_to_pci() and move logic into cxl_handle_proto_error()
- Simplify/refactor get_pci_cxl_host_dev()
- Simplify/refactor cxl_get_ras_base()
- Move patch 'Remove unnecessary CXL Endpoint handling helper functions' to front
- Update description for 'CXL/PCI: Introduce CXL Port protocol error
handlers' with why state is not used to determine handling
- Introduce cxl_pci_drv_bound() and call from cxl_rch_handle_error_iter()
Changes in v8 -> v9:
- Updated reference counting to use pci_get_device()/pci_put_device() in
cxl_disable_prot_errors()/cxl_enable_prot_errors
- Refactored cxl_create_prot_err_info() to fix reference counting
- Removed 'struct cxl_port' driver changes for error handler. Instead
check for CXL device type (EP or Port device) and call handler
- Make pcie_is_cxl() static inline in include/linux/linux.h
- Remove NULL check in create_prot_err_info()
- Change success return in cxl_ras_init() to use hardcoded 0
- Changed 'struct work_struct cxl_prot_err_work' declaration to static
- Change to use rate limited log with dev anchor in forward_cxl_error()
- Refactored forward-cxl_error() to remove severity auto variable
- Changed pci_aer_clear_nonfatal_status() to be static inline for
!(CONFIG_PCIEAER)
- Renamed merge_result() to be cxl_merge_result()
- Removed 'ue' condition in cxl_error_detected()
- Updated 2nd parameter in call to __cxl_handle_cor_ras()/__cxl_handle_ras()
in unify patch
- Added log message for failure while assigning interrupt disable callback
- Updated pci_aer_mask_internal_errors() to use pci_clear_and_set_config_dword()
- Simplified patch titles for clarity
- Moved CXL error interrupt disabling into cxl/core/port.c with CXL Port
teardown
- Updated 'struct cxl_port_err_info' to only contain sbdf and severity
Removed everything else.
- Added pdev and CXL device get_device()/put_device() before calling handlers
Changes in v7 -> v8:
[Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
driver
[Dan] Add device reference incrementors where needed throughout
[Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init
[Dan] Combine CXL Port and CXL Endpoint trace routine
[Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
[Jonathan] Add serial number for all devices in trace
[DaveJ] Move find_cxl_port() change into patch using it
[Terry] Move CXL Port RAS init into cxl/port.c
[Terry] Moved kfifo functions into cxl/core/ras.c
Changes in v6 -> v7:
[Terry] Move updated trace routine call to later patch. Was causing build
error.
Changes in v5 -> v6:
[Ira] Move pcie_is_cxl(dev) define to a inline function
[Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
[Ira] Change cxl_report_error_detected() cleanup to return correct bool
[Ira] Introduce and use PCI_ERS_RESULT_PANIC
[Ira] Reuse comment for PCIe and CXL recovery paths
[Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
[Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
[Jonathan] Add logging example to patches updating trace output
[Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
[Jonathan] Use __free() in cxl_pci_port_ras()
[Terry] Add patch to log the PCIe SBDF along with CXL device name
[Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
[Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
[Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
[Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
[Gregory] Move AER driver bus name detection to a static function
Changes in v4 -> v5:
[Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
[Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
[Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
[Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
[Ming] Use port->dev for call to devm_add_action_or_reset() in
cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
[Jonathan] Use get_device()/put_device() to prevent race condition in
cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
[Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
specifications
Changes in v3 -> v4:
[Lukas] Capitalize PCIe and CXL device names as in specifications
[Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
[Lukas] Correct namespace spelling
[Lukas] Removed export from pcie_is_cxl_port()
[Lukas] Simplify 'if' blocks in cxl_handle_error()
[Lukas] Change panic message to remove redundant 'panic' text
[Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
[lkp@intel] 'host' parameter is already removed. Remove parameter description too.
[Terry] Added field description for cxl_err_handlers in pci.h comment block
Changes in v1 -> v2:
[Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
[Jonathan] Update description to DSP map patch description
[Jonathan] Update cxl_pci_port_ras() to check for NULL port
[Jonathan] Dont call handler before handler port changes are present (patch order)
[Bjorn] Fix linebreak in cover sheet URL
[Bjorn] Remove timestamps from test logs in cover sheet
[Bjorn] Retitle AER commits to use "PCI/AER:"
[Bjorn] Retitle patch#3 to use renaming instead of refactoring
[Bjorn] Fix base commit-id on cover sheet
[Bjorn] Add VH spec reference/citation
[Terry] Removed last 2 patches to enable internal errors. Is not needed
because internal errors are enabled in AER driver.
[Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
[Dan] Use kernel panic in CXL recovery
[Dan] cxl_port_hndlrs -> cxl_port_error_handlers
Terry Bowman (9):
PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c
cxl: Update CXL Endpoint tracing
PCI/ERR: Introduce PCI_ERS_RESULT_PANIC
PCI/AER: Dequeue forwarded CXL error
PCI: Establish common CXL Port protocol error flow
cxl: Update error handlers to support CXL Port protocol errors
cxl: Update Endpoint AER uncorrectable handler
cxl: Remove Endpoint AER correctable handler
cxl: Enable CXL protocol error reporting
Terry Bowman (9):
PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c
cxl: Update CXL Endpoint tracing
PCI/ERR: Introduce PCI_ERS_RESULT_PANIC
PCI/AER: Dequeue forwarded CXL error
PCI: Establish common CXL Port protocol error flow
cxl: Update error handlers to support CXL Port protocol errors
cxl: Update Endpoint AER uncorrectable handler
cxl: Remove Endpoint AER correctable handler
cxl: Enable CXL protocol error reporting
Documentation/PCI/pci-error-recovery.rst | 2 +
drivers/cxl/core/core.h | 18 +-
drivers/cxl/core/port.c | 8 +-
drivers/cxl/core/ras.c | 336 +++++++++++++++++++----
drivers/cxl/core/ras_rch.c | 6 +-
drivers/cxl/core/trace.h | 21 +-
drivers/cxl/cxlpci.h | 15 +-
drivers/cxl/pci.c | 7 +-
drivers/pci/pci.c | 1 +
drivers/pci/pci.h | 2 -
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/aer.c | 16 +-
drivers/pci/pcie/aer_cxl_vh.c | 82 ++++++
drivers/pci/pcie/portdrv.h | 4 +
include/linux/aer.h | 24 ++
include/linux/pci.h | 5 +
16 files changed, 444 insertions(+), 104 deletions(-)
create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
Uses 9a8920ca8ebf (cxl/for-7.0/cxl-aer-prep) +
376e2f17f3fca84bf5a2e707d0c47ba22665df6d Dans series as prerequisite
base-commit: 376e2f17f3fca84bf5a2e707d0c47ba22665df6d
--
2.34.1
^ permalink raw reply [flat|nested] 31+ messages in thread
* [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-04 4:25 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 2/9] cxl: Update CXL Endpoint tracing Terry Bowman
` (7 subsequent siblings)
8 siblings, 1 reply; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
soon. This requires a notification mechanism for the AER driver to share
the AER interrupt with the CXL driver. The notification will be used as an
indication for the CXL drivers to handle and log the CXL RAS errors.
Note, 'CXL protocol error' terminology will refer to CXL VH and not
CXL RCH errors unless specifically noted going forward.
Introduce a new file in the AER driver to handle the CXL protocol errors
named pci/pcie/aer_cxl_vh.c.
Add a kfifo work queue to be used by the AER and CXL drivers. The AER
driver will be the sole kfifo producer adding work and the cxl_core will be
the sole kfifo consumer removing work. Add the boilerplate kfifo support.
Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
Add CXL work queue handler registration functions in the AER driver. Export
the functions allowing CXL driver to access. Implement registration
functions for the CXL driver to assign or clear the work handler function.
Synchronize accesses using the RW semaphore.
Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
This will contain a reference to the PCI error source device and the error
severity. This will be used when the work is dequeued by the cxl_core driver.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
Changes in v14->v15:
- Moved pci_dev_get() call to this patch (Dave)
Changes in v13 -> v14:
- Replaced workqueue_types.h include with 'struct work_struct'
predeclaration (Bjorn)
- Update error message (Bjorn)
- Reordered 'struct cxl_proto_err_work_data' (Bjorn)
- Remove export of cxl_error_is_native() here (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
- Update error message (Ben)
Changes in v11->v12:
- None
Changes in v10->v11:
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
---
drivers/pci/pcie/Makefile | 1 +
drivers/pci/pcie/aer.c | 15 ++-----
drivers/pci/pcie/aer_cxl_vh.c | 79 +++++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.h | 4 ++
include/linux/aer.h | 22 ++++++++++
5 files changed, 110 insertions(+), 11 deletions(-)
create mode 100644 drivers/pci/pcie/aer_cxl_vh.c
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index b0b43a18c304..62d3d3c69a5d 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS) += pcieportdrv.o bwctrl.o
obj-y += aspm.o
obj-$(CONFIG_PCIEAER) += aer.o err.o tlp.o
obj-$(CONFIG_CXL_RAS) += aer_cxl_rch.o
+obj-$(CONFIG_CXL_RAS) += aer_cxl_vh.o
obj-$(CONFIG_PCIEAER_INJECT) += aer_inject.o
obj-$(CONFIG_PCIE_PME) += pme.o
obj-$(CONFIG_PCIE_DPC) += dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 49a4bd13c2d2..7af10a74da34 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1155,16 +1155,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
*/
EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core");
-#ifdef CONFIG_CXL_RAS
-bool is_aer_internal_error(struct aer_err_info *info)
-{
- if (info->severity == AER_CORRECTABLE)
- return info->status & PCI_ERR_COR_INTERNAL;
-
- return info->status & PCI_ERR_UNC_INTN;
-}
-#endif
-
/**
* pci_aer_handle_error - handle logging error into an event log
* @dev: pointer to pci_dev data structure of error source device
@@ -1201,7 +1191,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
cxl_rch_handle_error(dev, info);
- pci_aer_handle_error(dev, info);
+ if (is_cxl_error(dev, info))
+ cxl_forward_error(dev, info);
+ else
+ pci_aer_handle_error(dev, info);
pci_dev_put(dev);
}
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
new file mode 100644
index 000000000000..de8bca383159
--- /dev/null
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -0,0 +1,79 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/types.h>
+#include <linux/bitfield.h>
+#include <linux/kfifo.h>
+#include <linux/aer.h>
+#include "../pci.h"
+#include "portdrv.h"
+
+#define CXL_ERROR_SOURCES_MAX 128
+
+struct cxl_proto_err_kfifo {
+ struct work_struct *work;
+ struct rw_semaphore rw_sema;
+ DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
+ CXL_ERROR_SOURCES_MAX);
+};
+
+static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
+ .rw_sema = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rw_sema)
+};
+
+bool is_aer_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ if (!info || !info->is_cxl)
+ return false;
+
+ if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ return false;
+
+ return is_aer_internal_error(info);
+}
+
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+ struct cxl_proto_err_work_data wd = (struct cxl_proto_err_work_data) {
+ .severity = info->severity,
+ .pdev = pdev
+ };
+
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rw_sema);
+ pci_dev_get(pdev);
+ if (!cxl_proto_err_kfifo.work || !kfifo_put(&cxl_proto_err_kfifo.fifo, wd)) {
+ dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo error");
+ return;
+ }
+
+ schedule_work(cxl_proto_err_kfifo.work);
+}
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
+ cxl_proto_err_kfifo.work = work;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
+
+void cxl_unregister_proto_err_work(void)
+{
+ guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
+ cxl_proto_err_kfifo.work = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
+
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
+{
+ guard(rwsem_read)(&cxl_proto_err_kfifo.rw_sema);
+ return kfifo_get(&cxl_proto_err_kfifo.fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index cc58bf2f2c84..66a6b8099c96 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -130,9 +130,13 @@ struct aer_err_info;
bool is_aer_internal_error(struct aer_err_info *info);
void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
void cxl_rch_enable_rcec(struct pci_dev *rcec);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
#else
static inline bool is_aer_internal_error(struct aer_err_info *info) { return false; }
static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
#endif /* CONFIG_CXL_RAS */
#endif /* _PORTDRV_H_ */
diff --git a/include/linux/aer.h b/include/linux/aer.h
index df0f5c382286..f351e41dd979 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -53,6 +53,16 @@ struct aer_capability_regs {
u16 uncor_err_source;
};
+/**
+ * struct cxl_proto_err_work_data - Error information used in CXL error handling
+ * @pdev: PCI device detecting the error
+ * @severity: AER severity
+ */
+struct cxl_proto_err_work_data {
+ struct pci_dev *pdev;
+ int severity;
+};
+
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
@@ -66,6 +76,18 @@ static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
+struct work_struct;
+
+#ifdef CONFIG_CXL_RAS
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
+void cxl_register_proto_err_work(struct work_struct *work);
+void cxl_unregister_proto_err_work(void);
+#else
+static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline void cxl_unregister_proto_err_work(void) { }
+#endif
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer);
int cper_severity_to_aer(int cper_severity);
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 2/9] cxl: Update CXL Endpoint tracing
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-02-03 2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-04 4:29 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 3/9] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
` (6 subsequent siblings)
8 siblings, 1 reply; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL protocol error handling will be expanded to soon include CXL Port
support along with existing Endpoint support. 2 updates are needed first:
- Update calling interfaces to use 'struct device*'
- Log endpoint serial number
Add serial number parameter to the trace logging. This is used for EPs
and 0 is provided for CXL port devices without a serial number.
Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
unchanged with respect to member data types and order.
Below is output of correctable and uncorrectable protocol error logging.
CXL Root Port and CXL Endpoint examples are included below.
The tracing support for CXL Port devices and Endpoints is already implemented.
Update cxl_handle_ras() & cxl_handle_cor_ras() to also call the CXL trace
routines.
Root Port:
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Endpoint:
cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
Changes in v14->v15:
- Update commit message.
- Moved cxl_handle_ras/cxl_handle_cor_ras() changes to future patch (terry)
Changes in v13->v14:
- Update commit headline (Bjorn)
Changes in v12->v13:
- Added Dave Jiang's review-by
Changes in v11 -> v12:
- Correct parameters to call trace_cxl_aer_correctable_error()
- Add reviewed-by for Jonathan and Shiju
Changes in v10->v11:
- Updated CE and UCE trace routines to maintain consistent TP_Struct ABI
and unchanged TP_printk() logging.
---
drivers/cxl/core/core.h | 11 +++++++----
drivers/cxl/core/ras.c | 23 ++++++++++++++---------
drivers/cxl/core/ras_rch.c | 6 ++++--
drivers/cxl/core/trace.h | 21 +++++++++++----------
4 files changed, 36 insertions(+), 25 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index be3c7b137115..c6cfaf2720e1 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -155,8 +155,9 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
#ifdef CONFIG_CXL_RAS
int cxl_ras_init(void);
void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+void cxl_handle_cor_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
void cxl_disable_rch_root_ints(struct cxl_dport *dport);
void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
@@ -167,11 +168,13 @@ static inline int cxl_ras_init(void)
return 0;
}
static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static inline bool cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base)
{
return false;
}
-static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
+static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base) { }
static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index f6a8f4a355f1..74df561ed32e 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -37,7 +37,8 @@ static void cxl_cper_trace_corr_prot_err(struct cxl_memdev *cxlmd,
{
u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
- trace_cxl_aer_correctable_error(cxlmd, status);
+ trace_cxl_aer_correctable_error(&cxlmd->dev, status,
+ cxlmd->cxlds->serial);
}
static void
@@ -45,6 +46,7 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
struct cxl_ras_capability_regs ras_cap)
{
u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
u32 fe;
if (hweight32(status) > 1)
@@ -53,8 +55,9 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
else
fe = status;
- trace_cxl_aer_uncorrectable_error(cxlmd, status, fe,
- ras_cap.header_log);
+ trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
+ ras_cap.header_log,
+ cxlds->serial);
}
static int match_memdev_by_parent(struct device *dev, const void *uport)
@@ -182,7 +185,7 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
+void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
u32 status;
@@ -194,7 +197,7 @@ void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
+ trace_cxl_aer_correctable_error(dev, status, serial);
}
}
@@ -219,7 +222,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
@@ -246,7 +249,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
@@ -269,7 +272,8 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
if (cxlds->rcd)
cxl_handle_rdport_errors(cxlds);
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
+ cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ cxlmd->endpoint->regs.ras);
}
}
EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -298,7 +302,8 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
* chance the situation is recoverable dump the status of the RAS
* capability registers and bounce the active state of the memdev.
*/
- ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlmd->endpoint->regs.ras);
+ ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ cxlmd->endpoint->regs.ras);
}
switch (state) {
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index 0a8b3b9b6388..5771abfc16de 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -115,7 +115,9 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
pci_print_aer(pdev, severity, &aer_regs);
if (severity == AER_CORRECTABLE)
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+ cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ dport->regs.ras);
else
- cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+ cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
+ dport->regs.ras);
}
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a972e4ef1936..5f630543b720 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -77,11 +77,12 @@ TRACE_EVENT(cxl_port_aer_uncorrectable_error,
);
TRACE_EVENT(cxl_aer_uncorrectable_error,
- TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
- TP_ARGS(cxlmd, status, fe, hl),
+ TP_PROTO(const struct device *cxlmd, u32 status, u32 fe, u32 *hl,
+ u64 serial),
+ TP_ARGS(cxlmd, status, fe, hl, serial),
TP_STRUCT__entry(
- __string(memdev, dev_name(&cxlmd->dev))
- __string(host, dev_name(cxlmd->dev.parent))
+ __string(memdev, dev_name(cxlmd))
+ __string(host, dev_name(cxlmd->parent))
__field(u64, serial)
__field(u32, status)
__field(u32, first_error)
@@ -90,7 +91,7 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
TP_fast_assign(
__assign_str(memdev);
__assign_str(host);
- __entry->serial = cxlmd->cxlds->serial;
+ __entry->serial = serial;
__entry->status = status;
__entry->first_error = fe;
/*
@@ -144,18 +145,18 @@ TRACE_EVENT(cxl_port_aer_correctable_error,
);
TRACE_EVENT(cxl_aer_correctable_error,
- TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
- TP_ARGS(cxlmd, status),
+ TP_PROTO(const struct device *cxlmd, u32 status, u64 serial),
+ TP_ARGS(cxlmd, status, serial),
TP_STRUCT__entry(
- __string(memdev, dev_name(&cxlmd->dev))
- __string(host, dev_name(cxlmd->dev.parent))
+ __string(memdev, dev_name(cxlmd))
+ __string(host, dev_name(cxlmd->parent))
__field(u64, serial)
__field(u32, status)
),
TP_fast_assign(
__assign_str(memdev);
__assign_str(host);
- __entry->serial = cxlmd->cxlds->serial;
+ __entry->serial = serial;
__entry->status = status;
),
TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 3/9] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-02-03 2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
2026-02-03 2:52 ` [PATCH v15 2/9] cxl: Update CXL Endpoint tracing Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-03 2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
` (5 subsequent siblings)
8 siblings, 0 replies; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
The CXL driver's uncorrectable (UCE) protocol error handling will be updated
in the future. One required change is for the error handlers to force a
system panic when a UCE is detected.
Introduce PCI_ERS_RESULT_PANIC as a 'enum pci_ers_result' type. This will
be used by CXL UCE fatal and non-fatal recovery in future patches. Update
PCIe recovery documentation with details of PCI_ERS_RESULT_PANIC.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
---
Changes in v14 -> v15:
- None
Changes in v13 -> v14:
- Add review-by for Dan
- Update Title prefix (Bjorn)
- Removed merge_result. Only logging error for device reporting the
error (Dan)
Changes in v12->v13:
- Add Dave Jiang's, Jonathan's, Ben's review-by
- Typo fix (Ben)
Changes v11 -> v12:
- Documentation requested (Lukas)
---
Documentation/PCI/pci-error-recovery.rst | 2 ++
include/linux/pci.h | 3 +++
2 files changed, 5 insertions(+)
diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index 43bc4e3665b4..82ee2c8c0450 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -102,6 +102,8 @@ Possible return values are::
PCI_ERS_RESULT_NEED_RESET, /* Device driver wants slot to be reset. */
PCI_ERS_RESULT_DISCONNECT, /* Device has completely failed, is unrecoverable */
PCI_ERS_RESULT_RECOVERED, /* Device driver is fully recovered and operational */
+ PCI_ERS_RESULT_NO_AER_DRIVER, /* No AER capabilities registered for the driver */
+ PCI_ERS_RESULT_PANIC, /* System is unstable, panic. Is CXL specific */
};
A driver does not have to implement all of these callbacks; however,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f8e8b3df794d..ee05d5925b13 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -921,6 +921,9 @@ enum pci_ers_result {
/* No AER capabilities registered for the driver */
PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+ /* System is unstable, panic. Is CXL specific */
+ PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
};
/* PCI bus error event callbacks */
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (2 preceding siblings ...)
2026-02-03 2:52 ` [PATCH v15 3/9] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-03 15:26 ` Jonathan Cameron
2026-02-04 4:46 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
` (4 subsequent siblings)
8 siblings, 2 replies; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
The AER driver now forwards CXL protocol errors to the CXL driver via a
kfifo. The CXL driver must consume these work items and initiate protocol
error handling while ensuring the device's RAS mappings remain valid
throughout processing.
Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
AER service driver. Lock the parent CXL Port device to ensure the CXL
device's RAS registers are accessible during handling. Add pdev reference-put
to match reference-get in AER driver. This will ensure pdev access after
kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
Update is_cxl_error() to recognize CXL Port devices with errors.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
---
Changes in v14->v15:
- Move pci_dev_get() to first patch (Dave)
- Move in is_cxl_error() change from later patch (Terry)
- Use pr_err_ratelimited() with PCI device name (Terry)
Changes in v13->v14:
- Update commit title's prefix (Bjorn)
- Add pdev ref get in AER driver before enqueue and add pdev ref put in
CXL driver after dequeue and handling (Dan)
- Removed handling to simplify patch context (Terry)
Changes in v12->v13:
- Add cxlmd lock using guard() (Terry)
- Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
- Change pr_err() calls to ratelimited. (Terry)
- Update commit message. (Terry)
- Remove namespace qualifier from pcie_clear_device_status()
export (Dave Jiang)
- Move locks into cxl_proto_err_work_fn() (Dave)
- Update log messages in cxl_forward_error() (Ben)
Changes in v11->v12:
- Add guard for CE case in cxl_handle_proto_error() (Dave)
Changes in v10->v11:
- Reword patch commit message to remove RCiEP details (Jonathan)
- Add #include <linux/bitfield.h> (Terry)
- is_cxl_rcd() - Fix short comment message wrap (Jonathan)
- is_cxl_rcd() - Combine return calls into 1 (Jonathan)
- cxl_handle_proto_error() - Move comment earlier (Jonathan)
- Use FIELD_GET() in discovering class code (Jonathan)
- Remove BDF from cxl_proto_err_work_data. Use 'struct
pci_dev *' (Dan)
---
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/port.c | 6 +-
drivers/cxl/core/ras.c | 106 ++++++++++++++++++++++++++++++----
drivers/pci/pcie/aer_cxl_vh.c | 5 +-
4 files changed, 105 insertions(+), 15 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index c6cfaf2720e1..92aea110817d 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -182,6 +182,9 @@ static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
#endif /* CONFIG_CXL_RAS */
int cxl_gpf_port_setup(struct cxl_dport *dport);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport);
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
struct cxl_hdm;
int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index ee7d14528867..8e30a3e7f610 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1402,8 +1402,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
return NULL;
}
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
- struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport)
{
struct cxl_find_port_ctx ctx = {
.dport_dev = dport_dev,
@@ -1607,7 +1607,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
* Function takes a device reference on the port device. Caller should do a
* put_device() when done.
*/
-static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
{
struct device *dev;
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 74df561ed32e..a6c0bc6d7203 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
}
static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
-int cxl_ras_init(void)
-{
- return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
-}
-
-void cxl_ras_exit(void)
-{
- cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
- cancel_work_sync(&cxl_cper_prot_err_work);
-}
-
static void cxl_dport_map_ras(struct cxl_dport *dport)
{
struct cxl_register_map *map = &dport->reg_map;
@@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
+/*
+ * get_cxl_port - Return the parent CXL Port of a PCI device
+ * @pdev: PCI device whose parent CXL Port is being queried
+ *
+ * Looks up and returns the parent CXL Port associated with @pdev. On
+ * success, the returned port has its reference count incremented and must
+ * be released by the caller. Returns NULL if no associated CXL port is
+ * found.
+ *
+ * Return: Pointer to the parent &struct cxl_port or NULL on failure
+ */
+static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
+{
+ switch (pci_pcie_type(pdev)) {
+ case PCI_EXP_TYPE_ROOT_PORT:
+ case PCI_EXP_TYPE_DOWNSTREAM:
+ {
+ struct cxl_dport *dport;
+ struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return port;
+ }
+ case PCI_EXP_TYPE_UPSTREAM:
+ case PCI_EXP_TYPE_ENDPOINT:
+ {
+ struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return port;
+ }
+ }
+
+ pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
+ pci_name(pdev), pci_pcie_type(pdev));
+ return NULL;
+}
+
void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
@@ -327,3 +360,54 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
return PCI_ERS_RESULT_NEED_RESET;
}
EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+
+static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
+{
+}
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+ struct cxl_proto_err_work_data wd;
+
+ while (cxl_proto_err_kfifo_get(&wd)) {
+ struct pci_dev *pdev __free(pci_dev_put) = wd.pdev;
+
+ if (!pdev) {
+ pr_err_ratelimited("%s: NULL PCI device passed in AER-CXL KFifo\n",
+ pci_name(pdev));
+ continue;
+ }
+
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+ if (!port) {
+ pr_err_ratelimited("%s: Failed to find parent port device in CXL topology\n",
+ pci_name(pdev));
+ continue;
+ }
+ guard(device)(&port->dev);
+
+ cxl_handle_proto_error(&wd);
+ }
+}
+
+static struct work_struct cxl_proto_err_work;
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+int cxl_ras_init(void)
+{
+ if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
+ pr_err("Failed to initialize CXL RAS CPER\n");
+
+ cxl_register_proto_err_work(&cxl_proto_err_work);
+
+ return 0;
+}
+
+void cxl_ras_exit(void)
+{
+ cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
+ cancel_work_sync(&cxl_cper_prot_err_work);
+
+ cxl_unregister_proto_err_work();
+ cancel_work_sync(&cxl_proto_err_work);
+}
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index de8bca383159..6bcd6271afbf 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -34,7 +34,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
if (!info || !info->is_cxl)
return false;
- if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+ if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
+ (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
return false;
return is_aer_internal_error(info);
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (3 preceding siblings ...)
2026-02-03 2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-03 15:40 ` Jonathan Cameron
2026-02-04 5:08 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors Terry Bowman
` (3 subsequent siblings)
8 siblings, 2 replies; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Introduce CXL Port protocol error handling callbacks to unify detection,
logging, and recovery across CXL Ports and Endpoints, including RCH
downstream ports. Establish a consistent flow for correctable and
uncorrectable CXL protocol errors.
Provide the solution by adding cxl_port_cor_error_detected() and
cxl_port_error_detected() to handle correctable and uncorrectable handling
through CXL RAS helpers, coordinating uncorrectable recovery in
cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
driver being bound to avoid processing errors on disabled devices.
Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
for Upstream Ports/Endpoints.
Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
cxl_core to clear PCIe/AER state in these flows.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Acked-by: Bjorn Helgaas <bhelgaas@google.com>
Reviewed-by: Dave Jiang dave.jiang@intel.com
---
Changes in v14->v15:
- Update commit message and title. Added Bjorn's ack.
- Move CE and UCE handling logic here
Changes in v13->v14:
- Add Dave Jiang's review-by
- Update commit message & headline (Bjorn)
- Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
one line (Jonathan)
- Remove cxl_walk_port() (Dan)
- Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
sufficient (Dan)
- Remove device_lock_if()
- Combined CE and UCE here (Terry)
Changes in v12->v13:
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)
Changes in v11->v12:
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
pci_to_cxl_dev()
- Change cxl_error_detected() -> cxl_cor_error_detected()
- Remove NULL variable assignments
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
port searches.
Changes in v10->v11:
- None
---
drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
drivers/pci/pci.c | 1 +
drivers/pci/pci.h | 2 -
drivers/pci/pcie/aer.c | 1 +
include/linux/aer.h | 2 +
include/linux/pci.h | 2 +
6 files changed, 140 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index a6c0bc6d7203..0216dafa6118 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
return NULL;
}
+static void __iomem *cxl_get_ras_base(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+
+ switch (pci_pcie_type(pdev)) {
+ case PCI_EXP_TYPE_ROOT_PORT:
+ case PCI_EXP_TYPE_DOWNSTREAM:
+ {
+ struct cxl_dport *dport;
+ struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
+
+ if (!dport) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return dport->regs.ras;
+ }
+ case PCI_EXP_TYPE_UPSTREAM:
+ case PCI_EXP_TYPE_ENDPOINT:
+ {
+ struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device");
+ return NULL;
+ }
+ return port->regs.ras;
+ }
+ }
+ dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
+ return NULL;
+}
+
+static pci_ers_result_t cxl_port_error_detected(struct device *dev);
+
+static void cxl_do_recovery(struct pci_dev *pdev)
+{
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+ pci_ers_result_t status;
+
+ if (!port) {
+ pci_err(pdev, "Failed to find the CXL device\n");
+ return;
+ }
+
+ status = cxl_port_error_detected(&pdev->dev);
+ if (status == PCI_ERS_RESULT_PANIC)
+ panic("CXL cachemem error.");
+
+ /*
+ * If we have native control of AER, clear error status in the device
+ * that detected the error. If the platform retained control of AER,
+ * it is responsible for clearing this status. In that case, the
+ * signaling device may not even be visible to the OS.
+ */
+ if (pcie_aer_is_native(pdev)) {
+ pcie_clear_device_status(pdev);
+ pci_aer_clear_nonfatal_status(pdev);
+ pci_aer_clear_fatal_status(pdev);
+ }
+}
+
void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
void __iomem *addr;
@@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
return true;
}
+static void cxl_port_cor_error_detected(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+
+ if (is_cxl_endpoint(port)) {
+ struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+ guard(device)(&cxlmd->dev);
+
+ if (!dev->driver) {
+ dev_warn(&pdev->dev,
+ "%s: memdev disabled, abort error handling\n",
+ dev_name(dev));
+ return;
+ }
+
+ if (cxlds->rcd)
+ cxl_handle_rdport_errors(cxlds);
+
+ cxl_handle_cor_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
+ } else {
+ cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));
+ }
+}
+
+static pci_ers_result_t cxl_port_error_detected(struct device *dev)
+{
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+
+ if (is_cxl_endpoint(port)) {
+ struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+ struct cxl_dev_state *cxlds = cxlmd->cxlds;
+
+ guard(device)(&cxlmd->dev);
+
+ if (!dev->driver) {
+ dev_warn(&pdev->dev,
+ "%s: memdev disabled, abort error handling\n",
+ dev_name(dev));
+ return PCI_ERS_RESULT_NONE;
+ }
+
+ if (cxlds->rcd)
+ cxl_handle_rdport_errors(cxlds);
+
+ return cxl_handle_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
+ } else {
+ return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
+ }
+}
+
void cxl_cor_error_detected(struct pci_dev *pdev)
{
struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
@@ -363,6 +479,24 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
{
+ struct pci_dev *pdev = err_info->pdev;
+
+ if (err_info->severity == AER_CORRECTABLE) {
+
+ if (!pcie_aer_is_native(pdev))
+ return;
+
+ if (pdev->aer_cap)
+ pci_clear_and_set_config_dword(pdev,
+ pdev->aer_cap + PCI_ERR_COR_STATUS,
+ 0, PCI_ERR_COR_INTERNAL);
+
+ cxl_port_cor_error_detected(&pdev->dev);
+
+ pcie_clear_device_status(pdev);
+ } else {
+ cxl_do_recovery(pdev);
+ }
}
static void cxl_proto_err_work_fn(struct work_struct *work)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 13dbb405dc31..b7bfefdaf990 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2248,6 +2248,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
}
+EXPORT_SYMBOL_GPL(pcie_clear_device_status);
#endif
/**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 8ccb3ba61e11..d81c4170f595 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -229,7 +229,6 @@ void pci_refresh_power_state(struct pci_dev *dev);
int pci_power_up(struct pci_dev *dev);
void pci_disable_enabled_device(struct pci_dev *dev);
int pci_finish_runtime_suspend(struct pci_dev *dev);
-void pcie_clear_device_status(struct pci_dev *dev);
void pcie_clear_root_pme_status(struct pci_dev *dev);
bool pci_check_pme_status(struct pci_dev *dev);
void pci_pme_wakeup_bus(struct pci_bus *bus);
@@ -1198,7 +1197,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
static inline void pci_no_aer(void) { }
static inline void pci_aer_init(struct pci_dev *d) { }
static inline void pci_aer_exit(struct pci_dev *d) { }
-static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
static inline void pci_save_aer_state(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 7af10a74da34..4fc9de4c78f8 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -298,6 +298,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
if (status)
pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
}
+EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
/**
* pci_aer_raw_clear_status - Clear AER error registers.
diff --git a/include/linux/aer.h b/include/linux/aer.h
index f351e41dd979..c1aef7859d0a 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -65,6 +65,7 @@ struct cxl_proto_err_work_data {
#if defined(CONFIG_PCIEAER)
int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
int pcie_aer_is_native(struct pci_dev *dev);
void pci_aer_unmask_internal_errors(struct pci_dev *dev);
#else
@@ -72,6 +73,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
{
return -EINVAL;
}
+static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
#endif
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ee05d5925b13..1ef4743bf151 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1921,8 +1921,10 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { }
#ifdef CONFIG_PCIEAER
bool pci_aer_available(void);
+void pcie_clear_device_status(struct pci_dev *dev);
#else
static inline bool pci_aer_available(void) { return false; }
+static inline void pcie_clear_device_status(struct pci_dev *dev) { }
#endif
bool pci_ats_disabled(void);
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (4 preceding siblings ...)
2026-02-03 2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-03 15:54 ` Jonathan Cameron
2026-02-03 2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
` (2 subsequent siblings)
8 siblings, 1 reply; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL Protocol errors are logged for Endpoints in cxl_handle_ras() and
cxl_handle_cor_ras(). The same is missing for CXL Port devices. The CXL
Port logging function is already present but needs a call added from
the handlers.
Update cxl_handle_ras() and cxl_handle_cor_ras() to call the CXL Port
trace logging function.
Also, add log messages in the case 'ras_base' is NULL. And, add calls to
the existing CXL Port tracing in the same functions.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v14 -> v15:
- New commit
---
drivers/cxl/core/core.h | 10 ++++++----
drivers/cxl/core/ras.c | 30 ++++++++++++++++++++++--------
2 files changed, 28 insertions(+), 12 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 92aea110817d..3b232e991b12 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
#include <cxl/mailbox.h>
#include <linux/rwsem.h>
+#include <linux/pci.h>
extern const struct device_type cxl_nvdimm_bridge_type;
extern const struct device_type cxl_nvdimm_type;
@@ -155,7 +156,8 @@ static inline struct device *dport_to_host(struct cxl_dport *dport)
#ifdef CONFIG_CXL_RAS
int cxl_ras_init(void);
void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base);
void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base);
void cxl_dport_map_rch_aer(struct cxl_dport *dport);
@@ -168,10 +170,10 @@ static inline int cxl_ras_init(void)
return 0;
}
static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, u64 serial,
- void __iomem *ras_base)
+static inline pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+ void __iomem *ras_base)
{
- return false;
+ return PCI_ERS_RESULT_NONE;
}
static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
void __iomem *ras_base) { }
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 0216dafa6118..970ff3df442c 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -285,15 +285,22 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
void __iomem *addr;
u32 status;
- if (!ras_base)
+ if (!ras_base) {
+ pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
+ dev_name(dev));
return;
+ }
addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
status = readl(addr);
- if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
- writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+ if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+ return;
+
+ writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+ if (is_cxl_memdev(dev))
trace_cxl_aer_correctable_error(dev, status, serial);
- }
+ else
+ trace_cxl_port_aer_correctable_error(dev, status);
}
/* CXL spec rev3.0 8.2.4.16.1 */
@@ -317,15 +324,19 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
+pci_ers_result_t
+cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
u32 status;
u32 fe;
- if (!ras_base)
+ if (!ras_base) {
+ pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
+ dev_name(dev));
return false;
+ }
addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
status = readl(addr);
@@ -344,10 +355,13 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
+ if (is_cxl_memdev(dev))
+ trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
+ else
+ trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
- return true;
+ return PCI_ERS_RESULT_PANIC;
}
static void cxl_port_cor_error_detected(struct device *dev)
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (5 preceding siblings ...)
2026-02-03 2:52 ` [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-03 16:18 ` Jonathan Cameron
2026-02-03 17:31 ` Dave Jiang
2026-02-03 2:52 ` [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-02-03 2:52 ` [PATCH v15 9/9] cxl: Enable CXL protocol error reporting Terry Bowman
8 siblings, 2 replies; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL drivers now implement protocol RAS support. PCI protocol errors,
however, continue to be reported via the AER capability and must still be
handled by a PCI error recovery callback.
Replace the existing cxl_error_detected() callback in cxl/pci.c with a
new cxl_pci_error_detected() implementation that handles only uncorrectable
PCI protocol errors reported through AER.
Introduce helper named cxl_handler_aer() amd implement to handle and
log the CXL device's AER error.
This cleanly separates CXL protocol error handling from PCI AER handling
and ensures that each subsystem processes only the errors it is
responsible.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v14->v15:
- Title update (Terry)
- Change cxl_pci_error-detected() to handle & log AER (Terry)
- Update commit message (Terry)
- Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
Changes in v13->v14:
- Update commit headline (Bjorn)
- Rename pci_error_detected()/pci_cor_error_detected() ->
cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
- Remove now-invalid comment in cxl_error_detected() (Jonathan)
- Split into separate patches for UCE and CE (Terry)
Changes in v12->v13:
- Update commit messaqge (Terry)
- Updated all the implementation and commit message. (Terry)
- Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
pdev (Dave Jiang)
Changes in v11->v12:
- None
Changes in v10->v11:
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
---
drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
drivers/cxl/cxlpci.h | 9 +++---
drivers/cxl/pci.c | 6 ++--
3 files changed, 31 insertions(+), 52 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 970ff3df442c..061e6aaec176 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
}
EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+static bool cxl_handle_aer(struct pci_dev *pdev)
{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
- struct cxl_memdev *cxlmd = cxlds->cxlmd;
- struct device *dev = &cxlmd->dev;
- bool ue;
-
- scoped_guard(device, dev) {
- if (!dev->driver) {
- dev_warn(&pdev->dev,
- "%s: memdev disabled, abort error handling\n",
- dev_name(dev));
- return PCI_ERS_RESULT_DISCONNECT;
- }
+ struct aer_capability_regs aer;
+ u32 aer_cap = pdev->aer_cap;
- if (cxlds->rcd)
- cxl_handle_rdport_errors(cxlds);
- /*
- * A frozen channel indicates an impending reset which is fatal to
- * CXL.mem operation, and will likely crash the system. On the off
- * chance the situation is recoverable dump the status of the RAS
- * capability registers and bounce the active state of the memdev.
- */
- ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
- cxlmd->endpoint->regs.ras);
+ if (!aer_cap) {
+ pr_warn_ratelimited("%s: AER capability isn't present\n",
+ pci_name(pdev));
+ return false;
}
- switch (state) {
- case pci_channel_io_normal:
- if (ue) {
- device_release_driver(dev);
- return PCI_ERS_RESULT_NEED_RESET;
- }
- return PCI_ERS_RESULT_CAN_RECOVER;
- case pci_channel_io_frozen:
- dev_warn(&pdev->dev,
- "%s: frozen state error detected, disable CXL.mem\n",
- dev_name(dev));
- device_release_driver(dev);
- return PCI_ERS_RESULT_NEED_RESET;
- case pci_channel_io_perm_failure:
- dev_warn(&pdev->dev,
- "failure state error detected, request disconnect\n");
- return PCI_ERS_RESULT_DISCONNECT;
- }
- return PCI_ERS_RESULT_NEED_RESET;
+ pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
+ pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
+
+ /* The AER driver logged the error */
+ pci_aer_clear_nonfatal_status(pdev);
+ pci_aer_clear_fatal_status(pdev);
+
+ return (aer.uncor_status & aer.uncor_mask);
+}
+
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t error)
+{
+ u32 rc = cxl_handle_aer(pdev);
+
+ return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_CAN_RECOVER;
}
-EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
{
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 970add0256e9..5534422b496c 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port);
#ifdef CONFIG_CXL_RAS
void cxl_cor_error_detected(struct pci_dev *pdev);
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state);
void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
+pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t error);
void devm_cxl_port_ras_setup(struct cxl_port *port);
#else
static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
-
-static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
- pci_channel_state_t state)
+static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
+ pci_channel_state_t state)
{
return PCI_ERS_RESULT_NONE;
}
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index acb0eb2a13c3..ff741adc7c7f 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
}
}
-static const struct pci_error_handlers cxl_error_handlers = {
- .error_detected = cxl_error_detected,
+static const struct pci_error_handlers pci_error_handlers = {
+ .error_detected = cxl_pci_error_detected,
.slot_reset = cxl_slot_reset,
.resume = cxl_error_resume,
.cor_error_detected = cxl_cor_error_detected,
@@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
.name = KBUILD_MODNAME,
.id_table = cxl_mem_pci_tbl,
.probe = cxl_pci_probe,
- .err_handler = &cxl_error_handlers,
+ .err_handler = &pci_error_handlers,
.dev_groups = cxl_rcd_groups,
.driver = {
.probe_type = PROBE_PREFER_ASYNCHRONOUS,
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (6 preceding siblings ...)
2026-02-03 2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
2026-02-03 16:27 ` Jonathan Cameron
2026-02-03 2:52 ` [PATCH v15 9/9] cxl: Enable CXL protocol error reporting Terry Bowman
8 siblings, 1 reply; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL drivers dont require a correctable PCI AER handler. Correctable AER
errors reported by CXL devices are logged and cleared in the AER driver.
This makes the correctable AER handler callback in the CXL driver
unnecessary.
Remove cxl_cor_error_detected() and drop the .cor_error_detected callback
from the CXL PCI error handlers.
This consolidates correctable error reporting under the CXL RAS infrastructure
and avoids redundant or conflicting logging with the AER driver.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
Changes in v14->v15:
- Remove cxl_pci_cor_error_detected(). Is not needed. AER is logged
in the AER driver. (Dan)
- Update commit message (Terry)
Changes in v13->v14:
- New commit
- Change cxl_cor_error_detected() parameter to &pdev->dev device from
memdev device. (Terry)
- Updated commit message (Terry)
---
drivers/cxl/core/ras.c | 23 -----------------------
drivers/cxl/cxlpci.h | 2 --
drivers/cxl/pci.c | 1 -
3 files changed, 26 deletions(-)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 061e6aaec176..e5a0d0283d3f 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -418,29 +418,6 @@ static pci_ers_result_t cxl_port_error_detected(struct device *dev)
}
}
-void cxl_cor_error_detected(struct pci_dev *pdev)
-{
- struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
- struct cxl_memdev *cxlmd = cxlds->cxlmd;
- struct device *dev = &cxlds->cxlmd->dev;
-
- scoped_guard(device, dev) {
- if (!dev->driver) {
- dev_warn(&pdev->dev,
- "%s: memdev disabled, abort error handling\n",
- dev_name(dev));
- return;
- }
-
- if (cxlds->rcd)
- cxl_handle_rdport_errors(cxlds);
-
- cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
- cxlmd->endpoint->regs.ras);
- }
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
-
static bool cxl_handle_aer(struct pci_dev *pdev)
{
struct aer_capability_regs aer;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 5534422b496c..e3388dffdd75 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -78,13 +78,11 @@ struct cxl_dev_state;
void read_cdat_data(struct cxl_port *port);
#ifdef CONFIG_CXL_RAS
-void cxl_cor_error_detected(struct pci_dev *pdev);
void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t error);
void devm_cxl_port_ras_setup(struct cxl_port *port);
#else
-static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
{
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index ff741adc7c7f..c6b2966f5fda 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1055,7 +1055,6 @@ static const struct pci_error_handlers pci_error_handlers = {
.error_detected = cxl_pci_error_detected,
.slot_reset = cxl_slot_reset,
.resume = cxl_error_resume,
- .cor_error_detected = cxl_cor_error_detected,
.reset_done = cxl_reset_done,
};
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* [PATCH v15 9/9] cxl: Enable CXL protocol error reporting
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
` (7 preceding siblings ...)
2026-02-03 2:52 ` [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler Terry Bowman
@ 2026-02-03 2:52 ` Terry Bowman
8 siblings, 0 replies; 31+ messages in thread
From: Terry Bowman @ 2026-02-03 2:52 UTC (permalink / raw)
To: dave, jonathan.cameron, dave.jiang, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
CXL protocol errors are not enabled for all CXL devices after boot. These
must be enabled inorder to process CXL protocol errors.
Introduce cxl_unmask_proto_interrupts() to call pci_aer_unmask_internal_errors().
pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
correctable internal errors and uncorrectable internal errors for all CXL
devices.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
---
Changes in v13->v14:
- Update commit title's prefix (Bjorn)
Changes in v12->v13:
- Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
- Add Dave Jiang's and Ben's review-by
Changes in v11->v12:
- None
Changes in v10->v11:
- Added check for valid PCI devices in is_cxl_error() (Terry)
- Removed check for RCiEP in cxl_handle_proto_err() and
cxl_report_error_detected() (Terry)
---
drivers/cxl/core/port.c | 2 ++
drivers/cxl/core/ras.c | 22 ++++++++++++++++++++++
drivers/cxl/cxlpci.h | 4 ++++
3 files changed, 28 insertions(+)
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 8e30a3e7f610..b63e8117d937 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1870,6 +1870,8 @@ int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd)
rc = cxl_add_ep(dport, &cxlmd->dev);
+ cxl_unmask_proto_interrupts(cxlmd->cxlds->dev);
+
/*
* If the endpoint already exists in the port's list,
* that's ok, it was added on a previous pass.
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index e5a0d0283d3f..d6c2fd4ae067 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -118,6 +118,24 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
}
static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
+void cxl_unmask_proto_interrupts(struct device *dev)
+{
+ if (!dev || !dev_is_pci(dev))
+ return;
+
+ struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(to_pci_dev(dev));
+
+ if (!pdev->aer_cap) {
+ pdev->aer_cap = pci_find_ext_capability(pdev,
+ PCI_EXT_CAP_ID_ERR);
+ if (!pdev->aer_cap)
+ return;
+ }
+
+ pci_aer_unmask_internal_errors(pdev);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unmask_proto_interrupts, "CXL");
+
static void cxl_dport_map_ras(struct cxl_dport *dport)
{
struct cxl_register_map *map = &dport->reg_map;
@@ -128,6 +146,8 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
else if (cxl_map_component_regs(map, &dport->regs.component,
BIT(CXL_CM_CAP_CAP_ID_RAS)))
dev_dbg(dev, "Failed to map RAS capability.\n");
+
+ cxl_unmask_proto_interrupts(dev);
}
/**
@@ -171,6 +191,8 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
if (cxl_map_component_regs(map, &port->regs,
BIT(CXL_CM_CAP_CAP_ID_RAS)))
dev_dbg(&port->dev, "Failed to map RAS capability\n");
+
+ cxl_unmask_proto_interrupts(port->uport_dev);
}
EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index e3388dffdd75..b5fea624b2cc 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -82,6 +82,7 @@ void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t error);
void devm_cxl_port_ras_setup(struct cxl_port *port);
+void cxl_unmask_proto_interrupts(struct device *dev);
#else
static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
pci_channel_state_t state)
@@ -94,6 +95,9 @@ static inline void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport)
static inline void devm_cxl_port_ras_setup(struct cxl_port *port)
{
}
+static inline void cxl_unmask_proto_interrupts(struct device *dev)
+{
+}
#endif
#endif /* __CXL_PCI_H__ */
--
2.34.1
^ permalink raw reply related [flat|nested] 31+ messages in thread
* Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
2026-02-03 2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
@ 2026-02-03 15:26 ` Jonathan Cameron
2026-02-03 17:00 ` Bowman, Terry
2026-02-04 4:46 ` dan.j.williams
1 sibling, 1 reply; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-03 15:26 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Feb 2026 20:52:39 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items and initiate protocol
> error handling while ensuring the device's RAS mappings remain valid
> throughout processing.
>
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver. Lock the parent CXL Port device to ensure the CXL
> device's RAS registers are accessible during handling. Add pdev reference-put
> to match reference-get in AER driver. This will ensure pdev access after
> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
>
> Update is_cxl_error() to recognize CXL Port devices with errors.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
There are some small functional changes to existing paths (maybe)
that I think need explanations in this commit message.
Otherwise, one suggests small simplification.
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 74df561ed32e..a6c0bc6d7203 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
> }
> static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>
> -int cxl_ras_init(void)
> -{
> - return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> -}
> -
> -void cxl_ras_exit(void)
> -{
> - cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> - cancel_work_sync(&cxl_cper_prot_err_work);
> -}
> -
> static void cxl_dport_map_ras(struct cxl_dport *dport)
> {
> struct cxl_register_map *map = &dport->reg_map;
> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>
> +/*
> + * get_cxl_port - Return the parent CXL Port of a PCI device
> + * @pdev: PCI device whose parent CXL Port is being queried
> + *
> + * Looks up and returns the parent CXL Port associated with @pdev. On
> + * success, the returned port has its reference count incremented and must
> + * be released by the caller. Returns NULL if no associated CXL port is
> + * found.
> + *
> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
> + */
> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> +{
> + switch (pci_pcie_type(pdev)) {
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + {
> + struct cxl_dport *dport;
> + struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
Can you pass NULL for dport? Looks like it to me as that ultimately ends
up in match_port_by_dport() and
if (ctx->dport)
*ctx->dport = dport;
where with this as null means ctx->dport == NULL.
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port;
> + }
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + {
> + struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port;
> + }
> + }
> +
> + pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
> + pci_name(pdev), pci_pcie_type(pdev));
> + return NULL;
> +}
> +int cxl_ras_init(void)
> +{
> + if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
> + pr_err("Failed to initialize CXL RAS CPER\n");
Why introduce a new error print? I don't particularly mind
but wasn't obvious to me why one has become appropriate and why only
for the first call here.
More importantly - if this failed it would previously have resulted
in cxl_core_init() failing and things getting torn down.
> +
> + cxl_register_proto_err_work(&cxl_proto_err_work);
> +
> + return 0;
> +}
> +
> +void cxl_ras_exit(void)
> +{
> + cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> + cancel_work_sync(&cxl_cper_prot_err_work);
> +
> + cxl_unregister_proto_err_work();
> + cancel_work_sync(&cxl_proto_err_work);
> +}
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-03 2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
@ 2026-02-03 15:40 ` Jonathan Cameron
2026-02-03 18:21 ` Bowman, Terry
2026-02-04 5:08 ` dan.j.williams
1 sibling, 1 reply; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-03 15:40 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Feb 2026 20:52:40 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> Introduce CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints, including RCH
> downstream ports. Establish a consistent flow for correctable and
> uncorrectable CXL protocol errors.
>
> Provide the solution by adding cxl_port_cor_error_detected() and
> cxl_port_error_detected() to handle correctable and uncorrectable handling
> through CXL RAS helpers, coordinating uncorrectable recovery in
> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
> driver being bound to avoid processing errors on disabled devices.
>
> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
> for Upstream Ports/Endpoints.
>
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
> cxl_core to clear PCIe/AER state in these flows.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Dave Jiang dave.jiang@intel.com
Hi Terry,
A few comments inline.
Thanks,
Jonathan
>
> ---
>
> Changes in v14->v15:
> - Update commit message and title. Added Bjorn's ack.
> - Move CE and UCE handling logic here
>
> Changes in v13->v14:
> - Add Dave Jiang's review-by
> - Update commit message & headline (Bjorn)
> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> one line (Jonathan)
> - Remove cxl_walk_port() (Dan)
> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> sufficient (Dan)
> - Remove device_lock_if()
> - Combined CE and UCE here (Terry)
>
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
>
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> port searches.
>
> Changes in v10->v11:
> - None
> ---
> drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 1 +
> drivers/pci/pci.h | 2 -
> drivers/pci/pcie/aer.c | 1 +
> include/linux/aer.h | 2 +
> include/linux/pci.h | 2 +
> 6 files changed, 140 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index a6c0bc6d7203..0216dafa6118 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> return NULL;
> }
>
> +static void __iomem *cxl_get_ras_base(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + switch (pci_pcie_type(pdev)) {
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + {
> + struct cxl_dport *dport;
struct cxl_dport *dport = NULL;
> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
as if this failed, dport is not written. Alternative is check port, not dport as port
will always be initialized whether or not failure occurs in find_cxl_port()
> +
> + if (!dport) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return dport->regs.ras;
> + }
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + {
> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port->regs.ras;
> + }
> + }
> + dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
> + return NULL;
> +}
> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> void __iomem *addr;
> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> return true;
> }
>
> +static void cxl_port_cor_error_detected(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (is_cxl_endpoint(port)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + guard(device)(&cxlmd->dev);
Maybe add a comment on why this lock needs to be held and then why the dev->drvier
below needs to be true.
> +
> + if (!dev->driver) {
> + dev_warn(&pdev->dev,
> + "%s: memdev disabled, abort error handling\n",
> + dev_name(dev));
Same question as below on why pdev->dev / dev_name(dev) here.
Maybe pci_warn() is more appropriate.
> + return;
> + }
> +
> + if (cxlds->rcd)
> + cxl_handle_rdport_errors(cxlds);
> +
> + cxl_handle_cor_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
> + } else {
> + cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));
> + }
> +}
> +
> +static pci_ers_result_t cxl_port_error_detected(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (is_cxl_endpoint(port)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + guard(device)(&cxlmd->dev);
> +
> + if (!dev->driver) {
> + dev_warn(&pdev->dev,
Somewhat circular. dev_warn() will print the device name anyway I think and
this pdev->dev == dev here so might as well use that.
Or was intent to use different devices?
> + "%s: memdev disabled, abort error handling\n",
> + dev_name(dev));
> + return PCI_ERS_RESULT_NONE;
> + }
> +
> + if (cxlds->rcd)
> + cxl_handle_rdport_errors(cxlds);
> +
> + return cxl_handle_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
> + } else {
> + return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
> + }
> +}
>
> static void cxl_proto_err_work_fn(struct work_struct *work)
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors
2026-02-03 2:52 ` [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors Terry Bowman
@ 2026-02-03 15:54 ` Jonathan Cameron
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-03 15:54 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Feb 2026 20:52:41 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL Protocol errors are logged for Endpoints in cxl_handle_ras() and
> cxl_handle_cor_ras(). The same is missing for CXL Port devices. The CXL
> Port logging function is already present but needs a call added from
> the handlers.
>
> Update cxl_handle_ras() and cxl_handle_cor_ras() to call the CXL Port
> trace logging function.
>
> Also, add log messages in the case 'ras_base' is NULL. And, add calls to
> the existing CXL Port tracing in the same functions.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
The error type was already wrongly documented for cxl_handle_ras().
This makes that comment inaccurate in a different way, particularly as you return a bool
value for a pci_ers_result_t.
>
> ---
>
> Changes in v14 -> v15:
> - New commit
> ---
> drivers/cxl/core/core.h | 10 ++++++----
> drivers/cxl/core/ras.c | 30 ++++++++++++++++++++++--------
> 2 files changed, 28 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 92aea110817d..3b232e991b12 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 0216dafa6118..970ff3df442c 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> /* CXL spec rev3.0 8.2.4.16.1 */
> @@ -317,15 +324,19 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
> * Log the state of the RAS status registers and prepare them to log the
> * next error status. Return 1 if reset needed.
It didn't return 1 previously and doesn't do in a different way now.
So comment needs an update.
> */
> -bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> +pci_ers_result_t
> +cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> u32 hl[CXL_HEADERLOG_SIZE_U32];
> void __iomem *addr;
> u32 status;
> u32 fe;
>
> - if (!ras_base)
> + if (!ras_base) {
> + pr_err_ratelimited("%s: CXL RAS registers aren't mapped\n",
> + dev_name(dev));
> return false;
returning false as pci_err_result_t?
> + }
>
> addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> @@ -344,10 +355,13 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
> + if (is_cxl_memdev(dev))
> + trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
> + else
> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> - return true;
> + return PCI_ERS_RESULT_PANIC;
> }
>
> static void cxl_port_cor_error_detected(struct device *dev)
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
2026-02-03 2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
@ 2026-02-03 16:18 ` Jonathan Cameron
2026-02-03 17:31 ` Dave Jiang
1 sibling, 0 replies; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-03 16:18 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Feb 2026 20:52:42 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL drivers now implement protocol RAS support. PCI protocol errors,
> however, continue to be reported via the AER capability and must still be
> handled by a PCI error recovery callback.
>
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
> new cxl_pci_error_detected() implementation that handles only uncorrectable
> PCI protocol errors reported through AER.
>
> Introduce helper named cxl_handler_aer() amd implement to handle and
> log the CXL device's AER error.
>
> This cleanly separates CXL protocol error handling from PCI AER handling
> and ensures that each subsystem processes only the errors it is
> responsible.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> ---
>
> Changes in v14->v15:
> - Title update (Terry)
> - Change cxl_pci_error-detected() to handle & log AER (Terry)`
> - Update commit message (Terry)
> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
>
> Changes in v13->v14:
> - Update commit headline (Bjorn)
> - Rename pci_error_detected()/pci_cor_error_detected() ->
> cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
> - Split into separate patches for UCE and CE (Terry)
>
> Changes in v12->v13:
> - Update commit messaqge (Terry)
> - Updated all the implementation and commit message. (Terry)
> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
> pdev (Dave Jiang)
>
> Changes in v11->v12:
> - None
>
> Changes in v10->v11:
> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
> - cxl_error_detected() - Remove extra line (Shiju)
> - Changes moved to core/ras.c (Terry)
> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
> - Move #include "pci.h from cxl.h to core.h (Terry)
> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
> ---
> drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
> drivers/cxl/cxlpci.h | 9 +++---
> drivers/cxl/pci.c | 6 ++--
> 3 files changed, 31 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 970ff3df442c..061e6aaec176 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state)
> +static bool cxl_handle_aer(struct pci_dev *pdev)
> {
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
> - struct device *dev = &cxlmd->dev;
> - bool ue;
> -
> - scoped_guard(device, dev) {
> - if (!dev->driver) {
> - dev_warn(&pdev->dev,
> - "%s: memdev disabled, abort error handling\n",
> - dev_name(dev));
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> + struct aer_capability_regs aer;
I don't see a strong reason to use this structure given you just want two
of the registers and read into them one by one.
> + u32 aer_cap = pdev->aer_cap;
>
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(cxlds);
> - /*
> - * A frozen channel indicates an impending reset which is fatal to
> - * CXL.mem operation, and will likely crash the system. On the off
> - * chance the situation is recoverable dump the status of the RAS
> - * capability registers and bounce the active state of the memdev.
> - */
> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
> - cxlmd->endpoint->regs.ras);
> + if (!aer_cap) {
> + pr_warn_ratelimited("%s: AER capability isn't present\n",
> + pci_name(pdev));
These could use dev_warn_rate_limited()
or even add a wrapper similar to pci_info_rate_limited()
> + return false;
> }
>
> - switch (state) {
> - case pci_channel_io_normal:
> - if (ue) {
> - device_release_driver(dev);
> - return PCI_ERS_RESULT_NEED_RESET;
> - }
> - return PCI_ERS_RESULT_CAN_RECOVER;
> - case pci_channel_io_frozen:
> - dev_warn(&pdev->dev,
> - "%s: frozen state error detected, disable CXL.mem\n",
> - dev_name(dev));
> - device_release_driver(dev);
> - return PCI_ERS_RESULT_NEED_RESET;
> - case pci_channel_io_perm_failure:
> - dev_warn(&pdev->dev,
> - "failure state error detected, request disconnect\n");
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> - return PCI_ERS_RESULT_NEED_RESET;
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
> +
> + /* The AER driver logged the error */
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> +
> + return (aer.uncor_status & aer.uncor_mask);
> +}
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler
2026-02-03 2:52 ` [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler Terry Bowman
@ 2026-02-03 16:27 ` Jonathan Cameron
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-03 16:27 UTC (permalink / raw)
To: Terry Bowman
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Mon, 2 Feb 2026 20:52:43 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL drivers dont require a correctable PCI AER handler. Correctable AER
> errors reported by CXL devices are logged and cleared in the AER driver.
> This makes the correctable AER handler callback in the CXL driver
> unnecessary.
>
> Remove cxl_cor_error_detected() and drop the .cor_error_detected callback
> from the CXL PCI error handlers.
>
> This consolidates correctable error reporting under the CXL RAS infrastructure
> and avoids redundant or conflicting logging with the AER driver.
Please add a before and after log so we know what the redundant info was that has
been dropped.
Jonathan
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> ---
>
> Changes in v14->v15:
> - Remove cxl_pci_cor_error_detected(). Is not needed. AER is logged
> in the AER driver. (Dan)
> - Update commit message (Terry)
>
> Changes in v13->v14:
> - New commit
> - Change cxl_cor_error_detected() parameter to &pdev->dev device from
> memdev device. (Terry)
> - Updated commit message (Terry)
> ---
> drivers/cxl/core/ras.c | 23 -----------------------
> drivers/cxl/cxlpci.h | 2 --
> drivers/cxl/pci.c | 1 -
> 3 files changed, 26 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 061e6aaec176..e5a0d0283d3f 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -418,29 +418,6 @@ static pci_ers_result_t cxl_port_error_detected(struct device *dev)
> }
> }
>
> -void cxl_cor_error_detected(struct pci_dev *pdev)
> -{
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
> - struct device *dev = &cxlds->cxlmd->dev;
> -
> - scoped_guard(device, dev) {
> - if (!dev->driver) {
> - dev_warn(&pdev->dev,
> - "%s: memdev disabled, abort error handling\n",
> - dev_name(dev));
> - return;
> - }
> -
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(cxlds);
> -
> - cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial,
> - cxlmd->endpoint->regs.ras);
> - }
> -}
> -EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
> -
> static bool cxl_handle_aer(struct pci_dev *pdev)
> {
> struct aer_capability_regs aer;
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 5534422b496c..e3388dffdd75 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -78,13 +78,11 @@ struct cxl_dev_state;
> void read_cdat_data(struct cxl_port *port);
>
> #ifdef CONFIG_CXL_RAS
> -void cxl_cor_error_detected(struct pci_dev *pdev);
> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
> pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> pci_channel_state_t error);
> void devm_cxl_port_ras_setup(struct cxl_port *port);
> #else
> -static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
> static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> pci_channel_state_t state)
> {
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index ff741adc7c7f..c6b2966f5fda 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1055,7 +1055,6 @@ static const struct pci_error_handlers pci_error_handlers = {
> .error_detected = cxl_pci_error_detected,
> .slot_reset = cxl_slot_reset,
> .resume = cxl_error_resume,
> - .cor_error_detected = cxl_cor_error_detected,
> .reset_done = cxl_reset_done,
> };
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
2026-02-03 15:26 ` Jonathan Cameron
@ 2026-02-03 17:00 ` Bowman, Terry
2026-02-05 17:13 ` Jonathan Cameron
0 siblings, 1 reply; 31+ messages in thread
From: Bowman, Terry @ 2026-02-03 17:00 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On 2/3/2026 9:26 AM, Jonathan Cameron wrote:
> On Mon, 2 Feb 2026 20:52:39 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER driver now forwards CXL protocol errors to the CXL driver via a
>> kfifo. The CXL driver must consume these work items and initiate protocol
>> error handling while ensuring the device's RAS mappings remain valid
>> throughout processing.
>>
>> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
>> AER service driver. Lock the parent CXL Port device to ensure the CXL
>> device's RAS registers are accessible during handling. Add pdev reference-put
>> to match reference-get in AER driver. This will ensure pdev access after
>> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
>>
>> Update is_cxl_error() to recognize CXL Port devices with errors.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>
> There are some small functional changes to existing paths (maybe)
> that I think need explanations in this commit message.
>
> Otherwise, one suggests small simplification.
>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 74df561ed32e..a6c0bc6d7203 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>> }
>> static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>
>> -int cxl_ras_init(void)
>> -{
>> - return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> -}
>> -
>> -void cxl_ras_exit(void)
>> -{
>> - cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> - cancel_work_sync(&cxl_cper_prot_err_work);
>> -}
>> -
>> static void cxl_dport_map_ras(struct cxl_dport *dport)
>> {
>> struct cxl_register_map *map = &dport->reg_map;
>> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
>> }
>> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>>
>> +/*
>> + * get_cxl_port - Return the parent CXL Port of a PCI device
>> + * @pdev: PCI device whose parent CXL Port is being queried
>> + *
>> + * Looks up and returns the parent CXL Port associated with @pdev. On
>> + * success, the returned port has its reference count incremented and must
>> + * be released by the caller. Returns NULL if no associated CXL port is
>> + * found.
>> + *
>> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
>> + */
>> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
>> +{
>> + switch (pci_pcie_type(pdev)) {
>> + case PCI_EXP_TYPE_ROOT_PORT:
>> + case PCI_EXP_TYPE_DOWNSTREAM:
>> + {
>> + struct cxl_dport *dport;
>> + struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
>
> Can you pass NULL for dport? Looks like it to me as that ultimately ends
> up in match_port_by_dport() and
> if (ctx->dport)
> *ctx->dport = dport;
>
> where with this as null means ctx->dport == NULL.
>
Yes.
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return port;
>> + }
>> + case PCI_EXP_TYPE_UPSTREAM:
>> + case PCI_EXP_TYPE_ENDPOINT:
>> + {
>> + struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return port;
>> + }
>> + }
>> +
>> + pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
>> + pci_name(pdev), pci_pcie_type(pdev));
>> + return NULL;
>> +}
>
>
>> +int cxl_ras_init(void)
>> +{
>> + if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
>> + pr_err("Failed to initialize CXL RAS CPER\n");
>
> Why introduce a new error print? I don't particularly mind
> but wasn't obvious to me why one has become appropriate and why only
> for the first call here.
>
This was introduced before v10.
RAS initialization failure should not fail cxl_core probe.
OSfirst AER support was added in this series in this file next to CPER.
CPER initialization can fail and OSFirst can not is the reason for only
one log.
When I look at this block of code I'm drawn to the return value. It looks
like it should be a void function. Thoughts?
- Terry
> More importantly - if this failed it would previously have resulted
> in cxl_core_init() failing and things getting torn down.
>
>> +
>> + cxl_register_proto_err_work(&cxl_proto_err_work);
>> +
>> + return 0;
>> +}
>> +
>> +void cxl_ras_exit(void)
>> +{
>> + cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>> + cancel_work_sync(&cxl_cper_prot_err_work);
>> +
>> + cxl_unregister_proto_err_work();
>> + cancel_work_sync(&cxl_proto_err_work);
>> +}
>
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
2026-02-03 2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-02-03 16:18 ` Jonathan Cameron
@ 2026-02-03 17:31 ` Dave Jiang
2026-02-03 18:35 ` Bowman, Terry
1 sibling, 1 reply; 31+ messages in thread
From: Dave Jiang @ 2026-02-03 17:31 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 2/2/26 7:52 PM, Terry Bowman wrote:
> CXL drivers now implement protocol RAS support. PCI protocol errors,
> however, continue to be reported via the AER capability and must still be
> handled by a PCI error recovery callback.
>
> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
> new cxl_pci_error_detected() implementation that handles only uncorrectable
> PCI protocol errors reported through AER.
Do we need to explain why only uncorrectable is handled?
>
> Introduce helper named cxl_handler_aer() amd implement to handle and
> log the CXL device's AER error.
>
> This cleanly separates CXL protocol error handling from PCI AER handling
> and ensures that each subsystem processes only the errors it is
> responsible.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> ---
>
> Changes in v14->v15:
> - Title update (Terry)
> - Change cxl_pci_error-detected() to handle & log AER (Terry)
> - Update commit message (Terry)
> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
>
> Changes in v13->v14:
> - Update commit headline (Bjorn)
> - Rename pci_error_detected()/pci_cor_error_detected() ->
> cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
> - Split into separate patches for UCE and CE (Terry)
>
> Changes in v12->v13:
> - Update commit messaqge (Terry)
> - Updated all the implementation and commit message. (Terry)
> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
> pdev (Dave Jiang)
>
> Changes in v11->v12:
> - None
>
> Changes in v10->v11:
> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
> - cxl_error_detected() - Remove extra line (Shiju)
> - Changes moved to core/ras.c (Terry)
> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
> - Move #include "pci.h from cxl.h to core.h (Terry)
> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
> ---
> drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
> drivers/cxl/cxlpci.h | 9 +++---
> drivers/cxl/pci.c | 6 ++--
> 3 files changed, 31 insertions(+), 52 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 970ff3df442c..061e6aaec176 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state)
> +static bool cxl_handle_aer(struct pci_dev *pdev)
For a function that returns a bool, the function name doesn't sound quite right. Maybe cxl_uncor_aer_present()?
DJ
> {
> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
> - struct device *dev = &cxlmd->dev;
> - bool ue;
> -
> - scoped_guard(device, dev) {
> - if (!dev->driver) {
> - dev_warn(&pdev->dev,
> - "%s: memdev disabled, abort error handling\n",
> - dev_name(dev));
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> + struct aer_capability_regs aer;
> + u32 aer_cap = pdev->aer_cap;
>
> - if (cxlds->rcd)
> - cxl_handle_rdport_errors(cxlds);
> - /*
> - * A frozen channel indicates an impending reset which is fatal to
> - * CXL.mem operation, and will likely crash the system. On the off
> - * chance the situation is recoverable dump the status of the RAS
> - * capability registers and bounce the active state of the memdev.
> - */
> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
> - cxlmd->endpoint->regs.ras);
> + if (!aer_cap) {
> + pr_warn_ratelimited("%s: AER capability isn't present\n",
> + pci_name(pdev));
> + return false;
> }
>
> - switch (state) {
> - case pci_channel_io_normal:
> - if (ue) {
> - device_release_driver(dev);
> - return PCI_ERS_RESULT_NEED_RESET;
> - }
> - return PCI_ERS_RESULT_CAN_RECOVER;
> - case pci_channel_io_frozen:
> - dev_warn(&pdev->dev,
> - "%s: frozen state error detected, disable CXL.mem\n",
> - dev_name(dev));
> - device_release_driver(dev);
> - return PCI_ERS_RESULT_NEED_RESET;
> - case pci_channel_io_perm_failure:
> - dev_warn(&pdev->dev,
> - "failure state error detected, request disconnect\n");
> - return PCI_ERS_RESULT_DISCONNECT;
> - }
> - return PCI_ERS_RESULT_NEED_RESET;
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
> +
> + /* The AER driver logged the error */
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> +
> + return (aer.uncor_status & aer.uncor_mask);
> +}
> +
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t error)
> +{
> + u32 rc = cxl_handle_aer(pdev);
> +
> + return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_CAN_RECOVER;
> }
> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
>
> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
> {
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 970add0256e9..5534422b496c 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port);
>
> #ifdef CONFIG_CXL_RAS
> void cxl_cor_error_detected(struct pci_dev *pdev);
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state);
> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t error);
> void devm_cxl_port_ras_setup(struct cxl_port *port);
> #else
> static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
> -
> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> - pci_channel_state_t state)
> +static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
> + pci_channel_state_t state)
> {
> return PCI_ERS_RESULT_NONE;
> }
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index acb0eb2a13c3..ff741adc7c7f 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
> }
> }
>
> -static const struct pci_error_handlers cxl_error_handlers = {
> - .error_detected = cxl_error_detected,
> +static const struct pci_error_handlers pci_error_handlers = {
> + .error_detected = cxl_pci_error_detected,
> .slot_reset = cxl_slot_reset,
> .resume = cxl_error_resume,
> .cor_error_detected = cxl_cor_error_detected,
> @@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
> .name = KBUILD_MODNAME,
> .id_table = cxl_mem_pci_tbl,
> .probe = cxl_pci_probe,
> - .err_handler = &cxl_error_handlers,
> + .err_handler = &pci_error_handlers,
> .dev_groups = cxl_rcd_groups,
> .driver = {
> .probe_type = PROBE_PREFER_ASYNCHRONOUS,
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-03 15:40 ` Jonathan Cameron
@ 2026-02-03 18:21 ` Bowman, Terry
2026-02-05 17:16 ` Jonathan Cameron
0 siblings, 1 reply; 31+ messages in thread
From: Bowman, Terry @ 2026-02-03 18:21 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On 2/3/2026 9:40 AM, Jonathan Cameron wrote:
> On Mon, 2 Feb 2026 20:52:40 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Introduce CXL Port protocol error handling callbacks to unify detection,
>> logging, and recovery across CXL Ports and Endpoints, including RCH
>> downstream ports. Establish a consistent flow for correctable and
>> uncorrectable CXL protocol errors.
>>
>> Provide the solution by adding cxl_port_cor_error_detected() and
>> cxl_port_error_detected() to handle correctable and uncorrectable handling
>> through CXL RAS helpers, coordinating uncorrectable recovery in
>> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
>> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
>> driver being bound to avoid processing errors on disabled devices.
>>
>> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
>> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
>> for Upstream Ports/Endpoints.
>>
>> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
>> cxl_core to clear PCIe/AER state in these flows.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>> Reviewed-by: Dave Jiang dave.jiang@intel.com
>
> Hi Terry,
>
> A few comments inline.
>
> Thanks,
>
> Jonathan
>
Thanks for reviewing.
>>
>> ---
>>
>> Changes in v14->v15:
>> - Update commit message and title. Added Bjorn's ack.
>> - Move CE and UCE handling logic here
>>
>> Changes in v13->v14:
>> - Add Dave Jiang's review-by
>> - Update commit message & headline (Bjorn)
>> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
>> one line (Jonathan)
>> - Remove cxl_walk_port() (Dan)
>> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
>> sufficient (Dan)
>> - Remove device_lock_if()
>> - Combined CE and UCE here (Terry)
>>
>> Changes in v12->v13:
>> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>> patch (Terry)
>> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
>> - Remove check for dport->dport_dev (Dave)
>> - Remove whitespace (Terry)
>>
>> Changes in v11->v12:
>> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
>> pci_to_cxl_dev()
>> - Change cxl_error_detected() -> cxl_cor_error_detected()
>> - Remove NULL variable assignments
>> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
>> port searches.
>>
>> Changes in v10->v11:
>> - None
>> ---
>> drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
>> drivers/pci/pci.c | 1 +
>> drivers/pci/pci.h | 2 -
>> drivers/pci/pcie/aer.c | 1 +
>> include/linux/aer.h | 2 +
>> include/linux/pci.h | 2 +
>> 6 files changed, 140 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index a6c0bc6d7203..0216dafa6118 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
>> return NULL;
>> }
>>
>> +static void __iomem *cxl_get_ras_base(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> + switch (pci_pcie_type(pdev)) {
>> + case PCI_EXP_TYPE_ROOT_PORT:
>> + case PCI_EXP_TYPE_DOWNSTREAM:
>> + {
>> + struct cxl_dport *dport;
>
> struct cxl_dport *dport = NULL;
>
>> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
>
> as if this failed, dport is not written. Alternative is check port, not dport as port
> will always be initialized whether or not failure occurs in find_cxl_port()
>
Ok.
>
>> +
>> + if (!dport) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return dport->regs.ras;
>> + }
>> + case PCI_EXP_TYPE_UPSTREAM:
>> + case PCI_EXP_TYPE_ENDPOINT:
>> + {
>> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return port->regs.ras;
>> + }
>> + }
>> + dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
>> + return NULL;
>> +}
>
>> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> {
>> void __iomem *addr;
>> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> return true;
>> }
>>
>> +static void cxl_port_cor_error_detected(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> +
>> + if (is_cxl_endpoint(port)) {
>> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +
>> + guard(device)(&cxlmd->dev);
>
> Maybe add a comment on why this lock needs to be held and then why the dev->drvier
> below needs to be true.
>
This can be removed. This was added to ensure the EP's RAS registers remained accessible
during handling. This was when when the mapped RAS registers were owned by the CXL memory
device. This has changed such that the EP RAS registers are now owned by the EP Port. And,
the Endpoint Port is already locked in cxl_proto_err_work_fn() before calling this funcion.
>> +
>> + if (!dev->driver) {
>> + dev_warn(&pdev->dev,
>> + "%s: memdev disabled, abort error handling\n",
>> + dev_name(dev));
>
> Same question as below on why pdev->dev / dev_name(dev) here.
> Maybe pci_warn() is more appropriate.
>
I believe the driver check can be removed but would like your input. The check
for the driver is another piece of code specifically for when the handler was accessing
the memdev's RAS registers. It was a last check to make certain the device is bound
to a driver before accessing. EP RAS is now owned by the Endpoint Port.
Ok, I'll make these a pci_warn().
>> + return;
>> + }
>> +
>> + if (cxlds->rcd)
>> + cxl_handle_rdport_errors(cxlds);
>> +
>> + cxl_handle_cor_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
>> + } else {
>> + cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));
>> + }
>> +}
>> +
>> +static pci_ers_result_t cxl_port_error_detected(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> +
>> + if (is_cxl_endpoint(port)) {
>> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +
>> + guard(device)(&cxlmd->dev);
>> +
>> + if (!dev->driver) {
>> + dev_warn(&pdev->dev,
>
> Somewhat circular. dev_warn() will print the device name anyway I think and
> this pdev->dev == dev here so might as well use that.
>
> Or was intent to use different devices?
>
No. Same device. I made a last minute change to "make logging more useful with
device name" and the change wasn't necessary here. I'll use pci_warn() as you
mentioned above.
-Terry
>> + "%s: memdev disabled, abort error handling\n",
>> + dev_name(dev));
>> + return PCI_ERS_RESULT_NONE;
>> + }
>> +
>> + if (cxlds->rcd)
>> + cxl_handle_rdport_errors(cxlds);
>> +
>> + return cxl_handle_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
>> + } else {
>> + return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
>> + }
>> +}
>
>>
>> static void cxl_proto_err_work_fn(struct work_struct *work)
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
2026-02-03 17:31 ` Dave Jiang
@ 2026-02-03 18:35 ` Bowman, Terry
2026-02-03 18:49 ` Dave Jiang
0 siblings, 1 reply; 31+ messages in thread
From: Bowman, Terry @ 2026-02-03 18:35 UTC (permalink / raw)
To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 2/3/2026 11:31 AM, Dave Jiang wrote:
>
>
> On 2/2/26 7:52 PM, Terry Bowman wrote:
>> CXL drivers now implement protocol RAS support. PCI protocol errors,
>> however, continue to be reported via the AER capability and must still be
>> handled by a PCI error recovery callback.
>>
>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
>> new cxl_pci_error_detected() implementation that handles only uncorrectable
>> PCI protocol errors reported through AER.
>
> Do we need to explain why only uncorrectable is handled?
>
Would it be Ok if I removed "only" with s/only// ?
After mentioning an important detail I shoud elaborate. But, how about if
remove it and not refer to the CE at all here? CE shouldnt be mentioned unless
good reason in a primarily UCE patch.
- Terry
>>
>> Introduce helper named cxl_handler_aer() amd implement to handle and
>> log the CXL device's AER error.
>>
>> This cleanly separates CXL protocol error handling from PCI AER handling
>> and ensures that each subsystem processes only the errors it is
>> responsible.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>
>> ---
>>
>> Changes in v14->v15:
>> - Title update (Terry)
>> - Change cxl_pci_error-detected() to handle & log AER (Terry)
>> - Update commit message (Terry)
>> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
>>
>> Changes in v13->v14:
>> - Update commit headline (Bjorn)
>> - Rename pci_error_detected()/pci_cor_error_detected() ->
>> cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
>> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
>> - Split into separate patches for UCE and CE (Terry)
>>
>> Changes in v12->v13:
>> - Update commit messaqge (Terry)
>> - Updated all the implementation and commit message. (Terry)
>> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>> pdev (Dave Jiang)
>>
>> Changes in v11->v12:
>> - None
>>
>> Changes in v10->v11:
>> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
>> - cxl_error_detected() - Remove extra line (Shiju)
>> - Changes moved to core/ras.c (Terry)
>> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
>> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
>> - Move #include "pci.h from cxl.h to core.h (Terry)
>> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
>> ---
>> drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
>> drivers/cxl/cxlpci.h | 9 +++---
>> drivers/cxl/pci.c | 6 ++--
>> 3 files changed, 31 insertions(+), 52 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 970ff3df442c..061e6aaec176 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>
>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> - pci_channel_state_t state)
>> +static bool cxl_handle_aer(struct pci_dev *pdev)
>
> For a function that returns a bool, the function name doesn't sound quite right. Maybe cxl_uncor_aer_present()?
>
> DJ
>
I was trying to follow the pattern of detected() function calls the
handle() function as done for cxl_handle_ras() and cxl_handle_cor_ras().
I will change to cxl_uncor_aer_present().
-Terry
>> {
>> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
>> - struct device *dev = &cxlmd->dev;
>> - bool ue;
>> -
>> - scoped_guard(device, dev) {
>> - if (!dev->driver) {
>> - dev_warn(&pdev->dev,
>> - "%s: memdev disabled, abort error handling\n",
>> - dev_name(dev));
>> - return PCI_ERS_RESULT_DISCONNECT;
>> - }
>> + struct aer_capability_regs aer;
>> + u32 aer_cap = pdev->aer_cap;
>>
>> - if (cxlds->rcd)
>> - cxl_handle_rdport_errors(cxlds);
>> - /*
>> - * A frozen channel indicates an impending reset which is fatal to
>> - * CXL.mem operation, and will likely crash the system. On the off
>> - * chance the situation is recoverable dump the status of the RAS
>> - * capability registers and bounce the active state of the memdev.
>> - */
>> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
>> - cxlmd->endpoint->regs.ras);
>> + if (!aer_cap) {
>> + pr_warn_ratelimited("%s: AER capability isn't present\n",
>> + pci_name(pdev));
>> + return false;
>> }
>>
>> - switch (state) {
>> - case pci_channel_io_normal:
>> - if (ue) {
>> - device_release_driver(dev);
>> - return PCI_ERS_RESULT_NEED_RESET;
>> - }
>> - return PCI_ERS_RESULT_CAN_RECOVER;
>> - case pci_channel_io_frozen:
>> - dev_warn(&pdev->dev,
>> - "%s: frozen state error detected, disable CXL.mem\n",
>> - dev_name(dev));
>> - device_release_driver(dev);
>> - return PCI_ERS_RESULT_NEED_RESET;
>> - case pci_channel_io_perm_failure:
>> - dev_warn(&pdev->dev,
>> - "failure state error detected, request disconnect\n");
>> - return PCI_ERS_RESULT_DISCONNECT;
>> - }
>> - return PCI_ERS_RESULT_NEED_RESET;
>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
>> +
>> + /* The AER driver logged the error */
>> + pci_aer_clear_nonfatal_status(pdev);
>> + pci_aer_clear_fatal_status(pdev);
>> +
>> + return (aer.uncor_status & aer.uncor_mask);
>> +}
>> +
>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> + pci_channel_state_t error)
>> +{
>> + u32 rc = cxl_handle_aer(pdev);
>> +
>> + return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_CAN_RECOVER;
>> }
>> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
>>
>> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
>> {
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 970add0256e9..5534422b496c 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port);
>>
>> #ifdef CONFIG_CXL_RAS
>> void cxl_cor_error_detected(struct pci_dev *pdev);
>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> - pci_channel_state_t state);
>> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> + pci_channel_state_t error);
>> void devm_cxl_port_ras_setup(struct cxl_port *port);
>> #else
>> static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
>> -
>> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>> - pci_channel_state_t state)
>> +static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>> + pci_channel_state_t state)
>> {
>> return PCI_ERS_RESULT_NONE;
>> }
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index acb0eb2a13c3..ff741adc7c7f 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>> }
>> }
>>
>> -static const struct pci_error_handlers cxl_error_handlers = {
>> - .error_detected = cxl_error_detected,
>> +static const struct pci_error_handlers pci_error_handlers = {
>> + .error_detected = cxl_pci_error_detected,
>> .slot_reset = cxl_slot_reset,
>> .resume = cxl_error_resume,
>> .cor_error_detected = cxl_cor_error_detected,
>> @@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
>> .name = KBUILD_MODNAME,
>> .id_table = cxl_mem_pci_tbl,
>> .probe = cxl_pci_probe,
>> - .err_handler = &cxl_error_handlers,
>> + .err_handler = &pci_error_handlers,
>> .dev_groups = cxl_rcd_groups,
>> .driver = {
>> .probe_type = PROBE_PREFER_ASYNCHRONOUS,
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
2026-02-03 18:35 ` Bowman, Terry
@ 2026-02-03 18:49 ` Dave Jiang
2026-02-03 20:21 ` Dave Jiang
0 siblings, 1 reply; 31+ messages in thread
From: Dave Jiang @ 2026-02-03 18:49 UTC (permalink / raw)
To: Bowman, Terry, dave, jonathan.cameron, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 2/3/26 11:35 AM, Bowman, Terry wrote:
> On 2/3/2026 11:31 AM, Dave Jiang wrote:
>>
>>
>> On 2/2/26 7:52 PM, Terry Bowman wrote:
>>> CXL drivers now implement protocol RAS support. PCI protocol errors,
>>> however, continue to be reported via the AER capability and must still be
>>> handled by a PCI error recovery callback.
>>>
>>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
>>> new cxl_pci_error_detected() implementation that handles only uncorrectable
>>> PCI protocol errors reported through AER.
>>
>> Do we need to explain why only uncorrectable is handled?
>>
>
> Would it be Ok if I removed "only" with s/only// ?
>
> After mentioning an important detail I shoud elaborate. But, how about if
> remove it and not refer to the CE at all here? CE shouldnt be mentioned unless
> good reason in a primarily UCE patch.
Is CE handling added later? Maybe just say that.
DJ
>
> - Terry
>
>>>
>>> Introduce helper named cxl_handler_aer() amd implement to handle and
>>> log the CXL device's AER error.
>>>
>>> This cleanly separates CXL protocol error handling from PCI AER handling
>>> and ensures that each subsystem processes only the errors it is
>>> responsible.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>
>>> ---
>>>
>>> Changes in v14->v15:
>>> - Title update (Terry)
>>> - Change cxl_pci_error-detected() to handle & log AER (Terry)
>>> - Update commit message (Terry)
>>> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
>>>
>>> Changes in v13->v14:
>>> - Update commit headline (Bjorn)
>>> - Rename pci_error_detected()/pci_cor_error_detected() ->
>>> cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
>>> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
>>> - Split into separate patches for UCE and CE (Terry)
>>>
>>> Changes in v12->v13:
>>> - Update commit messaqge (Terry)
>>> - Updated all the implementation and commit message. (Terry)
>>> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>>> pdev (Dave Jiang)
>>>
>>> Changes in v11->v12:
>>> - None
>>>
>>> Changes in v10->v11:
>>> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
>>> - cxl_error_detected() - Remove extra line (Shiju)
>>> - Changes moved to core/ras.c (Terry)
>>> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
>>> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
>>> - Move #include "pci.h from cxl.h to core.h (Terry)
>>> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
>>> ---
>>> drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
>>> drivers/cxl/cxlpci.h | 9 +++---
>>> drivers/cxl/pci.c | 6 ++--
>>> 3 files changed, 31 insertions(+), 52 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index 970ff3df442c..061e6aaec176 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>>> }
>>> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>>
>>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>> - pci_channel_state_t state)
>>> +static bool cxl_handle_aer(struct pci_dev *pdev)
>>
>> For a function that returns a bool, the function name doesn't sound quite right. Maybe cxl_uncor_aer_present()?
>>
>> DJ
>>
>
> I was trying to follow the pattern of detected() function calls the
> handle() function as done for cxl_handle_ras() and cxl_handle_cor_ras().
>
> I will change to cxl_uncor_aer_present().
>
> -Terry
>
>>> {
>>> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
>>> - struct device *dev = &cxlmd->dev;
>>> - bool ue;
>>> -
>>> - scoped_guard(device, dev) {
>>> - if (!dev->driver) {
>>> - dev_warn(&pdev->dev,
>>> - "%s: memdev disabled, abort error handling\n",
>>> - dev_name(dev));
>>> - return PCI_ERS_RESULT_DISCONNECT;
>>> - }
>>> + struct aer_capability_regs aer;
>>> + u32 aer_cap = pdev->aer_cap;
>>>
>>> - if (cxlds->rcd)
>>> - cxl_handle_rdport_errors(cxlds);
>>> - /*
>>> - * A frozen channel indicates an impending reset which is fatal to
>>> - * CXL.mem operation, and will likely crash the system. On the off
>>> - * chance the situation is recoverable dump the status of the RAS
>>> - * capability registers and bounce the active state of the memdev.
>>> - */
>>> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
>>> - cxlmd->endpoint->regs.ras);
>>> + if (!aer_cap) {
>>> + pr_warn_ratelimited("%s: AER capability isn't present\n",
>>> + pci_name(pdev));
>>> + return false;
>>> }
>>>
>>> - switch (state) {
>>> - case pci_channel_io_normal:
>>> - if (ue) {
>>> - device_release_driver(dev);
>>> - return PCI_ERS_RESULT_NEED_RESET;
>>> - }
>>> - return PCI_ERS_RESULT_CAN_RECOVER;
>>> - case pci_channel_io_frozen:
>>> - dev_warn(&pdev->dev,
>>> - "%s: frozen state error detected, disable CXL.mem\n",
>>> - dev_name(dev));
>>> - device_release_driver(dev);
>>> - return PCI_ERS_RESULT_NEED_RESET;
>>> - case pci_channel_io_perm_failure:
>>> - dev_warn(&pdev->dev,
>>> - "failure state error detected, request disconnect\n");
>>> - return PCI_ERS_RESULT_DISCONNECT;
>>> - }
>>> - return PCI_ERS_RESULT_NEED_RESET;
>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
>>> +
>>> + /* The AER driver logged the error */
>>> + pci_aer_clear_nonfatal_status(pdev);
>>> + pci_aer_clear_fatal_status(pdev);
>>> +
>>> + return (aer.uncor_status & aer.uncor_mask);
>>> +}
>>> +
>>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>>> + pci_channel_state_t error)
>>> +{
>>> + u32 rc = cxl_handle_aer(pdev);
>>> +
>>> + return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_CAN_RECOVER;
>>> }
>>> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>>> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
>>>
>>> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
>>> {
>>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>>> index 970add0256e9..5534422b496c 100644
>>> --- a/drivers/cxl/cxlpci.h
>>> +++ b/drivers/cxl/cxlpci.h
>>> @@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port);
>>>
>>> #ifdef CONFIG_CXL_RAS
>>> void cxl_cor_error_detected(struct pci_dev *pdev);
>>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>> - pci_channel_state_t state);
>>> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
>>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>>> + pci_channel_state_t error);
>>> void devm_cxl_port_ras_setup(struct cxl_port *port);
>>> #else
>>> static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
>>> -
>>> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>> - pci_channel_state_t state)
>>> +static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>>> + pci_channel_state_t state)
>>> {
>>> return PCI_ERS_RESULT_NONE;
>>> }
>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>> index acb0eb2a13c3..ff741adc7c7f 100644
>>> --- a/drivers/cxl/pci.c
>>> +++ b/drivers/cxl/pci.c
>>> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>>> }
>>> }
>>>
>>> -static const struct pci_error_handlers cxl_error_handlers = {
>>> - .error_detected = cxl_error_detected,
>>> +static const struct pci_error_handlers pci_error_handlers = {
>>> + .error_detected = cxl_pci_error_detected,
>>> .slot_reset = cxl_slot_reset,
>>> .resume = cxl_error_resume,
>>> .cor_error_detected = cxl_cor_error_detected,
>>> @@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
>>> .name = KBUILD_MODNAME,
>>> .id_table = cxl_mem_pci_tbl,
>>> .probe = cxl_pci_probe,
>>> - .err_handler = &cxl_error_handlers,
>>> + .err_handler = &pci_error_handlers,
>>> .dev_groups = cxl_rcd_groups,
>>> .driver = {
>>> .probe_type = PROBE_PREFER_ASYNCHRONOUS,
>>
>
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler
2026-02-03 18:49 ` Dave Jiang
@ 2026-02-03 20:21 ` Dave Jiang
0 siblings, 0 replies; 31+ messages in thread
From: Dave Jiang @ 2026-02-03 20:21 UTC (permalink / raw)
To: Bowman, Terry, dave, jonathan.cameron, alison.schofield,
dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 2/3/26 11:49 AM, Dave Jiang wrote:
>
>
> On 2/3/26 11:35 AM, Bowman, Terry wrote:
>> On 2/3/2026 11:31 AM, Dave Jiang wrote:
>>>
>>>
>>> On 2/2/26 7:52 PM, Terry Bowman wrote:
>>>> CXL drivers now implement protocol RAS support. PCI protocol errors,
>>>> however, continue to be reported via the AER capability and must still be
>>>> handled by a PCI error recovery callback.
>>>>
>>>> Replace the existing cxl_error_detected() callback in cxl/pci.c with a
>>>> new cxl_pci_error_detected() implementation that handles only uncorrectable
>>>> PCI protocol errors reported through AER.
>>>
>>> Do we need to explain why only uncorrectable is handled?
>>>
>>
>> Would it be Ok if I removed "only" with s/only// ?
>>
>> After mentioning an important detail I shoud elaborate. But, how about if
>> remove it and not refer to the CE at all here? CE shouldnt be mentioned unless
>> good reason in a primarily UCE patch.
>
> Is CE handling added later? Maybe just say that.
So it's explained in the commit log of patch 8/9. Maybe just add a line here and say that CE is not needed.
>
> DJ
>
>>
>> - Terry
>>
>>>>
>>>> Introduce helper named cxl_handler_aer() amd implement to handle and
>>>> log the CXL device's AER error.
>>>>
>>>> This cleanly separates CXL protocol error handling from PCI AER handling
>>>> and ensures that each subsystem processes only the errors it is
>>>> responsible.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>
>>>> ---
>>>>
>>>> Changes in v14->v15:
>>>> - Title update (Terry)
>>>> - Change cxl_pci_error-detected() to handle & log AER (Terry)
>>>> - Update commit message (Terry)
>>>> - Moved cxl_handle_ras()/cxl_handle_cor_ras() to earlier patch (Terry)
>>>>
>>>> Changes in v13->v14:
>>>> - Update commit headline (Bjorn)
>>>> - Rename pci_error_detected()/pci_cor_error_detected() ->
>>>> cxl_pci_error_detected/cxl_pci_cor_error_detected() (Jonathan)
>>>> - Remove now-invalid comment in cxl_error_detected() (Jonathan)
>>>> - Split into separate patches for UCE and CE (Terry)
>>>>
>>>> Changes in v12->v13:
>>>> - Update commit messaqge (Terry)
>>>> - Updated all the implementation and commit message. (Terry)
>>>> - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
>>>> pdev (Dave Jiang)
>>>>
>>>> Changes in v11->v12:
>>>> - None
>>>>
>>>> Changes in v10->v11:
>>>> - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
>>>> - cxl_error_detected() - Remove extra line (Shiju)
>>>> - Changes moved to core/ras.c (Terry)
>>>> - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
>>>> - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
>>>> - Move #include "pci.h from cxl.h to core.h (Terry)
>>>> - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
>>>> ---
>>>> drivers/cxl/core/ras.c | 68 +++++++++++++++---------------------------
>>>> drivers/cxl/cxlpci.h | 9 +++---
>>>> drivers/cxl/pci.c | 6 ++--
>>>> 3 files changed, 31 insertions(+), 52 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>>> index 970ff3df442c..061e6aaec176 100644
>>>> --- a/drivers/cxl/core/ras.c
>>>> +++ b/drivers/cxl/core/ras.c
>>>> @@ -441,55 +441,35 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>>>> }
>>>> EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>>>>
>>>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>>> - pci_channel_state_t state)
>>>> +static bool cxl_handle_aer(struct pci_dev *pdev)
>>>
>>> For a function that returns a bool, the function name doesn't sound quite right. Maybe cxl_uncor_aer_present()?
>>>
>>> DJ
>>>
>>
>> I was trying to follow the pattern of detected() function calls the
>> handle() function as done for cxl_handle_ras() and cxl_handle_cor_ras().
>>
>> I will change to cxl_uncor_aer_present().
>>
>> -Terry
>>
>>>> {
>>>> - struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>>> - struct cxl_memdev *cxlmd = cxlds->cxlmd;
>>>> - struct device *dev = &cxlmd->dev;
>>>> - bool ue;
>>>> -
>>>> - scoped_guard(device, dev) {
>>>> - if (!dev->driver) {
>>>> - dev_warn(&pdev->dev,
>>>> - "%s: memdev disabled, abort error handling\n",
>>>> - dev_name(dev));
>>>> - return PCI_ERS_RESULT_DISCONNECT;
>>>> - }
>>>> + struct aer_capability_regs aer;
>>>> + u32 aer_cap = pdev->aer_cap;
>>>>
>>>> - if (cxlds->rcd)
>>>> - cxl_handle_rdport_errors(cxlds);
>>>> - /*
>>>> - * A frozen channel indicates an impending reset which is fatal to
>>>> - * CXL.mem operation, and will likely crash the system. On the off
>>>> - * chance the situation is recoverable dump the status of the RAS
>>>> - * capability registers and bounce the active state of the memdev.
>>>> - */
>>>> - ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial,
>>>> - cxlmd->endpoint->regs.ras);
>>>> + if (!aer_cap) {
>>>> + pr_warn_ratelimited("%s: AER capability isn't present\n",
>>>> + pci_name(pdev));
>>>> + return false;
>>>> }
>>>>
>>>> - switch (state) {
>>>> - case pci_channel_io_normal:
>>>> - if (ue) {
>>>> - device_release_driver(dev);
>>>> - return PCI_ERS_RESULT_NEED_RESET;
>>>> - }
>>>> - return PCI_ERS_RESULT_CAN_RECOVER;
>>>> - case pci_channel_io_frozen:
>>>> - dev_warn(&pdev->dev,
>>>> - "%s: frozen state error detected, disable CXL.mem\n",
>>>> - dev_name(dev));
>>>> - device_release_driver(dev);
>>>> - return PCI_ERS_RESULT_NEED_RESET;
>>>> - case pci_channel_io_perm_failure:
>>>> - dev_warn(&pdev->dev,
>>>> - "failure state error detected, request disconnect\n");
>>>> - return PCI_ERS_RESULT_DISCONNECT;
>>>> - }
>>>> - return PCI_ERS_RESULT_NEED_RESET;
>>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_STATUS, &aer.uncor_status);
>>>> + pci_read_config_dword(pdev, aer_cap + PCI_ERR_UNCOR_MASK, &aer.uncor_mask);
>>>> +
>>>> + /* The AER driver logged the error */
>>>> + pci_aer_clear_nonfatal_status(pdev);
>>>> + pci_aer_clear_fatal_status(pdev);
>>>> +
>>>> + return (aer.uncor_status & aer.uncor_mask);
>>>> +}
>>>> +
>>>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>>>> + pci_channel_state_t error)
>>>> +{
>>>> + u32 rc = cxl_handle_aer(pdev);
>>>> +
>>>> + return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_CAN_RECOVER;
>>>> }
>>>> -EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>>>> +EXPORT_SYMBOL_NS_GPL(cxl_pci_error_detected, "CXL");
>>>>
>>>> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
>>>> {
>>>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>>>> index 970add0256e9..5534422b496c 100644
>>>> --- a/drivers/cxl/cxlpci.h
>>>> +++ b/drivers/cxl/cxlpci.h
>>>> @@ -79,15 +79,14 @@ void read_cdat_data(struct cxl_port *port);
>>>>
>>>> #ifdef CONFIG_CXL_RAS
>>>> void cxl_cor_error_detected(struct pci_dev *pdev);
>>>> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>>> - pci_channel_state_t state);
>>>> void devm_cxl_dport_rch_ras_setup(struct cxl_dport *dport);
>>>> +pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>>>> + pci_channel_state_t error);
>>>> void devm_cxl_port_ras_setup(struct cxl_port *port);
>>>> #else
>>>> static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
>>>> -
>>>> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>>> - pci_channel_state_t state)
>>>> +static inline pci_ers_result_t cxl_pci_error_detected(struct pci_dev *pdev,
>>>> + pci_channel_state_t state)
>>>> {
>>>> return PCI_ERS_RESULT_NONE;
>>>> }
>>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>>>> index acb0eb2a13c3..ff741adc7c7f 100644
>>>> --- a/drivers/cxl/pci.c
>>>> +++ b/drivers/cxl/pci.c
>>>> @@ -1051,8 +1051,8 @@ static void cxl_reset_done(struct pci_dev *pdev)
>>>> }
>>>> }
>>>>
>>>> -static const struct pci_error_handlers cxl_error_handlers = {
>>>> - .error_detected = cxl_error_detected,
>>>> +static const struct pci_error_handlers pci_error_handlers = {
>>>> + .error_detected = cxl_pci_error_detected,
>>>> .slot_reset = cxl_slot_reset,
>>>> .resume = cxl_error_resume,
>>>> .cor_error_detected = cxl_cor_error_detected,
>>>> @@ -1063,7 +1063,7 @@ static struct pci_driver cxl_pci_driver = {
>>>> .name = KBUILD_MODNAME,
>>>> .id_table = cxl_mem_pci_tbl,
>>>> .probe = cxl_pci_probe,
>>>> - .err_handler = &cxl_error_handlers,
>>>> + .err_handler = &pci_error_handlers,
>>>> .dev_groups = cxl_rcd_groups,
>>>> .driver = {
>>>> .probe_type = PROBE_PREFER_ASYNCHRONOUS,
>>>
>>
>>
>
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c
2026-02-03 2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
@ 2026-02-04 4:25 ` dan.j.williams
0 siblings, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2026-02-04 4:25 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Terry Bowman wrote:
> CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
> soon. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used as an
> indication for the CXL drivers to handle and log the CXL RAS errors.
>
> Note, 'CXL protocol error' terminology will refer to CXL VH and not
> CXL RCH errors unless specifically noted going forward.
>
> Introduce a new file in the AER driver to handle the CXL protocol errors
> named pci/pcie/aer_cxl_vh.c.
>
> Add a kfifo work queue to be used by the AER and CXL drivers. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
>
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> Synchronize accesses using the RW semaphore.
>
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
> This will contain a reference to the PCI error source device and the error
> severity. This will be used when the work is dequeued by the cxl_core driver.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>
[..]
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> new file mode 100644
> index 000000000000..de8bca383159
> --- /dev/null
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -0,0 +1,79 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/types.h>
> +#include <linux/bitfield.h>
> +#include <linux/kfifo.h>
> +#include <linux/aer.h>
> +#include "../pci.h"
> +#include "portdrv.h"
> +
> +#define CXL_ERROR_SOURCES_MAX 128
> +
> +struct cxl_proto_err_kfifo {
> + struct work_struct *work;
> + struct rw_semaphore rw_sema;
> + DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
> + CXL_ERROR_SOURCES_MAX);
> +};
> +
> +static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
> + .rw_sema = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rw_sema)
> +};
Minor nit, I have never seen "rw_sema" as an identifier for an
rw_semaphore, would have expected "rwsem". Note I am only commenting on
this because there is something real to fix below.
> +
> +bool is_aer_internal_error(struct aer_err_info *info)
> +{
> + if (info->severity == AER_CORRECTABLE)
> + return info->status & PCI_ERR_COR_INTERNAL;
> +
> + return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> + if (!info || !info->is_cxl)
> + return false;
> +
> + if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
> + return false;
> +
> + return is_aer_internal_error(info);
> +}
> +
> +void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> + struct cxl_proto_err_work_data wd = (struct cxl_proto_err_work_data) {
> + .severity = info->severity,
> + .pdev = pdev
> + };
> +
> + guard(rwsem_read)(&cxl_proto_err_kfifo.rw_sema);
> + pci_dev_get(pdev);
If the work item is not registered, nothing will drop this reference.
I would add a comment that the reference is held as long as the pdev is
live in the kfifo.
> + if (!cxl_proto_err_kfifo.work || !kfifo_put(&cxl_proto_err_kfifo.fifo, wd)) {
...while fixing the above, go ahead and wrap this at 80 columns.
> + dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo error");
At this point we know the pdev did not go live in the kfifo so ref can be
dropped here.
> + return;
> + }
> +
> + schedule_work(cxl_proto_err_kfifo.work);
> +}
> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> + guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
> + cxl_proto_err_kfifo.work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> + guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
> + cxl_proto_err_kfifo.work = NULL;
I prefer the cancel_work_sync() inside this function rather than
cxl_ras_exit() so the semantic that no invocations spill outside of the
"unregister" barrier.
I realize cxl_cper_unregister_prot_err_work() originally made this a bit
messy so I am ok if a follow-on cleanup fixes up both.
Maybe Dave can fix the above issues up on applying?
With those addressed you can add:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 2/9] cxl: Update CXL Endpoint tracing
2026-02-03 2:52 ` [PATCH v15 2/9] cxl: Update CXL Endpoint tracing Terry Bowman
@ 2026-02-04 4:29 ` dan.j.williams
0 siblings, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2026-02-04 4:29 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Terry Bowman wrote:
> CXL protocol error handling will be expanded to soon include CXL Port
> support along with existing Endpoint support. 2 updates are needed first:
> - Update calling interfaces to use 'struct device*'
> - Log endpoint serial number
>
> Add serial number parameter to the trace logging. This is used for EPs
> and 0 is provided for CXL port devices without a serial number.
>
> Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
> unchanged with respect to member data types and order.
>
> Below is output of correctable and uncorrectable protocol error logging.
> CXL Root Port and CXL Endpoint examples are included below.
>
> The tracing support for CXL Port devices and Endpoints is already implemented.
> Update cxl_handle_ras() & cxl_handle_cor_ras() to also call the CXL trace
> routines.
>
> Root Port:
> cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
> cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
>
> Endpoint:
> cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='CRC Threshold Hit'
> cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Looks good, adding the serial number at the end should preserve
compatibility with libtraceevent parsing of the parameters.
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
2026-02-03 2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2026-02-03 15:26 ` Jonathan Cameron
@ 2026-02-04 4:46 ` dan.j.williams
1 sibling, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2026-02-04 4:46 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Terry Bowman wrote:
> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items and initiate protocol
> error handling while ensuring the device's RAS mappings remain valid
> throughout processing.
>
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver. Lock the parent CXL Port device to ensure the CXL
> device's RAS registers are accessible during handling. Add pdev reference-put
> to match reference-get in AER driver. This will ensure pdev access after
> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
>
> Update is_cxl_error() to recognize CXL Port devices with errors.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>
> ---
>
> Changes in v14->v15:
> - Move pci_dev_get() to first patch (Dave)
> - Move in is_cxl_error() change from later patch (Terry)
> - Use pr_err_ratelimited() with PCI device name (Terry)
>
> Changes in v13->v14:
> - Update commit title's prefix (Bjorn)
> - Add pdev ref get in AER driver before enqueue and add pdev ref put in
> CXL driver after dequeue and handling (Dan)
> - Removed handling to simplify patch context (Terry)
>
> Changes in v12->v13:
> - Add cxlmd lock using guard() (Terry)
> - Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
> - Change pr_err() calls to ratelimited. (Terry)
> - Update commit message. (Terry)
> - Remove namespace qualifier from pcie_clear_device_status()
> export (Dave Jiang)
> - Move locks into cxl_proto_err_work_fn() (Dave)
> - Update log messages in cxl_forward_error() (Ben)
>
> Changes in v11->v12:
> - Add guard for CE case in cxl_handle_proto_error() (Dave)
>
> Changes in v10->v11:
> - Reword patch commit message to remove RCiEP details (Jonathan)
> - Add #include <linux/bitfield.h> (Terry)
> - is_cxl_rcd() - Fix short comment message wrap (Jonathan)
> - is_cxl_rcd() - Combine return calls into 1 (Jonathan)
> - cxl_handle_proto_error() - Move comment earlier (Jonathan)
> - Use FIELD_GET() in discovering class code (Jonathan)
> - Remove BDF from cxl_proto_err_work_data. Use 'struct
> pci_dev *' (Dan)
> ---
> drivers/cxl/core/core.h | 3 +
> drivers/cxl/core/port.c | 6 +-
> drivers/cxl/core/ras.c | 106 ++++++++++++++++++++++++++++++----
> drivers/pci/pcie/aer_cxl_vh.c | 5 +-
> 4 files changed, 105 insertions(+), 15 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index c6cfaf2720e1..92aea110817d 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -182,6 +182,9 @@ static inline void devm_cxl_dport_ras_setup(struct cxl_dport *dport) { }
> #endif /* CONFIG_CXL_RAS */
>
> int cxl_gpf_port_setup(struct cxl_dport *dport);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> + struct cxl_dport **dport);
> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
>
> struct cxl_hdm;
> int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index ee7d14528867..8e30a3e7f610 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1402,8 +1402,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
> return NULL;
> }
>
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> - struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> + struct cxl_dport **dport)
> {
> struct cxl_find_port_ctx ctx = {
> .dport_dev = dport_dev,
> @@ -1607,7 +1607,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
> * Function takes a device reference on the port device. Caller should do a
> * put_device() when done.
> */
> -static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
> {
> struct device *dev;
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 74df561ed32e..a6c0bc6d7203 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
> }
> static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>
> -int cxl_ras_init(void)
> -{
> - return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> -}
> -
> -void cxl_ras_exit(void)
> -{
> - cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> - cancel_work_sync(&cxl_cper_prot_err_work);
> -}
> -
> static void cxl_dport_map_ras(struct cxl_dport *dport)
> {
> struct cxl_register_map *map = &dport->reg_map;
> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
> }
> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
>
> +/*
> + * get_cxl_port - Return the parent CXL Port of a PCI device
> + * @pdev: PCI device whose parent CXL Port is being queried
> + *
> + * Looks up and returns the parent CXL Port associated with @pdev. On
> + * success, the returned port has its reference count incremented and must
> + * be released by the caller. Returns NULL if no associated CXL port is
> + * found.
> + *
> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
> + */
> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> +{
> + switch (pci_pcie_type(pdev)) {
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + {
> + struct cxl_dport *dport;
> + struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port;
> + }
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + {
> + struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port;
> + }
> + }
> +
> + pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
> + pci_name(pdev), pci_pcie_type(pdev));
> + return NULL;
> +}
> +
> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> void __iomem *addr;
> @@ -327,3 +360,54 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> return PCI_ERS_RESULT_NEED_RESET;
> }
> EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +
> +static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
> +{
> +}
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> + struct cxl_proto_err_work_data wd;
> +
> + while (cxl_proto_err_kfifo_get(&wd)) {
> + struct pci_dev *pdev __free(pci_dev_put) = wd.pdev;
This is a bit clever, might want a comment that it pairs with the
pci_dev_get() in cxl_forward_error().
> +
> + if (!pdev) {
There is no way for pdev to be NULL in this path.
cxl_cper_handle_prot_err() is different because the CPER record is not
100% reliable and pci_get_domain_bus_and_slot() can fail. No worries
about that here.
> + pr_err_ratelimited("%s: NULL PCI device passed in AER-CXL KFifo\n",
> + pci_name(pdev));
> + continue;
> + }
> +
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> + if (!port) {
> + pr_err_ratelimited("%s: Failed to find parent port device in CXL topology\n",
> + pci_name(pdev));
> + continue;
> + }
> + guard(device)(&port->dev);
I expect this also wants to check port->dev.driver?
> + cxl_handle_proto_error(&wd);
It feels odd to keep passing @wd when the pdev and port have already been
extracted. ...but that is minor.
> + }
> +}
> +
> +static struct work_struct cxl_proto_err_work;
> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
> +
> +int cxl_ras_init(void)
> +{
> + if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
> + pr_err("Failed to initialize CXL RAS CPER\n");
cxl_cper_register_prot_err_work() should return void like
cxl_register_proto_err_work(). If someone registers a NULL work that is
the same as not registering anything, caller gets to keep the pieces.
No real bugs that need addressing, so:
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-03 2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-02-03 15:40 ` Jonathan Cameron
@ 2026-02-04 5:08 ` dan.j.williams
2026-02-04 17:11 ` Bowman, Terry
1 sibling, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2026-02-04 5:08 UTC (permalink / raw)
To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci, terry.bowman
Terry Bowman wrote:
> Introduce CXL Port protocol error handling callbacks to unify detection,
> logging, and recovery across CXL Ports and Endpoints, including RCH
> downstream ports. Establish a consistent flow for correctable and
> uncorrectable CXL protocol errors.
>
> Provide the solution by adding cxl_port_cor_error_detected() and
> cxl_port_error_detected() to handle correctable and uncorrectable handling
> through CXL RAS helpers, coordinating uncorrectable recovery in
> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
> driver being bound to avoid processing errors on disabled devices.
>
> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
> for Upstream Ports/Endpoints.
>
> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
> cxl_core to clear PCIe/AER state in these flows.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> Reviewed-by: Dave Jiang dave.jiang@intel.com
>
> ---
>
> Changes in v14->v15:
> - Update commit message and title. Added Bjorn's ack.
> - Move CE and UCE handling logic here
>
> Changes in v13->v14:
> - Add Dave Jiang's review-by
> - Update commit message & headline (Bjorn)
> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> one line (Jonathan)
> - Remove cxl_walk_port() (Dan)
> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> sufficient (Dan)
> - Remove device_lock_if()
> - Combined CE and UCE here (Terry)
>
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
>
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> port searches.
>
> Changes in v10->v11:
> - None
> ---
> drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
> drivers/pci/pci.c | 1 +
> drivers/pci/pci.h | 2 -
> drivers/pci/pcie/aer.c | 1 +
> include/linux/aer.h | 2 +
> include/linux/pci.h | 2 +
> 6 files changed, 140 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index a6c0bc6d7203..0216dafa6118 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> return NULL;
> }
>
> +static void __iomem *cxl_get_ras_base(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> +
> + switch (pci_pcie_type(pdev)) {
> + case PCI_EXP_TYPE_ROOT_PORT:
> + case PCI_EXP_TYPE_DOWNSTREAM:
> + {
Nit, clang-format puts that { on the same line because coding style says
only functions get newlines for open brackets.
> + struct cxl_dport *dport;
> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
> +
> + if (!dport) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return dport->regs.ras;
> + }
> + case PCI_EXP_TYPE_UPSTREAM:
> + case PCI_EXP_TYPE_ENDPOINT:
> + {
> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device");
> + return NULL;
> + }
> + return port->regs.ras;
> + }
> + }
> + dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
> + return NULL;
> +}
> +
> +static pci_ers_result_t cxl_port_error_detected(struct device *dev);
> +
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> + pci_ers_result_t status;
> +
> + if (!port) {
> + pci_err(pdev, "Failed to find the CXL device\n");
> + return;
> + }
> +
> + status = cxl_port_error_detected(&pdev->dev);
> + if (status == PCI_ERS_RESULT_PANIC)
> + panic("CXL cachemem error.");
> +
> + /*
> + * If we have native control of AER, clear error status in the device
> + * that detected the error. If the platform retained control of AER,
> + * it is responsible for clearing this status. In that case, the
> + * signaling device may not even be visible to the OS.
> + */
This comment feels more appropriate as documentation for
pcie_aer_is_native(). CXL is just using for the same purpose as all the
other callers. You can maybe reference "See pcie_aer_is_native() for
expecations on clearing errors", but I otherwise would not expect CXL to
carry its own paragraph.
> + if (pcie_aer_is_native(pdev)) {
> + pcie_clear_device_status(pdev);
> + pci_aer_clear_nonfatal_status(pdev);
> + pci_aer_clear_fatal_status(pdev);
> + }
> +}
> +
> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> {
> void __iomem *addr;
> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> return true;
> }
>
> +static void cxl_port_cor_error_detected(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (is_cxl_endpoint(port)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + guard(device)(&cxlmd->dev);
> +
> + if (!dev->driver) {
> + dev_warn(&pdev->dev,
> + "%s: memdev disabled, abort error handling\n",
> + dev_name(dev));
> + return;
> + }
> +
> + if (cxlds->rcd)
> + cxl_handle_rdport_errors(cxlds);
Isn't this dead code? Only VH topologies will ever get a forwarded CXL
error, right? I realize it gets deleted in a future patch, but then why
leave dead code in the git history?
> +
> + cxl_handle_cor_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
> + } else {
> + cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));
> + }
> +}
> +
> +static pci_ers_result_t cxl_port_error_detected(struct device *dev)
> +{
> + struct pci_dev *pdev = to_pci_dev(dev);
> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> + if (is_cxl_endpoint(port)) {
> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +
> + guard(device)(&cxlmd->dev);
> +
> + if (!dev->driver) {
> + dev_warn(&pdev->dev,
> + "%s: memdev disabled, abort error handling\n",
> + dev_name(dev));
> + return PCI_ERS_RESULT_NONE;
> + }
> +
> + if (cxlds->rcd)
> + cxl_handle_rdport_errors(cxlds);
> +
> + return cxl_handle_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
> + } else {
> + return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
> + }
> +}
> +
> void cxl_cor_error_detected(struct pci_dev *pdev)
> {
> struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> @@ -363,6 +479,24 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>
> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
> {
> + struct pci_dev *pdev = err_info->pdev;
> +
> + if (err_info->severity == AER_CORRECTABLE) {
> +
> + if (!pcie_aer_is_native(pdev))
> + return;
> +
> + if (pdev->aer_cap)
> + pci_clear_and_set_config_dword(pdev,
> + pdev->aer_cap + PCI_ERR_COR_STATUS,
> + 0, PCI_ERR_COR_INTERNAL);
> +
> + cxl_port_cor_error_detected(&pdev->dev);
> +
> + pcie_clear_device_status(pdev);
> + } else {
> + cxl_do_recovery(pdev);
> + }
> }
>
> static void cxl_proto_err_work_fn(struct work_struct *work)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 13dbb405dc31..b7bfefdaf990 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2248,6 +2248,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
> pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
> pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
> }
> +EXPORT_SYMBOL_GPL(pcie_clear_device_status);
No reason to open up this symbol to the world. Only cxl_core.ko needs
this exported, and hopefully we never see another bus that abuses PCI
like CXL does ever again.
[..]
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 7af10a74da34..4fc9de4c78f8 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -298,6 +298,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
> if (status)
> pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
> }
> +EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
ditto, too wide of an export.
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-04 5:08 ` dan.j.williams
@ 2026-02-04 17:11 ` Bowman, Terry
2026-02-04 21:22 ` dan.j.williams
0 siblings, 1 reply; 31+ messages in thread
From: Bowman, Terry @ 2026-02-04 17:11 UTC (permalink / raw)
To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
alison.schofield, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 2/3/2026 11:08 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> Introduce CXL Port protocol error handling callbacks to unify detection,
>> logging, and recovery across CXL Ports and Endpoints, including RCH
>> downstream ports. Establish a consistent flow for correctable and
>> uncorrectable CXL protocol errors.
>>
>> Provide the solution by adding cxl_port_cor_error_detected() and
>> cxl_port_error_detected() to handle correctable and uncorrectable handling
>> through CXL RAS helpers, coordinating uncorrectable recovery in
>> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
>> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
>> driver being bound to avoid processing errors on disabled devices.
>>
>> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
>> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
>> for Upstream Ports/Endpoints.
>>
>> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
>> cxl_core to clear PCIe/AER state in these flows.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
>> Reviewed-by: Dave Jiang dave.jiang@intel.com
>>
>> ---
>>
>> Changes in v14->v15:
>> - Update commit message and title. Added Bjorn's ack.
>> - Move CE and UCE handling logic here
>>
>> Changes in v13->v14:
>> - Add Dave Jiang's review-by
>> - Update commit message & headline (Bjorn)
>> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
>> one line (Jonathan)
>> - Remove cxl_walk_port() (Dan)
>> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
>> sufficient (Dan)
>> - Remove device_lock_if()
>> - Combined CE and UCE here (Terry)
>>
>> Changes in v12->v13:
>> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>> patch (Terry)
>> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
>> - Remove check for dport->dport_dev (Dave)
>> - Remove whitespace (Terry)
>>
>> Changes in v11->v12:
>> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
>> pci_to_cxl_dev()
>> - Change cxl_error_detected() -> cxl_cor_error_detected()
>> - Remove NULL variable assignments
>> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
>> port searches.
>>
>> Changes in v10->v11:
>> - None
>> ---
>> drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
>> drivers/pci/pci.c | 1 +
>> drivers/pci/pci.h | 2 -
>> drivers/pci/pcie/aer.c | 1 +
>> include/linux/aer.h | 2 +
>> include/linux/pci.h | 2 +
>> 6 files changed, 140 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index a6c0bc6d7203..0216dafa6118 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
>> return NULL;
>> }
>>
>> +static void __iomem *cxl_get_ras_base(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> + switch (pci_pcie_type(pdev)) {
>> + case PCI_EXP_TYPE_ROOT_PORT:
>> + case PCI_EXP_TYPE_DOWNSTREAM:
>> + {
>
> Nit, clang-format puts that { on the same line because coding style says
> only functions get newlines for open brackets.
>
Hi Dan,
Thanks for the note. Would you like every switch-case to be upodated to match the clang
recommended format?
>> + struct cxl_dport *dport;
>> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
>> +
>> + if (!dport) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return dport->regs.ras;
>> + }
>> + case PCI_EXP_TYPE_UPSTREAM:
>> + case PCI_EXP_TYPE_ENDPOINT:
>> + {
>> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device");
>> + return NULL;
>> + }
>> + return port->regs.ras;
>> + }
>> + }
>> + dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
>> + return NULL;
>> +}
>> +
>> +static pci_ers_result_t cxl_port_error_detected(struct device *dev);
>> +
>> +static void cxl_do_recovery(struct pci_dev *pdev)
>> +{
>> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> + pci_ers_result_t status;
>> +
>> + if (!port) {
>> + pci_err(pdev, "Failed to find the CXL device\n");
>> + return;
>> + }
>> +
>> + status = cxl_port_error_detected(&pdev->dev);
>> + if (status == PCI_ERS_RESULT_PANIC)
>> + panic("CXL cachemem error.");
>> +
>> + /*
>> + * If we have native control of AER, clear error status in the device
>> + * that detected the error. If the platform retained control of AER,
>> + * it is responsible for clearing this status. In that case, the
>> + * signaling device may not even be visible to the OS.
>> + */
>
> This comment feels more appropriate as documentation for
> pcie_aer_is_native(). CXL is just using for the same purpose as all the
> other callers. You can maybe reference "See pcie_aer_is_native() for
> expecations on clearing errors", but I otherwise would not expect CXL to
> carry its own paragraph.
>
Agreed. I’ll drop the local comment and rely on pcie_aer_is_native() semantics,
with a brief reference if needed.
>> + if (pcie_aer_is_native(pdev)) {
>> + pcie_clear_device_status(pdev);
>> + pci_aer_clear_nonfatal_status(pdev);
>> + pci_aer_clear_fatal_status(pdev);
>> + }
>> +}
>> +
>> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> {
>> void __iomem *addr;
>> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> return true;
>> }
>>
>> +static void cxl_port_cor_error_detected(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> +
>> + if (is_cxl_endpoint(port)) {
>> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +
>> + guard(device)(&cxlmd->dev);
>> +
>> + if (!dev->driver) {
>> + dev_warn(&pdev->dev,
>> + "%s: memdev disabled, abort error handling\n",
>> + dev_name(dev));
>> + return;
>> + }
>> +
>> + if (cxlds->rcd)
>> + cxl_handle_rdport_errors(cxlds);
>
> Isn't this dead code? Only VH topologies will ever get a forwarded CXL
> error, right? I realize it gets deleted in a future patch, but then why
> leave dead code in the git history?
>
Yes, agreed - I'll remove. Correct, only VH is forwarded. My understanding is the
cxl_memdev guard and driver check are no longer required here. The memdev is only
used to source the serial number, so I’ll refactor accordingly. Please correct
me if Im wrong.
I see an additional fix needed: cxl_rch_handle_error_iter() in pci/pcie/aer.c
also needs its callbacks updated. The RCH/RCD path previously invoked the EP
PCIe handlers, but with RAS now handled at the port level, those callbacks no
longer reach the correct logic.
I had a coupled ideas. One options is for the CXL logic to make a callback into a
cxl_core exported function such as cxl_handle_rdport_errors(). BTW, the CXL logic in
AER and the CXL driver's RAS are both built with the CONFIG_CXL_RAS config.
Another option is updating the CXL PCIe callbacks. The cxl_pci PCI error callbacks
currently support only AER and could be updated to also support RCH/RCD (no VH) with
something along the lines of below?
static bool cxl_pci_detected(struct pci_dev *pdev)
{
...
if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC &&
is_aer_internal_error(info))
cxl_handle_rdport_errors();
In this case we would also need a cxl_pci_cor_detected().
>> +
>> + cxl_handle_cor_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
>> + } else {
>> + cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));
>> + }
>> +}
>> +
>> +static pci_ers_result_t cxl_port_error_detected(struct device *dev)
>> +{
>> + struct pci_dev *pdev = to_pci_dev(dev);
>> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> +
>> + if (is_cxl_endpoint(port)) {
>> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +
>> + guard(device)(&cxlmd->dev);
>> +
>> + if (!dev->driver) {
>> + dev_warn(&pdev->dev,
>> + "%s: memdev disabled, abort error handling\n",
>> + dev_name(dev));
>> + return PCI_ERS_RESULT_NONE;
>> + }
>> +
>> + if (cxlds->rcd)
>> + cxl_handle_rdport_errors(cxlds);
>> +
>> + return cxl_handle_ras(dev, cxlds->serial, cxl_get_ras_base(dev));
>> + } else {
>> + return cxl_handle_ras(dev, 0, cxl_get_ras_base(dev));
>> + }
>> +}
>> +
>> void cxl_cor_error_detected(struct pci_dev *pdev)
>> {
>> struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> @@ -363,6 +479,24 @@ EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>>
>> static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
>> {
>> + struct pci_dev *pdev = err_info->pdev;
>> +
>> + if (err_info->severity == AER_CORRECTABLE) {
>> +
>> + if (!pcie_aer_is_native(pdev))
>> + return;
>> +
>> + if (pdev->aer_cap)
>> + pci_clear_and_set_config_dword(pdev,
>> + pdev->aer_cap + PCI_ERR_COR_STATUS,
>> + 0, PCI_ERR_COR_INTERNAL);
>> +
>> + cxl_port_cor_error_detected(&pdev->dev);
>> +
>> + pcie_clear_device_status(pdev);
>> + } else {
>> + cxl_do_recovery(pdev);
>> + }
>> }
>>
>> static void cxl_proto_err_work_fn(struct work_struct *work)
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 13dbb405dc31..b7bfefdaf990 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -2248,6 +2248,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
>> pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
>> pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
>> }
>> +EXPORT_SYMBOL_GPL(pcie_clear_device_status);
>
> No reason to open up this symbol to the world. Only cxl_core.ko needs
> this exported, and hopefully we never see another bus that abuses PCI
> like CXL does ever again.
>
> [..]
Understood. I’ll switch this to a CXL‑scoped export.
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 7af10a74da34..4fc9de4c78f8 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -298,6 +298,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>> if (status)
>> pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
>> }
>> +EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
>
> ditto, too wide of an export.
OK
-Terry
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-04 17:11 ` Bowman, Terry
@ 2026-02-04 21:22 ` dan.j.williams
2026-02-05 16:07 ` Bowman, Terry
0 siblings, 1 reply; 31+ messages in thread
From: dan.j.williams @ 2026-02-04 21:22 UTC (permalink / raw)
To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
alison.schofield, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
Bowman, Terry wrote:
[..]
> >> +static void __iomem *cxl_get_ras_base(struct device *dev)
> >> +{
> >> + struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> + switch (pci_pcie_type(pdev)) {
> >> + case PCI_EXP_TYPE_ROOT_PORT:
> >> + case PCI_EXP_TYPE_DOWNSTREAM:
> >> + {
> >
> > Nit, clang-format puts that { on the same line because coding style says
> > only functions get newlines for open brackets.
> >
>
> Hi Dan,
>
> Thanks for the note. Would you like every switch-case to be upodated
> to match the clang recommended format?
Yes, please. See "git grep case.*:\ {" for all the other examples.
[..]
> > Isn't this dead code? Only VH topologies will ever get a forwarded CXL
> > error, right? I realize it gets deleted in a future patch, but then why
> > leave dead code in the git history?
> >
>
> Yes, agreed - I'll remove. Correct, only VH is forwarded. My
> understanding is the cxl_memdev guard and driver check are no longer
> required here. The memdev is only used to source the serial number, so
> I’ll refactor accordingly. Please correct me if Im wrong.
You do not need the memdev to get the serial number, and I note that the
serial number is only mandated for CXL memory class devices. I would
rather stop worrying about serial / pass 0 then add endpoint special
casing. The consumer of the tracepoint can always get the serial number
from sysfs, or this can call "pci_get_dsn(pdev)".
Overall, I expect that this generic error handling is device-type
indepdendent. The aim is it does not need to be touched again when/if
Linux ever sees CXL.cache devices without CXL.mem or the "serial is
mandated" edict for memory class devices.
> I see an additional fix needed: cxl_rch_handle_error_iter() in pci/pcie/aer.c
> also needs its callbacks updated. The RCH/RCD path previously invoked the EP
> PCIe handlers, but with RAS now handled at the port level, those callbacks no
> longer reach the correct logic.
>
> I had a coupled ideas. One options is for the CXL logic to make a
> callback into a cxl_core exported function such as
> cxl_handle_rdport_errors(). BTW, the CXL logic in AER and the CXL
> driver's RAS are both built with the CONFIG_CXL_RAS config.
That destroys the modularity of cxl_core.ko.
> Another option is updating the CXL PCIe callbacks. The cxl_pci PCI
> error callbacks currently support only AER and could be updated to
> also support RCH/RCD (no VH) with something along the lines of below?
This continues the abuse of PCI error handlers for what is an odd CXL
aberration.
The answer that feels consistent with unburdening the PCI core with the
vagaries CXL is to include RCH errors in the class of notifications that
get forwarded. Arrange for cxl_proto_err_work_data to carry whether it
is an RCH or VH error and then dispatch either
cxl_handle_rdport_errors() or cxl_handle_proto_error().
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-04 21:22 ` dan.j.williams
@ 2026-02-05 16:07 ` Bowman, Terry
2026-02-05 21:17 ` dan.j.williams
0 siblings, 1 reply; 31+ messages in thread
From: Bowman, Terry @ 2026-02-05 16:07 UTC (permalink / raw)
To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
alison.schofield, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
On 2/4/2026 3:22 PM, dan.j.williams@intel.com wrote:
> Bowman, Terry wrote:
> [..]
>>>> +static void __iomem *cxl_get_ras_base(struct device *dev)
>>>> +{
>>>> + struct pci_dev *pdev = to_pci_dev(dev);
>>>> +
>>>> + switch (pci_pcie_type(pdev)) {
>>>> + case PCI_EXP_TYPE_ROOT_PORT:
>>>> + case PCI_EXP_TYPE_DOWNSTREAM:
>>>> + {
>>>
>>> Nit, clang-format puts that { on the same line because coding style says
>>> only functions get newlines for open brackets.
>>>
>>
>> Hi Dan,
>>
>> Thanks for the note. Would you like every switch-case to be upodated
>> to match the clang recommended format?
>
> Yes, please. See "git grep case.*:\ {" for all the other examples.
>
> [..]
>>> Isn't this dead code? Only VH topologies will ever get a forwarded CXL
>>> error, right? I realize it gets deleted in a future patch, but then why
>>> leave dead code in the git history?
>>>
>>
>> Yes, agreed - I'll remove. Correct, only VH is forwarded. My
>> understanding is the cxl_memdev guard and driver check are no longer
>> required here. The memdev is only used to source the serial number, so
>> I’ll refactor accordingly. Please correct me if Im wrong.
>
> You do not need the memdev to get the serial number, and I note that the
> serial number is only mandated for CXL memory class devices. I would
> rather stop worrying about serial / pass 0 then add endpoint special
> casing. The consumer of the tracepoint can always get the serial number
> from sysfs, or this can call "pci_get_dsn(pdev)".
>
> Overall, I expect that this generic error handling is device-type
> indepdendent. The aim is it does not need to be touched again when/if
> Linux ever sees CXL.cache devices without CXL.mem or the "serial is
> mandated" edict for memory class devices.
>
>> I see an additional fix needed: cxl_rch_handle_error_iter() in pci/pcie/aer.c
>> also needs its callbacks updated. The RCH/RCD path previously invoked the EP
>> PCIe handlers, but with RAS now handled at the port level, those callbacks no
>> longer reach the correct logic.
>>
>> I had a coupled ideas. One options is for the CXL logic to make a
>> callback into a cxl_core exported function such as
>> cxl_handle_rdport_errors(). BTW, the CXL logic in AER and the CXL
>> driver's RAS are both built with the CONFIG_CXL_RAS config.
>
> That destroys the modularity of cxl_core.ko.
>
>> Another option is updating the CXL PCIe callbacks. The cxl_pci PCI
>> error callbacks currently support only AER and could be updated to
>> also support RCH/RCD (no VH) with something along the lines of below?
>
> This continues the abuse of PCI error handlers for what is an odd CXL
> aberration.
>
> The answer that feels consistent with unburdening the PCI core with the
> vagaries CXL is to include RCH errors in the class of notifications that
> get forwarded. Arrange for cxl_proto_err_work_data to carry whether it
> is an RCH or VH error and then dispatch either
> cxl_handle_rdport_errors() or cxl_handle_proto_error().
That approach makes sense to me.
Would you like to keep the RCH's RCiEP traversal in the AER driver for now? In
that model, the RCiEP PCI device ID would be passed via cxl_proto_err_work_data.
This would be a relatively small change — updating cxl_rch_handle_error_iter()
and pcie/aer_cxl_rch.c to call cxl_forward_error().
A cleaner long-term approach would be to move all of the logic in aer_cxl_rch.c
into cxl/core/rch_ras.c. In that case, an RCEC (reporting on behalf of the RCH
error) would be passed in cxl_proto_err_work_data, and RCiEP iteration would be
handled by the CXL driver after the work item surfaces from the kfifo.
The second approach improves PCI/CXL separation, but it may be harder to land
late in the series. Would it be acceptable to proceed with the first approach
initially, followed immediately by a cleanup series moving pcie/aer_cxl_rch.c
into cxl/core/rch_ras.c?
- Terry
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error
2026-02-03 17:00 ` Bowman, Terry
@ 2026-02-05 17:13 ` Jonathan Cameron
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-05 17:13 UTC (permalink / raw)
To: Bowman, Terry
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Tue, 3 Feb 2026 11:00:40 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:
> On 2/3/2026 9:26 AM, Jonathan Cameron wrote:
> > On Mon, 2 Feb 2026 20:52:39 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> The AER driver now forwards CXL protocol errors to the CXL driver via a
> >> kfifo. The CXL driver must consume these work items and initiate protocol
> >> error handling while ensuring the device's RAS mappings remain valid
> >> throughout processing.
> >>
> >> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> >> AER service driver. Lock the parent CXL Port device to ensure the CXL
> >> device's RAS registers are accessible during handling. Add pdev reference-put
> >> to match reference-get in AER driver. This will ensure pdev access after
> >> kfifo dequeue. These changes apply to CXL Ports and CXL Endpoints.
> >>
> >> Update is_cxl_error() to recognize CXL Port devices with errors.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> >
> > There are some small functional changes to existing paths (maybe)
> > that I think need explanations in this commit message.
> >
> > Otherwise, one suggests small simplification.
> >
> >> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> >> index 74df561ed32e..a6c0bc6d7203 100644
> >> --- a/drivers/cxl/core/ras.c
> >> +++ b/drivers/cxl/core/ras.c
> >> @@ -118,17 +118,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
> >> }
> >> static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
> >>
> >> -int cxl_ras_init(void)
> >> -{
> >> - return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> >> -}
> >> -
> >> -void cxl_ras_exit(void)
> >> -{
> >> - cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> >> - cancel_work_sync(&cxl_cper_prot_err_work);
> >> -}
> >> -
> >> static void cxl_dport_map_ras(struct cxl_dport *dport)
> >> {
> >> struct cxl_register_map *map = &dport->reg_map;
> >> @@ -185,6 +174,50 @@ void devm_cxl_port_ras_setup(struct cxl_port *port)
> >> }
> >> EXPORT_SYMBOL_NS_GPL(devm_cxl_port_ras_setup, "CXL");
> >>
> >> +/*
> >> + * get_cxl_port - Return the parent CXL Port of a PCI device
> >> + * @pdev: PCI device whose parent CXL Port is being queried
> >> + *
> >> + * Looks up and returns the parent CXL Port associated with @pdev. On
> >> + * success, the returned port has its reference count incremented and must
> >> + * be released by the caller. Returns NULL if no associated CXL port is
> >> + * found.
> >> + *
> >> + * Return: Pointer to the parent &struct cxl_port or NULL on failure
> >> + */
> >> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> >> +{
> >> + switch (pci_pcie_type(pdev)) {
> >> + case PCI_EXP_TYPE_ROOT_PORT:
> >> + case PCI_EXP_TYPE_DOWNSTREAM:
> >> + {
> >> + struct cxl_dport *dport;
> >> + struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
> >
> > Can you pass NULL for dport? Looks like it to me as that ultimately ends
> > up in match_port_by_dport() and
> > if (ctx->dport)
> > *ctx->dport = dport;
> >
> > where with this as null means ctx->dport == NULL.
> >
>
> Yes.
>
>
> >> +
> >> + if (!port) {
> >> + pci_err(pdev, "Failed to find the CXL device");
> >> + return NULL;
> >> + }
> >> + return port;
> >> + }
> >> + case PCI_EXP_TYPE_UPSTREAM:
> >> + case PCI_EXP_TYPE_ENDPOINT:
> >> + {
> >> + struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> >> +
> >> + if (!port) {
> >> + pci_err(pdev, "Failed to find the CXL device");
> >> + return NULL;
> >> + }
> >> + return port;
> >> + }
> >> + }
> >> +
> >> + pr_err_ratelimited("%s: Error - Unsupported device type (%#x)",
> >> + pci_name(pdev), pci_pcie_type(pdev));
> >> + return NULL;
> >> +}
> >
> >
> >> +int cxl_ras_init(void)
> >> +{
> >> + if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
> >> + pr_err("Failed to initialize CXL RAS CPER\n");
> >
> > Why introduce a new error print? I don't particularly mind
> > but wasn't obvious to me why one has become appropriate and why only
> > for the first call here.
> >
>
> This was introduced before v10.
>
> RAS initialization failure should not fail cxl_core probe.
>
> OSfirst AER support was added in this series in this file next to CPER.
> CPER initialization can fail and OSFirst can not is the reason for only
> one log.
>
> When I look at this block of code I'm drawn to the return value. It looks
> like it should be a void function. Thoughts?
I'd return an error code, then at caller decide to not treat that
as a failure case. That gives a clear place to add a print + maybe
a comment that says - yes it's an error, but for 'reasons' we carry on
anyway
Jonathan
>
> - Terry
>
>
> > More importantly - if this failed it would previously have resulted
> > in cxl_core_init() failing and things getting torn down.
> >
> >> +
> >> + cxl_register_proto_err_work(&cxl_proto_err_work);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +void cxl_ras_exit(void)
> >> +{
> >> + cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> >> + cancel_work_sync(&cxl_cper_prot_err_work);
> >> +
> >> + cxl_unregister_proto_err_work();
> >> + cancel_work_sync(&cxl_proto_err_work);
> >> +}
> >
> >
>
>
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-03 18:21 ` Bowman, Terry
@ 2026-02-05 17:16 ` Jonathan Cameron
0 siblings, 0 replies; 31+ messages in thread
From: Jonathan Cameron @ 2026-02-05 17:16 UTC (permalink / raw)
To: Bowman, Terry
Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
vishal.l.verma, alucerop, ira.weiny, linux-kernel, linux-pci
On Tue, 3 Feb 2026 12:21:56 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:
> On 2/3/2026 9:40 AM, Jonathan Cameron wrote:
> > On Mon, 2 Feb 2026 20:52:40 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> Introduce CXL Port protocol error handling callbacks to unify detection,
> >> logging, and recovery across CXL Ports and Endpoints, including RCH
> >> downstream ports. Establish a consistent flow for correctable and
> >> uncorrectable CXL protocol errors.
> >>
> >> Provide the solution by adding cxl_port_cor_error_detected() and
> >> cxl_port_error_detected() to handle correctable and uncorrectable handling
> >> through CXL RAS helpers, coordinating uncorrectable recovery in
> >> cxl_do_recovery(), and panicking when the handler returns PCI_ERS_RESULT_PANIC
> >> to preserve fatal cachemem behavior. Gate endpoint handling on the endpoint
> >> driver being bound to avoid processing errors on disabled devices.
> >>
> >> Centralize the RAS base lookup in cxl_get_ras_base(), selecting the
> >> downstream-port dport->regs.ras for Root/Downstream Ports and port->regs.ras
> >> for Upstream Ports/Endpoints.
> >>
> >> Export pcie_clear_device_status() and pci_aer_clear_fatal_status() to enable
> >> cxl_core to clear PCIe/AER state in these flows.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Acked-by: Bjorn Helgaas <bhelgaas@google.com>
> >> Reviewed-by: Dave Jiang dave.jiang@intel.com
> >
> > Hi Terry,
> >
> > A few comments inline.
> >
> > Thanks,
> >
> > Jonathan
> >
>
> Thanks for reviewing.
>
> >>
> >> ---
> >>
> >> Changes in v14->v15:
> >> - Update commit message and title. Added Bjorn's ack.
> >> - Move CE and UCE handling logic here
> >>
> >> Changes in v13->v14:
> >> - Add Dave Jiang's review-by
> >> - Update commit message & headline (Bjorn)
> >> - Refactor cxl_port_error_detected()/cxl_port_cor_error_detected() to
> >> one line (Jonathan)
> >> - Remove cxl_walk_port() (Dan)
> >> - Remove cxl_pci_drv_bound(). Check for 'is_cxl' parent port is
> >> sufficient (Dan)
> >> - Remove device_lock_if()
> >> - Combined CE and UCE here (Terry)
> >>
> >> Changes in v12->v13:
> >> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
> >> patch (Terry)
> >> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> >> - Remove check for dport->dport_dev (Dave)
> >> - Remove whitespace (Terry)
> >>
> >> Changes in v11->v12:
> >> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
> >> pci_to_cxl_dev()
> >> - Change cxl_error_detected() -> cxl_cor_error_detected()
> >> - Remove NULL variable assignments
> >> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
> >> port searches.
> >>
> >> Changes in v10->v11:
> >> - None
> >> ---
> >> drivers/cxl/core/ras.c | 134 +++++++++++++++++++++++++++++++++++++++++
> >> drivers/pci/pci.c | 1 +
> >> drivers/pci/pci.h | 2 -
> >> drivers/pci/pcie/aer.c | 1 +
> >> include/linux/aer.h | 2 +
> >> include/linux/pci.h | 2 +
> >> 6 files changed, 140 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> >> index a6c0bc6d7203..0216dafa6118 100644
> >> --- a/drivers/cxl/core/ras.c
> >> +++ b/drivers/cxl/core/ras.c
> >> @@ -218,6 +218,68 @@ static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> >> return NULL;
> >> }
> >>
> >> +static void __iomem *cxl_get_ras_base(struct device *dev)
> >> +{
> >> + struct pci_dev *pdev = to_pci_dev(dev);
> >> +
> >> + switch (pci_pcie_type(pdev)) {
> >> + case PCI_EXP_TYPE_ROOT_PORT:
> >> + case PCI_EXP_TYPE_DOWNSTREAM:
> >> + {
> >> + struct cxl_dport *dport;
> >
> > struct cxl_dport *dport = NULL;
> >
> >> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
> >
> > as if this failed, dport is not written. Alternative is check port, not dport as port
> > will always be initialized whether or not failure occurs in find_cxl_port()
> >
>
> Ok.
>
> >
> >> +
> >> + if (!dport) {
> >> + pci_err(pdev, "Failed to find the CXL device");
> >> + return NULL;
> >> + }
> >> + return dport->regs.ras;
> >> + }
> >> + case PCI_EXP_TYPE_UPSTREAM:
> >> + case PCI_EXP_TYPE_ENDPOINT:
> >> + {
> >> + struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
> >> +
> >> + if (!port) {
> >> + pci_err(pdev, "Failed to find the CXL device");
> >> + return NULL;
> >> + }
> >> + return port->regs.ras;
> >> + }
> >> + }
> >> + dev_warn_once(dev, "Error: Unsupported device type (%#x)", pci_pcie_type(pdev));
> >> + return NULL;
> >> +}
> >
> >> void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> >> {
> >> void __iomem *addr;
> >> @@ -288,6 +350,60 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> >> return true;
> >> }
> >>
> >> +static void cxl_port_cor_error_detected(struct device *dev)
> >> +{
> >> + struct pci_dev *pdev = to_pci_dev(dev);
> >> + struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> >> +
> >> + if (is_cxl_endpoint(port)) {
> >> + struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> >> + struct cxl_dev_state *cxlds = cxlmd->cxlds;
> >> +
> >> + guard(device)(&cxlmd->dev);
> >
> > Maybe add a comment on why this lock needs to be held and then why the dev->drvier
> > below needs to be true.
> >
> This can be removed. This was added to ensure the EP's RAS registers remained accessible
> during handling. This was when when the mapped RAS registers were owned by the CXL memory
> device. This has changed such that the EP RAS registers are now owned by the EP Port. And,
> the Endpoint Port is already locked in cxl_proto_err_work_fn() before calling this funcion.
>
> >> +
> >> + if (!dev->driver) {
> >> + dev_warn(&pdev->dev,
> >> + "%s: memdev disabled, abort error handling\n",
> >> + dev_name(dev));
> >
> > Same question as below on why pdev->dev / dev_name(dev) here.
> > Maybe pci_warn() is more appropriate.
> >
>
> I believe the driver check can be removed but would like your input. The check
> for the driver is another piece of code specifically for when the handler was accessing
> the memdev's RAS registers. It was a last check to make certain the device is bound
> to a driver before accessing. EP RAS is now owned by the Endpoint Port.
Sounds fine to me to drop this check if we don't need anything that
belongs to that driver any more.
Jonathan
^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow
2026-02-05 16:07 ` Bowman, Terry
@ 2026-02-05 21:17 ` dan.j.williams
0 siblings, 0 replies; 31+ messages in thread
From: dan.j.williams @ 2026-02-05 21:17 UTC (permalink / raw)
To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
alison.schofield, bhelgaas, shiju.jose, ming.li,
Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
sathyanarayanan.kuppuswamy, linux-cxl, vishal.l.verma, alucerop,
ira.weiny
Cc: linux-kernel, linux-pci
Bowman, Terry wrote:
[..]
> > The answer that feels consistent with unburdening the PCI core with the
> > vagaries CXL is to include RCH errors in the class of notifications that
> > get forwarded. Arrange for cxl_proto_err_work_data to carry whether it
> > is an RCH or VH error and then dispatch either
> > cxl_handle_rdport_errors() or cxl_handle_proto_error().
>
>
> That approach makes sense to me.
>
> Would you like to keep the RCH's RCiEP traversal in the AER driver for now? In
> that model, the RCiEP PCI device ID would be passed via cxl_proto_err_work_data.
> This would be a relatively small change — updating cxl_rch_handle_error_iter()
> and pcie/aer_cxl_rch.c to call cxl_forward_error().
>
> A cleaner long-term approach would be to move all of the logic in aer_cxl_rch.c
> into cxl/core/rch_ras.c. In that case, an RCEC (reporting on behalf of the RCH
> error) would be passed in cxl_proto_err_work_data, and RCiEP iteration would be
> handled by the CXL driver after the work item surfaces from the kfifo.
>
> The second approach improves PCI/CXL separation, but it may be harder to land
> late in the series. Would it be acceptable to proceed with the first approach
> initially, followed immediately by a cleanup series moving pcie/aer_cxl_rch.c
> into cxl/core/rch_ras.c?
I think it is fine to do this incrementally. Keep RCiEP traversal in AER
for now. Later move more of that logic to the CXL core so PCI does not
need to worry about that complication. That later move makes it easier
to add consideration for details like the "RCEC Downstream Port
Association Structure" (RDPAS) without thrashing the PCI core.
^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2026-02-05 21:17 UTC | newest]
Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-03 2:52 [PATCH v15 0/9] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2026-02-03 2:52 ` [PATCH v15 1/9] PCI/AER: Introduce AER-CXL Kfifo in new file, pcie/aer_cxl_vh.c Terry Bowman
2026-02-04 4:25 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 2/9] cxl: Update CXL Endpoint tracing Terry Bowman
2026-02-04 4:29 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 3/9] PCI/ERR: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
2026-02-03 2:52 ` [PATCH v15 4/9] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2026-02-03 15:26 ` Jonathan Cameron
2026-02-03 17:00 ` Bowman, Terry
2026-02-05 17:13 ` Jonathan Cameron
2026-02-04 4:46 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 5/9] PCI: Establish common CXL Port protocol error flow Terry Bowman
2026-02-03 15:40 ` Jonathan Cameron
2026-02-03 18:21 ` Bowman, Terry
2026-02-05 17:16 ` Jonathan Cameron
2026-02-04 5:08 ` dan.j.williams
2026-02-04 17:11 ` Bowman, Terry
2026-02-04 21:22 ` dan.j.williams
2026-02-05 16:07 ` Bowman, Terry
2026-02-05 21:17 ` dan.j.williams
2026-02-03 2:52 ` [PATCH v15 6/9] cxl: Update error handlers to support CXL Port protocol errors Terry Bowman
2026-02-03 15:54 ` Jonathan Cameron
2026-02-03 2:52 ` [PATCH v15 7/9] cxl: Update Endpoint AER uncorrectable handler Terry Bowman
2026-02-03 16:18 ` Jonathan Cameron
2026-02-03 17:31 ` Dave Jiang
2026-02-03 18:35 ` Bowman, Terry
2026-02-03 18:49 ` Dave Jiang
2026-02-03 20:21 ` Dave Jiang
2026-02-03 2:52 ` [PATCH v15 8/9] cxl: Remove Endpoint AER correctable handler Terry Bowman
2026-02-03 16:27 ` Jonathan Cameron
2026-02-03 2:52 ` [PATCH v15 9/9] cxl: Enable CXL protocol error reporting Terry Bowman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox