public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging
@ 2025-11-04 17:02 Terry Bowman
  2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
                   ` (25 more replies)
  0 siblings, 26 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

This patchset updates CXL Protocol Error handling for CXL Ports and CXL
Endpoints (EP). Previous versions of this series can be found here:
https://lore.kernel.org/linux-cxl/20250925223440.3539069-1-terry.bowman@amd.com/

The first 2 patches were moved to the front this revision. This is the pcie_is_cxl()
patch and the DVSEC definition patch. They were moved to front in case Alejandro needs for
his Type2 series.

The next 6 patches prepare and move files. This includes Dave Jiang's patch
moving CXL RAS related code from cxl/core/pci.c to cxl/core/ras.c. Restricted
CXL host (RCH) related RAS code is moved to cxl/core/ras_rch.c. AER driver
related RCH code is moved within the AER driver from pci/pcie/aer.c
pci/pcie/aer_cxl_rch.c.

Patches 9-15 are mostly fixups in preparation for following protocol
handling changes. This includes introducing the new PCI_ERS_RESULT_PANIC
result type, improvements to AER logging for bus type (CXL or PCI), function
handler interface updates supporting both Endpoints and CXL Port devices,
logging a message if RAS is NULL. The patch "CXL/AER: Update PCI class code
check to use FIELD_GET()" was removed from this group per Lukas's request.

Patches 16-17 move more code. The AER driver's virtual hierarchy (VH) RAS
related code is moved to pci/pcie/aer_cxl_vh.c in patch 17. Patch 18 introduces
cxl_pci_drv_bound() to identify if an EP is using the CXL EP driver. This
is to support cases where the CXL driver is not used (eg. VFIO). Accessing
cxl_pci_drv_bound() in cxl/pci.c from cxl_core fails with circular build
dependencies. This requires moving cxl/pci.c (containing cxl_pci_drv_bound())
to cxl/core/pci_drv.c.

Patches 18-20 create CXL Endpoint error handlers alongside the existing CXL
PCI error handlers. Both CXL and PCI error handlers are added for CXL Port devices.

Patches 21-23 implement the kernel kfifo dequeue and logic for calling the
correctable or uncorrectable handlers. Signifcant changes were made in the
unrecoverable patch for the following.
 - Updated locking. The endpoint and port devices lock the following:
   EP - pdev->dev (same as cxlds->dev) and cxlmd->dev
   RP/USP/DSP - pdev->dev and parent cxl_port
 - Move locking out of handlers and into cxl_handle_proto_error() and
   report_error_detected(). Lock as soon as possible after kfifo dequeue.
 - Device's reporting UCEs, are locked after kfifo dequeue. Must make condition
   check to prevent from locking the reporting device during iteration in
   do_recovery().

Patches 24-25 enable/disable protocol error interrupt masks.


== Testing ===
Below are the testing results while using QEMU. The QEMU testing uses a CXL Root
Port, CXL Upstream Switch Port, CXL Downstream Switch Port and CXL Endpoint as
given below. I've attached the QEMU startup commandline used. This testing uses
protocol error injection at all the devices.

The sub-topology for the QEMU testing is:
                    ---------------------
                    | CXL RP - 0C:00.0  |
                    ---------------------
                              |
                    ---------------------
                    | CXL USP - 0D:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL DSP - 0E:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL EP - 0F:00.0  |
                    ---------------------
		    
 root@tbowman-cxl:~# lspci -t
 -+-[0000:00]- -00.0
  |           +-01.0
  |           +-02.0
  |           +-03.0
  |           +-1f.0
  |           +-1f.2
  |           \-1f.3
  \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0

 The topology was created with:
  ${qemu} -boot menu=on \
             -cpu host \
             -nographic \
             -monitor telnet:127.0.0.1:1234,server,nowait \
             -M virt,cxl=on \
             -chardev stdio,id=s1,signal=off,mux=on -serial none \
             -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
             -machine q35,cxl=on \
             -m 16G,maxmem=24G,slots=8 \
             -cpu EPYC-v3 \
             -smp 16 \
             -accel kvm \
             -drive file=${img},format=raw,index=0,media=disk \
             -device e1000,netdev=user.0 \
             -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
             -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
             -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
             -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
             -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
             -device cxl-upstream,bus=root_port0,id=us0 \
             -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
             -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
             -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k


=== Root Port ===
root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0:    [14] CorrIntErr
cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0: status: 'CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0:    [22] UncorrIntErr
cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:03.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:12:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:02.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem2 host=0000:11:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:01.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem0 host=0000:10:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 150 Comm: kworker/10:1 Tainted: G            E       6.18.0-rc2-00029-g7d4bdf85dccf #3518 PREEMPT(voluntary)
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_proto_err_work_fn [cxl_core]
Call Trace:
 <TASK>
 vpanic+0x3a0/0x410
 panic+0x5b/0x60
 ? xa_find_after+0x134/0x250
 ? xa_find_after+0x86/0x250
 cxl_proto_err_work_fn+0x316/0x320 [cxl_core]
 ? lock_release+0x1e4/0x3f0
 process_one_work+0x22c/0x650
 worker_thread+0x188/0x330
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x102/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x278/0x2e0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: disabled
---[ end Kernel panic - not syncing: CXL cachemem error. ]---


=== Upstream Port ===
root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0:    [14] CorrIntErr
cxl_aer_correctable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 159 Comm: irq/24-aerdrv Tainted: G            E       6.18.0-rc2-00029-g7d4bdf85dccf #3518 PREEMPT(voluntary)
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 vpanic+0x3a0/0x410
 panic+0x5b/0x60
 pci_error_detected+0xb4/0xc0 [cxl_core]
 report_error_detected+0xbf/0x190
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x4c/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x34/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x34/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x31/0x50
 pcie_do_recovery+0x300/0x430
 aer_isr_one_error_type+0x20f/0x3c0
 aer_isr_one_error+0x117/0x140
 aer_isr+0x4c/0x80
 irq_thread_fn+0x24/0x60
 irq_thread+0x1a0/0x2b0
 ? __pfx_irq_thread_fn+0x10/0x10
 ? __pfx_irq_thread_dtor+0x10/0x10
 ? __pfx_irq_thread+0x10/0x10
 kthread+0x102/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x278/0x2e0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: disabled
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

=== Downstream Port ===
root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0:    [14] CorrIntErr
cxl_aer_correctable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0: status: 'CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0:    [22] UncorrIntErr
cxl_aer_uncorrectable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:01.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem1 host=0000:10:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:03.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem2 host=0000:12:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=0000:0e:02.0 host=0000:0d:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:11:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 683 Comm: kworker/10:2 Tainted: G            E       6.18.0-rc2-00029-g2e23f5f37fac #3552 PREEMPT(voluntary)
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_proto_err_work_fn [cxl_core]
Call Trace:
 <TASK>
 vpanic+0x3a0/0x410
 panic+0x5b/0x60
 ? xa_find_after+0x134/0x250
 ? xa_find_after+0x86/0x250
 cxl_proto_err_work_fn+0x352/0x360 [cxl_core]
 ? lock_release+0x1e4/0x3f0
 process_one_work+0x22c/0x650
 worker_thread+0x188/0x330
 ? __pfx_worker_thread+0x10/0x10
 kthread+0x102/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x278/0x2e0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: disabled
---[ end Kernel panic - not syncing: CXL cachemem error. ]---


=== Endpoint ===
root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_core 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
cxl_core 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
cxl_core 0000:0f:00.0:    [14] CorrIntErr
cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0: status: 'CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
cxl_core 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 159 Comm: irq/24-aerdrv Tainted: G            E       6.18.0-rc2-00029-g7d4bdf85dccf #3518 PREEMPT(voluntary)
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 vpanic+0x3a0/0x410
 panic+0x5b/0x60
 pci_error_detected+0xb4/0xc0 [cxl_core]
 report_error_detected+0xbf/0x190
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x4c/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x31/0x50
 pcie_do_recovery+0x300/0x430
 aer_isr_one_error_type+0x20f/0x3c0
 aer_isr_one_error+0x117/0x140
 aer_isr+0x4c/0x80
 irq_thread_fn+0x24/0x60
 irq_thread+0x1a0/0x2b0
 ? __pfx_irq_thread_fn+0x10/0x10
 ? __pfx_irq_thread_dtor+0x10/0x10
 ? __pfx_irq_thread+0x10/0x10
 kthread+0x102/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x278/0x2e0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: disabled
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== Changes ==

 Changes in v12->v13:
 CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
 - Add Dave Jiang's reviewed-by
 - Remove changes to existing PCI_DVSEC_CXL_PORT* defines. Update commit
   message. (Jonathan)
 PCI/CXL: Introduce pcie_is_cxl()
 - Add Ben's "reviewed-by"
 cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
 - None
 cxl/pci: Remove unnecessary CXL RCH handling helper functions
 - None
 cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core
 - None
 cxl: Move CXL driver's RCH error handling into core/ras_rch.c
 - None
 CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock
 - New patch
 CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
 - Add forward declararation of 'struct aer_err_info' in pci/pci.h (Terry)
 - Changed copyright date from 2025 to 2023 (Jonathan)
 - Add David Jiang's, Jonathan's, and Ben's review-by
 - Readd 'struct aer_err_info' (Bot)
 PCI/AER: Report CXL or PCIe bus error type in trace logging
 - Remove duplicated aer_err_info inline comments. Is already in the
   kernel-doc header (Ben)
 cxl/pci: Update RAS handler interfaces to also support CXL Ports
 - None
 cxl/pci: Log message if RAS registers are unmapped
 - Added Bens review-by
 cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
 - Added Dave Jiang's review-by
 cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
 - Add Ben's review-by
 cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
 - Change as result of dport delay fix. No longer need switchport and
 endport approach. Refactor. (Terry)
 CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
 - Add Dave Jiang's, Jonathan's, Ben's review-by
 - Typo fix (Ben)
 CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors
 - Add Dave Jiang's review-by
 - Update error message (Ben)
 cxl: Introduce cxl_pci_drv_bound() to check for bound driver
 - Add Dave Jiang's review-by.
 cxl: Change CXL handlers to use guard() instead of scoped_guard()
 - New patch
cxl/pci: Introduce CXL protocol error handlers for endpoints
 - Updated all the implemetnation and commit message. (Terry)
 - Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
   pdev (Dave Jiang)
CXL/PCI: Introduce CXL Port protocol error handlers
 - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
   patch (Terry)
 - Remove EP case in cxl_get_ras_base(), not used. (Terry)
 - Remove check for dport->dport_dev (Dave)
 - Remove whitespace (Terry)
PCI/AER: Dequeue forwarded CXL error
 - Rewrite cxl_handle_proto_error() and cxl_proto_err_work_fn() (Terry)
 - Rename get_cxl_host dev() to be get_cxl_port() (Terry)
 - Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
 - Change pr_err() calls to ratelimited. (Terry)
 - Update commit message. (Terry)
 - Remove namespace qualifier from pcie_clear_device_status()
   export (Dave Jiang)
 - Move locks into cxl_proto_err_work_fn() (Dave)
 - Update log messages in cxl_forward_error() (Ben)
CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
 - Renamed pci_ers_merge_result() to pcie_ers_merge_result().
   pci_ers_merge_result() is already used in eeh driver. (Bot)
CXL/PCI: Introduce CXL uncorrectable protocol error recovery
 - Rewrite report_error_detected() and cxl_walk_port (Terry)
 - Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
 - Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
   (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
   cxl_handle_proto_error() (Terry)
 - Remove unnecessary check for endpoint port. (Dave Jiang)
 - Remove check for RCIEP EP in cxl_report_error_detected() (Terry)
CXL/PCI: Enable CXL protocol errors during CXL Port probe
 - Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
 - Add Dave Jiang's and Ben's review-by
CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
 - Added dev and dev_is_pci() checks in cxl_mask_proto_interrupts() (Terry)

Changes in v11 -> v12:
cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
 - Added Dave Jiang's review by
 - Moved to front of series
cxl/pci: Remove unnecessary CXL RCH handling helper functions
 - Add reviewed-by for Alejandro & Dave Jiang
 - Moved to front of series
cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
 - Update CONFIG_CXL_RAS in CXL Kconfig to have CXL_PCI dependency (Terry)
CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS 
 - Added review-by for Sathyanarayanan
 - Changed Kconfig dependency from PCIEAER_CXL to PCIEAER. Moved
   this backwards into this patch.
cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditio
 - Moved CXL_RCH_RAS Kconfig definition here from following commit
CXL/AER: Introduce aer_cxl_rch.c into AER driver for handling CXL RCH errors
 - Rename drivers/pci/pcie/cxl_rch.c to drivers/pci/pcie/aer_cxl_rch.c (Lukas)
 - Removed forward declararation of 'struct aer_err_info' in pci/pci.h (Terry)
CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
 - Change formatting to be same as existing definitions
 - Change GENMASK() -> __GENMASK() and BIT() to _BITUL()
PCI/CXL: Introduce pcie_is_cxl()
 - Add review-by for Alejandro
 - Add comment in set_pcie_cxl() explaining why updating parent status.
PCI/AER: Report CXL or PCIe bus error type in trace logging
 - Change aer_err_info::is_cxl to be bool a bitfield. Update structure padding. (Lukas)
 - Add kernel-doc for 'struct aer_err_info' (Lukas)
cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports 
 - Correct parameters to call trace_cxl_aer_correctable_error() (Shiju)
 - Add reviewed-by for Jonathan and Shiju
cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
 - Add check for dport_parent->rch before calling cxl_dport_init_ras_reporting().
 - RCH dports are initialized from cxl_dport_init_ras_reporting cxl_mem_probe().  
CXL/PCI: Introduce PCI_ERS_RESULT_PANIC 
 - Documentation requested by (Lukas)
CXL/AER: Introduce aer_cxl_vh.c in AER driver for forwarding CXL errors
 - Rename drivers/pci/pcie/cxl_aer.c to drivers/pci/pcie/aer_cxl_vh.c (Lukas)
cxl: Introduce cxl_pci_drv_bound() to check for bound driver
 - New patch
PCI/AER: Dequeue forwarded CXL error
 - Add guard for CE case in cxl_handle_proto_error() (Dave)
 - Updated commit message (Terry)
CXL/PCI: Introduce CXL Port protocol error handlers
 - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
   pci_to_cxl_dev() (Lukas)
 - Change cxl_error_detected() -> cxl_cor_error_detected() (Terry)
 - Remove NULL variable assignments (Jonathan)
 - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
   port searches. (Dave)
CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
 - Remove static inline pci_ers_merge_result() definition for !CONFIG_PCIEAER.
   Is not needed. (Lukas)
CXL/PCI: Introduce CXL uncorrectable protocol error recovery
 - Clean up port discovery in cxl_do_recovery() (Dave)
 - Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()

Changes in v10 -> v11:
 cxl: Remove ifdef blocks of CONFIG_PCIEAER_CXL from core/pci.c
 - New patch
 CXL/AER: Remove CONFIG_PCIEAER_CXL and replace with CONFIG_CXL_RAS
 - New patch
 cxl/pci: Remove unnecessary CXL RCH handling helper functions
 - New patch
 cxl: Move CXL driver RCH error handling into CONFIG_CXL_RCH_RAS conditional block
 - New patch
 CXL/AER: Introduce rch_aer.c into AER driver for handling CXL RCH errors
 - Remove changes in code-split and move to earlier, new patch
 - Add #include <linux/bitfield.h> to cxl_ras.c
 - Move cxl_rch_handle_error() & cxl_rch_enable_rcec() declarations from pci.h
   to aer.h, more localized.
 - Introduce CONFIG_CXL_RCH_RAS, includes Makefile changes, ras.c ifdef changes
 CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
 - New patch
 PCI/CXL: Introduce pcie_is_cxl()
 - Amended set_pcie_cxl() to check for Upstream Port's and EP's parent
   downstream port by calling set_pcie_cxl(). (Dan)
 - Retitle patch: 'Add' -> 'Introduce'
 - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
 PCI/AER: Report CXL or PCIe bus error type in trace logging
 - Remove duplicate call to trace_aer_event() (Shiju)
 - Added Dan William's and Dave Jiang's reviewed-by
 CXL/AER: Update PCI class code check to use FIELD_GET()
 - Add #include <linux/bitfield.h> to cxl_ras.c (Terry)
 - Removed line wrapping at "(CXL 3.2, 8.1.12.1)". (Jonathan)
 cxl/pci: Log message if RAS registers are unmapped
 - Added Dave Jiang's review-by (Terry)
 cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
 - Updated CE and UCE trace routines to maintian consistent TP_Struct ABI
   and unchanged TP_printk() logging. (Shiju, Alison)
 cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
 - Added Dave Jiang and Jonathan Cameron's review-by
 - Changes moved to core/ras.c
 cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
 - Use local pointer for readability in cxl_switch_port_init_ras() (Jonathan Cameron)
 - Rename port to be ep in cxl_endpoint_port_init_ras() (Dave Jiang)
 - Rename dport to be parent_dport in cxl_endpoint_port_init_ras()
   and cxl_switch_port_init_ras() (Dave Jiang)
 - Port helper changes were in cxl/port.c, now in core/ras.c (Dave Jiang)
 cxl/pci: Introduce CXL Endpoint protocol error handlers
 - cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
 - cxl_error_detected() - Remove extra line (Shiju)
 - Changes moved to core/ras.c (Terry)
 - cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
 - Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
 - Move #include "pci.h from cxl.h to core.h (Terry)
 - Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)   
 CXL/AER: Introduce cxl_aer.c into AER driver for forwarding CXL errors
 - Move RCH implementation to cxl_rch.c and RCH declarations to pci/pci.h. (Terry)
 - Introduce 'struct cxl_proto_err_kfifo' containing semaphore, fifo,
   and work struct. (Dan)
 - Remove embedded struct from cxl_proto_err_work (Dan)
 - Make 'struct work_struct *cxl_proto_err_work' definition static (Jonathan)
 - Add check for NULL cxl_proto_err_kfifo to determine if CXL driver is
   not registered for workqueue. (Dan)
 PCI/AER: Dequeue forwarded CXL error
 - Reword patch commit message to remove RCiEP details (Jonathan)
 - Add #include <linux/bitfield.h> (Terry)
 - is_cxl_rcd() - Fix short comment message wrap  (Jonathan)
 - is_cxl_rcd() - Combine return calls into 1  (Jonathan)
 - cxl_handle_proto_error() - Move comment earlier  (Jonathan)
 - Usse FIELD_GET() in discovering class code (Jonathan)
 - Remove BDF from cxl_proto_err_work_data. Use 'struct pci_dev *' (Dan)
 CXL/PCI: Introduce CXL Port protocol error handlers
 - Removed check for PCI_EXP_TYPE_RC_END in cxl_report_error_detected() (Terry)
 - Update is_cxl_error() to check for acceptable PCI EP and port types
 CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
 - pci_ers_merge_result() - Change export to non-namespace and rename
   to be pci_ers_merge_result() (Jonathan)
 - Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result (Terry)
 CXL/PCI: Introduce CXL uncorrectable protocol error recovery
 - pci_ers_merge_results() - Move to earlier patch
 CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
 - Remove guard() in cxl_mask_proto_interrupts(). Observed device lockup/block
   during testing. (Terry)
  
 Changes in v9 -> v10:
 - Add drivers/pci/pcie/cxl_aer.c
 - Add drivers/cxl/core/native_ras.c
 - Change cxl_register_prot_err_work()/cxl_unregister_prot_err_work to return void
 - Check for pcie_ports_native in cxl_do_recovery()
 - Remove debug logging in cxl_do_recovery()
 - Update PCI_ERS_RESULT_PANIC definition to indicate is CXL specific
 - Revert trace logging changes: name,parent -> memdev,host.
 - Use FIELD_GET() to check for EP class code (cxl_aer.c & native_ras.c).
 - Change _prot_ to _proto_ everywhere
 - cxl_rch_handle_error_iter(), check if driver is cxl_pci_driver
 - Remove cxl_create_prot_error_info(). Move logic into forward_cxl_error()
 - Remove sbdf_to_pci() and move logic into cxl_handle_proto_error()
 - Simplify/refactor get_pci_cxl_host_dev()
 - Simplify/refactor cxl_get_ras_base()
 - Move patch 'Remove unnecessary CXL Endpoint handling helper functions' to front
 - Update description for 'CXL/PCI: Introduce CXL Port protocol error
   handlers' with why state is not used to determine handling
 - Introduce cxl_pci_drv_bound() and call from cxl_rch_handle_error_iter()

 Changes in v8 -> v9:
 - Updated reference counting to use pci_get_device()/pci_put_device() in
   cxl_disable_prot_errors()/cxl_enable_prot_errors
 - Refactored cxl_create_prot_err_info() to fix reference counting
 - Removed 'struct cxl_port' driver changes for error handler. Instead
   check for CXL device type (EP or Port device) and call handler
 - Make pcie_is_cxl() static inline in include/linux/linux.h
 - Remove NULL check in create_prot_err_info()
 - Change success return in cxl_ras_init() to use hardcoded 0
 - Changed 'struct work_struct cxl_prot_err_work' declaration to static
 - Change to use rate limited log with dev anchor in forward_cxl_error()
 - Refactored forward-cxl_error() to remove severity auto variable
 - Changed pci_aer_clear_nonfatal_status() to be static inline for
   !(CONFIG_PCIEAER)
 - Renamed merge_result() to be cxl_merge_result()
 - Removed 'ue' condition in cxl_error_detected()
 - Updated 2nd parameter in call to __cxl_handle_cor_ras()/__cxl_handle_ras()
   in unify patch
 - Added log message for failure while assigning interrupt disable callback
 - Updated pci_aer_mask_internal_errors() to use pci_clear_and_set_config_dword()
 - Simplified patch titles for clarity
 - Moved CXL error interrupt disabling into cxl/core/port.c with CXL Port
 teardown
 - Updated 'struct cxl_port_err_info' to only contain sbdf and severity
 Removed everything else.
 - Added pdev and CXL device get_device()/put_device() before calling handlers
 
 Changes in v7 -> v8:
 [Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
 driver
 [Dan] Add device reference incrementors where needed throughout
 [Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init 
 [Dan] Combine CXL Port and CXL Endpoint trace routine
 [Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
 [Jonathan] Add serial number for all devices in trace
 [DaveJ] Move find_cxl_port() change into patch using it
 [Terry] Move CXL Port RAS init into cxl/port.c
 [Terry] Moved kfifo functions into cxl/core/ras.c 
 
 Changes in v6 -> v7:
 [Terry] Move updated trace routine call to later patch. Was causing build
 error.
 
 Changes in v5 -> v6:
 [Ira] Move pcie_is_cxl(dev) define to a inline function
 [Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
 [Ira] Change cxl_report_error_detected() cleanup to return correct bool
 [Ira] Introduce and use PCI_ERS_RESULT_PANIC
 [Ira] Reuse comment for PCIe and CXL recovery paths
 [Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
 [Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
 [Jonathan] Add logging example to patches updating trace output
 [Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
 [Jonathan] Use __free() in cxl_pci_port_ras()
 [Terry] Add patch to log the PCIe SBDF along with CXL device name
 [Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
 [Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
 [Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
 [Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
 [Gregory] Move AER driver bus name detection to a static function

 Changes in v4 -> v5:
 [Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
 [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
 [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
 [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
 [Ming] Use port->dev for call to devm_add_action_or_reset() in
 cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
 [Jonathan] Use get_device()/put_device() to prevent race condition in
 cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
 specifications

 Changes in v3 -> v4:
 [Lukas] Capitalize PCIe and CXL device names as in specifications
 [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
 [Lukas] Correct namespace spelling
 [Lukas] Removed export from pcie_is_cxl_port()
 [Lukas] Simplify 'if' blocks in cxl_handle_error()
 [Lukas] Change panic message to remove redundant 'panic' text
 [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
 [lkp@intel] 'host' parameter is already removed. Remove parameter description too.
 [Terry] Added field description for cxl_err_handlers in pci.h comment block

 Changes in v1 -> v2:
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers

Dave Jiang (1):
  cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks
    from core/pci.c

Terry Bowman (24):
  CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  PCI/CXL: Introduce pcie_is_cxl()
  cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  cxl/pci: Remove unnecessary CXL RCH handling helper functions
  cxl: Move CXL driver's RCH error handling into core/ras_rch.c
  CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with
    guard() lock
  CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
  PCI/AER: Report CXL or PCIe bus error type in trace logging
  cxl/pci: Update RAS handler interfaces to also support CXL Ports
  cxl/pci: Log message if RAS registers are unmapped
  cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
  CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL
    errors
  cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  cxl: Change CXL handlers to use guard() instead of scoped_guard()
  cxl/pci: Introduce CXL protocol error handlers for Endpoints
  CXL/PCI: Introduce CXL Port protocol error handlers
  PCI/AER: Dequeue forwarded CXL error
  CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  CXL/PCI: Enable CXL protocol errors during CXL Port probe
  CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup

 Documentation/PCI/pci-error-recovery.rst |   6 +
 drivers/cxl/Kconfig                      |  17 +-
 drivers/cxl/Makefile                     |   2 -
 drivers/cxl/core/Makefile                |   4 +-
 drivers/cxl/core/core.h                  |  66 +++
 drivers/cxl/core/pci.c                   | 380 ++-------------
 drivers/cxl/{pci.c => core/pci_drv.c}    |  32 +-
 drivers/cxl/core/port.c                  |  28 +-
 drivers/cxl/core/ras.c                   | 572 ++++++++++++++++++++++-
 drivers/cxl/core/ras_rch.c               | 120 +++++
 drivers/cxl/core/regs.c                  |  12 +-
 drivers/cxl/core/trace.h                 |  68 +--
 drivers/cxl/cxl.h                        |  10 +-
 drivers/cxl/cxlpci.h                     |  68 +--
 drivers/cxl/mem.c                        |   3 +-
 drivers/pci/pci.c                        |   5 +-
 drivers/pci/pci.h                        |  59 ++-
 drivers/pci/pcie/Makefile                |   2 +
 drivers/pci/pcie/aer.c                   | 155 ++----
 drivers/pci/pcie/aer_cxl_rch.c           |  96 ++++
 drivers/pci/pcie/aer_cxl_vh.c            |  98 ++++
 drivers/pci/pcie/err.c                   |  14 +-
 drivers/pci/probe.c                      |  29 ++
 include/linux/aer.h                      |  29 ++
 include/linux/pci.h                      |  18 +
 include/ras/ras_event.h                  |   9 +-
 include/uapi/linux/pci_regs.h            |  63 ++-
 tools/testing/cxl/Kbuild                 |   4 +-
 28 files changed, 1320 insertions(+), 649 deletions(-)
 rename drivers/cxl/{pci.c => core/pci_drv.c} (98%)
 create mode 100644 drivers/cxl/core/ras_rch.c
 create mode 100644 drivers/pci/pcie/aer_cxl_rch.c
 create mode 100644 drivers/pci/pcie/aer_cxl_vh.c


base-commit: 211ddde0823f1442e4ad052a2f30f050145ccada
-- 
2.34.1


^ permalink raw reply	[flat|nested] 103+ messages in thread

* [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 17:50   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
                   ` (24 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The CXL DVSECs are currently defined in cxl/core/cxlpci.h. These are not
accessible to other subsystems. Move these to uapi/linux/pci_regs.h.

Change DVSEC name formatting to follow the existing PCI format in
pci_regs.h. The current format uses CXL_DVSEC_XYZ and the CXL defines must
be changed to be PCI_DVSEC_CXL_XYZ to match existing pci_regs.h. Leave
PCI_DVSEC_CXL_PORT* defines as-is because they are already defined and may
be in use by userspace application(s).

Update existing usage to match the name change.

Update the inline documentation to refer to latest CXL spec version.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

----

Changes in v12->v13:
- Add Dave Jiang's reviewed-by
- Remove changes to existing PCI_DVSEC_CXL_PORT* defines. Update commit
  message. (Jonathan)

Changes in v11 -> v12:
- Change formatting to be same as existing definitions
- Change GENMASK() -> __GENMASK() and BIT() to _BITUL()

Changes in v10 -> v11:
- New commit
---
 drivers/cxl/core/pci.c        | 62 +++++++++++++++++-----------------
 drivers/cxl/core/regs.c       | 12 +++----
 drivers/cxl/cxlpci.h          | 53 -----------------------------
 drivers/cxl/pci.c             |  2 +-
 drivers/pci/pci.c             |  4 ++-
 include/uapi/linux/pci_regs.h | 63 ++++++++++++++++++++++++++++++++---
 6 files changed, 100 insertions(+), 96 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 18825e1505d6..cbc8defa6848 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -158,19 +158,19 @@ static int cxl_dvsec_mem_range_valid(struct cxl_dev_state *cxlds, int id)
 	int rc, i;
 	u32 temp;
 
-	if (id > CXL_DVSEC_RANGE_MAX)
+	if (id > PCI_DVSEC_CXL_RANGE_MAX)
 		return -EINVAL;
 
 	/* Check MEM INFO VALID bit first, give up after 1s */
 	i = 1;
 	do {
 		rc = pci_read_config_dword(pdev,
-					   d + CXL_DVSEC_RANGE_SIZE_LOW(id),
+					   d + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id),
 					   &temp);
 		if (rc)
 			return rc;
 
-		valid = FIELD_GET(CXL_DVSEC_MEM_INFO_VALID, temp);
+		valid = FIELD_GET(PCI_DVSEC_CXL_MEM_INFO_VALID, temp);
 		if (valid)
 			break;
 		msleep(1000);
@@ -194,17 +194,17 @@ static int cxl_dvsec_mem_range_active(struct cxl_dev_state *cxlds, int id)
 	int rc, i;
 	u32 temp;
 
-	if (id > CXL_DVSEC_RANGE_MAX)
+	if (id > PCI_DVSEC_CXL_RANGE_MAX)
 		return -EINVAL;
 
 	/* Check MEM ACTIVE bit, up to 60s timeout by default */
 	for (i = media_ready_timeout; i; i--) {
 		rc = pci_read_config_dword(
-			pdev, d + CXL_DVSEC_RANGE_SIZE_LOW(id), &temp);
+			pdev, d + PCI_DVSEC_CXL_RANGE_SIZE_LOW(id), &temp);
 		if (rc)
 			return rc;
 
-		active = FIELD_GET(CXL_DVSEC_MEM_ACTIVE, temp);
+		active = FIELD_GET(PCI_DVSEC_CXL_MEM_ACTIVE, temp);
 		if (active)
 			break;
 		msleep(1000);
@@ -233,11 +233,11 @@ int cxl_await_media_ready(struct cxl_dev_state *cxlds)
 	u16 cap;
 
 	rc = pci_read_config_word(pdev,
-				  d + CXL_DVSEC_CAP_OFFSET, &cap);
+				  d + PCI_DVSEC_CXL_CAP_OFFSET, &cap);
 	if (rc)
 		return rc;
 
-	hdm_count = FIELD_GET(CXL_DVSEC_HDM_COUNT_MASK, cap);
+	hdm_count = FIELD_GET(PCI_DVSEC_CXL_HDM_COUNT_MASK, cap);
 	for (i = 0; i < hdm_count; i++) {
 		rc = cxl_dvsec_mem_range_valid(cxlds, i);
 		if (rc)
@@ -265,16 +265,16 @@ static int cxl_set_mem_enable(struct cxl_dev_state *cxlds, u16 val)
 	u16 ctrl;
 	int rc;
 
-	rc = pci_read_config_word(pdev, d + CXL_DVSEC_CTRL_OFFSET, &ctrl);
+	rc = pci_read_config_word(pdev, d + PCI_DVSEC_CXL_CTRL_OFFSET, &ctrl);
 	if (rc < 0)
 		return rc;
 
-	if ((ctrl & CXL_DVSEC_MEM_ENABLE) == val)
+	if ((ctrl & PCI_DVSEC_CXL_MEM_ENABLE) == val)
 		return 1;
-	ctrl &= ~CXL_DVSEC_MEM_ENABLE;
+	ctrl &= ~PCI_DVSEC_CXL_MEM_ENABLE;
 	ctrl |= val;
 
-	rc = pci_write_config_word(pdev, d + CXL_DVSEC_CTRL_OFFSET, ctrl);
+	rc = pci_write_config_word(pdev, d + PCI_DVSEC_CXL_CTRL_OFFSET, ctrl);
 	if (rc < 0)
 		return rc;
 
@@ -290,7 +290,7 @@ static int devm_cxl_enable_mem(struct device *host, struct cxl_dev_state *cxlds)
 {
 	int rc;
 
-	rc = cxl_set_mem_enable(cxlds, CXL_DVSEC_MEM_ENABLE);
+	rc = cxl_set_mem_enable(cxlds, PCI_DVSEC_CXL_MEM_ENABLE);
 	if (rc < 0)
 		return rc;
 	if (rc > 0)
@@ -352,11 +352,11 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
 		return -ENXIO;
 	}
 
-	rc = pci_read_config_word(pdev, d + CXL_DVSEC_CAP_OFFSET, &cap);
+	rc = pci_read_config_word(pdev, d + PCI_DVSEC_CXL_CAP_OFFSET, &cap);
 	if (rc)
 		return rc;
 
-	if (!(cap & CXL_DVSEC_MEM_CAPABLE)) {
+	if (!(cap & PCI_DVSEC_CXL_MEM_CAPABLE)) {
 		dev_dbg(dev, "Not MEM Capable\n");
 		return -ENXIO;
 	}
@@ -367,7 +367,7 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
 	 * driver is for a spec defined class code which must be CXL.mem
 	 * capable, there is no point in continuing to enable CXL.mem.
 	 */
-	hdm_count = FIELD_GET(CXL_DVSEC_HDM_COUNT_MASK, cap);
+	hdm_count = FIELD_GET(PCI_DVSEC_CXL_HDM_COUNT_MASK, cap);
 	if (!hdm_count || hdm_count > 2)
 		return -EINVAL;
 
@@ -376,11 +376,11 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
 	 * disabled, and they will remain moot after the HDM Decoder
 	 * capability is enabled.
 	 */
-	rc = pci_read_config_word(pdev, d + CXL_DVSEC_CTRL_OFFSET, &ctrl);
+	rc = pci_read_config_word(pdev, d + PCI_DVSEC_CXL_CTRL_OFFSET, &ctrl);
 	if (rc)
 		return rc;
 
-	info->mem_enabled = FIELD_GET(CXL_DVSEC_MEM_ENABLE, ctrl);
+	info->mem_enabled = FIELD_GET(PCI_DVSEC_CXL_MEM_ENABLE, ctrl);
 	if (!info->mem_enabled)
 		return 0;
 
@@ -393,35 +393,35 @@ int cxl_dvsec_rr_decode(struct cxl_dev_state *cxlds,
 			return rc;
 
 		rc = pci_read_config_dword(
-			pdev, d + CXL_DVSEC_RANGE_SIZE_HIGH(i), &temp);
+			pdev, d + PCI_DVSEC_CXL_RANGE_SIZE_HIGH(i), &temp);
 		if (rc)
 			return rc;
 
 		size = (u64)temp << 32;
 
 		rc = pci_read_config_dword(
-			pdev, d + CXL_DVSEC_RANGE_SIZE_LOW(i), &temp);
+			pdev, d + PCI_DVSEC_CXL_RANGE_SIZE_LOW(i), &temp);
 		if (rc)
 			return rc;
 
-		size |= temp & CXL_DVSEC_MEM_SIZE_LOW_MASK;
+		size |= temp & PCI_DVSEC_CXL_MEM_SIZE_LOW_MASK;
 		if (!size) {
 			continue;
 		}
 
 		rc = pci_read_config_dword(
-			pdev, d + CXL_DVSEC_RANGE_BASE_HIGH(i), &temp);
+			pdev, d + PCI_DVSEC_CXL_RANGE_BASE_HIGH(i), &temp);
 		if (rc)
 			return rc;
 
 		base = (u64)temp << 32;
 
 		rc = pci_read_config_dword(
-			pdev, d + CXL_DVSEC_RANGE_BASE_LOW(i), &temp);
+			pdev, d + PCI_DVSEC_CXL_RANGE_BASE_LOW(i), &temp);
 		if (rc)
 			return rc;
 
-		base |= temp & CXL_DVSEC_MEM_BASE_LOW_MASK;
+		base |= temp & PCI_DVSEC_CXL_MEM_BASE_LOW_MASK;
 
 		info->dvsec_range[ranges++] = (struct range) {
 			.start = base,
@@ -1147,7 +1147,7 @@ u16 cxl_gpf_get_dvsec(struct device *dev)
 		is_port = false;
 
 	dvsec = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
-			is_port ? CXL_DVSEC_PORT_GPF : CXL_DVSEC_DEVICE_GPF);
+			is_port ? PCI_DVSEC_CXL_PORT_GPF : PCI_DVSEC_CXL_DEVICE_GPF);
 	if (!dvsec)
 		dev_warn(dev, "%s GPF DVSEC not present\n",
 			 is_port ? "Port" : "Device");
@@ -1163,14 +1163,14 @@ static int update_gpf_port_dvsec(struct pci_dev *pdev, int dvsec, int phase)
 
 	switch (phase) {
 	case 1:
-		offset = CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET;
-		base = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK;
-		scale = CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK;
+		offset = PCI_DVSEC_CXL_PORT_GPF_PHASE_1_CONTROL_OFFSET;
+		base = PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_BASE_MASK;
+		scale = PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_SCALE_MASK;
 		break;
 	case 2:
-		offset = CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET;
-		base = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK;
-		scale = CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK;
+		offset = PCI_DVSEC_CXL_PORT_GPF_PHASE_2_CONTROL_OFFSET;
+		base = PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_BASE_MASK;
+		scale = PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_SCALE_MASK;
 		break;
 	default:
 		return -EINVAL;
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 5ca7b0eed568..fb70ffbba72d 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -271,10 +271,10 @@ EXPORT_SYMBOL_NS_GPL(cxl_map_device_regs, "CXL");
 static bool cxl_decode_regblock(struct pci_dev *pdev, u32 reg_lo, u32 reg_hi,
 				struct cxl_register_map *map)
 {
-	u8 reg_type = FIELD_GET(CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK, reg_lo);
-	int bar = FIELD_GET(CXL_DVSEC_REG_LOCATOR_BIR_MASK, reg_lo);
+	u8 reg_type = FIELD_GET(PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_ID_MASK, reg_lo);
+	int bar = FIELD_GET(PCI_DVSEC_CXL_REG_LOCATOR_BIR_MASK, reg_lo);
 	u64 offset = ((u64)reg_hi << 32) |
-		     (reg_lo & CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK);
+		     (reg_lo & PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_OFF_LOW_MASK);
 
 	if (offset > pci_resource_len(pdev, bar)) {
 		dev_warn(&pdev->dev,
@@ -311,15 +311,15 @@ static int __cxl_find_regblock_instance(struct pci_dev *pdev, enum cxl_regloc_ty
 	};
 
 	regloc = pci_find_dvsec_capability(pdev, PCI_VENDOR_ID_CXL,
-					   CXL_DVSEC_REG_LOCATOR);
+					   PCI_DVSEC_CXL_REG_LOCATOR);
 	if (!regloc)
 		return -ENXIO;
 
 	pci_read_config_dword(pdev, regloc + PCI_DVSEC_HEADER1, &regloc_size);
 	regloc_size = FIELD_GET(PCI_DVSEC_HEADER1_LENGTH_MASK, regloc_size);
 
-	regloc += CXL_DVSEC_REG_LOCATOR_BLOCK1_OFFSET;
-	regblocks = (regloc_size - CXL_DVSEC_REG_LOCATOR_BLOCK1_OFFSET) / 8;
+	regloc += PCI_DVSEC_CXL_REG_LOCATOR_BLOCK1_OFFSET;
+	regblocks = (regloc_size - PCI_DVSEC_CXL_REG_LOCATOR_BLOCK1_OFFSET) / 8;
 
 	for (i = 0; i < regblocks; i++, regloc += 8) {
 		u32 reg_lo, reg_hi;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 7ae621e618e7..4985dbd90069 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -7,59 +7,6 @@
 
 #define CXL_MEMORY_PROGIF	0x10
 
-/*
- * See section 8.1 Configuration Space Registers in the CXL 2.0
- * Specification. Names are taken straight from the specification with "CXL" and
- * "DVSEC" redundancies removed. When obvious, abbreviations may be used.
- */
-#define PCI_DVSEC_HEADER1_LENGTH_MASK	GENMASK(31, 20)
-
-/* CXL 2.0 8.1.3: PCIe DVSEC for CXL Device */
-#define CXL_DVSEC_PCIE_DEVICE					0
-#define   CXL_DVSEC_CAP_OFFSET		0xA
-#define     CXL_DVSEC_MEM_CAPABLE	BIT(2)
-#define     CXL_DVSEC_HDM_COUNT_MASK	GENMASK(5, 4)
-#define   CXL_DVSEC_CTRL_OFFSET		0xC
-#define     CXL_DVSEC_MEM_ENABLE	BIT(2)
-#define   CXL_DVSEC_RANGE_SIZE_HIGH(i)	(0x18 + (i * 0x10))
-#define   CXL_DVSEC_RANGE_SIZE_LOW(i)	(0x1C + (i * 0x10))
-#define     CXL_DVSEC_MEM_INFO_VALID	BIT(0)
-#define     CXL_DVSEC_MEM_ACTIVE	BIT(1)
-#define     CXL_DVSEC_MEM_SIZE_LOW_MASK	GENMASK(31, 28)
-#define   CXL_DVSEC_RANGE_BASE_HIGH(i)	(0x20 + (i * 0x10))
-#define   CXL_DVSEC_RANGE_BASE_LOW(i)	(0x24 + (i * 0x10))
-#define     CXL_DVSEC_MEM_BASE_LOW_MASK	GENMASK(31, 28)
-
-#define CXL_DVSEC_RANGE_MAX		2
-
-/* CXL 2.0 8.1.4: Non-CXL Function Map DVSEC */
-#define CXL_DVSEC_FUNCTION_MAP					2
-
-/* CXL 2.0 8.1.5: CXL 2.0 Extensions DVSEC for Ports */
-#define CXL_DVSEC_PORT_EXTENSIONS				3
-
-/* CXL 2.0 8.1.6: GPF DVSEC for CXL Port */
-#define CXL_DVSEC_PORT_GPF					4
-#define   CXL_DVSEC_PORT_GPF_PHASE_1_CONTROL_OFFSET		0x0C
-#define     CXL_DVSEC_PORT_GPF_PHASE_1_TMO_BASE_MASK		GENMASK(3, 0)
-#define     CXL_DVSEC_PORT_GPF_PHASE_1_TMO_SCALE_MASK		GENMASK(11, 8)
-#define   CXL_DVSEC_PORT_GPF_PHASE_2_CONTROL_OFFSET		0xE
-#define     CXL_DVSEC_PORT_GPF_PHASE_2_TMO_BASE_MASK		GENMASK(3, 0)
-#define     CXL_DVSEC_PORT_GPF_PHASE_2_TMO_SCALE_MASK		GENMASK(11, 8)
-
-/* CXL 2.0 8.1.7: GPF DVSEC for CXL Device */
-#define CXL_DVSEC_DEVICE_GPF					5
-
-/* CXL 2.0 8.1.8: PCIe DVSEC for Flex Bus Port */
-#define CXL_DVSEC_PCIE_FLEXBUS_PORT				7
-
-/* CXL 2.0 8.1.9: Register Locator DVSEC */
-#define CXL_DVSEC_REG_LOCATOR					8
-#define   CXL_DVSEC_REG_LOCATOR_BLOCK1_OFFSET			0xC
-#define     CXL_DVSEC_REG_LOCATOR_BIR_MASK			GENMASK(2, 0)
-#define	    CXL_DVSEC_REG_LOCATOR_BLOCK_ID_MASK			GENMASK(15, 8)
-#define     CXL_DVSEC_REG_LOCATOR_BLOCK_OFF_LOW_MASK		GENMASK(31, 16)
-
 /*
  * NOTE: Currently all the functions which are enabled for CXL require their
  * vectors to be in the first 16.  Use this as the default max.
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bd100ac31672..bd95be1f3d5c 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -933,7 +933,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	cxlds->rcd = is_cxl_restricted(pdev);
 	cxlds->serial = pci_get_dsn(pdev);
 	cxlds->cxl_dvsec = pci_find_dvsec_capability(
-		pdev, PCI_VENDOR_ID_CXL, CXL_DVSEC_PCIE_DEVICE);
+		pdev, PCI_VENDOR_ID_CXL, PCI_DVSEC_CXL_DEVICE);
 	if (!cxlds->cxl_dvsec)
 		dev_warn(&pdev->dev,
 			 "Device DVSEC not present, skip CXL.mem init\n");
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index b14dd064006c..53a49bb32514 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5002,7 +5002,9 @@ static bool cxl_sbr_masked(struct pci_dev *dev)
 	if (!dvsec)
 		return false;
 
-	rc = pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_PORT_CTL, &reg);
+	rc = pci_read_config_word(dev,
+				  dvsec + PCI_DVSEC_CXL_PORT_CTL,
+				  &reg);
 	if (rc || PCI_POSSIBLE_ERROR(reg))
 		return false;
 
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 07e06aafec50..279b92f01d08 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1244,9 +1244,64 @@
 /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
-#define PCI_DVSEC_CXL_PORT				3
-#define PCI_DVSEC_CXL_PORT_CTL				0x0c
-#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+/* Compute Express Link (CXL r3.2, sec 8.1)
+ *
+ * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
+ * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
+ * registers on downstream link-up events.
+ */
+
+#define PCI_DVSEC_HEADER1_LENGTH_MASK  __GENMASK(31, 20)
+
+/* CXL 3.2 8.1.3: PCIe DVSEC for CXL Device */
+#define PCI_DVSEC_CXL_DEVICE			0
+#define  PCI_DVSEC_CXL_CAP_OFFSET		0xA
+#define   PCI_DVSEC_CXL_MEM_CAPABLE		_BITUL(2)
+#define   PCI_DVSEC_CXL_HDM_COUNT_MASK		__GENMASK(5, 4)
+#define  PCI_DVSEC_CXL_CTRL_OFFSET		0xC
+#define   PCI_DVSEC_CXL_MEM_ENABLE		_BITUL(2)
+#define  PCI_DVSEC_CXL_RANGE_SIZE_HIGH(i)	(0x18 + (i * 0x10))
+#define  PCI_DVSEC_CXL_RANGE_SIZE_LOW(i)	(0x1C + (i * 0x10))
+#define   PCI_DVSEC_CXL_MEM_INFO_VALID		_BITUL(0)
+#define   PCI_DVSEC_CXL_MEM_ACTIVE		_BITUL(1)
+#define   PCI_DVSEC_CXL_MEM_SIZE_LOW_MASK	__GENMASK(31, 28)
+#define  PCI_DVSEC_CXL_RANGE_BASE_HIGH(i)	(0x20 + (i * 0x10))
+#define  PCI_DVSEC_CXL_RANGE_BASE_LOW(i)	(0x24 + (i * 0x10))
+#define   PCI_DVSEC_CXL_MEM_BASE_LOW_MASK	__GENMASK(31, 28)
+
+#define PCI_DVSEC_CXL_RANGE_MAX			2
+
+/* CXL 3.2 8.1.4: Non-CXL Function Map DVSEC */
+#define PCI_DVSEC_CXL_FUNCTION_MAP				2
+
+/* CXL 3.2 8.1.5: Extensions DVSEC for Ports */
+#define PCI_DVSEC_CXL_PORT					3
+#define   PCI_DVSEC_CXL_PORT_CTL				0x0c
+#define    PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+
+/* CXL 3.2 8.1.6: GPF DVSEC for CXL Port */
+#define PCI_DVSEC_CXL_PORT_GPF					4
+#define  PCI_DVSEC_CXL_PORT_GPF_PHASE_1_CONTROL_OFFSET		0x0C
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_BASE_MASK		__GENMASK(3, 0)
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_1_TMO_SCALE_MASK		__GENMASK(11, 8)
+#define  PCI_DVSEC_CXL_PORT_GPF_PHASE_2_CONTROL_OFFSET		0xE
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_BASE_MASK		__GENMASK(3, 0)
+#define   PCI_DVSEC_CXL_PORT_GPF_PHASE_2_TMO_SCALE_MASK		__GENMASK(11, 8)
+
+/* CXL 3.2 8.1.7: GPF DVSEC for CXL Device */
+#define PCI_DVSEC_CXL_DEVICE_GPF				5
+
+/* CXL 3.2 8.1.8: PCIe DVSEC for Flex Bus Port */
+#define PCI_DVSEC_CXL_FLEXBUS_PORT				7
+#define  PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET			0xE
+#define   PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK		_BITUL(0)
+#define   PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK			_BITUL(2)
+
+/* CXL 3.2 8.1.9: Register Locator DVSEC */
+#define PCI_DVSEC_CXL_REG_LOCATOR				8
+#define  PCI_DVSEC_CXL_REG_LOCATOR_BLOCK1_OFFSET		0xC
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BIR_MASK			__GENMASK(2, 0)
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_ID_MASK		__GENMASK(15, 8)
+#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_OFF_LOW_MASK		__GENMASK(31, 16)
 
 #endif /* LINUX_PCI_REGS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl()
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
  2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 17:52   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:02 ` [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
                   ` (23 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL and AER drivers need the ability to identify CXL devices.

Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache
status in the CXL Flexbus DVSEC status register. The CXL Flexbus DVSEC
presence is used because it is required for all the CXL PCIe devices.[1]

Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
CXL.cache and CXl.mem status.

In the case the device is an EP or USP, call set_pcie_cxl() on behalf of
the parent downstream device. Once a device is created there is
possibilty the parent training or CXL state was updated as well. This
will make certain the correct parent CXL state is cached.

Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.

[1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
    Capability (DVSEC) ID Assignment, Table 8-2

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Changes in v12->v13:
- Add Ben's "reviewed-by"

Changes in v11->v12:
- Add review-by for Alejandro
- Add comment in set_pcie_cxl() explaining why updating parent status.

Changes in v10->v11:
- Amend set_pcie_cxl() to check for Upstream Port's and EP's parent
  downstream port by calling set_pcie_cxl(). (Dan)
- Retitle patch: 'Add' -> 'Introduce'
- Add check for CXL.mem and CXL.cache (Alejandro, Dan)
---
 drivers/pci/probe.c | 29 +++++++++++++++++++++++++++++
 include/linux/pci.h |  6 ++++++
 2 files changed, 35 insertions(+)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 0ce98e18b5a8..63124651f865 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1709,6 +1709,33 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
 		dev->is_thunderbolt = 1;
 }
 
+static void set_pcie_cxl(struct pci_dev *dev)
+{
+	struct pci_dev *parent;
+	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+					      PCI_DVSEC_CXL_FLEXBUS_PORT);
+	if (dvsec) {
+		u16 cap;
+
+		pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET, &cap);
+
+		dev->is_cxl = FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK, cap) ||
+			FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK, cap);
+	}
+
+	if (!pci_is_pcie(dev) ||
+	    !(pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT ||
+	      pci_pcie_type(dev) == PCI_EXP_TYPE_UPSTREAM))
+		return;
+
+	/*
+	 * Update parent's CXL state because alternate protocol training
+	 * may have changed
+	 */
+	parent = pci_upstream_bridge(dev);
+	set_pcie_cxl(parent);
+}
+
 static void set_pcie_untrusted(struct pci_dev *dev)
 {
 	struct pci_dev *parent = pci_upstream_bridge(dev);
@@ -2039,6 +2066,8 @@ int pci_setup_device(struct pci_dev *dev)
 	/* Need to have dev->cfg_size ready */
 	set_pcie_thunderbolt(dev);
 
+	set_pcie_cxl(dev);
+
 	set_pcie_untrusted(dev);
 
 	if (pci_is_pcie(dev))
diff --git a/include/linux/pci.h b/include/linux/pci.h
index d1fdf81fbe1e..5c4759078d2f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -460,6 +460,7 @@ struct pci_dev {
 	unsigned int	is_pciehp:1;
 	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
 	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
+	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
 	/*
 	 * Devices marked being untrusted are the ones that can potentially
 	 * execute DMA attacks and similar. They are typically connected
@@ -766,6 +767,11 @@ static inline bool pci_is_display(struct pci_dev *pdev)
 	return (pdev->class >> 16) == PCI_BASE_CLASS_DISPLAY;
 }
 
+static inline bool pcie_is_cxl(struct pci_dev *pci_dev)
+{
+	return pci_dev->is_cxl;
+}
+
 #define for_each_pci_bridge(dev, bus)				\
 	list_for_each_entry(dev, &bus->devices, bus_list)	\
 		if (!pci_is_bridge(dev)) {} else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
  2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
  2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 17:53   ` Jonathan Cameron
  2025-11-19  3:20   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH " Terry Bowman
                   ` (22 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
are unnecessary helper functions used only for Endpoints. Remove these
functions as they are not common for all CXL devices and do not provide
value for EP handling.

Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
to cxl_handle_cor_ras().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Changes in v12->v13:
- None

Changes in v11->v12:
- Added Dave Jiang's review by
- Moved to front of series

Changes in v10->v11:
- None
---
 drivers/cxl/core/pci.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index cbc8defa6848..3ac90ff6e3d3 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -711,8 +711,8 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
-				 void __iomem *ras_base)
+static void cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+			       void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
@@ -728,11 +728,6 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	}
 }
 
-static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
-{
-	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
-}
-
 /* CXL spec rev3.0 8.2.4.16.1 */
 static void header_log_copy(void __iomem *ras_base, u32 *log)
 {
@@ -754,8 +749,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
-				  void __iomem *ras_base)
+static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
+			   void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -788,11 +783,6 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 	return true;
 }
 
-static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
-{
-	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
-}
-
 #ifdef CONFIG_PCIEAER_CXL
 
 static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
@@ -871,13 +861,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	return cxl_handle_cor_ras(cxlds, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(cxlds, dport->regs.ras);
+	return cxl_handle_ras(cxlds, dport->regs.ras);
 }
 
 /*
@@ -974,7 +964,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_endpoint_cor_ras(cxlds);
+		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -1003,7 +993,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_endpoint_ras(cxlds);
+		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
 	}
 
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH handling helper functions
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (2 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-19  3:20   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c Terry Bowman
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

cxl_handle_rdport_cor_ras() and cxl_handle_rdport_ras() are specific
to Restricted CXL Host (RCH) handling. Improve readability and
maintainability by replacing these and instead using the common
cxl_handle_cor_ras() and cxl_handle_ras() functions.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

---

Changes in v12->v13:
- None

Changes in v11->v12:
- Add reviewed-by for Alejandro & Dave Jiang
- Moved to front of series

Changes in v10->v11:
- New patch
---
 drivers/cxl/core/pci.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 3ac90ff6e3d3..a0f53a20fa61 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -858,18 +858,6 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
-static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
-					  struct cxl_dport *dport)
-{
-	return cxl_handle_cor_ras(cxlds, dport->regs.ras);
-}
-
-static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
-				       struct cxl_dport *dport)
-{
-	return cxl_handle_ras(cxlds, dport->regs.ras);
-}
-
 /*
  * Copy the AER capability registers using 32 bit read accesses.
  * This is necessary because RCRB AER capability is MMIO mapped. Clear the
@@ -939,9 +927,9 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 	pci_print_aer(pdev, severity, &aer_regs);
 
 	if (severity == AER_CORRECTABLE)
-		cxl_handle_rdport_cor_ras(cxlds, dport);
+		cxl_handle_cor_ras(cxlds, dport->regs.ras);
 	else
-		cxl_handle_rdport_ras(cxlds, dport);
+		cxl_handle_ras(cxlds, dport->regs.ras);
 }
 
 #else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (3 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH " Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-19  3:20   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c Terry Bowman
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

From: Dave Jiang <dave.jiang@intel.com>

Create new config CONFIG_CXL_RAS and put all CXL RAS items behind the
config. The config will depend on CPER and PCIE AER to build. Move the
related VH RAS code from core/pci.c to core/ras.c.

Restricted CXL host (RCH) RAS functions will be moved in a future patch.

Cc: Robert Richter <rrichter@amd.com>
Cc: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Co-developed-by: Terry Bowman <terry.bowman@amd.com>
Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v11->v12:
- None

Changes in v10->v11:
- New patch
- Updated by Terry Bowman to use (ACPI_APEI_GHES && PCIEAER_CXL) dependency
  in Kconfig. Otherwise checks will be reauired for CONFIG_PCIEAER because
  AER driver functions are called.
---
 drivers/cxl/Kconfig       |   4 +
 drivers/cxl/core/Makefile |   2 +-
 drivers/cxl/core/core.h   |  31 +++++++
 drivers/cxl/core/pci.c    | 189 +-------------------------------------
 drivers/cxl/core/ras.c    | 176 +++++++++++++++++++++++++++++++++++
 drivers/cxl/cxl.h         |   8 --
 drivers/cxl/cxlpci.h      |  16 ++++
 tools/testing/cxl/Kbuild  |   2 +-
 8 files changed, 233 insertions(+), 195 deletions(-)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 48b7314afdb8..217888992c88 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -233,4 +233,8 @@ config CXL_MCE
 	def_bool y
 	depends on X86_MCE && MEMORY_FAILURE
 
+config CXL_RAS
+	def_bool y
+	depends on ACPI_APEI_GHES && PCIEAER && CXL_PCI
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 5ad8fef210b5..b2930cc54f8b 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -14,9 +14,9 @@ cxl_core-y += pci.o
 cxl_core-y += hdm.o
 cxl_core-y += pmu.o
 cxl_core-y += cdat.o
-cxl_core-y += ras.o
 cxl_core-$(CONFIG_TRACING) += trace.o
 cxl_core-$(CONFIG_CXL_REGION) += region.o
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
+cxl_core-$(CONFIG_CXL_RAS) += ras.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1fb66132b777..bc818de87ccc 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -144,8 +144,39 @@ int cxl_pci_get_bandwidth(struct pci_dev *pdev, struct access_coordinate *c);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
 
+#ifdef CONFIG_CXL_RAS
 int cxl_ras_init(void);
 void cxl_ras_exit(void);
+bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base);
+void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base);
+#else
+static inline int cxl_ras_init(void)
+{
+	return 0;
+}
+
+static inline void cxl_ras_exit(void)
+{
+}
+
+static inline bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
+{
+	return false;
+}
+static inline void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base) { }
+#endif /* CONFIG_CXL_RAS */
+
+/* Restricted CXL Host specific RAS functions */
+#ifdef CONFIG_CXL_RAS
+void cxl_dport_map_rch_aer(struct cxl_dport *dport);
+void cxl_disable_rch_root_ints(struct cxl_dport *dport);
+void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
+#else
+static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
+static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
+static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
+#endif /* CONFIG_CXL_RAS */
+
 int cxl_gpf_port_setup(struct cxl_dport *dport);
 
 struct cxl_hdm;
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index a0f53a20fa61..cd73cea93282 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -711,81 +711,8 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
-			       void __iomem *ras_base)
-{
-	void __iomem *addr;
-	u32 status;
-
-	if (!ras_base)
-		return;
-
-	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
-	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
-	}
-}
-
-/* CXL spec rev3.0 8.2.4.16.1 */
-static void header_log_copy(void __iomem *ras_base, u32 *log)
-{
-	void __iomem *addr;
-	u32 *log_addr;
-	int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32);
-
-	addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET;
-	log_addr = log;
-
-	for (i = 0; i < log_u32_size; i++) {
-		*log_addr = readl(addr);
-		log_addr++;
-		addr += sizeof(u32);
-	}
-}
-
-/*
- * Log the state of the RAS status registers and prepare them to log the
- * next error status. Return 1 if reset needed.
- */
-static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
-			   void __iomem *ras_base)
-{
-	u32 hl[CXL_HEADERLOG_SIZE_U32];
-	void __iomem *addr;
-	u32 status;
-	u32 fe;
-
-	if (!ras_base)
-		return false;
-
-	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
-	status = readl(addr);
-	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
-		return false;
-
-	/* If multiple errors, log header points to first error from ctrl reg */
-	if (hweight32(status) > 1) {
-		void __iomem *rcc_addr =
-			ras_base + CXL_RAS_CAP_CONTROL_OFFSET;
-
-		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
-				   readl(rcc_addr)));
-	} else {
-		fe = status;
-	}
-
-	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
-	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
-
-	return true;
-}
-
-#ifdef CONFIG_PCIEAER_CXL
-
-static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
+#ifdef CONFIG_CXL_RAS
+void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 {
 	resource_size_t aer_phys;
 	struct device *host;
@@ -800,19 +727,7 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 	}
 }
 
-static void cxl_dport_map_ras(struct cxl_dport *dport)
-{
-	struct cxl_register_map *map = &dport->reg_map;
-	struct device *dev = dport->dport_dev;
-
-	if (!map->component_map.ras.valid)
-		dev_dbg(dev, "RAS registers not found\n");
-	else if (cxl_map_component_regs(map, &dport->regs.component,
-					BIT(CXL_CM_CAP_CAP_ID_RAS)))
-		dev_dbg(dev, "Failed to map RAS capability.\n");
-}
-
-static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
+void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 {
 	void __iomem *aer_base = dport->regs.dport_aer;
 	u32 aer_cmd_mask, aer_cmd;
@@ -836,28 +751,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
-/**
- * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
- * @dport: the cxl_dport that needs to be initialized
- * @host: host device for devm operations
- */
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
-{
-	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
-
-	if (dport->rch) {
-		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
-
-		if (!host_bridge->native_aer)
-			return;
-
-		cxl_dport_map_rch_aer(dport);
-		cxl_disable_rch_root_ints(dport);
-	}
-}
-EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
-
 /*
  * Copy the AER capability registers using 32 bit read accesses.
  * This is necessary because RCRB AER capability is MMIO mapped. Clear the
@@ -906,7 +799,7 @@ static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
 	return false;
 }
 
-static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
+void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 {
 	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
 	struct aer_capability_regs aer_regs;
@@ -931,82 +824,8 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 	else
 		cxl_handle_ras(cxlds, dport->regs.ras);
 }
-
-#else
-static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
 #endif
 
-void cxl_cor_error_detected(struct pci_dev *pdev)
-{
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct device *dev = &cxlds->cxlmd->dev;
-
-	scoped_guard(device, dev) {
-		if (!dev->driver) {
-			dev_warn(&pdev->dev,
-				 "%s: memdev disabled, abort error handling\n",
-				 dev_name(dev));
-			return;
-		}
-
-		if (cxlds->rcd)
-			cxl_handle_rdport_errors(cxlds);
-
-		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
-	}
-}
-EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
-
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state)
-{
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct cxl_memdev *cxlmd = cxlds->cxlmd;
-	struct device *dev = &cxlmd->dev;
-	bool ue;
-
-	scoped_guard(device, dev) {
-		if (!dev->driver) {
-			dev_warn(&pdev->dev,
-				 "%s: memdev disabled, abort error handling\n",
-				 dev_name(dev));
-			return PCI_ERS_RESULT_DISCONNECT;
-		}
-
-		if (cxlds->rcd)
-			cxl_handle_rdport_errors(cxlds);
-		/*
-		 * A frozen channel indicates an impending reset which is fatal to
-		 * CXL.mem operation, and will likely crash the system. On the off
-		 * chance the situation is recoverable dump the status of the RAS
-		 * capability registers and bounce the active state of the memdev.
-		 */
-		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
-	}
-
-
-	switch (state) {
-	case pci_channel_io_normal:
-		if (ue) {
-			device_release_driver(dev);
-			return PCI_ERS_RESULT_NEED_RESET;
-		}
-		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		dev_warn(&pdev->dev,
-			 "%s: frozen state error detected, disable CXL.mem\n",
-			 dev_name(dev));
-		device_release_driver(dev);
-		return PCI_ERS_RESULT_NEED_RESET;
-	case pci_channel_io_perm_failure:
-		dev_warn(&pdev->dev,
-			 "failure state error detected, request disconnect\n");
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-	return PCI_ERS_RESULT_NEED_RESET;
-}
-EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
-
 static int cxl_flit_size(struct pci_dev *pdev)
 {
 	if (cxl_pci_flit_256(pdev))
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 2731ba3a0799..b933030b8e1e 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -5,6 +5,7 @@
 #include <linux/aer.h>
 #include <cxl/event.h>
 #include <cxlmem.h>
+#include <cxlpci.h>
 #include "trace.h"
 
 static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
@@ -124,3 +125,178 @@ void cxl_ras_exit(void)
 	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
 	cancel_work_sync(&cxl_cper_prot_err_work);
 }
+
+static void cxl_dport_map_ras(struct cxl_dport *dport)
+{
+	struct cxl_register_map *map = &dport->reg_map;
+	struct device *dev = dport->dport_dev;
+
+	if (!map->component_map.ras.valid)
+		dev_dbg(dev, "RAS registers not found\n");
+	else if (cxl_map_component_regs(map, &dport->regs.component,
+					BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(dev, "Failed to map RAS capability.\n");
+}
+
+/**
+ * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
+ * @dport: the cxl_dport that needs to be initialized
+ * @host: host device for devm operations
+ */
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+{
+	dport->reg_map.host = host;
+	cxl_dport_map_ras(dport);
+
+	if (dport->rch) {
+		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
+
+		if (!host_bridge->native_aer)
+			return;
+
+		cxl_dport_map_rch_aer(dport);
+		cxl_disable_rch_root_ints(dport);
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
+
+void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
+{
+	void __iomem *addr;
+	u32 status;
+
+	if (!ras_base)
+		return;
+
+	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
+	status = readl(addr);
+	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
+		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+	}
+}
+
+/* CXL spec rev3.0 8.2.4.16.1 */
+static void header_log_copy(void __iomem *ras_base, u32 *log)
+{
+	void __iomem *addr;
+	u32 *log_addr;
+	int i, log_u32_size = CXL_HEADERLOG_SIZE / sizeof(u32);
+
+	addr = ras_base + CXL_RAS_HEADER_LOG_OFFSET;
+	log_addr = log;
+
+	for (i = 0; i < log_u32_size; i++) {
+		*log_addr = readl(addr);
+		log_addr++;
+		addr += sizeof(u32);
+	}
+}
+
+/*
+ * Log the state of the RAS status registers and prepare them to log the
+ * next error status. Return 1 if reset needed.
+ */
+bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
+{
+	u32 hl[CXL_HEADERLOG_SIZE_U32];
+	void __iomem *addr;
+	u32 status;
+	u32 fe;
+
+	if (!ras_base)
+		return false;
+
+	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
+	status = readl(addr);
+	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
+		return false;
+
+	/* If multiple errors, log header points to first error from ctrl reg */
+	if (hweight32(status) > 1) {
+		void __iomem *rcc_addr =
+			ras_base + CXL_RAS_CAP_CONTROL_OFFSET;
+
+		fe = BIT(FIELD_GET(CXL_RAS_CAP_CONTROL_FE_MASK,
+				   readl(rcc_addr)));
+	} else {
+		fe = status;
+	}
+
+	header_log_copy(ras_base, hl);
+	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
+
+	return true;
+}
+
+void cxl_cor_error_detected(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct device *dev = &cxlds->cxlmd->dev;
+
+	scoped_guard(device, dev) {
+		if (!dev->driver) {
+			dev_warn(&pdev->dev,
+				 "%s: memdev disabled, abort error handling\n",
+				 dev_name(dev));
+			return;
+		}
+
+		if (cxlds->rcd)
+			cxl_handle_rdport_errors(cxlds);
+
+		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
+
+pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
+				    pci_channel_state_t state)
+{
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct cxl_memdev *cxlmd = cxlds->cxlmd;
+	struct device *dev = &cxlmd->dev;
+	bool ue;
+
+	scoped_guard(device, dev) {
+		if (!dev->driver) {
+			dev_warn(&pdev->dev,
+				 "%s: memdev disabled, abort error handling\n",
+				 dev_name(dev));
+			return PCI_ERS_RESULT_DISCONNECT;
+		}
+
+		if (cxlds->rcd)
+			cxl_handle_rdport_errors(cxlds);
+		/*
+		 * A frozen channel indicates an impending reset which is fatal to
+		 * CXL.mem operation, and will likely crash the system. On the off
+		 * chance the situation is recoverable dump the status of the RAS
+		 * capability registers and bounce the active state of the memdev.
+		 */
+		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
+	}
+
+
+	switch (state) {
+	case pci_channel_io_normal:
+		if (ue) {
+			device_release_driver(dev);
+			return PCI_ERS_RESULT_NEED_RESET;
+		}
+		return PCI_ERS_RESULT_CAN_RECOVER;
+	case pci_channel_io_frozen:
+		dev_warn(&pdev->dev,
+			 "%s: frozen state error detected, disable CXL.mem\n",
+			 dev_name(dev));
+		device_release_driver(dev);
+		return PCI_ERS_RESULT_NEED_RESET;
+	case pci_channel_io_perm_failure:
+		dev_warn(&pdev->dev,
+			 "failure state error detected, request disconnect\n");
+		return PCI_ERS_RESULT_DISCONNECT;
+	}
+	return PCI_ERS_RESULT_NEED_RESET;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 231ddccf8977..259ed4b676e1 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -776,14 +776,6 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 					 struct device *dport_dev, int port_id,
 					 resource_size_t rcrb);
 
-#ifdef CONFIG_PCIEAER_CXL
-void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
-#else
-static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
-						struct device *host) { }
-#endif
-
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
 struct cxl_root_decoder *to_cxl_root_decoder(struct device *dev);
 struct cxl_switch_decoder *to_cxl_switch_decoder(struct device *dev);
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 4985dbd90069..0c8b6ee7b6de 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -77,7 +77,23 @@ static inline bool cxl_pci_flit_256(struct pci_dev *pdev)
 int devm_cxl_port_enumerate_dports(struct cxl_port *port);
 struct cxl_dev_state;
 void read_cdat_data(struct cxl_port *port);
+
+#ifdef CONFIG_CXL_RAS
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+#else
+static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
+
+static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
+						  pci_channel_state_t state)
+{
+	return PCI_ERS_RESULT_NONE;
+}
+
+static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
+						struct device *host) { }
+#endif
+
 #endif /* __CXL_PCI_H__ */
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 0d5ce4b74b9f..927fbb6c061f 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -58,12 +58,12 @@ cxl_core-y += $(CXL_CORE_SRC)/pci.o
 cxl_core-y += $(CXL_CORE_SRC)/hdm.o
 cxl_core-y += $(CXL_CORE_SRC)/pmu.o
 cxl_core-y += $(CXL_CORE_SRC)/cdat.o
-cxl_core-y += $(CXL_CORE_SRC)/ras.o
 cxl_core-$(CONFIG_TRACING) += $(CXL_CORE_SRC)/trace.o
 cxl_core-$(CONFIG_CXL_REGION) += $(CXL_CORE_SRC)/region.o
 cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += $(CXL_CORE_SRC)/edac.o
+cxl_core-$(CONFIG_CXL_RAS) += $(CXL_CORE_SRC)/ras.o
 cxl_core-y += config_check.o
 cxl_core-y += cxl_core_test.o
 cxl_core-y += cxl_core_exports.o
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (4 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:03   ` Jonathan Cameron
  2025-11-19  3:20   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
                   ` (19 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Restricted CXL Host (RCH) protocol error handling uses a procedure distinct
from the CXL Virtual Hierarchy (VH) handling. This is because of the
differences in the RCH and VH topologies. Improve the maintainability and
add ability to enable/disable RCH handling.

Move and combine the RCH handling code into a single block conditionally
compiled with the CONFIG_CXL_RCH_RAS kernel config.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v12->v13:
- None

Changes v11->v12:
- Moved CXL_RCH_RAS Kconfig definition here from following commit.

Changes v10->v11:
- New patch
---
 drivers/cxl/Kconfig        |   7 +++
 drivers/cxl/core/Makefile  |   1 +
 drivers/cxl/core/core.h    |   5 +-
 drivers/cxl/core/pci.c     | 115 -----------------------------------
 drivers/cxl/core/ras_rch.c | 120 +++++++++++++++++++++++++++++++++++++
 tools/testing/cxl/Kbuild   |   1 +
 6 files changed, 132 insertions(+), 117 deletions(-)
 create mode 100644 drivers/cxl/core/ras_rch.c

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 217888992c88..ffe6ad981434 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -237,4 +237,11 @@ config CXL_RAS
 	def_bool y
 	depends on ACPI_APEI_GHES && PCIEAER && CXL_PCI
 
+config CXL_RCH_RAS
+	bool "CXL: Restricted CXL Host (RCH) protocol error handling"
+	def_bool n
+	depends on CXL_RAS
+	help
+	  RAS support for Restricted CXL Host (RCH) defined in CXL1.1.
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index b2930cc54f8b..fa1d4aed28b9 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -20,3 +20,4 @@ cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
 cxl_core-$(CONFIG_CXL_RAS) += ras.o
+cxl_core-$(CONFIG_CXL_RCH_RAS) += ras_rch.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index bc818de87ccc..c30ab7c25a92 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -4,6 +4,7 @@
 #ifndef __CXL_CORE_H__
 #define __CXL_CORE_H__
 
+#include <linux/pci.h>
 #include <cxl/mailbox.h>
 #include <linux/rwsem.h>
 
@@ -167,7 +168,7 @@ static inline void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem
 #endif /* CONFIG_CXL_RAS */
 
 /* Restricted CXL Host specific RAS functions */
-#ifdef CONFIG_CXL_RAS
+#ifdef CONFIG_CXL_RCH_RAS
 void cxl_dport_map_rch_aer(struct cxl_dport *dport);
 void cxl_disable_rch_root_ints(struct cxl_dport *dport);
 void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
@@ -175,7 +176,7 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
 static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
 static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
 static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
-#endif /* CONFIG_CXL_RAS */
+#endif /* CONFIG_CXL_RCH_RAS */
 
 int cxl_gpf_port_setup(struct cxl_dport *dport);
 
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index cd73cea93282..a66f7a84b5c8 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -711,121 +711,6 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-#ifdef CONFIG_CXL_RAS
-void cxl_dport_map_rch_aer(struct cxl_dport *dport)
-{
-	resource_size_t aer_phys;
-	struct device *host;
-	u16 aer_cap;
-
-	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
-	if (aer_cap) {
-		host = dport->reg_map.host;
-		aer_phys = aer_cap + dport->rcrb.base;
-		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
-						sizeof(struct aer_capability_regs));
-	}
-}
-
-void cxl_disable_rch_root_ints(struct cxl_dport *dport)
-{
-	void __iomem *aer_base = dport->regs.dport_aer;
-	u32 aer_cmd_mask, aer_cmd;
-
-	if (!aer_base)
-		return;
-
-	/*
-	 * Disable RCH root port command interrupts.
-	 * CXL 3.0 12.2.1.1 - RCH Downstream Port-detected Errors
-	 *
-	 * This sequence may not be necessary. CXL spec states disabling
-	 * the root cmd register's interrupts is required. But, PCI spec
-	 * shows these are disabled by default on reset.
-	 */
-	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
-			PCI_ERR_ROOT_CMD_NONFATAL_EN |
-			PCI_ERR_ROOT_CMD_FATAL_EN);
-	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
-	aer_cmd &= ~aer_cmd_mask;
-	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
-}
-
-/*
- * Copy the AER capability registers using 32 bit read accesses.
- * This is necessary because RCRB AER capability is MMIO mapped. Clear the
- * status after copying.
- *
- * @aer_base: base address of AER capability block in RCRB
- * @aer_regs: destination for copying AER capability
- */
-static bool cxl_rch_get_aer_info(void __iomem *aer_base,
-				 struct aer_capability_regs *aer_regs)
-{
-	int read_cnt = sizeof(struct aer_capability_regs) / sizeof(u32);
-	u32 *aer_regs_buf = (u32 *)aer_regs;
-	int n;
-
-	if (!aer_base)
-		return false;
-
-	/* Use readl() to guarantee 32-bit accesses */
-	for (n = 0; n < read_cnt; n++)
-		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
-
-	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
-	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
-
-	return true;
-}
-
-/* Get AER severity. Return false if there is no error. */
-static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
-				     int *severity)
-{
-	if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
-		if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
-			*severity = AER_FATAL;
-		else
-			*severity = AER_NONFATAL;
-		return true;
-	}
-
-	if (aer_regs->cor_status & ~aer_regs->cor_mask) {
-		*severity = AER_CORRECTABLE;
-		return true;
-	}
-
-	return false;
-}
-
-void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
-{
-	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
-	struct aer_capability_regs aer_regs;
-	struct cxl_dport *dport;
-	int severity;
-
-	struct cxl_port *port __free(put_cxl_port) =
-		cxl_pci_find_port(pdev, &dport);
-	if (!port)
-		return;
-
-	if (!cxl_rch_get_aer_info(dport->regs.dport_aer, &aer_regs))
-		return;
-
-	if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
-		return;
-
-	pci_print_aer(pdev, severity, &aer_regs);
-
-	if (severity == AER_CORRECTABLE)
-		cxl_handle_cor_ras(cxlds, dport->regs.ras);
-	else
-		cxl_handle_ras(cxlds, dport->regs.ras);
-}
-#endif
-
 static int cxl_flit_size(struct pci_dev *pdev)
 {
 	if (cxl_pci_flit_256(pdev))
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
new file mode 100644
index 000000000000..f6de5492a8b7
--- /dev/null
+++ b/drivers/cxl/core/ras_rch.c
@@ -0,0 +1,120 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/aer.h>
+#include <cxl/event.h>
+#include <cxlmem.h>
+#include "trace.h"
+
+void cxl_dport_map_rch_aer(struct cxl_dport *dport)
+{
+	resource_size_t aer_phys;
+	struct device *host;
+	u16 aer_cap;
+
+	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
+	if (aer_cap) {
+		host = dport->reg_map.host;
+		aer_phys = aer_cap + dport->rcrb.base;
+		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
+							     sizeof(struct aer_capability_regs));
+	}
+}
+
+void cxl_disable_rch_root_ints(struct cxl_dport *dport)
+{
+	void __iomem *aer_base = dport->regs.dport_aer;
+	u32 aer_cmd_mask, aer_cmd;
+
+	if (!aer_base)
+		return;
+
+	/*
+	 * Disable RCH root port command interrupts.
+	 * CXL 3.0 12.2.1.1 - RCH Downstream Port-detected Errors
+	 *
+	 * This sequence may not be necessary. CXL spec states disabling
+	 * the root cmd register's interrupts is required. But, PCI spec
+	 * shows these are disabled by default on reset.
+	 */
+	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
+			PCI_ERR_ROOT_CMD_NONFATAL_EN |
+			PCI_ERR_ROOT_CMD_FATAL_EN);
+	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
+	aer_cmd &= ~aer_cmd_mask;
+	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
+}
+
+/*
+ * Copy the AER capability registers using 32 bit read accesses.
+ * This is necessary because RCRB AER capability is MMIO mapped. Clear the
+ * status after copying.
+ *
+ * @aer_base: base address of AER capability block in RCRB
+ * @aer_regs: destination for copying AER capability
+ */
+static bool cxl_rch_get_aer_info(void __iomem *aer_base,
+				 struct aer_capability_regs *aer_regs)
+{
+	int read_cnt = sizeof(struct aer_capability_regs) / sizeof(u32);
+	u32 *aer_regs_buf = (u32 *)aer_regs;
+	int n;
+
+	if (!aer_base)
+		return false;
+
+	/* Use readl() to guarantee 32-bit accesses */
+	for (n = 0; n < read_cnt; n++)
+		aer_regs_buf[n] = readl(aer_base + n * sizeof(u32));
+
+	writel(aer_regs->uncor_status, aer_base + PCI_ERR_UNCOR_STATUS);
+	writel(aer_regs->cor_status, aer_base + PCI_ERR_COR_STATUS);
+
+	return true;
+}
+
+/* Get AER severity. Return false if there is no error. */
+static bool cxl_rch_get_aer_severity(struct aer_capability_regs *aer_regs,
+				     int *severity)
+{
+	if (aer_regs->uncor_status & ~aer_regs->uncor_mask) {
+		if (aer_regs->uncor_status & PCI_ERR_ROOT_FATAL_RCV)
+			*severity = AER_FATAL;
+		else
+			*severity = AER_NONFATAL;
+		return true;
+	}
+
+	if (aer_regs->cor_status & ~aer_regs->cor_mask) {
+		*severity = AER_CORRECTABLE;
+		return true;
+	}
+
+	return false;
+}
+
+void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
+{
+	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
+	struct aer_capability_regs aer_regs;
+	struct cxl_dport *dport;
+	int severity;
+
+	struct cxl_port *port __free(put_cxl_port) =
+		cxl_pci_find_port(pdev, &dport);
+	if (!port)
+		return;
+
+	if (!cxl_rch_get_aer_info(dport->regs.dport_aer, &aer_regs))
+		return;
+
+	if (!cxl_rch_get_aer_severity(&aer_regs, &severity))
+		return;
+
+	pci_print_aer(pdev, severity, &aer_regs);
+	if (severity == AER_CORRECTABLE)
+		cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	else
+		cxl_handle_ras(cxlds, dport->regs.ras);
+}
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 927fbb6c061f..6905f8e710ab 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -64,6 +64,7 @@ cxl_core-$(CONFIG_CXL_MCE) += $(CXL_CORE_SRC)/mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += $(CXL_CORE_SRC)/edac.o
 cxl_core-$(CONFIG_CXL_RAS) += $(CXL_CORE_SRC)/ras.o
+cxl_core-$(CONFIG_CXL_RCH_RAS) += $(CXL_CORE_SRC)/ras_rch.o
 cxl_core-y += config_check.o
 cxl_core-y += cxl_core_test.o
 cxl_core-y += cxl_core_exports.o
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (5 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:05   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:02 ` [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c Terry Bowman
                   ` (18 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

cxl_rch_handle_error_iter() includes a call to device_lock() using a goto
for multiple return paths. Improve readability and maintainability by
using the guard() lock variant.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v12->v13:
- New patch
---
 drivers/pci/pcie/aer.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 0b5ed4722ac3..cbaed65577d9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1187,12 +1187,11 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
 		return 0;
 
-	/* Protect dev->driver */
-	device_lock(&dev->dev);
+	guard(device)(&dev->dev);
 
 	err_handler = dev->driver ? dev->driver->err_handler : NULL;
 	if (!err_handler)
-		goto out;
+		return 0;
 
 	if (info->severity == AER_CORRECTABLE) {
 		if (err_handler->cor_error_detected)
@@ -1203,8 +1202,6 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 		else if (info->severity == AER_FATAL)
 			err_handler->error_detected(dev, pci_channel_io_frozen);
 	}
-out:
-	device_unlock(&dev->dev);
 	return 0;
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (6 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-19  3:20   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The restricted CXL Host (RCH) AER error handling logic currently resides
in the AER driver file, drivers/pci/pcie/aer.c. CXL specific changes are
conditionally compiled using #ifdefs.

Improve the AER driver maintainability by separating the RCH specific logic
from the AER driver's core functionality and removing the ifdefs. Introduce
drivers/pci/pcie/aer_cxl_rch.c for moving the RCH AER logic into.
Conditionally compile the file using the CONFIG_CXL_RCH_RAS Kconfig.

Move the CXL logic into the new file but leave helper functions in aer.c
for now as they will be moved in future patch for CXL virtual hierarchy
handling. Export the handler functions as needed. Export
pci_aer_unmask_internal_errors() allowing for all subsystems to use.
Avoid multiple declaration moves and export cxl_error_is_native() now to
allow access from cxl_core.

Inorder to maintain compilation after the move other changes are required.
Change cxl_rch_handle_error() & cxl_rch_enable_rcec() to be non-static
inorder for accessing from the AER driver in aer.c.

Update the new file with the SPDX and 2023 AMD copyright notations because
the RCH bits were initally contributed in 2023 by AMD.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Changes in v12->v13:
- Add forward declararation of 'struct aer_err_info' in pci/pci.h (Terry)
- Changed copyright date from 2025 to 2023 (Jonathan)
- Add David Jiang's, Jonathan's, and Ben's review-by
- Readd 'struct aer_err_info' (Bot)

Changes in v11->v12:
- Rename drivers/pci/pcie/cxl_rch.c to drivers/pci/pcie/aer_cxl_rch.c (Lukas)
- Removed forward declararation of 'struct aer_err_info' in pci/pci.h (Terry)

Changes in v10->v11:
- Remove changes in code-split and move to earlier, new patch
- Add #include <linux/bitfield.h> to cxl_ras.c
- Move cxl_rch_handle_error() & cxl_rch_enable_rcec() declarations from pci.h
to aer.h, more localized.
- Introduce CONFIG_CXL_RCH_RAS, includes Makefile changes, ras.c
ifdef changes
---
 drivers/pci/pci.h              |  16 +++++
 drivers/pci/pcie/Makefile      |   1 +
 drivers/pci/pcie/aer.c         | 105 +++------------------------------
 drivers/pci/pcie/aer_cxl_rch.c |  96 ++++++++++++++++++++++++++++++
 include/linux/aer.h            |   8 +++
 5 files changed, 128 insertions(+), 98 deletions(-)
 create mode 100644 drivers/pci/pcie/aer_cxl_rch.c

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 4492b809094b..d23430e3eea0 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1295,4 +1295,20 @@ static inline int pci_msix_write_tph_tag(struct pci_dev *pdev, unsigned int inde
 	(PCI_CONF1_ADDRESS(bus, dev, func, reg) | \
 	 PCI_CONF1_EXT_REG(reg))
 
+struct aer_err_info;
+
+#ifdef CONFIG_CXL_RCH_RAS
+void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
+void cxl_rch_enable_rcec(struct pci_dev *rcec);
+#else
+static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
+static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+#endif
+
+#ifdef CONFIG_CXL_RAS
+bool is_internal_error(struct aer_err_info *info);
+#else
+static inline bool is_internal_error(struct aer_err_info *info) { return false; }
+#endif
+
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 173829aa02e6..970e7cbc5b34 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
 
 obj-y				+= aspm.o
 obj-$(CONFIG_PCIEAER)		+= aer.o err.o tlp.o
+obj-$(CONFIG_CXL_RCH_RAS)	+= aer_cxl_rch.o
 obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
 obj-$(CONFIG_PCIE_PME)		+= pme.o
 obj-$(CONFIG_PCIE_DPC)		+= dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index cbaed65577d9..f5f22216bb41 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1130,7 +1130,7 @@ static bool find_source_device(struct pci_dev *parent,
  * Note: AER must be enabled and supported by the device which must be
  * checked in advance, e.g. with pcie_aer_is_native().
  */
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 {
 	int aer = dev->aer_cap;
 	u32 mask;
@@ -1143,116 +1143,25 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 	mask &= ~PCI_ERR_COR_INTERNAL;
 	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
 }
+EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
 
-static bool is_cxl_mem_dev(struct pci_dev *dev)
-{
-	/*
-	 * The capability, status, and control fields in Device 0,
-	 * Function 0 DVSEC control the CXL functionality of the
-	 * entire device (CXL 3.0, 8.1.3).
-	 */
-	if (dev->devfn != PCI_DEVFN(0, 0))
-		return false;
-
-	/*
-	 * CXL Memory Devices must have the 502h class code set (CXL
-	 * 3.0, 8.1.12.1).
-	 */
-	if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
-		return false;
-
-	return true;
-}
-
-static bool cxl_error_is_native(struct pci_dev *dev)
+bool cxl_error_is_native(struct pci_dev *dev)
 {
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
 
 	return (pcie_ports_native || host->native_aer);
 }
+EXPORT_SYMBOL_NS_GPL(cxl_error_is_native, "CXL");
 
-static bool is_internal_error(struct aer_err_info *info)
+bool is_internal_error(struct aer_err_info *info)
 {
 	if (info->severity == AER_CORRECTABLE)
 		return info->status & PCI_ERR_COR_INTERNAL;
 
 	return info->status & PCI_ERR_UNC_INTN;
 }
-
-static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
-{
-	struct aer_err_info *info = (struct aer_err_info *)data;
-	const struct pci_error_handlers *err_handler;
-
-	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
-		return 0;
-
-	guard(device)(&dev->dev);
-
-	err_handler = dev->driver ? dev->driver->err_handler : NULL;
-	if (!err_handler)
-		return 0;
-
-	if (info->severity == AER_CORRECTABLE) {
-		if (err_handler->cor_error_detected)
-			err_handler->cor_error_detected(dev);
-	} else if (err_handler->error_detected) {
-		if (info->severity == AER_NONFATAL)
-			err_handler->error_detected(dev, pci_channel_io_normal);
-		else if (info->severity == AER_FATAL)
-			err_handler->error_detected(dev, pci_channel_io_frozen);
-	}
-	return 0;
-}
-
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
-{
-	/*
-	 * Internal errors of an RCEC indicate an AER error in an
-	 * RCH's downstream port. Check and handle them in the CXL.mem
-	 * device driver.
-	 */
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
-		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
-}
-
-static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
-{
-	bool *handles_cxl = data;
-
-	if (!*handles_cxl)
-		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
-
-	/* Non-zero terminates iteration */
-	return *handles_cxl;
-}
-
-static bool handles_cxl_errors(struct pci_dev *rcec)
-{
-	bool handles_cxl = false;
-
-	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
-
-	return handles_cxl;
-}
-
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
-{
-	if (!handles_cxl_errors(rcec))
-		return;
-
-	pci_aer_unmask_internal_errors(rcec);
-	pci_info(rcec, "CXL: Internal errors unmasked");
-}
-
-#else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
-static inline void cxl_rch_handle_error(struct pci_dev *dev,
-					struct aer_err_info *info) { }
-#endif
+EXPORT_SYMBOL_NS_GPL(is_internal_error, "CXL");
+#endif /* CONFIG_CXL_RAS */
 
 /**
  * pci_aer_handle_error - handle logging error into an event log
diff --git a/drivers/pci/pcie/aer_cxl_rch.c b/drivers/pci/pcie/aer_cxl_rch.c
new file mode 100644
index 000000000000..f4d160f18169
--- /dev/null
+++ b/drivers/pci/pcie/aer_cxl_rch.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2023 AMD Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/aer.h>
+#include <linux/bitfield.h>
+#include "../pci.h"
+
+static bool is_cxl_mem_dev(struct pci_dev *dev)
+{
+	/*
+	 * The capability, status, and control fields in Device 0,
+	 * Function 0 DVSEC control the CXL functionality of the
+	 * entire device (CXL 3.0, 8.1.3).
+	 */
+	if (dev->devfn != PCI_DEVFN(0, 0))
+		return false;
+
+	/*
+	 * CXL Memory Devices must have the 502h class code set (CXL
+	 * 3.0, 8.1.12.1).
+	 */
+	if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
+		return false;
+
+	return true;
+}
+
+static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
+{
+	struct aer_err_info *info = (struct aer_err_info *)data;
+	const struct pci_error_handlers *err_handler;
+
+	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
+		return 0;
+
+	guard(device)(&dev->dev);
+
+	err_handler = dev->driver ? dev->driver->err_handler : NULL;
+	if (!err_handler)
+		return 0;
+
+	if (info->severity == AER_CORRECTABLE) {
+		if (err_handler->cor_error_detected)
+			err_handler->cor_error_detected(dev);
+	} else if (err_handler->error_detected) {
+		if (info->severity == AER_NONFATAL)
+			err_handler->error_detected(dev, pci_channel_io_normal);
+		else if (info->severity == AER_FATAL)
+			err_handler->error_detected(dev, pci_channel_io_frozen);
+	}
+	return 0;
+}
+
+void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
+{
+	/*
+	 * Internal errors of an RCEC indicate an AER error in an
+	 * RCH's downstream port. Check and handle them in the CXL.mem
+	 * device driver.
+	 */
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+	    is_internal_error(info))
+		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+}
+
+static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
+{
+	bool *handles_cxl = data;
+
+	if (!*handles_cxl)
+		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
+
+	/* Non-zero terminates iteration */
+	return *handles_cxl;
+}
+
+static bool handles_cxl_errors(struct pci_dev *rcec)
+{
+	bool handles_cxl = false;
+
+	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
+	    pcie_aer_is_native(rcec))
+		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+
+	return handles_cxl;
+}
+
+void cxl_rch_enable_rcec(struct pci_dev *rcec)
+{
+	if (!handles_cxl_errors(rcec))
+		return;
+
+	pci_aer_unmask_internal_errors(rcec);
+	pci_info(rcec, "CXL: Internal errors unmasked");
+}
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 02940be66324..2ef820563996 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -56,12 +56,20 @@ struct aer_capability_regs {
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
 #else
 static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
+static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
+#endif
+
+#ifdef CONFIG_CXL_RAS
+bool cxl_error_is_native(struct pci_dev *dev);
+#else
+static inline bool cxl_error_is_native(struct pci_dev *dev) { return false; }
 #endif
 
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (7 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:08   ` Jonathan Cameron
  2025-11-04 18:26   ` Bjorn Helgaas
  2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
                   ` (16 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors. Update the driver and aer_event tracing to log 'CXL Bus
Type' for CXL device errors.

This requires the AER can identify and distinguish between PCIe errors and
CXL errors.

Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
aer_get_device_error_info() and pci_print_aer().

Update the aer_event trace routine to accept a bus type string parameter.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---

Changes in v12->v13:
- Remove duplicated aer_err_info inline comments. Is already in the
  kernel-doc header (Ben)

Changes in v11->v12:
 - Change aer_err_info::is_cxl to be bool a bitfield. Update structure
 padding. (Lukas)
 - Add kernel-doc for 'struct aer_err_info' (Lukas)

Changes in v10->v11:
 - Remove duplicate call to trace_aer_event() (Shiju)
 - Added Dan William's and Dave Jiang's reviewed-by
---
 drivers/pci/pci.h       | 37 ++++++++++++++++++++++++++++++-------
 drivers/pci/pcie/aer.c  | 18 ++++++++++++------
 include/ras/ras_event.h |  9 ++++++---
 3 files changed, 48 insertions(+), 16 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index d23430e3eea0..446251892bb7 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -701,31 +701,54 @@ static inline bool pci_dev_binding_disallowed(struct pci_dev *dev)
 
 #define AER_MAX_MULTI_ERR_DEVICES	5	/* Not likely to have more */
 
+/**
+ * struct aer_err_info - AER Error Information
+ * @dev: Devices reporting error
+ * @ratelimit_print: Flag to log or not log the devices' error. 0=NotLog/1=Log
+ * @error_devnum: Number of devices reporting an error
+ * @level: printk level to use in logging
+ * @id: Value from register PCI_ERR_ROOT_ERR_SRC
+ * @severity: AER severity, 0-UNCOR Non-fatal, 1-UNCOR fatal, 2-COR
+ * @root_ratelimit_print: Flag to log or not log the root's error. 0=NotLog/1=Log
+ * @multi_error_valid: If multiple errors are reported
+ * @first_error: First reported error
+ * @is_cxl: Bus type error: 0-PCI Bus error, 1-CXL Bus error
+ * @tlp_header_valid: Indicates if TLP field contains error information
+ * @status: COR/UNCOR error status
+ * @mask: COR/UNCOR mask
+ * @tlp: Transaction packet information
+ */
 struct aer_err_info {
 	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
 	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
-	const char *level;		/* printk level */
+	const char *level;
 
 	unsigned int id:16;
 
-	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
-	unsigned int root_ratelimit_print:1;	/* 0=skip, 1=print */
+	unsigned int severity:2;
+	unsigned int root_ratelimit_print:1;
 	unsigned int __pad1:4;
 	unsigned int multi_error_valid:1;
 
 	unsigned int first_error:5;
-	unsigned int __pad2:2;
+	unsigned int __pad2:1;
+	bool is_cxl:1;
 	unsigned int tlp_header_valid:1;
 
-	unsigned int status;		/* COR/UNCOR Error Status */
-	unsigned int mask;		/* COR/UNCOR Error Mask */
-	struct pcie_tlp_log tlp;	/* TLP Header */
+	unsigned int status;
+	unsigned int mask;
+	struct pcie_tlp_log tlp;
 };
 
 int aer_get_device_error_info(struct aer_err_info *info, int i);
 void aer_print_error(struct aer_err_info *info, int i);
 
+static inline const char *aer_err_bus(struct aer_err_info *info)
+{
+	return info->is_cxl ? "CXL" : "PCIe";
+}
+
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
 		      unsigned int tlp_len, bool flit,
 		      struct pcie_tlp_log *log);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f5f22216bb41..39e99f438563 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -868,6 +868,7 @@ void aer_print_error(struct aer_err_info *info, int i)
 	struct pci_dev *dev;
 	int layer, agent, id;
 	const char *level = info->level;
+	const char *bus_type = aer_err_bus(info);
 
 	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
 		return;
@@ -876,23 +877,23 @@ void aer_print_error(struct aer_err_info *info, int i)
 	id = pci_dev_id(dev);
 
 	pci_dev_aer_stats_incr(dev, info);
-	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
+	trace_aer_event(pci_name(dev), bus_type, (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 
 	if (!info->ratelimit_print[i])
 		return;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -926,6 +927,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type;
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info = {
@@ -946,9 +948,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	info.status = status;
 	info.mask = mask;
+	info.is_cxl = pcie_is_cxl(dev);
+
+	bus_type = aer_err_bus(&info);
 
 	pci_dev_aer_stats_incr(dev, &info);
-	trace_aer_event(pci_name(dev), (status & ~mask),
+	trace_aer_event(pci_name(dev), bus_type, (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 
 	if (!aer_ratelimit(dev, info.severity))
@@ -1309,6 +1314,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 	/* Must reset in this function */
 	info->status = 0;
 	info->tlp_header_valid = 0;
+	info->is_cxl = pcie_is_cxl(dev);
 
 	/* The device might not support AER */
 	if (!aer)
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index c8cd0f00c845..85dbafec6ad1 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -298,15 +298,17 @@ TRACE_EVENT(non_standard_event,
 
 TRACE_EVENT(aer_event,
 	TP_PROTO(const char *dev_name,
+		 const char *bus_type,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
 		 struct pcie_tlp_log *tlp),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
+		__string(	bus_type,	bus_type	)
 		__field(	u32,		status		)
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
@@ -315,6 +317,7 @@ TRACE_EVENT(aer_event,
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -326,8 +329,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (8 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:10   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:02 ` [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
                   ` (15 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL PCIe Port Protocol Error handling support will be added to the
CXL drivers in the future. In preparation, rename the existing
interfaces to support handling all CXL PCIe Port Protocol Errors.

The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL Port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.

To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.

No functional changes are introduced.

[1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Changes in v12->v13:
- Added Ben's review-by
---
 drivers/cxl/core/core.h    | 15 ++++++---------
 drivers/cxl/core/ras.c     | 12 ++++++------
 drivers/cxl/core/ras_rch.c |  4 ++--
 3 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index c30ab7c25a92..1a419b35fa59 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -7,6 +7,7 @@
 #include <linux/pci.h>
 #include <cxl/mailbox.h>
 #include <linux/rwsem.h>
+#include <linux/pci.h>
 
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
@@ -148,23 +149,19 @@ int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 #ifdef CONFIG_CXL_RAS
 int cxl_ras_init(void);
 void cxl_ras_exit(void);
-bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base);
-void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base);
+bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
+void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
 #else
 static inline int cxl_ras_init(void)
 {
 	return 0;
 }
-
-static inline void cxl_ras_exit(void)
-{
-}
-
-static inline bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
+static inline void cxl_ras_exit(void) { }
+static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 {
 	return false;
 }
-static inline void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base) { }
+static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
 #endif /* CONFIG_CXL_RAS */
 
 /* Restricted CXL Host specific RAS functions */
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index b933030b8e1e..72908f3ced77 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -160,7 +160,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
-void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
+void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
@@ -172,7 +172,7 @@ void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
 	}
 }
 
@@ -197,7 +197,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
+bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -224,7 +224,7 @@ bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -246,7 +246,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -275,7 +275,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
+		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 	}
 
 
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index f6de5492a8b7..4d2babe8d206 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -114,7 +114,7 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 
 	pci_print_aer(pdev, severity, &aer_regs);
 	if (severity == AER_CORRECTABLE)
-		cxl_handle_cor_ras(cxlds, dport->regs.ras);
+		cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 	else
-		cxl_handle_ras(cxlds, dport->regs.ras);
+		cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (9 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-19  3:27   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The CXL RAS handlers do not currently log if the RAS registers are
unmapped. This is needed in order to help debug CXL error handling. Update
the CXL driver to log a warning message if the RAS register block is
unmapped during RAS error handling.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Chan ges in v12->v13:
- Added Bens review-by
---
 drivers/cxl/core/ras.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 72908f3ced77..0320c391f201 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -165,8 +165,10 @@ void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
 	void __iomem *addr;
 	u32 status;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return;
+	}
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
@@ -204,8 +206,10 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	u32 status;
 	u32 fe;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return false;
+	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (10 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-19 21:23   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
                   ` (13 subsequent siblings)
  25 siblings, 1 reply; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL currently has separate trace routines for CXL Port errors and CXL
Endpoint errors. This is inconvenient for the user because they must enable
2 sets of trace routines. Make updates to the trace logging such that a
single trace routine logs both CXL Endpoint and CXL Port protocol errors.

Keep the trace log fields 'memdev' and 'host'. While these are not accurate
for non-Endpoints the fields will remain as-is to prevent breaking
userspace RAS trace consumers.

Add serial number parameter to the trace logging. This is used for EPs
and 0 is provided for CXL port devices without a serial number.

Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
unchanged with respect to member data types and order.

Below is output of correctable and uncorrectable protocol error logging.
CXL Root Port and CXL Endpoint examples are included below.

Root Port:
cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'

Endpoint:
cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---

Changes in v12->v13:
- Added Dave Jiang's review-by

Changes in v11 -> v12:
- Correct parameters to call trace_cxl_aer_correctable_error()
- Add reviewed-by for Jonathan and Shiju

Changes in v10->v11:
- Updated CE and UCE trace routines to maintain consistent TP_Struct ABI
and unchanged TP_printk() logging.
---
 drivers/cxl/core/core.h    |  4 +--
 drivers/cxl/core/ras.c     | 26 ++++++++-------
 drivers/cxl/core/ras_rch.c |  4 +--
 drivers/cxl/core/trace.h   | 68 ++++++--------------------------------
 4 files changed, 29 insertions(+), 73 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 1a419b35fa59..e47ae7365ce0 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -149,8 +149,8 @@ int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 #ifdef CONFIG_CXL_RAS
 int cxl_ras_init(void);
 void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base);
 #else
 static inline int cxl_ras_init(void)
 {
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 0320c391f201..599c88f0b376 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -13,7 +13,7 @@ static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
 {
 	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
 
-	trace_cxl_port_aer_correctable_error(&pdev->dev, status);
+	trace_cxl_aer_correctable_error(&pdev->dev, status, 0);
 }
 
 static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
@@ -28,8 +28,8 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
 	else
 		fe = status;
 
-	trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
-					       ras_cap.header_log);
+	trace_cxl_aer_uncorrectable_error(&pdev->dev, status, fe,
+					  ras_cap.header_log, 0);
 }
 
 static void cxl_cper_trace_corr_prot_err(struct cxl_memdev *cxlmd,
@@ -37,7 +37,7 @@ static void cxl_cper_trace_corr_prot_err(struct cxl_memdev *cxlmd,
 {
 	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
 
-	trace_cxl_aer_correctable_error(cxlmd, status);
+	trace_cxl_aer_correctable_error(&cxlmd->dev, status, cxlmd->cxlds->serial);
 }
 
 static void
@@ -45,6 +45,7 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
 			       struct cxl_ras_capability_regs ras_cap)
 {
 	u32 status = ras_cap.uncor_status & ~ras_cap.uncor_mask;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 	u32 fe;
 
 	if (hweight32(status) > 1)
@@ -53,8 +54,9 @@ cxl_cper_trace_uncorr_prot_err(struct cxl_memdev *cxlmd,
 	else
 		fe = status;
 
-	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe,
-					  ras_cap.header_log);
+	trace_cxl_aer_uncorrectable_error(&cxlmd->dev, status, fe,
+					  ras_cap.header_log,
+					  cxlds->serial);
 }
 
 static int match_memdev_by_parent(struct device *dev, const void *uport)
@@ -160,7 +162,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
-void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
+void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
@@ -174,7 +176,7 @@ void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
+		trace_cxl_aer_correctable_error(dev, status, serial);
 	}
 }
 
@@ -199,7 +201,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -228,7 +230,7 @@ bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -250,7 +252,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
+		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -279,7 +281,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
+		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 	}
 
 
diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
index 4d2babe8d206..421dd1bcfc9c 100644
--- a/drivers/cxl/core/ras_rch.c
+++ b/drivers/cxl/core/ras_rch.c
@@ -114,7 +114,7 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 
 	pci_print_aer(pdev, severity, &aer_regs);
 	if (severity == AER_CORRECTABLE)
-		cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+		cxl_handle_cor_ras(&cxlds->cxlmd->dev, 0, dport->regs.ras);
 	else
-		cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+		cxl_handle_ras(&cxlds->cxlmd->dev, 0, dport->regs.ras);
 }
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index a972e4ef1936..69f8a0efd924 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,40 +48,13 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
-TRACE_EVENT(cxl_port_aer_uncorrectable_error,
-	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
-	TP_ARGS(dev, status, fe, hl),
-	TP_STRUCT__entry(
-		__string(device, dev_name(dev))
-		__string(host, dev_name(dev->parent))
-		__field(u32, status)
-		__field(u32, first_error)
-		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
-	),
-	TP_fast_assign(
-		__assign_str(device);
-		__assign_str(host);
-		__entry->status = status;
-		__entry->first_error = fe;
-		/*
-		 * Embed the 512B headerlog data for user app retrieval and
-		 * parsing, but no need to print this in the trace buffer.
-		 */
-		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
-	),
-	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
-		  __get_str(device), __get_str(host),
-		  show_uc_errs(__entry->status),
-		  show_uc_errs(__entry->first_error)
-	)
-);
-
 TRACE_EVENT(cxl_aer_uncorrectable_error,
-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
-	TP_ARGS(cxlmd, status, fe, hl),
+	TP_PROTO(const struct device *cxlmd, u32 status, u32 fe, u32 *hl,
+		 u64 serial),
+	TP_ARGS(cxlmd, status, fe, hl, serial),
 	TP_STRUCT__entry(
-		__string(memdev, dev_name(&cxlmd->dev))
-		__string(host, dev_name(cxlmd->dev.parent))
+		__string(memdev, dev_name(cxlmd))
+		__string(host, dev_name(cxlmd->parent))
 		__field(u64, serial)
 		__field(u32, status)
 		__field(u32, first_error)
@@ -90,7 +63,7 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	TP_fast_assign(
 		__assign_str(memdev);
 		__assign_str(host);
-		__entry->serial = cxlmd->cxlds->serial;
+		__entry->serial = serial;
 		__entry->status = status;
 		__entry->first_error = fe;
 		/*
@@ -124,38 +97,19 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
-TRACE_EVENT(cxl_port_aer_correctable_error,
-	TP_PROTO(struct device *dev, u32 status),
-	TP_ARGS(dev, status),
-	TP_STRUCT__entry(
-		__string(device, dev_name(dev))
-		__string(host, dev_name(dev->parent))
-		__field(u32, status)
-	),
-	TP_fast_assign(
-		__assign_str(device);
-		__assign_str(host);
-		__entry->status = status;
-	),
-	TP_printk("device=%s host=%s status='%s'",
-		  __get_str(device), __get_str(host),
-		  show_ce_errs(__entry->status)
-	)
-);
-
 TRACE_EVENT(cxl_aer_correctable_error,
-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
-	TP_ARGS(cxlmd, status),
+	TP_PROTO(const struct device *cxlmd, u32 status, u64 serial),
+	TP_ARGS(cxlmd, status, serial),
 	TP_STRUCT__entry(
-		__string(memdev, dev_name(&cxlmd->dev))
-		__string(host, dev_name(cxlmd->dev.parent))
+		__string(memdev, dev_name(cxlmd))
+		__string(host, dev_name(cxlmd->parent))
 		__field(u64, serial)
 		__field(u32, status)
 	),
 	TP_fast_assign(
 		__assign_str(memdev);
 		__assign_str(host);
-		__entry->serial = cxlmd->cxlds->serial;
+		__entry->serial = serial;
 		__entry->status = status;
 	),
 	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (11 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-05  8:30   ` Alejandro Lucero Palau
  2025-11-19 22:00   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
                   ` (12 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Update cxl_handle_cor_ras() to exit early in the case there is no RAS
errors detected after applying the status mask. This change will make
the correctable handler's implementation consistent with the uncorrectable
handler, cxl_handle_ras().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Changes v12->v13:
- Added Ben's review-by

Changes v11->v12:
- None

Changes v10->v11:
- Added Dave Jiang and Jonathan Cameron's review-by
- Changes moved to core/ras.c
---
 drivers/cxl/core/ras.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 599c88f0b376..246dfe56617a 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -174,10 +174,11 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(dev, status, serial);
-	}
+	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+		return;
+	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+	trace_cxl_aer_correctable_error(dev, status, serial);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (12 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:15   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:02 ` [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
                   ` (11 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
mapping to enable RAS logging. This initialization is currently missing and
must be added for CXL RPs and DSPs.

Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.

Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
created and added to the EP port.

Make a call to cxl_port_setup_regs() in cxl_port_add(). This will probe the
Upstream Port's CXL capabilities' physical location to be used in mapping
the RAS registers.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v12->v13:
- Change as result of dport delay fix. No longer need switchport and
endport approach. (Terry)

Changes in v11->v12:
- Add check for dport_parent->rch before calling cxl_dport_init_ras_reporting().
RCH dports are initialized from cxl_dport_init_ras_reporting cxl_mem_probe().

Changes in v10->v11:
- Use local pointer for readability in cxl_switch_port_init_ras() (Jonathan Cameron)
- Rename port to be ep in cxl_endpoint_port_init_ras() (Dave Jiang)
- Rename dport to be parent_dport in cxl_endpoint_port_init_ras()
  and cxl_switch_port_init_ras() (Dave Jiang)
- Port helper changes were in cxl/port.c, now in core/ras.c (Dave Jiang)
---
 drivers/cxl/core/port.c |  4 ++++
 drivers/cxl/core/ras.c  | 12 ++++++++++++
 drivers/cxl/cxl.h       |  2 ++
 drivers/cxl/cxlpci.h    |  4 ++++
 drivers/cxl/mem.c       |  3 ++-
 5 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 8128fd2b5b31..48f6a1492544 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1194,6 +1194,8 @@ __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev,
 			return ERR_PTR(rc);
 		}
 		port->component_reg_phys = CXL_RESOURCE_NONE;
+		if (!is_cxl_endpoint(port) && dev_is_pci(port->uport_dev))
+			cxl_uport_init_ras_reporting(port, &port->dev);
 	}
 
 	get_device(dport_dev);
@@ -1623,6 +1625,8 @@ static struct cxl_dport *cxl_port_add_dport(struct cxl_port *port,
 
 	cxl_switch_parse_cdat(new_dport);
 
+	cxl_dport_init_ras_reporting(new_dport, &port->dev);
+
 	if (ida_is_empty(&port->decoder_ida)) {
 		rc = devm_cxl_switch_port_decoders_setup(port);
 		if (rc)
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 246dfe56617a..19d9ffe885bf 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -162,6 +162,18 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
+void cxl_uport_init_ras_reporting(struct cxl_port *port,
+				  struct device *host)
+{
+	struct cxl_register_map *map = &port->reg_map;
+
+	map->host = host;
+	if (cxl_map_component_regs(map, &port->uport_regs,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(&port->dev, "Failed to map RAS capability\n");
+}
+EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
+
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	void __iomem *addr;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 259ed4b676e1..b7654d40dc9e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -599,6 +599,7 @@ struct cxl_dax_region {
  * @parent_dport: dport that points to this port in the parent
  * @decoder_ida: allocator for decoder ids
  * @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
  * @nr_dports: number of entries in @dports
  * @hdm_end: track last allocated HDM decoder instance for allocation ordering
  * @commit_end: cursor to track highest committed decoder for commit ordering
@@ -620,6 +621,7 @@ struct cxl_port {
 	struct cxl_dport *parent_dport;
 	struct ida decoder_ida;
 	struct cxl_register_map reg_map;
+	struct cxl_component_regs uport_regs;
 	int nr_dports;
 	int hdm_end;
 	int commit_end;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 0c8b6ee7b6de..a0a491e7b5b9 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -83,6 +83,8 @@ void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void cxl_uport_init_ras_reporting(struct cxl_port *port,
+				  struct device *host);
 #else
 static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
 
@@ -94,6 +96,8 @@ static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 
 static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
 						struct device *host) { }
+static inline void cxl_uport_init_ras_reporting(struct cxl_port *port,
+						struct device *host) { }
 #endif
 
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 6e6777b7bafb..d2155f45240d 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
 	else
 		endpoint_parent = &parent_port->dev;
 
-	cxl_dport_init_ras_reporting(dport, dev);
+	if (dport->rch)
+		cxl_dport_init_ras_reporting(dport, dev);
 
 	scoped_guard(device, endpoint_parent) {
 		if (!endpoint_parent->driver) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (13 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 19:03   ` Bjorn Helgaas
  2025-11-20  0:17   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
                   ` (10 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The CXL driver's error handling for uncorrectable errors (UCE) will be
updated in the future. A required change is for the error handlers to
to force a system panic when a UCE is detected.

Introduce PCI_ERS_RESULT_PANIC as a 'enum pci_ers_result' type. This will
be used by CXL UCE fatal and non-fatal recovery in future patches. Update
PCIe recovery documentation with details of PCI_ERS_RESULT_PANIC.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Changes in  v12->v13:
- Add Dave Jiang's, Jonathan's, Ben's review-by
- Typo fix (Ben)

Changes v11 -> v12:
- Documentation requested (Lukas)
---
 Documentation/PCI/pci-error-recovery.rst | 6 ++++++
 include/linux/pci.h                      | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/Documentation/PCI/pci-error-recovery.rst b/Documentation/PCI/pci-error-recovery.rst
index 5df481ac6193..83505a585116 100644
--- a/Documentation/PCI/pci-error-recovery.rst
+++ b/Documentation/PCI/pci-error-recovery.rst
@@ -102,6 +102,8 @@ Possible return values are::
 		PCI_ERS_RESULT_NEED_RESET,  /* Device driver wants slot to be reset. */
 		PCI_ERS_RESULT_DISCONNECT,  /* Device has completely failed, is unrecoverable */
 		PCI_ERS_RESULT_RECOVERED,   /* Device driver is fully recovered and operational */
+		PCI_ERS_RESULT_NO_AER_DRIVER, /* No AER capabilities registered for the driver */
+		PCI_ERS_RESULT_PANIC,       /* System is unstable, panic. Is CXL specific */
 	};
 
 A driver does not have to implement all of these callbacks; however,
@@ -116,6 +118,10 @@ The actual steps taken by a platform to recover from a PCI error
 event will be platform-dependent, but will follow the general
 sequence described below.
 
+PCI_ERS_RESULT_PANIC is currently unique to CXL and handled in CXL
+cxl_do_recovery(). The PCI pcie_do_recovery() routine does not report or
+handle PCI_ERS_RESULT_PANIC.
+
 STEP 0: Error Event
 -------------------
 A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 5c4759078d2f..cffa5535f28d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -890,6 +890,9 @@ enum pci_ers_result {
 
 	/* No AER capabilities registered for the driver */
 	PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+	/* System is unstable, panic. Is CXL specific */
+	PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
 };
 
 /* PCI bus error event callbacks */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (14 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-20  0:44   ` dan.j.williams
  2025-11-20  0:53   ` dan.j.williams
  2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
                   ` (9 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
soon. This requires a notification mechanism for the AER driver to share
the AER interrupt with the CXL driver. The notification will be used as an
indication for the CXL drivers to handle and log the CXL RAS errors.

Note, 'CXL protocol error' terminology will refer to CXL VH and not
CXL RCH errors unless specifically noted going forward.

Introduce a new file in the AER driver to handle the CXL protocol errors
named pci/pcie/aer_cxl_vh.c.

Add a kfifo work queue to be used by the AER and CXL drivers. The AER
driver will be the sole kfifo producer adding work and the cxl_core will be
the sole kfifo consumer removing work. Add the boilerplate kfifo support.
Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.

Add CXL work queue handler registration functions in the AER driver. Export
the functions allowing CXL driver to access. Implement registration
functions for the CXL driver to assign or clear the work handler function.
Synchronize accesses using the RW semaphore.

Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
This will contain a reference to the erring PCI device and the error
severity. This will be used when the work is dequeued by the cxl_core driver.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

---

Changes in v12->v13:
- Added Dave Jiang's review-by
- Update error message (Ben)

Changes in v11->v12:
- None

Changes in v10->v11:
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
---
 drivers/pci/pci.h             |  4 ++
 drivers/pci/pcie/Makefile     |  1 +
 drivers/pci/pcie/aer.c        | 25 ++-------
 drivers/pci/pcie/aer_cxl_vh.c | 95 +++++++++++++++++++++++++++++++++++
 include/linux/aer.h           | 17 +++++++
 5 files changed, 121 insertions(+), 21 deletions(-)
 create mode 100644 drivers/pci/pcie/aer_cxl_vh.c

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 446251892bb7..a398e489318c 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1330,8 +1330,12 @@ static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
 
 #ifdef CONFIG_CXL_RAS
 bool is_internal_error(struct aer_err_info *info);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);
 #else
 static inline bool is_internal_error(struct aer_err_info *info) { return false; }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info) { }
 #endif
 
 #endif /* DRIVERS_PCI_H */
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 970e7cbc5b34..72992b3ea417 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -9,6 +9,7 @@ obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
 obj-y				+= aspm.o
 obj-$(CONFIG_PCIEAER)		+= aer.o err.o tlp.o
 obj-$(CONFIG_CXL_RCH_RAS)	+= aer_cxl_rch.o
+obj-$(CONFIG_CXL_RAS)		+= aer_cxl_vh.o
 obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
 obj-$(CONFIG_PCIE_PME)		+= pme.o
 obj-$(CONFIG_PCIE_DPC)		+= dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 39e99f438563..e806fa05280b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1123,8 +1123,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-#ifdef CONFIG_PCIEAER_CXL
-
 /**
  * pci_aer_unmask_internal_errors - unmask internal errors
  * @dev: pointer to the pci_dev data structure
@@ -1150,24 +1148,6 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
 
-bool cxl_error_is_native(struct pci_dev *dev)
-{
-	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
-
-	return (pcie_ports_native || host->native_aer);
-}
-EXPORT_SYMBOL_NS_GPL(cxl_error_is_native, "CXL");
-
-bool is_internal_error(struct aer_err_info *info)
-{
-	if (info->severity == AER_CORRECTABLE)
-		return info->status & PCI_ERR_COR_INTERNAL;
-
-	return info->status & PCI_ERR_UNC_INTN;
-}
-EXPORT_SYMBOL_NS_GPL(is_internal_error, "CXL");
-#endif /* CONFIG_CXL_RAS */
-
 /**
  * pci_aer_handle_error - handle logging error into an event log
  * @dev: pointer to pci_dev data structure of error source device
@@ -1204,7 +1184,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
 	cxl_rch_handle_error(dev, info);
-	pci_aer_handle_error(dev, info);
+	if (is_cxl_error(dev, info))
+		cxl_forward_error(dev, info);
+	else
+		pci_aer_handle_error(dev, info);
 	pci_dev_put(dev);
 }
 
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
new file mode 100644
index 000000000000..5dbc81341dc4
--- /dev/null
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -0,0 +1,95 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/aer.h>
+#include <linux/pci.h>
+#include <linux/bitfield.h>
+#include <linux/kfifo.h>
+#include "../pci.h"
+
+#define CXL_ERROR_SOURCES_MAX          128
+
+struct cxl_proto_err_kfifo {
+	struct work_struct *work;
+	struct rw_semaphore rw_sema;
+	DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
+		      CXL_ERROR_SOURCES_MAX);
+};
+
+static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
+	.rw_sema = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rw_sema)
+};
+
+bool cxl_error_is_native(struct pci_dev *dev)
+{
+	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+
+	return (pcie_ports_native || host->native_aer);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_error_is_native, "CXL");
+
+bool is_internal_error(struct aer_err_info *info)
+{
+	if (info->severity == AER_CORRECTABLE)
+		return info->status & PCI_ERR_COR_INTERNAL;
+
+	return info->status & PCI_ERR_UNC_INTN;
+}
+EXPORT_SYMBOL_NS_GPL(is_internal_error, "CXL");
+
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+	if (!info || !info->is_cxl)
+		return false;
+
+	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+		return false;
+
+	return is_internal_error(info);
+}
+EXPORT_SYMBOL_NS_GPL(is_cxl_error, "CXL");
+
+void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+	struct cxl_proto_err_work_data wd = (struct cxl_proto_err_work_data) {
+		.severity = info->severity,
+		.pdev = pdev
+	};
+
+	guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
+
+	if (!cxl_proto_err_kfifo.work) {
+		dev_warn_once(&pdev->dev, "CXL driver is unregistered. Unable to forward error.");
+		return;
+	}
+
+	if (!kfifo_put(&cxl_proto_err_kfifo.fifo, wd)) {
+		dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_proto_err_kfifo.work);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_forward_error, "CXL");
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+	guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
+	cxl_proto_err_kfifo.work = work;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
+
+void cxl_unregister_proto_err_work(void)
+{
+	guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
+	cxl_proto_err_kfifo.work = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
+
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
+{
+	guard(rwsem_read)(&cxl_proto_err_kfifo.rw_sema);
+	return kfifo_get(&cxl_proto_err_kfifo.fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 2ef820563996..6b2c87d1b5b6 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
 
 #include <linux/errno.h>
 #include <linux/types.h>
+#include <linux/workqueue_types.h>
 
 #define AER_NONFATAL			0
 #define AER_FATAL			1
@@ -53,6 +54,16 @@ struct aer_capability_regs {
 	u16 uncor_err_source;
 };
 
+/**
+ * struct cxl_proto_err_work_data - Error information used in CXL error handling
+ * @severity: AER severity
+ * @pdev: PCI device detecting the error
+ */
+struct cxl_proto_err_work_data {
+	int severity;
+	struct pci_dev *pdev;
+};
+
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
@@ -68,8 +79,14 @@ static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
 
 #ifdef CONFIG_CXL_RAS
 bool cxl_error_is_native(struct pci_dev *dev);
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
+void cxl_register_proto_err_work(struct work_struct *work);
+void cxl_unregister_proto_err_work(void);
 #else
 static inline bool cxl_error_is_native(struct pci_dev *dev) { return false; }
+static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline void cxl_unregister_proto_err_work(void) { }
 #endif
 
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (15 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-05 17:51   ` Gregory Price
                     ` (2 more replies)
  2025-11-04 17:02 ` [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard() Terry Bowman
                   ` (8 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL devices handle protocol errors via driver-specific callbacks rather
than the generic pci_driver::err_handlers by default. The callbacks are
implemented in the cxl_pci driver and are not part of struct pci_driver, so
cxl_core must verify that a device is actually bound to the cxl_pci
module's driver before invoking the callbacks (the device could be bound
to another driver, e.g. VFIO).

However, cxl_core can not reference symbols in the cxl_pci module because
it creates a circular dependency. This prevents cxl_core from checking the
EP's bound driver and calling the callbacks.

To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
build it as part of the cxl_core module. Compile into cxl_core using
CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
eliminates the circular dependency so cxl_core can safely perform
bound-driver checks and invoke the CXL PCI callbacks.

Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
parameter is bound to a CXL driver instance. This will be used in future
patch when dequeuing work from the kfifo.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

---

Changes in v12->v13;
- Add Dave Jiang's review-by.

Changes in v11->v12:
- Add device_lock_assert() in cxl_pci_drv_bound() (Dave Jiang)
- Add Jonathan's review-by

Changes in v11->v12:
- None

Changes in v10->v11:
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
---
 drivers/cxl/Kconfig                   |  6 +++---
 drivers/cxl/Makefile                  |  2 --
 drivers/cxl/core/Makefile             |  1 +
 drivers/cxl/core/core.h               |  9 +++++++++
 drivers/cxl/{pci.c => core/pci_drv.c} | 21 +++++++++++++--------
 drivers/cxl/core/port.c               |  3 +++
 tools/testing/cxl/Kbuild              |  1 +
 7 files changed, 30 insertions(+), 13 deletions(-)
 rename drivers/cxl/{pci.c => core/pci_drv.c} (99%)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index ffe6ad981434..360c78fa7e97 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -20,7 +20,7 @@ menuconfig CXL_BUS
 if CXL_BUS
 
 config CXL_PCI
-	tristate "PCI manageability"
+	bool "PCI manageability"
 	default CXL_BUS
 	help
 	  The CXL specification defines a "CXL memory device" sub-class in the
@@ -29,12 +29,12 @@ config CXL_PCI
 	  memory to be mapped into the system address map (Host-managed Device
 	  Memory (HDM)).
 
-	  Say 'y/m' to enable a driver that will attach to CXL memory expander
+	  Say 'y' to enable a driver that will attach to CXL memory expander
 	  devices enumerated by the memory device class code for configuration
 	  and management primarily via the mailbox interface. See Chapter 2.3
 	  Type 3 CXL Device in the CXL 2.0 specification for more details.
 
-	  If unsure say 'm'.
+	  If unsure say 'y'.
 
 config CXL_MEM_RAW_COMMANDS
 	bool "RAW Command Interface for Memory Devices"
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index 2caa90fa4bf2..ff6add88b6ae 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -12,10 +12,8 @@ obj-$(CONFIG_CXL_PORT) += cxl_port.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_MEM) += cxl_mem.o
-obj-$(CONFIG_CXL_PCI) += cxl_pci.o
 
 cxl_port-y := port.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o security.o
 cxl_mem-y := mem.o
-cxl_pci-y := pci.o
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index fa1d4aed28b9..2937d0ddcce2 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
 cxl_core-$(CONFIG_CXL_RAS) += ras.o
 cxl_core-$(CONFIG_CXL_RCH_RAS) += ras_rch.o
+cxl_core-$(CONFIG_CXL_PCI) += pci_drv.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index e47ae7365ce0..61c6726744d7 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -195,4 +195,13 @@ int cxl_set_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
 		    u16 *return_code);
 #endif
 
+#ifdef CONFIG_CXL_PCI
+bool cxl_pci_drv_bound(struct pci_dev *pdev);
+int cxl_pci_driver_init(void);
+void cxl_pci_driver_exit(void);
+#else
+static inline bool cxl_pci_drv_bound(struct pci_dev *pdev) { return false; };
+static inline int cxl_pci_driver_init(void) { return 0; }
+static inline void cxl_pci_driver_exit(void) { }
+#endif
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/core/pci_drv.c
similarity index 99%
rename from drivers/cxl/pci.c
rename to drivers/cxl/core/pci_drv.c
index bd95be1f3d5c..06f2fd993cb0 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/core/pci_drv.c
@@ -1131,6 +1131,17 @@ static struct pci_driver cxl_pci_driver = {
 	},
 };
 
+bool cxl_pci_drv_bound(struct pci_dev *pdev)
+{
+	device_lock_assert(&pdev->dev);
+
+	if (pdev->driver != &cxl_pci_driver)
+		pr_err_ratelimited("%s device not bound to CXL PCI driver\n",
+				   pci_name(pdev));
+
+	return (pdev->driver == &cxl_pci_driver);
+}
+
 #define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
 static void cxl_handle_cper_event(enum cxl_event_type ev_type,
 				  struct cxl_cper_event_rec *rec)
@@ -1177,7 +1188,7 @@ static void cxl_cper_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_work, cxl_cper_work_fn);
 
-static int __init cxl_pci_driver_init(void)
+int __init cxl_pci_driver_init(void)
 {
 	int rc;
 
@@ -1192,15 +1203,9 @@ static int __init cxl_pci_driver_init(void)
 	return rc;
 }
 
-static void __exit cxl_pci_driver_exit(void)
+void cxl_pci_driver_exit(void)
 {
 	cxl_cper_unregister_work(&cxl_cper_work);
 	cancel_work_sync(&cxl_cper_work);
 	pci_unregister_driver(&cxl_pci_driver);
 }
-
-module_init(cxl_pci_driver_init);
-module_exit(cxl_pci_driver_exit);
-MODULE_DESCRIPTION("CXL: PCI manageability");
-MODULE_LICENSE("GPL v2");
-MODULE_IMPORT_NS("CXL");
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 48f6a1492544..b70e1b505b5c 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2507,6 +2507,8 @@ static __init int cxl_core_init(void)
 	if (rc)
 		goto err_ras;
 
+	cxl_pci_driver_init();
+
 	return 0;
 
 err_ras:
@@ -2522,6 +2524,7 @@ static __init int cxl_core_init(void)
 
 static void cxl_core_exit(void)
 {
+	cxl_pci_driver_exit();
 	cxl_ras_exit();
 	cxl_region_exit();
 	bus_unregister(&cxl_bus_type);
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 6905f8e710ab..d8b8272ef87b 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -65,6 +65,7 @@ cxl_core-$(CONFIG_CXL_FEATURES) += $(CXL_CORE_SRC)/features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += $(CXL_CORE_SRC)/edac.o
 cxl_core-$(CONFIG_CXL_RAS) += $(CXL_CORE_SRC)/ras.o
 cxl_core-$(CONFIG_CXL_RCH_RAS) += $(CXL_CORE_SRC)/ras_rch.o
+cxl_core-$(CONFIG_CXL_PCI) += $(CXL_CORE_SRC)/pci_drv.o
 cxl_core-y += config_check.o
 cxl_core-y += cxl_core_test.o
 cxl_core-y += cxl_core_exports.o
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard()
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (16 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:18   ` Jonathan Cameron
  2025-11-04 20:15   ` Dave Jiang
  2025-11-04 17:02 ` [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints Terry Bowman
                   ` (7 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The CXL protocol error handlers use scoped_guard() to guarantee access to
the underlying CXL memory device. Improve readability and reduce complexity
by changing the current scoped_guard() to be guard().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v12->v13:
- New patch
---
 drivers/cxl/core/ras.c | 53 +++++++++++++++++++++---------------------
 1 file changed, 26 insertions(+), 27 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 19d9ffe885bf..cb712772de5c 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -254,19 +254,19 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
 	struct device *dev = &cxlds->cxlmd->dev;
 
-	scoped_guard(device, dev) {
-		if (!dev->driver) {
-			dev_warn(&pdev->dev,
-				 "%s: memdev disabled, abort error handling\n",
-				 dev_name(dev));
-			return;
-		}
-
-		if (cxlds->rcd)
-			cxl_handle_rdport_errors(cxlds);
+	guard(device)(dev);
 
-		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
+	if (!dev->driver) {
+		dev_warn(&pdev->dev,
+			 "%s: memdev disabled, abort error handling\n",
+			 dev_name(dev));
+		return;
 	}
+
+	if (cxlds->rcd)
+		cxl_handle_rdport_errors(cxlds);
+
+	cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
 
@@ -278,25 +278,24 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 	struct device *dev = &cxlmd->dev;
 	bool ue;
 
-	scoped_guard(device, dev) {
-		if (!dev->driver) {
-			dev_warn(&pdev->dev,
-				 "%s: memdev disabled, abort error handling\n",
-				 dev_name(dev));
-			return PCI_ERS_RESULT_DISCONNECT;
-		}
+	guard(device)(dev);
 
-		if (cxlds->rcd)
-			cxl_handle_rdport_errors(cxlds);
-		/*
-		 * A frozen channel indicates an impending reset which is fatal to
-		 * CXL.mem operation, and will likely crash the system. On the off
-		 * chance the situation is recoverable dump the status of the RAS
-		 * capability registers and bounce the active state of the memdev.
-		 */
-		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
+	if (!dev->driver) {
+		dev_warn(&pdev->dev,
+			 "%s: memdev disabled, abort error handling\n",
+			 dev_name(dev));
+		return PCI_ERS_RESULT_DISCONNECT;
 	}
 
+	if (cxlds->rcd)
+		cxl_handle_rdport_errors(cxlds);
+	/*
+	 * A frozen channel indicates an impending reset which is fatal to
+	 * CXL.mem operation, and will likely crash the system. On the off
+	 * chance the situation is recoverable dump the status of the RAS
+	 * capability registers and bounce the active state of the memdev.
+	 */
+	ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 
 	switch (state) {
 	case pci_channel_io_normal:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (17 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard() Terry Bowman
@ 2025-11-04 17:02 ` Terry Bowman
  2025-11-04 18:29   ` Jonathan Cameron
  2025-11-04 19:09   ` Bjorn Helgaas
  2025-11-04 17:03 ` [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
                   ` (6 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:02 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL Endpoint protocol errors are currently handled by generic PCI error
handlers. However, uncorrectable errors (UCEs) require CXL.mem protocol-
specific handling logic that the PCI handlers cannot provide.

Add dedicated CXL protocol error handlers for CXL Endpoints. Rename the
existing cxl_error_handlers to pci_error_handlers to better reflect their
purpose and maintain naming consistency. Update the PCI error handlers to
invoke the new CXL protocol handlers when the endpoint is operating in
CXL.mem mode.

Implement cxl_handle_ras() to return PCI_ERS_RESULT_NONE or
PCI_ERS_RESULT_PANIC. Remove unnecessary result checks from the previous
endpoint UCE handler since CXL UCE recovery is not implemented in this
patch.

Add device lock assertions to protect against concurrent device or RAS
register removal during error handling. Two devices require locking for
CXL endpoints:

1. The PCI device (pdev->dev) - RAS registers are allocated and mapped
   using devm_* functions with this device as the host. Locking prevents
   the RAS registers from being unmapped until after error handling
   completes.

2. The CXL memory device (cxlmd->dev) - Holds a reference to the RAS
   registers accessed during error handling. Locking prevents the memory
   device and its RAS register references from being removed during error
   handling.

The lock assertions added here will be satisfied by device locks
introduced in a subsequent patch. A future patch will extend the CXL UCE
handler to support full UCE recovery.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

---

Changes in v12->v13:
- Update commit messaqge (Terry)
- Updated all the implemetnation and commit message. (Terry)
- Refactored cxl_cor_error_detected()/cxl_error_detected() to remove
  pdev (Dave Jiang)

Changes in v11->v12:
- None

Changes in v10->v11:
- cxl_error_detected() - Change handlers' scoped_guard() to guard() (Jonathan)
- cxl_error_detected() - Remove extra line (Shiju)
- Changes moved to core/ras.c (Terry)
- cxl_error_detected(), remove 'ue' and return with function call. (Jonathan)
- Remove extra space in documentation for PCI_ERS_RESULT_PANIC definition
- Move #include "pci.h from cxl.h to core.h (Terry)
- Remove unnecessary includes of cxl.h and core.h in mem.c (Terry)
---
 drivers/cxl/core/core.h    | 22 +++++++--
 drivers/cxl/core/pci_drv.c |  9 ++--
 drivers/cxl/core/ras.c     | 97 +++++++++++++++++++++++---------------
 drivers/cxl/cxlpci.h       | 11 -----
 4 files changed, 82 insertions(+), 57 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 61c6726744d7..b2c0ccd6803f 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -149,19 +149,33 @@ int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 #ifdef CONFIG_CXL_RAS
 int cxl_ras_init(void);
 void cxl_ras_exit(void);
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+			       void __iomem *ras_base);
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base);
+pci_ers_result_t cxl_error_detected(struct device *dev);
+void cxl_cor_error_detected(struct device *dev);
+pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
+				    pci_channel_state_t error);
+void pci_cor_error_detected(struct pci_dev *pdev);
 #else
 static inline int cxl_ras_init(void)
 {
 	return 0;
 }
 static inline void cxl_ras_exit(void) { }
-static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static inline pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+					      void __iomem *ras_base)
 {
-	return false;
+	return PCI_ERS_RESULT_NONE;
 }
-static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
+static inline void cxl_handle_cor_ras(struct device *dev, u64 serial,
+				      void __iomem *ras_base) { }
+static inline pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
+						  pci_channel_state_t error)
+{
+	return PCI_ERS_RESULT_NONE;
+}
+static inline void pci_cor_error_detected(struct pci_dev *pdev) { }
 #endif /* CONFIG_CXL_RAS */
 
 /* Restricted CXL Host specific RAS functions */
diff --git a/drivers/cxl/core/pci_drv.c b/drivers/cxl/core/pci_drv.c
index 06f2fd993cb0..bc3c959f7eb6 100644
--- a/drivers/cxl/core/pci_drv.c
+++ b/drivers/cxl/core/pci_drv.c
@@ -16,6 +16,7 @@
 #include "cxlpci.h"
 #include "cxl.h"
 #include "pmu.h"
+#include "core/core.h"
 
 /**
  * DOC: cxl pci
@@ -1112,11 +1113,11 @@ static void cxl_reset_done(struct pci_dev *pdev)
 	}
 }
 
-static const struct pci_error_handlers cxl_error_handlers = {
-	.error_detected	= cxl_error_detected,
+static const struct pci_error_handlers pci_error_handlers = {
+	.error_detected	= pci_error_detected,
 	.slot_reset	= cxl_slot_reset,
 	.resume		= cxl_error_resume,
-	.cor_error_detected	= cxl_cor_error_detected,
+	.cor_error_detected	= pci_cor_error_detected,
 	.reset_done	= cxl_reset_done,
 };
 
@@ -1124,7 +1125,7 @@ static struct pci_driver cxl_pci_driver = {
 	.name			= KBUILD_MODNAME,
 	.id_table		= cxl_mem_pci_tbl,
 	.probe			= cxl_pci_probe,
-	.err_handler		= &cxl_error_handlers,
+	.err_handler		= &pci_error_handlers,
 	.dev_groups		= cxl_rcd_groups,
 	.driver	= {
 		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index cb712772de5c..beb142054bda 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -128,6 +128,11 @@ void cxl_ras_exit(void)
 	cancel_work_sync(&cxl_cper_prot_err_work);
 }
 
+static bool is_pcie_endpoint(struct pci_dev *pdev)
+{
+	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
+}
+
 static void cxl_dport_map_ras(struct cxl_dport *dport)
 {
 	struct cxl_register_map *map = &dport->reg_map;
@@ -214,7 +219,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
+pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -223,13 +228,13 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 
 	if (!ras_base) {
 		dev_warn_once(dev, "CXL RAS register block is not mapped");
-		return false;
+		return PCI_ERS_RESULT_NONE;
 	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
 	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
-		return false;
+		return PCI_ERS_RESULT_NONE;
 
 	/* If multiple errors, log header points to first error from ctrl reg */
 	if (hweight32(status) > 1) {
@@ -246,18 +251,19 @@ bool cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 	trace_cxl_aer_uncorrectable_error(dev, status, fe, hl, serial);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
-	return true;
+	return PCI_ERS_RESULT_PANIC;
 }
 
-void cxl_cor_error_detected(struct pci_dev *pdev)
+void cxl_cor_error_detected(struct device *dev)
 {
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct device *dev = &cxlds->cxlmd->dev;
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	guard(device)(dev);
+	device_lock_assert(cxlds->dev);
+	device_lock_assert(&cxlmd->dev);
 
 	if (!dev->driver) {
-		dev_warn(&pdev->dev,
+		dev_warn(cxlds->dev,
 			 "%s: memdev disabled, abort error handling\n",
 			 dev_name(dev));
 		return;
@@ -270,18 +276,31 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
 
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state)
+void pci_cor_error_detected(struct pci_dev *pdev)
+{
+	struct cxl_dev_state *cxlds;
+
+	device_lock_assert(&pdev->dev);
+	if (!cxl_pci_drv_bound(pdev))
+		return;
+
+	cxlds = pci_get_drvdata(pdev);
+	guard(device)(&cxlds->cxlmd->dev);
+
+	cxl_cor_error_detected(&pdev->dev);
+}
+EXPORT_SYMBOL_NS_GPL(pci_cor_error_detected, "CXL");
+
+pci_ers_result_t cxl_error_detected(struct device *dev)
 {
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct cxl_memdev *cxlmd = cxlds->cxlmd;
-	struct device *dev = &cxlmd->dev;
-	bool ue;
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	guard(device)(dev);
+	device_lock_assert(cxlds->dev);
+	device_lock_assert(&cxlmd->dev);
 
 	if (!dev->driver) {
-		dev_warn(&pdev->dev,
+		dev_warn(cxlds->dev,
 			 "%s: memdev disabled, abort error handling\n",
 			 dev_name(dev));
 		return PCI_ERS_RESULT_DISCONNECT;
@@ -289,32 +308,34 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 
 	if (cxlds->rcd)
 		cxl_handle_rdport_errors(cxlds);
+
 	/*
 	 * A frozen channel indicates an impending reset which is fatal to
 	 * CXL.mem operation, and will likely crash the system. On the off
 	 * chance the situation is recoverable dump the status of the RAS
 	 * capability registers and bounce the active state of the memdev.
 	 */
-	ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
-
-	switch (state) {
-	case pci_channel_io_normal:
-		if (ue) {
-			device_release_driver(dev);
-			return PCI_ERS_RESULT_NEED_RESET;
-		}
-		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		dev_warn(&pdev->dev,
-			 "%s: frozen state error detected, disable CXL.mem\n",
-			 dev_name(dev));
-		device_release_driver(dev);
-		return PCI_ERS_RESULT_NEED_RESET;
-	case pci_channel_io_perm_failure:
-		dev_warn(&pdev->dev,
-			 "failure state error detected, request disconnect\n");
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-	return PCI_ERS_RESULT_NEED_RESET;
+	return cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
+
+pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
+				    pci_channel_state_t error)
+{
+	struct cxl_dev_state *cxlds;
+	pci_ers_result_t rc;
+
+	device_lock_assert(&pdev->dev);
+	if (!cxl_pci_drv_bound(pdev))
+		return PCI_ERS_RESULT_NONE;
+
+	cxlds = pci_get_drvdata(pdev);
+	guard(device)(&cxlds->cxlmd->dev);
+
+	rc = cxl_error_detected(&cxlds->cxlmd->dev);
+	if (rc == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
+
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(pci_error_detected, "CXL");
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index a0a491e7b5b9..3526e6d75f79 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -79,21 +79,10 @@ struct cxl_dev_state;
 void read_cdat_data(struct cxl_port *port);
 
 #ifdef CONFIG_CXL_RAS
-void cxl_cor_error_detected(struct pci_dev *pdev);
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state);
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
 void cxl_uport_init_ras_reporting(struct cxl_port *port,
 				  struct device *host);
 #else
-static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
-
-static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-						  pci_channel_state_t state)
-{
-	return PCI_ERS_RESULT_NONE;
-}
-
 static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
 						struct device *host) { }
 static inline void cxl_uport_init_ras_reporting(struct cxl_port *port,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (18 preceding siblings ...)
  2025-11-04 17:02 ` [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints Terry Bowman
@ 2025-11-04 17:03 ` Terry Bowman
  2025-11-04 18:32   ` Jonathan Cameron
  2025-11-04 21:20   ` Dave Jiang
  2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
                   ` (5 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:03 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Add CXL protocol error handlers for CXL Port devices (Root Ports,
Downstream Ports, and Upstream Ports). Implement cxl_port_cor_error_detected()
and cxl_port_error_detected() to handle correctable and uncorrectable errors
respectively.

Introduce cxl_get_ras_base() to retrieve the cached RAS register base
address for a given CXL port. This function supports CXL Root Ports,
Downstream Ports, and Upstream Ports by returning their previously mapped
RAS register addresses.

Add device lock assertions to protect against concurrent device or RAS
register removal during error handling. The port error handlers require
two device locks:

1. The port's CXL parent device - RAS registers are mapped using devm_*
   functions with the parent port as the host. Locking the parent prevents
   the RAS registers from being unmapped during error handling.

2. The PCI device (pdev->dev) - Locking prevents concurrent modifications
   to the PCI device structure during error handling.

The lock assertions added here will be satisfied by device locks introduced
in a subsequent patch.

Introduce get_pci_cxl_host_dev() to return the device responsible for
managing the RAS register mapping. This function increments the reference
count on the host device to prevent premature resource release during error
handling. The caller is responsible for decrementing the reference count.
For CXL endpoints, which manage resources without a separate host device,
this function returns NULL.

Update the AER driver's is_cxl_error() to recognize CXL Port devices in
addition to CXL Endpoints, as both now have CXL-specific error handlers.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

---

Changes in v12->v13:
- Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
  patch (Terry)
- Remove EP case in cxl_get_ras_base(), not used. (Terry)
- Remove check for dport->dport_dev (Dave)
- Remove whitespace (Terry)

Changes in v11->v12:
- Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
  pci_to_cxl_dev()
- Change cxl_error_detected() -> cxl_cor_error_detected()
- Remove NULL variable assignments
- Replace bus_find_device() with find_cxl_port_by_uport() for upstream
  port searches.

Changes in v10->v11:
- None
---
 drivers/cxl/core/core.h       | 10 +++++++
 drivers/cxl/core/port.c       |  7 ++---
 drivers/cxl/core/ras.c        | 49 +++++++++++++++++++++++++++++++++++
 drivers/pci/pcie/aer_cxl_vh.c |  5 +++-
 4 files changed, 67 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index b2c0ccd6803f..046ec65ed147 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -157,6 +157,8 @@ void cxl_cor_error_detected(struct device *dev);
 pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t error);
 void pci_cor_error_detected(struct pci_dev *pdev);
+pci_ers_result_t cxl_port_error_detected(struct device *dev);
+void cxl_port_cor_error_detected(struct device *dev);
 #else
 static inline int cxl_ras_init(void)
 {
@@ -176,6 +178,11 @@ static inline pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
 	return PCI_ERS_RESULT_NONE;
 }
 static inline void pci_cor_error_detected(struct pci_dev *pdev) { }
+static inline void cxl_port_cor_error_detected(struct device *dev) { }
+static inline pci_ers_result_t cxl_port_error_detected(struct device *dev)
+{
+	return PCI_ERS_RESULT_NONE;
+}
 #endif /* CONFIG_CXL_RAS */
 
 /* Restricted CXL Host specific RAS functions */
@@ -190,6 +197,9 @@ static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
 #endif /* CONFIG_CXL_RCH_RAS */
 
 int cxl_gpf_port_setup(struct cxl_dport *dport);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
 
 struct cxl_hdm;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index b70e1b505b5c..d060f864cf2e 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1360,8 +1360,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
@@ -1564,7 +1564,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
  * Function takes a device reference on the port device. Caller should do a
  * put_device() when done.
  */
-static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
+struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
 {
 	struct device *dev;
 
@@ -1573,6 +1573,7 @@ static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
 		return to_cxl_port(dev);
 	return NULL;
 }
+EXPORT_SYMBOL_NS_GPL(find_cxl_port_by_uport, "CXL");
 
 static int update_decoder_targets(struct device *dev, void *data)
 {
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index beb142054bda..142ca8794107 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -145,6 +145,39 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
 		dev_dbg(dev, "Failed to map RAS capability.\n");
 }
 
+static void __iomem *cxl_get_ras_base(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	switch (pci_pcie_type(pdev)) {
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	{
+		struct cxl_dport *dport;
+		struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
+
+		if (!dport) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+		return dport->regs.ras;
+	}
+	case PCI_EXP_TYPE_UPSTREAM:
+	{
+		struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
+
+		if (!port) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+		return port->uport_regs.ras;
+	}
+	}
+
+	dev_warn_once(dev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
+	return NULL;
+}
+
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
@@ -254,6 +287,22 @@ pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ra
 	return PCI_ERS_RESULT_PANIC;
 }
 
+void cxl_port_cor_error_detected(struct device *dev)
+{
+	void __iomem *ras_base = cxl_get_ras_base(dev);
+
+	cxl_handle_cor_ras(dev, 0, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
+
+pci_ers_result_t cxl_port_error_detected(struct device *dev)
+{
+	void __iomem *ras_base = cxl_get_ras_base(dev);
+
+	return cxl_handle_ras(dev, 0, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
+
 void cxl_cor_error_detected(struct device *dev)
 {
 	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
index 5dbc81341dc4..25f9512b57f7 100644
--- a/drivers/pci/pcie/aer_cxl_vh.c
+++ b/drivers/pci/pcie/aer_cxl_vh.c
@@ -43,7 +43,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
 	if (!info || !info->is_cxl)
 		return false;
 
-	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
 		return false;
 
 	return is_internal_error(info);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (19 preceding siblings ...)
  2025-11-04 17:03 ` [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
@ 2025-11-04 17:03 ` Terry Bowman
  2025-11-04 18:40   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:03 ` [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
                   ` (4 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:03 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

The AER driver now forwards CXL protocol errors to the CXL driver via a
kfifo. The CXL driver must consume these work items, initiate protocol
error handling, and ensure RAS mappings remain valid throughout processing.

Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
AER service driver and begin protocol error processing by calling
cxl_handle_proto_error().

Add a PCI device lock on &pdev->dev within cxl_proto_err_work_fn() to
keep the PCI device structure valid during handling. Locking an Endpoint
will also defer RAS unmapping until the device is unlocked.

For Endpoints, add a lock on CXL memory device cxlds->dev. The CXL memory
device structure holds the RAS register reference needed during error
handling.

Add lock for the parent CXL Port for Root Ports, Downstream Ports, and
Upstream Ports to prevent destruction of structures holding mapped RAS
addresses while they are in use.

Invoke cxl_do_recovery() for uncorrectable errors. Treat this as a stub for
now; implement its functionality in a future patch.

Export pci_clean_device_status() to enable cleanup of AER status following
error handling.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

---
Changes in v12->v13:
- Add cxlmd lock using guard() (Terry)
- Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
- Change pr_err() calls to ratelimited. (Terry)
- Update commit message. (Terry)
- Remove namespace qualifier from pcie_clear_device_status()
  export (Dave Jiang)
- Move locks into cxl_proto_err_work_fn() (Dave)
- Update log messages in cxl_forward_error() (Ben)

Changes in v11->v12:
- Add guard for CE case in cxl_handle_proto_error() (Dave)

Changes in v10->v11:
- Reword patch commit message to remove RCiEP details (Jonathan)
- Add #include <linux/bitfield.h> (Terry)
- is_cxl_rcd() - Fix short comment message wrap  (Jonathan)
- is_cxl_rcd() - Combine return calls into 1  (Jonathan)
- cxl_handle_proto_error() - Move comment earlier  (Jonathan)
- Use FIELD_GET() in discovering class code (Jonathan)
- Remove BDF from cxl_proto_err_work_data. Use 'struct
pci_dev *' (Dan)
---
 drivers/cxl/core/ras.c | 153 ++++++++++++++++++++++++++++++++++++++---
 drivers/pci/pci.c      |   1 +
 drivers/pci/pci.h      |   1 -
 include/linux/pci.h    |   2 +
 4 files changed, 145 insertions(+), 12 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 142ca8794107..5bc144cde0ee 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -117,17 +117,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
 
-int cxl_ras_init(void)
-{
-	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
-}
-
-void cxl_ras_exit(void)
-{
-	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
-	cancel_work_sync(&cxl_cper_prot_err_work);
-}
-
 static bool is_pcie_endpoint(struct pci_dev *pdev)
 {
 	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
@@ -178,6 +167,51 @@ static void __iomem *cxl_get_ras_base(struct device *dev)
 	return NULL;
 }
 
+/*
+ * Return 'struct cxl_port *' parent CXL port of dev's
+ *
+ * Reference count increments on success
+ *
+ * dev: Find the parent port of this dev
+ */
+static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
+{
+	switch (pci_pcie_type(pdev)) {
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	{
+		struct cxl_dport *dport;
+		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
+
+		if (!port) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+		return port;
+	}
+	case PCI_EXP_TYPE_UPSTREAM:
+	{
+		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
+
+		if (!port) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+		return port;
+	}
+	case PCI_EXP_TYPE_ENDPOINT:
+	{
+		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+		struct cxl_port *port = cxlds->cxlmd->endpoint;
+
+		get_device(&port->dev);
+		return port;
+	}
+	}
+	pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
+	return NULL;
+}
+
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
@@ -212,6 +246,23 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
 
+static bool device_lock_if(struct device *dev, bool cond)
+{
+	if (cond)
+		device_lock(dev);
+	return cond;
+}
+
+static void device_unlock_if(struct device *dev, bool take)
+{
+	if (take)
+		device_unlock(dev);
+}
+
+static void cxl_do_recovery(struct pci_dev *pdev)
+{
+}
+
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
 	void __iomem *addr;
@@ -388,3 +439,83 @@ pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
 	return rc;
 }
 EXPORT_SYMBOL_NS_GPL(pci_error_detected, "CXL");
+
+static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
+{
+	struct pci_dev *pdev = err_info->pdev;
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+
+	if (err_info->severity == AER_CORRECTABLE) {
+
+		if (pdev->aer_cap)
+			pci_clear_and_set_config_dword(pdev,
+						       pdev->aer_cap + PCI_ERR_COR_STATUS,
+						       0, PCI_ERR_COR_INTERNAL);
+
+		if (is_pcie_endpoint(pdev))
+			cxl_cor_error_detected(&cxlds->cxlmd->dev);
+		else
+			cxl_port_cor_error_detected(&pdev->dev);
+
+		pcie_clear_device_status(pdev);
+	} else {
+		cxl_do_recovery(pdev);
+	}
+}
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+	struct cxl_proto_err_work_data wd;
+
+	while (cxl_proto_err_kfifo_get(&wd)) {
+		struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(wd.pdev);
+		struct device *cxlmd_dev;
+
+		if (!pdev) {
+			pr_err_ratelimited("NULL PCI device passed in AER-CXL KFIFO\n");
+			continue;
+		}
+
+		guard(device)(&pdev->dev);
+		if (is_pcie_endpoint(pdev)) {
+			struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+
+			if (!cxl_pci_drv_bound(pdev))
+				return;
+			cxlmd_dev = &cxlds->cxlmd->dev;
+			device_lock_if(cxlmd_dev, cxlmd_dev);
+		} else {
+			cxlmd_dev = NULL;
+		}
+
+		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+		if (!port)
+			return;
+		guard(device)(&port->dev);
+
+		cxl_handle_proto_error(&wd);
+		device_unlock_if(cxlmd_dev, cxlmd_dev);
+	}
+}
+
+static struct work_struct cxl_proto_err_work;
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+int cxl_ras_init(void)
+{
+	if (cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work))
+		pr_err("Failed to initialize CXL RAS CPER\n");
+
+	cxl_register_proto_err_work(&cxl_proto_err_work);
+
+	return 0;
+}
+
+void cxl_ras_exit(void)
+{
+	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
+	cancel_work_sync(&cxl_cper_prot_err_work);
+
+	cxl_unregister_proto_err_work();
+	cancel_work_sync(&cxl_proto_err_work);
+}
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 53a49bb32514..6341ca6515a5 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2277,6 +2277,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
+EXPORT_SYMBOL_GPL(pcie_clear_device_status);
 #endif
 
 /**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index a398e489318c..2af6ea82526d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -229,7 +229,6 @@ void pci_refresh_power_state(struct pci_dev *dev);
 int pci_power_up(struct pci_dev *dev);
 void pci_disable_enabled_device(struct pci_dev *dev);
 int pci_finish_runtime_suspend(struct pci_dev *dev);
-void pcie_clear_device_status(struct pci_dev *dev);
 void pcie_clear_root_pme_status(struct pci_dev *dev);
 bool pci_check_pme_status(struct pci_dev *dev);
 void pci_pme_wakeup_bus(struct pci_bus *bus);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index cffa5535f28d..33d16b212e0d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1886,8 +1886,10 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { }
 
 #ifdef CONFIG_PCIEAER
 bool pci_aer_available(void);
+void pcie_clear_device_status(struct pci_dev *dev);
 #else
 static inline bool pci_aer_available(void) { return false; }
+static inline void pcie_clear_device_status(struct pci_dev *dev) { }
 #endif
 
 bool pci_ats_disabled(void);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (20 preceding siblings ...)
  2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
@ 2025-11-04 17:03 ` Terry Bowman
  2025-11-04 18:41   ` Jonathan Cameron
  2025-11-04 19:03   ` Bjorn Helgaas
  2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
                   ` (3 subsequent siblings)
  25 siblings, 2 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:03 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL uncorrectable errors (UCE) will soon be handled separately from the PCI
AER handling. The merge_result() function can be made common to use in both
handling paths.

Rename the PCI subsystem's merge_result() to be pci_ers_merge_result().
Export pci_ers_merge_result() to make available for the CXL and other
drivers to use.

Update pci_ers_merge_result() to support recently introduced PCI_ERS_RESULT_PANIC
result.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v12->v13:
- Renamed pci_ers_merge_result() to pcie_ers_merge_result().
  pci_ers_merge_result() is already used in eeh driver. (Bot)

Changes in v11->v12:
- Remove static inline pci_ers_merge_result() definition for !CONFIG_PCIEAER.
  Is not needed. (Lukas)

Changes in v10->v11:
- New patch
- pci_ers_merge_result() - Change export to non-namespace and rename
  to be pci_ers_merge_result()
- Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result
---
 drivers/pci/pcie/err.c | 14 +++++++++-----
 include/linux/pci.h    |  7 +++++++
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index bebe4bc111d7..9394bbdcf0fb 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -21,9 +21,12 @@
 #include "portdrv.h"
 #include "../pci.h"
 
-static pci_ers_result_t merge_result(enum pci_ers_result orig,
-				  enum pci_ers_result new)
+pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig,
+				       enum pci_ers_result new)
 {
+	if (new == PCI_ERS_RESULT_PANIC)
+		return PCI_ERS_RESULT_PANIC;
+
 	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
 		return PCI_ERS_RESULT_NO_AER_DRIVER;
 
@@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
 
 	return orig;
 }
+EXPORT_SYMBOL(pcie_ers_merge_result);
 
 static int report_error_detected(struct pci_dev *dev,
 				 pci_channel_state_t state,
@@ -81,7 +85,7 @@ static int report_error_detected(struct pci_dev *dev,
 		vote = err_handler->error_detected(dev, state);
 	}
 	pci_uevent_ers(dev, vote);
-	*result = merge_result(*result, vote);
+	*result = pcie_ers_merge_result(*result, vote);
 	device_unlock(&dev->dev);
 	return 0;
 }
@@ -139,7 +143,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
 
 	err_handler = pdrv->err_handler;
 	vote = err_handler->mmio_enabled(dev);
-	*result = merge_result(*result, vote);
+	*result = pcie_ers_merge_result(*result, vote);
 out:
 	device_unlock(&dev->dev);
 	return 0;
@@ -159,7 +163,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
 
 	err_handler = pdrv->err_handler;
 	vote = err_handler->slot_reset(dev);
-	*result = merge_result(*result, vote);
+	*result = pcie_ers_merge_result(*result, vote);
 out:
 	device_unlock(&dev->dev);
 	return 0;
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 33d16b212e0d..d3e3300f79ec 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1887,9 +1887,16 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { }
 #ifdef CONFIG_PCIEAER
 bool pci_aer_available(void);
 void pcie_clear_device_status(struct pci_dev *dev);
+pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig,
+				       enum pci_ers_result new);
 #else
 static inline bool pci_aer_available(void) { return false; }
 static inline void pcie_clear_device_status(struct pci_dev *dev) { }
+static inline pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig,
+						     enum pci_ers_result new)
+{
+	return PCI_ERS_RESULT_NONE;
+}
 #endif
 
 bool pci_ats_disabled(void);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (21 preceding siblings ...)
  2025-11-04 17:03 ` [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
@ 2025-11-04 17:03 ` Terry Bowman
  2025-11-04 18:47   ` Jonathan Cameron
                     ` (2 more replies)
  2025-11-04 17:03 ` [RESEND v13 24/25] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
                   ` (2 subsequent siblings)
  25 siblings, 3 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:03 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Implement cxl_do_recovery() to handle uncorrectable protocol
errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
potential CXL memory corruption.

Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
CXL topology from the error source through downstream CXL ports and
endpoints.

Introduce cxl_report_error_detected(), mirroring PCI's
report_error_detected(), and implement device locking for the affected
subtree. Endpoints require locking the PCI device (pdev->dev) and the
CXL memdev (cxlmd->dev). CXL ports require locking the PCI
device (pdev->dev) and the parent CXL port.

The device locks should be taken early where possible. The initially
reporting device will be locked after kfifo dequeue. Iterated devices
will be locked in cxl_report_error_detected() and must lock the
iterated devices except for the first device as it has already been
locked.

Export pci_aer_clear_fatal_status() for use when a UCE is not present.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>

---

Changes in v12->v13:
- Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
- Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
  (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
  cxl_handle_proto_error() (Terry)
- Remove unnecessary check for endpoint port. (Dave Jiang)
- Remove check for RCIEP EP in cxl_report_error_detected(). (Terry)

Changes in v11->v12:
- Clean up port discovery in cxl_do_recovery() (Dave)
- Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()

Changes in v10->v11:
- pci_ers_merge_results() - Move to earlier patch
---
 drivers/cxl/core/ras.c | 135 ++++++++++++++++++++++++++++++++++++++++-
 drivers/pci/pci.h      |   1 -
 drivers/pci/pcie/aer.c |   1 +
 include/linux/aer.h    |   2 +
 4 files changed, 135 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 5bc144cde0ee..52c6f19564b6 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
 		device_unlock(dev);
 }
 
+/**
+ * cxl_report_error_detected
+ * @dev: Device being reported
+ * @data: Result
+ * @err_pdev: Device with initial detected error. Is locked immediately
+ *            after KFIFO dequeue.
+ */
+static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
+{
+	bool need_lock = (dev != &err_pdev->dev);
+	pci_ers_result_t vote, *result = data;
+	struct pci_dev *pdev;
+
+	if (!dev || !dev_is_pci(dev))
+		return 0;
+	pdev = to_pci_dev(dev);
+
+	device_lock_if(&pdev->dev, need_lock);
+	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
+		device_unlock_if(&pdev->dev, need_lock);
+		return PCI_ERS_RESULT_NONE;
+	}
+
+	if (pdev->aer_cap)
+		pci_clear_and_set_config_dword(pdev,
+					       pdev->aer_cap + PCI_ERR_COR_STATUS,
+					       0, PCI_ERR_COR_INTERNAL);
+
+	if (is_pcie_endpoint(pdev)) {
+		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+
+		device_lock_if(&cxlds->cxlmd->dev, need_lock);
+		vote = cxl_error_detected(&cxlds->cxlmd->dev);
+		device_unlock_if(&cxlds->cxlmd->dev, need_lock);
+	} else {
+		vote = cxl_port_error_detected(dev);
+	}
+
+	pcie_clear_device_status(pdev);
+	*result = pcie_ers_merge_result(*result, vote);
+	device_unlock_if(&pdev->dev, need_lock);
+
+	return 0;
+}
+
+static int match_port_by_parent_dport(struct device *dev, const void *dport_dev)
+{
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+
+	return port->parent_dport->dport_dev == dport_dev;
+}
+
+/**
+ * cxl_walk_port
+ *
+ * @port: Port be traversed into
+ * @cb: Callback for handling the CXL Ports
+ * @userdata: Result
+ * @err_pdev: Device with initial detected error. Is locked immediately
+ *            after KFIFO dequeue.
+ */
+static void cxl_walk_port(struct cxl_port *port,
+			  int (*cb)(struct device *, void *, struct pci_dev *),
+			  void *userdata,
+			  struct pci_dev *err_pdev)
+{
+	struct cxl_port *err_port __free(put_cxl_port) = get_cxl_port(err_pdev);
+	bool need_lock = (port != err_port);
+	struct cxl_dport *dport = NULL;
+	unsigned long index;
+
+	device_lock_if(&port->dev, need_lock);
+	if (is_cxl_endpoint(port)) {
+		cb(port->uport_dev->parent, userdata, err_pdev);
+		device_unlock_if(&port->dev, need_lock);
+		return;
+	}
+
+	if (port->uport_dev && dev_is_pci(port->uport_dev))
+		cb(port->uport_dev, userdata, err_pdev);
+
+	/*
+	 * Iterate over the set of Downstream Ports recorded in port->dports (XArray):
+	 *  - For each dport, attempt to find a child CXL Port whose parent dport
+	 *    match.
+	 *  - Invoke the provided callback on the dport's device.
+	 *  - If a matching child CXL Port device is found, recurse into that port to
+	 *    continue the walk.
+	 */
+	xa_for_each(&port->dports, index, dport)
+	{
+		struct device *child_port_dev __free(put_device) =
+			bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
+					match_port_by_parent_dport);
+
+		cb(dport->dport_dev, userdata, err_pdev);
+		if (child_port_dev)
+			cxl_walk_port(to_cxl_port(child_port_dev), cb, userdata, err_pdev);
+	}
+	device_unlock_if(&port->dev, need_lock);
+}
+
 static void cxl_do_recovery(struct pci_dev *pdev)
 {
+	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
+
+	if (!port) {
+		pci_err(pdev, "Failed to find the CXL device\n");
+		return;
+	}
+
+	cxl_walk_port(port, cxl_report_error_detected, &status, pdev);
+	if (status == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
+
+	/*
+	 * If we have native control of AER, clear error status in the device
+	 * that detected the error.  If the platform retained control of AER,
+	 * it is responsible for clearing this status.  In that case, the
+	 * signaling device may not even be visible to the OS.
+	 */
+	if (cxl_error_is_native(pdev)) {
+		pcie_clear_device_status(pdev);
+		pci_aer_clear_nonfatal_status(pdev);
+		pci_aer_clear_fatal_status(pdev);
+	}
 }
 
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
@@ -483,16 +613,15 @@ static void cxl_proto_err_work_fn(struct work_struct *work)
 			if (!cxl_pci_drv_bound(pdev))
 				return;
 			cxlmd_dev = &cxlds->cxlmd->dev;
-			device_lock_if(cxlmd_dev, cxlmd_dev);
 		} else {
 			cxlmd_dev = NULL;
 		}
 
+		/* Lock the CXL parent Port */
 		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
-		if (!port)
-			return;
 		guard(device)(&port->dev);
 
+		device_lock_if(cxlmd_dev, cxlmd_dev);
 		cxl_handle_proto_error(&wd);
 		device_unlock_if(cxlmd_dev, cxlmd_dev);
 	}
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 2af6ea82526d..3637996d37ab 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1174,7 +1174,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
 static inline void pci_no_aer(void) { }
 static inline void pci_aer_init(struct pci_dev *d) { }
 static inline void pci_aer_exit(struct pci_dev *d) { }
-static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
 static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
 static inline void pci_save_aer_state(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e806fa05280b..4cf44297bb24 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -297,6 +297,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
 	if (status)
 		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
 }
+EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
 
 /**
  * pci_aer_raw_clear_status - Clear AER error registers.
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 6b2c87d1b5b6..64aef69fb546 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -66,6 +66,7 @@ struct cxl_proto_err_work_data {
 
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
 void pci_aer_unmask_internal_errors(struct pci_dev *dev);
 #else
@@ -73,6 +74,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
+static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 24/25] CXL/PCI: Enable CXL protocol errors during CXL Port probe
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (22 preceding siblings ...)
  2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
@ 2025-11-04 17:03 ` Terry Bowman
  2025-11-04 17:03 ` [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
  2025-11-04 19:11 ` [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Bjorn Helgaas
  25 siblings, 0 replies; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:03 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

CXL protocol errors are not enabled for all CXL devices after boot. These
must be enabled inorder to process CXL protocol errors.

Introduce cxl_unmask_proto_interrupts() to call pci_aer_unmask_internal_errors().
pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
correctable internal errors and uncorrectable internal errors for all CXL
devices.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

---

Changes in v12->v13:
- Add dev and dev_is_pci() NULL checks in cxl_unmask_proto_interrupts() (Terry)
- Add Dave Jiang's and Ben's review-by

Changes in v11->v12:
- None

Changes in v10->v11:
- Added check for valid PCI devices in is_cxl_error() (Terry)
- Removed check for RCiEP in cxl_handle_proto_err() and
  cxl_report_error_detected() (Terry)
---
 drivers/cxl/core/core.h |  4 ++++
 drivers/cxl/core/port.c |  4 ++++
 drivers/cxl/core/ras.c  | 26 +++++++++++++++++++++++++-
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 046ec65ed147..a7a0838c8f23 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -159,6 +159,8 @@ pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
 void pci_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_port_error_detected(struct device *dev);
 void cxl_port_cor_error_detected(struct device *dev);
+void cxl_mask_proto_interrupts(struct device *dev);
+void cxl_unmask_proto_interrupts(struct device *dev);
 #else
 static inline int cxl_ras_init(void)
 {
@@ -183,6 +185,8 @@ static inline pci_ers_result_t cxl_port_error_detected(struct device *dev)
 {
 	return PCI_ERS_RESULT_NONE;
 }
+static inline void cxl_unmask_proto_interrupts(struct device *dev) { }
+static inline void cxl_mask_proto_interrupts(struct device *dev) { }
 #endif /* CONFIG_CXL_RAS */
 
 /* Restricted CXL Host specific RAS functions */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index d060f864cf2e..a23c742eb670 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1747,6 +1747,8 @@ static int add_port_attach_ep(struct cxl_memdev *cxlmd,
 		rc = -ENXIO;
 	}
 
+	cxl_unmask_proto_interrupts(cxlmd->cxlds->dev);
+
 	return rc;
 }
 
@@ -1833,6 +1835,8 @@ int devm_cxl_enumerate_ports(struct cxl_memdev *cxlmd)
 
 			rc = cxl_add_ep(dport, &cxlmd->dev);
 
+			cxl_unmask_proto_interrupts(cxlmd->cxlds->dev);
+
 			/*
 			 * If the endpoint already exists in the port's list,
 			 * that's ok, it was added on a previous pass.
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 52c6f19564b6..101e55723785 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -122,6 +122,23 @@ static bool is_pcie_endpoint(struct pci_dev *pdev)
 	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
 }
 
+void cxl_unmask_proto_interrupts(struct device *dev)
+{
+	if (!dev || !dev_is_pci(dev))
+		return;
+
+	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(to_pci_dev(dev));
+
+	if (!pdev->aer_cap) {
+		pdev->aer_cap = pci_find_ext_capability(pdev,
+							PCI_EXT_CAP_ID_ERR);
+		if (!pdev->aer_cap)
+			return;
+	}
+
+	pci_aer_unmask_internal_errors(pdev);
+}
+
 static void cxl_dport_map_ras(struct cxl_dport *dport)
 {
 	struct cxl_register_map *map = &dport->reg_map;
@@ -230,7 +247,10 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
+		return;
 	}
+
+	cxl_unmask_proto_interrupts(dport->dport_dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
@@ -241,8 +261,12 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port,
 
 	map->host = host;
 	if (cxl_map_component_regs(map, &port->uport_regs,
-				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
 		dev_dbg(&port->dev, "Failed to map RAS capability\n");
+		return;
+	}
+
+	cxl_unmask_proto_interrupts(port->uport_dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (23 preceding siblings ...)
  2025-11-04 17:03 ` [RESEND v13 24/25] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
@ 2025-11-04 17:03 ` Terry Bowman
  2025-11-20  3:10   ` dan.j.williams
  2025-11-04 19:11 ` [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Bjorn Helgaas
  25 siblings, 1 reply; 103+ messages in thread
From: Terry Bowman @ 2025-11-04 17:03 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

During CXL device cleanup the CXL PCIe Port device interrupts remain
enabled. This potentially allows unnecessary interrupt processing on
behalf of the CXL errors while the device is destroyed.

Disable CXL protocol errors by setting the CXL devices' AER mask register.

Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().
Add to the AER service driver allowing other subsystems to use.

Introduce cxl_mask_proto_interrupts() to call pci_aer_mask_internal_errors().
Add calls to cxl_mask_proto_interrupts() within CXL Port teardown for CXL
Root Ports, CXL Downstream Switch Ports, CXL Upstream Switch Ports, and CXL
Endpoints. Follow the same "bottom-up" approach used during CXL Port
teardown.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

---

Changes in v12->v13:
- Added dev and dev_is_pci() checks in cxl_mask_proto_interrupts() (Terry)

Changes in v11->v12:
- Keep pci_aer_mask_internal_errors() in driver/pci/pcie/aer.c (Lukas)
- Update commit description for pci_aer_mask_internal_errors()
- Add check `if (port->parent_dport)` in delete_switch_port() (Terry)

Changes in v10->v11:
- Removed guard() cxl_mask_proto_interrupts(). RP was blocking during
  testing. (Terry)
---
 drivers/cxl/core/port.c | 10 +++++++++-
 drivers/cxl/core/ras.c  | 10 ++++++++++
 drivers/pci/pcie/aer.c  | 21 +++++++++++++++++++++
 include/linux/aer.h     |  2 ++
 4 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index a23c742eb670..d19ebf052d76 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1437,6 +1437,10 @@ EXPORT_SYMBOL_NS_GPL(cxl_endpoint_autoremove, "CXL");
  */
 static void delete_switch_port(struct cxl_port *port)
 {
+	cxl_mask_proto_interrupts(port->uport_dev);
+	if (port->parent_dport)
+		cxl_mask_proto_interrupts(port->parent_dport->dport_dev);
+
 	devm_release_action(port->dev.parent, cxl_unlink_parent_dport, port);
 	devm_release_action(port->dev.parent, cxl_unlink_uport, port);
 	devm_release_action(port->dev.parent, unregister_port, port);
@@ -1458,8 +1462,10 @@ static void del_dports(struct cxl_port *port)
 
 	device_lock_assert(&port->dev);
 
-	xa_for_each(&port->dports, index, dport)
+	xa_for_each(&port->dports, index, dport) {
+		cxl_mask_proto_interrupts(dport->dport_dev);
 		del_dport(dport);
+	}
 }
 
 struct detach_ctx {
@@ -1486,6 +1492,8 @@ static void cxl_detach_ep(void *data)
 {
 	struct cxl_memdev *cxlmd = data;
 
+	cxl_mask_proto_interrupts(cxlmd->cxlds->dev);
+
 	for (int i = cxlmd->depth - 1; i >= 1; i--) {
 		struct cxl_port *port, *parent_port;
 		struct detach_ctx ctx = {
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 101e55723785..6dccbe66c9ac 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -139,6 +139,16 @@ void cxl_unmask_proto_interrupts(struct device *dev)
 	pci_aer_unmask_internal_errors(pdev);
 }
 
+void cxl_mask_proto_interrupts(struct device *dev)
+{
+	if (!dev || !dev_is_pci(dev))
+		return;
+
+	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(to_pci_dev(dev));
+
+	pci_aer_mask_internal_errors(pdev);
+}
+
 static void cxl_dport_map_ras(struct cxl_dport *dport)
 {
 	struct cxl_register_map *map = &dport->reg_map;
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 4cf44297bb24..fcc2f43c3383 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1149,6 +1149,27 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
 
+/**
+ * pci_aer_mask_internal_errors - mask internal errors
+ * @dev: pointer to the pcie_dev data structure
+ *
+ * Masks internal errors in the Uncorrectable and Correctable Error
+ * Mask registers.
+ *
+ * Note: AER must be enabled and supported by the device which must be
+ * checked in advance, e.g. with pcie_aer_is_native().
+ */
+void pci_aer_mask_internal_errors(struct pci_dev *dev)
+{
+	int aer = dev->aer_cap;
+
+	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
+				       0, PCI_ERR_UNC_INTN);
+	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_COR_MASK,
+				       0, PCI_ERR_COR_INTERNAL);
+}
+EXPORT_SYMBOL_GPL(pci_aer_mask_internal_errors);
+
 /**
  * pci_aer_handle_error - handle logging error into an event log
  * @dev: pointer to pci_dev data structure of error source device
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 64aef69fb546..2b89bd940ac1 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -69,6 +69,7 @@ int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
 void pci_aer_unmask_internal_errors(struct pci_dev *dev);
+void pci_aer_mask_internal_errors(struct pci_dev *dev);
 #else
 static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
@@ -77,6 +78,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
+static inline void pci_aer_mask_internal_errors(struct pci_dev *dev) { }
 #endif
 
 #ifdef CONFIG_CXL_RAS
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
@ 2025-11-04 17:50   ` Jonathan Cameron
  2025-11-19  3:19   ` dan.j.williams
  2025-12-08 18:04   ` Bjorn Helgaas
  2 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 17:50 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:41 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL DVSECs are currently defined in cxl/core/cxlpci.h. These are not
> accessible to other subsystems. Move these to uapi/linux/pci_regs.h.
> 
> Change DVSEC name formatting to follow the existing PCI format in
> pci_regs.h. The current format uses CXL_DVSEC_XYZ and the CXL defines must
> be changed to be PCI_DVSEC_CXL_XYZ to match existing pci_regs.h. Leave
> PCI_DVSEC_CXL_PORT* defines as-is because they are already defined and may
> be in use by userspace application(s).
> 
> Update existing usage to match the name change.
> 
> Update the inline documentation to refer to latest CXL spec version.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
Hi Terry,

A few minor things inline.

I'll assume you'll resolve those for next version and as they are
really minor
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b14dd064006c..53a49bb32514 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5002,7 +5002,9 @@ static bool cxl_sbr_masked(struct pci_dev *dev)
>  	if (!dvsec)
>  		return false;
>  
> -	rc = pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_PORT_CTL, &reg);
> +	rc = pci_read_config_word(dev,
> +				  dvsec + PCI_DVSEC_CXL_PORT_CTL,
> +				  &reg);
Looks like left over from before where that define got longer?
Shouldn't still be here given the two lines are (I think?) identical other
than some premature line wrapping.
>  	if (rc || PCI_POSSIBLE_ERROR(reg))
>  		return false;
>  
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 07e06aafec50..279b92f01d08 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1244,9 +1244,64 @@
>  /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
>  
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> -#define PCI_DVSEC_CXL_PORT				3
> -#define PCI_DVSEC_CXL_PORT_CTL				0x0c
> -#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> +/* Compute Express Link (CXL r3.2, sec 8.1)
Follow local comment style.
/*
 * Compute Express Link (CXL r3.2, sec 8.1)
 *...

> + *
> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
> + * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
> + * registers on downstream link-up events.
> + */


> +/* CXL 3.2 8.1.8: PCIe DVSEC for Flex Bus Port */
> +#define PCI_DVSEC_CXL_FLEXBUS_PORT				7
> +#define  PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET			0xE

I wonder if you should keep the _PORT in the naming for consistency.
These are also new defines rather than moves / renames.  I wonder if it
makes sense to bury them in this patch. Instead bring them in where they
are used?  That will also make it more obvious why only a fairly random
looking subset of this structure is used.


> +#define   PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK		_BITUL(0)
> +#define   PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK			_BITUL(2)
> +
> +/* CXL 3.2 8.1.9: Register Locator DVSEC */
> +#define PCI_DVSEC_CXL_REG_LOCATOR				8
> +#define  PCI_DVSEC_CXL_REG_LOCATOR_BLOCK1_OFFSET		0xC
> +#define   PCI_DVSEC_CXL_REG_LOCATOR_BIR_MASK			__GENMASK(2, 0)
> +#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_ID_MASK		__GENMASK(15, 8)
> +#define   PCI_DVSEC_CXL_REG_LOCATOR_BLOCK_OFF_LOW_MASK		__GENMASK(31, 16)
>  
>  #endif /* LINUX_PCI_REGS_H */


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl()
  2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
@ 2025-11-04 17:52   ` Jonathan Cameron
  2025-11-19  3:19   ` dan.j.williams
  2025-11-21 20:31   ` Gregory Price
  2 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 17:52 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:42 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL and AER drivers need the ability to identify CXL devices.
> 
> Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache
> status in the CXL Flexbus DVSEC status register. The CXL Flexbus DVSEC
> presence is used because it is required for all the CXL PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> CXL.cache and CXl.mem status.
> 
> In the case the device is an EP or USP, call set_pcie_cxl() on behalf of
> the parent downstream device. Once a device is created there is
> possibilty the parent training or CXL state was updated as well. This
> will make certain the correct parent CXL state is cached.
> 
> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> 
Hi Terry,

Drag the FLEXBUS_STATUS defines from previous patch to this one and we
are all good I think. At least it wasn't far :)

Jonathan

> ---
> 
> Changes in v12->v13:
> - Add Ben's "reviewed-by"
> 
> Changes in v11->v12:
> - Add review-by for Alejandro
> - Add comment in set_pcie_cxl() explaining why updating parent status.
> 
> Changes in v10->v11:
> - Amend set_pcie_cxl() to check for Upstream Port's and EP's parent
>   downstream port by calling set_pcie_cxl(). (Dan)
> - Retitle patch: 'Add' -> 'Introduce'
> - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
> ---
>  drivers/pci/probe.c | 29 +++++++++++++++++++++++++++++
>  include/linux/pci.h |  6 ++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 0ce98e18b5a8..63124651f865 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1709,6 +1709,33 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>  		dev->is_thunderbolt = 1;
>  }
>  
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	struct pci_dev *parent;
> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> +					      PCI_DVSEC_CXL_FLEXBUS_PORT);
> +	if (dvsec) {
> +		u16 cap;
> +
> +		pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET, &cap);
> +
> +		dev->is_cxl = FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK, cap) ||
> +			FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK, cap);
> +	}
> +
> +	if (!pci_is_pcie(dev) ||
> +	    !(pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT ||
> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_UPSTREAM))
> +		return;
> +
> +	/*
> +	 * Update parent's CXL state because alternate protocol training
> +	 * may have changed
> +	 */
> +	parent = pci_upstream_bridge(dev);
> +	set_pcie_cxl(parent);
> +}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-11-04 17:02 ` [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
@ 2025-11-04 17:53   ` Jonathan Cameron
  2025-11-19  3:20   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 17:53 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:43 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
> are unnecessary helper functions used only for Endpoints. Remove these
> functions as they are not common for all CXL devices and do not provide
> value for EP handling.
> 
> Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
> to cxl_handle_cor_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
> Changes in v12->v13:
Change log needs to go below the --- as we don't want it in the git history.

> - None

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c
  2025-11-04 17:02 ` [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c Terry Bowman
@ 2025-11-04 18:03   ` Jonathan Cameron
  2025-11-19  3:20   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:03 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:46 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> Restricted CXL Host (RCH) protocol error handling uses a procedure distinct
> from the CXL Virtual Hierarchy (VH) handling. This is because of the
> differences in the RCH and VH topologies. Improve the maintainability and
> add ability to enable/disable RCH handling.
> 
> Move and combine the RCH handling code into a single block conditionally
> compiled with the CONFIG_CXL_RCH_RAS kernel config.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> ---
> 
Fairly sure this patch had changes seeing as code now in a different file.

A few minor comments inline. With those tidied up
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


J

> Changes in v12->v13:
> - None
> 
> Changes v11->v12:
> - Moved CXL_RCH_RAS Kconfig definition here from following commit.
> 
> Changes v10->v11:
> - New patch
> ---
>  drivers/cxl/Kconfig        |   7 +++
>  drivers/cxl/core/Makefile  |   1 +
>  drivers/cxl/core/core.h    |   5 +-
>  drivers/cxl/core/pci.c     | 115 -----------------------------------
>  drivers/cxl/core/ras_rch.c | 120 +++++++++++++++++++++++++++++++++++++
>  tools/testing/cxl/Kbuild   |   1 +
>  6 files changed, 132 insertions(+), 117 deletions(-)
>  create mode 100644 drivers/cxl/core/ras_rch.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 217888992c88..ffe6ad981434 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -237,4 +237,11 @@ config CXL_RAS
>  	def_bool y
>  	depends on ACPI_APEI_GHES && PCIEAER && CXL_PCI
>  
> +config CXL_RCH_RAS
> +	bool "CXL: Restricted CXL Host (RCH) protocol error handling"
> +	def_bool n

Don't need to specify a default of no as that is always the default if
not overridden.

> +	depends on CXL_RAS
> +	help
> +	  RAS support for Restricted CXL Host (RCH) defined in CXL1.1.
> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index b2930cc54f8b..fa1d4aed28b9 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -20,3 +20,4 @@ cxl_core-$(CONFIG_CXL_MCE) += mce.o
>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
>  cxl_core-$(CONFIG_CXL_RAS) += ras.o
> +cxl_core-$(CONFIG_CXL_RCH_RAS) += ras_rch.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index bc818de87ccc..c30ab7c25a92 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -4,6 +4,7 @@
>  #ifndef __CXL_CORE_H__
>  #define __CXL_CORE_H__
>  
> +#include <linux/pci.h>
Why this include. I'm not spotting anything pci specific being added to this header.
If it should already have been here, belongs in a separate patch.

>  #include <cxl/mailbox.h>
>  #include <linux/rwsem.h>
>  
> @@ -167,7 +168,7 @@ static inline void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem
>  #endif /* CONFIG_CXL_RAS */
>  
>  /* Restricted CXL Host specific RAS functions */
> -#ifdef CONFIG_CXL_RAS
> +#ifdef CONFIG_CXL_RCH_RAS
>  void cxl_dport_map_rch_aer(struct cxl_dport *dport);
>  void cxl_disable_rch_root_ints(struct cxl_dport *dport);
>  void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
> @@ -175,7 +176,7 @@ void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds);
>  static inline void cxl_dport_map_rch_aer(struct cxl_dport *dport) { }
>  static inline void cxl_disable_rch_root_ints(struct cxl_dport *dport) { }
>  static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
> -#endif /* CONFIG_CXL_RAS */
> +#endif /* CONFIG_CXL_RCH_RAS */
>  
>  int cxl_gpf_port_setup(struct cxl_dport *dport);

> diff --git a/drivers/cxl/core/ras_rch.c b/drivers/cxl/core/ras_rch.c
> new file mode 100644
> index 000000000000..f6de5492a8b7
> --- /dev/null
> +++ b/drivers/cxl/core/ras_rch.c
> @@ -0,0 +1,120 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include <cxl/event.h>
> +#include <cxlmem.h>
> +#include "trace.h"
We should be trying to follow include what you use principles.
So linux/types.h for resource_size_t
Probably a forwards def for struct device as well.


> +
> +void cxl_dport_map_rch_aer(struct cxl_dport *dport)
> +{
> +	resource_size_t aer_phys;
> +	struct device *host;
> +	u16 aer_cap;
> +
> +	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
> +	if (aer_cap) {
> +		host = dport->reg_map.host;
> +		aer_phys = aer_cap + dport->rcrb.base;
> +		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
> +							     sizeof(struct aer_capability_regs));
Maybe keep original alignment or even something like
		dport->regs.dport_aer =
			devm_cxl_iomap_block(host, aer_phys,
					     sizeof(struct aer_capability_regs));


> +	}
> +}
> +




^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock
  2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
@ 2025-11-04 18:05   ` Jonathan Cameron
  2025-11-04 19:53   ` Dave Jiang
  2025-11-19  3:20   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:05 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:47 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> cxl_rch_handle_error_iter() includes a call to device_lock() using a goto
> for multiple return paths. Improve readability and maintainability by
> using the guard() lock variant.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
I don't think there is any existing use of cleanup.h in here aer.c?
If not you should add

#include <linux/cleanup.h> in appropriate place. 

Other than that LGTM

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>



> 
> ---
> 
> Changes in v12->v13:
> - New patch
> ---
>  drivers/pci/pcie/aer.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 0b5ed4722ac3..cbaed65577d9 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1187,12 +1187,11 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
>  		return 0;
>  
> -	/* Protect dev->driver */
> -	device_lock(&dev->dev);
> +	guard(device)(&dev->dev);
>  
>  	err_handler = dev->driver ? dev->driver->err_handler : NULL;
>  	if (!err_handler)
> -		goto out;
> +		return 0;
>  
>  	if (info->severity == AER_CORRECTABLE) {
>  		if (err_handler->cor_error_detected)
> @@ -1203,8 +1202,6 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  		else if (info->severity == AER_FATAL)
>  			err_handler->error_detected(dev, pci_channel_io_frozen);
>  	}
> -out:
> -	device_unlock(&dev->dev);
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-11-04 17:02 ` [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
@ 2025-11-04 18:08   ` Jonathan Cameron
  2025-11-04 18:26   ` Bjorn Helgaas
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:08 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:49 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
Hi Terry,

A couple of things from a fresh look inline.

> ---
> 
> Changes in v12->v13:
> - Remove duplicated aer_err_info inline comments. Is already in the
>   kernel-doc header (Ben)
> 
> Changes in v11->v12:
>  - Change aer_err_info::is_cxl to be bool a bitfield. Update structure
>  padding. (Lukas)
>  - Add kernel-doc for 'struct aer_err_info' (Lukas)
> 
> Changes in v10->v11:
>  - Remove duplicate call to trace_aer_event() (Shiju)
>  - Added Dan William's and Dave Jiang's reviewed-by
> ---
>  drivers/pci/pci.h       | 37 ++++++++++++++++++++++++++++++-------
>  drivers/pci/pcie/aer.c  | 18 ++++++++++++------
>  include/ras/ras_event.h |  9 ++++++---
>  3 files changed, 48 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index d23430e3eea0..446251892bb7 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -701,31 +701,54 @@ static inline bool pci_dev_binding_disallowed(struct pci_dev *dev)
>  
>  #define AER_MAX_MULTI_ERR_DEVICES	5	/* Not likely to have more */
>  
> +/**
> + * struct aer_err_info - AER Error Information
> + * @dev: Devices reporting error
> + * @ratelimit_print: Flag to log or not log the devices' error. 0=NotLog/1=Log
> + * @error_devnum: Number of devices reporting an error
typo error_dev_num

Run kernel-doc script over here to find things like this.

> + * @level: printk level to use in logging
> + * @id: Value from register PCI_ERR_ROOT_ERR_SRC
> + * @severity: AER severity, 0-UNCOR Non-fatal, 1-UNCOR fatal, 2-COR
> + * @root_ratelimit_print: Flag to log or not log the root's error. 0=NotLog/1=Log
> + * @multi_error_valid: If multiple errors are reported
> + * @first_error: First reported error
> + * @is_cxl: Bus type error: 0-PCI Bus error, 1-CXL Bus error
> + * @tlp_header_valid: Indicates if TLP field contains error information
> + * @status: COR/UNCOR error status
> + * @mask: COR/UNCOR mask
> + * @tlp: Transaction packet information
> + */
>  struct aer_err_info {
>  	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>  	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
>  	int error_dev_num;
> -	const char *level;		/* printk level */
> +	const char *level;
>  
>  	unsigned int id:16;
>  
> -	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
> -	unsigned int root_ratelimit_print:1;	/* 0=skip, 1=print */
> +	unsigned int severity:2;
> +	unsigned int root_ratelimit_print:1;
>  	unsigned int __pad1:4;
>  	unsigned int multi_error_valid:1;
>  
>  	unsigned int first_error:5;
> -	unsigned int __pad2:2;
> +	unsigned int __pad2:1;
> +	bool is_cxl:1;
Stick to unsigned int for the bit field just for consistency.

>  	unsigned int tlp_header_valid:1;
>  
> -	unsigned int status;		/* COR/UNCOR Error Status */
> -	unsigned int mask;		/* COR/UNCOR Error Mask */
> -	struct pcie_tlp_log tlp;	/* TLP Header */
> +	unsigned int status;
> +	unsigned int mask;
> +	struct pcie_tlp_log tlp;
>  };

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports
  2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
@ 2025-11-04 18:10   ` Jonathan Cameron
  2025-11-11  8:17   ` Alison Schofield
  2025-11-19  3:19   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:10 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:50 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL PCIe Port Protocol Error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port Protocol Errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> 
One additional comment inline.

> ---
> 
> Changes in v12->v13:
> - Added Ben's review-by
> ---
>  drivers/cxl/core/core.h    | 15 ++++++---------
>  drivers/cxl/core/ras.c     | 12 ++++++------
>  drivers/cxl/core/ras_rch.c |  4 ++--
>  3 files changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index c30ab7c25a92..1a419b35fa59 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -7,6 +7,7 @@
>  #include <linux/pci.h>
>  #include <cxl/mailbox.h>
>  #include <linux/rwsem.h>
> +#include <linux/pci.h>

Similar to earlier. Not setting what is no here that is pci specific
that wasn't before.  Maybe a forwards def of
struct device is needed? 

>  
>  extern const struct device_type cxl_nvdimm_bridge_type;
>  extern const struct device_type cxl_nvdimm_type;
> @@ -148,23 +149,19 @@ int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
>  #ifdef CONFIG_CXL_RAS
>  int cxl_ras_init(void);
>  void cxl_ras_exit(void);
> -bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base);
> -void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base);
> +bool cxl_handle_ras(struct device *dev, void __iomem *ras_base);
> +void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base);
>  #else
>  static inline int cxl_ras_init(void)
>  {
>  	return 0;
>  }
> -
> -static inline void cxl_ras_exit(void)
> -{
> -}
> -
> -static inline bool cxl_handle_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
> +static inline void cxl_ras_exit(void) { }
> +static inline bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  {
>  	return false;
>  }
> -static inline void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base) { }
> +static inline void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base) { }
>  #endif /* CONFIG_CXL_RAS */
>  
>  /* Restricted CXL Host specific RAS functions */


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
@ 2025-11-04 18:15   ` Jonathan Cameron
  2025-11-04 20:03   ` Dave Jiang
  2025-11-11  8:23   ` Alison Schofield
  2 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:15 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:54 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
> mapping to enable RAS logging. This initialization is currently missing and
> must be added for CXL RPs and DSPs.
> 
> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
> 
> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
> created and added to the EP port.
> 
> Make a call to cxl_port_setup_regs() in cxl_port_add(). This will probe the
> Upstream Port's CXL capabilities' physical location to be used in mapping
> the RAS registers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard()
  2025-11-04 17:02 ` [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard() Terry Bowman
@ 2025-11-04 18:18   ` Jonathan Cameron
  2025-11-04 20:15   ` Dave Jiang
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:18 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:58 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL protocol error handlers use scoped_guard() to guarantee access to
> the underlying CXL memory device. Improve readability and reduce complexity
> by changing the current scoped_guard() to be guard().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Nice
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-11-04 17:02 ` [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
  2025-11-04 18:08   ` Jonathan Cameron
@ 2025-11-04 18:26   ` Bjorn Helgaas
  1 sibling, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 18:26 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:49AM -0600, Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.

s/requires the AER/requires that AER/

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

> +/**
> + * struct aer_err_info - AER Error Information
> + * @dev: Devices reporting error
> + * @ratelimit_print: Flag to log or not log the devices' error. 0=NotLog/1=Log
> + * @error_devnum: Number of devices reporting an error
> + * @level: printk level to use in logging
> + * @id: Value from register PCI_ERR_ROOT_ERR_SRC
> + * @severity: AER severity, 0-UNCOR Non-fatal, 1-UNCOR fatal, 2-COR
> + * @root_ratelimit_print: Flag to log or not log the root's error. 0=NotLog/1=Log
> + * @multi_error_valid: If multiple errors are reported
> + * @first_error: First reported error
> + * @is_cxl: Bus type error: 0-PCI Bus error, 1-CXL Bus error
> + * @tlp_header_valid: Indicates if TLP field contains error information
> + * @status: COR/UNCOR error status
> + * @mask: COR/UNCOR mask
> + * @tlp: Transaction packet information
> + */

Would you mind splitting this kernel-doc addition and comment move to
its own patch that only does that?  That will make the functional
changes more obvious.

>  struct aer_err_info {
>  	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>  	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
>  	int error_dev_num;
> -	const char *level;		/* printk level */
> +	const char *level;
>  
>  	unsigned int id:16;
>  
> -	unsigned int severity:2;	/* 0:NONFATAL | 1:FATAL | 2:COR */
> -	unsigned int root_ratelimit_print:1;	/* 0=skip, 1=print */
> +	unsigned int severity:2;
> +	unsigned int root_ratelimit_print:1;
>  	unsigned int __pad1:4;
>  	unsigned int multi_error_valid:1;
>  
>  	unsigned int first_error:5;
> -	unsigned int __pad2:2;
> +	unsigned int __pad2:1;
> +	bool is_cxl:1;
>  	unsigned int tlp_header_valid:1;
>  
> -	unsigned int status;		/* COR/UNCOR Error Status */
> -	unsigned int mask;		/* COR/UNCOR Error Mask */
> -	struct pcie_tlp_log tlp;	/* TLP Header */
> +	unsigned int status;
> +	unsigned int mask;
> +	struct pcie_tlp_log tlp;
>  };

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints
  2025-11-04 17:02 ` [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints Terry Bowman
@ 2025-11-04 18:29   ` Jonathan Cameron
  2025-11-04 19:09   ` Bjorn Helgaas
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:29 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:02:59 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL Endpoint protocol errors are currently handled by generic PCI error
> handlers. However, uncorrectable errors (UCEs) require CXL.mem protocol-
> specific handling logic that the PCI handlers cannot provide.
> 
> Add dedicated CXL protocol error handlers for CXL Endpoints. Rename the
> existing cxl_error_handlers to pci_error_handlers to better reflect their
> purpose and maintain naming consistency. Update the PCI error handlers to
> invoke the new CXL protocol handlers when the endpoint is operating in
> CXL.mem mode.
> 
> Implement cxl_handle_ras() to return PCI_ERS_RESULT_NONE or
> PCI_ERS_RESULT_PANIC. Remove unnecessary result checks from the previous
> endpoint UCE handler since CXL UCE recovery is not implemented in this
> patch.
> 
> Add device lock assertions to protect against concurrent device or RAS
> register removal during error handling. Two devices require locking for
> CXL endpoints:
> 
> 1. The PCI device (pdev->dev) - RAS registers are allocated and mapped
>    using devm_* functions with this device as the host. Locking prevents
>    the RAS registers from being unmapped until after error handling
>    completes.
> 
> 2. The CXL memory device (cxlmd->dev) - Holds a reference to the RAS
>    registers accessed during error handling. Locking prevents the memory
>    device and its RAS register references from being removed during error
>    handling.
> 
> The lock assertions added here will be satisfied by device locks
> introduced in a subsequent patch. A future patch will extend the CXL UCE
> handler to support full UCE recovery.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
Hi Terry,

A few comments inline.

Thanks,

Jonathan


> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index cb712772de5c..beb142054bda 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -128,6 +128,11 @@ void cxl_ras_exit(void)
>  	cancel_work_sync(&cxl_cper_prot_err_work);
>  }
>  
> +static bool is_pcie_endpoint(struct pci_dev *pdev)
> +{
> +	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
> +}

Not used that I can see. Maybe should be in a different patch?


>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>  
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state)
> +void pci_cor_error_detected(struct pci_dev *pdev)
> +{
> +	struct cxl_dev_state *cxlds;
> +
> +	device_lock_assert(&pdev->dev);
> +	if (!cxl_pci_drv_bound(pdev))
> +		return;
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	guard(device)(&cxlds->cxlmd->dev);
> +
> +	cxl_cor_error_detected(&pdev->dev);
> +}
> +EXPORT_SYMBOL_NS_GPL(pci_cor_error_detected, "CXL");

Similarly to below.  I'm not keen on exporting such generic PCI
sounding functions even in the CXL namespace.

> +
> +pci_ers_result_t cxl_error_detected(struct device *dev)
>  {
> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> -	struct device *dev = &cxlmd->dev;
> -	bool ue;
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  
> -	guard(device)(dev);
> +	device_lock_assert(cxlds->dev);
> +	device_lock_assert(&cxlmd->dev);
>  
>  	if (!dev->driver) {
> -		dev_warn(&pdev->dev,
> +		dev_warn(cxlds->dev,
>  			 "%s: memdev disabled, abort error handling\n",
>  			 dev_name(dev));
>  		return PCI_ERS_RESULT_DISCONNECT;
> @@ -289,32 +308,34 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  
>  	if (cxlds->rcd)
>  		cxl_handle_rdport_errors(cxlds);
> +
I'd drop this blank line addition as it doesn't matter much and it does
add noise to the patch.

>  	/*
>  	 * A frozen channel indicates an impending reset which is fatal to
>  	 * CXL.mem operation, and will likely crash the system. On the off
>  	 * chance the situation is recoverable dump the status of the RAS
>  	 * capability registers and bounce the active state of the memdev.
>  	 */

Mind you - I think this comment wants to go away as it's talking about code
that is no longer here.


> -	ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
> -
> -	switch (state) {
> -	case pci_channel_io_normal:
> -		if (ue) {
> -			device_release_driver(dev);
> -			return PCI_ERS_RESULT_NEED_RESET;
> -		}
> -		return PCI_ERS_RESULT_CAN_RECOVER;
> -	case pci_channel_io_frozen:
> -		dev_warn(&pdev->dev,
> -			 "%s: frozen state error detected, disable CXL.mem\n",
> -			 dev_name(dev));
> -		device_release_driver(dev);
> -		return PCI_ERS_RESULT_NEED_RESET;
> -	case pci_channel_io_perm_failure:
> -		dev_warn(&pdev->dev,
> -			 "failure state error detected, request disconnect\n");
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> -	return PCI_ERS_RESULT_NEED_RESET;
> +	return cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
> +
> +pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
> +				    pci_channel_state_t error)
> +{
> +	struct cxl_dev_state *cxlds;
> +	pci_ers_result_t rc;
> +
> +	device_lock_assert(&pdev->dev);
> +	if (!cxl_pci_drv_bound(pdev))
> +		return PCI_ERS_RESULT_NONE;
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	guard(device)(&cxlds->cxlmd->dev);
> +
> +	rc = cxl_error_detected(&cxlds->cxlmd->dev);
> +	if (rc == PCI_ERS_RESULT_PANIC)
> +		panic("CXL cachemem error.");
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(pci_error_detected, "CXL");

Whilst the symbol is namespaced, I'm not sure I want to see
an exported CXL specific function that sounds so generic pci.

Maybe cxl_pci_error_detected() or something like that?

> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index a0a491e7b5b9..3526e6d75f79 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -79,21 +79,10 @@ struct cxl_dev_state;
>  void read_cdat_data(struct cxl_port *port);
>  
>  #ifdef CONFIG_CXL_RAS
> -void cxl_cor_error_detected(struct pci_dev *pdev);
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state);
>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
>  void cxl_uport_init_ras_reporting(struct cxl_port *port,
>  				  struct device *host);
>  #else
> -static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
> -
> -static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -						  pci_channel_state_t state)
> -{
> -	return PCI_ERS_RESULT_NONE;
> -}
> -
>  static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
>  						struct device *host) { }
>  static inline void cxl_uport_init_ras_reporting(struct cxl_port *port,


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers
  2025-11-04 17:03 ` [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
@ 2025-11-04 18:32   ` Jonathan Cameron
  2025-11-04 21:20   ` Dave Jiang
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:32 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:03:00 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> Add CXL protocol error handlers for CXL Port devices (Root Ports,
> Downstream Ports, and Upstream Ports). Implement cxl_port_cor_error_detected()
> and cxl_port_error_detected() to handle correctable and uncorrectable errors
> respectively.
> 
> Introduce cxl_get_ras_base() to retrieve the cached RAS register base
> address for a given CXL port. This function supports CXL Root Ports,
> Downstream Ports, and Upstream Ports by returning their previously mapped
> RAS register addresses.
> 
> Add device lock assertions to protect against concurrent device or RAS
> register removal during error handling. The port error handlers require
> two device locks:
> 
> 1. The port's CXL parent device - RAS registers are mapped using devm_*
>    functions with the parent port as the host. Locking the parent prevents
>    the RAS registers from being unmapped during error handling.
> 
> 2. The PCI device (pdev->dev) - Locking prevents concurrent modifications
>    to the PCI device structure during error handling.
> 
> The lock assertions added here will be satisfied by device locks introduced
> in a subsequent patch.
> 
> Introduce get_pci_cxl_host_dev() to return the device responsible for
> managing the RAS register mapping. This function increments the reference
> count on the host device to prevent premature resource release during error
> handling. The caller is responsible for decrementing the reference count.
> For CXL endpoints, which manage resources without a separate host device,
> this function returns NULL.
> 
> Update the AER driver's is_cxl_error() to recognize CXL Port devices in
> addition to CXL Endpoints, as both now have CXL-specific error handlers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> ---
> 
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>   patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
Really trivial comment follows.

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index beb142054bda..142ca8794107 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c

>  /**
>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>   * @dport: the cxl_dport that needs to be initialized
> @@ -254,6 +287,22 @@ pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ra
>  	return PCI_ERS_RESULT_PANIC;
>  }
>  
> +void cxl_port_cor_error_detected(struct device *dev)
> +{
> +	void __iomem *ras_base = cxl_get_ras_base(dev);
> +
> +	cxl_handle_cor_ras(dev, 0, ras_base);
To me no significant loss of readability to do

	cxl_handle_cor_ras(dev, 0, cxl_get_ras_base(dev));

I don't really care much so feel free to ignore.

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
> +
> +pci_ers_result_t cxl_port_error_detected(struct device *dev)
> +{
> +	void __iomem *ras_base = cxl_get_ras_base(dev);
> +
> +	return cxl_handle_ras(dev, 0, ras_base);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
> +
>  void cxl_cor_error_detected(struct device *dev)
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> index 5dbc81341dc4..25f9512b57f7 100644
> --- a/drivers/pci/pcie/aer_cxl_vh.c
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -43,7 +43,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>  	if (!info || !info->is_cxl)
>  		return false;
>  
> -	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
>  		return false;
>  
>  	return is_internal_error(info);


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error
  2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
@ 2025-11-04 18:40   ` Jonathan Cameron
  2025-11-04 18:45   ` Bjorn Helgaas
  2025-11-20  3:33   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:40 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:03:01 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items, initiate protocol
> error handling, and ensure RAS mappings remain valid throughout processing.
> 
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver and begin protocol error processing by calling
> cxl_handle_proto_error().
> 
> Add a PCI device lock on &pdev->dev within cxl_proto_err_work_fn() to
> keep the PCI device structure valid during handling. Locking an Endpoint
> will also defer RAS unmapping until the device is unlocked.
> 
> For Endpoints, add a lock on CXL memory device cxlds->dev. The CXL memory
> device structure holds the RAS register reference needed during error
> handling.
> 
> Add lock for the parent CXL Port for Root Ports, Downstream Ports, and
> Upstream Ports to prevent destruction of structures holding mapped RAS
> addresses while they are in use.
> 
> Invoke cxl_do_recovery() for uncorrectable errors. Treat this as a stub for
> now; implement its functionality in a future patch.
> 
> Export pci_clean_device_status() to enable cleanup of AER status following
> error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
Various comments inline.
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 142ca8794107..5bc144cde0ee 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -117,17 +117,6 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> -int cxl_ras_init(void)
> -{
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> -}
> -
> -void cxl_ras_exit(void)
> -{
> -	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
> -	cancel_work_sync(&cxl_cper_prot_err_work);
> -}
> -
>  static bool is_pcie_endpoint(struct pci_dev *pdev)
>  {
>  	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
> @@ -178,6 +167,51 @@ static void __iomem *cxl_get_ras_base(struct device *dev)
>  	return NULL;
>  }
>  
> +/*
> + * Return 'struct cxl_port *' parent CXL port of dev's
> + *
> + * Reference count increments on success
> + *
> + * dev: Find the parent port of this dev

pdev. 

Generally I'd prefer kernel-doc style even for non exported
/ exposed functions.  Makes it easy to check for stuff like
this as the script will moan at you.

> + */
> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> +{
> +	switch (pci_pcie_type(pdev)) {
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	{
> +		struct cxl_dport *dport;
> +		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	{
> +		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	{
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +		struct cxl_port *port = cxlds->cxlmd->endpoint;
> +
> +		get_device(&port->dev);
> +		return port;
> +	}
> +	}
> +	pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> +	return NULL;
> +}
> +
>  /**
>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>   * @dport: the cxl_dport that needs to be initialized
> @@ -212,6 +246,23 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
>  
> +static bool device_lock_if(struct device *dev, bool cond)
> +{
> +	if (cond)
> +		device_lock(dev);
> +	return cond;
> +}
> +
> +static void device_unlock_if(struct device *dev, bool take)
> +{
> +	if (take)
> +		device_unlock(dev);
> +}

See below. To me these are too weird to wrap up.  Open code them inline
where we can see what they are doing.

> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	while (cxl_proto_err_kfifo_get(&wd)) {
> +		struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(wd.pdev);
> +		struct device *cxlmd_dev;
> +
> +		if (!pdev) {
> +			pr_err_ratelimited("NULL PCI device passed in AER-CXL KFIFO\n");
> +			continue;
> +		}
> +
> +		guard(device)(&pdev->dev);
> +		if (is_pcie_endpoint(pdev)) {
> +			struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +
> +			if (!cxl_pci_drv_bound(pdev))
> +				return;
> +			cxlmd_dev = &cxlds->cxlmd->dev;
> +			device_lock_if(cxlmd_dev, cxlmd_dev);

As below. Too odd.  Also needs comments to explain why conditionally locking it
would be useful.

> +		} else {
> +			cxlmd_dev = NULL;

Set it to NULL at declaration and drop this else leg.

> +		}
> +
> +		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +		if (!port)
> +			return;
> +		guard(device)(&port->dev);
> +
> +		cxl_handle_proto_error(&wd);
> +		device_unlock_if(cxlmd_dev, cxlmd_dev);
This is too odd to wrap up like that.  Particularly given the
very generic sounding device_unlock_if() naming.

> +	}
> +}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  2025-11-04 17:03 ` [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
@ 2025-11-04 18:41   ` Jonathan Cameron
  2025-11-04 19:03   ` Bjorn Helgaas
  1 sibling, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:41 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:03:02 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL uncorrectable errors (UCE) will soon be handled separately from the PCI
> AER handling. The merge_result() function can be made common to use in both
> handling paths.
> 
> Rename the PCI subsystem's merge_result() to be pci_ers_merge_result().
> Export pci_ers_merge_result() to make available for the CXL and other
> drivers to use.
> 
> Update pci_ers_merge_result() to support recently introduced PCI_ERS_RESULT_PANIC
> result.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error
  2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
  2025-11-04 18:40   ` Jonathan Cameron
@ 2025-11-04 18:45   ` Bjorn Helgaas
  2025-11-20  3:33   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 18:45 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:03:01AM -0600, Terry Bowman wrote:
> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items, initiate protocol
> error handling, and ensure RAS mappings remain valid throughout processing.
> 
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver and begin protocol error processing by calling
> cxl_handle_proto_error().
> 
> Add a PCI device lock on &pdev->dev within cxl_proto_err_work_fn() to
> keep the PCI device structure valid during handling. Locking an Endpoint
> will also defer RAS unmapping until the device is unlocked.
> 
> For Endpoints, add a lock on CXL memory device cxlds->dev. The CXL memory
> device structure holds the RAS register reference needed during error
> handling.
> 
> Add lock for the parent CXL Port for Root Ports, Downstream Ports, and
> Upstream Ports to prevent destruction of structures holding mapped RAS
> addresses while they are in use.
> 
> Invoke cxl_do_recovery() for uncorrectable errors. Treat this as a stub for
> now; implement its functionality in a future patch.
> 
> Export pci_clean_device_status() to enable cleanup of AER status following
> error handling.

s/pci_clean_device_status/pcie_clear_device_status/

Acked-by: Bjorn Helgaas <bhelgaas@google.com>

>  drivers/cxl/core/ras.c | 153 ++++++++++++++++++++++++++++++++++++++---
>  drivers/pci/pci.c      |   1 +
>  drivers/pci/pci.h      |   1 -
>  include/linux/pci.h    |   2 +

Looks like this is primarily a CXL change, and the PCI part is
minimal, so I question the "PCI/AER:" prefix in the subject.

> +static struct cxl_port *get_cxl_port(struct pci_dev *pdev)
> +{
> +	switch (pci_pcie_type(pdev)) {
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	{
> +		struct cxl_dport *dport;
> +		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	{
> +		struct cxl_port *port = find_cxl_port_by_uport(&pdev->dev);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port;
> +	}
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	{
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +		struct cxl_port *port = cxlds->cxlmd->endpoint;
> +
> +		get_device(&port->dev);
> +		return port;
> +	}
> +	}
> +	pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));

Maybe use "%#x" so it's clear that this is hex?  PCI typically uses
lower-case hex; maybe the CXL convention is different.

> +static void cxl_handle_proto_error(struct cxl_proto_err_work_data *err_info)
> +{
> +	struct pci_dev *pdev = err_info->pdev;
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +
> +	if (err_info->severity == AER_CORRECTABLE) {
> +
> +		if (pdev->aer_cap)
> +			pci_clear_and_set_config_dword(pdev,
> +						       pdev->aer_cap + PCI_ERR_COR_STATUS,
> +						       0, PCI_ERR_COR_INTERNAL);
> +
> +		if (is_pcie_endpoint(pdev))
> +			cxl_cor_error_detected(&cxlds->cxlmd->dev);
> +		else
> +			cxl_port_cor_error_detected(&pdev->dev);
> +
> +		pcie_clear_device_status(pdev);

The AER clear above and pcie_clear_device_status() require
ownership of the PCIe Capability and the AER Capability, typically
granted by _OSC.

I suppose it's obvious that the OS does own these Capabilities if we
get here, but I'm not familiar with this code.


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
@ 2025-11-04 18:47   ` Jonathan Cameron
  2025-11-04 23:43     ` Dave Jiang
  2025-11-11  8:37   ` Alison Schofield
  2025-12-08 18:40   ` Bjorn Helgaas
  2 siblings, 1 reply; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-04 18:47 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, 4 Nov 2025 11:03:03 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> Implement cxl_do_recovery() to handle uncorrectable protocol
> errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
> all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
> potential CXL memory corruption.
> 
> Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
> CXL topology from the error source through downstream CXL ports and
> endpoints.
> 
> Introduce cxl_report_error_detected(), mirroring PCI's
> report_error_detected(), and implement device locking for the affected
> subtree. Endpoints require locking the PCI device (pdev->dev) and the
> CXL memdev (cxlmd->dev). CXL ports require locking the PCI
> device (pdev->dev) and the parent CXL port.
> 
> The device locks should be taken early where possible. The initially
> reporting device will be locked after kfifo dequeue. Iterated devices
> will be locked in cxl_report_error_detected() and must lock the
> iterated devices except for the first device as it has already been
> locked.
> 
> Export pci_aer_clear_fatal_status() for use when a UCE is not present.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Follow on comments around the locking stuff. If that has been there
a while and I didn't notice before, sorry!

> 
> ---
> 
> Changes in v12->v13:
> - Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
> - Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
>   (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
>   cxl_handle_proto_error() (Terry)
> - Remove unnecessary check for endpoint port. (Dave Jiang)
> - Remove check for RCIEP EP in cxl_report_error_detected(). (Terry)
> 
> Changes in v11->v12:
> - Clean up port discovery in cxl_do_recovery() (Dave)
> - Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
> 
> Changes in v10->v11:
> - pci_ers_merge_results() - Move to earlier patch
> ---
>  drivers/cxl/core/ras.c | 135 ++++++++++++++++++++++++++++++++++++++++-
>  drivers/pci/pci.h      |   1 -
>  drivers/pci/pcie/aer.c |   1 +
>  include/linux/aer.h    |   2 +
>  4 files changed, 135 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 5bc144cde0ee..52c6f19564b6 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
>  		device_unlock(dev);
>  }
>  
> +/**
> + * cxl_report_error_detected
> + * @dev: Device being reported
> + * @data: Result
> + * @err_pdev: Device with initial detected error. Is locked immediately
> + *            after KFIFO dequeue.
> + */
> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
> +{
> +	bool need_lock = (dev != &err_pdev->dev);

Add a comment on why this controls need for locking.
The resulting code is complex enough I'd be tempted to split the whole
thing into locked and unlocked variants.

> +	pci_ers_result_t vote, *result = data;
> +	struct pci_dev *pdev;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return 0;
> +	pdev = to_pci_dev(dev);
> +
> +	device_lock_if(&pdev->dev, need_lock);
> +	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
> +		device_unlock_if(&pdev->dev, need_lock);
> +		return PCI_ERS_RESULT_NONE;
> +	}
> +
> +	if (pdev->aer_cap)
> +		pci_clear_and_set_config_dword(pdev,
> +					       pdev->aer_cap + PCI_ERR_COR_STATUS,
> +					       0, PCI_ERR_COR_INTERNAL);
> +
> +	if (is_pcie_endpoint(pdev)) {
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +
> +		device_lock_if(&cxlds->cxlmd->dev, need_lock);
> +		vote = cxl_error_detected(&cxlds->cxlmd->dev);
> +		device_unlock_if(&cxlds->cxlmd->dev, need_lock);
> +	} else {
> +		vote = cxl_port_error_detected(dev);
> +	}
> +
> +	pcie_clear_device_status(pdev);
> +	*result = pcie_ers_merge_result(*result, vote);
> +	device_unlock_if(&pdev->dev, need_lock);
> +
> +	return 0;
> +}

> +
> +/**
> + * cxl_walk_port
Needs a short description I think to count as valid kernel-doc and
stop the tool moaning if anyone runs it on this.

> + *
> + * @port: Port be traversed into
> + * @cb: Callback for handling the CXL Ports
> + * @userdata: Result
> + * @err_pdev: Device with initial detected error. Is locked immediately
> + *            after KFIFO dequeue.
> + */
> +static void cxl_walk_port(struct cxl_port *port,
> +			  int (*cb)(struct device *, void *, struct pci_dev *),
> +			  void *userdata,
> +			  struct pci_dev *err_pdev)
> +{
> +	struct cxl_port *err_port __free(put_cxl_port) = get_cxl_port(err_pdev);
> +	bool need_lock = (port != err_port);
> +	struct cxl_dport *dport = NULL;
> +	unsigned long index;
> +
> +	device_lock_if(&port->dev, need_lock);
> +	if (is_cxl_endpoint(port)) {
> +		cb(port->uport_dev->parent, userdata, err_pdev);
> +		device_unlock_if(&port->dev, need_lock);
> +		return;
> +	}
> +
> +	if (port->uport_dev && dev_is_pci(port->uport_dev))
> +		cb(port->uport_dev, userdata, err_pdev);
> +
> +	/*
> +	 * Iterate over the set of Downstream Ports recorded in port->dports (XArray):
> +	 *  - For each dport, attempt to find a child CXL Port whose parent dport
> +	 *    match.
> +	 *  - Invoke the provided callback on the dport's device.
> +	 *  - If a matching child CXL Port device is found, recurse into that port to
> +	 *    continue the walk.
> +	 */
> +	xa_for_each(&port->dports, index, dport)
> +	{

Move that to line above for normal kernel loop formatting.

	xa_for_each(&port->dports, index, dport) {

> +		struct device *child_port_dev __free(put_device) =
> +			bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
> +					match_port_by_parent_dport);
> +
> +		cb(dport->dport_dev, userdata, err_pdev);
> +		if (child_port_dev)
> +			cxl_walk_port(to_cxl_port(child_port_dev), cb, userdata, err_pdev);
> +	}
> +	device_unlock_if(&port->dev, need_lock);
> +}
> +

>  
>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> @@ -483,16 +613,15 @@ static void cxl_proto_err_work_fn(struct work_struct *work)
>  			if (!cxl_pci_drv_bound(pdev))
>  				return;
>  			cxlmd_dev = &cxlds->cxlmd->dev;
> -			device_lock_if(cxlmd_dev, cxlmd_dev);
>  		} else {
>  			cxlmd_dev = NULL;
>  		}
>  
> +		/* Lock the CXL parent Port */
>  		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> -		if (!port)
> -			return;
>  		guard(device)(&port->dev);
>  
> +		device_lock_if(cxlmd_dev, cxlmd_dev);
>  		cxl_handle_proto_error(&wd);
>  		device_unlock_if(cxlmd_dev, cxlmd_dev);
Same issue on these helpers, but I'm also not sure why moving them in this
patch makes sense. I'm not sure what changed.

Perhaps this is stuff that ended up in wrong patch?
>  	}


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  2025-11-04 17:03 ` [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
  2025-11-04 18:41   ` Jonathan Cameron
@ 2025-11-04 19:03   ` Bjorn Helgaas
  2025-11-14 15:20     ` Bowman, Terry
  1 sibling, 1 reply; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 19:03 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:03:02AM -0600, Terry Bowman wrote:
> CXL uncorrectable errors (UCE) will soon be handled separately from the PCI
> AER handling. The merge_result() function can be made common to use in both
> handling paths.
> 
> Rename the PCI subsystem's merge_result() to be pci_ers_merge_result().
> Export pci_ers_merge_result() to make available for the CXL and other
> drivers to use.
> 
> Update pci_ers_merge_result() to support recently introduced PCI_ERS_RESULT_PANIC
> result.

Seems like this merge_result() change maybe should be in the same
patch that added PCI_ERS_RESULT_PANIC?  That would also solve the
problem that the subject line doesn't mention this important
functional change.

I haven't seen the user(s) of pci_ers_merge_result() yet, but this
seems like it might be a little too low level to be exported to
modules and in include/linux/pci.h.  Maybe there's no other way.

Wrap commit log to fit in 75 columns.

Suggest possible subject prefix of "PCI/ERR" since the only CXL
connection is that you want to *use* this from CXL.

> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> ---
> 
> Changes in v12->v13:
> - Renamed pci_ers_merge_result() to pcie_ers_merge_result().
>   pci_ers_merge_result() is already used in eeh driver. (Bot)
> 
> Changes in v11->v12:
> - Remove static inline pci_ers_merge_result() definition for !CONFIG_PCIEAER.
>   Is not needed. (Lukas)
> 
> Changes in v10->v11:
> - New patch
> - pci_ers_merge_result() - Change export to non-namespace and rename
>   to be pci_ers_merge_result()
> - Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result
> ---
>  drivers/pci/pcie/err.c | 14 +++++++++-----
>  include/linux/pci.h    |  7 +++++++
>  2 files changed, 16 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index bebe4bc111d7..9394bbdcf0fb 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -21,9 +21,12 @@
>  #include "portdrv.h"
>  #include "../pci.h"
>  
> -static pci_ers_result_t merge_result(enum pci_ers_result orig,
> -				  enum pci_ers_result new)
> +pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig,
> +				       enum pci_ers_result new)
>  {
> +	if (new == PCI_ERS_RESULT_PANIC)
> +		return PCI_ERS_RESULT_PANIC;
> +
>  	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>  		return PCI_ERS_RESULT_NO_AER_DRIVER;
>  
> @@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
>  
>  	return orig;
>  }
> +EXPORT_SYMBOL(pcie_ers_merge_result);
>  
>  static int report_error_detected(struct pci_dev *dev,
>  				 pci_channel_state_t state,
> @@ -81,7 +85,7 @@ static int report_error_detected(struct pci_dev *dev,
>  		vote = err_handler->error_detected(dev, state);
>  	}
>  	pci_uevent_ers(dev, vote);
> -	*result = merge_result(*result, vote);
> +	*result = pcie_ers_merge_result(*result, vote);
>  	device_unlock(&dev->dev);
>  	return 0;
>  }
> @@ -139,7 +143,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data)
>  
>  	err_handler = pdrv->err_handler;
>  	vote = err_handler->mmio_enabled(dev);
> -	*result = merge_result(*result, vote);
> +	*result = pcie_ers_merge_result(*result, vote);
>  out:
>  	device_unlock(&dev->dev);
>  	return 0;
> @@ -159,7 +163,7 @@ static int report_slot_reset(struct pci_dev *dev, void *data)
>  
>  	err_handler = pdrv->err_handler;
>  	vote = err_handler->slot_reset(dev);
> -	*result = merge_result(*result, vote);
> +	*result = pcie_ers_merge_result(*result, vote);
>  out:
>  	device_unlock(&dev->dev);
>  	return 0;
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 33d16b212e0d..d3e3300f79ec 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1887,9 +1887,16 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { }
>  #ifdef CONFIG_PCIEAER
>  bool pci_aer_available(void);
>  void pcie_clear_device_status(struct pci_dev *dev);
> +pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig,
> +				       enum pci_ers_result new);
>  #else
>  static inline bool pci_aer_available(void) { return false; }
>  static inline void pcie_clear_device_status(struct pci_dev *dev) { }
> +static inline pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig,
> +						     enum pci_ers_result new)
> +{
> +	return PCI_ERS_RESULT_NONE;
> +}
>  #endif
>  
>  bool pci_ats_disabled(void);
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
  2025-11-04 17:02 ` [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
@ 2025-11-04 19:03   ` Bjorn Helgaas
  2025-11-20  0:17   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 19:03 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:55AM -0600, Terry Bowman wrote:
> The CXL driver's error handling for uncorrectable errors (UCE) will be
> updated in the future. A required change is for the error handlers to
> to force a system panic when a UCE is detected.
> 
> Introduce PCI_ERS_RESULT_PANIC as a 'enum pci_ers_result' type. This will
> be used by CXL UCE fatal and non-fatal recovery in future patches. Update
> PCIe recovery documentation with details of PCI_ERS_RESULT_PANIC.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>

This patch doesn't actually *do* anything.  There's no possibility of a
bisect landing on it.  I think it would be better to combine this with
something that *uses* PCI_ERS_RESULT_PANIC, maybe the merge_result()
update?

Suggest possible subject prefix of "PCI/ERR" since this really isn't
CXL-specific; it just so happens that you don't know of uses outside
CXL.

> +++ b/Documentation/PCI/pci-error-recovery.rst
> @@ -102,6 +102,8 @@ Possible return values are::
>  		PCI_ERS_RESULT_NEED_RESET,  /* Device driver wants slot to be reset. */
>  		PCI_ERS_RESULT_DISCONNECT,  /* Device has completely failed, is unrecoverable */
>  		PCI_ERS_RESULT_RECOVERED,   /* Device driver is fully recovered and operational */
> +		PCI_ERS_RESULT_NO_AER_DRIVER, /* No AER capabilities registered for the driver */

"AER capabilities" is confusingly similar to the PCIe AER Capability.

I think this really means "there's no
pci_error_handlers.error_detected() callback".

> +		PCI_ERS_RESULT_PANIC,       /* System is unstable, panic. Is CXL specific */
>  	};
>  
>  A driver does not have to implement all of these callbacks; however,
> @@ -116,6 +118,10 @@ The actual steps taken by a platform to recover from a PCI error
>  event will be platform-dependent, but will follow the general
>  sequence described below.
>  
> +PCI_ERS_RESULT_PANIC is currently unique to CXL and handled in CXL
> +cxl_do_recovery(). The PCI pcie_do_recovery() routine does not report or
> +handle PCI_ERS_RESULT_PANIC.

I'm not sure all these mentions of being CXL specific are really
helpful.  I don't think they are actionable to driver writers.

>  STEP 0: Error Event
>  -------------------
>  A PCI bus error is detected by the PCI hardware.  On powerpc, the slot
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 5c4759078d2f..cffa5535f28d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -890,6 +890,9 @@ enum pci_ers_result {
>  
>  	/* No AER capabilities registered for the driver */
>  	PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
> +
> +	/* System is unstable, panic. Is CXL specific */
> +	PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
>  };
>  
>  /* PCI bus error event callbacks */
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints
  2025-11-04 17:02 ` [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints Terry Bowman
  2025-11-04 18:29   ` Jonathan Cameron
@ 2025-11-04 19:09   ` Bjorn Helgaas
  1 sibling, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 19:09 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:59AM -0600, Terry Bowman wrote:
> CXL Endpoint protocol errors are currently handled by generic PCI error
> handlers. However, uncorrectable errors (UCEs) require CXL.mem protocol-
> specific handling logic that the PCI handlers cannot provide.

> +++ b/drivers/cxl/core/ras.c

> +static bool is_pcie_endpoint(struct pci_dev *pdev)
> +{
> +	return pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT;
> +}

Seems like a weird place for this since it's not CXL related.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging
  2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (24 preceding siblings ...)
  2025-11-04 17:03 ` [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
@ 2025-11-04 19:11 ` Bjorn Helgaas
  2025-11-04 21:54   ` Bowman, Terry
  25 siblings, 1 reply; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 19:11 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:40AM -0600, Terry Bowman wrote:
> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
> Endpoints (EP). Previous versions of this series can be found here:
> https://lore.kernel.org/linux-cxl/20250925223440.3539069-1-terry.bowman@amd.com/
> ...

> Terry Bowman (24):
>   CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
>   PCI/CXL: Introduce pcie_is_cxl()
>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
>   cxl/pci: Remove unnecessary CXL RCH handling helper functions
>   cxl: Move CXL driver's RCH error handling into core/ras_rch.c
>   CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with
>     guard() lock
>   CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
>   PCI/AER: Report CXL or PCIe bus error type in trace logging
>   cxl/pci: Update RAS handler interfaces to also support CXL Ports
>   cxl/pci: Log message if RAS registers are unmapped
>   cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
>   cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
>   CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
>   CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL
>     errors
>   cxl: Introduce cxl_pci_drv_bound() to check for bound driver
>   cxl: Change CXL handlers to use guard() instead of scoped_guard()
>   cxl/pci: Introduce CXL protocol error handlers for Endpoints
>   CXL/PCI: Introduce CXL Port protocol error handlers
>   PCI/AER: Dequeue forwarded CXL error
>   CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
>   CXL/PCI: Introduce CXL uncorrectable protocol error recovery
>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
>   CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup

Is the mix of "CXL/PCI" vs "cxl/pci" in the above telling me
something, or should they all match?

As a rule of thumb, I'm going to look at things that start with "PCI"
and skip most of the rest on the assumption that the rest only have
incidental effects on PCI.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock
  2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
  2025-11-04 18:05   ` Jonathan Cameron
@ 2025-11-04 19:53   ` Dave Jiang
  2025-11-19  3:20   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: Dave Jiang @ 2025-11-04 19:53 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/4/25 10:02 AM, Terry Bowman wrote:
> cxl_rch_handle_error_iter() includes a call to device_lock() using a goto
> for multiple return paths. Improve readability and maintainability by
> using the guard() lock variant.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>> 
> ---
> 
> Changes in v12->v13:
> - New patch
> ---
>  drivers/pci/pcie/aer.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 0b5ed4722ac3..cbaed65577d9 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1187,12 +1187,11 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
>  		return 0;
>  
> -	/* Protect dev->driver */
> -	device_lock(&dev->dev);
> +	guard(device)(&dev->dev);
>  
>  	err_handler = dev->driver ? dev->driver->err_handler : NULL;
>  	if (!err_handler)
> -		goto out;
> +		return 0;
>  
>  	if (info->severity == AER_CORRECTABLE) {
>  		if (err_handler->cor_error_detected)
> @@ -1203,8 +1202,6 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  		else if (info->severity == AER_FATAL)
>  			err_handler->error_detected(dev, pci_channel_io_frozen);
>  	}
> -out:
> -	device_unlock(&dev->dev);
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
  2025-11-04 18:15   ` Jonathan Cameron
@ 2025-11-04 20:03   ` Dave Jiang
  2025-11-11  8:23   ` Alison Schofield
  2 siblings, 0 replies; 103+ messages in thread
From: Dave Jiang @ 2025-11-04 20:03 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/4/25 10:02 AM, Terry Bowman wrote:
> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
> mapping to enable RAS logging. This initialization is currently missing and
> must be added for CXL RPs and DSPs.
> 
> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
> 
> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
> created and added to the EP port.
> 
> Make a call to cxl_port_setup_regs() in cxl_port_add(). This will probe the
> Upstream Port's CXL capabilities' physical location to be used in mapping
> the RAS registers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>> 
> ---
> 
> Changes in v12->v13:
> - Change as result of dport delay fix. No longer need switchport and
> endport approach. (Terry)
> 
> Changes in v11->v12:
> - Add check for dport_parent->rch before calling cxl_dport_init_ras_reporting().
> RCH dports are initialized from cxl_dport_init_ras_reporting cxl_mem_probe().
> 
> Changes in v10->v11:
> - Use local pointer for readability in cxl_switch_port_init_ras() (Jonathan Cameron)
> - Rename port to be ep in cxl_endpoint_port_init_ras() (Dave Jiang)
> - Rename dport to be parent_dport in cxl_endpoint_port_init_ras()
>   and cxl_switch_port_init_ras() (Dave Jiang)
> - Port helper changes were in cxl/port.c, now in core/ras.c (Dave Jiang)
> ---
>  drivers/cxl/core/port.c |  4 ++++
>  drivers/cxl/core/ras.c  | 12 ++++++++++++
>  drivers/cxl/cxl.h       |  2 ++
>  drivers/cxl/cxlpci.h    |  4 ++++
>  drivers/cxl/mem.c       |  3 ++-
>  5 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 8128fd2b5b31..48f6a1492544 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1194,6 +1194,8 @@ __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev,
>  			return ERR_PTR(rc);
>  		}
>  		port->component_reg_phys = CXL_RESOURCE_NONE;
> +		if (!is_cxl_endpoint(port) && dev_is_pci(port->uport_dev))
> +			cxl_uport_init_ras_reporting(port, &port->dev);
>  	}
>  
>  	get_device(dport_dev);
> @@ -1623,6 +1625,8 @@ static struct cxl_dport *cxl_port_add_dport(struct cxl_port *port,
>  
>  	cxl_switch_parse_cdat(new_dport);
>  
> +	cxl_dport_init_ras_reporting(new_dport, &port->dev);
> +
>  	if (ida_is_empty(&port->decoder_ida)) {
>  		rc = devm_cxl_switch_port_decoders_setup(port);
>  		if (rc)
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 246dfe56617a..19d9ffe885bf 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -162,6 +162,18 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>  
> +void cxl_uport_init_ras_reporting(struct cxl_port *port,
> +				  struct device *host)
> +{
> +	struct cxl_register_map *map = &port->reg_map;
> +
> +	map->host = host;
> +	if (cxl_map_component_regs(map, &port->uport_regs,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
> +		dev_dbg(&port->dev, "Failed to map RAS capability\n");
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
> +
>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>  {
>  	void __iomem *addr;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 259ed4b676e1..b7654d40dc9e 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -599,6 +599,7 @@ struct cxl_dax_region {
>   * @parent_dport: dport that points to this port in the parent
>   * @decoder_ida: allocator for decoder ids
>   * @reg_map: component and ras register mapping parameters
> + * @uport_regs: mapped component registers
>   * @nr_dports: number of entries in @dports
>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>   * @commit_end: cursor to track highest committed decoder for commit ordering
> @@ -620,6 +621,7 @@ struct cxl_port {
>  	struct cxl_dport *parent_dport;
>  	struct ida decoder_ida;
>  	struct cxl_register_map reg_map;
> +	struct cxl_component_regs uport_regs;
>  	int nr_dports;
>  	int hdm_end;
>  	int commit_end;
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 0c8b6ee7b6de..a0a491e7b5b9 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -83,6 +83,8 @@ void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
> +void cxl_uport_init_ras_reporting(struct cxl_port *port,
> +				  struct device *host);
>  #else
>  static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
>  
> @@ -94,6 +96,8 @@ static inline pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  
>  static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
>  						struct device *host) { }
> +static inline void cxl_uport_init_ras_reporting(struct cxl_port *port,
> +						struct device *host) { }
>  #endif
>  
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 6e6777b7bafb..d2155f45240d 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
>  	else
>  		endpoint_parent = &parent_port->dev;
>  
> -	cxl_dport_init_ras_reporting(dport, dev);
> +	if (dport->rch)
> +		cxl_dport_init_ras_reporting(dport, dev);
>  
>  	scoped_guard(device, endpoint_parent) {
>  		if (!endpoint_parent->driver) {


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard()
  2025-11-04 17:02 ` [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard() Terry Bowman
  2025-11-04 18:18   ` Jonathan Cameron
@ 2025-11-04 20:15   ` Dave Jiang
  1 sibling, 0 replies; 103+ messages in thread
From: Dave Jiang @ 2025-11-04 20:15 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/4/25 10:02 AM, Terry Bowman wrote:
> The CXL protocol error handlers use scoped_guard() to guarantee access to
> the underlying CXL memory device. Improve readability and reduce complexity
> by changing the current scoped_guard() to be guard().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>> 
> ---
> 
> Changes in v12->v13:
> - New patch
> ---
>  drivers/cxl/core/ras.c | 53 +++++++++++++++++++++---------------------
>  1 file changed, 26 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 19d9ffe885bf..cb712772de5c 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -254,19 +254,19 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>  	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>  	struct device *dev = &cxlds->cxlmd->dev;
>  
> -	scoped_guard(device, dev) {
> -		if (!dev->driver) {
> -			dev_warn(&pdev->dev,
> -				 "%s: memdev disabled, abort error handling\n",
> -				 dev_name(dev));
> -			return;
> -		}
> -
> -		if (cxlds->rcd)
> -			cxl_handle_rdport_errors(cxlds);
> +	guard(device)(dev);
>  
> -		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
> +	if (!dev->driver) {
> +		dev_warn(&pdev->dev,
> +			 "%s: memdev disabled, abort error handling\n",
> +			 dev_name(dev));
> +		return;
>  	}
> +
> +	if (cxlds->rcd)
> +		cxl_handle_rdport_errors(cxlds);
> +
> +	cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>  
> @@ -278,25 +278,24 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  	struct device *dev = &cxlmd->dev;
>  	bool ue;
>  
> -	scoped_guard(device, dev) {
> -		if (!dev->driver) {
> -			dev_warn(&pdev->dev,
> -				 "%s: memdev disabled, abort error handling\n",
> -				 dev_name(dev));
> -			return PCI_ERS_RESULT_DISCONNECT;
> -		}
> +	guard(device)(dev);
>  
> -		if (cxlds->rcd)
> -			cxl_handle_rdport_errors(cxlds);
> -		/*
> -		 * A frozen channel indicates an impending reset which is fatal to
> -		 * CXL.mem operation, and will likely crash the system. On the off
> -		 * chance the situation is recoverable dump the status of the RAS
> -		 * capability registers and bounce the active state of the memdev.
> -		 */
> -		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
> +	if (!dev->driver) {
> +		dev_warn(&pdev->dev,
> +			 "%s: memdev disabled, abort error handling\n",
> +			 dev_name(dev));
> +		return PCI_ERS_RESULT_DISCONNECT;
>  	}
>  
> +	if (cxlds->rcd)
> +		cxl_handle_rdport_errors(cxlds);
> +	/*
> +	 * A frozen channel indicates an impending reset which is fatal to
> +	 * CXL.mem operation, and will likely crash the system. On the off
> +	 * chance the situation is recoverable dump the status of the RAS
> +	 * capability registers and bounce the active state of the memdev.
> +	 */
> +	ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
>  
>  	switch (state) {
>  	case pci_channel_io_normal:


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers
  2025-11-04 17:03 ` [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
  2025-11-04 18:32   ` Jonathan Cameron
@ 2025-11-04 21:20   ` Dave Jiang
  2025-11-04 21:27     ` Bowman, Terry
  1 sibling, 1 reply; 103+ messages in thread
From: Dave Jiang @ 2025-11-04 21:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/4/25 10:03 AM, Terry Bowman wrote:
> Add CXL protocol error handlers for CXL Port devices (Root Ports,
> Downstream Ports, and Upstream Ports). Implement cxl_port_cor_error_detected()
> and cxl_port_error_detected() to handle correctable and uncorrectable errors
> respectively.
> 
> Introduce cxl_get_ras_base() to retrieve the cached RAS register base
> address for a given CXL port. This function supports CXL Root Ports,
> Downstream Ports, and Upstream Ports by returning their previously mapped
> RAS register addresses.
> 
> Add device lock assertions to protect against concurrent device or RAS
> register removal during error handling. The port error handlers require
> two device locks:
> 
> 1. The port's CXL parent device - RAS registers are mapped using devm_*
>    functions with the parent port as the host. Locking the parent prevents
>    the RAS registers from being unmapped during error handling.
> 
> 2. The PCI device (pdev->dev) - Locking prevents concurrent modifications
>    to the PCI device structure during error handling.
> 
> The lock assertions added here will be satisfied by device locks introduced
> in a subsequent patch.
> 
> Introduce get_pci_cxl_host_dev() to return the device responsible for
> managing the RAS register mapping. This function increments the reference
> count on the host device to prevent premature resource release during error
> handling. The caller is responsible for decrementing the reference count.
> For CXL endpoints, which manage resources without a separate host device,
> this function returns NULL.
> 
> Update the AER driver's is_cxl_error() to recognize CXL Port devices in
> addition to CXL Endpoints, as both now have CXL-specific error handlers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> ---
> 
> Changes in v12->v13:
> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>   patch (Terry)
> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
> - Remove check for dport->dport_dev (Dave)
> - Remove whitespace (Terry)
> 
> Changes in v11->v12:
> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
>   pci_to_cxl_dev()
> - Change cxl_error_detected() -> cxl_cor_error_detected()
> - Remove NULL variable assignments
> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
>   port searches.
> 
> Changes in v10->v11:
> - None
> ---
>  drivers/cxl/core/core.h       | 10 +++++++
>  drivers/cxl/core/port.c       |  7 ++---
>  drivers/cxl/core/ras.c        | 49 +++++++++++++++++++++++++++++++++++
>  drivers/pci/pcie/aer_cxl_vh.c |  5 +++-
>  4 files changed, 67 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index b2c0ccd6803f..046ec65ed147 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -157,6 +157,8 @@ void cxl_cor_error_detected(struct device *dev);
>  pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t error);
>  void pci_cor_error_detected(struct pci_dev *pdev);
> +pci_ers_result_t cxl_port_error_detected(struct device *dev);
> +void cxl_port_cor_error_detected(struct device *dev);
>  #else
>  static inline int cxl_ras_init(void)
>  {
> @@ -176,6 +178,11 @@ static inline pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
>  	return PCI_ERS_RESULT_NONE;
>  }
>  static inline void pci_cor_error_detected(struct pci_dev *pdev) { }
> +static inline void cxl_port_cor_error_detected(struct device *dev) { }
> +static inline pci_ers_result_t cxl_port_error_detected(struct device *dev)
> +{
> +	return PCI_ERS_RESULT_NONE;
> +}
>  #endif /* CONFIG_CXL_RAS */
>  
>  /* Restricted CXL Host specific RAS functions */
> @@ -190,6 +197,9 @@ static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
>  #endif /* CONFIG_CXL_RCH_RAS */
>  
>  int cxl_gpf_port_setup(struct cxl_dport *dport);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport);
> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
>  
>  struct cxl_hdm;
>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index b70e1b505b5c..d060f864cf2e 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1360,8 +1360,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>  	return NULL;
>  }
>  
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> -				      struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport)
>  {
>  	struct cxl_find_port_ctx ctx = {
>  		.dport_dev = dport_dev,
> @@ -1564,7 +1564,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
>   * Function takes a device reference on the port device. Caller should do a
>   * put_device() when done.
>   */
> -static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>  {
>  	struct device *dev;
>  
> @@ -1573,6 +1573,7 @@ static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>  		return to_cxl_port(dev);
>  	return NULL;
>  }
> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_by_uport, "CXL");
>  
>  static int update_decoder_targets(struct device *dev, void *data)
>  {
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index beb142054bda..142ca8794107 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -145,6 +145,39 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
>  		dev_dbg(dev, "Failed to map RAS capability.\n");
>  }
>  
> +static void __iomem *cxl_get_ras_base(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +
> +	switch (pci_pcie_type(pdev)) {
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	{
> +		struct cxl_dport *dport;
> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
> +
> +		if (!dport) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return dport->regs.ras;

The RAS MMIO mapping is done via devm_cxl_iomap_block() and is a devres against the device. Without holding the device lock, the port driver can unbind and the address mapping may go away in the middle or before cxl_handle_cor_ras()/cxl_handle_ras() being called. I think you'll have to hold the port lock here and make sure that the port driver is bound before reading the RAS register? I think the dport ras should be covered under the port umbrella.

> +	}
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	{
> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
> +
> +		if (!port) {
> +			pci_err(pdev, "Failed to find the CXL device");
> +			return NULL;
> +		}
> +		return port->uport_regs.ras;

same here

DJ> +	}
> +	}
> +
> +	dev_warn_once(dev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> +	return NULL;
> +}
> +
>  /**
>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>   * @dport: the cxl_dport that needs to be initialized
> @@ -254,6 +287,22 @@ pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ra
>  	return PCI_ERS_RESULT_PANIC;
>  }
>  
> +void cxl_port_cor_error_detected(struct device *dev)
> +{
> +	void __iomem *ras_base = cxl_get_ras_base(dev);
> +
> +	cxl_handle_cor_ras(dev, 0, ras_base);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
> +
> +pci_ers_result_t cxl_port_error_detected(struct device *dev)
> +{
> +	void __iomem *ras_base = cxl_get_ras_base(dev);
> +
> +	return cxl_handle_ras(dev, 0, ras_base);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
> +
>  void cxl_cor_error_detected(struct device *dev)
>  {
>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> index 5dbc81341dc4..25f9512b57f7 100644
> --- a/drivers/pci/pcie/aer_cxl_vh.c
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -43,7 +43,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>  	if (!info || !info->is_cxl)
>  		return false;
>  
> -	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
>  		return false;
>  
>  	return is_internal_error(info);


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers
  2025-11-04 21:20   ` Dave Jiang
@ 2025-11-04 21:27     ` Bowman, Terry
  2025-11-04 23:39       ` Dave Jiang
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-11-04 21:27 UTC (permalink / raw)
  To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/4/2025 3:20 PM, Dave Jiang wrote:
>
> On 11/4/25 10:03 AM, Terry Bowman wrote:
>> Add CXL protocol error handlers for CXL Port devices (Root Ports,
>> Downstream Ports, and Upstream Ports). Implement cxl_port_cor_error_detected()
>> and cxl_port_error_detected() to handle correctable and uncorrectable errors
>> respectively.
>>
>> Introduce cxl_get_ras_base() to retrieve the cached RAS register base
>> address for a given CXL port. This function supports CXL Root Ports,
>> Downstream Ports, and Upstream Ports by returning their previously mapped
>> RAS register addresses.
>>
>> Add device lock assertions to protect against concurrent device or RAS
>> register removal during error handling. The port error handlers require
>> two device locks:
>>
>> 1. The port's CXL parent device - RAS registers are mapped using devm_*
>>    functions with the parent port as the host. Locking the parent prevents
>>    the RAS registers from being unmapped during error handling.
>>
>> 2. The PCI device (pdev->dev) - Locking prevents concurrent modifications
>>    to the PCI device structure during error handling.
>>
>> The lock assertions added here will be satisfied by device locks introduced
>> in a subsequent patch.
>>
>> Introduce get_pci_cxl_host_dev() to return the device responsible for
>> managing the RAS register mapping. This function increments the reference
>> count on the host device to prevent premature resource release during error
>> handling. The caller is responsible for decrementing the reference count.
>> For CXL endpoints, which manage resources without a separate host device,
>> this function returns NULL.
>>
>> Update the AER driver's is_cxl_error() to recognize CXL Port devices in
>> addition to CXL Endpoints, as both now have CXL-specific error handlers.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>
>> ---
>>
>> Changes in v12->v13:
>> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>>   patch (Terry)
>> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
>> - Remove check for dport->dport_dev (Dave)
>> - Remove whitespace (Terry)
>>
>> Changes in v11->v12:
>> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
>>   pci_to_cxl_dev()
>> - Change cxl_error_detected() -> cxl_cor_error_detected()
>> - Remove NULL variable assignments
>> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
>>   port searches.
>>
>> Changes in v10->v11:
>> - None
>> ---
>>  drivers/cxl/core/core.h       | 10 +++++++
>>  drivers/cxl/core/port.c       |  7 ++---
>>  drivers/cxl/core/ras.c        | 49 +++++++++++++++++++++++++++++++++++
>>  drivers/pci/pcie/aer_cxl_vh.c |  5 +++-
>>  4 files changed, 67 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index b2c0ccd6803f..046ec65ed147 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -157,6 +157,8 @@ void cxl_cor_error_detected(struct device *dev);
>>  pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
>>  				    pci_channel_state_t error);
>>  void pci_cor_error_detected(struct pci_dev *pdev);
>> +pci_ers_result_t cxl_port_error_detected(struct device *dev);
>> +void cxl_port_cor_error_detected(struct device *dev);
>>  #else
>>  static inline int cxl_ras_init(void)
>>  {
>> @@ -176,6 +178,11 @@ static inline pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
>>  	return PCI_ERS_RESULT_NONE;
>>  }
>>  static inline void pci_cor_error_detected(struct pci_dev *pdev) { }
>> +static inline void cxl_port_cor_error_detected(struct device *dev) { }
>> +static inline pci_ers_result_t cxl_port_error_detected(struct device *dev)
>> +{
>> +	return PCI_ERS_RESULT_NONE;
>> +}
>>  #endif /* CONFIG_CXL_RAS */
>>  
>>  /* Restricted CXL Host specific RAS functions */
>> @@ -190,6 +197,9 @@ static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
>>  #endif /* CONFIG_CXL_RCH_RAS */
>>  
>>  int cxl_gpf_port_setup(struct cxl_dport *dport);
>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>> +			       struct cxl_dport **dport);
>> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
>>  
>>  struct cxl_hdm;
>>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index b70e1b505b5c..d060f864cf2e 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -1360,8 +1360,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>>  	return NULL;
>>  }
>>  
>> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
>> -				      struct cxl_dport **dport)
>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>> +			       struct cxl_dport **dport)
>>  {
>>  	struct cxl_find_port_ctx ctx = {
>>  		.dport_dev = dport_dev,
>> @@ -1564,7 +1564,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
>>   * Function takes a device reference on the port device. Caller should do a
>>   * put_device() when done.
>>   */
>> -static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>>  {
>>  	struct device *dev;
>>  
>> @@ -1573,6 +1573,7 @@ static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>>  		return to_cxl_port(dev);
>>  	return NULL;
>>  }
>> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_by_uport, "CXL");
>>  
>>  static int update_decoder_targets(struct device *dev, void *data)
>>  {
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index beb142054bda..142ca8794107 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -145,6 +145,39 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
>>  		dev_dbg(dev, "Failed to map RAS capability.\n");
>>  }
>>  
>> +static void __iomem *cxl_get_ras_base(struct device *dev)
>> +{
>> +	struct pci_dev *pdev = to_pci_dev(dev);
>> +
>> +	switch (pci_pcie_type(pdev)) {
>> +	case PCI_EXP_TYPE_ROOT_PORT:
>> +	case PCI_EXP_TYPE_DOWNSTREAM:
>> +	{
>> +		struct cxl_dport *dport;
>> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
>> +
>> +		if (!dport) {
>> +			pci_err(pdev, "Failed to find the CXL device");
>> +			return NULL;
>> +		}
>> +		return dport->regs.ras;
> The RAS MMIO mapping is done via devm_cxl_iomap_block() and is a devres against the device. Without holding the device lock, the port driver can unbind and the address mapping may go away in the middle or before cxl_handle_cor_ras()/cxl_handle_ras() being called. I think you'll have to hold the port lock here and make sure that the port driver is bound before reading the RAS register? I think the dport ras should be covered under the port umbrella.
>
>> +	}
>> +	case PCI_EXP_TYPE_UPSTREAM:
>> +	{
>> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
>> +
>> +		if (!port) {
>> +			pci_err(pdev, "Failed to find the CXL device");
>> +			return NULL;
>> +		}
>> +		return port->uport_regs.ras;
> same here
>
> DJ> +	}


The cxl_port parent of the reported devices are locked previously. Locking is added in the CE case in the next patch.
and the UCE locking is in patch23. Locking logic is all made ASAP after after dequeueing.

Terry

>> +	}
>> +
>> +	dev_warn_once(dev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
>> +	return NULL;
>> +}
>> +
>>  /**
>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>   * @dport: the cxl_dport that needs to be initialized
>> @@ -254,6 +287,22 @@ pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ra
>>  	return PCI_ERS_RESULT_PANIC;
>>  }
>>  
>> +void cxl_port_cor_error_detected(struct device *dev)
>> +{
>> +	void __iomem *ras_base = cxl_get_ras_base(dev);
>> +
>> +	cxl_handle_cor_ras(dev, 0, ras_base);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
>> +
>> +pci_ers_result_t cxl_port_error_detected(struct device *dev)
>> +{
>> +	void __iomem *ras_base = cxl_get_ras_base(dev);
>> +
>> +	return cxl_handle_ras(dev, 0, ras_base);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
>> +
>>  void cxl_cor_error_detected(struct device *dev)
>>  {
>>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
>> index 5dbc81341dc4..25f9512b57f7 100644
>> --- a/drivers/pci/pcie/aer_cxl_vh.c
>> +++ b/drivers/pci/pcie/aer_cxl_vh.c
>> @@ -43,7 +43,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>>  	if (!info || !info->is_cxl)
>>  		return false;
>>  
>> -	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
>>  		return false;
>>  
>>  	return is_internal_error(info);


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging
  2025-11-04 19:11 ` [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Bjorn Helgaas
@ 2025-11-04 21:54   ` Bowman, Terry
  2025-11-04 22:12     ` Bjorn Helgaas
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-11-04 21:54 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci, terry.bowman



On 11/4/2025 1:11 PM, Bjorn Helgaas wrote:
> On Tue, Nov 04, 2025 at 11:02:40AM -0600, Terry Bowman wrote:
>> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
>> Endpoints (EP). Previous versions of this series can be found here:
>> https://lore.kernel.org/linux-cxl/20250925223440.3539069-1-terry.bowman@amd.com/
>> ...
>> Terry Bowman (24):
>>   CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
>>   PCI/CXL: Introduce pcie_is_cxl()
>>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
>>   cxl/pci: Remove unnecessary CXL RCH handling helper functions
>>   cxl: Move CXL driver's RCH error handling into core/ras_rch.c
>>   CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with
>>     guard() lock
>>   CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
>>   PCI/AER: Report CXL or PCIe bus error type in trace logging
>>   cxl/pci: Update RAS handler interfaces to also support CXL Ports
>>   cxl/pci: Log message if RAS registers are unmapped
>>   cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
>>   cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
>>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
>>   CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
>>   CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL
>>     errors
>>   cxl: Introduce cxl_pci_drv_bound() to check for bound driver
>>   cxl: Change CXL handlers to use guard() instead of scoped_guard()
>>   cxl/pci: Introduce CXL protocol error handlers for Endpoints
>>   CXL/PCI: Introduce CXL Port protocol error handlers
>>   PCI/AER: Dequeue forwarded CXL error
>>   CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
>>   CXL/PCI: Introduce CXL uncorrectable protocol error recovery
>>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
>>   CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
> Is the mix of "CXL/PCI" vs "cxl/pci" in the above telling me
> something, or should they all match?
>
> As a rule of thumb, I'm going to look at things that start with "PCI"
> and skip most of the rest on the assumption that the rest only have
> incidental effects on PCI.
I think there was logic behind the (un)capitalized but I forget the reasoning. It's 
better to keep it simple. I'll change to use PCI/CXL and AER/CXL.

Terry

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging
  2025-11-04 21:54   ` Bowman, Terry
@ 2025-11-04 22:12     ` Bjorn Helgaas
  2025-12-04 17:30       ` Bowman, Terry
  0 siblings, 1 reply; 103+ messages in thread
From: Bjorn Helgaas @ 2025-11-04 22:12 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 03:54:21PM -0600, Bowman, Terry wrote:
> 
> 
> On 11/4/2025 1:11 PM, Bjorn Helgaas wrote:
> > On Tue, Nov 04, 2025 at 11:02:40AM -0600, Terry Bowman wrote:
> >> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
> >> Endpoints (EP). Previous versions of this series can be found here:
> >> https://lore.kernel.org/linux-cxl/20250925223440.3539069-1-terry.bowman@amd.com/
> >> ...
> >> Terry Bowman (24):
> >>   CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
> >>   PCI/CXL: Introduce pcie_is_cxl()
> >>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
> >>   cxl/pci: Remove unnecessary CXL RCH handling helper functions
> >>   cxl: Move CXL driver's RCH error handling into core/ras_rch.c
> >>   CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with
> >>     guard() lock
> >>   CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
> >>   PCI/AER: Report CXL or PCIe bus error type in trace logging
> >>   cxl/pci: Update RAS handler interfaces to also support CXL Ports
> >>   cxl/pci: Log message if RAS registers are unmapped
> >>   cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
> >>   cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
> >>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
> >>   CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
> >>   CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL
> >>     errors
> >>   cxl: Introduce cxl_pci_drv_bound() to check for bound driver
> >>   cxl: Change CXL handlers to use guard() instead of scoped_guard()
> >>   cxl/pci: Introduce CXL protocol error handlers for Endpoints
> >>   CXL/PCI: Introduce CXL Port protocol error handlers
> >>   PCI/AER: Dequeue forwarded CXL error
> >>   CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
> >>   CXL/PCI: Introduce CXL uncorrectable protocol error recovery
> >>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
> >>   CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
> > Is the mix of "CXL/PCI" vs "cxl/pci" in the above telling me
> > something, or should they all match?
> >
> > As a rule of thumb, I'm going to look at things that start with "PCI"
> > and skip most of the rest on the assumption that the rest only have
> > incidental effects on PCI.
>
> I think there was logic behind the (un)capitalized but I forget the
> reasoning. It's  better to keep it simple. I'll change to use
> PCI/CXL and AER/CXL.

I don't know what "AER/CXL" means.  I think "PCI" and "CXL" are the
big chunks here and one of them should be first in the prefix.

I do think there's value in using "PCI/AER" for things specific to AER
and "PCI/ERR" for more generic PCI error handling, and maybe "PCI/CXL"
for significant CXL-related things in drivers/pci/.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers
  2025-11-04 21:27     ` Bowman, Terry
@ 2025-11-04 23:39       ` Dave Jiang
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Jiang @ 2025-11-04 23:39 UTC (permalink / raw)
  To: Bowman, Terry, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/4/25 2:27 PM, Bowman, Terry wrote:
> 
> 
> On 11/4/2025 3:20 PM, Dave Jiang wrote:
>>
>> On 11/4/25 10:03 AM, Terry Bowman wrote:
>>> Add CXL protocol error handlers for CXL Port devices (Root Ports,
>>> Downstream Ports, and Upstream Ports). Implement cxl_port_cor_error_detected()
>>> and cxl_port_error_detected() to handle correctable and uncorrectable errors
>>> respectively.
>>>
>>> Introduce cxl_get_ras_base() to retrieve the cached RAS register base
>>> address for a given CXL port. This function supports CXL Root Ports,
>>> Downstream Ports, and Upstream Ports by returning their previously mapped
>>> RAS register addresses.
>>>
>>> Add device lock assertions to protect against concurrent device or RAS
>>> register removal during error handling. The port error handlers require
>>> two device locks:
>>>
>>> 1. The port's CXL parent device - RAS registers are mapped using devm_*
>>>    functions with the parent port as the host. Locking the parent prevents
>>>    the RAS registers from being unmapped during error handling.
>>>
>>> 2. The PCI device (pdev->dev) - Locking prevents concurrent modifications
>>>    to the PCI device structure during error handling.
>>>
>>> The lock assertions added here will be satisfied by device locks introduced
>>> in a subsequent patch.
>>>
>>> Introduce get_pci_cxl_host_dev() to return the device responsible for
>>> managing the RAS register mapping. This function increments the reference
>>> count on the host device to prevent premature resource release during error
>>> handling. The caller is responsible for decrementing the reference count.
>>> For CXL endpoints, which manage resources without a separate host device,
>>> this function returns NULL.
>>>
>>> Update the AER driver's is_cxl_error() to recognize CXL Port devices in
>>> addition to CXL Endpoints, as both now have CXL-specific error handlers.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>>>
>>> ---
>>>
>>> Changes in v12->v13:
>>> - Move get_pci_cxl_host_dev() and cxl_handle_proto_error() to Dequeue
>>>   patch (Terry)
>>> - Remove EP case in cxl_get_ras_base(), not used. (Terry)
>>> - Remove check for dport->dport_dev (Dave)
>>> - Remove whitespace (Terry)
>>>
>>> Changes in v11->v12:
>>> - Add call to cxl_pci_drv_bound() in cxl_handle_proto_error() and
>>>   pci_to_cxl_dev()
>>> - Change cxl_error_detected() -> cxl_cor_error_detected()
>>> - Remove NULL variable assignments
>>> - Replace bus_find_device() with find_cxl_port_by_uport() for upstream
>>>   port searches.
>>>
>>> Changes in v10->v11:
>>> - None
>>> ---
>>>  drivers/cxl/core/core.h       | 10 +++++++
>>>  drivers/cxl/core/port.c       |  7 ++---
>>>  drivers/cxl/core/ras.c        | 49 +++++++++++++++++++++++++++++++++++
>>>  drivers/pci/pcie/aer_cxl_vh.c |  5 +++-
>>>  4 files changed, 67 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>>> index b2c0ccd6803f..046ec65ed147 100644
>>> --- a/drivers/cxl/core/core.h
>>> +++ b/drivers/cxl/core/core.h
>>> @@ -157,6 +157,8 @@ void cxl_cor_error_detected(struct device *dev);
>>>  pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
>>>  				    pci_channel_state_t error);
>>>  void pci_cor_error_detected(struct pci_dev *pdev);
>>> +pci_ers_result_t cxl_port_error_detected(struct device *dev);
>>> +void cxl_port_cor_error_detected(struct device *dev);
>>>  #else
>>>  static inline int cxl_ras_init(void)
>>>  {
>>> @@ -176,6 +178,11 @@ static inline pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
>>>  	return PCI_ERS_RESULT_NONE;
>>>  }
>>>  static inline void pci_cor_error_detected(struct pci_dev *pdev) { }
>>> +static inline void cxl_port_cor_error_detected(struct device *dev) { }
>>> +static inline pci_ers_result_t cxl_port_error_detected(struct device *dev)
>>> +{
>>> +	return PCI_ERS_RESULT_NONE;
>>> +}
>>>  #endif /* CONFIG_CXL_RAS */
>>>  
>>>  /* Restricted CXL Host specific RAS functions */
>>> @@ -190,6 +197,9 @@ static inline void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
>>>  #endif /* CONFIG_CXL_RCH_RAS */
>>>  
>>>  int cxl_gpf_port_setup(struct cxl_dport *dport);
>>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>>> +			       struct cxl_dport **dport);
>>> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev);
>>>  
>>>  struct cxl_hdm;
>>>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
>>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>>> index b70e1b505b5c..d060f864cf2e 100644
>>> --- a/drivers/cxl/core/port.c
>>> +++ b/drivers/cxl/core/port.c
>>> @@ -1360,8 +1360,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>>>  	return NULL;
>>>  }
>>>  
>>> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
>>> -				      struct cxl_dport **dport)
>>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>>> +			       struct cxl_dport **dport)
>>>  {
>>>  	struct cxl_find_port_ctx ctx = {
>>>  		.dport_dev = dport_dev,
>>> @@ -1564,7 +1564,7 @@ static int match_port_by_uport(struct device *dev, const void *data)
>>>   * Function takes a device reference on the port device. Caller should do a
>>>   * put_device() when done.
>>>   */
>>> -static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>>> +struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>>>  {
>>>  	struct device *dev;
>>>  
>>> @@ -1573,6 +1573,7 @@ static struct cxl_port *find_cxl_port_by_uport(struct device *uport_dev)
>>>  		return to_cxl_port(dev);
>>>  	return NULL;
>>>  }
>>> +EXPORT_SYMBOL_NS_GPL(find_cxl_port_by_uport, "CXL");
>>>  
>>>  static int update_decoder_targets(struct device *dev, void *data)
>>>  {
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index beb142054bda..142ca8794107 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -145,6 +145,39 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
>>>  		dev_dbg(dev, "Failed to map RAS capability.\n");
>>>  }
>>>  
>>> +static void __iomem *cxl_get_ras_base(struct device *dev)
>>> +{
>>> +	struct pci_dev *pdev = to_pci_dev(dev);
>>> +
>>> +	switch (pci_pcie_type(pdev)) {
>>> +	case PCI_EXP_TYPE_ROOT_PORT:
>>> +	case PCI_EXP_TYPE_DOWNSTREAM:
>>> +	{
>>> +		struct cxl_dport *dport;
>>> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
>>> +
>>> +		if (!dport) {
>>> +			pci_err(pdev, "Failed to find the CXL device");
>>> +			return NULL;
>>> +		}
>>> +		return dport->regs.ras;
>> The RAS MMIO mapping is done via devm_cxl_iomap_block() and is a devres against the device. Without holding the device lock, the port driver can unbind and the address mapping may go away in the middle or before cxl_handle_cor_ras()/cxl_handle_ras() being called. I think you'll have to hold the port lock here and make sure that the port driver is bound before reading the RAS register? I think the dport ras should be covered under the port umbrella.
>>
>>> +	}
>>> +	case PCI_EXP_TYPE_UPSTREAM:
>>> +	{
>>> +		struct cxl_port *port __free(put_cxl_port) = find_cxl_port_by_uport(&pdev->dev);
>>> +
>>> +		if (!port) {
>>> +			pci_err(pdev, "Failed to find the CXL device");
>>> +			return NULL;
>>> +		}
>>> +		return port->uport_regs.ras;
>> same here
>>
>> DJ> +	}
> 
> 
> The cxl_port parent of the reported devices are locked previously. Locking is added in the CE case in the next patch.
> and the UCE locking is in patch23. Locking logic is all made ASAP after after dequeueing.

Ok I see them.

Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> 
> Terry
> 
>>> +	}
>>> +
>>> +	dev_warn_once(dev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
>>> +	return NULL;
>>> +}
>>> +
>>>  /**
>>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>>   * @dport: the cxl_dport that needs to be initialized
>>> @@ -254,6 +287,22 @@ pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial, void __iomem *ra
>>>  	return PCI_ERS_RESULT_PANIC;
>>>  }
>>>  
>>> +void cxl_port_cor_error_detected(struct device *dev)
>>> +{
>>> +	void __iomem *ras_base = cxl_get_ras_base(dev);
>>> +
>>> +	cxl_handle_cor_ras(dev, 0, ras_base);
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
>>> +
>>> +pci_ers_result_t cxl_port_error_detected(struct device *dev)
>>> +{
>>> +	void __iomem *ras_base = cxl_get_ras_base(dev);
>>> +
>>> +	return cxl_handle_ras(dev, 0, ras_base);
>>> +}
>>> +EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
>>> +
>>>  void cxl_cor_error_detected(struct device *dev)
>>>  {
>>>  	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>>> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
>>> index 5dbc81341dc4..25f9512b57f7 100644
>>> --- a/drivers/pci/pcie/aer_cxl_vh.c
>>> +++ b/drivers/pci/pcie/aer_cxl_vh.c
>>> @@ -43,7 +43,10 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>>>  	if (!info || !info->is_cxl)
>>>  		return false;
>>>  
>>> -	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
>>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_UPSTREAM) &&
>>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_DOWNSTREAM))
>>>  		return false;
>>>  
>>>  	return is_internal_error(info);
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-04 18:47   ` Jonathan Cameron
@ 2025-11-04 23:43     ` Dave Jiang
  2025-11-05 14:59       ` Bowman, Terry
  0 siblings, 1 reply; 103+ messages in thread
From: Dave Jiang @ 2025-11-04 23:43 UTC (permalink / raw)
  To: Jonathan Cameron, Terry Bowman
  Cc: dave, alison.schofield, dan.j.williams, bhelgaas, shiju.jose,
	ming.li, Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci



On 11/4/25 11:47 AM, Jonathan Cameron wrote:
> On Tue, 4 Nov 2025 11:03:03 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> Implement cxl_do_recovery() to handle uncorrectable protocol
>> errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
>> all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
>> potential CXL memory corruption.
>>
>> Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
>> CXL topology from the error source through downstream CXL ports and
>> endpoints.
>>
>> Introduce cxl_report_error_detected(), mirroring PCI's
>> report_error_detected(), and implement device locking for the affected
>> subtree. Endpoints require locking the PCI device (pdev->dev) and the
>> CXL memdev (cxlmd->dev). CXL ports require locking the PCI
>> device (pdev->dev) and the parent CXL port.
>>
>> The device locks should be taken early where possible. The initially
>> reporting device will be locked after kfifo dequeue. Iterated devices
>> will be locked in cxl_report_error_detected() and must lock the
>> iterated devices except for the first device as it has already been
>> locked.
>>
>> Export pci_aer_clear_fatal_status() for use when a UCE is not present.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> Follow on comments around the locking stuff. If that has been there
> a while and I didn't notice before, sorry!
> 
>>
>> ---
>>
>> Changes in v12->v13:
>> - Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
>> - Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
>>   (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
>>   cxl_handle_proto_error() (Terry)
>> - Remove unnecessary check for endpoint port. (Dave Jiang)
>> - Remove check for RCIEP EP in cxl_report_error_detected(). (Terry)
>>
>> Changes in v11->v12:
>> - Clean up port discovery in cxl_do_recovery() (Dave)
>> - Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
>>
>> Changes in v10->v11:
>> - pci_ers_merge_results() - Move to earlier patch
>> ---
>>  drivers/cxl/core/ras.c | 135 ++++++++++++++++++++++++++++++++++++++++-
>>  drivers/pci/pci.h      |   1 -
>>  drivers/pci/pcie/aer.c |   1 +
>>  include/linux/aer.h    |   2 +
>>  4 files changed, 135 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 5bc144cde0ee..52c6f19564b6 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
>>  		device_unlock(dev);
>>  }
>>  
>> +/**
>> + * cxl_report_error_detected
>> + * @dev: Device being reported
>> + * @data: Result
>> + * @err_pdev: Device with initial detected error. Is locked immediately
>> + *            after KFIFO dequeue.
>> + */
>> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
>> +{
>> +	bool need_lock = (dev != &err_pdev->dev);
> 
> Add a comment on why this controls need for locking.
> The resulting code is complex enough I'd be tempted to split the whole
> thing into locked and unlocked variants.

May not be a bad idea. Terry, can you see if this would reduce the complexity?

DJ 

> 
>> +	pci_ers_result_t vote, *result = data;
>> +	struct pci_dev *pdev;
>> +
>> +	if (!dev || !dev_is_pci(dev))
>> +		return 0;
>> +	pdev = to_pci_dev(dev);
>> +
>> +	device_lock_if(&pdev->dev, need_lock);
>> +	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
>> +		device_unlock_if(&pdev->dev, need_lock);
>> +		return PCI_ERS_RESULT_NONE;
>> +	}
>> +
>> +	if (pdev->aer_cap)
>> +		pci_clear_and_set_config_dword(pdev,
>> +					       pdev->aer_cap + PCI_ERR_COR_STATUS,
>> +					       0, PCI_ERR_COR_INTERNAL);
>> +
>> +	if (is_pcie_endpoint(pdev)) {
>> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>> +
>> +		device_lock_if(&cxlds->cxlmd->dev, need_lock);
>> +		vote = cxl_error_detected(&cxlds->cxlmd->dev);
>> +		device_unlock_if(&cxlds->cxlmd->dev, need_lock);
>> +	} else {
>> +		vote = cxl_port_error_detected(dev);
>> +	}
>> +
>> +	pcie_clear_device_status(pdev);
>> +	*result = pcie_ers_merge_result(*result, vote);
>> +	device_unlock_if(&pdev->dev, need_lock);
>> +
>> +	return 0;
>> +}
> 
>> +
>> +/**
>> + * cxl_walk_port
> Needs a short description I think to count as valid kernel-doc and
> stop the tool moaning if anyone runs it on this.
> 
>> + *
>> + * @port: Port be traversed into
>> + * @cb: Callback for handling the CXL Ports
>> + * @userdata: Result
>> + * @err_pdev: Device with initial detected error. Is locked immediately
>> + *            after KFIFO dequeue.
>> + */
>> +static void cxl_walk_port(struct cxl_port *port,
>> +			  int (*cb)(struct device *, void *, struct pci_dev *),
>> +			  void *userdata,
>> +			  struct pci_dev *err_pdev)
>> +{
>> +	struct cxl_port *err_port __free(put_cxl_port) = get_cxl_port(err_pdev);
>> +	bool need_lock = (port != err_port);
>> +	struct cxl_dport *dport = NULL;
>> +	unsigned long index;
>> +
>> +	device_lock_if(&port->dev, need_lock);
>> +	if (is_cxl_endpoint(port)) {
>> +		cb(port->uport_dev->parent, userdata, err_pdev);
>> +		device_unlock_if(&port->dev, need_lock);
>> +		return;
>> +	}
>> +
>> +	if (port->uport_dev && dev_is_pci(port->uport_dev))
>> +		cb(port->uport_dev, userdata, err_pdev);
>> +
>> +	/*
>> +	 * Iterate over the set of Downstream Ports recorded in port->dports (XArray):
>> +	 *  - For each dport, attempt to find a child CXL Port whose parent dport
>> +	 *    match.
>> +	 *  - Invoke the provided callback on the dport's device.
>> +	 *  - If a matching child CXL Port device is found, recurse into that port to
>> +	 *    continue the walk.
>> +	 */
>> +	xa_for_each(&port->dports, index, dport)
>> +	{
> 
> Move that to line above for normal kernel loop formatting.
> 
> 	xa_for_each(&port->dports, index, dport) {
> 
>> +		struct device *child_port_dev __free(put_device) =
>> +			bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
>> +					match_port_by_parent_dport);
>> +
>> +		cb(dport->dport_dev, userdata, err_pdev);
>> +		if (child_port_dev)
>> +			cxl_walk_port(to_cxl_port(child_port_dev), cb, userdata, err_pdev);
>> +	}
>> +	device_unlock_if(&port->dev, need_lock);
>> +}
>> +
> 
>>  
>>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>> @@ -483,16 +613,15 @@ static void cxl_proto_err_work_fn(struct work_struct *work)
>>  			if (!cxl_pci_drv_bound(pdev))
>>  				return;
>>  			cxlmd_dev = &cxlds->cxlmd->dev;
>> -			device_lock_if(cxlmd_dev, cxlmd_dev);
>>  		} else {
>>  			cxlmd_dev = NULL;
>>  		}
>>  
>> +		/* Lock the CXL parent Port */
>>  		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>> -		if (!port)
>> -			return;
>>  		guard(device)(&port->dev);
>>  
>> +		device_lock_if(cxlmd_dev, cxlmd_dev);
>>  		cxl_handle_proto_error(&wd);
>>  		device_unlock_if(cxlmd_dev, cxlmd_dev);
> Same issue on these helpers, but I'm also not sure why moving them in this
> patch makes sense. I'm not sure what changed.
> 
> Perhaps this is stuff that ended up in wrong patch?
>>  	}
> 
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  2025-11-04 17:02 ` [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
@ 2025-11-05  8:30   ` Alejandro Lucero Palau
  2025-11-19 22:00   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: Alejandro Lucero Palau @ 2025-11-05  8:30 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, ira.weiny
  Cc: linux-kernel, linux-pci


On 11/4/25 17:02, Terry Bowman wrote:
> Update cxl_handle_cor_ras() to exit early in the case there is no RAS
> errors detected after applying the status mask. This change will make
> the correctable handler's implementation consistent with the uncorrectable
> handler, cxl_handle_ras().
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>


Reviewed-by: Alejandro Lucero <alucerop@amd.com>


> ---
>
> Changes v12->v13:
> - Added Ben's review-by
>
> Changes v11->v12:
> - None
>
> Changes v10->v11:
> - Added Dave Jiang and Jonathan Cameron's review-by
> - Changes moved to core/ras.c
> ---
>   drivers/cxl/core/ras.c | 9 +++++----
>   1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 599c88f0b376..246dfe56617a 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -174,10 +174,11 @@ void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>   
>   	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>   	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(dev, status, serial);
> -	}
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	trace_cxl_aer_correctable_error(dev, status, serial);
>   }
>   
>   /* CXL spec rev3.0 8.2.4.16.1 */

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-04 23:43     ` Dave Jiang
@ 2025-11-05 14:59       ` Bowman, Terry
  2025-11-05 16:10         ` Dave Jiang
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-11-05 14:59 UTC (permalink / raw)
  To: Dave Jiang, Jonathan Cameron
  Cc: dave, alison.schofield, dan.j.williams, bhelgaas, shiju.jose,
	ming.li, Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci



On 11/4/2025 5:43 PM, Dave Jiang wrote:
>
> On 11/4/25 11:47 AM, Jonathan Cameron wrote:
>> On Tue, 4 Nov 2025 11:03:03 -0600
>> Terry Bowman <terry.bowman@amd.com> wrote:
>>
>>> Implement cxl_do_recovery() to handle uncorrectable protocol
>>> errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
>>> all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
>>> potential CXL memory corruption.
>>>
>>> Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
>>> CXL topology from the error source through downstream CXL ports and
>>> endpoints.
>>>
>>> Introduce cxl_report_error_detected(), mirroring PCI's
>>> report_error_detected(), and implement device locking for the affected
>>> subtree. Endpoints require locking the PCI device (pdev->dev) and the
>>> CXL memdev (cxlmd->dev). CXL ports require locking the PCI
>>> device (pdev->dev) and the parent CXL port.
>>>
>>> The device locks should be taken early where possible. The initially
>>> reporting device will be locked after kfifo dequeue. Iterated devices
>>> will be locked in cxl_report_error_detected() and must lock the
>>> iterated devices except for the first device as it has already been
>>> locked.
>>>
>>> Export pci_aer_clear_fatal_status() for use when a UCE is not present.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Follow on comments around the locking stuff. If that has been there
>> a while and I didn't notice before, sorry!
>>
>>> ---
>>>
>>> Changes in v12->v13:
>>> - Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
>>> - Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
>>>   (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
>>>   cxl_handle_proto_error() (Terry)
>>> - Remove unnecessary check for endpoint port. (Dave Jiang)
>>> - Remove check for RCIEP EP in cxl_report_error_detected(). (Terry)
>>>
>>> Changes in v11->v12:
>>> - Clean up port discovery in cxl_do_recovery() (Dave)
>>> - Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
>>>
>>> Changes in v10->v11:
>>> - pci_ers_merge_results() - Move to earlier patch
>>> ---
>>>  drivers/cxl/core/ras.c | 135 ++++++++++++++++++++++++++++++++++++++++-
>>>  drivers/pci/pci.h      |   1 -
>>>  drivers/pci/pcie/aer.c |   1 +
>>>  include/linux/aer.h    |   2 +
>>>  4 files changed, 135 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>> index 5bc144cde0ee..52c6f19564b6 100644
>>> --- a/drivers/cxl/core/ras.c
>>> +++ b/drivers/cxl/core/ras.c
>>> @@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
>>>  		device_unlock(dev);
>>>  }
>>>  
>>> +/**
>>> + * cxl_report_error_detected
>>> + * @dev: Device being reported
>>> + * @data: Result
>>> + * @err_pdev: Device with initial detected error. Is locked immediately
>>> + *            after KFIFO dequeue.
>>> + */
>>> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
>>> +{
>>> +	bool need_lock = (dev != &err_pdev->dev);
>> Add a comment on why this controls need for locking.
>> The resulting code is complex enough I'd be tempted to split the whole
>> thing into locked and unlocked variants.
> May not be a bad idea. Terry, can you see if this would reduce the complexity?
>
> DJ 

I agree and will split into 2 functions. Do you have naming suggestions for a function copy 
without locks? Is cxl_report_error_detected_nolock() OK to go along with existing 
cxl_report_error_detected()? 

Terry

>>> +	pci_ers_result_t vote, *result = data;
>>> +	struct pci_dev *pdev;
>>> +
>>> +	if (!dev || !dev_is_pci(dev))
>>> +		return 0;
>>> +	pdev = to_pci_dev(dev);
>>> +
>>> +	device_lock_if(&pdev->dev, need_lock);
>>> +	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
>>> +		device_unlock_if(&pdev->dev, need_lock);
>>> +		return PCI_ERS_RESULT_NONE;
>>> +	}
>>> +
>>> +	if (pdev->aer_cap)
>>> +		pci_clear_and_set_config_dword(pdev,
>>> +					       pdev->aer_cap + PCI_ERR_COR_STATUS,
>>> +					       0, PCI_ERR_COR_INTERNAL);
>>> +
>>> +	if (is_pcie_endpoint(pdev)) {
>>> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>>> +
>>> +		device_lock_if(&cxlds->cxlmd->dev, need_lock);
>>> +		vote = cxl_error_detected(&cxlds->cxlmd->dev);
>>> +		device_unlock_if(&cxlds->cxlmd->dev, need_lock);
>>> +	} else {
>>> +		vote = cxl_port_error_detected(dev);
>>> +	}
>>> +
>>> +	pcie_clear_device_status(pdev);
>>> +	*result = pcie_ers_merge_result(*result, vote);
>>> +	device_unlock_if(&pdev->dev, need_lock);
>>> +
>>> +	return 0;
>>> +}
>>> +
>>> +/**
>>> + * cxl_walk_port
>> Needs a short description I think to count as valid kernel-doc and
>> stop the tool moaning if anyone runs it on this.
>>
>>> + *
>>> + * @port: Port be traversed into
>>> + * @cb: Callback for handling the CXL Ports
>>> + * @userdata: Result
>>> + * @err_pdev: Device with initial detected error. Is locked immediately
>>> + *            after KFIFO dequeue.
>>> + */
>>> +static void cxl_walk_port(struct cxl_port *port,
>>> +			  int (*cb)(struct device *, void *, struct pci_dev *),
>>> +			  void *userdata,
>>> +			  struct pci_dev *err_pdev)
>>> +{
>>> +	struct cxl_port *err_port __free(put_cxl_port) = get_cxl_port(err_pdev);
>>> +	bool need_lock = (port != err_port);
>>> +	struct cxl_dport *dport = NULL;
>>> +	unsigned long index;
>>> +
>>> +	device_lock_if(&port->dev, need_lock);
>>> +	if (is_cxl_endpoint(port)) {
>>> +		cb(port->uport_dev->parent, userdata, err_pdev);
>>> +		device_unlock_if(&port->dev, need_lock);
>>> +		return;
>>> +	}
>>> +
>>> +	if (port->uport_dev && dev_is_pci(port->uport_dev))
>>> +		cb(port->uport_dev, userdata, err_pdev);
>>> +
>>> +	/*
>>> +	 * Iterate over the set of Downstream Ports recorded in port->dports (XArray):
>>> +	 *  - For each dport, attempt to find a child CXL Port whose parent dport
>>> +	 *    match.
>>> +	 *  - Invoke the provided callback on the dport's device.
>>> +	 *  - If a matching child CXL Port device is found, recurse into that port to
>>> +	 *    continue the walk.
>>> +	 */
>>> +	xa_for_each(&port->dports, index, dport)
>>> +	{
>> Move that to line above for normal kernel loop formatting.
>>
>> 	xa_for_each(&port->dports, index, dport) {
>>
>>> +		struct device *child_port_dev __free(put_device) =
>>> +			bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
>>> +					match_port_by_parent_dport);
>>> +
>>> +		cb(dport->dport_dev, userdata, err_pdev);
>>> +		if (child_port_dev)
>>> +			cxl_walk_port(to_cxl_port(child_port_dev), cb, userdata, err_pdev);
>>> +	}
>>> +	device_unlock_if(&port->dev, need_lock);
>>> +}
>>> +
>>>  
>>>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
>>> @@ -483,16 +613,15 @@ static void cxl_proto_err_work_fn(struct work_struct *work)
>>>  			if (!cxl_pci_drv_bound(pdev))
>>>  				return;
>>>  			cxlmd_dev = &cxlds->cxlmd->dev;
>>> -			device_lock_if(cxlmd_dev, cxlmd_dev);
>>>  		} else {
>>>  			cxlmd_dev = NULL;
>>>  		}
>>>  
>>> +		/* Lock the CXL parent Port */
>>>  		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
>>> -		if (!port)
>>> -			return;
>>>  		guard(device)(&port->dev);
>>>  
>>> +		device_lock_if(cxlmd_dev, cxlmd_dev);
>>>  		cxl_handle_proto_error(&wd);
>>>  		device_unlock_if(cxlmd_dev, cxlmd_dev);
>> Same issue on these helpers, but I'm also not sure why moving them in this
>> patch makes sense. I'm not sure what changed.
>>
>> Perhaps this is stuff that ended up in wrong patch?
>>>  	}
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-05 14:59       ` Bowman, Terry
@ 2025-11-05 16:10         ` Dave Jiang
  0 siblings, 0 replies; 103+ messages in thread
From: Dave Jiang @ 2025-11-05 16:10 UTC (permalink / raw)
  To: Bowman, Terry, Jonathan Cameron
  Cc: dave, alison.schofield, dan.j.williams, bhelgaas, shiju.jose,
	ming.li, Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci



On 11/5/25 7:59 AM, Bowman, Terry wrote:
> 
> 
> On 11/4/2025 5:43 PM, Dave Jiang wrote:
>>
>> On 11/4/25 11:47 AM, Jonathan Cameron wrote:
>>> On Tue, 4 Nov 2025 11:03:03 -0600
>>> Terry Bowman <terry.bowman@amd.com> wrote:

<snip>

>>>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>>>> index 5bc144cde0ee..52c6f19564b6 100644
>>>> --- a/drivers/cxl/core/ras.c
>>>> +++ b/drivers/cxl/core/ras.c
>>>> @@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
>>>>  		device_unlock(dev);
>>>>  }
>>>>  
>>>> +/**
>>>> + * cxl_report_error_detected
>>>> + * @dev: Device being reported
>>>> + * @data: Result
>>>> + * @err_pdev: Device with initial detected error. Is locked immediately
>>>> + *            after KFIFO dequeue.
>>>> + */
>>>> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
>>>> +{
>>>> +	bool need_lock = (dev != &err_pdev->dev);
>>> Add a comment on why this controls need for locking.
>>> The resulting code is complex enough I'd be tempted to split the whole
>>> thing into locked and unlocked variants.
>> May not be a bad idea. Terry, can you see if this would reduce the complexity?
>>
>> DJ 
> 
> I agree and will split into 2 functions. Do you have naming suggestions for a function copy 
> without locks? Is cxl_report_error_detected_nolock() OK to go along with existing 
> cxl_report_error_detected()? 

Maybe cxl_report_error_detected_lock() vs cxl_report_error_detected().
I think there's also precedent of __cxl_report_error_detected() with no lock and indicates a raw function vs cxl_report_error_detected() with lock. 

DJ

> 
> Terry



^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
@ 2025-11-05 17:51   ` Gregory Price
  2025-11-05 19:03     ` Gregory Price
  2025-11-11  8:33   ` Alison Schofield
  2025-11-20  1:24   ` dan.j.williams
  2 siblings, 1 reply; 103+ messages in thread
From: Gregory Price @ 2025-11-05 17:51 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:57AM -0600, Terry Bowman wrote:
> CXL devices handle protocol errors via driver-specific callbacks rather
> than the generic pci_driver::err_handlers by default. The callbacks are
> implemented in the cxl_pci driver and are not part of struct pci_driver, so
> cxl_core must verify that a device is actually bound to the cxl_pci
> module's driver before invoking the callbacks (the device could be bound
> to another driver, e.g. VFIO).
> 
> However, cxl_core can not reference symbols in the cxl_pci module because
> it creates a circular dependency. This prevents cxl_core from checking the
> EP's bound driver and calling the callbacks.
> 
> To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
> build it as part of the cxl_core module. Compile into cxl_core using
> CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
> cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
> eliminates the circular dependency so cxl_core can safely perform
> bound-driver checks and invoke the CXL PCI callbacks.
> 
> Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
> parameter is bound to a CXL driver instance. This will be used in future
> patch when dequeuing work from the kfifo.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 
> ---

This commit causes my QEMU basic expander setup and a real device setup
to fail to probe the cxl_core driver.

[    2.697094] cxl_core 0000:0d:00.0: BAR 0 [mem 0xfe800000-0xfe80ffff 64bit]: not claimed; can't enable device
[    2.697098] cxl_core 0000:0d:00.0: probe with driver cxl_core failed with error -22

Probe order issue when CXL drivers are built-in maybe?

~Gregory

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-05 17:51   ` Gregory Price
@ 2025-11-05 19:03     ` Gregory Price
  2025-11-05 22:26       ` Gregory Price
  0 siblings, 1 reply; 103+ messages in thread
From: Gregory Price @ 2025-11-05 19:03 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Wed, Nov 05, 2025 at 12:51:04PM -0500, Gregory Price wrote:
> On Tue, Nov 04, 2025 at 11:02:57AM -0600, Terry Bowman wrote:
> > CXL devices handle protocol errors via driver-specific callbacks rather
> > than the generic pci_driver::err_handlers by default. The callbacks are
> > implemented in the cxl_pci driver and are not part of struct pci_driver, so
> > cxl_core must verify that a device is actually bound to the cxl_pci
> > module's driver before invoking the callbacks (the device could be bound
> > to another driver, e.g. VFIO).
> > 
> > However, cxl_core can not reference symbols in the cxl_pci module because
> > it creates a circular dependency. This prevents cxl_core from checking the
> > EP's bound driver and calling the callbacks.
> > 
> > To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
> > build it as part of the cxl_core module. Compile into cxl_core using
> > CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
> > cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
> > eliminates the circular dependency so cxl_core can safely perform
> > bound-driver checks and invoke the CXL PCI callbacks.
> > 
> > Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
> > parameter is bound to a CXL driver instance. This will be used in future
> > patch when dequeuing work from the kfifo.
> > 
> > Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > 
> > ---
> 
> This commit causes my QEMU basic expander setup and a real device setup
> to fail to probe the cxl_core driver.
> 
> [    2.697094] cxl_core 0000:0d:00.0: BAR 0 [mem 0xfe800000-0xfe80ffff 64bit]: not claimed; can't enable device
> [    2.697098] cxl_core 0000:0d:00.0: probe with driver cxl_core failed with error -22
> 
> Probe order issue when CXL drivers are built-in maybe?
> 

I've narrowed it down to:

Works
-----
CONFIG_CXL_BUS=m
CONFIG_CXL_MEM=m

Fails
-----
CONFIG_CXL_BUS=y
CONFIG_CXL_MEM=y
or BUS ^ MEM

this commit moves pci -> pci_drv.o and moves it ahead of cxl_mem into
cxl_core into core, but note the comment in the Makefile:

# Order is important here for the built-in case:
# - 'core' first for fundamental init
# - 'port' before platform root drivers like 'acpi' so that CXL-root ports
#   are immediately enabled
# - 'mem' and 'pmem' before endpoint drivers so that memdevs are
#   immediately enabled
# - 'pci' last, also mirrors the hardware enumeration hierarchy

~Gregory

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-05 19:03     ` Gregory Price
@ 2025-11-05 22:26       ` Gregory Price
  2025-11-06 17:11         ` Gregory Price
  2025-11-06 23:32         ` Bowman, Terry
  0 siblings, 2 replies; 103+ messages in thread
From: Gregory Price @ 2025-11-05 22:26 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Wed, Nov 05, 2025 at 02:03:31PM -0500, Gregory Price wrote:
> On Wed, Nov 05, 2025 at 12:51:04PM -0500, Gregory Price wrote:
> > 
> > [    2.697094] cxl_core 0000:0d:00.0: BAR 0 [mem 0xfe800000-0xfe80ffff 64bit]: not claimed; can't enable device
> > [    2.697098] cxl_core 0000:0d:00.0: probe with driver cxl_core failed with error -22
> > 
> > Probe order issue when CXL drivers are built-in maybe?
> > 
> 

moving it back but leaving the function seemed to work for me, i don't
know what the implication of this is though (i.e. it's unclear to me
why you moved it from point a to point b in the first place).

(only tested this on QEMU)
---

diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index ff6add88b6ae..2caa90fa4bf2 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -12,8 +12,10 @@ obj-$(CONFIG_CXL_PORT) += cxl_port.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_MEM) += cxl_mem.o
+obj-$(CONFIG_CXL_PCI) += cxl_pci.o

 cxl_port-y := port.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o security.o
 cxl_mem-y := mem.o
+cxl_pci-y := pci.o
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 2937d0ddcce2..fa1d4aed28b9 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -21,4 +21,3 @@ cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
 cxl_core-$(CONFIG_CXL_RAS) += ras.o
 cxl_core-$(CONFIG_CXL_RCH_RAS) += ras_rch.o
-cxl_core-$(CONFIG_CXL_PCI) += pci_drv.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index a7a0838c8f23..7c287b4fa699 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -223,13 +223,4 @@ int cxl_set_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
 		    u16 *return_code);
 #endif

-#ifdef CONFIG_CXL_PCI
-bool cxl_pci_drv_bound(struct pci_dev *pdev);
-int cxl_pci_driver_init(void);
-void cxl_pci_driver_exit(void);
-#else
-static inline bool cxl_pci_drv_bound(struct pci_dev *pdev) { return false; };
-static inline int cxl_pci_driver_init(void) { return 0; }
-static inline void cxl_pci_driver_exit(void) { }
-#endif
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index d19ebf052d76..ca02ad58fc57 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2520,8 +2520,6 @@ static __init int cxl_core_init(void)
 	if (rc)
 		goto err_ras;

-	cxl_pci_driver_init();
-
 	return 0;

 err_ras:
@@ -2537,7 +2535,6 @@ static __init int cxl_core_init(void)

 static void cxl_core_exit(void)
 {
-	cxl_pci_driver_exit();
 	cxl_ras_exit();
 	cxl_region_exit();
 	bus_unregister(&cxl_bus_type);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 97e6c187e048..a2660d64c6eb 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -941,4 +941,10 @@ u16 cxl_gpf_get_dvsec(struct device *dev);
 #define devm_cxl_switch_port_decoders_setup DECLARE_TESTABLE(devm_cxl_switch_port_decoders_setup)
 #endif

+#ifdef CONFIG_CXL_PCI
+bool cxl_pci_drv_bound(struct pci_dev *pdev);
+#else
+static inline bool cxl_pci_drv_bound(struct pci_dev *pdev) { return false; };
+#endif
+
 #endif /* __CXL_H__ */
diff --git a/drivers/cxl/core/pci_drv.c b/drivers/cxl/pci.c
similarity index 99%
rename from drivers/cxl/core/pci_drv.c
rename to drivers/cxl/pci.c
index bc3c959f7eb6..e6d741e15ac2 100644
--- a/drivers/cxl/core/pci_drv.c
+++ b/drivers/cxl/pci.c
@@ -1189,7 +1189,7 @@ static void cxl_cper_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_work, cxl_cper_work_fn);

-int __init cxl_pci_driver_init(void)
+static int __init cxl_pci_driver_init(void)
 {
 	int rc;

@@ -1204,9 +1204,15 @@ int __init cxl_pci_driver_init(void)
 	return rc;
 }

-void cxl_pci_driver_exit(void)
+static void cxl_pci_driver_exit(void)
 {
 	cxl_cper_unregister_work(&cxl_cper_work);
 	cancel_work_sync(&cxl_cper_work);
 	pci_unregister_driver(&cxl_pci_driver);
 }
+
+module_init(cxl_pci_driver_init);
+module_exit(cxl_pci_driver_exit);
+MODULE_DESCRIPTION("CXL: PCI manageability");
+MODULE_LICENSE("GPL v2");
+MODULE_IMPORT_NS("CXL");

^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-05 22:26       ` Gregory Price
@ 2025-11-06 17:11         ` Gregory Price
  2025-11-06 23:32         ` Bowman, Terry
  1 sibling, 0 replies; 103+ messages in thread
From: Gregory Price @ 2025-11-06 17:11 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Wed, Nov 05, 2025 at 05:26:01PM -0500, Gregory Price wrote:
> On Wed, Nov 05, 2025 at 02:03:31PM -0500, Gregory Price wrote:
> > On Wed, Nov 05, 2025 at 12:51:04PM -0500, Gregory Price wrote:
> > > 
> > > [    2.697094] cxl_core 0000:0d:00.0: BAR 0 [mem 0xfe800000-0xfe80ffff 64bit]: not claimed; can't enable device
> > > [    2.697098] cxl_core 0000:0d:00.0: probe with driver cxl_core failed with error -22
> > > 
> > > Probe order issue when CXL drivers are built-in maybe?
> > > 
> > 
> 
> moving it back but leaving the function seemed to work for me, i don't
> know what the implication of this is though (i.e. it's unclear to me
> why you moved it from point a to point b in the first place).
> 
> (only tested this on QEMU)

also tested on Zen5 systems and others.  Seems stable to me.

~Gregory

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-05 22:26       ` Gregory Price
  2025-11-06 17:11         ` Gregory Price
@ 2025-11-06 23:32         ` Bowman, Terry
  1 sibling, 0 replies; 103+ messages in thread
From: Bowman, Terry @ 2025-11-06 23:32 UTC (permalink / raw)
  To: Gregory Price
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci



On 11/5/2025 4:26 PM, Gregory Price wrote:
> On Wed, Nov 05, 2025 at 02:03:31PM -0500, Gregory Price wrote:
>> On Wed, Nov 05, 2025 at 12:51:04PM -0500, Gregory Price wrote:
>>> [    2.697094] cxl_core 0000:0d:00.0: BAR 0 [mem 0xfe800000-0xfe80ffff 64bit]: not claimed; can't enable device
>>> [    2.697098] cxl_core 0000:0d:00.0: probe with driver cxl_core failed with error -22
>>>
>>> Probe order issue when CXL drivers are built-in maybe?
>>>
> moving it back but leaving the function seemed to work for me, i don't
> know what the implication of this is though (i.e. it's unclear to me
> why you moved it from point a to point b in the first place).
>
> (only tested this on QEMU)
Thanks for pointing this out.

I expect your changes will not work when using loadable modules (m instead of y). 

It appears cxl_pci_probe() is being called earlier due to the changes. The call stack is:
cxl_pci_probe()     
  pcim_enable_device(pdev);     <---- Silent exit here because cxl_pci_probe() fails below 
    pci_enable_device(struct pci_dev *dev)
      pci_enable_device_flags(struct pci_dev *dev, unsigned long flags)
        cxl_pci_probe()         <---- Returns failure due to resource reservation failure

Brief testing is showing the following works:
@@ -922,7 +924,8 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 
        rc = pcim_enable_device(pdev);
        if (rc)
-               return rc;
+               return -EPROBE_DEFER;
+


Terry

> ---
>
> diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
> index ff6add88b6ae..2caa90fa4bf2 100644
> --- a/drivers/cxl/Makefile
> +++ b/drivers/cxl/Makefile
> @@ -12,8 +12,10 @@ obj-$(CONFIG_CXL_PORT) += cxl_port.o
>  obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
>  obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
>  obj-$(CONFIG_CXL_MEM) += cxl_mem.o
> +obj-$(CONFIG_CXL_PCI) += cxl_pci.o
>
>  cxl_port-y := port.o
>  cxl_acpi-y := acpi.o
>  cxl_pmem-y := pmem.o security.o
>  cxl_mem-y := mem.o
> +cxl_pci-y := pci.o
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 2937d0ddcce2..fa1d4aed28b9 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -21,4 +21,3 @@ cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
>  cxl_core-$(CONFIG_CXL_RAS) += ras.o
>  cxl_core-$(CONFIG_CXL_RCH_RAS) += ras_rch.o
> -cxl_core-$(CONFIG_CXL_PCI) += pci_drv.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index a7a0838c8f23..7c287b4fa699 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -223,13 +223,4 @@ int cxl_set_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
>  		    u16 *return_code);
>  #endif
>
> -#ifdef CONFIG_CXL_PCI
> -bool cxl_pci_drv_bound(struct pci_dev *pdev);
> -int cxl_pci_driver_init(void);
> -void cxl_pci_driver_exit(void);
> -#else
> -static inline bool cxl_pci_drv_bound(struct pci_dev *pdev) { return false; };
> -static inline int cxl_pci_driver_init(void) { return 0; }
> -static inline void cxl_pci_driver_exit(void) { }
> -#endif
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index d19ebf052d76..ca02ad58fc57 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2520,8 +2520,6 @@ static __init int cxl_core_init(void)
>  	if (rc)
>  		goto err_ras;
>
> -	cxl_pci_driver_init();
> -
>  	return 0;
>
>  err_ras:
> @@ -2537,7 +2535,6 @@ static __init int cxl_core_init(void)
>
>  static void cxl_core_exit(void)
>  {
> -	cxl_pci_driver_exit();
>  	cxl_ras_exit();
>  	cxl_region_exit();
>  	bus_unregister(&cxl_bus_type);
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 97e6c187e048..a2660d64c6eb 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -941,4 +941,10 @@ u16 cxl_gpf_get_dvsec(struct device *dev);
>  #define devm_cxl_switch_port_decoders_setup DECLARE_TESTABLE(devm_cxl_switch_port_decoders_setup)
>  #endif
>
> +#ifdef CONFIG_CXL_PCI
> +bool cxl_pci_drv_bound(struct pci_dev *pdev);
> +#else
> +static inline bool cxl_pci_drv_bound(struct pci_dev *pdev) { return false; };
> +#endif
> +
>  #endif /* __CXL_H__ */
> diff --git a/drivers/cxl/core/pci_drv.c b/drivers/cxl/pci.c
> similarity index 99%
> rename from drivers/cxl/core/pci_drv.c
> rename to drivers/cxl/pci.c
> index bc3c959f7eb6..e6d741e15ac2 100644
> --- a/drivers/cxl/core/pci_drv.c
> +++ b/drivers/cxl/pci.c
> @@ -1189,7 +1189,7 @@ static void cxl_cper_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_work, cxl_cper_work_fn);
>
> -int __init cxl_pci_driver_init(void)
> +static int __init cxl_pci_driver_init(void)
>  {
>  	int rc;
>
> @@ -1204,9 +1204,15 @@ int __init cxl_pci_driver_init(void)
>  	return rc;
>  }
>
> -void cxl_pci_driver_exit(void)
> +static void cxl_pci_driver_exit(void)
>  {
>  	cxl_cper_unregister_work(&cxl_cper_work);
>  	cancel_work_sync(&cxl_cper_work);
>  	pci_unregister_driver(&cxl_pci_driver);
>  }
> +
> +module_init(cxl_pci_driver_init);
> +module_exit(cxl_pci_driver_exit);
> +MODULE_DESCRIPTION("CXL: PCI manageability");
> +MODULE_LICENSE("GPL v2");
> +MODULE_IMPORT_NS("CXL");


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports
  2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
  2025-11-04 18:10   ` Jonathan Cameron
@ 2025-11-11  8:17   ` Alison Schofield
  2025-11-19  3:19   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: Alison Schofield @ 2025-11-11  8:17 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:50AM -0600, Terry Bowman wrote:
> CXL PCIe Port Protocol Error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port Protocol Errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> 
> ---
> 
> Changes in v12->v13:
> - Added Ben's review-by
> ---
>  drivers/cxl/core/core.h    | 15 ++++++---------
>  drivers/cxl/core/ras.c     | 12 ++++++------
>  drivers/cxl/core/ras_rch.c |  4 ++--
>  3 files changed, 14 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index c30ab7c25a92..1a419b35fa59 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -7,6 +7,7 @@
>  #include <linux/pci.h>
>  #include <cxl/mailbox.h>
>  #include <linux/rwsem.h>
> +#include <linux/pci.h>

Duplicate include above.

snip

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
  2025-11-04 18:15   ` Jonathan Cameron
  2025-11-04 20:03   ` Dave Jiang
@ 2025-11-11  8:23   ` Alison Schofield
  2 siblings, 0 replies; 103+ messages in thread
From: Alison Schofield @ 2025-11-11  8:23 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:54AM -0600, Terry Bowman wrote:
> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
> mapping to enable RAS logging. This initialization is currently missing and
> must be added for CXL RPs and DSPs.
> 
> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
> 
> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
> created and added to the EP port.
> 
> Make a call to cxl_port_setup_regs() in cxl_port_add(). This will probe the
> Upstream Port's CXL capabilities' physical location to be used in mapping
> the RAS registers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>


Terry,

This patch needed some cxl-test support:

Attaching what is needed to 'Make cxl_*_init_ras_reporting() work
with cxl-test"

It adds a mock version of cxl_uport_init_ras_reporting(), simply
following what existed for cxl_dport_init_ras_reporting().

The other changes apply a method that avoids circular dependencies
that the above patch introduced: cxl_mock->cxl_core->cxl_mock.
This method is a Dan invention that DaveJ first applied it here:
d96eb90d9ca6 ("cxl/test: Add mock version of devm_cxl_add_dport_by_dev()")

In my tree, I inserted and tested this after this patch I'm replying
to, but I think you'll need to combine them, or split some other way
so no patch introduces breakage.

--Alison


Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 drivers/cxl/core/port.c              |  4 ++--
 drivers/cxl/core/ras.c               | 12 ++++++------
 drivers/cxl/cxl.h                    |  5 +++++
 drivers/cxl/cxlpci.h                 |  6 ++++--
 drivers/cxl/mem.c                    |  2 +-
 tools/testing/cxl/Kbuild             |  1 -
 tools/testing/cxl/cxl_core_exports.c | 19 +++++++++++++++++++
 tools/testing/cxl/exports.h          |  8 ++++++++
 tools/testing/cxl/test/mock.c        | 25 +++++++++++++++++++++++--
 9 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 48f6a1492544..f0fc917f9575 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1195,7 +1195,7 @@ __devm_cxl_add_dport(struct cxl_port *port, struct device *dport_dev,
 		}
 		port->component_reg_phys = CXL_RESOURCE_NONE;
 		if (!is_cxl_endpoint(port) && dev_is_pci(port->uport_dev))
-			cxl_uport_init_ras_reporting(port, &port->dev);
+			__cxl_uport_init_ras_reporting(port, &port->dev);
 	}
 
 	get_device(dport_dev);
@@ -1625,7 +1625,7 @@ static struct cxl_dport *cxl_port_add_dport(struct cxl_port *port,
 
 	cxl_switch_parse_cdat(new_dport);
 
-	cxl_dport_init_ras_reporting(new_dport, &port->dev);
+	__cxl_dport_init_ras_reporting(new_dport, &port->dev);
 
 	if (ida_is_empty(&port->decoder_ida)) {
 		rc = devm_cxl_switch_port_decoders_setup(port);
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 19d9ffe885bf..90bfb32cc3c5 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -141,11 +141,12 @@ static void cxl_dport_map_ras(struct cxl_dport *dport)
 }
 
 /**
- * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
+ * __cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
  * @host: host device for devm operations
  */
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+void __cxl_dport_init_ras_reporting(struct cxl_dport *dport,
+				    struct device *host)
 {
 	dport->reg_map.host = host;
 	cxl_dport_map_ras(dport);
@@ -160,10 +161,9 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 		cxl_disable_rch_root_ints(dport);
 	}
 }
-EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
+EXPORT_SYMBOL_NS_GPL(__cxl_dport_init_ras_reporting, "CXL");
 
-void cxl_uport_init_ras_reporting(struct cxl_port *port,
-				  struct device *host)
+void __cxl_uport_init_ras_reporting(struct cxl_port *port, struct device *host)
 {
 	struct cxl_register_map *map = &port->reg_map;
 
@@ -172,7 +172,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port,
 				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
 		dev_dbg(&port->dev, "Failed to map RAS capability\n");
 }
-EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
+EXPORT_SYMBOL_NS_GPL(__cxl_uport_init_ras_reporting, "CXL");
 
 void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b7654d40dc9e..995e20a88d96 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -940,6 +940,11 @@ u16 cxl_gpf_get_dvsec(struct device *dev);
 #define DECLARE_TESTABLE(x) __##x
 #define devm_cxl_add_dport_by_dev DECLARE_TESTABLE(devm_cxl_add_dport_by_dev)
 #define devm_cxl_switch_port_decoders_setup DECLARE_TESTABLE(devm_cxl_switch_port_decoders_setup)
+#define cxl_dport_init_ras_reporting \
+	DECLARE_TESTABLE(cxl_dport_init_ras_reporting)
+#define cxl_uport_init_ras_reporting \
+	DECLARE_TESTABLE(cxl_uport_init_ras_reporting)
+
 #endif
 
 #endif /* __CXL_H__ */
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index a0a491e7b5b9..846cf0935252 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -82,9 +82,11 @@ void read_cdat_data(struct cxl_port *port);
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+void __cxl_dport_init_ras_reporting(struct cxl_dport *dport,
+				    struct device *host);
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
-void cxl_uport_init_ras_reporting(struct cxl_port *port,
-				  struct device *host);
+void __cxl_uport_init_ras_reporting(struct cxl_port *port, struct device *host);
+void cxl_uport_init_ras_reporting(struct cxl_port *port, struct device *host);
 #else
 static inline void cxl_cor_error_detected(struct pci_dev *pdev) { }
 
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index d2155f45240d..782fdb552865 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -167,7 +167,7 @@ static int cxl_mem_probe(struct device *dev)
 		endpoint_parent = &parent_port->dev;
 
 	if (dport->rch)
-		cxl_dport_init_ras_reporting(dport, dev);
+		__cxl_dport_init_ras_reporting(dport, dev);
 
 	scoped_guard(device, endpoint_parent) {
 		if (!endpoint_parent->driver) {
diff --git a/tools/testing/cxl/Kbuild b/tools/testing/cxl/Kbuild
index 6905f8e710ab..fe80a811fdef 100644
--- a/tools/testing/cxl/Kbuild
+++ b/tools/testing/cxl/Kbuild
@@ -9,7 +9,6 @@ ldflags-y += --wrap=cxl_await_media_ready
 ldflags-y += --wrap=devm_cxl_add_rch_dport
 ldflags-y += --wrap=cxl_rcd_component_reg_phys
 ldflags-y += --wrap=cxl_endpoint_parse_cdat
-ldflags-y += --wrap=cxl_dport_init_ras_reporting
 ldflags-y += --wrap=devm_cxl_endpoint_decoders_setup
 
 DRIVERS := ../../../drivers
diff --git a/tools/testing/cxl/cxl_core_exports.c b/tools/testing/cxl/cxl_core_exports.c
index 6754de35598d..5a071afa46fd 100644
--- a/tools/testing/cxl/cxl_core_exports.c
+++ b/tools/testing/cxl/cxl_core_exports.c
@@ -3,6 +3,7 @@
 
 #include "cxl.h"
 #include "exports.h"
+#include "cxlpci.h"
 
 /* Exporting of cxl_core symbols that are only used by cxl_test */
 EXPORT_SYMBOL_NS_GPL(cxl_num_decoders_committed, "CXL");
@@ -27,3 +28,21 @@ int devm_cxl_switch_port_decoders_setup(struct cxl_port *port)
 	return _devm_cxl_switch_port_decoders_setup(port);
 }
 EXPORT_SYMBOL_NS_GPL(devm_cxl_switch_port_decoders_setup, "CXL");
+
+cxl_dport_init_ras_reporting_fn _cxl_dport_init_ras_reporting =
+	__cxl_dport_init_ras_reporting;
+EXPORT_SYMBOL_NS_GPL(_cxl_dport_init_ras_reporting, "CXL");
+
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+{
+	return _cxl_dport_init_ras_reporting(dport, host);
+}
+
+cxl_uport_init_ras_reporting_fn _cxl_uport_init_ras_reporting =
+	__cxl_uport_init_ras_reporting;
+EXPORT_SYMBOL_NS_GPL(_cxl_uport_init_ras_reporting, "CXL");
+
+void cxl_uport_init_ras_reporting(struct cxl_port *port, struct device *host)
+{
+	return _cxl_uport_init_ras_reporting(port, host);
+}
diff --git a/tools/testing/cxl/exports.h b/tools/testing/cxl/exports.h
index 7ebee7c0bd67..f3bcba8bc11b 100644
--- a/tools/testing/cxl/exports.h
+++ b/tools/testing/cxl/exports.h
@@ -10,4 +10,12 @@ extern cxl_add_dport_by_dev_fn _devm_cxl_add_dport_by_dev;
 typedef int(*cxl_switch_decoders_setup_fn)(struct cxl_port *port);
 extern cxl_switch_decoders_setup_fn _devm_cxl_switch_port_decoders_setup;
 
+typedef void (*cxl_dport_init_ras_reporting_fn)(struct cxl_dport *dport,
+						struct device *host);
+extern cxl_dport_init_ras_reporting_fn _cxl_dport_init_ras_reporting;
+
+typedef void (*cxl_uport_init_ras_reporting_fn)(struct cxl_port *port,
+						struct device *host);
+extern cxl_uport_init_ras_reporting_fn _cxl_uport_init_ras_reporting;
+
 #endif
diff --git a/tools/testing/cxl/test/mock.c b/tools/testing/cxl/test/mock.c
index 995269a75cbd..776b951aab1a 100644
--- a/tools/testing/cxl/test/mock.c
+++ b/tools/testing/cxl/test/mock.c
@@ -18,6 +18,10 @@ static struct cxl_dport *
 redirect_devm_cxl_add_dport_by_dev(struct cxl_port *port,
 				   struct device *dport_dev);
 static int redirect_devm_cxl_switch_port_decoders_setup(struct cxl_port *port);
+static void redirect_cxl_dport_init_ras_reporting(struct cxl_dport *dport,
+						  struct device *host);
+static void redirect_cxl_uport_init_ras_reporting(struct cxl_port *port,
+						  struct device *host);
 
 void register_cxl_mock_ops(struct cxl_mock_ops *ops)
 {
@@ -25,6 +29,8 @@ void register_cxl_mock_ops(struct cxl_mock_ops *ops)
 	_devm_cxl_add_dport_by_dev = redirect_devm_cxl_add_dport_by_dev;
 	_devm_cxl_switch_port_decoders_setup =
 		redirect_devm_cxl_switch_port_decoders_setup;
+	_cxl_dport_init_ras_reporting = redirect_cxl_dport_init_ras_reporting;
+	_cxl_uport_init_ras_reporting = redirect_cxl_uport_init_ras_reporting;
 }
 EXPORT_SYMBOL_GPL(register_cxl_mock_ops);
 
@@ -35,6 +41,9 @@ void unregister_cxl_mock_ops(struct cxl_mock_ops *ops)
 	_devm_cxl_switch_port_decoders_setup =
 		__devm_cxl_switch_port_decoders_setup;
 	_devm_cxl_add_dport_by_dev = __devm_cxl_add_dport_by_dev;
+	_cxl_dport_init_ras_reporting = __cxl_dport_init_ras_reporting;
+	_cxl_uport_init_ras_reporting = __cxl_uport_init_ras_reporting;
+
 	list_del_rcu(&ops->list);
 	synchronize_srcu(&cxl_mock_srcu);
 }
@@ -257,7 +266,8 @@ void __wrap_cxl_endpoint_parse_cdat(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(__wrap_cxl_endpoint_parse_cdat, "CXL");
 
-void __wrap_cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+void redirect_cxl_dport_init_ras_reporting(struct cxl_dport *dport,
+					   struct device *host)
 {
 	int index;
 	struct cxl_mock_ops *ops = get_cxl_mock_ops(&index);
@@ -267,7 +277,18 @@ void __wrap_cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device
 
 	put_cxl_mock_ops(index);
 }
-EXPORT_SYMBOL_NS_GPL(__wrap_cxl_dport_init_ras_reporting, "CXL");
+
+void redirect_cxl_uport_init_ras_reporting(struct cxl_port *port,
+					   struct device *host)
+{
+	int index;
+	struct cxl_mock_ops *ops = get_cxl_mock_ops(&index);
+
+	if (!ops || !ops->is_mock_port(port->uport_dev))
+		cxl_uport_init_ras_reporting(port, host);
+
+	put_cxl_mock_ops(index);
+}
 
 struct cxl_dport *redirect_devm_cxl_add_dport_by_dev(struct cxl_port *port,
 						     struct device *dport_dev)
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
  2025-11-05 17:51   ` Gregory Price
@ 2025-11-11  8:33   ` Alison Schofield
  2025-11-13 21:42     ` Alison Schofield
  2025-11-20  1:24   ` dan.j.williams
  2 siblings, 1 reply; 103+ messages in thread
From: Alison Schofield @ 2025-11-11  8:33 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:57AM -0600, Terry Bowman wrote:
> CXL devices handle protocol errors via driver-specific callbacks rather
> than the generic pci_driver::err_handlers by default. The callbacks are
> implemented in the cxl_pci driver and are not part of struct pci_driver, so
> cxl_core must verify that a device is actually bound to the cxl_pci
> module's driver before invoking the callbacks (the device could be bound
> to another driver, e.g. VFIO).
> 
> However, cxl_core can not reference symbols in the cxl_pci module because
> it creates a circular dependency. This prevents cxl_core from checking the
> EP's bound driver and calling the callbacks.
> 
> To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
> build it as part of the cxl_core module. Compile into cxl_core using
> CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
> cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
> eliminates the circular dependency so cxl_core can safely perform
> bound-driver checks and invoke the CXL PCI callbacks.
> 
> Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
> parameter is bound to a CXL driver instance. This will be used in future
> patch when dequeuing work from the kfifo.

This one was troublesome in cxl-test, more circular dependencies.
I noticed you and GregP chatting about it, so I simply remove it from
the set for now (made all callsites true).

With it gone, the set builds cxl-test and passes the test suite.
I'll watch what happens with this one, and can take another look at
the cxl-test issues if they persist.

A bit below...

snip

> diff --git a/drivers/cxl/pci.c b/drivers/cxl/core/pci_drv.c
> similarity index 99%
> rename from drivers/cxl/pci.c
> rename to drivers/cxl/core/pci_drv.c
> index bd95be1f3d5c..06f2fd993cb0 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/core/pci_drv.c

Needs:
+#include "core.h"

Compiler is warning: no previous prototypes.

snip

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
  2025-11-04 18:47   ` Jonathan Cameron
@ 2025-11-11  8:37   ` Alison Schofield
  2025-12-08 18:40   ` Bjorn Helgaas
  2 siblings, 0 replies; 103+ messages in thread
From: Alison Schofield @ 2025-11-11  8:37 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:03:03AM -0600, Terry Bowman wrote:
> Implement cxl_do_recovery() to handle uncorrectable protocol
> errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
> all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
> potential CXL memory corruption.
> 
> Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
> CXL topology from the error source through downstream CXL ports and
> endpoints.
> 
> Introduce cxl_report_error_detected(), mirroring PCI's
> report_error_detected(), and implement device locking for the affected
> subtree. Endpoints require locking the PCI device (pdev->dev) and the
> CXL memdev (cxlmd->dev). CXL ports require locking the PCI
> device (pdev->dev) and the parent CXL port.
> 
> The device locks should be taken early where possible. The initially
> reporting device will be locked after kfifo dequeue. Iterated devices
> will be locked in cxl_report_error_detected() and must lock the
> iterated devices except for the first device as it has already been
> locked.
> 
> Export pci_aer_clear_fatal_status() for use when a UCE is not present.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> ---
> 
snip

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 5bc144cde0ee..52c6f19564b6 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c

snip

> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
> +{
> +	bool need_lock = (dev != &err_pdev->dev);
> +	pci_ers_result_t vote, *result = data;
> +	struct pci_dev *pdev;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return 0;
> +	pdev = to_pci_dev(dev);
> +
> +	device_lock_if(&pdev->dev, need_lock);
> +	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
> +		device_unlock_if(&pdev->dev, need_lock);
> +		return PCI_ERS_RESULT_NONE;

sparse warns:
drivers/cxl/core/ras.c:316:24: warning: incorrect type in return expression (different base types)
drivers/cxl/core/ras.c:316:24:    expected int
drivers/cxl/core/ras.c:316:24:    got restricted pci_ers_result_t


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-11  8:33   ` Alison Schofield
@ 2025-11-13 21:42     ` Alison Schofield
  2025-11-13 22:39       ` Bowman, Terry
  0 siblings, 1 reply; 103+ messages in thread
From: Alison Schofield @ 2025-11-13 21:42 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Tue, Nov 11, 2025 at 12:33:53AM -0800, Alison Schofield wrote:
> On Tue, Nov 04, 2025 at 11:02:57AM -0600, Terry Bowman wrote:
> > CXL devices handle protocol errors via driver-specific callbacks rather
> > than the generic pci_driver::err_handlers by default. The callbacks are
> > implemented in the cxl_pci driver and are not part of struct pci_driver, so
> > cxl_core must verify that a device is actually bound to the cxl_pci
> > module's driver before invoking the callbacks (the device could be bound
> > to another driver, e.g. VFIO).
> > 
> > However, cxl_core can not reference symbols in the cxl_pci module because
> > it creates a circular dependency. This prevents cxl_core from checking the
> > EP's bound driver and calling the callbacks.
> > 
> > To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
> > build it as part of the cxl_core module. Compile into cxl_core using
> > CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
> > cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
> > eliminates the circular dependency so cxl_core can safely perform
> > bound-driver checks and invoke the CXL PCI callbacks.
> > 
> > Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
> > parameter is bound to a CXL driver instance. This will be used in future
> > patch when dequeuing work from the kfifo.
> 
> This one was troublesome in cxl-test, more circular dependencies.
> I noticed you and GregP chatting about it, so I simply remove it from
> the set for now (made all callsites true).
> 
> With it gone, the set builds cxl-test and passes the test suite.
> I'll watch what happens with this one, and can take another look at
> the cxl-test issues if they persist.

Hi Terry -

I took another look, suspecting the circle issue started with the
move of pci.c into the core, and not necessarily your new additions.
There are two functions that are wrapped in cxl-test and now with
this move are being called from the core and creating the 'circle':

cxl_await_media_ready()
cxl_rcd_component_reg_phys()

Both those need the 'restrict' method, like for Patch 14.

Once that is resolved, the new function cxl_pci_drv_bound()
seems like it needs mocking and will require the same treatment.

Suggest doing it in separate patches. First patch does the move
and the cxl-test work.  Then a second patch adds the new function
and it's cxl-test support.

--Alison


> 
> A bit below...
> 
> snip
> 
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/core/pci_drv.c
> > similarity index 99%
> > rename from drivers/cxl/pci.c
> > rename to drivers/cxl/core/pci_drv.c
> > index bd95be1f3d5c..06f2fd993cb0 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/core/pci_drv.c
> 
> Needs:
> +#include "core.h"
> 
> Compiler is warning: no previous prototypes.
> 
> snip
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-13 21:42     ` Alison Schofield
@ 2025-11-13 22:39       ` Bowman, Terry
  0 siblings, 0 replies; 103+ messages in thread
From: Bowman, Terry @ 2025-11-13 22:39 UTC (permalink / raw)
  To: Alison Schofield
  Cc: dave, jonathan.cameron, dave.jiang, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci



On 11/13/2025 3:42 PM, Alison Schofield wrote:
> On Tue, Nov 11, 2025 at 12:33:53AM -0800, Alison Schofield wrote:
>> On Tue, Nov 04, 2025 at 11:02:57AM -0600, Terry Bowman wrote:
>>> CXL devices handle protocol errors via driver-specific callbacks rather
>>> than the generic pci_driver::err_handlers by default. The callbacks are
>>> implemented in the cxl_pci driver and are not part of struct pci_driver, so
>>> cxl_core must verify that a device is actually bound to the cxl_pci
>>> module's driver before invoking the callbacks (the device could be bound
>>> to another driver, e.g. VFIO).
>>>
>>> However, cxl_core can not reference symbols in the cxl_pci module because
>>> it creates a circular dependency. This prevents cxl_core from checking the
>>> EP's bound driver and calling the callbacks.
>>>
>>> To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
>>> build it as part of the cxl_core module. Compile into cxl_core using
>>> CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
>>> cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
>>> eliminates the circular dependency so cxl_core can safely perform
>>> bound-driver checks and invoke the CXL PCI callbacks.
>>>
>>> Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
>>> parameter is bound to a CXL driver instance. This will be used in future
>>> patch when dequeuing work from the kfifo.
>> This one was troublesome in cxl-test, more circular dependencies.
>> I noticed you and GregP chatting about it, so I simply remove it from
>> the set for now (made all callsites true).
>>
>> With it gone, the set builds cxl-test and passes the test suite.
>> I'll watch what happens with this one, and can take another look at
>> the cxl-test issues if they persist.
> Hi Terry -
>
> I took another look, suspecting the circle issue started with the
> move of pci.c into the core, and not necessarily your new additions.
> There are two functions that are wrapped in cxl-test and now with
> this move are being called from the core and creating the 'circle':
>
> cxl_await_media_ready()
> cxl_rcd_component_reg_phys()
>
> Both those need the 'restrict' method, like for Patch 14.
>
> Once that is resolved, the new function cxl_pci_drv_bound()
> seems like it needs mocking and will require the same treatment.
>
> Suggest doing it in separate patches. First patch does the move
> and the cxl-test work.  Then a second patch adds the new function
> and it's cxl-test support.
>
> --Alison
>

Hi Alison,

Thanks for finding the issue. I'll start on the fix using the changes 
you described.

-Terry


>> A bit below...
>>
>> snip
>>
>>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/core/pci_drv.c
>>> similarity index 99%
>>> rename from drivers/cxl/pci.c
>>> rename to drivers/cxl/core/pci_drv.c
>>> index bd95be1f3d5c..06f2fd993cb0 100644
>>> --- a/drivers/cxl/pci.c
>>> +++ b/drivers/cxl/core/pci_drv.c
>> Needs:
>> +#include "core.h"
>>
>> Compiler is warning: no previous prototypes.
>>
>> snip
>>


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  2025-11-04 19:03   ` Bjorn Helgaas
@ 2025-11-14 15:20     ` Bowman, Terry
  2025-11-14 16:09       ` Jonathan Cameron
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-11-14 15:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci



On 11/4/2025 1:03 PM, Bjorn Helgaas wrote:
> On Tue, Nov 04, 2025 at 11:03:02AM -0600, Terry Bowman wrote:
>> CXL uncorrectable errors (UCE) will soon be handled separately from the PCI
>> AER handling. The merge_result() function can be made common to use in both
>> handling paths.
>>
>> Rename the PCI subsystem's merge_result() to be pci_ers_merge_result().
>> Export pci_ers_merge_result() to make available for the CXL and other
>> drivers to use.
>>
>> Update pci_ers_merge_result() to support recently introduced PCI_ERS_RESULT_PANIC
>> result.
> Seems like this merge_result() change maybe should be in the same
> patch that added PCI_ERS_RESULT_PANIC?  That would also solve the
> problem that the subject line doesn't mention this important
> functional change.
>
> I haven't seen the user(s) of pci_ers_merge_result() yet, but this
> seems like it might be a little too low level to be exported to
> modules and in include/linux/pci.h.  Maybe there's no other way.

This is used in the UCE handling patch. I will move there.

Jonathan suggested updating |merge_result()| to handle both PCIe and CXL error 
cases with shared logic. The only other option I see is to remove the export here 
and duplicate the function in the CXL drivers?
/
/

- Terry


> Wrap commit log to fit in 75 columns.
>
> Suggest possible subject prefix of "PCI/ERR" since the only CXL
> connection is that you want to *use* this from CXL.
>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> --- Changes in v12->v13: - Renamed pci_ers_merge_result() to pcie_ers_merge_result(). pci_ers_merge_result() is already used in eeh driver. (Bot) Changes in v11->v12: - Remove static inline pci_ers_merge_result() definition for !CONFIG_PCIEAER. Is not needed. (Lukas) Changes in v10->v11: - New patch - pci_ers_merge_result() - Change export to non-namespace and rename to be pci_ers_merge_result() - Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result --- drivers/pci/pcie/err.c | 14 +++++++++----- include/linux/pci.h | 7 +++++++ 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index bebe4bc111d7..9394bbdcf0fb 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -21,9 +21,12 @@ #include "portdrv.h" #include "../pci.h" -static pci_ers_result_t merge_result(enum pci_ers_result orig, - enum pci_ers_result new) +pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig, + enum
>> pci_ers_result new) { + if (new == PCI_ERS_RESULT_PANIC) + return PCI_ERS_RESULT_PANIC; + if (new == PCI_ERS_RESULT_NO_AER_DRIVER) return PCI_ERS_RESULT_NO_AER_DRIVER; @@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig, return orig; } +EXPORT_SYMBOL(pcie_ers_merge_result); static int report_error_detected(struct pci_dev *dev, pci_channel_state_t state, @@ -81,7 +85,7 @@ static int report_error_detected(struct pci_dev *dev, vote = err_handler->error_detected(dev, state); } pci_uevent_ers(dev, vote); - *result = merge_result(*result, vote); + *result = pcie_ers_merge_result(*result, vote); device_unlock(&dev->dev); return 0; } @@ -139,7 +143,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) err_handler = pdrv->err_handler; vote = err_handler->mmio_enabled(dev); - *result = merge_result(*result, vote); + *result = pcie_ers_merge_result(*result, vote); out: device_unlock(&dev->dev); return 0; @@ -159,7 +163,7 @@ static int
>> report_slot_reset(struct pci_dev *dev, void *data) err_handler = pdrv->err_handler; vote = err_handler->slot_reset(dev); - *result = merge_result(*result, vote); + *result = pcie_ers_merge_result(*result, vote); out: device_unlock(&dev->dev); return 0; diff --git a/include/linux/pci.h b/include/linux/pci.h index 33d16b212e0d..d3e3300f79ec 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1887,9 +1887,16 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { } #ifdef CONFIG_PCIEAER bool pci_aer_available(void); void pcie_clear_device_status(struct pci_dev *dev); +pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig, + enum pci_ers_result new); #else static inline bool pci_aer_available(void) { return false; } static inline void pcie_clear_device_status(struct pci_dev *dev) { } +static inline pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig, + enum pci_ers_result new) +{ + return PCI_ERS_RESULT_NONE; +} #endif bool
>> pci_ats_disabled(void); -- 2.34.1 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
  2025-11-14 15:20     ` Bowman, Terry
@ 2025-11-14 16:09       ` Jonathan Cameron
  0 siblings, 0 replies; 103+ messages in thread
From: Jonathan Cameron @ 2025-11-14 16:09 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Bjorn Helgaas, dave, dave.jiang, alison.schofield, dan.j.williams,
	bhelgaas, shiju.jose, ming.li, Smita.KoralahalliChannabasappa,
	rrichter, dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	alucerop, ira.weiny, linux-kernel, linux-pci

On Fri, 14 Nov 2025 09:20:08 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 11/4/2025 1:03 PM, Bjorn Helgaas wrote:
> > On Tue, Nov 04, 2025 at 11:03:02AM -0600, Terry Bowman wrote:  
> >> CXL uncorrectable errors (UCE) will soon be handled separately from the PCI
> >> AER handling. The merge_result() function can be made common to use in both
> >> handling paths.
> >>
> >> Rename the PCI subsystem's merge_result() to be pci_ers_merge_result().
> >> Export pci_ers_merge_result() to make available for the CXL and other
> >> drivers to use.
> >>
> >> Update pci_ers_merge_result() to support recently introduced PCI_ERS_RESULT_PANIC
> >> result.  
> > Seems like this merge_result() change maybe should be in the same
> > patch that added PCI_ERS_RESULT_PANIC?  That would also solve the
> > problem that the subject line doesn't mention this important
> > functional change.
> >
> > I haven't seen the user(s) of pci_ers_merge_result() yet, but this
> > seems like it might be a little too low level to be exported to
> > modules and in include/linux/pci.h.  Maybe there's no other way.  
> 
> This is used in the UCE handling patch. I will move there.
> 
> Jonathan suggested updating |merge_result()| to handle both PCIe and CXL error 
> cases with shared logic. The only other option I see is to remove the export here 
> and duplicate the function in the CXL drivers?

I don't mind if turns out we do need to duplicate this little bit of code.

Jonathan
> /
> /
> 
> - Terry
> 
> 
> > Wrap commit log to fit in 75 columns.
> >
> > Suggest possible subject prefix of "PCI/ERR" since the only CXL
> > connection is that you want to *use* this from CXL.
> >  
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> --- Changes in v12->v13: - Renamed pci_ers_merge_result() to pcie_ers_merge_result(). pci_ers_merge_result() is already used in eeh driver. (Bot) Changes in v11->v12: - Remove static inline pci_ers_merge_result() definition for !CONFIG_PCIEAER. Is not needed. (Lukas) Changes in v10->v11: - New patch - pci_ers_merge_result() - Change export to non-namespace and rename to be pci_ers_merge_result() - Move pci_ers_merge_result() definition to pci.h. Needs pci_ers_result --- drivers/pci/pcie/err.c | 14 +++++++++----- include/linux/pci.h | 7 +++++++ 2 files changed, 16 insertions(+), 5 deletions(-) diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c index bebe4bc111d7..9394bbdcf0fb 100644 --- a/drivers/pci/pcie/err.c +++ b/drivers/pci/pcie/err.c @@ -21,9 +21,12 @@ #include "portdrv.h" #include "../pci.h" -static pci_ers_result_t merge_result(enum pci_ers_result orig, - enum pci_ers_result new) +pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig, + enum
> >> pci_ers_result new) { + if (new == PCI_ERS_RESULT_PANIC) + return PCI_ERS_RESULT_PANIC; + if (new == PCI_ERS_RESULT_NO_AER_DRIVER) return PCI_ERS_RESULT_NO_AER_DRIVER; @@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig, return orig; } +EXPORT_SYMBOL(pcie_ers_merge_result); static int report_error_detected(struct pci_dev *dev, pci_channel_state_t state, @@ -81,7 +85,7 @@ static int report_error_detected(struct pci_dev *dev, vote = err_handler->error_detected(dev, state); } pci_uevent_ers(dev, vote); - *result = merge_result(*result, vote); + *result = pcie_ers_merge_result(*result, vote); device_unlock(&dev->dev); return 0; } @@ -139,7 +143,7 @@ static int report_mmio_enabled(struct pci_dev *dev, void *data) err_handler = pdrv->err_handler; vote = err_handler->mmio_enabled(dev); - *result = merge_result(*result, vote); + *result = pcie_ers_merge_result(*result, vote); out: device_unlock(&dev->dev); return 0; @@ -159,7 +163,7 @@ static int
> >> report_slot_reset(struct pci_dev *dev, void *data) err_handler = pdrv->err_handler; vote = err_handler->slot_reset(dev); - *result = merge_result(*result, vote); + *result = pcie_ers_merge_result(*result, vote); out: device_unlock(&dev->dev); return 0; diff --git a/include/linux/pci.h b/include/linux/pci.h index 33d16b212e0d..d3e3300f79ec 100644 --- a/include/linux/pci.h +++ b/include/linux/pci.h @@ -1887,9 +1887,16 @@ static inline void pci_hp_unignore_link_change(struct pci_dev *pdev) { } #ifdef CONFIG_PCIEAER bool pci_aer_available(void); void pcie_clear_device_status(struct pci_dev *dev); +pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig, + enum pci_ers_result new); #else static inline bool pci_aer_available(void) { return false; } static inline void pcie_clear_device_status(struct pci_dev *dev) { } +static inline pci_ers_result_t pcie_ers_merge_result(enum pci_ers_result orig, + enum pci_ers_result new) +{ + return PCI_ERS_RESULT_NONE; +} #endif bool
> >> pci_ats_disabled(void); -- 2.34.1   
> 


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports
  2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
  2025-11-04 18:10   ` Jonathan Cameron
  2025-11-11  8:17   ` Alison Schofield
@ 2025-11-19  3:19   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:19 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> CXL PCIe Port Protocol Error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port Protocol Errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
[..]
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index b933030b8e1e..72908f3ced77 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -160,7 +160,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>  
> -void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
> +void cxl_handle_cor_ras(struct device *dev, void __iomem *ras_base)
>  {
>  	void __iomem *addr;
>  	u32 status;
> @@ -172,7 +172,7 @@ void cxl_handle_cor_ras(struct cxl_dev_state *cxlds, void __iomem *ras_base)
>  	status = readl(addr);
>  	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>  		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> +		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);

This indeed looks like an equivalent conversion, I just worry it does
not work if this function get re-used for protocol errors on non-memdev
(port) devices.

For now, at this stage of the series:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
  2025-11-04 17:50   ` Jonathan Cameron
@ 2025-11-19  3:19   ` dan.j.williams
  2025-12-08 18:04   ` Bjorn Helgaas
  2 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:19 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> The CXL DVSECs are currently defined in cxl/core/cxlpci.h. These are not
> accessible to other subsystems. Move these to uapi/linux/pci_regs.h.
> 
> Change DVSEC name formatting to follow the existing PCI format in
> pci_regs.h. The current format uses CXL_DVSEC_XYZ and the CXL defines must
> be changed to be PCI_DVSEC_CXL_XYZ to match existing pci_regs.h. Leave
> PCI_DVSEC_CXL_PORT* defines as-is because they are already defined and may
> be in use by userspace application(s).
> 
> Update existing usage to match the name change.
> 
> Update the inline documentation to refer to latest CXL spec version.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> 
[..]
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index b14dd064006c..53a49bb32514 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5002,7 +5002,9 @@ static bool cxl_sbr_masked(struct pci_dev *dev)
>  	if (!dvsec)
>  		return false;
>  
> -	rc = pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_PORT_CTL, &reg);
> +	rc = pci_read_config_word(dev,
> +				  dvsec + PCI_DVSEC_CXL_PORT_CTL,
> +				  &reg);

Patch looks ok,

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...but going forward please be careful to not reflow whitespace when no
other content is changing (no need to respin for this comment).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl()
  2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
  2025-11-04 17:52   ` Jonathan Cameron
@ 2025-11-19  3:19   ` dan.j.williams
  2025-11-19 15:55     ` Bowman, Terry
  2025-11-21 20:31   ` Gregory Price
  2 siblings, 1 reply; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:19 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices.
> 
> Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache
> status in the CXL Flexbus DVSEC status register. The CXL Flexbus DVSEC
> presence is used because it is required for all the CXL PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> CXL.cache and CXl.mem status.
> 
> In the case the device is an EP or USP, call set_pcie_cxl() on behalf of
> the parent downstream device. Once a device is created there is
> possibilty the parent training or CXL state was updated as well. This
> will make certain the correct parent CXL state is cached.
> 
> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> 
> ---
> 
> Changes in v12->v13:
> - Add Ben's "reviewed-by"
> 
> Changes in v11->v12:
> - Add review-by for Alejandro
> - Add comment in set_pcie_cxl() explaining why updating parent status.
> 
> Changes in v10->v11:
> - Amend set_pcie_cxl() to check for Upstream Port's and EP's parent
>   downstream port by calling set_pcie_cxl(). (Dan)
> - Retitle patch: 'Add' -> 'Introduce'
> - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
[..]
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 0ce98e18b5a8..63124651f865 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1709,6 +1709,33 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>  		dev->is_thunderbolt = 1;
>  }
>  
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	struct pci_dev *parent;
> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> +					      PCI_DVSEC_CXL_FLEXBUS_PORT);
> +	if (dvsec) {
> +		u16 cap;
> +
> +		pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET, &cap);
> +
> +		dev->is_cxl = FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK, cap) ||
> +			FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK, cap);
> +	}
> +
> +	if (!pci_is_pcie(dev) ||
> +	    !(pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT ||
> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_UPSTREAM))
> +		return;

Why are downstream ports excluded?

> +
> +	/*
> +	 * Update parent's CXL state because alternate protocol training
> +	 * may have changed
> +	 */
> +	parent = pci_upstream_bridge(dev);

This parent is a downstream port...

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-11-04 17:02 ` [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
  2025-11-04 17:53   ` Jonathan Cameron
@ 2025-11-19  3:20   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
> are unnecessary helper functions used only for Endpoints. Remove these
> functions as they are not common for all CXL devices and do not provide
> value for EP handling.
> 
> Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
> to cxl_handle_cor_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

LGTM

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH handling helper functions
  2025-11-04 17:02 ` [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH " Terry Bowman
@ 2025-11-19  3:20   ` dan.j.williams
  0 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> cxl_handle_rdport_cor_ras() and cxl_handle_rdport_ras() are specific
> to Restricted CXL Host (RCH) handling. Improve readability and
> maintainability by replacing these and instead using the common
> cxl_handle_cor_ras() and cxl_handle_ras() functions.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

LGTM

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c
  2025-11-04 17:02 ` [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c Terry Bowman
@ 2025-11-19  3:20   ` dan.j.williams
  0 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> From: Dave Jiang <dave.jiang@intel.com>
> 
> Create new config CONFIG_CXL_RAS and put all CXL RAS items behind the
> config. The config will depend on CPER and PCIE AER to build. Move the
> related VH RAS code from core/pci.c to core/ras.c.
> 
> Restricted CXL host (RCH) RAS functions will be moved in a future patch.
> 
> Cc: Robert Richter <rrichter@amd.com>
> Cc: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> Reviewed-by: Alison Schofield <alison.schofield@intel.com>
> Co-developed-by: Terry Bowman <terry.bowman@amd.com>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

LGTM

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c
  2025-11-04 17:02 ` [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c Terry Bowman
  2025-11-04 18:03   ` Jonathan Cameron
@ 2025-11-19  3:20   ` dan.j.williams
  2025-11-19 16:07     ` Bowman, Terry
  1 sibling, 1 reply; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> Restricted CXL Host (RCH) protocol error handling uses a procedure distinct
> from the CXL Virtual Hierarchy (VH) handling. This is because of the
> differences in the RCH and VH topologies. Improve the maintainability and
> add ability to enable/disable RCH handling.
> 
> Move and combine the RCH handling code into a single block conditionally
> compiled with the CONFIG_CXL_RCH_RAS kernel config.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> ---
> 
> Changes in v12->v13:
> - None
> 
> Changes v11->v12:
> - Moved CXL_RCH_RAS Kconfig definition here from following commit.
> 
> Changes v10->v11:
> - New patch
> ---
>  drivers/cxl/Kconfig        |   7 +++
>  drivers/cxl/core/Makefile  |   1 +
>  drivers/cxl/core/core.h    |   5 +-
>  drivers/cxl/core/pci.c     | 115 -----------------------------------
>  drivers/cxl/core/ras_rch.c | 120 +++++++++++++++++++++++++++++++++++++
>  tools/testing/cxl/Kbuild   |   1 +
>  6 files changed, 132 insertions(+), 117 deletions(-)
>  create mode 100644 drivers/cxl/core/ras_rch.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 217888992c88..ffe6ad981434 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -237,4 +237,11 @@ config CXL_RAS
>  	def_bool y
>  	depends on ACPI_APEI_GHES && PCIEAER && CXL_PCI
>  
> +config CXL_RCH_RAS
> +	bool "CXL: Restricted CXL Host (RCH) protocol error handling"
> +	def_bool n

"n" is already the default... but I think this optionality should be
scrapped.

> +	depends on CXL_RAS
> +	help
> +	  RAS support for Restricted CXL Host (RCH) defined in CXL1.1.

I can not imagine an end user or distro ever knowing that they need to
disable or enable this option. What is the motivation for making this
support optional going forward and defaulting RCH error handling off
after all this time?

...does it get in the way of VH error handling?

Otherwise the decluttering of adding a ras_rch.c file looks ok on its
own.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock
  2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
  2025-11-04 18:05   ` Jonathan Cameron
  2025-11-04 19:53   ` Dave Jiang
@ 2025-11-19  3:20   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> cxl_rch_handle_error_iter() includes a call to device_lock() using a goto
> for multiple return paths. Improve readability and maintainability by
> using the guard() lock variant.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

LGTM

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
  2025-11-04 17:02 ` [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c Terry Bowman
@ 2025-11-19  3:20   ` dan.j.williams
  2025-11-19  8:26     ` Lukas Wunner
  0 siblings, 1 reply; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:20 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> The restricted CXL Host (RCH) AER error handling logic currently resides
> in the AER driver file, drivers/pci/pcie/aer.c. CXL specific changes are
> conditionally compiled using #ifdefs.
> 
> Improve the AER driver maintainability by separating the RCH specific logic
> from the AER driver's core functionality and removing the ifdefs. Introduce
> drivers/pci/pcie/aer_cxl_rch.c for moving the RCH AER logic into.

Understood what you meant, but:

"Introduce drivers/pci/pcie/aer_cxl_rch.c for the RCH AER logic."

> Conditionally compile the file using the CONFIG_CXL_RCH_RAS Kconfig.
> 
> Move the CXL logic into the new file but leave helper functions in aer.c
> for now as they will be moved in future patch for CXL virtual hierarchy
> handling. Export the handler functions as needed. Export
> pci_aer_unmask_internal_errors() allowing for all subsystems to use.
> Avoid multiple declaration moves and export cxl_error_is_native() now to
> allow access from cxl_core.
> 
> Inorder to maintain compilation after the move other changes are required.
> Change cxl_rch_handle_error() & cxl_rch_enable_rcec() to be non-static
> inorder for accessing from the AER driver in aer.c.
> 
> Update the new file with the SPDX and 2023 AMD copyright notations because
> the RCH bits were initally contributed in 2023 by AMD.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> 
> ---
[..]
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index cbaed65577d9..f5f22216bb41 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1130,7 +1130,7 @@ static bool find_source_device(struct pci_dev *parent,
>   * Note: AER must be enabled and supported by the device which must be
>   * checked in advance, e.g. with pcie_aer_is_native().
>   */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  {
>  	int aer = dev->aer_cap;
>  	u32 mask;
> @@ -1143,116 +1143,25 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  	mask &= ~PCI_ERR_COR_INTERNAL;
>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>  }
> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);

I can not imagine any other driver but the CXL core consuming this
symbol, so how about:

EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")

...ditto for all the new exports.

[..]
> +EXPORT_SYMBOL_NS_GPL(is_internal_error, "CXL");

Perhaps pci_aer_is_internal()?

Otherwise "is_internal_error()" seems too generic a name for a new
global symbol.

With those fixups:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped
  2025-11-04 17:02 ` [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
@ 2025-11-19  3:27   ` dan.j.williams
  0 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19  3:27 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> The CXL RAS handlers do not currently log if the RAS registers are
> unmapped. This is needed in order to help debug CXL error handling. Update
> the CXL driver to log a warning message if the RAS register block is
> unmapped during RAS error handling.

That does not tell me anything about why this patch is needed, how this
scenario is entered and why catching this late is ok.

I do not have a problem with the change:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

...but I would steer away from patches that just say "add debug, because
debug helps debug".

What is more interesting is a story like:

"I lost a bunch of time figuring out why error handling was not working
only to find that in $scenario the RAS registers are not mapped. Save
the next person time by logging this condition".

Otherwise, if I NAK this patch I have no sense that Linux is any worse
off, and fewer patches is a virtue worth considering.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
  2025-11-19  3:20   ` dan.j.williams
@ 2025-11-19  8:26     ` Lukas Wunner
  2025-11-19 23:36       ` dan.j.williams
  0 siblings, 1 reply; 103+ messages in thread
From: Lukas Wunner @ 2025-11-19  8:26 UTC (permalink / raw)
  To: dan.j.williams
  Cc: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 18, 2025 at 07:20:22PM -0800, dan.j.williams@intel.com wrote:
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -1130,7 +1130,7 @@ static bool find_source_device(struct pci_dev *parent,
> >   * Note: AER must be enabled and supported by the device which must be
> >   * checked in advance, e.g. with pcie_aer_is_native().
> >   */
> > -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> > +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> >  {
> >  	int aer = dev->aer_cap;
> >  	u32 mask;
> > @@ -1143,116 +1143,25 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> >  	mask &= ~PCI_ERR_COR_INTERNAL;
> >  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> >  }
> > +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
> 
> I can not imagine any other driver but the CXL core consuming this
> symbol, so how about:
> 
> EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")

The "xe" driver needs to unmask Uncorrectable Internal Errors
(the default is "masked" per PCIe r7.0 sec 7.8.4.3) and could
take advantage of this helper, so I've asked Terry to keep it
available for anyone to use:

https://lore.kernel.org/all/aK66OcdL4Meb0wFt@wunner.de/

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl()
  2025-11-19  3:19   ` dan.j.williams
@ 2025-11-19 15:55     ` Bowman, Terry
  2025-11-19 23:34       ` dan.j.williams
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-11-19 15:55 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/18/2025 9:19 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> CXL and AER drivers need the ability to identify CXL devices.
>>
>> Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache
>> status in the CXL Flexbus DVSEC status register. The CXL Flexbus DVSEC
>> presence is used because it is required for all the CXL PCIe devices.[1]
>>
>> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
>> CXL.cache and CXl.mem status.
>>
>> In the case the device is an EP or USP, call set_pcie_cxl() on behalf of
>> the parent downstream device. Once a device is created there is
>> possibilty the parent training or CXL state was updated as well. This
>> will make certain the correct parent CXL state is cached.
>>
>> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
>>
>> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>>     Capability (DVSEC) ID Assignment, Table 8-2
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
>> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
>>
>> ---
>>
>> Changes in v12->v13:
>> - Add Ben's "reviewed-by"
>>
>> Changes in v11->v12:
>> - Add review-by for Alejandro
>> - Add comment in set_pcie_cxl() explaining why updating parent status.
>>
>> Changes in v10->v11:
>> - Amend set_pcie_cxl() to check for Upstream Port's and EP's parent
>>   downstream port by calling set_pcie_cxl(). (Dan)
>> - Retitle patch: 'Add' -> 'Introduce'
>> - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
> [..]
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 0ce98e18b5a8..63124651f865 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -1709,6 +1709,33 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>>  		dev->is_thunderbolt = 1;
>>  }
>>  
>> +static void set_pcie_cxl(struct pci_dev *dev)
>> +{
>> +	struct pci_dev *parent;
>> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>> +					      PCI_DVSEC_CXL_FLEXBUS_PORT);
>> +	if (dvsec) {
>> +		u16 cap;
>> +
>> +		pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET, &cap);
>> +
>> +		dev->is_cxl = FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK, cap) ||
>> +			FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK, cap);
>> +	}
>> +
>> +	if (!pci_is_pcie(dev) ||
>> +	    !(pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT ||
>> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_UPSTREAM))
>> +		return;
> Why are downstream ports excluded?
I thought we only need to check the upstream 'parent' if dev is an EP 
or USP as those are the only PCIe types in CXL that interface directly 
to the upstream dport device. And its the upstream dport device that must 
be checked to ensure it has the correct is_cxl setting. 

Do I need to update is_cxl for USP in the case of DSP-USP topology? 

Terry
>> +
>> +	/*
>> +	 * Update parent's CXL state because alternate protocol training
>> +	 * may have changed
>> +	 */
>> +	parent = pci_upstream_bridge(dev);
> This parent is a downstream port...


Terry

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c
  2025-11-19  3:20   ` dan.j.williams
@ 2025-11-19 16:07     ` Bowman, Terry
  0 siblings, 0 replies; 103+ messages in thread
From: Bowman, Terry @ 2025-11-19 16:07 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/18/2025 9:20 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> Restricted CXL Host (RCH) protocol error handling uses a procedure distinct
>> from the CXL Virtual Hierarchy (VH) handling. This is because of the
>> differences in the RCH and VH topologies. Improve the maintainability and
>> add ability to enable/disable RCH handling.
>>
>> Move and combine the RCH handling code into a single block conditionally
>> compiled with the CONFIG_CXL_RCH_RAS kernel config.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>
>> ---
>>
>> Changes in v12->v13:
>> - None
>>
>> Changes v11->v12:
>> - Moved CXL_RCH_RAS Kconfig definition here from following commit.
>>
>> Changes v10->v11:
>> - New patch
>> ---
>>  drivers/cxl/Kconfig        |   7 +++
>>  drivers/cxl/core/Makefile  |   1 +
>>  drivers/cxl/core/core.h    |   5 +-
>>  drivers/cxl/core/pci.c     | 115 -----------------------------------
>>  drivers/cxl/core/ras_rch.c | 120 +++++++++++++++++++++++++++++++++++++
>>  tools/testing/cxl/Kbuild   |   1 +
>>  6 files changed, 132 insertions(+), 117 deletions(-)
>>  create mode 100644 drivers/cxl/core/ras_rch.c
>>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
>> index 217888992c88..ffe6ad981434 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -237,4 +237,11 @@ config CXL_RAS
>>  	def_bool y
>>  	depends on ACPI_APEI_GHES && PCIEAER && CXL_PCI
>>  
>> +config CXL_RCH_RAS
>> +	bool "CXL: Restricted CXL Host (RCH) protocol error handling"
>> +	def_bool n
> "n" is already the default... but I think this optionality should be
> scrapped.

Ok
>> +	depends on CXL_RAS
>> +	help
>> +	  RAS support for Restricted CXL Host (RCH) defined in CXL1.1.
> I can not imagine an end user or distro ever knowing that they need to
> disable or enable this option. What is the motivation for making this
> support optional going forward and defaulting RCH error handling off
> after all this time?
>
> ...does it get in the way of VH error handling?
>
> Otherwise the decluttering of adding a ras_rch.c file looks ok on its
> own.
No, it does not get in the way of VH. I wasn't certain which to use, 'y' or 'n'. 
I will remove the option and use default as you mentioned.

Terry

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-11-04 17:02 ` [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
@ 2025-11-19 21:23   ` dan.j.williams
  2025-11-19 22:02     ` Bowman, Terry
  0 siblings, 1 reply; 103+ messages in thread
From: dan.j.williams @ 2025-11-19 21:23 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> CXL currently has separate trace routines for CXL Port errors and CXL
> Endpoint errors. This is inconvenient for the user because they must enable
> 2 sets of trace routines. Make updates to the trace logging such that a
> single trace routine logs both CXL Endpoint and CXL Port protocol errors.

No, this is not inconvient, this is required for compatible evolution of
tracepoints. The change in this patch breaks compatibility as it
violates the expectation the type and order of TP_ARGS does not change
from one kernel to next.

> Keep the trace log fields 'memdev' and 'host'. While these are not accurate
> for non-Endpoints the fields will remain as-is to prevent breaking
> userspace RAS trace consumers.
> 
> Add serial number parameter to the trace logging. This is used for EPs
> and 0 is provided for CXL port devices without a serial number.
> 
> Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
> unchanged with respect to member data types and order.
> 
> Below is output of correctable and uncorrectable protocol error logging.
> CXL Root Port and CXL Endpoint examples are included below.
> 
> Root Port:
> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
> cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'

A root port is not a "memdev", another awkward side effect of trying to
combine 2 trace points with different use cases.

So a NAK from me for this change (unless there is an strong reason for
Linux to inflict the compat breakage), please keep the separate
tracepoints they are for distinctly different use cases. A memdev
protocol error is contained to that memdev, a port protocol error
implicates every CXL.cachemem descendant of that port.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  2025-11-04 17:02 ` [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
  2025-11-05  8:30   ` Alejandro Lucero Palau
@ 2025-11-19 22:00   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19 22:00 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> Update cxl_handle_cor_ras() to exit early in the case there is no RAS
> errors detected after applying the status mask. This change will make
> the correctable handler's implementation consistent with the uncorrectable
> handler, cxl_handle_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> 
> ---
> 
> Changes v12->v13:
> - Added Ben's review-by
> 
> Changes v11->v12:
> - None
> 
> Changes v10->v11:
> - Added Dave Jiang and Jonathan Cameron's review-by
> - Changes moved to core/ras.c
> ---
>  drivers/cxl/core/ras.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)

Is there more motivation to this besides cxl_handle_ras() symmetry?
Something like, "in preparation for adding more logic when errors are
present..."

Otherwise, LGTM

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-11-19 21:23   ` dan.j.williams
@ 2025-11-19 22:02     ` Bowman, Terry
  2025-11-19 23:40       ` dan.j.williams
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-11-19 22:02 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/19/2025 3:23 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> CXL currently has separate trace routines for CXL Port errors and CXL
>> Endpoint errors. This is inconvenient for the user because they must enable
>> 2 sets of trace routines. Make updates to the trace logging such that a
>> single trace routine logs both CXL Endpoint and CXL Port protocol errors.
> No, this is not inconvient, this is required for compatible evolution of
> tracepoints. The change in this patch breaks compatibility as it
> violates the expectation the type and order of TP_ARGS does not change
> from one kernel to next.
>
>> Keep the trace log fields 'memdev' and 'host'. While these are not accurate
>> for non-Endpoints the fields will remain as-is to prevent breaking
>> userspace RAS trace consumers.
>>
>> Add serial number parameter to the trace logging. This is used for EPs
>> and 0 is provided for CXL port devices without a serial number.
>>
>> Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
>> unchanged with respect to member data types and order.
>>
>> Below is output of correctable and uncorrectable protocol error logging.
>> CXL Root Port and CXL Endpoint examples are included below.
>>
>> Root Port:
>> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
>> cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
> A root port is not a "memdev", another awkward side effect of trying to
> combine 2 trace points with different use cases.
>
> So a NAK from me for this change (unless there is an strong reason for
> Linux to inflict the compat breakage), please keep the separate
> tracepoints they are for distinctly different use cases. A memdev
> protocol error is contained to that memdev, a port protocol error
> implicates every CXL.cachemem descendant of that port.
I misunderstood this comment from previous code review:
https://lore.kernel.org/linux-cxl/67aea897cfe55_2d1e294ca@dwillia2-xfh.jf.intel.com.notmuch/#t

Are you OK with the following format for Port devices? Or let me know what format is needed.
cxl_port_aer_correctable_error: device=port1 parent=root0 status='Received Error From Physical Layer'
cxl_port_aer_uncorrectable_error: device=port1 parent=root0 status: 'Memory Byte Enable Parity Error' first_error: 'Memory Byte Enable Parity Erro'

Terry





^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl()
  2025-11-19 15:55     ` Bowman, Terry
@ 2025-11-19 23:34       ` dan.j.williams
  0 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19 23:34 UTC (permalink / raw)
  To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci

Bowman, Terry wrote:
> 
> 
> On 11/18/2025 9:19 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> CXL and AER drivers need the ability to identify CXL devices.
> >>
> >> Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache
> >> status in the CXL Flexbus DVSEC status register. The CXL Flexbus DVSEC
> >> presence is used because it is required for all the CXL PCIe devices.[1]
> >>
> >> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> >> CXL.cache and CXl.mem status.
> >>
> >> In the case the device is an EP or USP, call set_pcie_cxl() on behalf of
> >> the parent downstream device. Once a device is created there is
> >> possibilty the parent training or CXL state was updated as well. This
> >> will make certain the correct parent CXL state is cached.
> >>
> >> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
> >>
> >> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
> >>     Capability (DVSEC) ID Assignment, Table 8-2
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> >> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> >> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> >> Reviewed-by: Ben Cheatham <benjamin.cheatham@amd.com>
> >>
> >> ---
> >>
> >> Changes in v12->v13:
> >> - Add Ben's "reviewed-by"
> >>
> >> Changes in v11->v12:
> >> - Add review-by for Alejandro
> >> - Add comment in set_pcie_cxl() explaining why updating parent status.
> >>
> >> Changes in v10->v11:
> >> - Amend set_pcie_cxl() to check for Upstream Port's and EP's parent
> >>   downstream port by calling set_pcie_cxl(). (Dan)
> >> - Retitle patch: 'Add' -> 'Introduce'
> >> - Add check for CXL.mem and CXL.cache (Alejandro, Dan)
> > [..]
> >> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> >> index 0ce98e18b5a8..63124651f865 100644
> >> --- a/drivers/pci/probe.c
> >> +++ b/drivers/pci/probe.c
> >> @@ -1709,6 +1709,33 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
> >>  		dev->is_thunderbolt = 1;
> >>  }
> >>  
> >> +static void set_pcie_cxl(struct pci_dev *dev)
> >> +{
> >> +	struct pci_dev *parent;
> >> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> >> +					      PCI_DVSEC_CXL_FLEXBUS_PORT);
> >> +	if (dvsec) {
> >> +		u16 cap;
> >> +
> >> +		pci_read_config_word(dev, dvsec + PCI_DVSEC_CXL_FLEXBUS_STATUS_OFFSET, &cap);
> >> +
> >> +		dev->is_cxl = FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_CACHE_MASK, cap) ||
> >> +			FIELD_GET(PCI_DVSEC_CXL_FLEXBUS_STATUS_MEM_MASK, cap);
> >> +	}
> >> +
> >> +	if (!pci_is_pcie(dev) ||
> >> +	    !(pci_pcie_type(dev) == PCI_EXP_TYPE_ENDPOINT ||
> >> +	      pci_pcie_type(dev) == PCI_EXP_TYPE_UPSTREAM))
> >> +		return;
> > Why are downstream ports excluded?
> I thought we only need to check the upstream 'parent' if dev is an EP 
> or USP as those are the only PCIe types in CXL that interface directly 
> to the upstream dport device.

Yes, but in all cases the device upstream of an endpoint or an upstream
port is a PCI_EXP_TYPE_ROOT_PORT or PCI_EXP_TYPE_DOWNSTREAM device. So I
do not understand why this function needs to make any exclusions.

> And its the upstream dport device that must  be checked to ensure it
> has the correct is_cxl setting.

...but it will never have the correct is_cxl setting because that 'if
()' statement precludes the update.

> Do I need to update is_cxl for USP in the case of DSP-USP topology? 

I think the only change needed is drop that early return 'if ()'
statement altogether.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
  2025-11-19  8:26     ` Lukas Wunner
@ 2025-11-19 23:36       ` dan.j.williams
  0 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-19 23:36 UTC (permalink / raw)
  To: Lukas Wunner, dan.j.williams
  Cc: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

Lukas Wunner wrote:
> On Tue, Nov 18, 2025 at 07:20:22PM -0800, dan.j.williams@intel.com wrote:
> > > +++ b/drivers/pci/pcie/aer.c
> > > @@ -1130,7 +1130,7 @@ static bool find_source_device(struct pci_dev *parent,
> > >   * Note: AER must be enabled and supported by the device which must be
> > >   * checked in advance, e.g. with pcie_aer_is_native().
> > >   */
> > > -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> > > +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> > >  {
> > >  	int aer = dev->aer_cap;
> > >  	u32 mask;
> > > @@ -1143,116 +1143,25 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> > >  	mask &= ~PCI_ERR_COR_INTERNAL;
> > >  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> > >  }
> > > +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
> > 
> > I can not imagine any other driver but the CXL core consuming this
> > symbol, so how about:
> > 
> > EXPORT_SYMBOL_FOR_MODULES(pci_aer_unmask_internal_errors, "cxl_core")
> 
> The "xe" driver needs to unmask Uncorrectable Internal Errors
> (the default is "masked" per PCIe r7.0 sec 7.8.4.3) and could
> take advantage of this helper, so I've asked Terry to keep it
> available for anyone to use:
> 
> https://lore.kernel.org/all/aK66OcdL4Meb0wFt@wunner.de/

Ok. I would not say no to capture that future use case detail in the
changelog.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-11-19 22:02     ` Bowman, Terry
@ 2025-11-19 23:40       ` dan.j.williams
  2025-11-21 14:56         ` Bowman, Terry
  0 siblings, 1 reply; 103+ messages in thread
From: dan.j.williams @ 2025-11-19 23:40 UTC (permalink / raw)
  To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci

Bowman, Terry wrote:
> 
> 
> On 11/19/2025 3:23 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> CXL currently has separate trace routines for CXL Port errors and CXL
> >> Endpoint errors. This is inconvenient for the user because they must enable
> >> 2 sets of trace routines. Make updates to the trace logging such that a
> >> single trace routine logs both CXL Endpoint and CXL Port protocol errors.
> > No, this is not inconvient, this is required for compatible evolution of
> > tracepoints. The change in this patch breaks compatibility as it
> > violates the expectation the type and order of TP_ARGS does not change
> > from one kernel to next.
> >
> >> Keep the trace log fields 'memdev' and 'host'. While these are not accurate
> >> for non-Endpoints the fields will remain as-is to prevent breaking
> >> userspace RAS trace consumers.
> >>
> >> Add serial number parameter to the trace logging. This is used for EPs
> >> and 0 is provided for CXL port devices without a serial number.
> >>
> >> Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
> >> unchanged with respect to member data types and order.
> >>
> >> Below is output of correctable and uncorrectable protocol error logging.
> >> CXL Root Port and CXL Endpoint examples are included below.
> >>
> >> Root Port:
> >> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
> >> cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
> > A root port is not a "memdev", another awkward side effect of trying to
> > combine 2 trace points with different use cases.
> >
> > So a NAK from me for this change (unless there is an strong reason for
> > Linux to inflict the compat breakage), please keep the separate
> > tracepoints they are for distinctly different use cases. A memdev
> > protocol error is contained to that memdev, a port protocol error
> > implicates every CXL.cachemem descendant of that port.
> I misunderstood this comment from previous code review:
> https://lore.kernel.org/linux-cxl/67aea897cfe55_2d1e294ca@dwillia2-xfh.jf.intel.com.notmuch/#t

No, you did not misunderstand, I just did not realize at the time I was
asking for compatibility breakage with that suggestion. Apologies for
that thrash.

> Are you OK with the following format for Port devices? Or let me know what format is needed.
> cxl_port_aer_correctable_error: device=port1 parent=root0 status='Received Error From Physical Layer'
> cxl_port_aer_uncorrectable_error: device=port1 parent=root0 status: 'Memory Byte Enable Parity Error' first_error: 'Memory Byte Enable Parity Erro'

That looks good to me.

Also, I realize this patch set has gone through many revisions. We
really need to get at least some of these pre-req patches into a topic
branch so they do not need to keep being sent out in this large series.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
  2025-11-04 17:02 ` [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
  2025-11-04 19:03   ` Bjorn Helgaas
@ 2025-11-20  0:17   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-20  0:17 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> The CXL driver's error handling for uncorrectable errors (UCE) will be
> updated in the future. A required change is for the error handlers to
> to force a system panic when a UCE is detected.
> 
> Introduce PCI_ERS_RESULT_PANIC as a 'enum pci_ers_result' type. This will
> be used by CXL UCE fatal and non-fatal recovery in future patches. Update
> PCIe recovery documentation with details of PCI_ERS_RESULT_PANIC.

I would also note that this starts to bring Linux PCI core native AER
handling with ACPI GHES CPER_SEV_FATAL handling.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors
  2025-11-04 17:02 ` [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
@ 2025-11-20  0:44   ` dan.j.williams
  2025-11-20  0:53   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-20  0:44 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
> soon. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used as an
> indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> Note, 'CXL protocol error' terminology will refer to CXL VH and not
> CXL RCH errors unless specifically noted going forward.
> 
> Introduce a new file in the AER driver to handle the CXL protocol errors
> named pci/pcie/aer_cxl_vh.c.
> 
> Add a kfifo work queue to be used by the AER and CXL drivers. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> Synchronize accesses using the RW semaphore.
> 
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
> This will contain a reference to the erring PCI device and the error
> severity. This will be used when the work is dequeued by the cxl_core driver.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Some small things to fixup.

> diff --git a/drivers/pci/pcie/aer_cxl_vh.c b/drivers/pci/pcie/aer_cxl_vh.c
> new file mode 100644
> index 000000000000..5dbc81341dc4
> --- /dev/null
> +++ b/drivers/pci/pcie/aer_cxl_vh.c
> @@ -0,0 +1,95 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include <linux/pci.h>
> +#include <linux/bitfield.h>
> +#include <linux/kfifo.h>
> +#include "../pci.h"
> +
> +#define CXL_ERROR_SOURCES_MAX          128
> +
> +struct cxl_proto_err_kfifo {
> +	struct work_struct *work;
> +	struct rw_semaphore rw_sema;
> +	DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
> +		      CXL_ERROR_SOURCES_MAX);
> +};
> +
> +static struct cxl_proto_err_kfifo cxl_proto_err_kfifo = {
> +	.rw_sema = __RWSEM_INITIALIZER(cxl_proto_err_kfifo.rw_sema)
> +};
> +
> +bool cxl_error_is_native(struct pci_dev *dev)
> +{
> +	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> +
> +	return (pcie_ports_native || host->native_aer);

This function always confuses me because there is zero "cxl" inside this
function. Something to comment on later so I am not scratching my head
the next time this function is touched.

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_error_is_native, "CXL");

Why is this exported? All of the consumers are local to
drivers/pci/pcie/built-in.a.

> +
> +bool is_internal_error(struct aer_err_info *info)
> +{
> +	if (info->severity == AER_CORRECTABLE)
> +		return info->status & PCI_ERR_COR_INTERNAL;
> +
> +	return info->status & PCI_ERR_UNC_INTN;
> +}
> +EXPORT_SYMBOL_NS_GPL(is_internal_error, "CXL");

Ditto on the export, and I do not see it getting used anywhere later in
the series.

Also, this is so tiny that if anything else wanted to use it just make
it a static inline.

> +
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	if (!info || !info->is_cxl)
> +		return false;
> +
> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT)
> +		return false;
> +
> +	return is_internal_error(info);
> +}
> +EXPORT_SYMBOL_NS_GPL(is_cxl_error, "CXL");

No consumers for this exported symbol.

> +
> +void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	struct cxl_proto_err_work_data wd = (struct cxl_proto_err_work_data) {
> +		.severity = info->severity,
> +		.pdev = pdev
> +	};
> +
> +	guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);

This guard can be downgraded to rwsem_read. This only needs to make sure
that the kifo remain registered for the duration of the function.

> +
> +	if (!cxl_proto_err_kfifo.work) {
> +		dev_warn_once(&pdev->dev, "CXL driver is unregistered. Unable to forward error.");

I would combine this with the following ratelimited message because they
are effectively the same thing. "Hey admin, I see some errors but the
driver to handle them is gone, or out to lunch." The reason to combine
them is that you probably want this message to catch dropped errors
without failure, and this dev_warn_once() starts failing after the first
invocation.

> +		return;
> +	}
> +
> +	if (!kfifo_put(&cxl_proto_err_kfifo.fifo, wd)) {
> +		dev_err_ratelimited(&pdev->dev, "AER-CXL kfifo overflow\n");
> +		return;
> +	}
> +
> +	schedule_work(cxl_proto_err_kfifo.work);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_forward_error, "CXL");

No consumer for this export.

> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> +	guard(rwsem_write)(&cxl_proto_err_kfifo.rw_sema);
> +	cxl_proto_err_kfifo.work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");

Oh hey, the rest of these exports make sense.

...but I do think you can go back and remove

 bool is_internal_error(struct aer_err_info *info);
 bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
 void cxl_forward_error(struct pci_dev *pdev, struct aer_err_info *info);

...from pci.h, and move them to an aer internal header like
drivers/pci/pcie/portdrv.h.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors
  2025-11-04 17:02 ` [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
  2025-11-20  0:44   ` dan.j.williams
@ 2025-11-20  0:53   ` dan.j.williams
  1 sibling, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-20  0:53 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> CXL virtual hierarchy (VH) RAS handling for CXL Port devices will be added
> soon. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used as an
> indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> Note, 'CXL protocol error' terminology will refer to CXL VH and not
> CXL RCH errors unless specifically noted going forward.
> 
> Introduce a new file in the AER driver to handle the CXL protocol errors
> named pci/pcie/aer_cxl_vh.c.
> 
> Add a kfifo work queue to be used by the AER and CXL drivers. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> Encapsulate the kfifo, RW semaphore, and work pointer in a single structure.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> Synchronize accesses using the RW semaphore.
> 
> Introduce 'struct cxl_proto_err_work_data' to serve as the kfifo work data.
> This will contain a reference to the erring PCI device and the error
> severity. This will be used when the work is dequeued by the cxl_core driver.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
[..]
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 2ef820563996..6b2c87d1b5b6 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>

This is minor, but nothing in this header needs anything from
linux/workqueue_types.h.

I would just forward declare and be done, i.e.:

struct work_struct;
void cxl_register_proto_err_work(struct work_struct *work);

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver
  2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
  2025-11-05 17:51   ` Gregory Price
  2025-11-11  8:33   ` Alison Schofield
@ 2025-11-20  1:24   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-20  1:24 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> CXL devices handle protocol errors via driver-specific callbacks rather
> than the generic pci_driver::err_handlers by default. The callbacks are
> implemented in the cxl_pci driver and are not part of struct pci_driver, so
> cxl_core must verify that a device is actually bound to the cxl_pci
> module's driver before invoking the callbacks (the device could be bound
> to another driver, e.g. VFIO).
> 
> However, cxl_core can not reference symbols in the cxl_pci module because
> it creates a circular dependency. This prevents cxl_core from checking the
> EP's bound driver and calling the callbacks.
> 
> To fix this, move drivers/cxl/pci.c into drivers/cxl/core/pci_drv.c and
> build it as part of the cxl_core module. Compile into cxl_core using
> CXL_PCI and CXL_CORE Kconfig dependencies. This removes the standalone
> cxl_pci module, consolidates the cxl_pci driver code into cxl_core, and
> eliminates the circular dependency so cxl_core can safely perform
> bound-driver checks and invoke the CXL PCI callbacks.
> 
> Introduce cxl_pci_drv_bound() to return boolean depending on if the PCI EP
> parameter is bound to a CXL driver instance. This will be used in future
> patch when dequeuing work from the kfifo.

I am thoroughly confused about what this patch is trying to do. The
whole point of a cxl_core is to separate the potential shared mechanics
across CXL device types from the specific special case of the CXL memory
class device.

This would be like saying that all PCI drivers need to be built-into the
PCI core to satisfy PCI error handling.

If the core needs to verify the driver before calling the handler then
the design is broken.

The design should accommodate a case of *only* a CXL accelerator driver
loading and not a CXL memory expander driver. Let me go take a look at
how cxl_pci_drv_bound() is used. There must be a simple misunderstanding
that we can resolve quickly.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
  2025-11-04 17:03 ` [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
@ 2025-11-20  3:10   ` dan.j.williams
  0 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-20  3:10 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> During CXL device cleanup the CXL PCIe Port device interrupts remain
> enabled. This potentially allows unnecessary interrupt processing on
> behalf of the CXL errors while the device is destroyed.
> 
> Disable CXL protocol errors by setting the CXL devices' AER mask register.
> 
> Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().
> Add to the AER service driver allowing other subsystems to use.
> 
> Introduce cxl_mask_proto_interrupts() to call pci_aer_mask_internal_errors().
> Add calls to cxl_mask_proto_interrupts() within CXL Port teardown for CXL
> Root Ports, CXL Downstream Switch Ports, CXL Upstream Switch Ports, and CXL
> Endpoints. Follow the same "bottom-up" approach used during CXL Port
> teardown.

This comes across as too much special case sprinkling.

If it is just the cxl_port teardown case then the simple answer is only
enable interrupts at cxl_port::probe() time and disable them at
cxl_port::remove() time.

Can you clarify the exact nature of the unwanted interrupts because the
PCI core does not manage AER interrupts in the same way. It only ever
enables interrupts for root ports, so I am confused why CXL needs to
manage interrupts on a per-dport basis.

Maybe cxl_mask_proto_interrupts() is misnamed because the device is
never enabled to send interrupts. I think is just mask AER events,
right?

In any case I think this belongs right next to the code that maps and
unmaps dports. The endpoint case should be handled by the endpoint probe
and remove handlers.

Also, if this is really a problem this patch should be moved earlier in
the series before the kernel starts unmasking these new event sources.
Otherwiwse, a bisect run will start hitting spurious events if it lands
in the middle of this series.

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error
  2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
  2025-11-04 18:40   ` Jonathan Cameron
  2025-11-04 18:45   ` Bjorn Helgaas
@ 2025-11-20  3:33   ` dan.j.williams
  2 siblings, 0 replies; 103+ messages in thread
From: dan.j.williams @ 2025-11-20  3:33 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci, terry.bowman

Terry Bowman wrote:
> The AER driver now forwards CXL protocol errors to the CXL driver via a
> kfifo. The CXL driver must consume these work items, initiate protocol
> error handling, and ensure RAS mappings remain valid throughout processing.
> 
> Implement cxl_proto_err_work_fn() to dequeue work items forwarded by the
> AER service driver and begin protocol error processing by calling
> cxl_handle_proto_error().
> 
> Add a PCI device lock on &pdev->dev within cxl_proto_err_work_fn() to
> keep the PCI device structure valid during handling. Locking an Endpoint
> will also defer RAS unmapping until the device is unlocked.
> 
> For Endpoints, add a lock on CXL memory device cxlds->dev. The CXL memory
> device structure holds the RAS register reference needed during error
> handling.
> 
> Add lock for the parent CXL Port for Root Ports, Downstream Ports, and
> Upstream Ports to prevent destruction of structures holding mapped RAS
> addresses while they are in use.
> 
> Invoke cxl_do_recovery() for uncorrectable errors. Treat this as a stub for
> now; implement its functionality in a future patch.
> 
> Export pci_clean_device_status() to enable cleanup of AER status following
> error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> 
> ---
> Changes in v12->v13:
> - Add cxlmd lock using guard() (Terry)
> - Remove exporting of unused function, pci_aer_clear_fatal_status() (Dave Jiang)
> - Change pr_err() calls to ratelimited. (Terry)
> - Update commit message. (Terry)
> - Remove namespace qualifier from pcie_clear_device_status()
>   export (Dave Jiang)
> - Move locks into cxl_proto_err_work_fn() (Dave)
> - Update log messages in cxl_forward_error() (Ben)
> 
> Changes in v11->v12:
> - Add guard for CE case in cxl_handle_proto_error() (Dave)
> 
> Changes in v10->v11:
> - Reword patch commit message to remove RCiEP details (Jonathan)
> - Add #include <linux/bitfield.h> (Terry)
> - is_cxl_rcd() - Fix short comment message wrap  (Jonathan)
> - is_cxl_rcd() - Combine return calls into 1  (Jonathan)
> - cxl_handle_proto_error() - Move comment earlier  (Jonathan)
> - Use FIELD_GET() in discovering class code (Jonathan)
> - Remove BDF from cxl_proto_err_work_data. Use 'struct
> pci_dev *' (Dan)
> ---
>  drivers/cxl/core/ras.c | 153 ++++++++++++++++++++++++++++++++++++++---
>  drivers/pci/pci.c      |   1 +
>  drivers/pci/pci.h      |   1 -
>  include/linux/pci.h    |   2 +
>  4 files changed, 145 insertions(+), 12 deletions(-)
[..]
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	while (cxl_proto_err_kfifo_get(&wd)) {
> +		struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(wd.pdev);

Why does this function need its own device reference? I think this
handler should match PCI AER semantics where the device validity is
caller guaranteed.

> +		struct device *cxlmd_dev;
> +
> +		if (!pdev) {
> +			pr_err_ratelimited("NULL PCI device passed in AER-CXL KFIFO\n");
> +			continue;
> +		}
> +
> +		guard(device)(&pdev->dev);
> +		if (is_pcie_endpoint(pdev)) {
> +			struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +
> +			if (!cxl_pci_drv_bound(pdev))
> +				return;
> +			cxlmd_dev = &cxlds->cxlmd->dev;
> +			device_lock_if(cxlmd_dev, cxlmd_dev);

Ok, I think this demonstrates the problematic usage of
cxl_pci_drv_bound() and the presence of conditional locking is also a
tell that this is broken.

My expectation is the CXL protocol errors are exclusively reported to
cxl_ports. That means that all RAS register mapping must be exclusively
relative to cxl_port::probe() cxl_port::remove() lifetime. Once that is
in place this endpoint case melts away. The endpoint's job is to
register an endpoint-port to get protocol error services.

Given time is short for v6.19 I might take a quick stab at this to
demonstrate the proposal (or otherwise try to quickly discover why the
suggestion can not work).

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-11-19 23:40       ` dan.j.williams
@ 2025-11-21 14:56         ` Bowman, Terry
  0 siblings, 0 replies; 103+ messages in thread
From: Bowman, Terry @ 2025-11-21 14:56 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny
  Cc: linux-kernel, linux-pci



On 11/19/2025 5:40 PM, dan.j.williams@intel.com wrote:
> Bowman, Terry wrote:
>>
>> On 11/19/2025 3:23 PM, dan.j.williams@intel.com wrote:
>>> Terry Bowman wrote:
>>>> CXL currently has separate trace routines for CXL Port errors and CXL
>>>> Endpoint errors. This is inconvenient for the user because they must enable
>>>> 2 sets of trace routines. Make updates to the trace logging such that a
>>>> single trace routine logs both CXL Endpoint and CXL Port protocol errors.
>>> No, this is not inconvient, this is required for compatible evolution of
>>> tracepoints. The change in this patch breaks compatibility as it
>>> violates the expectation the type and order of TP_ARGS does not change
>>> from one kernel to next.
>>>
>>>> Keep the trace log fields 'memdev' and 'host'. While these are not accurate
>>>> for non-Endpoints the fields will remain as-is to prevent breaking
>>>> userspace RAS trace consumers.
>>>>
>>>> Add serial number parameter to the trace logging. This is used for EPs
>>>> and 0 is provided for CXL port devices without a serial number.
>>>>
>>>> Leave the correctable and uncorrectable trace routines' TP_STRUCT__entry()
>>>> unchanged with respect to member data types and order.
>>>>
>>>> Below is output of correctable and uncorrectable protocol error logging.
>>>> CXL Root Port and CXL Endpoint examples are included below.
>>>>
>>>> Root Port:
>>>> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
>>>> cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
>>> A root port is not a "memdev", another awkward side effect of trying to
>>> combine 2 trace points with different use cases.
>>>
>>> So a NAK from me for this change (unless there is an strong reason for
>>> Linux to inflict the compat breakage), please keep the separate
>>> tracepoints they are for distinctly different use cases. A memdev
>>> protocol error is contained to that memdev, a port protocol error
>>> implicates every CXL.cachemem descendant of that port.
>> I misunderstood this comment from previous code review:
>> https://lore.kernel.org/linux-cxl/67aea897cfe55_2d1e294ca@dwillia2-xfh.jf.intel.com.notmuch/#t
> No, you did not misunderstand, I just did not realize at the time I was
> asking for compatibility breakage with that suggestion. Apologies for
> that thrash.
>
>> Are you OK with the following format for Port devices? Or let me know what format is needed.
>> cxl_port_aer_correctable_error: device=port1 parent=root0 status='Received Error From Physical Layer'
>> cxl_port_aer_uncorrectable_error: device=port1 parent=root0 status: 'Memory Byte Enable Parity Error' first_error: 'Memory Byte Enable Parity Erro'
> That looks good to me.
>
> Also, I realize this patch set has gone through many revisions. We
> really need to get at least some of these pre-req patches into a topic
> branch so they do not need to keep being sent out in this large series.


Do we want a serial number field in the port log (missing above)? CXL USP and DSP will have serial 
from the Identify Command. It would make the port logs look like:

cxl_port_aer_correctable_error: device=port1 parent=root0 serial=0 status='Received Error From Physical Layer'
cxl_port_aer_uncorrectable_error: device=port1 parent=root0 serial=0 status: 'Memory Byte Enable Parity Error' first_error: 'Memory Byte 

-Terry


^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl()
  2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
  2025-11-04 17:52   ` Jonathan Cameron
  2025-11-19  3:19   ` dan.j.williams
@ 2025-11-21 20:31   ` Gregory Price
  2 siblings, 0 replies; 103+ messages in thread
From: Gregory Price @ 2025-11-21 20:31 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:42AM -0600, Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices.
> 
> Introduce set_pcie_cxl() with logic checking for CXL.mem or CXL.cache
> status in the CXL Flexbus DVSEC status register. The CXL Flexbus DVSEC
> presence is used because it is required for all the CXL PCIe devices.[1]
> 
------>8
>  
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	struct pci_dev *parent;
...
> +	parent = pci_upstream_bridge(dev);
> +	set_pcie_cxl(parent);
> +}
...
> +static inline bool pcie_is_cxl(struct pci_dev *pci_dev)
> +{
> +	return pci_dev->is_cxl;
> +}
> +

We have encountered a crash on QEMU where parent=NULL here

static inline struct pci_dev *pci_upstream_bridge(struct pci_dev *dev)
{
        dev = pci_physfn(dev);
        if (pci_is_root_bus(dev->bus))
                return NULL;

        return dev->bus->self;
}

~Gregory

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging
  2025-11-04 22:12     ` Bjorn Helgaas
@ 2025-12-04 17:30       ` Bowman, Terry
  2025-12-08 18:42         ` Bjorn Helgaas
  0 siblings, 1 reply; 103+ messages in thread
From: Bowman, Terry @ 2025-12-04 17:30 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On 11/4/2025 4:12 PM, Bjorn Helgaas wrote:
> On Tue, Nov 04, 2025 at 03:54:21PM -0600, Bowman, Terry wrote:
>>
>>
>> On 11/4/2025 1:11 PM, Bjorn Helgaas wrote:
>>> On Tue, Nov 04, 2025 at 11:02:40AM -0600, Terry Bowman wrote:
>>>> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
>>>> Endpoints (EP). Previous versions of this series can be found here:
>>>> https://lore.kernel.org/linux-cxl/20250925223440.3539069-1-terry.bowman@amd.com/
>>>> ...
>>>> Terry Bowman (24):
>>>>   CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
>>>>   PCI/CXL: Introduce pcie_is_cxl()
>>>>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
>>>>   cxl/pci: Remove unnecessary CXL RCH handling helper functions
>>>>   cxl: Move CXL driver's RCH error handling into core/ras_rch.c
>>>>   CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with
>>>>     guard() lock
>>>>   CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
>>>>   PCI/AER: Report CXL or PCIe bus error type in trace logging
>>>>   cxl/pci: Update RAS handler interfaces to also support CXL Ports
>>>>   cxl/pci: Log message if RAS registers are unmapped
>>>>   cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
>>>>   cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
>>>>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
>>>>   CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
>>>>   CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL
>>>>     errors
>>>>   cxl: Introduce cxl_pci_drv_bound() to check for bound driver
>>>>   cxl: Change CXL handlers to use guard() instead of scoped_guard()
>>>>   cxl/pci: Introduce CXL protocol error handlers for Endpoints
>>>>   CXL/PCI: Introduce CXL Port protocol error handlers
>>>>   PCI/AER: Dequeue forwarded CXL error
>>>>   CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
>>>>   CXL/PCI: Introduce CXL uncorrectable protocol error recovery
>>>>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
>>>>   CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
>>> Is the mix of "CXL/PCI" vs "cxl/pci" in the above telling me
>>> something, or should they all match?
>>>
>>> As a rule of thumb, I'm going to look at things that start with "PCI"
>>> and skip most of the rest on the assumption that the rest only have
>>> incidental effects on PCI.
>>
>> I think there was logic behind the (un)capitalized but I forget the
>> reasoning. It's  better to keep it simple. I'll change to use
>> PCI/CXL and AER/CXL.
> 
> I don't know what "AER/CXL" means.  I think "PCI" and "CXL" are the
> big chunks here and one of them should be first in the prefix.
> 
> I do think there's value in using "PCI/AER" for things specific to AER
> and "PCI/ERR" for more generic PCI error handling, and maybe "PCI/CXL"
> for significant CXL-related things in drivers/pci/.

I was informed any patch touching PCI files requires a PCI maintainer 
review or acknowledgment. I misunderstood how to communicate this.

In my workflow, I used uppercase tags like PCI or AER to indicate that 
a patch needed PCI review or ack. For example, when I wrote CXL/PCI, I 
intended to signal that the patch was primarily CXL-related but in a 
PCI context, and therefore might need PCI review.

To avoid confusion in the future, can you advise on the best way to 
indicate a patch needs your PCI review—even if the PCI changes are
minor and don’t warrant leading with the PCI label?

Also, can you review the following patches?
[RESEND v13 01/25] CXL-PCI-Move-CXL-DVSEC-definitions-into-uapi-lin
[RESEND v13 02/25] PCI-CXL-Introduce-pcie_is_cxl
[RESEND v13 07/25] CXL-AER-Replace-device_lock-in-cxl_rch_handle_er
[RESEND v13 08/25] CXL-AER-Move-AER-drivers-RCH-error-handling-into
[RESEND v13 16/25] CXL-AER-Introduce-pcie-aer_cxl_vh.c-in-AER-drive
[RESEND v13 20/25] CXL-PCI-Introduce-CXL-Port-protocol-error-handle
[RESEND v13 22/25] CXL-PCI-Export-and-rename-merge_result-to-pci_er
[RESEND v13 23/25] CXL-PCI-Introduce-CXL-uncorrectable-protocol-err
[RESEND v13 25/25] CXL-PCI-Disable-CXL-protocol-error-interrupts-du

-Terry

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
  2025-11-04 17:50   ` Jonathan Cameron
  2025-11-19  3:19   ` dan.j.williams
@ 2025-12-08 18:04   ` Bjorn Helgaas
  2025-12-08 22:13     ` Bowman, Terry
  2 siblings, 1 reply; 103+ messages in thread
From: Bjorn Helgaas @ 2025-12-08 18:04 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:02:41AM -0600, Terry Bowman wrote:
> The CXL DVSECs are currently defined in cxl/core/cxlpci.h. These are not
> accessible to other subsystems. Move these to uapi/linux/pci_regs.h.
> 
> Change DVSEC name formatting to follow the existing PCI format in
> pci_regs.h. The current format uses CXL_DVSEC_XYZ and the CXL defines must
> be changed to be PCI_DVSEC_CXL_XYZ to match existing pci_regs.h. Leave
> PCI_DVSEC_CXL_PORT* defines as-is because they are already defined and may
> be in use by userspace application(s).
> 
> Update existing usage to match the name change.
> 
> Update the inline documentation to refer to latest CXL spec version.

Regrettably, r3.2 is no longer the latest ;)

> +++ b/include/uapi/linux/pci_regs.h
> @@ -1244,9 +1244,64 @@
>  /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
>  
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> -#define PCI_DVSEC_CXL_PORT				3
> -#define PCI_DVSEC_CXL_PORT_CTL				0x0c
> -#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> +/* Compute Express Link (CXL r3.2, sec 8.1)
> + *
> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
> + * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
> + * registers on downstream link-up events.
> + */
> +
> +#define PCI_DVSEC_HEADER1_LENGTH_MASK  __GENMASK(31, 20)

I think PCI_DVSEC_HEADER1_LEN() could be used instead of adding a new
definition.

> +/* CXL 3.2 8.1.3: PCIe DVSEC for CXL Device */

Can you use "CXL r4.0, sec 8.1.3" and similar so it refers to the most
recent revision and matches the typical style for PCIe spec references?

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
  2025-11-04 18:47   ` Jonathan Cameron
  2025-11-11  8:37   ` Alison Schofield
@ 2025-12-08 18:40   ` Bjorn Helgaas
  2 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2025-12-08 18:40 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Tue, Nov 04, 2025 at 11:03:03AM -0600, Terry Bowman wrote:
> Implement cxl_do_recovery() to handle uncorrectable protocol
> errors (UCE), following the design of pcie_do_recovery(). Unlike PCIe,
> all CXL UCEs are treated as fatal and trigger a kernel panic to avoid
> potential CXL memory corruption.
> 
> Add cxl_walk_port(), analogous to pci_walk_bridge(), to traverse the
> CXL topology from the error source through downstream CXL ports and
> endpoints.
> 
> Introduce cxl_report_error_detected(), mirroring PCI's
> report_error_detected(), and implement device locking for the affected
> subtree. Endpoints require locking the PCI device (pdev->dev) and the
> CXL memdev (cxlmd->dev). CXL ports require locking the PCI
> device (pdev->dev) and the parent CXL port.
> 
> The device locks should be taken early where possible. The initially
> reporting device will be locked after kfifo dequeue. Iterated devices
> will be locked in cxl_report_error_detected() and must lock the
> iterated devices except for the first device as it has already been
> locked.
> 
> Export pci_aer_clear_fatal_status() for use when a UCE is not present.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Acked-by: Bjorn Helgaas <bhelgaas@google.com> # drivers/pci/

> ---
> 
> Changes in v12->v13:
> - Add guard() before calling cxl_pci_drv_bound() (Dave Jiang)
> - Add guard() calls for EP (cxlds->cxlmd->dev & pdev->dev) and ports
>   (pdev->dev & parent cxl_port) in cxl_report_error_detected() and
>   cxl_handle_proto_error() (Terry)
> - Remove unnecessary check for endpoint port. (Dave Jiang)
> - Remove check for RCIEP EP in cxl_report_error_detected(). (Terry)
> 
> Changes in v11->v12:
> - Clean up port discovery in cxl_do_recovery() (Dave)
> - Add PCI_EXP_TYPE_RC_END to type check in cxl_report_error_detected()
> 
> Changes in v10->v11:
> - pci_ers_merge_results() - Move to earlier patch
> ---
>  drivers/cxl/core/ras.c | 135 ++++++++++++++++++++++++++++++++++++++++-
>  drivers/pci/pci.h      |   1 -
>  drivers/pci/pcie/aer.c |   1 +
>  include/linux/aer.h    |   2 +
>  4 files changed, 135 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 5bc144cde0ee..52c6f19564b6 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -259,8 +259,138 @@ static void device_unlock_if(struct device *dev, bool take)
>  		device_unlock(dev);
>  }
>  
> +/**
> + * cxl_report_error_detected
> + * @dev: Device being reported
> + * @data: Result
> + * @err_pdev: Device with initial detected error. Is locked immediately
> + *            after KFIFO dequeue.
> + */
> +static int cxl_report_error_detected(struct device *dev, void *data, struct pci_dev *err_pdev)
> +{
> +	bool need_lock = (dev != &err_pdev->dev);
> +	pci_ers_result_t vote, *result = data;
> +	struct pci_dev *pdev;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return 0;
> +	pdev = to_pci_dev(dev);
> +
> +	device_lock_if(&pdev->dev, need_lock);
> +	if (is_pcie_endpoint(pdev) && !cxl_pci_drv_bound(pdev)) {
> +		device_unlock_if(&pdev->dev, need_lock);
> +		return PCI_ERS_RESULT_NONE;
> +	}
> +
> +	if (pdev->aer_cap)
> +		pci_clear_and_set_config_dword(pdev,
> +					       pdev->aer_cap + PCI_ERR_COR_STATUS,
> +					       0, PCI_ERR_COR_INTERNAL);
> +
> +	if (is_pcie_endpoint(pdev)) {
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +
> +		device_lock_if(&cxlds->cxlmd->dev, need_lock);
> +		vote = cxl_error_detected(&cxlds->cxlmd->dev);
> +		device_unlock_if(&cxlds->cxlmd->dev, need_lock);
> +	} else {
> +		vote = cxl_port_error_detected(dev);
> +	}
> +
> +	pcie_clear_device_status(pdev);
> +	*result = pcie_ers_merge_result(*result, vote);
> +	device_unlock_if(&pdev->dev, need_lock);
> +
> +	return 0;
> +}
> +
> +static int match_port_by_parent_dport(struct device *dev, const void *dport_dev)
> +{
> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return port->parent_dport->dport_dev == dport_dev;
> +}
> +
> +/**
> + * cxl_walk_port
> + *
> + * @port: Port be traversed into
> + * @cb: Callback for handling the CXL Ports
> + * @userdata: Result
> + * @err_pdev: Device with initial detected error. Is locked immediately
> + *            after KFIFO dequeue.
> + */
> +static void cxl_walk_port(struct cxl_port *port,
> +			  int (*cb)(struct device *, void *, struct pci_dev *),
> +			  void *userdata,
> +			  struct pci_dev *err_pdev)
> +{
> +	struct cxl_port *err_port __free(put_cxl_port) = get_cxl_port(err_pdev);
> +	bool need_lock = (port != err_port);
> +	struct cxl_dport *dport = NULL;
> +	unsigned long index;
> +
> +	device_lock_if(&port->dev, need_lock);
> +	if (is_cxl_endpoint(port)) {
> +		cb(port->uport_dev->parent, userdata, err_pdev);
> +		device_unlock_if(&port->dev, need_lock);
> +		return;
> +	}
> +
> +	if (port->uport_dev && dev_is_pci(port->uport_dev))
> +		cb(port->uport_dev, userdata, err_pdev);
> +
> +	/*
> +	 * Iterate over the set of Downstream Ports recorded in port->dports (XArray):
> +	 *  - For each dport, attempt to find a child CXL Port whose parent dport
> +	 *    match.
> +	 *  - Invoke the provided callback on the dport's device.
> +	 *  - If a matching child CXL Port device is found, recurse into that port to
> +	 *    continue the walk.
> +	 */
> +	xa_for_each(&port->dports, index, dport)
> +	{
> +		struct device *child_port_dev __free(put_device) =
> +			bus_find_device(&cxl_bus_type, &port->dev, dport->dport_dev,
> +					match_port_by_parent_dport);
> +
> +		cb(dport->dport_dev, userdata, err_pdev);
> +		if (child_port_dev)
> +			cxl_walk_port(to_cxl_port(child_port_dev), cb, userdata, err_pdev);
> +	}
> +	device_unlock_if(&port->dev, need_lock);
> +}
> +
>  static void cxl_do_recovery(struct pci_dev *pdev)
>  {
> +	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> +	struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> +
> +	if (!port) {
> +		pci_err(pdev, "Failed to find the CXL device\n");
> +		return;
> +	}
> +
> +	cxl_walk_port(port, cxl_report_error_detected, &status, pdev);
> +	if (status == PCI_ERS_RESULT_PANIC)
> +		panic("CXL cachemem error.");
> +
> +	/*
> +	 * If we have native control of AER, clear error status in the device
> +	 * that detected the error.  If the platform retained control of AER,
> +	 * it is responsible for clearing this status.  In that case, the
> +	 * signaling device may not even be visible to the OS.
> +	 */
> +	if (cxl_error_is_native(pdev)) {
> +		pcie_clear_device_status(pdev);
> +		pci_aer_clear_nonfatal_status(pdev);
> +		pci_aer_clear_fatal_status(pdev);
> +	}
>  }
>  
>  void cxl_handle_cor_ras(struct device *dev, u64 serial, void __iomem *ras_base)
> @@ -483,16 +613,15 @@ static void cxl_proto_err_work_fn(struct work_struct *work)
>  			if (!cxl_pci_drv_bound(pdev))
>  				return;
>  			cxlmd_dev = &cxlds->cxlmd->dev;
> -			device_lock_if(cxlmd_dev, cxlmd_dev);
>  		} else {
>  			cxlmd_dev = NULL;
>  		}
>  
> +		/* Lock the CXL parent Port */
>  		struct cxl_port *port __free(put_cxl_port) = get_cxl_port(pdev);
> -		if (!port)
> -			return;
>  		guard(device)(&port->dev);
>  
> +		device_lock_if(cxlmd_dev, cxlmd_dev);
>  		cxl_handle_proto_error(&wd);
>  		device_unlock_if(cxlmd_dev, cxlmd_dev);
>  	}
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 2af6ea82526d..3637996d37ab 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -1174,7 +1174,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
>  static inline void pci_no_aer(void) { }
>  static inline void pci_aer_init(struct pci_dev *d) { }
>  static inline void pci_aer_exit(struct pci_dev *d) { }
> -static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>  static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
>  static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
>  static inline void pci_save_aer_state(struct pci_dev *dev) { }
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e806fa05280b..4cf44297bb24 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -297,6 +297,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>  	if (status)
>  		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
>  }
> +EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
>  
>  /**
>   * pci_aer_raw_clear_status - Clear AER error registers.
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 6b2c87d1b5b6..64aef69fb546 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -66,6 +66,7 @@ struct cxl_proto_err_work_data {
>  
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> +void pci_aer_clear_fatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
>  void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>  #else
> @@ -73,6 +74,7 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  {
>  	return -EINVAL;
>  }
> +static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
>  #endif
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging
  2025-12-04 17:30       ` Bowman, Terry
@ 2025-12-08 18:42         ` Bjorn Helgaas
  0 siblings, 0 replies; 103+ messages in thread
From: Bjorn Helgaas @ 2025-12-08 18:42 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On Thu, Dec 04, 2025 at 11:30:45AM -0600, Bowman, Terry wrote:
> On 11/4/2025 4:12 PM, Bjorn Helgaas wrote:
> > On Tue, Nov 04, 2025 at 03:54:21PM -0600, Bowman, Terry wrote:
> >>
> >>
> >> On 11/4/2025 1:11 PM, Bjorn Helgaas wrote:
> >>> On Tue, Nov 04, 2025 at 11:02:40AM -0600, Terry Bowman wrote:
> >>>> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
> >>>> Endpoints (EP). Previous versions of this series can be found here:
> >>>> https://lore.kernel.org/linux-cxl/20250925223440.3539069-1-terry.bowman@amd.com/
> >>>> ...
> >>>> Terry Bowman (24):
> >>>>   CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
> >>>>   PCI/CXL: Introduce pcie_is_cxl()
> >>>>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
> >>>>   cxl/pci: Remove unnecessary CXL RCH handling helper functions
> >>>>   cxl: Move CXL driver's RCH error handling into core/ras_rch.c
> >>>>   CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with
> >>>>     guard() lock
> >>>>   CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c
> >>>>   PCI/AER: Report CXL or PCIe bus error type in trace logging
> >>>>   cxl/pci: Update RAS handler interfaces to also support CXL Ports
> >>>>   cxl/pci: Log message if RAS registers are unmapped
> >>>>   cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
> >>>>   cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
> >>>>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
> >>>>   CXL/PCI: Introduce PCI_ERS_RESULT_PANIC
> >>>>   CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL
> >>>>     errors
> >>>>   cxl: Introduce cxl_pci_drv_bound() to check for bound driver
> >>>>   cxl: Change CXL handlers to use guard() instead of scoped_guard()
> >>>>   cxl/pci: Introduce CXL protocol error handlers for Endpoints
> >>>>   CXL/PCI: Introduce CXL Port protocol error handlers
> >>>>   PCI/AER: Dequeue forwarded CXL error
> >>>>   CXL/PCI: Export and rename merge_result() to pci_ers_merge_result()
> >>>>   CXL/PCI: Introduce CXL uncorrectable protocol error recovery
> >>>>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
> >>>>   CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
> >>> Is the mix of "CXL/PCI" vs "cxl/pci" in the above telling me
> >>> something, or should they all match?
> >>>
> >>> As a rule of thumb, I'm going to look at things that start with "PCI"
> >>> and skip most of the rest on the assumption that the rest only have
> >>> incidental effects on PCI.
> >>
> >> I think there was logic behind the (un)capitalized but I forget the
> >> reasoning. It's  better to keep it simple. I'll change to use
> >> PCI/CXL and AER/CXL.
> > 
> > I don't know what "AER/CXL" means.  I think "PCI" and "CXL" are the
> > big chunks here and one of them should be first in the prefix.
> > 
> > I do think there's value in using "PCI/AER" for things specific to AER
> > and "PCI/ERR" for more generic PCI error handling, and maybe "PCI/CXL"
> > for significant CXL-related things in drivers/pci/.
> 
> I was informed any patch touching PCI files requires a PCI maintainer 
> review or acknowledgment. I misunderstood how to communicate this.
> 
> In my workflow, I used uppercase tags like PCI or AER to indicate that 
> a patch needed PCI review or ack. For example, when I wrote CXL/PCI, I 
> intended to signal that the patch was primarily CXL-related but in a 
> PCI context, and therefore might need PCI review.
> 
> To avoid confusion in the future, can you advise on the best way to 
> indicate a patch needs your PCI review—even if the PCI changes are
> minor and don’t warrant leading with the PCI label?
> 
> Also, can you review the following patches?
> [RESEND v13 01/25] CXL-PCI-Move-CXL-DVSEC-definitions-into-uapi-lin
> [RESEND v13 02/25] PCI-CXL-Introduce-pcie_is_cxl
> [RESEND v13 07/25] CXL-AER-Replace-device_lock-in-cxl_rch_handle_er
> [RESEND v13 08/25] CXL-AER-Move-AER-drivers-RCH-error-handling-into
> [RESEND v13 16/25] CXL-AER-Introduce-pcie-aer_cxl_vh.c-in-AER-drive
> [RESEND v13 20/25] CXL-PCI-Introduce-CXL-Port-protocol-error-handle
> [RESEND v13 22/25] CXL-PCI-Export-and-rename-merge_result-to-pci_er
> [RESEND v13 23/25] CXL-PCI-Introduce-CXL-uncorrectable-protocol-err
> [RESEND v13 25/25] CXL-PCI-Disable-CXL-protocol-error-interrupts-du

Sorry, I responded to most of the first v13 series because I didn't
notice the resend, so this got a little fragmented.  Let me know if
there's more I should look at.

Bjorn

^ permalink raw reply	[flat|nested] 103+ messages in thread

* Re: [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h
  2025-12-08 18:04   ` Bjorn Helgaas
@ 2025-12-08 22:13     ` Bowman, Terry
  0 siblings, 0 replies; 103+ messages in thread
From: Bowman, Terry @ 2025-12-08 22:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, alucerop, ira.weiny,
	linux-kernel, linux-pci

On 12/8/2025 12:04 PM, Bjorn Helgaas wrote:
> On Tue, Nov 04, 2025 at 11:02:41AM -0600, Terry Bowman wrote:
>> The CXL DVSECs are currently defined in cxl/core/cxlpci.h. These are not
>> accessible to other subsystems. Move these to uapi/linux/pci_regs.h.
>>
>> Change DVSEC name formatting to follow the existing PCI format in
>> pci_regs.h. The current format uses CXL_DVSEC_XYZ and the CXL defines must
>> be changed to be PCI_DVSEC_CXL_XYZ to match existing pci_regs.h. Leave
>> PCI_DVSEC_CXL_PORT* defines as-is because they are already defined and may
>> be in use by userspace application(s).
>>
>> Update existing usage to match the name change.
>>
>> Update the inline documentation to refer to latest CXL spec version.
> 
> Regrettably, r3.2 is no longer the latest ;)
> 

Yes, I'll update.

>> +++ b/include/uapi/linux/pci_regs.h
>> @@ -1244,9 +1244,64 @@
>>  /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
>>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
>>  
>> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
>> -#define PCI_DVSEC_CXL_PORT				3
>> -#define PCI_DVSEC_CXL_PORT_CTL				0x0c
>> -#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
>> +/* Compute Express Link (CXL r3.2, sec 8.1)
>> + *
>> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
>> + * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
>> + * registers on downstream link-up events.
>> + */
>> +
>> +#define PCI_DVSEC_HEADER1_LENGTH_MASK  __GENMASK(31, 20)
> 
> I think PCI_DVSEC_HEADER1_LEN() could be used instead of adding a new
> definition.
> 
>> +/* CXL 3.2 8.1.3: PCIe DVSEC for CXL Device */
> 
> Can you use "CXL r4.0, sec 8.1.3" and similar so it refers to the most
> recent revision and matches the typical style for PCIe spec references?

Yes, I'll update all spec references to point at CXL r4.0 and specific section.

-Terry

^ permalink raw reply	[flat|nested] 103+ messages in thread

end of thread, other threads:[~2025-12-08 22:13 UTC | newest]

Thread overview: 103+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-04 17:02 [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2025-11-04 17:02 ` [RESEND v13 01/25] CXL/PCI: Move CXL DVSEC definitions into uapi/linux/pci_regs.h Terry Bowman
2025-11-04 17:50   ` Jonathan Cameron
2025-11-19  3:19   ` dan.j.williams
2025-12-08 18:04   ` Bjorn Helgaas
2025-12-08 22:13     ` Bowman, Terry
2025-11-04 17:02 ` [RESEND v13 02/25] PCI/CXL: Introduce pcie_is_cxl() Terry Bowman
2025-11-04 17:52   ` Jonathan Cameron
2025-11-19  3:19   ` dan.j.williams
2025-11-19 15:55     ` Bowman, Terry
2025-11-19 23:34       ` dan.j.williams
2025-11-21 20:31   ` Gregory Price
2025-11-04 17:02 ` [RESEND v13 03/25] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-11-04 17:53   ` Jonathan Cameron
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 04/25] cxl/pci: Remove unnecessary CXL RCH " Terry Bowman
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 05/25] cxl: Remove CXL VH handling in CONFIG_PCIEAER_CXL conditional blocks from core/pci.c Terry Bowman
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 06/25] cxl: Move CXL driver's RCH error handling into core/ras_rch.c Terry Bowman
2025-11-04 18:03   ` Jonathan Cameron
2025-11-19  3:20   ` dan.j.williams
2025-11-19 16:07     ` Bowman, Terry
2025-11-04 17:02 ` [RESEND v13 07/25] CXL/AER: Replace device_lock() in cxl_rch_handle_error_iter() with guard() lock Terry Bowman
2025-11-04 18:05   ` Jonathan Cameron
2025-11-04 19:53   ` Dave Jiang
2025-11-19  3:20   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 08/25] CXL/AER: Move AER drivers RCH error handling into pcie/aer_cxl_rch.c Terry Bowman
2025-11-19  3:20   ` dan.j.williams
2025-11-19  8:26     ` Lukas Wunner
2025-11-19 23:36       ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 09/25] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-11-04 18:08   ` Jonathan Cameron
2025-11-04 18:26   ` Bjorn Helgaas
2025-11-04 17:02 ` [RESEND v13 10/25] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-11-04 18:10   ` Jonathan Cameron
2025-11-11  8:17   ` Alison Schofield
2025-11-19  3:19   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 11/25] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-11-19  3:27   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 12/25] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-11-19 21:23   ` dan.j.williams
2025-11-19 22:02     ` Bowman, Terry
2025-11-19 23:40       ` dan.j.williams
2025-11-21 14:56         ` Bowman, Terry
2025-11-04 17:02 ` [RESEND v13 13/25] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-11-05  8:30   ` Alejandro Lucero Palau
2025-11-19 22:00   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 14/25] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-11-04 18:15   ` Jonathan Cameron
2025-11-04 20:03   ` Dave Jiang
2025-11-11  8:23   ` Alison Schofield
2025-11-04 17:02 ` [RESEND v13 15/25] CXL/PCI: Introduce PCI_ERS_RESULT_PANIC Terry Bowman
2025-11-04 19:03   ` Bjorn Helgaas
2025-11-20  0:17   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 16/25] CXL/AER: Introduce pcie/aer_cxl_vh.c in AER driver for forwarding CXL errors Terry Bowman
2025-11-20  0:44   ` dan.j.williams
2025-11-20  0:53   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 17/25] cxl: Introduce cxl_pci_drv_bound() to check for bound driver Terry Bowman
2025-11-05 17:51   ` Gregory Price
2025-11-05 19:03     ` Gregory Price
2025-11-05 22:26       ` Gregory Price
2025-11-06 17:11         ` Gregory Price
2025-11-06 23:32         ` Bowman, Terry
2025-11-11  8:33   ` Alison Schofield
2025-11-13 21:42     ` Alison Schofield
2025-11-13 22:39       ` Bowman, Terry
2025-11-20  1:24   ` dan.j.williams
2025-11-04 17:02 ` [RESEND v13 18/25] cxl: Change CXL handlers to use guard() instead of scoped_guard() Terry Bowman
2025-11-04 18:18   ` Jonathan Cameron
2025-11-04 20:15   ` Dave Jiang
2025-11-04 17:02 ` [RESEND v13 19/25] cxl/pci: Introduce CXL protocol error handlers for Endpoints Terry Bowman
2025-11-04 18:29   ` Jonathan Cameron
2025-11-04 19:09   ` Bjorn Helgaas
2025-11-04 17:03 ` [RESEND v13 20/25] CXL/PCI: Introduce CXL Port protocol error handlers Terry Bowman
2025-11-04 18:32   ` Jonathan Cameron
2025-11-04 21:20   ` Dave Jiang
2025-11-04 21:27     ` Bowman, Terry
2025-11-04 23:39       ` Dave Jiang
2025-11-04 17:03 ` [RESEND v13 21/25] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-11-04 18:40   ` Jonathan Cameron
2025-11-04 18:45   ` Bjorn Helgaas
2025-11-20  3:33   ` dan.j.williams
2025-11-04 17:03 ` [RESEND v13 22/25] CXL/PCI: Export and rename merge_result() to pci_ers_merge_result() Terry Bowman
2025-11-04 18:41   ` Jonathan Cameron
2025-11-04 19:03   ` Bjorn Helgaas
2025-11-14 15:20     ` Bowman, Terry
2025-11-14 16:09       ` Jonathan Cameron
2025-11-04 17:03 ` [RESEND v13 23/25] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-11-04 18:47   ` Jonathan Cameron
2025-11-04 23:43     ` Dave Jiang
2025-11-05 14:59       ` Bowman, Terry
2025-11-05 16:10         ` Dave Jiang
2025-11-11  8:37   ` Alison Schofield
2025-12-08 18:40   ` Bjorn Helgaas
2025-11-04 17:03 ` [RESEND v13 24/25] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-11-04 17:03 ` [RESEND v13 25/25] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-11-20  3:10   ` dan.j.williams
2025-11-04 19:11 ` [RESEND v13 00/25] Enable CXL PCIe Port Protocol Error handling and logging Bjorn Helgaas
2025-11-04 21:54   ` Bowman, Terry
2025-11-04 22:12     ` Bjorn Helgaas
2025-12-04 17:30       ` Bowman, Terry
2025-12-08 18:42         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox