[PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging
@ 2025-06-26 22:42 Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
                   ` (18 more replies)
  0 siblings, 19 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

This patchset updates CXL Protocol Error handling for CXL Ports and CXL
Endpoints (EP). The reach of this patchset grew from CXL Ports to include
EPs as well.

This patchset is a continuation of v9 found here:
https://lore.kernel.org/linux-cxl/20250603172239.159260-1-terry.bowman@amd.com/

The first patch is a small cleanup change to reduce amount of code. 

The next 2 patches introduce pci_dev::is_cxl, aer_info::is_cxl, and add
bus string to AER log tracing. aer_info::is_cxl will be used to indicate a
CXL or PCI error and will be used to direct the error handling flow in
later patches.

The next patch introduces a new driver file, pci/pcie/cxl_aer.c, to move
the existing CXL AER logic into.

The next 3 patches update the AER driver and CXL driver to use a kfifo. 
The kfifo is added to offload CXL-AER protocol error work to the CXL
driver. These patches provide the kfifo work add and work remove. 

The next 5 patches prepare the CXL driver for adding the updated protocol
error handlers. This includes adding CXL Port RAS mapping and updating
interfaces for common support.

The final 5 patches add the CXL error handlers for CXL EPs and CXL Ports.
CXL EPs keep the PCIe error handler for cases the EP error is interpreted
as a PCIe error. These patches also add logic to unmask CXL Protocol Errors
during port probing, and mask CXL Protocol Errors during port device
cleanup.

== Testing ==
Testing results below shows the Upstream Switch Port UCE and EP UCE errors
are handled as PCI errors. This is because aer_get_device_error_info() does
not populate the AER error severity and status in the case of FATAL UCE on
Upstream Ports and Endpoints. This is intended because the USP link to
access the device can be compromised. The check for is_cxl_error() and
is_internal_error() fail as a result and then processes the error as a PCI
error. Also, the AER event logging is missing the PCIe AER status.

The sub-topology for the QEMU testing is:
                    ---------------------
                    | CXL RP - 0C:00.0  |
                    ---------------------
                              |
                    ---------------------
                    | CXL USP - 0D:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL DSP - 0E:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL EP - 0F:00.0  |
                    ---------------------

 root@tbowman-cxl:~# lspci -t
 -+-[0000:00]-+-00.0
  |           +-01.0
  |           +-02.0
  |           +-03.0
  |           +-1f.0
  |           +-1f.2
  |           \-1f.3
  \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0

 The topology was created with:
  ${qemu} -boot menu=on \
             -cpu host \
             -nographic \
             -monitor telnet:127.0.0.1:1234,server,nowait \
             -M virt,cxl=on \
             -chardev stdio,id=s1,signal=off,mux=on -serial none \
             -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
             -machine q35,cxl=on \
             -m 16G,maxmem=24G,slots=8 \
             -cpu EPYC-v3 \
             -smp 16 \
             -accel kvm \
             -drive file=${img},format=raw,index=0,media=disk \
             -device e1000,netdev=user.0 \
             -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
             -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
             -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
             -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
             -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
             -device cxl-upstream,bus=root_port0,id=us0 \
             -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
             -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
             -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k

== Root Port ==
root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0:    [14] CorrIntErr    
cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0:    [22] UncorrIntErr          
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Pari'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 214 Comm: kworker/10:1 Tainted: G            E       6.16.0-rc1-gec6a70ce29c1-dirty #423 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_proto_err_work_fn
Call Trace:
 <TASK>
 panic+0x364/0x3d0
 ? __pfx_cxl_report_error_detected+0x10/0x10
 cxl_do_recovery+0xa3/0xb0
 cxl_proto_err_work_fn+0xf5/0x180
 process_scheduled_works+0xa8/0x420
 ? __pfx_worker_thread+0x10/0x10
 worker_thread+0x11c/0x260
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xfe/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x84/0xf0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0xc000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== Upstream Port ==
root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0:    [14] CorrIntErr            
cxl_aer_correctable_error: memdev=0000:0d:00.0 host=0000:0c:00.0 serial=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
AER: aer_print_error():850: status=0, mask=0
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
cxl_aer_uncorrectable_error: memdev=mem0 host=0000:0f:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Pari'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E       6.16.0-rc1-gec6a70ce29c1-dirty #427 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 panic+0x364/0x3d0
 ? __pfx_aer_root_reset+0x10/0x10
 pci_error_detected+0x2b/0x30
 report_error_detected+0xf7/0x170
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x4c/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x34/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x34/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x31/0x50
 pcie_do_recovery+0x163/0x2b0
 aer_isr_one_error_type+0x1e8/0x380
 ? __pfx_irq_thread_fn+0x10/0x10
 aer_isr_one_error+0x11d/0x140
 aer_isr+0x4c/0x80
 irq_thread_fn+0x24/0x70
 irq_thread+0x188/0x280
 ? __pfx_irq_thread_dtor+0x10/0x10
 ? __pfx_irq_thread+0x10/0x10
 kthread+0xfe/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x84/0xf0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x12400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== Downstream Port ==
root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0:    [14] CorrIntErr            
cxl_aer_correctable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
AER: aer_print_error():850: status=400000, mask=2000000
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0:    [22] UncorrIntErr          
cxl_aer_uncorrectable_error: memdev=0000:0e:00.0 host=0000:0d:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Ena'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Pari'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 218 Comm: kworker/10:1 Tainted: G            E       6.16.0-rc1-gec6a70ce29c1-dirty #428 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_proto_err_work_fn
Call Trace:
 <TASK>
 panic+0x364/0x3d0
 cxl_do_recovery+0xa3/0xb0
 cxl_proto_err_work_fn+0xf5/0x180
 process_scheduled_works+0xa8/0x420
 ? __pfx_worker_thread+0x10/0x10
 worker_thread+0x11c/0x260
 ? __pfx_worker_thread+0x10/0x10
 kthread+0xfe/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x84/0xf0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x3b400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== Endpoint ==
root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not avaie
cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
cxl_pci 0000:0f:00.0:    [14] CorrIntErr            
cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0 status='Cache Byte Enable Parity Error' first_error='Cache Byte Enable Pari'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E       6.16.0-rc1-gec6a70ce29c1-dirty #423 PREEMPT(voluntary) 
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 panic+0x364/0x3d0
 ? __pfx_aer_root_reset+0x10/0x10
 pci_error_detected+0x2b/0x30
 report_error_detected+0xf7/0x170
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x4c/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x31/0x50
 pcie_do_recovery+0x163/0x2b0
 aer_isr_one_error_type+0x1e8/0x380
 ? __pfx_irq_thread_fn+0x10/0x10
 aer_isr_one_error+0x11d/0x140
 aer_isr+0x4c/0x80
 irq_thread_fn+0x24/0x70
 irq_thread+0x188/0x280
 ? __pfx_irq_thread_dtor+0x10/0x10
 ? __pfx_irq_thread+0x10/0x10
 kthread+0xfe/0x210
 ? __pfx_kthread+0x10/0x10
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x84/0xf0
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x2b200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

Changes
=======

Changes in v9 -> v10:
 - Add drivers/pci/pcie/cxl_aer.c
 - Add drivers/cxl/core/native_ras.c
 - Change cxl_register_prot_err_work()/cxl_unregister_prot_err_work to return void
 - Check for pcie_ports_native in cxl_do_recovery()
 - Remove debug logging in cxl_do_recovery()
 - Update PCI_ERS_RESULT_PANIC definition to indicate is CXL specific
 - Revert trace logging changes: name,parent -> memdev,host.
 - Use FIELD_GET() to check for EP class code (cxl_aer.c & native_ras.c).
 - Change _prot_ to _proto_ everywhere
 - cxl_rch_handle_error_iter(), check if driver is cxl_pci_driver
 - Remove cxl_create_prot_error_info(). Move logic into forward_cxl_error()
 - Remove sbdf_to_pci() and move logic into cxl_handle_proto_error()
 - Simplify/refactor get_pci_cxl_host_dev()
 - Simplify/refactor cxl_get_ras_base()
 - Move patch 'Remove unnecessary CXL Endpoint handling helper functions' to front
 - Update description for 'CXL/PCI: Introduce CXL Port protocol error
   handlers' with why state is not used to determine handling
 - Introduce cxl_pci_drv_bound() and call from cxl_rch_handle_error_iter()

Changes in v8 -> v9:
 - Updated reference counting to use pci_get_device()/pci_put_device() in
   cxl_disable_prot_errors()/cxl_enable_prot_errors
 - Refactored cxl_create_prot_err_info() to fix reference counting
 - Removed 'struct cxl_port' driver changes for error handler. Instead
   check for CXL device type (EP or Port device) and call handler
 - Make pcie_is_cxl() static inline in include/linux/linux.h
 - Remove NULL check in create_prot_err_info()
 - Change success return in cxl_ras_init() to use hardcoded 0
 - Changed 'struct work_struct cxl_prot_err_work' declaration to static
 - Change to use rate limited log with dev anchor in forward_cxl_error()
 - Refactored forward-cxl_error() to remove severity auto variable
 - Changed pci_aer_clear_nonfatal_status() to be static inline for
   !(CONFIG_PCIEAER)
 - Renamed merge_result() to be cxl_merge_result()
 - Removed 'ue' condition in cxl_error_detected()
 - Updated 2nd parameter in call to __cxl_handle_cor_ras()/__cxl_handle_ras()
   in unify patch
 - Added log message for failure while assigning interrupt disable callback
 - Updated pci_aer_mask_internal_errors() to use pci_clear_and_set_config_dword()
 - Simplified patch titles for clarity
 - Moved CXL error interrupt disabling into cxl/core/port.c with CXL Port
 teardown
 - Updated 'struct cxl_port_err_info' to only contain sbdf and severity
 Removed everything else.
 - Added pdev and CXL device get_device()/put_device() before calling handlers
 
Changes in v7 -> v8:
 [Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
driver
 [Dan] Add device reference incrementors where needed throughout
 [Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init 
 [Dan] Combine CXL Port and CXL Endpoint trace routine
 [Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
 [Jonathan] Add serial number for all devices in trace
 [DaveJ] Move find_cxl_port() change into patch using it
 [Terry] Move CXL Port RAS init into cxl/port.c
 [Terry] Moved kfifo functions into cxl/core/ras.c 
 
 Changes in v6 -> v7:
 [Terry] Move updated trace routine call to later patch. Was causing build
 error.
 
 Changes in v5 -> v6:
 [Ira] Move pcie_is_cxl(dev) define to a inline function
 [Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
 [Ira] Change cxl_report_error_detected() cleanup to return correct bool
 [Ira] Introduce and use PCI_ERS_RESULT_PANIC
 [Ira] Reuse comment for PCIe and CXL recovery paths
 [Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
 [Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
 [Jonathan] Add logging example to patches updating trace output
 [Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
 [Jonathan] Use __free() in cxl_pci_port_ras()
 [Terry] Add patch to log the PCIe SBDF along with CXL device name
 [Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
 [Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
 [Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
 [Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
 [Gregory] Move AER driver bus name detection to a static function

 Changes in v4 -> v5:
 [Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
 [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
 [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
 [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
 [Ming] Use port->dev for call to devm_add_action_or_reset() in
 cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
 [Jonathan] Use get_device()/put_device() to prevent race condition in
 cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
 specifications

 Changes in v3 -> v4:
 [Lukas] Capitalize PCIe and CXL device names as in specifications
 [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
 [Lukas] Correct namespace spelling
 [Lukas] Removed export from pcie_is_cxl_port()
 [Lukas] Simplify 'if' blocks in cxl_handle_error()
 [Lukas] Change panic message to remove redundant 'panic' text
 [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
 [lkp@intel] 'host' parameter is already removed. Remove parameter description too.
 [Terry] Added field description for cxl_err_handlers in pci.h comment block

 Changes in v1 -> v2:
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
 
Terry Bowman (17):
  cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  PCI/CXL: Add pcie_is_cxl()
  PCI/AER: Report CXL or PCIe bus error type in trace logging
  CXL/AER: Introduce CXL specific AER driver file
  CXL/AER: Introduce kfifo for forwarding CXL errors
  PCI/AER: Dequeue forwarded CXL error
  CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  cxl/pci: Move RAS initialization to cxl_port driver
  cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  cxl/pci: Update RAS handler interfaces to also support CXL Ports
  cxl/pci: Log message if RAS registers are unmapped
  cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  cxl/pci: Introduce CXL Endpoint protocol error handlers
  CXL/PCI: Introduce CXL Port protocol error handlers
  CXL/PCI: Enable CXL protocol errors during CXL Port probe
  CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup

 drivers/cxl/Kconfig           |  14 +++
 drivers/cxl/core/Makefile     |   1 +
 drivers/cxl/core/core.h       |  10 ++
 drivers/cxl/core/native_ras.c | 231 ++++++++++++++++++++++++++++++++++
 drivers/cxl/core/pci.c        | 231 ++++++++++++++++------------------
 drivers/cxl/core/port.c       |  12 +-
 drivers/cxl/core/ras.c        |  15 ++-
 drivers/cxl/core/regs.c       |   2 +
 drivers/cxl/core/trace.h      |  84 +++----------
 drivers/cxl/cxl.h             |  20 +++
 drivers/cxl/cxlpci.h          |   8 +-
 drivers/cxl/mem.c             |   3 +-
 drivers/cxl/pci.c             |  14 ++-
 drivers/cxl/port.c            | 159 +++++++++++++++++++++++
 drivers/pci/pci.c             |   1 +
 drivers/pci/pci.h             |  25 ++--
 drivers/pci/pcie/Makefile     |   1 +
 drivers/pci/pcie/aer.c        | 167 ++++--------------------
 drivers/pci/pcie/cxl_aer.c    | 179 ++++++++++++++++++++++++++
 drivers/pci/pcie/err.c        |   8 +-
 drivers/pci/pcie/rcec.c       |   1 +
 drivers/pci/probe.c           |  10 ++
 include/linux/aer.h           |  46 +++++++
 include/linux/pci.h           |  19 +++
 include/linux/pci_ids.h       |   2 +
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   8 +-
 27 files changed, 916 insertions(+), 364 deletions(-)
 create mode 100644 drivers/cxl/core/native_ras.c
 create mode 100644 drivers/pci/pcie/cxl_aer.c


base-commit: 716ba3023561ccacfaa28f988d26717535b8fed1
-- 
2.34.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-07-18 17:55   ` Dave Jiang
  2025-07-23 21:58   ` dan.j.williams
  2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
are unnecessary helper functions used only for Endpoints. Remove these
functions as they are not common for all CXL devices and do not provide
value for EP handling.

Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
to cxl_handle_cor_ras().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
---
 drivers/cxl/core/pci.c | 26 ++++++++------------------
 1 file changed, 8 insertions(+), 18 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index b50551601c2e..06464a25d8bd 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -664,8 +664,8 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
-				 void __iomem *ras_base)
+static void cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+			       void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
@@ -681,11 +681,6 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	}
 }
 
-static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
-{
-	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
-}
-
 /* CXL spec rev3.0 8.2.4.16.1 */
 static void header_log_copy(void __iomem *ras_base, u32 *log)
 {
@@ -707,8 +702,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
-				  void __iomem *ras_base)
+static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
+			   void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -741,11 +736,6 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 	return true;
 }
 
-static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
-{
-	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
-}
-
 #ifdef CONFIG_PCIEAER_CXL
 
 static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
@@ -824,13 +814,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	return cxl_handle_cor_ras(cxlds, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(cxlds, dport->regs.ras);
+	return cxl_handle_ras(cxlds, dport->regs.ras);
 }
 
 /*
@@ -927,7 +917,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_endpoint_cor_ras(cxlds);
+		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -956,7 +946,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_endpoint_ras(cxlds);
+		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
 	}
 
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl()
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-07-23 22:30   ` dan.j.williams
  2025-08-09 10:56   ` Alejandro Lucero Palau
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL and AER drivers need the ability to identify CXL devices.

Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
CXL Flexbus DVSEC presence is used because it is required for all the CXL
PCIe devices.[1]

Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
Flexbus presence.

Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.

[1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
    Capability (DVSEC) ID Assignment, Table 8-2

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/pci/probe.c           | 10 ++++++++++
 include/linux/pci.h           |  6 ++++++
 include/uapi/linux/pci_regs.h |  8 +++++++-
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4b8693ec9e4c..5d3548648d5c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1691,6 +1691,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
 		dev->is_thunderbolt = 1;
 }
 
+static void set_pcie_cxl(struct pci_dev *dev)
+{
+	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+					      PCI_DVSEC_CXL_FLEXBUS);
+	if (dvsec)
+		dev->is_cxl = 1;
+}
+
 static void set_pcie_untrusted(struct pci_dev *dev)
 {
 	struct pci_dev *parent = pci_upstream_bridge(dev);
@@ -2021,6 +2029,8 @@ int pci_setup_device(struct pci_dev *dev)
 	/* Need to have dev->cfg_size ready */
 	set_pcie_thunderbolt(dev);
 
+	set_pcie_cxl(dev);
+
 	set_pcie_untrusted(dev);
 
 	if (pci_is_pcie(dev))
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 05e68f35f392..79878243b681 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -453,6 +453,7 @@ struct pci_dev {
 	unsigned int	is_hotplug_bridge:1;
 	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
 	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
+	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
 	/*
 	 * Devices marked being untrusted are the ones that can potentially
 	 * execute DMA attacks and similar. They are typically connected
@@ -744,6 +745,11 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
 	return false;
 }
 
+static inline bool pcie_is_cxl(struct pci_dev *pci_dev)
+{
+	return pci_dev->is_cxl;
+}
+
 #define for_each_pci_bridge(dev, bus)				\
 	list_for_each_entry(dev, &bus->devices, bus_list)	\
 		if (!pci_is_bridge(dev)) {} else
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index a3a3e942dedf..fb9d77c69d5f 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1225,9 +1225,15 @@
 /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
+/* Compute Express Link (CXL r3.2, sec 8.1)
+ *
+ * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
+ * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
+ * registers on downstream link-up events.
+ */
 #define PCI_DVSEC_CXL_PORT				3
 #define PCI_DVSEC_CXL_PORT_CTL				0x0c
 #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+#define PCI_DVSEC_CXL_FLEXBUS				7
 
 #endif /* LINUX_PCI_REGS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
                     ` (4 more replies)
  2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
                   ` (15 subsequent siblings)
  18 siblings, 5 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors. Update the driver and aer_event tracing to log 'CXL Bus
Type' for CXL device errors.

This requires the AER can identify and distinguish between PCIe errors and
CXL errors.

Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
aer_get_device_error_info() and pci_print_aer().

Update the aer_event trace routine to accept a bus type string parameter.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
---
 drivers/pci/pci.h       |  6 ++++++
 drivers/pci/pcie/aer.c  | 21 +++++++++++++++------
 include/ras/ras_event.h |  9 ++++++---
 3 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 12215ee72afb..a0d1e59b5666 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -608,6 +608,7 @@ struct aer_err_info {
 	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
 	const char *level;		/* printk level */
+	bool is_cxl;
 
 	unsigned int id:16;
 
@@ -628,6 +629,11 @@ struct aer_err_info {
 int aer_get_device_error_info(struct aer_err_info *info, int i);
 void aer_print_error(struct aer_err_info *info, int i);
 
+static inline const char *aer_err_bus(struct aer_err_info *info)
+{
+	return info->is_cxl ? "CXL" : "PCIe";
+}
+
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
 		      unsigned int tlp_len, bool flit,
 		      struct pcie_tlp_log *log);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 70ac66188367..a2df9456595a 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -837,6 +837,7 @@ void aer_print_error(struct aer_err_info *info, int i)
 	struct pci_dev *dev;
 	int layer, agent, id;
 	const char *level = info->level;
+	const char *bus_type = aer_err_bus(info);
 
 	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
 		return;
@@ -845,23 +846,23 @@ void aer_print_error(struct aer_err_info *info, int i)
 	id = pci_dev_id(dev);
 
 	pci_dev_aer_stats_incr(dev, info);
-	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
+	trace_aer_event(pci_name(dev), bus_type, (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 
 	if (!info->ratelimit_print[i])
 		return;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
 	agent = AER_GET_AGENT(info->severity, info->status);
 
-	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -895,6 +896,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type;
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info = {
@@ -915,9 +917,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 
 	info.status = status;
 	info.mask = mask;
+	info.is_cxl = pcie_is_cxl(dev);
+
+	bus_type = aer_err_bus(&info);
 
 	pci_dev_aer_stats_incr(dev, &info);
-	trace_aer_event(pci_name(dev), (status & ~mask),
+	trace_aer_event(pci_name(dev), bus_type, (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 
 	if (!aer_ratelimit(dev, info.severity))
@@ -939,6 +944,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, info.level,
 				   dev_fmt("  "));
+
+	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
+			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
 
@@ -1371,6 +1379,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 	/* Must reset in this function */
 	info->status = 0;
 	info->tlp_header_valid = 0;
+	info->is_cxl = pcie_is_cxl(dev);
 
 	/* The device might not support AER */
 	if (!aer)
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 14c9f943d53f..080829d59c36 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
 
 TRACE_EVENT(aer_event,
 	TP_PROTO(const char *dev_name,
+		 const char *bus_type,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
 		 struct pcie_tlp_log *tlp),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
+		__string(	bus_type,	bus_type	)
 		__field(	u32,		status		)
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (2 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
                     ` (2 more replies)
  2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
                   ` (14 subsequent siblings)
  18 siblings, 3 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

The CXL AER error handling logic currently resides in the AER driver file,
drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
using #ifdefs.

Improve the AER driver maintainability by separating the CXL specific logic
from the AER driver's core functionality and removing the #ifdefs.
Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
new file.

Update the makefile to conditionally compile the CXL file using the
existing CONFIG_PCIEAER_CXL Kconfig.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.h          |   8 +++
 drivers/pci/pcie/Makefile  |   1 +
 drivers/pci/pcie/aer.c     | 138 -------------------------------------
 drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++
 include/linux/pci_ids.h    |   2 +
 5 files changed, 149 insertions(+), 138 deletions(-)
 create mode 100644 drivers/pci/pcie/cxl_aer.c

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index a0d1e59b5666..91b583cf18eb 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1029,6 +1029,14 @@ static inline void pci_save_aer_state(struct pci_dev *dev) { }
 static inline void pci_restore_aer_state(struct pci_dev *dev) { }
 #endif
 
+#ifdef CONFIG_PCIEAER_CXL
+void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
+void cxl_rch_enable_rcec(struct pci_dev *rcec);
+#else
+static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
+static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+#endif
+
 #ifdef CONFIG_ACPI
 bool pci_acpi_preserve_config(struct pci_host_bridge *bridge);
 int pci_acpi_program_hp_params(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
index 173829aa02e6..cd2cb925dbd5 100644
--- a/drivers/pci/pcie/Makefile
+++ b/drivers/pci/pcie/Makefile
@@ -8,6 +8,7 @@ obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
 
 obj-y				+= aspm.o
 obj-$(CONFIG_PCIEAER)		+= aer.o err.o tlp.o
+obj-$(CONFIG_PCIEAER_CXL)	+= cxl_aer.o
 obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
 obj-$(CONFIG_PCIE_PME)		+= pme.o
 obj-$(CONFIG_PCIE_DPC)		+= dpc.o
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index a2df9456595a..0b4d721980ef 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1094,144 +1094,6 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-#ifdef CONFIG_PCIEAER_CXL
-
-/**
- * pci_aer_unmask_internal_errors - unmask internal errors
- * @dev: pointer to the pci_dev data structure
- *
- * Unmask internal errors in the Uncorrectable and Correctable Error
- * Mask registers.
- *
- * Note: AER must be enabled and supported by the device which must be
- * checked in advance, e.g. with pcie_aer_is_native().
- */
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
-{
-	int aer = dev->aer_cap;
-	u32 mask;
-
-	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
-	mask &= ~PCI_ERR_UNC_INTN;
-	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
-
-	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
-	mask &= ~PCI_ERR_COR_INTERNAL;
-	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
-}
-
-static bool is_cxl_mem_dev(struct pci_dev *dev)
-{
-	/*
-	 * The capability, status, and control fields in Device 0,
-	 * Function 0 DVSEC control the CXL functionality of the
-	 * entire device (CXL 3.0, 8.1.3).
-	 */
-	if (dev->devfn != PCI_DEVFN(0, 0))
-		return false;
-
-	/*
-	 * CXL Memory Devices must have the 502h class code set (CXL
-	 * 3.0, 8.1.12.1).
-	 */
-	if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
-		return false;
-
-	return true;
-}
-
-static bool cxl_error_is_native(struct pci_dev *dev)
-{
-	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
-
-	return (pcie_ports_native || host->native_aer);
-}
-
-static bool is_internal_error(struct aer_err_info *info)
-{
-	if (info->severity == AER_CORRECTABLE)
-		return info->status & PCI_ERR_COR_INTERNAL;
-
-	return info->status & PCI_ERR_UNC_INTN;
-}
-
-static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
-{
-	struct aer_err_info *info = (struct aer_err_info *)data;
-	const struct pci_error_handlers *err_handler;
-
-	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
-		return 0;
-
-	/* Protect dev->driver */
-	device_lock(&dev->dev);
-
-	err_handler = dev->driver ? dev->driver->err_handler : NULL;
-	if (!err_handler)
-		goto out;
-
-	if (info->severity == AER_CORRECTABLE) {
-		if (err_handler->cor_error_detected)
-			err_handler->cor_error_detected(dev);
-	} else if (err_handler->error_detected) {
-		if (info->severity == AER_NONFATAL)
-			err_handler->error_detected(dev, pci_channel_io_normal);
-		else if (info->severity == AER_FATAL)
-			err_handler->error_detected(dev, pci_channel_io_frozen);
-	}
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
-{
-	/*
-	 * Internal errors of an RCEC indicate an AER error in an
-	 * RCH's downstream port. Check and handle them in the CXL.mem
-	 * device driver.
-	 */
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
-		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
-}
-
-static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
-{
-	bool *handles_cxl = data;
-
-	if (!*handles_cxl)
-		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
-
-	/* Non-zero terminates iteration */
-	return *handles_cxl;
-}
-
-static bool handles_cxl_errors(struct pci_dev *rcec)
-{
-	bool handles_cxl = false;
-
-	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
-
-	return handles_cxl;
-}
-
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
-{
-	if (!handles_cxl_errors(rcec))
-		return;
-
-	pci_aer_unmask_internal_errors(rcec);
-	pci_info(rcec, "CXL: Internal errors unmasked");
-}
-
-#else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
-static inline void cxl_rch_handle_error(struct pci_dev *dev,
-					struct aer_err_info *info) { }
-#endif
 
 /**
  * pci_aer_handle_error - handle logging error into an event log
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
new file mode 100644
index 000000000000..b2ea14f70055
--- /dev/null
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -0,0 +1,138 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/aer.h>
+#include "../pci.h"
+
+/**
+ * pci_aer_unmask_internal_errors - unmask internal errors
+ * @dev: pointer to the pci_dev data structure
+ *
+ * Unmask internal errors in the Uncorrectable and Correctable Error
+ * Mask registers.
+ *
+ * Note: AER must be enabled and supported by the device which must be
+ * checked in advance, e.g. with pcie_aer_is_native().
+ */
+static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+{
+	int aer = dev->aer_cap;
+	u32 mask;
+
+	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
+	mask &= ~PCI_ERR_UNC_INTN;
+	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
+
+	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
+	mask &= ~PCI_ERR_COR_INTERNAL;
+	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
+}
+
+static bool is_cxl_mem_dev(struct pci_dev *dev)
+{
+	/*
+	 * The capability, status, and control fields in Device 0,
+	 * Function 0 DVSEC control the CXL functionality of the
+	 * entire device (CXL 3.2, 8.1.3).
+	 */
+	if (dev->devfn != PCI_DEVFN(0, 0))
+		return false;
+
+	/*
+	 * CXL Memory Devices must have the 502h class code set (CXL
+	 * 3.2, 8.1.12.1).
+	 */
+	if (FIELD_GET(PCI_CLASS_CODE_MASK, dev->class) != PCI_CLASS_MEMORY_CXL)
+		return false;
+
+	return true;
+}
+
+static bool cxl_error_is_native(struct pci_dev *dev)
+{
+	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+
+	return (pcie_ports_native || host->native_aer);
+}
+
+static bool is_internal_error(struct aer_err_info *info)
+{
+	if (info->severity == AER_CORRECTABLE)
+		return info->status & PCI_ERR_COR_INTERNAL;
+
+	return info->status & PCI_ERR_UNC_INTN;
+}
+
+static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
+{
+	struct aer_err_info *info = (struct aer_err_info *)data;
+	const struct pci_error_handlers *err_handler;
+
+	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
+		return 0;
+
+	/* Protect dev->driver */
+	device_lock(&dev->dev);
+
+	err_handler = dev->driver ? dev->driver->err_handler : NULL;
+	if (!err_handler)
+		goto out;
+
+	if (info->severity == AER_CORRECTABLE) {
+		if (err_handler->cor_error_detected)
+			err_handler->cor_error_detected(dev);
+	} else if (err_handler->error_detected) {
+		if (info->severity == AER_NONFATAL)
+			err_handler->error_detected(dev, pci_channel_io_normal);
+		else if (info->severity == AER_FATAL)
+			err_handler->error_detected(dev, pci_channel_io_frozen);
+	}
+out:
+	device_unlock(&dev->dev);
+	return 0;
+}
+
+void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
+{
+	/*
+	 * Internal errors of an RCEC indicate an AER error in an
+	 * RCH's downstream port. Check and handle them in the CXL.mem
+	 * device driver.
+	 */
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+	    is_internal_error(info))
+		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+}
+
+static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
+{
+	bool *handles_cxl = data;
+
+	if (!*handles_cxl)
+		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
+
+	/* Non-zero terminates iteration */
+	return *handles_cxl;
+}
+
+static bool handles_cxl_errors(struct pci_dev *rcec)
+{
+	bool handles_cxl = false;
+
+	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
+	    pcie_aer_is_native(rcec))
+		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+
+	return handles_cxl;
+}
+
+void cxl_rch_enable_rcec(struct pci_dev *rcec)
+{
+	if (!handles_cxl_errors(rcec))
+		return;
+
+	pci_aer_unmask_internal_errors(rcec);
+	pci_info(rcec, "CXL: Internal errors unmasked");
+}
+
diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
index e2d71b6fdd84..31b3935bf189 100644
--- a/include/linux/pci_ids.h
+++ b/include/linux/pci_ids.h
@@ -12,6 +12,8 @@
 
 /* Device classes and subclasses */
 
+#define PCI_CLASS_CODE_MASK             0xFFFF00
+
 #define PCI_CLASS_NOT_DEFINED		0x0000
 #define PCI_CLASS_NOT_DEFINED_VGA	0x0001
 
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (3 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 10:24   ` Jonathan Cameron
                     ` (2 more replies)
  2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
                   ` (13 subsequent siblings)
  18 siblings, 3 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL error handling will soon be moved from the AER driver into the CXL
driver. This requires a notification mechanism for the AER driver to share
the AER interrupt with the CXL driver. The notification will be used
as an indication for the CXL drivers to handle and log the CXL RAS errors.

First, introduce cxl/core/native_ras.c to contain changes for the CXL
driver's RAS native handling. This as an alternative to dropping the
changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
conditionally compile the new file.

Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
driver will be the sole kfifo producer adding work and the cxl_core will be
the sole kfifo consumer removing work. Add the boilerplate kfifo support.

Add CXL work queue handler registration functions in the AER driver. Export
the functions allowing CXL driver to access. Implement registration
functions for the CXL driver to assign or clear the work handler function.

Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
will contain the erring device's PCI SBDF details used to rediscover the
device after the CXL driver dequeues the kfifo work. The device rediscovery
will be introduced along with the CXL handling in future patches.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/Kconfig           | 14 ++++++++
 drivers/cxl/core/Makefile     |  1 +
 drivers/cxl/core/core.h       |  8 +++++
 drivers/cxl/core/native_ras.c | 26 +++++++++++++++
 drivers/cxl/core/port.c       |  2 ++
 drivers/cxl/core/ras.c        |  1 +
 drivers/cxl/cxlpci.h          |  1 +
 drivers/pci/pci.h             |  4 +++
 drivers/pci/pcie/aer.c        |  7 ++--
 drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
 include/linux/aer.h           | 31 ++++++++++++++++++
 11 files changed, 153 insertions(+), 2 deletions(-)
 create mode 100644 drivers/cxl/core/native_ras.c

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 48b7314afdb8..57274de54a45 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -233,4 +233,18 @@ config CXL_MCE
 	def_bool y
 	depends on X86_MCE && MEMORY_FAILURE
 
+config CXL_NATIVE_RAS
+	bool "CXL: Enable CXL RAS native handling"
+	depends on PCIEAER_CXL
+	default CXL_BUS
+	help
+	  Enable native CXL RAS protocol error handling and logging in the CXL
+	  drivers. This functionality relies on the AER service driver being
+	  enabled, as the AER interrupt is used to inform the operating system
+	  of CXL RAS protocol errors. The platform must be configured to
+	  utilize AER reporting for interrupts.
+
+	  If unsure, or if this kernel is meant for production environments,
+	  say Y.
+
 endif
diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 79e2ef81fde8..16f5832e5cc4 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
 cxl_core-$(CONFIG_CXL_MCE) += mce.o
 cxl_core-$(CONFIG_CXL_FEATURES) += features.o
 cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
+cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 29b61828a847..4c08bb92e2f9 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
 int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
 					    int nid, resource_size_t *size);
 
+#ifdef CONFIG_PCIEAER_CXL
+void cxl_native_ras_init(void);
+void cxl_native_ras_exit(void);
+#else
+static inline void cxl_native_ras_init(void) { };
+static inline void cxl_native_ras_exit(void) { };
+#endif
+
 #ifdef CONFIG_CXL_FEATURES
 struct cxl_feat_entry *
 cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
new file mode 100644
index 000000000000..011815ddaae3
--- /dev/null
+++ b/drivers/cxl/core/native_ras.c
@@ -0,0 +1,26 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
+
+#include <linux/pci.h>
+#include <linux/aer.h>
+#include <cxl/event.h>
+#include <cxlmem.h>
+#include <core/core.h>
+
+static void cxl_proto_err_work_fn(struct work_struct *work)
+{
+}
+
+static struct work_struct cxl_proto_err_work;
+static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
+
+void cxl_native_ras_init(void)
+{
+	cxl_register_proto_err_work(&cxl_proto_err_work);
+}
+
+void cxl_native_ras_exit(void)
+{
+	cxl_unregister_proto_err_work();
+	cancel_work_sync(&cxl_proto_err_work);
+}
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index eb46c6764d20..8e8f21197c86 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
 	if (rc)
 		goto err_ras;
 
+	cxl_native_ras_init();
+
 	return 0;
 
 err_ras:
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 485a831695c7..962dc94fed8c 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -5,6 +5,7 @@
 #include <linux/aer.h>
 #include <cxl/event.h>
 #include <cxlmem.h>
+#include <cxlpci.h>
 #include "trace.h"
 
 static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 54e219b0049e..6f1396ef7b77 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -4,6 +4,7 @@
 #define __CXL_PCI_H__
 #include <linux/pci.h>
 #include "cxl.h"
+#include "linux/aer.h"
 
 #define CXL_MEMORY_PROGIF	0x10
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 91b583cf18eb..29c11c7136d3 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
 void cxl_rch_enable_rcec(struct pci_dev *rcec);
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
+void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
 #else
 static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
 static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
+static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
+static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
 #endif
 
 #ifdef CONFIG_ACPI
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 0b4d721980ef..8417a49c71f2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_rch_handle_error(dev, info);
-	pci_aer_handle_error(dev, info);
+	if (is_cxl_error(dev, info))
+		forward_cxl_error(dev, info);
+	else
+		pci_aer_handle_error(dev, info);
+
 	pci_dev_put(dev);
 }
 
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index b2ea14f70055..846ab55d747c 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -3,8 +3,11 @@
 
 #include <linux/pci.h>
 #include <linux/aer.h>
+#include <linux/kfifo.h>
 #include "../pci.h"
 
+#define CXL_ERROR_SOURCES_MAX          128
+
 /**
  * pci_aer_unmask_internal_errors - unmask internal errors
  * @dev: pointer to the pci_dev data structure
@@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
 	return info->status & PCI_ERR_UNC_INTN;
 }
 
+bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
+{
+	if (!info || !info->is_cxl)
+		return false;
+
+	/* Only CXL Endpoints are currently supported */
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
+		return false;
+
+	return is_internal_error(info);
+}
+
 static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 {
 	struct aer_err_info *info = (struct aer_err_info *)data;
@@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
 	pci_info(rcec, "CXL: Internal errors unmasked");
 }
 
+static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
+		    CXL_ERROR_SOURCES_MAX);
+static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
+struct work_struct *cxl_proto_err_work;
+
+void cxl_register_proto_err_work(struct work_struct *work)
+{
+	guard(spinlock)(&cxl_proto_err_fifo_lock);
+	cxl_proto_err_work = work;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
+
+void cxl_unregister_proto_err_work(void)
+{
+	guard(spinlock)(&cxl_proto_err_fifo_lock);
+	cxl_proto_err_work = NULL;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
+
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
+{
+	return kfifo_get(&cxl_proto_err_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
+
+void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
+{
+	struct cxl_proto_err_work_data wd;
+
+	wd.err_info = (struct cxl_proto_error_info) {
+		.severity = aer_err_info->severity,
+		.devfn = pdev->devfn,
+		.bus = pdev->bus->number,
+		.segment = pci_domain_nr(pdev->bus)
+	};
+
+	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
+		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_proto_err_work);
+}
+
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 02940be66324..24c3d9e18ad5 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
 
 #include <linux/errno.h>
 #include <linux/types.h>
+#include <linux/workqueue_types.h>
 
 #define AER_NONFATAL			0
 #define AER_FATAL			1
@@ -53,6 +54,26 @@ struct aer_capability_regs {
 	u16 uncor_err_source;
 };
 
+/**
+ * struct cxl_proto_err_info - Error information used in CXL error handling
+ * @severity: AER severity
+ * @function: Device's PCI function
+ * @device: Device's PCI device
+ * @bus: Device's PCI bus
+ * @segment: Device's PCI segment
+ */
+struct cxl_proto_error_info {
+	int severity;
+
+	u8 devfn;
+	u8 bus;
+	u16 segment;
+};
+
+struct cxl_proto_err_work_data {
+	struct cxl_proto_error_info err_info;
+};
+
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
@@ -64,6 +85,16 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
+#if defined(CONFIG_PCIEAER_CXL)
+void cxl_register_proto_err_work(struct work_struct *work);
+void cxl_unregister_proto_err_work(void);
+int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
+#else
+static inline void cxl_register_proto_err_work(struct work_struct *work) { }
+static inline void cxl_unregister_proto_err_work(void) { }
+static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
+#endif
+
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		    struct aer_capability_regs *aer);
 int cper_severity_to_aer(int cper_severity);
-- 
2.34.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (4 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 11:00   ` Jonathan Cameron
                     ` (2 more replies)
  2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
                   ` (12 subsequent siblings)
  18 siblings, 3 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

The AER driver is now designed to forward CXL protocol errors to the CXL
driver. Update the CXL driver with functionality to dequeue the forwarded
CXL error from the kfifo. Also, update the CXL driver to begin the protocol
error handling processing using the work received from the FIFO.

Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the
AER service driver. This will begin the CXL protocol error processing with
a call to cxl_handle_proto_error().

Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
previously in the AER driver. Add check that Endpoint is bound to a CXL
driver.

Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
will return a reference counted 'struct pci_dev *'. This will serve as
reference count to prevent releasing the CXL Endpoint's mapped RAS while
handling the error. Use scope base __free() to put the reference count.
This will change when adding support for CXL port devices in the future.

Implement cxl_handle_proto_error() to differentiate between Restricted CXL
Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
errors will be processed with a call to walk the associated Root Complex
Event Collector's (RCEC) secondary bus looking for the Root Complex
Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
so the CXL driver can walk the RCEC's downstream bus, searching for the
RCiEP.

VH correctable error (CE) processing will call the CXL CE handler. VH
uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
and pci_clean_device_status() used to clean up AER status after handling.

Maintain the locking logic found in the original AER driver. Replace the
existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
lock for maintainability. CE errors did not include locking in previous driver
implementation. Leave the updated CE handling path as-is.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/core/native_ras.c | 87 +++++++++++++++++++++++++++++++++++
 drivers/cxl/cxlpci.h          |  1 +
 drivers/cxl/pci.c             |  6 +++
 drivers/pci/pci.c             |  1 +
 drivers/pci/pci.h             |  7 ---
 drivers/pci/pcie/aer.c        |  1 +
 drivers/pci/pcie/cxl_aer.c    | 41 -----------------
 drivers/pci/pcie/rcec.c       |  1 +
 include/linux/aer.h           |  2 +
 include/linux/pci.h           | 10 ++++
 10 files changed, 109 insertions(+), 48 deletions(-)

diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
index 011815ddaae3..5bd79d5019e7 100644
--- a/drivers/cxl/core/native_ras.c
+++ b/drivers/cxl/core/native_ras.c
@@ -6,9 +6,96 @@
 #include <cxl/event.h>
 #include <cxlmem.h>
 #include <core/core.h>
+#include <cxlpci.h>
+
+static void cxl_do_recovery(struct pci_dev *pdev)
+{
+}
+
+static bool is_cxl_rcd(struct pci_dev *pdev)
+{
+	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)
+		return false;
+
+	/*
+	 * The capability, status, and control fields in Device 0,
+	 * Function 0 DVSEC control the CXL functionality of the
+	 * entire device (CXL 3.2, 8.1.3).
+	 */
+	if (pdev->devfn != PCI_DEVFN(0, 0))
+		return false;
+
+	/*
+	 * CXL Memory Devices must have the 502h class code set (CXL
+	 * 3.2, 8.1.12.1).
+	 */
+	if (FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) != PCI_CLASS_MEMORY_CXL)
+		return false;
+
+	return true;
+}
+
+static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
+{
+	struct cxl_proto_error_info *err_info = data;
+
+	guard(device)(&pdev->dev);
+
+	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
+		return 0;
+
+	if (err_info->severity == AER_CORRECTABLE)
+		cxl_cor_error_detected(pdev);
+	else
+		cxl_error_detected(pdev, pci_channel_io_frozen);
+
+	return 1;
+}
+
+static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
+{
+	struct pci_dev *pdev __free(pci_dev_put) =
+		pci_get_domain_bus_and_slot(err_info->segment,
+					    err_info->bus,
+					    err_info->devfn);
+
+	if (!pdev) {
+		pr_err("Failed to find the CXL device (SBDF=%x:%x:%x:%x)\n",
+		       err_info->segment, err_info->bus, PCI_SLOT(err_info->devfn),
+		       PCI_FUNC(err_info->devfn));
+		return;
+	}
+
+	/*
+	 * Internal errors of an RCEC indicate an AER error in an
+	 * RCH's downstream port. Check and handle them in the CXL.mem
+	 * device driver.
+	 */
+	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
+		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
+
+	if (err_info->severity == AER_CORRECTABLE) {
+		int aer = pdev->aer_cap;
+
+		if (aer)
+			pci_clear_and_set_config_dword(pdev,
+						       aer + PCI_ERR_COR_STATUS,
+						       0, PCI_ERR_COR_INTERNAL);
+
+		cxl_cor_error_detected(pdev);
+
+		pcie_clear_device_status(pdev);
+	} else {
+		cxl_do_recovery(pdev);
+	}
+}
 
 static void cxl_proto_err_work_fn(struct work_struct *work)
 {
+	struct cxl_proto_err_work_data wd;
+
+	while (cxl_proto_err_kfifo_get(&wd))
+		cxl_handle_proto_error(&wd.err_info);
 }
 
 static struct work_struct cxl_proto_err_work;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 6f1396ef7b77..ed3c9701b79f 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -136,4 +136,5 @@ void read_cdat_data(struct cxl_port *port);
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+bool cxl_pci_drv_bound(struct pci_dev *pdev);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index bd100ac31672..cae049f9ae3e 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1131,6 +1131,12 @@ static struct pci_driver cxl_pci_driver = {
 	},
 };
 
+bool cxl_pci_drv_bound(struct pci_dev *pdev)
+{
+	return (pdev->driver == &cxl_pci_driver);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_pci_drv_bound, "CXL");
+
 #define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
 static void cxl_handle_cper_event(enum cxl_event_type ev_type,
 				  struct cxl_cper_event_rec *rec)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index e9448d55113b..8d78d882bf78 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2328,6 +2328,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
+EXPORT_SYMBOL_NS_GPL(pcie_clear_device_status, "CXL");
 #endif
 
 /**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 29c11c7136d3..c7fc86d93bea 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -671,16 +671,10 @@ static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; }
 void pci_rcec_init(struct pci_dev *dev);
 void pci_rcec_exit(struct pci_dev *dev);
 void pcie_link_rcec(struct pci_dev *rcec);
-void pcie_walk_rcec(struct pci_dev *rcec,
-		    int (*cb)(struct pci_dev *, void *),
-		    void *userdata);
 #else
 static inline void pci_rcec_init(struct pci_dev *dev) { }
 static inline void pci_rcec_exit(struct pci_dev *dev) { }
 static inline void pcie_link_rcec(struct pci_dev *rcec) { }
-static inline void pcie_walk_rcec(struct pci_dev *rcec,
-				  int (*cb)(struct pci_dev *, void *),
-				  void *userdata) { }
 #endif
 
 #ifdef CONFIG_PCI_ATS
@@ -1022,7 +1016,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
 static inline void pci_no_aer(void) { }
 static inline void pci_aer_init(struct pci_dev *d) { }
 static inline void pci_aer_exit(struct pci_dev *d) { }
-static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
 static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
 static inline void pci_save_aer_state(struct pci_dev *dev) { }
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 8417a49c71f2..5999d90dfdcb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -287,6 +287,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
 	if (status)
 		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
 }
+EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
 
 /**
  * pci_aer_raw_clear_status - Clear AER error registers.
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index 846ab55d747c..939438a7161a 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -80,47 +80,6 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
 	return is_internal_error(info);
 }
 
-static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
-{
-	struct aer_err_info *info = (struct aer_err_info *)data;
-	const struct pci_error_handlers *err_handler;
-
-	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
-		return 0;
-
-	/* Protect dev->driver */
-	device_lock(&dev->dev);
-
-	err_handler = dev->driver ? dev->driver->err_handler : NULL;
-	if (!err_handler)
-		goto out;
-
-	if (info->severity == AER_CORRECTABLE) {
-		if (err_handler->cor_error_detected)
-			err_handler->cor_error_detected(dev);
-	} else if (err_handler->error_detected) {
-		if (info->severity == AER_NONFATAL)
-			err_handler->error_detected(dev, pci_channel_io_normal);
-		else if (info->severity == AER_FATAL)
-			err_handler->error_detected(dev, pci_channel_io_frozen);
-	}
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
-{
-	/*
-	 * Internal errors of an RCEC indicate an AER error in an
-	 * RCH's downstream port. Check and handle them in the CXL.mem
-	 * device driver.
-	 */
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
-		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
-}
-
 static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 {
 	bool *handles_cxl = data;
diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
index d0bcd141ac9c..fb6cf6449a1d 100644
--- a/drivers/pci/pcie/rcec.c
+++ b/drivers/pci/pcie/rcec.c
@@ -145,6 +145,7 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
 
 	walk_rcec(walk_rcec_helper, &rcec_data);
 }
+EXPORT_SYMBOL_NS_GPL(pcie_walk_rcec, "CXL");
 
 void pci_rcec_init(struct pci_dev *dev)
 {
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 24c3d9e18ad5..0aafcc678e45 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -76,12 +76,14 @@ struct cxl_proto_err_work_data {
 
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
 #else
 static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
+static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 79878243b681..79326358f641 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1801,6 +1801,9 @@ extern bool pcie_ports_native;
 
 int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
 			  bool use_lt);
+void pcie_walk_rcec(struct pci_dev *rcec,
+		    int (*cb)(struct pci_dev *, void *),
+		    void *userdata);
 #else
 #define pcie_ports_disabled	true
 #define pcie_ports_native	false
@@ -1811,8 +1814,15 @@ static inline int pcie_set_target_speed(struct pci_dev *port,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline void pcie_walk_rcec(struct pci_dev *rcec,
+				  int (*cb)(struct pci_dev *, void *),
+				  void *userdata) { }
+
 #endif
 
+void pcie_clear_device_status(struct pci_dev *dev);
+
 #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
 #define PCIE_LINK_STATE_L1		BIT(2)	/* L1 state */
 #define PCIE_LINK_STATE_L1_1		BIT(3)	/* ASPM L1.1 state */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (5 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 11:05   ` Jonathan Cameron
  2025-06-27 12:27   ` Shiju Jose
  2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
handling. Follow similar design as found in PCIe error driver,
pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs
as fatal with a kernel panic. This is to prevent corruption on CXL memory.

Export the PCI error driver's merge_result() to CXL namespace. Introduce
PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will
be used by CXL to panic the system in the case of uncorrectable protocol
errors. PCI error handling is not currently expected to use the
PCI_ERS_RESULT_PANIC.

Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
first device in all cases.

Copy the PCI error driver's report_error_detected() to cxl_report_error_detected().
Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
supported. Add locking for PCI device as done in PCI's report_error_detected().
This is necessary to prevent the RAS registers from disappearing before
logging is completed.

Call panic() to halt the system in the case of uncorrectable errors (UCE)
in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
if a UCE is not found. In this case the AER status must be cleared and
uses pci_aer_clear_fatal_status().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/native_ras.c | 44 +++++++++++++++++++++++++++++++++++
 drivers/pci/pcie/cxl_aer.c    |  3 ++-
 drivers/pci/pcie/err.c        |  8 +++++--
 include/linux/aer.h           | 11 +++++++++
 include/linux/pci.h           |  3 +++
 5 files changed, 66 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
index 5bd79d5019e7..19f8f2ac8376 100644
--- a/drivers/cxl/core/native_ras.c
+++ b/drivers/cxl/core/native_ras.c
@@ -8,8 +8,52 @@
 #include <core/core.h>
 #include <cxlpci.h>
 
+static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
+{
+	pci_ers_result_t vote, *result = data;
+
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END))
+		return 0;
+
+	guard(device)(&pdev->dev);
+
+	vote = cxl_error_detected(pdev, pci_channel_io_frozen);
+	*result = merge_result(*result, vote);
+
+	return 0;
+}
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+			    int (*cb)(struct pci_dev *, void *),
+			    void *userdata)
+{
+	if (cb(bridge, userdata))
+		return;
+
+	if (bridge->subordinate)
+		pci_walk_bus(bridge->subordinate, cb, userdata);
+}
+
 static void cxl_do_recovery(struct pci_dev *pdev)
 {
+	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+
+	cxl_walk_bridge(pdev, cxl_report_error_detected, &status);
+	if (status == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
+
+	/*
+	 * If we have native control of AER, clear error status in the device
+	 * that detected the error.  If the platform retained control of AER,
+	 * it is responsible for clearing this status.  In that case, the
+	 * signaling device may not even be visible to the OS.
+	 */
+	if (cxl_error_is_native(pdev)) {
+		pcie_clear_device_status(pdev);
+		pci_aer_clear_nonfatal_status(pdev);
+		pci_aer_clear_fatal_status(pdev);
+	}
 }
 
 static bool is_cxl_rcd(struct pci_dev *pdev)
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index 939438a7161a..b238791b7101 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -52,12 +52,13 @@ static bool is_cxl_mem_dev(struct pci_dev *dev)
 	return true;
 }
 
-static bool cxl_error_is_native(struct pci_dev *dev)
+bool cxl_error_is_native(struct pci_dev *dev)
 {
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
 
 	return (pcie_ports_native || host->native_aer);
 }
+EXPORT_SYMBOL_NS_GPL(cxl_error_is_native, "CXL");
 
 static bool is_internal_error(struct aer_err_info *info)
 {
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index de6381c690f5..63fceb3e8613 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -21,9 +21,12 @@
 #include "portdrv.h"
 #include "../pci.h"
 
-static pci_ers_result_t merge_result(enum pci_ers_result orig,
-				  enum pci_ers_result new)
+pci_ers_result_t merge_result(enum pci_ers_result orig,
+			      enum pci_ers_result new)
 {
+	if (new == PCI_ERS_RESULT_PANIC)
+		return PCI_ERS_RESULT_PANIC;
+
 	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
 		return PCI_ERS_RESULT_NO_AER_DRIVER;
 
@@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
 
 	return orig;
 }
+EXPORT_SYMBOL_NS_GPL(merge_result, "CXL");
 
 static int report_error_detected(struct pci_dev *dev,
 				 pci_channel_state_t state,
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 0aafcc678e45..f14db635ef90 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
 
 #include <linux/errno.h>
 #include <linux/types.h>
+#include <linux/pci.h>
 #include <linux/workqueue_types.h>
 
 #define AER_NONFATAL			0
@@ -78,6 +79,8 @@ struct cxl_proto_err_work_data {
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
+pci_ers_result_t merge_result(enum pci_ers_result orig,
+			      enum pci_ers_result new);
 #else
 static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
@@ -85,16 +88,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 }
 static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
+static inline pci_ers_result_t merge_result(enum pci_ers_result orig,
+					    enum pci_ers_result new)
+{
+	return PCI_ERS_RESULT_NONE;
+}
+
 #endif
 
 #if defined(CONFIG_PCIEAER_CXL)
 void cxl_register_proto_err_work(struct work_struct *work);
 void cxl_unregister_proto_err_work(void);
 int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
+bool cxl_error_is_native(struct pci_dev *dev);
 #else
 static inline void cxl_register_proto_err_work(struct work_struct *work) { }
 static inline void cxl_unregister_proto_err_work(void) { }
 static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
+static inline bool cxl_error_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 79326358f641..16a8310e0373 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -868,6 +868,9 @@ enum pci_ers_result {
 
 	/* No AER capabilities registered for the driver */
 	PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+	/* System is unstable, panic. Is CXL specific  */
+	PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
 };
 
 /* PCI bus error event callbacks */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (6 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 11:12   ` Jonathan Cameron
  2025-07-18 18:01   ` Dave Jiang
  2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
                   ` (10 subsequent siblings)
  18 siblings, 2 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

The cxl_port driver is intended to manage CXL Endpoint Ports and CXL Switch
Ports. Move existing RAS initialization to the cxl_port driver.

Restricted CXL Host (RCH) Downstream Port RAS initialization currently
resides in cxl/core/pci.c. The PCI source file is not otherwise associated
with CXL Port management.

Additional CXL Port RAS initialization will be added in future patches to
support a CXL Port device's CXL errors.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c  | 73 --------------------------------------
 drivers/cxl/core/regs.c |  2 ++
 drivers/cxl/cxl.h       |  6 ++++
 drivers/cxl/port.c      | 78 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 86 insertions(+), 73 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 06464a25d8bd..35c9c50534bf 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -738,79 +738,6 @@ static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
 
 #ifdef CONFIG_PCIEAER_CXL
 
-static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
-{
-	resource_size_t aer_phys;
-	struct device *host;
-	u16 aer_cap;
-
-	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
-	if (aer_cap) {
-		host = dport->reg_map.host;
-		aer_phys = aer_cap + dport->rcrb.base;
-		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
-						sizeof(struct aer_capability_regs));
-	}
-}
-
-static void cxl_dport_map_ras(struct cxl_dport *dport)
-{
-	struct cxl_register_map *map = &dport->reg_map;
-	struct device *dev = dport->dport_dev;
-
-	if (!map->component_map.ras.valid)
-		dev_dbg(dev, "RAS registers not found\n");
-	else if (cxl_map_component_regs(map, &dport->regs.component,
-					BIT(CXL_CM_CAP_CAP_ID_RAS)))
-		dev_dbg(dev, "Failed to map RAS capability.\n");
-}
-
-static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
-{
-	void __iomem *aer_base = dport->regs.dport_aer;
-	u32 aer_cmd_mask, aer_cmd;
-
-	if (!aer_base)
-		return;
-
-	/*
-	 * Disable RCH root port command interrupts.
-	 * CXL 3.0 12.2.1.1 - RCH Downstream Port-detected Errors
-	 *
-	 * This sequence may not be necessary. CXL spec states disabling
-	 * the root cmd register's interrupts is required. But, PCI spec
-	 * shows these are disabled by default on reset.
-	 */
-	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
-			PCI_ERR_ROOT_CMD_NONFATAL_EN |
-			PCI_ERR_ROOT_CMD_FATAL_EN);
-	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
-	aer_cmd &= ~aer_cmd_mask;
-	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
-}
-
-/**
- * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
- * @dport: the cxl_dport that needs to be initialized
- * @host: host device for devm operations
- */
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
-{
-	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
-
-	if (dport->rch) {
-		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
-
-		if (!host_bridge->native_aer)
-			return;
-
-		cxl_dport_map_rch_aer(dport);
-		cxl_disable_rch_root_ints(dport);
-	}
-}
-EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
-
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 5ca7b0eed568..b8e767a9571c 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -199,6 +199,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 
 	return ret_val;
 }
+EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, "CXL");
 
 int cxl_map_component_regs(const struct cxl_register_map *map,
 			   struct cxl_component_regs *regs,
@@ -517,6 +518,7 @@ u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb)
 
 	return offset;
 }
+EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, "CXL");
 
 static resource_size_t cxl_rcrb_to_linkcap(struct device *dev, struct cxl_dport *dport)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 3f1695c96abc..c57c160f3e5e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -313,6 +313,12 @@ int cxl_setup_regs(struct cxl_register_map *map);
 struct cxl_dport;
 resource_size_t cxl_rcd_component_reg_phys(struct device *dev,
 					   struct cxl_dport *dport);
+
+u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb);
+
+void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
+				   resource_size_t length);
+
 int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
 
 #define CXL_RESOURCE_NONE ((resource_size_t) -1)
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index fe4b593331da..021f35145c65 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -6,6 +6,7 @@
 
 #include "cxlmem.h"
 #include "cxlpci.h"
+#include "cxl.h"
 
 /**
  * DOC: cxl port
@@ -57,6 +58,83 @@ static int discover_region(struct device *dev, void *unused)
 	return 0;
 }
 
+#ifdef CONFIG_PCIEAER_CXL
+
+static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
+{
+	resource_size_t aer_phys;
+	struct device *host;
+	u16 aer_cap;
+
+	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
+	if (aer_cap) {
+		host = dport->reg_map.host;
+		aer_phys = aer_cap + dport->rcrb.base;
+		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
+						sizeof(struct aer_capability_regs));
+	}
+}
+
+static void cxl_dport_map_ras(struct cxl_dport *dport)
+{
+	struct cxl_register_map *map = &dport->reg_map;
+	struct device *dev = dport->dport_dev;
+
+	if (!map->component_map.ras.valid)
+		dev_dbg(dev, "RAS registers not found\n");
+	else if (cxl_map_component_regs(map, &dport->regs.component,
+					BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(dev, "Failed to map RAS capability.\n");
+}
+
+static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
+{
+	void __iomem *aer_base = dport->regs.dport_aer;
+	u32 aer_cmd_mask, aer_cmd;
+
+	if (!aer_base)
+		return;
+
+	/*
+	 * Disable RCH root port command interrupts.
+	 * CXL 3.2 12.2.1.1 - RCH Downstream Port-detected Errors
+	 *
+	 * This sequence may not be necessary. CXL spec states disabling
+	 * the root cmd register's interrupts is required. But, PCI spec
+	 * shows these are disabled by default on reset.
+	 */
+	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
+			PCI_ERR_ROOT_CMD_NONFATAL_EN |
+			PCI_ERR_ROOT_CMD_FATAL_EN);
+	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
+	aer_cmd &= ~aer_cmd_mask;
+	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
+}
+
+/**
+ * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
+ * @dport: the cxl_dport that needs to be initialized
+ * @host: host device for devm operations
+ */
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+{
+	dport->reg_map.host = host;
+	cxl_dport_map_ras(dport);
+
+	if (dport->rch) {
+		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
+
+		if (!host_bridge->native_aer)
+			return;
+
+		cxl_dport_map_rch_aer(dport);
+		cxl_disable_rch_root_ints(dport);
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
+
+#endif /* CONFIG_PCIEAER_CXL */
+
 static int cxl_switch_port_probe(struct cxl_port *port)
 {
 	struct cxl_hdm *cxlhdm;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (7 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 11:17   ` Jonathan Cameron
  2025-07-18 21:28   ` Dave Jiang
  2025-06-26 22:42 ` [PATCH v10 10/17] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
                   ` (9 subsequent siblings)
  18 siblings, 2 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
mapping to enable RAS logging. This initialization is currently missing and
must be added for CXL RPs and DSPs.

Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.

Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
created and added to the EP port.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/cxl.h  |  2 ++
 drivers/cxl/mem.c  |  3 ++-
 drivers/cxl/port.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index c57c160f3e5e..d696d419bd5a 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -590,6 +590,7 @@ struct cxl_dax_region {
  * @parent_dport: dport that points to this port in the parent
  * @decoder_ida: allocator for decoder ids
  * @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
  * @nr_dports: number of entries in @dports
  * @hdm_end: track last allocated HDM decoder instance for allocation ordering
  * @commit_end: cursor to track highest committed decoder for commit ordering
@@ -610,6 +611,7 @@ struct cxl_port {
 	struct cxl_dport *parent_dport;
 	struct ida decoder_ida;
 	struct cxl_register_map reg_map;
+	struct cxl_component_regs uport_regs;
 	int nr_dports;
 	int hdm_end;
 	int commit_end;
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 6e6777b7bafb..d2155f45240d 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
 	else
 		endpoint_parent = &parent_port->dev;
 
-	cxl_dport_init_ras_reporting(dport, dev);
+	if (dport->rch)
+		cxl_dport_init_ras_reporting(dport, dev);
 
 	scoped_guard(device, endpoint_parent) {
 		if (!endpoint_parent->driver) {
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 021f35145c65..b52f82925891 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
+static void cxl_uport_init_ras_reporting(struct cxl_port *port,
+					 struct device *host)
+{
+	struct cxl_register_map *map = &port->reg_map;
+
+	map->host = host;
+	if (cxl_map_component_regs(map, &port->uport_regs,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(&port->dev, "Failed to map RAS capability\n");
+}
+
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
@@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 {
 	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
 
 	if (dport->rch) {
 		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
@@ -127,12 +137,54 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 		if (!host_bridge->native_aer)
 			return;
 
+		cxl_dport_map_ras(dport);
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
+		return;
 	}
+
+	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
+
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
+static void cxl_switch_port_init_ras(struct cxl_port *port)
+{
+	if (is_cxl_root(to_cxl_port(port->dev.parent)))
+		return;
+
+	/* May have upstream DSP or RP */
+	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
+		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
+
+		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
+		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
+			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
+	}
+
+	cxl_uport_init_ras_reporting(port, &port->dev);
+}
+
+static void cxl_endpoint_port_init_ras(struct cxl_port *port)
+{
+	struct cxl_dport *dport;
+	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+	struct cxl_port *parent_port __free(put_cxl_port) =
+		cxl_mem_find_port(cxlmd, &dport);
+
+	if (!dport || !dev_is_pci(dport->dport_dev)) {
+		dev_err(&port->dev, "CXL port topology not found\n");
+		return;
+	}
+
+	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
+}
+
+#else
+static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
+static void cxl_switch_port_init_ras(struct cxl_port *port) { }
 #endif /* CONFIG_PCIEAER_CXL */
 
 static int cxl_switch_port_probe(struct cxl_port *port)
@@ -149,6 +201,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
 
 	cxl_switch_parse_cdat(port);
 
+	cxl_switch_port_init_ras(port);
+
 	cxlhdm = devm_cxl_setup_hdm(port, NULL);
 	if (!IS_ERR(cxlhdm))
 		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
@@ -203,6 +257,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
 	if (rc)
 		return rc;
 
+	cxl_endpoint_port_init_ras(port);
+
 	/*
 	 * Now that all endpoint decoders are successfully enumerated, try to
 	 * assemble regions from committed decoders
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 10/17] cxl/pci: Update RAS handler interfaces to also support CXL Ports
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (8 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL PCIe Port Protocol Error handling support will be added to the
CXL drivers in the future. In preparation, rename the existing
interfaces to support handling all CXL PCIe Port Protocol Errors.

The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL Port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.

To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.

No functional changes are introduced.

[1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/core/pci.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 35c9c50534bf..9b464f9c55c1 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -664,8 +664,8 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
-			       void __iomem *ras_base)
+static void cxl_handle_cor_ras(struct device *dev,
+				 void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
@@ -677,7 +677,7 @@ static void cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
 	}
 }
 
@@ -702,8 +702,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
-			   void __iomem *ras_base)
+static bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -730,7 +729,7 @@ static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -741,13 +740,13 @@ static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return cxl_handle_ras(cxlds, dport->regs.ras);
+	return cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 /*
@@ -844,7 +843,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -873,7 +872,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
+		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 	}
 
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (9 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 10/17] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-07-21 21:56   ` Dave Jiang
  2025-06-26 22:42 ` [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

The CXL RAS handlers do not currently log if the RAS registers are
unmapped. This is needed in order to help debug CXL error handling. Update
the CXL driver to log a warning message if the RAS register block is
unmapped during RAS error handling.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/core/pci.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 9b464f9c55c1..c9a4b528e0b8 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -670,8 +670,10 @@ static void cxl_handle_cor_ras(struct device *dev,
 	void __iomem *addr;
 	u32 status;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return;
+	}
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
@@ -709,8 +711,10 @@ static bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	u32 status;
 	u32 fe;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return false;
+	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (10 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 12:22   ` Shiju Jose
  2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL currently has separate trace routines for CXL Port errors and CXL
Endpoint errors. This is inconvenient for the user because they must enable
2 sets of trace routines. Make updates to the trace logging such that a
single trace routine logs both CXL Endpoint and CXL Port protocol errors.

Keep the trace log fields 'memdev' and 'host'. While these are not accurate
for non-Endpoints the fields will remain as-is to prevent breaking userspace
RAS trace consumers.

Add serial number parameter to the trace logging. This is used for EPs
and 0 is provided for CXL port devices without a serial number.

Below is output of correctable and uncorrectable protocol error logging.
CXL Root Port and CXL Endpoint examples are included below.

Root Port:
cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status='CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'

Endpoint:
cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0 status='CRC Threshold Hit'
cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c   | 19 ++++-----
 drivers/cxl/core/ras.c   | 14 ++++---
 drivers/cxl/core/trace.h | 84 +++++++++-------------------------------
 3 files changed, 37 insertions(+), 80 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index c9a4b528e0b8..156ce094a8b9 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -664,8 +664,8 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void cxl_handle_cor_ras(struct device *dev,
-				 void __iomem *ras_base)
+static void cxl_handle_cor_ras(struct device *dev, u64 serial,
+			       void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
@@ -679,7 +679,7 @@ static void cxl_handle_cor_ras(struct device *dev,
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
+		trace_cxl_aer_correctable_error(dev, serial, status);
 	}
 }
 
@@ -704,7 +704,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static bool cxl_handle_ras(struct device *dev, u64 serial,
+			   void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -733,7 +734,7 @@ static bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(dev, serial, status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -744,13 +745,13 @@ static bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+	cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+	return cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, dport->regs.ras);
 }
 
 /*
@@ -847,7 +848,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
+		cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -876,7 +877,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
+		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 	}
 
 
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 962dc94fed8c..9588b39faabd 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -13,7 +13,7 @@ static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
 {
 	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
 
-	trace_cxl_port_aer_correctable_error(&pdev->dev, status);
+	trace_cxl_aer_correctable_error(&pdev->dev, 0, status);
 }
 
 static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
@@ -28,8 +28,8 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
 	else
 		fe = status;
 
-	trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
-					       ras_cap.header_log);
+	trace_cxl_aer_uncorrectable_error(&pdev->dev, 0, status, fe,
+					  ras_cap.header_log);
 }
 
 static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
@@ -42,7 +42,8 @@ static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
 	if (!cxlds)
 		return;
 
-	trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+	trace_cxl_aer_correctable_error(&cxlds->cxlmd->dev, cxlds->serial,
+					status);
 }
 
 static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
@@ -62,8 +63,9 @@ static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
 	else
 		fe = status;
 
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe,
-					  ras_cap.header_log);
+	trace_cxl_aer_uncorrectable_error(&cxlds->cxlmd->dev,
+					  cxlds->serial, status,
+					  fe, ras_cap.header_log);
 }
 
 static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 25ebfbc1616c..494d6db461a7 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,49 +48,22 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
-TRACE_EVENT(cxl_port_aer_uncorrectable_error,
-	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
-	TP_ARGS(dev, status, fe, hl),
-	TP_STRUCT__entry(
-		__string(device, dev_name(dev))
-		__string(host, dev_name(dev->parent))
-		__field(u32, status)
-		__field(u32, first_error)
-		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
-	),
-	TP_fast_assign(
-		__assign_str(device);
-		__assign_str(host);
-		__entry->status = status;
-		__entry->first_error = fe;
-		/*
-		 * Embed the 512B headerlog data for user app retrieval and
-		 * parsing, but no need to print this in the trace buffer.
-		 */
-		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
-	),
-	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
-		  __get_str(device), __get_str(host),
-		  show_uc_errs(__entry->status),
-		  show_uc_errs(__entry->first_error)
-	)
-);
-
 TRACE_EVENT(cxl_aer_uncorrectable_error,
-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
-	TP_ARGS(cxlmd, status, fe, hl),
+	TP_PROTO(struct device *dev, u64 serial, u32 status, u32 fe,
+		 u32 *hl),
+	TP_ARGS(dev, serial, status, fe, hl),
 	TP_STRUCT__entry(
-		__string(memdev, dev_name(&cxlmd->dev))
-		__string(host, dev_name(cxlmd->dev.parent))
+		__string(name, dev_name(dev))
+		__string(parent, dev_name(dev->parent))
 		__field(u64, serial)
 		__field(u32, status)
 		__field(u32, first_error)
 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
 	),
 	TP_fast_assign(
-		__assign_str(memdev);
-		__assign_str(host);
-		__entry->serial = cxlmd->cxlds->serial;
+		__assign_str(name);
+		__assign_str(parent);
+		__entry->serial = serial;
 		__entry->status = status;
 		__entry->first_error = fe;
 		/*
@@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 		 */
 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
 	),
-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error: '%s'",
-		  __get_str(memdev), __get_str(host), __entry->serial,
+	TP_printk("memdev=%s host=%s serial=%lld status='%s' first_error='%s'",
+		  __get_str(name), __get_str(parent), __entry->serial,
 		  show_uc_errs(__entry->status),
 		  show_uc_errs(__entry->first_error)
 	)
@@ -124,42 +97,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
-TRACE_EVENT(cxl_port_aer_correctable_error,
-	TP_PROTO(struct device *dev, u32 status),
-	TP_ARGS(dev, status),
-	TP_STRUCT__entry(
-		__string(device, dev_name(dev))
-		__string(host, dev_name(dev->parent))
-		__field(u32, status)
-	),
-	TP_fast_assign(
-		__assign_str(device);
-		__assign_str(host);
-		__entry->status = status;
-	),
-	TP_printk("device=%s host=%s status='%s'",
-		  __get_str(device), __get_str(host),
-		  show_ce_errs(__entry->status)
-	)
-);
-
 TRACE_EVENT(cxl_aer_correctable_error,
-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
-	TP_ARGS(cxlmd, status),
+	TP_PROTO(struct device *dev, u64 serial, u32 status),
+	TP_ARGS(dev, serial, status),
 	TP_STRUCT__entry(
-		__string(memdev, dev_name(&cxlmd->dev))
-		__string(host, dev_name(cxlmd->dev.parent))
+		__string(name, dev_name(dev))
+		__string(parent, dev_name(dev->parent))
 		__field(u64, serial)
 		__field(u32, status)
 	),
 	TP_fast_assign(
-		__assign_str(memdev);
-		__assign_str(host);
-		__entry->serial = cxlmd->cxlds->serial;
+		__assign_str(name);
+		__assign_str(parent);
+		__entry->serial = serial;
 		__entry->status = status;
 	),
-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
-		  __get_str(memdev), __get_str(host), __entry->serial,
+	TP_printk("memdev=%s host=%s serial=%lld status='%s'",
+		  __get_str(name), __get_str(parent), __entry->serial,
 		  show_ce_errs(__entry->status)
 	)
 );
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (11 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 11:48   ` Jonathan Cameron
  2025-07-21 22:17   ` Dave Jiang
  2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
                   ` (5 subsequent siblings)
  18 siblings, 2 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Update cxl_handle_cor_ras() to exit early in the case there is no RAS
errors detected after applying the status mask. This change will make
the correctable handler's implementation consistent with the uncorrectable
handler, cxl_handle_ras().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/core/pci.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 156ce094a8b9..887b54cf3395 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -677,10 +677,11 @@ static void cxl_handle_cor_ras(struct device *dev, u64 serial,
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(dev, serial, status);
-	}
+	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+		return;
+	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+	trace_cxl_aer_correctable_error(dev, serial, status);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (12 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-27 11:52   ` Jonathan Cameron
                     ` (2 more replies)
  2025-06-26 22:42 ` [PATCH v10 15/17] CXL/PCI: Introduce CXL Port " Terry Bowman
                   ` (4 subsequent siblings)
  18 siblings, 3 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL Endpoint protocol errors are currently handled using PCI error
handlers. The CXL Endpoint requires CXL specific handling in the case of
uncorrectable error (UCE) handling not provided by the PCI handlers.

Add CXL specific handlers for CXL Endpoints. Rename the existing
cxl_error_handlers to be pci_error_handlers to more correctly indicate
the error type and follow naming consistency.

The PCI handlers will be called if the CXL device is not trained for
alternate protocol (CXL). Update the CXL Endpoint PCI handlers to call the
CXL UCE handlers.

The existing EP UCE handler includes checks for various results. These are
no longer needed because CXL UCE recovery will not be attempted. Implement
cxl_handle_ras() to return PCI_ERS_RESULT_NONE or PCI_ERS_RESULT_PANIC. The
CXL UCE handler is called by cxl_do_recovery() that acts on the return
value. In the case of the PCI handler path, call panic() if the result is
PCI_ERS_RESULT_PANIC.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/core/native_ras.c | 15 ++++---
 drivers/cxl/core/pci.c        | 77 ++++++++++++++++++-----------------
 drivers/cxl/cxl.h             |  4 ++
 drivers/cxl/cxlpci.h          |  6 +--
 drivers/cxl/pci.c             |  8 ++--
 5 files changed, 59 insertions(+), 51 deletions(-)

diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
index 19f8f2ac8376..89b65a35f2c0 100644
--- a/drivers/cxl/core/native_ras.c
+++ b/drivers/cxl/core/native_ras.c
@@ -7,18 +7,20 @@
 #include <cxlmem.h>
 #include <core/core.h>
 #include <cxlpci.h>
+#include <core/core.h>
 
 static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
 {
 	pci_ers_result_t vote, *result = data;
+	struct device *dev = &pdev->dev;
 
 	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
 	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END))
 		return 0;
 
-	guard(device)(&pdev->dev);
+	guard(device)(dev);
 
-	vote = cxl_error_detected(pdev, pci_channel_io_frozen);
+	vote = cxl_error_detected(dev);
 	*result = merge_result(*result, vote);
 
 	return 0;
@@ -82,16 +84,17 @@ static bool is_cxl_rcd(struct pci_dev *pdev)
 static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
 {
 	struct cxl_proto_error_info *err_info = data;
+	struct device *dev = &pdev->dev;
 
-	guard(device)(&pdev->dev);
+	guard(device)(dev);
 
 	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
 		return 0;
 
 	if (err_info->severity == AER_CORRECTABLE)
-		cxl_cor_error_detected(pdev);
+		cxl_cor_error_detected(dev);
 	else
-		cxl_error_detected(pdev, pci_channel_io_frozen);
+		cxl_error_detected(dev);
 
 	return 1;
 }
@@ -126,7 +129,7 @@ static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
 						       aer + PCI_ERR_COR_STATUS,
 						       0, PCI_ERR_COR_INTERNAL);
 
-		cxl_cor_error_detected(pdev);
+		cxl_cor_error_detected(&pdev->dev);
 
 		pcie_clear_device_status(pdev);
 	} else {
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 887b54cf3395..7209ffb5c2fe 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -705,8 +705,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool cxl_handle_ras(struct device *dev, u64 serial,
-			   void __iomem *ras_base)
+static pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
+				       void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -715,13 +715,13 @@ static bool cxl_handle_ras(struct device *dev, u64 serial,
 
 	if (!ras_base) {
 		dev_warn_once(dev, "CXL RAS register block is not mapped");
-		return false;
+		return PCI_ERS_RESULT_NONE;
 	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
 	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
-		return false;
+		return PCI_ERS_RESULT_NONE;
 
 	/* If multiple errors, log header points to first error from ctrl reg */
 	if (hweight32(status) > 1) {
@@ -738,7 +738,7 @@ static bool cxl_handle_ras(struct device *dev, u64 serial,
 	trace_cxl_aer_uncorrectable_error(dev, serial, status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
-	return true;
+	return PCI_ERS_RESULT_PANIC;
 }
 
 #ifdef CONFIG_PCIEAER_CXL
@@ -833,13 +833,14 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
 #endif
 
-void cxl_cor_error_detected(struct pci_dev *pdev)
+void cxl_cor_error_detected(struct device *dev)
 {
+	struct pci_dev *pdev = to_pci_dev(dev);
 	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct device *dev = &cxlds->cxlmd->dev;
+	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
 
-	scoped_guard(device, dev) {
-		if (!dev->driver) {
+	scoped_guard(device, cxlmd_dev) {
+		if (!cxlmd_dev->driver) {
 			dev_warn(&pdev->dev,
 				 "%s: memdev disabled, abort error handling\n",
 				 dev_name(dev));
@@ -854,20 +855,26 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
 
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state)
+void pci_cor_error_detected(struct pci_dev *pdev)
 {
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct cxl_memdev *cxlmd = cxlds->cxlmd;
-	struct device *dev = &cxlmd->dev;
-	bool ue;
+	cxl_cor_error_detected(&pdev->dev);
+}
+EXPORT_SYMBOL_NS_GPL(pci_cor_error_detected, "CXL");
 
-	scoped_guard(device, dev) {
-		if (!dev->driver) {
+pci_ers_result_t cxl_error_detected(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
+	pci_ers_result_t ue;
+
+	scoped_guard(device, cxlmd_dev) {
+
+		if (!cxlmd_dev->driver) {
 			dev_warn(&pdev->dev,
 				 "%s: memdev disabled, abort error handling\n",
 				 dev_name(dev));
-			return PCI_ERS_RESULT_DISCONNECT;
+			return PCI_ERS_RESULT_PANIC;
 		}
 
 		if (cxlds->rcd)
@@ -881,29 +888,23 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
 	}
 
-
-	switch (state) {
-	case pci_channel_io_normal:
-		if (ue) {
-			device_release_driver(dev);
-			return PCI_ERS_RESULT_NEED_RESET;
-		}
-		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		dev_warn(&pdev->dev,
-			 "%s: frozen state error detected, disable CXL.mem\n",
-			 dev_name(dev));
-		device_release_driver(dev);
-		return PCI_ERS_RESULT_NEED_RESET;
-	case pci_channel_io_perm_failure:
-		dev_warn(&pdev->dev,
-			 "failure state error detected, request disconnect\n");
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-	return PCI_ERS_RESULT_NEED_RESET;
+	return ue;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
 
+pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
+				    pci_channel_state_t error)
+{
+	pci_ers_result_t rc;
+
+	rc = cxl_error_detected(&pdev->dev);
+	if (rc == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
+
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(pci_error_detected, "CXL");
+
 static int cxl_flit_size(struct pci_dev *pdev)
 {
 	if (cxl_pci_flit_256(pdev))
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index d696d419bd5a..a2eedc8a82e8 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -11,6 +11,7 @@
 #include <linux/log2.h>
 #include <linux/node.h>
 #include <linux/io.h>
+#include <linux/pci.h>
 
 extern const struct nvdimm_security_ops *cxl_security_ops;
 
@@ -797,6 +798,9 @@ static inline int cxl_root_decoder_autoremove(struct device *host,
 }
 int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint);
 
+void cxl_cor_error_detected(struct device *dev);
+pci_ers_result_t cxl_error_detected(struct device *dev);
+
 /**
  * struct cxl_endpoint_dvsec_info - Cached DVSEC info
  * @mem_enabled: cached value of mem_enabled in the DVSEC at init time
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index ed3c9701b79f..e69a47f0cd94 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -133,8 +133,8 @@ struct cxl_dev_state;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
 			struct cxl_endpoint_dvsec_info *info);
 void read_cdat_data(struct cxl_port *port);
-void cxl_cor_error_detected(struct pci_dev *pdev);
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state);
+void pci_cor_error_detected(struct pci_dev *pdev);
+pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
+				    pci_channel_state_t error);
 bool cxl_pci_drv_bound(struct pci_dev *pdev);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index cae049f9ae3e..91fab33094a9 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1112,11 +1112,11 @@ static void cxl_reset_done(struct pci_dev *pdev)
 	}
 }
 
-static const struct pci_error_handlers cxl_error_handlers = {
-	.error_detected	= cxl_error_detected,
+static const struct pci_error_handlers pci_error_handlers = {
+	.error_detected = pci_error_detected,
 	.slot_reset	= cxl_slot_reset,
 	.resume		= cxl_error_resume,
-	.cor_error_detected	= cxl_cor_error_detected,
+	.cor_error_detected	= pci_cor_error_detected,
 	.reset_done	= cxl_reset_done,
 };
 
@@ -1124,7 +1124,7 @@ static struct pci_driver cxl_pci_driver = {
 	.name			= KBUILD_MODNAME,
 	.id_table		= cxl_mem_pci_tbl,
 	.probe			= cxl_pci_probe,
-	.err_handler		= &cxl_error_handlers,
+	.err_handler		= &pci_error_handlers,
 	.dev_groups		= cxl_rcd_groups,
 	.driver	= {
 		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 15/17] CXL/PCI: Introduce CXL Port protocol error handlers
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (13 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 16/17] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Introduce CXL error handlers for CXL Port devices.

Add functions cxl_port_cor_error_detected() and cxl_port_error_detected().
These will serve as the handlers for all CXL Port devices. Introduce
cxl_get_ras_base() to provide the RAS base address needed by the handlers.

Update cxl_handle_proto_error() to call the CXL Port or CXL Endpoint
handler depending on which CXL device reports the error.

Implement cxl_get_ras_base() to return the cached RAS register address of a
CXL Root Port, CXL Downstream Port, or CXL Upstream Port.

Introduce get_pci_cxl_host_dev() to return the host responsible for
releasing the RAS mapped resources. CXL endpoints do not use a host to
manage its resources, allow for NULL in the case of an EP. Use reference
count increment on the host to prevent resource release. Make the caller
responsible for the reference decrement.

Update the AER driver's is_cxl_error() to remove the filter PCI type check
because CXL Port devices are now supported.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/core/core.h       |  2 +
 drivers/cxl/core/native_ras.c | 74 ++++++++++++++++++++++++++++++++---
 drivers/cxl/core/pci.c        | 60 ++++++++++++++++++++++++++++
 drivers/cxl/core/port.c       |  4 +-
 drivers/cxl/cxl.h             |  5 +++
 drivers/pci/pcie/cxl_aer.c    | 17 ++++----
 6 files changed, 145 insertions(+), 17 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 4c08bb92e2f9..f5a2571fc208 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -122,6 +122,8 @@ void cxl_ras_exit(void);
 int cxl_gpf_port_setup(struct cxl_dport *dport);
 int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
 					    int nid, resource_size_t *size);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
 
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_native_ras_init(void);
diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
index 89b65a35f2c0..c7f9a237a1a2 100644
--- a/drivers/cxl/core/native_ras.c
+++ b/drivers/cxl/core/native_ras.c
@@ -9,18 +9,74 @@
 #include <cxlpci.h>
 #include <core/core.h>
 
+int match_uport(struct device *dev, const void *data)
+{
+	const struct device *uport_dev = data;
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+
+	return port->uport_dev == uport_dev;
+}
+EXPORT_SYMBOL_NS_GPL(match_uport, "CXL");
+
+/*
+ * Return 'struct device *' responsible for freeing pdev's CXL resources.
+ * Caller is responsible for reference count decrementing the return
+ * 'struct device *'.
+ *
+ * pdev: CXL PCI device to find the host device.
+ */
+static struct device *get_pci_cxl_host_dev(struct pci_dev *pdev)
+{
+	switch (pci_pcie_type(pdev)) {
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	{
+		struct cxl_dport *dport = NULL;
+		struct cxl_port *port = find_cxl_port(&pdev->dev, &dport);
+
+		if (!port)
+			return NULL;
+
+		return &port->dev;
+	}
+	case PCI_EXP_TYPE_UPSTREAM:
+	{
+		struct device *port_dev = bus_find_device(&cxl_bus_type, NULL,
+							  &pdev->dev, match_uport);
+
+		if (!port_dev || !is_cxl_port(port_dev))
+			return NULL;
+
+		return port_dev;
+	}
+	/* Endpoint resources are managed by endpoint itself */
+	case PCI_EXP_TYPE_ENDPOINT:
+	case PCI_EXP_TYPE_RC_END:
+		return NULL;
+	}
+
+	pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
+	return NULL;
+}
+
 static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
 {
+	struct device *host_dev __free(put_device) = get_pci_cxl_host_dev(pdev);
 	pci_ers_result_t vote, *result = data;
 	struct device *dev = &pdev->dev;
 
-	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
-	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END))
-		return 0;
-
 	guard(device)(dev);
 
-	vote = cxl_error_detected(dev);
+	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT) ||
+	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END))
+		vote = cxl_error_detected(dev);
+	else
+		vote = cxl_port_error_detected(dev);
 	*result = merge_result(*result, vote);
 
 	return 0;
@@ -112,6 +168,7 @@ static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
 		       PCI_FUNC(err_info->devfn));
 		return;
 	}
+	struct device *host_dev __free(put_device) = get_pci_cxl_host_dev(pdev);
 
 	/*
 	 * Internal errors of an RCEC indicate an AER error in an
@@ -122,6 +179,7 @@ static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
 		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
 
 	if (err_info->severity == AER_CORRECTABLE) {
+		struct device *dev = &pdev->dev;
 		int aer = pdev->aer_cap;
 
 		if (aer)
@@ -129,7 +187,11 @@ static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
 						       aer + PCI_ERR_COR_STATUS,
 						       0, PCI_ERR_COR_INTERNAL);
 
-		cxl_cor_error_detected(&pdev->dev);
+		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ENDPOINT) ||
+		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_END))
+			cxl_cor_error_detected(dev);
+		else
+			cxl_port_cor_error_detected(dev);
 
 		pcie_clear_device_status(pdev);
 	} else {
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 7209ffb5c2fe..7a83408ad0bb 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -743,6 +743,66 @@ static pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
 
 #ifdef CONFIG_PCIEAER_CXL
 
+static void __iomem *cxl_get_ras_base(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	switch (pci_pcie_type(pdev)) {
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	{
+		struct cxl_dport *dport = NULL;
+		struct cxl_port *port __free(put_cxl_port) = find_cxl_port(&pdev->dev, &dport);
+
+		if (!dport || !dport->dport_dev) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+
+		if (!dport)
+			return NULL;
+
+		return dport->regs.ras;
+	}
+	case PCI_EXP_TYPE_UPSTREAM:
+	{
+		struct cxl_port *port;
+		struct device *dev __free(put_device) = bus_find_device(&cxl_bus_type, NULL,
+									&pdev->dev, match_uport);
+
+		if (!dev || !is_cxl_port(dev)) {
+			pci_err(pdev, "Failed to find the CXL device");
+			return NULL;
+		}
+
+		port = to_cxl_port(dev);
+		if (!port)
+			return NULL;
+
+		return port->uport_regs.ras;
+	}
+	}
+
+	pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
+	return NULL;
+}
+
+void cxl_port_cor_error_detected(struct device *dev)
+{
+	void __iomem *ras_base = cxl_get_ras_base(dev);
+
+	cxl_handle_cor_ras(dev, 0, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
+
+pci_ers_result_t cxl_port_error_detected(struct device *dev)
+{
+	void __iomem *ras_base = cxl_get_ras_base(dev);
+
+	return cxl_handle_ras(dev, 0, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
+
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 8e8f21197c86..c76a1289e24e 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1341,8 +1341,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index a2eedc8a82e8..19eb33a7de6a 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -801,6 +801,9 @@ int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint)
 void cxl_cor_error_detected(struct device *dev);
 pci_ers_result_t cxl_error_detected(struct device *dev);
 
+void cxl_port_cor_error_detected(struct device *dev);
+pci_ers_result_t cxl_port_error_detected(struct device *dev);
+
 /**
  * struct cxl_endpoint_dvsec_info - Cached DVSEC info
  * @mem_enabled: cached value of mem_enabled in the DVSEC at init time
@@ -915,6 +918,8 @@ void cxl_coordinates_combine(struct access_coordinate *out,
 
 bool cxl_endpoint_decoder_reset_detected(struct cxl_port *port);
 
+int match_uport(struct device *dev, const void *data);
+
 /*
  * Unit test builds overrides this to __weak, find the 'strong' version
  * of these symbols in tools/testing/cxl/.
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index b238791b7101..38dc82df0baf 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -73,11 +73,6 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
 	if (!info || !info->is_cxl)
 		return false;
 
-	/* Only CXL Endpoints are currently supported */
-	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
-	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
-		return false;
-
 	return is_internal_error(info);
 }
 
@@ -92,13 +87,17 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 	return *handles_cxl;
 }
 
-static bool handles_cxl_errors(struct pci_dev *rcec)
+static bool handles_cxl_errors(struct pci_dev *dev)
 {
 	bool handles_cxl = false;
 
-	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+	if (!pcie_aer_is_native(dev))
+		return false;
+
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
+		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
+	else
+		handles_cxl = pcie_is_cxl(dev);
 
 	return handles_cxl;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 16/17] CXL/PCI: Enable CXL protocol errors during CXL Port probe
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (14 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 15/17] CXL/PCI: Introduce CXL Port " Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-06-26 22:42 ` [PATCH v10 17/17] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

CXL protocol errors are not enabled for all CXL devices after boot. These
must be enabled inorder to process CXL protocol errors.

Export the AER service driver's pci_aer_unmask_internal_errors().

Introduce cxl_unmask_proto_interrupts() to call pci_aer_unmask_internal_errors().
pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
correctable internal errors and uncorrectable internal errors for all CXL
devices.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/cxl/port.c         | 29 +++++++++++++++++++++++++++--
 drivers/pci/pcie/cxl_aer.c |  3 ++-
 include/linux/aer.h        |  1 +
 3 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index b52f82925891..b90f5efa5904 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -3,6 +3,7 @@
 #include <linux/device.h>
 #include <linux/module.h>
 #include <linux/slab.h>
+#include <linux/pci.h>
 
 #include "cxlmem.h"
 #include "cxlpci.h"
@@ -60,6 +61,21 @@ static int discover_region(struct device *dev, void *unused)
 
 #ifdef CONFIG_PCIEAER_CXL
 
+static void cxl_unmask_proto_interrupts(struct device *dev)
+{
+	struct pci_dev *pdev __free(pci_dev_put) =
+		pci_dev_get(to_pci_dev(dev));
+
+	if (!pdev->aer_cap) {
+		pdev->aer_cap = pci_find_ext_capability(pdev,
+							PCI_EXT_CAP_ID_ERR);
+		if (!pdev->aer_cap)
+			return;
+	}
+
+	pci_aer_unmask_internal_errors(pdev);
+}
+
 static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 {
 	resource_size_t aer_phys;
@@ -118,8 +134,12 @@ static void cxl_uport_init_ras_reporting(struct cxl_port *port,
 
 	map->host = host;
 	if (cxl_map_component_regs(map, &port->uport_regs,
-				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
 		dev_dbg(&port->dev, "Failed to map RAS capability\n");
+		return;
+	}
+
+	cxl_unmask_proto_interrupts(port->uport_dev);
 }
 
 /**
@@ -144,9 +164,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 	}
 
 	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
-				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
 		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
+		return;
+	}
 
+	cxl_unmask_proto_interrupts(dport->dport_dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
@@ -180,6 +203,8 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
 	}
 
 	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
+
+	cxl_unmask_proto_interrupts(cxlmd->cxlds->dev);
 }
 
 #else
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index 38dc82df0baf..3c5bf162607c 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -18,7 +18,7 @@
  * Note: AER must be enabled and supported by the device which must be
  * checked in advance, e.g. with pcie_aer_is_native().
  */
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 {
 	int aer = dev->aer_cap;
 	u32 mask;
@@ -31,6 +31,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 	mask &= ~PCI_ERR_COR_INTERNAL;
 	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
 }
+EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
 
 static bool is_cxl_mem_dev(struct pci_dev *dev)
 {
diff --git a/include/linux/aer.h b/include/linux/aer.h
index f14db635ef90..8fb1eca97c37 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -113,5 +113,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
 #endif //_AER_H_
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH v10 17/17] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (15 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 16/17] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
@ 2025-06-26 22:42 ` Terry Bowman
  2025-07-23 21:55 ` [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging dan.j.williams
  2025-08-18 15:18 ` Joshua Hahn
  18 siblings, 0 replies; 82+ messages in thread
From: Terry Bowman @ 2025-06-26 22:42 UTC (permalink / raw)
  To: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

During CXL device cleanup the CXL PCIe Port device interrupts remain
enabled. This potentially allows unnecessary interrupt processing on
behalf of the CXL errors while the device is destroyed.

Disable CXL protocol errors by setting the CXL devices' AER mask register.

Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().

Introduce cxl_mask_proto_interrupts() to call pci_aer_mask_internal_errors().
Add calls to cxl_mask_proto_interrupts() within CXL Port teardown for CXL
Root Ports, CXL Downstream Switch Ports, CXL Upstream Switch Ports, and CXL
Endpoints. Follow the same "bottom-up" approach used during CXL Port
teardown.

Implement cxl_mask_proto_interrupts() in a header file to avoid introducing
Kconfig ifdefs in cxl/core/port.c.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
---
 drivers/cxl/core/native_ras.c |  9 +++++++++
 drivers/cxl/core/port.c       |  6 ++++++
 drivers/cxl/cxl.h             |  3 +++
 drivers/pci/pcie/cxl_aer.c    | 21 +++++++++++++++++++++
 include/linux/aer.h           |  1 +
 5 files changed, 40 insertions(+)

diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
index c7f9a237a1a2..8b8c7b3fba6c 100644
--- a/drivers/cxl/core/native_ras.c
+++ b/drivers/cxl/core/native_ras.c
@@ -9,6 +9,15 @@
 #include <cxlpci.h>
 #include <core/core.h>
 
+void cxl_mask_proto_interrupts(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+
+	guard(device)(dev);
+
+	pci_aer_mask_internal_errors(pdev);
+}
+
 int match_uport(struct device *dev, const void *data)
 {
 	const struct device *uport_dev = data;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index c76a1289e24e..786a036d33cc 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1433,6 +1433,9 @@ EXPORT_SYMBOL_NS_GPL(cxl_endpoint_autoremove, "CXL");
  */
 static void delete_switch_port(struct cxl_port *port)
 {
+	cxl_mask_proto_interrupts(port->uport_dev);
+	cxl_mask_proto_interrupts(port->parent_dport->dport_dev);
+
 	devm_release_action(port->dev.parent, cxl_unlink_parent_dport, port);
 	devm_release_action(port->dev.parent, cxl_unlink_uport, port);
 	devm_release_action(port->dev.parent, unregister_port, port);
@@ -1446,6 +1449,7 @@ static void reap_dports(struct cxl_port *port)
 	device_lock_assert(&port->dev);
 
 	xa_for_each(&port->dports, index, dport) {
+		cxl_mask_proto_interrupts(dport->dport_dev);
 		devm_release_action(&port->dev, cxl_dport_unlink, dport);
 		devm_release_action(&port->dev, cxl_dport_remove, dport);
 		devm_kfree(&port->dev, dport);
@@ -1476,6 +1480,8 @@ static void cxl_detach_ep(void *data)
 {
 	struct cxl_memdev *cxlmd = data;
 
+	cxl_mask_proto_interrupts(cxlmd->cxlds->dev);
+
 	for (int i = cxlmd->depth - 1; i >= 1; i--) {
 		struct cxl_port *port, *parent_port;
 		struct detach_ctx ctx = {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 19eb33a7de6a..95843428be92 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -12,6 +12,7 @@
 #include <linux/node.h>
 #include <linux/io.h>
 #include <linux/pci.h>
+#include <linux/aer.h>
 
 extern const struct nvdimm_security_ops *cxl_security_ops;
 
@@ -771,9 +772,11 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void cxl_mask_proto_interrupts(struct device *dev);
 #else
 static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
 						struct device *host) { }
+static inline void cxl_mask_proto_interrupts(struct device *dev) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
index 3c5bf162607c..bd6660eab87b 100644
--- a/drivers/pci/pcie/cxl_aer.c
+++ b/drivers/pci/pcie/cxl_aer.c
@@ -33,6 +33,27 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
 
+/**
+ * pci_aer_mask_internal_errors - mask internal errors
+ * @dev: pointer to the pcie_dev data structure
+ *
+ * Masks internal errors in the Uncorrectable and Correctable Error
+ * Mask registers.
+ *
+ * Note: AER must be enabled and supported by the device which must be
+ * checked in advance, e.g. with pcie_aer_is_native().
+ */
+void pci_aer_mask_internal_errors(struct pci_dev *dev)
+{
+	int aer = dev->aer_cap;
+
+	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
+				       0, PCI_ERR_UNC_INTN);
+	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_COR_MASK,
+				       0, PCI_ERR_COR_INTERNAL);
+}
+EXPORT_SYMBOL_NS_GPL(pci_aer_mask_internal_errors, "CXL");
+
 static bool is_cxl_mem_dev(struct pci_dev *dev)
 {
 	/*
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 8fb1eca97c37..c0fd627328ae 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -114,5 +114,6 @@ int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
 void pci_aer_unmask_internal_errors(struct pci_dev *dev);
+void pci_aer_mask_internal_errors(struct pci_dev *dev);
 #endif //_AER_H_
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
@ 2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
  2025-06-27  9:53   ` Jonathan Cameron
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 82+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-06-26 23:25 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham, linux-cxl
  Cc: linux-kernel, linux-pci


On 6/26/25 3:42 PM, Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
>
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
>
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
>
> Update the aer_event trace routine to accept a bus type string parameter.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> ---

Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

>   drivers/pci/pci.h       |  6 ++++++
>   drivers/pci/pcie/aer.c  | 21 +++++++++++++++------
>   include/ras/ras_event.h |  9 ++++++---
>   3 files changed, 27 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 12215ee72afb..a0d1e59b5666 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -608,6 +608,7 @@ struct aer_err_info {
>   	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
>   	int error_dev_num;
>   	const char *level;		/* printk level */
> +	bool is_cxl;
>   
>   	unsigned int id:16;
>   
> @@ -628,6 +629,11 @@ struct aer_err_info {
>   int aer_get_device_error_info(struct aer_err_info *info, int i);
>   void aer_print_error(struct aer_err_info *info, int i);
>   
> +static inline const char *aer_err_bus(struct aer_err_info *info)
> +{
> +	return info->is_cxl ? "CXL" : "PCIe";
> +}
> +
>   int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>   		      unsigned int tlp_len, bool flit,
>   		      struct pcie_tlp_log *log);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac66188367..a2df9456595a 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -837,6 +837,7 @@ void aer_print_error(struct aer_err_info *info, int i)
>   	struct pci_dev *dev;
>   	int layer, agent, id;
>   	const char *level = info->level;
> +	const char *bus_type = aer_err_bus(info);
>   
>   	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
>   		return;
> @@ -845,23 +846,23 @@ void aer_print_error(struct aer_err_info *info, int i)
>   	id = pci_dev_id(dev);
>   
>   	pci_dev_aer_stats_incr(dev, info);
> -	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
> +	trace_aer_event(pci_name(dev), bus_type, (info->status & ~info->mask),
>   			info->severity, info->tlp_header_valid, &info->tlp);
>   
>   	if (!info->ratelimit_print[i])
>   		return;
>   
>   	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>   		goto out;
>   	}
>   
>   	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>   	agent = AER_GET_AGENT(info->severity, info->status);
>   
> -	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>   		   aer_error_layer[layer], aer_agent_string[agent]);
>   
>   	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -895,6 +896,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>   void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   		   struct aer_capability_regs *aer)
>   {
> +	const char *bus_type;
>   	int layer, agent, tlp_header_valid = 0;
>   	u32 status, mask;
>   	struct aer_err_info info = {
> @@ -915,9 +917,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   
>   	info.status = status;
>   	info.mask = mask;
> +	info.is_cxl = pcie_is_cxl(dev);
> +
> +	bus_type = aer_err_bus(&info);
>   
>   	pci_dev_aer_stats_incr(dev, &info);
> -	trace_aer_event(pci_name(dev), (status & ~mask),
> +	trace_aer_event(pci_name(dev), bus_type, (status & ~mask),
>   			aer_severity, tlp_header_valid, &aer->header_log);
>   
>   	if (!aer_ratelimit(dev, info.severity))
> @@ -939,6 +944,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   	if (tlp_header_valid)
>   		pcie_print_tlp_log(dev, &aer->header_log, info.level,
>   				   dev_fmt("  "));
> +
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
> +			aer_severity, tlp_header_valid, &aer->header_log);
>   }
>   EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>   
> @@ -1371,6 +1379,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
>   	/* Must reset in this function */
>   	info->status = 0;
>   	info->tlp_header_valid = 0;
> +	info->is_cxl = pcie_is_cxl(dev);
>   
>   	/* The device might not support AER */
>   	if (!aer)
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 14c9f943d53f..080829d59c36 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>   
>   TRACE_EVENT(aer_event,
>   	TP_PROTO(const char *dev_name,
> +		 const char *bus_type,
>   		 const u32 status,
>   		 const u8 severity,
>   		 const u8 tlp_header_valid,
>   		 struct pcie_tlp_log *tlp),
>   
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>   
>   	TP_STRUCT__entry(
>   		__string(	dev_name,	dev_name	)
> +		__string(	bus_type,	bus_type	)
>   		__field(	u32,		status		)
>   		__field(	u8,		severity	)
>   		__field(	u8, 		tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>   
>   	TP_fast_assign(
>   		__assign_str(dev_name);
> +		__assign_str(bus_type);
>   		__entry->status		= status;
>   		__entry->severity	= severity;
>   		__entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>   		}
>   	),
>   
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>   		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>   			__entry->severity == AER_FATAL ?
>   			"Fatal" : "Uncorrected, non-fatal",

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
@ 2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
  2025-06-27 10:12     ` Jonathan Cameron
  2025-06-27 14:29     ` Bowman, Terry
  2025-07-24  0:01   ` dan.j.williams
  2025-07-24  1:16   ` dan.j.williams
  2 siblings, 2 replies; 82+ messages in thread
From: Sathyanarayanan Kuppuswamy @ 2025-06-26 23:42 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham, linux-cxl
  Cc: linux-kernel, linux-pci


On 6/26/25 3:42 PM, Terry Bowman wrote:
> The CXL AER error handling logic currently resides in the AER driver file,
> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> using #ifdefs.
>
> Improve the AER driver maintainability by separating the CXL specific logic
> from the AER driver's core functionality and removing the #ifdefs.
> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> new file.
>
> Update the makefile to conditionally compile the CXL file using the
> existing CONFIG_PCIEAER_CXL Kconfig.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---

After moving the code , you seem to have updated it with your own
changes. May be you split it into two patches.

>   drivers/pci/pci.h          |   8 +++
>   drivers/pci/pcie/Makefile  |   1 +
>   drivers/pci/pcie/aer.c     | 138 -------------------------------------
>   drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++
>   include/linux/pci_ids.h    |   2 +
>   5 files changed, 149 insertions(+), 138 deletions(-)
>   create mode 100644 drivers/pci/pcie/cxl_aer.c
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index a0d1e59b5666..91b583cf18eb 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -1029,6 +1029,14 @@ static inline void pci_save_aer_state(struct pci_dev *dev) { }
>   static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>   #endif
>   
> +#ifdef CONFIG_PCIEAER_CXL
> +void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
> +void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +#else
> +static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
> +static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +#endif
> +
>   #ifdef CONFIG_ACPI
>   bool pci_acpi_preserve_config(struct pci_host_bridge *bridge);
>   int pci_acpi_program_hp_params(struct pci_dev *dev);
> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
> index 173829aa02e6..cd2cb925dbd5 100644
> --- a/drivers/pci/pcie/Makefile
> +++ b/drivers/pci/pcie/Makefile
> @@ -8,6 +8,7 @@ obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
>   
>   obj-y				+= aspm.o
>   obj-$(CONFIG_PCIEAER)		+= aer.o err.o tlp.o
> +obj-$(CONFIG_PCIEAER_CXL)	+= cxl_aer.o
>   obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
>   obj-$(CONFIG_PCIE_PME)		+= pme.o
>   obj-$(CONFIG_PCIE_DPC)		+= dpc.o
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index a2df9456595a..0b4d721980ef 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1094,144 +1094,6 @@ static bool find_source_device(struct pci_dev *parent,
>   	return true;
>   }
>   
> -#ifdef CONFIG_PCIEAER_CXL
> -
> -/**
> - * pci_aer_unmask_internal_errors - unmask internal errors
> - * @dev: pointer to the pci_dev data structure
> - *
> - * Unmask internal errors in the Uncorrectable and Correctable Error
> - * Mask registers.
> - *
> - * Note: AER must be enabled and supported by the device which must be
> - * checked in advance, e.g. with pcie_aer_is_native().
> - */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> -{
> -	int aer = dev->aer_cap;
> -	u32 mask;
> -
> -	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
> -	mask &= ~PCI_ERR_UNC_INTN;
> -	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
> -
> -	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
> -	mask &= ~PCI_ERR_COR_INTERNAL;
> -	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> -}
> -
> -static bool is_cxl_mem_dev(struct pci_dev *dev)
> -{
> -	/*
> -	 * The capability, status, and control fields in Device 0,
> -	 * Function 0 DVSEC control the CXL functionality of the
> -	 * entire device (CXL 3.0, 8.1.3).
> -	 */
> -	if (dev->devfn != PCI_DEVFN(0, 0))
> -		return false;
> -
> -	/*
> -	 * CXL Memory Devices must have the 502h class code set (CXL
> -	 * 3.0, 8.1.12.1).
> -	 */
> -	if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
> -		return false;
> -
> -	return true;
> -}
> -
> -static bool cxl_error_is_native(struct pci_dev *dev)
> -{
> -	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> -
> -	return (pcie_ports_native || host->native_aer);
> -}
> -
> -static bool is_internal_error(struct aer_err_info *info)
> -{
> -	if (info->severity == AER_CORRECTABLE)
> -		return info->status & PCI_ERR_COR_INTERNAL;
> -
> -	return info->status & PCI_ERR_UNC_INTN;
> -}
> -
> -static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> -{
> -	struct aer_err_info *info = (struct aer_err_info *)data;
> -	const struct pci_error_handlers *err_handler;
> -
> -	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
> -		return 0;
> -
> -	/* Protect dev->driver */
> -	device_lock(&dev->dev);
> -
> -	err_handler = dev->driver ? dev->driver->err_handler : NULL;
> -	if (!err_handler)
> -		goto out;
> -
> -	if (info->severity == AER_CORRECTABLE) {
> -		if (err_handler->cor_error_detected)
> -			err_handler->cor_error_detected(dev);
> -	} else if (err_handler->error_detected) {
> -		if (info->severity == AER_NONFATAL)
> -			err_handler->error_detected(dev, pci_channel_io_normal);
> -		else if (info->severity == AER_FATAL)
> -			err_handler->error_detected(dev, pci_channel_io_frozen);
> -	}
> -out:
> -	device_unlock(&dev->dev);
> -	return 0;
> -}
> -
> -static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> -{
> -	/*
> -	 * Internal errors of an RCEC indicate an AER error in an
> -	 * RCH's downstream port. Check and handle them in the CXL.mem
> -	 * device driver.
> -	 */
> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    is_internal_error(info))
> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> -}
> -
> -static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> -{
> -	bool *handles_cxl = data;
> -
> -	if (!*handles_cxl)
> -		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> -
> -	/* Non-zero terminates iteration */
> -	return *handles_cxl;
> -}
> -
> -static bool handles_cxl_errors(struct pci_dev *rcec)
> -{
> -	bool handles_cxl = false;
> -
> -	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
> -	    pcie_aer_is_native(rcec))
> -		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> -
> -	return handles_cxl;
> -}
> -
> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> -{
> -	if (!handles_cxl_errors(rcec))
> -		return;
> -
> -	pci_aer_unmask_internal_errors(rcec);
> -	pci_info(rcec, "CXL: Internal errors unmasked");
> -}
> -
> -#else
> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
> -static inline void cxl_rch_handle_error(struct pci_dev *dev,
> -					struct aer_err_info *info) { }
> -#endif
>   
>   /**
>    * pci_aer_handle_error - handle logging error into an event log
> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> new file mode 100644
> index 000000000000..b2ea14f70055
> --- /dev/null
> +++ b/drivers/pci/pcie/cxl_aer.c
> @@ -0,0 +1,138 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include "../pci.h"
> +
> +/**
> + * pci_aer_unmask_internal_errors - unmask internal errors
> + * @dev: pointer to the pci_dev data structure
> + *
> + * Unmask internal errors in the Uncorrectable and Correctable Error
> + * Mask registers.
> + *
> + * Note: AER must be enabled and supported by the device which must be
> + * checked in advance, e.g. with pcie_aer_is_native().
> + */
> +static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +{
> +	int aer = dev->aer_cap;
> +	u32 mask;
> +
> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
> +	mask &= ~PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
> +
> +	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
> +	mask &= ~PCI_ERR_COR_INTERNAL;
> +	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> +}
> +
> +static bool is_cxl_mem_dev(struct pci_dev *dev)
> +{
> +	/*
> +	 * The capability, status, and control fields in Device 0,
> +	 * Function 0 DVSEC control the CXL functionality of the
> +	 * entire device (CXL 3.2, 8.1.3).
> +	 */
> +	if (dev->devfn != PCI_DEVFN(0, 0))
> +		return false;
> +
> +	/*
> +	 * CXL Memory Devices must have the 502h class code set (CXL
> +	 * 3.2, 8.1.12.1).
> +	 */
> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, dev->class) != PCI_CLASS_MEMORY_CXL)
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool cxl_error_is_native(struct pci_dev *dev)
> +{
> +	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> +
> +	return (pcie_ports_native || host->native_aer);
> +}
> +
> +static bool is_internal_error(struct aer_err_info *info)
> +{
> +	if (info->severity == AER_CORRECTABLE)
> +		return info->status & PCI_ERR_COR_INTERNAL;
> +
> +	return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> +{
> +	struct aer_err_info *info = (struct aer_err_info *)data;
> +	const struct pci_error_handlers *err_handler;
> +
> +	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
> +		return 0;
> +
> +	/* Protect dev->driver */
> +	device_lock(&dev->dev);
> +
> +	err_handler = dev->driver ? dev->driver->err_handler : NULL;
> +	if (!err_handler)
> +		goto out;
> +
> +	if (info->severity == AER_CORRECTABLE) {
> +		if (err_handler->cor_error_detected)
> +			err_handler->cor_error_detected(dev);
> +	} else if (err_handler->error_detected) {
> +		if (info->severity == AER_NONFATAL)
> +			err_handler->error_detected(dev, pci_channel_io_normal);
> +		else if (info->severity == AER_FATAL)
> +			err_handler->error_detected(dev, pci_channel_io_frozen);
> +	}
> +out:
> +	device_unlock(&dev->dev);
> +	return 0;
> +}
> +
> +void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> +{
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.
> +	 */
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> +	    is_internal_error(info))
> +		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> +}
> +
> +static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> +{
> +	bool *handles_cxl = data;
> +
> +	if (!*handles_cxl)
> +		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> +
> +	/* Non-zero terminates iteration */
> +	return *handles_cxl;
> +}
> +
> +static bool handles_cxl_errors(struct pci_dev *rcec)
> +{
> +	bool handles_cxl = false;
> +
> +	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
> +	    pcie_aer_is_native(rcec))
> +		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> +
> +	return handles_cxl;
> +}
> +
> +void cxl_rch_enable_rcec(struct pci_dev *rcec)
> +{
> +	if (!handles_cxl_errors(rcec))
> +		return;
> +
> +	pci_aer_unmask_internal_errors(rcec);
> +	pci_info(rcec, "CXL: Internal errors unmasked");
> +}
> +
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index e2d71b6fdd84..31b3935bf189 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -12,6 +12,8 @@
>   
>   /* Device classes and subclasses */
>   
> +#define PCI_CLASS_CODE_MASK             0xFFFF00
> +
>   #define PCI_CLASS_NOT_DEFINED		0x0000
>   #define PCI_CLASS_NOT_DEFINED_VGA	0x0001
>   

-- 
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
  2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
@ 2025-06-27  9:53   ` Jonathan Cameron
  2025-07-02 16:00     ` Bowman, Terry
  2025-06-27 11:32   ` Shiju Jose
                     ` (2 subsequent siblings)
  4 siblings, 1 reply; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27  9:53 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:38 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
@ 2025-06-27 10:12     ` Jonathan Cameron
  2025-06-27 14:29     ` Bowman, Terry
  1 sibling, 0 replies; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 10:12 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy
  Cc: Terry Bowman, dave, dave.jiang, alison.schofield, dan.j.williams,
	bhelgaas, shiju.jose, ming.li, Smita.KoralahalliChannabasappa,
	rrichter, dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, linux-cxl, linux-kernel, linux-pci

On Thu, 26 Jun 2025 16:42:09 -0700
Sathyanarayanan Kuppuswamy <sathyanarayanan.kuppuswamy@linux.intel.com> wrote:

> On 6/26/25 3:42 PM, Terry Bowman wrote:
> > The CXL AER error handling logic currently resides in the AER driver file,
> > drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> > using #ifdefs.
> >
> > Improve the AER driver maintainability by separating the CXL specific logic
> > from the AER driver's core functionality and removing the #ifdefs.
> > Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> > new file.
> >
> > Update the makefile to conditionally compile the CXL file using the
> > existing CONFIG_PCIEAER_CXL Kconfig.
> >
> > Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > ---  
> 
> After moving the code , you seem to have updated it with your own
> changes. May be you split it into two patches.

Agreed. I think the changes are small but a direct code move followed
by cleanup (I think it's the mask and the comment update only?)
would be better.

Assuming you do that, for both resulting patches:

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
@ 2025-06-27 10:24   ` Jonathan Cameron
  2025-07-02 16:21     ` Bowman, Terry
  2025-07-01 21:53   ` Dave Jiang
  2025-07-24  2:01   ` dan.j.williams
  2 siblings, 1 reply; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 10:24 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:40 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used
> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> driver's RAS native handling. This as an alternative to dropping the
> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> conditionally compile the new file.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> will contain the erring device's PCI SBDF details used to rediscover the
> device after the CXL driver dequeues the kfifo work. The device rediscovery
> will be introduced along with the CXL handling in future patches.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

Whilst it obviously makes patch preparation a bit more time consuming
for series like this with many patches it can be useful to add a brief
change log to the individual patches as well as the cover letter.
That helps reviewers figure out where they need to look again.

A few trivial things inline.

With those fixed up
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

Jonathan


> ---
>  drivers/cxl/Kconfig           | 14 ++++++++
>  drivers/cxl/core/Makefile     |  1 +
>  drivers/cxl/core/core.h       |  8 +++++
>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>  drivers/cxl/core/port.c       |  2 ++
>  drivers/cxl/core/ras.c        |  1 +
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/pci/pci.h             |  4 +++
>  drivers/pci/pcie/aer.c        |  7 ++--
>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>  include/linux/aer.h           | 31 ++++++++++++++++++
>  11 files changed, 153 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/cxl/core/native_ras.c


>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..6f1396ef7b77 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"

Why?  There are no changes in this header other than the include and the changes
to linux/aer.h are new stuff so I can't see how it becomes necessary if it
wasn't before.

Might well have always been missing and should have been here. If so separate
patch to tidy that up.

>  
>  #define CXL_MEMORY_PROGIF	0x10
>  


> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index b2ea14f70055..846ab55d747c 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c

>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
> +struct work_struct *cxl_proto_err_work;

I'm not seeing a declaration for this in the headers, so can it be static?

This is made a little more confusing as in this patch we have both
a structure called cxl_proto_err_work and a pointer to it with exactly the
same name.  Maybe rename this so it's subtly different.  cxl_protocol_err_work
or something silly like that just to make reviewers life a tiny bit easier!

> +

> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 02940be66324..24c3d9e18ad5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_info - Error information used in CXL error handling
> + * @severity: AER severity
> + * @function: Device's PCI function

Run kernel-doc over the files and fix errors / warning.
Missed updating this to devfn which it would have shouted about.

> + * @device: Device's PCI device
> + * @bus: Device's PCI bus
> + * @segment: Device's PCI segment
> + */
> +struct cxl_proto_error_info {
> +	int severity;
> +
> +	u8 devfn;
> +	u8 bus;
> +	u16 segment;
> +};


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
@ 2025-06-27 11:00   ` Jonathan Cameron
  2025-07-02 17:51     ` Bowman, Terry
  2025-07-01 23:04   ` Dave Jiang
  2025-07-25  0:38   ` dan.j.williams
  2 siblings, 1 reply; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 11:00 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:41 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
> 
> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the

After earlier update it already exists, you are just filling it in here.
So reword this.

> AER service driver. This will begin the CXL protocol error processing with
> a call to cxl_handle_proto_error().
> 
> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver. Add check that Endpoint is bound to a CXL
> driver.
> 
> Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
> and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
> will return a reference counted 'struct pci_dev *'. This will serve as
> reference count to prevent releasing the CXL Endpoint's mapped RAS while
> handling the error. Use scope base __free() to put the reference count.
> This will change when adding support for CXL port devices in the future.
> 
> Implement cxl_handle_proto_error() to differentiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
> errors will be processed with a call to walk the associated Root Complex
> Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for the
> RCiEP.

I'd drop the RCiEP description beyond saying 'handle it as before'
as I think there is no major change in this.

> 
> VH correctable error (CE) processing will call the CXL CE handler. VH
> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> and pci_clean_device_status() used to clean up AER status after handling.
> 
> Maintain the locking logic found in the original AER driver. Replace the
> existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
> lock for maintainability.

This change is fine, but it is an AND change in a patch doing quite a few other
things.  So do it in a trivial precursor patch.  Look at the other things in this
description and see if they can be factored out too so that the guts of this
patch are much easier to spot.


> CE errors did not include locking in previous driver
> implementation. Leave the updated CE handling path as-is.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
A few comments inline.

Jonathan

> ---
>  drivers/cxl/core/native_ras.c | 87 +++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/cxl/pci.c             |  6 +++
>  drivers/pci/pci.c             |  1 +
>  drivers/pci/pci.h             |  7 ---
>  drivers/pci/pcie/aer.c        |  1 +
>  drivers/pci/pcie/cxl_aer.c    | 41 -----------------
>  drivers/pci/pcie/rcec.c       |  1 +
>  include/linux/aer.h           |  2 +
>  include/linux/pci.h           | 10 ++++
>  10 files changed, 109 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> index 011815ddaae3..5bd79d5019e7 100644
> --- a/drivers/cxl/core/native_ras.c
> +++ b/drivers/cxl/core/native_ras.c
> @@ -6,9 +6,96 @@
>  #include <cxl/event.h>
>  #include <cxlmem.h>
>  #include <core/core.h>
> +#include <cxlpci.h>
> +
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> +}
> +
> +static bool is_cxl_rcd(struct pci_dev *pdev)
> +{
> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)
> +		return false;
> +
> +	/*
> +	 * The capability, status, and control fields in Device 0,
> +	 * Function 0 DVSEC control the CXL functionality of the
> +	 * entire device (CXL 3.2, 8.1.3).
> +	 */
> +	if (pdev->devfn != PCI_DEVFN(0, 0))
> +		return false;
> +
> +	/*
> +	 * CXL Memory Devices must have the 502h class code set (CXL

Short wrap.

> +	 * 3.2, 8.1.12.1).
> +	 */
> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) != PCI_CLASS_MEMORY_CXL)
> +		return false;
> +
> +	return true;


If this isn't going to get more complex

	return FIELD_GET(...)

> +}
> +
> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
> +{
> +	struct cxl_proto_error_info *err_info = data;
> +
> +	guard(device)(&pdev->dev);
> +
> +	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
> +		return 0;
> +
> +	if (err_info->severity == AER_CORRECTABLE)
> +		cxl_cor_error_detected(pdev);
> +	else
> +		cxl_error_detected(pdev, pci_channel_io_frozen);
> +
> +	return 1;
> +}
> +
> +static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(err_info->segment,
> +					    err_info->bus,
> +					    err_info->devfn);
> +
> +	if (!pdev) {
> +		pr_err("Failed to find the CXL device (SBDF=%x:%x:%x:%x)\n",
> +		       err_info->segment, err_info->bus, PCI_SLOT(err_info->devfn),
> +		       PCI_FUNC(err_info->devfn));
> +		return;
> +	}
> +
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.

I don't think the reference here to the CXL.mem driver is that helpful
given the code is immediate above. Maybe move the comment?


> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +
> +	if (err_info->severity == AER_CORRECTABLE) {
> +		int aer = pdev->aer_cap;
> +
> +		if (aer)
> +			pci_clear_and_set_config_dword(pdev,
> +						       aer + PCI_ERR_COR_STATUS,
> +						       0, PCI_ERR_COR_INTERNAL);
> +
> +		cxl_cor_error_detected(pdev);
> +
> +		pcie_clear_device_status(pdev);
> +	} else {
> +		cxl_do_recovery(pdev);
> +	}
> +}



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
@ 2025-06-27 11:05   ` Jonathan Cameron
  2025-07-02 21:06     ` Bowman, Terry
  2025-06-27 12:27   ` Shiju Jose
  1 sibling, 1 reply; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 11:05 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:42 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
> handling. Follow similar design as found in PCIe error driver,
> pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs
> as fatal with a kernel panic. This is to prevent corruption on CXL memory.
> 
> Export the PCI error driver's merge_result() to CXL namespace.

I think this may be a confusion from earlier review.  Anyhow, it should
be namespaced in the sense of not exporting something the vague name of
merge_result but it's PCI code, not CXL code and we don't have the dangerous
interface argument to justify putting it in the CXL namespace so I think
a namespaced EXPORT makes little sense for this one.

Jonathan


> Introduce
> PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will
> be used by CXL to panic the system in the case of uncorrectable protocol
> errors. PCI error handling is not currently expected to use the
> PCI_ERS_RESULT_PANIC.
> 
> Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
> first device in all cases.
> 
> Copy the PCI error driver's report_error_detected() to cxl_report_error_detected().
> Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
> supported. Add locking for PCI device as done in PCI's report_error_detected().
> This is necessary to prevent the RAS registers from disappearing before
> logging is completed.
> 
> Call panic() to halt the system in the case of uncorrectable errors (UCE)
> in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
> if a UCE is not found. In this case the AER status must be cleared and
> uses pci_aer_clear_fatal_status().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>


> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index de6381c690f5..63fceb3e8613 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -21,9 +21,12 @@
>  #include "portdrv.h"
>  #include "../pci.h"
>  
> -static pci_ers_result_t merge_result(enum pci_ers_result orig,
> -				  enum pci_ers_result new)
> +pci_ers_result_t merge_result(enum pci_ers_result orig,
> +			      enum pci_ers_result new)
>  {
> +	if (new == PCI_ERS_RESULT_PANIC)
> +		return PCI_ERS_RESULT_PANIC;
> +
>  	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>  		return PCI_ERS_RESULT_NO_AER_DRIVER;
>  
> @@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
>  
>  	return orig;
>  }
> +EXPORT_SYMBOL_NS_GPL(merge_result, "CXL"); 

Do we care about namespacing this?  I think not given it is PCIe code
and hardly destructive for other drivers to mess with it if they like.

I would namespace it in the sense of renaming it to make it clear
it's about pci errors though.

pci_ers_merge_result() perhaps?

Do that as a percursor patch.


>  
>  static int report_error_detected(struct pci_dev *dev,
>  				 pci_channel_state_t state,


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver
  2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
@ 2025-06-27 11:12   ` Jonathan Cameron
  2025-07-18 18:01   ` Dave Jiang
  1 sibling, 0 replies; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 11:12 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:43 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The cxl_port driver is intended to manage CXL Endpoint Ports and CXL Switch
> Ports. Move existing RAS initialization to the cxl_port driver.
> 
> Restricted CXL Host (RCH) Downstream Port RAS initialization currently
> resides in cxl/core/pci.c. The PCI source file is not otherwise associated
> with CXL Port management.
> 
> Additional CXL Port RAS initialization will be added in future patches to
> support a CXL Port device's CXL errors.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
One small thing inline.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index fe4b593331da..021f35145c65 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -6,6 +6,7 @@

> +
> +static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> +{
> +	void __iomem *aer_base = dport->regs.dport_aer;
> +	u32 aer_cmd_mask, aer_cmd;
> +
> +	if (!aer_base)
> +		return;
> +
> +	/*
> +	 * Disable RCH root port command interrupts.
> +	 * CXL 3.2 12.2.1.1 - RCH Downstream Port-detected Errors
Don't update spec versions in a code move patch.  That's a separate change
appropriate for doing in a separate patch.

For this we just want to see code moved with zero changes at all.
> +	 *
> +	 * This sequence may not be necessary. CXL spec states disabling
> +	 * the root cmd register's interrupts is required. But, PCI spec
> +	 * shows these are disabled by default on reset.
> +	 */
> +	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
> +			PCI_ERR_ROOT_CMD_NONFATAL_EN |
> +			PCI_ERR_ROOT_CMD_FATAL_EN);
> +	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
> +	aer_cmd &= ~aer_cmd_mask;
> +	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> +}
> +


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
@ 2025-06-27 11:17   ` Jonathan Cameron
  2025-07-02 21:41     ` Bowman, Terry
  2025-07-18 21:28   ` Dave Jiang
  1 sibling, 1 reply; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 11:17 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:44 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
> mapping to enable RAS logging. This initialization is currently missing and
> must be added for CXL RPs and DSPs.
> 
> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
> 
> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
> created and added to the EP port.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

One trivial comment inline.  I'm not super confident that I follow exactly
what is going on here so more eyes needed.  However I think it's fine.

Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 021f35145c65..b52f82925891 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c

>  
> +static void cxl_switch_port_init_ras(struct cxl_port *port)
> +{
> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
> +		return;
> +
> +	/* May have upstream DSP or RP */
> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {

 A lot of port->parent_dport in here. Maybe a local variable for that with
a suitable name to describe that its the next port in the upstream direction.

> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
> +
> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
> +	}
> +
> +	cxl_uport_init_ras_reporting(port, &port->dev);
> +}
> +
> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)
> +{
> +	struct cxl_dport *dport;
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> +	struct cxl_port *parent_port __free(put_cxl_port) =
> +		cxl_mem_find_port(cxlmd, &dport);
> +
> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
> +		dev_err(&port->dev, "CXL port topology not found\n");
> +		return;
> +	}
> +
> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
> +}
>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
  2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
  2025-06-27  9:53   ` Jonathan Cameron
@ 2025-06-27 11:32   ` Shiju Jose
  2025-07-01 21:27   ` Dave Jiang
  2025-07-23 22:56   ` dan.j.williams
  4 siblings, 0 replies; 82+ messages in thread
From: Shiju Jose @ 2025-06-27 11:32 UTC (permalink / raw)
  To: Terry Bowman, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org

>-----Original Message-----
>From: Terry Bowman <terry.bowman@amd.com>
>Sent: 26 June 2025 23:43
>To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>rrichter@amd.com; dan.carpenter@linaro.org;
>PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>Benjamin.Cheatham@amd.com;
>sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>linux-cxl@vger.kernel.org
>Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>Subject: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace
>logging
>
>The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
>for all errors. Update the driver and aer_event tracing to log 'CXL Bus Type' for
>CXL device errors.
>
>This requires the AER can identify and distinguish between PCIe errors and CXL
>errors.
>
>Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
>aer_get_device_error_info() and pci_print_aer().
>
>Update the aer_event trace routine to accept a bus type string parameter.
>
>Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>---
> drivers/pci/pci.h       |  6 ++++++
> drivers/pci/pcie/aer.c  | 21 +++++++++++++++------  include/ras/ras_event.h |  9
>++++++---
> 3 files changed, 27 insertions(+), 9 deletions(-)
>
>diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h index
>12215ee72afb..a0d1e59b5666 100644
>--- a/drivers/pci/pci.h
>+++ b/drivers/pci/pci.h
>@@ -608,6 +608,7 @@ struct aer_err_info {
> 	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
> 	int error_dev_num;
> 	const char *level;		/* printk level */
>+	bool is_cxl;
>
> 	unsigned int id:16;
>
>@@ -628,6 +629,11 @@ struct aer_err_info {  int
>aer_get_device_error_info(struct aer_err_info *info, int i);  void
>aer_print_error(struct aer_err_info *info, int i);
>
>+static inline const char *aer_err_bus(struct aer_err_info *info) {
>+	return info->is_cxl ? "CXL" : "PCIe";
>+}
>+
> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> 		      unsigned int tlp_len, bool flit,
> 		      struct pcie_tlp_log *log);
>diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c index
>70ac66188367..a2df9456595a 100644
>--- a/drivers/pci/pcie/aer.c
>+++ b/drivers/pci/pcie/aer.c
>@@ -837,6 +837,7 @@ void aer_print_error(struct aer_err_info *info, int i)
> 	struct pci_dev *dev;
> 	int layer, agent, id;
> 	const char *level = info->level;
>+	const char *bus_type = aer_err_bus(info);
>
> 	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
> 		return;
>@@ -845,23 +846,23 @@ void aer_print_error(struct aer_err_info *info, int i)
> 	id = pci_dev_id(dev);
>
> 	pci_dev_aer_stats_incr(dev, info);
>-	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
>+	trace_aer_event(pci_name(dev), bus_type, (info->status & ~info->mask),
> 			info->severity, info->tlp_header_valid, &info->tlp);
>
> 	if (!info->ratelimit_print[i])
> 		return;
>
> 	if (!info->status) {
>-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible,
>(Unregistered Agent ID)\n",
>-			aer_error_severity_string[info->severity]);
>+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible,
>(Unregistered Agent ID)\n",
>+			bus_type, aer_error_severity_string[info->severity]);
> 		goto out;
> 	}
>
> 	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
> 	agent = AER_GET_AGENT(info->severity, info->status);
>
>-	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>-		   aer_error_severity_string[info->severity],
>+	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
>+		   bus_type, aer_error_severity_string[info->severity],
> 		   aer_error_layer[layer], aer_agent_string[agent]);
>
> 	aer_printk(level, dev, "  device [%04x:%04x] error
>status/mask=%08x/%08x\n", @@ -895,6 +896,7 @@
>EXPORT_SYMBOL_GPL(cper_severity_to_aer);
> void pci_print_aer(struct pci_dev *dev, int aer_severity,
> 		   struct aer_capability_regs *aer)
> {
>+	const char *bus_type;
> 	int layer, agent, tlp_header_valid = 0;
> 	u32 status, mask;
> 	struct aer_err_info info = {
>@@ -915,9 +917,12 @@ void pci_print_aer(struct pci_dev *dev, int
>aer_severity,
>
> 	info.status = status;
> 	info.mask = mask;
>+	info.is_cxl = pcie_is_cxl(dev);
>+
>+	bus_type = aer_err_bus(&info);
>
> 	pci_dev_aer_stats_incr(dev, &info);
>-	trace_aer_event(pci_name(dev), (status & ~mask),
>+	trace_aer_event(pci_name(dev), bus_type, (status & ~mask),
> 			aer_severity, tlp_header_valid, &aer->header_log);
>
> 	if (!aer_ratelimit(dev, info.severity)) @@ -939,6 +944,9 @@ void
>pci_print_aer(struct pci_dev *dev, int aer_severity,
> 	if (tlp_header_valid)
> 		pcie_print_tlp_log(dev, &aer->header_log, info.level,
> 				   dev_fmt("  "));
>+
>+	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>+			aer_severity, tlp_header_valid, &aer->header_log);
Hi Terry,

It looks like an extra trace_aer_event() is called here along with the above
trace_aer_event(pci_name(dev),...? 

Thanks,
Shiju
> }
> EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>
>@@ -1371,6 +1379,7 @@ int aer_get_device_error_info(struct aer_err_info
>*info, int i)
> 	/* Must reset in this function */
> 	info->status = 0;
> 	info->tlp_header_valid = 0;
>+	info->is_cxl = pcie_is_cxl(dev);
>
> 	/* The device might not support AER */
> 	if (!aer)
>diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h index
>14c9f943d53f..080829d59c36 100644
>--- a/include/ras/ras_event.h
>+++ b/include/ras/ras_event.h
>@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>
> TRACE_EVENT(aer_event,
> 	TP_PROTO(const char *dev_name,
>+		 const char *bus_type,
> 		 const u32 status,
> 		 const u8 severity,
> 		 const u8 tlp_header_valid,
> 		 struct pcie_tlp_log *tlp),
>
>-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
>+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>
> 	TP_STRUCT__entry(
> 		__string(	dev_name,	dev_name	)
>+		__string(	bus_type,	bus_type	)
> 		__field(	u32,		status		)
> 		__field(	u8,		severity	)
> 		__field(	u8, 		tlp_header_valid)
>@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>
> 	TP_fast_assign(
> 		__assign_str(dev_name);
>+		__assign_str(bus_type);
> 		__entry->status		= status;
> 		__entry->severity	= severity;
> 		__entry->tlp_header_valid = tlp_header_valid; @@ -325,8
>+328,8 @@ TRACE_EVENT(aer_event,
> 		}
> 	),
>
>-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
>-		__get_str(dev_name),
>+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
>+		__get_str(dev_name), __get_str(bus_type),
> 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
> 			__entry->severity == AER_FATAL ?
> 			"Fatal" : "Uncorrected, non-fatal",
>--
>2.34.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
@ 2025-06-27 11:48   ` Jonathan Cameron
  2025-07-21 22:17   ` Dave Jiang
  1 sibling, 0 replies; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 11:48 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:48 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Update cxl_handle_cor_ras() to exit early in the case there is no RAS
> errors detected after applying the status mask. This change will make
> the correctable handler's implementation consistent with the uncorrectable
> handler, cxl_handle_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

> ---
>  drivers/cxl/core/pci.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 156ce094a8b9..887b54cf3395 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -677,10 +677,11 @@ static void cxl_handle_cor_ras(struct device *dev, u64 serial,
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(dev, serial, status);
> -	}
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	trace_cxl_aer_correctable_error(dev, serial, status);
>  }
>  
>  /* CXL spec rev3.0 8.2.4.16.1 */


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers
  2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
@ 2025-06-27 11:52   ` Jonathan Cameron
  2025-06-27 12:27   ` Shiju Jose
  2025-07-21 22:35   ` Dave Jiang
  2 siblings, 0 replies; 82+ messages in thread
From: Jonathan Cameron @ 2025-06-27 11:52 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:49 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL Endpoint protocol errors are currently handled using PCI error
> handlers. The CXL Endpoint requires CXL specific handling in the case of
> uncorrectable error (UCE) handling not provided by the PCI handlers.
> 
> Add CXL specific handlers for CXL Endpoints. Rename the existing
> cxl_error_handlers to be pci_error_handlers to more correctly indicate
> the error type and follow naming consistency.
> 
> The PCI handlers will be called if the CXL device is not trained for
> alternate protocol (CXL). Update the CXL Endpoint PCI handlers to call the
> CXL UCE handlers.
> 
> The existing EP UCE handler includes checks for various results. These are
> no longer needed because CXL UCE recovery will not be attempted. Implement
> cxl_handle_ras() to return PCI_ERS_RESULT_NONE or PCI_ERS_RESULT_PANIC. The
> CXL UCE handler is called by cxl_do_recovery() that acts on the return
> value. In the case of the PCI handler path, call panic() if the result is
> PCI_ERS_RESULT_PANIC.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

A few minor comments inline.

J
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 887b54cf3395..7209ffb5c2fe 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c


>  
> -	scoped_guard(device, dev) {
> -		if (!dev->driver) {
> +pci_ers_result_t cxl_error_detected(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
> +	pci_ers_result_t ue;
> +
> +	scoped_guard(device, cxlmd_dev) {
I think there is nothing much happening after this (maybe introduced in later
patches in which case ignore this comment).

So can you just use a guard and reduce the indent of the rest?

> +
> +		if (!cxlmd_dev->driver) {
>  			dev_warn(&pdev->dev,
>  				 "%s: memdev disabled, abort error handling\n",
>  				 dev_name(dev));
> -			return PCI_ERS_RESULT_DISCONNECT;
> +			return PCI_ERS_RESULT_PANIC;
>  		}
>  
>  		if (cxlds->rcd)
> @@ -881,29 +888,23 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);

little hard to tell from this code blob but can you return here?

>  	}
>  
> -
> -	switch (state) {
> -	case pci_channel_io_normal:
> -		if (ue) {
> -			device_release_driver(dev);
> -			return PCI_ERS_RESULT_NEED_RESET;
> -		}
> -		return PCI_ERS_RESULT_CAN_RECOVER;
> -	case pci_channel_io_frozen:
> -		dev_warn(&pdev->dev,
> -			 "%s: frozen state error detected, disable CXL.mem\n",
> -			 dev_name(dev));
> -		device_release_driver(dev);
> -		return PCI_ERS_RESULT_NEED_RESET;
> -	case pci_channel_io_perm_failure:
> -		dev_warn(&pdev->dev,
> -			 "failure state error detected, request disconnect\n");
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> -	return PCI_ERS_RESULT_NEED_RESET;
> +	return ue;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-06-26 22:42 ` [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
@ 2025-06-27 12:22   ` Shiju Jose
  2025-07-02  1:18     ` Alison Schofield
  2025-07-02 21:56     ` Bowman, Terry
  0 siblings, 2 replies; 82+ messages in thread
From: Shiju Jose @ 2025-06-27 12:22 UTC (permalink / raw)
  To: Terry Bowman, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org

>-----Original Message-----
>From: Terry Bowman <terry.bowman@amd.com>
>Sent: 26 June 2025 23:43
>To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>rrichter@amd.com; dan.carpenter@linaro.org;
>PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>Benjamin.Cheatham@amd.com;
>sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>linux-cxl@vger.kernel.org
>Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>Subject: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints
>and CXL Ports
>
>CXL currently has separate trace routines for CXL Port errors and CXL Endpoint
>errors. This is inconvenient for the user because they must enable
>2 sets of trace routines. Make updates to the trace logging such that a single
>trace routine logs both CXL Endpoint and CXL Port protocol errors.
>
>Keep the trace log fields 'memdev' and 'host'. While these are not accurate for
>non-Endpoints the fields will remain as-is to prevent breaking userspace RAS
>trace consumers.
>
>Add serial number parameter to the trace logging. This is used for EPs and 0 is
>provided for CXL port devices without a serial number.
>
>Below is output of correctable and uncorrectable protocol error logging.
>CXL Root Port and CXL Endpoint examples are included below.
>
>Root Port:
>cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0
>status='CRC Threshold Hit'
>cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0
>status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity
>Error'
>
>Endpoint:
>cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0
>status='CRC Threshold Hit'
>cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status:
>'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
>
>Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>---
> drivers/cxl/core/pci.c   | 19 ++++-----
> drivers/cxl/core/ras.c   | 14 ++++---
> drivers/cxl/core/trace.h | 84 +++++++++-------------------------------
> 3 files changed, 37 insertions(+), 80 deletions(-)
>
[...]
>
> static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data
>*data) diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index
>25ebfbc1616c..494d6db461a7 100644
>--- a/drivers/cxl/core/trace.h
>+++ b/drivers/cxl/core/trace.h
>@@ -48,49 +48,22 @@
> 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
> )
>
>-TRACE_EVENT(cxl_port_aer_uncorrectable_error,
>-	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
>-	TP_ARGS(dev, status, fe, hl),
>-	TP_STRUCT__entry(
>-		__string(device, dev_name(dev))
>-		__string(host, dev_name(dev->parent))
>-		__field(u32, status)
>-		__field(u32, first_error)
>-		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>-	),
>-	TP_fast_assign(
>-		__assign_str(device);
>-		__assign_str(host);
>-		__entry->status = status;
>-		__entry->first_error = fe;
>-		/*
>-		 * Embed the 512B headerlog data for user app retrieval and
>-		 * parsing, but no need to print this in the trace buffer.
>-		 */
>-		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>-	),
>-	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
>-		  __get_str(device), __get_str(host),
>-		  show_uc_errs(__entry->status),
>-		  show_uc_errs(__entry->first_error)
>-	)
>-);
>-
> TRACE_EVENT(cxl_aer_uncorrectable_error,
>-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32
>*hl),
>-	TP_ARGS(cxlmd, status, fe, hl),
>+	TP_PROTO(struct device *dev, u64 serial, u32 status, u32 fe,
>+		 u32 *hl),
>+	TP_ARGS(dev, serial, status, fe, hl),
> 	TP_STRUCT__entry(
>-		__string(memdev, dev_name(&cxlmd->dev))
>-		__string(host, dev_name(cxlmd->dev.parent))
>+		__string(name, dev_name(dev))
>+		__string(parent, dev_name(dev->parent))

Hi Terry,

Thanks for considering the feedback given in v9 regarding the compatibility issue
with the rasdaemon.
https://lore.kernel.org/all/959acc682e6e4b52ac0283b37ee21026@huawei.com/

Probably some confusion w.r.t the feedback.
Unfortunately  TP_printk(...) is not an ABI that we need to keep stable, 
it's this structure, TP_STRUCT__entry(..) , that matters to the rasdaemon.

> 		__field(u64, serial)
> 		__field(u32, status)
> 		__field(u32, first_error)
> 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> 	),
> 	TP_fast_assign(
>-		__assign_str(memdev);
>-		__assign_str(host);
>-		__entry->serial = cxlmd->cxlds->serial;
>+		__assign_str(name);
>+		__assign_str(parent);
>+		__entry->serial = serial;
> 		__entry->status = status;
> 		__entry->first_error = fe;
> 		/*
>@@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> 		 */
> 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> 	),
>-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error:
>'%s'",
>-		  __get_str(memdev), __get_str(host), __entry->serial,
>+	TP_printk("memdev=%s host=%s serial=%lld status='%s'
>first_error='%s'",
>+		  __get_str(name), __get_str(parent), __entry->serial,
> 		  show_uc_errs(__entry->status),
> 		  show_uc_errs(__entry->first_error)
> 	)
>@@ -124,42 +97,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer"
>}	\
> )
>
>-TRACE_EVENT(cxl_port_aer_correctable_error,
>-	TP_PROTO(struct device *dev, u32 status),
>-	TP_ARGS(dev, status),
>-	TP_STRUCT__entry(
>-		__string(device, dev_name(dev))
>-		__string(host, dev_name(dev->parent))
>-		__field(u32, status)
>-	),
>-	TP_fast_assign(
>-		__assign_str(device);
>-		__assign_str(host);
>-		__entry->status = status;
>-	),
>-	TP_printk("device=%s host=%s status='%s'",
>-		  __get_str(device), __get_str(host),
>-		  show_ce_errs(__entry->status)
>-	)
>-);
>-
> TRACE_EVENT(cxl_aer_correctable_error,
>-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>-	TP_ARGS(cxlmd, status),
>+	TP_PROTO(struct device *dev, u64 serial, u32 status),
>+	TP_ARGS(dev, serial, status),
> 	TP_STRUCT__entry(
>-		__string(memdev, dev_name(&cxlmd->dev))
>-		__string(host, dev_name(cxlmd->dev.parent))
>+		__string(name, dev_name(dev))
>+		__string(parent, dev_name(dev->parent))

Same as above.
> 		__field(u64, serial)
> 		__field(u32, status)
> 	),
> 	TP_fast_assign(
>-		__assign_str(memdev);
>-		__assign_str(host);
>-		__entry->serial = cxlmd->cxlds->serial;
>+		__assign_str(name);
>+		__assign_str(parent);
>+		__entry->serial = serial;
> 		__entry->status = status;
> 	),
>-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
>-		  __get_str(memdev), __get_str(host), __entry->serial,
>+	TP_printk("memdev=%s host=%s serial=%lld status='%s'",
>+		  __get_str(name), __get_str(parent), __entry->serial,
> 		  show_ce_errs(__entry->status)
> 	)
> );
>--
>2.34.1

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
  2025-06-27 11:05   ` Jonathan Cameron
@ 2025-06-27 12:27   ` Shiju Jose
  2025-07-02 21:34     ` Bowman, Terry
  1 sibling, 1 reply; 82+ messages in thread
From: Shiju Jose @ 2025-06-27 12:27 UTC (permalink / raw)
  To: Terry Bowman, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org

>-----Original Message-----
>From: Terry Bowman <terry.bowman@amd.com>
>Sent: 26 June 2025 23:43
>To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>rrichter@amd.com; dan.carpenter@linaro.org;
>PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>Benjamin.Cheatham@amd.com;
>sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>linux-cxl@vger.kernel.org
>Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>Subject: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error
>recovery
>
>Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
>handling. Follow similar design as found in PCIe error driver,
>pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs as
>fatal with a kernel panic. This is to prevent corruption on CXL memory.
>
>Export the PCI error driver's merge_result() to CXL namespace. Introduce
>PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will be
>used by CXL to panic the system in the case of uncorrectable protocol errors. PCI
>error handling is not currently expected to use the PCI_ERS_RESULT_PANIC.
>
>Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the first
>device in all cases.
>
>Copy the PCI error driver's report_error_detected() to
>cxl_report_error_detected().
>Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
>supported. Add locking for PCI device as done in PCI's report_error_detected().
>This is necessary to prevent the RAS registers from disappearing before logging
>is completed.
>
>Call panic() to halt the system in the case of uncorrectable errors (UCE) in
>cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use if a UCE is
>not found. In this case the AER status must be cleared and uses
>pci_aer_clear_fatal_status().
>
>Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>---
> drivers/cxl/core/native_ras.c | 44 +++++++++++++++++++++++++++++++++++
> drivers/pci/pcie/cxl_aer.c    |  3 ++-
> drivers/pci/pcie/err.c        |  8 +++++--
> include/linux/aer.h           | 11 +++++++++
> include/linux/pci.h           |  3 +++
> 5 files changed, 66 insertions(+), 3 deletions(-)
>
[...]
>
> void pci_print_aer(struct pci_dev *dev, int aer_severity, diff --git
>a/include/linux/pci.h b/include/linux/pci.h index 79326358f641..16a8310e0373
>100644
>--- a/include/linux/pci.h
>+++ b/include/linux/pci.h
>@@ -868,6 +868,9 @@ enum pci_ers_result {
>
> 	/* No AER capabilities registered for the driver */
> 	PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
>+
>+	/* System is unstable, panic. Is CXL specific  */
>+	PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
Extra space is present after casting?
> };
>
> /* PCI bus error event callbacks */
>--
>2.34.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* RE: [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers
  2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
  2025-06-27 11:52   ` Jonathan Cameron
@ 2025-06-27 12:27   ` Shiju Jose
  2025-07-21 22:35   ` Dave Jiang
  2 siblings, 0 replies; 82+ messages in thread
From: Shiju Jose @ 2025-06-27 12:27 UTC (permalink / raw)
  To: Terry Bowman, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org

>-----Original Message-----
>From: Terry Bowman <terry.bowman@amd.com>
>Sent: 26 June 2025 23:43
>To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>rrichter@amd.com; dan.carpenter@linaro.org;
>PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>Benjamin.Cheatham@amd.com;
>sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>linux-cxl@vger.kernel.org
>Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>Subject: [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error
>handlers
>
>CXL Endpoint protocol errors are currently handled using PCI error handlers. The
>CXL Endpoint requires CXL specific handling in the case of uncorrectable error
>(UCE) handling not provided by the PCI handlers.
>
>Add CXL specific handlers for CXL Endpoints. Rename the existing
>cxl_error_handlers to be pci_error_handlers to more correctly indicate the
>error type and follow naming consistency.
>
>The PCI handlers will be called if the CXL device is not trained for alternate
>protocol (CXL). Update the CXL Endpoint PCI handlers to call the CXL UCE
>handlers.
>
>The existing EP UCE handler includes checks for various results. These are no
>longer needed because CXL UCE recovery will not be attempted. Implement
>cxl_handle_ras() to return PCI_ERS_RESULT_NONE or PCI_ERS_RESULT_PANIC.
>The CXL UCE handler is called by cxl_do_recovery() that acts on the return
>value. In the case of the PCI handler path, call panic() if the result is
>PCI_ERS_RESULT_PANIC.
>
>Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>Reviewed-by: Kuppuswamy Sathyanarayanan
><sathyanarayanan.kuppuswamy@linux.intel.com>
>---
> drivers/cxl/core/native_ras.c | 15 ++++---
> drivers/cxl/core/pci.c        | 77 ++++++++++++++++++-----------------
> drivers/cxl/cxl.h             |  4 ++
> drivers/cxl/cxlpci.h          |  6 +--
> drivers/cxl/pci.c             |  8 ++--
> 5 files changed, 59 insertions(+), 51 deletions(-)
>
[...]
>diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c index
>887b54cf3395..7209ffb5c2fe 100644
>--- a/drivers/cxl/core/pci.c
>+++ b/drivers/cxl/core/pci.c
>@@ -705,8 +705,8 @@ static void header_log_copy(void __iomem *ras_base,
>u32 *log)
>  * Log the state of the RAS status registers and prepare them to log the
>  * next error status. Return 1 if reset needed.
>  */
>-static bool cxl_handle_ras(struct device *dev, u64 serial,
>-			   void __iomem *ras_base)
>+static pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
>+				       void __iomem *ras_base)
> {
> 	u32 hl[CXL_HEADERLOG_SIZE_U32];
> 	void __iomem *addr;
>@@ -715,13 +715,13 @@ static bool cxl_handle_ras(struct device *dev, u64
>serial,
>
> 	if (!ras_base) {
> 		dev_warn_once(dev, "CXL RAS register block is not mapped");
>-		return false;
>+		return PCI_ERS_RESULT_NONE;
> 	}
>
> 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
> 	status = readl(addr);
> 	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
>-		return false;
>+		return PCI_ERS_RESULT_NONE;
>
> 	/* If multiple errors, log header points to first error from ctrl reg */
> 	if (hweight32(status) > 1) {
>@@ -738,7 +738,7 @@ static bool cxl_handle_ras(struct device *dev, u64 serial,
> 	trace_cxl_aer_uncorrectable_error(dev, serial, status, fe, hl);
> 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
>-	return true;
>+	return PCI_ERS_RESULT_PANIC;
> }
>
> #ifdef CONFIG_PCIEAER_CXL
>@@ -833,13 +833,14 @@ static void cxl_handle_rdport_errors(struct
>cxl_dev_state *cxlds)  static void cxl_handle_rdport_errors(struct cxl_dev_state
>*cxlds) { }  #endif
>
>-void cxl_cor_error_detected(struct pci_dev *pdev)
>+void cxl_cor_error_detected(struct device *dev)
> {
>+	struct pci_dev *pdev = to_pci_dev(dev);
> 	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>-	struct device *dev = &cxlds->cxlmd->dev;
>+	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
>
>-	scoped_guard(device, dev) {
>-		if (!dev->driver) {
>+	scoped_guard(device, cxlmd_dev) {
>+		if (!cxlmd_dev->driver) {
> 			dev_warn(&pdev->dev,
> 				 "%s: memdev disabled, abort error
>handling\n",
> 				 dev_name(dev));
>@@ -854,20 +855,26 @@ void cxl_cor_error_detected(struct pci_dev *pdev)  }
>EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>
>-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>-				    pci_channel_state_t state)
>+void pci_cor_error_detected(struct pci_dev *pdev)
> {
>-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>-	struct cxl_memdev *cxlmd = cxlds->cxlmd;
>-	struct device *dev = &cxlmd->dev;
>-	bool ue;
>+	cxl_cor_error_detected(&pdev->dev);
>+}
>+EXPORT_SYMBOL_NS_GPL(pci_cor_error_detected, "CXL");
>
>-	scoped_guard(device, dev) {
>-		if (!dev->driver) {
>+pci_ers_result_t cxl_error_detected(struct device *dev) {
>+	struct pci_dev *pdev = to_pci_dev(dev);
>+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
>+	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
>+	pci_ers_result_t ue;
>+
>+	scoped_guard(device, cxlmd_dev) {
>+
Please remove the extra blank line.

>+		if (!cxlmd_dev->driver) {
> 			dev_warn(&pdev->dev,
> 				 "%s: memdev disabled, abort error
>handling\n",
> 				 dev_name(dev));

Thanks,
Shiju

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
  2025-06-27 10:12     ` Jonathan Cameron
@ 2025-06-27 14:29     ` Bowman, Terry
  1 sibling, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-06-27 14:29 UTC (permalink / raw)
  To: Sathyanarayanan Kuppuswamy, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/2025 6:42 PM, Sathyanarayanan Kuppuswamy wrote:
> On 6/26/25 3:42 PM, Terry Bowman wrote:
>> The CXL AER error handling logic currently resides in the AER driver file,
>> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
>> using #ifdefs.
>>
>> Improve the AER driver maintainability by separating the CXL specific logic
>> from the AER driver's core functionality and removing the #ifdefs.
>> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
>> new file.
>>
>> Update the makefile to conditionally compile the CXL file using the
>> existing CONFIG_PCIEAER_CXL Kconfig.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
> After moving the code , you seem to have updated it with your own
> changes. May be you split it into two patches.
Yes, I'll update to make a copy and add the changes in the following patch(es).

-Terry

>>   drivers/pci/pci.h          |   8 +++
>>   drivers/pci/pcie/Makefile  |   1 +
>>   drivers/pci/pcie/aer.c     | 138 -------------------------------------
>>   drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++
>>   include/linux/pci_ids.h    |   2 +
>>   5 files changed, 149 insertions(+), 138 deletions(-)
>>   create mode 100644 drivers/pci/pcie/cxl_aer.c
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index a0d1e59b5666..91b583cf18eb 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -1029,6 +1029,14 @@ static inline void pci_save_aer_state(struct pci_dev *dev) { }
>>   static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>>   #endif
>>   
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>> +void cxl_rch_enable_rcec(struct pci_dev *rcec);
>> +#else
>> +static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>> +static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
>> +#endif
>> +
>>   #ifdef CONFIG_ACPI
>>   bool pci_acpi_preserve_config(struct pci_host_bridge *bridge);
>>   int pci_acpi_program_hp_params(struct pci_dev *dev);
>> diff --git a/drivers/pci/pcie/Makefile b/drivers/pci/pcie/Makefile
>> index 173829aa02e6..cd2cb925dbd5 100644
>> --- a/drivers/pci/pcie/Makefile
>> +++ b/drivers/pci/pcie/Makefile
>> @@ -8,6 +8,7 @@ obj-$(CONFIG_PCIEPORTBUS)	+= pcieportdrv.o bwctrl.o
>>   
>>   obj-y				+= aspm.o
>>   obj-$(CONFIG_PCIEAER)		+= aer.o err.o tlp.o
>> +obj-$(CONFIG_PCIEAER_CXL)	+= cxl_aer.o
>>   obj-$(CONFIG_PCIEAER_INJECT)	+= aer_inject.o
>>   obj-$(CONFIG_PCIE_PME)		+= pme.o
>>   obj-$(CONFIG_PCIE_DPC)		+= dpc.o
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index a2df9456595a..0b4d721980ef 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1094,144 +1094,6 @@ static bool find_source_device(struct pci_dev *parent,
>>   	return true;
>>   }
>>   
>> -#ifdef CONFIG_PCIEAER_CXL
>> -
>> -/**
>> - * pci_aer_unmask_internal_errors - unmask internal errors
>> - * @dev: pointer to the pci_dev data structure
>> - *
>> - * Unmask internal errors in the Uncorrectable and Correctable Error
>> - * Mask registers.
>> - *
>> - * Note: AER must be enabled and supported by the device which must be
>> - * checked in advance, e.g. with pcie_aer_is_native().
>> - */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> -{
>> -	int aer = dev->aer_cap;
>> -	u32 mask;
>> -
>> -	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
>> -	mask &= ~PCI_ERR_UNC_INTN;
>> -	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
>> -
>> -	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
>> -	mask &= ~PCI_ERR_COR_INTERNAL;
>> -	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> -}
>> -
>> -static bool is_cxl_mem_dev(struct pci_dev *dev)
>> -{
>> -	/*
>> -	 * The capability, status, and control fields in Device 0,
>> -	 * Function 0 DVSEC control the CXL functionality of the
>> -	 * entire device (CXL 3.0, 8.1.3).
>> -	 */
>> -	if (dev->devfn != PCI_DEVFN(0, 0))
>> -		return false;
>> -
>> -	/*
>> -	 * CXL Memory Devices must have the 502h class code set (CXL
>> -	 * 3.0, 8.1.12.1).
>> -	 */
>> -	if ((dev->class >> 8) != PCI_CLASS_MEMORY_CXL)
>> -		return false;
>> -
>> -	return true;
>> -}
>> -
>> -static bool cxl_error_is_native(struct pci_dev *dev)
>> -{
>> -	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> -
>> -	return (pcie_ports_native || host->native_aer);
>> -}
>> -
>> -static bool is_internal_error(struct aer_err_info *info)
>> -{
>> -	if (info->severity == AER_CORRECTABLE)
>> -		return info->status & PCI_ERR_COR_INTERNAL;
>> -
>> -	return info->status & PCI_ERR_UNC_INTN;
>> -}
>> -
>> -static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> -{
>> -	struct aer_err_info *info = (struct aer_err_info *)data;
>> -	const struct pci_error_handlers *err_handler;
>> -
>> -	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
>> -		return 0;
>> -
>> -	/* Protect dev->driver */
>> -	device_lock(&dev->dev);
>> -
>> -	err_handler = dev->driver ? dev->driver->err_handler : NULL;
>> -	if (!err_handler)
>> -		goto out;
>> -
>> -	if (info->severity == AER_CORRECTABLE) {
>> -		if (err_handler->cor_error_detected)
>> -			err_handler->cor_error_detected(dev);
>> -	} else if (err_handler->error_detected) {
>> -		if (info->severity == AER_NONFATAL)
>> -			err_handler->error_detected(dev, pci_channel_io_normal);
>> -		else if (info->severity == AER_FATAL)
>> -			err_handler->error_detected(dev, pci_channel_io_frozen);
>> -	}
>> -out:
>> -	device_unlock(&dev->dev);
>> -	return 0;
>> -}
>> -
>> -static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> -{
>> -	/*
>> -	 * Internal errors of an RCEC indicate an AER error in an
>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>> -	 * device driver.
>> -	 */
>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>> -	    is_internal_error(info))
>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>> -}
>> -
>> -static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>> -{
>> -	bool *handles_cxl = data;
>> -
>> -	if (!*handles_cxl)
>> -		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
>> -
>> -	/* Non-zero terminates iteration */
>> -	return *handles_cxl;
>> -}
>> -
>> -static bool handles_cxl_errors(struct pci_dev *rcec)
>> -{
>> -	bool handles_cxl = false;
>> -
>> -	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
>> -	    pcie_aer_is_native(rcec))
>> -		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
>> -
>> -	return handles_cxl;
>> -}
>> -
>> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>> -{
>> -	if (!handles_cxl_errors(rcec))
>> -		return;
>> -
>> -	pci_aer_unmask_internal_errors(rcec);
>> -	pci_info(rcec, "CXL: Internal errors unmasked");
>> -}
>> -
>> -#else
>> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
>> -static inline void cxl_rch_handle_error(struct pci_dev *dev,
>> -					struct aer_err_info *info) { }
>> -#endif
>>   
>>   /**
>>    * pci_aer_handle_error - handle logging error into an event log
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> new file mode 100644
>> index 000000000000..b2ea14f70055
>> --- /dev/null
>> +++ b/drivers/pci/pcie/cxl_aer.c
>> @@ -0,0 +1,138 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>> +#include "../pci.h"
>> +
>> +/**
>> + * pci_aer_unmask_internal_errors - unmask internal errors
>> + * @dev: pointer to the pci_dev data structure
>> + *
>> + * Unmask internal errors in the Uncorrectable and Correctable Error
>> + * Mask registers.
>> + *
>> + * Note: AER must be enabled and supported by the device which must be
>> + * checked in advance, e.g. with pcie_aer_is_native().
>> + */
>> +static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +{
>> +	int aer = dev->aer_cap;
>> +	u32 mask;
>> +
>> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
>> +	mask &= ~PCI_ERR_UNC_INTN;
>> +	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
>> +
>> +	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
>> +	mask &= ~PCI_ERR_COR_INTERNAL;
>> +	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> +}
>> +
>> +static bool is_cxl_mem_dev(struct pci_dev *dev)
>> +{
>> +	/*
>> +	 * The capability, status, and control fields in Device 0,
>> +	 * Function 0 DVSEC control the CXL functionality of the
>> +	 * entire device (CXL 3.2, 8.1.3).
>> +	 */
>> +	if (dev->devfn != PCI_DEVFN(0, 0))
>> +		return false;
>> +
>> +	/*
>> +	 * CXL Memory Devices must have the 502h class code set (CXL
>> +	 * 3.2, 8.1.12.1).
>> +	 */
>> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, dev->class) != PCI_CLASS_MEMORY_CXL)
>> +		return false;
>> +
>> +	return true;
>> +}
>> +
>> +static bool cxl_error_is_native(struct pci_dev *dev)
>> +{
>> +	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> +
>> +	return (pcie_ports_native || host->native_aer);
>> +}
>> +
>> +static bool is_internal_error(struct aer_err_info *info)
>> +{
>> +	if (info->severity == AER_CORRECTABLE)
>> +		return info->status & PCI_ERR_COR_INTERNAL;
>> +
>> +	return info->status & PCI_ERR_UNC_INTN;
>> +}
>> +
>> +static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> +{
>> +	struct aer_err_info *info = (struct aer_err_info *)data;
>> +	const struct pci_error_handlers *err_handler;
>> +
>> +	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
>> +		return 0;
>> +
>> +	/* Protect dev->driver */
>> +	device_lock(&dev->dev);
>> +
>> +	err_handler = dev->driver ? dev->driver->err_handler : NULL;
>> +	if (!err_handler)
>> +		goto out;
>> +
>> +	if (info->severity == AER_CORRECTABLE) {
>> +		if (err_handler->cor_error_detected)
>> +			err_handler->cor_error_detected(dev);
>> +	} else if (err_handler->error_detected) {
>> +		if (info->severity == AER_NONFATAL)
>> +			err_handler->error_detected(dev, pci_channel_io_normal);
>> +		else if (info->severity == AER_FATAL)
>> +			err_handler->error_detected(dev, pci_channel_io_frozen);
>> +	}
>> +out:
>> +	device_unlock(&dev->dev);
>> +	return 0;
>> +}
>> +
>> +void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> +{
>> +	/*
>> +	 * Internal errors of an RCEC indicate an AER error in an
>> +	 * RCH's downstream port. Check and handle them in the CXL.mem
>> +	 * device driver.
>> +	 */
>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>> +	    is_internal_error(info))
>> +		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>> +}
>> +
>> +static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>> +{
>> +	bool *handles_cxl = data;
>> +
>> +	if (!*handles_cxl)
>> +		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
>> +
>> +	/* Non-zero terminates iteration */
>> +	return *handles_cxl;
>> +}
>> +
>> +static bool handles_cxl_errors(struct pci_dev *rcec)
>> +{
>> +	bool handles_cxl = false;
>> +
>> +	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
>> +	    pcie_aer_is_native(rcec))
>> +		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
>> +
>> +	return handles_cxl;
>> +}
>> +
>> +void cxl_rch_enable_rcec(struct pci_dev *rcec)
>> +{
>> +	if (!handles_cxl_errors(rcec))
>> +		return;
>> +
>> +	pci_aer_unmask_internal_errors(rcec);
>> +	pci_info(rcec, "CXL: Internal errors unmasked");
>> +}
>> +
>> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
>> index e2d71b6fdd84..31b3935bf189 100644
>> --- a/include/linux/pci_ids.h
>> +++ b/include/linux/pci_ids.h
>> @@ -12,6 +12,8 @@
>>   
>>   /* Device classes and subclasses */
>>   
>> +#define PCI_CLASS_CODE_MASK             0xFFFF00
>> +
>>   #define PCI_CLASS_NOT_DEFINED		0x0000
>>   #define PCI_CLASS_NOT_DEFINED_VGA	0x0001
>>   


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
                     ` (2 preceding siblings ...)
  2025-06-27 11:32   ` Shiju Jose
@ 2025-07-01 21:27   ` Dave Jiang
  2025-07-23 22:56   ` dan.j.williams
  4 siblings, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-01 21:27 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/pci/pci.h       |  6 ++++++
>  drivers/pci/pcie/aer.c  | 21 +++++++++++++++------
>  include/ras/ras_event.h |  9 ++++++---
>  3 files changed, 27 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 12215ee72afb..a0d1e59b5666 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -608,6 +608,7 @@ struct aer_err_info {
>  	int ratelimit_print[AER_MAX_MULTI_ERR_DEVICES];
>  	int error_dev_num;
>  	const char *level;		/* printk level */
> +	bool is_cxl;
>  
>  	unsigned int id:16;
>  
> @@ -628,6 +629,11 @@ struct aer_err_info {
>  int aer_get_device_error_info(struct aer_err_info *info, int i);
>  void aer_print_error(struct aer_err_info *info, int i);
>  
> +static inline const char *aer_err_bus(struct aer_err_info *info)
> +{
> +	return info->is_cxl ? "CXL" : "PCIe";
> +}
> +
>  int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>  		      unsigned int tlp_len, bool flit,
>  		      struct pcie_tlp_log *log);
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 70ac66188367..a2df9456595a 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -837,6 +837,7 @@ void aer_print_error(struct aer_err_info *info, int i)
>  	struct pci_dev *dev;
>  	int layer, agent, id;
>  	const char *level = info->level;
> +	const char *bus_type = aer_err_bus(info);
>  
>  	if (WARN_ON_ONCE(i >= AER_MAX_MULTI_ERR_DEVICES))
>  		return;
> @@ -845,23 +846,23 @@ void aer_print_error(struct aer_err_info *info, int i)
>  	id = pci_dev_id(dev);
>  
>  	pci_dev_aer_stats_incr(dev, info);
> -	trace_aer_event(pci_name(dev), (info->status & ~info->mask),
> +	trace_aer_event(pci_name(dev), bus_type, (info->status & ~info->mask),
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  
>  	if (!info->ratelimit_print[i])
>  		return;
>  
>  	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>  		goto out;
>  	}
>  
>  	layer = AER_GET_LAYER_ERROR(info->severity, info->status);
>  	agent = AER_GET_AGENT(info->severity, info->status);
>  
> -	aer_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	aer_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	aer_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -895,6 +896,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> +	const char *bus_type;
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info = {
> @@ -915,9 +917,12 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  
>  	info.status = status;
>  	info.mask = mask;
> +	info.is_cxl = pcie_is_cxl(dev);
> +
> +	bus_type = aer_err_bus(&info);
>  
>  	pci_dev_aer_stats_incr(dev, &info);
> -	trace_aer_event(pci_name(dev), (status & ~mask),
> +	trace_aer_event(pci_name(dev), bus_type, (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  
>  	if (!aer_ratelimit(dev, info.severity))
> @@ -939,6 +944,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (tlp_header_valid)
>  		pcie_print_tlp_log(dev, &aer->header_log, info.level,
>  				   dev_fmt("  "));
> +
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
> +			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>  
> @@ -1371,6 +1379,7 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
>  	/* Must reset in this function */
>  	info->status = 0;
>  	info->tlp_header_valid = 0;
> +	info->is_cxl = pcie_is_cxl(dev);
>  
>  	/* The device might not support AER */
>  	if (!aer)
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index 14c9f943d53f..080829d59c36 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>  
>  TRACE_EVENT(aer_event,
>  	TP_PROTO(const char *dev_name,
> +		 const char *bus_type,
>  		 const u32 status,
>  		 const u8 severity,
>  		 const u8 tlp_header_valid,
>  		 struct pcie_tlp_log *tlp),
>  
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>  
>  	TP_STRUCT__entry(
>  		__string(	dev_name,	dev_name	)
> +		__string(	bus_type,	bus_type	)
>  		__field(	u32,		status		)
>  		__field(	u8,		severity	)
>  		__field(	u8, 		tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>  
>  	TP_fast_assign(
>  		__assign_str(dev_name);
> +		__assign_str(bus_type);
>  		__entry->status		= status;
>  		__entry->severity	= severity;
>  		__entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>  		}
>  	),
>  
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>  			__entry->severity == AER_FATAL ?
>  			"Fatal" : "Uncorrected, non-fatal",


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
  2025-06-27 10:24   ` Jonathan Cameron
@ 2025-07-01 21:53   ` Dave Jiang
  2025-07-02 17:10     ` Bowman, Terry
  2025-07-24  2:01   ` dan.j.williams
  2 siblings, 1 reply; 82+ messages in thread
From: Dave Jiang @ 2025-07-01 21:53 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used
> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> driver's RAS native handling. This as an alternative to dropping the
> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> conditionally compile the new file.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> will contain the erring device's PCI SBDF details used to rediscover the
> device after the CXL driver dequeues the kfifo work. The device rediscovery
> will be introduced along with the CXL handling in future patches.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/Kconfig           | 14 ++++++++
>  drivers/cxl/core/Makefile     |  1 +
>  drivers/cxl/core/core.h       |  8 +++++
>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++

With the addition of a new file to cxl_core, can you please also fix up tools/testing/cxl/Kbuild?

DJ

>  drivers/cxl/core/port.c       |  2 ++
>  drivers/cxl/core/ras.c        |  1 +
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/pci/pci.h             |  4 +++
>  drivers/pci/pcie/aer.c        |  7 ++--
>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>  include/linux/aer.h           | 31 ++++++++++++++++++
>  11 files changed, 153 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/cxl/core/native_ras.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 48b7314afdb8..57274de54a45 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -233,4 +233,18 @@ config CXL_MCE
>  	def_bool y
>  	depends on X86_MCE && MEMORY_FAILURE
>  
> +config CXL_NATIVE_RAS
> +	bool "CXL: Enable CXL RAS native handling"
> +	depends on PCIEAER_CXL
> +	default CXL_BUS
> +	help
> +	  Enable native CXL RAS protocol error handling and logging in the CXL
> +	  drivers. This functionality relies on the AER service driver being
> +	  enabled, as the AER interrupt is used to inform the operating system
> +	  of CXL RAS protocol errors. The platform must be configured to
> +	  utilize AER reporting for interrupts.
> +
> +	  If unsure, or if this kernel is meant for production environments,
> +	  say Y.
> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 79e2ef81fde8..16f5832e5cc4 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 29b61828a847..4c08bb92e2f9 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>  					    int nid, resource_size_t *size);
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +void cxl_native_ras_init(void);
> +void cxl_native_ras_exit(void);
> +#else
> +static inline void cxl_native_ras_init(void) { };
> +static inline void cxl_native_ras_exit(void) { };
> +#endif
> +
>  #ifdef CONFIG_CXL_FEATURES
>  struct cxl_feat_entry *
>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> new file mode 100644
> index 000000000000..011815ddaae3
> --- /dev/null
> +++ b/drivers/cxl/core/native_ras.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include <cxl/event.h>
> +#include <cxlmem.h>
> +#include <core/core.h>
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +}
> +
> +static struct work_struct cxl_proto_err_work;
> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
> +
> +void cxl_native_ras_init(void)
> +{
> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> +}
> +
> +void cxl_native_ras_exit(void)
> +{
> +	cxl_unregister_proto_err_work();
> +	cancel_work_sync(&cxl_proto_err_work);
> +}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index eb46c6764d20..8e8f21197c86 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>  	if (rc)
>  		goto err_ras;
>  
> +	cxl_native_ras_init();
> +
>  	return 0;
>  
>  err_ras:
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 485a831695c7..962dc94fed8c 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..6f1396ef7b77 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 91b583cf18eb..29c11c7136d3 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>  #else
>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>  #endif
>  
>  #ifdef CONFIG_ACPI
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 0b4d721980ef..8417a49c71f2 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);
> +	if (is_cxl_error(dev, info))
> +		forward_cxl_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
> +
>  	pci_dev_put(dev);
>  }
>  
> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index b2ea14f70055..846ab55d747c 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c
> @@ -3,8 +3,11 @@
>  
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/kfifo.h>
>  #include "../pci.h"
>  
> +#define CXL_ERROR_SOURCES_MAX          128
> +
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pci_dev data structure
> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>  	return info->status & PCI_ERR_UNC_INTN;
>  }
>  
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	if (!info || !info->is_cxl)
> +		return false;
> +
> +	/* Only CXL Endpoints are currently supported */
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
> +		return false;
> +
> +	return is_internal_error(info);
> +}
> +
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
> +struct work_struct *cxl_proto_err_work;
> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> +	cxl_proto_err_work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> +	cxl_proto_err_work = NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
> +
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_proto_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
> +
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	wd.err_info = (struct cxl_proto_error_info) {
> +		.severity = aer_err_info->severity,
> +		.devfn = pdev->devfn,
> +		.bus = pdev->bus->number,
> +		.segment = pci_domain_nr(pdev->bus)
> +	};
> +
> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
> +		return;
> +	}
> +
> +	schedule_work(cxl_proto_err_work);
> +}
> +
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 02940be66324..24c3d9e18ad5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_info - Error information used in CXL error handling
> + * @severity: AER severity
> + * @function: Device's PCI function
> + * @device: Device's PCI device
> + * @bus: Device's PCI bus
> + * @segment: Device's PCI segment
> + */
> +struct cxl_proto_error_info {
> +	int severity;
> +
> +	u8 devfn;
> +	u8 bus;
> +	u16 segment;
> +};
> +
> +struct cxl_proto_err_work_data {
> +	struct cxl_proto_error_info err_info;
> +};
> +
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> @@ -64,6 +85,16 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +void cxl_register_proto_err_work(struct work_struct *work);
> +void cxl_unregister_proto_err_work(void);
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
> +#else
> +static inline void cxl_register_proto_err_work(struct work_struct *work) { }
> +static inline void cxl_unregister_proto_err_work(void) { }
> +static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
  2025-06-27 11:00   ` Jonathan Cameron
@ 2025-07-01 23:04   ` Dave Jiang
  2025-07-02 17:56     ` Bowman, Terry
  2025-07-25  0:38   ` dan.j.williams
  2 siblings, 1 reply; 82+ messages in thread
From: Dave Jiang @ 2025-07-01 23:04 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
> 
> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the
> AER service driver. This will begin the CXL protocol error processing with
> a call to cxl_handle_proto_error().
> 
> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver. Add check that Endpoint is bound to a CXL
> driver.
> 
> Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
> and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
> will return a reference counted 'struct pci_dev *'. This will serve as
> reference count to prevent releasing the CXL Endpoint's mapped RAS while
> handling the error. Use scope base __free() to put the reference count.
> This will change when adding support for CXL port devices in the future.
> 
> Implement cxl_handle_proto_error() to differentiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
> errors will be processed with a call to walk the associated Root Complex
> Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for the
> RCiEP.
> 
> VH correctable error (CE) processing will call the CXL CE handler. VH
> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> and pci_clean_device_status() used to clean up AER status after handling.
> 
> Maintain the locking logic found in the original AER driver. Replace the
> existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
> lock for maintainability. CE errors did not include locking in previous driver
> implementation. Leave the updated CE handling path as-is.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Couple minor comments below. Otherwise
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/cxl/core/native_ras.c | 87 +++++++++++++++++++++++++++++++++++
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/cxl/pci.c             |  6 +++
>  drivers/pci/pci.c             |  1 +
>  drivers/pci/pci.h             |  7 ---
>  drivers/pci/pcie/aer.c        |  1 +
>  drivers/pci/pcie/cxl_aer.c    | 41 -----------------
>  drivers/pci/pcie/rcec.c       |  1 +
>  include/linux/aer.h           |  2 +
>  include/linux/pci.h           | 10 ++++
>  10 files changed, 109 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> index 011815ddaae3..5bd79d5019e7 100644
> --- a/drivers/cxl/core/native_ras.c
> +++ b/drivers/cxl/core/native_ras.c
> @@ -6,9 +6,96 @@
>  #include <cxl/event.h>
>  #include <cxlmem.h>
>  #include <core/core.h>
> +#include <cxlpci.h>
> +
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> +}
> +
> +static bool is_cxl_rcd(struct pci_dev *pdev)
> +{
> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)
> +		return false;
> +
> +	/*
> +	 * The capability, status, and control fields in Device 0,
> +	 * Function 0 DVSEC control the CXL functionality of the
> +	 * entire device (CXL 3.2, 8.1.3).
> +	 */
> +	if (pdev->devfn != PCI_DEVFN(0, 0))
> +		return false;
> +
> +	/*
> +	 * CXL Memory Devices must have the 502h class code set (CXL
> +	 * 3.2, 8.1.12.1).
> +	 */
> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) != PCI_CLASS_MEMORY_CXL)
> +		return false;
> +
> +	return true;

return FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) == PCI_CLASS_MEMORY_CXL;

> +}
> +
> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
> +{
> +	struct cxl_proto_error_info *err_info = data;
> +
> +	guard(device)(&pdev->dev);
> +
> +	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
> +		return 0;
> +
> +	if (err_info->severity == AER_CORRECTABLE)
> +		cxl_cor_error_detected(pdev);
> +	else
> +		cxl_error_detected(pdev, pci_channel_io_frozen);
> +
> +	return 1;
> +}
> +
> +static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(err_info->segment,
> +					    err_info->bus,
> +					    err_info->devfn);
> +
> +	if (!pdev) {
> +		pr_err("Failed to find the CXL device (SBDF=%x:%x:%x:%x)\n",
> +		       err_info->segment, err_info->bus, PCI_SLOT(err_info->devfn),
> +		       PCI_FUNC(err_info->devfn));
> +		return;
> +	}
> +
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.
> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +
> +	if (err_info->severity == AER_CORRECTABLE) {
> +		int aer = pdev->aer_cap;
> +
> +		if (aer)
> +			pci_clear_and_set_config_dword(pdev,
> +						       aer + PCI_ERR_COR_STATUS,
> +						       0, PCI_ERR_COR_INTERNAL);
> +
> +		cxl_cor_error_detected(pdev);
> +
> +		pcie_clear_device_status(pdev);
> +	} else {
> +		cxl_do_recovery(pdev);
> +	}
> +}
>  
>  static void cxl_proto_err_work_fn(struct work_struct *work)
>  {
> +	struct cxl_proto_err_work_data wd;
> +
> +	while (cxl_proto_err_kfifo_get(&wd))
> +		cxl_handle_proto_error(&wd.err_info);
>  }
>  
>  static struct work_struct cxl_proto_err_work;
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 6f1396ef7b77..ed3c9701b79f 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -136,4 +136,5 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +bool cxl_pci_drv_bound(struct pci_dev *pdev);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index bd100ac31672..cae049f9ae3e 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1131,6 +1131,12 @@ static struct pci_driver cxl_pci_driver = {
>  	},
>  };
>  
> +bool cxl_pci_drv_bound(struct pci_dev *pdev)
> +{
> +	return (pdev->driver == &cxl_pci_driver);

Maybe () not needed?

DJ

> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_pci_drv_bound, "CXL");
> +
>  #define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
>  static void cxl_handle_cper_event(enum cxl_event_type ev_type,
>  				  struct cxl_cper_event_rec *rec)
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e9448d55113b..8d78d882bf78 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -2328,6 +2328,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
>  	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
>  	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
>  }
> +EXPORT_SYMBOL_NS_GPL(pcie_clear_device_status, "CXL");
>  #endif
>  
>  /**
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 29c11c7136d3..c7fc86d93bea 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -671,16 +671,10 @@ static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; }
>  void pci_rcec_init(struct pci_dev *dev);
>  void pci_rcec_exit(struct pci_dev *dev);
>  void pcie_link_rcec(struct pci_dev *rcec);
> -void pcie_walk_rcec(struct pci_dev *rcec,
> -		    int (*cb)(struct pci_dev *, void *),
> -		    void *userdata);
>  #else
>  static inline void pci_rcec_init(struct pci_dev *dev) { }
>  static inline void pci_rcec_exit(struct pci_dev *dev) { }
>  static inline void pcie_link_rcec(struct pci_dev *rcec) { }
> -static inline void pcie_walk_rcec(struct pci_dev *rcec,
> -				  int (*cb)(struct pci_dev *, void *),
> -				  void *userdata) { }
>  #endif
>  
>  #ifdef CONFIG_PCI_ATS
> @@ -1022,7 +1016,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
>  static inline void pci_no_aer(void) { }
>  static inline void pci_aer_init(struct pci_dev *d) { }
>  static inline void pci_aer_exit(struct pci_dev *d) { }
> -static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>  static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
>  static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
>  static inline void pci_save_aer_state(struct pci_dev *dev) { }
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 8417a49c71f2..5999d90dfdcb 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -287,6 +287,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>  	if (status)
>  		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
>  }
> +EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
>  
>  /**
>   * pci_aer_raw_clear_status - Clear AER error registers.
> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index 846ab55d747c..939438a7161a 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c
> @@ -80,47 +80,6 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>  	return is_internal_error(info);
>  }
>  
> -static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> -{
> -	struct aer_err_info *info = (struct aer_err_info *)data;
> -	const struct pci_error_handlers *err_handler;
> -
> -	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
> -		return 0;
> -
> -	/* Protect dev->driver */
> -	device_lock(&dev->dev);
> -
> -	err_handler = dev->driver ? dev->driver->err_handler : NULL;
> -	if (!err_handler)
> -		goto out;
> -
> -	if (info->severity == AER_CORRECTABLE) {
> -		if (err_handler->cor_error_detected)
> -			err_handler->cor_error_detected(dev);
> -	} else if (err_handler->error_detected) {
> -		if (info->severity == AER_NONFATAL)
> -			err_handler->error_detected(dev, pci_channel_io_normal);
> -		else if (info->severity == AER_FATAL)
> -			err_handler->error_detected(dev, pci_channel_io_frozen);
> -	}
> -out:
> -	device_unlock(&dev->dev);
> -	return 0;
> -}
> -
> -void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> -{
> -	/*
> -	 * Internal errors of an RCEC indicate an AER error in an
> -	 * RCH's downstream port. Check and handle them in the CXL.mem
> -	 * device driver.
> -	 */
> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    is_internal_error(info))
> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> -}
> -
>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>  {
>  	bool *handles_cxl = data;
> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
> index d0bcd141ac9c..fb6cf6449a1d 100644
> --- a/drivers/pci/pcie/rcec.c
> +++ b/drivers/pci/pcie/rcec.c
> @@ -145,6 +145,7 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
>  
>  	walk_rcec(walk_rcec_helper, &rcec_data);
>  }
> +EXPORT_SYMBOL_NS_GPL(pcie_walk_rcec, "CXL");
>  
>  void pci_rcec_init(struct pci_dev *dev)
>  {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 24c3d9e18ad5..0aafcc678e45 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -76,12 +76,14 @@ struct cxl_proto_err_work_data {
>  
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
> +void pci_aer_clear_fatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
>  #else
>  static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  {
>  	return -EINVAL;
>  }
> +static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 79878243b681..79326358f641 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1801,6 +1801,9 @@ extern bool pcie_ports_native;
>  
>  int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
>  			  bool use_lt);
> +void pcie_walk_rcec(struct pci_dev *rcec,
> +		    int (*cb)(struct pci_dev *, void *),
> +		    void *userdata);
>  #else
>  #define pcie_ports_disabled	true
>  #define pcie_ports_native	false
> @@ -1811,8 +1814,15 @@ static inline int pcie_set_target_speed(struct pci_dev *port,
>  {
>  	return -EOPNOTSUPP;
>  }
> +
> +static inline void pcie_walk_rcec(struct pci_dev *rcec,
> +				  int (*cb)(struct pci_dev *, void *),
> +				  void *userdata) { }
> +
>  #endif
>  
> +void pcie_clear_device_status(struct pci_dev *dev);
> +
>  #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
>  #define PCIE_LINK_STATE_L1		BIT(2)	/* L1 state */
>  #define PCIE_LINK_STATE_L1_1		BIT(3)	/* ASPM L1.1 state */


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-06-27 12:22   ` Shiju Jose
@ 2025-07-02  1:18     ` Alison Schofield
  2025-07-02 22:07       ` Bowman, Terry
  2025-07-02 21:56     ` Bowman, Terry
  1 sibling, 1 reply; 82+ messages in thread
From: Alison Schofield @ 2025-07-02  1:18 UTC (permalink / raw)
  To: Shiju Jose
  Cc: Terry Bowman, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, dan.j.williams@intel.com,
	bhelgaas@google.com, ming.li@zohomail.com,
	Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
	dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com,
	lukas@wunner.de, Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org

On Fri, Jun 27, 2025 at 12:22:39PM +0000, Shiju Jose wrote:
> >-----Original Message-----
> >From: Terry Bowman <terry.bowman@amd.com>
> >Sent: 26 June 2025 23:43
> >To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
> >dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
> >bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
> >ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
> >rrichter@amd.com; dan.carpenter@linaro.org;
> >PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
> >Benjamin.Cheatham@amd.com;
> >sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
> >linux-cxl@vger.kernel.org
> >Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
> >Subject: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints
> >and CXL Ports
> >

big snip -

> >-);
> >-
> > TRACE_EVENT(cxl_aer_uncorrectable_error,
> >-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32
> >*hl),
> >-	TP_ARGS(cxlmd, status, fe, hl),
> >+	TP_PROTO(struct device *dev, u64 serial, u32 status, u32 fe,
> >+		 u32 *hl),
> >+	TP_ARGS(dev, serial, status, fe, hl),
> > 	TP_STRUCT__entry(
> >-		__string(memdev, dev_name(&cxlmd->dev))
> >-		__string(host, dev_name(cxlmd->dev.parent))
> >+		__string(name, dev_name(dev))
> >+		__string(parent, dev_name(dev->parent))
> 
> Hi Terry,
> 
> Thanks for considering the feedback given in v9 regarding the compatibility issue
> with the rasdaemon.
> https://lore.kernel.org/all/959acc682e6e4b52ac0283b37ee21026@huawei.com/
> 
> Probably some confusion w.r.t the feedback.
> Unfortunately  TP_printk(...) is not an ABI that we need to keep stable, 
> it's this structure, TP_STRUCT__entry(..) , that matters to the rasdaemon.
> 

I'm not so sure you should be letting him off the hook for TP_printk ;)
It seems TP_printk should be kept aligned w TP_STRUCT_entry(). As a
user who often looks at TP_printk output, I'd say keep them all in
sync, and consider them ABI - ie. add to but don't modify.



> > 		__field(u64, serial)
> > 		__field(u32, status)
> > 		__field(u32, first_error)
> > 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> > 	),
> > 	TP_fast_assign(
> >-		__assign_str(memdev);
> >-		__assign_str(host);
> >-		__entry->serial = cxlmd->cxlds->serial;
> >+		__assign_str(name);
> >+		__assign_str(parent);
> >+		__entry->serial = serial;
> > 		__entry->status = status;
> > 		__entry->first_error = fe;
> > 		/*
> >@@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> > 		 */
> > 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> > 	),
> >-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error:
> >'%s'",
> >-		  __get_str(memdev), __get_str(host), __entry->serial,
> >+	TP_printk("memdev=%s host=%s serial=%lld status='%s'
> >first_error='%s'",
> >+		  __get_str(name), __get_str(parent), __entry->serial,
> > 		  show_uc_errs(__entry->status),
> > 		  show_uc_errs(__entry->first_error)
> > 	)

snip


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-27  9:53   ` Jonathan Cameron
@ 2025-07-02 16:00     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 16:00 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci



On 6/27/2025 4:53 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:38 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
>> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
>> Type' for CXL device errors.
>>
>> This requires the AER can identify and distinguish between PCIe errors and
>> CXL errors.
>>
>> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
>> aer_get_device_error_info() and pci_print_aer().
>>
>> Update the aer_event trace routine to accept a bus type string parameter.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Thanks Jonathan.

-Terry

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-06-27 10:24   ` Jonathan Cameron
@ 2025-07-02 16:21     ` Bowman, Terry
  2025-07-02 19:54       ` Dan Carpenter
  2025-07-03 10:06       ` Jonathan Cameron
  0 siblings, 2 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 16:21 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci



On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:40 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt with the CXL driver. The notification will be used
>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>
>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>> driver's RAS native handling. This as an alternative to dropping the
>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>> conditionally compile the new file.
>>
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work and the cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>> will contain the erring device's PCI SBDF details used to rediscover the
>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>> will be introduced along with the CXL handling in future patches.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
>
> Whilst it obviously makes patch preparation a bit more time consuming
> for series like this with many patches it can be useful to add a brief
> change log to the individual patches as well as the cover letter.
> That helps reviewers figure out where they need to look again.
>
> A few trivial things inline.
>
> With those fixed up
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
> Jonathan

Hi Jonathan,

Do you have an example you can point me to with a change log in the
individual patch? I want to make certain I change correctly.

 
>
>> ---
>>  drivers/cxl/Kconfig           | 14 ++++++++
>>  drivers/cxl/core/Makefile     |  1 +
>>  drivers/cxl/core/core.h       |  8 +++++
>>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>>  drivers/cxl/core/port.c       |  2 ++
>>  drivers/cxl/core/ras.c        |  1 +
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/pci/pci.h             |  4 +++
>>  drivers/pci/pcie/aer.c        |  7 ++--
>>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h           | 31 ++++++++++++++++++
>>  11 files changed, 153 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/cxl/core/native_ras.c
>
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..6f1396ef7b77 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
> Why?  There are no changes in this header other than the include and the changes
> to linux/aer.h are new stuff so I can't see how it becomes necessary if it
> wasn't before.
>
> Might well have always been missing and should have been here. If so separate
> patch to tidy that up.
You're correct, this can be removed and added later.

>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index b2ea14f70055..846ab55d747c 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
>> +struct work_struct *cxl_proto_err_work;
> I'm not seeing a declaration for this in the headers, so can it be static?
>
> This is made a little more confusing as in this patch we have both
> a structure called cxl_proto_err_work and a pointer to it with exactly the
> same name.  Maybe rename this so it's subtly different.  cxl_protocol_err_work
> or something silly like that just to make reviewers life a tiny bit easier!
Yes, I'll make 'static' and rename to be cxl_protocol_err_work.

>> +
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 02940be66324..24c3d9e18ad5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_proto_err_info - Error information used in CXL error handling
>> + * @severity: AER severity
>> + * @function: Device's PCI function
> Run kernel-doc over the files and fix errors / warning.
> Missed updating this to devfn which it would have shouted about.
I haven't used kernel-doc, obviously. :) Ill add that to the list of checks before sending. Thanks.

-Terry

>> + * @device: Device's PCI device
>> + * @bus: Device's PCI bus
>> + * @segment: Device's PCI segment
>> + */
>> +struct cxl_proto_error_info {
>> +	int severity;
>> +
>> +	u8 devfn;
>> +	u8 bus;
>> +	u16 segment;
>> +};


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-07-01 21:53   ` Dave Jiang
@ 2025-07-02 17:10     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 17:10 UTC (permalink / raw)
  To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/1/2025 4:53 PM, Dave Jiang wrote:
>
> On 6/26/25 3:42 PM, Terry Bowman wrote:
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt with the CXL driver. The notification will be used
>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>
>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>> driver's RAS native handling. This as an alternative to dropping the
>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>> conditionally compile the new file.
>>
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work and the cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>> will contain the erring device's PCI SBDF details used to rediscover the
>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>> will be introduced along with the CXL handling in future patches.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/Kconfig           | 14 ++++++++
>>  drivers/cxl/core/Makefile     |  1 +
>>  drivers/cxl/core/core.h       |  8 +++++
>>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
> With the addition of a new file to cxl_core, can you please also fix up tools/testing/cxl/Kbuild?
>
> DJ
Hi Dave,

Yes, I'll update the CXL test's Kbuild.

-Terry

>>  drivers/cxl/core/port.c       |  2 ++
>>  drivers/cxl/core/ras.c        |  1 +
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/pci/pci.h             |  4 +++
>>  drivers/pci/pcie/aer.c        |  7 ++--
>>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h           | 31 ++++++++++++++++++
>>  11 files changed, 153 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/cxl/core/native_ras.c
>>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
>> index 48b7314afdb8..57274de54a45 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -233,4 +233,18 @@ config CXL_MCE
>>  	def_bool y
>>  	depends on X86_MCE && MEMORY_FAILURE
>>  
>> +config CXL_NATIVE_RAS
>> +	bool "CXL: Enable CXL RAS native handling"
>> +	depends on PCIEAER_CXL
>> +	default CXL_BUS
>> +	help
>> +	  Enable native CXL RAS protocol error handling and logging in the CXL
>> +	  drivers. This functionality relies on the AER service driver being
>> +	  enabled, as the AER interrupt is used to inform the operating system
>> +	  of CXL RAS protocol errors. The platform must be configured to
>> +	  utilize AER reporting for interrupts.
>> +
>> +	  If unsure, or if this kernel is meant for production environments,
>> +	  say Y.
>> +
>>  endif
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>> index 79e2ef81fde8..16f5832e5cc4 100644
>> --- a/drivers/cxl/core/Makefile
>> +++ b/drivers/cxl/core/Makefile
>> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
>> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index 29b61828a847..4c08bb92e2f9 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>>  					    int nid, resource_size_t *size);
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void cxl_native_ras_init(void);
>> +void cxl_native_ras_exit(void);
>> +#else
>> +static inline void cxl_native_ras_init(void) { };
>> +static inline void cxl_native_ras_exit(void) { };
>> +#endif
>> +
>>  #ifdef CONFIG_CXL_FEATURES
>>  struct cxl_feat_entry *
>>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
>> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
>> new file mode 100644
>> index 000000000000..011815ddaae3
>> --- /dev/null
>> +++ b/drivers/cxl/core/native_ras.c
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>> +#include <cxl/event.h>
>> +#include <cxlmem.h>
>> +#include <core/core.h>
>> +
>> +static void cxl_proto_err_work_fn(struct work_struct *work)
>> +{
>> +}
>> +
>> +static struct work_struct cxl_proto_err_work;
>> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
>> +
>> +void cxl_native_ras_init(void)
>> +{
>> +	cxl_register_proto_err_work(&cxl_proto_err_work);
>> +}
>> +
>> +void cxl_native_ras_exit(void)
>> +{
>> +	cxl_unregister_proto_err_work();
>> +	cancel_work_sync(&cxl_proto_err_work);
>> +}
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index eb46c6764d20..8e8f21197c86 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>>  	if (rc)
>>  		goto err_ras;
>>  
>> +	cxl_native_ras_init();
>> +
>>  	return 0;
>>  
>>  err_ras:
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 485a831695c7..962dc94fed8c 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/aer.h>
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>> +#include <cxlpci.h>
>>  #include "trace.h"
>>  
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..6f1396ef7b77 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 91b583cf18eb..29c11c7136d3 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>>  #else
>>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
>> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
>> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>>  #endif
>>  
>>  #ifdef CONFIG_ACPI
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 0b4d721980ef..8417a49c71f2 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	cxl_rch_handle_error(dev, info);
>> -	pci_aer_handle_error(dev, info);
>> +	if (is_cxl_error(dev, info))
>> +		forward_cxl_error(dev, info);
>> +	else
>> +		pci_aer_handle_error(dev, info);
>> +
>>  	pci_dev_put(dev);
>>  }
>>  
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index b2ea14f70055..846ab55d747c 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
>> @@ -3,8 +3,11 @@
>>  
>>  #include <linux/pci.h>
>>  #include <linux/aer.h>
>> +#include <linux/kfifo.h>
>>  #include "../pci.h"
>>  
>> +#define CXL_ERROR_SOURCES_MAX          128
>> +
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pci_dev data structure
>> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>>  	return info->status & PCI_ERR_UNC_INTN;
>>  }
>>  
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>> +{
>> +	if (!info || !info->is_cxl)
>> +		return false;
>> +
>> +	/* Only CXL Endpoints are currently supported */
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
>> +		return false;
>> +
>> +	return is_internal_error(info);
>> +}
>> +
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
>> +struct work_struct *cxl_proto_err_work;
>> +
>> +void cxl_register_proto_err_work(struct work_struct *work)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
>> +	cxl_proto_err_work = work;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>> +
>> +void cxl_unregister_proto_err_work(void)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
>> +	cxl_proto_err_work = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>> +
>> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
>> +{
>> +	return kfifo_get(&cxl_proto_err_fifo, wd);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
>> +
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
>> +{
>> +	struct cxl_proto_err_work_data wd;
>> +
>> +	wd.err_info = (struct cxl_proto_error_info) {
>> +		.severity = aer_err_info->severity,
>> +		.devfn = pdev->devfn,
>> +		.bus = pdev->bus->number,
>> +		.segment = pci_domain_nr(pdev->bus)
>> +	};
>> +
>> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
>> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
>> +		return;
>> +	}
>> +
>> +	schedule_work(cxl_proto_err_work);
>> +}
>> +
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 02940be66324..24c3d9e18ad5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_proto_err_info - Error information used in CXL error handling
>> + * @severity: AER severity
>> + * @function: Device's PCI function
>> + * @device: Device's PCI device
>> + * @bus: Device's PCI bus
>> + * @segment: Device's PCI segment
>> + */
>> +struct cxl_proto_error_info {
>> +	int severity;
>> +
>> +	u8 devfn;
>> +	u8 bus;
>> +	u16 segment;
>> +};
>> +
>> +struct cxl_proto_err_work_data {
>> +	struct cxl_proto_error_info err_info;
>> +};
>> +
>>  #if defined(CONFIG_PCIEAER)
>>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>>  int pcie_aer_is_native(struct pci_dev *dev);
>> @@ -64,6 +85,16 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>>  #endif
>>  
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +void cxl_register_proto_err_work(struct work_struct *work);
>> +void cxl_unregister_proto_err_work(void);
>> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd);
>> +#else
>> +static inline void cxl_register_proto_err_work(struct work_struct *work) { }
>> +static inline void cxl_unregister_proto_err_work(void) { }
>> +static inline int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd) { return 0; }
>> +#endif
>> +
>>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  		    struct aer_capability_regs *aer);
>>  int cper_severity_to_aer(int cper_severity);


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-06-27 11:00   ` Jonathan Cameron
@ 2025-07-02 17:51     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 17:51 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci



On 6/27/2025 6:00 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:41 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER driver is now designed to forward CXL protocol errors to the CXL
>> driver. Update the CXL driver with functionality to dequeue the forwarded
>> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
>> error handling processing using the work received from the FIFO.
>>
>> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the
> After earlier update it already exists, you are just filling it in here.
> So reword this.
Ok.
>> AER service driver. This will begin the CXL protocol error processing with
>> a call to cxl_handle_proto_error().
>>
>> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
>> previously in the AER driver. Add check that Endpoint is bound to a CXL
>> driver.
>>
>> Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
>> and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
>> will return a reference counted 'struct pci_dev *'. This will serve as
>> reference count to prevent releasing the CXL Endpoint's mapped RAS while
>> handling the error. Use scope base __free() to put the reference count.
>> This will change when adding support for CXL port devices in the future.
>>
>> Implement cxl_handle_proto_error() to differentiate between Restricted CXL
>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
>> errors will be processed with a call to walk the associated Root Complex
>> Event Collector's (RCEC) secondary bus looking for the Root Complex
>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
>> so the CXL driver can walk the RCEC's downstream bus, searching for the
>> RCiEP.
> I'd drop the RCiEP description beyond saying 'handle it as before'
> as I think there is no major change in this.
Ok.
>> VH correctable error (CE) processing will call the CXL CE handler. VH
>> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
>> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
>> and pci_clean_device_status() used to clean up AER status after handling.
>>
>> Maintain the locking logic found in the original AER driver. Replace the
>> existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
>> lock for maintainability.
> This change is fine, but it is an AND change in a patch doing quite a few other
> things.  So do it in a trivial precursor patch.  Look at the other things in this
> description and see if they can be factored out too so that the guts of this
> patch are much easier to spot.
>

I agree. I'll revisit to try and simplify the amount of changes in this patch.

>> CE errors did not include locking in previous driver
>> implementation. Leave the updated CE handling path as-is.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> A few comments inline.
>
> Jonathan
>
>> ---
>>  drivers/cxl/core/native_ras.c | 87 +++++++++++++++++++++++++++++++++++
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/cxl/pci.c             |  6 +++
>>  drivers/pci/pci.c             |  1 +
>>  drivers/pci/pci.h             |  7 ---
>>  drivers/pci/pcie/aer.c        |  1 +
>>  drivers/pci/pcie/cxl_aer.c    | 41 -----------------
>>  drivers/pci/pcie/rcec.c       |  1 +
>>  include/linux/aer.h           |  2 +
>>  include/linux/pci.h           | 10 ++++
>>  10 files changed, 109 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
>> index 011815ddaae3..5bd79d5019e7 100644
>> --- a/drivers/cxl/core/native_ras.c
>> +++ b/drivers/cxl/core/native_ras.c
>> @@ -6,9 +6,96 @@
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>>  #include <core/core.h>
>> +#include <cxlpci.h>
>> +
>> +static void cxl_do_recovery(struct pci_dev *pdev)
>> +{
>> +}
>> +
>> +static bool is_cxl_rcd(struct pci_dev *pdev)
>> +{
>> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)
>> +		return false;
>> +
>> +	/*
>> +	 * The capability, status, and control fields in Device 0,
>> +	 * Function 0 DVSEC control the CXL functionality of the
>> +	 * entire device (CXL 3.2, 8.1.3).
>> +	 */
>> +	if (pdev->devfn != PCI_DEVFN(0, 0))
>> +		return false;
>> +
>> +	/*
>> +	 * CXL Memory Devices must have the 502h class code set (CXL
> Short wrap.
Ok.
>> +	 * 3.2, 8.1.12.1).
>> +	 */
>> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) != PCI_CLASS_MEMORY_CXL)
>> +		return false;
>> +
>> +	return true;
>
> If this isn't going to get more complex
>
> 	return FIELD_GET(...)
Good idea. I overlooked this.

>> +}
>> +
>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>> +{
>> +	struct cxl_proto_error_info *err_info = data;
>> +
>> +	guard(device)(&pdev->dev);
>> +
>> +	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
>> +		return 0;
>> +
>> +	if (err_info->severity == AER_CORRECTABLE)
>> +		cxl_cor_error_detected(pdev);
>> +	else
>> +		cxl_error_detected(pdev, pci_channel_io_frozen);
>> +
>> +	return 1;
>> +}
>> +
>> +static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
>> +{
>> +	struct pci_dev *pdev __free(pci_dev_put) =
>> +		pci_get_domain_bus_and_slot(err_info->segment,
>> +					    err_info->bus,
>> +					    err_info->devfn);
>> +
>> +	if (!pdev) {
>> +		pr_err("Failed to find the CXL device (SBDF=%x:%x:%x:%x)\n",
>> +		       err_info->segment, err_info->bus, PCI_SLOT(err_info->devfn),
>> +		       PCI_FUNC(err_info->devfn));
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Internal errors of an RCEC indicate an AER error in an
>> +	 * RCH's downstream port. Check and handle them in the CXL.mem
>> +	 * device driver.
> I don't think the reference here to the CXL.mem driver is that helpful
> given the code is immediate above. Maybe move the comment?
>
Ok.

- Terry

>> +	 */
>> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
>> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
>> +
>> +	if (err_info->severity == AER_CORRECTABLE) {
>> +		int aer = pdev->aer_cap;
>> +
>> +		if (aer)
>> +			pci_clear_and_set_config_dword(pdev,
>> +						       aer + PCI_ERR_COR_STATUS,
>> +						       0, PCI_ERR_COR_INTERNAL);
>> +
>> +		cxl_cor_error_detected(pdev);
>> +
>> +		pcie_clear_device_status(pdev);
>> +	} else {
>> +		cxl_do_recovery(pdev);
>> +	}
>> +}
>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-07-01 23:04   ` Dave Jiang
@ 2025-07-02 17:56     ` Bowman, Terry
  2025-07-03 10:11       ` Jonathan Cameron
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 17:56 UTC (permalink / raw)
  To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/1/2025 6:04 PM, Dave Jiang wrote:
>
> On 6/26/25 3:42 PM, Terry Bowman wrote:
>> The AER driver is now designed to forward CXL protocol errors to the CXL
>> driver. Update the CXL driver with functionality to dequeue the forwarded
>> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
>> error handling processing using the work received from the FIFO.
>>
>> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the
>> AER service driver. This will begin the CXL protocol error processing with
>> a call to cxl_handle_proto_error().
>>
>> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
>> previously in the AER driver. Add check that Endpoint is bound to a CXL
>> driver.
>>
>> Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
>> and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
>> will return a reference counted 'struct pci_dev *'. This will serve as
>> reference count to prevent releasing the CXL Endpoint's mapped RAS while
>> handling the error. Use scope base __free() to put the reference count.
>> This will change when adding support for CXL port devices in the future.
>>
>> Implement cxl_handle_proto_error() to differentiate between Restricted CXL
>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
>> errors will be processed with a call to walk the associated Root Complex
>> Event Collector's (RCEC) secondary bus looking for the Root Complex
>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
>> so the CXL driver can walk the RCEC's downstream bus, searching for the
>> RCiEP.
>>
>> VH correctable error (CE) processing will call the CXL CE handler. VH
>> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
>> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
>> and pci_clean_device_status() used to clean up AER status after handling.
>>
>> Maintain the locking logic found in the original AER driver. Replace the
>> existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
>> lock for maintainability. CE errors did not include locking in previous driver
>> implementation. Leave the updated CE handling path as-is.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Couple minor comments below. Otherwise
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Thanks Dave.
>> ---
>>  drivers/cxl/core/native_ras.c | 87 +++++++++++++++++++++++++++++++++++
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/cxl/pci.c             |  6 +++
>>  drivers/pci/pci.c             |  1 +
>>  drivers/pci/pci.h             |  7 ---
>>  drivers/pci/pcie/aer.c        |  1 +
>>  drivers/pci/pcie/cxl_aer.c    | 41 -----------------
>>  drivers/pci/pcie/rcec.c       |  1 +
>>  include/linux/aer.h           |  2 +
>>  include/linux/pci.h           | 10 ++++
>>  10 files changed, 109 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
>> index 011815ddaae3..5bd79d5019e7 100644
>> --- a/drivers/cxl/core/native_ras.c
>> +++ b/drivers/cxl/core/native_ras.c
>> @@ -6,9 +6,96 @@
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>>  #include <core/core.h>
>> +#include <cxlpci.h>
>> +
>> +static void cxl_do_recovery(struct pci_dev *pdev)
>> +{
>> +}
>> +
>> +static bool is_cxl_rcd(struct pci_dev *pdev)
>> +{
>> +	if (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)
>> +		return false;
>> +
>> +	/*
>> +	 * The capability, status, and control fields in Device 0,
>> +	 * Function 0 DVSEC control the CXL functionality of the
>> +	 * entire device (CXL 3.2, 8.1.3).
>> +	 */
>> +	if (pdev->devfn != PCI_DEVFN(0, 0))
>> +		return false;
>> +
>> +	/*
>> +	 * CXL Memory Devices must have the 502h class code set (CXL
>> +	 * 3.2, 8.1.12.1).
>> +	 */
>> +	if (FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) != PCI_CLASS_MEMORY_CXL)
>> +		return false;
>> +
>> +	return true;
> return FIELD_GET(PCI_CLASS_CODE_MASK, pdev->class) == PCI_CLASS_MEMORY_CXL;
Ok.
>> +}
>> +
>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>> +{
>> +	struct cxl_proto_error_info *err_info = data;
>> +
>> +	guard(device)(&pdev->dev);
>> +
>> +	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
>> +		return 0;
>> +
>> +	if (err_info->severity == AER_CORRECTABLE)
>> +		cxl_cor_error_detected(pdev);
>> +	else
>> +		cxl_error_detected(pdev, pci_channel_io_frozen);
>> +
>> +	return 1;
>> +}
>> +
>> +static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
>> +{
>> +	struct pci_dev *pdev __free(pci_dev_put) =
>> +		pci_get_domain_bus_and_slot(err_info->segment,
>> +					    err_info->bus,
>> +					    err_info->devfn);
>> +
>> +	if (!pdev) {
>> +		pr_err("Failed to find the CXL device (SBDF=%x:%x:%x:%x)\n",
>> +		       err_info->segment, err_info->bus, PCI_SLOT(err_info->devfn),
>> +		       PCI_FUNC(err_info->devfn));
>> +		return;
>> +	}
>> +
>> +	/*
>> +	 * Internal errors of an RCEC indicate an AER error in an
>> +	 * RCH's downstream port. Check and handle them in the CXL.mem
>> +	 * device driver.
>> +	 */
>> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
>> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
>> +
>> +	if (err_info->severity == AER_CORRECTABLE) {
>> +		int aer = pdev->aer_cap;
>> +
>> +		if (aer)
>> +			pci_clear_and_set_config_dword(pdev,
>> +						       aer + PCI_ERR_COR_STATUS,
>> +						       0, PCI_ERR_COR_INTERNAL);
>> +
>> +		cxl_cor_error_detected(pdev);
>> +
>> +		pcie_clear_device_status(pdev);
>> +	} else {
>> +		cxl_do_recovery(pdev);
>> +	}
>> +}
>>  
>>  static void cxl_proto_err_work_fn(struct work_struct *work)
>>  {
>> +	struct cxl_proto_err_work_data wd;
>> +
>> +	while (cxl_proto_err_kfifo_get(&wd))
>> +		cxl_handle_proto_error(&wd.err_info);
>>  }
>>  
>>  static struct work_struct cxl_proto_err_work;
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 6f1396ef7b77..ed3c9701b79f 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -136,4 +136,5 @@ void read_cdat_data(struct cxl_port *port);
>>  void cxl_cor_error_detected(struct pci_dev *pdev);
>>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>  				    pci_channel_state_t state);
>> +bool cxl_pci_drv_bound(struct pci_dev *pdev);
>>  #endif /* __CXL_PCI_H__ */
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index bd100ac31672..cae049f9ae3e 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -1131,6 +1131,12 @@ static struct pci_driver cxl_pci_driver = {
>>  	},
>>  };
>>  
>> +bool cxl_pci_drv_bound(struct pci_dev *pdev)
>> +{
>> +	return (pdev->driver == &cxl_pci_driver);
> Maybe () not needed?
>
> DJ
Ok.

-Terry
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_pci_drv_bound, "CXL");
>> +
>>  #define CXL_EVENT_HDR_FLAGS_REC_SEVERITY GENMASK(1, 0)
>>  static void cxl_handle_cper_event(enum cxl_event_type ev_type,
>>  				  struct cxl_cper_event_rec *rec)
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index e9448d55113b..8d78d882bf78 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -2328,6 +2328,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
>>  	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
>>  	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
>>  }
>> +EXPORT_SYMBOL_NS_GPL(pcie_clear_device_status, "CXL");
>>  #endif
>>  
>>  /**
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 29c11c7136d3..c7fc86d93bea 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -671,16 +671,10 @@ static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; }
>>  void pci_rcec_init(struct pci_dev *dev);
>>  void pci_rcec_exit(struct pci_dev *dev);
>>  void pcie_link_rcec(struct pci_dev *rcec);
>> -void pcie_walk_rcec(struct pci_dev *rcec,
>> -		    int (*cb)(struct pci_dev *, void *),
>> -		    void *userdata);
>>  #else
>>  static inline void pci_rcec_init(struct pci_dev *dev) { }
>>  static inline void pci_rcec_exit(struct pci_dev *dev) { }
>>  static inline void pcie_link_rcec(struct pci_dev *rcec) { }
>> -static inline void pcie_walk_rcec(struct pci_dev *rcec,
>> -				  int (*cb)(struct pci_dev *, void *),
>> -				  void *userdata) { }
>>  #endif
>>  
>>  #ifdef CONFIG_PCI_ATS
>> @@ -1022,7 +1016,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
>>  static inline void pci_no_aer(void) { }
>>  static inline void pci_aer_init(struct pci_dev *d) { }
>>  static inline void pci_aer_exit(struct pci_dev *d) { }
>> -static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>>  static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
>>  static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
>>  static inline void pci_save_aer_state(struct pci_dev *dev) { }
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 8417a49c71f2..5999d90dfdcb 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -287,6 +287,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
>>  	if (status)
>>  		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
>>  }
>> +EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
>>  
>>  /**
>>   * pci_aer_raw_clear_status - Clear AER error registers.
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index 846ab55d747c..939438a7161a 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
>> @@ -80,47 +80,6 @@ bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>>  	return is_internal_error(info);
>>  }
>>  
>> -static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> -{
>> -	struct aer_err_info *info = (struct aer_err_info *)data;
>> -	const struct pci_error_handlers *err_handler;
>> -
>> -	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
>> -		return 0;
>> -
>> -	/* Protect dev->driver */
>> -	device_lock(&dev->dev);
>> -
>> -	err_handler = dev->driver ? dev->driver->err_handler : NULL;
>> -	if (!err_handler)
>> -		goto out;
>> -
>> -	if (info->severity == AER_CORRECTABLE) {
>> -		if (err_handler->cor_error_detected)
>> -			err_handler->cor_error_detected(dev);
>> -	} else if (err_handler->error_detected) {
>> -		if (info->severity == AER_NONFATAL)
>> -			err_handler->error_detected(dev, pci_channel_io_normal);
>> -		else if (info->severity == AER_FATAL)
>> -			err_handler->error_detected(dev, pci_channel_io_frozen);
>> -	}
>> -out:
>> -	device_unlock(&dev->dev);
>> -	return 0;
>> -}
>> -
>> -void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> -{
>> -	/*
>> -	 * Internal errors of an RCEC indicate an AER error in an
>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>> -	 * device driver.
>> -	 */
>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>> -	    is_internal_error(info))
>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>> -}
>> -
>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	bool *handles_cxl = data;
>> diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
>> index d0bcd141ac9c..fb6cf6449a1d 100644
>> --- a/drivers/pci/pcie/rcec.c
>> +++ b/drivers/pci/pcie/rcec.c
>> @@ -145,6 +145,7 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
>>  
>>  	walk_rcec(walk_rcec_helper, &rcec_data);
>>  }
>> +EXPORT_SYMBOL_NS_GPL(pcie_walk_rcec, "CXL");
>>  
>>  void pci_rcec_init(struct pci_dev *dev)
>>  {
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 24c3d9e18ad5..0aafcc678e45 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -76,12 +76,14 @@ struct cxl_proto_err_work_data {
>>  
>>  #if defined(CONFIG_PCIEAER)
>>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>> +void pci_aer_clear_fatal_status(struct pci_dev *dev);
>>  int pcie_aer_is_native(struct pci_dev *dev);
>>  #else
>>  static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>>  {
>>  	return -EINVAL;
>>  }
>> +static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
>>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>>  #endif
>>  
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index 79878243b681..79326358f641 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -1801,6 +1801,9 @@ extern bool pcie_ports_native;
>>  
>>  int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
>>  			  bool use_lt);
>> +void pcie_walk_rcec(struct pci_dev *rcec,
>> +		    int (*cb)(struct pci_dev *, void *),
>> +		    void *userdata);
>>  #else
>>  #define pcie_ports_disabled	true
>>  #define pcie_ports_native	false
>> @@ -1811,8 +1814,15 @@ static inline int pcie_set_target_speed(struct pci_dev *port,
>>  {
>>  	return -EOPNOTSUPP;
>>  }
>> +
>> +static inline void pcie_walk_rcec(struct pci_dev *rcec,
>> +				  int (*cb)(struct pci_dev *, void *),
>> +				  void *userdata) { }
>> +
>>  #endif
>>  
>> +void pcie_clear_device_status(struct pci_dev *dev);
>> +
>>  #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
>>  #define PCIE_LINK_STATE_L1		BIT(2)	/* L1 state */
>>  #define PCIE_LINK_STATE_L1_1		BIT(3)	/* ASPM L1.1 state */


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-07-02 16:21     ` Bowman, Terry
@ 2025-07-02 19:54       ` Dan Carpenter
  2025-07-02 19:57         ` Bowman, Terry
  2025-07-03 10:06       ` Jonathan Cameron
  1 sibling, 1 reply; 82+ messages in thread
From: Dan Carpenter @ 2025-07-02 19:54 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Jonathan Cameron, dave, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, linux-pci

On Wed, Jul 02, 2025 at 11:21:20AM -0500, Bowman, Terry wrote:
> 
> 
> On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
> > On Thu, 26 Jun 2025 17:42:40 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt with the CXL driver. The notification will be used
> >> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> >>
> >> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> >> driver's RAS native handling. This as an alternative to dropping the
> >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> >> conditionally compile the new file.
> >>
> >> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> >> driver will be the sole kfifo producer adding work and the cxl_core will be
> >> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> >>
> >> Add CXL work queue handler registration functions in the AER driver. Export
> >> the functions allowing CXL driver to access. Implement registration
> >> functions for the CXL driver to assign or clear the work handler function.
> >>
> >> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> >> will contain the erring device's PCI SBDF details used to rediscover the
> >> device after the CXL driver dequeues the kfifo work. The device rediscovery
> >> will be introduced along with the CXL handling in future patches.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Hi Terry,
> >
> > Whilst it obviously makes patch preparation a bit more time consuming
> > for series like this with many patches it can be useful to add a brief
> > change log to the individual patches as well as the cover letter.
> > That helps reviewers figure out where they need to look again.
> >
> > A few trivial things inline.
> >
> > With those fixed up
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >
> > Jonathan
> 
> Hi Jonathan,
> 
> Do you have an example you can point me to with a change log in the
> individual patch? I want to make certain I change correctly.
> 

https://staticthinking.wordpress.com/2022/07/27/how-to-send-a-v2-patch/

Just put a:
---
v2: white space changes

or whatever.

regards,
dan carpenter


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-07-02 19:54       ` Dan Carpenter
@ 2025-07-02 19:57         ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 19:57 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Jonathan Cameron, dave, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, linux-pci



On 7/2/2025 2:54 PM, Dan Carpenter wrote:
> On Wed, Jul 02, 2025 at 11:21:20AM -0500, Bowman, Terry wrote:
>>
>> On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
>>> On Thu, 26 Jun 2025 17:42:40 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>
>>>> CXL error handling will soon be moved from the AER driver into the CXL
>>>> driver. This requires a notification mechanism for the AER driver to share
>>>> the AER interrupt with the CXL driver. The notification will be used
>>>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>>>
>>>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>>>> driver's RAS native handling. This as an alternative to dropping the
>>>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>>>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>>>> conditionally compile the new file.
>>>>
>>>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>>>> driver will be the sole kfifo producer adding work and the cxl_core will be
>>>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>>>
>>>> Add CXL work queue handler registration functions in the AER driver. Export
>>>> the functions allowing CXL driver to access. Implement registration
>>>> functions for the CXL driver to assign or clear the work handler function.
>>>>
>>>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>>>> will contain the erring device's PCI SBDF details used to rediscover the
>>>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>>>> will be introduced along with the CXL handling in future patches.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Hi Terry,
>>>
>>> Whilst it obviously makes patch preparation a bit more time consuming
>>> for series like this with many patches it can be useful to add a brief
>>> change log to the individual patches as well as the cover letter.
>>> That helps reviewers figure out where they need to look again.
>>>
>>> A few trivial things inline.
>>>
>>> With those fixed up
>>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>>>
>>> Jonathan
>> Hi Jonathan,
>>
>> Do you have an example you can point me to with a change log in the
>> individual patch? I want to make certain I change correctly.
>>
> https://staticthinking.wordpress.com/2022/07/27/how-to-send-a-v2-patch/
>
> Just put a:
> ---
> v2: white space changes
>
> or whatever.
>
> regards,
> dan carpenter
>
Thanks Dan Carpenter.

-Terry

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-06-27 11:05   ` Jonathan Cameron
@ 2025-07-02 21:06     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 21:06 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci



On 6/27/2025 6:05 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:42 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
>> handling. Follow similar design as found in PCIe error driver,
>> pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs
>> as fatal with a kernel panic. This is to prevent corruption on CXL memory.
>>
>> Export the PCI error driver's merge_result() to CXL namespace.
> I think this may be a confusion from earlier review.  Anyhow, it should
> be namespaced in the sense of not exporting something the vague name of
> merge_result but it's PCI code, not CXL code and we don't have the dangerous
> interface argument to justify putting it in the CXL namespace so I think
> a namespaced EXPORT makes little sense for this one.
>
> Jonathan
>
>
>> Introduce
>> PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will
>> be used by CXL to panic the system in the case of uncorrectable protocol
>> errors. PCI error handling is not currently expected to use the
>> PCI_ERS_RESULT_PANIC.
>>
>> Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
>> first device in all cases.
>>
>> Copy the PCI error driver's report_error_detected() to cxl_report_error_detected().
>> Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
>> supported. Add locking for PCI device as done in PCI's report_error_detected().
>> This is necessary to prevent the RAS registers from disappearing before
>> logging is completed.
>>
>> Call panic() to halt the system in the case of uncorrectable errors (UCE)
>> in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
>> if a UCE is not found. In this case the AER status must be cleared and
>> uses pci_aer_clear_fatal_status().
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index de6381c690f5..63fceb3e8613 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -21,9 +21,12 @@
>>  #include "portdrv.h"
>>  #include "../pci.h"
>>  
>> -static pci_ers_result_t merge_result(enum pci_ers_result orig,
>> -				  enum pci_ers_result new)
>> +pci_ers_result_t merge_result(enum pci_ers_result orig,
>> +			      enum pci_ers_result new)
>>  {
>> +	if (new == PCI_ERS_RESULT_PANIC)
>> +		return PCI_ERS_RESULT_PANIC;
>> +
>>  	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>>  		return PCI_ERS_RESULT_NO_AER_DRIVER;
>>  
>> @@ -45,6 +48,7 @@ static pci_ers_result_t merge_result(enum pci_ers_result orig,
>>  
>>  	return orig;
>>  }
>> +EXPORT_SYMBOL_NS_GPL(merge_result, "CXL"); 
> Do we care about namespacing this?  I think not given it is PCIe code
> and hardly destructive for other drivers to mess with it if they like.
>
> I would namespace it in the sense of renaming it to make it clear
> it's about pci errors though.
>
> pci_ers_merge_result() perhaps?
>
> Do that as a percursor patch.
>

Good idea. There is a lot of changes related to just exporting this and changing
the name. I've changed the namespace export to be:

EXPORT_SYMBOL(pci_ers_merge_result);

I moved this and its related required changes into an earlier patch.

-Terry

>>  
>>  static int report_error_detected(struct pci_dev *dev,
>>  				 pci_channel_state_t state,


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery
  2025-06-27 12:27   ` Shiju Jose
@ 2025-07-02 21:34     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 21:34 UTC (permalink / raw)
  To: Shiju Jose, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org



On 6/27/2025 7:27 AM, Shiju Jose wrote:
>> -----Original Message-----
>> From: Terry Bowman <terry.bowman@amd.com>
>> Sent: 26 June 2025 23:43
>> To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>> dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>> bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>> ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>> rrichter@amd.com; dan.carpenter@linaro.org;
>> PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>> Benjamin.Cheatham@amd.com;
>> sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>> linux-cxl@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>> Subject: [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error
>> recovery
>>
>> Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
>> handling. Follow similar design as found in PCIe error driver,
>> pcie_do_recovery(). One difference is cxl_do_recovery() will treat all UCEs as
>> fatal with a kernel panic. This is to prevent corruption on CXL memory.
>>
>> Export the PCI error driver's merge_result() to CXL namespace. Introduce
>> PCI_ERS_RESULT_PANIC and add support in merge_result() routine. This will be
>> used by CXL to panic the system in the case of uncorrectable protocol errors. PCI
>> error handling is not currently expected to use the PCI_ERS_RESULT_PANIC.
>>
>> Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the first
>> device in all cases.
>>
>> Copy the PCI error driver's report_error_detected() to
>> cxl_report_error_detected().
>> Note, only CXL Endpoints and RCH Downstream Ports(RCH DSP) are currently
>> supported. Add locking for PCI device as done in PCI's report_error_detected().
>> This is necessary to prevent the RAS registers from disappearing before logging
>> is completed.
>>
>> Call panic() to halt the system in the case of uncorrectable errors (UCE) in
>> cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use if a UCE is
>> not found. In this case the AER status must be cleared and uses
>> pci_aer_clear_fatal_status().
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/cxl/core/native_ras.c | 44 +++++++++++++++++++++++++++++++++++
>> drivers/pci/pcie/cxl_aer.c    |  3 ++-
>> drivers/pci/pcie/err.c        |  8 +++++--
>> include/linux/aer.h           | 11 +++++++++
>> include/linux/pci.h           |  3 +++
>> 5 files changed, 66 insertions(+), 3 deletions(-)
>>
> [...]
>> void pci_print_aer(struct pci_dev *dev, int aer_severity, diff --git
>> a/include/linux/pci.h b/include/linux/pci.h index 79326358f641..16a8310e0373
>> 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -868,6 +868,9 @@ enum pci_ers_result {
>>
>> 	/* No AER capabilities registered for the driver */
>> 	PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
>> +
>> +	/* System is unstable, panic. Is CXL specific  */
>> +	PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
> Extra space is present after casting?
>> };

Hi Shiju,

I see the existing PCIE_ERS_RESULT entries have a space before the number.
For example,

PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
                                                         ^

I do see that I had an extra space in my comment that I will fix.
Please let me know if you agree or if I'm missing something?

-Terry
>>
>> /* PCI bus error event callbacks */
>> --
>> 2.34.1


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-06-27 11:17   ` Jonathan Cameron
@ 2025-07-02 21:41     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 21:41 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci



On 6/27/2025 6:17 AM, Jonathan Cameron wrote:
> On Thu, 26 Jun 2025 17:42:44 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
>> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
>> mapping to enable RAS logging. This initialization is currently missing and
>> must be added for CXL RPs and DSPs.
>>
>> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
>> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
>>
>> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
>> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
>> created and added to the EP port.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> One trivial comment inline.  I'm not super confident that I follow exactly
> what is going on here so more eyes needed.  However I think it's fine.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

Thanks.
>> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
>> index 021f35145c65..b52f82925891 100644
>> --- a/drivers/cxl/port.c
>> +++ b/drivers/cxl/port.c
>>  
>> +static void cxl_switch_port_init_ras(struct cxl_port *port)
>> +{
>> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
>> +		return;
>> +
>> +	/* May have upstream DSP or RP */
>> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
>  A lot of port->parent_dport in here. Maybe a local variable for that with
> a suitable name to describe that its the next port in the upstream direction.
Ok, I'll add a local pointer to make it more readable.

-Terry
>> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
>> +
>> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
>> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
>> +	}
>> +
>> +	cxl_uport_init_ras_reporting(port, &port->dev);
>> +}
>> +
>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>> +{
>> +	struct cxl_dport *dport;
>> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> +	struct cxl_port *parent_port __free(put_cxl_port) =
>> +		cxl_mem_find_port(cxlmd, &dport);
>> +
>> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
>> +		dev_err(&port->dev, "CXL port topology not found\n");
>> +		return;
>> +	}
>> +
>> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
>> +}
>>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-06-27 12:22   ` Shiju Jose
  2025-07-02  1:18     ` Alison Schofield
@ 2025-07-02 21:56     ` Bowman, Terry
  1 sibling, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 21:56 UTC (permalink / raw)
  To: Shiju Jose, dave@stgolabs.net, Jonathan Cameron,
	dave.jiang@intel.com, alison.schofield@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org
  Cc: linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org



On 6/27/2025 7:22 AM, Shiju Jose wrote:
>> -----Original Message-----
>> From: Terry Bowman <terry.bowman@amd.com>
>> Sent: 26 June 2025 23:43
>> To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>> dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>> bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>> ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>> rrichter@amd.com; dan.carpenter@linaro.org;
>> PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>> Benjamin.Cheatham@amd.com;
>> sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>> linux-cxl@vger.kernel.org
>> Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>> Subject: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints
>> and CXL Ports
>>
>> CXL currently has separate trace routines for CXL Port errors and CXL Endpoint
>> errors. This is inconvenient for the user because they must enable
>> 2 sets of trace routines. Make updates to the trace logging such that a single
>> trace routine logs both CXL Endpoint and CXL Port protocol errors.
>>
>> Keep the trace log fields 'memdev' and 'host'. While these are not accurate for
>> non-Endpoints the fields will remain as-is to prevent breaking userspace RAS
>> trace consumers.
>>
>> Add serial number parameter to the trace logging. This is used for EPs and 0 is
>> provided for CXL port devices without a serial number.
>>
>> Below is output of correctable and uncorrectable protocol error logging.
>> CXL Root Port and CXL Endpoint examples are included below.
>>
>> Root Port:
>> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0
>> status='CRC Threshold Hit'
>> cxl_aer_uncorrectable_error: memdev=0000:0c:00.0 host=pci0000:0c serial: 0
>> status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity
>> Error'
>>
>> Endpoint:
>> cxl_aer_correctable_error: memdev=mem3 host=0000:0f:00.0 serial=0
>> status='CRC Threshold Hit'
>> cxl_aer_uncorrectable_error: memdev=mem3 host=0000:0f:00.0 serial: 0 status:
>> 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/cxl/core/pci.c   | 19 ++++-----
>> drivers/cxl/core/ras.c   | 14 ++++---
>> drivers/cxl/core/trace.h | 84 +++++++++-------------------------------
>> 3 files changed, 37 insertions(+), 80 deletions(-)
>>
> [...]
>> static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data
>> *data) diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h index
>> 25ebfbc1616c..494d6db461a7 100644
>> --- a/drivers/cxl/core/trace.h
>> +++ b/drivers/cxl/core/trace.h
>> @@ -48,49 +48,22 @@
>> 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>> )
>>
>> -TRACE_EVENT(cxl_port_aer_uncorrectable_error,
>> -	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
>> -	TP_ARGS(dev, status, fe, hl),
>> -	TP_STRUCT__entry(
>> -		__string(device, dev_name(dev))
>> -		__string(host, dev_name(dev->parent))
>> -		__field(u32, status)
>> -		__field(u32, first_error)
>> -		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>> -	),
>> -	TP_fast_assign(
>> -		__assign_str(device);
>> -		__assign_str(host);
>> -		__entry->status = status;
>> -		__entry->first_error = fe;
>> -		/*
>> -		 * Embed the 512B headerlog data for user app retrieval and
>> -		 * parsing, but no need to print this in the trace buffer.
>> -		 */
>> -		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>> -	),
>> -	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
>> -		  __get_str(device), __get_str(host),
>> -		  show_uc_errs(__entry->status),
>> -		  show_uc_errs(__entry->first_error)
>> -	)
>> -);
>> -
>> TRACE_EVENT(cxl_aer_uncorrectable_error,
>> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32
>> *hl),
>> -	TP_ARGS(cxlmd, status, fe, hl),
>> +	TP_PROTO(struct device *dev, u64 serial, u32 status, u32 fe,
>> +		 u32 *hl),
>> +	TP_ARGS(dev, serial, status, fe, hl),
>> 	TP_STRUCT__entry(
>> -		__string(memdev, dev_name(&cxlmd->dev))
>> -		__string(host, dev_name(cxlmd->dev.parent))
>> +		__string(name, dev_name(dev))
>> +		__string(parent, dev_name(dev->parent))
> Hi Terry,
>
> Thanks for considering the feedback given in v9 regarding the compatibility issue
> with the rasdaemon.
> https://lore.kernel.org/all/959acc682e6e4b52ac0283b37ee21026@huawei.com/
>
> Probably some confusion w.r.t the feedback.
> Unfortunately  TP_printk(...) is not an ABI that we need to keep stable, 
> it's this structure, TP_STRUCT__entry(..) , that matters to the rasdaemon.
>
>> 		

Oh. Apologies, I didn't realize TP_STRUCT was an ABI requirement. I will change back
the TP_STRUCT as well.

-Terry

>> __field(u64, serial)
>> 		__field(u32, status)
>> 		__field(u32, first_error)
>> 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>> 	),
>> 	TP_fast_assign(
>> -		__assign_str(memdev);
>> -		__assign_str(host);
>> -		__entry->serial = cxlmd->cxlds->serial;
>> +		__assign_str(name);
>> +		__assign_str(parent);
>> +		__entry->serial = serial;
>> 		__entry->status = status;
>> 		__entry->first_error = fe;
>> 		/*
>> @@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>> 		 */
>> 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>> 	),
>> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error:
>> '%s'",
>> -		  __get_str(memdev), __get_str(host), __entry->serial,
>> +	TP_printk("memdev=%s host=%s serial=%lld status='%s'
>> first_error='%s'",
>> +		  __get_str(name), __get_str(parent), __entry->serial,
>> 		  show_uc_errs(__entry->status),
>> 		  show_uc_errs(__entry->first_error)
>> 	)
>> @@ -124,42 +97,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>> 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer"
>> }	\
>> )
>>
>> -TRACE_EVENT(cxl_port_aer_correctable_error,
>> -	TP_PROTO(struct device *dev, u32 status),
>> -	TP_ARGS(dev, status),
>> -	TP_STRUCT__entry(
>> -		__string(device, dev_name(dev))
>> -		__string(host, dev_name(dev->parent))
>> -		__field(u32, status)
>> -	),
>> -	TP_fast_assign(
>> -		__assign_str(device);
>> -		__assign_str(host);
>> -		__entry->status = status;
>> -	),
>> -	TP_printk("device=%s host=%s status='%s'",
>> -		  __get_str(device), __get_str(host),
>> -		  show_ce_errs(__entry->status)
>> -	)
>> -);
>> -
>> TRACE_EVENT(cxl_aer_correctable_error,
>> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>> -	TP_ARGS(cxlmd, status),
>> +	TP_PROTO(struct device *dev, u64 serial, u32 status),
>> +	TP_ARGS(dev, serial, status),
>> 	TP_STRUCT__entry(
>> -		__string(memdev, dev_name(&cxlmd->dev))
>> -		__string(host, dev_name(cxlmd->dev.parent))
>> +		__string(name, dev_name(dev))
>> +		__string(parent, dev_name(dev->parent))
> Same as above.
>> 		__field(u64, serial)
>> 		__field(u32, status)
>> 	),
>> 	TP_fast_assign(
>> -		__assign_str(memdev);
>> -		__assign_str(host);
>> -		__entry->serial = cxlmd->cxlds->serial;
>> +		__assign_str(name);
>> +		__assign_str(parent);
>> +		__entry->serial = serial;
>> 		__entry->status = status;
>> 	),
>> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
>> -		  __get_str(memdev), __get_str(host), __entry->serial,
>> +	TP_printk("memdev=%s host=%s serial=%lld status='%s'",
>> +		  __get_str(name), __get_str(parent), __entry->serial,
>> 		  show_ce_errs(__entry->status)
>> 	)
>> );
>> --
>> 2.34.1
> Thanks,
> Shiju


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports
  2025-07-02  1:18     ` Alison Schofield
@ 2025-07-02 22:07       ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-02 22:07 UTC (permalink / raw)
  To: Alison Schofield, Shiju Jose
  Cc: dave@stgolabs.net, Jonathan Cameron, dave.jiang@intel.com,
	dan.j.williams@intel.com, bhelgaas@google.com,
	ming.li@zohomail.com, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, dan.carpenter@linaro.org,
	PradeepVineshReddy.Kodamati@amd.com, lukas@wunner.de,
	Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org



On 7/1/2025 8:18 PM, Alison Schofield wrote:
> On Fri, Jun 27, 2025 at 12:22:39PM +0000, Shiju Jose wrote:
>>> -----Original Message-----
>>> From: Terry Bowman <terry.bowman@amd.com>
>>> Sent: 26 June 2025 23:43
>>> To: dave@stgolabs.net; Jonathan Cameron <jonathan.cameron@huawei.com>;
>>> dave.jiang@intel.com; alison.schofield@intel.com; dan.j.williams@intel.com;
>>> bhelgaas@google.com; Shiju Jose <shiju.jose@huawei.com>;
>>> ming.li@zohomail.com; Smita.KoralahalliChannabasappa@amd.com;
>>> rrichter@amd.com; dan.carpenter@linaro.org;
>>> PradeepVineshReddy.Kodamati@amd.com; lukas@wunner.de;
>>> Benjamin.Cheatham@amd.com;
>>> sathyanarayanan.kuppuswamy@linux.intel.com; terry.bowman@amd.com;
>>> linux-cxl@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org; linux-pci@vger.kernel.org
>>> Subject: [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints
>>> and CXL Ports
>>>
> big snip -
>
>>> -);
>>> -
>>> TRACE_EVENT(cxl_aer_uncorrectable_error,
>>> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32
>>> *hl),
>>> -	TP_ARGS(cxlmd, status, fe, hl),
>>> +	TP_PROTO(struct device *dev, u64 serial, u32 status, u32 fe,
>>> +		 u32 *hl),
>>> +	TP_ARGS(dev, serial, status, fe, hl),
>>> 	TP_STRUCT__entry(
>>> -		__string(memdev, dev_name(&cxlmd->dev))
>>> -		__string(host, dev_name(cxlmd->dev.parent))
>>> +		__string(name, dev_name(dev))
>>> +		__string(parent, dev_name(dev->parent))
>> Hi Terry,
>>
>> Thanks for considering the feedback given in v9 regarding the compatibility issue
>> with the rasdaemon.
>> https://lore.kernel.org/all/959acc682e6e4b52ac0283b37ee21026@huawei.com/
>>
>> Probably some confusion w.r.t the feedback.
>> Unfortunately  TP_printk(...) is not an ABI that we need to keep stable, 
>> it's this structure, TP_STRUCT__entry(..) , that matters to the rasdaemon.
>>
> I'm not so sure you should be letting him off the hook for TP_printk ;)
> It seems TP_printk should be kept aligned w TP_STRUCT_entry(). As a
> user who often looks at TP_printk output, I'd say keep them all in
> sync, and consider them ABI - ie. add to but don't modify.
>
>
I agree. I will keep TP_printk and TP_STRUCT aligned by using 'memdev' and 'host'.
The only change here will be TP_PROTO and will be:
TP_PROTO(struct device *dev, u64 serial, u32 status),

Let me know if that's not ok.

-Terry

>>> 		__field(u64, serial)
>>> 		__field(u32, status)
>>> 		__field(u32, first_error)
>>> 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>>> 	),
>>> 	TP_fast_assign(
>>> -		__assign_str(memdev);
>>> -		__assign_str(host);
>>> -		__entry->serial = cxlmd->cxlds->serial;
>>> +		__assign_str(name);
>>> +		__assign_str(parent);
>>> +		__entry->serial = serial;
>>> 		__entry->status = status;
>>> 		__entry->first_error = fe;
>>> 		/*
>>> @@ -99,8 +72,8 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>>> 		 */
>>> 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>>> 	),
>>> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error:
>>> '%s'",
>>> -		  __get_str(memdev), __get_str(host), __entry->serial,
>>> +	TP_printk("memdev=%s host=%s serial=%lld status='%s'
>>> first_error='%s'",
>>> +		  __get_str(name), __get_str(parent), __entry->serial,
>>> 		  show_uc_errs(__entry->status),
>>> 		  show_uc_errs(__entry->first_error)
>>> 	)
> snip
>


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-07-02 16:21     ` Bowman, Terry
  2025-07-02 19:54       ` Dan Carpenter
@ 2025-07-03 10:06       ` Jonathan Cameron
  1 sibling, 0 replies; 82+ messages in thread
From: Jonathan Cameron @ 2025-07-03 10:06 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: dave, dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Wed, 2 Jul 2025 11:21:20 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 6/27/2025 5:24 AM, Jonathan Cameron wrote:
> > On Thu, 26 Jun 2025 17:42:40 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >  
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt with the CXL driver. The notification will be used
> >> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> >>
> >> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> >> driver's RAS native handling. This as an alternative to dropping the
> >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> >> conditionally compile the new file.
> >>
> >> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> >> driver will be the sole kfifo producer adding work and the cxl_core will be
> >> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> >>
> >> Add CXL work queue handler registration functions in the AER driver. Export
> >> the functions allowing CXL driver to access. Implement registration
> >> functions for the CXL driver to assign or clear the work handler function.
> >>
> >> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> >> will contain the erring device's PCI SBDF details used to rediscover the
> >> device after the CXL driver dequeues the kfifo work. The device rediscovery
> >> will be introduced along with the CXL handling in future patches.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> > Hi Terry,
> >
> > Whilst it obviously makes patch preparation a bit more time consuming
> > for series like this with many patches it can be useful to add a brief
> > change log to the individual patches as well as the cover letter.
> > That helps reviewers figure out where they need to look again.
> >
> > A few trivial things inline.
> >
> > With those fixed up
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >
> > Jonathan  
> 
> Hi Jonathan,
> 
> Do you have an example you can point me to with a change log in the
> individual patch? I want to make certain I change correctly.
> 
>  
> >  
> >> ---

You just put it here.  No particular formatting standard for it other than making
sure its after the --- so it doesn't end up in the git log.
https://lore.kernel.org/all/2025070138-vigorous-negative-eae7@gregkh/

Is first example I scrolled down to.  This is a single patch series but
the principle is the same.



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-07-02 17:56     ` Bowman, Terry
@ 2025-07-03 10:11       ` Jonathan Cameron
  0 siblings, 0 replies; 82+ messages in thread
From: Jonathan Cameron @ 2025-07-03 10:11 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Dave Jiang, dave, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, linux-pci

On Wed, 2 Jul 2025 12:56:29 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 7/1/2025 6:04 PM, Dave Jiang wrote:
> >
> > On 6/26/25 3:42 PM, Terry Bowman wrote:  
> >> The AER driver is now designed to forward CXL protocol errors to the CXL
> >> driver. Update the CXL driver with functionality to dequeue the forwarded
> >> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> >> error handling processing using the work received from the FIFO.
> >>
> >> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the
> >> AER service driver. This will begin the CXL protocol error processing with
> >> a call to cxl_handle_proto_error().
> >>
> >> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
> >> previously in the AER driver. Add check that Endpoint is bound to a CXL
> >> driver.
> >>
> >> Introduce logic to take the SBDF values from 'struct cxl_proto_error_info'
> >> and use in discovering the erring PCI device. The call to pci_get_domain_bus_and_slot()
> >> will return a reference counted 'struct pci_dev *'. This will serve as
> >> reference count to prevent releasing the CXL Endpoint's mapped RAS while
> >> handling the error. Use scope base __free() to put the reference count.
> >> This will change when adding support for CXL port devices in the future.
> >>
> >> Implement cxl_handle_proto_error() to differentiate between Restricted CXL
> >> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors. RCH
> >> errors will be processed with a call to walk the associated Root Complex
> >> Event Collector's (RCEC) secondary bus looking for the Root Complex
> >> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> >> so the CXL driver can walk the RCEC's downstream bus, searching for the
> >> RCiEP.
> >>
> >> VH correctable error (CE) processing will call the CXL CE handler. VH
> >> uncorrectable errors (UCE) will call cxl_do_recovery(), implemented as a
> >> stub for now and to be updated in future patch. Export pci_aer_clean_fatal_status()
> >> and pci_clean_device_status() used to clean up AER status after handling.
> >>
> >> Maintain the locking logic found in the original AER driver. Replace the
> >> existing device_lock() in cxl_rch_handle_error_iter() to use guard(device)
> >> lock for maintainability. CE errors did not include locking in previous driver
> >> implementation. Leave the updated CE handling path as-is.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>  
> > Couple minor comments below. Otherwise
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>  
> Thanks Dave.

Hi Terry,

Picking a random patch, another small request process wise.

If you agree with all suggestions in a review, don't reply to that email
just put your thanks and what changed in the change log of the next version.

Skipping that reply cuts down on the volume of emails that need scrolling
through and generally helps people focus on the emails that matter where there
is a question or similar.

This one gets a lot of contributors because it feels rude to not reply
but doing it via the next version is more efficient for everyone!

Jonathan

p.s. I only bother moaning about this to contributors who are sending
     quite a bit of useful stuff!


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
@ 2025-07-18 17:55   ` Dave Jiang
  2025-07-23 21:58   ` dan.j.williams
  1 sibling, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-18 17:55 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
> are unnecessary helper functions used only for Endpoints. Remove these
> functions as they are not common for all CXL devices and do not provide
> value for EP handling.
> 
> Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
> to cxl_handle_cor_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/pci.c | 26 ++++++++------------------
>  1 file changed, 8 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index b50551601c2e..06464a25d8bd 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -664,8 +664,8 @@ void read_cdat_data(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
>  
> -static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
> -				 void __iomem *ras_base)
> +static void cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
> +			       void __iomem *ras_base)
>  {
>  	void __iomem *addr;
>  	u32 status;
> @@ -681,11 +681,6 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
>  	}
>  }
>  
> -static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> -{
> -	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
> -}
> -
>  /* CXL spec rev3.0 8.2.4.16.1 */
>  static void header_log_copy(void __iomem *ras_base, u32 *log)
>  {
> @@ -707,8 +702,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>   * Log the state of the RAS status registers and prepare them to log the
>   * next error status. Return 1 if reset needed.
>   */
> -static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
> -				  void __iomem *ras_base)
> +static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
> +			   void __iomem *ras_base)
>  {
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
> @@ -741,11 +736,6 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
>  	return true;
>  }
>  
> -static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
> -{
> -	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
> -}
> -
>  #ifdef CONFIG_PCIEAER_CXL
>  
>  static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
> @@ -824,13 +814,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>  static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
>  					  struct cxl_dport *dport)
>  {
> -	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
> +	return cxl_handle_cor_ras(cxlds, dport->regs.ras);
>  }
>  
>  static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>  				       struct cxl_dport *dport)
>  {
> -	return __cxl_handle_ras(cxlds, dport->regs.ras);
> +	return cxl_handle_ras(cxlds, dport->regs.ras);
>  }
>  
>  /*
> @@ -927,7 +917,7 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>  		if (cxlds->rcd)
>  			cxl_handle_rdport_errors(cxlds);
>  
> -		cxl_handle_endpoint_cor_ras(cxlds);
> +		cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
>  	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
> @@ -956,7 +946,7 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  		 * chance the situation is recoverable dump the status of the RAS
>  		 * capability registers and bounce the active state of the memdev.
>  		 */
> -		ue = cxl_handle_endpoint_ras(cxlds);
> +		ue = cxl_handle_ras(cxlds, cxlds->regs.ras);
>  	}
>  
>  


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver
  2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
  2025-06-27 11:12   ` Jonathan Cameron
@ 2025-07-18 18:01   ` Dave Jiang
  1 sibling, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-18 18:01 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> The cxl_port driver is intended to manage CXL Endpoint Ports and CXL Switch
> Ports. Move existing RAS initialization to the cxl_port driver.
> 
> Restricted CXL Host (RCH) Downstream Port RAS initialization currently
> resides in cxl/core/pci.c. The PCI source file is not otherwise associated
> with CXL Port management.
> 
> Additional CXL Port RAS initialization will be added in future patches to
> support a CXL Port device's CXL errors.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/pci.c  | 73 --------------------------------------
>  drivers/cxl/core/regs.c |  2 ++
>  drivers/cxl/cxl.h       |  6 ++++
>  drivers/cxl/port.c      | 78 +++++++++++++++++++++++++++++++++++++++++
>  4 files changed, 86 insertions(+), 73 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 06464a25d8bd..35c9c50534bf 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -738,79 +738,6 @@ static bool cxl_handle_ras(struct cxl_dev_state *cxlds,
>  
>  #ifdef CONFIG_PCIEAER_CXL
>  
> -static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
> -{
> -	resource_size_t aer_phys;
> -	struct device *host;
> -	u16 aer_cap;
> -
> -	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
> -	if (aer_cap) {
> -		host = dport->reg_map.host;
> -		aer_phys = aer_cap + dport->rcrb.base;
> -		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
> -						sizeof(struct aer_capability_regs));
> -	}
> -}
> -
> -static void cxl_dport_map_ras(struct cxl_dport *dport)
> -{
> -	struct cxl_register_map *map = &dport->reg_map;
> -	struct device *dev = dport->dport_dev;
> -
> -	if (!map->component_map.ras.valid)
> -		dev_dbg(dev, "RAS registers not found\n");
> -	else if (cxl_map_component_regs(map, &dport->regs.component,
> -					BIT(CXL_CM_CAP_CAP_ID_RAS)))
> -		dev_dbg(dev, "Failed to map RAS capability.\n");
> -}
> -
> -static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> -{
> -	void __iomem *aer_base = dport->regs.dport_aer;
> -	u32 aer_cmd_mask, aer_cmd;
> -
> -	if (!aer_base)
> -		return;
> -
> -	/*
> -	 * Disable RCH root port command interrupts.
> -	 * CXL 3.0 12.2.1.1 - RCH Downstream Port-detected Errors
> -	 *
> -	 * This sequence may not be necessary. CXL spec states disabling
> -	 * the root cmd register's interrupts is required. But, PCI spec
> -	 * shows these are disabled by default on reset.
> -	 */
> -	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
> -			PCI_ERR_ROOT_CMD_NONFATAL_EN |
> -			PCI_ERR_ROOT_CMD_FATAL_EN);
> -	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
> -	aer_cmd &= ~aer_cmd_mask;
> -	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> -}
> -
> -/**
> - * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
> - * @dport: the cxl_dport that needs to be initialized
> - * @host: host device for devm operations
> - */
> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
> -{
> -	dport->reg_map.host = host;
> -	cxl_dport_map_ras(dport);
> -
> -	if (dport->rch) {
> -		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
> -
> -		if (!host_bridge->native_aer)
> -			return;
> -
> -		cxl_dport_map_rch_aer(dport);
> -		cxl_disable_rch_root_ints(dport);
> -	}
> -}
> -EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
> -
>  static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
>  					  struct cxl_dport *dport)
>  {
> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> index 5ca7b0eed568..b8e767a9571c 100644
> --- a/drivers/cxl/core/regs.c
> +++ b/drivers/cxl/core/regs.c
> @@ -199,6 +199,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
>  
>  	return ret_val;
>  }
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, "CXL");
>  
>  int cxl_map_component_regs(const struct cxl_register_map *map,
>  			   struct cxl_component_regs *regs,
> @@ -517,6 +518,7 @@ u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb)
>  
>  	return offset;
>  }
> +EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, "CXL");
>  
>  static resource_size_t cxl_rcrb_to_linkcap(struct device *dev, struct cxl_dport *dport)
>  {
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 3f1695c96abc..c57c160f3e5e 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -313,6 +313,12 @@ int cxl_setup_regs(struct cxl_register_map *map);
>  struct cxl_dport;
>  resource_size_t cxl_rcd_component_reg_phys(struct device *dev,
>  					   struct cxl_dport *dport);
> +
> +u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb);
> +
> +void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
> +				   resource_size_t length);
> +
>  int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
>  
>  #define CXL_RESOURCE_NONE ((resource_size_t) -1)
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index fe4b593331da..021f35145c65 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -6,6 +6,7 @@
>  
>  #include "cxlmem.h"
>  #include "cxlpci.h"
> +#include "cxl.h"
>  
>  /**
>   * DOC: cxl port
> @@ -57,6 +58,83 @@ static int discover_region(struct device *dev, void *unused)
>  	return 0;
>  }
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +
> +static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
> +{
> +	resource_size_t aer_phys;
> +	struct device *host;
> +	u16 aer_cap;
> +
> +	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
> +	if (aer_cap) {
> +		host = dport->reg_map.host;
> +		aer_phys = aer_cap + dport->rcrb.base;
> +		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
> +						sizeof(struct aer_capability_regs));
> +	}
> +}
> +
> +static void cxl_dport_map_ras(struct cxl_dport *dport)
> +{
> +	struct cxl_register_map *map = &dport->reg_map;
> +	struct device *dev = dport->dport_dev;
> +
> +	if (!map->component_map.ras.valid)
> +		dev_dbg(dev, "RAS registers not found\n");
> +	else if (cxl_map_component_regs(map, &dport->regs.component,
> +					BIT(CXL_CM_CAP_CAP_ID_RAS)))
> +		dev_dbg(dev, "Failed to map RAS capability.\n");
> +}
> +
> +static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> +{
> +	void __iomem *aer_base = dport->regs.dport_aer;
> +	u32 aer_cmd_mask, aer_cmd;
> +
> +	if (!aer_base)
> +		return;
> +
> +	/*
> +	 * Disable RCH root port command interrupts.
> +	 * CXL 3.2 12.2.1.1 - RCH Downstream Port-detected Errors
> +	 *
> +	 * This sequence may not be necessary. CXL spec states disabling
> +	 * the root cmd register's interrupts is required. But, PCI spec
> +	 * shows these are disabled by default on reset.
> +	 */
> +	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
> +			PCI_ERR_ROOT_CMD_NONFATAL_EN |
> +			PCI_ERR_ROOT_CMD_FATAL_EN);
> +	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
> +	aer_cmd &= ~aer_cmd_mask;
> +	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> +}
> +
> +/**
> + * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
> + * @dport: the cxl_dport that needs to be initialized
> + * @host: host device for devm operations
> + */
> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
> +{
> +	dport->reg_map.host = host;
> +	cxl_dport_map_ras(dport);
> +
> +	if (dport->rch) {
> +		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
> +
> +		if (!host_bridge->native_aer)
> +			return;
> +
> +		cxl_dport_map_rch_aer(dport);
> +		cxl_disable_rch_root_ints(dport);
> +	}
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
> +
> +#endif /* CONFIG_PCIEAER_CXL */
> +
>  static int cxl_switch_port_probe(struct cxl_port *port)
>  {
>  	struct cxl_hdm *cxlhdm;


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
  2025-06-27 11:17   ` Jonathan Cameron
@ 2025-07-18 21:28   ` Dave Jiang
  2025-07-18 21:55     ` Bowman, Terry
  1 sibling, 1 reply; 82+ messages in thread
From: Dave Jiang @ 2025-07-18 21:28 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
> mapping to enable RAS logging. This initialization is currently missing and
> must be added for CXL RPs and DSPs.
> 
> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
> 
> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
> created and added to the EP port.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/cxl.h  |  2 ++
>  drivers/cxl/mem.c  |  3 ++-
>  drivers/cxl/port.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 61 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index c57c160f3e5e..d696d419bd5a 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -590,6 +590,7 @@ struct cxl_dax_region {
>   * @parent_dport: dport that points to this port in the parent
>   * @decoder_ida: allocator for decoder ids
>   * @reg_map: component and ras register mapping parameters
> + * @uport_regs: mapped component registers
>   * @nr_dports: number of entries in @dports
>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>   * @commit_end: cursor to track highest committed decoder for commit ordering
> @@ -610,6 +611,7 @@ struct cxl_port {
>  	struct cxl_dport *parent_dport;
>  	struct ida decoder_ida;
>  	struct cxl_register_map reg_map;
> +	struct cxl_component_regs uport_regs;
>  	int nr_dports;
>  	int hdm_end;
>  	int commit_end;
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 6e6777b7bafb..d2155f45240d 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
>  	else
>  		endpoint_parent = &parent_port->dev;
>  
> -	cxl_dport_init_ras_reporting(dport, dev);
> +	if (dport->rch)
> +		cxl_dport_init_ras_reporting(dport, dev);
>  
>  	scoped_guard(device, endpoint_parent) {
>  		if (!endpoint_parent->driver) {
> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
> index 021f35145c65..b52f82925891 100644
> --- a/drivers/cxl/port.c
> +++ b/drivers/cxl/port.c
> @@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +static void cxl_uport_init_ras_reporting(struct cxl_port *port,
> +					 struct device *host)
> +{
> +	struct cxl_register_map *map = &port->reg_map;
> +
> +	map->host = host;
> +	if (cxl_map_component_regs(map, &port->uport_regs,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
> +		dev_dbg(&port->dev, "Failed to map RAS capability\n");
> +}
> +
>  /**
>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>   * @dport: the cxl_dport that needs to be initialized
> @@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>  {
>  	dport->reg_map.host = host;
> -	cxl_dport_map_ras(dport);
>  
>  	if (dport->rch) {
>  		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
> @@ -127,12 +137,54 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>  		if (!host_bridge->native_aer)
>  			return;
>  
> +		cxl_dport_map_ras(dport);
>  		cxl_dport_map_rch_aer(dport);
>  		cxl_disable_rch_root_ints(dport);
> +		return;
>  	}
> +
> +	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
> +		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
> +
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>  
> +static void cxl_switch_port_init_ras(struct cxl_port *port)
> +{
> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
> +		return;
> +
> +	/* May have upstream DSP or RP */
> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
> +
> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
> +	}
> +
> +	cxl_uport_init_ras_reporting(port, &port->dev);
> +}
> +
> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)

Maybe rename 'port' to 'ep' to be explicit

> +{
> +	struct cxl_dport *dport;

parent_dport would be clearer. I was thinking why does the endpoint have a dport for a second there.

> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
> +	struct cxl_port *parent_port __free(put_cxl_port) =
> +		cxl_mem_find_port(cxlmd, &dport);
> +
> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
> +		dev_err(&port->dev, "CXL port topology not found\n");> +		return;
> +	}
> +
> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
> +}
> +
> +#else
> +static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
> +static void cxl_switch_port_init_ras(struct cxl_port *port) { }
>  #endif /* CONFIG_PCIEAER_CXL */

I cc'd you on the new patch to move all the AER stuff to core/pci_aer.c. That should take care of ifdef CONFIG_PCIEAER_CXL in pci.c and port.c.

DJ

>  >  static int cxl_switch_port_probe(struct cxl_port *port)
> @@ -149,6 +201,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
>  
>  	cxl_switch_parse_cdat(port);
>  
> +	cxl_switch_port_init_ras(port);
> +
>  	cxlhdm = devm_cxl_setup_hdm(port, NULL);
>  	if (!IS_ERR(cxlhdm))
>  		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
> @@ -203,6 +257,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>  	if (rc)
>  		return rc;
>  
> +	cxl_endpoint_port_init_ras(port);
> +
>  	/*
>  	 * Now that all endpoint decoders are successfully enumerated, try to
>  	 * assemble regions from committed decoders


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-07-18 21:28   ` Dave Jiang
@ 2025-07-18 21:55     ` Bowman, Terry
  2025-07-18 22:01       ` Dave Jiang
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-07-18 21:55 UTC (permalink / raw)
  To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/18/2025 4:28 PM, Dave Jiang wrote:
>
> On 6/26/25 3:42 PM, Terry Bowman wrote:
>> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
>> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
>> mapping to enable RAS logging. This initialization is currently missing and
>> must be added for CXL RPs and DSPs.
>>
>> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
>> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
>>
>> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
>> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
>> created and added to the EP port.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/cxl.h  |  2 ++
>>  drivers/cxl/mem.c  |  3 ++-
>>  drivers/cxl/port.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 61 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index c57c160f3e5e..d696d419bd5a 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -590,6 +590,7 @@ struct cxl_dax_region {
>>   * @parent_dport: dport that points to this port in the parent
>>   * @decoder_ida: allocator for decoder ids
>>   * @reg_map: component and ras register mapping parameters
>> + * @uport_regs: mapped component registers
>>   * @nr_dports: number of entries in @dports
>>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>>   * @commit_end: cursor to track highest committed decoder for commit ordering
>> @@ -610,6 +611,7 @@ struct cxl_port {
>>  	struct cxl_dport *parent_dport;
>>  	struct ida decoder_ida;
>>  	struct cxl_register_map reg_map;
>> +	struct cxl_component_regs uport_regs;
>>  	int nr_dports;
>>  	int hdm_end;
>>  	int commit_end;
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index 6e6777b7bafb..d2155f45240d 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
>>  	else
>>  		endpoint_parent = &parent_port->dev;
>>  
>> -	cxl_dport_init_ras_reporting(dport, dev);
>> +	if (dport->rch)
>> +		cxl_dport_init_ras_reporting(dport, dev);
>>  
>>  	scoped_guard(device, endpoint_parent) {
>>  		if (!endpoint_parent->driver) {
>> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
>> index 021f35145c65..b52f82925891 100644
>> --- a/drivers/cxl/port.c
>> +++ b/drivers/cxl/port.c
>> @@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>  }
>>  
>> +static void cxl_uport_init_ras_reporting(struct cxl_port *port,
>> +					 struct device *host)
>> +{
>> +	struct cxl_register_map *map = &port->reg_map;
>> +
>> +	map->host = host;
>> +	if (cxl_map_component_regs(map, &port->uport_regs,
>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>> +		dev_dbg(&port->dev, "Failed to map RAS capability\n");
>> +}
>> +
>>  /**
>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>   * @dport: the cxl_dport that needs to be initialized
>> @@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>  {
>>  	dport->reg_map.host = host;
>> -	cxl_dport_map_ras(dport);
>>  
>>  	if (dport->rch) {
>>  		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
>> @@ -127,12 +137,54 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>  		if (!host_bridge->native_aer)
>>  			return;
>>  
>> +		cxl_dport_map_ras(dport);
>>  		cxl_dport_map_rch_aer(dport);
>>  		cxl_disable_rch_root_ints(dport);
>> +		return;
>>  	}
>> +
>> +	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>> +		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
>> +
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>>  
>> +static void cxl_switch_port_init_ras(struct cxl_port *port)
>> +{
>> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
>> +		return;
>> +
>> +	/* May have upstream DSP or RP */
>> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
>> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
>> +
>> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
>> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
>> +	}
>> +
>> +	cxl_uport_init_ras_reporting(port, &port->dev);
>> +}
>> +
>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)
> Maybe rename 'port' to 'ep' to be explicit
Ok
>> +{
>> +	struct cxl_dport *dport;
> parent_dport would be clearer. I was thinking why does the endpoint have a dport for a second there.
Ok
>> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>> +	struct cxl_port *parent_port __free(put_cxl_port) =
>> +		cxl_mem_find_port(cxlmd, &dport);
>> +
>> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
>> +		dev_err(&port->dev, "CXL port topology not found\n");> +		return;
>> +	}
>> +
>> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
>> +}
>> +
>> +#else
>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
>> +static void cxl_switch_port_init_ras(struct cxl_port *port) { }
>>  #endif /* CONFIG_PCIEAER_CXL */
> I cc'd you on the new patch to move all the AER stuff to core/pci_aer.c. That should take care of ifdef CONFIG_PCIEAER_CXL in pci.c and port.c.
>
> DJ

Move to core/native_ras.c introduced in "Dequeue forwarded CXL error", right? I just want to be certain.

Regards,
Terry

>>  >  static int cxl_switch_port_probe(struct cxl_port *port)
>> @@ -149,6 +201,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
>>  
>>  	cxl_switch_parse_cdat(port);
>>  
>> +	cxl_switch_port_init_ras(port);
>> +
>>  	cxlhdm = devm_cxl_setup_hdm(port, NULL);
>>  	if (!IS_ERR(cxlhdm))
>>  		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
>> @@ -203,6 +257,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>>  	if (rc)
>>  		return rc;
>>  
>> +	cxl_endpoint_port_init_ras(port);
>> +
>>  	/*
>>  	 * Now that all endpoint decoders are successfully enumerated, try to
>>  	 * assemble regions from committed decoders


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-07-18 21:55     ` Bowman, Terry
@ 2025-07-18 22:01       ` Dave Jiang
  2025-07-18 22:40         ` Bowman, Terry
  0 siblings, 1 reply; 82+ messages in thread
From: Dave Jiang @ 2025-07-18 22:01 UTC (permalink / raw)
  To: Bowman, Terry, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/18/25 2:55 PM, Bowman, Terry wrote:
> 
> 
> On 7/18/2025 4:28 PM, Dave Jiang wrote:
>>
>> On 6/26/25 3:42 PM, Terry Bowman wrote:
>>> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
>>> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
>>> mapping to enable RAS logging. This initialization is currently missing and
>>> must be added for CXL RPs and DSPs.
>>>
>>> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
>>> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
>>>
>>> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
>>> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
>>> created and added to the EP port.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> ---
>>>  drivers/cxl/cxl.h  |  2 ++
>>>  drivers/cxl/mem.c  |  3 ++-
>>>  drivers/cxl/port.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-
>>>  3 files changed, 61 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>>> index c57c160f3e5e..d696d419bd5a 100644
>>> --- a/drivers/cxl/cxl.h
>>> +++ b/drivers/cxl/cxl.h
>>> @@ -590,6 +590,7 @@ struct cxl_dax_region {
>>>   * @parent_dport: dport that points to this port in the parent
>>>   * @decoder_ida: allocator for decoder ids
>>>   * @reg_map: component and ras register mapping parameters
>>> + * @uport_regs: mapped component registers
>>>   * @nr_dports: number of entries in @dports
>>>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>>>   * @commit_end: cursor to track highest committed decoder for commit ordering
>>> @@ -610,6 +611,7 @@ struct cxl_port {
>>>  	struct cxl_dport *parent_dport;
>>>  	struct ida decoder_ida;
>>>  	struct cxl_register_map reg_map;
>>> +	struct cxl_component_regs uport_regs;
>>>  	int nr_dports;
>>>  	int hdm_end;
>>>  	int commit_end;
>>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>>> index 6e6777b7bafb..d2155f45240d 100644
>>> --- a/drivers/cxl/mem.c
>>> +++ b/drivers/cxl/mem.c
>>> @@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
>>>  	else
>>>  		endpoint_parent = &parent_port->dev;
>>>  
>>> -	cxl_dport_init_ras_reporting(dport, dev);
>>> +	if (dport->rch)
>>> +		cxl_dport_init_ras_reporting(dport, dev);
>>>  
>>>  	scoped_guard(device, endpoint_parent) {
>>>  		if (!endpoint_parent->driver) {
>>> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
>>> index 021f35145c65..b52f82925891 100644
>>> --- a/drivers/cxl/port.c
>>> +++ b/drivers/cxl/port.c
>>> @@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>>  }
>>>  
>>> +static void cxl_uport_init_ras_reporting(struct cxl_port *port,
>>> +					 struct device *host)
>>> +{
>>> +	struct cxl_register_map *map = &port->reg_map;
>>> +
>>> +	map->host = host;
>>> +	if (cxl_map_component_regs(map, &port->uport_regs,
>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>> +		dev_dbg(&port->dev, "Failed to map RAS capability\n");
>>> +}
>>> +
>>>  /**
>>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>>   * @dport: the cxl_dport that needs to be initialized
>>> @@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>>  {
>>>  	dport->reg_map.host = host;
>>> -	cxl_dport_map_ras(dport);
>>>  
>>>  	if (dport->rch) {
>>>  		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
>>> @@ -127,12 +137,54 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>>  		if (!host_bridge->native_aer)
>>>  			return;
>>>  
>>> +		cxl_dport_map_ras(dport);
>>>  		cxl_dport_map_rch_aer(dport);
>>>  		cxl_disable_rch_root_ints(dport);
>>> +		return;
>>>  	}
>>> +
>>> +	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>> +		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
>>> +
>>>  }
>>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>>>  
>>> +static void cxl_switch_port_init_ras(struct cxl_port *port)
>>> +{
>>> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
>>> +		return;
>>> +
>>> +	/* May have upstream DSP or RP */
>>> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
>>> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
>>> +
>>> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>>> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
>>> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
>>> +	}
>>> +
>>> +	cxl_uport_init_ras_reporting(port, &port->dev);
>>> +}
>>> +
>>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>> Maybe rename 'port' to 'ep' to be explicit
> Ok
>>> +{
>>> +	struct cxl_dport *dport;
>> parent_dport would be clearer. I was thinking why does the endpoint have a dport for a second there.
> Ok
>>> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>>> +	struct cxl_port *parent_port __free(put_cxl_port) =
>>> +		cxl_mem_find_port(cxlmd, &dport);
>>> +
>>> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
>>> +		dev_err(&port->dev, "CXL port topology not found\n");> +		return;
>>> +	}
>>> +
>>> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
>>> +}
>>> +
>>> +#else
>>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
>>> +static void cxl_switch_port_init_ras(struct cxl_port *port) { }
>>>  #endif /* CONFIG_PCIEAER_CXL */
>> I cc'd you on the new patch to move all the AER stuff to core/pci_aer.c. That should take care of ifdef CONFIG_PCIEAER_CXL in pci.c and port.c.
>>
>> DJ
> 
> Move to core/native_ras.c introduced in "Dequeue forwarded CXL error", right? I just want to be certain.

I just posted this [1] to clean up what's there. Do you prefer me to just rename the file to native_ras.c? Or maybe pci_ras.c?

[1]: https://lore.kernel.org/linux-cxl/20250718212452.2100663-1-dave.jiang@intel.com/

DJ

> 
> Regards,
> Terry
> 
>>>  >  static int cxl_switch_port_probe(struct cxl_port *port)
>>> @@ -149,6 +201,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
>>>  
>>>  	cxl_switch_parse_cdat(port);
>>>  
>>> +	cxl_switch_port_init_ras(port);
>>> +
>>>  	cxlhdm = devm_cxl_setup_hdm(port, NULL);
>>>  	if (!IS_ERR(cxlhdm))
>>>  		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
>>> @@ -203,6 +257,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>>>  	if (rc)
>>>  		return rc;
>>>  
>>> +	cxl_endpoint_port_init_ras(port);
>>> +
>>>  	/*
>>>  	 * Now that all endpoint decoders are successfully enumerated, try to
>>>  	 * assemble regions from committed decoders
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-07-18 22:01       ` Dave Jiang
@ 2025-07-18 22:40         ` Bowman, Terry
  2025-07-18 22:45           ` Dave Jiang
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-07-18 22:40 UTC (permalink / raw)
  To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci

On 7/18/2025 5:01 PM, Dave Jiang wrote:
> 
> 
> On 7/18/25 2:55 PM, Bowman, Terry wrote:
>>
>>
>> On 7/18/2025 4:28 PM, Dave Jiang wrote:
>>>
>>> On 6/26/25 3:42 PM, Terry Bowman wrote:
>>>> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
>>>> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
>>>> mapping to enable RAS logging. This initialization is currently missing and
>>>> must be added for CXL RPs and DSPs.
>>>>
>>>> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
>>>> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
>>>>
>>>> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
>>>> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
>>>> created and added to the EP port.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> ---
>>>>  drivers/cxl/cxl.h  |  2 ++
>>>>  drivers/cxl/mem.c  |  3 ++-
>>>>  drivers/cxl/port.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-
>>>>  3 files changed, 61 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>>>> index c57c160f3e5e..d696d419bd5a 100644
>>>> --- a/drivers/cxl/cxl.h
>>>> +++ b/drivers/cxl/cxl.h
>>>> @@ -590,6 +590,7 @@ struct cxl_dax_region {
>>>>   * @parent_dport: dport that points to this port in the parent
>>>>   * @decoder_ida: allocator for decoder ids
>>>>   * @reg_map: component and ras register mapping parameters
>>>> + * @uport_regs: mapped component registers
>>>>   * @nr_dports: number of entries in @dports
>>>>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>>>>   * @commit_end: cursor to track highest committed decoder for commit ordering
>>>> @@ -610,6 +611,7 @@ struct cxl_port {
>>>>  	struct cxl_dport *parent_dport;
>>>>  	struct ida decoder_ida;
>>>>  	struct cxl_register_map reg_map;
>>>> +	struct cxl_component_regs uport_regs;
>>>>  	int nr_dports;
>>>>  	int hdm_end;
>>>>  	int commit_end;
>>>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>>>> index 6e6777b7bafb..d2155f45240d 100644
>>>> --- a/drivers/cxl/mem.c
>>>> +++ b/drivers/cxl/mem.c
>>>> @@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
>>>>  	else
>>>>  		endpoint_parent = &parent_port->dev;
>>>>  
>>>> -	cxl_dport_init_ras_reporting(dport, dev);
>>>> +	if (dport->rch)
>>>> +		cxl_dport_init_ras_reporting(dport, dev);
>>>>  
>>>>  	scoped_guard(device, endpoint_parent) {
>>>>  		if (!endpoint_parent->driver) {
>>>> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
>>>> index 021f35145c65..b52f82925891 100644
>>>> --- a/drivers/cxl/port.c
>>>> +++ b/drivers/cxl/port.c
>>>> @@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>>>  }
>>>>  
>>>> +static void cxl_uport_init_ras_reporting(struct cxl_port *port,
>>>> +					 struct device *host)
>>>> +{
>>>> +	struct cxl_register_map *map = &port->reg_map;
>>>> +
>>>> +	map->host = host;
>>>> +	if (cxl_map_component_regs(map, &port->uport_regs,
>>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>>> +		dev_dbg(&port->dev, "Failed to map RAS capability\n");
>>>> +}
>>>> +
>>>>  /**
>>>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>>>   * @dport: the cxl_dport that needs to be initialized
>>>> @@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>>>  {
>>>>  	dport->reg_map.host = host;
>>>> -	cxl_dport_map_ras(dport);
>>>>  
>>>>  	if (dport->rch) {
>>>>  		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
>>>> @@ -127,12 +137,54 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>>>  		if (!host_bridge->native_aer)
>>>>  			return;
>>>>  
>>>> +		cxl_dport_map_ras(dport);
>>>>  		cxl_dport_map_rch_aer(dport);
>>>>  		cxl_disable_rch_root_ints(dport);
>>>> +		return;
>>>>  	}
>>>> +
>>>> +	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
>>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>>> +		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
>>>> +
>>>>  }
>>>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>>>>  
>>>> +static void cxl_switch_port_init_ras(struct cxl_port *port)
>>>> +{
>>>> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
>>>> +		return;
>>>> +
>>>> +	/* May have upstream DSP or RP */
>>>> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
>>>> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
>>>> +
>>>> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>>>> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
>>>> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
>>>> +	}
>>>> +
>>>> +	cxl_uport_init_ras_reporting(port, &port->dev);
>>>> +}
>>>> +
>>>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>>> Maybe rename 'port' to 'ep' to be explicit
>> Ok
>>>> +{
>>>> +	struct cxl_dport *dport;
>>> parent_dport would be clearer. I was thinking why does the endpoint have a dport for a second there.
>> Ok
>>>> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>>>> +	struct cxl_port *parent_port __free(put_cxl_port) =
>>>> +		cxl_mem_find_port(cxlmd, &dport);
>>>> +
>>>> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
>>>> +		dev_err(&port->dev, "CXL port topology not found\n");> +		return;
>>>> +	}
>>>> +
>>>> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
>>>> +}
>>>> +
>>>> +#else
>>>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
>>>> +static void cxl_switch_port_init_ras(struct cxl_port *port) { }
>>>>  #endif /* CONFIG_PCIEAER_CXL */
>>> I cc'd you on the new patch to move all the AER stuff to core/pci_aer.c. That should take care of ifdef CONFIG_PCIEAER_CXL in pci.c and port.c.
>>>
>>> DJ
>>
>> Move to core/native_ras.c introduced in "Dequeue forwarded CXL error", right? I just want to be certain.
> 
> I just posted this [1] to clean up what's there. Do you prefer me to just rename the file to native_ras.c? Or maybe pci_ras.c?
> 
> [1]: https://lore.kernel.org/linux-cxl/20250718212452.2100663-1-dave.jiang@intel.com/
> 
> DJ
> 

Hi Dave,

I think leaving as you have it is fine. 

Would you like to see my v10 changes in core/native_ras.c stay as-is or move into pci_aer.c?

-Terry

>>
>> Regards,
>> Terry
>>
>>>>  >  static int cxl_switch_port_probe(struct cxl_port *port)
>>>> @@ -149,6 +201,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
>>>>  
>>>>  	cxl_switch_parse_cdat(port);
>>>>  
>>>> +	cxl_switch_port_init_ras(port);
>>>> +
>>>>  	cxlhdm = devm_cxl_setup_hdm(port, NULL);
>>>>  	if (!IS_ERR(cxlhdm))
>>>>  		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
>>>> @@ -203,6 +257,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>>>>  	if (rc)
>>>>  		return rc;
>>>>  
>>>> +	cxl_endpoint_port_init_ras(port);
>>>> +
>>>>  	/*
>>>>  	 * Now that all endpoint decoders are successfully enumerated, try to
>>>>  	 * assemble regions from committed decoders
>>
> 



^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-07-18 22:40         ` Bowman, Terry
@ 2025-07-18 22:45           ` Dave Jiang
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-18 22:45 UTC (permalink / raw)
  To: Bowman, Terry, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/18/25 3:40 PM, Bowman, Terry wrote:
> On 7/18/2025 5:01 PM, Dave Jiang wrote:
>>
>>
>> On 7/18/25 2:55 PM, Bowman, Terry wrote:
>>>
>>>
>>> On 7/18/2025 4:28 PM, Dave Jiang wrote:
>>>>
>>>> On 6/26/25 3:42 PM, Terry Bowman wrote:
>>>>> CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
>>>>> Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
>>>>> mapping to enable RAS logging. This initialization is currently missing and
>>>>> must be added for CXL RPs and DSPs.
>>>>>
>>>>> Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
>>>>> Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.
>>>>>
>>>>> Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
>>>>> This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
>>>>> created and added to the EP port.
>>>>>
>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>> ---
>>>>>  drivers/cxl/cxl.h  |  2 ++
>>>>>  drivers/cxl/mem.c  |  3 ++-
>>>>>  drivers/cxl/port.c | 58 +++++++++++++++++++++++++++++++++++++++++++++-
>>>>>  3 files changed, 61 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>>>>> index c57c160f3e5e..d696d419bd5a 100644
>>>>> --- a/drivers/cxl/cxl.h
>>>>> +++ b/drivers/cxl/cxl.h
>>>>> @@ -590,6 +590,7 @@ struct cxl_dax_region {
>>>>>   * @parent_dport: dport that points to this port in the parent
>>>>>   * @decoder_ida: allocator for decoder ids
>>>>>   * @reg_map: component and ras register mapping parameters
>>>>> + * @uport_regs: mapped component registers
>>>>>   * @nr_dports: number of entries in @dports
>>>>>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>>>>>   * @commit_end: cursor to track highest committed decoder for commit ordering
>>>>> @@ -610,6 +611,7 @@ struct cxl_port {
>>>>>  	struct cxl_dport *parent_dport;
>>>>>  	struct ida decoder_ida;
>>>>>  	struct cxl_register_map reg_map;
>>>>> +	struct cxl_component_regs uport_regs;
>>>>>  	int nr_dports;
>>>>>  	int hdm_end;
>>>>>  	int commit_end;
>>>>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>>>>> index 6e6777b7bafb..d2155f45240d 100644
>>>>> --- a/drivers/cxl/mem.c
>>>>> +++ b/drivers/cxl/mem.c
>>>>> @@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
>>>>>  	else
>>>>>  		endpoint_parent = &parent_port->dev;
>>>>>  
>>>>> -	cxl_dport_init_ras_reporting(dport, dev);
>>>>> +	if (dport->rch)
>>>>> +		cxl_dport_init_ras_reporting(dport, dev);
>>>>>  
>>>>>  	scoped_guard(device, endpoint_parent) {
>>>>>  		if (!endpoint_parent->driver) {
>>>>> diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
>>>>> index 021f35145c65..b52f82925891 100644
>>>>> --- a/drivers/cxl/port.c
>>>>> +++ b/drivers/cxl/port.c
>>>>> @@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>>>>  }
>>>>>  
>>>>> +static void cxl_uport_init_ras_reporting(struct cxl_port *port,
>>>>> +					 struct device *host)
>>>>> +{
>>>>> +	struct cxl_register_map *map = &port->reg_map;
>>>>> +
>>>>> +	map->host = host;
>>>>> +	if (cxl_map_component_regs(map, &port->uport_regs,
>>>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>>>> +		dev_dbg(&port->dev, "Failed to map RAS capability\n");
>>>>> +}
>>>>> +
>>>>>  /**
>>>>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>>>>   * @dport: the cxl_dport that needs to be initialized
>>>>> @@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>>>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>>>>  {
>>>>>  	dport->reg_map.host = host;
>>>>> -	cxl_dport_map_ras(dport);
>>>>>  
>>>>>  	if (dport->rch) {
>>>>>  		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
>>>>> @@ -127,12 +137,54 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>>>>>  		if (!host_bridge->native_aer)
>>>>>  			return;
>>>>>  
>>>>> +		cxl_dport_map_ras(dport);
>>>>>  		cxl_dport_map_rch_aer(dport);
>>>>>  		cxl_disable_rch_root_ints(dport);
>>>>> +		return;
>>>>>  	}
>>>>> +
>>>>> +	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
>>>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>>>> +		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
>>>>> +
>>>>>  }
>>>>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>>>>>  
>>>>> +static void cxl_switch_port_init_ras(struct cxl_port *port)
>>>>> +{
>>>>> +	if (is_cxl_root(to_cxl_port(port->dev.parent)))
>>>>> +		return;
>>>>> +
>>>>> +	/* May have upstream DSP or RP */
>>>>> +	if (port->parent_dport && dev_is_pci(port->parent_dport->dport_dev)) {
>>>>> +		struct pci_dev *pdev = to_pci_dev(port->parent_dport->dport_dev);
>>>>> +
>>>>> +		if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>>>>> +		    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM))
>>>>> +			cxl_dport_init_ras_reporting(port->parent_dport, &port->dev);
>>>>> +	}
>>>>> +
>>>>> +	cxl_uport_init_ras_reporting(port, &port->dev);
>>>>> +}
>>>>> +
>>>>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>>>> Maybe rename 'port' to 'ep' to be explicit
>>> Ok
>>>>> +{
>>>>> +	struct cxl_dport *dport;
>>>> parent_dport would be clearer. I was thinking why does the endpoint have a dport for a second there.
>>> Ok
>>>>> +	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
>>>>> +	struct cxl_port *parent_port __free(put_cxl_port) =
>>>>> +		cxl_mem_find_port(cxlmd, &dport);
>>>>> +
>>>>> +	if (!dport || !dev_is_pci(dport->dport_dev)) {
>>>>> +		dev_err(&port->dev, "CXL port topology not found\n");> +		return;
>>>>> +	}
>>>>> +
>>>>> +	cxl_dport_init_ras_reporting(dport, cxlmd->cxlds->dev);
>>>>> +}
>>>>> +
>>>>> +#else
>>>>> +static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
>>>>> +static void cxl_switch_port_init_ras(struct cxl_port *port) { }
>>>>>  #endif /* CONFIG_PCIEAER_CXL */
>>>> I cc'd you on the new patch to move all the AER stuff to core/pci_aer.c. That should take care of ifdef CONFIG_PCIEAER_CXL in pci.c and port.c.
>>>>
>>>> DJ
>>>
>>> Move to core/native_ras.c introduced in "Dequeue forwarded CXL error", right? I just want to be certain.
>>
>> I just posted this [1] to clean up what's there. Do you prefer me to just rename the file to native_ras.c? Or maybe pci_ras.c?
>>
>> [1]: https://lore.kernel.org/linux-cxl/20250718212452.2100663-1-dave.jiang@intel.com/
>>
>> DJ
>>
> 
> Hi Dave,
> 
> I think leaving as you have it is fine. 
> 
> Would you like to see my v10 changes in core/native_ras.c stay as-is or move into pci_aer.c?

Yes. Let's move it all to pci_aer.c in v11 if no one objects to my changes. Thanks!

DJ

> 
> -Terry
> 
>>>
>>> Regards,
>>> Terry
>>>
>>>>>  >  static int cxl_switch_port_probe(struct cxl_port *port)
>>>>> @@ -149,6 +201,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
>>>>>  
>>>>>  	cxl_switch_parse_cdat(port);
>>>>>  
>>>>> +	cxl_switch_port_init_ras(port);
>>>>> +
>>>>>  	cxlhdm = devm_cxl_setup_hdm(port, NULL);
>>>>>  	if (!IS_ERR(cxlhdm))
>>>>>  		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
>>>>> @@ -203,6 +257,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
>>>>>  	if (rc)
>>>>>  		return rc;
>>>>>  
>>>>> +	cxl_endpoint_port_init_ras(port);
>>>>> +
>>>>>  	/*
>>>>>  	 * Now that all endpoint decoders are successfully enumerated, try to
>>>>>  	 * assemble regions from committed decoders
>>>
>>
> 
> 


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped
  2025-06-26 22:42 ` [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
@ 2025-07-21 21:56   ` Dave Jiang
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-21 21:56 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> The CXL RAS handlers do not currently log if the RAS registers are
> unmapped. This is needed in order to help debug CXL error handling. Update
> the CXL driver to log a warning message if the RAS register block is
> unmapped during RAS error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/pci.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 9b464f9c55c1..c9a4b528e0b8 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -670,8 +670,10 @@ static void cxl_handle_cor_ras(struct device *dev,
>  	void __iomem *addr;
>  	u32 status;
>  
> -	if (!ras_base)
> +	if (!ras_base) {
> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>  		return;
> +	}
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> @@ -709,8 +711,10 @@ static bool cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	u32 status;
>  	u32 fe;
>  
> -	if (!ras_base)
> +	if (!ras_base) {
> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>  		return false;
> +	}
>  
>  	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors
  2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
  2025-06-27 11:48   ` Jonathan Cameron
@ 2025-07-21 22:17   ` Dave Jiang
  1 sibling, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-21 22:17 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> Update cxl_handle_cor_ras() to exit early in the case there is no RAS
> errors detected after applying the status mask. This change will make
> the correctable handler's implementation consistent with the uncorrectable
> handler, cxl_handle_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/cxl/core/pci.c | 9 +++++----
>  1 file changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 156ce094a8b9..887b54cf3395 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -677,10 +677,11 @@ static void cxl_handle_cor_ras(struct device *dev, u64 serial,
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(dev, serial, status);
> -	}
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	trace_cxl_aer_correctable_error(dev, serial, status);
>  }
>  
>  /* CXL spec rev3.0 8.2.4.16.1 */


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers
  2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
  2025-06-27 11:52   ` Jonathan Cameron
  2025-06-27 12:27   ` Shiju Jose
@ 2025-07-21 22:35   ` Dave Jiang
  2025-07-22 18:23     ` Bowman, Terry
  2 siblings, 1 reply; 82+ messages in thread
From: Dave Jiang @ 2025-07-21 22:35 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 6/26/25 3:42 PM, Terry Bowman wrote:
> CXL Endpoint protocol errors are currently handled using PCI error
> handlers. The CXL Endpoint requires CXL specific handling in the case of
> uncorrectable error (UCE) handling not provided by the PCI handlers.
> 
> Add CXL specific handlers for CXL Endpoints. Rename the existing
> cxl_error_handlers to be pci_error_handlers to more correctly indicate
> the error type and follow naming consistency.
> 
> The PCI handlers will be called if the CXL device is not trained for
> alternate protocol (CXL). Update the CXL Endpoint PCI handlers to call the
> CXL UCE handlers.

Would the CXL device still be functional if it can't train the CXL protocols? Just wondering if we still need the standard PCI handlers in that case at all. 

DJ

> 
> The existing EP UCE handler includes checks for various results. These are
> no longer needed because CXL UCE recovery will not be attempted. Implement
> cxl_handle_ras() to return PCI_ERS_RESULT_NONE or PCI_ERS_RESULT_PANIC. The
> CXL UCE handler is called by cxl_do_recovery() that acts on the return
> value. In the case of the PCI handler path, call panic() if the result is
> PCI_ERS_RESULT_PANIC.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/cxl/core/native_ras.c | 15 ++++---
>  drivers/cxl/core/pci.c        | 77 ++++++++++++++++++-----------------
>  drivers/cxl/cxl.h             |  4 ++
>  drivers/cxl/cxlpci.h          |  6 +--
>  drivers/cxl/pci.c             |  8 ++--
>  5 files changed, 59 insertions(+), 51 deletions(-)
> 
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> index 19f8f2ac8376..89b65a35f2c0 100644
> --- a/drivers/cxl/core/native_ras.c
> +++ b/drivers/cxl/core/native_ras.c
> @@ -7,18 +7,20 @@
>  #include <cxlmem.h>
>  #include <core/core.h>
>  #include <cxlpci.h>
> +#include <core/core.h>
>  
>  static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
>  {
>  	pci_ers_result_t vote, *result = data;
> +	struct device *dev = &pdev->dev;
>  
>  	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>  	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END))
>  		return 0;
>  
> -	guard(device)(&pdev->dev);
> +	guard(device)(dev);
>  
> -	vote = cxl_error_detected(pdev, pci_channel_io_frozen);
> +	vote = cxl_error_detected(dev);
>  	*result = merge_result(*result, vote);
>  
>  	return 0;
> @@ -82,16 +84,17 @@ static bool is_cxl_rcd(struct pci_dev *pdev)
>  static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>  {
>  	struct cxl_proto_error_info *err_info = data;
> +	struct device *dev = &pdev->dev;
>  
> -	guard(device)(&pdev->dev);
> +	guard(device)(dev);
>  
>  	if (!is_cxl_rcd(pdev) || !cxl_pci_drv_bound(pdev))
>  		return 0;
>  
>  	if (err_info->severity == AER_CORRECTABLE)
> -		cxl_cor_error_detected(pdev);
> +		cxl_cor_error_detected(dev);
>  	else
> -		cxl_error_detected(pdev, pci_channel_io_frozen);
> +		cxl_error_detected(dev);
>  
>  	return 1;
>  }
> @@ -126,7 +129,7 @@ static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
>  						       aer + PCI_ERR_COR_STATUS,
>  						       0, PCI_ERR_COR_INTERNAL);
>  
> -		cxl_cor_error_detected(pdev);
> +		cxl_cor_error_detected(&pdev->dev);
>  
>  		pcie_clear_device_status(pdev);
>  	} else {
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 887b54cf3395..7209ffb5c2fe 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -705,8 +705,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>   * Log the state of the RAS status registers and prepare them to log the
>   * next error status. Return 1 if reset needed.
>   */
> -static bool cxl_handle_ras(struct device *dev, u64 serial,
> -			   void __iomem *ras_base)
> +static pci_ers_result_t cxl_handle_ras(struct device *dev, u64 serial,
> +				       void __iomem *ras_base)
>  {
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
> @@ -715,13 +715,13 @@ static bool cxl_handle_ras(struct device *dev, u64 serial,
>  
>  	if (!ras_base) {
>  		dev_warn_once(dev, "CXL RAS register block is not mapped");
> -		return false;
> +		return PCI_ERS_RESULT_NONE;
>  	}
>  
>  	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
>  	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
> -		return false;
> +		return PCI_ERS_RESULT_NONE;
>  
>  	/* If multiple errors, log header points to first error from ctrl reg */
>  	if (hweight32(status) > 1) {
> @@ -738,7 +738,7 @@ static bool cxl_handle_ras(struct device *dev, u64 serial,
>  	trace_cxl_aer_uncorrectable_error(dev, serial, status, fe, hl);
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
> -	return true;
> +	return PCI_ERS_RESULT_PANIC;
>  }
>  
>  #ifdef CONFIG_PCIEAER_CXL
> @@ -833,13 +833,14 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
>  static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
>  #endif
>  
> -void cxl_cor_error_detected(struct pci_dev *pdev)
> +void cxl_cor_error_detected(struct device *dev)
>  {
> +	struct pci_dev *pdev = to_pci_dev(dev);
>  	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> -	struct device *dev = &cxlds->cxlmd->dev;
> +	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
>  
> -	scoped_guard(device, dev) {
> -		if (!dev->driver) {
> +	scoped_guard(device, cxlmd_dev) {
> +		if (!cxlmd_dev->driver) {
>  			dev_warn(&pdev->dev,
>  				 "%s: memdev disabled, abort error handling\n",
>  				 dev_name(dev));
> @@ -854,20 +855,26 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
>  
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state)
> +void pci_cor_error_detected(struct pci_dev *pdev)
>  {
> -	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> -	struct cxl_memdev *cxlmd = cxlds->cxlmd;
> -	struct device *dev = &cxlmd->dev;
> -	bool ue;
> +	cxl_cor_error_detected(&pdev->dev);
> +}
> +EXPORT_SYMBOL_NS_GPL(pci_cor_error_detected, "CXL");
>  
> -	scoped_guard(device, dev) {
> -		if (!dev->driver) {
> +pci_ers_result_t cxl_error_detected(struct device *dev)
> +{
> +	struct pci_dev *pdev = to_pci_dev(dev);
> +	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +	struct device *cxlmd_dev = &cxlds->cxlmd->dev;
> +	pci_ers_result_t ue;
> +
> +	scoped_guard(device, cxlmd_dev) {
> +
> +		if (!cxlmd_dev->driver) {
>  			dev_warn(&pdev->dev,
>  				 "%s: memdev disabled, abort error handling\n",
>  				 dev_name(dev));
> -			return PCI_ERS_RESULT_DISCONNECT;
> +			return PCI_ERS_RESULT_PANIC;
>  		}
>  
>  		if (cxlds->rcd)
> @@ -881,29 +888,23 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  		ue = cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->serial, cxlds->regs.ras);
>  	}
>  
> -
> -	switch (state) {
> -	case pci_channel_io_normal:
> -		if (ue) {
> -			device_release_driver(dev);
> -			return PCI_ERS_RESULT_NEED_RESET;
> -		}
> -		return PCI_ERS_RESULT_CAN_RECOVER;
> -	case pci_channel_io_frozen:
> -		dev_warn(&pdev->dev,
> -			 "%s: frozen state error detected, disable CXL.mem\n",
> -			 dev_name(dev));
> -		device_release_driver(dev);
> -		return PCI_ERS_RESULT_NEED_RESET;
> -	case pci_channel_io_perm_failure:
> -		dev_warn(&pdev->dev,
> -			 "failure state error detected, request disconnect\n");
> -		return PCI_ERS_RESULT_DISCONNECT;
> -	}
> -	return PCI_ERS_RESULT_NEED_RESET;
> +	return ue;
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
>  
> +pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
> +				    pci_channel_state_t error)
> +{
> +	pci_ers_result_t rc;
> +
> +	rc = cxl_error_detected(&pdev->dev);
> +	if (rc == PCI_ERS_RESULT_PANIC)
> +		panic("CXL cachemem error.");
> +
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(pci_error_detected, "CXL");
> +
>  static int cxl_flit_size(struct pci_dev *pdev)
>  {
>  	if (cxl_pci_flit_256(pdev))
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index d696d419bd5a..a2eedc8a82e8 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -11,6 +11,7 @@
>  #include <linux/log2.h>
>  #include <linux/node.h>
>  #include <linux/io.h>
> +#include <linux/pci.h>
>  
>  extern const struct nvdimm_security_ops *cxl_security_ops;
>  
> @@ -797,6 +798,9 @@ static inline int cxl_root_decoder_autoremove(struct device *host,
>  }
>  int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint);
>  
> +void cxl_cor_error_detected(struct device *dev);
> +pci_ers_result_t cxl_error_detected(struct device *dev);
> +
>  /**
>   * struct cxl_endpoint_dvsec_info - Cached DVSEC info
>   * @mem_enabled: cached value of mem_enabled in the DVSEC at init time
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index ed3c9701b79f..e69a47f0cd94 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -133,8 +133,8 @@ struct cxl_dev_state;
>  int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
>  			struct cxl_endpoint_dvsec_info *info);
>  void read_cdat_data(struct cxl_port *port);
> -void cxl_cor_error_detected(struct pci_dev *pdev);
> -pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
> -				    pci_channel_state_t state);
> +void pci_cor_error_detected(struct pci_dev *pdev);
> +pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
> +				    pci_channel_state_t error);
>  bool cxl_pci_drv_bound(struct pci_dev *pdev);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index cae049f9ae3e..91fab33094a9 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -1112,11 +1112,11 @@ static void cxl_reset_done(struct pci_dev *pdev)
>  	}
>  }
>  
> -static const struct pci_error_handlers cxl_error_handlers = {
> -	.error_detected	= cxl_error_detected,
> +static const struct pci_error_handlers pci_error_handlers = {
> +	.error_detected = pci_error_detected,
>  	.slot_reset	= cxl_slot_reset,
>  	.resume		= cxl_error_resume,
> -	.cor_error_detected	= cxl_cor_error_detected,
> +	.cor_error_detected	= pci_cor_error_detected,
>  	.reset_done	= cxl_reset_done,
>  };
>  
> @@ -1124,7 +1124,7 @@ static struct pci_driver cxl_pci_driver = {
>  	.name			= KBUILD_MODNAME,
>  	.id_table		= cxl_mem_pci_tbl,
>  	.probe			= cxl_pci_probe,
> -	.err_handler		= &cxl_error_handlers,
> +	.err_handler		= &pci_error_handlers,
>  	.dev_groups		= cxl_rcd_groups,
>  	.driver	= {
>  		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers
  2025-07-21 22:35   ` Dave Jiang
@ 2025-07-22 18:23     ` Bowman, Terry
  0 siblings, 0 replies; 82+ messages in thread
From: Bowman, Terry @ 2025-07-22 18:23 UTC (permalink / raw)
  To: Dave Jiang, dave, jonathan.cameron, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/21/2025 5:35 PM, Dave Jiang wrote:
>
> On 6/26/25 3:42 PM, Terry Bowman wrote:
>> CXL Endpoint protocol errors are currently handled using PCI error
>> handlers. The CXL Endpoint requires CXL specific handling in the case of
>> uncorrectable error (UCE) handling not provided by the PCI handlers.
>>
>> Add CXL specific handlers for CXL Endpoints. Rename the existing
>> cxl_error_handlers to be pci_error_handlers to more correctly indicate
>> the error type and follow naming consistency.
>>
>> The PCI handlers will be called if the CXL device is not trained for
>> alternate protocol (CXL). Update the CXL Endpoint PCI handlers to call the
>> CXL UCE handlers.
> Would the CXL device still be functional if it can't train the CXL protocols? Just wondering if we still need the standard PCI handlers in that case at all. 
>
> DJ

A CXL EP failing training will not support CXL functionality. 

Once training fails the RAS registers may be unavailable. I'm concerned accesses to the 
MMIO RAS registers could possibly cause a MCE if the PCIe device doesn't respond. It will 
depend on how the training fails. This a reason to remove the PCIe handlers.

BTW, the AER status will be logged by the AER driver before a PCIe handler is called.

A while back Dan mentioned we should leave the PCIe EP handlers. He may have an opinion 
or more to add.

-Terry

[snip]


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (16 preceding siblings ...)
  2025-06-26 22:42 ` [PATCH v10 17/17] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
@ 2025-07-23 21:55 ` dan.j.williams
  2025-08-18 15:18 ` Joshua Hahn
  18 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-23 21:55 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
> Endpoints (EP). The reach of this patchset grew from CXL Ports to include
> EPs as well.
[..]
> == Testing ==
> Testing results below shows the Upstream Switch Port UCE and EP UCE errors
> are handled as PCI errors. This is because aer_get_device_error_info() does
> not populate the AER error severity and status in the case of FATAL UCE on
> Upstream Ports and Endpoints. This is intended because the USP link to
> access the device can be compromised. The check for is_cxl_error() and
> is_internal_error() fail as a result and then processes the error as a PCI
> error. Also, the AER event logging is missing the PCIe AER status.

Are those issues "TODO" or permanent quirks of the implementation?

Although looking at the error message they all seem to correctly say "CXL
Bus Error", I guess I am not seting the end user visible problem of the
details you are pointing out here. I.e. LGTM.

[..]
> == Root Port ==
> root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh

Where can I find these inject scripts?

> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
> pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
> pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
> pcieport 0000:0c:00.0:    [14] CorrIntErr    
> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0 status='CRC Threshold Hit'

Hmm, why "memdev=" for a root port error? Will take a look at what
cxl_aer_correctable_error() is doing.

[..] 
> base-commit: 716ba3023561ccacfaa28f988d26717535b8fed1

I cannot find this commit in mainline nor linux-next. Please do try to
base series on mainline tags, or otherwise push a public baseline branch
somewhere. Helps reviewers and build bots.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
  2025-07-18 17:55   ` Dave Jiang
@ 2025-07-23 21:58   ` dan.j.williams
  2025-07-23 22:15     ` Dave Jiang
  1 sibling, 1 reply; 82+ messages in thread
From: dan.j.williams @ 2025-07-23 21:58 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
> are unnecessary helper functions used only for Endpoints. Remove these
> functions as they are not common for all CXL devices and do not provide
> value for EP handling.
> 
> Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
> to cxl_handle_cor_ras().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>

Looks good to me:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

Perhaps this and any other pure cleanups can go into a topic branch for
6.17 so that it does not need to be sent again if this set gets respun.
Dave?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-07-23 21:58   ` dan.j.williams
@ 2025-07-23 22:15     ` Dave Jiang
  0 siblings, 0 replies; 82+ messages in thread
From: Dave Jiang @ 2025-07-23 22:15 UTC (permalink / raw)
  To: dan.j.williams, Terry Bowman, dave, jonathan.cameron,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/23/25 2:58 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> The CXL driver's cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras()
>> are unnecessary helper functions used only for Endpoints. Remove these
>> functions as they are not common for all CXL devices and do not provide
>> value for EP handling.
>>
>> Rename __cxl_handle_ras to cxl_handle_ras() and __cxl_handle_cor_ras()
>> to cxl_handle_cor_ras().
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> 
> Looks good to me:
> 
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
> 
> Perhaps this and any other pure cleanups can go into a topic branch for
> 6.17 so that it does not need to be sent again if this set gets respun.
> Dave?

Sure. I can pick them up once you are done reviewing this series. Probably should cut things off by end of this week though. 

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl()
  2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
@ 2025-07-23 22:30   ` dan.j.williams
  2025-08-09 10:56   ` Alejandro Lucero Palau
  1 sibling, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-23 22:30 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices.
> 
> Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
> CXL Flexbus DVSEC presence is used because it is required for all the CXL
> PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
> 
> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/pci/probe.c           | 10 ++++++++++
>  include/linux/pci.h           |  6 ++++++
>  include/uapi/linux/pci_regs.h |  8 +++++++-
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4b8693ec9e4c..5d3548648d5c 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1691,6 +1691,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>  		dev->is_thunderbolt = 1;
>  }
>  
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> +					      PCI_DVSEC_CXL_FLEXBUS);
> +	if (dvsec)
> +		dev->is_cxl = 1;
> +}
> +
>  static void set_pcie_untrusted(struct pci_dev *dev)
>  {
>  	struct pci_dev *parent = pci_upstream_bridge(dev);
> @@ -2021,6 +2029,8 @@ int pci_setup_device(struct pci_dev *dev)
>  	/* Need to have dev->cfg_size ready */
>  	set_pcie_thunderbolt(dev);
>  
> +	set_pcie_cxl(dev);

Per the comment in the header below, in the case of upstream ports and
endpoints, this should walk to the parent downstream port and make sure
the cxl setting matches. I.e. with hotplug the downstream port may
transition from not-cxl to is-cxl. Update downstream-port parents at the
beginning of life of their CXL child-devices.

> +
>  	set_pcie_untrusted(dev);
>  
>  	if (pci_is_pcie(dev))
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 05e68f35f392..79878243b681 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -453,6 +453,7 @@ struct pci_dev {
>  	unsigned int	is_hotplug_bridge:1;
>  	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
>  	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
> +	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
>  	/*
>  	 * Devices marked being untrusted are the ones that can potentially
>  	 * execute DMA attacks and similar. They are typically connected
> @@ -744,6 +745,11 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +static inline bool pcie_is_cxl(struct pci_dev *pci_dev)
> +{
> +	return pci_dev->is_cxl;
> +}
> +
>  #define for_each_pci_bridge(dev, bus)				\
>  	list_for_each_entry(dev, &bus->devices, bus_list)	\
>  		if (!pci_is_bridge(dev)) {} else
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index a3a3e942dedf..fb9d77c69d5f 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1225,9 +1225,15 @@
>  /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
>  
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> +/* Compute Express Link (CXL r3.2, sec 8.1)
> + *
> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
> + * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
> + * registers on downstream link-up events.
> + */
>  #define PCI_DVSEC_CXL_PORT				3
>  #define PCI_DVSEC_CXL_PORT_CTL				0x0c
>  #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> +#define PCI_DVSEC_CXL_FLEXBUS				7

Please, let us not end up with multiple definitions for the same thing
in multiple places. I note that Alejandro introduces include/cxl/pci.h
with some of the CXL DVSEC definitions from drivers/cxl/cxlpci.h.
Although not all of them, I think a precursor patch to move all of them
in this patch set is the way to go. I.e. drop all these PCI_DVSEC_CXL_*
versions in favor of their existing CXL_DVSEC_* versions.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging
  2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
                     ` (3 preceding siblings ...)
  2025-07-01 21:27   ` Dave Jiang
@ 2025-07-23 22:56   ` dan.j.williams
  4 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-23 22:56 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Looks good to me, and agree with Shiju's remove the extra trace comment.

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
  2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
@ 2025-07-24  0:01   ` dan.j.williams
  2025-07-24 17:06     ` Bowman, Terry
  2025-07-24  1:16   ` dan.j.williams
  2 siblings, 1 reply; 82+ messages in thread
From: dan.j.williams @ 2025-07-24  0:01 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> The CXL AER error handling logic currently resides in the AER driver file,
> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> using #ifdefs.
> 
> Improve the AER driver maintainability by separating the CXL specific logic
> from the AER driver's core functionality and removing the #ifdefs.
> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> new file.
> 
> Update the makefile to conditionally compile the CXL file using the
> existing CONFIG_PCIEAER_CXL Kconfig.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

[..]
> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> index e2d71b6fdd84..31b3935bf189 100644
> --- a/include/linux/pci_ids.h
> +++ b/include/linux/pci_ids.h
> @@ -12,6 +12,8 @@
>  
>  /* Device classes and subclasses */
>  
> +#define PCI_CLASS_CODE_MASK             0xFFFF00

Per other comments do not add this updated in the same patch as the
move.

When / if you submit it separately it likely also belongs next to
PCI_CLASS_REVISION in include/uapi/linux/pci_regs.h defined with
__GENMASK(23, 8).

Otherwise, with this change dropped you can add:

Reviewed-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
  2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
  2025-07-24  0:01   ` dan.j.williams
@ 2025-07-24  1:16   ` dan.j.williams
  2025-07-24 17:02     ` Bowman, Terry
  2 siblings, 1 reply; 82+ messages in thread
From: dan.j.williams @ 2025-07-24  1:16 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> The CXL AER error handling logic currently resides in the AER driver file,
> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> using #ifdefs.
> 
> Improve the AER driver maintainability by separating the CXL specific logic
> from the AER driver's core functionality and removing the #ifdefs.
> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> new file.
> 
> Update the makefile to conditionally compile the CXL file using the
> existing CONFIG_PCIEAER_CXL Kconfig.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---

After reading patch5 I want to qualify my Reviewed-by:...

>  drivers/pci/pci.h          |   8 +++
>  drivers/pci/pcie/Makefile  |   1 +
>  drivers/pci/pcie/aer.c     | 138 -------------------------------------
>  drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++

This is a poor name for this file because the functionality only relates to
code that supports a dead-end generation of RCH / RCD hardware platforms. 

I do agree that it should be removed from aer.c so typical PCIe AER
maintenance does not need to trip over that cruft.

Please call it something like rch_aer.c so it is tucked out of the way,
sticks out as odd in any future diffstat, and does not confuse from the
CXL VH error handling that supports current and future generation
hardware.

Perhaps even move it to its own silent Kconfig symbol with a deprecation
warning, something like below, so someone remembers to delete it.

diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
index 17919b99fa66..da88358bbb4f 100644
--- a/drivers/pci/pcie/Kconfig
+++ b/drivers/pci/pcie/Kconfig
@@ -58,6 +58,13 @@ config PCIEAER_CXL
 
 	  If unsure, say Y.
 
+# Restricted CXL Host (RCH) error handling supports first generation CXL
+# hardware and can be deprecated in 7-10 years when only CXL Virtual Host
+# (CXL specification version 2+) hardware remains in service
+config RCH_AER
+	def_bool y
+	depends on PCIEAER_CXL
+
 #
 # PCI Express ECRC
 #

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
  2025-06-27 10:24   ` Jonathan Cameron
  2025-07-01 21:53   ` Dave Jiang
@ 2025-07-24  2:01   ` dan.j.williams
  2025-07-24 17:21     ` Bowman, Terry
  2 siblings, 1 reply; 82+ messages in thread
From: dan.j.williams @ 2025-07-24  2:01 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt with the CXL driver. The notification will be used
> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> 
> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> driver's RAS native handling. This as an alternative to dropping the
> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> conditionally compile the new file.

I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those
needs to subsume the other. I also do not understand the point of
"NATIVE" in the name. Will not CPER notified protocol errors be routed
to the same CXL error handling infrastructure as AER notified protocol
errors? I.e. the aer_recover_queue() path?

> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work and the cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
> will contain the erring device's PCI SBDF details used to rediscover the
> device after the CXL driver dequeues the kfifo work. The device rediscovery
> will be introduced along with the CXL handling in future patches.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/Kconfig           | 14 ++++++++
>  drivers/cxl/core/Makefile     |  1 +
>  drivers/cxl/core/core.h       |  8 +++++
>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>  drivers/cxl/core/port.c       |  2 ++
>  drivers/cxl/core/ras.c        |  1 +
>  drivers/cxl/cxlpci.h          |  1 +
>  drivers/pci/pci.h             |  4 +++
>  drivers/pci/pcie/aer.c        |  7 ++--
>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>  include/linux/aer.h           | 31 ++++++++++++++++++
>  11 files changed, 153 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/cxl/core/native_ras.c
> 
> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
> index 48b7314afdb8..57274de54a45 100644
> --- a/drivers/cxl/Kconfig
> +++ b/drivers/cxl/Kconfig
> @@ -233,4 +233,18 @@ config CXL_MCE
>  	def_bool y
>  	depends on X86_MCE && MEMORY_FAILURE
>  
> +config CXL_NATIVE_RAS
> +	bool "CXL: Enable CXL RAS native handling"
> +	depends on PCIEAER_CXL

This nice helpful option is hidden if someone forgets to set the
PCIEAER_CXL option which does not have helpful text. Given the
dependencies, I am leaning towards drop this new option, move the
help text to PCIEAER_CXL... but let me read the rest of the patch first.

You can move PCIEAER_CXL to drivers/cxl/Kconfig if you want to keep all
the CXL options in the CXL menu.

> +	default CXL_BUS
> +	help
> +	  Enable native CXL RAS protocol error handling and logging in the CXL
> +	  drivers. This functionality relies on the AER service driver being
> +	  enabled,

No need to put dependencies in the help text the tool will tell them
that PCIEAER=y is a dependency.

>         as the AER interrupt is used to inform the operating system
> +	  of CXL RAS protocol errors. The platform must be configured to
> +	  utilize AER reporting for interrupts.

Per above, does any of CXL CPER reporting make its way into this path?

> +
> +	  If unsure, or if this kernel is meant for production environments,
> +	  say Y.

I think: "If unsure, say Y" is sufficient.

> +
>  endif
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 79e2ef81fde8..16f5832e5cc4 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 29b61828a847..4c08bb92e2f9 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>  					    int nid, resource_size_t *size);
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +void cxl_native_ras_init(void);
> +void cxl_native_ras_exit(void);
> +#else
> +static inline void cxl_native_ras_init(void) { };
> +static inline void cxl_native_ras_exit(void) { };
> +#endif
> +
>  #ifdef CONFIG_CXL_FEATURES
>  struct cxl_feat_entry *
>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
> new file mode 100644
> index 000000000000..011815ddaae3
> --- /dev/null
> +++ b/drivers/cxl/core/native_ras.c
> @@ -0,0 +1,26 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
> +
> +#include <linux/pci.h>
> +#include <linux/aer.h>
> +#include <cxl/event.h>
> +#include <cxlmem.h>
> +#include <core/core.h>
> +
> +static void cxl_proto_err_work_fn(struct work_struct *work)
> +{
> +}
> +
> +static struct work_struct cxl_proto_err_work;
> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
> +
> +void cxl_native_ras_init(void)
> +{
> +	cxl_register_proto_err_work(&cxl_proto_err_work);
> +}
> +
> +void cxl_native_ras_exit(void)
> +{
> +	cxl_unregister_proto_err_work();
> +	cancel_work_sync(&cxl_proto_err_work);
> +}
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index eb46c6764d20..8e8f21197c86 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>  	if (rc)
>  		goto err_ras;
>  
> +	cxl_native_ras_init();
> +
>  	return 0;
>  
>  err_ras:
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 485a831695c7..962dc94fed8c 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..6f1396ef7b77 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 91b583cf18eb..29c11c7136d3 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>  #else
>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>  #endif
>  
>  #ifdef CONFIG_ACPI
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 0b4d721980ef..8417a49c71f2 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);

No, can not just drop what was working before, even if you restore the
functionality in a later patch in the same series.

I would expect that this patch at a minimum maintains RCH handling and
forwards anything else to the CXL core for VH handling.

> -	pci_aer_handle_error(dev, info);
> +	if (is_cxl_error(dev, info))
> +		forward_cxl_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
> +
>  	pci_dev_put(dev);
>  }
>  
> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> index b2ea14f70055..846ab55d747c 100644
> --- a/drivers/pci/pcie/cxl_aer.c
> +++ b/drivers/pci/pcie/cxl_aer.c

With the RCH bits moved to its own file then this file would be 100%
concerned with typical CXL VH error handling and deserve to carry the
"cxl_aer.c" name.

> @@ -3,8 +3,11 @@
>  
>  #include <linux/pci.h>
>  #include <linux/aer.h>
> +#include <linux/kfifo.h>
>  #include "../pci.h"
>  
> +#define CXL_ERROR_SOURCES_MAX          128
> +
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pci_dev data structure
> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>  	return info->status & PCI_ERR_UNC_INTN;
>  }
>  
> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
> +{
> +	if (!info || !info->is_cxl)
> +		return false;
> +
> +	/* Only CXL Endpoints are currently supported */
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
> +		return false;
> +
> +	return is_internal_error(info);
> +}
> +
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
> +struct work_struct *cxl_proto_err_work;

Please make this one combo object with one registration entry point. 

struct cxl_prot_err_work {
        struct work_struct work;
        DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
                      CXL_ERROR_SOURCES_MAX);
};      

Bonus points to go back and clean up the CPER code to do the same to
reduce the amount of "registration" APIs.

> +
> +void cxl_register_proto_err_work(struct work_struct *work)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);

This lock acquisition is not protecting anything. 'unsigned long'
assignments are already atomic and forward_cxl_error() looks like it
happily de-references NULL pointers without checking the lock.

I would make it an rwsem. Hold the rwsem for write at registration /
unregistration...

> +	cxl_proto_err_work = work;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
> +
> +void cxl_unregister_proto_err_work(void)
> +{
> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> +	cxl_proto_err_work = NULL;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
> +
> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_proto_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
> +
> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
> +{
> +	struct cxl_proto_err_work_data wd;
> +
> +	wd.err_info = (struct cxl_proto_error_info) {
> +		.severity = aer_err_info->severity,
> +		.devfn = pdev->devfn,
> +		.bus = pdev->bus->number,
> +		.segment = pci_domain_nr(pdev->bus)
> +	};

...hold the rwsem for read when de-referencing a 'struct
cxl_prot_err_work *'

> +
> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");

In the case that 'struct cxl_prot_err_work *' is NULL, perhaps this
should be a dev_warn_once() to say "hey, we're seeing CXL errors, but
nobody registered the CXL core!?".

> +		return;
> +	}
> +
> +	schedule_work(cxl_proto_err_work);
> +}
> +
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 02940be66324..24c3d9e18ad5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_proto_err_info - Error information used in CXL error handling
> + * @severity: AER severity
> + * @function: Device's PCI function
> + * @device: Device's PCI device
> + * @bus: Device's PCI bus
> + * @segment: Device's PCI segment
> + */
> +struct cxl_proto_error_info {
> +	int severity;
> +
> +	u8 devfn;
> +	u8 bus;
> +	u16 segment;
> +};
> +
> +struct cxl_proto_err_work_data {
> +	struct cxl_proto_error_info err_info;
> +};

Why not use cxl_proto_error_info directly?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-07-24  1:16   ` dan.j.williams
@ 2025-07-24 17:02     ` Bowman, Terry
  2025-07-24 20:23       ` dan.j.williams
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-07-24 17:02 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/23/2025 8:16 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> The CXL AER error handling logic currently resides in the AER driver file,
>> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
>> using #ifdefs.
>>
>> Improve the AER driver maintainability by separating the CXL specific logic
>> from the AER driver's core functionality and removing the #ifdefs.
>> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
>> new file.
>>
>> Update the makefile to conditionally compile the CXL file using the
>> existing CONFIG_PCIEAER_CXL Kconfig.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
> After reading patch5 I want to qualify my Reviewed-by:...
>
>>  drivers/pci/pci.h          |   8 +++
>>  drivers/pci/pcie/Makefile  |   1 +
>>  drivers/pci/pcie/aer.c     | 138 -------------------------------------
>>  drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++
> This is a poor name for this file because the functionality only relates to
> code that supports a dead-end generation of RCH / RCD hardware platforms. 
>
> I do agree that it should be removed from aer.c so typical PCIe AER
> maintenance does not need to trip over that cruft.
>
> Please call it something like rch_aer.c so it is tucked out of the way,
> sticks out as odd in any future diffstat, and does not confuse from the
> CXL VH error handling that supports current and future generation
> hardware.
>
> Perhaps even move it to its own silent Kconfig symbol with a deprecation
> warning, something like below, so someone remembers to delete it.

cxl_rch_handle_error_iter() and cxl_rch_handle_error() need to be moved from pci/pcie/cxl_aer.c
into cxl/core/native_ras.c introduced in this series. There is no RCH or VH handling in cxl_aer.c. 
cxl_aer.c serves to detect if an error is a CXL error and if it is then it forwards it to the 
CXL drivers using the kfifo introduced later. I will update the commit message stating more 
will be added later.

Dave Jiang introduced cxl/core/pci_aer.c I understand the name is still up for possible change.
The native_ras.c changes in this series is planned to be moved into cxl/core/pci_aer.c for v11. 
The files were created with the same purpose but we used different filenames and need to converge.

Let me know if you still want the rename to rch_aer.c.

-Terry

> diff --git a/drivers/pci/pcie/Kconfig b/drivers/pci/pcie/Kconfig
> index 17919b99fa66..da88358bbb4f 100644
> --- a/drivers/pci/pcie/Kconfig
> +++ b/drivers/pci/pcie/Kconfig
> @@ -58,6 +58,13 @@ config PCIEAER_CXL
>  
>  	  If unsure, say Y.
>  
> +# Restricted CXL Host (RCH) error handling supports first generation CXL
> +# hardware and can be deprecated in 7-10 years when only CXL Virtual Host
> +# (CXL specification version 2+) hardware remains in service
> +config RCH_AER
> +	def_bool y
> +	depends on PCIEAER_CXL
> +
>  #
>  # PCI Express ECRC
>  #


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-07-24  0:01   ` dan.j.williams
@ 2025-07-24 17:06     ` Bowman, Terry
  2025-07-24 20:32       ` dan.j.williams
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-07-24 17:06 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/23/2025 7:01 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> The CXL AER error handling logic currently resides in the AER driver file,
>> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
>> using #ifdefs.
>>
>> Improve the AER driver maintainability by separating the CXL specific logic
>> from the AER driver's core functionality and removing the #ifdefs.
>> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
>> new file.
>>
>> Update the makefile to conditionally compile the CXL file using the
>> existing CONFIG_PCIEAER_CXL Kconfig.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> [..]
>> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
>> index e2d71b6fdd84..31b3935bf189 100644
>> --- a/include/linux/pci_ids.h
>> +++ b/include/linux/pci_ids.h
>> @@ -12,6 +12,8 @@
>>  
>>  /* Device classes and subclasses */
>>  
>> +#define PCI_CLASS_CODE_MASK             0xFFFF00
> Per other comments do not add this updated in the same patch as the
> move.
>
> When / if you submit it separately it likely also belongs next to
> PCI_CLASS_REVISION in include/uapi/linux/pci_regs.h defined with
> __GENMASK(23, 8).

include/uapi/linux/pci_regs.h appears to use all values without using GENMASK().
Just adding as a note. I'm making the change.

> Otherwise, with this change dropped you can add:
>
> Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Your next email pauses this. I'll respond there.

-Terry

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-07-24  2:01   ` dan.j.williams
@ 2025-07-24 17:21     ` Bowman, Terry
  2025-07-24 20:55       ` dan.j.williams
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-07-24 17:21 UTC (permalink / raw)
  To: dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci



On 7/23/2025 9:01 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt with the CXL driver. The notification will be used
>> as an indication for the CXL drivers to handle and log the CXL RAS errors.
>>
>> First, introduce cxl/core/native_ras.c to contain changes for the CXL
>> driver's RAS native handling. This as an alternative to dropping the
>> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
>> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
>> conditionally compile the new file.
> I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those
> needs to subsume the other. I also do not understand the point of
> "NATIVE" in the name. Will not CPER notified protocol errors be routed
> to the same CXL error handling infrastructure as AER notified protocol
> errors? I.e. the aer_recover_queue() path?

This change and comment is planned to be removed in v11. Instead of introducing this 
as a new file. The same changes will instead be added to pci_aer.c/pci_ras.c Dave Jiang
is introducing here:

https://lore.kernel.org/linux-cxl/20250721170415.285961-1-dave.jiang@intel.com/
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work and the cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Introduce 'struct cxl_proto_err_info' to serve as the kfifo work data. This
>> will contain the erring device's PCI SBDF details used to rediscover the
>> device after the CXL driver dequeues the kfifo work. The device rediscovery
>> will be introduced along with the CXL handling in future patches.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/Kconfig           | 14 ++++++++
>>  drivers/cxl/core/Makefile     |  1 +
>>  drivers/cxl/core/core.h       |  8 +++++
>>  drivers/cxl/core/native_ras.c | 26 +++++++++++++++
>>  drivers/cxl/core/port.c       |  2 ++
>>  drivers/cxl/core/ras.c        |  1 +
>>  drivers/cxl/cxlpci.h          |  1 +
>>  drivers/pci/pci.h             |  4 +++
>>  drivers/pci/pcie/aer.c        |  7 ++--
>>  drivers/pci/pcie/cxl_aer.c    | 60 +++++++++++++++++++++++++++++++++++
>>  include/linux/aer.h           | 31 ++++++++++++++++++
>>  11 files changed, 153 insertions(+), 2 deletions(-)
>>  create mode 100644 drivers/cxl/core/native_ras.c
>>
>> diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
>> index 48b7314afdb8..57274de54a45 100644
>> --- a/drivers/cxl/Kconfig
>> +++ b/drivers/cxl/Kconfig
>> @@ -233,4 +233,18 @@ config CXL_MCE
>>  	def_bool y
>>  	depends on X86_MCE && MEMORY_FAILURE
>>  
>> +config CXL_NATIVE_RAS
>> +	bool "CXL: Enable CXL RAS native handling"
>> +	depends on PCIEAER_CXL
> This nice helpful option is hidden if someone forgets to set the
> PCIEAER_CXL option which does not have helpful text. Given the
> dependencies, I am leaning towards drop this new option, move the
> help text to PCIEAER_CXL... but let me read the rest of the patch first.
>
> You can move PCIEAER_CXL to drivers/cxl/Kconfig if you want to keep all
> the CXL options in the CXL menu.
>
>> +	default CXL_BUS
>> +	help
>> +	  Enable native CXL RAS protocol error handling and logging in the CXL
>> +	  drivers. This functionality relies on the AER service driver being
>> +	  enabled,
> No need to put dependencies in the help text the tool will tell them
> that PCIEAER=y is a dependency.
>
>>         as the AER interrupt is used to inform the operating system
>> +	  of CXL RAS protocol errors. The platform must be configured to
>> +	  utilize AER reporting for interrupts.
> Per above, does any of CXL CPER reporting make its way into this path?
>
>> +
>> +	  If unsure, or if this kernel is meant for production environments,
>> +	  say Y.
> I think: "If unsure, say Y" is sufficient.
>
>> +
>>  endif
>> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>> index 79e2ef81fde8..16f5832e5cc4 100644
>> --- a/drivers/cxl/core/Makefile
>> +++ b/drivers/cxl/core/Makefile
>> @@ -21,3 +21,4 @@ cxl_core-$(CONFIG_CXL_REGION) += region.o
>>  cxl_core-$(CONFIG_CXL_MCE) += mce.o
>>  cxl_core-$(CONFIG_CXL_FEATURES) += features.o
>>  cxl_core-$(CONFIG_CXL_EDAC_MEM_FEATURES) += edac.o
>> +cxl_core-$(CONFIG_CXL_NATIVE_RAS) += native_ras.o
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index 29b61828a847..4c08bb92e2f9 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -123,6 +123,14 @@ int cxl_gpf_port_setup(struct cxl_dport *dport);
>>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>>  					    int nid, resource_size_t *size);
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void cxl_native_ras_init(void);
>> +void cxl_native_ras_exit(void);
>> +#else
>> +static inline void cxl_native_ras_init(void) { };
>> +static inline void cxl_native_ras_exit(void) { };
>> +#endif
>> +
>>  #ifdef CONFIG_CXL_FEATURES
>>  struct cxl_feat_entry *
>>  cxl_feature_info(struct cxl_features_state *cxlfs, const uuid_t *uuid);
>> diff --git a/drivers/cxl/core/native_ras.c b/drivers/cxl/core/native_ras.c
>> new file mode 100644
>> index 000000000000..011815ddaae3
>> --- /dev/null
>> +++ b/drivers/cxl/core/native_ras.c
>> @@ -0,0 +1,26 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/* Copyright(c) 2025 AMD Corporation. All rights reserved. */
>> +
>> +#include <linux/pci.h>
>> +#include <linux/aer.h>
>> +#include <cxl/event.h>
>> +#include <cxlmem.h>
>> +#include <core/core.h>
>> +
>> +static void cxl_proto_err_work_fn(struct work_struct *work)
>> +{
>> +}
>> +
>> +static struct work_struct cxl_proto_err_work;
>> +static DECLARE_WORK(cxl_proto_err_work, cxl_proto_err_work_fn);
>> +
>> +void cxl_native_ras_init(void)
>> +{
>> +	cxl_register_proto_err_work(&cxl_proto_err_work);
>> +}
>> +
>> +void cxl_native_ras_exit(void)
>> +{
>> +	cxl_unregister_proto_err_work();
>> +	cancel_work_sync(&cxl_proto_err_work);
>> +}
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index eb46c6764d20..8e8f21197c86 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -2345,6 +2345,8 @@ static __init int cxl_core_init(void)
>>  	if (rc)
>>  		goto err_ras;
>>  
>> +	cxl_native_ras_init();
>> +
>>  	return 0;
>>  
>>  err_ras:
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 485a831695c7..962dc94fed8c 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/aer.h>
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>> +#include <cxlpci.h>
>>  #include "trace.h"
>>  
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..6f1396ef7b77 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 91b583cf18eb..29c11c7136d3 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -1032,9 +1032,13 @@ static inline void pci_restore_aer_state(struct pci_dev *dev) { }
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info);
>>  void cxl_rch_enable_rcec(struct pci_dev *rcec);
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info);
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info);
>>  #else
>>  static inline void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info) { }
>>  static inline void cxl_rch_enable_rcec(struct pci_dev *rcec) { }
>> +static inline bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info) { return false; }
>> +static inline void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info) { }
>>  #endif
>>  
>>  #ifdef CONFIG_ACPI
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 0b4d721980ef..8417a49c71f2 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	cxl_rch_handle_error(dev, info);
> No, can not just drop what was working before, even if you restore the
> functionality in a later patch in the same series.
>
> I would expect that this patch at a minimum maintains RCH handling and
> forwards anything else to the CXL core for VH handling.

You want RCH handling to stay here in the AER driver or does the patch changes need 
to be reworked to present the change better?

>> -	pci_aer_handle_error(dev, info);
>> +	if (is_cxl_error(dev, info))
>> +		forward_cxl_error(dev, info);
>> +	else
>> +		pci_aer_handle_error(dev, info);
>> +
>>  	pci_dev_put(dev);
>>  }
>>  
>> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
>> index b2ea14f70055..846ab55d747c 100644
>> --- a/drivers/pci/pcie/cxl_aer.c
>> +++ b/drivers/pci/pcie/cxl_aer.c
> With the RCH bits moved to its own file then this file would be 100%
> concerned with typical CXL VH error handling and deserve to carry the
> "cxl_aer.c" name.
The plan was to move all the handling to cxl/core/pci_aer.c or whatever it is renamed.
>> @@ -3,8 +3,11 @@
>>  
>>  #include <linux/pci.h>
>>  #include <linux/aer.h>
>> +#include <linux/kfifo.h>
>>  #include "../pci.h"
>>  
>> +#define CXL_ERROR_SOURCES_MAX          128
>> +
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pci_dev data structure
>> @@ -64,6 +67,19 @@ static bool is_internal_error(struct aer_err_info *info)
>>  	return info->status & PCI_ERR_UNC_INTN;
>>  }
>>  
>> +bool is_cxl_error(struct pci_dev *pdev, struct aer_err_info *info)
>> +{
>> +	if (!info || !info->is_cxl)
>> +		return false;
>> +
>> +	/* Only CXL Endpoints are currently supported */
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_EC))
>> +		return false;
>> +
>> +	return is_internal_error(info);
>> +}
>> +
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -136,3 +152,47 @@ void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static DEFINE_KFIFO(cxl_proto_err_fifo, struct cxl_proto_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_proto_err_fifo_lock);
>> +struct work_struct *cxl_proto_err_work;
> Please make this one combo object with one registration entry point. 
>
> struct cxl_prot_err_work {
>         struct work_struct work;
>         DECLARE_KFIFO(fifo, struct cxl_proto_err_work_data,
>                       CXL_ERROR_SOURCES_MAX);
> };      
>
> Bonus points to go back and clean up the CPER code to do the same to
> reduce the amount of "registration" APIs.

Ok.

>> +
>> +void cxl_register_proto_err_work(struct work_struct *work)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
> This lock acquisition is not protecting anything. 'unsigned long'
> assignments are already atomic and forward_cxl_error() looks like it
> happily de-references NULL pointers without checking the lock.
>
> I would make it an rwsem. Hold the rwsem for write at registration /
> unregistration...

Ok.

>> +	cxl_proto_err_work = work;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_register_proto_err_work, "CXL");
>> +
>> +void cxl_unregister_proto_err_work(void)
>> +{
>> +	guard(spinlock)(&cxl_proto_err_fifo_lock);
>> +	cxl_proto_err_work = NULL;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_proto_err_work, "CXL");
>> +
>> +int cxl_proto_err_kfifo_get(struct cxl_proto_err_work_data *wd)
>> +{
>> +	return kfifo_get(&cxl_proto_err_fifo, wd);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_proto_err_kfifo_get, "CXL");
>> +
>> +void forward_cxl_error(struct pci_dev *pdev, struct aer_err_info *aer_err_info)
>> +{
>> +	struct cxl_proto_err_work_data wd;
>> +
>> +	wd.err_info = (struct cxl_proto_error_info) {
>> +		.severity = aer_err_info->severity,
>> +		.devfn = pdev->devfn,
>> +		.bus = pdev->bus->number,
>> +		.segment = pci_domain_nr(pdev->bus)
>> +	};
> ...hold the rwsem for read when de-referencing a 'struct
> cxl_prot_err_work *'
>
>> +
>> +	if (!kfifo_put(&cxl_proto_err_fifo, wd)) {
>> +		dev_err_ratelimited(&pdev->dev, "CXL kfifo overflow\n");
> In the case that 'struct cxl_prot_err_work *' is NULL, perhaps this
> should be a dev_warn_once() to say "hey, we're seeing CXL errors, but
> nobody registered the CXL core!?".

Ok.

>> +		return;
>> +	}
>> +
>> +	schedule_work(cxl_proto_err_work);
>> +}
>> +
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 02940be66324..24c3d9e18ad5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -53,6 +54,26 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_proto_err_info - Error information used in CXL error handling
>> + * @severity: AER severity
>> + * @function: Device's PCI function
>> + * @device: Device's PCI device
>> + * @bus: Device's PCI bus
>> + * @segment: Device's PCI segment
>> + */
>> +struct cxl_proto_error_info {
>> +	int severity;
>> +
>> +	u8 devfn;
>> +	u8 bus;
>> +	u16 segment;
>> +};
>> +
>> +struct cxl_proto_err_work_data {
>> +	struct cxl_proto_error_info err_info;
>> +};
> Why not use cxl_proto_error_info directly?
At one point I thought there was a good reason for using it later in another case.
I'll use cxl_proto_error_info directly.

-Terry


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-07-24 17:02     ` Bowman, Terry
@ 2025-07-24 20:23       ` dan.j.williams
  0 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-24 20:23 UTC (permalink / raw)
  To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci

Bowman, Terry wrote:
> On 7/23/2025 8:16 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> The CXL AER error handling logic currently resides in the AER driver file,
> >> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> >> using #ifdefs.
> >>
> >> Improve the AER driver maintainability by separating the CXL specific logic
> >> from the AER driver's core functionality and removing the #ifdefs.
> >> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> >> new file.
> >>
> >> Update the makefile to conditionally compile the CXL file using the
> >> existing CONFIG_PCIEAER_CXL Kconfig.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> ---
> > After reading patch5 I want to qualify my Reviewed-by:...
> >
> >>  drivers/pci/pci.h          |   8 +++
> >>  drivers/pci/pcie/Makefile  |   1 +
> >>  drivers/pci/pcie/aer.c     | 138 -------------------------------------
> >>  drivers/pci/pcie/cxl_aer.c | 138 +++++++++++++++++++++++++++++++++++++
> > This is a poor name for this file because the functionality only relates to
> > code that supports a dead-end generation of RCH / RCD hardware platforms. 
> >
> > I do agree that it should be removed from aer.c so typical PCIe AER
> > maintenance does not need to trip over that cruft.
> >
> > Please call it something like rch_aer.c so it is tucked out of the way,
> > sticks out as odd in any future diffstat, and does not confuse from the
> > CXL VH error handling that supports current and future generation
> > hardware.
> >
> > Perhaps even move it to its own silent Kconfig symbol with a deprecation
> > warning, something like below, so someone remembers to delete it.
> 
> cxl_rch_handle_error_iter() and cxl_rch_handle_error() need to be moved from pci/pcie/cxl_aer.c
> into cxl/core/native_ras.c introduced in this series. There is no RCH or VH handling in cxl_aer.c. 
> cxl_aer.c serves to detect if an error is a CXL error and if it is then it forwards it to the 
> CXL drivers using the kfifo introduced later. I will update the commit message stating more 
> will be added later.

Wait, this set moves the same function to a new file twice in the same
set? I had not gotten that far along, but that's not acceptable.

The reasons I had assumed that the rch bits would remain as a vestigial
drivers/pci/pcie/rch_aer.c file to be cut from the kernel later are:

- The goal of forwarding protocol errors to the cxl_core is that the
  cxl_core maintains a cxl_port hierarchy. For the RCH case there is no
  hierarchy and little to no value in being able disposition or decorate
  error reports with the cxl_port driver.

- The RCH code requires a series of new PCI core exports for this
  one-off unfortunate mistake of history where the CXL specification
  tried way too hard to hide the presence of CXL. If this code is
  already on a deprecation path, that contraindicates new exports.

> Dave Jiang introduced cxl/core/pci_aer.c I understand the name is still up for possible change.
> The native_ras.c changes in this series is planned to be moved into cxl/core/pci_aer.c for v11. 
> The files were created with the same purpose but we used different filenames and need to converge.

Why not put this stuff in the existing cxl/core/ras.c? I do expect that
we want to route CPER reports to cxl_port objects at some point, so the
"native" distinction is more confusing than beneficial as far as I can
see.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file
  2025-07-24 17:06     ` Bowman, Terry
@ 2025-07-24 20:32       ` dan.j.williams
  0 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-24 20:32 UTC (permalink / raw)
  To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci

Bowman, Terry wrote:
> 
> 
> On 7/23/2025 7:01 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> The CXL AER error handling logic currently resides in the AER driver file,
> >> drivers/pci/pcie/aer.c. CXL specific changes are conditionally compiled
> >> using #ifdefs.
> >>
> >> Improve the AER driver maintainability by separating the CXL specific logic
> >> from the AER driver's core functionality and removing the #ifdefs.
> >> Introduce drivers/pci/pcie/cxl_aer.c and move the CXL AER logic into the
> >> new file.
> >>
> >> Update the makefile to conditionally compile the CXL file using the
> >> existing CONFIG_PCIEAER_CXL Kconfig.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > [..]
> >> diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h
> >> index e2d71b6fdd84..31b3935bf189 100644
> >> --- a/include/linux/pci_ids.h
> >> +++ b/include/linux/pci_ids.h
> >> @@ -12,6 +12,8 @@
> >>  
> >>  /* Device classes and subclasses */
> >>  
> >> +#define PCI_CLASS_CODE_MASK             0xFFFF00
> > Per other comments do not add this updated in the same patch as the
> > move.
> >
> > When / if you submit it separately it likely also belongs next to
> > PCI_CLASS_REVISION in include/uapi/linux/pci_regs.h defined with
> > __GENMASK(23, 8).
> 
> include/uapi/linux/pci_regs.h appears to use all values without using GENMASK().
> Just adding as a note. I'm making the change.

pci_regs.h is in include/uapi/. Historically that meant that it was
unable to use GENMASK() from include/linux/. That changed "recently"
(compared to the age of pci_regs.h) with commit:

3c7a8e190bc5 uapi: introduce uapi-friendly macros for GENMASK

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors
  2025-07-24 17:21     ` Bowman, Terry
@ 2025-07-24 20:55       ` dan.j.williams
  0 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-24 20:55 UTC (permalink / raw)
  To: Bowman, Terry, dan.j.williams, dave, jonathan.cameron, dave.jiang,
	alison.schofield, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci

Bowman, Terry wrote:
> On 7/23/2025 9:01 PM, dan.j.williams@intel.com wrote:
> > Terry Bowman wrote:
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt with the CXL driver. The notification will be used
> >> as an indication for the CXL drivers to handle and log the CXL RAS errors.
> >>
> >> First, introduce cxl/core/native_ras.c to contain changes for the CXL
> >> driver's RAS native handling. This as an alternative to dropping the
> >> changes into existing cxl/core/ras.c file with purpose to avoid #ifdefs.
> >> Introduce CXL Kconfig CXL_NATIVE_RAS, dependent on PCIEAER_CXL, to
> >> conditionally compile the new file.
> > I see no daylight between CXL_NATIVE_RAS and PCIEAER_CXL, one of those
> > needs to subsume the other. I also do not understand the point of
> > "NATIVE" in the name. Will not CPER notified protocol errors be routed
> > to the same CXL error handling infrastructure as AER notified protocol
> > errors? I.e. the aer_recover_queue() path?
> 
> This change and comment is planned to be removed in v11. Instead of introducing this 
> as a new file. The same changes will instead be added to pci_aer.c/pci_ras.c Dave Jiang
> is introducing here:
> 
> https://lore.kernel.org/linux-cxl/20250721170415.285961-1-dave.jiang@intel.com/

Lets just put it all in cxl/core/ras.c, I don't think we need to have
fine grained file distinctions between the "native", "aer", and "ras
component registers" cases.

"Want CXL error handling? Need ras.c."

[..]
<trim three pages of context that had no new comments to respond, please
 trim your replies>

> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index 0b4d721980ef..8417a49c71f2 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -1130,8 +1130,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> >>  
> >>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> >>  {
> >> -	cxl_rch_handle_error(dev, info);
> > No, can not just drop what was working before, even if you restore the
> > functionality in a later patch in the same series.
> >
> > I would expect that this patch at a minimum maintains RCH handling and
> > forwards anything else to the CXL core for VH handling.
> 
> You want RCH handling to stay here in the AER driver or does the patch changes need 
> to be reworked to present the change better?

Stay here. The cost of exporting PCI core functionality for this one-off
seems not worth it.

> >> -	pci_aer_handle_error(dev, info);
> >> +	if (is_cxl_error(dev, info))
> >> +		forward_cxl_error(dev, info);
> >> +	else
> >> +		pci_aer_handle_error(dev, info);
> >> +
> >>  	pci_dev_put(dev);
> >>  }
> >>  
> >> diff --git a/drivers/pci/pcie/cxl_aer.c b/drivers/pci/pcie/cxl_aer.c
> >> index b2ea14f70055..846ab55d747c 100644
> >> --- a/drivers/pci/pcie/cxl_aer.c
> >> +++ b/drivers/pci/pcie/cxl_aer.c
> > With the RCH bits moved to its own file then this file would be 100%
> > concerned with typical CXL VH error handling and deserve to carry the
> > "cxl_aer.c" name.
> The plan was to move all the handling to cxl/core/pci_aer.c or whatever it is renamed.

It would need more justification to overcome the perception of "new
exports for code that is already on a deprecation watch"

[..]
> >> diff --git a/include/linux/aer.h b/include/linux/aer.h
> >> index 02940be66324..24c3d9e18ad5 100644
> >> --- a/include/linux/aer.h
> >> +++ b/include/linux/aer.h
> >> @@ -10,6 +10,7 @@
> >>  
> >>  #include <linux/errno.h>
> >>  #include <linux/types.h>
> >> +#include <linux/workqueue_types.h>
> >>  
> >>  #define AER_NONFATAL			0
> >>  #define AER_FATAL			1
> >> @@ -53,6 +54,26 @@ struct aer_capability_regs {
> >>  	u16 uncor_err_source;
> >>  };
> >>  
> >> +/**
> >> + * struct cxl_proto_err_info - Error information used in CXL error handling
> >> + * @severity: AER severity
> >> + * @function: Device's PCI function
> >> + * @device: Device's PCI device
> >> + * @bus: Device's PCI bus
> >> + * @segment: Device's PCI segment
> >> + */
> >> +struct cxl_proto_error_info {
> >> +	int severity;
> >> +
> >> +	u8 devfn;
> >> +	u8 bus;
> >> +	u16 segment;
> >> +};
> >> +
> >> +struct cxl_proto_err_work_data {
> >> +	struct cxl_proto_error_info err_info;
> >> +};
> > Why not use cxl_proto_error_info directly?
> At one point I thought there was a good reason for using it later in another case.
> I'll use cxl_proto_error_info directly.

Maybe that was more the CPER case where the CXL standard mailbox payload
structure is wrapped with a CPER-record envelope? In any event, I think
it is safe to drop that indirection.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error
  2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
  2025-06-27 11:00   ` Jonathan Cameron
  2025-07-01 23:04   ` Dave Jiang
@ 2025-07-25  0:38   ` dan.j.williams
  2 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-07-25  0:38 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, terry.bowman, linux-cxl
  Cc: linux-kernel, linux-pci

Terry Bowman wrote:
> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to begin the protocol
> error handling processing using the work received from the FIFO.
> 
> Introduce function cxl_proto_err_work_fn() to dequeue work forwarded by the
> AER service driver. This will begin the CXL protocol error processing with
> a call to cxl_handle_proto_error().
> 
> Update cxl/core/native_ras.c by adding cxl_rch_handle_error_iter() that was
> previously in the AER driver. Add check that Endpoint is bound to a CXL
> driver.
[..]
> +static void cxl_handle_proto_error(struct cxl_proto_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) =
> +		pci_get_domain_bus_and_slot(err_info->segment,
> +					    err_info->bus,
> +					    err_info->devfn);

So this patch in its current form is about restoring the RCH error
handling code which we already talked about should probably stay as a
special case in drivers/pci/pcie/.

For v11, where this code can 100% focus on VH error handling, my
expectation is to not see any PCI topology walking, i.e. no
pci_get_domain_bus_and_slot() no pci_walk_bridge() etc. If all we cared
about were PCI details this code could have remained in the PCI core.

Instead, my expectation is that motive for a kfifo and calling back into
the cxl_core is cxl_core has a parallel universe of software objects
('struct cxl_port') that can experience errors independent of the errors
the PCIe core cares about. It also has a cxl_port driver model that
knows the lifetime of when RAS registers are mapped that the PCIe AER
core can not know about.

So, the PCIe core has already done the device lookup before this point.
Just pass that device to the cxl_core directly, and then use that
device to lookup a cxl_port and/or cxl_dport directly.

A useful property of passing a 'struct device *' to identify the error
source device is that it supports cxl_test emulation of CXL port
protocol error injection.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl()
  2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
  2025-07-23 22:30   ` dan.j.williams
@ 2025-08-09 10:56   ` Alejandro Lucero Palau
  2025-08-11 19:14     ` Bowman, Terry
  1 sibling, 1 reply; 82+ messages in thread
From: Alejandro Lucero Palau @ 2025-08-09 10:56 UTC (permalink / raw)
  To: Terry Bowman, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci


On 6/26/25 23:42, Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices.
>
> Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
> CXL Flexbus DVSEC presence is used because it is required for all the CXL
> PCIe devices.[1]
>
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
>
> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
>
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>      Capability (DVSEC) ID Assignment, Table 8-2
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>   drivers/pci/probe.c           | 10 ++++++++++
>   include/linux/pci.h           |  6 ++++++
>   include/uapi/linux/pci_regs.h |  8 +++++++-
>   3 files changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4b8693ec9e4c..5d3548648d5c 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1691,6 +1691,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>   		dev->is_thunderbolt = 1;
>   }
>   
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> +					      PCI_DVSEC_CXL_FLEXBUS);
> +	if (dvsec)
> +		dev->is_cxl = 1;


Hi Terry,


Should not this check for the bits about port status like IO_Enabled?


Maybe I'm confused with the goal, but we can know a device is a cxl one 
without specifically looking at the FLEXBUS capability presence or not. 
What this capability can tell, and what I think it is more interesting, 
is if the link was successfully trained as CXL.io. From problems we have 
had while testing our type2 device, if the training for CXL.io fails, 
the device will use pcie and this bit will be 0, same if the device can 
be configured for using pcie only.


If this is what we really need to know, changing is_cxl to 
is_cxl_enabled could be clearer.


I come here after Dan suggesting to use this functionality for ensuring 
the CXL device functionality is on, and it would require to inspect the 
status instead of just the capability being present. Maybe I'm confused 
because I remember some patches from Robert Richter dealing with 
checking link up for enabling downstream ports, but I think the goal 
here is different.


> +}
> +
>   static void set_pcie_untrusted(struct pci_dev *dev)
>   {
>   	struct pci_dev *parent = pci_upstream_bridge(dev);
> @@ -2021,6 +2029,8 @@ int pci_setup_device(struct pci_dev *dev)
>   	/* Need to have dev->cfg_size ready */
>   	set_pcie_thunderbolt(dev);
>   
> +	set_pcie_cxl(dev);
> +
>   	set_pcie_untrusted(dev);
>   
>   	if (pci_is_pcie(dev))
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 05e68f35f392..79878243b681 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -453,6 +453,7 @@ struct pci_dev {
>   	unsigned int	is_hotplug_bridge:1;
>   	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
>   	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
> +	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
>   	/*
>   	 * Devices marked being untrusted are the ones that can potentially
>   	 * execute DMA attacks and similar. They are typically connected
> @@ -744,6 +745,11 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>   	return false;
>   }
>   
> +static inline bool pcie_is_cxl(struct pci_dev *pci_dev)
> +{
> +	return pci_dev->is_cxl;
> +}
> +
>   #define for_each_pci_bridge(dev, bus)				\
>   	list_for_each_entry(dev, &bus->devices, bus_list)	\
>   		if (!pci_is_bridge(dev)) {} else
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index a3a3e942dedf..fb9d77c69d5f 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1225,9 +1225,15 @@
>   /* Deprecated old name, replaced with PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE */
>   #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		PCI_DOE_DATA_OBJECT_DISC_RSP_3_TYPE
>   
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> +/* Compute Express Link (CXL r3.2, sec 8.1)
> + *
> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
> + * is "disconnected" (CXL r3.2, sec 9.12.3). Re-enumerate these
> + * registers on downstream link-up events.
> + */
>   #define PCI_DVSEC_CXL_PORT				3
>   #define PCI_DVSEC_CXL_PORT_CTL				0x0c
>   #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> +#define PCI_DVSEC_CXL_FLEXBUS				7
>   
>   #endif /* LINUX_PCI_REGS_H */

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl()
  2025-08-09 10:56   ` Alejandro Lucero Palau
@ 2025-08-11 19:14     ` Bowman, Terry
  2025-08-11 23:14       ` dan.j.williams
  0 siblings, 1 reply; 82+ messages in thread
From: Bowman, Terry @ 2025-08-11 19:14 UTC (permalink / raw)
  To: Alejandro Lucero Palau, dave, jonathan.cameron, dave.jiang,
	alison.schofield, dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci

On 8/9/2025 5:56 AM, Alejandro Lucero Palau wrote:
> 
> On 6/26/25 23:42, Terry Bowman wrote:
>> CXL and AER drivers need the ability to identify CXL devices.
>>
>> Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
>> CXL Flexbus DVSEC presence is used because it is required for all the CXL
>> PCIe devices.[1]
>>
>> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
>> Flexbus presence.
>>
>> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
>>
>> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>>      Capability (DVSEC) ID Assignment, Table 8-2
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> ---
>>   drivers/pci/probe.c           | 10 ++++++++++
>>   include/linux/pci.h           |  6 ++++++
>>   include/uapi/linux/pci_regs.h |  8 +++++++-
>>   3 files changed, 23 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
>> index 4b8693ec9e4c..5d3548648d5c 100644
>> --- a/drivers/pci/probe.c
>> +++ b/drivers/pci/probe.c
>> @@ -1691,6 +1691,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>>   		dev->is_thunderbolt = 1;
>>   }
>>   
>> +static void set_pcie_cxl(struct pci_dev *dev)
>> +{
>> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>> +					      PCI_DVSEC_CXL_FLEXBUS);
>> +	if (dvsec)
>> +		dev->is_cxl = 1;
> 
> 
> Hi Terry,
> 
> 
> Should not this check for the bits about port status like IO_Enabled?
> 
> 
> Maybe I'm confused with the goal, but we can know a device is a cxl one 
> without specifically looking at the FLEXBUS capability presence or not. 
> What this capability can tell, and what I think it is more interesting, 
> is if the link was successfully trained as CXL.io. From problems we have 
> had while testing our type2 device, if the training for CXL.io fails, 
> the device will use pcie and this bit will be 0, same if the device can 
> be configured for using pcie only.
> 
> 
> If this is what we really need to know, changing is_cxl to 
> is_cxl_enabled could be clearer.
> 
> 
> I come here after Dan suggesting to use this functionality for ensuring 
> the CXL device functionality is on, and it would require to inspect the 
> status instead of just the capability being present. Maybe I'm confused 
> because I remember some patches from Robert Richter dealing with 
> checking link up for enabling downstream ports, but I think the goal 
> here is different.
> 

Hi Alejandro,

I agree in large part. We need to check for training complete and any change 
in training needs to be reflected in is_cxl(). My understanding is this 
will be be added later in a following patch series. 

Dan, can you add your thoughts ?

-Terry


^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl()
  2025-08-11 19:14     ` Bowman, Terry
@ 2025-08-11 23:14       ` dan.j.williams
  0 siblings, 0 replies; 82+ messages in thread
From: dan.j.williams @ 2025-08-11 23:14 UTC (permalink / raw)
  To: Bowman, Terry, Alejandro Lucero Palau, dave, jonathan.cameron,
	dave.jiang, alison.schofield, dan.j.williams, bhelgaas,
	shiju.jose, ming.li, Smita.KoralahalliChannabasappa, rrichter,
	dan.carpenter, PradeepVineshReddy.Kodamati, lukas,
	Benjamin.Cheatham, sathyanarayanan.kuppuswamy, linux-cxl
  Cc: linux-kernel, linux-pci

Bowman, Terry wrote:
[..]
> On 8/9/2025 5:56 AM, Alejandro Lucero Palau wrote:
> > I come here after Dan suggesting to use this functionality for ensuring 
> > the CXL device functionality is on, and it would require to inspect the 
> > status instead of just the capability being present. Maybe I'm confused 
> > because I remember some patches from Robert Richter dealing with 
> > checking link up for enabling downstream ports, but I think the goal 
> > here is different.
> > 
> 
> Hi Alejandro,
> 
> I agree in large part. We need to check for training complete and any change 
> in training needs to be reflected in is_cxl(). My understanding is this 
> will be be added later in a following patch series. 
> 
> Dan, can you add your thoughts ?

Training completion is implicit in the fact that the device can be
targeted by configuration requests. DVSEC 7 absence means you know that
CXL is disabled. It is true that DVSEC 7 presence is inconclusive for
whether the device supports CXL.mem or CXL.cache. I have seen some
implementations hide DVSEC 7, but that can not be relied upon for
is_cxl() purposes. CXL.io support is not interesting.

So is_cxl() should probably indicate (CXL.mem || CXL.cache) because the
presence of those needs CXL subsystem infrastructure.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging
  2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
                   ` (17 preceding siblings ...)
  2025-07-23 21:55 ` [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging dan.j.williams
@ 2025-08-18 15:18 ` Joshua Hahn
  18 siblings, 0 replies; 82+ messages in thread
From: Joshua Hahn @ 2025-08-18 15:18 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dave, jonathan.cameron, dave.jiang, alison.schofield,
	dan.j.williams, bhelgaas, shiju.jose, ming.li,
	Smita.KoralahalliChannabasappa, rrichter, dan.carpenter,
	PradeepVineshReddy.Kodamati, lukas, Benjamin.Cheatham,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, linux-pci

On Thu, 26 Jun 2025 17:42:35 -0500 Terry Bowman <terry.bowman@amd.com> wrote:

> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
> Endpoints (EP). The reach of this patchset grew from CXL Ports to include
> EPs as well.
> 
> This patchset is a continuation of v9 found here:
> https://lore.kernel.org/linux-cxl/20250603172239.159260-1-terry.bowman@amd.com/
> 
> The first patch is a small cleanup change to reduce amount of code. 
> 
> The next 2 patches introduce pci_dev::is_cxl, aer_info::is_cxl, and add
> bus string to AER log tracing. aer_info::is_cxl will be used to indicate a
> CXL or PCI error and will be used to direct the error handling flow in
> later patches.
> 
> The next patch introduces a new driver file, pci/pcie/cxl_aer.c, to move
> the existing CXL AER logic into.
> 
> The next 3 patches update the AER driver and CXL driver to use a kfifo. 
> The kfifo is added to offload CXL-AER protocol error work to the CXL
> driver. These patches provide the kfifo work add and work remove. 
> 
> The next 5 patches prepare the CXL driver for adding the updated protocol
> error handlers. This includes adding CXL Port RAS mapping and updating
> interfaces for common support.
> 
> The final 5 patches add the CXL error handlers for CXL EPs and CXL Ports.
> CXL EPs keep the PCIe error handler for cases the EP error is interpreted
> as a PCIe error. These patches also add logic to unmask CXL Protocol Errors
> during port probing, and mask CXL Protocol Errors during port device
> cleanup.

Hello Terry,

Thank you for this new version. I just wanted to add that I have been testing
this new version on a few machines, and it fixes an issue that I was seeing
on v8 of the patchset.

Previously, booting a kernel with the parameter pcie_ports=compat would lead
to a kernel crash caused by a NULL pointer dereference. After I rebased the
kernel to use v10 instead, this went away and I can use pcie_ports=compat
without any complications. I tried looking in to see what the change that
led to this fix was, but couldn't find anything specific. 

It seems like a use-after-free bug and happens specifically in
cxl_dport_init_ras_reporting. Since this new version fixes this issue, pleae
feel free to add my tested-by tag in future versions.

Thank you again for your work on this series! I hope you have a great day.
Joshua Hahn

Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Sent using hkml (https://github.com/sjp38/hackermail)

^ permalink raw reply	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2025-08-18 15:19 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-07-18 17:55   ` Dave Jiang
2025-07-23 21:58   ` dan.j.williams
2025-07-23 22:15     ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
2025-07-23 22:30   ` dan.j.williams
2025-08-09 10:56   ` Alejandro Lucero Palau
2025-08-11 19:14     ` Bowman, Terry
2025-08-11 23:14       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
2025-06-27  9:53   ` Jonathan Cameron
2025-07-02 16:00     ` Bowman, Terry
2025-06-27 11:32   ` Shiju Jose
2025-07-01 21:27   ` Dave Jiang
2025-07-23 22:56   ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
2025-06-27 10:12     ` Jonathan Cameron
2025-06-27 14:29     ` Bowman, Terry
2025-07-24  0:01   ` dan.j.williams
2025-07-24 17:06     ` Bowman, Terry
2025-07-24 20:32       ` dan.j.williams
2025-07-24  1:16   ` dan.j.williams
2025-07-24 17:02     ` Bowman, Terry
2025-07-24 20:23       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
2025-06-27 10:24   ` Jonathan Cameron
2025-07-02 16:21     ` Bowman, Terry
2025-07-02 19:54       ` Dan Carpenter
2025-07-02 19:57         ` Bowman, Terry
2025-07-03 10:06       ` Jonathan Cameron
2025-07-01 21:53   ` Dave Jiang
2025-07-02 17:10     ` Bowman, Terry
2025-07-24  2:01   ` dan.j.williams
2025-07-24 17:21     ` Bowman, Terry
2025-07-24 20:55       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-06-27 11:00   ` Jonathan Cameron
2025-07-02 17:51     ` Bowman, Terry
2025-07-01 23:04   ` Dave Jiang
2025-07-02 17:56     ` Bowman, Terry
2025-07-03 10:11       ` Jonathan Cameron
2025-07-25  0:38   ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-06-27 11:05   ` Jonathan Cameron
2025-07-02 21:06     ` Bowman, Terry
2025-06-27 12:27   ` Shiju Jose
2025-07-02 21:34     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
2025-06-27 11:12   ` Jonathan Cameron
2025-07-18 18:01   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-06-27 11:17   ` Jonathan Cameron
2025-07-02 21:41     ` Bowman, Terry
2025-07-18 21:28   ` Dave Jiang
2025-07-18 21:55     ` Bowman, Terry
2025-07-18 22:01       ` Dave Jiang
2025-07-18 22:40         ` Bowman, Terry
2025-07-18 22:45           ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 10/17] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-06-26 22:42 ` [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-07-21 21:56   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-06-27 12:22   ` Shiju Jose
2025-07-02  1:18     ` Alison Schofield
2025-07-02 22:07       ` Bowman, Terry
2025-07-02 21:56     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-06-27 11:48   ` Jonathan Cameron
2025-07-21 22:17   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
2025-06-27 11:52   ` Jonathan Cameron
2025-06-27 12:27   ` Shiju Jose
2025-07-21 22:35   ` Dave Jiang
2025-07-22 18:23     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 15/17] CXL/PCI: Introduce CXL Port " Terry Bowman
2025-06-26 22:42 ` [PATCH v10 16/17] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-06-26 22:42 ` [PATCH v10 17/17] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-07-23 21:55 ` [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging dan.j.williams
2025-08-18 15:18 ` Joshua Hahn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).