[PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging
@ 2025-03-27  1:47 Terry Bowman
  2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
                   ` (17 more replies)
  0 siblings, 18 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

This patchset updates CXL Protocol Error handling for CXL Ports and CXL
Endpoints (EP). The reach of this patchset grew from CXL Ports to include
EPs as well because updating the handling for all devices is preferable
over supporting multiple handling paths.

This patchset is a continuation of v7 and can be found here:
https://lore.kernel.org/linux-cxl/20250211192444.2292833-1-terry.bowman@amd.com/

The difference between v7 and v8 includes a significant refactor of the CXL
error handling. v8 further defines the difference between CXL errors and
PCIe errors by moving the CXL error handling to the CXL driver. The intent
is to isolate the CXL error handling from the AER driver as much as
possible.  The AER driver will continue to handle AER interrupts but will
now forward an error to the CXL driver if it's a CXL error. If an error is
not a CXL error then the existing PCIe flow is used to handle the error.

Another change from v7->v8 is the error handlers themselves. v8 introduces
the error handlers as 'struct cxl_driver::err_handler' instead of as
'struct pci_dev::cxl_err_handlers' done in v7. 

Most of the review acks and reviewed-by's had to be taken down because of
changes.

= Patch Descriptions =
The first 2 patches introduce pci_dev::is_cxl, aer_info::is_cxl, and add
bus string to AER log tracing. aer_info::is_cxl will be used to indicate a
CXL or PCI error and will affect the error handling flow in later patches.

The next 6 patches add a kfifo for forwarding CXL errors to the CXL driver.
These patches also move the CXL handling from the AER service driver into
the CXL driver and add the necessary plumbing. This subset of patches also
introduces CXL UCE handling never present in the AER service driver. 

The next 3 patches add the CXL Port RAS mapping and interface updates to
support addition of CXL error handlers.

The final 5 patches add the CXL error handlers for CXL EPs and CXL Ports.
CXL EPs keep the PCIe error handler for cases the EP error is interpreted as
a PCIe error (please see USP and EP UCE testing below). These patches also
add logic to assign the CXL error handlers to a CXL device, unmask CXL
Protocol Errors during port probing, and mask CXL Protocol Errors during
port device cleanup. 

= Testing =
 Below are test results for this patchset using QEMU with CXL Root
 Port(RP, 0C:00.0), CXL Upstream Switch Port(USP, 0D:00.0), and CXL
 Downstream Switch Port(DSP, 0E:00.0), and Endpoint (EP, 0F:00.0).

 The sub-topology for testing is:
                    ---------------------
                    | CXL RP - 0C:00.0  |
                    ---------------------
                              |
                    ---------------------
                    | CXL USP - 0D:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL DSP - 0E:00.0 |
                    ---------------------
                              |
                    ---------------------
                    | CXL EP - 0F:00.0  |
                    ---------------------

 root@tbowman-cxl:~# lspci -t
 -+-[0000:00]-+-00.0
  |           +-01.0
  |           +-02.0
  |           +-03.0
  |           +-1f.0
  |           +-1f.2
  |           \-1f.3
  \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0

 The topology was created with:
  ${qemu} -boot menu=on \
             -cpu host \
             -nographic \
             -monitor telnet:127.0.0.1:1234,server,nowait \
             -M virt,cxl=on \
             -chardev stdio,id=s1,signal=off,mux=on -serial none \
             -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
             -machine q35,cxl=on \
             -m 16G,maxmem=24G,slots=8 \
             -cpu EPYC-v3 \
             -smp 16 \
             -accel kvm \
             -drive file=${img},format=raw,index=0,media=disk \
             -device e1000,netdev=user.0 \
             -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
             -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
             -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
             -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
             -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
             -device cxl-upstream,bus=root_port0,id=us0 \
             -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
             -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
             -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k

 NOTE: The USP and EP below include a UCE injection that is handled as a
 PCIe error instead of a CXL error. The stack trace shows handling from
 pcie_do_recovery() instead of cxl_do_redcovery(). This is because the AER
 info is not read from EP's or USP's due to the UCE error is AER_FATAL. As
 a result, the CXL RAS is not logged. But, panic is called when the PCIe
 report_error_detected() calls CXL EP PCIe error handler, containing the panic().
 
== CXL Root Port ==
root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0:    [14] CorrIntErr            
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_aer_correctable_error: device=port1 (0000:0c:00.0) parent=root0 (pci0000:0c) serieal=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0:    [22] UncorrIntErr      
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_aer_uncorrectable_error: device=port1 (0000:0c:00.0) parent=root0 (pci0000:0c) serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enable Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 287 Comm: kworker/10:1 Tainted: G            E      6.14.0-rc1-hp-debug+ #199
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_prot_err_work_fn [cxl_core]
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33e/0x380
 ? preempt_count_add+0x4b/0xc0
 cxl_do_recovery+0xc9/0xd0 [cxl_core]
 cxl_prot_err_work_fn+0x74/0x190 [cxl_core]
 process_scheduled_works+0xa6/0x420
 worker_thread+0x121/0x260
 kthread+0x10b/0x220
 ? __pfx_worker_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x4a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== CXL Upstream Switch Port ==
root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0:    [14] CorrIntErr            
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_aer_correctable_error: device=port2 (0000:0d:00.0) parent=port1 (0000:0c:00.0) serieal=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.14.0-rc1-hp-debug+ #200
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33e/0x380
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_error_detected+0x6d/0x70 [cxl_core]
 report_error_detected+0xc2/0x180
 ? __pm_runtime_resume+0x60/0x90
 ? __pfx_report_frozen_detected+0x10/0x10
 report_frozen_detected+0x16/0x20
 __pci_walk_bus+0x50/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x39/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 __pci_walk_bus+0x39/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x32/0x50
 pci_walk_bridge+0x1d/0x40
 pcie_do_recovery+0x175/0x2b0
 ? __pfx_aer_root_reset+0x10/0x10
 aer_isr_one_error.isra.0+0x656/0x720
 ? srso_return_thunk+0x5/0x5f
 ? _raw_spin_unlock+0x19/0x40
 ? srso_return_thunk+0x5/0x5f
 ? __switch_to+0x115/0x420
 ? srso_return_thunk+0x5/0x5f
 ? __schedule+0x4d1/0x1190
 aer_isr+0x4d/0x80
 irq_thread_fn+0x28/0x70
 irq_thread+0x179/0x240
 ? srso_return_thunk+0x5/0x5f
 ? __pfx_irq_thread_fn+0x10/0x10
 ? __pfx_irq_thread_dtor+0x10/0x10
 kthread+0x10b/0x220
 ? __pfx_irq_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x6e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== CXL Downstream Port == 
root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0:    [14] CorrIntErr            
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_aer_correctable_error: device=port2 (0000:0e:00.0) parent=port1 (0000:0d:00.0) serieal=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0:    [22] UncorrIntErr          
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_aer_uncorrectable_error: device=port2 (0000:0e:00.0) parent=port1 (0000:0d:00.0) serial: 0 status: 'Cache Byte Enable Parity Error' first_error: 'Cache Byte Enabl'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 81 Comm: kworker/10:0 Tainted: G            E      6.14.0-rc1-hp-debug+ #201
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Workqueue: events cxl_prot_err_work_fn [cxl_core]
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33e/0x380
 ? preempt_count_add+0x4b/0xc0
 cxl_do_recovery+0xc9/0xd0 [cxl_core]
 cxl_prot_err_work_fn+0x74/0x190 [cxl_core]
 process_scheduled_works+0xa6/0x420
 worker_thread+0x121/0x260
 kthread+0x10b/0x220
 ? __pfx_worker_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x29c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

== CXL Endpoint ==
root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00004000/00000000
 cxl_pci 0000:0f:00.0:    [14] CorrIntErr            
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available 
 cxl_aer_correctable_error: device=mem3 (0000:0f:00.0) parent=0000:0f:00.0 (0000:0e:00.0) serieal=0 status='CRC Threshold Hit'

root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0f:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
cxl_aer_uncorrectable_error: device=mem3 (0000:0f:00.0) parent=0000:0f:00.0 (0000:0e:00.0) serial: 0 status: 'Cache Byte Enable Parity '
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.14.0-rc1-hp-debug+ #203
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
 <TASK>
 dump_stack_lvl+0x27/0x90
 dump_stack+0x10/0x20
 panic+0x33e/0x380
 pci_error_detected+0x6d/0x70 [cxl_core]
 report_error_detected+0xc2/0x180
 ? __pm_runtime_resume+0x60/0x90
 ? __pfx_report_frozen_detected+0x10/0x10
 report_frozen_detected+0x16/0x20
 __pci_walk_bus+0x50/0x70
 ? __pfx_report_frozen_detected+0x10/0x10
 pci_walk_bus+0x32/0x50
 pci_walk_bridge+0x1d/0x40
 pcie_do_recovery+0x175/0x2b0
 ? __pfx_aer_root_reset+0x10/0x10
 aer_isr_one_error.isra.0+0x656/0x720
 ? srso_return_thunk+0x5/0x5f
 ? _raw_spin_unlock+0x19/0x40
 ? srso_return_thunk+0x5/0x5f
 ? __switch_to+0x115/0x420
 ? srso_return_thunk+0x5/0x5f
 ? __schedule+0x4d1/0x1190
 aer_isr+0x4d/0x80
 irq_thread_fn+0x28/0x70
 irq_thread+0x179/0x240
 ? srso_return_thunk+0x5/0x5f
 ? __pfx_irq_thread_fn+0x10/0x10
 ? __pfx_irq_thread_dtor+0x10/0x10
 kthread+0x10b/0x220
 ? __pfx_irq_thread+0x10/0x10
 ? _raw_spin_unlock_irq+0x1f/0x40
 ? __pfx_kthread+0x10/0x10
 ret_from_fork+0x3c/0x60
 ? __pfx_kthread+0x10/0x10
 ret_from_fork_asm+0x1a/0x30
 </TASK>
Kernel Offset: 0x32200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---

Changes
=======
 Changes in v7 -> v8:
 [Dan] Use kfifo. Move handling to CXL driver. AER forwards error to CXL
 driver
 [Dan] Add device reference incrementors where needed throughout
 [Dan] Initiate CXL Port RAS init from Switch Port and Endpoint Port init 
 [Dan] Combine CXL Port and CXL Endpoint trace routine
 [Dan] Introduce aer_info::is_cxl. Use to indicate CXL or PCI errors
 [Jonathan] Add serial number for all devices in trace
 [DaveJ] Move find_cxl_port() change into patch using it
 [Terry] Move CXL Port RAS init into cxl/port.c
 [Terry] Moved kfifo functions into cxl/core/ras.c 
 
 Changes in v6 -> v7:
 [Terry] Move updated trace routine call to later patch. Was causing build
 error.
 
 Changes in v5 -> v6:
 [Ira] Move pcie_is_cxl(dev) define to a inline function
 [Ira] Update returning value from pcie_is_cxl_port() to bool w/o cast
 [Ira] Change cxl_report_error_detected() cleanup to return correct bool
 [Ira] Introduce and use PCI_ERS_RESULT_PANIC
 [Ira] Reuse comment for PCIe and CXL recovery paths
 [Jonathan] Add type check in for cxl_handle_cor_ras() and cxl_handle_ras()
 [Jonathan] cxl_uport/dport_init_ras_reporting(), added a mutex.
 [Jonathan] Add logging example to patches updating trace output
 [Jonathan] Make parameter 'const' to eliminate for cast in match_uport()
 [Jonathan] Use __free() in cxl_pci_port_ras()
 [Terry] Add patch to log the PCIe SBDF along with CXL device name
 [Terry] Add patch to handle CXL endpoint and RCH DP errors as CXL errors
 [Terry] Remove patch w USP UCE fatal support @ aer_get_device_error_info()
 [Terry] Rebase to cxl/next commit 5585e342e8d3 ("cxl/memdev: Remove unused partition values")
 [Gregory] Pre-initialize pointer to NULL in cxl_pci_port_ras()
 [Gregory] Move AER driver bus name detection to a static function

 Changes in v4 -> v5:
 [Alejandro] Refactor cxl_walk_bridge to simplify 'status' variable usage
 [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
 [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
 [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
 [Ming] Use port->dev for call to devm_add_action_or_reset() in
 cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
 [Jonathan] Use get_device()/put_device() to prevent race condition in
 cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Commit message cleanup. Capitalize keywords from CXL and PCI
 specifications

 Changes in v3 -> v4:
 [Lukas] Capitalize PCIe and CXL device names as in specifications
 [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
 [Lukas] Correct namespace spelling
 [Lukas] Removed export from pcie_is_cxl_port()
 [Lukas] Simplify 'if' blocks in cxl_handle_error()
 [Lukas] Change panic message to remove redundant 'panic' text
 [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
 [lkp@intel] 'host' parameter is already removed. Remove parameter description too.
 [Terry] Added field description for cxl_err_handlers in pci.h comment block

 Changes in v1 -> v2:
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers

Terry Bowman (16):
  PCI/CXL: Introduce PCIe helper function pcie_is_cxl()
  PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
    type
  CXL/AER: Introduce Kfifo for forwarding CXL errors
  cxl/aer: AER service driver forwards CXL error to CXL driver
  PCI/AER: CXL driver dequeues CXL error forwarded from AER service
    driver
  CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
  cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver
  cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  cxl/pci: Add log message if RAS registers are not mapped
  cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports
  cxl/pci: Assign CXL Port protocol error handlers
  cxl/pci: Assign CXL Endpoint protocol error handlers
  cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  CXL/PCI: Enable CXL protocol errors during CXL Port probe
  CXL/PCI: Disable CXL protocol errors during CXL Port cleanup

 drivers/cxl/core/core.h       |   2 +
 drivers/cxl/core/pci.c        | 195 +++++++++-------------
 drivers/cxl/core/port.c       |   4 +-
 drivers/cxl/core/ras.c        | 304 +++++++++++++++++++++++++++++++++-
 drivers/cxl/core/regs.c       |   2 +
 drivers/cxl/core/trace.h      | 100 ++++-------
 drivers/cxl/cxl.h             |  37 +++++
 drivers/cxl/cxlpci.h          |   7 +-
 drivers/cxl/mem.c             |   3 +-
 drivers/cxl/pci.c             |   8 +-
 drivers/cxl/port.c            | 202 ++++++++++++++++++++++
 drivers/pci/pci.c             |   6 +
 drivers/pci/pci.h             |  14 +-
 drivers/pci/pcie/aer.c        | 180 ++++++++++++++------
 drivers/pci/pcie/rcec.c       |   1 +
 drivers/pci/probe.c           |  10 ++
 include/linux/aer.h           |  41 +++++
 include/linux/pci.h           |  18 ++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   8 +-
 20 files changed, 886 insertions(+), 265 deletions(-)


base-commit: aae0594a7053c60b82621136257c8b648c67b512
-- 
2.34.1


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl()
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27 15:11   ` Ira Weiny
  2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL and AER drivers need the ability to identify CXL devices.

Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
CXL Flexbus DVSEC presence is used because it is required for all the CXL
PCIe devices.[1]

Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
Flexbus presence.

Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.

[1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
    Capability (DVSEC) ID Assignment, Table 8-2

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.c             |  5 +++++
 drivers/pci/probe.c           | 10 ++++++++++
 include/linux/pci.h           |  3 +++
 include/uapi/linux/pci_regs.h |  8 +++++++-
 4 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 869d204a70a3..a1d75f40017e 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5032,6 +5032,11 @@ static u16 cxl_port_dvsec(struct pci_dev *dev)
 					 PCI_DVSEC_CXL_PORT);
 }
 
+inline bool pcie_is_cxl(struct pci_dev *pci_dev)
+{
+	return pci_dev->is_cxl;
+}
+
 static bool cxl_sbr_masked(struct pci_dev *dev)
 {
 	u16 dvsec, reg;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index b6536ed599c3..7737b9ce7a83 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1676,6 +1676,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
 		dev->is_thunderbolt = 1;
 }
 
+static void set_pcie_cxl(struct pci_dev *dev)
+{
+	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+					      PCI_DVSEC_CXL_FLEXBUS);
+	if (dvsec)
+		dev->is_cxl = 1;
+}
+
 static void set_pcie_untrusted(struct pci_dev *dev)
 {
 	struct pci_dev *parent = pci_upstream_bridge(dev);
@@ -2006,6 +2014,8 @@ int pci_setup_device(struct pci_dev *dev)
 	/* Need to have dev->cfg_size ready */
 	set_pcie_thunderbolt(dev);
 
+	set_pcie_cxl(dev);
+
 	set_pcie_untrusted(dev);
 
 	if (pci_is_pcie(dev))
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 47b31ad724fa..af83230bef1a 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -452,6 +452,7 @@ struct pci_dev {
 	unsigned int	is_hotplug_bridge:1;
 	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
 	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
+	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
 	/*
 	 * Devices marked being untrusted are the ones that can potentially
 	 * execute DMA attacks and similar. They are typically connected
@@ -741,6 +742,8 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
 	return false;
 }
 
+bool pcie_is_cxl(struct pci_dev *pci_dev);
+
 #define for_each_pci_bridge(dev, bus)				\
 	list_for_each_entry(dev, &bus->devices, bus_list)	\
 		if (!pci_is_bridge(dev)) {} else
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 3445c4970e4d..7ccb3b2fcc38 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1208,9 +1208,15 @@
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
+/* Compute Express Link (CXL r3.1, sec 8.1)
+ *
+ * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
+ * is "disconnected" (CXL r3.1, sec 9.12.3). Re-enumerate these
+ * registers on downstream link-up events.
+ */
 #define PCI_DVSEC_CXL_PORT				3
 #define PCI_DVSEC_CXL_PORT_CTL				0x0c
 #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+#define PCI_DVSEC_CXL_FLEXBUS				7
 
 #endif /* LINUX_PCI_REGS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl()
  2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
@ 2025-03-27 15:11   ` Ira Weiny
  2025-03-27 15:30     ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Ira Weiny @ 2025-03-27 15:11 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices.
> 
> Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
> CXL Flexbus DVSEC presence is used because it is required for all the CXL
> PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
> 
> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/pci/pci.c             |  5 +++++
>  drivers/pci/probe.c           | 10 ++++++++++
>  include/linux/pci.h           |  3 +++
>  include/uapi/linux/pci_regs.h |  8 +++++++-
>  4 files changed, 25 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 869d204a70a3..a1d75f40017e 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5032,6 +5032,11 @@ static u16 cxl_port_dvsec(struct pci_dev *dev)
>  					 PCI_DVSEC_CXL_PORT);
>  }
>  
> +inline bool pcie_is_cxl(struct pci_dev *pci_dev)
> +{
> +	return pci_dev->is_cxl;
> +}

Shouldn't this just be a static inline in include/linux/pci.h?

> +
>  static bool cxl_sbr_masked(struct pci_dev *dev)
>  {
>  	u16 dvsec, reg;

[snip]

> @@ -741,6 +742,8 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +bool pcie_is_cxl(struct pci_dev *pci_dev);
> +
>  #define for_each_pci_bridge(dev, bus)				\
>  	list_for_each_entry(dev, &bus->devices, bus_list)	\
>  		if (!pci_is_bridge(dev)) {} else
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 3445c4970e4d..7ccb3b2fcc38 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1208,9 +1208,15 @@
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
>  
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> +/* Compute Express Link (CXL r3.1, sec 8.1)
				r3.2

> + *
> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
> + * is "disconnected" (CXL r3.1, sec 9.12.3). Re-enumerate these
                             r3.2
Same sections.  :-D

With changes:

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

Ira

[snip]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl()
  2025-03-27 15:11   ` Ira Weiny
@ 2025-03-27 15:30     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-03-27 15:30 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati



On 3/27/2025 10:11 AM, Ira Weiny wrote:
> Terry Bowman wrote:
>> CXL and AER drivers need the ability to identify CXL devices.
>>
>> Add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC presence. The
>> CXL Flexbus DVSEC presence is used because it is required for all the CXL
>> PCIe devices.[1]
>>
>> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
>> Flexbus presence.
>>
>> Add function pcie_is_cxl() to return 'struct pci_dev::is_cxl'.
>>
>> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>>     Capability (DVSEC) ID Assignment, Table 8-2
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/pci/pci.c             |  5 +++++
>>  drivers/pci/probe.c           | 10 ++++++++++
>>  include/linux/pci.h           |  3 +++
>>  include/uapi/linux/pci_regs.h |  8 +++++++-
>>  4 files changed, 25 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 869d204a70a3..a1d75f40017e 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5032,6 +5032,11 @@ static u16 cxl_port_dvsec(struct pci_dev *dev)
>>  					 PCI_DVSEC_CXL_PORT);
>>  }
>>  
>> +inline bool pcie_is_cxl(struct pci_dev *pci_dev)
>> +{
>> +	return pci_dev->is_cxl;
>> +}
> Shouldn't this just be a static inline in include/linux/pci.h?
Ok, I'll make that change.
>> +
>>  static bool cxl_sbr_masked(struct pci_dev *dev)
>>  {
>>  	u16 dvsec, reg;
> [snip]
>
>> @@ -741,6 +742,8 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>>  	return false;
>>  }
>>  
>> +bool pcie_is_cxl(struct pci_dev *pci_dev);
>> +
>>  #define for_each_pci_bridge(dev, bus)				\
>>  	list_for_each_entry(dev, &bus->devices, bus_list)	\
>>  		if (!pci_is_bridge(dev)) {} else
>> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
>> index 3445c4970e4d..7ccb3b2fcc38 100644
>> --- a/include/uapi/linux/pci_regs.h
>> +++ b/include/uapi/linux/pci_regs.h
>> @@ -1208,9 +1208,15 @@
>>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
>>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
>>  
>> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
>> +/* Compute Express Link (CXL r3.1, sec 8.1)
> 				r3.2
>
>> + *
>> + * Note that CXL DVSEC id 3 and 7 to be ignored when the CXL link state
>> + * is "disconnected" (CXL r3.1, sec 9.12.3). Re-enumerate these
>                              r3.2
> Same sections.  :-D

Got it. I'll change the spec release to 3.2.
> With changes:
>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>
> Ira
>
> [snip]
Thanks Ira.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
  2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27 16:48   ` Bjorn Helgaas
  2025-03-27 16:58   ` Ira Weiny
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
                   ` (15 subsequent siblings)
  17 siblings, 2 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors. Update the driver and aer_event tracing to log 'CXL Bus
Type' for CXL device errors.

This requires the AER can identify and distinguish between PCIe errors and
CXL errors.

Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
aer_get_device_error_info() and pci_print_aer().

Update the aer_event trace routine to accept a bus type string parameter.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.h       |  6 ++++++
 drivers/pci/pcie/aer.c  | 18 ++++++++++++------
 include/ras/ras_event.h |  9 ++++++---
 3 files changed, 24 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 01e51db8d285..eed098c134a6 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -533,6 +533,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
 struct aer_err_info {
 	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
 	int error_dev_num;
+	bool is_cxl;
 
 	unsigned int id:16;
 
@@ -549,6 +550,11 @@ struct aer_err_info {
 	struct pcie_tlp_log tlp;	/* TLP Header */
 };
 
+static inline const char *aer_err_bus(struct aer_err_info *info)
+{
+	return info->is_cxl ? "CXL" : "PCIe";
+}
+
 int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 508474e17183..83f2069f111e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -694,13 +694,14 @@ static void __aer_print_error(struct pci_dev *dev,
 
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
+	const char *bus_type = aer_err_bus(info);
 	int layer, agent;
 	int id = pci_dev_id(dev);
 	const char *level;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
@@ -709,8 +710,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 
 	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
 
-	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -725,7 +726,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		pci_err(dev, "  Error of this Agent is reported first\n");
 
-	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
@@ -759,6 +760,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type;
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info;
@@ -780,6 +782,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	info.status = status;
 	info.mask = mask;
 	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
+	info.is_cxl = pcie_is_cxl(dev);
+
+	bus_type = aer_err_bus(&info);
 
 	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
 	__aer_print_error(dev, &info);
@@ -793,7 +798,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	if (tlp_header_valid)
 		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
 
-	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
@@ -1211,6 +1216,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	/* Must reset in this function */
 	info->status = 0;
 	info->tlp_header_valid = 0;
+	info->is_cxl = pcie_is_cxl(dev);
 
 	/* The device might not support AER */
 	if (!aer)
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index e5f7ee0864e7..1bf8e7050ba8 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
 
 TRACE_EVENT(aer_event,
 	TP_PROTO(const char *dev_name,
+		 const char *bus_type,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
 		 struct pcie_tlp_log *tlp),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
+		__string(	bus_type,	bus_type	)
 		__field(	u32,		status		)
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2025-03-27 16:48   ` Bjorn Helgaas
  2025-03-27 17:15     ` Bowman, Terry
  2025-03-27 16:58   ` Ira Weiny
  1 sibling, 1 reply; 76+ messages in thread
From: Bjorn Helgaas @ 2025-03-27 16:48 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:03PM -0500, Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.

> +++ b/drivers/pci/pci.h
> @@ -533,6 +533,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
>  struct aer_err_info {
>  	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>  	int error_dev_num;
> +	bool is_cxl;
>  
>  	unsigned int id:16;
>  
> @@ -549,6 +550,11 @@ struct aer_err_info {
>  	struct pcie_tlp_log tlp;	/* TLP Header */
>  };
>  
> +static inline const char *aer_err_bus(struct aer_err_info *info)
> +{
> +	return info->is_cxl ? "CXL" : "PCIe";

I don't really see the point in adding struct aer_err_info.is_cxl.
Every place where we call aer_err_bus() to look at it, we also have
the struct pci_dev pointer, so we could just as easily use
pcie_is_cxl() here.

> +}
> +
>  int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
>  
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 508474e17183..83f2069f111e 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -694,13 +694,14 @@ static void __aer_print_error(struct pci_dev *dev,
>  
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
> +	const char *bus_type = aer_err_bus(info);
>  	int layer, agent;
>  	int id = pci_dev_id(dev);
>  	const char *level;
>  
>  	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>  		goto out;
>  	}
>  
> @@ -709,8 +710,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>  
> -	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -725,7 +726,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		pci_err(dev, "  Error of this Agent is reported first\n");
>  
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> @@ -759,6 +760,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> +	const char *bus_type;
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info;
> @@ -780,6 +782,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	info.status = status;
>  	info.mask = mask;
>  	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
> +	info.is_cxl = pcie_is_cxl(dev);
> +
> +	bus_type = aer_err_bus(&info);
>  
>  	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>  	__aer_print_error(dev, &info);
> @@ -793,7 +798,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (tlp_header_valid)
>  		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
>  
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
> @@ -1211,6 +1216,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  	/* Must reset in this function */
>  	info->status = 0;
>  	info->tlp_header_valid = 0;
> +	info->is_cxl = pcie_is_cxl(dev);
>  
>  	/* The device might not support AER */
>  	if (!aer)
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index e5f7ee0864e7..1bf8e7050ba8 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>  
>  TRACE_EVENT(aer_event,
>  	TP_PROTO(const char *dev_name,
> +		 const char *bus_type,
>  		 const u32 status,
>  		 const u8 severity,
>  		 const u8 tlp_header_valid,
>  		 struct pcie_tlp_log *tlp),
>  
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>  
>  	TP_STRUCT__entry(
>  		__string(	dev_name,	dev_name	)
> +		__string(	bus_type,	bus_type	)
>  		__field(	u32,		status		)
>  		__field(	u8,		severity	)
>  		__field(	u8, 		tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>  
>  	TP_fast_assign(
>  		__assign_str(dev_name);
> +		__assign_str(bus_type);
>  		__entry->status		= status;
>  		__entry->severity	= severity;
>  		__entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>  		}
>  	),
>  
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>  			__entry->severity == AER_FATAL ?
>  			"Fatal" : "Uncorrected, non-fatal",
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-03-27 16:48   ` Bjorn Helgaas
@ 2025-03-27 17:15     ` Bowman, Terry
  2025-03-27 17:49       ` Bjorn Helgaas
  0 siblings, 1 reply; 76+ messages in thread
From: Bowman, Terry @ 2025-03-27 17:15 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 3/27/2025 11:48 AM, Bjorn Helgaas wrote:
> On Wed, Mar 26, 2025 at 08:47:03PM -0500, Terry Bowman wrote:
>> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
>> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
>> Type' for CXL device errors.
>>
>> This requires the AER can identify and distinguish between PCIe errors and
>> CXL errors.
>>
>> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
>> aer_get_device_error_info() and pci_print_aer().
>>
>> Update the aer_event trace routine to accept a bus type string parameter.
>> +++ b/drivers/pci/pci.h
>> @@ -533,6 +533,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
>>  struct aer_err_info {
>>  	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
>>  	int error_dev_num;
>> +	bool is_cxl;
>>  
>>  	unsigned int id:16;
>>  
>> @@ -549,6 +550,11 @@ struct aer_err_info {
>>  	struct pcie_tlp_log tlp;	/* TLP Header */
>>  };
>>  
>> +static inline const char *aer_err_bus(struct aer_err_info *info)
>> +{
>> +	return info->is_cxl ? "CXL" : "PCIe";
> I don't really see the point in adding struct aer_err_info.is_cxl.
> Every place where we call aer_err_bus() to look at it, we also have
> the struct pci_dev pointer, so we could just as easily use
> pcie_is_cxl() here.

Hi Bjorn,

My understanding is Dan wanted the decorated aer_err_info::is_cxl to capture
the type of error (CXL or PCIe) and cache it because it could change. That is,
the CXL device's alternate protocol could transition down before handling but
the CXL driver must keep knowledge of the original state for reporting. But, the
alternate protocol transition is not currently detected and used to update
pci_dev::is_cxl. pci_dev::is_cxl is currently only assigned during device creation
at the moment.

DanW, please add detail and/or correct me.

Regards,
Terry
>> +}
>> +
>>  int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info);
>>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
>>  
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 508474e17183..83f2069f111e 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -694,13 +694,14 @@ static void __aer_print_error(struct pci_dev *dev,
>>  
>>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> +	const char *bus_type = aer_err_bus(info);
>>  	int layer, agent;
>>  	int id = pci_dev_id(dev);
>>  	const char *level;
>>  
>>  	if (!info->status) {
>> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>> -			aer_error_severity_string[info->severity]);
>> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>> +			bus_type, aer_error_severity_string[info->severity]);
>>  		goto out;
>>  	}
>>  
>> @@ -709,8 +710,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>>  
>> -	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>> -		   aer_error_severity_string[info->severity],
>> +	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
>> +		   bus_type, aer_error_severity_string[info->severity],
>>  		   aer_error_layer[layer], aer_agent_string[agent]);
>>  
>>  	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
>> @@ -725,7 +726,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>>  		pci_err(dev, "  Error of this Agent is reported first\n");
>>  
>> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
>> +	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>>  			info->severity, info->tlp_header_valid, &info->tlp);
>>  }
>>  
>> @@ -759,6 +760,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  		   struct aer_capability_regs *aer)
>>  {
>> +	const char *bus_type;
>>  	int layer, agent, tlp_header_valid = 0;
>>  	u32 status, mask;
>>  	struct aer_err_info info;
>> @@ -780,6 +782,9 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  	info.status = status;
>>  	info.mask = mask;
>>  	info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
>> +	info.is_cxl = pcie_is_cxl(dev);
>> +
>> +	bus_type = aer_err_bus(&info);
>>  
>>  	pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
>>  	__aer_print_error(dev, &info);
>> @@ -793,7 +798,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  	if (tlp_header_valid)
>>  		pcie_print_tlp_log(dev, &aer->header_log, dev_fmt("  "));
>>  
>> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
>> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>>  			aer_severity, tlp_header_valid, &aer->header_log);
>>  }
>>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
>> @@ -1211,6 +1216,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>  	/* Must reset in this function */
>>  	info->status = 0;
>>  	info->tlp_header_valid = 0;
>> +	info->is_cxl = pcie_is_cxl(dev);
>>  
>>  	/* The device might not support AER */
>>  	if (!aer)
>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>> index e5f7ee0864e7..1bf8e7050ba8 100644
>> --- a/include/ras/ras_event.h
>> +++ b/include/ras/ras_event.h
>> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>>  
>>  TRACE_EVENT(aer_event,
>>  	TP_PROTO(const char *dev_name,
>> +		 const char *bus_type,
>>  		 const u32 status,
>>  		 const u8 severity,
>>  		 const u8 tlp_header_valid,
>>  		 struct pcie_tlp_log *tlp),
>>  
>> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
>> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>>  
>>  	TP_STRUCT__entry(
>>  		__string(	dev_name,	dev_name	)
>> +		__string(	bus_type,	bus_type	)
>>  		__field(	u32,		status		)
>>  		__field(	u8,		severity	)
>>  		__field(	u8, 		tlp_header_valid)
>> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>>  
>>  	TP_fast_assign(
>>  		__assign_str(dev_name);
>> +		__assign_str(bus_type);
>>  		__entry->status		= status;
>>  		__entry->severity	= severity;
>>  		__entry->tlp_header_valid = tlp_header_valid;
>> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>>  		}
>>  	),
>>  
>> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
>> -		__get_str(dev_name),
>> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
>> +		__get_str(dev_name), __get_str(bus_type),
>>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>>  			__entry->severity == AER_FATAL ?
>>  			"Fatal" : "Uncorrected, non-fatal",
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-03-27 17:15     ` Bowman, Terry
@ 2025-03-27 17:49       ` Bjorn Helgaas
  0 siblings, 0 replies; 76+ messages in thread
From: Bjorn Helgaas @ 2025-03-27 17:49 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Thu, Mar 27, 2025 at 12:15:10PM -0500, Bowman, Terry wrote:
> On 3/27/2025 11:48 AM, Bjorn Helgaas wrote:
> > On Wed, Mar 26, 2025 at 08:47:03PM -0500, Terry Bowman wrote:
> >> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> >> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> >> Type' for CXL device errors.
> >>
> >> This requires the AER can identify and distinguish between PCIe errors and
> >> CXL errors.
> >>
> >> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> >> aer_get_device_error_info() and pci_print_aer().
> >>
> >> Update the aer_event trace routine to accept a bus type string parameter.
> >> +++ b/drivers/pci/pci.h
> >> @@ -533,6 +533,7 @@ static inline bool pci_dev_test_and_set_removed(struct pci_dev *dev)
> >>  struct aer_err_info {
> >>  	struct pci_dev *dev[AER_MAX_MULTI_ERR_DEVICES];
> >>  	int error_dev_num;
> >> +	bool is_cxl;
> >>  
> >>  	unsigned int id:16;
> >>  
> >> @@ -549,6 +550,11 @@ struct aer_err_info {
> >>  	struct pcie_tlp_log tlp;	/* TLP Header */
> >>  };
> >>  
> >> +static inline const char *aer_err_bus(struct aer_err_info *info)
> >> +{
> >> +	return info->is_cxl ? "CXL" : "PCIe";
> >
> > I don't really see the point in adding struct aer_err_info.is_cxl.
> > Every place where we call aer_err_bus() to look at it, we also have
> > the struct pci_dev pointer, so we could just as easily use
> > pcie_is_cxl() here.
> 
> My understanding is Dan wanted the decorated aer_err_info::is_cxl to
> capture the type of error (CXL or PCIe) and cache it because it
> could change. That is, the CXL device's alternate protocol could
> transition down before handling but the CXL driver must keep
> knowledge of the original state for reporting. But, the alternate
> protocol transition is not currently detected and used to update
> pci_dev::is_cxl. pci_dev::is_cxl is currently only assigned during
> device creation at the moment.

Sounds like there's always a race here if the CXL/PCIe type isn't
captured as part of a hardware error log register.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
  2025-03-27 16:48   ` Bjorn Helgaas
@ 2025-03-27 16:58   ` Ira Weiny
  2025-03-27 17:17     ` Bowman, Terry
  1 sibling, 1 reply; 76+ messages in thread
From: Ira Weiny @ 2025-03-27 16:58 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Terry Bowman wrote:
> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
> Type' for CXL device errors.
> 
> This requires the AER can identify and distinguish between PCIe errors and
> CXL errors.
> 
> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
> aer_get_device_error_info() and pci_print_aer().
> 
> Update the aer_event trace routine to accept a bus type string parameter.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

I'm unsure how Karolina's new populate_aer_err_info() will get the new
is_cxl flag.[1]

Generally though this looks ok.

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[1] https://lore.kernel.org/all/81c040d54209627de2d8b150822636b415834c7f.1742900213.git.karolina.stolarek@oracle.com/

[snip]

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-03-27 16:58   ` Ira Weiny
@ 2025-03-27 17:17     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-03-27 17:17 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati



On 3/27/2025 11:58 AM, Ira Weiny wrote:
> Terry Bowman wrote:
>> The AER service driver and aer_event tracing currently log 'PCIe Bus Type'
>> for all errors. Update the driver and aer_event tracing to log 'CXL Bus
>> Type' for CXL device errors.
>>
>> This requires the AER can identify and distinguish between PCIe errors and
>> CXL errors.
>>
>> Introduce boolean 'is_cxl' to 'struct aer_err_info'. Add assignment in
>> aer_get_device_error_info() and pci_print_aer().
>>
>> Update the aer_event trace routine to accept a bus type string parameter.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> I'm unsure how Karolina's new populate_aer_err_info() will get the new
> is_cxl flag.[1]

It looks like the Karolina's patchset and this will likely fall into the
same merge window. I will need to take a closer look and possibly make changes
for a simpler merge.

> Generally though this looks ok.
>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>
> [1] https://lore.kernel.org/all/81c040d54209627de2d8b150822636b415834c7f.1742900213.git.karolina.stolarek@oracle.com/
>
> [snip]
Thanks Ira.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
  2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
  2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27 17:08   ` Bjorn Helgaas
                     ` (5 more replies)
  2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
                   ` (14 subsequent siblings)
  17 siblings, 6 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL error handling will soon be moved from the AER driver into the CXL
driver. This requires a notification mechanism for the AER driver to share
the AER interrupt details with CXL driver. The notification is required for
the CXL drivers to then handle CXL RAS errors.

Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
driver will be the sole kfifo producer adding work. The cxl_core will be
the sole kfifo consumer removing work. Add the boilerplate kfifo support.

Add CXL work queue handler registration functions in the AER driver. Export
the functions allowing CXL driver to access. Implement the registration
functions for the CXL driver to assign or clear the work handler function.

Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
now. The CXL specific handling will be added in future patch.

Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
details used in completing error handling. This avoid duplicating some
function calls and allows the error to be treated generically when
possible.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/ras.c | 54 +++++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxlpci.h   |  3 +++
 drivers/pci/pcie/aer.c | 39 ++++++++++++++++++++++++++++++
 include/linux/aer.h    | 37 +++++++++++++++++++++++++++++
 4 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 485a831695c7..ecb60a5962de 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -5,6 +5,7 @@
 #include <linux/aer.h>
 #include <cxl/event.h>
 #include <cxlmem.h>
+#include <cxlpci.h>
 #include "trace.h"
 
 static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
@@ -107,13 +108,64 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
 
+int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
+			     struct cxl_prot_error_info *err_info)
+{
+	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
+	struct cxl_dev_state *cxlds;
+
+	if (!pdev || !err_info) {
+		pr_warn_once("Error: parameter is NULL");
+		return -ENODEV;
+	}
+
+	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
+	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
+		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
+		return -ENODEV;
+	}
+
+	cxlds = pci_get_drvdata(pdev);
+	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
+
+	if (!dev)
+		return -ENODEV;
+
+	*err_info = (struct cxl_prot_error_info){ 0 };
+	err_info->ras_base = cxlds->regs.ras;
+	err_info->severity = severity;
+	err_info->pdev = pdev;
+	err_info->dev = dev;
+
+	return 0;
+}
+
+struct work_struct cxl_prot_err_work;
+
 int cxl_ras_init(void)
 {
-	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
+	int rc;
+
+	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
+	if (rc) {
+		pr_err("Failed to register CPER kfifo with AER driver");
+		return rc;
+	}
+
+	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
+	if (rc) {
+		pr_err("Failed to register kfifo with AER driver");
+		return rc;
+	}
+
+	return rc;
 }
 
 void cxl_ras_exit(void)
 {
 	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
 	cancel_work_sync(&cxl_cper_prot_err_work);
+
+	cxl_unregister_prot_err_work();
+	cancel_work_sync(&cxl_prot_err_work);
 }
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 54e219b0049e..92d72c0423ab 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -4,6 +4,7 @@
 #define __CXL_PCI_H__
 #include <linux/pci.h>
 #include "cxl.h"
+#include "linux/aer.h"
 
 #define CXL_MEMORY_PROGIF	0x10
 
@@ -135,4 +136,6 @@ void read_cdat_data(struct cxl_port *port);
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
+			     struct cxl_prot_error_info *err_info);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 83f2069f111e..46123b70f496 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -110,6 +110,16 @@ struct aer_stats {
 static int pcie_aer_disable;
 static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
 
+#if defined(CONFIG_PCIEAER_CXL)
+#define CXL_ERROR_SOURCES_MAX          128
+static DEFINE_KFIFO(cxl_prot_err_fifo, struct cxl_prot_err_work_data,
+		    CXL_ERROR_SOURCES_MAX);
+static DEFINE_SPINLOCK(cxl_prot_err_fifo_lock);
+struct work_struct *cxl_prot_err_work;
+static int (*cxl_create_prot_err_info)(struct pci_dev*, int severity,
+				       struct cxl_prot_error_info*);
+#endif
+
 void pci_no_aer(void)
 {
 	pcie_aer_disable = 1;
@@ -1577,6 +1587,35 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
 	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
 }
 
+
+#if defined(CONFIG_PCIEAER_CXL)
+int cxl_register_prot_err_work(struct work_struct *work,
+			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
+								struct cxl_prot_error_info*))
+{
+	guard(spinlock)(&cxl_prot_err_fifo_lock);
+	cxl_prot_err_work = work;
+	cxl_create_prot_err_info = _cxl_create_prot_err_info;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_register_prot_err_work, "CXL");
+
+int cxl_unregister_prot_err_work(void)
+{
+	guard(spinlock)(&cxl_prot_err_fifo_lock);
+	cxl_prot_err_work = NULL;
+	cxl_create_prot_err_info = NULL;
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_unregister_prot_err_work, "CXL");
+
+int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd)
+{
+	return kfifo_get(&cxl_prot_err_fifo, wd);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_prot_err_kfifo_get, "CXL");
+#endif
+
 static struct pcie_port_service_driver aerdriver = {
 	.name		= "aer",
 	.port_type	= PCIE_ANY_PORT,
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 947b63091902..761d6f5cd792 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -10,6 +10,7 @@
 
 #include <linux/errno.h>
 #include <linux/types.h>
+#include <linux/workqueue_types.h>
 
 #define AER_NONFATAL			0
 #define AER_FATAL			1
@@ -45,6 +46,24 @@ struct aer_capability_regs {
 	u16 uncor_err_source;
 };
 
+/**
+ * struct cxl_prot_err_info - Error information used in CXL error handling
+ * @pdev: PCI device with CXL error
+ * @dev: CXL device with error. From CXL topology using ACPI/platform discovery
+ * @ras_base: Mapped address of CXL RAS registers
+ * @severity: CXL AER/RAS severity: AER_CORRECTABLE, AER_FATAL, AER_NONFATAL
+ */
+struct cxl_prot_error_info {
+	struct pci_dev *pdev;
+	struct device *dev;
+	void __iomem *ras_base;
+	int severity;
+};
+
+struct cxl_prot_err_work_data {
+	struct cxl_prot_error_info err_info;
+};
+
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
@@ -56,6 +75,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
+#if defined(CONFIG_PCIEAER_CXL)
+int cxl_register_prot_err_work(struct work_struct *work,
+			       int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
+								 struct cxl_prot_error_info*));
+int cxl_unregister_prot_err_work(void);
+int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd);
+#else
+static inline int
+cxl_register_prot_err_work(struct work_struct *work,
+			   int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
+							     struct cxl_prot_error_info*))
+{
+	return 0;
+}
+static inline int cxl_unregister_prot_err_work(void) { return 0; }
+static inline int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd) { return 0; }
+#endif
+
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		    struct aer_capability_regs *aer);
 int cper_severity_to_aer(int cper_severity);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
@ 2025-03-27 17:08   ` Bjorn Helgaas
  2025-03-27 18:12     ` Bowman, Terry
  2025-03-28 17:01   ` Ira Weiny
                     ` (4 subsequent siblings)
  5 siblings, 1 reply; 76+ messages in thread
From: Bjorn Helgaas @ 2025-03-27 17:08 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:04PM -0500, Terry Bowman wrote:
> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt details with CXL driver. The notification is required for
> the CXL drivers to then handle CXL RAS errors.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work. The cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement the registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
> now. The CXL specific handling will be added in future patch.
> 
> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
> details used in completing error handling. This avoid duplicating some
> function calls and allows the error to be treated generically when
> possible.
> ...

> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> @@ -107,13 +108,64 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> +	struct cxl_dev_state *cxlds;
> +
> +	if (!pdev || !err_info) {
> +		pr_warn_once("Error: parameter is NULL");
> +		return -ENODEV;

This is CXL code, so your call, but I'm always skeptical about testing
for NULL and basically ignoring a code defect that got us here with a
NULL pointer.  I would rather take the NULL pointer dereference fault
and force a fix in the caller.

> +	}
> +
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> +		return -ENODEV;

Similar.  A pci_warn_once() here seems like a debugging aid during
development, not necessarily a production kind of thing.

Thanks for printing the type.  I would use "%#x" to make it clear that
it's hex.  There are about 1900 %X uses compared with 33K
%x uses, but maybe you have a reason to capitalize it?

> +	}
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	*err_info = (struct cxl_prot_error_info){ 0 };

Neat, I hadn't seen this idiom.

> +	err_info->ras_base = cxlds->regs.ras;
> +	err_info->severity = severity;
> +	err_info->pdev = pdev;
> +	err_info->dev = dev;
> +
> +	return 0;
> +}
> +
> +struct work_struct cxl_prot_err_work;
> +
>  int cxl_ras_init(void)
>  {
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	int rc;
> +
> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	if (rc) {
> +		pr_err("Failed to register CPER kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
> +	if (rc) {
> +		pr_err("Failed to register kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	return rc;
>  }
>  
>  void cxl_ras_exit(void)
>  {
>  	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>  	cancel_work_sync(&cxl_cper_prot_err_work);
> +
> +	cxl_unregister_prot_err_work();
> +	cancel_work_sync(&cxl_prot_err_work);
>  }
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..92d72c0423ab 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> @@ -135,4 +136,6 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info);

What does the "_" in "_pdev" signify?  Looks unnecessarily different
than the decls above.

>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 83f2069f111e..46123b70f496 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -110,6 +110,16 @@ struct aer_stats {
>  static int pcie_aer_disable;
>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +#define CXL_ERROR_SOURCES_MAX          128
> +static DEFINE_KFIFO(cxl_prot_err_fifo, struct cxl_prot_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_prot_err_fifo_lock);
> +struct work_struct *cxl_prot_err_work;
> +static int (*cxl_create_prot_err_info)(struct pci_dev*, int severity,
> +				       struct cxl_prot_error_info*);

Space before "*" in the parameters.

> +#endif
> +
>  void pci_no_aer(void)
>  {
>  	pcie_aer_disable = 1;
> @@ -1577,6 +1587,35 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
>  
> +
> +#if defined(CONFIG_PCIEAER_CXL)
> +int cxl_register_prot_err_work(struct work_struct *work,
> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
> +								struct cxl_prot_error_info*))

Ditto.  Rewrap to fit in 80 columns, unindent this function pointer
decl to make it fit.  Same below in aer.h.

> +{
> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
> +	cxl_prot_err_work = work;
> +	cxl_create_prot_err_info = _cxl_create_prot_err_info;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_prot_err_work, "CXL");
> +
> +int cxl_unregister_prot_err_work(void)
> +{
> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
> +	cxl_prot_err_work = NULL;
> +	cxl_create_prot_err_info = NULL;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_prot_err_work, "CXL");
> +
> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_prot_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_prot_err_kfifo_get, "CXL");
> +#endif
> +
>  static struct pcie_port_service_driver aerdriver = {
>  	.name		= "aer",
>  	.port_type	= PCIE_ANY_PORT,
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 947b63091902..761d6f5cd792 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -45,6 +46,24 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_prot_err_info - Error information used in CXL error handling
> + * @pdev: PCI device with CXL error
> + * @dev: CXL device with error. From CXL topology using ACPI/platform discovery
> + * @ras_base: Mapped address of CXL RAS registers
> + * @severity: CXL AER/RAS severity: AER_CORRECTABLE, AER_FATAL, AER_NONFATAL
> + */
> +struct cxl_prot_error_info {
> +	struct pci_dev *pdev;
> +	struct device *dev;
> +	void __iomem *ras_base;
> +	int severity;

What does the "prot" in "cxl_prot_error_info" refer to?

There's basically no error info here other than "severity".

I guess "dev" and "pdev" are separate devices (otherwise you would
just use "&pdev->dev"), but I don't have any intuition about how they
might be related, which is a little disconcerting.

I would have thought that "ras_base" would be a property of "dev" (the
CXL device) and wouldn't need to be separate.

From above, I guess "ras_base" is a property of cxlds, not
cxlds->cxlmd->dev.  Maybe we should be keeping &cxlds here instead and
letting the consumer look up cxlds->cxlmd->dev?

> +};
> +
> +struct cxl_prot_err_work_data {
> +	struct cxl_prot_error_info err_info;
> +};
> +
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> @@ -56,6 +75,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +int cxl_register_prot_err_work(struct work_struct *work,
> +			       int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
> +								 struct cxl_prot_error_info*));
> +int cxl_unregister_prot_err_work(void);
> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd);
> +#else
> +static inline int
> +cxl_register_prot_err_work(struct work_struct *work,
> +			   int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
> +							     struct cxl_prot_error_info*))
> +{
> +	return 0;
> +}
> +static inline int cxl_unregister_prot_err_work(void) { return 0; }
> +static inline int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd) { return 0; }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27 17:08   ` Bjorn Helgaas
@ 2025-03-27 18:12     ` Bowman, Terry
  2025-03-28 17:02       ` Bjorn Helgaas
  0 siblings, 1 reply; 76+ messages in thread
From: Bowman, Terry @ 2025-03-27 18:12 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 3/27/2025 12:08 PM, Bjorn Helgaas wrote:
> On Wed, Mar 26, 2025 at 08:47:04PM -0500, Terry Bowman wrote:
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt details with CXL driver. The notification is required for
>> the CXL drivers to then handle CXL RAS errors.
>>
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work. The cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement the registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
>> now. The CXL specific handling will be added in future patch.
>>
>> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
>> details used in completing error handling. This avoid duplicating some
>> function calls and allows the error to be treated generically when
>> possible.
>> ...
>> +++ b/drivers/cxl/core/ras.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/aer.h>
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>> +#include <cxlpci.h>
>>  #include "trace.h"
>>  
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> @@ -107,13 +108,64 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>>  }
>>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>  
>> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>> +			     struct cxl_prot_error_info *err_info)
>> +{
>> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
>> +	struct cxl_dev_state *cxlds;
>> +
>> +	if (!pdev || !err_info) {
>> +		pr_warn_once("Error: parameter is NULL");
>> +		return -ENODEV;
> This is CXL code, so your call, but I'm always skeptical about testing
> for NULL and basically ignoring a code defect that got us here with a
> NULL pointer.  I would rather take the NULL pointer dereference fault
> and force a fix in the caller.
I sometimes struggle with too much parameter validation, especially in
new code. And there are often valid questions of "how can this happen
and does it need to be checked". Some of it borders on paranoid (pointing
back to initial development). Thanks for the feedback here, I will keep
this in mind.

>> +	}
>> +
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
>> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
>> +		return -ENODEV;
> Similar.  A pci_warn_once() here seems like a debugging aid during
> development, not necessarily a production kind of thing.
>
> Thanks for printing the type.  I would use "%#x" to make it clear that
> it's hex.  There are about 1900 %X uses compared with 33K
> %x uses, but maybe you have a reason to capitalize it?
Got it "%x". Would you recommend the pci_warn_once() is removed?
>
>> +	}
>> +
>> +	cxlds = pci_get_drvdata(pdev);
>> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
>> +
>> +	if (!dev)
>> +		return -ENODEV;
>> +
>> +	*err_info = (struct cxl_prot_error_info){ 0 };
> Neat, I hadn't seen this idiom.
>
>> +	err_info->ras_base = cxlds->regs.ras;
>> +	err_info->severity = severity;
>> +	err_info->pdev = pdev;
>> +	err_info->dev = dev;
>> +
>> +	return 0;
>> +}
>> +
>> +struct work_struct cxl_prot_err_work;
>> +
>>  int cxl_ras_init(void)
>>  {
>> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> +	int rc;
>> +
>> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> +	if (rc) {
>> +		pr_err("Failed to register CPER kfifo with AER driver");
>> +		return rc;
>> +	}
>> +
>> +	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
>> +	if (rc) {
>> +		pr_err("Failed to register kfifo with AER driver");
>> +		return rc;
>> +	}
>> +
>> +	return rc;
>>  }
>>  
>>  void cxl_ras_exit(void)
>>  {
>>  	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>>  	cancel_work_sync(&cxl_cper_prot_err_work);
>> +
>> +	cxl_unregister_prot_err_work();
>> +	cancel_work_sync(&cxl_prot_err_work);
>>  }
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..92d72c0423ab 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>> @@ -135,4 +136,6 @@ void read_cdat_data(struct cxl_port *port);
>>  void cxl_cor_error_detected(struct pci_dev *pdev);
>>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>  				    pci_channel_state_t state);
>> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>> +			     struct cxl_prot_error_info *err_info);
> What does the "_" in "_pdev" signify?  Looks unnecessarily different
> than the decls above.
_pdev shadows pdev. In previous patchset review Dan asked to add reference count
incr because much of this logic is during error handling and devices can go away.
Long story to say I was using the following throughout where needed:


int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
                             struct cxl_prot_error_info *err_info)
{
        struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);

>>  #endif /* __CXL_PCI_H__ */
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 83f2069f111e..46123b70f496 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -110,6 +110,16 @@ struct aer_stats {
>>  static int pcie_aer_disable;
>>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>>  
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +#define CXL_ERROR_SOURCES_MAX          128
>> +static DEFINE_KFIFO(cxl_prot_err_fifo, struct cxl_prot_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_prot_err_fifo_lock);
>> +struct work_struct *cxl_prot_err_work;
>> +static int (*cxl_create_prot_err_info)(struct pci_dev*, int severity,
>> +				       struct cxl_prot_error_info*);
> Space before "*" in the parameters.
I'm surprised checkpatch() didn't catch this. Maybe it's cause the parameter
itself is not present. Thanks!
>> +#endif
>> +
>>  void pci_no_aer(void)
>>  {
>>  	pcie_aer_disable = 1;
>> @@ -1577,6 +1587,35 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>>  	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>>  }
>>  
>> +
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +int cxl_register_prot_err_work(struct work_struct *work,
>> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
>> +								struct cxl_prot_error_info*))
> Ditto.  Rewrap to fit in 80 columns, unindent this function pointer
> decl to make it fit.  Same below in aer.h.
Ok, got it. Without using typedefs, right ?
>> +{
>> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
>> +	cxl_prot_err_work = work;
>> +	cxl_create_prot_err_info = _cxl_create_prot_err_info;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_register_prot_err_work, "CXL");
>> +
>> +int cxl_unregister_prot_err_work(void)
>> +{
>> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
>> +	cxl_prot_err_work = NULL;
>> +	cxl_create_prot_err_info = NULL;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_prot_err_work, "CXL");
>> +
>> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd)
>> +{
>> +	return kfifo_get(&cxl_prot_err_fifo, wd);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_prot_err_kfifo_get, "CXL");
>> +#endif
>> +
>>  static struct pcie_port_service_driver aerdriver = {
>>  	.name		= "aer",
>>  	.port_type	= PCIE_ANY_PORT,
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 947b63091902..761d6f5cd792 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -45,6 +46,24 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_prot_err_info - Error information used in CXL error handling
>> + * @pdev: PCI device with CXL error
>> + * @dev: CXL device with error. From CXL topology using ACPI/platform discovery
>> + * @ras_base: Mapped address of CXL RAS registers
>> + * @severity: CXL AER/RAS severity: AER_CORRECTABLE, AER_FATAL, AER_NONFATAL
>> + */
>> +struct cxl_prot_error_info {
>> +	struct pci_dev *pdev;
>> +	struct device *dev;
>> +	void __iomem *ras_base;
>> +	int severity;
> What does the "prot" in "cxl_prot_error_info" refer to?
Protocol. As in CXL Protocol Error Information. I suppose it needs to be renamed
if it wasn't obvious.
>
> There's basically no error info here other than "severity".
Correct. It's more accurately "CXL Protocol Error Context" but I didn't
want to re-use 'context' because 'context' is used for thread/process
statefulness. Also, I followed the existing CPER parallel work that uses
a similar kfifo etc. Thoughts on rename?

> I guess "dev" and "pdev" are separate devices (otherwise you would
> just use "&pdev->dev"), but I don't have any intuition about how they
> might be related, which is a little disconcerting.
"pdev" represents a PCIe device: RP, USP, DSP, or EP.  "dev" is the same
device as "pdev" but "dev" is found in CXL topology. "dev" is discovered through
ACPI/platform enumeration and interconnected with other CXL "devs" using upstream
and downstream links. Moving back and forth between "pdev" and its CXL "dev"
requires a search unique to the device type and point beginning the search.

BTW, CXL "dev" devices discussed here are the underlying devices for 'struct cxl_port',
'struct cxl_dports', and CXL upstream ports.

The 'struct cxl_prot_error_info' could possibly be removed where only a 'pdev' or 'dev' and AER severity are cached. When I started implementing the redesign I found cacheing this information made it simpler to implement. This could be revisited to improve. But there are 2 caveats to consider: 1. Removing the cached data will require invoking more SW calls for "pdev"->"dev" conversions, converting "dev" to CXL port, etc. 2. Will require saving the AER severity somewhere at a minimum.
> I would have thought that "ras_base" would be a property of "dev" (the
> CXL device) and wouldn't need to be separate.
"ras_base" is a common member of the CXL Port, CXL Downstream Port, CXL Upstream Port,
and CXL EP. If one wants the "ras_base" for a given CXL "dev" then the "dev" must be
converted to CXL Port, Downstream Port, or Upstream Port.
> From above, I guess "ras_base" is a property of cxlds, not
> cxlds->cxlmd->dev.  Maybe we should be keeping &cxlds here instead and
> letting the consumer look up cxlds->cxlmd->dev?
Yes, at this "point" in the patchset. It is updated later when support is added
for CXL Port devices.

Terry
>> +};
>> +
>> +struct cxl_prot_err_work_data {
>> +	struct cxl_prot_error_info err_info;
>> +};
>> +
>>  #if defined(CONFIG_PCIEAER)
>>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>>  int pcie_aer_is_native(struct pci_dev *dev);
>> @@ -56,6 +75,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>>  #endif
>>  
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +int cxl_register_prot_err_work(struct work_struct *work,
>> +			       int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
>> +								 struct cxl_prot_error_info*));
>> +int cxl_unregister_prot_err_work(void);
>> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd);
>> +#else
>> +static inline int
>> +cxl_register_prot_err_work(struct work_struct *work,
>> +			   int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
>> +							     struct cxl_prot_error_info*))
>> +{
>> +	return 0;
>> +}
>> +static inline int cxl_unregister_prot_err_work(void) { return 0; }
>> +static inline int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd) { return 0; }
>> +#endif
>> +
>>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  		    struct aer_capability_regs *aer);
>>  int cper_severity_to_aer(int cper_severity);
>> -- 
>> 2.34.1
>>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27 18:12     ` Bowman, Terry
@ 2025-03-28 17:02       ` Bjorn Helgaas
  2025-03-28 17:36         ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Bjorn Helgaas @ 2025-03-28 17:02 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

What does this series apply to?  I default to the current -rc1
(v6.14-rc1), but this doesn't apply there, and I don't have the
base-commit: aae0594a7053c60b82621136257c8b648c67b512 mentioned in the
cover letter.

Sometimes things make more sense when I can see everything as applied.

On Thu, Mar 27, 2025 at 01:12:30PM -0500, Bowman, Terry wrote:
> On 3/27/2025 12:08 PM, Bjorn Helgaas wrote:
> > On Wed, Mar 26, 2025 at 08:47:04PM -0500, Terry Bowman wrote:
> >> CXL error handling will soon be moved from the AER driver into the CXL
> >> driver. This requires a notification mechanism for the AER driver to share
> >> the AER interrupt details with CXL driver. The notification is required for
> >> the CXL drivers to then handle CXL RAS errors.
> >>
> >> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> >> driver will be the sole kfifo producer adding work. The cxl_core will be
> >> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> >>
> >> Add CXL work queue handler registration functions in the AER driver. Export
> >> the functions allowing CXL driver to access. Implement the registration
> >> functions for the CXL driver to assign or clear the work handler function.
> >>
> >> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
> >> now. The CXL specific handling will be added in future patch.
> >>
> >> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
> >> details used in completing error handling. This avoid duplicating some
> >> function calls and allows the error to be treated generically when
> >> possible.

> >> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> >> +			     struct cxl_prot_error_info *err_info)
> >> +{
> ...

> >> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> >> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> >> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> >> +		return -ENODEV;
> >
> > Similar.  A pci_warn_once() here seems like a debugging aid during
> > development, not necessarily a production kind of thing.
> >
> > Thanks for printing the type.  I would use "%#x" to make it clear that
> > it's hex.  There are about 1900 %X uses compared with 33K
> > %x uses, but maybe you have a reason to capitalize it?
>
> Got it "%x". Would you recommend the pci_warn_once() is removed?

The dependency on pdev being an endpoint is not clear here, so I would
just remove the check altogether or move it to the place that breaks
if pdev is not an endpoint.

> >> +#if defined(CONFIG_PCIEAER_CXL)
> >> +int cxl_register_prot_err_work(struct work_struct *work,
> >> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
> >> +								struct cxl_prot_error_info*))
> >
> > Ditto.  Rewrap to fit in 80 columns, unindent this function
> > pointer decl to make it fit.  Same below in aer.h.
>
> Ok, got it. Without using typedefs, right ?

A typedef would be fine with me.

> >> +struct cxl_prot_error_info {
> >> +	struct pci_dev *pdev;
> >> +	struct device *dev;
> >> +	void __iomem *ras_base;
> >> +	int severity;
> >
> > What does the "prot" in "cxl_prot_error_info" refer to?
>
> Protocol. As in CXL Protocol Error Information. I suppose it needs
> to be renamed if it wasn't obvious.

Unless there are CXL non-protocol errors that need to be
distinguished, I would just omit "prot" altogether.

> > There's basically no error info here other than "severity".
>
> Correct. It's more accurately "CXL Protocol Error Context" but I didn't
> want to re-use 'context' because 'context' is used for thread/process
> statefulness. Also, I followed the existing CPER parallel work that uses
> a similar kfifo etc. Thoughts on rename?

What's the name of the corresponding CPER struct?

> > I guess "dev" and "pdev" are separate devices (otherwise you would
> > just use "&pdev->dev"), but I don't have any intuition about how they
> > might be related, which is a little disconcerting.
>
> "pdev" represents a PCIe device: RP, USP, DSP, or EP.  "dev" is the
> same device as "pdev" but "dev" is found in CXL topology. "dev" is
> discovered through ACPI/platform enumeration and interconnected with
> other CXL "devs" using upstream and downstream links. Moving back
> and forth between "pdev" and its CXL "dev" requires a search unique
> to the device type and point beginning the search.

It seems weird to me to have two device pointers here.  Seems like we
should use a single pointer to identify the device, and if we need to
get from PCI to CXL or vice versa, there should be a pointer somewhere
so we don't have to search all the time.

> > I would have thought that "ras_base" would be a property of "dev"
> > (the CXL device) and wouldn't need to be separate.
>
> "ras_base" is a common member of the CXL Port, CXL Downstream Port,
> CXL Upstream Port, and CXL EP. If one wants the "ras_base" for a
> given CXL "dev" then the "dev" must be converted to CXL Port,
> Downstream Port, or Upstream Port.

Passing around ras_base seems dodgy to me.  I think it's better to
pass around a real entity like a pci_dev or cxl_port or cxl_dport or
whatever.  Code that needs to deal with ras_base presumably should
know about the internals of the device ras_base belongs to.

Bjorn

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-28 17:02       ` Bjorn Helgaas
@ 2025-03-28 17:36         ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-03-28 17:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 3/28/2025 12:02 PM, Bjorn Helgaas wrote:
> What does this series apply to?  I default to the current -rc1
> (v6.14-rc1), but this doesn't apply there, and I don't have the
> base-commit: aae0594a7053c60b82621136257c8b648c67b512 mentioned in the
> cover letter.
>
> Sometimes things make more sense when I can see everything as applied.

This base commit is from cxl/next and can be found here:
https://web.git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl.git/log/?h=next

Terry

> On Thu, Mar 27, 2025 at 01:12:30PM -0500, Bowman, Terry wrote:
>> On 3/27/2025 12:08 PM, Bjorn Helgaas wrote:
>>> On Wed, Mar 26, 2025 at 08:47:04PM -0500, Terry Bowman wrote:
>>>> CXL error handling will soon be moved from the AER driver into the CXL
>>>> driver. This requires a notification mechanism for the AER driver to share
>>>> the AER interrupt details with CXL driver. The notification is required for
>>>> the CXL drivers to then handle CXL RAS errors.
>>>>
>>>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>>>> driver will be the sole kfifo producer adding work. The cxl_core will be
>>>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>>>
>>>> Add CXL work queue handler registration functions in the AER driver. Export
>>>> the functions allowing CXL driver to access. Implement the registration
>>>> functions for the CXL driver to assign or clear the work handler function.
>>>>
>>>> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
>>>> now. The CXL specific handling will be added in future patch.
>>>>
>>>> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
>>>> details used in completing error handling. This avoid duplicating some
>>>> function calls and allows the error to be treated generically when
>>>> possible.
>>>> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>>>> +			     struct cxl_prot_error_info *err_info)
>>>> +{
>> ...
>>>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>>>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
>>>> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
>>>> +		return -ENODEV;
>>> Similar.  A pci_warn_once() here seems like a debugging aid during
>>> development, not necessarily a production kind of thing.
>>>
>>> Thanks for printing the type.  I would use "%#x" to make it clear that
>>> it's hex.  There are about 1900 %X uses compared with 33K
>>> %x uses, but maybe you have a reason to capitalize it?
>> Got it "%x". Would you recommend the pci_warn_once() is removed?
> The dependency on pdev being an endpoint is not clear here, so I would
> just remove the check altogether or move it to the place that breaks
> if pdev is not an endpoint.
>
>>>> +#if defined(CONFIG_PCIEAER_CXL)
>>>> +int cxl_register_prot_err_work(struct work_struct *work,
>>>> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
>>>> +								struct cxl_prot_error_info*))
>>> Ditto.  Rewrap to fit in 80 columns, unindent this function
>>> pointer decl to make it fit.  Same below in aer.h.
>> Ok, got it. Without using typedefs, right ?
> A typedef would be fine with me.
>
>>>> +struct cxl_prot_error_info {
>>>> +	struct pci_dev *pdev;
>>>> +	struct device *dev;
>>>> +	void __iomem *ras_base;
>>>> +	int severity;
>>> What does the "prot" in "cxl_prot_error_info" refer to?
>> Protocol. As in CXL Protocol Error Information. I suppose it needs
>> to be renamed if it wasn't obvious.
> Unless there are CXL non-protocol errors that need to be
> distinguished, I would just omit "prot" altogether.
>
>>> There's basically no error info here other than "severity".
>> Correct. It's more accurately "CXL Protocol Error Context" but I didn't
>> want to re-use 'context' because 'context' is used for thread/process
>> statefulness. Also, I followed the existing CPER parallel work that uses
>> a similar kfifo etc. Thoughts on rename?
> What's the name of the corresponding CPER struct?
>
>>> I guess "dev" and "pdev" are separate devices (otherwise you would
>>> just use "&pdev->dev"), but I don't have any intuition about how they
>>> might be related, which is a little disconcerting.
>> "pdev" represents a PCIe device: RP, USP, DSP, or EP.  "dev" is the
>> same device as "pdev" but "dev" is found in CXL topology. "dev" is
>> discovered through ACPI/platform enumeration and interconnected with
>> other CXL "devs" using upstream and downstream links. Moving back
>> and forth between "pdev" and its CXL "dev" requires a search unique
>> to the device type and point beginning the search.
> It seems weird to me to have two device pointers here.  Seems like we
> should use a single pointer to identify the device, and if we need to
> get from PCI to CXL or vice versa, there should be a pointer somewhere
> so we don't have to search all the time.
>
>>> I would have thought that "ras_base" would be a property of "dev"
>>> (the CXL device) and wouldn't need to be separate.
>> "ras_base" is a common member of the CXL Port, CXL Downstream Port,
>> CXL Upstream Port, and CXL EP. If one wants the "ras_base" for a
>> given CXL "dev" then the "dev" must be converted to CXL Port,
>> Downstream Port, or Upstream Port.
> Passing around ras_base seems dodgy to me.  I think it's better to
> pass around a real entity like a pci_dev or cxl_port or cxl_dport or
> whatever.  Code that needs to deal with ras_base presumably should
> know about the internals of the device ras_base belongs to.
>
> Bjorn


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
  2025-03-27 17:08   ` Bjorn Helgaas
@ 2025-03-28 17:01   ` Ira Weiny
  2025-04-07 13:43     ` Bowman, Terry
  2025-04-04 16:53   ` Jonathan Cameron
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 76+ messages in thread
From: Ira Weiny @ 2025-03-28 17:01 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Terry Bowman wrote:
> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt details with CXL driver. The notification is required for
> the CXL drivers to then handle CXL RAS errors.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work. The cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement the registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
> now. The CXL specific handling will be added in future patch.

This part of the message is no longer valid.

> 
> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error

                    cxl_prot_error_info

> details used in completing error handling. This avoid duplicating some
> function calls and allows the error to be treated generically when
> possible.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/ras.c | 54 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxlpci.h   |  3 +++
>  drivers/pci/pcie/aer.c | 39 ++++++++++++++++++++++++++++++
>  include/linux/aer.h    | 37 +++++++++++++++++++++++++++++
>  4 files changed, 132 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 485a831695c7..ecb60a5962de 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> @@ -107,13 +108,64 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> +	struct cxl_dev_state *cxlds;
> +
> +	if (!pdev || !err_info) {
> +		pr_warn_once("Error: parameter is NULL");
> +		return -ENODEV;
> +	}
> +
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> +		return -ENODEV;
> +	}
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	*err_info = (struct cxl_prot_error_info){ 0 };
> +	err_info->ras_base = cxlds->regs.ras;
> +	err_info->severity = severity;
> +	err_info->pdev = pdev;
> +	err_info->dev = dev;

How are pdev and dev guaranteed to be valid after the put_device() and
pci_dev_put() free functions are called on return?

> +
> +	return 0;
> +}
> +
> +struct work_struct cxl_prot_err_work;

Why is this not declared with DECLARE_WORK()?

Without that I don't know what cancel_work_sync() will do with this in the
!CONFIG_PCIEAER_CXL case.

Ah... ok looks like that comes in 5/16.  :-/

I got side tracked looking at the rest of the series after I found this
change in 5/16.

I'll send these questions out but I'm thinking Bjorn is correct that
passing cxlds or something might work better than stashing pdev/dev.  But
even then I'm not sure about the object lifetimes.

Ira

> +
>  int cxl_ras_init(void)
>  {
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	int rc;
> +
> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	if (rc) {
> +		pr_err("Failed to register CPER kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
> +	if (rc) {
> +		pr_err("Failed to register kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	return rc;
>  }
>  
>  void cxl_ras_exit(void)
>  {
>  	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>  	cancel_work_sync(&cxl_cper_prot_err_work);
> +
> +	cxl_unregister_prot_err_work();
> +	cancel_work_sync(&cxl_prot_err_work);
>  }
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..92d72c0423ab 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> @@ -135,4 +136,6 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 83f2069f111e..46123b70f496 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -110,6 +110,16 @@ struct aer_stats {
>  static int pcie_aer_disable;
>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +#define CXL_ERROR_SOURCES_MAX          128
> +static DEFINE_KFIFO(cxl_prot_err_fifo, struct cxl_prot_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_prot_err_fifo_lock);
> +struct work_struct *cxl_prot_err_work;
> +static int (*cxl_create_prot_err_info)(struct pci_dev*, int severity,
> +				       struct cxl_prot_error_info*);
> +#endif
> +
>  void pci_no_aer(void)
>  {
>  	pcie_aer_disable = 1;
> @@ -1577,6 +1587,35 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
>  
> +
> +#if defined(CONFIG_PCIEAER_CXL)
> +int cxl_register_prot_err_work(struct work_struct *work,
> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
> +								struct cxl_prot_error_info*))
> +{
> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
> +	cxl_prot_err_work = work;
> +	cxl_create_prot_err_info = _cxl_create_prot_err_info;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_prot_err_work, "CXL");
> +
> +int cxl_unregister_prot_err_work(void)
> +{
> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
> +	cxl_prot_err_work = NULL;
> +	cxl_create_prot_err_info = NULL;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_prot_err_work, "CXL");
> +
> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_prot_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_prot_err_kfifo_get, "CXL");
> +#endif
> +
>  static struct pcie_port_service_driver aerdriver = {
>  	.name		= "aer",
>  	.port_type	= PCIE_ANY_PORT,
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 947b63091902..761d6f5cd792 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -45,6 +46,24 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_prot_err_info - Error information used in CXL error handling
> + * @pdev: PCI device with CXL error
> + * @dev: CXL device with error. From CXL topology using ACPI/platform discovery
> + * @ras_base: Mapped address of CXL RAS registers
> + * @severity: CXL AER/RAS severity: AER_CORRECTABLE, AER_FATAL, AER_NONFATAL
> + */
> +struct cxl_prot_error_info {
> +	struct pci_dev *pdev;
> +	struct device *dev;
> +	void __iomem *ras_base;
> +	int severity;
> +};
> +
> +struct cxl_prot_err_work_data {
> +	struct cxl_prot_error_info err_info;
> +};
> +
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> @@ -56,6 +75,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +int cxl_register_prot_err_work(struct work_struct *work,
> +			       int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
> +								 struct cxl_prot_error_info*));
> +int cxl_unregister_prot_err_work(void);
> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd);
> +#else
> +static inline int
> +cxl_register_prot_err_work(struct work_struct *work,
> +			   int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
> +							     struct cxl_prot_error_info*))
> +{
> +	return 0;
> +}
> +static inline int cxl_unregister_prot_err_work(void) { return 0; }
> +static inline int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd) { return 0; }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);
> -- 
> 2.34.1
> 



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-28 17:01   ` Ira Weiny
@ 2025-04-07 13:43     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-07 13:43 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati



On 3/28/2025 12:01 PM, Ira Weiny wrote:
> Terry Bowman wrote:
>> CXL error handling will soon be moved from the AER driver into the CXL
>> driver. This requires a notification mechanism for the AER driver to share
>> the AER interrupt details with CXL driver. The notification is required for
>> the CXL drivers to then handle CXL RAS errors.
>>
>> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
>> driver will be the sole kfifo producer adding work. The cxl_core will be
>> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
>>
>> Add CXL work queue handler registration functions in the AER driver. Export
>> the functions allowing CXL driver to access. Implement the registration
>> functions for the CXL driver to assign or clear the work handler function.
>>
>> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
>> now. The CXL specific handling will be added in future patch.
> This part of the message is no longer valid.

Yes, I'll remove that.

>> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
>                     cxl_prot_error_info
>
>> details used in completing error handling. This avoid duplicating some
>> function calls and allows the error to be treated generically when
>> possible.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/ras.c | 54 +++++++++++++++++++++++++++++++++++++++++-
>>  drivers/cxl/cxlpci.h   |  3 +++
>>  drivers/pci/pcie/aer.c | 39 ++++++++++++++++++++++++++++++
>>  include/linux/aer.h    | 37 +++++++++++++++++++++++++++++
>>  4 files changed, 132 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index 485a831695c7..ecb60a5962de 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -5,6 +5,7 @@
>>  #include <linux/aer.h>
>>  #include <cxl/event.h>
>>  #include <cxlmem.h>
>> +#include <cxlpci.h>
>>  #include "trace.h"
>>  
>>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>> @@ -107,13 +108,64 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>>  }
>>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>>  
>> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>> +			     struct cxl_prot_error_info *err_info)
>> +{
>> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
>> +	struct cxl_dev_state *cxlds;
>> +
>> +	if (!pdev || !err_info) {
>> +		pr_warn_once("Error: parameter is NULL");
>> +		return -ENODEV;
>> +	}
>> +
>> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
>> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
>> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
>> +		return -ENODEV;
>> +	}
>> +
>> +	cxlds = pci_get_drvdata(pdev);
>> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
>> +
>> +	if (!dev)
>> +		return -ENODEV;
>> +
>> +	*err_info = (struct cxl_prot_error_info){ 0 };
>> +	err_info->ras_base = cxlds->regs.ras;
>> +	err_info->severity = severity;
>> +	err_info->pdev = pdev;
>> +	err_info->dev = dev;
> How are pdev and dev guaranteed to be valid after the put_device() and
> pci_dev_put() free functions are called on return?
>
>> +
>> +	return 0;
>> +}
>> +
>> +struct work_struct cxl_prot_err_work;
> Why is this not declared with DECLARE_WORK()?
>
> Without that I don't know what cancel_work_sync() will do with this in the
> !CONFIG_PCIEAER_CXL case.
>
> Ah... ok looks like that comes in 5/16.  :-/
>
> I got side tracked looking at the rest of the series after I found this
> change in 5/16.
>
> I'll send these questions out but I'm thinking Bjorn is correct that
> passing cxlds or something might work better than stashing pdev/dev.  But
> even then I'm not sure about the object lifetimes.
>
> Ira

The problem is cxlds only works for CXL EPs and doesn't scale for CXL Port devices.
Port devices are not associated with a cxlds.

We should consider adding a cached dereferenced status in 'struct err_info' as well. This was something I wanted to bring up  in the cover sheet for discussion.

Terry
>> +
>>  int cxl_ras_init(void)
>>  {
>> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> +	int rc;
>> +
>> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
>> +	if (rc) {
>> +		pr_err("Failed to register CPER kfifo with AER driver");
>> +		return rc;
>> +	}
>> +
>> +	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
>> +	if (rc) {
>> +		pr_err("Failed to register kfifo with AER driver");
>> +		return rc;
>> +	}
>> +
>> +	return rc;
>>  }
>>  
>>  void cxl_ras_exit(void)
>>  {
>>  	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>>  	cancel_work_sync(&cxl_cper_prot_err_work);
>> +
>> +	cxl_unregister_prot_err_work();
>> +	cancel_work_sync(&cxl_prot_err_work);
>>  }
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 54e219b0049e..92d72c0423ab 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -4,6 +4,7 @@
>>  #define __CXL_PCI_H__
>>  #include <linux/pci.h>
>>  #include "cxl.h"
>> +#include "linux/aer.h"
>>  
>>  #define CXL_MEMORY_PROGIF	0x10
>>  
>> @@ -135,4 +136,6 @@ void read_cdat_data(struct cxl_port *port);
>>  void cxl_cor_error_detected(struct pci_dev *pdev);
>>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>  				    pci_channel_state_t state);
>> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>> +			     struct cxl_prot_error_info *err_info);
>>  #endif /* __CXL_PCI_H__ */
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 83f2069f111e..46123b70f496 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -110,6 +110,16 @@ struct aer_stats {
>>  static int pcie_aer_disable;
>>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>>  
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +#define CXL_ERROR_SOURCES_MAX          128
>> +static DEFINE_KFIFO(cxl_prot_err_fifo, struct cxl_prot_err_work_data,
>> +		    CXL_ERROR_SOURCES_MAX);
>> +static DEFINE_SPINLOCK(cxl_prot_err_fifo_lock);
>> +struct work_struct *cxl_prot_err_work;
>> +static int (*cxl_create_prot_err_info)(struct pci_dev*, int severity,
>> +				       struct cxl_prot_error_info*);
>> +#endif
>> +
>>  void pci_no_aer(void)
>>  {
>>  	pcie_aer_disable = 1;
>> @@ -1577,6 +1587,35 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>>  	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>>  }
>>  
>> +
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +int cxl_register_prot_err_work(struct work_struct *work,
>> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
>> +								struct cxl_prot_error_info*))
>> +{
>> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
>> +	cxl_prot_err_work = work;
>> +	cxl_create_prot_err_info = _cxl_create_prot_err_info;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_register_prot_err_work, "CXL");
>> +
>> +int cxl_unregister_prot_err_work(void)
>> +{
>> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
>> +	cxl_prot_err_work = NULL;
>> +	cxl_create_prot_err_info = NULL;
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_prot_err_work, "CXL");
>> +
>> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd)
>> +{
>> +	return kfifo_get(&cxl_prot_err_fifo, wd);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_prot_err_kfifo_get, "CXL");
>> +#endif
>> +
>>  static struct pcie_port_service_driver aerdriver = {
>>  	.name		= "aer",
>>  	.port_type	= PCIE_ANY_PORT,
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 947b63091902..761d6f5cd792 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -10,6 +10,7 @@
>>  
>>  #include <linux/errno.h>
>>  #include <linux/types.h>
>> +#include <linux/workqueue_types.h>
>>  
>>  #define AER_NONFATAL			0
>>  #define AER_FATAL			1
>> @@ -45,6 +46,24 @@ struct aer_capability_regs {
>>  	u16 uncor_err_source;
>>  };
>>  
>> +/**
>> + * struct cxl_prot_err_info - Error information used in CXL error handling
>> + * @pdev: PCI device with CXL error
>> + * @dev: CXL device with error. From CXL topology using ACPI/platform discovery
>> + * @ras_base: Mapped address of CXL RAS registers
>> + * @severity: CXL AER/RAS severity: AER_CORRECTABLE, AER_FATAL, AER_NONFATAL
>> + */
>> +struct cxl_prot_error_info {
>> +	struct pci_dev *pdev;
>> +	struct device *dev;
>> +	void __iomem *ras_base;
>> +	int severity;
>> +};
>> +
>> +struct cxl_prot_err_work_data {
>> +	struct cxl_prot_error_info err_info;
>> +};
>> +
>>  #if defined(CONFIG_PCIEAER)
>>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>>  int pcie_aer_is_native(struct pci_dev *dev);
>> @@ -56,6 +75,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>>  #endif
>>  
>> +#if defined(CONFIG_PCIEAER_CXL)
>> +int cxl_register_prot_err_work(struct work_struct *work,
>> +			       int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
>> +								 struct cxl_prot_error_info*));
>> +int cxl_unregister_prot_err_work(void);
>> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd);
>> +#else
>> +static inline int
>> +cxl_register_prot_err_work(struct work_struct *work,
>> +			   int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
>> +							     struct cxl_prot_error_info*))
>> +{
>> +	return 0;
>> +}
>> +static inline int cxl_unregister_prot_err_work(void) { return 0; }
>> +static inline int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd) { return 0; }
>> +#endif
>> +
>>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  		    struct aer_capability_regs *aer);
>>  int cper_severity_to_aer(int cper_severity);
>> -- 
>> 2.34.1
>>
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
  2025-03-27 17:08   ` Bjorn Helgaas
  2025-03-28 17:01   ` Ira Weiny
@ 2025-04-04 16:53   ` Jonathan Cameron
  2025-04-23 14:33   ` Jonathan Cameron
                     ` (2 subsequent siblings)
  5 siblings, 0 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-04 16:53 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati


>  int cxl_ras_init(void)
>  {
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	int rc;
> +
> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	if (rc) {
> +		pr_err("Failed to register CPER kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
> +	if (rc) {
> +		pr_err("Failed to register kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	return rc;
	return 0;

Good to explicit if we know we can only get to a return with a particular
value.

>  }
>  
>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
                     ` (2 preceding siblings ...)
  2025-04-04 16:53   ` Jonathan Cameron
@ 2025-04-23 14:33   ` Jonathan Cameron
  2025-04-23 15:04   ` Jonathan Cameron
  2025-04-23 22:12   ` Gregory Price
  5 siblings, 0 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 14:33 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati


> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> +	struct cxl_dev_state *cxlds;
> +
> +	if (!pdev || !err_info) {
> +		pr_warn_once("Error: parameter is NULL");
> +		return -ENODEV;
> +	}
> +
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> +		return -ENODEV;
> +	}
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	*err_info = (struct cxl_prot_error_info){ 0 };
> +	err_info->ras_base = cxlds->regs.ras;
> +	err_info->severity = severity;
> +	err_info->pdev = pdev;
> +	err_info->dev = dev;

I missed this before but might as well do...

	*err_info = (struct cxl_prot_error_info) {
		.ras_base = cxlds->regs.ras,
		.severity = serverity,
	...
	};

> +
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
                     ` (3 preceding siblings ...)
  2025-04-23 14:33   ` Jonathan Cameron
@ 2025-04-23 15:04   ` Jonathan Cameron
  2025-04-23 22:12   ` Gregory Price
  5 siblings, 0 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 15:04 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:04 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL error handling will soon be moved from the AER driver into the CXL
> driver. This requires a notification mechanism for the AER driver to share
> the AER interrupt details with CXL driver. The notification is required for
> the CXL drivers to then handle CXL RAS errors.
> 
> Add a kfifo work queue to be used by the AER driver and CXL driver. The AER
> driver will be the sole kfifo producer adding work. The cxl_core will be
> the sole kfifo consumer removing work. Add the boilerplate kfifo support.
> 
> Add CXL work queue handler registration functions in the AER driver. Export
> the functions allowing CXL driver to access. Implement the registration
> functions for the CXL driver to assign or clear the work handler function.
> 
> Create a work queue handler function, cxl_prot_err_work_fn(), as a stub for
> now. The CXL specific handling will be added in future patch.
> 
> Introduce 'struct cxl_prot_err_info'. This structure caches CXL error
> details used in completing error handling. This avoid duplicating some
> function calls and allows the error to be treated generically when
> possible.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

It's been a while since I messed around with work queues, but
don't we need to initialize the work_struct somewhere?

DECLARE_WORK() probably similar to cxl_cper_prot_err_work.

Having both the pointer and the instance of the work struct called
cxl_prot_err_work is confusing me...


> ---
>  drivers/cxl/core/ras.c | 54 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxlpci.h   |  3 +++
>  drivers/pci/pcie/aer.c | 39 ++++++++++++++++++++++++++++++
>  include/linux/aer.h    | 37 +++++++++++++++++++++++++++++
>  4 files changed, 132 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 485a831695c7..ecb60a5962de 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -5,6 +5,7 @@
>  #include <linux/aer.h>
>  #include <cxl/event.h>
>  #include <cxlmem.h>
> +#include <cxlpci.h>
>  #include "trace.h"
>  
>  static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
> @@ -107,13 +108,64 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info)
> +{
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> +	struct cxl_dev_state *cxlds;
> +
> +	if (!pdev || !err_info) {
> +		pr_warn_once("Error: parameter is NULL");
> +		return -ENODEV;
> +	}
> +
> +	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> +	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> +		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
> +		return -ENODEV;
> +	}
> +
> +	cxlds = pci_get_drvdata(pdev);
> +	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	*err_info = (struct cxl_prot_error_info){ 0 };
> +	err_info->ras_base = cxlds->regs.ras;
> +	err_info->severity = severity;
> +	err_info->pdev = pdev;
> +	err_info->dev = dev;
> +
> +	return 0;
> +}
> +
> +struct work_struct cxl_prot_err_work;

This is the one I'm not seeing initialized anywhere...

> +
>  int cxl_ras_init(void)
>  {
> -	return cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	int rc;
> +
> +	rc = cxl_cper_register_prot_err_work(&cxl_cper_prot_err_work);
> +	if (rc) {
> +		pr_err("Failed to register CPER kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	rc = cxl_register_prot_err_work(&cxl_prot_err_work, cxl_create_prot_err_info);
> +	if (rc) {
> +		pr_err("Failed to register kfifo with AER driver");
> +		return rc;
> +	}
> +
> +	return rc;
>  }
>  
>  void cxl_ras_exit(void)
>  {
>  	cxl_cper_unregister_prot_err_work(&cxl_cper_prot_err_work);
>  	cancel_work_sync(&cxl_cper_prot_err_work);
> +
> +	cxl_unregister_prot_err_work();
> +	cancel_work_sync(&cxl_prot_err_work);
>  }
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 54e219b0049e..92d72c0423ab 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -4,6 +4,7 @@
>  #define __CXL_PCI_H__
>  #include <linux/pci.h>
>  #include "cxl.h"
> +#include "linux/aer.h"
>  
>  #define CXL_MEMORY_PROGIF	0x10
>  
> @@ -135,4 +136,6 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
> +			     struct cxl_prot_error_info *err_info);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 83f2069f111e..46123b70f496 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -110,6 +110,16 @@ struct aer_stats {
>  static int pcie_aer_disable;
>  static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +#define CXL_ERROR_SOURCES_MAX          128
> +static DEFINE_KFIFO(cxl_prot_err_fifo, struct cxl_prot_err_work_data,
> +		    CXL_ERROR_SOURCES_MAX);
> +static DEFINE_SPINLOCK(cxl_prot_err_fifo_lock);
> +struct work_struct *cxl_prot_err_work;
> +static int (*cxl_create_prot_err_info)(struct pci_dev*, int severity,
> +				       struct cxl_prot_error_info*);
> +#endif
> +
>  void pci_no_aer(void)
>  {
>  	pcie_aer_disable = 1;
> @@ -1577,6 +1587,35 @@ static pci_ers_result_t aer_root_reset(struct pci_dev *dev)
>  	return rc ? PCI_ERS_RESULT_DISCONNECT : PCI_ERS_RESULT_RECOVERED;
>  }
>  
> +
> +#if defined(CONFIG_PCIEAER_CXL)
> +int cxl_register_prot_err_work(struct work_struct *work,
> +			       int (*_cxl_create_prot_err_info)(struct pci_dev*, int,
> +								struct cxl_prot_error_info*))
> +{
> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
> +	cxl_prot_err_work = work;
> +	cxl_create_prot_err_info = _cxl_create_prot_err_info;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_register_prot_err_work, "CXL");
> +
> +int cxl_unregister_prot_err_work(void)
> +{
> +	guard(spinlock)(&cxl_prot_err_fifo_lock);
> +	cxl_prot_err_work = NULL;
> +	cxl_create_prot_err_info = NULL;
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_unregister_prot_err_work, "CXL");
> +
> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd)
> +{
> +	return kfifo_get(&cxl_prot_err_fifo, wd);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_prot_err_kfifo_get, "CXL");
> +#endif
> +
>  static struct pcie_port_service_driver aerdriver = {
>  	.name		= "aer",
>  	.port_type	= PCIE_ANY_PORT,
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 947b63091902..761d6f5cd792 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -10,6 +10,7 @@
>  
>  #include <linux/errno.h>
>  #include <linux/types.h>
> +#include <linux/workqueue_types.h>
>  
>  #define AER_NONFATAL			0
>  #define AER_FATAL			1
> @@ -45,6 +46,24 @@ struct aer_capability_regs {
>  	u16 uncor_err_source;
>  };
>  
> +/**
> + * struct cxl_prot_err_info - Error information used in CXL error handling
> + * @pdev: PCI device with CXL error
> + * @dev: CXL device with error. From CXL topology using ACPI/platform discovery
> + * @ras_base: Mapped address of CXL RAS registers
> + * @severity: CXL AER/RAS severity: AER_CORRECTABLE, AER_FATAL, AER_NONFATAL
> + */
> +struct cxl_prot_error_info {
> +	struct pci_dev *pdev;
> +	struct device *dev;
> +	void __iomem *ras_base;
> +	int severity;
> +};
> +
> +struct cxl_prot_err_work_data {
> +	struct cxl_prot_error_info err_info;
> +};
> +
>  #if defined(CONFIG_PCIEAER)
>  int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
>  int pcie_aer_is_native(struct pci_dev *dev);
> @@ -56,6 +75,24 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#if defined(CONFIG_PCIEAER_CXL)
> +int cxl_register_prot_err_work(struct work_struct *work,
> +			       int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
> +								 struct cxl_prot_error_info*));
> +int cxl_unregister_prot_err_work(void);
> +int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd);
> +#else
> +static inline int
> +cxl_register_prot_err_work(struct work_struct *work,
> +			   int (*_cxl_create_proto_err_info)(struct pci_dev*, int,
> +							     struct cxl_prot_error_info*))
> +{
> +	return 0;
> +}
> +static inline int cxl_unregister_prot_err_work(void) { return 0; }
> +static inline int cxl_prot_err_kfifo_get(struct cxl_prot_err_work_data *wd) { return 0; }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
                     ` (4 preceding siblings ...)
  2025-04-23 15:04   ` Jonathan Cameron
@ 2025-04-23 22:12   ` Gregory Price
  5 siblings, 0 replies; 76+ messages in thread
From: Gregory Price @ 2025-04-23 22:12 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:04PM -0500, Terry Bowman wrote:
> index 485a831695c7..ecb60a5962de 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
... snip ...
> +
> +struct work_struct cxl_prot_err_work;

This changes in patch 5, but this commit fails to build when the drivers
are built-in.  This should be static.

~Gregory

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (2 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27 17:13   ` Bjorn Helgaas
                     ` (2 more replies)
  2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
                   ` (13 subsequent siblings)
  17 siblings, 3 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

The AER service driver includes a CXL-specific kfifo, intended to forward
CXL errors to the CXL driver. However, the forwarding functionality is
currently unimplemented. Update the AER driver to enable error forwarding
to the CXL driver.

Modify the AER service driver's handle_error_source(), which is called from
process_aer_err_devices(), to distinguish between PCIe and CXL errors.

Rename and update is_internal_error() to is_cxl_error(). Ensuring it
checks both the 'struct aer_info::is_cxl' flag and the AER internal error
masks.

If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
as done in the current AER driver.

If the error is a CXL-related error then forward it to the CXL driver for
handling using the kfifo mechanism.

Introduce a new function forward_cxl_error(), which constructs a CXL
protocol error context using cxl_create_prot_err_info(). This context is
then passed to the CXL driver via kfifo using a 'struct work_struct'.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 55 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 46123b70f496..d1df751cfe4b 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1010,6 +1010,14 @@ static bool is_internal_error(struct aer_err_info *info)
 	return info->status & PCI_ERR_UNC_INTN;
 }
 
+static bool is_cxl_error(struct aer_err_info *info)
+{
+	if (!info || !info->is_cxl)
+		return false;
+
+	return is_internal_error(info);
+}
+
 static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 {
 	struct aer_err_info *info = (struct aer_err_info *)data;
@@ -1062,13 +1070,17 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 	return *handles_cxl;
 }
 
-static bool handles_cxl_errors(struct pci_dev *rcec)
+static bool handles_cxl_errors(struct pci_dev *dev)
 {
 	bool handles_cxl = false;
 
-	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+	if (!pcie_aer_is_native(dev))
+		return false;
+
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
+		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
+	else
+		handles_cxl = pcie_is_cxl(dev);
 
 	return handles_cxl;
 }
@@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
 	pci_info(rcec, "CXL: Internal errors unmasked");
 }
 
+static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
+{
+	int severity = info->severity;
+	struct cxl_prot_err_work_data wd;
+	struct cxl_prot_error_info *err_info = &wd.err_info;
+	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
+
+	if (!cxl_create_prot_err_info) {
+		pci_err(pdev, "Failed. CXL-AER interface not initialized.");
+		return;
+	}
+
+	if (cxl_create_prot_err_info(pdev, severity, err_info)) {
+		pci_err(pdev, "Failed to create CXL protocol error information");
+		return;
+	}
+
+	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);
+
+	if (!kfifo_put(&cxl_prot_err_fifo, wd)) {
+		pr_err_ratelimited("CXL kfifo overflow\n");
+		return;
+	}
+
+	schedule_work(cxl_prot_err_work);
+}
+
 #else
 static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
 static inline void cxl_rch_handle_error(struct pci_dev *dev,
 					struct aer_err_info *info) { }
+static inline void forward_cxl_error(struct pci_dev *dev,
+				    struct aer_err_info *info) { }
+static inline bool handles_cxl_errors(struct pci_dev *dev)
+{
+	return false;
+}
+static bool is_cxl_error(struct aer_err_info *info) { return 0; };
 #endif
 
 /**
@@ -1123,8 +1169,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_rch_handle_error(dev, info);
-	pci_aer_handle_error(dev, info);
+	if (is_cxl_error(info))
+		forward_cxl_error(dev, info);
+	else
+		pci_aer_handle_error(dev, info);
+
 	pci_dev_put(dev);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
@ 2025-03-27 17:13   ` Bjorn Helgaas
  2025-04-07 14:00     ` Bowman, Terry
  2025-04-23 15:04   ` Jonathan Cameron
  2025-04-23 22:21   ` Gregory Price
  2 siblings, 1 reply; 76+ messages in thread
From: Bjorn Helgaas @ 2025-03-27 17:13 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:05PM -0500, Terry Bowman wrote:
> The AER service driver includes a CXL-specific kfifo, intended to forward
> CXL errors to the CXL driver. However, the forwarding functionality is
> currently unimplemented. Update the AER driver to enable error forwarding
> to the CXL driver.
> 
> Modify the AER service driver's handle_error_source(), which is called from
> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
> 
> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
> masks.
> 
> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
> as done in the current AER driver.
> 
> If the error is a CXL-related error then forward it to the CXL driver for
> handling using the kfifo mechanism.
> 
> Introduce a new function forward_cxl_error(), which constructs a CXL
> protocol error context using cxl_create_prot_err_info(). This context is
> then passed to the CXL driver via kfifo using a 'struct work_struct'.

This only touches drivers/pci, so I would make the subject prefix be
"PCI/AER".

> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
> +{
> +	int severity = info->severity;
> +	struct cxl_prot_err_work_data wd;
> +	struct cxl_prot_error_info *err_info = &wd.err_info;
> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> +
> +	if (!cxl_create_prot_err_info) {
> +		pci_err(pdev, "Failed. CXL-AER interface not initialized.");
> +		return;
> +	}
> +
> +	if (cxl_create_prot_err_info(pdev, severity, err_info)) {
> +		pci_err(pdev, "Failed to create CXL protocol error information");
> +		return;
> +	}
> +
> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);
> +
> +	if (!kfifo_put(&cxl_prot_err_fifo, wd)) {
> +		pr_err_ratelimited("CXL kfifo overflow\n");

Needs a dev identifier here to anchor the message to something.

> +		return;
> +	}
> +
> +	schedule_work(cxl_prot_err_work);
> +}

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-03-27 17:13   ` Bjorn Helgaas
@ 2025-04-07 14:00     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-07 14:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 3/27/2025 12:13 PM, Bjorn Helgaas wrote:
> On Wed, Mar 26, 2025 at 08:47:05PM -0500, Terry Bowman wrote:
>> The AER service driver includes a CXL-specific kfifo, intended to forward
>> CXL errors to the CXL driver. However, the forwarding functionality is
>> currently unimplemented. Update the AER driver to enable error forwarding
>> to the CXL driver.
>>
>> Modify the AER service driver's handle_error_source(), which is called from
>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>
>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>> masks.
>>
>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>> as done in the current AER driver.
>>
>> If the error is a CXL-related error then forward it to the CXL driver for
>> handling using the kfifo mechanism.
>>
>> Introduce a new function forward_cxl_error(), which constructs a CXL
>> protocol error context using cxl_create_prot_err_info(). This context is
>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
> This only touches drivers/pci, so I would make the subject prefix be
> "PCI/AER".
Got it. Thanks Bjorn.

>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>> +{
>> +	int severity = info->severity;
>> +	struct cxl_prot_err_work_data wd;
>> +	struct cxl_prot_error_info *err_info = &wd.err_info;
>> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
>> +
>> +	if (!cxl_create_prot_err_info) {
>> +		pci_err(pdev, "Failed. CXL-AER interface not initialized.");
>> +		return;
>> +	}
>> +
>> +	if (cxl_create_prot_err_info(pdev, severity, err_info)) {
>> +		pci_err(pdev, "Failed to create CXL protocol error information");
>> +		return;
>> +	}
>> +
>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);
>> +
>> +	if (!kfifo_put(&cxl_prot_err_fifo, wd)) {
>> +		pr_err_ratelimited("CXL kfifo overflow\n");
> Needs a dev identifier here to anchor the message to something.
Ok.

Regards,
Terry

>> +		return;
>> +	}
>> +
>> +	schedule_work(cxl_prot_err_work);
>> +}


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
  2025-03-27 17:13   ` Bjorn Helgaas
@ 2025-04-23 15:04   ` Jonathan Cameron
  2025-04-24 14:17     ` Bowman, Terry
  2025-04-23 22:21   ` Gregory Price
  2 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 15:04 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:05 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver includes a CXL-specific kfifo, intended to forward
> CXL errors to the CXL driver. However, the forwarding functionality is
> currently unimplemented. Update the AER driver to enable error forwarding
> to the CXL driver.
> 
> Modify the AER service driver's handle_error_source(), which is called from
> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
> 
> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
> masks.
> 
> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
> as done in the current AER driver.
> 
> If the error is a CXL-related error then forward it to the CXL driver for
> handling using the kfifo mechanism.
> 
> Introduce a new function forward_cxl_error(), which constructs a CXL
> protocol error context using cxl_create_prot_err_info(). This context is
> then passed to the CXL driver via kfifo using a 'struct work_struct'.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

Finally got back to this.  I'm not following how some of the reference
counting in here is working.  It might be fine but there is a lot
taking then dropping device references - some of which are taken again later.

> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  	pci_info(rcec, "CXL: Internal errors unmasked");
>  }
>  
> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
> +{
> +	int severity = info->severity;

So far this variable isn't really justified.  Maybe it makes sense later in the
series?

> +	struct cxl_prot_err_work_data wd;
> +	struct cxl_prot_error_info *err_info = &wd.err_info;

Similarly. Why not just use this directly in the call below?

> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
Can you talk me through the reference counting?  You take one pci device reference
here.... 
> +
> +	if (!cxl_create_prot_err_info) {
> +		pci_err(pdev, "Failed. CXL-AER interface not initialized.");
> +		return;
> +	}
> +
> +	if (cxl_create_prot_err_info(pdev, severity, err_info)) {

...but the implementation of this also takes once internally.  Can we skip that
internal one and document that it is always take by the caller?

> +		pci_err(pdev, "Failed to create CXL protocol error information");
> +		return;
> +	}
> +
> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);

Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
followed by retaking it here.  How do we know it is still about by this call
and once we pull it off the kfifo later?

> +
> +	if (!kfifo_put(&cxl_prot_err_fifo, wd)) {
> +		pr_err_ratelimited("CXL kfifo overflow\n");
> +		return;
> +	}
> +
> +	schedule_work(cxl_prot_err_work);
> +}
> +

Thanks,

Jonathan


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-04-23 15:04   ` Jonathan Cameron
@ 2025-04-24 14:17     ` Bowman, Terry
  2025-04-25 13:18       ` Jonathan Cameron
  0 siblings, 1 reply; 76+ messages in thread
From: Bowman, Terry @ 2025-04-24 14:17 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/23/2025 10:04 AM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:05 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver includes a CXL-specific kfifo, intended to forward
>> CXL errors to the CXL driver. However, the forwarding functionality is
>> currently unimplemented. Update the AER driver to enable error forwarding
>> to the CXL driver.
>>
>> Modify the AER service driver's handle_error_source(), which is called from
>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>
>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>> masks.
>>
>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>> as done in the current AER driver.
>>
>> If the error is a CXL-related error then forward it to the CXL driver for
>> handling using the kfifo mechanism.
>>
>> Introduce a new function forward_cxl_error(), which constructs a CXL
>> protocol error context using cxl_create_prot_err_info(). This context is
>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
>
> Finally got back to this.  I'm not following how some of the reference
> counting in here is working.  It might be fine but there is a lot
> taking then dropping device references - some of which are taken again later.
>
>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>  }
>>  
>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>> +{
>> +	int severity = info->severity;
> So far this variable isn't really justified.  Maybe it makes sense later in the
> series?

This is used below in call to cxl_create_prot_err_info().

>> +	struct cxl_prot_err_work_data wd;
>> +	struct cxl_prot_error_info *err_info = &wd.err_info;
> Similarly. Why not just use this directly in the call below?

The reference assignment is made so that err_info can be initialized above and is then
used here to assign and later passed below as part of the work structure.

>> +	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> Can you talk me through the reference counting?  You take one pci device reference
> here.... 

This will add reference count to the PCI device (not the CXL device) to prevent it
from being destroyed until scope exit cleanup.

>> +
>> +	if (!cxl_create_prot_err_info) {
>> +		pci_err(pdev, "Failed. CXL-AER interface not initialized.");
>> +		return;
>> +	}
>> +
>> +	if (cxl_create_prot_err_info(pdev, severity, err_info)) {
> ...but the implementation of this also takes once internally.  Can we skip that
> internal one and document that it is always take by the caller?

Yes, it can. I will have to verify other the 5 or 6 calls do the same before calling.
I was wanting to make the reference incr as early as possible immediately after error
detection and also not forcing callers to do the same setup everywhere beforehand. I
see your point and will consolidate them.

>> +		pci_err(pdev, "Failed to create CXL protocol error information");
>> +		return;
>> +	}
>> +
>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);
> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
> followed by retaking it here.  How do we know it is still about by this call
> and once we pull it off the kfifo later?

Yes, this is a problem I realized after sending the series.

The device reference incr could be changed for all the devices to the non-cleanup
variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
I need to look at the other calls to to cxl_create_prot_err_info() as well.

In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
This would eliminate the need for further accesses to the CXL device after being dequeued from the
fifo. Thoughts?

>> +
>> +	if (!kfifo_put(&cxl_prot_err_fifo, wd)) {
>> +		pr_err_ratelimited("CXL kfifo overflow\n");
>> +		return;
>> +	}
>> +
>> +	schedule_work(cxl_prot_err_work);
>> +}
>> +
> Thanks,
>
> Jonathan
>
Thanks for reviewing Jonathan.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-04-24 14:17     ` Bowman, Terry
@ 2025-04-25 13:18       ` Jonathan Cameron
  2025-04-25 21:03         ` Bowman, Terry
  2025-05-15 21:52         ` Bowman, Terry
  0 siblings, 2 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-25 13:18 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Thu, 24 Apr 2025 09:17:45 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:
> > On Wed, 26 Mar 2025 20:47:05 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >  
> >> The AER service driver includes a CXL-specific kfifo, intended to forward
> >> CXL errors to the CXL driver. However, the forwarding functionality is
> >> currently unimplemented. Update the AER driver to enable error forwarding
> >> to the CXL driver.
> >>
> >> Modify the AER service driver's handle_error_source(), which is called from
> >> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
> >>
> >> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
> >> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
> >> masks.
> >>
> >> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
> >> as done in the current AER driver.
> >>
> >> If the error is a CXL-related error then forward it to the CXL driver for
> >> handling using the kfifo mechanism.
> >>
> >> Introduce a new function forward_cxl_error(), which constructs a CXL
> >> protocol error context using cxl_create_prot_err_info(). This context is
> >> then passed to the CXL driver via kfifo using a 'struct work_struct'.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> > Hi Terry,
> >
> > Finally got back to this.  I'm not following how some of the reference
> > counting in here is working.  It might be fine but there is a lot
> > taking then dropping device references - some of which are taken again later.
> >  
> >> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> >>  	pci_info(rcec, "CXL: Internal errors unmasked");
> >>  }
> >>  
> >> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
> >> +{
> >> +	int severity = info->severity;  
> > So far this variable isn't really justified.  Maybe it makes sense later in the
> > series?  
> 
> This is used below in call to cxl_create_prot_err_info().
Sure, but why not just do

if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {

There isn't anything modifying info->severity in between so that local
variable is just padding out the code to no real benefit.


> 
> >> +		pci_err(pdev, "Failed to create CXL protocol error information");
> >> +		return;
> >> +	}
> >> +
> >> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);  
> > Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
> > followed by retaking it here.  How do we know it is still about by this call
> > and once we pull it off the kfifo later?  
> 
> Yes, this is a problem I realized after sending the series.
> 
> The device reference incr could be changed for all the devices to the non-cleanup
> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
> I need to look at the other calls to to cxl_create_prot_err_info() as well.
> 
> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
> This would eliminate the need for further accesses to the CXL device after being dequeued from the
> fifo. Thoughts?

That sounds like a reasonable solution to me.

Jonathan


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-04-25 13:18       ` Jonathan Cameron
@ 2025-04-25 21:03         ` Bowman, Terry
  2025-05-15 21:52         ` Bowman, Terry
  1 sibling, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-25 21:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/25/2025 8:18 AM, Jonathan Cameron wrote:
> On Thu, 24 Apr 2025 09:17:45 -0500
> "Bowman, Terry" <terry.bowman@amd.com> wrote:
>
>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:
>>> On Wed, 26 Mar 2025 20:47:05 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>  
>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
>>>> CXL errors to the CXL driver. However, the forwarding functionality is
>>>> currently unimplemented. Update the AER driver to enable error forwarding
>>>> to the CXL driver.
>>>>
>>>> Modify the AER service driver's handle_error_source(), which is called from
>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>>>
>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>>>> masks.
>>>>
>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>>>> as done in the current AER driver.
>>>>
>>>> If the error is a CXL-related error then forward it to the CXL driver for
>>>> handling using the kfifo mechanism.
>>>>
>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
>>>> protocol error context using cxl_create_prot_err_info(). This context is
>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
>>> Hi Terry,
>>>
>>> Finally got back to this.  I'm not following how some of the reference
>>> counting in here is working.  It might be fine but there is a lot
>>> taking then dropping device references - some of which are taken again later.
>>>  
>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>>>  }
>>>>  
>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>>>> +{
>>>> +	int severity = info->severity;  
>>> So far this variable isn't really justified.  Maybe it makes sense later in the
>>> series?  
>> This is used below in call to cxl_create_prot_err_info().
> Sure, but why not just do
>
> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
>
> There isn't anything modifying info->severity in between so that local
> variable is just padding out the code to no real benefit.
>

I was following a common pattern I observed where a local variable pointer is assigned
to a struct member reference when passing as a function call parameter. I suppose it helps
readability but not necessary here.

Sure, I'll make that change.

>>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);  
>>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
>>> followed by retaking it here.  How do we know it is still about by this call
>>> and once we pull it off the kfifo later?  
>> Yes, this is a problem I realized after sending the series.
>>
>> The device reference incr could be changed for all the devices to the non-cleanup
>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
>>
>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
>> fifo. Thoughts?
> That sounds like a reasonable solution to me.
>
> Jonathan
>
Ok.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-04-25 13:18       ` Jonathan Cameron
  2025-04-25 21:03         ` Bowman, Terry
@ 2025-05-15 21:52         ` Bowman, Terry
  2025-05-20 11:04           ` Jonathan Cameron
  1 sibling, 1 reply; 76+ messages in thread
From: Bowman, Terry @ 2025-05-15 21:52 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, terry.bowman



On 4/25/2025 8:18 AM, Jonathan Cameron wrote:
> On Thu, 24 Apr 2025 09:17:45 -0500
> "Bowman, Terry" <terry.bowman@amd.com> wrote:
>
>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:
>>> On Wed, 26 Mar 2025 20:47:05 -0500
>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>  
>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
>>>> CXL errors to the CXL driver. However, the forwarding functionality is
>>>> currently unimplemented. Update the AER driver to enable error forwarding
>>>> to the CXL driver.
>>>>
>>>> Modify the AER service driver's handle_error_source(), which is called from
>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>>>
>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>>>> masks.
>>>>
>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>>>> as done in the current AER driver.
>>>>
>>>> If the error is a CXL-related error then forward it to the CXL driver for
>>>> handling using the kfifo mechanism.
>>>>
>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
>>>> protocol error context using cxl_create_prot_err_info(). This context is
>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
>>> Hi Terry,
>>>
>>> Finally got back to this.  I'm not following how some of the reference
>>> counting in here is working.  It might be fine but there is a lot
>>> taking then dropping device references - some of which are taken again later.
>>>  
>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>>>  }
>>>>  
>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>>>> +{
>>>> +	int severity = info->severity;  
>>> So far this variable isn't really justified.  Maybe it makes sense later in the
>>> series?  
>> This is used below in call to cxl_create_prot_err_info().
> Sure, but why not just do
>
> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
>
> There isn't anything modifying info->severity in between so that local
> variable is just padding out the code to no real benefit.
>
>
>>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
>>>> +		return;
>>>> +	}
>>>> +
>>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);  
>>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
>>> followed by retaking it here.  How do we know it is still about by this call
>>> and once we pull it off the kfifo later?  
>> Yes, this is a problem I realized after sending the series.
>>
>> The device reference incr could be changed for all the devices to the non-cleanup
>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
>>
>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
>> fifo. Thoughts?
> That sounds like a reasonable solution to me.
>
> Jonathan
Hi Jonathan,

Is it sufficient to rely on correctly implemented reference counting implementation instead
of caching the RAS status I mentioned earlier?

I have the next revision coded to 'get' the CXL erring device's reference count in the AER
driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
the RAS status and caching it. One more question: is it OK to implement the get and put (of
the same object) in different drivers?

If we need to read and cache the RAS status before the kfifo enqueue there will be some other
details to work through.

-Terry



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-05-15 21:52         ` Bowman, Terry
@ 2025-05-20 11:04           ` Jonathan Cameron
  2025-05-20 13:21             ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-05-20 11:04 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Thu, 15 May 2025 16:52:15 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 4/25/2025 8:18 AM, Jonathan Cameron wrote:
> > On Thu, 24 Apr 2025 09:17:45 -0500
> > "Bowman, Terry" <terry.bowman@amd.com> wrote:
> >  
> >> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:  
> >>> On Wed, 26 Mar 2025 20:47:05 -0500
> >>> Terry Bowman <terry.bowman@amd.com> wrote:
> >>>    
> >>>> The AER service driver includes a CXL-specific kfifo, intended to forward
> >>>> CXL errors to the CXL driver. However, the forwarding functionality is
> >>>> currently unimplemented. Update the AER driver to enable error forwarding
> >>>> to the CXL driver.
> >>>>
> >>>> Modify the AER service driver's handle_error_source(), which is called from
> >>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
> >>>>
> >>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
> >>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
> >>>> masks.
> >>>>
> >>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
> >>>> as done in the current AER driver.
> >>>>
> >>>> If the error is a CXL-related error then forward it to the CXL driver for
> >>>> handling using the kfifo mechanism.
> >>>>
> >>>> Introduce a new function forward_cxl_error(), which constructs a CXL
> >>>> protocol error context using cxl_create_prot_err_info(). This context is
> >>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
> >>>>
> >>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>    
> >>> Hi Terry,
> >>>
> >>> Finally got back to this.  I'm not following how some of the reference
> >>> counting in here is working.  It might be fine but there is a lot
> >>> taking then dropping device references - some of which are taken again later.
> >>>    
> >>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> >>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
> >>>>  }
> >>>>  
> >>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
> >>>> +{
> >>>> +	int severity = info->severity;    
> >>> So far this variable isn't really justified.  Maybe it makes sense later in the
> >>> series?    
> >> This is used below in call to cxl_create_prot_err_info().  
> > Sure, but why not just do
> >
> > if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
> >
> > There isn't anything modifying info->severity in between so that local
> > variable is just padding out the code to no real benefit.
> >
> >  
> >>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
> >>>> +		return;
> >>>> +	}
> >>>> +
> >>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);    
> >>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
> >>> followed by retaking it here.  How do we know it is still about by this call
> >>> and once we pull it off the kfifo later?    
> >> Yes, this is a problem I realized after sending the series.
> >>
> >> The device reference incr could be changed for all the devices to the non-cleanup
> >> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
> >> I need to look at the other calls to to cxl_create_prot_err_info() as well.
> >>
> >> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
> >> This would eliminate the need for further accesses to the CXL device after being dequeued from the
> >> fifo. Thoughts?  
> > That sounds like a reasonable solution to me.
> >
> > Jonathan  
> Hi Jonathan,
Hi Terry,

Sorry for delay - travel etc...

> 
> Is it sufficient to rely on correctly implemented reference counting implementation instead
> of caching the RAS status I mentioned earlier?
> 
> I have the next revision coded to 'get' the CXL erring device's reference count in the AER
> driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
> after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
> the RAS status and caching it. One more question: is it OK to implement the get and put (of
> the same object) in different drivers?

It's definitely unusual.  If there is anything similar to point at I'd be happier than
this 'innovation' showing up here first. 

> 
> If we need to read and cache the RAS status before the kfifo enqueue there will be some other
> details to work through.
This still smells like the cleaner solution to me, but depends on those details..

Jonathan

> 
> -Terry
> 
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-05-20 11:04           ` Jonathan Cameron
@ 2025-05-20 13:21             ` Bowman, Terry
  2025-05-21 18:34               ` Jonathan Cameron
  0 siblings, 1 reply; 76+ messages in thread
From: Bowman, Terry @ 2025-05-20 13:21 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, terry.bowman



On 5/20/2025 6:04 AM, Jonathan Cameron wrote:
> On Thu, 15 May 2025 16:52:15 -0500
> "Bowman, Terry" <terry.bowman@amd.com> wrote:
>
>> On 4/25/2025 8:18 AM, Jonathan Cameron wrote:
>>> On Thu, 24 Apr 2025 09:17:45 -0500
>>> "Bowman, Terry" <terry.bowman@amd.com> wrote:
>>>  
>>>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:  
>>>>> On Wed, 26 Mar 2025 20:47:05 -0500
>>>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>>>    
>>>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
>>>>>> CXL errors to the CXL driver. However, the forwarding functionality is
>>>>>> currently unimplemented. Update the AER driver to enable error forwarding
>>>>>> to the CXL driver.
>>>>>>
>>>>>> Modify the AER service driver's handle_error_source(), which is called from
>>>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>>>>>
>>>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>>>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>>>>>> masks.
>>>>>>
>>>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>>>>>> as done in the current AER driver.
>>>>>>
>>>>>> If the error is a CXL-related error then forward it to the CXL driver for
>>>>>> handling using the kfifo mechanism.
>>>>>>
>>>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
>>>>>> protocol error context using cxl_create_prot_err_info(). This context is
>>>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
>>>>>>
>>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>    
>>>>> Hi Terry,
>>>>>
>>>>> Finally got back to this.  I'm not following how some of the reference
>>>>> counting in here is working.  It might be fine but there is a lot
>>>>> taking then dropping device references - some of which are taken again later.
>>>>>    
>>>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>>>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>>>>>  }
>>>>>>  
>>>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>>>>>> +{
>>>>>> +	int severity = info->severity;    
>>>>> So far this variable isn't really justified.  Maybe it makes sense later in the
>>>>> series?    
>>>> This is used below in call to cxl_create_prot_err_info().  
>>> Sure, but why not just do
>>>
>>> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
>>>
>>> There isn't anything modifying info->severity in between so that local
>>> variable is just padding out the code to no real benefit.
>>>
>>>  
>>>>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
>>>>>> +		return;
>>>>>> +	}
>>>>>> +
>>>>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);    
>>>>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
>>>>> followed by retaking it here.  How do we know it is still about by this call
>>>>> and once we pull it off the kfifo later?    
>>>> Yes, this is a problem I realized after sending the series.
>>>>
>>>> The device reference incr could be changed for all the devices to the non-cleanup
>>>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
>>>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
>>>>
>>>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
>>>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
>>>> fifo. Thoughts?  
>>> That sounds like a reasonable solution to me.
>>>
>>> Jonathan  
>> Hi Jonathan,
> Hi Terry,
>
> Sorry for delay - travel etc...
>
>> Is it sufficient to rely on correctly implemented reference counting implementation instead
>> of caching the RAS status I mentioned earlier?
>>
>> I have the next revision coded to 'get' the CXL erring device's reference count in the AER
>> driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
>> after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
>> the RAS status and caching it. One more question: is it OK to implement the get and put (of
>> the same object) in different drivers?
> It's definitely unusual.  If there is anything similar to point at I'd be happier than
> this 'innovation' showing up here first. 
>
>> If we need to read and cache the RAS status before the kfifo enqueue there will be some other
>> details to work through.
> This still smells like the cleaner solution to me, but depends on those details..
>
> Jonathan

In this case I believe we will need to move the CE handling (RAS status reading and clearing) before
the kfifo enqueue. I think this is necessary because CXL errors may continue to be received and we
don't want their status's combined when reading or clearing. I can refactor cxl_handle_ras()/
cxl_handle_cor_ras() to return the RAS status value and remove the trace logging (to instead be
called after kfifo dequeue).

This leaves the UCE case. It's worth mentioning the UCE flow is different than the the CE case
because it uses the top-bottom traversal starting at the erring device. Correct me if I'm wrong
this would be handled before the kfifo as well. The handling and logging in the UCE case are
baked together. The UCE flow would therefore need to include the trace logging during handling.

Another flow is the PCI EP errors. The PCIe EP CE and UCE handlers remain and can call the
the refactored cxl_handle_ras()/cxl_handle_cor_ras() and then trace log afterwards. This is no
issue.

This leaves only CE trace logging to be called after the kfifo dequeue. This is what doesn't
feel right and wanted to draw attention to.

All this to say: very little work will be done after the kfifo dequeue. Most of the work in
the kfifo implementation would be before the kfifo enqueuing in the CXL create_prot_error_info()
callback. I am concerned the balance of work done before and after the kfifo enqueue/dequeue
will be very asymmetric with little value provided from the kfifo.

-Terry


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-05-20 13:21             ` Bowman, Terry
@ 2025-05-21 18:34               ` Jonathan Cameron
  2025-05-21 23:30                 ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-05-21 18:34 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Tue, 20 May 2025 08:21:18 -0500
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 5/20/2025 6:04 AM, Jonathan Cameron wrote:
> > On Thu, 15 May 2025 16:52:15 -0500
> > "Bowman, Terry" <terry.bowman@amd.com> wrote:
> >  
> >> On 4/25/2025 8:18 AM, Jonathan Cameron wrote:  
> >>> On Thu, 24 Apr 2025 09:17:45 -0500
> >>> "Bowman, Terry" <terry.bowman@amd.com> wrote:
> >>>    
> >>>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:    
> >>>>> On Wed, 26 Mar 2025 20:47:05 -0500
> >>>>> Terry Bowman <terry.bowman@amd.com> wrote:
> >>>>>      
> >>>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
> >>>>>> CXL errors to the CXL driver. However, the forwarding functionality is
> >>>>>> currently unimplemented. Update the AER driver to enable error forwarding
> >>>>>> to the CXL driver.
> >>>>>>
> >>>>>> Modify the AER service driver's handle_error_source(), which is called from
> >>>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
> >>>>>>
> >>>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
> >>>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
> >>>>>> masks.
> >>>>>>
> >>>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
> >>>>>> as done in the current AER driver.
> >>>>>>
> >>>>>> If the error is a CXL-related error then forward it to the CXL driver for
> >>>>>> handling using the kfifo mechanism.
> >>>>>>
> >>>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
> >>>>>> protocol error context using cxl_create_prot_err_info(). This context is
> >>>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
> >>>>>>
> >>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>      
> >>>>> Hi Terry,
> >>>>>
> >>>>> Finally got back to this.  I'm not following how some of the reference
> >>>>> counting in here is working.  It might be fine but there is a lot
> >>>>> taking then dropping device references - some of which are taken again later.
> >>>>>      
> >>>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> >>>>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
> >>>>>>  }
> >>>>>>  
> >>>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
> >>>>>> +{
> >>>>>> +	int severity = info->severity;      
> >>>>> So far this variable isn't really justified.  Maybe it makes sense later in the
> >>>>> series?      
> >>>> This is used below in call to cxl_create_prot_err_info().    
> >>> Sure, but why not just do
> >>>
> >>> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
> >>>
> >>> There isn't anything modifying info->severity in between so that local
> >>> variable is just padding out the code to no real benefit.
> >>>
> >>>    
> >>>>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
> >>>>>> +		return;
> >>>>>> +	}
> >>>>>> +
> >>>>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);      
> >>>>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
> >>>>> followed by retaking it here.  How do we know it is still about by this call
> >>>>> and once we pull it off the kfifo later?      
> >>>> Yes, this is a problem I realized after sending the series.
> >>>>
> >>>> The device reference incr could be changed for all the devices to the non-cleanup
> >>>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
> >>>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
> >>>>
> >>>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
> >>>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
> >>>> fifo. Thoughts?    
> >>> That sounds like a reasonable solution to me.
> >>>
> >>> Jonathan    
> >> Hi Jonathan,  
> > Hi Terry,
> >
> > Sorry for delay - travel etc...
> >  
> >> Is it sufficient to rely on correctly implemented reference counting implementation instead
> >> of caching the RAS status I mentioned earlier?
> >>
> >> I have the next revision coded to 'get' the CXL erring device's reference count in the AER
> >> driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
> >> after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
> >> the RAS status and caching it. One more question: is it OK to implement the get and put (of
> >> the same object) in different drivers?  
> > It's definitely unusual.  If there is anything similar to point at I'd be happier than
> > this 'innovation' showing up here first. 
> >  
> >> If we need to read and cache the RAS status before the kfifo enqueue there will be some other
> >> details to work through.  
> > This still smells like the cleaner solution to me, but depends on those details..
> >
> > Jonathan  
> 
> In this case I believe we will need to move the CE handling (RAS status reading and clearing) before
> the kfifo enqueue. I think this is necessary because CXL errors may continue to be received and we
> don't want their status's combined when reading or clearing. I can refactor cxl_handle_ras()/
> cxl_handle_cor_ras() to return the RAS status value and remove the trace logging (to instead be
> called after kfifo dequeue).
> 
> This leaves the UCE case. It's worth mentioning the UCE flow is different than the the CE case
> because it uses the top-bottom traversal starting at the erring device. Correct me if I'm wrong
> this would be handled before the kfifo as well. The handling and logging in the UCE case are
> baked together. The UCE flow would therefore need to include the trace logging during handling.
> 
> Another flow is the PCI EP errors. The PCIe EP CE and UCE handlers remain and can call the
> the refactored cxl_handle_ras()/cxl_handle_cor_ras() and then trace log afterwards. This is no
> issue.
> 
> This leaves only CE trace logging to be called after the kfifo dequeue. This is what doesn't
> feel right and wanted to draw attention to.
> 
> All this to say: very little work will be done after the kfifo dequeue. Most of the work in
> the kfifo implementation would be before the kfifo enqueuing in the CXL create_prot_error_info()
> callback. I am concerned the balance of work done before and after the kfifo enqueue/dequeue
> will be very asymmetric with little value provided from the kfifo.
> 
As per the discord chat - if you look up the device again from BDF or similar and get this
info once you have right locks post kfifo all should be fine as any race will be easy
to resolve by doing nothing if the driver has gone away.

Jonathan

> -Terry
> 


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-05-21 18:34               ` Jonathan Cameron
@ 2025-05-21 23:30                 ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-05-21 23:30 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On 5/21/2025 1:34 PM, Jonathan Cameron wrote:
> On Tue, 20 May 2025 08:21:18 -0500
> "Bowman, Terry" <terry.bowman@amd.com> wrote:
> 
>> On 5/20/2025 6:04 AM, Jonathan Cameron wrote:
>>> On Thu, 15 May 2025 16:52:15 -0500
>>> "Bowman, Terry" <terry.bowman@amd.com> wrote:
>>>  
>>>> On 4/25/2025 8:18 AM, Jonathan Cameron wrote:  
>>>>> On Thu, 24 Apr 2025 09:17:45 -0500
>>>>> "Bowman, Terry" <terry.bowman@amd.com> wrote:
>>>>>    
>>>>>> On 4/23/2025 10:04 AM, Jonathan Cameron wrote:    
>>>>>>> On Wed, 26 Mar 2025 20:47:05 -0500
>>>>>>> Terry Bowman <terry.bowman@amd.com> wrote:
>>>>>>>      
>>>>>>>> The AER service driver includes a CXL-specific kfifo, intended to forward
>>>>>>>> CXL errors to the CXL driver. However, the forwarding functionality is
>>>>>>>> currently unimplemented. Update the AER driver to enable error forwarding
>>>>>>>> to the CXL driver.
>>>>>>>>
>>>>>>>> Modify the AER service driver's handle_error_source(), which is called from
>>>>>>>> process_aer_err_devices(), to distinguish between PCIe and CXL errors.
>>>>>>>>
>>>>>>>> Rename and update is_internal_error() to is_cxl_error(). Ensuring it
>>>>>>>> checks both the 'struct aer_info::is_cxl' flag and the AER internal error
>>>>>>>> masks.
>>>>>>>>
>>>>>>>> If the error is a standard PCIe error then continue calling pcie_aer_handle_error()
>>>>>>>> as done in the current AER driver.
>>>>>>>>
>>>>>>>> If the error is a CXL-related error then forward it to the CXL driver for
>>>>>>>> handling using the kfifo mechanism.
>>>>>>>>
>>>>>>>> Introduce a new function forward_cxl_error(), which constructs a CXL
>>>>>>>> protocol error context using cxl_create_prot_err_info(). This context is
>>>>>>>> then passed to the CXL driver via kfifo using a 'struct work_struct'.
>>>>>>>>
>>>>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>      
>>>>>>> Hi Terry,
>>>>>>>
>>>>>>> Finally got back to this.  I'm not following how some of the reference
>>>>>>> counting in here is working.  It might be fine but there is a lot
>>>>>>> taking then dropping device references - some of which are taken again later.
>>>>>>>      
>>>>>>>> @@ -1082,10 +1094,44 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>>>>>>>>  	pci_info(rcec, "CXL: Internal errors unmasked");
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +static void forward_cxl_error(struct pci_dev *_pdev, struct aer_err_info *info)
>>>>>>>> +{
>>>>>>>> +	int severity = info->severity;      
>>>>>>> So far this variable isn't really justified.  Maybe it makes sense later in the
>>>>>>> series?      
>>>>>> This is used below in call to cxl_create_prot_err_info().    
>>>>> Sure, but why not just do
>>>>>
>>>>> if (cxl_create_prot_error_info(pdev, info->severity, &wd.err_info)) {
>>>>>
>>>>> There isn't anything modifying info->severity in between so that local
>>>>> variable is just padding out the code to no real benefit.
>>>>>
>>>>>    
>>>>>>>> +		pci_err(pdev, "Failed to create CXL protocol error information");
>>>>>>>> +		return;
>>>>>>>> +	}
>>>>>>>> +
>>>>>>>> +	struct device *cxl_dev __free(put_device) = get_device(err_info->dev);      
>>>>>>> Also this one.  A reference was acquired and dropped in cxl_create_prot_err_info()
>>>>>>> followed by retaking it here.  How do we know it is still about by this call
>>>>>>> and once we pull it off the kfifo later?      
>>>>>> Yes, this is a problem I realized after sending the series.
>>>>>>
>>>>>> The device reference incr could be changed for all the devices to the non-cleanup
>>>>>> variety. Then would add the reference incr in the caller after calling cxl_create_prot_err_info().
>>>>>> I need to look at the other calls to to cxl_create_prot_err_info() as well.
>>>>>>
>>>>>> In addition, I think we should consider adding the CXL RAS status into the struct cxl_prot_err_info.
>>>>>> This would eliminate the need for further accesses to the CXL device after being dequeued from the
>>>>>> fifo. Thoughts?    
>>>>> That sounds like a reasonable solution to me.
>>>>>
>>>>> Jonathan    
>>>> Hi Jonathan,  
>>> Hi Terry,
>>>
>>> Sorry for delay - travel etc...
>>>  
>>>> Is it sufficient to rely on correctly implemented reference counting implementation instead
>>>> of caching the RAS status I mentioned earlier?
>>>>
>>>> I have the next revision coded to 'get' the CXL erring device's reference count in the AER
>>>> driver before enqueuing in the kfifo and then added a reference count 'put' in the CXL driver
>>>> after dequeuing and handling/logging. This is an alternative to what I mentioned earlier reading
>>>> the RAS status and caching it. One more question: is it OK to implement the get and put (of
>>>> the same object) in different drivers?  
>>> It's definitely unusual.  If there is anything similar to point at I'd be happier than
>>> this 'innovation' showing up here first. 
>>>  
>>>> If we need to read and cache the RAS status before the kfifo enqueue there will be some other
>>>> details to work through.  
>>> This still smells like the cleaner solution to me, but depends on those details..
>>>
>>> Jonathan  
>>
>> In this case I believe we will need to move the CE handling (RAS status reading and clearing) before
>> the kfifo enqueue. I think this is necessary because CXL errors may continue to be received and we
>> don't want their status's combined when reading or clearing. I can refactor cxl_handle_ras()/
>> cxl_handle_cor_ras() to return the RAS status value and remove the trace logging (to instead be
>> called after kfifo dequeue).
>>
>> This leaves the UCE case. It's worth mentioning the UCE flow is different than the the CE case
>> because it uses the top-bottom traversal starting at the erring device. Correct me if I'm wrong
>> this would be handled before the kfifo as well. The handling and logging in the UCE case are
>> baked together. The UCE flow would therefore need to include the trace logging during handling.
>>
>> Another flow is the PCI EP errors. The PCIe EP CE and UCE handlers remain and can call the
>> the refactored cxl_handle_ras()/cxl_handle_cor_ras() and then trace log afterwards. This is no
>> issue.
>>
>> This leaves only CE trace logging to be called after the kfifo dequeue. This is what doesn't
>> feel right and wanted to draw attention to.
>>
>> All this to say: very little work will be done after the kfifo dequeue. Most of the work in
>> the kfifo implementation would be before the kfifo enqueuing in the CXL create_prot_error_info()
>> callback. I am concerned the balance of work done before and after the kfifo enqueue/dequeue
>> will be very asymmetric with little value provided from the kfifo.
>>
> As per the discord chat - if you look up the device again from BDF or similar and get this
> info once you have right locks post kfifo all should be fine as any race will be easy
> to resolve by doing nothing if the driver has gone away.
> 
> Jonathan
> 
>> -Terry
>>
> 

Ok. I understand. Thanks.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver
  2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
  2025-03-27 17:13   ` Bjorn Helgaas
  2025-04-23 15:04   ` Jonathan Cameron
@ 2025-04-23 22:21   ` Gregory Price
  2 siblings, 0 replies; 76+ messages in thread
From: Gregory Price @ 2025-04-23 22:21 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:05PM -0500, Terry Bowman wrote:
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 46123b70f496..d1df751cfe4b 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1123,8 +1169,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);


This appears to remove that last reference to cxl_rch_handle_error,
build throws a warning saying as such.

I see in patch 5 it's removed, should probably be removed in this patch
instead.

~Gregory

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (3 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27  4:43   ` kernel test robot
  2025-04-23 16:28   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

The AER driver is now designed to forward CXL protocol errors to the CXL
driver. Update the CXL driver with functionality to dequeue the forwarded
CXL error from the kfifo. Also, update the CXL driver to process the CXL
protocol errors using CXL protocol error handlers.

First, move cxl_rch_handle_error_iter() from aer.c to cxl/core/ras.c.
Remove and drop the cxl_rch_handle_error() in aer.c as it is not needed.

Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
AER service driver. This will begin the CXL protocol error processing
with the call to cxl_handle_prot_error().

Introduce cxl_handle_prot_error() to differntiate between Restricted CXL
Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
RCH errors will be processed with a call to walk the associated Root
Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
so the CXL driver can walk the RCEC's downstream bus, searching for
the RCiEP.

VH correctable error (CE) processing will call the CXL CE handler if
present. VH uncorrectable errors (UCE) will call cxl_do_recovery(),
implemented as a stub for now and to be updated in future patch. Export
pci_aer_clean_fatal_status() and pci_clean_device_status() used to clean up
AER status after handling.

Create cxl_driver::error_handler structure similar to
pci_driver::error_handlers. Add handlers for CE and UCE CXL.io errors. Add
'struct cxl_prot_error_info' as a parameter to the CXL CE and UCE error
handlers.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/ras.c  | 102 +++++++++++++++++++++++++++++++++++++++-
 drivers/cxl/cxl.h       |  17 +++++++
 drivers/pci/pci.c       |   1 +
 drivers/pci/pci.h       |   6 ---
 drivers/pci/pcie/aer.c  |  42 +----------------
 drivers/pci/pcie/rcec.c |   1 +
 include/linux/aer.h     |   2 +
 include/linux/pci.h     |  10 ++++
 8 files changed, 133 insertions(+), 48 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index ecb60a5962de..eca8f11a05d9 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -139,8 +139,108 @@ int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
 
 	return 0;
 }
+EXPORT_SYMBOL_NS_GPL(cxl_create_prot_err_info, "CXL");
 
-struct work_struct cxl_prot_err_work;
+static void cxl_do_recovery(struct pci_dev *pdev) { }
+
+static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
+{
+	struct cxl_prot_error_info *err_info = data;
+	const struct cxl_error_handlers *err_handler;
+	struct device *dev = err_info->dev;
+	struct cxl_driver *pdrv;
+
+	/*
+	 * The capability, status, and control fields in Device 0,
+	 * Function 0 DVSEC control the CXL functionality of the
+	 * entire device (CXL 3.0, 8.1.3).
+	 */
+	if (pdev->devfn != PCI_DEVFN(0, 0))
+		return 0;
+
+	/*
+	 * CXL Memory Devices must have the 502h class code set (CXL
+	 * 3.0, 8.1.12.1).
+	 */
+	if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)
+		return 0;
+
+	if (!is_cxl_memdev(dev) || !dev->driver)
+		return 0;
+
+	pdrv = to_cxl_drv(dev->driver);
+	if (!pdrv || !pdrv->err_handler)
+		return 0;
+
+	err_handler = pdrv->err_handler;
+	if (err_info->severity == AER_CORRECTABLE) {
+		if (err_handler->cor_error_detected)
+			err_handler->cor_error_detected(dev, err_info);
+	} else if (err_handler->error_detected) {
+		cxl_do_recovery(pdev);
+	}
+
+	return 0;
+}
+
+static void cxl_handle_prot_error(struct pci_dev *pdev, struct cxl_prot_error_info *err_info)
+{
+	if (!pdev || !err_info)
+		return;
+
+	/*
+	 * Internal errors of an RCEC indicate an AER error in an
+	 * RCH's downstream port. Check and handle them in the CXL.mem
+	 * device driver.
+	 */
+	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
+		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
+
+	if (err_info->severity == AER_CORRECTABLE) {
+		struct device *dev __free(put_device) = get_device(err_info->dev);
+		struct cxl_driver *pdrv;
+		int aer = pdev->aer_cap;
+
+		if (!dev || !dev->driver)
+			return;
+
+		if (aer) {
+			int ras_status;
+
+			pci_read_config_dword(pdev, aer + PCI_ERR_COR_STATUS, &ras_status);
+			pci_write_config_dword(pdev, aer + PCI_ERR_COR_STATUS,
+					       ras_status);
+		}
+
+		pdrv = to_cxl_drv(dev->driver);
+		if (!pdrv || !pdrv->err_handler ||
+		    !pdrv->err_handler->cor_error_detected)
+			return;
+
+		pdrv->err_handler->cor_error_detected(dev, err_info);
+		pcie_clear_device_status(pdev);
+	} else {
+		cxl_do_recovery(pdev);
+	}
+}
+
+static void cxl_prot_err_work_fn(struct work_struct *work)
+{
+	struct cxl_prot_err_work_data wd;
+
+	while (cxl_prot_err_kfifo_get(&wd)) {
+		struct cxl_prot_error_info *err_info = &wd.err_info;
+		struct device *dev __free(put_device) = get_device(err_info->dev);
+		struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(err_info->pdev);
+
+		if (!dev || !pdev)
+			continue;
+
+		cxl_handle_prot_error(pdev, err_info);
+	}
+}
+
+static DECLARE_WORK(cxl_prot_err_work, cxl_prot_err_work_fn);
 
 int cxl_ras_init(void)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index be8a7dc77719..73cddd2c921e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -11,6 +11,8 @@
 #include <linux/log2.h>
 #include <linux/node.h>
 #include <linux/io.h>
+#include <linux/pci.h>
+#include <linux/aer.h>
 
 extern const struct nvdimm_security_ops *cxl_security_ops;
 
@@ -786,6 +788,20 @@ static inline int cxl_root_decoder_autoremove(struct device *host,
 }
 int cxl_endpoint_autoremove(struct cxl_memdev *cxlmd, struct cxl_port *endpoint);
 
+int cxl_create_prot_err_info(struct pci_dev *pdev, int severity,
+			     struct cxl_prot_error_info *err_info);
+
+/* CXL bus error event callbacks */
+struct cxl_error_handlers {
+	/* CXL bus error detected on this device */
+	pci_ers_result_t (*error_detected)(struct device *dev,
+					   struct cxl_prot_error_info *err_info);
+
+	/* Allow device driver to record more details of a correctable error */
+	void (*cor_error_detected)(struct device *dev,
+				   struct cxl_prot_error_info *err_info);
+};
+
 /**
  * struct cxl_endpoint_dvsec_info - Cached DVSEC info
  * @mem_enabled: cached value of mem_enabled in the DVSEC at init time
@@ -820,6 +836,7 @@ struct cxl_driver {
 	void (*remove)(struct device *dev);
 	struct device_driver drv;
 	int id;
+	const struct cxl_error_handlers *err_handler;
 };
 
 #define to_cxl_drv(__drv)	container_of_const(__drv, struct cxl_driver, drv)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index a1d75f40017e..d80c705d683c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -2320,6 +2320,7 @@ void pcie_clear_device_status(struct pci_dev *dev)
 	pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &sta);
 	pcie_capability_write_word(dev, PCI_EXP_DEVSTA, sta);
 }
+EXPORT_SYMBOL_NS_GPL(pcie_clear_device_status, "CXL");
 #endif
 
 /**
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index eed098c134a6..c32eab22c0b2 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -593,16 +593,10 @@ static inline bool pci_dpc_recovered(struct pci_dev *pdev) { return false; }
 void pci_rcec_init(struct pci_dev *dev);
 void pci_rcec_exit(struct pci_dev *dev);
 void pcie_link_rcec(struct pci_dev *rcec);
-void pcie_walk_rcec(struct pci_dev *rcec,
-		    int (*cb)(struct pci_dev *, void *),
-		    void *userdata);
 #else
 static inline void pci_rcec_init(struct pci_dev *dev) { }
 static inline void pci_rcec_exit(struct pci_dev *dev) { }
 static inline void pcie_link_rcec(struct pci_dev *rcec) { }
-static inline void pcie_walk_rcec(struct pci_dev *rcec,
-				  int (*cb)(struct pci_dev *, void *),
-				  void *userdata) { }
 #endif
 
 #ifdef CONFIG_PCI_ATS
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d1df751cfe4b..763ec6aa1a9a 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -288,6 +288,7 @@ void pci_aer_clear_fatal_status(struct pci_dev *dev)
 	if (status)
 		pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, status);
 }
+EXPORT_SYMBOL_GPL(pci_aer_clear_fatal_status);
 
 /**
  * pci_aer_raw_clear_status - Clear AER error registers.
@@ -1018,47 +1019,6 @@ static bool is_cxl_error(struct aer_err_info *info)
 	return is_internal_error(info);
 }
 
-static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
-{
-	struct aer_err_info *info = (struct aer_err_info *)data;
-	const struct pci_error_handlers *err_handler;
-
-	if (!is_cxl_mem_dev(dev) || !cxl_error_is_native(dev))
-		return 0;
-
-	/* protect dev->driver */
-	device_lock(&dev->dev);
-
-	err_handler = dev->driver ? dev->driver->err_handler : NULL;
-	if (!err_handler)
-		goto out;
-
-	if (info->severity == AER_CORRECTABLE) {
-		if (err_handler->cor_error_detected)
-			err_handler->cor_error_detected(dev);
-	} else if (err_handler->error_detected) {
-		if (info->severity == AER_NONFATAL)
-			err_handler->error_detected(dev, pci_channel_io_normal);
-		else if (info->severity == AER_FATAL)
-			err_handler->error_detected(dev, pci_channel_io_frozen);
-	}
-out:
-	device_unlock(&dev->dev);
-	return 0;
-}
-
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
-{
-	/*
-	 * Internal errors of an RCEC indicate an AER error in an
-	 * RCH's downstream port. Check and handle them in the CXL.mem
-	 * device driver.
-	 */
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
-		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
-}
-
 static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 {
 	bool *handles_cxl = data;
diff --git a/drivers/pci/pcie/rcec.c b/drivers/pci/pcie/rcec.c
index d0bcd141ac9c..fb6cf6449a1d 100644
--- a/drivers/pci/pcie/rcec.c
+++ b/drivers/pci/pcie/rcec.c
@@ -145,6 +145,7 @@ void pcie_walk_rcec(struct pci_dev *rcec, int (*cb)(struct pci_dev *, void *),
 
 	walk_rcec(walk_rcec_helper, &rcec_data);
 }
+EXPORT_SYMBOL_NS_GPL(pcie_walk_rcec, "CXL");
 
 void pci_rcec_init(struct pci_dev *dev)
 {
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 761d6f5cd792..8f815f34d447 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -66,12 +66,14 @@ struct cxl_prot_err_work_data {
 
 #if defined(CONFIG_PCIEAER)
 int pci_aer_clear_nonfatal_status(struct pci_dev *dev);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pcie_aer_is_native(struct pci_dev *dev);
 #else
 static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 {
 	return -EINVAL;
 }
+static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index af83230bef1a..56015721be22 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1802,6 +1802,9 @@ extern bool pcie_ports_native;
 
 int pcie_set_target_speed(struct pci_dev *port, enum pci_bus_speed speed_req,
 			  bool use_lt);
+void pcie_walk_rcec(struct pci_dev *rcec,
+		    int (*cb)(struct pci_dev *, void *),
+		    void *userdata);
 #else
 #define pcie_ports_disabled	true
 #define pcie_ports_native	false
@@ -1812,8 +1815,15 @@ static inline int pcie_set_target_speed(struct pci_dev *port,
 {
 	return -EOPNOTSUPP;
 }
+
+static inline void pcie_walk_rcec(struct pci_dev *rcec,
+				  int (*cb)(struct pci_dev *, void *),
+				  void *userdata) { }
+
 #endif
 
+void pcie_clear_device_status(struct pci_dev *dev);
+
 #define PCIE_LINK_STATE_L0S		(BIT(0) | BIT(1)) /* Upstr/dwnstr L0s */
 #define PCIE_LINK_STATE_L1		BIT(2)	/* L1 state */
 #define PCIE_LINK_STATE_L1_1		BIT(3)	/* ASPM L1.1 state */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver
  2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
@ 2025-03-27  4:43   ` kernel test robot
  2025-04-23 16:28   ` Jonathan Cameron
  1 sibling, 0 replies; 76+ messages in thread
From: kernel test robot @ 2025-03-27  4:43 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati
  Cc: oe-kbuild-all

Hi Terry,

kernel test robot noticed the following build errors:

[auto build test ERROR on aae0594a7053c60b82621136257c8b648c67b512]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/PCI-CXL-Introduce-PCIe-helper-function-pcie_is_cxl/20250327-095738
base:   aae0594a7053c60b82621136257c8b648c67b512
patch link:    https://lore.kernel.org/r/20250327014717.2988633-6-terry.bowman%40amd.com
patch subject: [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver
config: loongarch-randconfig-001-20250327 (https://download.01.org/0day-ci/archive/20250327/202503271234.IKMoGynt-lkp@intel.com/config)
compiler: loongarch64-linux-gcc (GCC) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250327/202503271234.IKMoGynt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503271234.IKMoGynt-lkp@intel.com/

All errors (new ones prefixed by >>):

   drivers/cxl/core/ras.c: In function 'cxl_handle_prot_error':
>> drivers/cxl/core/ras.c:202:33: error: 'struct pci_dev' has no member named 'aer_cap'; did you mean 'ats_cap'?
     202 |                 int aer = pdev->aer_cap;
         |                                 ^~~~~~~
         |                                 ats_cap


vim +202 drivers/cxl/core/ras.c

   185	
   186	static void cxl_handle_prot_error(struct pci_dev *pdev, struct cxl_prot_error_info *err_info)
   187	{
   188		if (!pdev || !err_info)
   189			return;
   190	
   191		/*
   192		 * Internal errors of an RCEC indicate an AER error in an
   193		 * RCH's downstream port. Check and handle them in the CXL.mem
   194		 * device driver.
   195		 */
   196		if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
   197			return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
   198	
   199		if (err_info->severity == AER_CORRECTABLE) {
   200			struct device *dev __free(put_device) = get_device(err_info->dev);
   201			struct cxl_driver *pdrv;
 > 202			int aer = pdev->aer_cap;
   203	
   204			if (!dev || !dev->driver)
   205				return;
   206	
   207			if (aer) {
   208				int ras_status;
   209	
   210				pci_read_config_dword(pdev, aer + PCI_ERR_COR_STATUS, &ras_status);
   211				pci_write_config_dword(pdev, aer + PCI_ERR_COR_STATUS,
   212						       ras_status);
   213			}
   214	
   215			pdrv = to_cxl_drv(dev->driver);
   216			if (!pdrv || !pdrv->err_handler ||
   217			    !pdrv->err_handler->cor_error_detected)
   218				return;
   219	
   220			pdrv->err_handler->cor_error_detected(dev, err_info);
   221			pcie_clear_device_status(pdev);
   222		} else {
   223			cxl_do_recovery(pdev);
   224		}
   225	}
   226	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver
  2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
  2025-03-27  4:43   ` kernel test robot
@ 2025-04-23 16:28   ` Jonathan Cameron
  2025-04-24 15:03     ` Bowman, Terry
  1 sibling, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 16:28 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:06 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver is now designed to forward CXL protocol errors to the CXL
> driver. Update the CXL driver with functionality to dequeue the forwarded
> CXL error from the kfifo. Also, update the CXL driver to process the CXL
> protocol errors using CXL protocol error handlers.
> 
> First, move cxl_rch_handle_error_iter() from aer.c to cxl/core/ras.c.
> Remove and drop the cxl_rch_handle_error() in aer.c as it is not needed.
> 
> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
> AER service driver. This will begin the CXL protocol error processing
> with the call to cxl_handle_prot_error().
> 
> Introduce cxl_handle_prot_error() to differntiate between Restricted CXL
> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
> RCH errors will be processed with a call to walk the associated Root
> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
> so the CXL driver can walk the RCEC's downstream bus, searching for
> the RCiEP.
> 
> VH correctable error (CE) processing will call the CXL CE handler if
> present. VH uncorrectable errors (UCE) will call cxl_do_recovery(),
> implemented as a stub for now and to be updated in future patch. Export
> pci_aer_clean_fatal_status() and pci_clean_device_status() used to clean up
> AER status after handling.
> 
> Create cxl_driver::error_handler structure similar to
> pci_driver::error_handlers. Add handlers for CE and UCE CXL.io errors. Add
> 'struct cxl_prot_error_info' as a parameter to the CXL CE and UCE error
> handlers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/ras.c  | 102 +++++++++++++++++++++++++++++++++++++++-
>  drivers/cxl/cxl.h       |  17 +++++++
>  drivers/pci/pci.c       |   1 +
>  drivers/pci/pci.h       |   6 ---
>  drivers/pci/pcie/aer.c  |  42 +----------------
>  drivers/pci/pcie/rcec.c |   1 +
>  include/linux/aer.h     |   2 +
>  include/linux/pci.h     |  10 ++++
>  8 files changed, 133 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index ecb60a5962de..eca8f11a05d9 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -139,8 +139,108 @@ int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>  
>  	return 0;
>  }
> +EXPORT_SYMBOL_NS_GPL(cxl_create_prot_err_info, "CXL");
>  
> -struct work_struct cxl_prot_err_work;
> +static void cxl_do_recovery(struct pci_dev *pdev) { }
> +
> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
> +{
> +	struct cxl_prot_error_info *err_info = data;
> +	const struct cxl_error_handlers *err_handler;
> +	struct device *dev = err_info->dev;
> +	struct cxl_driver *pdrv;
> +
> +	/*
> +	 * The capability, status, and control fields in Device 0,
> +	 * Function 0 DVSEC control the CXL functionality of the
> +	 * entire device (CXL 3.0, 8.1.3).
> +	 */
> +	if (pdev->devfn != PCI_DEVFN(0, 0))
> +		return 0;
> +
> +	/*
> +	 * CXL Memory Devices must have the 502h class code set (CXL
> +	 * 3.0, 8.1.12.1).
> +	 */
> +	if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)
> +		return 0;
> +
> +	if (!is_cxl_memdev(dev) || !dev->driver)
> +		return 0;
> +
> +	pdrv = to_cxl_drv(dev->driver);
> +	if (!pdrv || !pdrv->err_handler)
> +		return 0;
> +
> +	err_handler = pdrv->err_handler;
> +	if (err_info->severity == AER_CORRECTABLE) {
> +		if (err_handler->cor_error_detected)
> +			err_handler->cor_error_detected(dev, err_info);
> +	} else if (err_handler->error_detected) {
> +		cxl_do_recovery(pdev);
> +	}
> +
> +	return 0;
> +}
> +
> +static void cxl_handle_prot_error(struct pci_dev *pdev, struct cxl_prot_error_info *err_info)
> +{
> +	if (!pdev || !err_info)

Are these real potential conditions?  If so can we have a comment on why.
If this is defensive only, do we need it? 
Looks like the caller below checked pdev already.

> +		return;
> +
> +	/*
> +	 * Internal errors of an RCEC indicate an AER error in an
> +	 * RCH's downstream port. Check and handle them in the CXL.mem
> +	 * device driver.
> +	 */
> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
> +
> +	if (err_info->severity == AER_CORRECTABLE) {
> +		struct device *dev __free(put_device) = get_device(err_info->dev);

Similar question around lifetimes. The caller already got this. Why again?

> +		struct cxl_driver *pdrv;

calling a cxl driver pdrv seems odd.  cdrv maybe?

> +		int aer = pdev->aer_cap;
> +
> +		if (!dev || !dev->driver)
> +			return;
> +
> +		if (aer) {
> +			int ras_status;
> +
> +			pci_read_config_dword(pdev, aer + PCI_ERR_COR_STATUS, &ras_status);

If we get multiple bits set in this register, can this wipe out ones we haven't noticed
anywhere else in the handling?  Bad tlp etc.  Maybe we need to ensure this only clears
the internal error bit?

> +			pci_write_config_dword(pdev, aer + PCI_ERR_COR_STATUS,
> +					       ras_status);
> +		}
> +
> +		pdrv = to_cxl_drv(dev->driver);
> +		if (!pdrv || !pdrv->err_handler ||
> +		    !pdrv->err_handler->cor_error_detected)
> +			return;
> +
> +		pdrv->err_handler->cor_error_detected(dev, err_info);
> +		pcie_clear_device_status(pdev);
> +	} else {
> +		cxl_do_recovery(pdev);
> +	}
> +}
> +
> +static void cxl_prot_err_work_fn(struct work_struct *work)
> +{
> +	struct cxl_prot_err_work_data wd;
> +
> +	while (cxl_prot_err_kfifo_get(&wd)) {
> +		struct cxl_prot_error_info *err_info = &wd.err_info;
> +		struct device *dev __free(put_device) = get_device(err_info->dev);
> +		struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(err_info->pdev);
> +
> +		if (!dev || !pdev)
> +			continue;
> +
> +		cxl_handle_prot_error(pdev, err_info);
> +	}
> +}
> +
> +static DECLARE_WORK(cxl_prot_err_work, cxl_prot_err_work_fn);

Ah! Here it is... I think this can be in patch 3. With a stub of the function
(which is what the patch 3 description claims is there).

>  

Jonathan


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver
  2025-04-23 16:28   ` Jonathan Cameron
@ 2025-04-24 15:03     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-24 15:03 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/23/2025 11:28 AM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:06 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER driver is now designed to forward CXL protocol errors to the CXL
>> driver. Update the CXL driver with functionality to dequeue the forwarded
>> CXL error from the kfifo. Also, update the CXL driver to process the CXL
>> protocol errors using CXL protocol error handlers.
>>
>> First, move cxl_rch_handle_error_iter() from aer.c to cxl/core/ras.c.
>> Remove and drop the cxl_rch_handle_error() in aer.c as it is not needed.
>>
>> Introduce function cxl_prot_err_work_fn() to dequeue work forwarded by the
>> AER service driver. This will begin the CXL protocol error processing
>> with the call to cxl_handle_prot_error().
>>
>> Introduce cxl_handle_prot_error() to differntiate between Restricted CXL
>> Host (RCH) protocol errors and CXL virtual host (VH) protocol errors.
>> RCH errors will be processed with a call to walk the associated Root
>> Complex Event Collector's (RCEC) secondary bus looking for the Root Complex
>> Integrated Endpoint (RCiEP) to handle the RCH error. Export pcie_walk_rcec()
>> so the CXL driver can walk the RCEC's downstream bus, searching for
>> the RCiEP.
>>
>> VH correctable error (CE) processing will call the CXL CE handler if
>> present. VH uncorrectable errors (UCE) will call cxl_do_recovery(),
>> implemented as a stub for now and to be updated in future patch. Export
>> pci_aer_clean_fatal_status() and pci_clean_device_status() used to clean up
>> AER status after handling.
>>
>> Create cxl_driver::error_handler structure similar to
>> pci_driver::error_handlers. Add handlers for CE and UCE CXL.io errors. Add
>> 'struct cxl_prot_error_info' as a parameter to the CXL CE and UCE error
>> handlers.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/ras.c  | 102 +++++++++++++++++++++++++++++++++++++++-
>>  drivers/cxl/cxl.h       |  17 +++++++
>>  drivers/pci/pci.c       |   1 +
>>  drivers/pci/pci.h       |   6 ---
>>  drivers/pci/pcie/aer.c  |  42 +----------------
>>  drivers/pci/pcie/rcec.c |   1 +
>>  include/linux/aer.h     |   2 +
>>  include/linux/pci.h     |  10 ++++
>>  8 files changed, 133 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index ecb60a5962de..eca8f11a05d9 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -139,8 +139,108 @@ int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>>  
>>  	return 0;
>>  }
>> +EXPORT_SYMBOL_NS_GPL(cxl_create_prot_err_info, "CXL");
>>  
>> -struct work_struct cxl_prot_err_work;
>> +static void cxl_do_recovery(struct pci_dev *pdev) { }
>> +
>> +static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>> +{
>> +	struct cxl_prot_error_info *err_info = data;
>> +	const struct cxl_error_handlers *err_handler;
>> +	struct device *dev = err_info->dev;
>> +	struct cxl_driver *pdrv;
>> +
>> +	/*
>> +	 * The capability, status, and control fields in Device 0,
>> +	 * Function 0 DVSEC control the CXL functionality of the
>> +	 * entire device (CXL 3.0, 8.1.3).
>> +	 */
>> +	if (pdev->devfn != PCI_DEVFN(0, 0))
>> +		return 0;
>> +
>> +	/*
>> +	 * CXL Memory Devices must have the 502h class code set (CXL
>> +	 * 3.0, 8.1.12.1).
>> +	 */
>> +	if ((pdev->class >> 8) != PCI_CLASS_MEMORY_CXL)
>> +		return 0;
>> +
>> +	if (!is_cxl_memdev(dev) || !dev->driver)
>> +		return 0;
>> +
>> +	pdrv = to_cxl_drv(dev->driver);
>> +	if (!pdrv || !pdrv->err_handler)
>> +		return 0;
>> +
>> +	err_handler = pdrv->err_handler;
>> +	if (err_info->severity == AER_CORRECTABLE) {
>> +		if (err_handler->cor_error_detected)
>> +			err_handler->cor_error_detected(dev, err_info);
>> +	} else if (err_handler->error_detected) {
>> +		cxl_do_recovery(pdev);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static void cxl_handle_prot_error(struct pci_dev *pdev, struct cxl_prot_error_info *err_info)
>> +{
>> +	if (!pdev || !err_info)
> Are these real potential conditions?  If so can we have a comment on why.
> If this is defensive only, do we need it? 
> Looks like the caller below checked pdev already.
Yes, these checks can be removed.

>> +		return;
>> +
>> +	/*
>> +	 * Internal errors of an RCEC indicate an AER error in an
>> +	 * RCH's downstream port. Check and handle them in the CXL.mem
>> +	 * device driver.
>> +	 */
>> +	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_RC_EC)
>> +		return pcie_walk_rcec(pdev, cxl_rch_handle_error_iter, err_info);
>> +
>> +	if (err_info->severity == AER_CORRECTABLE) {
>> +		struct device *dev __free(put_device) = get_device(err_info->dev);
> Similar question around lifetimes. The caller already got this. Why again?
Not necessary. I'll remove.
>> +		struct cxl_driver *pdrv;
> calling a cxl driver pdrv seems odd.  cdrv maybe?
Ok. I'll rename cdrv.

>> +		int aer = pdev->aer_cap;
>> +
>> +		if (!dev || !dev->driver)
>> +			return;
>> +
>> +		if (aer) {
>> +			int ras_status;
>> +
>> +			pci_read_config_dword(pdev, aer + PCI_ERR_COR_STATUS, &ras_status);
> If we get multiple bits set in this register, can this wipe out ones we haven't noticed
> anywhere else in the handling?  Bad tlp etc.  Maybe we need to ensure this only clears
> the internal error bit?
Good point. I'll fix. I'm going to rename 'ras_status' to 'aer_status' as well.

>> +			pci_write_config_dword(pdev, aer + PCI_ERR_COR_STATUS,
>> +					       ras_status);
>> +		}
>> +
>> +		pdrv = to_cxl_drv(dev->driver);
>> +		if (!pdrv || !pdrv->err_handler ||
>> +		    !pdrv->err_handler->cor_error_detected)
>> +			return;
>> +
>> +		pdrv->err_handler->cor_error_detected(dev, err_info);
>> +		pcie_clear_device_status(pdev);
>> +	} else {
>> +		cxl_do_recovery(pdev);
>> +	}
>> +}
>> +
>> +static void cxl_prot_err_work_fn(struct work_struct *work)
>> +{
>> +	struct cxl_prot_err_work_data wd;
>> +
>> +	while (cxl_prot_err_kfifo_get(&wd)) {
>> +		struct cxl_prot_error_info *err_info = &wd.err_info;
>> +		struct device *dev __free(put_device) = get_device(err_info->dev);
>> +		struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(err_info->pdev);
>> +
>> +		if (!dev || !pdev)
>> +			continue;
>> +
>> +		cxl_handle_prot_error(pdev, err_info);
>> +	}
>> +}
>> +
>> +static DECLARE_WORK(cxl_prot_err_work, cxl_prot_err_work_fn);
> Ah! Here it is... I think this can be in patch 3. With a stub of the function
> (which is what the patch 3 description claims is there).
I'll move it.
>>  
> Jonathan
>
-Terry


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (4 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27  3:37   ` kernel test robot
                     ` (2 more replies)
  2025-03-27  1:47 ` [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver Terry Bowman
                   ` (11 subsequent siblings)
  17 siblings, 3 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
handling. Follow similar design as found in PCIe error driver,
pcie_do_recovery(). One difference is that cxl_do_recovery() will treat all
UCEs as fatal with a kernel panic. This is to prevent corruption on CXL
memory.

Copy the PCIe error handlers merge_result(). Introduce PCI_ERS_RESULT_PANIC
and add support in the merge_result() routine.

Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
first device in all cases.

Copy report_error_detected() to cxl_report_error_detected(). Update this
function to populate the CXL error information structure, 'struct
cxl_prot_error_info', before calling the device error handler.

Call panic() to halt the system in the case of uncorrectable errors (UCE)
in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
if a UCE is not found. In this case the AER status must be cleared and
uses pci_aer_clear_fatal_status().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/ras.c | 92 +++++++++++++++++++++++++++++++++++++++++-
 drivers/pci/pci.h      |  2 -
 include/linux/pci.h    |  5 +++
 3 files changed, 96 insertions(+), 3 deletions(-)

diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index eca8f11a05d9..1f94fc08e72b 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -141,7 +141,97 @@ int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
 }
 EXPORT_SYMBOL_NS_GPL(cxl_create_prot_err_info, "CXL");
 
-static void cxl_do_recovery(struct pci_dev *pdev) { }
+
+static pci_ers_result_t merge_result(enum pci_ers_result orig,
+				     enum pci_ers_result new)
+{
+	if (new == PCI_ERS_RESULT_PANIC)
+		return PCI_ERS_RESULT_PANIC;
+
+	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
+		return PCI_ERS_RESULT_NO_AER_DRIVER;
+
+	if (new == PCI_ERS_RESULT_NONE)
+		return orig;
+
+	switch (orig) {
+	case PCI_ERS_RESULT_CAN_RECOVER:
+	case PCI_ERS_RESULT_RECOVERED:
+		orig = new;
+		break;
+	case PCI_ERS_RESULT_DISCONNECT:
+		if (new == PCI_ERS_RESULT_NEED_RESET)
+			orig = PCI_ERS_RESULT_NEED_RESET;
+		break;
+	default:
+		break;
+	}
+
+	return orig;
+}
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+			    int (*cb)(struct pci_dev *, void *),
+			    void *userdata)
+{
+	if (cb(bridge, userdata))
+		return;
+
+	if (bridge->subordinate)
+		pci_walk_bus(bridge->subordinate, cb, userdata);
+}
+
+
+static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
+{
+	struct cxl_driver *pdrv;
+	pci_ers_result_t vote, *result = data;
+	struct cxl_prot_error_info err_info = { 0 };
+	const struct cxl_error_handlers *cxl_err_handler;
+
+	if (cxl_create_prot_err_info(pdev, AER_FATAL, &err_info))
+		return 0;
+
+	struct device *dev __free(put_device) = get_device(err_info.dev);
+	if (!dev)
+		return 0;
+
+	pdrv = to_cxl_drv(dev->driver);
+	if (!pdrv || !pdrv->err_handler ||
+	    !pdrv->err_handler->error_detected)
+		return 0;
+
+	cxl_err_handler = pdrv->err_handler;
+	vote = cxl_err_handler->error_detected(dev, &err_info);
+
+	*result = merge_result(*result, vote);
+
+	return 0;
+}
+
+static void cxl_do_recovery(struct pci_dev *pdev)
+{
+	struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus);
+	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
+
+	cxl_walk_bridge(pdev, cxl_report_error_detected, &status);
+	if (status == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
+
+	/*
+	 * If we have native control of AER, clear error status in the device
+	 * that detected the error.  If the platform retained control of AER,
+	 * it is responsible for clearing this status.  In that case, the
+	 * signaling device may not even be visible to the OS.
+	 */
+	if (host->native_aer) {
+		pcie_clear_device_status(pdev);
+		pci_aer_clear_nonfatal_status(pdev);
+		pci_aer_clear_fatal_status(pdev);
+	}
+
+	pci_info(pdev, "CXL uncorrectable error.\n");
+}
 
 static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
 {
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index c32eab22c0b2..1354c7cfedeb 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -886,7 +886,6 @@ void pci_no_aer(void);
 void pci_aer_init(struct pci_dev *dev);
 void pci_aer_exit(struct pci_dev *dev);
 extern const struct attribute_group aer_stats_attr_group;
-void pci_aer_clear_fatal_status(struct pci_dev *dev);
 int pci_aer_clear_status(struct pci_dev *dev);
 int pci_aer_raw_clear_status(struct pci_dev *dev);
 void pci_save_aer_state(struct pci_dev *dev);
@@ -895,7 +894,6 @@ void pci_restore_aer_state(struct pci_dev *dev);
 static inline void pci_no_aer(void) { }
 static inline void pci_aer_init(struct pci_dev *d) { }
 static inline void pci_aer_exit(struct pci_dev *d) { }
-static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
 static inline int pci_aer_clear_status(struct pci_dev *dev) { return -EINVAL; }
 static inline int pci_aer_raw_clear_status(struct pci_dev *dev) { return -EINVAL; }
 static inline void pci_save_aer_state(struct pci_dev *dev) { }
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 56015721be22..0aee5846b95c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -862,6 +862,9 @@ enum pci_ers_result {
 
 	/* No AER capabilities registered for the driver */
 	PCI_ERS_RESULT_NO_AER_DRIVER = (__force pci_ers_result_t) 6,
+
+	/* System is unstable, panic  */
+	PCI_ERS_RESULT_PANIC = (__force pci_ers_result_t) 7,
 };
 
 /* PCI bus error event callbacks */
@@ -1864,8 +1867,10 @@ static inline bool pcie_aspm_enabled(struct pci_dev *pdev) { return false; }
 
 #ifdef CONFIG_PCIEAER
 bool pci_aer_available(void);
+void pci_aer_clear_fatal_status(struct pci_dev *dev);
 #else
 static inline bool pci_aer_available(void) { return false; }
+void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
 #endif
 
 bool pci_ats_disabled(void);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
  2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
@ 2025-03-27  3:37   ` kernel test robot
  2025-03-27  4:19   ` kernel test robot
  2025-04-23 16:35   ` Jonathan Cameron
  2 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2025-03-27  3:37 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati
  Cc: oe-kbuild-all

Hi Terry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on aae0594a7053c60b82621136257c8b648c67b512]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/PCI-CXL-Introduce-PCIe-helper-function-pcie_is_cxl/20250327-095738
base:   aae0594a7053c60b82621136257c8b648c67b512
patch link:    https://lore.kernel.org/r/20250327014717.2988633-7-terry.bowman%40amd.com
patch subject: [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
config: x86_64-defconfig (https://download.01.org/0day-ci/archive/20250327/202503271130.BAWzwRWM-lkp@intel.com/config)
compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250327/202503271130.BAWzwRWM-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503271130.BAWzwRWM-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from lib/iomap.c:7:
>> include/linux/pci.h:1873:6: warning: no previous prototype for 'pci_aer_clear_fatal_status' [-Wmissing-prototypes]
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
--
   In file included from drivers/pci/probe.c:9:
>> include/linux/pci.h:1873:6: warning: no previous prototype for 'pci_aer_clear_fatal_status' [-Wmissing-prototypes]
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from drivers/pci/probe.c:16:
   include/linux/aer.h:76:20: error: redefinition of 'pci_aer_clear_fatal_status'
      76 | static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
         |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~
   In file included from drivers/pci/probe.c:9:
   include/linux/pci.h:1873:6: note: previous definition of 'pci_aer_clear_fatal_status' with type 'void(struct pci_dev *)'
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         |      ^~~~~~~~~~~~~~~~~~~~~~~~~~


vim +/pci_aer_clear_fatal_status +1873 include/linux/pci.h

  1867	
  1868	#ifdef CONFIG_PCIEAER
  1869	bool pci_aer_available(void);
  1870	void pci_aer_clear_fatal_status(struct pci_dev *dev);
  1871	#else
  1872	static inline bool pci_aer_available(void) { return false; }
> 1873	void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
  1874	#endif
  1875	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
  2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
  2025-03-27  3:37   ` kernel test robot
@ 2025-03-27  4:19   ` kernel test robot
  2025-04-23 16:35   ` Jonathan Cameron
  2 siblings, 0 replies; 76+ messages in thread
From: kernel test robot @ 2025-03-27  4:19 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati
  Cc: llvm, oe-kbuild-all

Hi Terry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on aae0594a7053c60b82621136257c8b648c67b512]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/PCI-CXL-Introduce-PCIe-helper-function-pcie_is_cxl/20250327-095738
base:   aae0594a7053c60b82621136257c8b648c67b512
patch link:    https://lore.kernel.org/r/20250327014717.2988633-7-terry.bowman%40amd.com
patch subject: [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
config: arm-mv78xx0_defconfig (https://download.01.org/0day-ci/archive/20250327/202503271128.zMRuNISx-lkp@intel.com/config)
compiler: clang version 19.1.7 (https://github.com/llvm/llvm-project cd708029e0b2869e80abe31ddb175f7c35361f90)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250327/202503271128.zMRuNISx-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503271128.zMRuNISx-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from drivers/pci/access.c:2:
>> include/linux/pci.h:1873:6: warning: no previous prototype for function 'pci_aer_clear_fatal_status' [-Wmissing-prototypes]
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         |      ^
   include/linux/pci.h:1873:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         | ^
         | static 
   1 warning generated.
--
   In file included from drivers/pci/probe.c:9:
>> include/linux/pci.h:1873:6: warning: no previous prototype for function 'pci_aer_clear_fatal_status' [-Wmissing-prototypes]
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         |      ^
   include/linux/pci.h:1873:1: note: declare 'static' if the function is not intended to be used outside of this translation unit
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         | ^
         | static 
   In file included from drivers/pci/probe.c:16:
   include/linux/aer.h:76:20: error: static declaration of 'pci_aer_clear_fatal_status' follows non-static declaration
      76 | static inline void pci_aer_clear_fatal_status(struct pci_dev *dev) { }
         |                    ^
   include/linux/pci.h:1873:6: note: previous definition is here
    1873 | void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
         |      ^
   1 warning and 1 error generated.


vim +/pci_aer_clear_fatal_status +1873 include/linux/pci.h

  1867	
  1868	#ifdef CONFIG_PCIEAER
  1869	bool pci_aer_available(void);
  1870	void pci_aer_clear_fatal_status(struct pci_dev *dev);
  1871	#else
  1872	static inline bool pci_aer_available(void) { return false; }
> 1873	void pci_aer_clear_fatal_status(struct pci_dev *dev) { };
  1874	#endif
  1875	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
  2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
  2025-03-27  3:37   ` kernel test robot
  2025-03-27  4:19   ` kernel test robot
@ 2025-04-23 16:35   ` Jonathan Cameron
  2025-04-24 14:22     ` Bowman, Terry
  2 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 16:35 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:07 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
> handling. Follow similar design as found in PCIe error driver,
> pcie_do_recovery(). One difference is that cxl_do_recovery() will treat all
> UCEs as fatal with a kernel panic. This is to prevent corruption on CXL
> memory.
> 
> Copy the PCIe error handlers merge_result(). Introduce PCI_ERS_RESULT_PANIC
> and add support in the merge_result() routine.
> 
> Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
> first device in all cases.
> 
> Copy report_error_detected() to cxl_report_error_detected(). Update this
> function to populate the CXL error information structure, 'struct
> cxl_prot_error_info', before calling the device error handler.
> 
> Call panic() to halt the system in the case of uncorrectable errors (UCE)
> in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
> if a UCE is not found. In this case the AER status must be cleared and
> uses pci_aer_clear_fatal_status().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/ras.c | 92 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/pci/pci.h      |  2 -
>  include/linux/pci.h    |  5 +++
>  3 files changed, 96 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index eca8f11a05d9..1f94fc08e72b 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -141,7 +141,97 @@ int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_create_prot_err_info, "CXL");
>  
> -static void cxl_do_recovery(struct pci_dev *pdev) { }
> +
> +static pci_ers_result_t merge_result(enum pci_ers_result orig,

Rename perhaps to avoid confusion / grep clashed...

> +				     enum pci_ers_result new)
> +{
> +	if (new == PCI_ERS_RESULT_PANIC)
> +		return PCI_ERS_RESULT_PANIC;
> +
> +	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
> +		return PCI_ERS_RESULT_NO_AER_DRIVER;
> +
> +	if (new == PCI_ERS_RESULT_NONE)
> +		return orig;
> +
> +	switch (orig) {
> +	case PCI_ERS_RESULT_CAN_RECOVER:
> +	case PCI_ERS_RESULT_RECOVERED:
> +		orig = new;
> +		break;
> +	case PCI_ERS_RESULT_DISCONNECT:
> +		if (new == PCI_ERS_RESULT_NEED_RESET)
> +			orig = PCI_ERS_RESULT_NEED_RESET;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	return orig;
> +}
> +
> +static void cxl_walk_bridge(struct pci_dev *bridge,
> +			    int (*cb)(struct pci_dev *, void *),
> +			    void *userdata)
> +{
> +	if (cb(bridge, userdata))
> +		return;
> +
> +	if (bridge->subordinate)
> +		pci_walk_bus(bridge->subordinate, cb, userdata);
> +}
> +

Trivial but seems there are two blank lines where one will do.

> +
> +static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
> +{
> +	struct cxl_driver *pdrv;
> +	pci_ers_result_t vote, *result = data;
> +	struct cxl_prot_error_info err_info = { 0 };
> +	const struct cxl_error_handlers *cxl_err_handler;
> +
> +	if (cxl_create_prot_err_info(pdev, AER_FATAL, &err_info))
> +		return 0;
> +
> +	struct device *dev __free(put_device) = get_device(err_info.dev);
> +	if (!dev)
> +		return 0;
> +
> +	pdrv = to_cxl_drv(dev->driver);
> +	if (!pdrv || !pdrv->err_handler ||
> +	    !pdrv->err_handler->error_detected)
> +		return 0;
> +
> +	cxl_err_handler = pdrv->err_handler;
> +	vote = cxl_err_handler->error_detected(dev, &err_info);
> +
> +	*result = merge_result(*result, vote);
> +
> +	return 0;
> +}
> +
> +static void cxl_do_recovery(struct pci_dev *pdev)
> +{
> +	struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus);
> +	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> +
> +	cxl_walk_bridge(pdev, cxl_report_error_detected, &status);
> +	if (status == PCI_ERS_RESULT_PANIC)
> +		panic("CXL cachemem error.");
> +
> +	/*
> +	 * If we have native control of AER, clear error status in the device
> +	 * that detected the error.  If the platform retained control of AER,
> +	 * it is responsible for clearing this status.  In that case, the
> +	 * signaling device may not even be visible to the OS.
> +	 */
> +	if (host->native_aer) {
> +		pcie_clear_device_status(pdev);
> +		pci_aer_clear_nonfatal_status(pdev);
> +		pci_aer_clear_fatal_status(pdev);
> +	}
> +
> +	pci_info(pdev, "CXL uncorrectable error.\n");
> +}
>  
>  static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>  {



^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
  2025-04-23 16:35   ` Jonathan Cameron
@ 2025-04-24 14:22     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-24 14:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/23/2025 11:35 AM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:07 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Create cxl_do_recovery() to provide uncorrectable protocol error (UCE)
>> handling. Follow similar design as found in PCIe error driver,
>> pcie_do_recovery(). One difference is that cxl_do_recovery() will treat all
>> UCEs as fatal with a kernel panic. This is to prevent corruption on CXL
>> memory.
>>
>> Copy the PCIe error handlers merge_result(). Introduce PCI_ERS_RESULT_PANIC
>> and add support in the merge_result() routine.
>>
>> Copy pci_walk_bridge() to cxl_walk_bridge(). Make a change to walk the
>> first device in all cases.
>>
>> Copy report_error_detected() to cxl_report_error_detected(). Update this
>> function to populate the CXL error information structure, 'struct
>> cxl_prot_error_info', before calling the device error handler.
>>
>> Call panic() to halt the system in the case of uncorrectable errors (UCE)
>> in cxl_do_recovery(). Export pci_aer_clear_fatal_status() for CXL to use
>> if a UCE is not found. In this case the AER status must be cleared and
>> uses pci_aer_clear_fatal_status().
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/ras.c | 92 +++++++++++++++++++++++++++++++++++++++++-
>>  drivers/pci/pci.h      |  2 -
>>  include/linux/pci.h    |  5 +++
>>  3 files changed, 96 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
>> index eca8f11a05d9..1f94fc08e72b 100644
>> --- a/drivers/cxl/core/ras.c
>> +++ b/drivers/cxl/core/ras.c
>> @@ -141,7 +141,97 @@ int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_create_prot_err_info, "CXL");
>>  
>> -static void cxl_do_recovery(struct pci_dev *pdev) { }
>> +
>> +static pci_ers_result_t merge_result(enum pci_ers_result orig,
> Rename perhaps to avoid confusion / grep clashed...
Ok. I'll rename to cxl_merge_results().
>> +				     enum pci_ers_result new)
>> +{
>> +	if (new == PCI_ERS_RESULT_PANIC)
>> +		return PCI_ERS_RESULT_PANIC;
>> +
>> +	if (new == PCI_ERS_RESULT_NO_AER_DRIVER)
>> +		return PCI_ERS_RESULT_NO_AER_DRIVER;
>> +
>> +	if (new == PCI_ERS_RESULT_NONE)
>> +		return orig;
>> +
>> +	switch (orig) {
>> +	case PCI_ERS_RESULT_CAN_RECOVER:
>> +	case PCI_ERS_RESULT_RECOVERED:
>> +		orig = new;
>> +		break;
>> +	case PCI_ERS_RESULT_DISCONNECT:
>> +		if (new == PCI_ERS_RESULT_NEED_RESET)
>> +			orig = PCI_ERS_RESULT_NEED_RESET;
>> +		break;
>> +	default:
>> +		break;
>> +	}
>> +
>> +	return orig;
>> +}
>> +
>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>> +			    int (*cb)(struct pci_dev *, void *),
>> +			    void *userdata)
>> +{
>> +	if (cb(bridge, userdata))
>> +		return;
>> +
>> +	if (bridge->subordinate)
>> +		pci_walk_bus(bridge->subordinate, cb, userdata);
>> +}
>> +
> Trivial but seems there are two blank lines where one will do.
Ok

-Terry

>> +
>> +static int cxl_report_error_detected(struct pci_dev *pdev, void *data)
>> +{
>> +	struct cxl_driver *pdrv;
>> +	pci_ers_result_t vote, *result = data;
>> +	struct cxl_prot_error_info err_info = { 0 };
>> +	const struct cxl_error_handlers *cxl_err_handler;
>> +
>> +	if (cxl_create_prot_err_info(pdev, AER_FATAL, &err_info))
>> +		return 0;
>> +
>> +	struct device *dev __free(put_device) = get_device(err_info.dev);
>> +	if (!dev)
>> +		return 0;
>> +
>> +	pdrv = to_cxl_drv(dev->driver);
>> +	if (!pdrv || !pdrv->err_handler ||
>> +	    !pdrv->err_handler->error_detected)
>> +		return 0;
>> +
>> +	cxl_err_handler = pdrv->err_handler;
>> +	vote = cxl_err_handler->error_detected(dev, &err_info);
>> +
>> +	*result = merge_result(*result, vote);
>> +
>> +	return 0;
>> +}
>> +
>> +static void cxl_do_recovery(struct pci_dev *pdev)
>> +{
>> +	struct pci_host_bridge *host = pci_find_host_bridge(pdev->bus);
>> +	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> +
>> +	cxl_walk_bridge(pdev, cxl_report_error_detected, &status);
>> +	if (status == PCI_ERS_RESULT_PANIC)
>> +		panic("CXL cachemem error.");
>> +
>> +	/*
>> +	 * If we have native control of AER, clear error status in the device
>> +	 * that detected the error.  If the platform retained control of AER,
>> +	 * it is responsible for clearing this status.  In that case, the
>> +	 * signaling device may not even be visible to the OS.
>> +	 */
>> +	if (host->native_aer) {
>> +		pcie_clear_device_status(pdev);
>> +		pci_aer_clear_nonfatal_status(pdev);
>> +		pci_aer_clear_fatal_status(pdev);
>> +	}
>> +
>> +	pci_info(pdev, "CXL uncorrectable error.\n");
>> +}
>>  
>>  static int cxl_rch_handle_error_iter(struct pci_dev *pdev, void *data)
>>  {
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (5 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-04-17 10:18   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 08/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Restricted CXL Host (RCH) Downstream Port RAS initialization currently
resides in cxl/core/pci.c. The PCI source file is not otherwise associated
with CXL port management.

Additional CXL Port RAS initialization will be added in future patches to
support CXL Port devices' CXL errors.

Move existing RAS initialization to the cxl_port driver. The cxl_port
driver is intended to manage CXL Endpoint and CXL Switch Ports.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c  | 73 --------------------------------------
 drivers/cxl/core/regs.c |  2 ++
 drivers/cxl/cxl.h       |  6 ++++
 drivers/cxl/port.c      | 78 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 86 insertions(+), 73 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 96fecb799cbc..27ef3d55a6f1 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -734,79 +734,6 @@ static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 
 #ifdef CONFIG_PCIEAER_CXL
 
-static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
-{
-	resource_size_t aer_phys;
-	struct device *host;
-	u16 aer_cap;
-
-	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
-	if (aer_cap) {
-		host = dport->reg_map.host;
-		aer_phys = aer_cap + dport->rcrb.base;
-		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
-						sizeof(struct aer_capability_regs));
-	}
-}
-
-static void cxl_dport_map_ras(struct cxl_dport *dport)
-{
-	struct cxl_register_map *map = &dport->reg_map;
-	struct device *dev = dport->dport_dev;
-
-	if (!map->component_map.ras.valid)
-		dev_dbg(dev, "RAS registers not found\n");
-	else if (cxl_map_component_regs(map, &dport->regs.component,
-					BIT(CXL_CM_CAP_CAP_ID_RAS)))
-		dev_dbg(dev, "Failed to map RAS capability.\n");
-}
-
-static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
-{
-	void __iomem *aer_base = dport->regs.dport_aer;
-	u32 aer_cmd_mask, aer_cmd;
-
-	if (!aer_base)
-		return;
-
-	/*
-	 * Disable RCH root port command interrupts.
-	 * CXL 3.0 12.2.1.1 - RCH Downstream Port-detected Errors
-	 *
-	 * This sequence may not be necessary. CXL spec states disabling
-	 * the root cmd register's interrupts is required. But, PCI spec
-	 * shows these are disabled by default on reset.
-	 */
-	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
-			PCI_ERR_ROOT_CMD_NONFATAL_EN |
-			PCI_ERR_ROOT_CMD_FATAL_EN);
-	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
-	aer_cmd &= ~aer_cmd_mask;
-	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
-}
-
-/**
- * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
- * @dport: the cxl_dport that needs to be initialized
- * @host: host device for devm operations
- */
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
-{
-	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
-
-	if (dport->rch) {
-		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
-
-		if (!host_bridge->native_aer)
-			return;
-
-		cxl_dport_map_rch_aer(dport);
-		cxl_disable_rch_root_ints(dport);
-	}
-}
-EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
-
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 117c2e94c761..f3f85a753460 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -199,6 +199,7 @@ void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
 
 	return ret_val;
 }
+EXPORT_SYMBOL_NS_GPL(devm_cxl_iomap_block, "CXL");
 
 int cxl_map_component_regs(const struct cxl_register_map *map,
 			   struct cxl_component_regs *regs,
@@ -517,6 +518,7 @@ u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb)
 
 	return offset;
 }
+EXPORT_SYMBOL_NS_GPL(cxl_rcrb_to_aer, "CXL");
 
 static resource_size_t cxl_rcrb_to_linkcap(struct device *dev, struct cxl_dport *dport)
 {
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 73cddd2c921e..b2b55083886a 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -315,6 +315,12 @@ int cxl_setup_regs(struct cxl_register_map *map);
 struct cxl_dport;
 resource_size_t cxl_rcd_component_reg_phys(struct device *dev,
 					   struct cxl_dport *dport);
+
+u16 cxl_rcrb_to_aer(struct device *dev, resource_size_t rcrb);
+
+void __iomem *devm_cxl_iomap_block(struct device *dev, resource_size_t addr,
+				   resource_size_t length);
+
 int cxl_dport_map_rcd_linkcap(struct pci_dev *pdev, struct cxl_dport *dport);
 
 #define CXL_RESOURCE_NONE ((resource_size_t) -1)
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index d2bfd1ff5492..d5ea8b04400b 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -6,6 +6,7 @@
 
 #include "cxlmem.h"
 #include "cxlpci.h"
+#include "cxl.h"
 
 /**
  * DOC: cxl port
@@ -57,6 +58,83 @@ static int discover_region(struct device *dev, void *root)
 	return 0;
 }
 
+#ifdef CONFIG_PCIEAER_CXL
+
+static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
+{
+	resource_size_t aer_phys;
+	struct device *host;
+	u16 aer_cap;
+
+	aer_cap = cxl_rcrb_to_aer(dport->dport_dev, dport->rcrb.base);
+	if (aer_cap) {
+		host = dport->reg_map.host;
+		aer_phys = aer_cap + dport->rcrb.base;
+		dport->regs.dport_aer = devm_cxl_iomap_block(host, aer_phys,
+						sizeof(struct aer_capability_regs));
+	}
+}
+
+static void cxl_dport_map_ras(struct cxl_dport *dport)
+{
+	struct cxl_register_map *map = &dport->reg_map;
+	struct device *dev = dport->dport_dev;
+
+	if (!map->component_map.ras.valid)
+		dev_dbg(dev, "RAS registers not found\n");
+	else if (cxl_map_component_regs(map, &dport->regs.component,
+					BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(dev, "Failed to map RAS capability.\n");
+}
+
+static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
+{
+	void __iomem *aer_base = dport->regs.dport_aer;
+	u32 aer_cmd_mask, aer_cmd;
+
+	if (!aer_base)
+		return;
+
+	/*
+	 * Disable RCH root port command interrupts.
+	 * CXL 3.0 12.2.1.1 - RCH Downstream Port-detected Errors
+	 *
+	 * This sequence may not be necessary. CXL spec states disabling
+	 * the root cmd register's interrupts is required. But, PCI spec
+	 * shows these are disabled by default on reset.
+	 */
+	aer_cmd_mask = (PCI_ERR_ROOT_CMD_COR_EN |
+			PCI_ERR_ROOT_CMD_NONFATAL_EN |
+			PCI_ERR_ROOT_CMD_FATAL_EN);
+	aer_cmd = readl(aer_base + PCI_ERR_ROOT_COMMAND);
+	aer_cmd &= ~aer_cmd_mask;
+	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
+}
+
+/**
+ * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
+ * @dport: the cxl_dport that needs to be initialized
+ * @host: host device for devm operations
+ */
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+{
+	dport->reg_map.host = host;
+	cxl_dport_map_ras(dport);
+
+	if (dport->rch) {
+		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
+
+		if (!host_bridge->native_aer)
+			return;
+
+		cxl_dport_map_rch_aer(dport);
+		cxl_disable_rch_root_ints(dport);
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
+
+#endif /* CONFIG_PCIEAER_CXL */
+
 static int cxl_switch_port_probe(struct cxl_port *port)
 {
 	struct cxl_hdm *cxlhdm;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver
  2025-03-27  1:47 ` [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver Terry Bowman
@ 2025-04-17 10:18   ` Jonathan Cameron
  2025-04-24 14:25     ` Bowman, Terry
  2025-05-12 14:47     ` Bowman, Terry
  0 siblings, 2 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-17 10:18 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:08 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Restricted CXL Host (RCH) Downstream Port RAS initialization currently
> resides in cxl/core/pci.c. The PCI source file is not otherwise associated
> with CXL port management.
> 
> Additional CXL Port RAS initialization will be added in future patches to
> support CXL Port devices' CXL errors.
> 
> Move existing RAS initialization to the cxl_port driver. The cxl_port
> driver is intended to manage CXL Endpoint and CXL Switch Ports.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

Sorry for the interrupt nature of reviews on this. Crazy week.

Anyhow getting back to this series...

I'm not a fan of ifdefs in a c file.  Maybe we should consider
a port_aer.c and stubbing in the header as needed?

I think it ends up cleaner both in this patch and even more so later
in the series.

Jonathan

p.s. And now I need to run again.  I'll be back!

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver
  2025-04-17 10:18   ` Jonathan Cameron
@ 2025-04-24 14:25     ` Bowman, Terry
  2025-05-12 14:47     ` Bowman, Terry
  1 sibling, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-24 14:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/17/2025 5:18 AM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:08 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Restricted CXL Host (RCH) Downstream Port RAS initialization currently
>> resides in cxl/core/pci.c. The PCI source file is not otherwise associated
>> with CXL port management.
>>
>> Additional CXL Port RAS initialization will be added in future patches to
>> support CXL Port devices' CXL errors.
>>
>> Move existing RAS initialization to the cxl_port driver. The cxl_port
>> driver is intended to manage CXL Endpoint and CXL Switch Ports.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
>
> Sorry for the interrupt nature of reviews on this. Crazy week.
>
> Anyhow getting back to this series...
>
> I'm not a fan of ifdefs in a c file.  Maybe we should consider
> a port_aer.c and stubbing in the header as needed?
>
> I think it ends up cleaner both in this patch and even more so later
> in the series.
>
> Jonathan
>
> p.s. And now I need to run again.  I'll be back!
Hi Jonathan,

Sorry, I missed this email earlier. I understand your point. I'll keep that in
mind for new changes and revisit the series to remove those where possible.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver
  2025-04-17 10:18   ` Jonathan Cameron
  2025-04-24 14:25     ` Bowman, Terry
@ 2025-05-12 14:47     ` Bowman, Terry
  1 sibling, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-05-12 14:47 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/17/2025 5:18 AM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:08 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Restricted CXL Host (RCH) Downstream Port RAS initialization currently
>> resides in cxl/core/pci.c. The PCI source file is not otherwise associated
>> with CXL port management.
>>
>> Additional CXL Port RAS initialization will be added in future patches to
>> support CXL Port devices' CXL errors.
>>
>> Move existing RAS initialization to the cxl_port driver. The cxl_port
>> driver is intended to manage CXL Endpoint and CXL Switch Ports.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
>
> Sorry for the interrupt nature of reviews on this. Crazy week.
>
> Anyhow getting back to this series...
>
> I'm not a fan of ifdefs in a c file.  Maybe we should consider
> a port_aer.c and stubbing in the header as needed?
>
> I think it ends up cleaner both in this patch and even more so later
> in the series.
>
> Jonathan
>
> p.s. And now I need to run again.  I'll be back!

Yes, I will try to add to the next revision.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 08/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (6 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27  1:47 ` [PATCH v8 09/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL Endpoint (EP) Ports may include Root Ports (RP) or Downstream Switch
Ports (DSP). CXL RPs and DSPs contain RAS registers that require memory
mapping to enable RAS logging. This initialization is currently missing and
must be added for CXL RPs and DSPs.

Update cxl_dport_init_ras_reporting() to support RP and DSP RAS mapping.
Add alongside the existing Restricted CXL Host Downstream Port RAS mapping.

Update cxl_endpoint_port_probe() to invoke cxl_dport_init_ras_reporting().
This will initiate the RAS mapping for CXL RPs and DSPs when each CXL EP is
created and added to the EP port.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/cxl.h  |  2 ++
 drivers/cxl/mem.c  |  3 ++-
 drivers/cxl/port.c | 56 +++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 59 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index b2b55083886a..0d05d5449f97 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -592,6 +592,7 @@ struct cxl_dax_region {
  * @parent_dport: dport that points to this port in the parent
  * @decoder_ida: allocator for decoder ids
  * @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
  * @nr_dports: number of entries in @dports
  * @hdm_end: track last allocated HDM decoder instance for allocation ordering
  * @commit_end: cursor to track highest committed decoder for commit ordering
@@ -613,6 +614,7 @@ struct cxl_port {
 	struct cxl_dport *parent_dport;
 	struct ida decoder_ida;
 	struct cxl_register_map reg_map;
+	struct cxl_component_regs uport_regs;
 	int nr_dports;
 	int hdm_end;
 	int commit_end;
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 9675243bd05b..29dc4a624b15 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -166,7 +166,8 @@ static int cxl_mem_probe(struct device *dev)
 	else
 		endpoint_parent = &parent_port->dev;
 
-	cxl_dport_init_ras_reporting(dport, dev);
+	if (dport->rch)
+		cxl_dport_init_ras_reporting(dport, dev);
 
 	scoped_guard(device, endpoint_parent) {
 		if (!endpoint_parent->driver) {
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index d5ea8b04400b..1b8dc161428f 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -111,6 +111,17 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
+static void cxl_uport_init_ras_reporting(struct cxl_port *port,
+					 struct device *host)
+{
+	struct cxl_register_map *map = &port->reg_map;
+
+	map->host = host;
+	if (cxl_map_component_regs(map, &port->uport_regs,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(&port->dev, "Failed to map RAS capability\n");
+}
+
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
@@ -119,7 +130,6 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 {
 	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
 
 	if (dport->rch) {
 		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
@@ -127,12 +137,52 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 		if (!host_bridge->native_aer)
 			return;
 
+		cxl_dport_map_ras(dport);
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
+		return;
 	}
+
+	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
+
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
+static void cxl_switch_port_init_ras(struct cxl_port *port)
+{
+	struct device *dev __free(put_device) = get_device(&port->dev);
+
+	if (is_cxl_root(to_cxl_port(port->dev.parent)))
+		return;
+
+	/* Check for parent DSP */
+	if (port->parent_dport)
+		cxl_dport_init_ras_reporting(port->parent_dport, dev);
+
+	cxl_uport_init_ras_reporting(port, dev);
+}
+
+static void cxl_endpoint_port_init_ras(struct cxl_port *port)
+{
+	struct cxl_dport *dport;
+	struct cxl_memdev *cxlmd = to_cxl_memdev(port->uport_dev);
+	struct cxl_port *parent_port __free(put_cxl_port) =
+		cxl_mem_find_port(cxlmd, &dport);
+	struct device *cxlmd_dev __free(put_device) = &cxlmd->dev;
+
+	if (!dport || !dev_is_pci(dport->dport_dev)) {
+		dev_err(&port->dev, "CXL port topology not found\n");
+		return;
+	}
+
+	cxl_dport_init_ras_reporting(dport, cxlmd_dev);
+}
+
+#else
+static void cxl_endpoint_port_init_ras(struct cxl_port *port) { }
+static void cxl_switch_port_init_ras(struct cxl_port *port) { }
 #endif /* CONFIG_PCIEAER_CXL */
 
 static int cxl_switch_port_probe(struct cxl_port *port)
@@ -149,6 +199,8 @@ static int cxl_switch_port_probe(struct cxl_port *port)
 
 	cxl_switch_parse_cdat(port);
 
+	cxl_switch_port_init_ras(port);
+
 	cxlhdm = devm_cxl_setup_hdm(port, NULL);
 	if (!IS_ERR(cxlhdm))
 		return devm_cxl_enumerate_decoders(cxlhdm, NULL);
@@ -204,6 +256,8 @@ static int cxl_endpoint_port_probe(struct cxl_port *port)
 	if (rc)
 		return rc;
 
+	cxl_endpoint_port_init_ras(port);
+
 	/*
 	 * This can't fail in practice as CXL root exit unregisters all
 	 * descendant ports and that in turn synchronizes with cxl_port_probe()
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v8 09/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (7 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 08/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27  1:47 ` [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped Terry Bowman
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL PCIe Port Protocol Error handling support will be added to the
CXL drivers in the future. In preparation, rename the existing
interfaces to support handling all CXL PCIe Port Protocol Errors.

The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL Port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.

To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.

No functional changes are introduced.

[1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/pci.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 27ef3d55a6f1..1cf1ab4d9160 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,7 +650,7 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+static void __cxl_handle_cor_ras(struct device *dev,
 				 void __iomem *ras_base)
 {
 	void __iomem *addr;
@@ -663,13 +663,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
 	}
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
@@ -693,8 +693,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
-				  void __iomem *ras_base)
+static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -721,7 +720,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -729,7 +728,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 
 static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 #ifdef CONFIG_PCIEAER_CXL
@@ -737,13 +736,13 @@ static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (8 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 09/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-04-23 16:41   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
                   ` (7 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

The CXL RAS handlers do not currently log if the RAS registers are
unmapped. This is needed in order to help debug CXL error handling. Update
the CXL driver to log a warning message if the RAS register block is
unmapped during RAS error handling.

Also, refactor the __cxl_handle_cor_ras() functions check for status.
Change it to be consistent with the same status check in
__cxl_handle_cor_ras().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 1cf1ab4d9160..4770810b2138 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -656,15 +656,18 @@ static void __cxl_handle_cor_ras(struct device *dev,
 	void __iomem *addr;
 	u32 status;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return;
+	}
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
-	}
+	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+		return;
+	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+	trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
@@ -700,8 +703,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	u32 status;
 	u32 fe;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return false;
+	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped
  2025-03-27  1:47 ` [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped Terry Bowman
@ 2025-04-23 16:41   ` Jonathan Cameron
  2025-04-24 14:30     ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 16:41 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:11 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL RAS handlers do not currently log if the RAS registers are
> unmapped. This is needed in order to help debug CXL error handling. Update
> the CXL driver to log a warning message if the RAS register block is
> unmapped during RAS error handling.
> 
> Also, refactor the __cxl_handle_cor_ras() functions check for status.
> Change it to be consistent with the same status check in
> __cxl_handle_cor_ras().

Not keen on an 'also' bit in here.  Seems entirely separable
into its own patch.

Two trivial one thing patches seems better than one slightly larger one.
Actual changes seem fine to me so feel free to add
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
to resulting pair of patches.

> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 1cf1ab4d9160..4770810b2138 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -656,15 +656,18 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  	void __iomem *addr;
>  	u32 status;
>  
> -	if (!ras_base)
> +	if (!ras_base) {
> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>  		return;
> +	}
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> -	}
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -700,8 +703,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	u32 status;
>  	u32 fe;
>  
> -	if (!ras_base)
> +	if (!ras_base) {
> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>  		return false;
> +	}
>  
>  	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped
  2025-04-23 16:41   ` Jonathan Cameron
@ 2025-04-24 14:30     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-24 14:30 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/23/2025 11:41 AM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:11 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The CXL RAS handlers do not currently log if the RAS registers are
>> unmapped. This is needed in order to help debug CXL error handling. Update
>> the CXL driver to log a warning message if the RAS register block is
>> unmapped during RAS error handling.
>>
>> Also, refactor the __cxl_handle_cor_ras() functions check for status.
>> Change it to be consistent with the same status check in
>> __cxl_handle_cor_ras().
> Not keen on an 'also' bit in here.  Seems entirely separable
> into its own patch.
>
> Two trivial one thing patches seems better than one slightly larger one.
> Actual changes seem fine to me so feel free to add
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> to resulting pair of patches.

Hi Jonathan,

I will split the patch as you recommend.

-Terry

>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/pci.c | 17 +++++++++++------
>>  1 file changed, 11 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 1cf1ab4d9160..4770810b2138 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -656,15 +656,18 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  	void __iomem *addr;
>>  	u32 status;
>>  
>> -	if (!ras_base)
>> +	if (!ras_base) {
>> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>>  		return;
>> +	}
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> -		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
>> -	}
>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> +	trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -700,8 +703,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>>  	u32 status;
>>  	u32 fe;
>>  
>> -	if (!ras_base)
>> +	if (!ras_base) {
>> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>>  		return false;
>> +	}
>>  
>>  	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (9 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-04-23 16:44   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers Terry Bowman
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL currently has separate trace routines for CXL Port errors and CXL
Endpoint errors. This is inconvnenient for the user because they must
enable 2 sets of trace routines. Make updates to the trace logging such
that a single trace routine logs both CXL Endpoint and CXL Port protocol
errors.

Also, CXL RAS errors are currently logged using the associated CXL port's
name returned from devname(). They are typically named with 'port1',
'port2', etc. to indicate the hierarchial location in the CXL topology.
But, this doesn't clearly indicate the CXL card or slot reporting the
error.

Update the logging to also log the corresponding PCIe devname. This will
give a PCIe SBDF or ACPI object name (in case of CXL HB). This will provide
details helping users understand which physical slot and card has the
error.

Below is example output after making these changes.

Correctable error example output:
cxl_port_aer_correctable_error: device=port1 (0000:0c:00.0) parent=root0 (pci0000:0c) status='Received Error From Physical Layer'

Uncorrectable error example output:
cxl_port_aer_uncorrectable_error: device=port1 (0000:0c:00.0) parent=root0 (pci0000:0c) status: 'Memory Byte Enable Parity Error' first_error: 'Memory Byte Enable Parity Error'

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c   |  29 ++++++------
 drivers/cxl/core/ras.c   |  14 +++---
 drivers/cxl/core/trace.h | 100 +++++++++++++--------------------------
 3 files changed, 55 insertions(+), 88 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 4770810b2138..10b2abfb0e64 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,14 +650,14 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void __cxl_handle_cor_ras(struct device *dev,
-				 void __iomem *ras_base)
+static void __cxl_handle_cor_ras(struct device *cxl_dev, struct device *pcie_dev,
+				 u64 serial, void __iomem *ras_base)
 {
 	void __iomem *addr;
 	u32 status;
 
 	if (!ras_base) {
-		dev_warn_once(dev, "CXL RAS register block is not mapped");
+		dev_warn_once(cxl_dev, "CXL RAS register block is not mapped");
 		return;
 	}
 
@@ -667,12 +667,12 @@ static void __cxl_handle_cor_ras(struct device *dev,
 		return;
 	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
 
-	trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
+	trace_cxl_aer_correctable_error(cxl_dev, pcie_dev, serial, status);
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
@@ -696,7 +696,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
+static pci_ers_result_t __cxl_handle_ras(struct device *cxl_dev, struct device *pcie_dev,
+					 u64 serial, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -704,14 +705,14 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	u32 fe;
 
 	if (!ras_base) {
-		dev_warn_once(dev, "CXL RAS register block is not mapped");
-		return false;
+		dev_warn_once(cxl_dev, "CXL RAS register block is not mapped");
+		return PCI_ERS_RESULT_NONE;
 	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
 	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
-		return false;
+		return PCI_ERS_RESULT_NONE;
 
 	/* If multiple errors, log header points to first error from ctrl reg */
 	if (hweight32(status) > 1) {
@@ -725,15 +726,15 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(cxl_dev, pcie_dev, serial, status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
-	return true;
+	return PCI_ERS_RESULT_PANIC;
 }
 
 static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
 }
 
 #ifdef CONFIG_PCIEAER_CXL
@@ -741,13 +742,13 @@ static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, dport->regs.ras);
 }
 
 /*
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index 1f94fc08e72b..f18cb568eabd 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -13,7 +13,7 @@ static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
 {
 	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
 
-	trace_cxl_port_aer_correctable_error(&pdev->dev, status);
+	trace_cxl_aer_correctable_error(&pdev->dev, &pdev->dev, 0, status);
 }
 
 static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
@@ -28,8 +28,8 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
 	else
 		fe = status;
 
-	trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
-					       ras_cap.header_log);
+	trace_cxl_aer_uncorrectable_error(&pdev->dev, &pdev->dev, 0,
+					  status, fe, ras_cap.header_log);
 }
 
 static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
@@ -42,7 +42,8 @@ static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
 	if (!cxlds)
 		return;
 
-	trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+	trace_cxl_aer_correctable_error(&cxlds->cxlmd->dev, &pdev->dev,
+					cxlds->serial, status);
 }
 
 static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
@@ -62,8 +63,9 @@ static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
 	else
 		fe = status;
 
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe,
-					  ras_cap.header_log);
+	trace_cxl_aer_uncorrectable_error(&cxlds->cxlmd->dev, &pdev->dev,
+					  cxlds->serial, status,
+					  fe, ras_cap.header_log);
 }
 
 static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 25ebfbc1616c..399e0b8bf0f2 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,49 +48,26 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
-TRACE_EVENT(cxl_port_aer_uncorrectable_error,
-	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
-	TP_ARGS(dev, status, fe, hl),
-	TP_STRUCT__entry(
-		__string(device, dev_name(dev))
-		__string(host, dev_name(dev->parent))
-		__field(u32, status)
-		__field(u32, first_error)
-		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
-	),
-	TP_fast_assign(
-		__assign_str(device);
-		__assign_str(host);
-		__entry->status = status;
-		__entry->first_error = fe;
-		/*
-		 * Embed the 512B headerlog data for user app retrieval and
-		 * parsing, but no need to print this in the trace buffer.
-		 */
-		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
-	),
-	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
-		  __get_str(device), __get_str(host),
-		  show_uc_errs(__entry->status),
-		  show_uc_errs(__entry->first_error)
-	)
-);
-
 TRACE_EVENT(cxl_aer_uncorrectable_error,
-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
-	TP_ARGS(cxlmd, status, fe, hl),
+	TP_PROTO(struct device *cxl_dev, struct device *pcie_dev, u64 serial,
+		 u32 status, u32 fe, u32 *hl),
+	TP_ARGS(cxl_dev, pcie_dev, serial, status, fe, hl),
 	TP_STRUCT__entry(
-		__string(memdev, dev_name(&cxlmd->dev))
-		__string(host, dev_name(cxlmd->dev.parent))
+		__string(cxl_name, dev_name(cxl_dev))
+		__string(cxl_parent_name, dev_name(cxl_dev->parent))
+		__string(pcie_name, dev_name(pcie_dev))
+		__string(pcie_parent_name, dev_name(pcie_dev->parent))
 		__field(u64, serial)
 		__field(u32, status)
 		__field(u32, first_error)
 		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
 	),
 	TP_fast_assign(
-		__assign_str(memdev);
-		__assign_str(host);
-		__entry->serial = cxlmd->cxlds->serial;
+		__assign_str(cxl_name);
+		__assign_str(cxl_parent_name);
+		__assign_str(pcie_name);
+		__assign_str(pcie_parent_name);
+		__entry->serial = serial;
 		__entry->status = status;
 		__entry->first_error = fe;
 		/*
@@ -99,10 +76,11 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 		 */
 		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
 	),
-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error: '%s'",
-		  __get_str(memdev), __get_str(host), __entry->serial,
-		  show_uc_errs(__entry->status),
-		  show_uc_errs(__entry->first_error)
+	TP_printk("device=%s (%s) parent=%s (%s) serial: %lld status: '%s' first_error: '%s'",
+		__get_str(cxl_name), __get_str(pcie_name),
+		__get_str(cxl_parent_name), __get_str(pcie_parent_name),
+		__entry->serial, show_uc_errs(__entry->status),
+		show_uc_errs(__entry->first_error)
 	)
 );
 
@@ -124,43 +102,29 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
-TRACE_EVENT(cxl_port_aer_correctable_error,
-	TP_PROTO(struct device *dev, u32 status),
-	TP_ARGS(dev, status),
-	TP_STRUCT__entry(
-		__string(device, dev_name(dev))
-		__string(host, dev_name(dev->parent))
-		__field(u32, status)
-	),
-	TP_fast_assign(
-		__assign_str(device);
-		__assign_str(host);
-		__entry->status = status;
-	),
-	TP_printk("device=%s host=%s status='%s'",
-		  __get_str(device), __get_str(host),
-		  show_ce_errs(__entry->status)
-	)
-);
-
 TRACE_EVENT(cxl_aer_correctable_error,
-	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
-	TP_ARGS(cxlmd, status),
+	TP_PROTO(struct device *cxl_dev, struct device *pcie_dev, u64 serial, u32 status),
+	TP_ARGS(cxl_dev, pcie_dev, serial, status),
 	TP_STRUCT__entry(
-		__string(memdev, dev_name(&cxlmd->dev))
-		__string(host, dev_name(cxlmd->dev.parent))
+		__string(cxl_name, dev_name(cxl_dev))
+		__string(cxl_parent_name, dev_name(cxl_dev->parent))
+		__string(pcie_name, dev_name(pcie_dev))
+		__string(pcie_parent_name, dev_name(pcie_dev->parent))
 		__field(u64, serial)
 		__field(u32, status)
 	),
 	TP_fast_assign(
-		__assign_str(memdev);
-		__assign_str(host);
-		__entry->serial = cxlmd->cxlds->serial;
+		__assign_str(cxl_name);
+		__assign_str(cxl_parent_name);
+		__assign_str(pcie_name);
+		__assign_str(pcie_parent_name);
+		__entry->serial = serial;
 		__entry->status = status;
 	),
-	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
-		  __get_str(memdev), __get_str(host), __entry->serial,
-		  show_ce_errs(__entry->status)
+	TP_printk("device=%s (%s) parent=%s (%s) serieal=%lld status='%s'",
+		__get_str(cxl_name), __get_str(pcie_name),
+		__get_str(cxl_parent_name), __get_str(pcie_parent_name),
+		__entry->serial, show_ce_errs(__entry->status)
 	)
 );
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports
  2025-03-27  1:47 ` [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
@ 2025-04-23 16:44   ` Jonathan Cameron
  2025-05-07 16:28     ` Shiju Jose
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 16:44 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, shiju.jose

On Wed, 26 Mar 2025 20:47:12 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

Unify.


> CXL currently has separate trace routines for CXL Port errors and CXL
> Endpoint errors. This is inconvnenient for the user because they must
> enable 2 sets of trace routines. Make updates to the trace logging such
> that a single trace routine logs both CXL Endpoint and CXL Port protocol
> errors.
> 
> Also, CXL RAS errors are currently logged using the associated CXL port's
> name returned from devname(). They are typically named with 'port1',
> 'port2', etc. to indicate the hierarchial location in the CXL topology.
> But, this doesn't clearly indicate the CXL card or slot reporting the
> error.
> 
> Update the logging to also log the corresponding PCIe devname. This will
> give a PCIe SBDF or ACPI object name (in case of CXL HB). This will provide
> details helping users understand which physical slot and card has the
> error.
> 
> Below is example output after making these changes.
> 
> Correctable error example output:
> cxl_port_aer_correctable_error: device=port1 (0000:0c:00.0) parent=root0 (pci0000:0c) status='Received Error From Physical Layer'
> 
> Uncorrectable error example output:
> cxl_port_aer_uncorrectable_error: device=port1 (0000:0c:00.0) parent=root0 (pci0000:0c) status: 'Memory Byte Enable Parity Error' first_error: 'Memory Byte Enable Parity Error'

I'm not sure the pcie parent is adding much... Why bother with that?

Shiju, is this going to affect rasdaemon handling?

I'd assume we can't just rename fields in the tracepoints and
combining them will also presumably make a mess?

Jonathan


> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c   |  29 ++++++------
>  drivers/cxl/core/ras.c   |  14 +++---
>  drivers/cxl/core/trace.h | 100 +++++++++++++--------------------------
>  3 files changed, 55 insertions(+), 88 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 4770810b2138..10b2abfb0e64 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -650,14 +650,14 @@ void read_cdat_data(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
>  
> -static void __cxl_handle_cor_ras(struct device *dev,
> -				 void __iomem *ras_base)
> +static void __cxl_handle_cor_ras(struct device *cxl_dev, struct device *pcie_dev,
> +				 u64 serial, void __iomem *ras_base)
>  {
>  	void __iomem *addr;
>  	u32 status;
>  
>  	if (!ras_base) {
> -		dev_warn_once(dev, "CXL RAS register block is not mapped");
> +		dev_warn_once(cxl_dev, "CXL RAS register block is not mapped");
>  		return;
>  	}
>  
> @@ -667,12 +667,12 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  		return;
>  	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>  
> -	trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> +	trace_cxl_aer_correctable_error(cxl_dev, pcie_dev, serial, status);
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>  {
> -	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
> +	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
>  }
>  
>  /* CXL spec rev3.0 8.2.4.16.1 */
> @@ -696,7 +696,8 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>   * Log the state of the RAS status registers and prepare them to log the
>   * next error status. Return 1 if reset needed.
>   */
> -static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> +static pci_ers_result_t __cxl_handle_ras(struct device *cxl_dev, struct device *pcie_dev,
> +					 u64 serial, void __iomem *ras_base)
>  {
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
> @@ -704,14 +705,14 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	u32 fe;
>  
>  	if (!ras_base) {
> -		dev_warn_once(dev, "CXL RAS register block is not mapped");
> -		return false;
> +		dev_warn_once(cxl_dev, "CXL RAS register block is not mapped");
> +		return PCI_ERS_RESULT_NONE;
>  	}
>  
>  	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
>  	if (!(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK))
> -		return false;
> +		return PCI_ERS_RESULT_NONE;
>  
>  	/* If multiple errors, log header points to first error from ctrl reg */
>  	if (hweight32(status) > 1) {
> @@ -725,15 +726,15 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> +	trace_cxl_aer_uncorrectable_error(cxl_dev, pcie_dev, serial, status, fe, hl);
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
> -	return true;
> +	return PCI_ERS_RESULT_PANIC;
>  }
>  
>  static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
>  {
> -	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
> +	return __cxl_handle_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
>  }
>  
>  #ifdef CONFIG_PCIEAER_CXL
> @@ -741,13 +742,13 @@ static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
>  static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
>  					  struct cxl_dport *dport)
>  {
> -	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> +	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, dport->regs.ras);
>  }
>  
>  static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>  				       struct cxl_dport *dport)
>  {
> -	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> +	return __cxl_handle_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, dport->regs.ras);
>  }
>  
>  /*
> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index 1f94fc08e72b..f18cb568eabd 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -13,7 +13,7 @@ static void cxl_cper_trace_corr_port_prot_err(struct pci_dev *pdev,
>  {
>  	u32 status = ras_cap.cor_status & ~ras_cap.cor_mask;
>  
> -	trace_cxl_port_aer_correctable_error(&pdev->dev, status);
> +	trace_cxl_aer_correctable_error(&pdev->dev, &pdev->dev, 0, status);
>  }
>  
>  static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
> @@ -28,8 +28,8 @@ static void cxl_cper_trace_uncorr_port_prot_err(struct pci_dev *pdev,
>  	else
>  		fe = status;
>  
> -	trace_cxl_port_aer_uncorrectable_error(&pdev->dev, status, fe,
> -					       ras_cap.header_log);
> +	trace_cxl_aer_uncorrectable_error(&pdev->dev, &pdev->dev, 0,
> +					  status, fe, ras_cap.header_log);
>  }
>  
>  static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
> @@ -42,7 +42,8 @@ static void cxl_cper_trace_corr_prot_err(struct pci_dev *pdev,
>  	if (!cxlds)
>  		return;
>  
> -	trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> +	trace_cxl_aer_correctable_error(&cxlds->cxlmd->dev, &pdev->dev,
> +					cxlds->serial, status);
>  }
>  
>  static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
> @@ -62,8 +63,9 @@ static void cxl_cper_trace_uncorr_prot_err(struct pci_dev *pdev,
>  	else
>  		fe = status;
>  
> -	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe,
> -					  ras_cap.header_log);
> +	trace_cxl_aer_uncorrectable_error(&cxlds->cxlmd->dev, &pdev->dev,
> +					  cxlds->serial, status,
> +					  fe, ras_cap.header_log);
>  }
>  
>  static void cxl_cper_handle_prot_err(struct cxl_cper_prot_err_work_data *data)
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 25ebfbc1616c..399e0b8bf0f2 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,49 +48,26 @@
>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>  )
>  
> -TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> -	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> -	TP_ARGS(dev, status, fe, hl),
> -	TP_STRUCT__entry(
> -		__string(device, dev_name(dev))
> -		__string(host, dev_name(dev->parent))
> -		__field(u32, status)
> -		__field(u32, first_error)
> -		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> -	),
> -	TP_fast_assign(
> -		__assign_str(device);
> -		__assign_str(host);
> -		__entry->status = status;
> -		__entry->first_error = fe;
> -		/*
> -		 * Embed the 512B headerlog data for user app retrieval and
> -		 * parsing, but no need to print this in the trace buffer.
> -		 */
> -		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> -	),
> -	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> -		  __get_str(device), __get_str(host),
> -		  show_uc_errs(__entry->status),
> -		  show_uc_errs(__entry->first_error)
> -	)
> -);
> -
>  TRACE_EVENT(cxl_aer_uncorrectable_error,
> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
> -	TP_ARGS(cxlmd, status, fe, hl),
> +	TP_PROTO(struct device *cxl_dev, struct device *pcie_dev, u64 serial,
> +		 u32 status, u32 fe, u32 *hl),
> +	TP_ARGS(cxl_dev, pcie_dev, serial, status, fe, hl),
>  	TP_STRUCT__entry(
> -		__string(memdev, dev_name(&cxlmd->dev))
> -		__string(host, dev_name(cxlmd->dev.parent))
> +		__string(cxl_name, dev_name(cxl_dev))
> +		__string(cxl_parent_name, dev_name(cxl_dev->parent))
> +		__string(pcie_name, dev_name(pcie_dev))
> +		__string(pcie_parent_name, dev_name(pcie_dev->parent))
>  		__field(u64, serial)
>  		__field(u32, status)
>  		__field(u32, first_error)
>  		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>  	),
>  	TP_fast_assign(
> -		__assign_str(memdev);
> -		__assign_str(host);
> -		__entry->serial = cxlmd->cxlds->serial;
> +		__assign_str(cxl_name);
> +		__assign_str(cxl_parent_name);
> +		__assign_str(pcie_name);
> +		__assign_str(pcie_parent_name);
> +		__entry->serial = serial;
>  		__entry->status = status;
>  		__entry->first_error = fe;
>  		/*
> @@ -99,10 +76,11 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>  		 */
>  		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>  	),
> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s' first_error: '%s'",
> -		  __get_str(memdev), __get_str(host), __entry->serial,
> -		  show_uc_errs(__entry->status),
> -		  show_uc_errs(__entry->first_error)
> +	TP_printk("device=%s (%s) parent=%s (%s) serial: %lld status: '%s' first_error: '%s'",
> +		__get_str(cxl_name), __get_str(pcie_name),
> +		__get_str(cxl_parent_name), __get_str(pcie_parent_name),
> +		__entry->serial, show_uc_errs(__entry->status),
> +		show_uc_errs(__entry->first_error)
>  	)
>  );
>  
> @@ -124,43 +102,29 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>  )
>  
> -TRACE_EVENT(cxl_port_aer_correctable_error,
> -	TP_PROTO(struct device *dev, u32 status),
> -	TP_ARGS(dev, status),
> -	TP_STRUCT__entry(
> -		__string(device, dev_name(dev))
> -		__string(host, dev_name(dev->parent))
> -		__field(u32, status)
> -	),
> -	TP_fast_assign(
> -		__assign_str(device);
> -		__assign_str(host);
> -		__entry->status = status;
> -	),
> -	TP_printk("device=%s host=%s status='%s'",
> -		  __get_str(device), __get_str(host),
> -		  show_ce_errs(__entry->status)
> -	)
> -);
> -
>  TRACE_EVENT(cxl_aer_correctable_error,
> -	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
> -	TP_ARGS(cxlmd, status),
> +	TP_PROTO(struct device *cxl_dev, struct device *pcie_dev, u64 serial, u32 status),
> +	TP_ARGS(cxl_dev, pcie_dev, serial, status),
>  	TP_STRUCT__entry(
> -		__string(memdev, dev_name(&cxlmd->dev))
> -		__string(host, dev_name(cxlmd->dev.parent))
> +		__string(cxl_name, dev_name(cxl_dev))
> +		__string(cxl_parent_name, dev_name(cxl_dev->parent))
> +		__string(pcie_name, dev_name(pcie_dev))
> +		__string(pcie_parent_name, dev_name(pcie_dev->parent))
>  		__field(u64, serial)
>  		__field(u32, status)
>  	),
>  	TP_fast_assign(
> -		__assign_str(memdev);
> -		__assign_str(host);
> -		__entry->serial = cxlmd->cxlds->serial;
> +		__assign_str(cxl_name);
> +		__assign_str(cxl_parent_name);
> +		__assign_str(pcie_name);
> +		__assign_str(pcie_parent_name);
> +		__entry->serial = serial;
>  		__entry->status = status;
>  	),
> -	TP_printk("memdev=%s host=%s serial=%lld: status: '%s'",
> -		  __get_str(memdev), __get_str(host), __entry->serial,
> -		  show_ce_errs(__entry->status)
> +	TP_printk("device=%s (%s) parent=%s (%s) serieal=%lld status='%s'",
> +		__get_str(cxl_name), __get_str(pcie_name),
> +		__get_str(cxl_parent_name), __get_str(pcie_parent_name),
> +		__entry->serial, show_ce_errs(__entry->status)
>  	)
>  );
>  


^ permalink raw reply	[flat|nested] 76+ messages in thread

* RE: [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports
  2025-04-23 16:44   ` Jonathan Cameron
@ 2025-05-07 16:28     ` Shiju Jose
  2025-05-07 18:30       ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Shiju Jose @ 2025-05-07 16:28 UTC (permalink / raw)
  To: Jonathan Cameron, Terry Bowman
  Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, nifan.cxl@gmail.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, dan.j.williams@intel.com,
	bhelgaas@google.com, mahesh@linux.ibm.com, ira.weiny@intel.com,
	oohall@gmail.com, Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com,
	lukas@wunner.de, ming.li@zohomail.com,
	PradeepVineshReddy.Kodamati@amd.com

>-----Original Message-----
>From: Jonathan Cameron <jonathan.cameron@huawei.com>
>Sent: 23 April 2025 17:45
>To: Terry Bowman <terry.bowman@amd.com>
>Cc: linux-cxl@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
>pci@vger.kernel.org; nifan.cxl@gmail.com; dave@stgolabs.net;
>dave.jiang@intel.com; alison.schofield@intel.com; vishal.l.verma@intel.com;
>dan.j.williams@intel.com; bhelgaas@google.com; mahesh@linux.ibm.com;
>ira.weiny@intel.com; oohall@gmail.com; Benjamin.Cheatham@amd.com;
>rrichter@amd.com; nathan.fontenot@amd.com;
>Smita.KoralahalliChannabasappa@amd.com; lukas@wunner.de;
>ming.li@zohomail.com; PradeepVineshReddy.Kodamati@amd.com; Shiju Jose
><shiju.jose@huawei.com>
>Subject: Re: [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints
>and CXL Ports
>
>On Wed, 26 Mar 2025 20:47:12 -0500
>Terry Bowman <terry.bowman@amd.com> wrote:
>
>Unify.
>
>
>> CXL currently has separate trace routines for CXL Port errors and CXL
>> Endpoint errors. This is inconvnenient for the user because they must
>> enable 2 sets of trace routines. Make updates to the trace logging
>> such that a single trace routine logs both CXL Endpoint and CXL Port
>> protocol errors.
>>
>> Also, CXL RAS errors are currently logged using the associated CXL
>> port's name returned from devname(). They are typically named with
>> 'port1', 'port2', etc. to indicate the hierarchial location in the CXL topology.
>> But, this doesn't clearly indicate the CXL card or slot reporting the
>> error.
>>
>> Update the logging to also log the corresponding PCIe devname. This
>> will give a PCIe SBDF or ACPI object name (in case of CXL HB). This
>> will provide details helping users understand which physical slot and
>> card has the error.
>>
>> Below is example output after making these changes.
>>
>> Correctable error example output:
>> cxl_port_aer_correctable_error: device=port1 (0000:0c:00.0) parent=root0
>(pci0000:0c) status='Received Error From Physical Layer'
>>
>> Uncorrectable error example output:
>> cxl_port_aer_uncorrectable_error: device=port1 (0000:0c:00.0) parent=root0
>(pci0000:0c) status: 'Memory Byte Enable Parity Error' first_error: 'Memory
>Byte Enable Parity Error'
>
>I'm not sure the pcie parent is adding much... Why bother with that?
>
>Shiju, is this going to affect rasdaemon handling?

Hi Jonathan,

Yes. Renaming the existing fields in the trace events will result failure
while parsing the fields in the rasdaemon.

>
>I'd assume we can't just rename fields in the tracepoints and combining them
>will also presumably make a mess?
>
>Jonathan
>
[...]
>>

Thanks,
Shiju


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports
  2025-05-07 16:28     ` Shiju Jose
@ 2025-05-07 18:30       ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-05-07 18:30 UTC (permalink / raw)
  To: Shiju Jose, Jonathan Cameron
  Cc: linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, nifan.cxl@gmail.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, dan.j.williams@intel.com,
	bhelgaas@google.com, mahesh@linux.ibm.com, ira.weiny@intel.com,
	oohall@gmail.com, Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com,
	lukas@wunner.de, ming.li@zohomail.com,
	PradeepVineshReddy.Kodamati@amd.com



On 5/7/2025 11:28 AM, Shiju Jose wrote:
>> -----Original Message-----
>> From: Jonathan Cameron <jonathan.cameron@huawei.com>
>> Sent: 23 April 2025 17:45
>> To: Terry Bowman <terry.bowman@amd.com>
>> Cc: linux-cxl@vger.kernel.org; linux-kernel@vger.kernel.org; linux-
>> pci@vger.kernel.org; nifan.cxl@gmail.com; dave@stgolabs.net;
>> dave.jiang@intel.com; alison.schofield@intel.com; vishal.l.verma@intel.com;
>> dan.j.williams@intel.com; bhelgaas@google.com; mahesh@linux.ibm.com;
>> ira.weiny@intel.com; oohall@gmail.com; Benjamin.Cheatham@amd.com;
>> rrichter@amd.com; nathan.fontenot@amd.com;
>> Smita.KoralahalliChannabasappa@amd.com; lukas@wunner.de;
>> ming.li@zohomail.com; PradeepVineshReddy.Kodamati@amd.com; Shiju Jose
>> <shiju.jose@huawei.com>
>> Subject: Re: [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints
>> and CXL Ports
>>
>> On Wed, 26 Mar 2025 20:47:12 -0500
>> Terry Bowman <terry.bowman@amd.com> wrote:
>>
>> Unify.
>>
>>
>>> CXL currently has separate trace routines for CXL Port errors and CXL
>>> Endpoint errors. This is inconvnenient for the user because they must
>>> enable 2 sets of trace routines. Make updates to the trace logging
>>> such that a single trace routine logs both CXL Endpoint and CXL Port
>>> protocol errors.
>>>
>>> Also, CXL RAS errors are currently logged using the associated CXL
>>> port's name returned from devname(). They are typically named with
>>> 'port1', 'port2', etc. to indicate the hierarchial location in the CXL topology.
>>> But, this doesn't clearly indicate the CXL card or slot reporting the
>>> error.
>>>
>>> Update the logging to also log the corresponding PCIe devname. This
>>> will give a PCIe SBDF or ACPI object name (in case of CXL HB). This
>>> will provide details helping users understand which physical slot and
>>> card has the error.
>>>
>>> Below is example output after making these changes.
>>>
>>> Correctable error example output:
>>> cxl_port_aer_correctable_error: device=port1 (0000:0c:00.0) parent=root0
>> (pci0000:0c) status='Received Error From Physical Layer'
>>> Uncorrectable error example output:
>>> cxl_port_aer_uncorrectable_error: device=port1 (0000:0c:00.0) parent=root0
>> (pci0000:0c) status: 'Memory Byte Enable Parity Error' first_error: 'Memory
>> Byte Enable Parity Error'
>>
>> I'm not sure the pcie parent is adding much... Why bother with that?
>>
>> Shiju, is this going to affect rasdaemon handling?
> Hi Jonathan,
>
> Yes. Renaming the existing fields in the trace events will result failure
> while parsing the fields in the rasdaemon.
>
>> I'd assume we can't just rename fields in the tracepoints and combining them
>> will also presumably make a mess?
>>
>> Jonathan
>>
> [...]
> Thanks,
> Shiju
>
Shiju and Jonathan,

I will remove the parent field.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (10 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-04-23 16:47   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Introduce CXL error handlers for CXL Port devices. These are needed
to handle and log CXL protocol errors.

Update cxl_create_prot_err_info() with support for CXL Root Ports (RP), CXL
Upstream Switch Ports (USP) and CXL Downstreasm Switch ports (DSP).

Add functions cxl_port_error_detected() and cxl_port_cor_error_detected().

Add cxl_assign_error_handlers() and use to assign the CXL Port error
handlers for CXL RP, CXL USP, and CXL DSP. Make the assignments in
cxl_uport_init_ras() and cxl_dport_init_ras() after mapping RAS registers.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/core.h |  2 ++
 drivers/cxl/core/pci.c  | 23 +++++++++++++
 drivers/cxl/core/port.c |  4 +--
 drivers/cxl/core/ras.c  | 76 +++++++++++++++++++++++++++++++++--------
 drivers/cxl/cxl.h       |  5 +++
 drivers/cxl/port.c      | 29 ++++++++++++++--
 6 files changed, 120 insertions(+), 19 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 15699299dc11..5ce7269e5f13 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -122,6 +122,8 @@ void cxl_ras_exit(void);
 int cxl_gpf_port_setup(struct device *dport_dev, struct cxl_port *port);
 int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
 					    int nid, resource_size_t *size);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
 
 #ifdef CONFIG_CXL_FEATURES
 size_t cxl_get_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 10b2abfb0e64..9ed6f700e132 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -739,6 +739,29 @@ static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 
 #ifdef CONFIG_PCIEAER_CXL
 
+
+void cxl_port_cor_error_detected(struct device *cxl_dev,
+				 struct cxl_prot_error_info *err_info)
+{
+	void __iomem *ras_base = err_info->ras_base;
+	struct device *pci_dev = &err_info->pdev->dev;
+	u64 serial = 0;
+
+	__cxl_handle_cor_ras(cxl_dev, pci_dev, serial, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
+
+pci_ers_result_t cxl_port_error_detected(struct device *cxl_dev,
+					 struct cxl_prot_error_info *err_info)
+{
+	void __iomem *ras_base = err_info->ras_base;
+	struct device *pci_dev = &err_info->pdev->dev;
+	u64 serial = 0;
+
+	return  __cxl_handle_ras(cxl_dev, pci_dev, serial, ras_base);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
+
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 0fd6646c1a2e..83d331c82d91 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1348,8 +1348,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
index f18cb568eabd..fe38e76f2d1a 100644
--- a/drivers/cxl/core/ras.c
+++ b/drivers/cxl/core/ras.c
@@ -110,34 +110,80 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
 }
 static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
 
+static int match_uport(struct device *dev, const void *data)
+{
+	const struct device *uport_dev = data;
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+
+	return port->uport_dev == uport_dev;
+}
+
 int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
 			     struct cxl_prot_error_info *err_info)
 {
 	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
-	struct cxl_dev_state *cxlds;
 
 	if (!pdev || !err_info) {
 		pr_warn_once("Error: parameter is NULL");
 		return -ENODEV;
 	}
 
-	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
-	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
+	*err_info = (struct cxl_prot_error_info){ 0 };
+	err_info->severity = severity;
+	err_info->pdev = pdev;
+
+	switch (pci_pcie_type(pdev)) {
+	case PCI_EXP_TYPE_ROOT_PORT:
+	case PCI_EXP_TYPE_DOWNSTREAM:
+	{
+		struct cxl_dport *dport = NULL;
+		struct cxl_port *port __free(put_cxl_port) =
+			find_cxl_port(&pdev->dev, &dport);
+
+		if (!port || !is_cxl_port(&port->dev))
+			return -ENODEV;
+
+		err_info->ras_base = dport ? dport->regs.ras : NULL;
+		err_info->dev = &port->dev;
+		break;
+	}
+	case PCI_EXP_TYPE_UPSTREAM:
+	{
+		struct cxl_port *port;
+		struct device *port_dev __free(put_device) =
+			bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
+					match_uport);
+
+		if (!port_dev || !is_cxl_port(port_dev))
+			return -ENODEV;
+
+		port = to_cxl_port(port_dev);
+		err_info->ras_base = port ? port->uport_regs.ras : NULL;
+		err_info->dev = port_dev;
+		break;
+	}
+	case PCI_EXP_TYPE_ENDPOINT:
+	case PCI_EXP_TYPE_RC_END:
+	{
+		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
+		struct cxl_memdev *cxlmd = cxlds->cxlmd;
+		struct device *dev __free(put_device) = get_device(&cxlmd->dev);
+
+		err_info->ras_base = cxlds->regs.ras;
+		err_info->dev = &cxlds->cxlmd->dev;
+		break;
+	}
+	default:
+	{
 		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
 		return -ENODEV;
 	}
-
-	cxlds = pci_get_drvdata(pdev);
-	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
-
-	if (!dev)
-		return -ENODEV;
-
-	*err_info = (struct cxl_prot_error_info){ 0 };
-	err_info->ras_base = cxlds->regs.ras;
-	err_info->severity = severity;
-	err_info->pdev = pdev;
-	err_info->dev = dev;
+	}
 
 	return 0;
 }
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 0d05d5449f97..512cc38892ed 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -810,6 +810,11 @@ struct cxl_error_handlers {
 				   struct cxl_prot_error_info *err_info);
 };
 
+void cxl_port_cor_error_detected(struct device *dev,
+				 struct cxl_prot_error_info *err_info);
+pci_ers_result_t cxl_port_error_detected(struct device *dev,
+					 struct cxl_prot_error_info *err_info);
+
 /**
  * struct cxl_endpoint_dvsec_info - Cached DVSEC info
  * @mem_enabled: cached value of mem_enabled in the DVSEC at init time
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 1b8dc161428f..30a4bdb88c31 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -60,6 +60,24 @@ static int discover_region(struct device *dev, void *root)
 
 #ifdef CONFIG_PCIEAER_CXL
 
+static const struct cxl_error_handlers cxl_port_error_handlers = {
+	.error_detected = cxl_port_error_detected,
+	.cor_error_detected = cxl_port_cor_error_detected,
+};
+
+static void cxl_assign_error_handlers(struct device *_dev,
+				      const struct cxl_error_handlers *handlers)
+{
+	struct device *dev __free(put_device) = get_device(_dev);
+	struct cxl_driver *pdrv;
+
+	if (!dev)
+		return;
+
+	pdrv = to_cxl_drv(dev->driver);
+	pdrv->err_handler = handlers;
+}
+
 static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 {
 	resource_size_t aer_phys;
@@ -118,8 +136,12 @@ static void cxl_uport_init_ras_reporting(struct cxl_port *port,
 
 	map->host = host;
 	if (cxl_map_component_regs(map, &port->uport_regs,
-				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
 		dev_dbg(&port->dev, "Failed to map RAS capability\n");
+		return;
+	}
+
+	cxl_assign_error_handlers(&port->dev, &cxl_port_error_handlers);
 }
 
 /**
@@ -144,9 +166,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 	}
 
 	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
-				   BIT(CXL_CM_CAP_CAP_ID_RAS)))
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
 		dev_dbg(dport->dport_dev, "Failed to map RAS capability\n");
+		return;
+	}
 
+	cxl_assign_error_handlers(dport->dport_dev, &cxl_port_error_handlers);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers
  2025-03-27  1:47 ` [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers Terry Bowman
@ 2025-04-23 16:47   ` Jonathan Cameron
  0 siblings, 0 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 16:47 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:13 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Introduce CXL error handlers for CXL Port devices. These are needed
> to handle and log CXL protocol errors.
> 
> Update cxl_create_prot_err_info() with support for CXL Root Ports (RP), CXL
> Upstream Switch Ports (USP) and CXL Downstreasm Switch ports (DSP).
> 
> Add functions cxl_port_error_detected() and cxl_port_cor_error_detected().
> 
> Add cxl_assign_error_handlers() and use to assign the CXL Port error
> handlers for CXL RP, CXL USP, and CXL DSP. Make the assignments in
> cxl_uport_init_ras() and cxl_dport_init_ras() after mapping RAS registers.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/core.h |  2 ++
>  drivers/cxl/core/pci.c  | 23 +++++++++++++
>  drivers/cxl/core/port.c |  4 +--
>  drivers/cxl/core/ras.c  | 76 +++++++++++++++++++++++++++++++++--------
>  drivers/cxl/cxl.h       |  5 +++
>  drivers/cxl/port.c      | 29 ++++++++++++++--
>  6 files changed, 120 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 15699299dc11..5ce7269e5f13 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -122,6 +122,8 @@ void cxl_ras_exit(void);
>  int cxl_gpf_port_setup(struct device *dport_dev, struct cxl_port *port);
>  int cxl_acpi_get_extended_linear_cache_size(struct resource *backing_res,
>  					    int nid, resource_size_t *size);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport);
>  
>  #ifdef CONFIG_CXL_FEATURES
>  size_t cxl_get_feature(struct cxl_mailbox *cxl_mbox, const uuid_t *feat_uuid,
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 10b2abfb0e64..9ed6f700e132 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -739,6 +739,29 @@ static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
>  
>  #ifdef CONFIG_PCIEAER_CXL
>  
> +
> +void cxl_port_cor_error_detected(struct device *cxl_dev,
> +				 struct cxl_prot_error_info *err_info)
> +{
> +	void __iomem *ras_base = err_info->ras_base;
> +	struct device *pci_dev = &err_info->pdev->dev;
> +	u64 serial = 0;
> +
> +	__cxl_handle_cor_ras(cxl_dev, pci_dev, serial, ras_base);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_cor_error_detected, "CXL");
> +
> +pci_ers_result_t cxl_port_error_detected(struct device *cxl_dev,
> +					 struct cxl_prot_error_info *err_info)
> +{
> +	void __iomem *ras_base = err_info->ras_base;
> +	struct device *pci_dev = &err_info->pdev->dev;
> +	u64 serial = 0;

Maybe just put that directly in the call?  Or is it usefull to hvae
it here as a form of documentation?

> +
> +	return  __cxl_handle_ras(cxl_dev, pci_dev, serial, ras_base);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_port_error_detected, "CXL");
> +
>  static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
>  					  struct cxl_dport *dport)
>  {

> diff --git a/drivers/cxl/core/ras.c b/drivers/cxl/core/ras.c
> index f18cb568eabd..fe38e76f2d1a 100644
> --- a/drivers/cxl/core/ras.c
> +++ b/drivers/cxl/core/ras.c
> @@ -110,34 +110,80 @@ static void cxl_cper_prot_err_work_fn(struct work_struct *work)
>  }
>  static DECLARE_WORK(cxl_cper_prot_err_work, cxl_cper_prot_err_work_fn);
>  
> +static int match_uport(struct device *dev, const void *data)
> +{
> +	const struct device *uport_dev = data;
> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return port->uport_dev == uport_dev;
> +}
> +
>  int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
>  			     struct cxl_prot_error_info *err_info)
>  {
>  	struct pci_dev *pdev __free(pci_dev_put) = pci_dev_get(_pdev);
> -	struct cxl_dev_state *cxlds;
>  
>  	if (!pdev || !err_info) {
>  		pr_warn_once("Error: parameter is NULL");
>  		return -ENODEV;
>  	}
>  
> -	if ((pci_pcie_type(pdev) != PCI_EXP_TYPE_ENDPOINT) &&
> -	    (pci_pcie_type(pdev) != PCI_EXP_TYPE_RC_END)) {
> +	*err_info = (struct cxl_prot_error_info){ 0 };
> +	err_info->severity = severity;
> +	err_info->pdev = pdev;
Can maybe carry forward earlier suggestion for at least these two fields.

	*err_info = (struct cxl_prot_error_info) {
		.severity = ...

	};
> +
> +	switch (pci_pcie_type(pdev)) {
> +	case PCI_EXP_TYPE_ROOT_PORT:
> +	case PCI_EXP_TYPE_DOWNSTREAM:
> +	{
> +		struct cxl_dport *dport = NULL;
> +		struct cxl_port *port __free(put_cxl_port) =
> +			find_cxl_port(&pdev->dev, &dport);
> +
> +		if (!port || !is_cxl_port(&port->dev))
> +			return -ENODEV;
> +
> +		err_info->ras_base = dport ? dport->regs.ras : NULL;
> +		err_info->dev = &port->dev;
> +		break;
> +	}
> +	case PCI_EXP_TYPE_UPSTREAM:
> +	{
> +		struct cxl_port *port;
> +		struct device *port_dev __free(put_device) =
> +			bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
> +					match_uport);
> +
> +		if (!port_dev || !is_cxl_port(port_dev))
> +			return -ENODEV;
> +
> +		port = to_cxl_port(port_dev);
> +		err_info->ras_base = port ? port->uport_regs.ras : NULL;
> +		err_info->dev = port_dev;
> +		break;
> +	}
> +	case PCI_EXP_TYPE_ENDPOINT:
> +	case PCI_EXP_TYPE_RC_END:
> +	{
> +		struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
> +		struct cxl_memdev *cxlmd = cxlds->cxlmd;
> +		struct device *dev __free(put_device) = get_device(&cxlmd->dev);
> +
> +		err_info->ras_base = cxlds->regs.ras;
> +		err_info->dev = &cxlds->cxlmd->dev;
> +		break;
> +	}
> +	default:
> +	{
>  		pci_warn_once(pdev, "Error: Unsupported device type (%X)", pci_pcie_type(pdev));
>  		return -ENODEV;
>  	}
> -
> -	cxlds = pci_get_drvdata(pdev);
> -	struct device *dev __free(put_device) = get_device(&cxlds->cxlmd->dev);
> -
> -	if (!dev)
> -		return -ENODEV;
> -
> -	*err_info = (struct cxl_prot_error_info){ 0 };
> -	err_info->ras_base = cxlds->regs.ras;
> -	err_info->severity = severity;
> -	err_info->pdev = pdev;
> -	err_info->dev = dev;
> +	}
>  
>  	return 0;
>  }


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint protocol error handlers
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (11 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-27 19:46   ` kernel test robot
  2025-04-23 16:49   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
                   ` (4 subsequent siblings)
  17 siblings, 2 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL Endpoint protocol errors are currently handled using PCI error
handlers. The CXL Endpoint requires CXL specific handling in the case of
uncorrectable error handling not provided by the PCI handlers.

Add CXL specific handlers for CXL Endpoints. Assign the CXL handlers
during Endpoint Port initialization.

Keep the PCI Endpoint handlers. PCI handlers can be called if the CXL
device is not trained for alternate protocol (CXL). Update the CXL
Endpoint PCI handlers to call the CXL handler. If the CXL
uncorrectable handler returns PCI_ERS_RESULT_PANIC then the PCI
handler invokes panic().

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 65 ++++++++++++++++++++++++------------------
 drivers/cxl/cxl.h      |  5 ++++
 drivers/cxl/cxlpci.h   |  4 +--
 drivers/cxl/pci.c      |  8 +++---
 drivers/cxl/port.c     |  7 +++++
 5 files changed, 56 insertions(+), 33 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 9ed6f700e132..f2139b382839 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -852,10 +852,10 @@ static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds)
 static void cxl_handle_rdport_errors(struct cxl_dev_state *cxlds) { }
 #endif
 
-void cxl_cor_error_detected(struct pci_dev *pdev)
+void cxl_cor_error_detected(struct device *dev, struct cxl_prot_error_info *err_info)
 {
+	struct pci_dev *pdev = err_info->pdev;
 	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct device *dev = &cxlds->cxlmd->dev;
 
 	scoped_guard(device, dev) {
 		if (!dev->driver) {
@@ -873,20 +873,30 @@ void cxl_cor_error_detected(struct pci_dev *pdev)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
 
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
-				    pci_channel_state_t state)
+void pci_cor_error_detected(struct pci_dev *pdev)
+{
+	struct cxl_prot_error_info err_info;
+
+	if (cxl_create_prot_err_info(pdev, AER_CORRECTABLE, &err_info))
+		return;
+
+	cxl_cor_error_detected(err_info.dev, &err_info);
+}
+EXPORT_SYMBOL_NS_GPL(pci_cor_error_detected, "CXL");
+
+pci_ers_result_t cxl_error_detected(struct device *dev,
+				    struct cxl_prot_error_info *err_info)
 {
-	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
-	struct cxl_memdev *cxlmd = cxlds->cxlmd;
-	struct device *dev = &cxlmd->dev;
 	bool ue;
+	struct pci_dev *pdev = err_info->pdev;
+	struct cxl_dev_state *cxlds = pci_get_drvdata(pdev);
 
 	scoped_guard(device, dev) {
 		if (!dev->driver) {
 			dev_warn(&pdev->dev,
 				 "%s: memdev disabled, abort error handling\n",
 				 dev_name(dev));
-			return PCI_ERS_RESULT_DISCONNECT;
+			return PCI_ERS_RESULT_PANIC;
 		}
 
 		if (cxlds->rcd)
@@ -900,29 +910,30 @@ pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 		ue = cxl_handle_endpoint_ras(cxlds);
 	}
 
+	if (ue)
+		return PCI_ERS_RESULT_PANIC;
 
-	switch (state) {
-	case pci_channel_io_normal:
-		if (ue) {
-			device_release_driver(dev);
-			return PCI_ERS_RESULT_NEED_RESET;
-		}
-		return PCI_ERS_RESULT_CAN_RECOVER;
-	case pci_channel_io_frozen:
-		dev_warn(&pdev->dev,
-			 "%s: frozen state error detected, disable CXL.mem\n",
-			 dev_name(dev));
-		device_release_driver(dev);
-		return PCI_ERS_RESULT_NEED_RESET;
-	case pci_channel_io_perm_failure:
-		dev_warn(&pdev->dev,
-			 "failure state error detected, request disconnect\n");
-		return PCI_ERS_RESULT_DISCONNECT;
-	}
-	return PCI_ERS_RESULT_NEED_RESET;
+	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 EXPORT_SYMBOL_NS_GPL(cxl_error_detected, "CXL");
 
+pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
+				    pci_channel_state_t error)
+{
+	struct cxl_prot_error_info err_info;
+	pci_ers_result_t rc;
+
+	if (cxl_create_prot_err_info(pdev, AER_FATAL, &err_info))
+		return PCI_ERS_RESULT_DISCONNECT;
+
+	rc = cxl_error_detected(err_info.dev, &err_info);
+	if (rc == PCI_ERS_RESULT_PANIC)
+		panic("CXL cachemem error.");
+
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(pci_error_detected, "CXL");
+
 static int cxl_flit_size(struct pci_dev *pdev)
 {
 	if (cxl_pci_flit_256(pdev))
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 512cc38892ed..c1adf8a3cb9e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -815,6 +815,11 @@ void cxl_port_cor_error_detected(struct device *dev,
 pci_ers_result_t cxl_port_error_detected(struct device *dev,
 					 struct cxl_prot_error_info *err_info);
 
+void cxl_cor_error_detected(struct device *dev,
+			    struct cxl_prot_error_info *err_info);
+pci_ers_result_t cxl_error_detected(struct device *dev,
+				    struct cxl_prot_error_info *err_info);
+
 /**
  * struct cxl_endpoint_dvsec_info - Cached DVSEC info
  * @mem_enabled: cached value of mem_enabled in the DVSEC at init time
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 92d72c0423ab..d277cf048eba 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -133,8 +133,8 @@ struct cxl_dev_state;
 int cxl_hdm_decode_init(struct cxl_dev_state *cxlds, struct cxl_hdm *cxlhdm,
 			struct cxl_endpoint_dvsec_info *info);
 void read_cdat_data(struct cxl_port *port);
-void cxl_cor_error_detected(struct pci_dev *pdev);
-pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
+void pci_cor_error_detected(struct pci_dev *pdev);
+pci_ers_result_t pci_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
 int cxl_create_prot_err_info(struct pci_dev *_pdev, int severity,
 			     struct cxl_prot_error_info *err_info);
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 4288f4814cc5..c5be4422748e 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -1108,11 +1108,11 @@ static void cxl_reset_done(struct pci_dev *pdev)
 	}
 }
 
-static const struct pci_error_handlers cxl_error_handlers = {
-	.error_detected	= cxl_error_detected,
+static const struct pci_error_handlers pci_error_handlers = {
+	.error_detected = pci_error_detected,
 	.slot_reset	= cxl_slot_reset,
 	.resume		= cxl_error_resume,
-	.cor_error_detected	= cxl_cor_error_detected,
+	.cor_error_detected	= pci_cor_error_detected,
 	.reset_done	= cxl_reset_done,
 };
 
@@ -1120,7 +1120,7 @@ static struct pci_driver cxl_pci_driver = {
 	.name			= KBUILD_MODNAME,
 	.id_table		= cxl_mem_pci_tbl,
 	.probe			= cxl_pci_probe,
-	.err_handler		= &cxl_error_handlers,
+	.err_handler		= &pci_error_handlers,
 	.dev_groups		= cxl_rcd_groups,
 	.driver	= {
 		.probe_type	= PROBE_PREFER_ASYNCHRONOUS,
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 30a4bdb88c31..8e2b70e73582 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -65,6 +65,11 @@ static const struct cxl_error_handlers cxl_port_error_handlers = {
 	.cor_error_detected = cxl_port_cor_error_detected,
 };
 
+const struct cxl_error_handlers cxl_ep_error_handlers = {
+	.error_detected = cxl_error_detected,
+	.cor_error_detected = cxl_cor_error_detected,
+};
+
 static void cxl_assign_error_handlers(struct device *_dev,
 				      const struct cxl_error_handlers *handlers)
 {
@@ -203,6 +208,8 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
 	}
 
 	cxl_dport_init_ras_reporting(dport, cxlmd_dev);
+
+	cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers);
 }
 
 #else
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint protocol error handlers
  2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
@ 2025-03-27 19:46   ` kernel test robot
  2025-04-23 16:49   ` Jonathan Cameron
  1 sibling, 0 replies; 76+ messages in thread
From: kernel test robot @ 2025-03-27 19:46 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati
  Cc: oe-kbuild-all

Hi Terry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on aae0594a7053c60b82621136257c8b648c67b512]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/PCI-CXL-Introduce-PCIe-helper-function-pcie_is_cxl/20250327-095738
base:   aae0594a7053c60b82621136257c8b648c67b512
patch link:    https://lore.kernel.org/r/20250327014717.2988633-14-terry.bowman%40amd.com
patch subject: [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint protocol error handlers
config: csky-randconfig-r122-20250327 (https://download.01.org/0day-ci/archive/20250328/202503280346.euKvcovE-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 12.4.0
reproduce: (https://download.01.org/0day-ci/archive/20250328/202503280346.euKvcovE-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503280346.euKvcovE-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
>> drivers/cxl/port.c:68:33: sparse: sparse: symbol 'cxl_ep_error_handlers' was not declared. Should it be static?

vim +/cxl_ep_error_handlers +68 drivers/cxl/port.c

    67	
  > 68	const struct cxl_error_handlers cxl_ep_error_handlers = {
    69		.error_detected = cxl_error_detected,
    70		.cor_error_detected = cxl_cor_error_detected,
    71	};
    72	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint protocol error handlers
  2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
  2025-03-27 19:46   ` kernel test robot
@ 2025-04-23 16:49   ` Jonathan Cameron
  1 sibling, 0 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-23 16:49 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:14 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL Endpoint protocol errors are currently handled using PCI error
> handlers. The CXL Endpoint requires CXL specific handling in the case of
> uncorrectable error handling not provided by the PCI handlers.
> 
> Add CXL specific handlers for CXL Endpoints. Assign the CXL handlers
> during Endpoint Port initialization.
> 
> Keep the PCI Endpoint handlers. PCI handlers can be called if the CXL
> device is not trained for alternate protocol (CXL). Update the CXL
> Endpoint PCI handlers to call the CXL handler. If the CXL
> uncorrectable handler returns PCI_ERS_RESULT_PANIC then the PCI
> handler invokes panic().
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (12 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-04-17 17:22   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

The cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras() functions
are unnecessary helper function and only used for Endpoints. Remove these
functions because they are not necessary and do not align with a common
handling API for all CXL devices' errors.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index f2139b382839..a67925dfdbe1 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -670,11 +670,6 @@ static void __cxl_handle_cor_ras(struct device *cxl_dev, struct device *pcie_dev
 	trace_cxl_aer_correctable_error(cxl_dev, pcie_dev, serial, status);
 }
 
-static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
-{
-	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
-}
-
 /* CXL spec rev3.0 8.2.4.16.1 */
 static void header_log_copy(void __iomem *ras_base, u32 *log)
 {
@@ -732,14 +727,8 @@ static pci_ers_result_t __cxl_handle_ras(struct device *cxl_dev, struct device *
 	return PCI_ERS_RESULT_PANIC;
 }
 
-static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
-{
-	return __cxl_handle_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
-}
-
 #ifdef CONFIG_PCIEAER_CXL
 
-
 void cxl_port_cor_error_detected(struct device *cxl_dev,
 				 struct cxl_prot_error_info *err_info)
 {
@@ -868,7 +857,8 @@ void cxl_cor_error_detected(struct device *dev, struct cxl_prot_error_info *err_
 		if (cxlds->rcd)
 			cxl_handle_rdport_errors(cxlds);
 
-		cxl_handle_endpoint_cor_ras(cxlds);
+		__cxl_handle_cor_ras(&cxlds->cxlmd->dev, &pdev->dev,
+				     cxlds->serial, cxlds->regs.ras);
 	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
@@ -907,7 +897,8 @@ pci_ers_result_t cxl_error_detected(struct device *dev,
 		 * chance the situation is recoverable dump the status of the RAS
 		 * capability registers and bounce the active state of the memdev.
 		 */
-		ue = cxl_handle_endpoint_ras(cxlds);
+		ue = __cxl_handle_ras(&cxlds->cxlmd->dev, &pdev->dev,
+				      cxlds->serial, cxlds->regs.ras);
 	}
 
 	if (ue)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
  2025-03-27  1:47 ` [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
@ 2025-04-17 17:22   ` Jonathan Cameron
  0 siblings, 0 replies; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-17 17:22 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:15 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The cxl_handle_endpoint_cor_ras()/cxl_handle_endpoint_ras() functions
> are unnecessary helper function and only used for Endpoints. Remove these
> functions because they are not necessary and do not align with a common
> handling API for all CXL devices' errors.
Having done this, what does the double underscore in the naming denote?
I assume original intent was perhaps that only the wrappers should
ever be called.  If that's not the case after this change maybe get
rid of the __ prefix?

> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 17 ++++-------------
>  1 file changed, 4 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index f2139b382839..a67925dfdbe1 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -670,11 +670,6 @@ static void __cxl_handle_cor_ras(struct device *cxl_dev, struct device *pcie_dev
>  	trace_cxl_aer_correctable_error(cxl_dev, pcie_dev, serial, status);
>  }
>  
> -static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> -{
> -	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
Previously second parameter was NULL. After this change you pass &pdev->dev.
That makes it look at least like there is a functional change here.
If this doesn't matter perhaps you should explain why in the description.

> -}
> -
>  /* CXL spec rev3.0 8.2.4.16.1 */
>  static void header_log_copy(void __iomem *ras_base, u32 *log)
>  {
> @@ -732,14 +727,8 @@ static pci_ers_result_t __cxl_handle_ras(struct device *cxl_dev, struct device *
>  	return PCI_ERS_RESULT_PANIC;
>  }
>  
> -static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
> -{
> -	return __cxl_handle_ras(&cxlds->cxlmd->dev, NULL, cxlds->serial, cxlds->regs.ras);
> -}
> -
>  #ifdef CONFIG_PCIEAER_CXL
>  
> -

Unrelated change. I think this ifdef was added earlier in series so avoid
adding the bonus line wherever it came from...

>  void cxl_port_cor_error_detected(struct device *cxl_dev,
>  				 struct cxl_prot_error_info *err_info)
>  {
> @@ -868,7 +857,8 @@ void cxl_cor_error_detected(struct device *dev, struct cxl_prot_error_info *err_
>  		if (cxlds->rcd)
>  			cxl_handle_rdport_errors(cxlds);
>  
> -		cxl_handle_endpoint_cor_ras(cxlds);
> +		__cxl_handle_cor_ras(&cxlds->cxlmd->dev, &pdev->dev,
> +				     cxlds->serial, cxlds->regs.ras);
>  	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_cor_error_detected, "CXL");
> @@ -907,7 +897,8 @@ pci_ers_result_t cxl_error_detected(struct device *dev,
>  		 * chance the situation is recoverable dump the status of the RAS
>  		 * capability registers and bounce the active state of the memdev.
>  		 */
> -		ue = cxl_handle_endpoint_ras(cxlds);
> +		ue = __cxl_handle_ras(&cxlds->cxlmd->dev, &pdev->dev,
> +				      cxlds->serial, cxlds->regs.ras);
>  	}
>  
>  	if (ue)


^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (13 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-04-04 17:05   ` Jonathan Cameron
  2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

CXL protocol errors are not enabled for all CXL devices at boot. These
must be enabled inorder to process CXL protocol errors.

Export the AER service driver's pci_aer_unmask_internal_errors().

Introduce cxl_enable_port_errors() to call pci_aer_unmask_internal_errors().
pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
correctable internal errors and uncorrectable internal errors for all CXL
devices.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/cxl.h      |  2 ++
 drivers/cxl/port.c     | 22 ++++++++++++++++++++++
 drivers/pci/pcie/aer.c |  3 ++-
 include/linux/aer.h    |  1 +
 4 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index c1adf8a3cb9e..473267c19cd0 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -769,9 +769,11 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void cxl_enable_prot_errors(struct device *dev);
 #else
 static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
 						struct device *host) { }
+static inline void cxl_enable_prot_errors(struct device *dev) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index 8e2b70e73582..bb7a0526e609 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -83,6 +83,24 @@ static void cxl_assign_error_handlers(struct device *_dev,
 	pdrv->err_handler = handlers;
 }
 
+void cxl_enable_prot_errors(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct device *pci_dev __free(put_device) = get_device(&pdev->dev);
+
+	if (!pci_dev)
+		return;
+
+	if (!pdev->aer_cap) {
+		pdev->aer_cap = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+		if (!pdev->aer_cap)
+			return;
+	}
+
+	pci_aer_unmask_internal_errors(pdev);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_enable_prot_errors, "CXL");
+
 static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 {
 	resource_size_t aer_phys;
@@ -147,6 +165,7 @@ static void cxl_uport_init_ras_reporting(struct cxl_port *port,
 	}
 
 	cxl_assign_error_handlers(&port->dev, &cxl_port_error_handlers);
+	cxl_enable_prot_errors(port->uport_dev);
 }
 
 /**
@@ -177,6 +196,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 	}
 
 	cxl_assign_error_handlers(dport->dport_dev, &cxl_port_error_handlers);
+	cxl_enable_prot_errors(dport->dport_dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
@@ -201,6 +221,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
 	struct cxl_port *parent_port __free(put_cxl_port) =
 		cxl_mem_find_port(cxlmd, &dport);
 	struct device *cxlmd_dev __free(put_device) = &cxlmd->dev;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
 	if (!dport || !dev_is_pci(dport->dport_dev)) {
 		dev_err(&port->dev, "CXL port topology not found\n");
@@ -210,6 +231,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
 	cxl_dport_init_ras_reporting(dport, cxlmd_dev);
 
 	cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers);
+	cxl_enable_prot_errors(cxlds->dev);
 }
 
 #else
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 763ec6aa1a9a..d3068f5cc767 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -962,7 +962,7 @@ static bool find_source_device(struct pci_dev *parent,
  * Note: AER must be enabled and supported by the device which must be
  * checked in advance, e.g. with pcie_aer_is_native().
  */
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 {
 	int aer = dev->aer_cap;
 	u32 mask;
@@ -975,6 +975,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 	mask &= ~PCI_ERR_COR_INTERNAL;
 	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
 }
+EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
 
 static bool is_cxl_mem_dev(struct pci_dev *dev)
 {
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 8f815f34d447..a65fe324fad2 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -100,5 +100,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
 #endif //_AER_H_
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe
  2025-03-27  1:47 ` [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
@ 2025-04-04 17:05   ` Jonathan Cameron
  2025-04-07 14:34     ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-04 17:05 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:16 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL protocol errors are not enabled for all CXL devices at boot. These
> must be enabled inorder to process CXL protocol errors.
> 
> Export the AER service driver's pci_aer_unmask_internal_errors().
> 
> Introduce cxl_enable_port_errors() to call pci_aer_unmask_internal_errors().
> pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
> But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
> Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
> correctable internal errors and uncorrectable internal errors for all CXL
> devices.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe
  2025-04-04 17:05   ` Jonathan Cameron
@ 2025-04-07 14:34     ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-07 14:34 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/4/2025 12:05 PM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:16 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL protocol errors are not enabled for all CXL devices at boot. These
>> must be enabled inorder to process CXL protocol errors.
>>
>> Export the AER service driver's pci_aer_unmask_internal_errors().
>>
>> Introduce cxl_enable_port_errors() to call pci_aer_unmask_internal_errors().
>> pci_aer_unmask_internal_errors() expects the pdev->aer_cap is initialized.
>> But, dev->aer_cap is not initialized for CXL Upstream Switch Ports and CXL
>> Downstream Switch Ports. Initialize the dev->aer_cap if necessary. Enable AER
>> correctable internal errors and uncorrectable internal errors for all CXL
>> devices.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Thanks for reviewing.

Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (14 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
@ 2025-03-27  1:47 ` Terry Bowman
  2025-03-28  1:18   ` kernel test robot
  2025-04-04 17:04   ` Jonathan Cameron
  2025-03-27 17:16 ` [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
  2025-05-06 23:06 ` Gregory Price
  17 siblings, 2 replies; 76+ messages in thread
From: Terry Bowman @ 2025-03-27  1:47 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

During CXL device cleanup the CXL PCIe Port device interrupts may remain
enabled. This can potentialy allow unnecessary interrupt processing on
behalf of the CXL errors while the device is destroyed.

Disable CXL protocol errors by setting the CXL devices' AER mask register.

Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().

Next, introduce cxl_disable_prot_errors() to call pci_aer_mask_internal_errors().
Register cxl_disable_prot_errors() to run at CXL device cleanup.
Register for CXL Root Ports, CXL Downstream Ports, CXL Upstream Ports, and
CXL Endpoints.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/port.c     | 18 +++++++++++++++++-
 drivers/pci/pcie/aer.c | 25 +++++++++++++++++++++++++
 include/linux/aer.h    |  1 +
 3 files changed, 43 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/port.c b/drivers/cxl/port.c
index bb7a0526e609..7e3efd8be8eb 100644
--- a/drivers/cxl/port.c
+++ b/drivers/cxl/port.c
@@ -101,6 +101,19 @@ void cxl_enable_prot_errors(struct device *dev)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_enable_prot_errors, "CXL");
 
+void cxl_disable_prot_errors(void *_dev)
+{
+	struct device *dev = _dev;
+	struct pci_dev *pdev = to_pci_dev(dev);
+	struct device *pci_dev __free(put_device) = get_device(&pdev->dev);
+
+	if (!pci_dev || !pdev->aer_cap)
+		return;
+
+	pci_aer_mask_internal_errors(pdev);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_disable_prot_errors, "CXL");
+
 static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 {
 	resource_size_t aer_phys;
@@ -166,6 +179,7 @@ static void cxl_uport_init_ras_reporting(struct cxl_port *port,
 
 	cxl_assign_error_handlers(&port->dev, &cxl_port_error_handlers);
 	cxl_enable_prot_errors(port->uport_dev);
+	devm_add_action_or_reset(host, cxl_disable_prot_errors, port->uport_dev);
 }
 
 /**
@@ -197,6 +211,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
 
 	cxl_assign_error_handlers(dport->dport_dev, &cxl_port_error_handlers);
 	cxl_enable_prot_errors(dport->dport_dev);
+	devm_add_action_or_reset(host, cxl_disable_prot_errors, dport->dport_dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
@@ -223,7 +238,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
 	struct device *cxlmd_dev __free(put_device) = &cxlmd->dev;
 	struct cxl_dev_state *cxlds = cxlmd->cxlds;
 
-	if (!dport || !dev_is_pci(dport->dport_dev)) {
+	if (!dport || !dev_is_pci(dport->dport_dev) || !dev_is_pci(cxlds->dev)) {
 		dev_err(&port->dev, "CXL port topology not found\n");
 		return;
 	}
@@ -232,6 +247,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
 
 	cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers);
 	cxl_enable_prot_errors(cxlds->dev);
+	devm_add_action_or_reset(cxlds->dev, cxl_disable_prot_errors, cxlds->dev);
 }
 
 #else
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d3068f5cc767..d1ef0c676ff8 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 }
 EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
 
+/**
+ * pci_aer_mask_internal_errors - mask internal errors
+ * @dev: pointer to the pcie_dev data structure
+ *
+ * Masks internal errors in the Uncorrectable and Correctable Error
+ * Mask registers.
+ *
+ * Note: AER must be enabled and supported by the device which must be
+ * checked in advance, e.g. with pcie_aer_is_native().
+ */
+void pci_aer_mask_internal_errors(struct pci_dev *dev)
+{
+	int aer = dev->aer_cap;
+	u32 mask;
+
+	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
+	mask |= PCI_ERR_UNC_INTN;
+	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
+
+	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
+	mask |= PCI_ERR_COR_INTERNAL;
+	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
+}
+EXPORT_SYMBOL_NS_GPL(pci_aer_mask_internal_errors, "CXL");
+
 static bool is_cxl_mem_dev(struct pci_dev *dev)
 {
 	/*
diff --git a/include/linux/aer.h b/include/linux/aer.h
index a65fe324fad2..f0c84db466e5 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -101,5 +101,6 @@ int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
 void pci_aer_unmask_internal_errors(struct pci_dev *dev);
+void pci_aer_mask_internal_errors(struct pci_dev *dev);
 #endif //_AER_H_
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
  2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
@ 2025-03-28  1:18   ` kernel test robot
  2025-04-04 17:04   ` Jonathan Cameron
  1 sibling, 0 replies; 76+ messages in thread
From: kernel test robot @ 2025-03-28  1:18 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati
  Cc: oe-kbuild-all

Hi Terry,

kernel test robot noticed the following build warnings:

[auto build test WARNING on aae0594a7053c60b82621136257c8b648c67b512]

url:    https://github.com/intel-lab-lkp/linux/commits/Terry-Bowman/PCI-CXL-Introduce-PCIe-helper-function-pcie_is_cxl/20250327-095738
base:   aae0594a7053c60b82621136257c8b648c67b512
patch link:    https://lore.kernel.org/r/20250327014717.2988633-17-terry.bowman%40amd.com
patch subject: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
config: csky-randconfig-r122-20250327 (https://download.01.org/0day-ci/archive/20250328/202503280816.M7DZmSDT-lkp@intel.com/config)
compiler: csky-linux-gcc (GCC) 12.4.0
reproduce: (https://download.01.org/0day-ci/archive/20250328/202503280816.M7DZmSDT-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202503280816.M7DZmSDT-lkp@intel.com/

sparse warnings: (new ones prefixed by >>)
   drivers/cxl/port.c:68:33: sparse: sparse: symbol 'cxl_ep_error_handlers' was not declared. Should it be static?
>> drivers/cxl/port.c:104:6: sparse: sparse: symbol 'cxl_disable_prot_errors' was not declared. Should it be static?

vim +/cxl_disable_prot_errors +104 drivers/cxl/port.c

    67	
  > 68	const struct cxl_error_handlers cxl_ep_error_handlers = {
    69		.error_detected = cxl_error_detected,
    70		.cor_error_detected = cxl_cor_error_detected,
    71	};
    72	
    73	static void cxl_assign_error_handlers(struct device *_dev,
    74					      const struct cxl_error_handlers *handlers)
    75	{
    76		struct device *dev __free(put_device) = get_device(_dev);
    77		struct cxl_driver *pdrv;
    78	
    79		if (!dev)
    80			return;
    81	
    82		pdrv = to_cxl_drv(dev->driver);
    83		pdrv->err_handler = handlers;
    84	}
    85	
    86	void cxl_enable_prot_errors(struct device *dev)
    87	{
    88		struct pci_dev *pdev = to_pci_dev(dev);
    89		struct device *pci_dev __free(put_device) = get_device(&pdev->dev);
    90	
    91		if (!pci_dev)
    92			return;
    93	
    94		if (!pdev->aer_cap) {
    95			pdev->aer_cap = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
    96			if (!pdev->aer_cap)
    97				return;
    98		}
    99	
   100		pci_aer_unmask_internal_errors(pdev);
   101	}
   102	EXPORT_SYMBOL_NS_GPL(cxl_enable_prot_errors, "CXL");
   103	
 > 104	void cxl_disable_prot_errors(void *_dev)
   105	{
   106		struct device *dev = _dev;
   107		struct pci_dev *pdev = to_pci_dev(dev);
   108		struct device *pci_dev __free(put_device) = get_device(&pdev->dev);
   109	
   110		if (!pci_dev || !pdev->aer_cap)
   111			return;
   112	
   113		pci_aer_mask_internal_errors(pdev);
   114	}
   115	EXPORT_SYMBOL_NS_GPL(cxl_disable_prot_errors, "CXL");
   116	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
  2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
  2025-03-28  1:18   ` kernel test robot
@ 2025-04-04 17:04   ` Jonathan Cameron
  2025-04-07 14:25     ` Bowman, Terry
  1 sibling, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-04 17:04 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, 26 Mar 2025 20:47:17 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> During CXL device cleanup the CXL PCIe Port device interrupts may remain
> enabled. This can potentialy allow unnecessary interrupt processing on
> behalf of the CXL errors while the device is destroyed.
> 
> Disable CXL protocol errors by setting the CXL devices' AER mask register.
> 
> Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().
> 
> Next, introduce cxl_disable_prot_errors() to call pci_aer_mask_internal_errors().
> Register cxl_disable_prot_errors() to run at CXL device cleanup.
> Register for CXL Root Ports, CXL Downstream Ports, CXL Upstream Ports, and
> CXL Endpoints.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

A few small comments in here.  I haven't looked through all the rest of the series
as out of time today but this one caught my eye.
>  
> @@ -223,7 +238,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>  	struct device *cxlmd_dev __free(put_device) = &cxlmd->dev;
>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>  
> -	if (!dport || !dev_is_pci(dport->dport_dev)) {
> +	if (!dport || !dev_is_pci(dport->dport_dev) || !dev_is_pci(cxlds->dev)) {
>  		dev_err(&port->dev, "CXL port topology not found\n");
>  		return;
>  	}
> @@ -232,6 +247,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>  
>  	cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers);
>  	cxl_enable_prot_errors(cxlds->dev);
> +	devm_add_action_or_reset(cxlds->dev, cxl_disable_prot_errors, cxlds->dev);

This can fail (at least in theory).  Should at least scream that oddly we've
disabled error handling interrupts if it is hard to return anything cleanly.

Same for all the other cases.
>  }
>  
>  #else
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d3068f5cc767..d1ef0c676ff8 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>  
> +/**
> + * pci_aer_mask_internal_errors - mask internal errors
> + * @dev: pointer to the pcie_dev data structure
> + *
> + * Masks internal errors in the Uncorrectable and Correctable Error
> + * Mask registers.
> + *
> + * Note: AER must be enabled and supported by the device which must be
> + * checked in advance, e.g. with pcie_aer_is_native().
> + */
> +void pci_aer_mask_internal_errors(struct pci_dev *dev)
> +{
> +	int aer = dev->aer_cap;
> +	u32 mask;
> +
> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
> +	mask |= PCI_ERR_UNC_INTN;
> +	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
> +
It does an extra clear we don't need, but....
	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
				       0, PCI_ERR_UNC_INTN);

	is at very least shorter than the above 3 lines.

> +	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
> +	mask |= PCI_ERR_COR_INTERNAL;
> +	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> +}
> +EXPORT_SYMBOL_NS_GPL(pci_aer_mask_internal_errors, "CXL");
> +
>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>  {
>  	/*
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index a65fe324fad2..f0c84db466e5 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -101,5 +101,6 @@ int cper_severity_to_aer(int cper_severity);
>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>  		       int severity, struct aer_capability_regs *aer_regs);
>  void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +void pci_aer_mask_internal_errors(struct pci_dev *dev);
>  #endif //_AER_H_
>  


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
  2025-04-04 17:04   ` Jonathan Cameron
@ 2025-04-07 14:25     ` Bowman, Terry
  2025-04-17 10:13       ` Jonathan Cameron
  0 siblings, 1 reply; 76+ messages in thread
From: Bowman, Terry @ 2025-04-07 14:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/4/2025 12:04 PM, Jonathan Cameron wrote:
> On Wed, 26 Mar 2025 20:47:17 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> During CXL device cleanup the CXL PCIe Port device interrupts may remain
>> enabled. This can potentialy allow unnecessary interrupt processing on
>> behalf of the CXL errors while the device is destroyed.
>>
>> Disable CXL protocol errors by setting the CXL devices' AER mask register.
>>
>> Introduce pci_aer_mask_internal_errors() similar to pci_aer_unmask_internal_errors().
>>
>> Next, introduce cxl_disable_prot_errors() to call pci_aer_mask_internal_errors().
>> Register cxl_disable_prot_errors() to run at CXL device cleanup.
>> Register for CXL Root Ports, CXL Downstream Ports, CXL Upstream Ports, and
>> CXL Endpoints.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> A few small comments in here.  I haven't looked through all the rest of the series
> as out of time today but this one caught my eye.
>>  
>> @@ -223,7 +238,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>>  	struct device *cxlmd_dev __free(put_device) = &cxlmd->dev;
>>  	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>>  
>> -	if (!dport || !dev_is_pci(dport->dport_dev)) {
>> +	if (!dport || !dev_is_pci(dport->dport_dev) || !dev_is_pci(cxlds->dev)) {
>>  		dev_err(&port->dev, "CXL port topology not found\n");
>>  		return;
>>  	}
>> @@ -232,6 +247,7 @@ static void cxl_endpoint_port_init_ras(struct cxl_port *port)
>>  
>>  	cxl_assign_error_handlers(cxlmd_dev, &cxl_ep_error_handlers);
>>  	cxl_enable_prot_errors(cxlds->dev);
>> +	devm_add_action_or_reset(cxlds->dev, cxl_disable_prot_errors, cxlds->dev);
> This can fail (at least in theory).  Should at least scream that oddly we've
> disabled error handling interrupts if it is hard to return anything cleanly.
>
> Same for all the other cases.

Ok. I will add a dev_err() for errors returned by devm_add_action_or_reset().
>>  }
>>  
>>  #else
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index d3068f5cc767..d1ef0c676ff8 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>  }
>>  EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>>  
>> +/**
>> + * pci_aer_mask_internal_errors - mask internal errors
>> + * @dev: pointer to the pcie_dev data structure
>> + *
>> + * Masks internal errors in the Uncorrectable and Correctable Error
>> + * Mask registers.
>> + *
>> + * Note: AER must be enabled and supported by the device which must be
>> + * checked in advance, e.g. with pcie_aer_is_native().
>> + */
>> +void pci_aer_mask_internal_errors(struct pci_dev *dev)
>> +{
>> +	int aer = dev->aer_cap;
>> +	u32 mask;
>> +
>> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
>> +	mask |= PCI_ERR_UNC_INTN;
>> +	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
>> +
> It does an extra clear we don't need, but....
> 	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
> 				       0, PCI_ERR_UNC_INTN);
>
> 	is at very least shorter than the above 3 lines.
Doing so will overwrite the existing mask. CXL normally only uses AER UIE/CIE but if the device
happens to lose alternate training and no longer identifies as a CXL device than this mask
value would be critical for reporting PCI AER errors and would need UCE/CE enabled (other
than UIE/CIE).

-Terry

>> +	pci_read_config_dword(dev, aer + PCI_ERR_COR_MASK, &mask);
>> +	mask |= PCI_ERR_COR_INTERNAL;
>> +	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> +}
>> +EXPORT_SYMBOL_NS_GPL(pci_aer_mask_internal_errors, "CXL");
>> +
>>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>>  {
>>  	/*
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index a65fe324fad2..f0c84db466e5 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -101,5 +101,6 @@ int cper_severity_to_aer(int cper_severity);
>>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>>  		       int severity, struct aer_capability_regs *aer_regs);
>>  void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>> +void pci_aer_mask_internal_errors(struct pci_dev *dev);
>>  #endif //_AER_H_
>>  


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
  2025-04-07 14:25     ` Bowman, Terry
@ 2025-04-17 10:13       ` Jonathan Cameron
  2025-04-24 16:37         ` Bowman, Terry
  0 siblings, 1 reply; 76+ messages in thread
From: Jonathan Cameron @ 2025-04-17 10:13 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

Hi Terry,

> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index d3068f5cc767..d1ef0c676ff8 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> >>  }
> >>  EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
> >>  
> >> +/**
> >> + * pci_aer_mask_internal_errors - mask internal errors
> >> + * @dev: pointer to the pcie_dev data structure
> >> + *
> >> + * Masks internal errors in the Uncorrectable and Correctable Error
> >> + * Mask registers.
> >> + *
> >> + * Note: AER must be enabled and supported by the device which must be
> >> + * checked in advance, e.g. with pcie_aer_is_native().
> >> + */
> >> +void pci_aer_mask_internal_errors(struct pci_dev *dev)
> >> +{
> >> +	int aer = dev->aer_cap;
> >> +	u32 mask;
> >> +
> >> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
> >> +	mask |= PCI_ERR_UNC_INTN;
> >> +	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
> >> +  
> > It does an extra clear we don't need, but....
> > 	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
> > 				       0, PCI_ERR_UNC_INTN);
> >
> > 	is at very least shorter than the above 3 lines.  
> Doing so will overwrite the existing mask. CXL normally only uses AER UIE/CIE but if the device
> happens to lose alternate training and no longer identifies as a CXL device than this mask
> value would be critical for reporting PCI AER errors and would need UCE/CE enabled (other
> than UIE/CIE).
I'm not seeing that.  Implementation of pci_clear_and_set_config_dword() is:
void pci_clear_and_set_config_dword(const struct pci_dev *dev, int pos,
				    u32 clear, u32 set)
{
	u32 val;

	pci_read_config_dword(dev, pos, &val);
	val &= ~clear;
	val |= set;
	pci_write_config_dword(dev, pos, val);
}

With clear parameter as zero it will do the same the open coded
version you have above as the ~clear will be all 1s and hence
&= ~clear has no affect.

Arguably we could add pci_clear_config_dword() and pci_set_config_dword()
that both take one fewer parameter but I guess that is not worth
the bother.

Jonathan




^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
  2025-04-17 10:13       ` Jonathan Cameron
@ 2025-04-24 16:37         ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-04-24 16:37 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 4/17/2025 5:13 AM, Jonathan Cameron wrote:
> Hi Terry,
>
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index d3068f5cc767..d1ef0c676ff8 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -977,6 +977,31 @@ void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>>  }
>>>>  EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>>>>  
>>>> +/**
>>>> + * pci_aer_mask_internal_errors - mask internal errors
>>>> + * @dev: pointer to the pcie_dev data structure
>>>> + *
>>>> + * Masks internal errors in the Uncorrectable and Correctable Error
>>>> + * Mask registers.
>>>> + *
>>>> + * Note: AER must be enabled and supported by the device which must be
>>>> + * checked in advance, e.g. with pcie_aer_is_native().
>>>> + */
>>>> +void pci_aer_mask_internal_errors(struct pci_dev *dev)
>>>> +{
>>>> +	int aer = dev->aer_cap;
>>>> +	u32 mask;
>>>> +
>>>> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &mask);
>>>> +	mask |= PCI_ERR_UNC_INTN;
>>>> +	pci_write_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, mask);
>>>> +  
>>> It does an extra clear we don't need, but....
>>> 	pci_clear_and_set_config_dword(dev, aer + PCI_ERR_UNCOR_MASK,
>>> 				       0, PCI_ERR_UNC_INTN);
>>>
>>> 	is at very least shorter than the above 3 lines.  
>> Doing so will overwrite the existing mask. CXL normally only uses AER UIE/CIE but if the device
>> happens to lose alternate training and no longer identifies as a CXL device than this mask
>> value would be critical for reporting PCI AER errors and would need UCE/CE enabled (other
>> than UIE/CIE).
> I'm not seeing that.  Implementation of pci_clear_and_set_config_dword() is:
> void pci_clear_and_set_config_dword(const struct pci_dev *dev, int pos,
> 				    u32 clear, u32 set)
> {
> 	u32 val;
>
> 	pci_read_config_dword(dev, pos, &val);
> 	val &= ~clear;
> 	val |= set;
> 	pci_write_config_dword(dev, pos, val);
> }
>
> With clear parameter as zero it will do the same the open coded
> version you have above as the ~clear will be all 1s and hence
> &= ~clear has no affect.
>
> Arguably we could add pci_clear_config_dword() and pci_set_config_dword()
> that both take one fewer parameter but I guess that is not worth
> the bother.
>
> Jonathan
>
Got it. I'll change to use pci_clear_and_set_config_dword().

-Terry
>


^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (15 preceding siblings ...)
  2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
@ 2025-03-27 17:16 ` Bjorn Helgaas
  2025-03-27 22:04   ` Bowman, Terry
  2025-05-06 23:06 ` Gregory Price
  17 siblings, 1 reply; 76+ messages in thread
From: Bjorn Helgaas @ 2025-03-27 17:16 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:01PM -0500, Terry Bowman wrote:
> ...

> Terry Bowman (16):
>   PCI/CXL: Introduce PCIe helper function pcie_is_cxl()

Something like "Add pcie_is_cxl()" is probably enough.

>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>     type

No need to repeat "AER" in the subject.  Could start with "Report" or
"Distinguish" since "modify AER driver logging" is kind of low-value
information.

>   CXL/AER: Introduce Kfifo for forwarding CXL errors

>   cxl/aer: AER service driver forwards CXL error to CXL driver
>   PCI/AER: CXL driver dequeues CXL error forwarded from AER service
>     driver

Both should say what the patch changes.  "AER service driver forwards"
and "CXL driver dequeues" could be descriptions of existing behavior
or something else.  Starting with a verb will help make this clearer.

Maybe don't need to repeat "AER" in "CXL/AER: AER ..."

>   CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
>   cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver

Drop "existing" and at least one "CXL" to increase information density
in subject.

>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
>   cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
>   cxl/pci: Add log message if RAS registers are not mapped
>   cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports

s/Unifi/Unify/

>   cxl/pci: Assign CXL Port protocol error handlers
>   cxl/pci: Assign CXL Endpoint protocol error handlers
>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
>   CXL/PCI: Disable CXL protocol errors during CXL Port cleanup

Don't repost just for any of this, but it looks like there are some
kernel test robot warnings that need to be addressed.  When you do,
tidy up these subject lines so they are capitalized consistently.

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging
  2025-03-27 17:16 ` [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
@ 2025-03-27 22:04   ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-03-27 22:04 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 3/27/2025 12:16 PM, Bjorn Helgaas wrote:
> On Wed, Mar 26, 2025 at 08:47:01PM -0500, Terry Bowman wrote:
>> ...
>> Terry Bowman (16):
>>   PCI/CXL: Introduce PCIe helper function pcie_is_cxl()
> Something like "Add pcie_is_cxl()" is probably enough.
>
>>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>>     type
> No need to repeat "AER" in the subject.  Could start with "Report" or
> "Distinguish" since "modify AER driver logging" is kind of low-value
> information.
>
>>   CXL/AER: Introduce Kfifo for forwarding CXL errors
>>   cxl/aer: AER service driver forwards CXL error to CXL driver
>>   PCI/AER: CXL driver dequeues CXL error forwarded from AER service
>>     driver
> Both should say what the patch changes.  "AER service driver forwards"
> and "CXL driver dequeues" could be descriptions of existing behavior
> or something else.  Starting with a verb will help make this clearer.
>
> Maybe don't need to repeat "AER" in "CXL/AER: AER ..."
>
>>   CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery'
>>   cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver
> Drop "existing" and at least one "CXL" to increase information density
> in subject.
>
>>   cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers
>>   cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
>>   cxl/pci: Add log message if RAS registers are not mapped
>>   cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports
> s/Unifi/Unify/
>
>>   cxl/pci: Assign CXL Port protocol error handlers
>>   cxl/pci: Assign CXL Endpoint protocol error handlers
>>   cxl/pci: Remove unnecessary CXL Endpoint handling helper functions
>>   CXL/PCI: Enable CXL protocol errors during CXL Port probe
>>   CXL/PCI: Disable CXL protocol errors during CXL Port cleanup
> Don't repost just for any of this, but it looks like there are some
> kernel test robot warnings that need to be addressed.  When you do,
> tidy up these subject lines so they are capitalized consistently.
Hi Bjorn,

I added all the changes. The commit titles read much better.

Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging
  2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (16 preceding siblings ...)
  2025-03-27 17:16 ` [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
@ 2025-05-06 23:06 ` Gregory Price
  2025-05-07 18:28   ` Bowman, Terry
  17 siblings, 1 reply; 76+ messages in thread
From: Gregory Price @ 2025-05-06 23:06 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati

On Wed, Mar 26, 2025 at 08:47:01PM -0500, Terry Bowman wrote:
> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
> Endpoints (EP). The reach of this patchset grew from CXL Ports to include
> EPs as well because updating the handling for all devices is preferable
> over supporting multiple handling paths.
> 
> This patchset is a continuation of v7 and can be found here:
> https://lore.kernel.org/linux-cxl/20250211192444.2292833-1-terry.bowman@amd.com/
> 

I've been testing this for stability on a fair number of boxes for some
time - backported to v6.13. Haven't seen any major issues related to
this set in that time. Outside my normal wheelhouse, but for the sake
of runtime stability:

Tested-by: Gregory Price <gourry@gourry.net>

Trying to get more explicit testing feedback from RAS folks.

(note: there appears to be some conflicting changes in v6.15-rc4+ that
a bit outside my current timeline to forward port and test.)

~Gregory

^ permalink raw reply	[flat|nested] 76+ messages in thread

* Re: [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging
  2025-05-06 23:06 ` Gregory Price
@ 2025-05-07 18:28   ` Bowman, Terry
  0 siblings, 0 replies; 76+ messages in thread
From: Bowman, Terry @ 2025-05-07 18:28 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati



On 5/6/2025 6:06 PM, Gregory Price wrote:
> On Wed, Mar 26, 2025 at 08:47:01PM -0500, Terry Bowman wrote:
>> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
>> Endpoints (EP). The reach of this patchset grew from CXL Ports to include
>> EPs as well because updating the handling for all devices is preferable
>> over supporting multiple handling paths.
>>
>> This patchset is a continuation of v7 and can be found here:
>> https://lore.kernel.org/linux-cxl/20250211192444.2292833-1-terry.bowman@amd.com/
>>
> I've been testing this for stability on a fair number of boxes for some
> time - backported to v6.13. Haven't seen any major issues related to
> this set in that time. Outside my normal wheelhouse, but for the sake
> of runtime stability:
>
> Tested-by: Gregory Price <gourry@gourry.net>
>
> Trying to get more explicit testing feedback from RAS folks.
>
> (note: there appears to be some conflicting changes in v6.15-rc4+ that
> a bit outside my current timeline to forward port and test.)
>
> ~Gregory
Thanks Greg.

-Terry

^ permalink raw reply	[flat|nested] 76+ messages in thread

end of thread, other threads:[~2025-05-21 23:30 UTC | newest]

Thread overview: 76+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-27  1:47 [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-03-27  1:47 ` [PATCH v8 01/16] PCI/CXL: Introduce PCIe helper function pcie_is_cxl() Terry Bowman
2025-03-27 15:11   ` Ira Weiny
2025-03-27 15:30     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 02/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2025-03-27 16:48   ` Bjorn Helgaas
2025-03-27 17:15     ` Bowman, Terry
2025-03-27 17:49       ` Bjorn Helgaas
2025-03-27 16:58   ` Ira Weiny
2025-03-27 17:17     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 03/16] CXL/AER: Introduce Kfifo for forwarding CXL errors Terry Bowman
2025-03-27 17:08   ` Bjorn Helgaas
2025-03-27 18:12     ` Bowman, Terry
2025-03-28 17:02       ` Bjorn Helgaas
2025-03-28 17:36         ` Bowman, Terry
2025-03-28 17:01   ` Ira Weiny
2025-04-07 13:43     ` Bowman, Terry
2025-04-04 16:53   ` Jonathan Cameron
2025-04-23 14:33   ` Jonathan Cameron
2025-04-23 15:04   ` Jonathan Cameron
2025-04-23 22:12   ` Gregory Price
2025-03-27  1:47 ` [PATCH v8 04/16] cxl/aer: AER service driver forwards CXL error to CXL driver Terry Bowman
2025-03-27 17:13   ` Bjorn Helgaas
2025-04-07 14:00     ` Bowman, Terry
2025-04-23 15:04   ` Jonathan Cameron
2025-04-24 14:17     ` Bowman, Terry
2025-04-25 13:18       ` Jonathan Cameron
2025-04-25 21:03         ` Bowman, Terry
2025-05-15 21:52         ` Bowman, Terry
2025-05-20 11:04           ` Jonathan Cameron
2025-05-20 13:21             ` Bowman, Terry
2025-05-21 18:34               ` Jonathan Cameron
2025-05-21 23:30                 ` Bowman, Terry
2025-04-23 22:21   ` Gregory Price
2025-03-27  1:47 ` [PATCH v8 05/16] PCI/AER: CXL driver dequeues CXL error forwarded from AER service driver Terry Bowman
2025-03-27  4:43   ` kernel test robot
2025-04-23 16:28   ` Jonathan Cameron
2025-04-24 15:03     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 06/16] CXL/PCI: Introduce CXL uncorrectable protocol error 'recovery' Terry Bowman
2025-03-27  3:37   ` kernel test robot
2025-03-27  4:19   ` kernel test robot
2025-04-23 16:35   ` Jonathan Cameron
2025-04-24 14:22     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 07/16] cxl/pci: Move existing CXL RAS initialization to CXL's cxl_port driver Terry Bowman
2025-04-17 10:18   ` Jonathan Cameron
2025-04-24 14:25     ` Bowman, Terry
2025-05-12 14:47     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 08/16] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-03-27  1:47 ` [PATCH v8 09/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2025-03-27  1:47 ` [PATCH v8 10/16] cxl/pci: Add log message if RAS registers are not mapped Terry Bowman
2025-04-23 16:41   ` Jonathan Cameron
2025-04-24 14:30     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 11/16] cxl/pci: Unifi CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-04-23 16:44   ` Jonathan Cameron
2025-05-07 16:28     ` Shiju Jose
2025-05-07 18:30       ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 12/16] cxl/pci: Assign CXL Port protocol error handlers Terry Bowman
2025-04-23 16:47   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 13/16] cxl/pci: Assign CXL Endpoint " Terry Bowman
2025-03-27 19:46   ` kernel test robot
2025-04-23 16:49   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 14/16] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-04-17 17:22   ` Jonathan Cameron
2025-03-27  1:47 ` [PATCH v8 15/16] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-04-04 17:05   ` Jonathan Cameron
2025-04-07 14:34     ` Bowman, Terry
2025-03-27  1:47 ` [PATCH v8 16/16] CXL/PCI: Disable CXL protocol errors during CXL Port cleanup Terry Bowman
2025-03-28  1:18   ` kernel test robot
2025-04-04 17:04   ` Jonathan Cameron
2025-04-07 14:25     ` Bowman, Terry
2025-04-17 10:13       ` Jonathan Cameron
2025-04-24 16:37         ` Bowman, Terry
2025-03-27 17:16 ` [PATCH v8 00/16] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2025-03-27 22:04   ` Bowman, Terry
2025-05-06 23:06 ` Gregory Price
2025-05-07 18:28   ` Bowman, Terry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox