[PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging
@ 2025-01-07 14:38 Terry Bowman
  2025-01-07 14:38 ` [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
                   ` (15 more replies)
  0 siblings, 16 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

This is a continuation of the CXL Port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe Port Protocol Error
handling to the existing RCH Downstream Port handling in the AER service
driver. This patchset adds the CXL PCIe Port Protocol Error handling and
logging.

The first 7 patches update the existing AER service driver to support CXL
PCIe Port Protocol Error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.

The following 9 patches address CXL driver support for CXL PCIe Port
Protocol Errors. This includes the following changes to the CXL drivers:
mapping CXL Port and Downstream Port RAS registers, interface updates for
common CXL Restricted host (RCH) and virtual hierarchy (VH) modes,
adding CXL Port Protocol Error handlers, and Protocol Error logging.

Note, this patchset does not completely refactor RCH Protocol Error
handling. The plan is to update the RCH handling in the future to use the
same handling path introduced here. 

[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/

Testing:
========
Below are test results for this patchset using QEMU with CXL Root
Port(RP, 0C:00.0), CXL Upstream Switch Port(USP, 0D:00.0), and CXL
Downstream Switch Port(DSP, 0E:00.0). 

The topology is:
                   ---------------------
		   | CXL RP - 0C:00.0  |
                   ---------------------
                             |
                   ---------------------
		   | CXL USP - 0D:00.0 |
                   ---------------------
                             |
                   ---------------------
		   | CXL DSP - 0E:00.0 |
                   ---------------------
                             |
                   ---------------------
		   | CXL EP - 0F:00.0  |	
                   ---------------------

root@tbowman-cxl:~# lspci -t
-+-[0000:00]-+-00.0
 |           +-01.0
 |           +-02.0
 |           +-03.0
 |           +-1f.0
 |           +-1f.2
 |           \-1f.3
 \-[0000:0c]---00.0-[0d-0f]----00.0-[0e-0f]----00.0-[0f]----00.0

The topology was created with:
 ${qemu} -boot menu=on \
            -cpu host \
            -nographic \
            -monitor telnet:127.0.0.1:1234,server,nowait \
            -M virt,cxl=on \
            -chardev stdio,id=s1,signal=off,mux=on -serial none \
            -device isa-serial,chardev=s1 -mon chardev=s1,mode=readline \
            -machine q35,cxl=on \
            -m 16G,maxmem=24G,slots=8 \
            -cpu EPYC-v3 \
            -smp 16 \
            -accel kvm \
            -drive file=${img},format=raw,index=0,media=disk \
            -device e1000,netdev=user.0 \
            -netdev user,id=user.0,hostfwd=tcp::5555-:22 \
            -object memory-backend-file,id=cxl-mem0,share=on,mem-path=/tmp/cxltest.raw,size=256M \
            -object memory-backend-file,id=cxl-lsa0,share=on,mem-path=/tmp/lsa0.raw,size=256M \
            -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
            -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
            -device cxl-upstream,bus=root_port0,id=us0 \
            -device cxl-downstream,port=0,bus=us0,id=swport0,chassis=0,slot=4 \
            -device cxl-type3,bus=swport0,volatile-memdev=cxl-mem0,lsa=cxl-lsa0,id=cxl-vmem0 \
            -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k

 This was tested using the aer-inject tool updated to support CE and UCE
 internal Protocol Error injection. CXL port RAS was set using a test patch (not
 upstreamed but can provide if needed).

 == Root Port Correctable Error ==
 root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
 pcieport 0000:0c:00.0:    [14] CorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

 == Root Port UnCorrectable Error ==
 root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
 pcieport 0000:0c:00.0:    [22] UncorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 Kernel panic - not syncing: CXL cachemem error.
 CPU: 10 UID: 0 PID: 149 Comm: irq/24-aerdrv Tainted: G            E      6.13.0-rc2-cxl-port-err-g0161162f683c #4833
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  ? srso_return_thunk+0x5/0x5f
  cxl_do_recovery+0x117/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x64f/0x700
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. ]---

 == Upstream Port Correctable Error ==
 root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
 pcieport 0000:0d:00.0:    [14] CorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

 == Upstream Port UnCorrectable Error == 
 root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
 pcieport 0000:0d:00.0:    [22] UncorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 systemd-journald[483]: Sent WATCHDOG=1 notification.
 cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error.
 CPU: 10 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.13.0-rc2-cxl-port-err-g0161162f683c #4833
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x117/0x120
  aer_isr+0x64f/0x700
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x13e00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. ]---

 == Downstream Port Correctable Error ==
 root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0 
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
 pcieport 0000:0e:00.0:    [14] CorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
 
 == Downstream Port UnCorrectable Error == 
 root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
 pcieport 0000:0e:00.0:    [22] UncorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error.
 CPU: 10 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.13.0-rc2-cxl-port-err-g0161162f683c #4833
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x117/0x120
  aer_isr+0x64f/0x700
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x1bc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. ]---

 Changes
 =======
 Changes in v4 -> v5:
 [Alejandro] Refactor cxl_walk_bridge to simplify 'status' usage
 [Alejandro] Add WARN_ONCE() in __cxl_handle_ras() and cxl_handle_cor_ras()
 [Ming] Remove unnecessary NULL check in cxl_pci_port_ras()
 [Terry] Add failure check for call to to_cxl_port() in cxl_pci_port_ras()
 [Ming] Use port->dev for call to devm_add_action_or_reset() in
 cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()
 [Jonathan] Use get_device()/put_device() to prevent race condition in
 cxl_clear_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Update commit messages with uppercasing CXL and PCI terms
 
 Changes in v3 -> v4:
 [Lukas] Capitalize PCIe and CXL device names as in specifications
 [Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
 [Lukas] Correct namespace spelling
 [Lukas] Removed export from pcie_is_cxl_port()
 [Lukas] Simplify 'if' blocks in cxl_handle_error()
 [Lukas] Change panic message to remove redundant 'panic' text
 [Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
 [lkp@intel] 'host' parameter is already removed. Remove parameter description too.
 [Terry] Added field description for cxl_err_handlers in pci.h comment block

 Changes in v2 -> v3:
 [Terry] Add UIE/CIE port enablement patch. Needed because only RP are  enabled by AER driver.
 [DaveJ] Isolate reading upstream port's AER info to only the CXL path
 [Jonathan, Dan] Add details about separate handling paths for CXL & PCIe
 [Jonathan] Add details to existing comment in devm_cxl_add_endpoint()
 about call to cxl_init_ep_ports_aer()
 [Jonathan] Updated cxl_init_ep_ports_aer() w/ checks for NULL;
 [Jonathan] Move find_cxl_port() patch immediately before patch to create handlers
 [Jonathan] Patch title fix: find_cxl_ports() -> find_cxl_port()
 [Jonathan] Remove 2 unnecessary dev_warns() in cxl_dport_init_ras_reporting() and
 cxl_uport_init_ras_reporting().
 [Jonathan] Remove unnecessary filter on PCIe port devices in dev_is_cxl_pci()
 [Jonathan] Change to use 2 cxl_port declarations in cxl_pci_port_ras()
 [Jonathan] Fix spacing in 'struct cxl_error_handlers' declaration.
 
 Changes in v1 -> v2:
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order)
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fix base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
 [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
 [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
 is not used in the CXL_err_handlers callbacks.

 Changes in RFC -> v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (16):
  PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
    pci_driver'
  PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port
    support
  CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and
    pcie_is_cxl_port()
  PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
    type
  PCI/AER: Add CXL PCIe Port correctable error support in AER service
    driver
  PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
    Port devices
  PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service
    driver
  cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS
    registers
  cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  cxl/pci: Add log message for umnapped registers in existing RAS
    handlers
  cxl/pci: Change find_cxl_port() to non-static
  cxl/pci: Add error handler for CXL PCIe Port RAS errors
  cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch
    Ports

 drivers/cxl/core/core.h       |   3 +
 drivers/cxl/core/pci.c        | 206 ++++++++++++++++++++++++++++------
 drivers/cxl/core/port.c       |   4 +-
 drivers/cxl/core/trace.h      |  47 ++++++++
 drivers/cxl/cxl.h             |  10 +-
 drivers/cxl/mem.c             |  39 ++++++-
 drivers/pci/pci.c             |  13 +++
 drivers/pci/pci.h             |   3 +
 drivers/pci/pcie/aer.c        | 107 +++++++++++-------
 drivers/pci/pcie/err.c        |  54 +++++++++
 drivers/pci/probe.c           |  10 ++
 include/linux/aer.h           |   1 +
 include/linux/pci.h           |  14 +++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   3 +-
 15 files changed, 435 insertions(+), 88 deletions(-)


base-commit: 2f84d072bdcb7d6ec66cc4d0de9f37a3dc394cd2
-- 
2.34.1


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-13 23:45   ` Ira Weiny
  2025-02-06 17:01   ` Gregory Price
  2025-01-07 14:38 ` [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
                   ` (14 subsequent siblings)
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

CXL.io provides protocol error handling on top of PCIe Protocol Error
handling. But, CXL.io and PCIe have different handling requirements
for uncorrectable errors (UCE).

The PCIe AER service driver may attempt recovering PCIe devices with
UCE while recovery is not used for CXL.io. Recovery is not used in the
CXL.io case because of potential corruption on what can be system memory.

Create pci_driver::cxl_err_handlers structure similar to
pci_driver::error_handler. Create handlers for correctable and
uncorrectable CXL.io error handling.

The CXL error handlers will be used in future patches adding CXL PCIe
Port Protocol Error handling.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
 include/linux/pci.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index db9b47ce3eef..e2e36f11205c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -882,6 +882,14 @@ struct pci_error_handlers {
 	void (*cor_error_detected)(struct pci_dev *dev);
 };
 
+/* Compute Express Link (CXL) bus error event callbacks */
+struct cxl_error_handlers {
+	/* CXL bus error detected on this device */
+	bool (*error_detected)(struct pci_dev *dev);
+
+	/* Allow device driver to record more details of a correctable error */
+	void (*cor_error_detected)(struct pci_dev *dev);
+};
 
 struct module;
 
@@ -927,6 +935,7 @@ struct module;
  * @sriov_get_vf_total_msix: PF driver callback to get the total number of
  *              MSI-X vectors available for distribution to the VFs.
  * @err_handler: See Documentation/PCI/pci-error-recovery.rst
+ * @cxl_err_handler: Compute Express Link specific error handlers.
  * @groups:	Sysfs attribute groups.
  * @dev_groups: Attributes attached to the device that will be
  *              created once it is bound to the driver.
@@ -952,6 +961,7 @@ struct pci_driver {
 	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
 	const struct pci_error_handlers *err_handler;
+	const struct cxl_error_handlers *cxl_err_handler;
 	const struct attribute_group **groups;
 	const struct attribute_group **dev_groups;
 	struct device_driver	driver;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
  2025-01-07 14:38 ` [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-13 23:45   ` Ira Weiny
  2025-02-06 17:02   ` Gregory Price
  2025-01-07 14:38 ` [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
                   ` (13 subsequent siblings)
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The AER service driver already includes support for Restricted CXL host
(RCH) Downstream Port Protocol Error handling. The current implementation
is based on CXL1.1 using a Root Complex Event Collector.

Rename function interfaces and parameters where necessary to include
virtual hierarchy (VH) mode CXL PCIe Port error handling alongside the RCH
handling.[1] The CXL PCIe Port Protocol Error handling support will be
added in a future patch.

Limit changes to renaming variable and function names. No functional
changes are added.

[1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
 drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 34ce9f834d0c..0e2478f4fca2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1030,7 +1030,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 	return 0;
 }
 
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
+static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	/*
 	 * Internal errors of an RCEC indicate an AER error in an
@@ -1053,30 +1053,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 	return *handles_cxl;
 }
 
-static bool handles_cxl_errors(struct pci_dev *rcec)
+static bool handles_cxl_errors(struct pci_dev *dev)
 {
 	bool handles_cxl = false;
 
-	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+	    pcie_aer_is_native(dev))
+		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
 
 	return handles_cxl;
 }
 
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
+static void cxl_enable_internal_errors(struct pci_dev *dev)
 {
-	if (!handles_cxl_errors(rcec))
+	if (!handles_cxl_errors(dev))
 		return;
 
-	pci_aer_unmask_internal_errors(rcec);
-	pci_info(rcec, "CXL: Internal errors unmasked");
+	pci_aer_unmask_internal_errors(dev);
+	pci_info(dev, "CXL: Internal errors unmasked");
 }
 
 #else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
-static inline void cxl_rch_handle_error(struct pci_dev *dev,
-					struct aer_err_info *info) { }
+static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
+static inline void cxl_handle_error(struct pci_dev *dev,
+				    struct aer_err_info *info) { }
 #endif
 
 /**
@@ -1114,7 +1114,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_rch_handle_error(dev, info);
+	cxl_handle_error(dev, info);
 	pci_aer_handle_error(dev, info);
 	pci_dev_put(dev);
 }
@@ -1494,7 +1494,7 @@ static int aer_probe(struct pcie_device *dev)
 		return status;
 	}
 
-	cxl_rch_enable_rcec(port);
+	cxl_enable_internal_errors(port);
 	aer_enable_rootport(rpc);
 	pci_info(port, "enabled with IRQ %d\n", dev->irq);
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
  2025-01-07 14:38 ` [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
  2025-01-07 14:38 ` [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-13 23:49   ` Ira Weiny
  2025-01-07 14:38 ` [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

CXL and AER drivers need the ability to identify CXL devices and CXL port
devices.

First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
presence. The CXL Flexbus DVSEC presence is used because it is required
for all the CXL PCIe devices.[1]

Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
Flexbus presence.

Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl'.

Add pcie_is_cxl_port() to check if a device is a CXL Root Port, CXL
Upstream Switch Port, or CXL Downstream Switch Port. Also, verify the
CXL Extensions DVSEC for Ports is present.[1]

[1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
    Capability (DVSEC) ID Assignment, Table 8-2

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
 drivers/pci/pci.c             | 13 +++++++++++++
 drivers/pci/probe.c           | 10 ++++++++++
 include/linux/pci.h           |  4 ++++
 include/uapi/linux/pci_regs.h |  3 ++-
 4 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 661f98c6c63a..9319c62e3488 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5036,10 +5036,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
 
 static u16 cxl_port_dvsec(struct pci_dev *dev)
 {
+	if (!pcie_is_cxl(dev))
+		return 0;
+
 	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
 					 PCI_DVSEC_CXL_PORT);
 }
 
+bool pcie_is_cxl_port(struct pci_dev *dev)
+{
+	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
+	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
+	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
+		return false;
+
+	return cxl_port_dvsec(dev);
+}
+
 static bool cxl_sbr_masked(struct pci_dev *dev)
 {
 	u16 dvsec, reg;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 2e81ab0f5a25..ee40a1e2ec75 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1633,6 +1633,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
 		dev->is_thunderbolt = 1;
 }
 
+static void set_pcie_cxl(struct pci_dev *dev)
+{
+	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+					      PCI_DVSEC_CXL_FLEXBUS);
+	if (dvsec)
+		dev->is_cxl = 1;
+}
+
 static void set_pcie_untrusted(struct pci_dev *dev)
 {
 	struct pci_dev *parent = pci_upstream_bridge(dev);
@@ -1963,6 +1971,8 @@ int pci_setup_device(struct pci_dev *dev)
 	/* Need to have dev->cfg_size ready */
 	set_pcie_thunderbolt(dev);
 
+	set_pcie_cxl(dev);
+
 	set_pcie_untrusted(dev);
 
 	if (pci_is_pcie(dev))
diff --git a/include/linux/pci.h b/include/linux/pci.h
index e2e36f11205c..08350302b3e9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -452,6 +452,7 @@ struct pci_dev {
 	unsigned int	is_hotplug_bridge:1;
 	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
 	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
+	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
 	/*
 	 * Devices marked being untrusted are the ones that can potentially
 	 * execute DMA attacks and similar. They are typically connected
@@ -739,6 +740,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
 	return false;
 }
 
+#define pcie_is_cxl(dev) (dev->is_cxl)
+bool pcie_is_cxl_port(struct pci_dev *dev);
+
 #define for_each_pci_bridge(dev, bus)				\
 	list_for_each_entry(dev, &bus->devices, bus_list)	\
 		if (!pci_is_bridge(dev)) {} else
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 1601c7ed5fab..4251af090742 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1208,9 +1208,10 @@
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
+/* Compute Express Link (CXL r3.1, sec 8.1) */
 #define PCI_DVSEC_CXL_PORT				3
 #define PCI_DVSEC_CXL_PORT_CTL				0x0c
 #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+#define PCI_DVSEC_CXL_FLEXBUS				7
 
 #endif /* LINUX_PCI_REGS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (2 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-13 23:51   ` Ira Weiny
  2025-02-06 18:18   ` Gregory Price
  2025-01-07 14:38 ` [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
                   ` (11 subsequent siblings)
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The AER driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors.

Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
device errors.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
 drivers/pci/pcie/aer.c  | 14 ++++++++------
 include/ras/ras_event.h |  9 ++++++---
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 0e2478f4fca2..f8b3350fcbb4 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -700,13 +700,14 @@ static void __aer_print_error(struct pci_dev *dev,
 
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
+	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
 	int layer, agent;
 	int id = pci_dev_id(dev);
 	const char *level;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
@@ -715,8 +716,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 
 	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
 
-	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -731,7 +732,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		pci_err(dev, "  Error of this Agent is reported first\n");
 
-	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
@@ -765,6 +766,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info;
@@ -799,7 +801,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	if (tlp_header_valid)
 		__print_tlp_header(dev, &aer->header_log);
 
-	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, "CXL");
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index e5f7ee0864e7..1bf8e7050ba8 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
 
 TRACE_EVENT(aer_event,
 	TP_PROTO(const char *dev_name,
+		 const char *bus_type,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
 		 struct pcie_tlp_log *tlp),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
+		__string(	bus_type,	bus_type	)
 		__field(	u32,		status		)
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (3 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14  6:54   ` Li Ming
                     ` (2 more replies)
  2025-01-07 14:38 ` [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
                   ` (10 subsequent siblings)
  15 siblings, 3 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The AER service driver supports handling Downstream Port Protocol Errors in
Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
mode.[1]

CXL and PCIe Protocol Error handling have different requirements that
necessitate a separate handling path. The AER service driver may try to
recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
suitable for CXL PCIe Port devices because of potential for system memory
corruption. Instead, CXL Protocol Error handling must use a kernel panic
in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
Error handling does not panic the kernel in response to a UCE.

Introduce a separate path for CXL Protocol Error handling in the AER
service driver. This will allow CXL Protocol Errors to use CXL specific
handling instead of PCIe handling. Add the CXL specific changes without
affecting or adding functionality in the PCIe handling.

Make this update alongside the existing Downstream Port RCH error handling
logic, extending support to CXL PCIe Ports in VH mode.

is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
config. Update is_internal_error()'s function declaration such that it is
always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
or disabled.

The uncorrectable error (UCE) handling will be added in a future patch.

[1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
Upstream Switch Ports

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
 drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
 1 file changed, 40 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f8b3350fcbb4..62be599e3bee 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-#ifdef CONFIG_PCIEAER_CXL
+static bool is_internal_error(struct aer_err_info *info)
+{
+	if (info->severity == AER_CORRECTABLE)
+		return info->status & PCI_ERR_COR_INTERNAL;
 
+	return info->status & PCI_ERR_UNC_INTN;
+}
+
+#ifdef CONFIG_PCIEAER_CXL
 /**
  * pci_aer_unmask_internal_errors - unmask internal errors
  * @dev: pointer to the pcie_dev data structure
@@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
 	return (pcie_ports_native || host->native_aer);
 }
 
-static bool is_internal_error(struct aer_err_info *info)
-{
-	if (info->severity == AER_CORRECTABLE)
-		return info->status & PCI_ERR_COR_INTERNAL;
-
-	return info->status & PCI_ERR_UNC_INTN;
-}
-
 static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 {
 	struct aer_err_info *info = (struct aer_err_info *)data;
@@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 
 static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 {
-	/*
-	 * Internal errors of an RCEC indicate an AER error in an
-	 * RCH's downstream port. Check and handle them in the CXL.mem
-	 * device driver.
-	 */
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
-		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
+		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+
+	if (info->severity == AER_CORRECTABLE) {
+		struct pci_driver *pdrv = dev->driver;
+		int aer = dev->aer_cap;
+
+		if (aer)
+			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+					       info->status);
+
+		if (pdrv && pdrv->cxl_err_handler &&
+		    pdrv->cxl_err_handler->cor_error_detected)
+			pdrv->cxl_err_handler->cor_error_detected(dev);
+
+		pcie_clear_device_status(dev);
+	}
 }
 
 static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
@@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
 {
 	bool handles_cxl = false;
 
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(dev))
+	if (!pcie_aer_is_native(dev))
+		return false;
+
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
 		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
+	else
+		handles_cxl = pcie_is_cxl_port(dev);
 
 	return handles_cxl;
 }
@@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
 static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
 static inline void cxl_handle_error(struct pci_dev *dev,
 				    struct aer_err_info *info) { }
+static bool handles_cxl_errors(struct pci_dev *dev)
+{
+	return false;
+}
 #endif
 
 /**
@@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_handle_error(dev, info);
-	pci_aer_handle_error(dev, info);
+	if (is_internal_error(info) && handles_cxl_errors(dev))
+		cxl_handle_error(dev, info);
+	else
+		pci_aer_handle_error(dev, info);
+
 	pci_dev_put(dev);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (4 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:32   ` Jonathan Cameron
  2025-01-14 16:57   ` Ira Weiny
  2025-01-07 14:38 ` [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver Terry Bowman
                   ` (9 subsequent siblings)
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The AER service driver's aer_get_device_error_info() function doesn't read
uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
including CXL Upstream Switch Ports. As a result, fatal errors are not
logged or handled as needed for CXL PCIe Upstream Switch Port devices.

Update the aer_get_device_error_info() function to read the UCE fatal
status for all CXL PCIe devices. Make the change such that non-CXL devices
are not affected.

The fatal error status will be used in future patches implementing
CXL PCIe Port uncorrectable error handling and logging.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 62be599e3bee..79c828bdcb6d 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
 		   type == PCI_EXP_TYPE_RC_EC ||
 		   type == PCI_EXP_TYPE_DOWNSTREAM ||
-		   info->severity == AER_NONFATAL) {
+		   info->severity == AER_NONFATAL ||
+		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
 
 		/* Link is still healthy for IO reads */
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (5 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:33   ` Jonathan Cameron
  2025-01-14 17:27   ` Ira Weiny
  2025-01-07 14:38 ` [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
                   ` (8 subsequent siblings)
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
apply to CXL devices. Recovery can not be used for CXL devices because of
potential corruption on what can be system memory. Also, current PCIe UCE
recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
does not begin at the RP/DSP but begins at the first downstream device.
This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
CXL recovery is needed because of the different handling requirements

Add a new function, cxl_do_recovery() using the following.

Add cxl_walk_bridge() to iterate the detected error's sub-topology.
cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
will begin iteration at the RP or DSP rather than beginning at the
first downstream device.

Add cxl_report_error_detected() as an analog to report_error_detected().
It will call pci_driver::cxl_err_handlers for each iterated downstream
device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
indicating if there was a UCE error detected during handling.

cxl_do_recovery() uses the status from cxl_report_error_detected() to
determine how to proceed. Non-fatal CXL UCE errors will be treated as
fatal. If a UCE was present during handling then cxl_do_recovery()
will kernel panic.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.h      |  3 +++
 drivers/pci/pcie/aer.c |  4 ++++
 drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 2e40fc63ba31..566ad527e61f 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -711,6 +711,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		pci_channel_state_t state,
 		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
 
+/* CXL error reporting and handling */
+void cxl_do_recovery(struct pci_dev *dev);
+
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 79c828bdcb6d..68e957459008 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1025,6 +1025,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 			err_handler->error_detected(dev, pci_channel_io_normal);
 		else if (info->severity == AER_FATAL)
 			err_handler->error_detected(dev, pci_channel_io_frozen);
+
+		cxl_do_recovery(dev);
 	}
 out:
 	device_unlock(&dev->dev);
@@ -1049,6 +1051,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 			pdrv->cxl_err_handler->cor_error_detected(dev);
 
 		pcie_clear_device_status(dev);
+	} else {
+		cxl_do_recovery(dev);
 	}
 }
 
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..bfa5dbbc0e1a 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	return status;
 }
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+			    int (*cb)(struct pci_dev *, void *),
+			    void *userdata)
+{
+	if (cb(bridge, userdata))
+		return;
+
+	if (bridge->subordinate)
+		pci_walk_bus(bridge->subordinate, cb, userdata);
+}
+
+static int cxl_report_error_detected(struct pci_dev *dev, void *data)
+{
+	const struct cxl_error_handlers *cxl_err_handler;
+	struct pci_driver *pdrv = dev->driver;
+	bool *status = data;
+
+	device_lock(&dev->dev);
+	if (pdrv && pdrv->cxl_err_handler &&
+	    pdrv->cxl_err_handler->error_detected) {
+		cxl_err_handler = pdrv->cxl_err_handler;
+		*status = cxl_err_handler->error_detected(dev);
+	}
+	device_unlock(&dev->dev);
+	return *status;
+}
+
+void cxl_do_recovery(struct pci_dev *dev)
+{
+	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+	int type = pci_pcie_type(dev);
+	struct pci_dev *bridge;
+	int status;
+
+	if (type == PCI_EXP_TYPE_ROOT_PORT ||
+	    type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    type == PCI_EXP_TYPE_UPSTREAM ||
+	    type == PCI_EXP_TYPE_ENDPOINT)
+		bridge = dev;
+	else
+		bridge = pci_upstream_bridge(dev);
+
+	cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
+	if (status)
+		panic("CXL cachemem error.");
+
+	if (host->native_aer || pcie_ports_native) {
+		pcie_clear_device_status(dev);
+		pci_aer_clear_nonfatal_status(dev);
+	}
+
+	pci_info(bridge, "CXL uncorrectable error.\n");
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (6 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 21:37   ` Ira Weiny
  2025-02-07  7:30   ` Gregory Price
  2025-01-07 14:38 ` [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
                   ` (7 subsequent siblings)
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
registers for the endpoint's Root Port. The same needs to be done for
each of the CXL Downstream Switch Ports and CXL Root Ports found between
the endpoint and CXL Host Bridge.

Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
sub-topology between the endpoint and the CXL Host Bridge. This function
will determine if there are CXL Downstream Switch Ports or CXL Root Ports
associated with this Port. The same check will be added in the future for
upstream switch ports.

Move the RAS register map logic from cxl_dport_map_ras() into
cxl_dport_init_ras_reporting(). This eliminates the need for the helper
function, cxl_dport_map_ras().

cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.

cxl_dport_init_ras_reporting() must check for previously mapped registers
before mapping. This is required because multiple endpoints under a CXL
switch may share an upstream CXL Root Port, CXL Downstream Switch Port,
or CXL Downstream Switch Port. Ensure the RAS registers are only mapped
once.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/pci.c | 37 +++++++++++++++----------------------
 drivers/cxl/cxl.h      |  6 ++----
 drivers/cxl/mem.c      | 31 +++++++++++++++++++++++++++++--
 3 files changed, 46 insertions(+), 28 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index b3aac9964e0d..1af2d0a14f5d 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -749,18 +749,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 	}
 }
 
-static void cxl_dport_map_ras(struct cxl_dport *dport)
-{
-	struct cxl_register_map *map = &dport->reg_map;
-	struct device *dev = dport->dport_dev;
-
-	if (!map->component_map.ras.valid)
-		dev_dbg(dev, "RAS registers not found\n");
-	else if (cxl_map_component_regs(map, &dport->regs.component,
-					BIT(CXL_CM_CAP_CAP_ID_RAS)))
-		dev_dbg(dev, "Failed to map RAS capability.\n");
-}
-
 static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 {
 	void __iomem *aer_base = dport->regs.dport_aer;
@@ -788,22 +776,27 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
- * @host: host device for devm operations
  */
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 {
-	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
-
-	if (dport->rch) {
-		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
-
-		if (!host_bridge->native_aer)
-			return;
+	struct device *dport_dev = dport->dport_dev;
+	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
 
+	dport->reg_map.host = dport_dev;
+	if (dport->rch && host_bridge->native_aer) {
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
 	}
+
+	/* dport may have more than 1 downstream EP. Check if already mapped. */
+	if (dport->regs.ras)
+		return;
+
+	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+		dev_err(dport_dev, "Failed to map RAS capability.\n");
+		return;
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index fdac3ddb8635..727429dfdaed 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -772,11 +772,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 					 resource_size_t rcrb);
 
 #ifdef CONFIG_PCIEAER_CXL
-void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
 #else
-static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
-						struct device *host) { }
+static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 2f03a4d5606e..dd39f4565be2 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,6 +45,31 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
 	return 0;
 }
 
+static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
+{
+	struct pci_dev *pdev;
+
+	if (!dev || !dev_is_pci(dev))
+		return false;
+
+	pdev = to_pci_dev(dev);
+
+	return (pci_pcie_type(pdev) == pcie_type);
+}
+
+static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
+{
+	struct cxl_dport *dport = ep->dport;
+
+	if (dport) {
+		struct device *dport_dev = dport->dport_dev;
+
+		if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
+		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
+			cxl_dport_init_ras_reporting(dport);
+	}
+}
+
 static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 				 struct cxl_dport *parent_dport)
 {
@@ -52,6 +77,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 	struct cxl_port *endpoint, *iter, *down;
 	int rc;
 
+	if (parent_dport->rch)
+		cxl_dport_init_ras_reporting(parent_dport);
+
 	/*
 	 * Now that the path to the root is established record all the
 	 * intervening ports in the chain.
@@ -62,6 +90,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 
 		ep = cxl_ep_load(iter, cxlmd);
 		ep->next = down;
+		cxl_init_ep_ports_aer(ep);
 	}
 
 	/* Note: endpoint port component registers are derived from @cxlds */
@@ -166,8 +195,6 @@ static int cxl_mem_probe(struct device *dev)
 	else
 		endpoint_parent = &parent_port->dev;
 
-	cxl_dport_init_ras_reporting(dport, dev);
-
 	scoped_guard(device, endpoint_parent) {
 		if (!endpoint_parent->driver) {
 			dev_err(dev, "CXL port topology %s not enabled\n",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (7 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:35   ` Jonathan Cameron
                     ` (2 more replies)
  2025-01-07 14:38 ` [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
                   ` (6 subsequent siblings)
  15 siblings, 3 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.

Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
pointer to the CXL Upstream Port's mapped RAS registers.

Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
register mapping. This is similar to the existing
cxl_dport_init_ras_reporting() but for USP devices.

The USP may have multiple downstream endpoints. Before mapping AER
registers check if the registers are already mapped.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 15 +++++++++++++++
 drivers/cxl/cxl.h      |  4 ++++
 drivers/cxl/mem.c      |  8 ++++++++
 3 files changed, 27 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 1af2d0a14f5d..97e6a15bea88 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
+void cxl_uport_init_ras_reporting(struct cxl_port *port)
+{
+	/* uport may have more than 1 downstream EP. Check if already mapped. */
+	if (port->uport_regs.ras)
+		return;
+
+	port->reg_map.host = &port->dev;
+	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+		dev_err(&port->dev, "Failed to map RAS capability.\n");
+		return;
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
+
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 727429dfdaed..c51735fe75d6 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -601,6 +601,7 @@ struct cxl_dax_region {
  * @parent_dport: dport that points to this port in the parent
  * @decoder_ida: allocator for decoder ids
  * @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
  * @nr_dports: number of entries in @dports
  * @hdm_end: track last allocated HDM decoder instance for allocation ordering
  * @commit_end: cursor to track highest committed decoder for commit ordering
@@ -621,6 +622,7 @@ struct cxl_port {
 	struct cxl_dport *parent_dport;
 	struct ida decoder_ida;
 	struct cxl_register_map reg_map;
+	struct cxl_component_regs uport_regs;
 	int nr_dports;
 	int hdm_end;
 	int commit_end;
@@ -773,8 +775,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
+void cxl_uport_init_ras_reporting(struct cxl_port *port);
 #else
 static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
+static inline void cxl_uport_init_ras_reporting(struct cxl_port *port) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index dd39f4565be2..97dbca765f4d 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -60,6 +60,7 @@ static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
 static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
 {
 	struct cxl_dport *dport = ep->dport;
+	struct cxl_port *port = ep->next;
 
 	if (dport) {
 		struct device *dport_dev = dport->dport_dev;
@@ -68,6 +69,13 @@ static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
 		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
 			cxl_dport_init_ras_reporting(dport);
 	}
+
+	if (port) {
+		struct device *uport_dev = port->uport_dev;
+
+		if (dev_is_cxl_pci(uport_dev, PCI_EXP_TYPE_UPSTREAM))
+			cxl_uport_init_ras_reporting(port);
+	}
 }
 
 static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (8 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:39   ` Jonathan Cameron
                     ` (2 more replies)
  2025-01-07 14:38 ` [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers Terry Bowman
                   ` (5 subsequent siblings)
  15 siblings, 3 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

CXL PCIe Port Protocol Error handling support will be added to the
CXL drivers in the future. In preparation, rename the existing
interfaces to support handling all CXL PCIe Port Protocol Errors.

The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL Port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.

To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.

No functional changes are introduced.

[1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
---
 drivers/cxl/core/pci.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 97e6a15bea88..5699ee5b29df 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,7 +650,7 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
 
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+static void __cxl_handle_cor_ras(struct device *dev,
 				 void __iomem *ras_base)
 {
 	void __iomem *addr;
@@ -663,13 +663,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
 	}
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
@@ -693,8 +693,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
-				  void __iomem *ras_base)
+static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -721,7 +720,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -729,7 +728,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 
 static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 #ifdef CONFIG_PCIEAER_CXL
@@ -818,13 +817,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (9 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:41   ` Jonathan Cameron
                     ` (2 more replies)
  2025-01-07 14:38 ` [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
                   ` (4 subsequent siblings)
  15 siblings, 3 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The CXL RAS handlers do not currently log if the RAS registers are
unmapped. This is needed inorder to help debug CXL error handling. Update
the CXL driver to log a warning message if the RAS register block is
unmapped.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5699ee5b29df..8275b3dc3589 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -656,8 +656,10 @@ static void __cxl_handle_cor_ras(struct device *dev,
 	void __iomem *addr;
 	u32 status;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return;
+	}
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
@@ -700,8 +702,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	u32 status;
 	u32 fe;
 
-	if (!ras_base)
+	if (!ras_base) {
+		dev_warn_once(dev, "CXL RAS register block is not mapped");
 		return false;
+	}
 
 	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (10 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 22:23   ` Ira Weiny
  2025-01-07 14:38 ` [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

CXL PCIe Port Protocol Error support will be added in the future. This
requires searching for a CXL PCIe Port device in the CXL topology as
provided by find_cxl_port(). But, find_cxl_port() is defined static
and as a result is not callable outside of this source file.

Update the find_cxl_port() declaration to be non-static.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/core.h | 3 +++
 drivers/cxl/core/port.c | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 800466f96a68..eb42a2801f98 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -115,4 +115,7 @@ bool cxl_need_node_perf_attrs_update(int nid);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
 
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
+
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 78a5c2c25982..1ee408412782 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1342,8 +1342,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (11 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:46   ` Jonathan Cameron
                     ` (2 more replies)
  2025-01-07 14:38 ` [PATCH v5 14/16] cxl/pci: Add trace logging " Terry Bowman
                   ` (2 subsequent siblings)
  15 siblings, 3 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Introduce correctable and uncorrectable CXL PCIe Port Protocol Error
handlers.

The handlers will be called with a 'struct pci_dev' parameter
indicating the CXL Port device requiring handling. The CXL PCIe Port
device's underlying 'struct device' will match the port device in the
CXL topology.

Use the PCIe Port's device object to find the matching CXL Upstream Switch
Port, CXL Downstream Switch Port, or CXL Root Port in the CXL topology. The
matching CXL Port device should contain a cached reference to the RAS
register block. The cached RAS block will be used handling the error.

Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() using
a reference to the RAS registers as a parameter. These functions will use
the RAS register reference to indicate an error and clear the device's RAS
status.

Future patches will assign the error handlers and add trace logging.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 63 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 63 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 8275b3dc3589..411834f7efe0 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -776,6 +776,69 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
+static int match_uport(struct device *dev, const void *data)
+{
+	struct device *uport_dev = (struct device *)data;
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+
+	return port->uport_dev == uport_dev;
+}
+
+static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
+{
+	struct cxl_port *port;
+
+	if (!pdev)
+		return NULL;
+
+	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
+	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+		struct cxl_dport *dport;
+		void __iomem *ras_base;
+
+		port = find_cxl_port(&pdev->dev, &dport);
+		ras_base = dport ? dport->regs.ras : NULL;
+		if (port)
+			put_device(&port->dev);
+		return ras_base;
+	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
+		struct device *port_dev;
+
+		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
+					   match_uport);
+		if (!port_dev)
+			return NULL;
+
+		port = to_cxl_port(port_dev);
+		if (!port)
+			return NULL;
+
+		put_device(port_dev);
+		return port->uport_regs.ras;
+	}
+
+	return NULL;
+}
+
+static void cxl_port_cor_error_detected(struct pci_dev *pdev)
+{
+	void __iomem *ras_base = cxl_pci_port_ras(pdev);
+
+	__cxl_handle_cor_ras(&pdev->dev, ras_base);
+}
+
+static bool cxl_port_error_detected(struct pci_dev *pdev)
+{
+	void __iomem *ras_base = cxl_pci_port_ras(pdev);
+
+	return __cxl_handle_ras(&pdev->dev, ras_base);
+}
+
 void cxl_uport_init_ras_reporting(struct cxl_port *port)
 {
 	/* uport may have more than 1 downstream EP. Check if already mapped. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 14/16] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (12 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:49   ` Jonathan Cameron
  2025-01-14 22:58   ` Ira Weiny
  2025-01-07 14:38 ` [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
  2025-01-07 14:38 ` [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
  15 siblings, 2 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The CXL drivers use kernel trace functions for logging endpoint and
Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
Upstream Switch Ports.

Introduce trace logging functions for both RAS correctable and
uncorrectable errors specific to CXL PCIe Ports. Additionally, update
the CXL Port Protocol Error handlers to invoke these new trace functions.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/pci.c   | 17 +++++++++++----
 drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 60 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 411834f7efe0..3e87fe54a1a2 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -663,10 +663,15 @@ static void __cxl_handle_cor_ras(struct device *dev,
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+		return;
+
+	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+	if (is_cxl_memdev(dev))
 		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
-	}
+	else
+		trace_cxl_port_aer_correctable_error(dev, status);
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
@@ -724,7 +729,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	if (is_cxl_memdev(dev))
+		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	else
+		trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
+
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 8389a94adb1a..681e415ac8f5 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,6 +48,34 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
+TRACE_EVENT(cxl_port_aer_uncorrectable_error,
+	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
+	TP_ARGS(dev, status, fe, hl),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__string(host, dev_name(dev->parent))
+		__field(u32, status)
+		__field(u32, first_error)
+		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__assign_str(host);
+		__entry->status = status;
+		__entry->first_error = fe;
+		/*
+		 * Embed the 512B headerlog data for user app retrieval and
+		 * parsing, but no need to print this in the trace buffer.
+		 */
+		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
+	),
+	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
+		  __get_str(devname), __get_str(host),
+		  show_uc_errs(__entry->status),
+		  show_uc_errs(__entry->first_error)
+	)
+);
+
 TRACE_EVENT(cxl_aer_uncorrectable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
 	TP_ARGS(cxlmd, status, fe, hl),
@@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
+TRACE_EVENT(cxl_port_aer_correctable_error,
+	TP_PROTO(struct device *dev, u32 status),
+	TP_ARGS(dev, status),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__string(host, dev_name(dev->parent))
+		__field(u32, status)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__assign_str(host);
+		__entry->status = status;
+	),
+	TP_printk("device=%s host=%s status='%s'",
+		  __get_str(devname), __get_str(host),
+		  show_ce_errs(__entry->status)
+	)
+);
+
 TRACE_EVENT(cxl_aer_correctable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
 	TP_ARGS(cxlmd, status),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (13 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 14/16] cxl/pci: Add trace logging " Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 11:51   ` Jonathan Cameron
                     ` (2 more replies)
  2025-01-07 14:38 ` [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
  15 siblings, 3 replies; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
The handlers can't be set in the pci_driver static definition because the
CXL PCIe Port devices are bound to the portdrv driver which is not CXL
driver aware.

Add cxl_assign_port_error_handlers() in the cxl_core module. This
function will assign the default handlers for a CXL PCIe Port device.

When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
pci_driver::cxl_err_handlers must be set to NULL indicating they should no
longer be used.

Create cxl_clear_port_error_handlers() and register it to be called
when the CXL Port device (cxl_port or cxl_dport) is destroyed.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 49 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 49 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 3e87fe54a1a2..9c162120f0fe 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -848,8 +848,40 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
 	return __cxl_handle_ras(&pdev->dev, ras_base);
 }
 
+static const struct cxl_error_handlers cxl_port_error_handlers = {
+	.error_detected	= cxl_port_error_detected,
+	.cor_error_detected = cxl_port_cor_error_detected,
+};
+
+static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
+{
+	struct pci_driver *pdrv;
+
+	if (!pdev || !pdev->driver || !get_device(&pdev->dev))
+		return;
+
+	pdrv = pdev->driver;
+	pdrv->cxl_err_handler = &cxl_port_error_handlers;
+	put_device(&pdev->dev);
+}
+
+static void cxl_clear_port_error_handlers(void *data)
+{
+	struct pci_dev *pdev = data;
+	struct pci_driver *pdrv;
+
+	if (!pdev || !pdev->driver || !get_device(&pdev->dev))
+		return;
+
+	pdrv = pdev->driver;
+	pdrv->cxl_err_handler = NULL;
+	put_device(&pdev->dev);
+}
+
 void cxl_uport_init_ras_reporting(struct cxl_port *port)
 {
+	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
+
 	/* uport may have more than 1 downstream EP. Check if already mapped. */
 	if (port->uport_regs.ras)
 		return;
@@ -860,6 +892,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 		dev_err(&port->dev, "Failed to map RAS capability.\n");
 		return;
 	}
+
+	cxl_assign_port_error_handlers(pdev);
+	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
 
@@ -871,6 +906,8 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 {
 	struct device *dport_dev = dport->dport_dev;
 	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
+	struct pci_dev *pdev = to_pci_dev(dport_dev);
+	struct cxl_port *port;
 
 	dport->reg_map.host = dport_dev;
 	if (dport->rch && host_bridge->native_aer) {
@@ -887,6 +924,18 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 		dev_err(dport_dev, "Failed to map RAS capability.\n");
 		return;
 	}
+
+	if (dport->rch)
+		return;
+
+	port = find_cxl_port(dport_dev, NULL);
+	if (!port) {
+		dev_err(dport_dev, "Failed to find upstream port\n");
+		return;
+	}
+	cxl_assign_port_error_handlers(pdev);
+	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
+	put_device(&port->dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (14 preceding siblings ...)
  2025-01-07 14:38 ` [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
@ 2025-01-07 14:38 ` Terry Bowman
  2025-01-14 23:26   ` Ira Weiny
  15 siblings, 1 reply; 96+ messages in thread
From: Terry Bowman @ 2025-01-07 14:38 UTC (permalink / raw)
  To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
inorder to notify the associated Root Port and OS.[1]

Export the AER service driver's pci_aer_unmask_internal_errors() function
to CXL namespace.

Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
because it is now an exported function.

Call pci_aer_unmask_internal_errors() during RAS initialization in:
cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().

[1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/pci.c | 2 ++
 drivers/pci/pcie/aer.c | 5 +++--
 include/linux/aer.h    | 1 +
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 9c162120f0fe..c62329cd9a87 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -895,6 +895,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 
 	cxl_assign_port_error_handlers(pdev);
 	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
+	pci_aer_unmask_internal_errors(pdev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
 
@@ -935,6 +936,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 	}
 	cxl_assign_port_error_handlers(pdev);
 	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
+	pci_aer_unmask_internal_errors(pdev);
 	put_device(&port->dev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 68e957459008..e6aaa3bd84f0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -950,7 +950,6 @@ static bool is_internal_error(struct aer_err_info *info)
 	return info->status & PCI_ERR_UNC_INTN;
 }
 
-#ifdef CONFIG_PCIEAER_CXL
 /**
  * pci_aer_unmask_internal_errors - unmask internal errors
  * @dev: pointer to the pcie_dev data structure
@@ -961,7 +960,7 @@ static bool is_internal_error(struct aer_err_info *info)
  * Note: AER must be enabled and supported by the device which must be
  * checked in advance, e.g. with pcie_aer_is_native().
  */
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 {
 	int aer = dev->aer_cap;
 	u32 mask;
@@ -974,7 +973,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 	mask &= ~PCI_ERR_COR_INTERNAL;
 	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
 }
+EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
 
+#ifdef CONFIG_PCIEAER_CXL
 static bool is_cxl_mem_dev(struct pci_dev *dev)
 {
 	/*
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..093293f9f12b 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 int cper_severity_to_aer(int cper_severity);
 void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 		       int severity, struct aer_capability_regs *aer_regs);
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
 #endif //_AER_H_
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2025-01-07 14:38 ` [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
@ 2025-01-13 23:45   ` Ira Weiny
  2025-02-06 17:01   ` Gregory Price
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-13 23:45 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> CXL.io provides protocol error handling on top of PCIe Protocol Error
> handling. But, CXL.io and PCIe have different handling requirements
> for uncorrectable errors (UCE).
> 
> The PCIe AER service driver may attempt recovering PCIe devices with
> UCE while recovery is not used for CXL.io. Recovery is not used in the
> CXL.io case because of potential corruption on what can be system memory.
> 
> Create pci_driver::cxl_err_handlers structure similar to
> pci_driver::error_handler. Create handlers for correctable and
> uncorrectable CXL.io error handling.
> 
> The CXL error handlers will be used in future patches adding CXL PCIe
> Port Protocol Error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support
  2025-01-07 14:38 ` [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
@ 2025-01-13 23:45   ` Ira Weiny
  2025-02-06 17:02   ` Gregory Price
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-13 23:45 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The AER service driver already includes support for Restricted CXL host
> (RCH) Downstream Port Protocol Error handling. The current implementation
> is based on CXL1.1 using a Root Complex Event Collector.
> 
> Rename function interfaces and parameters where necessary to include
> virtual hierarchy (VH) mode CXL PCIe Port error handling alongside the RCH
> handling.[1] The CXL PCIe Port Protocol Error handling support will be
> added in a future patch.
> 
> Limit changes to renaming variable and function names. No functional
> changes are added.
> 
> [1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-07 14:38 ` [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
@ 2025-01-13 23:49   ` Ira Weiny
  2025-01-14 15:19     ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Ira Weiny @ 2025-01-13 23:49 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices and CXL port
> devices.
> 
> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
> presence. The CXL Flexbus DVSEC presence is used because it is required
> for all the CXL PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
> 
> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl'.
> 
> Add pcie_is_cxl_port() to check if a device is a CXL Root Port, CXL
> Upstream Switch Port, or CXL Downstream Switch Port. Also, verify the
> CXL Extensions DVSEC for Ports is present.[1]
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> ---
>  drivers/pci/pci.c             | 13 +++++++++++++
>  drivers/pci/probe.c           | 10 ++++++++++
>  include/linux/pci.h           |  4 ++++
>  include/uapi/linux/pci_regs.h |  3 ++-
>  4 files changed, 29 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 661f98c6c63a..9319c62e3488 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5036,10 +5036,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
>  
>  static u16 cxl_port_dvsec(struct pci_dev *dev)
>  {
> +	if (!pcie_is_cxl(dev))
> +		return 0;
> +
>  	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>  					 PCI_DVSEC_CXL_PORT);
>  }
>  
> +bool pcie_is_cxl_port(struct pci_dev *dev)
> +{
> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> +		return false;
> +
> +	return cxl_port_dvsec(dev);

Returning bool from a function which returns u16 is odd and I don't think
it should be coded this way.  I don't think it is wrong right now but this
really ought to code the pcie_is_cxl() here and leave cxl_port_dvsec()
alone.  Calling cxl_port_dvsec(), checking for if the dvsec exists, and
returning bool.

> +}
> +

[snip]

> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index e2e36f11205c..08350302b3e9 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -452,6 +452,7 @@ struct pci_dev {
>  	unsigned int	is_hotplug_bridge:1;
>  	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
>  	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
> +	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
>  	/*
>  	 * Devices marked being untrusted are the ones that can potentially
>  	 * execute DMA attacks and similar. They are typically connected
> @@ -739,6 +740,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +#define pcie_is_cxl(dev) (dev->is_cxl)

This should be an inline function which takes struct pci_dev * for type
safety.

Ira

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-01-07 14:38 ` [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2025-01-13 23:51   ` Ira Weiny
  2025-02-06 18:18   ` Gregory Price
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-13 23:51 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors.
> 
> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
> device errors.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-07 14:38 ` [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
@ 2025-01-14  6:54   ` Li Ming
  2025-01-14 11:20     ` Jonathan Cameron
  2025-01-14 19:29     ` Bowman, Terry
  2025-01-14 16:35   ` Ira Weiny
  2025-02-06 18:33   ` Gregory Price
  2 siblings, 2 replies; 96+ messages in thread
From: Li Ming @ 2025-01-14  6:54 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop

On 1/7/2025 10:38 PM, Terry Bowman wrote:
> The AER service driver supports handling Downstream Port Protocol Errors in
> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
> mode.[1]
>
> CXL and PCIe Protocol Error handling have different requirements that
> necessitate a separate handling path. The AER service driver may try to
> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
> suitable for CXL PCIe Port devices because of potential for system memory
> corruption. Instead, CXL Protocol Error handling must use a kernel panic
> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
> Error handling does not panic the kernel in response to a UCE.
>
> Introduce a separate path for CXL Protocol Error handling in the AER
> service driver. This will allow CXL Protocol Errors to use CXL specific
> handling instead of PCIe handling. Add the CXL specific changes without
> affecting or adding functionality in the PCIe handling.
>
> Make this update alongside the existing Downstream Port RCH error handling
> logic, extending support to CXL PCIe Ports in VH mode.
>
> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
> config. Update is_internal_error()'s function declaration such that it is
> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
> or disabled.
>
> The uncorrectable error (UCE) handling will be added in a future patch.
>
> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>  1 file changed, 40 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index f8b3350fcbb4..62be599e3bee 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>  	return true;
>  }
>  
> -#ifdef CONFIG_PCIEAER_CXL
> +static bool is_internal_error(struct aer_err_info *info)
> +{
> +	if (info->severity == AER_CORRECTABLE)
> +		return info->status & PCI_ERR_COR_INTERNAL;
>  
> +	return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +#ifdef CONFIG_PCIEAER_CXL
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pcie_dev data structure
> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>  	return (pcie_ports_native || host->native_aer);
>  }
>  
> -static bool is_internal_error(struct aer_err_info *info)
> -{
> -	if (info->severity == AER_CORRECTABLE)
> -		return info->status & PCI_ERR_COR_INTERNAL;
> -
> -	return info->status & PCI_ERR_UNC_INTN;
> -}
> -
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  
>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	/*
> -	 * Internal errors of an RCEC indicate an AER error in an
> -	 * RCH's downstream port. Check and handle them in the CXL.mem
> -	 * device driver.
> -	 */
> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    is_internal_error(info))
> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> +
> +	if (info->severity == AER_CORRECTABLE) {
> +		struct pci_driver *pdrv = dev->driver;
> +		int aer = dev->aer_cap;
> +
> +		if (aer)
> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> +					       info->status);
> +
> +		if (pdrv && pdrv->cxl_err_handler &&
> +		    pdrv->cxl_err_handler->cor_error_detected)
> +			pdrv->cxl_err_handler->cor_error_detected(dev);
> +
> +		pcie_clear_device_status(dev);
> +	}
>  }
>  
>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>  {
>  	bool handles_cxl = false;
>  
> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    pcie_aer_is_native(dev))
> +	if (!pcie_aer_is_native(dev))
> +		return false;
> +
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
> +	else
> +		handles_cxl = pcie_is_cxl_port(dev);

My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().

pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?


Ming

>  
>  	return handles_cxl;
>  }
> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>  static inline void cxl_handle_error(struct pci_dev *dev,
>  				    struct aer_err_info *info) { }
> +static bool handles_cxl_errors(struct pci_dev *dev)
> +{
> +	return false;
> +}
>  #endif
>  
>  /**
> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);
> +	if (is_internal_error(info) && handles_cxl_errors(dev))
> +		cxl_handle_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
> +
>  	pci_dev_put(dev);
>  }
>  



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-14  6:54   ` Li Ming
@ 2025-01-14 11:20     ` Jonathan Cameron
  2025-01-14 20:10       ` Bowman, Terry
  2025-01-14 19:29     ` Bowman, Terry
  1 sibling, 1 reply; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:20 UTC (permalink / raw)
  To: Li Ming
  Cc: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
	bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 14 Jan 2025 14:54:23 +0800
Li Ming <ming.li@zohomail.com> wrote:

> On 1/7/2025 10:38 PM, Terry Bowman wrote:
> > The AER service driver supports handling Downstream Port Protocol Errors in
> > Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
> > functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
> > mode.[1]
> >
> > CXL and PCIe Protocol Error handling have different requirements that
> > necessitate a separate handling path. The AER service driver may try to
> > recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
> > suitable for CXL PCIe Port devices because of potential for system memory
> > corruption. Instead, CXL Protocol Error handling must use a kernel panic
> > in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
> > Error handling does not panic the kernel in response to a UCE.
> >
> > Introduce a separate path for CXL Protocol Error handling in the AER
> > service driver. This will allow CXL Protocol Errors to use CXL specific
> > handling instead of PCIe handling. Add the CXL specific changes without
> > affecting or adding functionality in the PCIe handling.
> >
> > Make this update alongside the existing Downstream Port RCH error handling
> > logic, extending support to CXL PCIe Ports in VH mode.
> >
> > is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
> > config. Update is_internal_error()'s function declaration such that it is
> > always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
> > or disabled.
> >
> > The uncorrectable error (UCE) handling will be added in a future patch.
> >
> > [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> > Upstream Switch Ports
> >
> > Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> > ---
> >  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
> >  1 file changed, 40 insertions(+), 21 deletions(-)
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index f8b3350fcbb4..62be599e3bee 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
> >  	return true;
> >  }
> >  
> > -#ifdef CONFIG_PCIEAER_CXL
> > +static bool is_internal_error(struct aer_err_info *info)
> > +{
> > +	if (info->severity == AER_CORRECTABLE)
> > +		return info->status & PCI_ERR_COR_INTERNAL;
> >  
> > +	return info->status & PCI_ERR_UNC_INTN;
> > +}
> > +
> > +#ifdef CONFIG_PCIEAER_CXL
> >  /**
> >   * pci_aer_unmask_internal_errors - unmask internal errors
> >   * @dev: pointer to the pcie_dev data structure
> > @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
> >  	return (pcie_ports_native || host->native_aer);
> >  }
> >  
> > -static bool is_internal_error(struct aer_err_info *info)
> > -{
> > -	if (info->severity == AER_CORRECTABLE)
> > -		return info->status & PCI_ERR_COR_INTERNAL;
> > -
> > -	return info->status & PCI_ERR_UNC_INTN;
> > -}
> > -
> >  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> >  {
> >  	struct aer_err_info *info = (struct aer_err_info *)data;
> > @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> >  
> >  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> >  {
> > -	/*
> > -	 * Internal errors of an RCEC indicate an AER error in an
> > -	 * RCH's downstream port. Check and handle them in the CXL.mem
> > -	 * device driver.
> > -	 */
> > -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > -	    is_internal_error(info))
> > -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> > +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
> > +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> > +
> > +	if (info->severity == AER_CORRECTABLE) {
> > +		struct pci_driver *pdrv = dev->driver;
> > +		int aer = dev->aer_cap;
> > +
> > +		if (aer)
> > +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> > +					       info->status);
> > +
> > +		if (pdrv && pdrv->cxl_err_handler &&
> > +		    pdrv->cxl_err_handler->cor_error_detected)
> > +			pdrv->cxl_err_handler->cor_error_detected(dev);
> > +
> > +		pcie_clear_device_status(dev);
> > +	}
> >  }
> >  
> >  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> > @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
> >  {
> >  	bool handles_cxl = false;
> >  
> > -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> > -	    pcie_aer_is_native(dev))
> > +	if (!pcie_aer_is_native(dev))
> > +		return false;
> > +
> > +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
> >  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
> > +	else
> > +		handles_cxl = pcie_is_cxl_port(dev);  
> 
> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
> 
> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
> 
> 
Good spot.

Agreed a check on the mode makes sense.

Jonathan

> Ming
> 
> >  
> >  	return handles_cxl;
> >  }
> > @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
> >  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
> >  static inline void cxl_handle_error(struct pci_dev *dev,
> >  				    struct aer_err_info *info) { }
> > +static bool handles_cxl_errors(struct pci_dev *dev)
> > +{
> > +	return false;
> > +}
> >  #endif
> >  
> >  /**
> > @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> >  
> >  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
> >  {
> > -	cxl_handle_error(dev, info);
> > -	pci_aer_handle_error(dev, info);
> > +	if (is_internal_error(info) && handles_cxl_errors(dev))
> > +		cxl_handle_error(dev, info);
> > +	else
> > +		pci_aer_handle_error(dev, info);
> > +
> >  	pci_dev_put(dev);
> >  }
> >    
> 
> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
  2025-01-07 14:38 ` [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
@ 2025-01-14 11:32   ` Jonathan Cameron
  2025-01-14 20:44     ` Bowman, Terry
  2025-01-28 20:25     ` Bowman, Terry
  2025-01-14 16:57   ` Ira Weiny
  1 sibling, 2 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:32 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop, Shuai Xue

On Tue, 7 Jan 2025 08:38:42 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> including CXL Upstream Switch Ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
> 
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe devices. Make the change such that non-CXL devices
> are not affected.
> 
> The fatal error status will be used in future patches implementing
> CXL PCIe Port uncorrectable error handling and logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

This clashes with Shuai's series adding link healthy checks.
Maybe we can reuse that logic to incorporate the condition we
care about here?


> ---
>  drivers/pci/pcie/aer.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 62be599e3bee..79c828bdcb6d 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>  		   type == PCI_EXP_TYPE_RC_EC ||
>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
> -		   info->severity == AER_NONFATAL) {
> +		   info->severity == AER_NONFATAL ||
> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>  
>  		/* Link is still healthy for IO reads */
>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver
  2025-01-07 14:38 ` [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver Terry Bowman
@ 2025-01-14 11:33   ` Jonathan Cameron
  2025-01-14 20:28     ` Bowman, Terry
  2025-01-14 17:27   ` Ira Weiny
  1 sibling, 1 reply; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:33 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:43 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> potential corruption on what can be system memory. Also, current PCIe UCE
> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> does not begin at the RP/DSP but begins at the first downstream device.
> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> CXL recovery is needed because of the different handling requirements
> 
> Add a new function, cxl_do_recovery() using the following.
> 
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the RP or DSP rather than beginning at the
> first downstream device.

I'm still holding out for making pci_walk_bridge() do the same and seeing
what if anything breaks.

Other than that I'm fine with this patch.

> 
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
> 
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-07 14:38 ` [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
@ 2025-01-14 11:35   ` Jonathan Cameron
  2025-01-14 15:24     ` Bowman, Terry
  2025-01-14 22:02   ` Ira Weiny
  2025-02-07  7:35   ` Gregory Price
  2 siblings, 1 reply; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:35 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:45 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
> 
> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
> pointer to the CXL Upstream Port's mapped RAS registers.
> 
> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
> register mapping. This is similar to the existing
> cxl_dport_init_ras_reporting() but for USP devices.
> 
> The USP may have multiple downstream endpoints. Before mapping AER
> registers check if the registers are already mapped.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 15 +++++++++++++++
>  drivers/cxl/cxl.h      |  4 ++++
>  drivers/cxl/mem.c      |  8 ++++++++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 1af2d0a14f5d..97e6a15bea88 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> +{
> +	/* uport may have more than 1 downstream EP. Check if already mapped. */

Is it worth a lockdep check in here on whatever lock is stoping this racing?

> +	if (port->uport_regs.ras)
> +		return;
> +
> +	port->reg_map.host = &port->dev;
> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
> +		return;
> +	}
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
> +
>  /**
>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>   * @dport: the cxl_dport that needs to be initialized
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 727429dfdaed..c51735fe75d6 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -601,6 +601,7 @@ struct cxl_dax_region {
>   * @parent_dport: dport that points to this port in the parent
>   * @decoder_ida: allocator for decoder ids
>   * @reg_map: component and ras register mapping parameters
> + * @uport_regs: mapped component registers
>   * @nr_dports: number of entries in @dports
>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>   * @commit_end: cursor to track highest committed decoder for commit ordering
> @@ -621,6 +622,7 @@ struct cxl_port {
>  	struct cxl_dport *parent_dport;
>  	struct ida decoder_ida;
>  	struct cxl_register_map reg_map;
> +	struct cxl_component_regs uport_regs;
>  	int nr_dports;
>  	int hdm_end;
>  	int commit_end;
> @@ -773,8 +775,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
> +void cxl_uport_init_ras_reporting(struct cxl_port *port);
>  #else
>  static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
> +static inline void cxl_uport_init_ras_reporting(struct cxl_port *port) { }
>  #endif
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index dd39f4565be2..97dbca765f4d 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -60,6 +60,7 @@ static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
>  static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
>  {
>  	struct cxl_dport *dport = ep->dport;
> +	struct cxl_port *port = ep->next;
>  
>  	if (dport) {
>  		struct device *dport_dev = dport->dport_dev;
> @@ -68,6 +69,13 @@ static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
>  		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
>  			cxl_dport_init_ras_reporting(dport);
>  	}
> +
> +	if (port) {
> +		struct device *uport_dev = port->uport_dev;
> +
> +		if (dev_is_cxl_pci(uport_dev, PCI_EXP_TYPE_UPSTREAM))
> +			cxl_uport_init_ras_reporting(port);
> +	}
>  }
>  
>  static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  2025-01-07 14:38 ` [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
@ 2025-01-14 11:39   ` Jonathan Cameron
  2025-01-14 22:20   ` Ira Weiny
  2025-02-07  7:38   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:39 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:46 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL PCIe Port Protocol Error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port Protocol Errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
A few protections you introduce in later patches should be here
to make the change to dev safer.

Otherwise looks fine.

Jonathan

> ---
>  drivers/cxl/core/pci.c | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 97e6a15bea88..5699ee5b29df 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -650,7 +650,7 @@ void read_cdat_data(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(read_cdat_data, "CXL");
>  
> -static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
> +static void __cxl_handle_cor_ras(struct device *dev,
>  				 void __iomem *ras_base)
>  {
>  	void __iomem *addr;
> @@ -663,13 +663,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
>  	status = readl(addr);
>  	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>  		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> +		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);

I'd expect the protection on the dev being a memdev to be in this patch
given the relaxation on the interface.  Pull that bit forward form patch 14.


>  	}
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>  {
> -	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
> +	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
>  }
>  
>  /* CXL spec rev3.0 8.2.4.16.1 */
> @@ -693,8 +693,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>   * Log the state of the RAS status registers and prepare them to log the
>   * next error status. Return 1 if reset needed.
>   */
> -static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
> -				  void __iomem *ras_base)
> +static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  {
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
> @@ -721,7 +720,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
> +	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);

As above.  If the interface takes a dev we need to check it is the right
type of dev.

>  
>  	return true;
> @@ -729,7 +728,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
>  
>  static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
>  {
> -	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
> +	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
>  }
>  
>  #ifdef CONFIG_PCIEAER_CXL
> @@ -818,13 +817,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>  static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
>  					  struct cxl_dport *dport)
>  {
> -	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
> +	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>  }
>  
>  static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>  				       struct cxl_dport *dport)
>  {
> -	return __cxl_handle_ras(cxlds, dport->regs.ras);
> +	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>  }
>  
>  /*


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers
  2025-01-07 14:38 ` [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers Terry Bowman
@ 2025-01-14 11:41   ` Jonathan Cameron
  2025-01-14 22:21   ` Ira Weiny
  2025-02-07  7:39   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:41 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:47 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL RAS handlers do not currently log if the RAS registers are
> unmapped. This is needed inorder to help debug CXL error handling. Update
> the CXL driver to log a warning message if the RAS register block is
> unmapped.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/cxl/core/pci.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 5699ee5b29df..8275b3dc3589 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -656,8 +656,10 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  	void __iomem *addr;
>  	u32 status;
>  
> -	if (!ras_base)
> +	if (!ras_base) {
> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>  		return;
> +	}
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> @@ -700,8 +702,10 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	u32 status;
>  	u32 fe;
>  
> -	if (!ras_base)
> +	if (!ras_base) {
> +		dev_warn_once(dev, "CXL RAS register block is not mapped");
>  		return false;
> +	}
>  
>  	addr = ras_base + CXL_RAS_UNCORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-07 14:38 ` [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
@ 2025-01-14 11:46   ` Jonathan Cameron
  2025-01-14 21:20     ` Bowman, Terry
  2025-01-14 22:51   ` Ira Weiny
  2025-02-07  8:01   ` Gregory Price
  2 siblings, 1 reply; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:46 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:49 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> Introduce correctable and uncorrectable CXL PCIe Port Protocol Error
> handlers.
> 
> The handlers will be called with a 'struct pci_dev' parameter
> indicating the CXL Port device requiring handling. The CXL PCIe Port
> device's underlying 'struct device' will match the port device in the
> CXL topology.
> 
> Use the PCIe Port's device object to find the matching CXL Upstream Switch
> Port, CXL Downstream Switch Port, or CXL Root Port in the CXL topology. The
> matching CXL Port device should contain a cached reference to the RAS
> register block. The cached RAS block will be used handling the error.
> 
> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() using
> a reference to the RAS registers as a parameter. These functions will use
> the RAS register reference to indicate an error and clear the device's RAS
> status.
> 
> Future patches will assign the error handlers and add trace logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 63 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 8275b3dc3589..411834f7efe0 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -776,6 +776,69 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +static int match_uport(struct device *dev, const void *data)
> +{
> +	struct device *uport_dev = (struct device *)data;

It should be const and then no need to cast explicitly.


> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return port->uport_dev == uport_dev;
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> +	struct cxl_port *port;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +		struct cxl_dport *dport;
> +		void __iomem *ras_base;
> +
> +		port = find_cxl_port(&pdev->dev, &dport);
Maybe some __free magic on port as then can just
return dport ? dport->regs.ras : NULL;
> +		ras_base = dport ? dport->regs.ras : NULL;
> +		if (port)
> +			put_device(&port->dev);
> +		return ras_base;
> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {

	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {

or maybe just make it a switch statement?

> +		struct device *port_dev;
> +
> +		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
> +					   match_uport);
Likewise on __free magic to automate the put.

> +		if (!port_dev)
> +			return NULL;
> +
> +		port = to_cxl_port(port_dev);
> +		if (!port)

why no put of the port_dev?

> +			return NULL;
> +
> +		put_device(port_dev);
> +		return port->uport_regs.ras;
> +	}
> +
> +	return NULL;
> +}


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 14/16] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-07 14:38 ` [PATCH v5 14/16] cxl/pci: Add trace logging " Terry Bowman
@ 2025-01-14 11:49   ` Jonathan Cameron
  2025-01-14 20:56     ` Bowman, Terry
  2025-01-14 22:58   ` Ira Weiny
  1 sibling, 1 reply; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:49 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:50 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL drivers use kernel trace functions for logging endpoint and
> Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
> is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
> Upstream Switch Ports.
> 
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Additionally, update
> the CXL Port Protocol Error handlers to invoke these new trace functions.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
An example print in commit message would help understand what the tracepoints
look like.

Few more things inline following on from earlier comments.

Jonathan
> ---
>  drivers/cxl/core/pci.c   | 17 +++++++++++----
>  drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 60 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 411834f7efe0..3e87fe54a1a2 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -663,10 +663,15 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	if (is_cxl_memdev(dev))
As below. Drag to earlier patch.
>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> -	}
> +	else

and perhaps check it's a port mostly for documentation purposes.


> +		trace_cxl_port_aer_correctable_error(dev, status);
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -724,7 +729,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> +	if (is_cxl_memdev(dev))

As mentioned above, drag this if to the earlier patch.

> +		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> +	else

For documentation purposes mostly I'd be tempted to have an is_cxl_port() check
before calling the following.

> +		trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> +
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
>  	return true;
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 8389a94adb1a..681e415ac8f5 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,6 +48,34 @@
>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>  )
>  
> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> +	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> +	TP_ARGS(dev, status, fe, hl),
> +	TP_STRUCT__entry(
> +		__string(devname, dev_name(dev))
> +		__string(host, dev_name(dev->parent))
What is host in this case? Perhaps a comment.
> +		__field(u32, status)
> +		__field(u32, first_error)
> +		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> +	),
> +	TP_fast_assign(
> +		__assign_str(devname);
> +		__assign_str(host);
> +		__entry->status = status;
> +		__entry->first_error = fe;
> +		/*
> +		 * Embed the 512B headerlog data for user app retrieval and
> +		 * parsing, but no need to print this in the trace buffer.
> +		 */
> +		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> +	),
> +	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> +		  __get_str(devname), __get_str(host),
> +		  show_uc_errs(__entry->status),
> +		  show_uc_errs(__entry->first_error)
> +	)
> +);
> +
>  TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
>  	TP_ARGS(cxlmd, status, fe, hl),
> @@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>  )
>  
> +TRACE_EVENT(cxl_port_aer_correctable_error,
> +	TP_PROTO(struct device *dev, u32 status),
> +	TP_ARGS(dev, status),
> +	TP_STRUCT__entry(
> +		__string(devname, dev_name(dev))
> +		__string(host, dev_name(dev->parent))
> +		__field(u32, status)
> +	),
> +	TP_fast_assign(
> +		__assign_str(devname);
> +		__assign_str(host);
> +		__entry->status = status;
> +	),
> +	TP_printk("device=%s host=%s status='%s'",
> +		  __get_str(devname), __get_str(host),
> +		  show_ce_errs(__entry->status)
> +	)
> +);
> +
>  TRACE_EVENT(cxl_aer_correctable_error,
>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>  	TP_ARGS(cxlmd, status),


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2025-01-07 14:38 ` [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
@ 2025-01-14 11:51   ` Jonathan Cameron
  2025-01-14 23:03   ` Ira Weiny
  2025-02-07  8:08   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-14 11:51 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 7 Jan 2025 08:38:51 -0600
Terry Bowman <terry.bowman@amd.com> wrote:

> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
> The handlers can't be set in the pci_driver static definition because the
> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
> driver aware.
> 
> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> function will assign the default handlers for a CXL PCIe Port device.
> 
> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
> longer be used.
> 
> Create cxl_clear_port_error_handlers() and register it to be called
> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-13 23:49   ` Ira Weiny
@ 2025-01-14 15:19     ` Bowman, Terry
  2025-01-14 23:33       ` Ira Weiny
  2025-01-15 10:03       ` Lukas Wunner
  0 siblings, 2 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 15:19 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/13/2025 5:49 PM, Ira Weiny wrote:
> Terry Bowman wrote:
>> CXL and AER drivers need the ability to identify CXL devices and CXL port
>> devices.
>>
>> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
>> presence. The CXL Flexbus DVSEC presence is used because it is required
>> for all the CXL PCIe devices.[1]
>>
>> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
>> Flexbus presence.
>>
>> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl'.
>>
>> Add pcie_is_cxl_port() to check if a device is a CXL Root Port, CXL
>> Upstream Switch Port, or CXL Downstream Switch Port. Also, verify the
>> CXL Extensions DVSEC for Ports is present.[1]
>>
>> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>>     Capability (DVSEC) ID Assignment, Table 8-2
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
>> ---
>>  drivers/pci/pci.c             | 13 +++++++++++++
>>  drivers/pci/probe.c           | 10 ++++++++++
>>  include/linux/pci.h           |  4 ++++
>>  include/uapi/linux/pci_regs.h |  3 ++-
>>  4 files changed, 29 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>> index 661f98c6c63a..9319c62e3488 100644
>> --- a/drivers/pci/pci.c
>> +++ b/drivers/pci/pci.c
>> @@ -5036,10 +5036,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
>>  
>>  static u16 cxl_port_dvsec(struct pci_dev *dev)
>>  {
>> +	if (!pcie_is_cxl(dev))
>> +		return 0;
>> +
>>  	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>>  					 PCI_DVSEC_CXL_PORT);
>>  }
>>  
>> +bool pcie_is_cxl_port(struct pci_dev *dev)
>> +{
>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>> +		return false;
>> +
>> +	return cxl_port_dvsec(dev);
> Returning bool from a function which returns u16 is odd and I don't think
> it should be coded this way.  I don't think it is wrong right now but this
> really ought to code the pcie_is_cxl() here and leave cxl_port_dvsec()
> alone.  Calling cxl_port_dvsec(), checking for if the dvsec exists, and
> returning bool.

Hi Ira,

Thanks for reviewing. Is this what you are looking for here:

+bool pcie_is_cxl_port(struct pci_dev *dev)
+{
+	return (cxl_port_dvsec(dev) > 0);

>> +}
>> +
> [snip]
>
>> diff --git a/include/linux/pci.h b/include/linux/pci.h
>> index e2e36f11205c..08350302b3e9 100644
>> --- a/include/linux/pci.h
>> +++ b/include/linux/pci.h
>> @@ -452,6 +452,7 @@ struct pci_dev {
>>  	unsigned int	is_hotplug_bridge:1;
>>  	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
>>  	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
>> +	unsigned int	is_cxl:1;               /* Compute Express Link (CXL) */
>>  	/*
>>  	 * Devices marked being untrusted are the ones that can potentially
>>  	 * execute DMA attacks and similar. They are typically connected
>> @@ -739,6 +740,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>>  	return false;
>>  }
>>  
>> +#define pcie_is_cxl(dev) (dev->is_cxl)
> This should be an inline function which takes struct pci_dev * for type
> safety.
>
> Ira
Ok,

Thanks for reviewing the patches.

Regards,
Terry
> [snip]


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-14 11:35   ` Jonathan Cameron
@ 2025-01-14 15:24     ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 15:24 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:35 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:45 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
>>
>> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
>> pointer to the CXL Upstream Port's mapped RAS registers.
>>
>> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
>> register mapping. This is similar to the existing
>> cxl_dport_init_ras_reporting() but for USP devices.
>>
>> The USP may have multiple downstream endpoints. Before mapping AER
>> registers check if the registers are already mapped.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/pci.c | 15 +++++++++++++++
>>  drivers/cxl/cxl.h      |  4 ++++
>>  drivers/cxl/mem.c      |  8 ++++++++
>>  3 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 1af2d0a14f5d..97e6a15bea88 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>  }
>>  
>> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
>> +{
>> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
> Is it worth a lockdep check in here on whatever lock is stoping this racing?


Yes, it is. Thanks Jonathan.

Regards,
Terry

>> +	if (port->uport_regs.ras)
>> +		return;
>> +
>> +	port->reg_map.host = &port->dev;
>> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
>> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
>> +		return;
>> +	}
>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
>> +
>>  /**
>>   * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>>   * @dport: the cxl_dport that needs to be initialized
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index 727429dfdaed..c51735fe75d6 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -601,6 +601,7 @@ struct cxl_dax_region {
>>   * @parent_dport: dport that points to this port in the parent
>>   * @decoder_ida: allocator for decoder ids
>>   * @reg_map: component and ras register mapping parameters
>> + * @uport_regs: mapped component registers
>>   * @nr_dports: number of entries in @dports
>>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>>   * @commit_end: cursor to track highest committed decoder for commit ordering
>> @@ -621,6 +622,7 @@ struct cxl_port {
>>  	struct cxl_dport *parent_dport;
>>  	struct ida decoder_ida;
>>  	struct cxl_register_map reg_map;
>> +	struct cxl_component_regs uport_regs;
>>  	int nr_dports;
>>  	int hdm_end;
>>  	int commit_end;
>> @@ -773,8 +775,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>>  
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
>> +void cxl_uport_init_ras_reporting(struct cxl_port *port);
>>  #else
>>  static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
>> +static inline void cxl_uport_init_ras_reporting(struct cxl_port *port) { }
>>  #endif
>>  
>>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index dd39f4565be2..97dbca765f4d 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -60,6 +60,7 @@ static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
>>  static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
>>  {
>>  	struct cxl_dport *dport = ep->dport;
>> +	struct cxl_port *port = ep->next;
>>  
>>  	if (dport) {
>>  		struct device *dport_dev = dport->dport_dev;
>> @@ -68,6 +69,13 @@ static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
>>  		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
>>  			cxl_dport_init_ras_reporting(dport);
>>  	}
>> +
>> +	if (port) {
>> +		struct device *uport_dev = port->uport_dev;
>> +
>> +		if (dev_is_cxl_pci(uport_dev, PCI_EXP_TYPE_UPSTREAM))
>> +			cxl_uport_init_ras_reporting(port);
>> +	}
>>  }
>>  
>>  static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-07 14:38 ` [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
  2025-01-14  6:54   ` Li Ming
@ 2025-01-14 16:35   ` Ira Weiny
  2025-02-06 18:33   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 16:35 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The AER service driver supports handling Downstream Port Protocol Errors in
> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
> mode.[1]
> 
> CXL and PCIe Protocol Error handling have different requirements that
> necessitate a separate handling path. The AER service driver may try to
> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
> suitable for CXL PCIe Port devices because of potential for system memory
> corruption. Instead, CXL Protocol Error handling must use a kernel panic
> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
> Error handling does not panic the kernel in response to a UCE.
> 
> Introduce a separate path for CXL Protocol Error handling in the AER
> service driver. This will allow CXL Protocol Errors to use CXL specific
> handling instead of PCIe handling. Add the CXL specific changes without
> affecting or adding functionality in the PCIe handling.
> 
> Make this update alongside the existing Downstream Port RCH error handling
> logic, extending support to CXL PCIe Ports in VH mode.
> 
> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
> config. Update is_internal_error()'s function declaration such that it is
> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
> or disabled.
> 
> The uncorrectable error (UCE) handling will be added in a future patch.
> 
> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
  2025-01-07 14:38 ` [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
  2025-01-14 11:32   ` Jonathan Cameron
@ 2025-01-14 16:57   ` Ira Weiny
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 16:57 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> including CXL Upstream Switch Ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
> 
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe devices. Make the change such that non-CXL devices
> are not affected.
> 
> The fatal error status will be used in future patches implementing
> CXL PCIe Port uncorrectable error handling and logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver
  2025-01-07 14:38 ` [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver Terry Bowman
  2025-01-14 11:33   ` Jonathan Cameron
@ 2025-01-14 17:27   ` Ira Weiny
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 17:27 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> potential corruption on what can be system memory. Also, current PCIe UCE
> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> does not begin at the RP/DSP but begins at the first downstream device.
> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> CXL recovery is needed because of the different handling requirements
> 
> Add a new function, cxl_do_recovery() using the following.
> 
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the RP or DSP rather than beginning at the
> first downstream device.
> 
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
> 
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---

[snip]

> +
> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> +{
> +	const struct cxl_error_handlers *cxl_err_handler;
> +	struct pci_driver *pdrv = dev->driver;
> +	bool *status = data;
> +
> +	device_lock(&dev->dev);
> +	if (pdrv && pdrv->cxl_err_handler &&
> +	    pdrv->cxl_err_handler->error_detected) {
> +		cxl_err_handler = pdrv->cxl_err_handler;
> +		*status = cxl_err_handler->error_detected(dev);
> +	}
> +	device_unlock(&dev->dev);
> +	return *status;

This is probably just another nit on my part but returning bool here for
int may cause issues down the road.

Looking at this I wonder if it would be better to add *_PANIC to
pci_ers_result_t and return that similar to report_error_detected()?

> +}
> +
> +void cxl_do_recovery(struct pci_dev *dev)
> +{
> +	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> +	int type = pci_pcie_type(dev);
> +	struct pci_dev *bridge;
> +	int status;
> +
> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    type == PCI_EXP_TYPE_DOWNSTREAM ||
> +	    type == PCI_EXP_TYPE_UPSTREAM ||
> +	    type == PCI_EXP_TYPE_ENDPOINT)
> +		bridge = dev;
> +	else
> +		bridge = pci_upstream_bridge(dev);
> +
> +	cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
> +	if (status)
> +		panic("CXL cachemem error.");
> +
> +	if (host->native_aer || pcie_ports_native) {
> +		pcie_clear_device_status(dev);
> +		pci_aer_clear_nonfatal_status(dev);
> +	}

There is a nice informative comment in pcie_do_recovery() about this
block.  I think we should combine this and that block into a new function
which preserves that for both paths.

Ira

> +
> +	pci_info(bridge, "CXL uncorrectable error.\n");
> +}
> -- 
> 2.34.1
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-14  6:54   ` Li Ming
  2025-01-14 11:20     ` Jonathan Cameron
@ 2025-01-14 19:29     ` Bowman, Terry
  2025-01-15  1:18       ` Li Ming
  1 sibling, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 19:29 UTC (permalink / raw)
  To: Li Ming, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop



On 1/14/2025 12:54 AM, Li Ming wrote:
> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>> The AER service driver supports handling Downstream Port Protocol Errors in
>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>> mode.[1]
>>
>> CXL and PCIe Protocol Error handling have different requirements that
>> necessitate a separate handling path. The AER service driver may try to
>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>> suitable for CXL PCIe Port devices because of potential for system memory
>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>> Error handling does not panic the kernel in response to a UCE.
>>
>> Introduce a separate path for CXL Protocol Error handling in the AER
>> service driver. This will allow CXL Protocol Errors to use CXL specific
>> handling instead of PCIe handling. Add the CXL specific changes without
>> affecting or adding functionality in the PCIe handling.
>>
>> Make this update alongside the existing Downstream Port RCH error handling
>> logic, extending support to CXL PCIe Ports in VH mode.
>>
>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>> config. Update is_internal_error()'s function declaration such that it is
>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>> or disabled.
>>
>> The uncorrectable error (UCE) handling will be added in a future patch.
>>
>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>> Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> ---
>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index f8b3350fcbb4..62be599e3bee 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>  	return true;
>>  }
>>  
>> -#ifdef CONFIG_PCIEAER_CXL
>> +static bool is_internal_error(struct aer_err_info *info)
>> +{
>> +	if (info->severity == AER_CORRECTABLE)
>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>  
>> +	return info->status & PCI_ERR_UNC_INTN;
>> +}
>> +
>> +#ifdef CONFIG_PCIEAER_CXL
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pcie_dev data structure
>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>  	return (pcie_ports_native || host->native_aer);
>>  }
>>  
>> -static bool is_internal_error(struct aer_err_info *info)
>> -{
>> -	if (info->severity == AER_CORRECTABLE)
>> -		return info->status & PCI_ERR_COR_INTERNAL;
>> -
>> -	return info->status & PCI_ERR_UNC_INTN;
>> -}
>> -
>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  {
>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>  
>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	/*
>> -	 * Internal errors of an RCEC indicate an AER error in an
>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>> -	 * device driver.
>> -	 */
>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>> -	    is_internal_error(info))
>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>> +
>> +	if (info->severity == AER_CORRECTABLE) {
>> +		struct pci_driver *pdrv = dev->driver;
>> +		int aer = dev->aer_cap;
>> +
>> +		if (aer)
>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> +					       info->status);
>> +
>> +		if (pdrv && pdrv->cxl_err_handler &&
>> +		    pdrv->cxl_err_handler->cor_error_detected)
>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>
>> +		pcie_clear_device_status(dev);
>> +	}
>>  }
>>  
>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>  {
>>  	bool handles_cxl = false;
>>  
>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>> -	    pcie_aer_is_native(dev))
>> +	if (!pcie_aer_is_native(dev))
>> +		return false;
>> +
>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>> +	else
>> +		handles_cxl = pcie_is_cxl_port(dev);
> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>
> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>
>
> Ming

Hi Ming and Jonathan,

RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.

If the recommended change is made then RCH RAS will not be logged and the
user would miss CXL details about the alternate protocol training failure.
Also, AER is not CXL required and as a result in some cases you would only
have the RCEC forwarded UIE/CIE message logged by the AER driver without
any other logging.

Is there value in *not* logging CXL RAS for errors on an untrained RCH
link? Isn't it more informative to log PCIe AER and CXL RAS in this case?

Regards,
Terry

>>  
>>  	return handles_cxl;
>>  }
>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>  				    struct aer_err_info *info) { }
>> +static bool handles_cxl_errors(struct pci_dev *dev)
>> +{
>> +	return false;
>> +}
>>  #endif
>>  
>>  /**
>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	cxl_handle_error(dev, info);
>> -	pci_aer_handle_error(dev, info);
>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>> +		cxl_handle_error(dev, info);
>> +	else
>> +		pci_aer_handle_error(dev, info);
>> +
>>  	pci_dev_put(dev);
>>  }
>>  
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-14 11:20     ` Jonathan Cameron
@ 2025-01-14 20:10       ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 20:10 UTC (permalink / raw)
  To: Jonathan Cameron, Li Ming
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:20 AM, Jonathan Cameron wrote:
> On Tue, 14 Jan 2025 14:54:23 +0800
> Li Ming <ming.li@zohomail.com> wrote:
>
>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>> mode.[1]
>>>
>>> CXL and PCIe Protocol Error handling have different requirements that
>>> necessitate a separate handling path. The AER service driver may try to
>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>> suitable for CXL PCIe Port devices because of potential for system memory
>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>> Error handling does not panic the kernel in response to a UCE.
>>>
>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>> handling instead of PCIe handling. Add the CXL specific changes without
>>> affecting or adding functionality in the PCIe handling.
>>>
>>> Make this update alongside the existing Downstream Port RCH error handling
>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>
>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>> config. Update is_internal_error()'s function declaration such that it is
>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>> or disabled.
>>>
>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>
>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>> Upstream Switch Ports
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>> ---
>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index f8b3350fcbb4..62be599e3bee 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>  	return true;
>>>  }
>>>  
>>> -#ifdef CONFIG_PCIEAER_CXL
>>> +static bool is_internal_error(struct aer_err_info *info)
>>> +{
>>> +	if (info->severity == AER_CORRECTABLE)
>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>  
>>> +	return info->status & PCI_ERR_UNC_INTN;
>>> +}
>>> +
>>> +#ifdef CONFIG_PCIEAER_CXL
>>>  /**
>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>   * @dev: pointer to the pcie_dev data structure
>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>  	return (pcie_ports_native || host->native_aer);
>>>  }
>>>  
>>> -static bool is_internal_error(struct aer_err_info *info)
>>> -{
>>> -	if (info->severity == AER_CORRECTABLE)
>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>> -
>>> -	return info->status & PCI_ERR_UNC_INTN;
>>> -}
>>> -
>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>  {
>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>  
>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>  {
>>> -	/*
>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>> -	 * device driver.
>>> -	 */
>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>> -	    is_internal_error(info))
>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>> +
>>> +	if (info->severity == AER_CORRECTABLE) {
>>> +		struct pci_driver *pdrv = dev->driver;
>>> +		int aer = dev->aer_cap;
>>> +
>>> +		if (aer)
>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>> +					       info->status);
>>> +
>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>> +
>>> +		pcie_clear_device_status(dev);
>>> +	}
>>>  }
>>>  
>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>  {
>>>  	bool handles_cxl = false;
>>>  
>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>> -	    pcie_aer_is_native(dev))
>>> +	if (!pcie_aer_is_native(dev))
>>> +		return false;
>>> +
>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>> +	else
>>> +		handles_cxl = pcie_is_cxl_port(dev);  
>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>
>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>
>>
> Good spot.
>
> Agreed a check on the mode makes sense.
>
> Jonathan

Hi Jonathan,

I responded to you and Ming here:

https://lore.kernel.org/linux-cxl/20250107143852.3692571-1-terry.bowman@amd.com/T/#m74f758d744ae446db5d07c541dc84f0a1d57996e

Regards,
Terry
>> Ming
>>
>>>  
>>>  	return handles_cxl;
>>>  }
>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>  				    struct aer_err_info *info) { }
>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>> +{
>>> +	return false;
>>> +}
>>>  #endif
>>>  
>>>  /**
>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>  
>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>  {
>>> -	cxl_handle_error(dev, info);
>>> -	pci_aer_handle_error(dev, info);
>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>> +		cxl_handle_error(dev, info);
>>> +	else
>>> +		pci_aer_handle_error(dev, info);
>>> +
>>>  	pci_dev_put(dev);
>>>  }
>>>    
>>
>>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver
  2025-01-14 11:33   ` Jonathan Cameron
@ 2025-01-14 20:28     ` Bowman, Terry
  2025-01-15 11:37       ` Jonathan Cameron
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 20:28 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop, terry.bowman




On 1/14/2025 5:33 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:43 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
>> apply to CXL devices. Recovery can not be used for CXL devices because of
>> potential corruption on what can be system memory. Also, current PCIe UCE
>> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
>> does not begin at the RP/DSP but begins at the first downstream device.
>> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
>> CXL recovery is needed because of the different handling requirements
>>
>> Add a new function, cxl_do_recovery() using the following.
>>
>> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
>> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
>> will begin iteration at the RP or DSP rather than beginning at the
>> first downstream device.
> I'm still holding out for making pci_walk_bridge() do the same and seeing
> what if anything breaks.

I can test AER fatal UCE on a PCIe device. Do you have any other ideas for specific
testing? A specific device or topology in mind ?

Regards,
Terry

> Other than that I'm fine with this patch.
>
>> Add cxl_report_error_detected() as an analog to report_error_detected().
>> It will call pci_driver::cxl_err_handlers for each iterated downstream
>> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
>> indicating if there was a UCE error detected during handling.
>>
>> cxl_do_recovery() uses the status from cxl_report_error_detected() to
>> determine how to proceed. Non-fatal CXL UCE errors will be treated as
>> fatal. If a UCE was present during handling then cxl_do_recovery()
>> will kernel panic.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
  2025-01-14 11:32   ` Jonathan Cameron
@ 2025-01-14 20:44     ` Bowman, Terry
  2025-01-28 20:25     ` Bowman, Terry
  1 sibling, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 20:44 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop, Shuai Xue




On 1/14/2025 5:32 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:42 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver's aer_get_device_error_info() function doesn't read
>> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
>> including CXL Upstream Switch Ports. As a result, fatal errors are not
>> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
>>
>> Update the aer_get_device_error_info() function to read the UCE fatal
>> status for all CXL PCIe devices. Make the change such that non-CXL devices
>> are not affected.
>>
>> The fatal error status will be used in future patches implementing
>> CXL PCIe Port uncorrectable error handling and logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> This clashes with Shuai's series adding link healthy checks.
> Maybe we can reuse that logic to incorporate the condition we
> care about here?
>

I'll add changes to query Upstream Port link status. I'll borrow from Shuai's patch.

Regards,
Terry

>> ---
>>  drivers/pci/pcie/aer.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 62be599e3bee..79c828bdcb6d 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>  		   type == PCI_EXP_TYPE_RC_EC ||
>>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
>> -		   info->severity == AER_NONFATAL) {
>> +		   info->severity == AER_NONFATAL ||
>> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>>  
>>  		/* Link is still healthy for IO reads */
>>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 14/16] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-14 11:49   ` Jonathan Cameron
@ 2025-01-14 20:56     ` Bowman, Terry
  2025-01-15 11:42       ` Jonathan Cameron
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 20:56 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:49 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:50 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The CXL drivers use kernel trace functions for logging endpoint and
>> Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
>> is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
>> Upstream Switch Ports.
>>
>> Introduce trace logging functions for both RAS correctable and
>> uncorrectable errors specific to CXL PCIe Ports. Additionally, update
>> the CXL Port Protocol Error handlers to invoke these new trace functions.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> An example print in commit message would help understand what the tracepoints
> look like.
>
> Few more things inline following on from earlier comments.
>
> Jonathan
>> ---
>>  drivers/cxl/core/pci.c   | 17 +++++++++++----
>>  drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 60 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 411834f7efe0..3e87fe54a1a2 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -663,10 +663,15 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> +	if (is_cxl_memdev(dev))
> As below. Drag to earlier patch.

Ok

>>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
>> -	}
>> +	else
> and perhaps check it's a port mostly for documentation purposes.
>

Ok

>> +		trace_cxl_port_aer_correctable_error(dev, status);
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -724,7 +729,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>>  	}
>>  
>>  	header_log_copy(ras_base, hl);
>> -	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>> +	if (is_cxl_memdev(dev))
> As mentioned above, drag this if to the earlier patch.

Ok

>> +		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>> +	else
> For documentation purposes mostly I'd be tempted to have an is_cxl_port() check
> before calling the following.
>
>> +		trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
>> +
>>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>  
>>  	return true;
>> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
>> index 8389a94adb1a..681e415ac8f5 100644
>> --- a/drivers/cxl/core/trace.h
>> +++ b/drivers/cxl/core/trace.h
>> @@ -48,6 +48,34 @@
>>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>>  )
>>  
>> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
>> +	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
>> +	TP_ARGS(dev, status, fe, hl),
>> +	TP_STRUCT__entry(
>> +		__string(devname, dev_name(dev))
>> +		__string(host, dev_name(dev->parent))
> What is host in this case? Perhaps a comment.
host is a string initialized with value from dev_name(dev->parent). What
kind of comment would you like to see here?

Regards,
Terry

>> +		__field(u32, status)
>> +		__field(u32, first_error)
>> +		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(devname);
>> +		__assign_str(host);
>> +		__entry->status = status;
>> +		__entry->first_error = fe;
>> +		/*
>> +		 * Embed the 512B headerlog data for user app retrieval and
>> +		 * parsing, but no need to print this in the trace buffer.
>> +		 */
>> +		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>> +	),
>> +	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
>> +		  __get_str(devname), __get_str(host),
>> +		  show_uc_errs(__entry->status),
>> +		  show_uc_errs(__entry->first_error)
>> +	)
>> +);
>> +
>>  TRACE_EVENT(cxl_aer_uncorrectable_error,
>>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
>>  	TP_ARGS(cxlmd, status, fe, hl),
>> @@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>>  )
>>  
>> +TRACE_EVENT(cxl_port_aer_correctable_error,
>> +	TP_PROTO(struct device *dev, u32 status),
>> +	TP_ARGS(dev, status),
>> +	TP_STRUCT__entry(
>> +		__string(devname, dev_name(dev))
>> +		__string(host, dev_name(dev->parent))
>> +		__field(u32, status)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(devname);
>> +		__assign_str(host);
>> +		__entry->status = status;
>> +	),
>> +	TP_printk("device=%s host=%s status='%s'",
>> +		  __get_str(devname), __get_str(host),
>> +		  show_ce_errs(__entry->status)
>> +	)
>> +);
>> +
>>  TRACE_EVENT(cxl_aer_correctable_error,
>>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>>  	TP_ARGS(cxlmd, status),


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-14 11:46   ` Jonathan Cameron
@ 2025-01-14 21:20     ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 21:20 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:46 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:49 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> Introduce correctable and uncorrectable CXL PCIe Port Protocol Error
>> handlers.
>>
>> The handlers will be called with a 'struct pci_dev' parameter
>> indicating the CXL Port device requiring handling. The CXL PCIe Port
>> device's underlying 'struct device' will match the port device in the
>> CXL topology.
>>
>> Use the PCIe Port's device object to find the matching CXL Upstream Switch
>> Port, CXL Downstream Switch Port, or CXL Root Port in the CXL topology. The
>> matching CXL Port device should contain a cached reference to the RAS
>> register block. The cached RAS block will be used handling the error.
>>
>> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() using
>> a reference to the RAS registers as a parameter. These functions will use
>> the RAS register reference to indicate an error and clear the device's RAS
>> status.
>>
>> Future patches will assign the error handlers and add trace logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/pci.c | 63 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 63 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 8275b3dc3589..411834f7efe0 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -776,6 +776,69 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>  }
>>  
>> +static int match_uport(struct device *dev, const void *data)
>> +{
>> +	struct device *uport_dev = (struct device *)data;
> It should be const and then no need to cast explicitly.
>

Ok

>> +	struct cxl_port *port;
>> +
>> +	if (!is_cxl_port(dev))
>> +		return 0;
>> +
>> +	port = to_cxl_port(dev);
>> +
>> +	return port->uport_dev == uport_dev;
>> +}
>> +
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> +	struct cxl_port *port;
>> +
>> +	if (!pdev)
>> +		return NULL;
>> +
>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> +		struct cxl_dport *dport;
>> +		void __iomem *ras_base;
>> +
>> +		port = find_cxl_port(&pdev->dev, &dport);
> Maybe some __free magic on port as then can just
> return dport ? dport->regs.ras : NULL;

Ok

>> +		ras_base = dport ? dport->regs.ras : NULL;
>> +		if (port)
>> +			put_device(&port->dev);
>> +		return ras_base;
>> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> 	if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
>
> or maybe just make it a switch statement?

I'll add the _free approach back.
>> +		struct device *port_dev;
>> +
>> +		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
>> +					   match_uport);
> Likewise on __free magic to automate the put.
>
>> +		if (!port_dev)
>> +			return NULL;
>> +
>> +		port = to_cxl_port(port_dev);
>> +		if (!port)
> why no put of the port_dev?
I overlooked. Thanks

Regards,
Terry
>> +			return NULL;
>> +
>> +		put_device(port_dev);
>> +		return port->uport_regs.ras;
>> +	}
>> +
>> +	return NULL;
>> +}


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
  2025-01-07 14:38 ` [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
@ 2025-01-14 21:37   ` Ira Weiny
  2025-02-07  7:30   ` Gregory Price
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 21:37 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
> registers for the endpoint's Root Port. The same needs to be done for
> each of the CXL Downstream Switch Ports and CXL Root Ports found between
> the endpoint and CXL Host Bridge.
> 
> Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
> sub-topology between the endpoint and the CXL Host Bridge. This function
> will determine if there are CXL Downstream Switch Ports or CXL Root Ports
> associated with this Port. The same check will be added in the future for
> upstream switch ports.
> 
> Move the RAS register map logic from cxl_dport_map_ras() into
> cxl_dport_init_ras_reporting(). This eliminates the need for the helper
> function, cxl_dport_map_ras().
> 
> cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
> the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.
> 
> cxl_dport_init_ras_reporting() must check for previously mapped registers
> before mapping. This is required because multiple endpoints under a CXL
> switch may share an upstream CXL Root Port, CXL Downstream Switch Port,
> or CXL Downstream Switch Port. Ensure the RAS registers are only mapped
> once.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-07 14:38 ` [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
  2025-01-14 11:35   ` Jonathan Cameron
@ 2025-01-14 22:02   ` Ira Weiny
  2025-01-14 22:11     ` Bowman, Terry
  2025-02-07  7:35   ` Gregory Price
  2 siblings, 1 reply; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 22:02 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
> 
> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
> pointer to the CXL Upstream Port's mapped RAS registers.
> 
> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
> register mapping. This is similar to the existing
> cxl_dport_init_ras_reporting() but for USP devices.
> 
> The USP may have multiple downstream endpoints. Before mapping AER
> registers check if the registers are already mapped.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 15 +++++++++++++++
>  drivers/cxl/cxl.h      |  4 ++++
>  drivers/cxl/mem.c      |  8 ++++++++
>  3 files changed, 27 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 1af2d0a14f5d..97e6a15bea88 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> +{
> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
> +	if (port->uport_regs.ras)
> +		return;
> +
> +	port->reg_map.host = &port->dev;
> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
> +		return;

Why return here?  Actually I think 8/16 had the same issue now that I see
this.

Other than that:

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-14 22:02   ` Ira Weiny
@ 2025-01-14 22:11     ` Bowman, Terry
  2025-01-14 23:38       ` Ira Weiny
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 22:11 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 4:02 PM, Ira Weiny wrote:
> Terry Bowman wrote:
>> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
>>
>> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
>> pointer to the CXL Upstream Port's mapped RAS registers.
>>
>> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
>> register mapping. This is similar to the existing
>> cxl_dport_init_ras_reporting() but for USP devices.
>>
>> The USP may have multiple downstream endpoints. Before mapping AER
>> registers check if the registers are already mapped.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/pci.c | 15 +++++++++++++++
>>  drivers/cxl/cxl.h      |  4 ++++
>>  drivers/cxl/mem.c      |  8 ++++++++
>>  3 files changed, 27 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 1af2d0a14f5d..97e6a15bea88 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>  }
>>  
>> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
>> +{
>> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
>> +	if (port->uport_regs.ras)
>> +		return;
>> +
>> +	port->reg_map.host = &port->dev;
>> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
>> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
>> +		return;
> Why return here?  Actually I think 8/16 had the same issue now that I see
> this.
>
> Other than that:
>
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>
> [snip]
If RAS registers fail mapping then exit to avoid CXL Port error handler initialization.
The CXL Port error handlers rely on RAS registers for logging and without mapped RAS
registers the error handlers will return immediately.

Regards,
Terry

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  2025-01-07 14:38 ` [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
  2025-01-14 11:39   ` Jonathan Cameron
@ 2025-01-14 22:20   ` Ira Weiny
  2025-02-07  7:38   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 22:20 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> CXL PCIe Port Protocol Error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port Protocol Errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers
  2025-01-07 14:38 ` [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers Terry Bowman
  2025-01-14 11:41   ` Jonathan Cameron
@ 2025-01-14 22:21   ` Ira Weiny
  2025-02-07  7:39   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 22:21 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The CXL RAS handlers do not currently log if the RAS registers are
> unmapped. This is needed inorder to help debug CXL error handling. Update
                           ^^^^^^^
                           in order


Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static
  2025-01-07 14:38 ` [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
@ 2025-01-14 22:23   ` Ira Weiny
  2025-02-07  7:45     ` Gregory Price
  0 siblings, 1 reply; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 22:23 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> CXL PCIe Port Protocol Error support will be added in the future. This
> requires searching for a CXL PCIe Port device in the CXL topology as
> provided by find_cxl_port(). But, find_cxl_port() is defined static
> and as a result is not callable outside of this source file.
> 
> Update the find_cxl_port() declaration to be non-static.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Generally I think Dan prefers this type of patch to be squashed with the
patch which requires the change.  But I'm ok with the smaller patches...

:-D

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-07 14:38 ` [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
  2025-01-14 11:46   ` Jonathan Cameron
@ 2025-01-14 22:51   ` Ira Weiny
  2025-01-14 23:10     ` Bowman, Terry
  2025-01-14 23:42     ` Bowman, Terry
  2025-02-07  8:01   ` Gregory Price
  2 siblings, 2 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 22:51 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> Introduce correctable and uncorrectable CXL PCIe Port Protocol Error
> handlers.
> 
> The handlers will be called with a 'struct pci_dev' parameter
> indicating the CXL Port device requiring handling. The CXL PCIe Port
> device's underlying 'struct device' will match the port device in the
> CXL topology.
> 
> Use the PCIe Port's device object to find the matching CXL Upstream Switch
> Port, CXL Downstream Switch Port, or CXL Root Port in the CXL topology. The
> matching CXL Port device should contain a cached reference to the RAS
> register block. The cached RAS block will be used handling the error.
> 
> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() using
> a reference to the RAS registers as a parameter. These functions will use
> the RAS register reference to indicate an error and clear the device's RAS
> status.
> 
> Future patches will assign the error handlers and add trace logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 63 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 63 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 8275b3dc3589..411834f7efe0 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -776,6 +776,69 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +static int match_uport(struct device *dev, const void *data)
> +{
> +	struct device *uport_dev = (struct device *)data;
> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return port->uport_dev == uport_dev;
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> +	struct cxl_port *port;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +		struct cxl_dport *dport;
> +		void __iomem *ras_base;
> +
> +		port = find_cxl_port(&pdev->dev, &dport);
> +		ras_base = dport ? dport->regs.ras : NULL;
> +		if (port)
> +			put_device(&port->dev);
> +		return ras_base;
> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> +		struct device *port_dev;
> +
> +		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
> +					   match_uport);
> +		if (!port_dev)
> +			return NULL;
> +
> +		port = to_cxl_port(port_dev);
> +		if (!port)
> +			return NULL;
> +
> +		put_device(port_dev);

Is there any chance the cxl_port (and subsequently the mapping of the ras
registers) could go away between here and their use in
__cxl_handle_*_ras()?

Ira

> +		return port->uport_regs.ras;
> +	}
> +
> +	return NULL;
> +}
> +
> +static void cxl_port_cor_error_detected(struct pci_dev *pdev)
> +{
> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
> +
> +	__cxl_handle_cor_ras(&pdev->dev, ras_base);
> +}
> +
> +static bool cxl_port_error_detected(struct pci_dev *pdev)
> +{
> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
> +
> +	return __cxl_handle_ras(&pdev->dev, ras_base);
> +}
> +
>  void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  {
>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
> -- 
> 2.34.1
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 14/16] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-07 14:38 ` [PATCH v5 14/16] cxl/pci: Add trace logging " Terry Bowman
  2025-01-14 11:49   ` Jonathan Cameron
@ 2025-01-14 22:58   ` Ira Weiny
  1 sibling, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 22:58 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The CXL drivers use kernel trace functions for logging endpoint and
> Restricted CXL host (RCH) Downstream Port RAS errors. Similar functionality
> is required for CXL Root Ports, CXL Downstream Switch Ports, and CXL
> Upstream Switch Ports.
> 
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Additionally, update
> the CXL Port Protocol Error handlers to invoke these new trace functions.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2025-01-07 14:38 ` [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
  2025-01-14 11:51   ` Jonathan Cameron
@ 2025-01-14 23:03   ` Ira Weiny
  2025-02-07  8:08   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 23:03 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
> The handlers can't be set in the pci_driver static definition because the
> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
> driver aware.
> 
> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> function will assign the default handlers for a CXL PCIe Port device.
> 
> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
> longer be used.
> 
> Create cxl_clear_port_error_handlers() and register it to be called
> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Ira Weiny <ira.weiny@intel.com>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-14 22:51   ` Ira Weiny
@ 2025-01-14 23:10     ` Bowman, Terry
  2025-01-14 23:42     ` Bowman, Terry
  1 sibling, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 23:10 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 4:51 PM, Ira Weiny wrote:
> Terry Bowman wrote:
>> Introduce correctable and uncorrectable CXL PCIe Port Protocol Error
>> handlers.
>>
>> The handlers will be called with a 'struct pci_dev' parameter
>> indicating the CXL Port device requiring handling. The CXL PCIe Port
>> device's underlying 'struct device' will match the port device in the
>> CXL topology.
>>
>> Use the PCIe Port's device object to find the matching CXL Upstream Switch
>> Port, CXL Downstream Switch Port, or CXL Root Port in the CXL topology. The
>> matching CXL Port device should contain a cached reference to the RAS
>> register block. The cached RAS block will be used handling the error.
>>
>> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() using
>> a reference to the RAS registers as a parameter. These functions will use
>> the RAS register reference to indicate an error and clear the device's RAS
>> status.
>>
>> Future patches will assign the error handlers and add trace logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/pci.c | 63 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 63 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 8275b3dc3589..411834f7efe0 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -776,6 +776,69 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>  }
>>  
>> +static int match_uport(struct device *dev, const void *data)
>> +{
>> +	struct device *uport_dev = (struct device *)data;
>> +	struct cxl_port *port;
>> +
>> +	if (!is_cxl_port(dev))
>> +		return 0;
>> +
>> +	port = to_cxl_port(dev);
>> +
>> +	return port->uport_dev == uport_dev;
>> +}
>> +
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> +	struct cxl_port *port;
>> +
>> +	if (!pdev)
>> +		return NULL;
>> +
>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> +		struct cxl_dport *dport;
>> +		void __iomem *ras_base;
>> +
>> +		port = find_cxl_port(&pdev->dev, &dport);
>> +		ras_base = dport ? dport->regs.ras : NULL;
>> +		if (port)
>> +			put_device(&port->dev);
>> +		return ras_base;
>> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
>> +		struct device *port_dev;
>> +
>> +		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
>> +					   match_uport);
>> +		if (!port_dev)
>> +			return NULL;
>> +
>> +		port = to_cxl_port(port_dev);
>> +		if (!port)
>> +			return NULL;
>> +
>> +		put_device(port_dev);
> Is there any chance the cxl_port (and subsequently the mapping of the ras
> registers) could go away between here and their use in
> __cxl_handle_*_ras()?
>
> Ira

Yes. I believe that is possible.

Regards,
Terry

>> +		return port->uport_regs.ras;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void cxl_port_cor_error_detected(struct pci_dev *pdev)
>> +{
>> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
>> +
>> +	__cxl_handle_cor_ras(&pdev->dev, ras_base);
>> +}
>> +
>> +static bool cxl_port_error_detected(struct pci_dev *pdev)
>> +{
>> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
>> +
>> +	return __cxl_handle_ras(&pdev->dev, ras_base);
>> +}
>> +
>>  void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>  {
>>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
>> -- 
>> 2.34.1
>>
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-07 14:38 ` [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
@ 2025-01-14 23:26   ` Ira Weiny
  2025-01-14 23:34     ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 23:26 UTC (permalink / raw)
  To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Terry Bowman wrote:
> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
> Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
> used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
> needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
> inorder to notify the associated Root Port and OS.[1]
> 
> Export the AER service driver's pci_aer_unmask_internal_errors() function
> to CXL namespace.
> 
> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
> because it is now an exported function.

This seems wrong to me.  As of this patch CXL_PCI requires PCIEAER_CXL for
the AER code to handle the errors which were just enabled.

To keep PCIEAER_CXL optional pci_aer_unmask_internal_errors() should be
stubbed out in aer.h if !CONFIG_PCIEAER_CXL.

Ira

> 
> Call pci_aer_unmask_internal_errors() during RAS initialization in:
> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
> 
> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/cxl/core/pci.c | 2 ++
>  drivers/pci/pcie/aer.c | 5 +++--
>  include/linux/aer.h    | 1 +
>  3 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 9c162120f0fe..c62329cd9a87 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -895,6 +895,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  
>  	cxl_assign_port_error_handlers(pdev);
>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
> +	pci_aer_unmask_internal_errors(pdev);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
>  
> @@ -935,6 +936,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  	}
>  	cxl_assign_port_error_handlers(pdev);
>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
> +	pci_aer_unmask_internal_errors(pdev);
>  	put_device(&port->dev);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 68e957459008..e6aaa3bd84f0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -950,7 +950,6 @@ static bool is_internal_error(struct aer_err_info *info)
>  	return info->status & PCI_ERR_UNC_INTN;
>  }
>  
> -#ifdef CONFIG_PCIEAER_CXL
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pcie_dev data structure
> @@ -961,7 +960,7 @@ static bool is_internal_error(struct aer_err_info *info)
>   * Note: AER must be enabled and supported by the device which must be
>   * checked in advance, e.g. with pcie_aer_is_native().
>   */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  {
>  	int aer = dev->aer_cap;
>  	u32 mask;
> @@ -974,7 +973,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  	mask &= ~PCI_ERR_COR_INTERNAL;
>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>  }
> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>  
> +#ifdef CONFIG_PCIEAER_CXL
>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>  {
>  	/*
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..093293f9f12b 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  int cper_severity_to_aer(int cper_severity);
>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>  		       int severity, struct aer_capability_regs *aer_regs);
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>  #endif //_AER_H_
>  
> -- 
> 2.34.1
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-14 15:19     ` Bowman, Terry
@ 2025-01-14 23:33       ` Ira Weiny
  2025-01-14 23:39         ` Bowman, Terry
  2025-01-15 10:03       ` Lukas Wunner
  1 sibling, 1 reply; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 23:33 UTC (permalink / raw)
  To: Bowman, Terry, Ira Weiny, linux-cxl, linux-kernel, linux-pci,
	nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Bowman, Terry wrote:
> 
> 
> 
> On 1/13/2025 5:49 PM, Ira Weiny wrote:
> > Terry Bowman wrote:
> >> CXL and AER drivers need the ability to identify CXL devices and CXL port
> >> devices.
> >>
> >> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
> >> presence. The CXL Flexbus DVSEC presence is used because it is required
> >> for all the CXL PCIe devices.[1]
> >>
> >> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> >> Flexbus presence.
> >>
> >> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl'.
> >>
> >> Add pcie_is_cxl_port() to check if a device is a CXL Root Port, CXL
> >> Upstream Switch Port, or CXL Downstream Switch Port. Also, verify the
> >> CXL Extensions DVSEC for Ports is present.[1]
> >>
> >> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
> >>     Capability (DVSEC) ID Assignment, Table 8-2
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> >> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> >> ---
> >>  drivers/pci/pci.c             | 13 +++++++++++++
> >>  drivers/pci/probe.c           | 10 ++++++++++
> >>  include/linux/pci.h           |  4 ++++
> >>  include/uapi/linux/pci_regs.h |  3 ++-
> >>  4 files changed, 29 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> >> index 661f98c6c63a..9319c62e3488 100644
> >> --- a/drivers/pci/pci.c
> >> +++ b/drivers/pci/pci.c
> >> @@ -5036,10 +5036,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
> >>  
> >>  static u16 cxl_port_dvsec(struct pci_dev *dev)
> >>  {
> >> +	if (!pcie_is_cxl(dev))
> >> +		return 0;
> >> +
> >>  	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> >>  					 PCI_DVSEC_CXL_PORT);
> >>  }
> >>  
> >> +bool pcie_is_cxl_port(struct pci_dev *dev)
> >> +{
> >> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> >> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> >> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> >> +		return false;
> >> +
> >> +	return cxl_port_dvsec(dev);
> > Returning bool from a function which returns u16 is odd and I don't think
> > it should be coded this way.  I don't think it is wrong right now but this
> > really ought to code the pcie_is_cxl() here and leave cxl_port_dvsec()
> > alone.  Calling cxl_port_dvsec(), checking for if the dvsec exists, and
> > returning bool.
> 
> Hi Ira,
> 
> Thanks for reviewing. Is this what you are looking for here:
> 
> +bool pcie_is_cxl_port(struct pci_dev *dev)
> +{
> +	return (cxl_port_dvsec(dev) > 0);

With the type checks, yes that is more clear.

Ira

[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-14 23:26   ` Ira Weiny
@ 2025-01-14 23:34     ` Bowman, Terry
  2025-01-14 23:45       ` Ira Weiny
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 23:34 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:26 PM, Ira Weiny wrote:
> Terry Bowman wrote:
>> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
>> Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
>> used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
>> needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
>> inorder to notify the associated Root Port and OS.[1]
>>
>> Export the AER service driver's pci_aer_unmask_internal_errors() function
>> to CXL namespace.
>>
>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>> because it is now an exported function.
> This seems wrong to me.  As of this patch CXL_PCI requires PCIEAER_CXL for
> the AER code to handle the errors which were just enabled.
>
> To keep PCIEAER_CXL optional pci_aer_unmask_internal_errors() should be
> stubbed out in aer.h if !CONFIG_PCIEAER_CXL.
>
> Ira

Bjorn (I believe in v1 or v2) directed me to remove pci_aer_unmask_internal_errors() dependency on PCIEAER_CXL because it is now exported. He wants the behavior for other users (and subsystems) to be consistent with/without the PCIEAER_CXL setting.

Regards,
Terry

>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>
>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> ---
>>  drivers/cxl/core/pci.c | 2 ++
>>  drivers/pci/pcie/aer.c | 5 +++--
>>  include/linux/aer.h    | 1 +
>>  3 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 9c162120f0fe..c62329cd9a87 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -895,6 +895,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>  
>>  	cxl_assign_port_error_handlers(pdev);
>>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
>> +	pci_aer_unmask_internal_errors(pdev);
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
>>  
>> @@ -935,6 +936,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>  	}
>>  	cxl_assign_port_error_handlers(pdev);
>>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
>> +	pci_aer_unmask_internal_errors(pdev);
>>  	put_device(&port->dev);
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 68e957459008..e6aaa3bd84f0 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -950,7 +950,6 @@ static bool is_internal_error(struct aer_err_info *info)
>>  	return info->status & PCI_ERR_UNC_INTN;
>>  }
>>  
>> -#ifdef CONFIG_PCIEAER_CXL
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pcie_dev data structure
>> @@ -961,7 +960,7 @@ static bool is_internal_error(struct aer_err_info *info)
>>   * Note: AER must be enabled and supported by the device which must be
>>   * checked in advance, e.g. with pcie_aer_is_native().
>>   */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>  {
>>  	int aer = dev->aer_cap;
>>  	u32 mask;
>> @@ -974,7 +973,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>  	mask &= ~PCI_ERR_COR_INTERNAL;
>>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>>  }
>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>>  {
>>  	/*
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 4b97f38f3fcf..093293f9f12b 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  int cper_severity_to_aer(int cper_severity);
>>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>>  		       int severity, struct aer_capability_regs *aer_regs);
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>>  #endif //_AER_H_
>>  
>> -- 
>> 2.34.1
>>
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-14 22:11     ` Bowman, Terry
@ 2025-01-14 23:38       ` Ira Weiny
  2025-01-14 23:49         ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 23:38 UTC (permalink / raw)
  To: Bowman, Terry, Ira Weiny, linux-cxl, linux-kernel, linux-pci,
	nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Bowman, Terry wrote:
> 
> 
> 
> On 1/14/2025 4:02 PM, Ira Weiny wrote:
> > Terry Bowman wrote:
> >> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
> >>
> >> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
> >> pointer to the CXL Upstream Port's mapped RAS registers.
> >>
> >> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
> >> register mapping. This is similar to the existing
> >> cxl_dport_init_ras_reporting() but for USP devices.
> >>
> >> The USP may have multiple downstream endpoints. Before mapping AER
> >> registers check if the registers are already mapped.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> ---
> >>  drivers/cxl/core/pci.c | 15 +++++++++++++++
> >>  drivers/cxl/cxl.h      |  4 ++++
> >>  drivers/cxl/mem.c      |  8 ++++++++
> >>  3 files changed, 27 insertions(+)
> >>
> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> >> index 1af2d0a14f5d..97e6a15bea88 100644
> >> --- a/drivers/cxl/core/pci.c
> >> +++ b/drivers/cxl/core/pci.c
> >> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> >>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> >>  }
> >>  
> >> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >> +{
> >> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
> >> +	if (port->uport_regs.ras)
> >> +		return;
> >> +
> >> +	port->reg_map.host = &port->dev;
> >> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> >> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> >> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
> >> +		return;
> > Why return here?  Actually I think 8/16 had the same issue now that I see
> > this.
> >
> > Other than that:
> >
> > Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> >
> > [snip]
> If RAS registers fail mapping then exit to avoid CXL Port error handler initialization.
> The CXL Port error handlers rely on RAS registers for logging and without mapped RAS
> registers the error handlers will return immediately.

Sorry I was not clear and I should not have clipped the text so much.  You
return in a block which is at the end of the function:


+void cxl_uport_init_ras_reporting(struct cxl_port *port)
+{
+       /* uport may have more than 1 downstream EP. Check if already mapped. */
+       if (port->uport_regs.ras)
+               return;
+
+       port->reg_map.host = &port->dev;
+       if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
+                                  BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+               dev_err(&port->dev, "Failed to map RAS capability.\n");
+               return;
+       }
+}

So no need for this specific statement?

Ira

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-14 23:33       ` Ira Weiny
@ 2025-01-14 23:39         ` Bowman, Terry
  2025-01-16 15:35           ` Ira Weiny
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 23:39 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:33 PM, Ira Weiny wrote:
> Bowman, Terry wrote:
>>
>>
>> On 1/13/2025 5:49 PM, Ira Weiny wrote:
>>> Terry Bowman wrote:
>>>> CXL and AER drivers need the ability to identify CXL devices and CXL port
>>>> devices.
>>>>
>>>> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
>>>> presence. The CXL Flexbus DVSEC presence is used because it is required
>>>> for all the CXL PCIe devices.[1]
>>>>
>>>> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
>>>> Flexbus presence.
>>>>
>>>> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl'.
>>>>
>>>> Add pcie_is_cxl_port() to check if a device is a CXL Root Port, CXL
>>>> Upstream Switch Port, or CXL Downstream Switch Port. Also, verify the
>>>> CXL Extensions DVSEC for Ports is present.[1]
>>>>
>>>> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>>>>     Capability (DVSEC) ID Assignment, Table 8-2
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
>>>> ---
>>>>  drivers/pci/pci.c             | 13 +++++++++++++
>>>>  drivers/pci/probe.c           | 10 ++++++++++
>>>>  include/linux/pci.h           |  4 ++++
>>>>  include/uapi/linux/pci_regs.h |  3 ++-
>>>>  4 files changed, 29 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
>>>> index 661f98c6c63a..9319c62e3488 100644
>>>> --- a/drivers/pci/pci.c
>>>> +++ b/drivers/pci/pci.c
>>>> @@ -5036,10 +5036,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
>>>>  
>>>>  static u16 cxl_port_dvsec(struct pci_dev *dev)
>>>>  {
>>>> +	if (!pcie_is_cxl(dev))
>>>> +		return 0;
>>>> +
>>>>  	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>>>>  					 PCI_DVSEC_CXL_PORT);
>>>>  }
>>>>  
>>>> +bool pcie_is_cxl_port(struct pci_dev *dev)
>>>> +{
>>>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
>>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
>>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
>>>> +		return false;
>>>> +
>>>> +	return cxl_port_dvsec(dev);
>>> Returning bool from a function which returns u16 is odd and I don't think
>>> it should be coded this way.  I don't think it is wrong right now but this
>>> really ought to code the pcie_is_cxl() here and leave cxl_port_dvsec()
>>> alone.  Calling cxl_port_dvsec(), checking for if the dvsec exists, and
>>> returning bool.
>> Hi Ira,
>>
>> Thanks for reviewing. Is this what you are looking for here:
>>
>> +bool pcie_is_cxl_port(struct pci_dev *dev)
>> +{
>> +	return (cxl_port_dvsec(dev) > 0);
> With the type checks, yes that is more clear.
>
> Ira
>
> [snip]
Since sending the above I made update to be:

static u16 cxl_port_dvsec(struct pci_dev *dev)
{
        return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
                                         PCI_DVSEC_CXL_PORT);
}

inline bool pcie_is_cxl(struct pci_dev *pci_dev)
{
        return pci_dev->is_cxl;
}

bool pcie_is_cxl_port(struct pci_dev *pci_dev)
{
        if (!pcie_is_cxl(pci_dev))
                return false;

        return (cxl_port_dvsec(pci_dev) > 0);
}

I can change if you see anything is needed.

Regards,
Terry

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-14 22:51   ` Ira Weiny
  2025-01-14 23:10     ` Bowman, Terry
@ 2025-01-14 23:42     ` Bowman, Terry
  1 sibling, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 23:42 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 4:51 PM, Ira Weiny wrote:
> Terry Bowman wrote:
>> Introduce correctable and uncorrectable CXL PCIe Port Protocol Error
>> handlers.
>>
>> The handlers will be called with a 'struct pci_dev' parameter
>> indicating the CXL Port device requiring handling. The CXL PCIe Port
>> device's underlying 'struct device' will match the port device in the
>> CXL topology.
>>
>> Use the PCIe Port's device object to find the matching CXL Upstream Switch
>> Port, CXL Downstream Switch Port, or CXL Root Port in the CXL topology. The
>> matching CXL Port device should contain a cached reference to the RAS
>> register block. The cached RAS block will be used handling the error.
>>
>> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() using
>> a reference to the RAS registers as a parameter. These functions will use
>> the RAS register reference to indicate an error and clear the device's RAS
>> status.
>>
>> Future patches will assign the error handlers and add trace logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/pci.c | 63 ++++++++++++++++++++++++++++++++++++++++++
>>  1 file changed, 63 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 8275b3dc3589..411834f7efe0 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -776,6 +776,69 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>  }
>>  
>> +static int match_uport(struct device *dev, const void *data)
>> +{
>> +	struct device *uport_dev = (struct device *)data;
>> +	struct cxl_port *port;
>> +
>> +	if (!is_cxl_port(dev))
>> +		return 0;
>> +
>> +	port = to_cxl_port(dev);
>> +
>> +	return port->uport_dev == uport_dev;
>> +}
>> +
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> +	struct cxl_port *port;
>> +
>> +	if (!pdev)
>> +		return NULL;
>> +
>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> +		struct cxl_dport *dport;
>> +		void __iomem *ras_base;
>> +
>> +		port = find_cxl_port(&pdev->dev, &dport);
>> +		ras_base = dport ? dport->regs.ras : NULL;
>> +		if (port)
>> +			put_device(&port->dev);
>> +		return ras_base;
>> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
>> +		struct device *port_dev;
>> +
>> +		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
>> +					   match_uport);
>> +		if (!port_dev)
>> +			return NULL;
>> +
>> +		port = to_cxl_port(port_dev);
>> +		if (!port)
>> +			return NULL;
>> +
>> +		put_device(port_dev);
> Is there any chance the cxl_port (and subsequently the mapping of the ras
> registers) could go away between here and their use in
> __cxl_handle_*_ras()?
>
> Ira
Yes, this could happen.

>> +		return port->uport_regs.ras;
>> +	}
>> +
>> +	return NULL;
>> +}
>> +
>> +static void cxl_port_cor_error_detected(struct pci_dev *pdev)
>> +{
>> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
>> +
>> +	__cxl_handle_cor_ras(&pdev->dev, ras_base);
>> +}
>> +
>> +static bool cxl_port_error_detected(struct pci_dev *pdev)
>> +{
>> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
>> +
>> +	return __cxl_handle_ras(&pdev->dev, ras_base);
>> +}
>> +
>>  void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>  {
>>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
>> -- 
>> 2.34.1
>>
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-14 23:34     ` Bowman, Terry
@ 2025-01-14 23:45       ` Ira Weiny
  2025-01-15  0:09         ` Bowman, Terry
  2025-01-15  0:20         ` Bowman, Terry
  0 siblings, 2 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-14 23:45 UTC (permalink / raw)
  To: Bowman, Terry, Ira Weiny, linux-cxl, linux-kernel, linux-pci,
	nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Bowman, Terry wrote:
> 
> 
> 
> On 1/14/2025 5:26 PM, Ira Weiny wrote:
> > Terry Bowman wrote:
> >> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
> >> Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
> >> used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
> >> needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
> >> inorder to notify the associated Root Port and OS.[1]
> >>
> >> Export the AER service driver's pci_aer_unmask_internal_errors() function
> >> to CXL namespace.
> >>
> >> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
> >> because it is now an exported function.
> > This seems wrong to me.  As of this patch CXL_PCI requires PCIEAER_CXL for
> > the AER code to handle the errors which were just enabled.
> >
> > To keep PCIEAER_CXL optional pci_aer_unmask_internal_errors() should be
> > stubbed out in aer.h if !CONFIG_PCIEAER_CXL.
> >
> > Ira
> 
> Bjorn (I believe in v1 or v2) directed me to remove
> pci_aer_unmask_internal_errors() dependency on PCIEAER_CXL because it is
> now exported. He wants the behavior for other users (and subsystems) to
> be consistent with/without the PCIEAER_CXL setting.
> 

I see...  If PCIEAER_CXL is not enabled why even set the cxl error
handlers and enable these?

I guess this is just adding some code which eventually calls
handles_cxl_errors() which returns false in the !PCIEAER_CXL case?

Ira

> Regards,
> Terry
> 
> >> Call pci_aer_unmask_internal_errors() during RAS initialization in:
> >> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
> >>
> >> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> >> ---
> >>  drivers/cxl/core/pci.c | 2 ++
> >>  drivers/pci/pcie/aer.c | 5 +++--
> >>  include/linux/aer.h    | 1 +
> >>  3 files changed, 6 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> >> index 9c162120f0fe..c62329cd9a87 100644
> >> --- a/drivers/cxl/core/pci.c
> >> +++ b/drivers/cxl/core/pci.c
> >> @@ -895,6 +895,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >>  
> >>  	cxl_assign_port_error_handlers(pdev);
> >>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
> >> +	pci_aer_unmask_internal_errors(pdev);
> >>  }
> >>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
> >>  
> >> @@ -935,6 +936,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >>  	}
> >>  	cxl_assign_port_error_handlers(pdev);
> >>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
> >> +	pci_aer_unmask_internal_errors(pdev);
> >>  	put_device(&port->dev);
> >>  }
> >>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index 68e957459008..e6aaa3bd84f0 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -950,7 +950,6 @@ static bool is_internal_error(struct aer_err_info *info)
> >>  	return info->status & PCI_ERR_UNC_INTN;
> >>  }
> >>  
> >> -#ifdef CONFIG_PCIEAER_CXL
> >>  /**
> >>   * pci_aer_unmask_internal_errors - unmask internal errors
> >>   * @dev: pointer to the pcie_dev data structure
> >> @@ -961,7 +960,7 @@ static bool is_internal_error(struct aer_err_info *info)
> >>   * Note: AER must be enabled and supported by the device which must be
> >>   * checked in advance, e.g. with pcie_aer_is_native().
> >>   */
> >> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> >> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> >>  {
> >>  	int aer = dev->aer_cap;
> >>  	u32 mask;
> >> @@ -974,7 +973,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> >>  	mask &= ~PCI_ERR_COR_INTERNAL;
> >>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> >>  }
> >> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
> >>  
> >> +#ifdef CONFIG_PCIEAER_CXL
> >>  static bool is_cxl_mem_dev(struct pci_dev *dev)
> >>  {
> >>  	/*
> >> diff --git a/include/linux/aer.h b/include/linux/aer.h
> >> index 4b97f38f3fcf..093293f9f12b 100644
> >> --- a/include/linux/aer.h
> >> +++ b/include/linux/aer.h
> >> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> >>  int cper_severity_to_aer(int cper_severity);
> >>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
> >>  		       int severity, struct aer_capability_regs *aer_regs);
> >> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> >>  #endif //_AER_H_
> >>  
> >> -- 
> >> 2.34.1
> >>
> >
> 



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-14 23:38       ` Ira Weiny
@ 2025-01-14 23:49         ` Bowman, Terry
  2025-01-15 11:40           ` Jonathan Cameron
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-14 23:49 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:38 PM, Ira Weiny wrote:
> Bowman, Terry wrote:
>>
>>
>> On 1/14/2025 4:02 PM, Ira Weiny wrote:
>>> Terry Bowman wrote:
>>>> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
>>>>
>>>> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
>>>> pointer to the CXL Upstream Port's mapped RAS registers.
>>>>
>>>> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
>>>> register mapping. This is similar to the existing
>>>> cxl_dport_init_ras_reporting() but for USP devices.
>>>>
>>>> The USP may have multiple downstream endpoints. Before mapping AER
>>>> registers check if the registers are already mapped.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> ---
>>>>  drivers/cxl/core/pci.c | 15 +++++++++++++++
>>>>  drivers/cxl/cxl.h      |  4 ++++
>>>>  drivers/cxl/mem.c      |  8 ++++++++
>>>>  3 files changed, 27 insertions(+)
>>>>
>>>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>>>> index 1af2d0a14f5d..97e6a15bea88 100644
>>>> --- a/drivers/cxl/core/pci.c
>>>> +++ b/drivers/cxl/core/pci.c
>>>> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>>>>  }
>>>>  
>>>> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>>> +{
>>>> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
>>>> +	if (port->uport_regs.ras)
>>>> +		return;
>>>> +
>>>> +	port->reg_map.host = &port->dev;
>>>> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
>>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
>>>> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
>>>> +		return;
>>> Why return here?  Actually I think 8/16 had the same issue now that I see
>>> this.
>>>
>>> Other than that:
>>>
>>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
>>>
>>> [snip]
>> If RAS registers fail mapping then exit to avoid CXL Port error handler initialization.
>> The CXL Port error handlers rely on RAS registers for logging and without mapped RAS
>> registers the error handlers will return immediately.
> Sorry I was not clear and I should not have clipped the text so much.  You
> return in a block which is at the end of the function:
>
>
> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> +{
> +       /* uport may have more than 1 downstream EP. Check if already mapped. */
> +       if (port->uport_regs.ras)
> +               return;
> +
> +       port->reg_map.host = &port->dev;
> +       if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> +                                  BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> +               dev_err(&port->dev, "Failed to map RAS capability.\n");
> +               return;
> +       }
> +}
>
> So no need for this specific statement?
>
> Ira

I wrote it this way to add the handler initialization (after the return) in later patch
without a diff removal. But, your correct, I can remove the 'return' statement in this patch
and add in later patch without cluttering the diff.

Thanks. I'll make the change.

Regards,
Terry




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-14 23:45       ` Ira Weiny
@ 2025-01-15  0:09         ` Bowman, Terry
  2025-01-15  0:20         ` Bowman, Terry
  1 sibling, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-01-15  0:09 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:45 PM, Ira Weiny wrote:
> Bowman, Terry wrote:
>>
>>
>> On 1/14/2025 5:26 PM, Ira Weiny wrote:
>>> Terry Bowman wrote:
>>>> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
>>>> Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
>>>> used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
>>>> needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
>>>> inorder to notify the associated Root Port and OS.[1]
>>>>
>>>> Export the AER service driver's pci_aer_unmask_internal_errors() function
>>>> to CXL namespace.
>>>>
>>>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>>>> because it is now an exported function.
>>> This seems wrong to me.  As of this patch CXL_PCI requires PCIEAER_CXL for
>>> the AER code to handle the errors which were just enabled.
>>>
>>> To keep PCIEAER_CXL optional pci_aer_unmask_internal_errors() should be
>>> stubbed out in aer.h if !CONFIG_PCIEAER_CXL.
>>>
>>> Ira
>> Bjorn (I believe in v1 or v2) directed me to remove
>> pci_aer_unmask_internal_errors() dependency on PCIEAER_CXL because it is
>> now exported. He wants the behavior for other users (and subsystems) to
>> be consistent with/without the PCIEAER_CXL setting.
>>
> I see...  If PCIEAER_CXL is not enabled why even set the cxl error
> handlers and enable these?
>
> I guess this is just adding some code which eventually calls
> handles_cxl_errors() which returns false in the !PCIEAER_CXL case?
>
> Ira
cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting()assign the error handlers and are within #ifdef PCIEAER_CXL. I need to add the empty stubs to the
#else block. Correct. handles_cxl_errors() returns false in the !PCIEAER_CXL case. Terry
>> Regards,
>> Terry
>>
>>>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>>>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>>>
>>>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> ---
>>>>  drivers/cxl/core/pci.c | 2 ++
>>>>  drivers/pci/pcie/aer.c | 5 +++--
>>>>  include/linux/aer.h    | 1 +
>>>>  3 files changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>>>> index 9c162120f0fe..c62329cd9a87 100644
>>>> --- a/drivers/cxl/core/pci.c
>>>> +++ b/drivers/cxl/core/pci.c
>>>> @@ -895,6 +895,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>>>  
>>>>  	cxl_assign_port_error_handlers(pdev);
>>>>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
>>>> +	pci_aer_unmask_internal_errors(pdev);
>>>>  }
>>>>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
>>>>  
>>>> @@ -935,6 +936,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>>>  	}
>>>>  	cxl_assign_port_error_handlers(pdev);
>>>>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
>>>> +	pci_aer_unmask_internal_errors(pdev);
>>>>  	put_device(&port->dev);
>>>>  }
>>>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index 68e957459008..e6aaa3bd84f0 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -950,7 +950,6 @@ static bool is_internal_error(struct aer_err_info *info)
>>>>  	return info->status & PCI_ERR_UNC_INTN;
>>>>  }
>>>>  
>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>>  /**
>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>   * @dev: pointer to the pcie_dev data structure
>>>> @@ -961,7 +960,7 @@ static bool is_internal_error(struct aer_err_info *info)
>>>>   * Note: AER must be enabled and supported by the device which must be
>>>>   * checked in advance, e.g. with pcie_aer_is_native().
>>>>   */
>>>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>>  {
>>>>  	int aer = dev->aer_cap;
>>>>  	u32 mask;
>>>> @@ -974,7 +973,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>>  	mask &= ~PCI_ERR_COR_INTERNAL;
>>>>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>>>>  }
>>>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>>>>  
>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>>>>  {
>>>>  	/*
>>>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>>>> index 4b97f38f3fcf..093293f9f12b 100644
>>>> --- a/include/linux/aer.h
>>>> +++ b/include/linux/aer.h
>>>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>>>  int cper_severity_to_aer(int cper_severity);
>>>>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>>>>  		       int severity, struct aer_capability_regs *aer_regs);
>>>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>>>>  #endif //_AER_H_
>>>>  
>>>> -- 
>>>> 2.34.1
>>>>
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-14 23:45       ` Ira Weiny
  2025-01-15  0:09         ` Bowman, Terry
@ 2025-01-15  0:20         ` Bowman, Terry
  2025-01-16 21:42           ` Ira Weiny
  1 sibling, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-15  0:20 UTC (permalink / raw)
  To: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 5:45 PM, Ira Weiny wrote:
> Bowman, Terry wrote:
>>
>>
>> On 1/14/2025 5:26 PM, Ira Weiny wrote:
>>> Terry Bowman wrote:
>>>> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
>>>> Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
>>>> used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
>>>> needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
>>>> inorder to notify the associated Root Port and OS.[1]
>>>>
>>>> Export the AER service driver's pci_aer_unmask_internal_errors() function
>>>> to CXL namespace.
>>>>
>>>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>>>> because it is now an exported function.
>>> This seems wrong to me.  As of this patch CXL_PCI requires PCIEAER_CXL for
>>> the AER code to handle the errors which were just enabled.
>>>
>>> To keep PCIEAER_CXL optional pci_aer_unmask_internal_errors() should be
>>> stubbed out in aer.h if !CONFIG_PCIEAER_CXL.
>>>
>>> Ira
>> Bjorn (I believe in v1 or v2) directed me to remove
>> pci_aer_unmask_internal_errors() dependency on PCIEAER_CXL because it is
>> now exported. He wants the behavior for other users (and subsystems) to
>> be consistent with/without the PCIEAER_CXL setting.
>>
> I see...  If PCIEAER_CXL is not enabled why even set the cxl error
> handlers and enable these?
>
> I guess this is just adding some code which eventually calls
> handles_cxl_errors() which returns false in the !PCIEAER_CXL case?
>
> Ira

Re-sending because I somehow sent from Outlook earlier.

cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting() assign the error 
handlers and are within #ifdef PCIEAER_CXL. The stubs are in cxl.h.

Correct. handles_cxl_errors() returns false in the !PCIEAER_CXL case.

Terry
>>>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>>>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>>>
>>>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> ---
>>>>  drivers/cxl/core/pci.c | 2 ++
>>>>  drivers/pci/pcie/aer.c | 5 +++--
>>>>  include/linux/aer.h    | 1 +
>>>>  3 files changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>>>> index 9c162120f0fe..c62329cd9a87 100644
>>>> --- a/drivers/cxl/core/pci.c
>>>> +++ b/drivers/cxl/core/pci.c
>>>> @@ -895,6 +895,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>>>  
>>>>  	cxl_assign_port_error_handlers(pdev);
>>>>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
>>>> +	pci_aer_unmask_internal_errors(pdev);
>>>>  }
>>>>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, "CXL");
>>>>  
>>>> @@ -935,6 +936,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>>>  	}
>>>>  	cxl_assign_port_error_handlers(pdev);
>>>>  	devm_add_action_or_reset(&port->dev, cxl_clear_port_error_handlers, pdev);
>>>> +	pci_aer_unmask_internal_errors(pdev);
>>>>  	put_device(&port->dev);
>>>>  }
>>>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, "CXL");
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index 68e957459008..e6aaa3bd84f0 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -950,7 +950,6 @@ static bool is_internal_error(struct aer_err_info *info)
>>>>  	return info->status & PCI_ERR_UNC_INTN;
>>>>  }
>>>>  
>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>>  /**
>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>   * @dev: pointer to the pcie_dev data structure
>>>> @@ -961,7 +960,7 @@ static bool is_internal_error(struct aer_err_info *info)
>>>>   * Note: AER must be enabled and supported by the device which must be
>>>>   * checked in advance, e.g. with pcie_aer_is_native().
>>>>   */
>>>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>>  {
>>>>  	int aer = dev->aer_cap;
>>>>  	u32 mask;
>>>> @@ -974,7 +973,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>>>  	mask &= ~PCI_ERR_COR_INTERNAL;
>>>>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>>>>  }
>>>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>>>>  
>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>>>>  {
>>>>  	/*
>>>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>>>> index 4b97f38f3fcf..093293f9f12b 100644
>>>> --- a/include/linux/aer.h
>>>> +++ b/include/linux/aer.h
>>>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>>>  int cper_severity_to_aer(int cper_severity);
>>>>  void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>>>>  		       int severity, struct aer_capability_regs *aer_regs);
>>>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>>>>  #endif //_AER_H_
>>>>  
>>>> -- 
>>>> 2.34.1
>>>>
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-14 19:29     ` Bowman, Terry
@ 2025-01-15  1:18       ` Li Ming
  2025-01-15 14:39         ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Li Ming @ 2025-01-15  1:18 UTC (permalink / raw)
  To: Bowman, Terry, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
	oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop

On 1/15/2025 3:29 AM, Bowman, Terry wrote:
>
> On 1/14/2025 12:54 AM, Li Ming wrote:
>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>> mode.[1]
>>>
>>> CXL and PCIe Protocol Error handling have different requirements that
>>> necessitate a separate handling path. The AER service driver may try to
>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>> suitable for CXL PCIe Port devices because of potential for system memory
>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>> Error handling does not panic the kernel in response to a UCE.
>>>
>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>> handling instead of PCIe handling. Add the CXL specific changes without
>>> affecting or adding functionality in the PCIe handling.
>>>
>>> Make this update alongside the existing Downstream Port RCH error handling
>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>
>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>> config. Update is_internal_error()'s function declaration such that it is
>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>> or disabled.
>>>
>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>
>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>> Upstream Switch Ports
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>> ---
>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index f8b3350fcbb4..62be599e3bee 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>  	return true;
>>>  }
>>>  
>>> -#ifdef CONFIG_PCIEAER_CXL
>>> +static bool is_internal_error(struct aer_err_info *info)
>>> +{
>>> +	if (info->severity == AER_CORRECTABLE)
>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>  
>>> +	return info->status & PCI_ERR_UNC_INTN;
>>> +}
>>> +
>>> +#ifdef CONFIG_PCIEAER_CXL
>>>  /**
>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>   * @dev: pointer to the pcie_dev data structure
>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>  	return (pcie_ports_native || host->native_aer);
>>>  }
>>>  
>>> -static bool is_internal_error(struct aer_err_info *info)
>>> -{
>>> -	if (info->severity == AER_CORRECTABLE)
>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>> -
>>> -	return info->status & PCI_ERR_UNC_INTN;
>>> -}
>>> -
>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>  {
>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>  
>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>  {
>>> -	/*
>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>> -	 * device driver.
>>> -	 */
>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>> -	    is_internal_error(info))
>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>> +
>>> +	if (info->severity == AER_CORRECTABLE) {
>>> +		struct pci_driver *pdrv = dev->driver;
>>> +		int aer = dev->aer_cap;
>>> +
>>> +		if (aer)
>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>> +					       info->status);
>>> +
>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>>
>>> +		pcie_clear_device_status(dev);
>>> +	}
>>>  }
>>>  
>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>  {
>>>  	bool handles_cxl = false;
>>>  
>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>> -	    pcie_aer_is_native(dev))
>>> +	if (!pcie_aer_is_native(dev))
>>> +		return false;
>>> +
>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>> +	else
>>> +		handles_cxl = pcie_is_cxl_port(dev);
>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>
>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>
>>
>> Ming
> Hi Ming and Jonathan,
>
> RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.
>
> If the recommended change is made then RCH RAS will not be logged and the
> user would miss CXL details about the alternate protocol training failure.
> Also, AER is not CXL required and as a result in some cases you would only
> have the RCEC forwarded UIE/CIE message logged by the AER driver without
> any other logging.
>
> Is there value in *not* logging CXL RAS for errors on an untrained RCH
> link? Isn't it more informative to log PCIe AER and CXL RAS in this case?
>
> Regards,
> Terry

Hi Terry,


I don't understand why the recommended change will influence RCH RAS handling, would you mind giving more details?

My understanding is that above 'pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)' is used for RCH case.

And the 'else' block is used for VH case, so just check if the cxl port is working on CXL mode in pcie_is_cxl_port() or adding an extra function to check it in the 'else' block. I think it will not change RCH AER & RAS handling, is it right? or do I miss other details?


Ming

>>>  
>>>  	return handles_cxl;
>>>  }
>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>  				    struct aer_err_info *info) { }
>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>> +{
>>> +	return false;
>>> +}
>>>  #endif
>>>  
>>>  /**
>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>  
>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>  {
>>> -	cxl_handle_error(dev, info);
>>> -	pci_aer_handle_error(dev, info);
>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>> +		cxl_handle_error(dev, info);
>>> +	else
>>> +		pci_aer_handle_error(dev, info);
>>> +
>>>  	pci_dev_put(dev);
>>>  }
>>>  



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-14 15:19     ` Bowman, Terry
  2025-01-14 23:33       ` Ira Weiny
@ 2025-01-15 10:03       ` Lukas Wunner
  1 sibling, 0 replies; 96+ messages in thread
From: Lukas Wunner @ 2025-01-15 10:03 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa,
	ming.li, PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 14, 2025 at 09:19:12AM -0600, Bowman, Terry wrote:
> On 1/13/2025 5:49 PM, Ira Weiny wrote:
> > Terry Bowman wrote:
> > > --- a/drivers/pci/pci.c
> > > +++ b/drivers/pci/pci.c
> > > @@ -5036,10 +5036,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
> > >  
> > >  static u16 cxl_port_dvsec(struct pci_dev *dev)
> > >  {
> > > +	if (!pcie_is_cxl(dev))
> > > +		return 0;
> > > +
> > >  	return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> > >  					 PCI_DVSEC_CXL_PORT);
> > >  }
> > >  
> > > +bool pcie_is_cxl_port(struct pci_dev *dev)
> > > +{
> > > +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> > > +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> > > +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> > > +		return false;
> > > +
> > > +	return cxl_port_dvsec(dev);
> > 
> > Returning bool from a function which returns u16 is odd and I don't think
> > it should be coded this way.  I don't think it is wrong right now but this
> > really ought to code the pcie_is_cxl() here and leave cxl_port_dvsec()
> > alone.  Calling cxl_port_dvsec(), checking for if the dvsec exists, and
> > returning bool.
> 
> Thanks for reviewing. Is this what you are looking for here:
> 
> +bool pcie_is_cxl_port(struct pci_dev *dev)
> +{
> +	return (cxl_port_dvsec(dev) > 0);

Since cxl_port_dvsec() cannot return a negative integer,
you might as well use:

	return !!cxl_port_dvsec(dev);

However last I checked gcc generates code which implicitly turns
a number bigger than 1 into a 1 if the return type is bool.
(I had to fix a bug caused by this behavior once, see 009f8c90f571).

Thanks,

Lukas

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver
  2025-01-14 20:28     ` Bowman, Terry
@ 2025-01-15 11:37       ` Jonathan Cameron
  0 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-15 11:37 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 14 Jan 2025 14:28:13 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 1/14/2025 5:33 AM, Jonathan Cameron wrote:
> > On Tue, 7 Jan 2025 08:38:43 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >  
> >> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
> >> apply to CXL devices. Recovery can not be used for CXL devices because of
> >> potential corruption on what can be system memory. Also, current PCIe UCE
> >> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> >> does not begin at the RP/DSP but begins at the first downstream device.
> >> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> >> CXL recovery is needed because of the different handling requirements
> >>
> >> Add a new function, cxl_do_recovery() using the following.
> >>
> >> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> >> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> >> will begin iteration at the RP or DSP rather than beginning at the
> >> first downstream device.  
> > I'm still holding out for making pci_walk_bridge() do the same and seeing
> > what if anything breaks.  
> 
> I can test AER fatal UCE on a PCIe device. Do you have any other ideas for specific
> testing? A specific device or topology in mind ?

It's the interaction with runtime power management usage that worries me and
might need wider testing.  Maybe it is just a case of sending a patch marked
RFT.

The other paths are no-op where it matters.

Jonathan

> 
> Regards,
> Terry
> 
> > Other than that I'm fine with this patch.
> >  
> >> Add cxl_report_error_detected() as an analog to report_error_detected().
> >> It will call pci_driver::cxl_err_handlers for each iterated downstream
> >> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> >> indicating if there was a UCE error detected during handling.
> >>
> >> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> >> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> >> fatal. If a UCE was present during handling then cxl_do_recovery()
> >> will kernel panic.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-14 23:49         ` Bowman, Terry
@ 2025-01-15 11:40           ` Jonathan Cameron
  0 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-15 11:40 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Ira Weiny, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
	bhelgaas, mahesh, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, 14 Jan 2025 17:49:44 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 1/14/2025 5:38 PM, Ira Weiny wrote:
> > Bowman, Terry wrote:  
> >>
> >>
> >> On 1/14/2025 4:02 PM, Ira Weiny wrote:  
> >>> Terry Bowman wrote:  
> >>>> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
> >>>>
> >>>> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
> >>>> pointer to the CXL Upstream Port's mapped RAS registers.
> >>>>
> >>>> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
> >>>> register mapping. This is similar to the existing
> >>>> cxl_dport_init_ras_reporting() but for USP devices.
> >>>>
> >>>> The USP may have multiple downstream endpoints. Before mapping AER
> >>>> registers check if the registers are already mapped.
> >>>>
> >>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >>>> ---
> >>>>  drivers/cxl/core/pci.c | 15 +++++++++++++++
> >>>>  drivers/cxl/cxl.h      |  4 ++++
> >>>>  drivers/cxl/mem.c      |  8 ++++++++
> >>>>  3 files changed, 27 insertions(+)
> >>>>
> >>>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> >>>> index 1af2d0a14f5d..97e6a15bea88 100644
> >>>> --- a/drivers/cxl/core/pci.c
> >>>> +++ b/drivers/cxl/core/pci.c
> >>>> @@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> >>>>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> >>>>  }
> >>>>  
> >>>> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >>>> +{
> >>>> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
> >>>> +	if (port->uport_regs.ras)
> >>>> +		return;
> >>>> +
> >>>> +	port->reg_map.host = &port->dev;
> >>>> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> >>>> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> >>>> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
> >>>> +		return;  
> >>> Why return here?  Actually I think 8/16 had the same issue now that I see
> >>> this.
> >>>
> >>> Other than that:
> >>>
> >>> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> >>>
> >>> [snip]  
> >> If RAS registers fail mapping then exit to avoid CXL Port error handler initialization.
> >> The CXL Port error handlers rely on RAS registers for logging and without mapped RAS
> >> registers the error handlers will return immediately.  
> > Sorry I was not clear and I should not have clipped the text so much.  You
> > return in a block which is at the end of the function:
> >
> >
> > +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> > +{
> > +       /* uport may have more than 1 downstream EP. Check if already mapped. */
> > +       if (port->uport_regs.ras)
> > +               return;
> > +
> > +       port->reg_map.host = &port->dev;
> > +       if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> > +                                  BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> > +               dev_err(&port->dev, "Failed to map RAS capability.\n");
> > +               return;
> > +       }
> > +}
> >
> > So no need for this specific statement?
> >
> > Ira  
> 
> I wrote it this way to add the handler initialization (after the return) in later patch
> without a diff removal. But, your correct, I can remove the 'return' statement in this patch
> and add in later patch without cluttering the diff.
> 
> Thanks. I'll make the change.
> 
Leave it as it stands.  I'm sure Ira doesn't mind given the additions later.
I'd prefer we keep things clean across the series.

Jonathan

> Regards,
> Terry
> 
> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 14/16] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
  2025-01-14 20:56     ` Bowman, Terry
@ 2025-01-15 11:42       ` Jonathan Cameron
  0 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-15 11:42 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop


> >> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> >> index 8389a94adb1a..681e415ac8f5 100644
> >> --- a/drivers/cxl/core/trace.h
> >> +++ b/drivers/cxl/core/trace.h
> >> @@ -48,6 +48,34 @@
> >>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
> >>  )
> >>  
> >> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> >> +	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> >> +	TP_ARGS(dev, status, fe, hl),
> >> +	TP_STRUCT__entry(
> >> +		__string(devname, dev_name(dev))
> >> +		__string(host, dev_name(dev->parent))  
> > What is host in this case? Perhaps a comment.  
> host is a string initialized with value from dev_name(dev->parent). What
> kind of comment would you like to see here?
What is that parent in practice?  A port, an EP, a PCI device?

> 
> Regards,
> Terry



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-15  1:18       ` Li Ming
@ 2025-01-15 14:39         ` Bowman, Terry
  2025-01-16  3:15           ` Li Ming
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-15 14:39 UTC (permalink / raw)
  To: Li Ming, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop




On 1/14/2025 7:18 PM, Li Ming wrote:
> On 1/15/2025 3:29 AM, Bowman, Terry wrote:
>> On 1/14/2025 12:54 AM, Li Ming wrote:
>>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>>> mode.[1]
>>>>
>>>> CXL and PCIe Protocol Error handling have different requirements that
>>>> necessitate a separate handling path. The AER service driver may try to
>>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>>> suitable for CXL PCIe Port devices because of potential for system memory
>>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>>> Error handling does not panic the kernel in response to a UCE.
>>>>
>>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>>> handling instead of PCIe handling. Add the CXL specific changes without
>>>> affecting or adding functionality in the PCIe handling.
>>>>
>>>> Make this update alongside the existing Downstream Port RCH error handling
>>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>>
>>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>>> config. Update is_internal_error()'s function declaration such that it is
>>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>>> or disabled.
>>>>
>>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>>
>>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>> Upstream Switch Ports
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>>> ---
>>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>>
>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>> index f8b3350fcbb4..62be599e3bee 100644
>>>> --- a/drivers/pci/pcie/aer.c
>>>> +++ b/drivers/pci/pcie/aer.c
>>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>>  	return true;
>>>>  }
>>>>  
>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>> +static bool is_internal_error(struct aer_err_info *info)
>>>> +{
>>>> +	if (info->severity == AER_CORRECTABLE)
>>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>>  
>>>> +	return info->status & PCI_ERR_UNC_INTN;
>>>> +}
>>>> +
>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>  /**
>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>   * @dev: pointer to the pcie_dev data structure
>>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>>  	return (pcie_ports_native || host->native_aer);
>>>>  }
>>>>  
>>>> -static bool is_internal_error(struct aer_err_info *info)
>>>> -{
>>>> -	if (info->severity == AER_CORRECTABLE)
>>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>>> -
>>>> -	return info->status & PCI_ERR_UNC_INTN;
>>>> -}
>>>> -
>>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>  {
>>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>  
>>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>  {
>>>> -	/*
>>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>>> -	 * device driver.
>>>> -	 */
>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>> -	    is_internal_error(info))
>>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>> +
>>>> +	if (info->severity == AER_CORRECTABLE) {
>>>> +		struct pci_driver *pdrv = dev->driver;
>>>> +		int aer = dev->aer_cap;
>>>> +
>>>> +		if (aer)
>>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>>> +					       info->status);
>>>> +
>>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>>>
>>>> +		pcie_clear_device_status(dev);
>>>> +	}
>>>>  }
>>>>  
>>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>>  {
>>>>  	bool handles_cxl = false;
>>>>  
>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>> -	    pcie_aer_is_native(dev))
>>>> +	if (!pcie_aer_is_native(dev))
>>>> +		return false;
>>>> +
>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>>> +	else
>>>> +		handles_cxl = pcie_is_cxl_port(dev);
>>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>>
>>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>>
>>>
>>> Ming
>> Hi Ming and Jonathan,
>>
>> RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.
>>
>> If the recommended change is made then RCH RAS will not be logged and the
>> user would miss CXL details about the alternate protocol training failure.
>> Also, AER is not CXL required and as a result in some cases you would only
>> have the RCEC forwarded UIE/CIE message logged by the AER driver without
>> any other logging.
>>
>> Is there value in *not* logging CXL RAS for errors on an untrained RCH
>> link? Isn't it more informative to log PCIe AER and CXL RAS in this case?
>>
>> Regards,
>> Terry
> Hi Terry,
>
>
> I don't understand why the recommended change will influence RCH RAS handling, would you mind giving more details?
>
> My understanding is that above 'pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)' is used for RCH case.
>
> And the 'else' block is used for VH case, so just check if the cxl port is working on CXL mode in pcie_is_cxl_port() or adding an extra function to check it in the 'else' block. I think it will not change RCH AER & RAS handling, is it right? or do I miss other details?
>
>
> Ming

Hi Ming,

You're recommending this example case is handled by pci_aer_handle_error() rather than cxl_handle_error(). Correct me if I misunderstood. And, I believe this should continue to be handled by cxl_handle_error(). There are 2 issues with the recommended approach that deserve to be mentioned.

First, the RCH Downstream Port (DP) is implemented as an RCRB and does not have a
SBDF.[1] The RCH AER error is reported with the RCEC SBDF in the AER SRC_ID register.[2] The
RCEC is used to find the RCH's handlers using a CXL unique procedure (see cxl_handle_error()).

The logic in pci_aer_handle_error() operates on a 'struct pci_dev' type and pci_aer_handle_error() is not plumbed to support searching for the RCH handlers.

Using pci_aer_handle_error would require significant changes to support a CXL RCH
in addition to a PCIe device. These changes are already in cxl_handle_error().
 
Another issue to note is the CXL RAS information will (should) not be logged with this
recommended change. pci_aer_handle_error is PCIe specific and is not aware of CXL RAS. As a result,pci_aer_handle_error() is not suited to log the CXL RAS.

The example scenario was the RCH DP failed training. The user needs to know why training
failed and these details are stored in the CXL RAS registers. Again, CXL RAS needs to be logged
as well but CXL specific awareness shouldn't be added to pci_aer_handle_error().

Terry

[1] CXL r3.1 - 8.2 Memory Mapped Registers
[2] CXL r3.1 - 12.2.1.1 RCH Downstream Port-detected Errors
>>>>  
>>>>  	return handles_cxl;
>>>>  }
>>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>>  				    struct aer_err_info *info) { }
>>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>>> +{
>>>> +	return false;
>>>> +}
>>>>  #endif
>>>>  
>>>>  /**
>>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>  
>>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>>  {
>>>> -	cxl_handle_error(dev, info);
>>>> -	pci_aer_handle_error(dev, info);
>>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>>> +		cxl_handle_error(dev, info);
>>>> +	else
>>>> +		pci_aer_handle_error(dev, info);
>>>> +
>>>>  	pci_dev_put(dev);
>>>>  }
>>>>  
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-15 14:39         ` Bowman, Terry
@ 2025-01-16  3:15           ` Li Ming
  2025-02-05  3:46             ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Li Ming @ 2025-01-16  3:15 UTC (permalink / raw)
  To: Bowman, Terry, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
	oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop

On 1/15/2025 10:39 PM, Bowman, Terry wrote:
>
>
> On 1/14/2025 7:18 PM, Li Ming wrote:
>> On 1/15/2025 3:29 AM, Bowman, Terry wrote:
>>> On 1/14/2025 12:54 AM, Li Ming wrote:
>>>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>>>> mode.[1]
>>>>>
>>>>> CXL and PCIe Protocol Error handling have different requirements that
>>>>> necessitate a separate handling path. The AER service driver may try to
>>>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>>>> suitable for CXL PCIe Port devices because of potential for system memory
>>>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>>>> Error handling does not panic the kernel in response to a UCE.
>>>>>
>>>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>>>> handling instead of PCIe handling. Add the CXL specific changes without
>>>>> affecting or adding functionality in the PCIe handling.
>>>>>
>>>>> Make this update alongside the existing Downstream Port RCH error handling
>>>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>>>
>>>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>>>> config. Update is_internal_error()'s function declaration such that it is
>>>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>>>> or disabled.
>>>>>
>>>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>>>
>>>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>>> Upstream Switch Ports
>>>>>
>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>>>> ---
>>>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>>>
>>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>>> index f8b3350fcbb4..62be599e3bee 100644
>>>>> --- a/drivers/pci/pcie/aer.c
>>>>> +++ b/drivers/pci/pcie/aer.c
>>>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>>>  	return true;
>>>>>  }
>>>>>  
>>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>>> +static bool is_internal_error(struct aer_err_info *info)
>>>>> +{
>>>>> +	if (info->severity == AER_CORRECTABLE)
>>>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>  
>>>>> +	return info->status & PCI_ERR_UNC_INTN;
>>>>> +}
>>>>> +
>>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>>  /**
>>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>>   * @dev: pointer to the pcie_dev data structure
>>>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>>>  	return (pcie_ports_native || host->native_aer);
>>>>>  }
>>>>>  
>>>>> -static bool is_internal_error(struct aer_err_info *info)
>>>>> -{
>>>>> -	if (info->severity == AER_CORRECTABLE)
>>>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>>>> -
>>>>> -	return info->status & PCI_ERR_UNC_INTN;
>>>>> -}
>>>>> -
>>>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>  {
>>>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>  
>>>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>  {
>>>>> -	/*
>>>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>>>> -	 * device driver.
>>>>> -	 */
>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>> -	    is_internal_error(info))
>>>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>> +
>>>>> +	if (info->severity == AER_CORRECTABLE) {
>>>>> +		struct pci_driver *pdrv = dev->driver;
>>>>> +		int aer = dev->aer_cap;
>>>>> +
>>>>> +		if (aer)
>>>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>>>> +					       info->status);
>>>>> +
>>>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>>>>
>>>>> +		pcie_clear_device_status(dev);
>>>>> +	}
>>>>>  }
>>>>>  
>>>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>  {
>>>>>  	bool handles_cxl = false;
>>>>>  
>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>> -	    pcie_aer_is_native(dev))
>>>>> +	if (!pcie_aer_is_native(dev))
>>>>> +		return false;
>>>>> +
>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>>>> +	else
>>>>> +		handles_cxl = pcie_is_cxl_port(dev);
>>>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>>>
>>>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>>>
>>>>
>>>> Ming
>>> Hi Ming and Jonathan,
>>>
>>> RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.
>>>
>>> If the recommended change is made then RCH RAS will not be logged and the
>>> user would miss CXL details about the alternate protocol training failure.
>>> Also, AER is not CXL required and as a result in some cases you would only
>>> have the RCEC forwarded UIE/CIE message logged by the AER driver without
>>> any other logging.
>>>
>>> Is there value in *not* logging CXL RAS for errors on an untrained RCH
>>> link? Isn't it more informative to log PCIe AER and CXL RAS in this case?
>>>
>>> Regards,
>>> Terry
>> Hi Terry,
>>
>>
>> I don't understand why the recommended change will influence RCH RAS handling, would you mind giving more details?
>>
>> My understanding is that above 'pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)' is used for RCH case.
>>
>> And the 'else' block is used for VH case, so just check if the cxl port is working on CXL mode in pcie_is_cxl_port() or adding an extra function to check it in the 'else' block. I think it will not change RCH AER & RAS handling, is it right? or do I miss other details?
>>
>>
>> Ming
> Hi Ming,
>
> You're recommending this example case is handled by pci_aer_handle_error() rather than cxl_handle_error(). Correct me if I misunderstood. And, I believe this should continue to be handled by cxl_handle_error(). There are 2 issues with the recommended approach that deserve to be mentioned.

I guess that what you thought is the recommended change using pci_aer_handle_error() to handle CXL RAS issues? If yes, it is not what I meant.

handles_cxl_errors() is used to distinguish if the errors is a CXL error or a PCIe error. if the returned value of handles_cxl_errors() is 'true', that means the error is a CXL error. Then invoking either cxl_handle_error() or pcie_aer_handle_error() depending on the returned value. I think no problem in this part.

handles_cxl_errors() is using pcie_is_cxl_port() to distinguish CXL errors for VH cases. the implementation of pcie_is_cxl_port() is only checking if there is a DVSEC ID 3 exposed on the CXL RP/DSP/USP. I think it is not enough.

For example, If a CXL device connected to a CXL RP, there is no problem, because the return value of handles_cxl_errors() will be 'true' then cxl_handle_error() will be invoked to handle the errors.

If a PCIe device connected to a CXL RP, the CXL RP is working on PCIe mode, the CXL RP is possible to expose a DVSEC ID 3[1]. If the CXL RP has a DVSEC ID 3 in the case, the return value of handles_cxl_errors() is also 'true' and also invoking cxl_handle_error() to handle the error, I thinks it is not right, the CXL RP is working on PCIe mode, the error should be a PCIe error, and it should be handled by pcie_aer_handle_error(). So my suggestion is about checking if the CXL RP/DSP/USP is working on CXL mode in pcie_is_cxl_port() for VH cases.


[1] CXL r3.1 - 9.12.3 Enumerating CXL RPs and DSPs

   "CXL root port or DSP connected to a PCIe device/switch may or may not expose theCXL DVSEC ID 3 and the CXL DVSEC ID 7 capability structures."

>
> First, the RCH Downstream Port (DP) is implemented as an RCRB and does not have a
> SBDF.[1] The RCH AER error is reported with the RCEC SBDF in the AER SRC_ID register.[2] The
> RCEC is used to find the RCH's handlers using a CXL unique procedure (see cxl_handle_error()).
>
> The logic in pci_aer_handle_error() operates on a 'struct pci_dev' type and pci_aer_handle_error() is not plumbed to support searching for the RCH handlers.
>
> Using pci_aer_handle_error would require significant changes to support a CXL RCH
> in addition to a PCIe device. These changes are already in cxl_handle_error().
>  
> Another issue to note is the CXL RAS information will (should) not be logged with this
> recommended change. pci_aer_handle_error is PCIe specific and is not aware of CXL RAS. As a result,pci_aer_handle_error() is not suited to log the CXL RAS.
>
> The example scenario was the RCH DP failed training. The user needs to know why training
> failed and these details are stored in the CXL RAS registers. Again, CXL RAS needs to be logged
> as well but CXL specific awareness shouldn't be added to pci_aer_handle_error().

For these two issues, handles_cxl_errors() is always using "pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)" for RCH cases. I believe no change on this part, the return value of handles_cxl_errors() will be 'true' as expected in the cases you mentioned, cxl_handle_error() will help to handle these errors.


Ming

>
> Terry
>
> [1] CXL r3.1 - 8.2 Memory Mapped Registers
> [2] CXL r3.1 - 12.2.1.1 RCH Downstream Port-detected Errors+
>>>>>  
>>>>>  	return handles_cxl;
>>>>>  }
>>>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>>>  				    struct aer_err_info *info) { }
>>>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>>>> +{
>>>>> +	return false;
>>>>> +}
>>>>>  #endif
>>>>>  
>>>>>  /**
>>>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>  
>>>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>>>  {
>>>>> -	cxl_handle_error(dev, info);
>>>>> -	pci_aer_handle_error(dev, info);
>>>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>>>> +		cxl_handle_error(dev, info);
>>>>> +	else
>>>>> +		pci_aer_handle_error(dev, info);
>>>>> +
>>>>>  	pci_dev_put(dev);
>>>>>  }
>>>>>  



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2025-01-14 23:39         ` Bowman, Terry
@ 2025-01-16 15:35           ` Ira Weiny
  0 siblings, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-16 15:35 UTC (permalink / raw)
  To: Bowman, Terry, Ira Weiny, linux-cxl, linux-kernel, linux-pci,
	nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Bowman, Terry wrote:
> 
> 
> 
> On 1/14/2025 5:33 PM, Ira Weiny wrote:
> > Bowman, Terry wrote:
> >>
> >>
> >> On 1/13/2025 5:49 PM, Ira Weiny wrote:
> >>> Terry Bowman wrote:

[snip]

> >>>> +bool pcie_is_cxl_port(struct pci_dev *dev)
> >>>> +{
> >>>> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> >>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> >>>> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> >>>> +		return false;
> >>>> +
> >>>> +	return cxl_port_dvsec(dev);
> >>> Returning bool from a function which returns u16 is odd and I don't think
> >>> it should be coded this way.  I don't think it is wrong right now but this
> >>> really ought to code the pcie_is_cxl() here and leave cxl_port_dvsec()
> >>> alone.  Calling cxl_port_dvsec(), checking for if the dvsec exists, and
> >>> returning bool.
> >> Hi Ira,
> >>
> >> Thanks for reviewing. Is this what you are looking for here:
> >>
> >> +bool pcie_is_cxl_port(struct pci_dev *dev)
> >> +{
> >> +	return (cxl_port_dvsec(dev) > 0);
> > With the type checks, yes that is more clear.
> >
> > Ira
> >
> > [snip]
> Since sending the above I made update to be:
> 
> static u16 cxl_port_dvsec(struct pci_dev *dev)
> {
>         return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
>                                          PCI_DVSEC_CXL_PORT);
> }
> 
> inline bool pcie_is_cxl(struct pci_dev *pci_dev)
> {
>         return pci_dev->is_cxl;
> }
> 
> bool pcie_is_cxl_port(struct pci_dev *pci_dev)
> {
>         if (!pcie_is_cxl(pci_dev))
>                 return false;
> 
>         return (cxl_port_dvsec(pci_dev) > 0);
> }
> 
> I can change if you see anything is needed.

Looks good thanks!
Ira


[snip]

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
  2025-01-15  0:20         ` Bowman, Terry
@ 2025-01-16 21:42           ` Ira Weiny
  0 siblings, 0 replies; 96+ messages in thread
From: Ira Weiny @ 2025-01-16 21:42 UTC (permalink / raw)
  To: Bowman, Terry, Ira Weiny, linux-cxl, linux-kernel, linux-pci,
	nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

Bowman, Terry wrote:
> 
> 
> 
> On 1/14/2025 5:45 PM, Ira Weiny wrote:
> > Bowman, Terry wrote:
> >>
> >>
> >> On 1/14/2025 5:26 PM, Ira Weiny wrote:
> >>> Terry Bowman wrote:
> >>>> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
> >>>> Correctable Internal errors (CIE) for CXL Root Ports. The UIE and CIE are
> >>>> used in reporting CXL Protocol Errors. The same UIE/CIE enablement is
> >>>> needed for CXL Upstream Switch Ports and CXL Downstream Switch Ports
> >>>> inorder to notify the associated Root Port and OS.[1]
> >>>>
> >>>> Export the AER service driver's pci_aer_unmask_internal_errors() function
> >>>> to CXL namespace.
> >>>>
> >>>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
> >>>> because it is now an exported function.
> >>> This seems wrong to me.  As of this patch CXL_PCI requires PCIEAER_CXL for
> >>> the AER code to handle the errors which were just enabled.
> >>>
> >>> To keep PCIEAER_CXL optional pci_aer_unmask_internal_errors() should be
> >>> stubbed out in aer.h if !CONFIG_PCIEAER_CXL.
> >>>
> >>> Ira
> >> Bjorn (I believe in v1 or v2) directed me to remove
> >> pci_aer_unmask_internal_errors() dependency on PCIEAER_CXL because it is
> >> now exported. He wants the behavior for other users (and subsystems) to
> >> be consistent with/without the PCIEAER_CXL setting.
> >>
> > I see...  If PCIEAER_CXL is not enabled why even set the cxl error
> > handlers and enable these?
> >
> > I guess this is just adding some code which eventually calls
> > handles_cxl_errors() which returns false in the !PCIEAER_CXL case?
> >
> > Ira
> 
> Re-sending because I somehow sent from Outlook earlier.
> 
> cxl_dport_init_ras_reporting() and cxl_uport_init_ras_reporting() assign the error 
> handlers and are within #ifdef PCIEAER_CXL. The stubs are in cxl.h.
> 
> Correct. handles_cxl_errors() returns false in the !PCIEAER_CXL case.
> 

That is a bit convoluted...  But I'm not sure how to get the cross Kconfig
dependencies set to eliminate the set up.  :-/

So I guess it is fine.

Ira

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
  2025-01-14 11:32   ` Jonathan Cameron
  2025-01-14 20:44     ` Bowman, Terry
@ 2025-01-28 20:25     ` Bowman, Terry
  2025-01-29 18:04       ` Jonathan Cameron
  1 sibling, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-01-28 20:25 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop, Shuai Xue




On 1/14/2025 5:32 AM, Jonathan Cameron wrote:
> On Tue, 7 Jan 2025 08:38:42 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver's aer_get_device_error_info() function doesn't read
>> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
>> including CXL Upstream Switch Ports. As a result, fatal errors are not
>> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
>>
>> Update the aer_get_device_error_info() function to read the UCE fatal
>> status for all CXL PCIe devices. Make the change such that non-CXL devices
>> are not affected.
>>
>> The fatal error status will be used in future patches implementing
>> CXL PCIe Port uncorrectable error handling and logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> This clashes with Shuai's series adding link healthy checks.
> Maybe we can reuse that logic to incorporate the condition we
> care about here?
>

Hi Jonathan, et. al,

After looking at this closer and considering the situation I believe
we should remove this patch from the patchset and defer adding these
changes to log USP AER and RAS UCE.

I propose we reintroduce this later as a RFC or RFT in a future patchset.
This will give more needed time for testing.

The only downside to adding later is in the case of CXL USP fatal UCE. AER and
RAS will not be logged but this was the AER driver's existing behavior and as a
result isn't a regression.

Your thoughts?

Regards,
Terry

>> ---
>>  drivers/pci/pcie/aer.c | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 62be599e3bee..79c828bdcb6d 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>  		   type == PCI_EXP_TYPE_RC_EC ||
>>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
>> -		   info->severity == AER_NONFATAL) {
>> +		   info->severity == AER_NONFATAL ||
>> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>>  
>>  		/* Link is still healthy for IO reads */
>>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
  2025-01-28 20:25     ` Bowman, Terry
@ 2025-01-29 18:04       ` Jonathan Cameron
  0 siblings, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-01-29 18:04 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop, Shuai Xue

On Tue, 28 Jan 2025 14:25:54 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:

> On 1/14/2025 5:32 AM, Jonathan Cameron wrote:
> > On Tue, 7 Jan 2025 08:38:42 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >  
> >> The AER service driver's aer_get_device_error_info() function doesn't read
> >> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> >> including CXL Upstream Switch Ports. As a result, fatal errors are not
> >> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
> >>
> >> Update the aer_get_device_error_info() function to read the UCE fatal
> >> status for all CXL PCIe devices. Make the change such that non-CXL devices
> >> are not affected.
> >>
> >> The fatal error status will be used in future patches implementing
> >> CXL PCIe Port uncorrectable error handling and logging.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> > This clashes with Shuai's series adding link healthy checks.
> > Maybe we can reuse that logic to incorporate the condition we
> > care about here?
> >  
> 
> Hi Jonathan, et. al,
> 
> After looking at this closer and considering the situation I believe
> we should remove this patch from the patchset and defer adding these
> changes to log USP AER and RAS UCE.
> 
> I propose we reintroduce this later as a RFC or RFT in a future patchset.
> This will give more needed time for testing.
> 
> The only downside to adding later is in the case of CXL USP fatal UCE. AER and
> RAS will not be logged but this was the AER driver's existing behavior and as a
> result isn't a regression.

If we have doubts and it is complex then sure. Let's do this in stages.

Jonathan

> 
> Your thoughts?
> 
> Regards,
> Terry
> 
> >> ---
> >>  drivers/pci/pcie/aer.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> >> index 62be599e3bee..79c828bdcb6d 100644
> >> --- a/drivers/pci/pcie/aer.c
> >> +++ b/drivers/pci/pcie/aer.c
> >> @@ -1253,7 +1253,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
> >>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >>  		   type == PCI_EXP_TYPE_RC_EC ||
> >>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
> >> -		   info->severity == AER_NONFATAL) {
> >> +		   info->severity == AER_NONFATAL ||
> >> +		   (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
> >>  
> >>  		/* Link is still healthy for IO reads */
> >>  		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,  
> 
> 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-16  3:15           ` Li Ming
@ 2025-02-05  3:46             ` Bowman, Terry
  2025-02-05 13:58               ` Li Ming
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-02-05  3:46 UTC (permalink / raw)
  To: Li Ming, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop



On 1/15/2025 9:15 PM, Li Ming wrote:
> On 1/15/2025 10:39 PM, Bowman, Terry wrote:
>>
>> On 1/14/2025 7:18 PM, Li Ming wrote:
>>> On 1/15/2025 3:29 AM, Bowman, Terry wrote:
>>>> On 1/14/2025 12:54 AM, Li Ming wrote:
>>>>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>>>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>>>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>>>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>>>>> mode.[1]
>>>>>>
>>>>>> CXL and PCIe Protocol Error handling have different requirements that
>>>>>> necessitate a separate handling path. The AER service driver may try to
>>>>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>>>>> suitable for CXL PCIe Port devices because of potential for system memory
>>>>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>>>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>>>>> Error handling does not panic the kernel in response to a UCE.
>>>>>>
>>>>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>>>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>>>>> handling instead of PCIe handling. Add the CXL specific changes without
>>>>>> affecting or adding functionality in the PCIe handling.
>>>>>>
>>>>>> Make this update alongside the existing Downstream Port RCH error handling
>>>>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>>>>
>>>>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>>>>> config. Update is_internal_error()'s function declaration such that it is
>>>>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>>>>> or disabled.
>>>>>>
>>>>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>>>>
>>>>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>>>> Upstream Switch Ports
>>>>>>
>>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>>>>> ---
>>>>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>>>> index f8b3350fcbb4..62be599e3bee 100644
>>>>>> --- a/drivers/pci/pcie/aer.c
>>>>>> +++ b/drivers/pci/pcie/aer.c
>>>>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>>>>  	return true;
>>>>>>  }
>>>>>>  
>>>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>>>> +static bool is_internal_error(struct aer_err_info *info)
>>>>>> +{
>>>>>> +	if (info->severity == AER_CORRECTABLE)
>>>>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>>  
>>>>>> +	return info->status & PCI_ERR_UNC_INTN;
>>>>>> +}
>>>>>> +
>>>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>>>  /**
>>>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>>>   * @dev: pointer to the pcie_dev data structure
>>>>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>>>>  	return (pcie_ports_native || host->native_aer);
>>>>>>  }
>>>>>>  
>>>>>> -static bool is_internal_error(struct aer_err_info *info)
>>>>>> -{
>>>>>> -	if (info->severity == AER_CORRECTABLE)
>>>>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>> -
>>>>>> -	return info->status & PCI_ERR_UNC_INTN;
>>>>>> -}
>>>>>> -
>>>>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>>  {
>>>>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>>>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>>  
>>>>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>  {
>>>>>> -	/*
>>>>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>>>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>>>>> -	 * device driver.
>>>>>> -	 */
>>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>>> -	    is_internal_error(info))
>>>>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>>> +
>>>>>> +	if (info->severity == AER_CORRECTABLE) {
>>>>>> +		struct pci_driver *pdrv = dev->driver;
>>>>>> +		int aer = dev->aer_cap;
>>>>>> +
>>>>>> +		if (aer)
>>>>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>>>>> +					       info->status);
>>>>>> +
>>>>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>>>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>>>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>>>>>
>>>>>> +		pcie_clear_device_status(dev);
>>>>>> +	}
>>>>>>  }
>>>>>>  
>>>>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>>  {
>>>>>>  	bool handles_cxl = false;
>>>>>>  
>>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>>> -	    pcie_aer_is_native(dev))
>>>>>> +	if (!pcie_aer_is_native(dev))
>>>>>> +		return false;
>>>>>> +
>>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>>>>> +	else
>>>>>> +		handles_cxl = pcie_is_cxl_port(dev);
>>>>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>>>>
>>>>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>>>>
>>>>>
>>>>> Ming
>>>> Hi Ming and Jonathan,
>>>>
>>>> RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.
>>>>
>>>> If the recommended change is made then RCH RAS will not be logged and the
>>>> user would miss CXL details about the alternate protocol training failure.
>>>> Also, AER is not CXL required and as a result in some cases you would only
>>>> have the RCEC forwarded UIE/CIE message logged by the AER driver without
>>>> any other logging.
>>>>
>>>> Is there value in *not* logging CXL RAS for errors on an untrained RCH
>>>> link? Isn't it more informative to log PCIe AER and CXL RAS in this case?
>>>>
>>>> Regards,
>>>> Terry
>>> Hi Terry,
>>>
>>>
>>> I don't understand why the recommended change will influence RCH RAS handling, would you mind giving more details?
>>>
>>> My understanding is that above 'pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)' is used for RCH case.
>>>
>>> And the 'else' block is used for VH case, so just check if the cxl port is working on CXL mode in pcie_is_cxl_port() or adding an extra function to check it in the 'else' block. I think it will not change RCH AER & RAS handling, is it right? or do I miss other details?
>>>
>>>
>>> Ming
>> Hi Ming,
>>
>> You're recommending this example case is handled by pci_aer_handle_error() rather than cxl_handle_error(). Correct me if I misunderstood. And, I believe this should continue to be handled by cxl_handle_error(). There are 2 issues with the recommended approach that deserve to be mentioned.
> I guess that what you thought is the recommended change using pci_aer_handle_error() to handle CXL RAS issues? If yes, it is not what I meant.
>
> handles_cxl_errors() is used to distinguish if the errors is a CXL error or a PCIe error. if the returned value of handles_cxl_errors() is 'true', that means the error is a CXL error. Then invoking either cxl_handle_error() or pcie_aer_handle_error() depending on the returned value. I think no problem in this part.
>
> handles_cxl_errors() is using pcie_is_cxl_port() to distinguish CXL errors for VH cases. the implementation of pcie_is_cxl_port() is only checking if there is a DVSEC ID 3 exposed on the CXL RP/DSP/USP. I think it is not enough.
>
> For example, If a CXL device connected to a CXL RP, there is no problem, because the return value of handles_cxl_errors() will be 'true' then cxl_handle_error() will be invoked to handle the errors.
>
> If a PCIe device connected to a CXL RP, the CXL RP is working on PCIe mode, the CXL RP is possible to expose a DVSEC ID 3[1]. If the CXL RP has a DVSEC ID 3 in the case, the return value of handles_cxl_errors() is also 'true' and also invoking cxl_handle_error() to handle the error, I thinks it is not right, the CXL RP is working on PCIe mode, the error should be a PCIe error, and it should be handled by pcie_aer_handle_error(). So my suggestion is about checking if the CXL RP/DSP/USP is working on CXL mode in pcie_is_cxl_port() for VH cases.
>
>
> [1] CXL r3.1 - 9.12.3 Enumerating CXL RPs and DSPs
>
>    "CXL root port or DSP connected to a PCIe device/switch may or may not expose theCXL DVSEC ID 3 and the CXL DVSEC ID 7 capability structures."
>

Hi Ming,

I apologize for the delayed response. Thanks for the patience in explaining.

In your example using a RP with downstream non-CXL device, the RP AER will log the
RP's CE/UCE and RAS status for a protocol error. It's not helpful in this case
because it's a non-CXL device but it is failing alternate prootcol training that can
also happen with a CXL endpoint. I expect the RAS registers contain details about
the failed CXL training in the endpoint case.

I believe we should give the user as much error details within reason. And for CXL using
AER CE/UCE errors, this should include the RAS logging. If we rely on the PCIe handling path,
this information will not be logged.

Also, CE/UCE AER is logged in the CXL handling path. The AER driver logs AER status before
calling the CE/UCE CXL handlers.

Are there any other use cases or reasons why to use PCIe handling if alt. protocol training
fails? Is there anything lost by using CXL handling?

Terry
>> First, the RCH Downstream Port (DP) is implemented as an RCRB and does not have a
>> SBDF.[1] The RCH AER error is reported with the RCEC SBDF in the AER SRC_ID register.[2] The
>> RCEC is used to find the RCH's handlers using a CXL unique procedure (see cxl_handle_error()).
>>
>> The logic in pci_aer_handle_error() operates on a 'struct pci_dev' type and pci_aer_handle_error() is not plumbed to support searching for the RCH handlers.
>>
>> Using pci_aer_handle_error would require significant changes to support a CXL RCH
>> in addition to a PCIe device. These changes are already in cxl_handle_error().
>>  
>> Another issue to note is the CXL RAS information will (should) not be logged with this
>> recommended change. pci_aer_handle_error is PCIe specific and is not aware of CXL RAS. As a result,pci_aer_handle_error() is not suited to log the CXL RAS.
>>
>> The example scenario was the RCH DP failed training. The user needs to know why training
>> failed and these details are stored in the CXL RAS registers. Again, CXL RAS needs to be logged
>> as well but CXL specific awareness shouldn't be added to pci_aer_handle_error().
> For these two issues, handles_cxl_errors() is always using "pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)" for RCH cases. I believe no change on this part, the return value of handles_cxl_errors() will be 'true' as expected in the cases you mentioned, cxl_handle_error() will help to handle these errors.
>
>
> Ming
>
>> Terry
>>
>> [1] CXL r3.1 - 8.2 Memory Mapped Registers
>> [2] CXL r3.1 - 12.2.1.1 RCH Downstream Port-detected Errors+
>>>>>>  
>>>>>>  	return handles_cxl;
>>>>>>  }
>>>>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>>>>  				    struct aer_err_info *info) { }
>>>>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>> +{
>>>>>> +	return false;
>>>>>> +}
>>>>>>  #endif
>>>>>>  
>>>>>>  /**
>>>>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>  
>>>>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>  {
>>>>>> -	cxl_handle_error(dev, info);
>>>>>> -	pci_aer_handle_error(dev, info);
>>>>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>>>>> +		cxl_handle_error(dev, info);
>>>>>> +	else
>>>>>> +		pci_aer_handle_error(dev, info);
>>>>>> +
>>>>>>  	pci_dev_put(dev);
>>>>>>  }
>>>>>>  
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-02-05  3:46             ` Bowman, Terry
@ 2025-02-05 13:58               ` Li Ming
  2025-02-05 14:22                 ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Li Ming @ 2025-02-05 13:58 UTC (permalink / raw)
  To: Bowman, Terry, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
	dave, jonathan.cameron, dave.jiang, alison.schofield,
	vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
	oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop

On 2/5/2025 11:46 AM, Bowman, Terry wrote:
>
> On 1/15/2025 9:15 PM, Li Ming wrote:
>> On 1/15/2025 10:39 PM, Bowman, Terry wrote:
>>> On 1/14/2025 7:18 PM, Li Ming wrote:
>>>> On 1/15/2025 3:29 AM, Bowman, Terry wrote:
>>>>> On 1/14/2025 12:54 AM, Li Ming wrote:
>>>>>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>>>>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>>>>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>>>>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>>>>>> mode.[1]
>>>>>>>
>>>>>>> CXL and PCIe Protocol Error handling have different requirements that
>>>>>>> necessitate a separate handling path. The AER service driver may try to
>>>>>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>>>>>> suitable for CXL PCIe Port devices because of potential for system memory
>>>>>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>>>>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>>>>>> Error handling does not panic the kernel in response to a UCE.
>>>>>>>
>>>>>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>>>>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>>>>>> handling instead of PCIe handling. Add the CXL specific changes without
>>>>>>> affecting or adding functionality in the PCIe handling.
>>>>>>>
>>>>>>> Make this update alongside the existing Downstream Port RCH error handling
>>>>>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>>>>>
>>>>>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>>>>>> config. Update is_internal_error()'s function declaration such that it is
>>>>>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>>>>>> or disabled.
>>>>>>>
>>>>>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>>>>>
>>>>>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>>>>> Upstream Switch Ports
>>>>>>>
>>>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>>>>>> ---
>>>>>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>>>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>>>>> index f8b3350fcbb4..62be599e3bee 100644
>>>>>>> --- a/drivers/pci/pcie/aer.c
>>>>>>> +++ b/drivers/pci/pcie/aer.c
>>>>>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>>>>>  	return true;
>>>>>>>  }
>>>>>>>  
>>>>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>>>>> +static bool is_internal_error(struct aer_err_info *info)
>>>>>>> +{
>>>>>>> +	if (info->severity == AER_CORRECTABLE)
>>>>>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>>>  
>>>>>>> +	return info->status & PCI_ERR_UNC_INTN;
>>>>>>> +}
>>>>>>> +
>>>>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>>>>  /**
>>>>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>>>>   * @dev: pointer to the pcie_dev data structure
>>>>>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>>>>>  	return (pcie_ports_native || host->native_aer);
>>>>>>>  }
>>>>>>>  
>>>>>>> -static bool is_internal_error(struct aer_err_info *info)
>>>>>>> -{
>>>>>>> -	if (info->severity == AER_CORRECTABLE)
>>>>>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>>> -
>>>>>>> -	return info->status & PCI_ERR_UNC_INTN;
>>>>>>> -}
>>>>>>> -
>>>>>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>>>  {
>>>>>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>>>>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>>>  
>>>>>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>>  {
>>>>>>> -	/*
>>>>>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>>>>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>>>>>> -	 * device driver.
>>>>>>> -	 */
>>>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>>>> -	    is_internal_error(info))
>>>>>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>>>> +
>>>>>>> +	if (info->severity == AER_CORRECTABLE) {
>>>>>>> +		struct pci_driver *pdrv = dev->driver;
>>>>>>> +		int aer = dev->aer_cap;
>>>>>>> +
>>>>>>> +		if (aer)
>>>>>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>>>>>> +					       info->status);
>>>>>>> +
>>>>>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>>>>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>>>>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>>>>>>
>>>>>>> +		pcie_clear_device_status(dev);
>>>>>>> +	}
>>>>>>>  }
>>>>>>>  
>>>>>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>>>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>>>  {
>>>>>>>  	bool handles_cxl = false;
>>>>>>>  
>>>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>>>> -	    pcie_aer_is_native(dev))
>>>>>>> +	if (!pcie_aer_is_native(dev))
>>>>>>> +		return false;
>>>>>>> +
>>>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>>>>>> +	else
>>>>>>> +		handles_cxl = pcie_is_cxl_port(dev);
>>>>>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>>>>>
>>>>>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>>>>>
>>>>>>
>>>>>> Ming
>>>>> Hi Ming and Jonathan,
>>>>>
>>>>> RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.
>>>>>
>>>>> If the recommended change is made then RCH RAS will not be logged and the
>>>>> user would miss CXL details about the alternate protocol training failure.
>>>>> Also, AER is not CXL required and as a result in some cases you would only
>>>>> have the RCEC forwarded UIE/CIE message logged by the AER driver without
>>>>> any other logging.
>>>>>
>>>>> Is there value in *not* logging CXL RAS for errors on an untrained RCH
>>>>> link? Isn't it more informative to log PCIe AER and CXL RAS in this case?
>>>>>
>>>>> Regards,
>>>>> Terry
>>>> Hi Terry,
>>>>
>>>>
>>>> I don't understand why the recommended change will influence RCH RAS handling, would you mind giving more details?
>>>>
>>>> My understanding is that above 'pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)' is used for RCH case.
>>>>
>>>> And the 'else' block is used for VH case, so just check if the cxl port is working on CXL mode in pcie_is_cxl_port() or adding an extra function to check it in the 'else' block. I think it will not change RCH AER & RAS handling, is it right? or do I miss other details?
>>>>
>>>>
>>>> Ming
>>> Hi Ming,
>>>
>>> You're recommending this example case is handled by pci_aer_handle_error() rather than cxl_handle_error(). Correct me if I misunderstood. And, I believe this should continue to be handled by cxl_handle_error(). There are 2 issues with the recommended approach that deserve to be mentioned.
>> I guess that what you thought is the recommended change using pci_aer_handle_error() to handle CXL RAS issues? If yes, it is not what I meant.
>>
>> handles_cxl_errors() is used to distinguish if the errors is a CXL error or a PCIe error. if the returned value of handles_cxl_errors() is 'true', that means the error is a CXL error. Then invoking either cxl_handle_error() or pcie_aer_handle_error() depending on the returned value. I think no problem in this part.
>>
>> handles_cxl_errors() is using pcie_is_cxl_port() to distinguish CXL errors for VH cases. the implementation of pcie_is_cxl_port() is only checking if there is a DVSEC ID 3 exposed on the CXL RP/DSP/USP. I think it is not enough.
>>
>> For example, If a CXL device connected to a CXL RP, there is no problem, because the return value of handles_cxl_errors() will be 'true' then cxl_handle_error() will be invoked to handle the errors.
>>
>> If a PCIe device connected to a CXL RP, the CXL RP is working on PCIe mode, the CXL RP is possible to expose a DVSEC ID 3[1]. If the CXL RP has a DVSEC ID 3 in the case, the return value of handles_cxl_errors() is also 'true' and also invoking cxl_handle_error() to handle the error, I thinks it is not right, the CXL RP is working on PCIe mode, the error should be a PCIe error, and it should be handled by pcie_aer_handle_error(). So my suggestion is about checking if the CXL RP/DSP/USP is working on CXL mode in pcie_is_cxl_port() for VH cases.
>>
>>
>> [1] CXL r3.1 - 9.12.3 Enumerating CXL RPs and DSPs
>>
>>    "CXL root port or DSP connected to a PCIe device/switch may or may not expose theCXL DVSEC ID 3 and the CXL DVSEC ID 7 capability structures."
>>
> Hi Ming,
>
> I apologize for the delayed response. Thanks for the patience in explaining.
>
> In your example using a RP with downstream non-CXL device, the RP AER will log the
> RP's CE/UCE and RAS status for a protocol error. It's not helpful in this case
> because it's a non-CXL device but it is failing alternate prootcol training that can
> also happen with a CXL endpoint. I expect the RAS registers contain details about
> the failed CXL training in the endpoint case.
>
> I believe we should give the user as much error details within reason. And for CXL using
> AER CE/UCE errors, this should include the RAS logging. If we rely on the PCIe handling path,
> this information will not be logged.
>
> Also, CE/UCE AER is logged in the CXL handling path. The AER driver logs AER status before
> calling the CE/UCE CXL handlers.
>
> Are there any other use cases or reasons why to use PCIe handling if alt. protocol training
> fails? Is there anything lost by using CXL handling?

One problem I realized is if using cxl_handle_error() instead of pci_aer_handle_error() for the above case I described(a CXL RP is working on PCIe mode because it connected to a PCIe device), the CXL RP will miss pcie_do_recovery() invoked in pci_aer_handle_error() when the error is an UCE, and it will also miss pcie error handler implemented in pcie port driver. 

It means that AER handling logic is different between CXL RP working on PCIe mode and PCIe RP. I am not sure whether it is OK.


Although cxl_handle_error() includes cxl_do_recovery() implemented in patch #7, cxl_do_recovery() seems like only for CXL cases(CXL RP working on CXL mode), is it suitable for pcie port recovery(CXL RP working on PCIe mode)?

Please correct me if I am wrong.


Ming

>
> Terry
>>> First, the RCH Downstream Port (DP) is implemented as an RCRB and does not have a
>>> SBDF.[1] The RCH AER error is reported with the RCEC SBDF in the AER SRC_ID register.[2] The
>>> RCEC is used to find the RCH's handlers using a CXL unique procedure (see cxl_handle_error()).
>>>
>>> The logic in pci_aer_handle_error() operates on a 'struct pci_dev' type and pci_aer_handle_error() is not plumbed to support searching for the RCH handlers.
>>>
>>> Using pci_aer_handle_error would require significant changes to support a CXL RCH
>>> in addition to a PCIe device. These changes are already in cxl_handle_error().
>>>  
>>> Another issue to note is the CXL RAS information will (should) not be logged with this
>>> recommended change. pci_aer_handle_error is PCIe specific and is not aware of CXL RAS. As a result,pci_aer_handle_error() is not suited to log the CXL RAS.
>>>
>>> The example scenario was the RCH DP failed training. The user needs to know why training
>>> failed and these details are stored in the CXL RAS registers. Again, CXL RAS needs to be logged
>>> as well but CXL specific awareness shouldn't be added to pci_aer_handle_error().
>> For these two issues, handles_cxl_errors() is always using "pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)" for RCH cases. I believe no change on this part, the return value of handles_cxl_errors() will be 'true' as expected in the cases you mentioned, cxl_handle_error() will help to handle these errors.
>>
>>
>> Ming
>>
>>> Terry
>>>
>>> [1] CXL r3.1 - 8.2 Memory Mapped Registers
>>> [2] CXL r3.1 - 12.2.1.1 RCH Downstream Port-detected Errors+
>>>>>>>  
>>>>>>>  	return handles_cxl;
>>>>>>>  }
>>>>>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>>>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>>>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>>>>>  				    struct aer_err_info *info) { }
>>>>>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>>> +{
>>>>>>> +	return false;
>>>>>>> +}
>>>>>>>  #endif
>>>>>>>  
>>>>>>>  /**
>>>>>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>>  
>>>>>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>>  {
>>>>>>> -	cxl_handle_error(dev, info);
>>>>>>> -	pci_aer_handle_error(dev, info);
>>>>>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>>>>>> +		cxl_handle_error(dev, info);
>>>>>>> +	else
>>>>>>> +		pci_aer_handle_error(dev, info);
>>>>>>> +
>>>>>>>  	pci_dev_put(dev);
>>>>>>>  }
>>>>>>>  



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-02-05 13:58               ` Li Ming
@ 2025-02-05 14:22                 ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-02-05 14:22 UTC (permalink / raw)
  To: Li Ming, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas,
	PradeepVineshReddy.Kodamati, alucerop



On 2/5/2025 7:58 AM, Li Ming wrote:
> On 2/5/2025 11:46 AM, Bowman, Terry wrote:
>> On 1/15/2025 9:15 PM, Li Ming wrote:
>>> On 1/15/2025 10:39 PM, Bowman, Terry wrote:
>>>> On 1/14/2025 7:18 PM, Li Ming wrote:
>>>>> On 1/15/2025 3:29 AM, Bowman, Terry wrote:
>>>>>> On 1/14/2025 12:54 AM, Li Ming wrote:
>>>>>>> On 1/7/2025 10:38 PM, Terry Bowman wrote:
>>>>>>>> The AER service driver supports handling Downstream Port Protocol Errors in
>>>>>>>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>>>>>>>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>>>>>>>> mode.[1]
>>>>>>>>
>>>>>>>> CXL and PCIe Protocol Error handling have different requirements that
>>>>>>>> necessitate a separate handling path. The AER service driver may try to
>>>>>>>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>>>>>>>> suitable for CXL PCIe Port devices because of potential for system memory
>>>>>>>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>>>>>>>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>>>>>>>> Error handling does not panic the kernel in response to a UCE.
>>>>>>>>
>>>>>>>> Introduce a separate path for CXL Protocol Error handling in the AER
>>>>>>>> service driver. This will allow CXL Protocol Errors to use CXL specific
>>>>>>>> handling instead of PCIe handling. Add the CXL specific changes without
>>>>>>>> affecting or adding functionality in the PCIe handling.
>>>>>>>>
>>>>>>>> Make this update alongside the existing Downstream Port RCH error handling
>>>>>>>> logic, extending support to CXL PCIe Ports in VH mode.
>>>>>>>>
>>>>>>>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>>>>>>>> config. Update is_internal_error()'s function declaration such that it is
>>>>>>>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>>>>>>>> or disabled.
>>>>>>>>
>>>>>>>> The uncorrectable error (UCE) handling will be added in a future patch.
>>>>>>>>
>>>>>>>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>>>>>> Upstream Switch Ports
>>>>>>>>
>>>>>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>>>>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>>>>>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>>>>>>> ---
>>>>>>>>  drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
>>>>>>>>  1 file changed, 40 insertions(+), 21 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>>>>>>> index f8b3350fcbb4..62be599e3bee 100644
>>>>>>>> --- a/drivers/pci/pcie/aer.c
>>>>>>>> +++ b/drivers/pci/pcie/aer.c
>>>>>>>> @@ -942,8 +942,15 @@ static bool find_source_device(struct pci_dev *parent,
>>>>>>>>  	return true;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> -#ifdef CONFIG_PCIEAER_CXL
>>>>>>>> +static bool is_internal_error(struct aer_err_info *info)
>>>>>>>> +{
>>>>>>>> +	if (info->severity == AER_CORRECTABLE)
>>>>>>>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>>>>  
>>>>>>>> +	return info->status & PCI_ERR_UNC_INTN;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +#ifdef CONFIG_PCIEAER_CXL
>>>>>>>>  /**
>>>>>>>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>>>>>>>   * @dev: pointer to the pcie_dev data structure
>>>>>>>> @@ -995,14 +1002,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>>>>>>>  	return (pcie_ports_native || host->native_aer);
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> -static bool is_internal_error(struct aer_err_info *info)
>>>>>>>> -{
>>>>>>>> -	if (info->severity == AER_CORRECTABLE)
>>>>>>>> -		return info->status & PCI_ERR_COR_INTERNAL;
>>>>>>>> -
>>>>>>>> -	return info->status & PCI_ERR_UNC_INTN;
>>>>>>>> -}
>>>>>>>> -
>>>>>>>>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>>>>  {
>>>>>>>>  	struct aer_err_info *info = (struct aer_err_info *)data;
>>>>>>>> @@ -1034,14 +1033,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>>>>>>>>  
>>>>>>>>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>>>  {
>>>>>>>> -	/*
>>>>>>>> -	 * Internal errors of an RCEC indicate an AER error in an
>>>>>>>> -	 * RCH's downstream port. Check and handle them in the CXL.mem
>>>>>>>> -	 * device driver.
>>>>>>>> -	 */
>>>>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>>>>> -	    is_internal_error(info))
>>>>>>>> -		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>>>> +		return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
>>>>>>>> +
>>>>>>>> +	if (info->severity == AER_CORRECTABLE) {
>>>>>>>> +		struct pci_driver *pdrv = dev->driver;
>>>>>>>> +		int aer = dev->aer_cap;
>>>>>>>> +
>>>>>>>> +		if (aer)
>>>>>>>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>>>>>>>> +					       info->status);
>>>>>>>> +
>>>>>>>> +		if (pdrv && pdrv->cxl_err_handler &&
>>>>>>>> +		    pdrv->cxl_err_handler->cor_error_detected)
>>>>>>>> +			pdrv->cxl_err_handler->cor_error_detected(dev);
>>>>>>>>
>>>>>>>> +		pcie_clear_device_status(dev);
>>>>>>>> +	}
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>>>>>>>> @@ -1059,9 +1067,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>>>>  {
>>>>>>>>  	bool handles_cxl = false;
>>>>>>>>  
>>>>>>>> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
>>>>>>>> -	    pcie_aer_is_native(dev))
>>>>>>>> +	if (!pcie_aer_is_native(dev))
>>>>>>>> +		return false;
>>>>>>>> +
>>>>>>>> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>>>>>>>>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>>>>>>>> +	else
>>>>>>>> +		handles_cxl = pcie_is_cxl_port(dev);
>>>>>>> My understanding is if a cxl RP/USP/DSP is working on PCIe mode, they are also possible to expose a DVSEC ID 3(CXL r3.1 section 9.12.3). In such case, the AER handler should be pci_aer_handle_error() rather than cxl_handle_error().
>>>>>>>
>>>>>>> pcie_is_cxl_port() only checks if there is a DVSEC ID 3, but I think it should also check if the cxl port is working on CXL mode, does it make more sense?
>>>>>>>
>>>>>>>
>>>>>>> Ming
>>>>>> Hi Ming and Jonathan,
>>>>>>
>>>>>> RCH AER & RCH RAS are currently logged by the CXL driver's RCH handlers.
>>>>>>
>>>>>> If the recommended change is made then RCH RAS will not be logged and the
>>>>>> user would miss CXL details about the alternate protocol training failure.
>>>>>> Also, AER is not CXL required and as a result in some cases you would only
>>>>>> have the RCEC forwarded UIE/CIE message logged by the AER driver without
>>>>>> any other logging.
>>>>>>
>>>>>> Is there value in *not* logging CXL RAS for errors on an untrained RCH
>>>>>> link? Isn't it more informative to log PCIe AER and CXL RAS in this case?
>>>>>>
>>>>>> Regards,
>>>>>> Terry
>>>>> Hi Terry,
>>>>>
>>>>>
>>>>> I don't understand why the recommended change will influence RCH RAS handling, would you mind giving more details?
>>>>>
>>>>> My understanding is that above 'pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)' is used for RCH case.
>>>>>
>>>>> And the 'else' block is used for VH case, so just check if the cxl port is working on CXL mode in pcie_is_cxl_port() or adding an extra function to check it in the 'else' block. I think it will not change RCH AER & RAS handling, is it right? or do I miss other details?
>>>>>
>>>>>
>>>>> Ming
>>>> Hi Ming,
>>>>
>>>> You're recommending this example case is handled by pci_aer_handle_error() rather than cxl_handle_error(). Correct me if I misunderstood. And, I believe this should continue to be handled by cxl_handle_error(). There are 2 issues with the recommended approach that deserve to be mentioned.
>>> I guess that what you thought is the recommended change using pci_aer_handle_error() to handle CXL RAS issues? If yes, it is not what I meant.
>>>
>>> handles_cxl_errors() is used to distinguish if the errors is a CXL error or a PCIe error. if the returned value of handles_cxl_errors() is 'true', that means the error is a CXL error. Then invoking either cxl_handle_error() or pcie_aer_handle_error() depending on the returned value. I think no problem in this part.
>>>
>>> handles_cxl_errors() is using pcie_is_cxl_port() to distinguish CXL errors for VH cases. the implementation of pcie_is_cxl_port() is only checking if there is a DVSEC ID 3 exposed on the CXL RP/DSP/USP. I think it is not enough.
>>>
>>> For example, If a CXL device connected to a CXL RP, there is no problem, because the return value of handles_cxl_errors() will be 'true' then cxl_handle_error() will be invoked to handle the errors.
>>>
>>> If a PCIe device connected to a CXL RP, the CXL RP is working on PCIe mode, the CXL RP is possible to expose a DVSEC ID 3[1]. If the CXL RP has a DVSEC ID 3 in the case, the return value of handles_cxl_errors() is also 'true' and also invoking cxl_handle_error() to handle the error, I thinks it is not right, the CXL RP is working on PCIe mode, the error should be a PCIe error, and it should be handled by pcie_aer_handle_error(). So my suggestion is about checking if the CXL RP/DSP/USP is working on CXL mode in pcie_is_cxl_port() for VH cases.
>>>
>>>
>>> [1] CXL r3.1 - 9.12.3 Enumerating CXL RPs and DSPs
>>>
>>>    "CXL root port or DSP connected to a PCIe device/switch may or may not expose theCXL DVSEC ID 3 and the CXL DVSEC ID 7 capability structures."
>>>
>> Hi Ming,
>>
>> I apologize for the delayed response. Thanks for the patience in explaining.
>>
>> In your example using a RP with downstream non-CXL device, the RP AER will log the
>> RP's CE/UCE and RAS status for a protocol error. It's not helpful in this case
>> because it's a non-CXL device but it is failing alternate prootcol training that can
>> also happen with a CXL endpoint. I expect the RAS registers contain details about
>> the failed CXL training in the endpoint case.
>>
>> I believe we should give the user as much error details within reason. And for CXL using
>> AER CE/UCE errors, this should include the RAS logging. If we rely on the PCIe handling path,
>> this information will not be logged.
>>
>> Also, CE/UCE AER is logged in the CXL handling path. The AER driver logs AER status before
>> calling the CE/UCE CXL handlers.
>>
>> Are there any other use cases or reasons why to use PCIe handling if alt. protocol training
>> fails? Is there anything lost by using CXL handling?
> One problem I realized is if using cxl_handle_error() instead of pci_aer_handle_error() for the above case I described(a CXL RP is working on PCIe mode because it connected to a PCIe device), the CXL RP will miss pcie_do_recovery() invoked in pci_aer_handle_error() when the error is an UCE, and it will also miss pcie error handler implemented in pcie port driver. 
>
> It means that AER handling logic is different between CXL RP working on PCIe mode and PCIe RP. I am not sure whether it is OK.
>
>
> Although cxl_handle_error() includes cxl_do_recovery() implemented in patch #7, cxl_do_recovery() seems like only for CXL cases(CXL RP working on CXL mode), is it suitable for pcie port recovery(CXL RP working on PCIe mode)?
>
> Please correct me if I am wrong.
>
>
> Ming

Hi Ming,

Yes, the plan is to handle CXL protocol errors in a separate CXL path than PCIe protocol errors.

You stated this is a problem. Can you elaborate on the issue ?

Regards,
Terry

>> Terry
>>>> First, the RCH Downstream Port (DP) is implemented as an RCRB and does not have a
>>>> SBDF.[1] The RCH AER error is reported with the RCEC SBDF in the AER SRC_ID register.[2] The
>>>> RCEC is used to find the RCH's handlers using a CXL unique procedure (see cxl_handle_error()).
>>>>
>>>> The logic in pci_aer_handle_error() operates on a 'struct pci_dev' type and pci_aer_handle_error() is not plumbed to support searching for the RCH handlers.
>>>>
>>>> Using pci_aer_handle_error would require significant changes to support a CXL RCH
>>>> in addition to a PCIe device. These changes are already in cxl_handle_error().
>>>>  
>>>> Another issue to note is the CXL RAS information will (should) not be logged with this
>>>> recommended change. pci_aer_handle_error is PCIe specific and is not aware of CXL RAS. As a result,pci_aer_handle_error() is not suited to log the CXL RAS.
>>>>
>>>> The example scenario was the RCH DP failed training. The user needs to know why training
>>>> failed and these details are stored in the CXL RAS registers. Again, CXL RAS needs to be logged
>>>> as well but CXL specific awareness shouldn't be added to pci_aer_handle_error().
>>> For these two issues, handles_cxl_errors() is always using "pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl)" for RCH cases. I believe no change on this part, the return value of handles_cxl_errors() will be 'true' as expected in the cases you mentioned, cxl_handle_error() will help to handle these errors.
>>>
>>>
>>> Ming
>>>
>>>> Terry
>>>>
>>>> [1] CXL r3.1 - 8.2 Memory Mapped Registers
>>>> [2] CXL r3.1 - 12.2.1.1 RCH Downstream Port-detected Errors+
>>>>>>>>  
>>>>>>>>  	return handles_cxl;
>>>>>>>>  }
>>>>>>>> @@ -1079,6 +1091,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>>>>>>>>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>>>>>>>>  static inline void cxl_handle_error(struct pci_dev *dev,
>>>>>>>>  				    struct aer_err_info *info) { }
>>>>>>>> +static bool handles_cxl_errors(struct pci_dev *dev)
>>>>>>>> +{
>>>>>>>> +	return false;
>>>>>>>> +}
>>>>>>>>  #endif
>>>>>>>>  
>>>>>>>>  /**
>>>>>>>> @@ -1116,8 +1132,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>>>  
>>>>>>>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>>>>>>>  {
>>>>>>>> -	cxl_handle_error(dev, info);
>>>>>>>> -	pci_aer_handle_error(dev, info);
>>>>>>>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>>>>>>>> +		cxl_handle_error(dev, info);
>>>>>>>> +	else
>>>>>>>> +		pci_aer_handle_error(dev, info);
>>>>>>>> +
>>>>>>>>  	pci_dev_put(dev);
>>>>>>>>  }
>>>>>>>>  
>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2025-01-07 14:38 ` [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
  2025-01-13 23:45   ` Ira Weiny
@ 2025-02-06 17:01   ` Gregory Price
  2025-02-07 18:35     ` Bowman, Terry
  1 sibling, 1 reply; 96+ messages in thread
From: Gregory Price @ 2025-02-06 17:01 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:37AM -0600, Terry Bowman wrote:
> CXL.io provides protocol error handling on top of PCIe Protocol Error
> handling. But, CXL.io and PCIe have different handling requirements
> for uncorrectable errors (UCE).
> 
> The PCIe AER service driver may attempt recovering PCIe devices with
> UCE while recovery is not used for CXL.io. Recovery is not used in the
> CXL.io case because of potential corruption on what can be system memory.
> 
> Create pci_driver::cxl_err_handlers structure similar to
> pci_driver::error_handler. Create handlers for correctable and
> uncorrectable CXL.io error handling.
> 
> The CXL error handlers will be used in future patches adding CXL PCIe
> Port Protocol Error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support
  2025-01-07 14:38 ` [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
  2025-01-13 23:45   ` Ira Weiny
@ 2025-02-06 17:02   ` Gregory Price
  1 sibling, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-06 17:02 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:38AM -0600, Terry Bowman wrote:
> The AER service driver already includes support for Restricted CXL host
> (RCH) Downstream Port Protocol Error handling. The current implementation
> is based on CXL1.1 using a Root Complex Event Collector.
> 
> Rename function interfaces and parameters where necessary to include
> virtual hierarchy (VH) mode CXL PCIe Port error handling alongside the RCH
> handling.[1] The CXL PCIe Port Protocol Error handling support will be
> added in a future patch.
> 
> Limit changes to renaming variable and function names. No functional
> changes are added.
> 
> [1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-01-07 14:38 ` [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
  2025-01-13 23:51   ` Ira Weiny
@ 2025-02-06 18:18   ` Gregory Price
  2025-02-07 18:50     ` Bowman, Terry
  1 sibling, 1 reply; 96+ messages in thread
From: Gregory Price @ 2025-02-06 18:18 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:40AM -0600, Terry Bowman wrote:
> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors.
> 
> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
> device errors.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>

Worth a macro/static function?  Wonder what else could use this.

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-01-07 14:38 ` [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
  2025-01-14  6:54   ` Li Ming
  2025-01-14 16:35   ` Ira Weiny
@ 2025-02-06 18:33   ` Gregory Price
  2025-02-07 17:54     ` Jonathan Cameron
  2025-02-07 19:05     ` Bowman, Terry
  2 siblings, 2 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-06 18:33 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:41AM -0600, Terry Bowman wrote:
> The AER service driver supports handling Downstream Port Protocol Errors in
> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
> mode.[1]
> 
> CXL and PCIe Protocol Error handling have different requirements that
> necessitate a separate handling path. The AER service driver may try to
> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
> suitable for CXL PCIe Port devices because of potential for system memory
> corruption. Instead, CXL Protocol Error handling must use a kernel panic
> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
> Error handling does not panic the kernel in response to a UCE.
>

Naive question: is a panic actually required if the memory is a userland
resource?

The code in arch/x86/kernel/cpu/mce/core.c suggests we may not panic
if an uncorrectable error occurs in this fashion, but simply a SIGBUS.

Unless this is down the wrong pipe - in which case disregard.

I'm still digging through background on this patch set so I may be
barking up the wrong tree.

~Gregory

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
  2025-01-07 14:38 ` [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
  2025-01-14 21:37   ` Ira Weiny
@ 2025-02-07  7:30   ` Gregory Price
  2025-02-07 19:08     ` Bowman, Terry
  1 sibling, 1 reply; 96+ messages in thread
From: Gregory Price @ 2025-02-07  7:30 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:44AM -0600, Terry Bowman wrote:
> +static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
> +{
> +	struct pci_dev *pdev;
> +
> +	if (!dev || !dev_is_pci(dev))
> +		return false;
> +
> +	pdev = to_pci_dev(dev);
> +
> +	return (pci_pcie_type(pdev) == pcie_type);
> +}
> +
> +static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
> +{
> +	struct cxl_dport *dport = ep->dport;
> +
> +	if (dport) {
> +		struct device *dport_dev = dport->dport_dev;
> +
> +		if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
> +		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))

Mostly an observation - this kind of comparison seems to be coming up
more.  Wonder if an explicit set of APIs for these checks would be worth
it to clean up the 3 or 4 different comparison variants i've seen.

Either way

Reviewed-by: Gregory Price <gourry@gourry.net>

~Gregory

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
  2025-01-07 14:38 ` [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
  2025-01-14 11:35   ` Jonathan Cameron
  2025-01-14 22:02   ` Ira Weiny
@ 2025-02-07  7:35   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-07  7:35 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:45AM -0600, Terry Bowman wrote:
> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
> 
> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
> pointer to the CXL Upstream Port's mapped RAS registers.
> 
> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
> register mapping. This is similar to the existing
> cxl_dport_init_ras_reporting() but for USP devices.
> 
> The USP may have multiple downstream endpoints. Before mapping AER
> registers check if the registers are already mapped.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Gregory Price <gourry@gourry.net>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
  2025-01-07 14:38 ` [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
  2025-01-14 11:39   ` Jonathan Cameron
  2025-01-14 22:20   ` Ira Weiny
@ 2025-02-07  7:38   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-07  7:38 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:46AM -0600, Terry Bowman wrote:
> CXL PCIe Port Protocol Error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port Protocol Errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers
  2025-01-07 14:38 ` [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers Terry Bowman
  2025-01-14 11:41   ` Jonathan Cameron
  2025-01-14 22:21   ` Ira Weiny
@ 2025-02-07  7:39   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-07  7:39 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:47AM -0600, Terry Bowman wrote:
> The CXL RAS handlers do not currently log if the RAS registers are
> unmapped. This is needed inorder to help debug CXL error handling. Update
> the CXL driver to log a warning message if the RAS register block is
> unmapped.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static
  2025-01-14 22:23   ` Ira Weiny
@ 2025-02-07  7:45     ` Gregory Price
  0 siblings, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-07  7:45 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
	ming.li, PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 14, 2025 at 04:23:56PM -0600, Ira Weiny wrote:
> Terry Bowman wrote:
> > CXL PCIe Port Protocol Error support will be added in the future. This
> > requires searching for a CXL PCIe Port device in the CXL topology as
> > provided by find_cxl_port(). But, find_cxl_port() is defined static
> > and as a result is not callable outside of this source file.
> > 
> > Update the find_cxl_port() declaration to be non-static.
> > 
> > Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> > Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Generally I think Dan prefers this type of patch to be squashed with the
> patch which requires the change.  But I'm ok with the smaller patches...
> 
> :-D
> 
> Reviewed-by: Ira Weiny <ira.weiny@intel.com>
> 

I've read elsewhere that changelog entries should avoid telling
the future - and simply state what the patch is doing.

I.e.: "do the thing" as opposed to "In the future... so do the thing"

The existence of the patch implies there is a user for it.

anyway

Reviewed-by: Gregory Price <gourry@gourry.net>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-01-07 14:38 ` [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
  2025-01-14 11:46   ` Jonathan Cameron
  2025-01-14 22:51   ` Ira Weiny
@ 2025-02-07  8:01   ` Gregory Price
  2025-02-07 19:23     ` Bowman, Terry
  2 siblings, 1 reply; 96+ messages in thread
From: Gregory Price @ 2025-02-07  8:01 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:49AM -0600, Terry Bowman wrote:
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> +	struct cxl_port *port;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +		struct cxl_dport *dport;
> +		void __iomem *ras_base;
> +
> +		port = find_cxl_port(&pdev->dev, &dport);
> +		ras_base = dport ? dport->regs.ras : NULL;

I'm fairly certain dport can come back here uninitialized, you
probably want to put this inside the `if (port)` block and 
pre-initialize dport to NULL.

> +		if (port)
> +			put_device(&port->dev);
> +		return ras_base;

You can probably even simplify this down to something like

		struct_cxl_dport *dport = NULL;

		port = find_cxl_port(&pdev->dev, &dport);
		if (port)
			put_device(&port->dev);

		return dport ? dport->regs.ras : NULL;


~Gregory

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2025-01-07 14:38 ` [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
  2025-01-14 11:51   ` Jonathan Cameron
  2025-01-14 23:03   ` Ira Weiny
@ 2025-02-07  8:08   ` Gregory Price
  2 siblings, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-07  8:08 UTC (permalink / raw)
  To: Terry Bowman
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Tue, Jan 07, 2025 at 08:38:51AM -0600, Terry Bowman wrote:
> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
> The handlers can't be set in the pci_driver static definition because the
> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
> driver aware.
> 
> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> function will assign the default handlers for a CXL PCIe Port device.
> 
> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
> longer be used.
> 
> Create cxl_clear_port_error_handlers() and register it to be called
> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Gregory Price <gourry@gourry.net>


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-02-06 18:33   ` Gregory Price
@ 2025-02-07 17:54     ` Jonathan Cameron
  2025-02-07 19:05     ` Bowman, Terry
  1 sibling, 0 replies; 96+ messages in thread
From: Jonathan Cameron @ 2025-02-07 17:54 UTC (permalink / raw)
  To: Gregory Price
  Cc: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
	bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Thu, 6 Feb 2025 13:33:55 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Tue, Jan 07, 2025 at 08:38:41AM -0600, Terry Bowman wrote:
> > The AER service driver supports handling Downstream Port Protocol Errors in
> > Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
> > functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
> > mode.[1]
> > 
> > CXL and PCIe Protocol Error handling have different requirements that
> > necessitate a separate handling path. The AER service driver may try to
> > recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
> > suitable for CXL PCIe Port devices because of potential for system memory
> > corruption. Instead, CXL Protocol Error handling must use a kernel panic
> > in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
> > Error handling does not panic the kernel in response to a UCE.
> >  
> 
> Naive question: is a panic actually required if the memory is a userland
> resource?

It's a protocol error, not a contained memory issue.
You'd need to find everything using that memory and kill it.

Maybe longer term if it's DAX and we know whole device is allocated
to only a few apps can resolve more smoothly.


> 
> The code in arch/x86/kernel/cpu/mce/core.c suggests we may not panic
> if an uncorrectable error occurs in this fashion, but simply a SIGBUS.
> 
> Unless this is down the wrong pipe - in which case disregard.
> 
> I'm still digging through background on this patch set so I may be
> barking up the wrong tree.
> 
> ~Gregory


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2025-02-06 17:01   ` Gregory Price
@ 2025-02-07 18:35     ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-02-07 18:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop



On 2/6/2025 11:01 AM, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 08:38:37AM -0600, Terry Bowman wrote:
>> CXL.io provides protocol error handling on top of PCIe Protocol Error
>> handling. But, CXL.io and PCIe have different handling requirements
>> for uncorrectable errors (UCE).
>>
>> The PCIe AER service driver may attempt recovering PCIe devices with
>> UCE while recovery is not used for CXL.io. Recovery is not used in the
>> CXL.io case because of potential corruption on what can be system memory.
>>
>> Create pci_driver::cxl_err_handlers structure similar to
>> pci_driver::error_handler. Create handlers for correctable and
>> uncorrectable CXL.io error handling.
>>
>> The CXL error handlers will be used in future patches adding CXL PCIe
>> Port Protocol Error handling.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Reviewed-by: Gregory Price <gourry@gourry.net>
Thanks for reviewing and adding the "Reviewed-by".

Terry

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2025-02-06 18:18   ` Gregory Price
@ 2025-02-07 18:50     ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-02-07 18:50 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop



On 2/6/2025 12:18 PM, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 08:38:40AM -0600, Terry Bowman wrote:
>> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
>> for all errors.
>>
>> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
>> device errors.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> Worth a macro/static function?  Wonder what else could use this.
>
> Reviewed-by: Gregory Price <gourry@gourry.net>
Good idea. Thanks for the 'Reviewed-by`:.

Terry

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
  2025-02-06 18:33   ` Gregory Price
  2025-02-07 17:54     ` Jonathan Cameron
@ 2025-02-07 19:05     ` Bowman, Terry
  1 sibling, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-02-07 19:05 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop



On 2/6/2025 12:33 PM, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 08:38:41AM -0600, Terry Bowman wrote:
>> The AER service driver supports handling Downstream Port Protocol Errors in
>> Restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
>> functionality for CXL PCIe Ports operating in Virtual Hierarchy (VH)
>> mode.[1]
>>
>> CXL and PCIe Protocol Error handling have different requirements that
>> necessitate a separate handling path. The AER service driver may try to
>> recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
>> suitable for CXL PCIe Port devices because of potential for system memory
>> corruption. Instead, CXL Protocol Error handling must use a kernel panic
>> in the case of a fatal or non-fatal UCE. The AER driver's PCIe Protocol
>> Error handling does not panic the kernel in response to a UCE.
>>
> Naive question: is a panic actually required if the memory is a userland
> resource?
>
> The code in arch/x86/kernel/cpu/mce/core.c suggests we may not panic
> if an uncorrectable error occurs in this fashion, but simply a SIGBUS.
>
> Unless this is down the wrong pipe - in which case disregard.
>
> I'm still digging through background on this patch set so I may be
> barking up the wrong tree.
>
> ~Gregory
The plan is to panic on any CXL device with uncorrectable errors
regardless of where used. This is to avoid corruption.

Terry




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
  2025-02-07  7:30   ` Gregory Price
@ 2025-02-07 19:08     ` Bowman, Terry
  2025-02-07 19:39       ` Gregory Price
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-02-07 19:08 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop



On 2/7/2025 1:30 AM, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 08:38:44AM -0600, Terry Bowman wrote:
>> +static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
>> +{
>> +	struct pci_dev *pdev;
>> +
>> +	if (!dev || !dev_is_pci(dev))
>> +		return false;
>> +
>> +	pdev = to_pci_dev(dev);
>> +
>> +	return (pci_pcie_type(pdev) == pcie_type);
>> +}
>> +
>> +static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
>> +{
>> +	struct cxl_dport *dport = ep->dport;
>> +
>> +	if (dport) {
>> +		struct device *dport_dev = dport->dport_dev;
>> +
>> +		if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
>> +		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
> Mostly an observation - this kind of comparison seems to be coming up
> more.  Wonder if an explicit set of APIs for these checks would be worth
> it to clean up the 3 or 4 different comparison variants i've seen.
>
> Either way
>
> Reviewed-by: Gregory Price <gourry@gourry.net>
>
> ~Gregory
Do you recommend moving dev_is_cxl_pci() to pci library as is? Thanks for 'Reviewed-by:`. Terry

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-02-07  8:01   ` Gregory Price
@ 2025-02-07 19:23     ` Bowman, Terry
  2025-02-07 19:41       ` Gregory Price
  0 siblings, 1 reply; 96+ messages in thread
From: Bowman, Terry @ 2025-02-07 19:23 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop



On 2/7/2025 2:01 AM, Gregory Price wrote:
> On Tue, Jan 07, 2025 at 08:38:49AM -0600, Terry Bowman wrote:
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> +	struct cxl_port *port;
>> +
>> +	if (!pdev)
>> +		return NULL;
>> +
>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> +		struct cxl_dport *dport;
>> +		void __iomem *ras_base;
>> +
>> +		port = find_cxl_port(&pdev->dev, &dport);
>> +		ras_base = dport ? dport->regs.ras : NULL;
> I'm fairly certain dport can come back here uninitialized, you
> probably want to put this inside the `if (port)` block and 
> pre-initialize dport to NULL.
Right, it can. NULL dport check here covers this, no?:

		ras_base = dport ? dport->regs.ras : NULL;

Terry

>> +		if (port)
>> +			put_device(&port->dev);
>> +		return ras_base;
> You can probably even simplify this down to something like
>
> 		struct_cxl_dport *dport = NULL;
>
> 		port = find_cxl_port(&pdev->dev, &dport);
> 		if (port)
> 			put_device(&port->dev);
>
> 		return dport ? dport->regs.ras : NULL;
>
>
> ~Gregory


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
  2025-02-07 19:08     ` Bowman, Terry
@ 2025-02-07 19:39       ` Gregory Price
  0 siblings, 0 replies; 96+ messages in thread
From: Gregory Price @ 2025-02-07 19:39 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Fri, Feb 07, 2025 at 01:08:35PM -0600, Bowman, Terry wrote:
> 
> >> +
> >> +		if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
> >> +		    dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
> > Mostly an observation - this kind of comparison seems to be coming up
> > more.  Wonder if an explicit set of APIs for these checks would be worth
> > it to clean up the 3 or 4 different comparison variants i've seen.
> >
> > Either way
> >
> > Reviewed-by: Gregory Price <gourry@gourry.net>
> >
> > ~Gregory
> Do you recommend moving dev_is_cxl_pci() to pci library as is? Thanks for 'Reviewed-by:`. Terry

I'd have to go look around and see how much the PCI_EXP_TYPE items are
being used for comparison and how - I was just commenting internally on
this patch set.  If this set is the only place it's being used (so far)
then it's probably not worth extra work yet.

~Gregory

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-02-07 19:23     ` Bowman, Terry
@ 2025-02-07 19:41       ` Gregory Price
  2025-02-07 21:04         ` Bowman, Terry
  0 siblings, 1 reply; 96+ messages in thread
From: Gregory Price @ 2025-02-07 19:41 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop

On Fri, Feb 07, 2025 at 01:23:06PM -0600, Bowman, Terry wrote:
> 
> 
> On 2/7/2025 2:01 AM, Gregory Price wrote:
> > On Tue, Jan 07, 2025 at 08:38:49AM -0600, Terry Bowman wrote:
> >> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> >> +{
> >> +	struct cxl_port *port;
> >> +
> >> +	if (!pdev)
> >> +		return NULL;
> >> +
> >> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> >> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> >> +		struct cxl_dport *dport;
> >> +		void __iomem *ras_base;
> >> +
> >> +		port = find_cxl_port(&pdev->dev, &dport);
> >> +		ras_base = dport ? dport->regs.ras : NULL;
> > I'm fairly certain dport can come back here uninitialized, you
> > probably want to put this inside the `if (port)` block and 
> > pre-initialize dport to NULL.
> Right, it can. NULL dport check here covers this, no?:
> 
> 		ras_base = dport ? dport->regs.ras : NULL;

dport can be garbage (whatever is on the stack) at this check because
nothing ever intializes it to NULL.

> 
> Terry
> 
> >> +		if (port)
> >> +			put_device(&port->dev);
> >> +		return ras_base;
> > You can probably even simplify this down to something like
> >
> > 		struct_cxl_dport *dport = NULL;
> >
> > 		port = find_cxl_port(&pdev->dev, &dport);
> > 		if (port)
> > 			put_device(&port->dev);
> >
> > 		return dport ? dport->regs.ras : NULL;
> >
> >
> > ~Gregory
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors
  2025-02-07 19:41       ` Gregory Price
@ 2025-02-07 21:04         ` Bowman, Terry
  0 siblings, 0 replies; 96+ messages in thread
From: Bowman, Terry @ 2025-02-07 21:04 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa, lukas, ming.li,
	PradeepVineshReddy.Kodamati, alucerop



On 2/7/2025 1:41 PM, Gregory Price wrote:
> On Fri, Feb 07, 2025 at 01:23:06PM -0600, Bowman, Terry wrote:
>>
>> On 2/7/2025 2:01 AM, Gregory Price wrote:
>>> On Tue, Jan 07, 2025 at 08:38:49AM -0600, Terry Bowman wrote:
>>>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>>>> +{
>>>> +	struct cxl_port *port;
>>>> +
>>>> +	if (!pdev)
>>>> +		return NULL;
>>>> +
>>>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>>>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>>>> +		struct cxl_dport *dport;
>>>> +		void __iomem *ras_base;
>>>> +
>>>> +		port = find_cxl_port(&pdev->dev, &dport);
>>>> +		ras_base = dport ? dport->regs.ras : NULL;
>>> I'm fairly certain dport can come back here uninitialized, you
>>> probably want to put this inside the `if (port)` block and 
>>> pre-initialize dport to NULL.
>> Right, it can. NULL dport check here covers this, no?:
>>
>> 		ras_base = dport ? dport->regs.ras : NULL;
> dport can be garbage (whatever is on the stack) at this check because
> nothing ever intializes it to NULL.

Got it. Thanks.
>> Terry
>>
>>>> +		if (port)
>>>> +			put_device(&port->dev);
>>>> +		return ras_base;
>>> You can probably even simplify this down to something like
>>>
>>> 		struct_cxl_dport *dport = NULL;
>>>
>>> 		port = find_cxl_port(&pdev->dev, &dport);
>>> 		if (port)
>>> 			put_device(&port->dev);
>>>
>>> 		return dport ? dport->regs.ras : NULL;
>>>
>>>
>>> ~Gregory


^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2025-02-07 21:04 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-07 14:38 [PATCH v5 0/16] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2025-01-07 14:38 ` [PATCH v5 01/16] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2025-01-13 23:45   ` Ira Weiny
2025-02-06 17:01   ` Gregory Price
2025-02-07 18:35     ` Bowman, Terry
2025-01-07 14:38 ` [PATCH v5 02/16] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
2025-01-13 23:45   ` Ira Weiny
2025-02-06 17:02   ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 03/16] CXL/PCI: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2025-01-13 23:49   ` Ira Weiny
2025-01-14 15:19     ` Bowman, Terry
2025-01-14 23:33       ` Ira Weiny
2025-01-14 23:39         ` Bowman, Terry
2025-01-16 15:35           ` Ira Weiny
2025-01-15 10:03       ` Lukas Wunner
2025-01-07 14:38 ` [PATCH v5 04/16] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2025-01-13 23:51   ` Ira Weiny
2025-02-06 18:18   ` Gregory Price
2025-02-07 18:50     ` Bowman, Terry
2025-01-07 14:38 ` [PATCH v5 05/16] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
2025-01-14  6:54   ` Li Ming
2025-01-14 11:20     ` Jonathan Cameron
2025-01-14 20:10       ` Bowman, Terry
2025-01-14 19:29     ` Bowman, Terry
2025-01-15  1:18       ` Li Ming
2025-01-15 14:39         ` Bowman, Terry
2025-01-16  3:15           ` Li Ming
2025-02-05  3:46             ` Bowman, Terry
2025-02-05 13:58               ` Li Ming
2025-02-05 14:22                 ` Bowman, Terry
2025-01-14 16:35   ` Ira Weiny
2025-02-06 18:33   ` Gregory Price
2025-02-07 17:54     ` Jonathan Cameron
2025-02-07 19:05     ` Bowman, Terry
2025-01-07 14:38 ` [PATCH v5 06/16] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
2025-01-14 11:32   ` Jonathan Cameron
2025-01-14 20:44     ` Bowman, Terry
2025-01-28 20:25     ` Bowman, Terry
2025-01-29 18:04       ` Jonathan Cameron
2025-01-14 16:57   ` Ira Weiny
2025-01-07 14:38 ` [PATCH v5 07/16] PCI/AER: Add CXL PCIe Port uncorrectable error recovery in AER service driver Terry Bowman
2025-01-14 11:33   ` Jonathan Cameron
2025-01-14 20:28     ` Bowman, Terry
2025-01-15 11:37       ` Jonathan Cameron
2025-01-14 17:27   ` Ira Weiny
2025-01-07 14:38 ` [PATCH v5 08/16] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
2025-01-14 21:37   ` Ira Weiny
2025-02-07  7:30   ` Gregory Price
2025-02-07 19:08     ` Bowman, Terry
2025-02-07 19:39       ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 09/16] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
2025-01-14 11:35   ` Jonathan Cameron
2025-01-14 15:24     ` Bowman, Terry
2025-01-14 22:02   ` Ira Weiny
2025-01-14 22:11     ` Bowman, Terry
2025-01-14 23:38       ` Ira Weiny
2025-01-14 23:49         ` Bowman, Terry
2025-01-15 11:40           ` Jonathan Cameron
2025-02-07  7:35   ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 10/16] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2025-01-14 11:39   ` Jonathan Cameron
2025-01-14 22:20   ` Ira Weiny
2025-02-07  7:38   ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 11/16] cxl/pci: Add log message for umnapped registers in existing RAS handlers Terry Bowman
2025-01-14 11:41   ` Jonathan Cameron
2025-01-14 22:21   ` Ira Weiny
2025-02-07  7:39   ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 12/16] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
2025-01-14 22:23   ` Ira Weiny
2025-02-07  7:45     ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 13/16] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
2025-01-14 11:46   ` Jonathan Cameron
2025-01-14 21:20     ` Bowman, Terry
2025-01-14 22:51   ` Ira Weiny
2025-01-14 23:10     ` Bowman, Terry
2025-01-14 23:42     ` Bowman, Terry
2025-02-07  8:01   ` Gregory Price
2025-02-07 19:23     ` Bowman, Terry
2025-02-07 19:41       ` Gregory Price
2025-02-07 21:04         ` Bowman, Terry
2025-01-07 14:38 ` [PATCH v5 14/16] cxl/pci: Add trace logging " Terry Bowman
2025-01-14 11:49   ` Jonathan Cameron
2025-01-14 20:56     ` Bowman, Terry
2025-01-15 11:42       ` Jonathan Cameron
2025-01-14 22:58   ` Ira Weiny
2025-01-07 14:38 ` [PATCH v5 15/16] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2025-01-14 11:51   ` Jonathan Cameron
2025-01-14 23:03   ` Ira Weiny
2025-02-07  8:08   ` Gregory Price
2025-01-07 14:38 ` [PATCH v5 16/16] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
2025-01-14 23:26   ` Ira Weiny
2025-01-14 23:34     ` Bowman, Terry
2025-01-14 23:45       ` Ira Weiny
2025-01-15  0:09         ` Bowman, Terry
2025-01-15  0:20         ` Bowman, Terry
2025-01-16 21:42           ` Ira Weiny

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox