[PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging

linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
@ 2024-10-25 21:02 Terry Bowman
  2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
                   ` (16 more replies)
  0 siblings, 17 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling in the AER service driver. This
patchset adds the CXL PCIe port protocol error handling and logging.

The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.

The following 7 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
adding port specific error handlers, and protocol error logging.

[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/

Testing:

Below are test results for this patchset using Qemu with CXL root
port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
also added to show the existing PCIe endpoint handling is not changed.

This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed but can
provide if needed).

 - Root port UCE:
 root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
 pcieport 0000:0c:00.0:    [22] UncorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x116/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Root port CE:
 root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
 e-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
 pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
 pcieport 0000:0c:00.0:    [14] CorrIntErr
 aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

 - Upstream switch port UCE:
 root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
 pcieport 0000:0d:00.0:    [22] UncorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x116/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  ? free_cpumask_var+0x9/0x10
  ? kfree+0x259/0x2e0
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Upstream switch port CE:
 root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
 pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
 pcieport 0000:0d:00.0:    [14] CorrIntErr
 aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

 - Downstream switch port UCE:
 root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
 pcieport 0000:0e:00.0:    [22] UncorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
 cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
 Kernel panic - not syncing: CXL cachemem error. Invoking panic
 CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
 Tainted: [E]=UNSIGNED_MODULE
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl+0x27/0x90
  dump_stack+0x10/0x20
  panic+0x33e/0x380
  cxl_do_recovery+0x116/0x120
  ? srso_return_thunk+0x5/0x5f
  aer_isr+0x3e0/0x710
  irq_thread_fn+0x28/0x70
  irq_thread+0x179/0x240
  ? srso_return_thunk+0x5/0x5f
  ? __pfx_irq_thread_fn+0x10/0x10
  ? __pfx_irq_thread_dtor+0x10/0x10
  ? __pfx_irq_thread+0x10/0x10
  kthread+0xf5/0x130
  ? __pfx_kthread+0x10/0x10
  ret_from_fork+0x3c/0x60
  ? __pfx_kthread+0x10/0x10
  ret_from_fork_asm+0x1a/0x30
  </TASK>
 Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
 ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

 - Downstream switch port CE:
 root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
 pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
 pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
 pcieport 0000:0e:00.0:    [14] CorrIntErr
 aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
 cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

 - Endpoint CE
 root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
 cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
 cxl_pci 0000:0f:00.0:    [ 6] BadTLP
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
 cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'

 - Endpoint UCE
 root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
 pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
 pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
 cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
 aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
 cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
 cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
 cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
 cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
 pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
 pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
 cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
 devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
 devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
 devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
 devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 <snip>
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
 cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
 cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
 devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
 pcieport 0000:0e:00.0: RAS is already mapped
 cxl_port port2: RAS is already mapped
 pcieport 0000:0c:00.0: RAS is already mapped
 cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
 cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
 cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
 cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
 init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
 init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
 init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
 init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
 add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
 cxl_bus_probe: cxl_port endpoint4: probe: 0
 devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
 cxl_bus_probe: cxl_mem mem1: probe: 0
 cxl_pci 0000:0f:00.0: mem1: error resume successful
 pcieport 0000:0e:00.0: AER: device recovery successful

 Changes in v1 -> v2
 [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
 [Jonathan] Update description to DSP map patch description
 [Jonathan] Update cxl_pci_port_ras() to check for NULL port
 [Jonathan] Dont call handler before handler port changes are present (patch order).
 [Bjorn] Fix linebreak in cover sheet URL
 [Bjorn] Remove timestamps from test logs in cover sheet
 [Bjorn] Retitle AER commits to use "PCI/AER:"
 [Bjorn] Retitle patch#3 to use renaming instead of refactoring
 [Bjorn] Fixe base commit-id on cover sheet
 [Bjorn] Add VH spec reference/citation
 [Terry] Removed last 2 patches to enable internal errors. Is not needed
 because internal errors are enabled in AER driver.
 [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
 [Dan] Use kernel panic in CXL recovery
 [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
 [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
 [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
 [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
 is not used in the CXL_err_handlers callabcks.

Changes in RFC -> v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (14):
  PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
    pci_driver'
  PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
    support
  cxl/pci: Introduce helper functions pcie_is_cxl() and
    pcie_is_cxl_port()
  PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
    type
  PCI/AER: Add CXL PCIe port correctable error support in AER service
    driver
  PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
    port devices
  PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
    driver
  cxl/pci: Change find_cxl_ports() to non-static
  cxl/pci: Map CXL PCIe root port and downstream switch port RAS
    registers
  cxl/pci: Map CXL PCIe upstream switch port RAS registers
  cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
    support
  cxl/pci: Add error handler for CXL PCIe port RAS errors
  cxl/pci: Add trace logging for CXL PCIe port RAS errors
  cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers

 drivers/cxl/core/core.h       |   3 +
 drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
 drivers/cxl/core/port.c       |   4 +-
 drivers/cxl/core/trace.h      |  47 +++++++++
 drivers/cxl/cxl.h             |  10 +-
 drivers/cxl/mem.c             |  29 +++++-
 drivers/pci/pci.c             |  14 +++
 drivers/pci/pci.h             |   3 +
 drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
 drivers/pci/pcie/err.c        |  54 ++++++++++
 drivers/pci/probe.c           |  10 ++
 include/linux/pci.h           |  13 +++
 include/ras/ras_event.h       |   9 +-
 include/uapi/linux/pci_regs.h |   3 +-
 14 files changed, 396 insertions(+), 82 deletions(-)


base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0
-- 
2.34.1


^ permalink raw reply	[flat|nested] 55+ messages in thread

* [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 15:14   ` Jonathan Cameron
                     ` (2 more replies)
  2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
                   ` (15 subsequent siblings)
  16 siblings, 3 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

CXL.io provides PCIe like protocol error implementation, but CXL.io and
PCIe have different handling requirements.

The PCIe AER service driver may attempt recovering PCIe devices with
uncorrectable errors while recovery is not used for CXL.io. Recovery is not
used in the CXL.io recovery because of the potential for corruption on
what can be system memory.

Create pci_driver::cxl_err_handlers similar to pci_driver::error_handler.
Create handlers for correctable and uncorrectable CXL.io error
handling.

The CXL error handlers will be used in future patches adding CXL PCIe
port protocol error handling.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 include/linux/pci.h | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/include/linux/pci.h b/include/linux/pci.h
index 573b4c4c2be6..106ac83e3a7b 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -886,6 +886,14 @@ struct pci_error_handlers {
 	void (*cor_error_detected)(struct pci_dev *dev);
 };
 
+/* CXL bus error event callbacks */
+struct cxl_error_handlers {
+	/* CXL bus error detected on this device */
+	bool (*error_detected)(struct pci_dev *dev);
+
+	/* Allow device driver to record more details of a correctable error */
+	void (*cor_error_detected)(struct pci_dev *dev);
+};
 
 struct module;
 
@@ -956,6 +964,7 @@ struct pci_driver {
 	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
 	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
 	const struct pci_error_handlers *err_handler;
+	const struct cxl_error_handlers *cxl_err_handler;
 	const struct attribute_group **groups;
 	const struct attribute_group **dev_groups;
 	struct device_driver	driver;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
  2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 15:13   ` Jonathan Cameron
                     ` (2 more replies)
  2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
                   ` (14 subsequent siblings)
  16 siblings, 3 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

The AER service driver already includes support for CXL restricted host
(RCH) downstream port error handling. The current implementation is based
on CXL1.1 using a root complex event collector.

Rename function interfaces and parameters where necessary to include
virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
handling.[1] The CXL PCIe port error handling will be added in a future
patch.

Limit changes to renaming variable and function names. No functional
changes are added.

[1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
 1 file changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 13b8586924ea..fe6edf26279e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1029,7 +1029,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 	return 0;
 }
 
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
+static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 {
 	/*
 	 * Internal errors of an RCEC indicate an AER error in an
@@ -1052,30 +1052,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 	return *handles_cxl;
 }
 
-static bool handles_cxl_errors(struct pci_dev *rcec)
+static bool handles_cxl_errors(struct pci_dev *dev)
 {
 	bool handles_cxl = false;
 
-	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+	    pcie_aer_is_native(dev))
+		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
 
 	return handles_cxl;
 }
 
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
+static void cxl_enable_internal_errors(struct pci_dev *dev)
 {
-	if (!handles_cxl_errors(rcec))
+	if (!handles_cxl_errors(dev))
 		return;
 
-	pci_aer_unmask_internal_errors(rcec);
-	pci_info(rcec, "CXL: Internal errors unmasked");
+	pci_aer_unmask_internal_errors(dev);
+	pci_info(dev, "CXL: Internal errors unmasked");
 }
 
 #else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
-static inline void cxl_rch_handle_error(struct pci_dev *dev,
-					struct aer_err_info *info) { }
+static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
+static inline void cxl_handle_error(struct pci_dev *dev,
+				    struct aer_err_info *info) { }
 #endif
 
 /**
@@ -1113,7 +1113,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_rch_handle_error(dev, info);
+	cxl_handle_error(dev, info);
 	pci_aer_handle_error(dev, info);
 	pci_dev_put(dev);
 }
@@ -1491,7 +1491,7 @@ static int aer_probe(struct pcie_device *dev)
 		return status;
 	}
 
-	cxl_rch_enable_rcec(port);
+	cxl_enable_internal_errors(port);
 	aer_enable_rootport(rpc);
 	pci_info(port, "enabled with IRQ %d\n", dev->irq);
 	return 0;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
  2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
  2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 14:57   ` Jonathan Cameron
                     ` (2 more replies)
  2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
                   ` (13 subsequent siblings)
  16 siblings, 3 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

CXL and AER drivers need the ability to identify CXL devices and CXL port
devices.

First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
presence. The CXL Flexbus DVSEC presence is used because it is required
for all the CXL PCIe devices.[1]

Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
Flexbus presence.

Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl',

Add pcie_is_cxl_port() to check if a device is a CXL root port, CXL
upstream switch port, or CXL downstream switch port. Also, verify the
CXL extensions DVSEC for port is present.[1]

[1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
    Capability (DVSEC) ID Assignment, Table 8-2

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.c             | 14 ++++++++++++++
 drivers/pci/probe.c           | 10 ++++++++++
 include/linux/pci.h           |  4 ++++
 include/uapi/linux/pci_regs.h |  3 ++-
 4 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 7d85c04fbba2..c1b243aec61c 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5034,6 +5034,20 @@ static u16 cxl_port_dvsec(struct pci_dev *dev)
 					 PCI_DVSEC_CXL_PORT);
 }
 
+bool pcie_is_cxl_port(struct pci_dev *dev)
+{
+	if (!pcie_is_cxl(dev))
+		return false;
+
+	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
+	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
+	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
+		return false;
+
+	return cxl_port_dvsec(dev);
+}
+EXPORT_SYMBOL_GPL(pcie_is_cxl_port);
+
 static bool cxl_sbr_masked(struct pci_dev *dev)
 {
 	u16 dvsec, reg;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 4f68414c3086..9324eb345f11 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1631,6 +1631,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
 		dev->is_thunderbolt = 1;
 }
 
+static void set_pcie_cxl(struct pci_dev *dev)
+{
+	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+					      PCI_DVSEC_CXL_FLEXBUS);
+	if (dvsec)
+		dev->is_cxl = 1;
+}
+
 static void set_pcie_untrusted(struct pci_dev *dev)
 {
 	struct pci_dev *parent;
@@ -1945,6 +1953,8 @@ int pci_setup_device(struct pci_dev *dev)
 	/* Need to have dev->cfg_size ready */
 	set_pcie_thunderbolt(dev);
 
+	set_pcie_cxl(dev);
+
 	set_pcie_untrusted(dev);
 
 	/* "Unknown power state" */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 106ac83e3a7b..d3b1af9fb273 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -443,6 +443,7 @@ struct pci_dev {
 	unsigned int	is_hotplug_bridge:1;
 	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
 	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
+	unsigned int	is_cxl:1;               /* CXL alternate protocol */
 	/*
 	 * Devices marked being untrusted are the ones that can potentially
 	 * execute DMA attacks and similar. They are typically connected
@@ -743,6 +744,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
 	return false;
 }
 
+#define pcie_is_cxl(dev) (dev->is_cxl)
+bool pcie_is_cxl_port(struct pci_dev *dev);
+
 #define for_each_pci_bridge(dev, bus)				\
 	list_for_each_entry(dev, &bus->devices, bus_list)	\
 		if (!pci_is_bridge(dev)) {} else
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 12323b3334a9..5df6c74963c5 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1186,9 +1186,10 @@
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
 #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
 
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
+/* Compute Express Link (CXL r3.1, sec 8.1) */
 #define PCI_DVSEC_CXL_PORT				3
 #define PCI_DVSEC_CXL_PORT_CTL				0x0c
 #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
+#define PCI_DVSEC_CXL_FLEXBUS				7
 
 #endif /* LINUX_PCI_REGS_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (2 preceding siblings ...)
  2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 14:56   ` Jonathan Cameron
                     ` (2 more replies)
  2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
                   ` (12 subsequent siblings)
  16 siblings, 3 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

The AER driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors.

Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL devices.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c  | 14 ++++++++------
 include/ras/ras_event.h |  9 ++++++---
 2 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index fe6edf26279e..53e9a11f6c0f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
 
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 {
+	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
 	int layer, agent;
 	int id = pci_dev_id(dev);
 	const char *level;
 
 	if (!info->status) {
-		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
-			aer_error_severity_string[info->severity]);
+		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+			bus_type, aer_error_severity_string[info->severity]);
 		goto out;
 	}
 
@@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 
 	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
 
-	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-		   aer_error_severity_string[info->severity],
+	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+		   bus_type, aer_error_severity_string[info->severity],
 		   aer_error_layer[layer], aer_agent_string[agent]);
 
 	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	if (info->id && info->error_dev_num > 1 && info->id == id)
 		pci_err(dev, "  Error of this Agent is reported first\n");
 
-	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
 			info->severity, info->tlp_header_valid, &info->tlp);
 }
 
@@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
+	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
 	int layer, agent, tlp_header_valid = 0;
 	u32 status, mask;
 	struct aer_err_info info;
@@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
 	if (tlp_header_valid)
 		__print_tlp_header(dev, &aer->header_log);
 
-	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
 			aer_severity, tlp_header_valid, &aer->header_log);
 }
 EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index e5f7ee0864e7..1bf8e7050ba8 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
 
 TRACE_EVENT(aer_event,
 	TP_PROTO(const char *dev_name,
+		 const char *bus_type,
 		 const u32 status,
 		 const u8 severity,
 		 const u8 tlp_header_valid,
 		 struct pcie_tlp_log *tlp),
 
-	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
 
 	TP_STRUCT__entry(
 		__string(	dev_name,	dev_name	)
+		__string(	bus_type,	bus_type	)
 		__field(	u32,		status		)
 		__field(	u8,		severity	)
 		__field(	u8, 		tlp_header_valid)
@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
 
 	TP_fast_assign(
 		__assign_str(dev_name);
+		__assign_str(bus_type);
 		__entry->status		= status;
 		__entry->severity	= severity;
 		__entry->tlp_header_valid = tlp_header_valid;
@@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
 		}
 	),
 
-	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
-		__get_str(dev_name),
+	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+		__get_str(dev_name), __get_str(bus_type),
 		__entry->severity == AER_CORRECTABLE ? "Corrected" :
 			__entry->severity == AER_FATAL ?
 			"Fatal" : "Uncorrected, non-fatal",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (3 preceding siblings ...)
  2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 15:13   ` Jonathan Cameron
  2024-10-31 16:37   ` Dave Jiang
  2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
                   ` (11 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

The AER service driver doesn't currently handle CXL protocol errors
reported by CXL root ports, CXL upstream switch ports, and CXL downstream
switch ports. Consequently, RAS protocol errors from CXL PCIe port devices
are not properly logged or handled.

These errors are reported to the OS via the root port's AER correctable
and uncorrectable internal error fields. While the AER driver supports
handling downstream port protocol errors in restricted CXL host (RCH) mode
also known as CXL1.1, it lacks the same functionality for CXL PCIe ports
operating in virtual hierarchy (VH) mode.

To address this gap, update the AER driver to handle CXL PCIe port device
protocol correctable errors (CE).

Make this update alongside the existing downstream port RCH error handling
logic, extending support to CXL PCIe ports in VH mode.

is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
config. Update is_internal_error()'s function declaration such that it is
always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
or disabled.

The uncorrectable error (UCE) handling will be added in a future patch.

[1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
Upstream Switch Ports

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 59 ++++++++++++++++++++++++++++--------------
 1 file changed, 39 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 53e9a11f6c0f..1d3e5b929661 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent,
 	return true;
 }
 
-#ifdef CONFIG_PCIEAER_CXL
+static bool is_internal_error(struct aer_err_info *info)
+{
+	if (info->severity == AER_CORRECTABLE)
+		return info->status & PCI_ERR_COR_INTERNAL;
 
+	return info->status & PCI_ERR_UNC_INTN;
+}
+
+#ifdef CONFIG_PCIEAER_CXL
 /**
  * pci_aer_unmask_internal_errors - unmask internal errors
  * @dev: pointer to the pcie_dev data structure
@@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
 	return (pcie_ports_native || host->native_aer);
 }
 
-static bool is_internal_error(struct aer_err_info *info)
-{
-	if (info->severity == AER_CORRECTABLE)
-		return info->status & PCI_ERR_COR_INTERNAL;
-
-	return info->status & PCI_ERR_UNC_INTN;
-}
-
 static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 {
 	struct aer_err_info *info = (struct aer_err_info *)data;
@@ -1033,14 +1032,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
 
 static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 {
-	/*
-	 * Internal errors of an RCEC indicate an AER error in an
-	 * RCH's downstream port. Check and handle them in the CXL.mem
-	 * device driver.
-	 */
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    is_internal_error(info))
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
 		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+
+	if (info->severity == AER_CORRECTABLE) {
+		struct pci_driver *pdrv = dev->driver;
+		int aer = dev->aer_cap;
+
+		if (aer)
+			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+					       info->status);
+
+		if (pdrv && pdrv->cxl_err_handler &&
+		    pdrv->cxl_err_handler->cor_error_detected)
+			pdrv->cxl_err_handler->cor_error_detected(dev);
+
+		pcie_clear_device_status(dev);
+	}
 }
 
 static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
@@ -1058,9 +1066,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
 {
 	bool handles_cxl = false;
 
-	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
-	    pcie_aer_is_native(dev))
+	if (!pcie_aer_is_native(dev))
+		return false;
+
+	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
 		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
+	else
+		handles_cxl = pcie_is_cxl_port(dev);
 
 	return handles_cxl;
 }
@@ -1078,6 +1090,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
 static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
 static inline void cxl_handle_error(struct pci_dev *dev,
 				    struct aer_err_info *info) { }
+static bool handles_cxl_errors(struct pci_dev *dev)
+{
+	return false;
+}
 #endif
 
 /**
@@ -1115,8 +1131,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 
 static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
-	cxl_handle_error(dev, info);
-	pci_aer_handle_error(dev, info);
+	if (is_internal_error(info) && handles_cxl_errors(dev))
+		cxl_handle_error(dev, info);
+	else
+		pci_aer_handle_error(dev, info);
+
 	pci_dev_put(dev);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (4 preceding siblings ...)
  2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 15:37   ` Jonathan Cameron
  2024-10-31 16:58   ` Dave Jiang
  2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

The AER service driver's aer_get_device_error_info() function doesn't read
uncorrectable (UCE) fatal error status from PCIe upstream port devices,
including CXL upstream switch ports. As a result, fatal errors are not
logged or handled as needed for CXL PCIe upstream switch port devices.

Update the aer_get_device_error_info() function to read the UCE fatal
status for all CXL PCIe port devices.

The fatal error status will be used in future patches implementing
CXL PCIe port uncorrectable error handling and logging.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pcie/aer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 1d3e5b929661..d772f123c6a2 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1250,6 +1250,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
 		   type == PCI_EXP_TYPE_RC_EC ||
 		   type == PCI_EXP_TYPE_DOWNSTREAM ||
+		   type == PCI_EXP_TYPE_UPSTREAM ||
 		   info->severity == AER_NONFATAL) {
 
 		/* Link is still healthy for IO reads */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (5 preceding siblings ...)
  2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 15:42   ` Jonathan Cameron
  2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
apply to CXL devices. Recovery can not be used for CXL devices because of
the potential for corruption on what can be system memory. Also, current
PCIe UCE recovery does not begin at the bridge but begins at the bridge's
first downstream device. This will miss handling CXL protocol errors in a
CXL root port. A separate CXL recovery is needed because of the different
handling requirements

Add a new function, cxl_do_recovery() using the following.

Add cxl_walk_bridge() to iterate the detected error's sub-topology.
cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
will begin iteration at the bridge rather than beginning at the
bridge's first downstream child.

Add cxl_report_error_detected() as an analog to report_error_detected().
It will call pci_driver::cxl_err_handlers for each iterated downstream
child. The pci_driver::cxl_err_handlers UCE handler returns a boolean
indicating if there was a UCE error detected during handling.

cxl_do_recovery() uses the status from cxl_report_error_detected() to
determine how to proceed. Non-fatal CXL UCE errors will be treated as
fatal. If a UCE was present during handling then cxl_do_recovery()
will kernel panic.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/pci/pci.h      |  3 +++
 drivers/pci/pcie/aer.c |  5 +++-
 drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 14d00ce45bfa..5a67e41919d8 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 		pci_channel_state_t state,
 		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
 
+/* CXL error reporting and handling */
+void cxl_do_recovery(struct pci_dev *dev);
+
 bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
 int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d772f123c6a2..19432ab2cfb6 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1048,7 +1048,10 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 			pdrv->cxl_err_handler->cor_error_detected(dev);
 
 		pcie_clear_device_status(dev);
-	}
+	} else if (info->severity == AER_NONFATAL)
+		cxl_do_recovery(dev);
+	else if (info->severity == AER_FATAL)
+		cxl_do_recovery(dev);
 }
 
 static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..3785f4ca5103 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 
 	return status;
 }
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+			    int (*cb)(struct pci_dev *, void *),
+			    void *userdata)
+{
+	bool *status = userdata;
+
+	cb(bridge, status);
+	if (bridge->subordinate && !*status)
+		pci_walk_bus(bridge->subordinate, cb, status);
+}
+
+static int cxl_report_error_detected(struct pci_dev *dev, void *data)
+{
+	struct pci_driver *pdrv = dev->driver;
+	bool *status = data;
+
+	device_lock(&dev->dev);
+	if (pdrv && pdrv->cxl_err_handler &&
+	    pdrv->cxl_err_handler->error_detected) {
+		const struct cxl_error_handlers *cxl_err_handler =
+			pdrv->cxl_err_handler;
+		*status |= cxl_err_handler->error_detected(dev);
+	}
+	device_unlock(&dev->dev);
+	return *status;
+}
+
+void cxl_do_recovery(struct pci_dev *dev)
+{
+	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+	int type = pci_pcie_type(dev);
+	struct pci_dev *bridge;
+	int status;
+
+	if (type == PCI_EXP_TYPE_ROOT_PORT ||
+	    type == PCI_EXP_TYPE_DOWNSTREAM ||
+	    type == PCI_EXP_TYPE_UPSTREAM ||
+	    type == PCI_EXP_TYPE_ENDPOINT)
+		bridge = dev;
+	else
+		bridge = pci_upstream_bridge(dev);
+
+	cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
+	if (status)
+		panic("CXL cachemem error. Invoking panic");
+
+	if (host->native_aer || pcie_ports_native) {
+		pcie_clear_device_status(dev);
+		pci_aer_clear_nonfatal_status(dev);
+	}
+
+	pci_info(bridge, "No uncorrectable error found. Continuing.\n");
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (6 preceding siblings ...)
  2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
@ 2024-10-25 21:02 ` Terry Bowman
  2024-10-30 15:45   ` Jonathan Cameron
  2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:02 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

CXL PCIe port protocol error support will be added in the future. This
requires searching for a CXL PCIe port device in the CXL topology as
provided by find_cxl_port(). But, find_cxl_port() is defined static
and as a result is not callable outside of this source file.

Update the find_cxl_port() declaration to be non-static.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/core.h | 3 +++
 drivers/cxl/core/port.c | 4 ++--
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 0c62b4069ba0..d81e5ee25f58 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -110,4 +110,7 @@ bool cxl_need_node_perf_attrs_update(int nid);
 int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
 					struct access_coordinate *c);
 
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
+
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index e666ec6a9085..2ac835cd4f1b 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1342,8 +1342,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (7 preceding siblings ...)
  2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
@ 2024-10-25 21:03 ` Terry Bowman
  2024-10-30 15:55   ` Jonathan Cameron
  2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:03 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

Map RAS registers for CXL PCIe root port and downstream RAS registers.

Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
Update the function to iterate an endpoint's parent downstream switch
ports and parent root ports. It maps the RAS registers for each
CXL downstream switch port and CXL root port iterated.

Move the RAS register map logic from cxl_dport_map_regs() into
cxl_dport_init_ras_reporting(). This eliminates an unnecessary helper.
cxl_dport_map_regs() can be removed.

cxl_dport_init_ras_reporting() must check for previously mapped registers
within the topology, particularly with CXL switches. Endpoints under a
CXL switch may share parent ports or downstream ports, ensure the ports'
registers are only mapped once.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 38 +++++++++++++++++---------------------
 drivers/cxl/cxl.h      |  6 ++----
 drivers/cxl/mem.c      | 26 ++++++++++++++++++++++++--
 3 files changed, 43 insertions(+), 27 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5b46bc46aaa9..0bb61e39cf8f 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -749,18 +749,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 	}
 }
 
-static void cxl_dport_map_ras(struct cxl_dport *dport)
-{
-	struct cxl_register_map *map = &dport->reg_map;
-	struct device *dev = dport->dport_dev;
-
-	if (!map->component_map.ras.valid)
-		dev_dbg(dev, "RAS registers not found\n");
-	else if (cxl_map_component_regs(map, &dport->regs.component,
-					BIT(CXL_CM_CAP_CAP_ID_RAS)))
-		dev_dbg(dev, "Failed to map RAS capability.\n");
-}
-
 static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 {
 	void __iomem *aer_base = dport->regs.dport_aer;
@@ -790,20 +778,28 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
  * @dport: the cxl_dport that needs to be initialized
  * @host: host device for devm operations
  */
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 {
-	dport->reg_map.host = host;
-	cxl_dport_map_ras(dport);
-
-	if (dport->rch) {
-		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
-
-		if (!host_bridge->native_aer)
-			return;
+	struct device *dport_dev = dport->dport_dev;
+	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
 
+	if (dport->rch && host_bridge->native_aer) {
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
 	}
+
+	/* dport may have more than 1 downstream EP. Check if already mapped. */
+	if (dport->regs.ras) {
+		dev_warn(dport_dev, "RAS is already mapped\n");
+		return;
+	}
+
+	dport->reg_map.host = dport_dev;
+	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+		dev_err(dport_dev, "Failed to map RAS capability.\n");
+		return;
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 0d8b810a51f0..787688e81602 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -762,11 +762,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 					 resource_size_t rcrb);
 
 #ifdef CONFIG_PCIEAER_CXL
-void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
 #else
-static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
-						struct device *host) { }
+static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index a9fd5cd5a0d2..240d54b22a8c 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,6 +45,29 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
 	return 0;
 }
 
+static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
+{
+	struct pci_dev *pdev;
+
+	if (!dev_is_pci(dev))
+		return false;
+
+	pdev = to_pci_dev(dev);
+	if (!pcie_is_cxl_port(pdev))
+		return false;
+
+	return (pci_pcie_type(pdev) == pcie_type);
+}
+
+static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
+{
+	struct cxl_dport *dport = ep->dport;
+
+	if (dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
+	    dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
+		cxl_dport_init_ras_reporting(dport);
+}
+
 static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 				 struct cxl_dport *parent_dport)
 {
@@ -62,6 +85,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 
 		ep = cxl_ep_load(iter, cxlmd);
 		ep->next = down;
+		cxl_init_ep_ports_aer(ep);
 	}
 
 	/* Note: endpoint port component registers are derived from @cxlds */
@@ -166,8 +190,6 @@ static int cxl_mem_probe(struct device *dev)
 	else
 		endpoint_parent = &parent_port->dev;
 
-	cxl_dport_init_ras_reporting(dport, dev);
-
 	scoped_guard(device, endpoint_parent) {
 		if (!endpoint_parent->driver) {
 			dev_err(dev, "CXL port topology %s not enabled\n",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream switch port RAS registers
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (8 preceding siblings ...)
  2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
@ 2024-10-25 21:03 ` Terry Bowman
  2024-10-30 15:56   ` Jonathan Cameron
  2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:03 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

Add logic to map CXL PCIe upstream switch port (USP) RAS registers.

Introduce 'struct cxl_regs' member into 'struct cxl_port' to store a
pointer to the upstream port's mapped RAS registers.

The upstream port may have multiple downstream endpoints. Before
mapping AER registers check if the registers are already mapped.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 17 +++++++++++++++++
 drivers/cxl/cxl.h      |  4 ++++
 drivers/cxl/mem.c      |  3 +++
 3 files changed, 24 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 0bb61e39cf8f..53ca773557f3 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -773,6 +773,23 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
+void cxl_uport_init_ras_reporting(struct cxl_port *port)
+{
+	/* uport may have more than 1 downstream EP. Check if already mapped. */
+	if (port->uport_regs.ras) {
+		dev_warn(&port->dev, "RAS is already mapped\n");
+		return;
+	}
+
+	port->reg_map.host = &port->dev;
+	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
+				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+		dev_err(&port->dev, "Failed to map RAS capability.\n");
+		return;
+	}
+}
+EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
+
 /**
  * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
  * @dport: the cxl_dport that needs to be initialized
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 787688e81602..ded6a343c05e 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -592,6 +592,7 @@ struct cxl_dax_region {
  * @parent_dport: dport that points to this port in the parent
  * @decoder_ida: allocator for decoder ids
  * @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
  * @nr_dports: number of entries in @dports
  * @hdm_end: track last allocated HDM decoder instance for allocation ordering
  * @commit_end: cursor to track highest committed decoder for commit ordering
@@ -612,6 +613,7 @@ struct cxl_port {
 	struct cxl_dport *parent_dport;
 	struct ida decoder_ida;
 	struct cxl_register_map reg_map;
+	struct cxl_component_regs uport_regs;
 	int nr_dports;
 	int hdm_end;
 	int commit_end;
@@ -763,8 +765,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
+void cxl_uport_init_ras_reporting(struct cxl_port *port);
 #else
 static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
+static inline void cxl_uport_init_ras_reporting(struct cxl_port *port) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 240d54b22a8c..067fd6389562 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -66,6 +66,9 @@ static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
 	if (dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
 	    dev_is_cxl_pci(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
 		cxl_dport_init_ras_reporting(dport);
+
+	if (dev_is_cxl_pci(dport->port->uport_dev, PCI_EXP_TYPE_UPSTREAM))
+		cxl_uport_init_ras_reporting(dport->port);
 }
 
 static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (9 preceding siblings ...)
  2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
@ 2024-10-25 21:03 ` Terry Bowman
  2024-10-30 15:59   ` Jonathan Cameron
  2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:03 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

CXL PCIe port protocol error handling support will be added to the
CXL drivers in the future. In preparation, rename the existing
interfaces to support handling all CXL PCIe port protocol errors.

The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.

To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.

No functional changes are introduced.

[1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 53ca773557f3..bb2fd7d04c4f 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,7 +650,7 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
 
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+static void __cxl_handle_cor_ras(struct device *dev,
 				 void __iomem *ras_base)
 {
 	void __iomem *addr;
@@ -663,13 +663,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
 	}
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
@@ -693,8 +693,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
-				  void __iomem *ras_base)
+static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 {
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
@@ -721,7 +720,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -729,7 +728,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 
 static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 #ifdef CONFIG_PCIEAER_CXL
@@ -823,13 +822,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (10 preceding siblings ...)
  2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
@ 2024-10-25 21:03 ` Terry Bowman
  2024-10-30 16:03   ` Jonathan Cameron
  2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:03 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

Introduce correctable and uncorrectable CXL PCIe port handlers.

Use the PCIe port's device object to find the matching port or
downstream port in the CXL topology. The matching port or downstream
port will include the cached RAS register block.

Invoke the existing __cxl_handle_ras() with the RAS registers as a
parameter. __cxl_handle_ras() will log the RAS errors (if present)
and clear the RAS status.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 59 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index bb2fd7d04c4f..adb184d346ae 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -772,6 +772,65 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
 }
 
+static int match_uport(struct device *dev, const void *data)
+{
+	struct device *uport_dev = (struct device *)data;
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+
+	return port->uport_dev == uport_dev;
+}
+
+static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
+{
+	struct cxl_port *port __free(put_cxl_port) = NULL;
+	void __iomem *ras_base = NULL;
+
+	if (!pdev)
+		return NULL;
+
+	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
+	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+		struct cxl_dport *dport;
+
+		port = find_cxl_port(&pdev->dev, &dport);
+		ras_base = dport ? dport->regs.ras : NULL;
+	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
+		struct device *port_dev;
+
+		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
+					   match_uport);
+		if (!port_dev)
+			return NULL;
+
+		port = to_cxl_port(port_dev);
+		ras_base = port ? port->uport_regs.ras : NULL;
+	}
+
+	return ras_base;
+}
+
+static void cxl_port_cor_error_detected(struct pci_dev *pdev)
+{
+	void __iomem *ras_base = cxl_pci_port_ras(pdev);
+
+	__cxl_handle_cor_ras(&pdev->dev, ras_base);
+}
+
+static bool cxl_port_error_detected(struct pci_dev *pdev)
+{
+	void __iomem *ras_base = cxl_pci_port_ras(pdev);
+	bool ue;
+
+	ue = __cxl_handle_ras(&pdev->dev, ras_base);
+
+	return ue;
+}
+
 void cxl_uport_init_ras_reporting(struct cxl_port *port)
 {
 	/* uport may have more than 1 downstream EP. Check if already mapped. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 13/14] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (11 preceding siblings ...)
  2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
@ 2024-10-25 21:03 ` Terry Bowman
  2024-10-30 16:07   ` Jonathan Cameron
  2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:03 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

The CXL drivers use kernel trace functions for logging endpoint and
RCH downstream port RAS errors. Similar functionality is
required for CXL root ports, CXL downstream switch ports, and CXL
upstream switch ports.

Introduce trace logging functions for both RAS correctable and
uncorrectable errors specific to CXL PCIe ports. Additionally, update
the PCIe port error handlers to invoke these new trace functions.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c   | 16 ++++++++++----
 drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index adb184d346ae..eeb4a64ba5b5 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+		return;
+	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+	if (is_cxl_memdev(dev))
 		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
-	}
+	else if (dev_is_pci(dev))
+		trace_cxl_port_aer_correctable_error(dev, status);
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
@@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	if (is_cxl_memdev(dev))
+		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+	else if (dev_is_pci(dev))
+		trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
+
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 8672b42ee4d1..1c4368a7b50b 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,6 +48,34 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
+TRACE_EVENT(cxl_port_aer_uncorrectable_error,
+	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
+	TP_ARGS(dev, status, fe, hl),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__string(host, dev_name(dev->parent))
+		__field(u32, status)
+		__field(u32, first_error)
+		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__assign_str(host);
+		__entry->status = status;
+		__entry->first_error = fe;
+		/*
+		 * Embed the 512B headerlog data for user app retrieval and
+		 * parsing, but no need to print this in the trace buffer.
+		 */
+		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
+	),
+	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
+		  __get_str(devname), __get_str(host),
+		  show_uc_errs(__entry->status),
+		  show_uc_errs(__entry->first_error)
+	)
+);
+
 TRACE_EVENT(cxl_aer_uncorrectable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
 	TP_ARGS(cxlmd, status, fe, hl),
@@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
+TRACE_EVENT(cxl_port_aer_correctable_error,
+	TP_PROTO(struct device *dev, u32 status),
+	TP_ARGS(dev, status),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__string(host, dev_name(dev->parent))
+		__field(u32, status)
+	),
+	TP_fast_assign(
+		__assign_str(devname);
+		__assign_str(host);
+		__entry->status = status;
+	),
+	TP_printk("device=%s host=%s status='%s'",
+		  __get_str(devname), __get_str(host),
+		  show_ce_errs(__entry->status)
+	)
+);
+
 TRACE_EVENT(cxl_aer_correctable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
 	TP_ARGS(cxlmd, status),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (12 preceding siblings ...)
  2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
@ 2024-10-25 21:03 ` Terry Bowman
  2024-10-30 16:11   ` Jonathan Cameron
  2024-10-27 16:59 ` [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 Bowman, Terry
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 55+ messages in thread
From: Terry Bowman @ 2024-10-25 21:03 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
	Smita.KoralahalliChannabasappa

pci_driver::cxl_err_handlers are not currrently assigned handler callbacks.
The handlers can't be set in the pci_driver static definition because the
CXL PCIe port devices are bound to the portdrv driver which is not CXL
driver aware.

Add cxl_assign_port_error_handlers() in the cxl_core module. This
function will assign the default handlers for a CXL PCIe port device.

When the CXL port (cxl_port or cxl_dport) is destroyed the CXL PCIe port
device's pci_driver::cxl_err_handlers must be set to NULL to prevent future
use. Create cxl_clear_port_error_handlers() and register it to be called
when the CXL port device (cxl_port or cxl_dport) is destroyed.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index eeb4a64ba5b5..5f7570c6173c 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -839,8 +839,36 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
 	return ue;
 }
 
+static const struct cxl_error_handlers cxl_port_error_handlers = {
+	.error_detected	= cxl_port_error_detected,
+	.cor_error_detected	= cxl_port_cor_error_detected,
+};
+
+static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
+{
+	struct pci_driver *pdrv = pdev->driver;
+
+	if (!pdrv)
+		return;
+
+	pdrv->cxl_err_handler = &cxl_port_error_handlers;
+}
+
+static void cxl_clear_port_error_handlers(void *data)
+{
+	struct pci_dev *pdev = data;
+	struct pci_driver *pdrv = pdev->driver;
+
+	if (!pdrv)
+		return;
+
+	pdrv->cxl_err_handler = NULL;
+}
+
 void cxl_uport_init_ras_reporting(struct cxl_port *port)
 {
+	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
+
 	/* uport may have more than 1 downstream EP. Check if already mapped. */
 	if (port->uport_regs.ras) {
 		dev_warn(&port->dev, "RAS is already mapped\n");
@@ -853,6 +881,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 		dev_err(&port->dev, "Failed to map RAS capability.\n");
 		return;
 	}
+
+	cxl_assign_port_error_handlers(pdev);
+	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
 
@@ -865,6 +896,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 {
 	struct device *dport_dev = dport->dport_dev;
 	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
+	struct pci_dev *pdev = to_pci_dev(dport_dev);
 
 	if (dport->rch && host_bridge->native_aer) {
 		cxl_dport_map_rch_aer(dport);
@@ -883,6 +915,9 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 		dev_err(dport_dev, "Failed to map RAS capability.\n");
 		return;
 	}
+
+	cxl_assign_port_error_handlers(pdev);
+	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
 }
 EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (13 preceding siblings ...)
  2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
@ 2024-10-27 16:59 ` Bowman, Terry
  2024-10-28  1:05 ` [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Bowman, Terry
  2024-11-01 18:00 ` Fan Ni
  16 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-27 16:59 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

Correction. This applies to:

8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2

On 10/25/2024 4:02 PM, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>
> Testing:
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Root port CE:
>  root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
>  e-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>  pcieport 0000:0c:00.0:    [14] CorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
>  - Upstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>  pcieport 0000:0d:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   ? free_cpumask_var+0x9/0x10
>   ? kfree+0x259/0x2e0
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Upstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>  pcieport 0000:0d:00.0:    [14] CorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
>  - Downstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>  pcieport 0000:0e:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Downstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>  pcieport 0000:0e:00.0:    [14] CorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
>  - Endpoint CE
>  root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
>  cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
>  cxl_pci 0000:0f:00.0:    [ 6] BadTLP
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
>  cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
>
>  - Endpoint UCE
>  root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
>  cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
>  cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
>  pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
>  pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
>  cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  <snip>
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
>  devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
>  pcieport 0000:0e:00.0: RAS is already mapped
>  cxl_port port2: RAS is already mapped
>  pcieport 0000:0c:00.0: RAS is already mapped
>  cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
>  cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
>  cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
>  cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
>  init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
>  cxl_bus_probe: cxl_port endpoint4: probe: 0
>  devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
>  cxl_bus_probe: cxl_mem mem1: probe: 0
>  cxl_pci 0000:0f:00.0: mem1: error resume successful
>  pcieport 0000:0e:00.0: AER: device recovery successful
>
>  Changes in v1 -> v2
>  [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
>  [Jonathan] Update description to DSP map patch description
>  [Jonathan] Update cxl_pci_port_ras() to check for NULL port
>  [Jonathan] Dont call handler before handler port changes are present (patch order).
>  [Bjorn] Fix linebreak in cover sheet URL
>  [Bjorn] Remove timestamps from test logs in cover sheet
>  [Bjorn] Retitle AER commits to use "PCI/AER:"
>  [Bjorn] Retitle patch#3 to use renaming instead of refactoring
>  [Bjorn] Fixe base commit-id on cover sheet
>  [Bjorn] Add VH spec reference/citation
>  [Terry] Removed last 2 patches to enable internal errors. Is not needed
>  because internal errors are enabled in AER driver.
>  [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
>  [Dan] Use kernel panic in CXL recovery
>  [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
>  [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
>  [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
>  [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
>  is not used in the CXL_err_handlers callabcks.
>
> Changes in RFC -> v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (14):
>   PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
>     pci_driver'
>   PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Introduce helper functions pcie_is_cxl() and
>     pcie_is_cxl_port()
>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>     type
>   PCI/AER: Add CXL PCIe port correctable error support in AER service
>     driver
>   PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
>     port devices
>   PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
>     driver
>   cxl/pci: Change find_cxl_ports() to non-static
>   cxl/pci: Map CXL PCIe root port and downstream switch port RAS
>     registers
>   cxl/pci: Map CXL PCIe upstream switch port RAS registers
>   cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
>
>  drivers/cxl/core/core.h       |   3 +
>  drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
>  drivers/cxl/core/port.c       |   4 +-
>  drivers/cxl/core/trace.h      |  47 +++++++++
>  drivers/cxl/cxl.h             |  10 +-
>  drivers/cxl/mem.c             |  29 +++++-
>  drivers/pci/pci.c             |  14 +++
>  drivers/pci/pci.h             |   3 +
>  drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
>  drivers/pci/pcie/err.c        |  54 ++++++++++
>  drivers/pci/probe.c           |  10 ++
>  include/linux/pci.h           |  13 +++
>  include/ras/ras_event.h       |   9 +-
>  include/uapi/linux/pci_regs.h |   3 +-
>  14 files changed, 396 insertions(+), 82 deletions(-)
>
>
> base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (14 preceding siblings ...)
  2024-10-27 16:59 ` [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 Bowman, Terry
@ 2024-10-28  1:05 ` Bowman, Terry
  2024-11-01 18:00 ` Fan Ni
  16 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-28  1:05 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

Correction. This applies to the following base commit:

8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2


On 10/25/2024 4:02 PM, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>
> Testing:
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Root port CE:
>  root@tbowman-cxl:~/aer-inject# ./root-c[  191.866259] systemd-journald[482]: Sent WATCHDOG=1 notification.
>  e-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>  pcieport 0000:0c:00.0:    [14] CorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
>
>  - Upstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
>  pcieport 0000:0d:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   ? free_cpumask_var+0x9/0x10
>   ? kfree+0x259/0x2e0
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x24c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Upstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
>  pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
>  pcieport 0000:0d:00.0:    [14] CorrIntErr
>  aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
>
>  - Downstream switch port UCE:
>  root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
>  pcieport 0000:0e:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 147 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x19c00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
>
>  - Downstream switch port CE:
>  root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
>  pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
>  pcieport 0000:0e:00.0:    [14] CorrIntErr
>  aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>  cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
>
>  - Endpoint CE
>  root@tbowman-cxl:~/aer-inject# ./ep-ce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: CXL Bus Error: severity=Correctable, type=Data Link Layer, (Receiver ID)
>  cxl_pci 0000:0f:00.0:   device [8086:0d93] error status/mask=00000040/0000e000
>  cxl_pci 0000:0f:00.0:    [ 6] BadTLP
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Corrected, Bad TLP, TLP Header=Not available
>  cxl_aer_correctable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Received Error From Physical Layer'
>
>  - Endpoint UCE
>  root@tbowman-cxl:~/aer-inject# ./ep-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00040000 into device 0000:0f:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0f:00.0
>  cxl_pci 0000:0f:00.0: AER: CXL Bus Error: severity=Uncorrectable (Fatal), type=Inaccessible, (Unregistered Agent ID)
>  aer_event: 0000:0f:00.0 CXL Bus Error: severity=Fatal, , TLP Header=Not available
>  cxl_aer_uncorrectable_error: memdev=mem1 host=0000:0f:00.0 serial=0: status: 'Memory Byte Enable Parity Error' firs'
>  cxl_pci 0000:0f:00.0: mem1: frozen state error detected, disable CXL.mem
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port2
>  cxl_detach_ep: cxl_mem mem1: disconnect mem1 from port1
>  pcieport 0000:0e:00.0: unlocked secondary bus reset via: pciehp_reset_slot+0xac/0x160
>  pcieport 0000:0e:00.0: AER: Downstream Port link has been reset (0)
>  cxl_pci 0000:0f:00.0: mem1: restart CXL.mem after slot reset
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: mem1 dport_dev: 0000:0e:00.0 parent: 0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port2:0000:0d:00.0
>  devm_cxl_enumerate_ports: cxl_mem mem1: scan: iter: 0000:0e:00.0 dport_dev: 0000:0c:00.0 parent: pci0000:0c
>  devm_cxl_enumerate_ports: cxl_mem mem1: found already registered port port1:pci0000:0c
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4500
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  <snip>
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  __cxl_pci_mbox_send_cmd: cxl_pci 0000:0f:00.0: Sending command: 0x4102
>  cxl_pci_mbox_wait_for_doorbell: cxl_pci 0000:0f:00.0: Doorbell wait took 0ms
>  cxl_bus_probe: cxl_nvdimm pmem1: probe: 0
>  devm_cxl_add_nvdimm: cxl_mem mem1: register pmem1
>  pcieport 0000:0e:00.0: RAS is already mapped
>  cxl_port port2: RAS is already mapped
>  pcieport 0000:0c:00.0: RAS is already mapped
>  cxl_port_alloc: cxl_mem mem1: host-bridge: pci0000:0c
>  cxl_cdat_get_length: cxl_port endpoint4: CDAT length 160
>  cxl_port_perf_data_calculate: cxl_port endpoint4: Failed to retrieve ep perf coordinates.
>  cxl_endpoint_parse_cdat: cxl_port endpoint4: Failed to do perf coord calculations.
>  init_hdm_decoder: cxl_port endpoint4: decoder4.0: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.0: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.1: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.1: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.2: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.2: Added to port endpoint4
>  init_hdm_decoder: cxl_port endpoint4: decoder4.3: range: 0x0-0xffffffffffffffff iw: 1 ig: 256
>  add_hdm_decoder: cxl decoder4.3: Added to port endpoint4
>  cxl_bus_probe: cxl_port endpoint4: probe: 0
>  devm_cxl_add_port: cxl_mem mem1: endpoint4 added to port2
>  cxl_bus_probe: cxl_mem mem1: probe: 0
>  cxl_pci 0000:0f:00.0: mem1: error resume successful
>  pcieport 0000:0e:00.0: AER: device recovery successful
>
>  Changes in v1 -> v2
>  [Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
>  [Jonathan] Update description to DSP map patch description
>  [Jonathan] Update cxl_pci_port_ras() to check for NULL port
>  [Jonathan] Dont call handler before handler port changes are present (patch order).
>  [Bjorn] Fix linebreak in cover sheet URL
>  [Bjorn] Remove timestamps from test logs in cover sheet
>  [Bjorn] Retitle AER commits to use "PCI/AER:"
>  [Bjorn] Retitle patch#3 to use renaming instead of refactoring
>  [Bjorn] Fixe base commit-id on cover sheet
>  [Bjorn] Add VH spec reference/citation
>  [Terry] Removed last 2 patches to enable internal errors. Is not needed
>  because internal errors are enabled in AER driver.
>  [Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
>  [Dan] Use kernel panic in CXL recovery
>  [Dan] cxl_port_hndlrs -> cxl_port_error_handlers
>  [Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
>  [Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
>  [Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
>  is not used in the CXL_err_handlers callabcks.
>
> Changes in RFC -> v1:
>  [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
>  [Dan] Add cxl_do_recovery()
>  [Jonathan] Flatten cxl_setup_parent_uport()
>  [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
>  [Jonathan] Rename cxl_dev_is_pci_type()
>  [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
>  replace these find_cxl_port() and device_find_child().
>  [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
>  [Ming] Dont use endpoint as host to cxl_map_component_regs()
>  [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
>  [Bjorn] Dont use Kconfig to enable/disable a CXL external interface
>
> Terry Bowman (14):
>   PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
>     pci_driver'
>   PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Introduce helper functions pcie_is_cxl() and
>     pcie_is_cxl_port()
>   PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
>     type
>   PCI/AER: Add CXL PCIe port correctable error support in AER service
>     driver
>   PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
>     port devices
>   PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service
>     driver
>   cxl/pci: Change find_cxl_ports() to non-static
>   cxl/pci: Map CXL PCIe root port and downstream switch port RAS
>     registers
>   cxl/pci: Map CXL PCIe upstream switch port RAS registers
>   cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port
>     support
>   cxl/pci: Add error handler for CXL PCIe port RAS errors
>   cxl/pci: Add trace logging for CXL PCIe port RAS errors
>   cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
>
>  drivers/cxl/core/core.h       |   3 +
>  drivers/cxl/core/pci.c        | 180 +++++++++++++++++++++++++++-------
>  drivers/cxl/core/port.c       |   4 +-
>  drivers/cxl/core/trace.h      |  47 +++++++++
>  drivers/cxl/cxl.h             |  10 +-
>  drivers/cxl/mem.c             |  29 +++++-
>  drivers/pci/pci.c             |  14 +++
>  drivers/pci/pci.h             |   3 +
>  drivers/pci/pcie/aer.c        |  99 ++++++++++++-------
>  drivers/pci/pcie/err.c        |  54 ++++++++++
>  drivers/pci/probe.c           |  10 ++
>  include/linux/pci.h           |  13 +++
>  include/ras/ras_event.h       |   9 +-
>  include/uapi/linux/pci_regs.h |   3 +-
>  14 files changed, 396 insertions(+), 82 deletions(-)
>
>
> base-commit: 739a5da7ed744578a9477fb322f04afecafca6b0


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2024-10-30 14:56   ` Jonathan Cameron
  2024-10-31 16:27   ` Dave Jiang
  2024-10-31 21:27   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 14:56 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, shiju.jose,
	M.Chehab

On Fri, 25 Oct 2024 16:02:55 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors.
> 
> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL devices.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Looks fine to me.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

+CC Mauro and Shiju.  I assume this will be easy to add to rasdaemon
useage of this tracepoint?

> ---
>  drivers/pci/pcie/aer.c  | 14 ++++++++------
>  include/ras/ras_event.h |  9 ++++++---
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index fe6edf26279e..53e9a11f6c0f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
>  
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
> +	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
>  	int layer, agent;
>  	int id = pci_dev_id(dev);
>  	const char *level;
>  
>  	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>  		goto out;
>  	}
>  
> @@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>  
> -	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		pci_err(dev, "  Error of this Agent is reported first\n");
>  
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> @@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> +	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info;
> @@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (tlp_header_valid)
>  		__print_tlp_header(dev, &aer->header_log);
>  
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index e5f7ee0864e7..1bf8e7050ba8 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>  
>  TRACE_EVENT(aer_event,
>  	TP_PROTO(const char *dev_name,
> +		 const char *bus_type,
>  		 const u32 status,
>  		 const u8 severity,
>  		 const u8 tlp_header_valid,
>  		 struct pcie_tlp_log *tlp),
>  
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>  
>  	TP_STRUCT__entry(
>  		__string(	dev_name,	dev_name	)
> +		__string(	bus_type,	bus_type	)
>  		__field(	u32,		status		)
>  		__field(	u8,		severity	)
>  		__field(	u8, 		tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>  
>  	TP_fast_assign(
>  		__assign_str(dev_name);
> +		__assign_str(bus_type);
>  		__entry->status		= status;
>  		__entry->severity	= severity;
>  		__entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>  		}
>  	),
>  
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>  			__entry->severity == AER_FATAL ?
>  			"Fatal" : "Uncorrected, non-fatal",


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
@ 2024-10-30 14:57   ` Jonathan Cameron
  2024-10-31 16:25   ` Dave Jiang
  2024-10-31 21:22   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 14:57 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:54 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL and AER drivers need the ability to identify CXL devices and CXL port
> devices.
> 
> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
> presence. The CXL Flexbus DVSEC presence is used because it is required
> for all the CXL PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
> 
> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl',
> 
> Add pcie_is_cxl_port() to check if a device is a CXL root port, CXL
> upstream switch port, or CXL downstream switch port. Also, verify the
> CXL extensions DVSEC for port is present.[1]
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Make sense to improve the trace point info if nothing else.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver
  2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
@ 2024-10-30 15:13   ` Jonathan Cameron
  2024-10-30 15:51     ` Bowman, Terry
  2024-11-04 21:50     ` Dan Williams
  2024-10-31 16:37   ` Dave Jiang
  1 sibling, 2 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:13 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:56 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver doesn't currently handle CXL protocol errors
> reported by CXL root ports, CXL upstream switch ports, and CXL downstream
> switch ports. Consequently, RAS protocol errors from CXL PCIe port devices
> are not properly logged or handled.
> 
> These errors are reported to the OS via the root port's AER correctable
> and uncorrectable internal error fields. While the AER driver supports
> handling downstream port protocol errors in restricted CXL host (RCH) mode
> also known as CXL1.1, it lacks the same functionality for CXL PCIe ports
> operating in virtual hierarchy (VH) mode.
> 
> To address this gap, update the AER driver to handle CXL PCIe port device
> protocol correctable errors (CE).
> 
> Make this update alongside the existing downstream port RCH error handling
> logic, extending support to CXL PCIe ports in VH mode.
> 
> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
> config. Update is_internal_error()'s function declaration such that it is
> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
> or disabled.
> 
> The uncorrectable error (UCE) handling will be added in a future patch.
> 
> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
This is a fiddly patch to read, but that's at least partly diff going crazy
in a few places.

Anyhow, I think it is fine but I would call out that this changes
things so that the PCI error handlers are no longer called for CXL ports
if it's an internal error.

With a sentence on that:

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

I'm not 100% convinced the path of separate handlers is the way to go
but we can always change things again if that doesn't work out.

Jonathan

> ---
>  drivers/pci/pcie/aer.c | 59 ++++++++++++++++++++++++++++--------------
>  1 file changed, 39 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 53e9a11f6c0f..1d3e5b929661 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent,
>  	return true;
>  }
>  
> -#ifdef CONFIG_PCIEAER_CXL
> +static bool is_internal_error(struct aer_err_info *info)
> +{
> +	if (info->severity == AER_CORRECTABLE)
> +		return info->status & PCI_ERR_COR_INTERNAL;
>  
> +	return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +#ifdef CONFIG_PCIEAER_CXL

Diff was having fun.  Maybe put a blank line here? I think that's
what has tripped it up.

>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pcie_dev data structure
> @@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>  	return (pcie_ports_native || host->native_aer);
>  }

> -

>  /**
> @@ -1115,8 +1131,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);
> +	if (is_internal_error(info) && handles_cxl_errors(dev))
> +		cxl_handle_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
Whilst not calling this for the CXL cases probably makes sense and
given new code needs to be the case to avoid a double clear I think,
I would call that change out more explicitly in the patch description.
> +

>  	pci_dev_put(dev);
>  }
>  


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support
  2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
@ 2024-10-30 15:13   ` Jonathan Cameron
  2024-10-31 16:21   ` Dave Jiang
  2024-10-31 20:25   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:13 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:53 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver already includes support for CXL restricted host
> (RCH) downstream port error handling. The current implementation is based
> on CXL1.1 using a root complex event collector.
> 
> Rename function interfaces and parameters where necessary to include
> virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
> handling.[1] The CXL PCIe port error handling will be added in a future
> patch.
> 
> Limit changes to renaming variable and function names. No functional
> changes are added.
> 
> [1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
@ 2024-10-30 15:14   ` Jonathan Cameron
  2024-10-30 15:15     ` Bowman, Terry
  2024-10-31 16:20   ` Dave Jiang
  2024-10-31 20:24   ` Fan Ni
  2 siblings, 1 reply; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:14 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:52 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL.io provides PCIe like protocol error implementation, but CXL.io and
> PCIe have different handling requirements.
> 
> The PCIe AER service driver may attempt recovering PCIe devices with
> uncorrectable errors while recovery is not used for CXL.io. Recovery is not
> used in the CXL.io recovery because of the potential for corruption on
> what can be system memory.
> 
> Create pci_driver::cxl_err_handlers similar to pci_driver::error_handler.
> Create handlers for correctable and uncorrectable CXL.io error
> handling.
> 
> The CXL error handlers will be used in future patches adding CXL PCIe
> port protocol error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2024-10-30 15:14   ` Jonathan Cameron
@ 2024-10-30 15:15     ` Bowman, Terry
  0 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-30 15:15 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

Hi Jonathan,

Thank you for reviewing.

Regards,
Terry

On 10/30/2024 10:14 AM, Jonathan Cameron wrote:
> On Fri, 25 Oct 2024 16:02:52 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> CXL.io provides PCIe like protocol error implementation, but CXL.io and
>> PCIe have different handling requirements.
>>
>> The PCIe AER service driver may attempt recovering PCIe devices with
>> uncorrectable errors while recovery is not used for CXL.io. Recovery is not
>> used in the CXL.io recovery because of the potential for corruption on
>> what can be system memory.
>>
>> Create pci_driver::cxl_err_handlers similar to pci_driver::error_handler.
>> Create handlers for correctable and uncorrectable CXL.io error
>> handling.
>>
>> The CXL error handlers will be used in future patches adding CXL PCIe
>> port protocol error handling.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices
  2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
@ 2024-10-30 15:37   ` Jonathan Cameron
  2024-10-31 16:58   ` Dave Jiang
  1 sibling, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:37 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:57 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe upstream port devices,
> including CXL upstream switch ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe upstream switch port devices.
> 
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe port devices.
> 
> The fatal error status will be used in future patches implementing
> CXL PCIe port uncorrectable error handling and logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

I assume this was previously not done because the upstream port
requires a healthy link and maybe the error indicates we don't have one.

So I'd imagine this change may have a bad effect on PCIe devices
even if we know it's fine CXL ones in the case of certain protocol errors.

Also, does the error log stuff that follows make much sense for
an upstream port?

> ---
>  drivers/pci/pcie/aer.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1d3e5b929661..d772f123c6a2 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1250,6 +1250,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>  		   type == PCI_EXP_TYPE_RC_EC ||
>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
> +		   type == PCI_EXP_TYPE_UPSTREAM ||
>  		   info->severity == AER_NONFATAL) {
>  
>  		/* Link is still healthy for IO reads */


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver
  2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
@ 2024-10-30 15:42   ` Jonathan Cameron
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:42 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:58 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Existing recovery procedure for PCIe uncorrectable errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> the potential for corruption on what can be system memory. Also, current
> PCIe UCE recovery does not begin at the bridge but begins at the bridge's
> first downstream device. 

I'm still stuck on why it is fine for PCIe to skip handling errors in the
root port.  Can't see why it should be different.

I might take a poke at what happens on an emulated PCIe system to try and
understand this better but probably not that quickly.

Without understanding the reasoning behind that I'm reluctant to make
an assessment of whether this code is right.

One trivial comment inline.

Jonathan


> This will miss handling CXL protocol errors in a
> CXL root port. A separate CXL recovery is needed because of the different
> handling requirements
> 
> Add a new function, cxl_do_recovery() using the following.
> 
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the bridge rather than beginning at the
> bridge's first downstream child.
> 
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> child. The pci_driver::cxl_err_handlers UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
> 
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/pci/pci.h      |  3 +++
>  drivers/pci/pcie/aer.c |  5 +++-
>  drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 61 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 14d00ce45bfa..5a67e41919d8 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  		pci_channel_state_t state,
>  		pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
>  
> +/* CXL error reporting and handling */
> +void cxl_do_recovery(struct pci_dev *dev);
> +
>  bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
>  int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
>  
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d772f123c6a2..19432ab2cfb6 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1048,7 +1048,10 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  			pdrv->cxl_err_handler->cor_error_detected(dev);
>  
>  		pcie_clear_device_status(dev);
> -	}
> +	} else if (info->severity == AER_NONFATAL)
> +		cxl_do_recovery(dev);
> +	else if (info->severity == AER_FATAL)
Needs {}

> +		cxl_do_recovery(dev);
>  }
>  
>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 31090770fffc..3785f4ca5103 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  
>  	return status;
>  }
> +
> +static void cxl_walk_bridge(struct pci_dev *bridge,
> +			    int (*cb)(struct pci_dev *, void *),
> +			    void *userdata)
> +{
> +	bool *status = userdata;
> +
> +	cb(bridge, status);
> +	if (bridge->subordinate && !*status)
> +		pci_walk_bus(bridge->subordinate, cb, status);
> +}
> +
> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> +{
> +	struct pci_driver *pdrv = dev->driver;
> +	bool *status = data;
> +
> +	device_lock(&dev->dev);
> +	if (pdrv && pdrv->cxl_err_handler &&
> +	    pdrv->cxl_err_handler->error_detected) {
> +		const struct cxl_error_handlers *cxl_err_handler =
> +			pdrv->cxl_err_handler;
> +		*status |= cxl_err_handler->error_detected(dev);
> +	}
> +	device_unlock(&dev->dev);
> +	return *status;
> +}
> +
> +void cxl_do_recovery(struct pci_dev *dev)
> +{
> +	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> +	int type = pci_pcie_type(dev);
> +	struct pci_dev *bridge;
> +	int status;
> +
> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    type == PCI_EXP_TYPE_DOWNSTREAM ||
> +	    type == PCI_EXP_TYPE_UPSTREAM ||
> +	    type == PCI_EXP_TYPE_ENDPOINT)
> +		bridge = dev;
> +	else
> +		bridge = pci_upstream_bridge(dev);
> +
> +	cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
> +	if (status)
> +		panic("CXL cachemem error. Invoking panic");
> +
> +	if (host->native_aer || pcie_ports_native) {
> +		pcie_clear_device_status(dev);
> +		pci_aer_clear_nonfatal_status(dev);
> +	}
> +
> +	pci_info(bridge, "No uncorrectable error found. Continuing.\n");
> +}


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static
  2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
@ 2024-10-30 15:45   ` Jonathan Cameron
  2024-10-30 15:54     ` Bowman, Terry
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:45 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:02:59 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

Typo in title. Shouldn't be plural ports.


> CXL PCIe port protocol error support will be added in the future. This
> requires searching for a CXL PCIe port device in the CXL topology as
> provided by find_cxl_port(). But, find_cxl_port() is defined static
> and as a result is not callable outside of this source file.
> 
> Update the find_cxl_port() declaration to be non-static.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Doesn't hugely matter but I'd do this later in the series as it's
not used until patch 12 (I think) and by then reviewers may have forgotten what
it is for.

Fine otherwise,

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

> ---
>  drivers/cxl/core/core.h | 3 +++
>  drivers/cxl/core/port.c | 4 ++--
>  2 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index 0c62b4069ba0..d81e5ee25f58 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -110,4 +110,7 @@ bool cxl_need_node_perf_attrs_update(int nid);
>  int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
>  					struct access_coordinate *c);
>  
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport);
> +
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index e666ec6a9085..2ac835cd4f1b 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1342,8 +1342,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>  	return NULL;
>  }
>  
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> -				      struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport)
>  {
>  	struct cxl_find_port_ctx ctx = {
>  		.dport_dev = dport_dev,


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver
  2024-10-30 15:13   ` Jonathan Cameron
@ 2024-10-30 15:51     ` Bowman, Terry
  2024-11-04 21:50     ` Dan Williams
  1 sibling, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-30 15:51 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

Hi Jonathan,

I added responses below.

On 10/30/2024 10:13 AM, Jonathan Cameron wrote:
> On Fri, 25 Oct 2024 16:02:56 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver doesn't currently handle CXL protocol errors
>> reported by CXL root ports, CXL upstream switch ports, and CXL downstream
>> switch ports. Consequently, RAS protocol errors from CXL PCIe port devices
>> are not properly logged or handled.
>>
>> These errors are reported to the OS via the root port's AER correctable
>> and uncorrectable internal error fields. While the AER driver supports
>> handling downstream port protocol errors in restricted CXL host (RCH) mode
>> also known as CXL1.1, it lacks the same functionality for CXL PCIe ports
>> operating in virtual hierarchy (VH) mode.
>>
>> To address this gap, update the AER driver to handle CXL PCIe port device
>> protocol correctable errors (CE).
>>
>> Make this update alongside the existing downstream port RCH error handling
>> logic, extending support to CXL PCIe ports in VH mode.
>>
>> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
>> config. Update is_internal_error()'s function declaration such that it is
>> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
>> or disabled.
>>
>> The uncorrectable error (UCE) handling will be added in a future patch.
>>
>> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>> Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> This is a fiddly patch to read, but that's at least partly diff going crazy
> in a few places.
>
> Anyhow, I think it is fine but I would call out that this changes
> things so that the PCI error handlers are no longer called for CXL ports
> if it's an internal error.
>
> With a sentence on that:
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> I'm not 100% convinced the path of separate handlers is the way to go
> but we can always change things again if that doesn't work out.
>
> Jonathan
I will update the patch message to mention ports use CXL handling
for CIE will not call the PCIe handlers or PCIe recovery.

Note, port devices are bound to the portdrv driver with fairly generic
CIE handler.

Regards,
Terry

>> ---
>>  drivers/pci/pcie/aer.c | 59 ++++++++++++++++++++++++++++--------------
>>  1 file changed, 39 insertions(+), 20 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 53e9a11f6c0f..1d3e5b929661 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent,
>>  	return true;
>>  }
>>  
>> -#ifdef CONFIG_PCIEAER_CXL
>> +static bool is_internal_error(struct aer_err_info *info)
>> +{
>> +	if (info->severity == AER_CORRECTABLE)
>> +		return info->status & PCI_ERR_COR_INTERNAL;
>>  
>> +	return info->status & PCI_ERR_UNC_INTN;
>> +}
>> +
>> +#ifdef CONFIG_PCIEAER_CXL
> Diff was having fun.  Maybe put a blank line here? I think that's
> what has tripped it up.
>
>>  /**
>>   * pci_aer_unmask_internal_errors - unmask internal errors
>>   * @dev: pointer to the pcie_dev data structure
>> @@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>>  	return (pcie_ports_native || host->native_aer);
>>  }
>> -
>>  /**
>> @@ -1115,8 +1131,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  
>>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>>  {
>> -	cxl_handle_error(dev, info);
>> -	pci_aer_handle_error(dev, info);
>> +	if (is_internal_error(info) && handles_cxl_errors(dev))
>> +		cxl_handle_error(dev, info);
>> +	else
>> +		pci_aer_handle_error(dev, info);
> Whilst not calling this for the CXL cases probably makes sense and
> given new code needs to be the case to avoid a double clear I think,
> I would call that change out more explicitly in the patch description.
>> +
>>  	pci_dev_put(dev);
>>  }
>>  


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static
  2024-10-30 15:45   ` Jonathan Cameron
@ 2024-10-30 15:54     ` Bowman, Terry
  0 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-30 15:54 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

Hi Jonathan,

I added comments below.

On 10/30/2024 10:45 AM, Jonathan Cameron wrote:
> On Fri, 25 Oct 2024 16:02:59 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
> Typo in title. Shouldn't be plural ports.

I'll remove the plural. Thanks.

>> CXL PCIe port protocol error support will be added in the future. This
>> requires searching for a CXL PCIe port device in the CXL topology as
>> provided by find_cxl_port(). But, find_cxl_port() is defined static
>> and as a result is not callable outside of this source file.
>>
>> Update the find_cxl_port() declaration to be non-static.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Doesn't hugely matter but I'd do this later in the series as it's
> not used until patch 12 (I think) and by then reviewers may have forgotten what
> it is for.
>
> Fine otherwise,
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Thanks. I will move this patch to later.

Regards,
Terry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers
  2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
@ 2024-10-30 15:55   ` Jonathan Cameron
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:55 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:03:00 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Map RAS registers for CXL PCIe root port and downstream RAS registers.
> 
> Refactor and rename cxl_setup_parent_dport() to be cxl_init_ep_ports_aer().
> Update the function to iterate an endpoint's parent downstream switch
> ports and parent root ports. It maps the RAS registers for each
> CXL downstream switch port and CXL root port iterated.
> 
> Move the RAS register map logic from cxl_dport_map_regs() into
> cxl_dport_init_ras_reporting(). This eliminates an unnecessary helper.
> cxl_dport_map_regs() can be removed.

looks to be called cxl_dport_map_ras()


> 
> cxl_dport_init_ras_reporting() must check for previously mapped registers
> within the topology, particularly with CXL switches. Endpoints under a
> CXL switch may share parent ports or downstream ports, ensure the ports'
> registers are only mapped once.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 38 +++++++++++++++++---------------------
>  drivers/cxl/cxl.h      |  6 ++----
>  drivers/cxl/mem.c      | 26 ++++++++++++++++++++++++--
>  3 files changed, 43 insertions(+), 27 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 5b46bc46aaa9..0bb61e39cf8f 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -749,18 +749,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
>  	}
>  }
>  
> -static void cxl_dport_map_ras(struct cxl_dport *dport)
> -{
> -	struct cxl_register_map *map = &dport->reg_map;
> -	struct device *dev = dport->dport_dev;
> -
> -	if (!map->component_map.ras.valid)
> -		dev_dbg(dev, "RAS registers not found\n");
> -	else if (cxl_map_component_regs(map, &dport->regs.component,
> -					BIT(CXL_CM_CAP_CAP_ID_RAS)))
> -		dev_dbg(dev, "Failed to map RAS capability.\n");
> -}
> -
>  static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  {
>  	void __iomem *aer_base = dport->regs.dport_aer;
> @@ -790,20 +778,28 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>   * @dport: the cxl_dport that needs to be initialized
>   * @host: host device for devm operations
>   */
> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  {
> -	dport->reg_map.host = host;
> -	cxl_dport_map_ras(dport);
> -
> -	if (dport->rch) {
> -		struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
> -
> -		if (!host_bridge->native_aer)
> -			return;
> +	struct device *dport_dev = dport->dport_dev;
> +	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>  
> +	if (dport->rch && host_bridge->native_aer) {
>  		cxl_dport_map_rch_aer(dport);
>  		cxl_disable_rch_root_ints(dport);
>  	}
> +
> +	/* dport may have more than 1 downstream EP. Check if already mapped. */
> +	if (dport->regs.ras) {
> +		dev_warn(dport_dev, "RAS is already mapped\n");
The comment suggests this is normal? If so why the dev_warn?

> +		return;
> +	}
> +
> +	dport->reg_map.host = dport_dev;
> +	if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> +		dev_err(dport_dev, "Failed to map RAS capability.\n");
> +		return;
> +	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);

>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index a9fd5cd5a0d2..240d54b22a8c 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -45,6 +45,29 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>  	return 0;
>  }
>  
> +static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)

Seems to only match ports, so that name is a little misleading.

> +{
> +	struct pci_dev *pdev;
> +
> +	if (!dev_is_pci(dev))
> +		return false;
> +
> +	pdev = to_pci_dev(dev);
> +	if (!pcie_is_cxl_port(pdev))
> +		return false;
> +
> +	return (pci_pcie_type(pdev) == pcie_type);
> +}
> +

>  static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  				 struct cxl_dport *parent_dport)
>  {
> @@ -62,6 +85,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  
>  		ep = cxl_ep_load(iter, cxlmd);
>  		ep->next = down;
> +		cxl_init_ep_ports_aer(ep);
The comment above this is talking about various stuff, not including that it now
maps the aer registers. Probably need to add something if this is the appropriate
place to do it.
>  	}
>  
>  	/* Note: endpoint port component registers are derived from @cxlds */
> @@ -166,8 +190,6 @@ static int cxl_mem_probe(struct device *dev)
>  	else
>  		endpoint_parent = &parent_port->dev;
>  
> -	cxl_dport_init_ras_reporting(dport, dev);
> -
>  	scoped_guard(device, endpoint_parent) {
>  		if (!endpoint_parent->driver) {
>  			dev_err(dev, "CXL port topology %s not enabled\n",


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream switch port RAS registers
  2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
@ 2024-10-30 15:56   ` Jonathan Cameron
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:56 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:03:01 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Add logic to map CXL PCIe upstream switch port (USP) RAS registers.
> 
> Introduce 'struct cxl_regs' member into 'struct cxl_port' to store a
> pointer to the upstream port's mapped RAS registers.
> 
> The upstream port may have multiple downstream endpoints. Before
> mapping AER registers check if the registers are already mapped.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 17 +++++++++++++++++
>  drivers/cxl/cxl.h      |  4 ++++
>  drivers/cxl/mem.c      |  3 +++
>  3 files changed, 24 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 0bb61e39cf8f..53ca773557f3 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -773,6 +773,23 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +void cxl_uport_init_ras_reporting(struct cxl_port *port)
> +{
> +	/* uport may have more than 1 downstream EP. Check if already mapped. */
> +	if (port->uport_regs.ras) {
> +		dev_warn(&port->dev, "RAS is already mapped\n");
As before, warn seems inappropriate from the comment.
> +		return;
> +	}
> +
> +	port->reg_map.host = &port->dev;
> +	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> +				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> +		dev_err(&port->dev, "Failed to map RAS capability.\n");
> +		return;
> +	}
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
> +


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support
  2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
@ 2024-10-30 15:59   ` Jonathan Cameron
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 15:59 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:03:02 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

Patch title looks unconnected to the patch. Cut and paste issue?


> CXL PCIe port protocol error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe port protocol errors.
> 
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
> 
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
> 
> No functional changes are introduced.
> 
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Otherwise looks fine


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors
  2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
@ 2024-10-30 16:03   ` Jonathan Cameron
  0 siblings, 0 replies; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 16:03 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:03:03 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> Introduce correctable and uncorrectable CXL PCIe port handlers.
> 
> Use the PCIe port's device object to find the matching port or
> downstream port in the CXL topology. The matching port or downstream
> port will include the cached RAS register block.
> 
> Invoke the existing __cxl_handle_ras() with the RAS registers as a
> parameter. __cxl_handle_ras() will log the RAS errors (if present)
> and clear the RAS status.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 59 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 59 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index bb2fd7d04c4f..adb184d346ae 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -772,6 +772,65 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
>  }
>  
> +static int match_uport(struct device *dev, const void *data)
> +{
> +	struct device *uport_dev = (struct device *)data;
> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return port->uport_dev == uport_dev;
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> +	struct cxl_port *port __free(put_cxl_port) = NULL;
> +	void __iomem *ras_base = NULL;
> +
> +	if (!pdev)
> +		return NULL;
> +
> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +		struct cxl_dport *dport;
> +
> +		port = find_cxl_port(&pdev->dev, &dport);

Scope of port is messy as the constructor and destructor
are not well associated. I'd drag a copy into each leg so they can
remain closer to each other.
Or don't use __free() as it's not adding much here.

> +		ras_base = dport ? dport->regs.ras : NULL;	
> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> +		struct device *port_dev;
> +
> +		port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
> +					   match_uport);
> +		if (!port_dev)
> +			return NULL;
> +
> +		port = to_cxl_port(port_dev);
> +		ras_base = port ? port->uport_regs.ras : NULL;
> +	}
> +
> +	return ras_base;
> +}
> +
> +static void cxl_port_cor_error_detected(struct pci_dev *pdev)
> +{
> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
> +
> +	__cxl_handle_cor_ras(&pdev->dev, ras_base);
> +}
> +
> +static bool cxl_port_error_detected(struct pci_dev *pdev)
> +{
> +	void __iomem *ras_base = cxl_pci_port_ras(pdev);
> +	bool ue;
> +
> +	ue = __cxl_handle_ras(&pdev->dev, ras_base);
> +
> +	return ue;
> +}
> +
>  void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  {
>  	/* uport may have more than 1 downstream EP. Check if already mapped. */


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 13/14] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
@ 2024-10-30 16:07   ` Jonathan Cameron
  2024-10-30 21:30     ` Bowman, Terry
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 16:07 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, shiju.jose,
	M.Chehab

On Fri, 25 Oct 2024 16:03:04 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The CXL drivers use kernel trace functions for logging endpoint and
> RCH downstream port RAS errors. Similar functionality is
> required for CXL root ports, CXL downstream switch ports, and CXL
> upstream switch ports.
> 
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe ports. Additionally, update
> the PCIe port error handlers to invoke these new trace functions.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
+CC Mauro and Shiju to give the tracepoint a sanity check and for
awareness that we have something new to feed rasdaemon :)

Jonathan

> ---
>  drivers/cxl/core/pci.c   | 16 ++++++++++----
>  drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 59 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index adb184d346ae..eeb4a64ba5b5 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	if (is_cxl_memdev(dev))
>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> -	}
> +	else if (dev_is_pci(dev))
How would you get here otherwise? Is it useful to know it is a pci device
here?
> +		trace_cxl_port_aer_correctable_error(dev, status);
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> +	if (is_cxl_memdev(dev))
> +		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> +	else if (dev_is_pci(dev))
as above.

> +		trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> +
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
>  	return true;
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 8672b42ee4d1..1c4368a7b50b 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,6 +48,34 @@
>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>  )
>  
> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> +	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> +	TP_ARGS(dev, status, fe, hl),
> +	TP_STRUCT__entry(
> +		__string(devname, dev_name(dev))
> +		__string(host, dev_name(dev->parent))
> +		__field(u32, status)
> +		__field(u32, first_error)
> +		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> +	),
> +	TP_fast_assign(
> +		__assign_str(devname);
> +		__assign_str(host);
> +		__entry->status = status;
> +		__entry->first_error = fe;
> +		/*
> +		 * Embed the 512B headerlog data for user app retrieval and
> +		 * parsing, but no need to print this in the trace buffer.
I'm not sure any printing as such goes on in the trace buffer. It is from
the data in the trace buffer I think.

> +		 */
> +		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> +	),
> +	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> +		  __get_str(devname), __get_str(host),
> +		  show_uc_errs(__entry->status),
> +		  show_uc_errs(__entry->first_error)
> +	)
> +);
> +
>  TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
>  	TP_ARGS(cxlmd, status, fe, hl),
> @@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>  )
>  
> +TRACE_EVENT(cxl_port_aer_correctable_error,
> +	TP_PROTO(struct device *dev, u32 status),
> +	TP_ARGS(dev, status),
> +	TP_STRUCT__entry(
> +		__string(devname, dev_name(dev))
> +		__string(host, dev_name(dev->parent))
> +		__field(u32, status)
> +	),
> +	TP_fast_assign(
> +		__assign_str(devname);
> +		__assign_str(host);
> +		__entry->status = status;
> +	),
> +	TP_printk("device=%s host=%s status='%s'",
> +		  __get_str(devname), __get_str(host),
> +		  show_ce_errs(__entry->status)
> +	)
> +);
> +
>  TRACE_EVENT(cxl_aer_correctable_error,
>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>  	TP_ARGS(cxlmd, status),


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
@ 2024-10-30 16:11   ` Jonathan Cameron
  2024-10-30 21:34     ` Bowman, Terry
  0 siblings, 1 reply; 55+ messages in thread
From: Jonathan Cameron @ 2024-10-30 16:11 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

On Fri, 25 Oct 2024 16:03:05 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> pci_driver::cxl_err_handlers are not currrently assigned handler callbacks.
> The handlers can't be set in the pci_driver static definition because the
> CXL PCIe port devices are bound to the portdrv driver which is not CXL
> driver aware.
> 
> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> function will assign the default handlers for a CXL PCIe port device.
> 
> When the CXL port (cxl_port or cxl_dport) is destroyed the CXL PCIe port
> device's pci_driver::cxl_err_handlers must be set to NULL to prevent future
> use. Create cxl_clear_port_error_handlers() and register it to be called
> when the CXL port device (cxl_port or cxl_dport) is destroyed.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
One trivial comment inline. 
> ---
>  drivers/cxl/core/pci.c | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index eeb4a64ba5b5..5f7570c6173c 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -839,8 +839,36 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
>  	return ue;
>  }
>  
> +static const struct cxl_error_handlers cxl_port_error_handlers = {
> +	.error_detected	= cxl_port_error_detected,
> +	.cor_error_detected	= cxl_port_cor_error_detected,
Odd spacing?  I'd just use a single space as aligning these almost
always makes for messy future patches.

> +};
> +
> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> +{
> +	struct pci_driver *pdrv = pdev->driver;
> +
> +	if (!pdrv)
> +		return;
> +
> +	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> +}
> +
> +static void cxl_clear_port_error_handlers(void *data)
> +{
> +	struct pci_dev *pdev = data;
> +	struct pci_driver *pdrv = pdev->driver;
> +
> +	if (!pdrv)
> +		return;
> +
> +	pdrv->cxl_err_handler = NULL;
> +}
> +
>  void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  {
> +	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> +
>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
>  	if (port->uport_regs.ras) {
>  		dev_warn(&port->dev, "RAS is already mapped\n");
> @@ -853,6 +881,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  		dev_err(&port->dev, "Failed to map RAS capability.\n");
>  		return;
>  	}
> +
> +	cxl_assign_port_error_handlers(pdev);
> +	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>  
> @@ -865,6 +896,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  {
>  	struct device *dport_dev = dport->dport_dev;
>  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> +	struct pci_dev *pdev = to_pci_dev(dport_dev);
>  
>  	if (dport->rch && host_bridge->native_aer) {
>  		cxl_dport_map_rch_aer(dport);
> @@ -883,6 +915,9 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  		dev_err(dport_dev, "Failed to map RAS capability.\n");
>  		return;
>  	}
> +
> +	cxl_assign_port_error_handlers(pdev);
> +	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>  


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 13/14] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-10-30 16:07   ` Jonathan Cameron
@ 2024-10-30 21:30     ` Bowman, Terry
  0 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-30 21:30 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa, shiju.jose,
	M.Chehab

Hi Jonathan,

On 10/30/2024 11:07 AM, Jonathan Cameron wrote:
> On Fri, 25 Oct 2024 16:03:04 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The CXL drivers use kernel trace functions for logging endpoint and
>> RCH downstream port RAS errors. Similar functionality is
>> required for CXL root ports, CXL downstream switch ports, and CXL
>> upstream switch ports.
>>
>> Introduce trace logging functions for both RAS correctable and
>> uncorrectable errors specific to CXL PCIe ports. Additionally, update
>> the PCIe port error handlers to invoke these new trace functions.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> +CC Mauro and Shiju to give the tracepoint a sanity check and for
> awareness that we have something new to feed rasdaemon :)
>
> Jonathan
>
>> ---
>>  drivers/cxl/core/pci.c   | 16 ++++++++++----
>>  drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 59 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index adb184d346ae..eeb4a64ba5b5 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> +	if (is_cxl_memdev(dev))
>>  		trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
>> -	}
>> +	else if (dev_is_pci(dev))
> How would you get here otherwise? Is it useful to know it is a pci device
> here?
This dev_is_pci() check is not necessary and can be removed.

>> +		trace_cxl_port_aer_correctable_error(dev, status);
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>>  	}
>>  
>>  	header_log_copy(ras_base, hl);
>> -	trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>> +	if (is_cxl_memdev(dev))
>> +		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>> +	else if (dev_is_pci(dev))
> as above.
Got it and thank you.
>> +		trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
>> +
>>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>  
>>  	return true;
>> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
>> index 8672b42ee4d1..1c4368a7b50b 100644
>> --- a/drivers/cxl/core/trace.h
>> +++ b/drivers/cxl/core/trace.h
>> @@ -48,6 +48,34 @@
>>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>>  )
>>  
>> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
>> +	TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
>> +	TP_ARGS(dev, status, fe, hl),
>> +	TP_STRUCT__entry(
>> +		__string(devname, dev_name(dev))
>> +		__string(host, dev_name(dev->parent))
>> +		__field(u32, status)
>> +		__field(u32, first_error)
>> +		__array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(devname);
>> +		__assign_str(host);
>> +		__entry->status = status;
>> +		__entry->first_error = fe;
>> +		/*
>> +		 * Embed the 512B headerlog data for user app retrieval and
>> +		 * parsing, but no need to print this in the trace buffer.
> I'm not sure any printing as such goes on in the trace buffer. It is from
> the data in the trace buffer I think.
Right, the comment indicates it is not printed but included here because the buffer can be accessed by applications. Regards, Terry
>> +		 */
>> +		memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
>> +	),
>> +	TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
>> +		  __get_str(devname), __get_str(host),
>> +		  show_uc_errs(__entry->status),
>> +		  show_uc_errs(__entry->first_error)
>> +	)
>> +);
>> +
>>  TRACE_EVENT(cxl_aer_uncorrectable_error,
>>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
>>  	TP_ARGS(cxlmd, status, fe, hl),
>> @@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>>  )
>>  
>> +TRACE_EVENT(cxl_port_aer_correctable_error,
>> +	TP_PROTO(struct device *dev, u32 status),
>> +	TP_ARGS(dev, status),
>> +	TP_STRUCT__entry(
>> +		__string(devname, dev_name(dev))
>> +		__string(host, dev_name(dev->parent))
>> +		__field(u32, status)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(devname);
>> +		__assign_str(host);
>> +		__entry->status = status;
>> +	),
>> +	TP_printk("device=%s host=%s status='%s'",
>> +		  __get_str(devname), __get_str(host),
>> +		  show_ce_errs(__entry->status)
>> +	)
>> +);
>> +
>>  TRACE_EVENT(cxl_aer_correctable_error,
>>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>>  	TP_ARGS(cxlmd, status),


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
  2024-10-30 16:11   ` Jonathan Cameron
@ 2024-10-30 21:34     ` Bowman, Terry
  0 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-10-30 21:34 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

Hi Jonathan,

On 10/30/2024 11:11 AM, Jonathan Cameron wrote:
> On Fri, 25 Oct 2024 16:03:05 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> pci_driver::cxl_err_handlers are not currrently assigned handler callbacks.
>> The handlers can't be set in the pci_driver static definition because the
>> CXL PCIe port devices are bound to the portdrv driver which is not CXL
>> driver aware.
>>
>> Add cxl_assign_port_error_handlers() in the cxl_core module. This
>> function will assign the default handlers for a CXL PCIe port device.
>>
>> When the CXL port (cxl_port or cxl_dport) is destroyed the CXL PCIe port
>> device's pci_driver::cxl_err_handlers must be set to NULL to prevent future
>> use. Create cxl_clear_port_error_handlers() and register it to be called
>> when the CXL port device (cxl_port or cxl_dport) is destroyed.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> One trivial comment inline. 
>> ---
>>  drivers/cxl/core/pci.c | 35 +++++++++++++++++++++++++++++++++++
>>  1 file changed, 35 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index eeb4a64ba5b5..5f7570c6173c 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -839,8 +839,36 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
>>  	return ue;
>>  }
>>  
>> +static const struct cxl_error_handlers cxl_port_error_handlers = {
>> +	.error_detected	= cxl_port_error_detected,
>> +	.cor_error_detected	= cxl_port_cor_error_detected,
> Odd spacing?  I'd just use a single space as aligning these almost
> always makes for messy future patches.

Thanks for pointing out. I'll fix it.

Regards,
Terry

>> +};
>> +
>> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
>> +{
>> +	struct pci_driver *pdrv = pdev->driver;
>> +
>> +	if (!pdrv)
>> +		return;
>> +
>> +	pdrv->cxl_err_handler = &cxl_port_error_handlers;
>> +}
>> +
>> +static void cxl_clear_port_error_handlers(void *data)
>> +{
>> +	struct pci_dev *pdev = data;
>> +	struct pci_driver *pdrv = pdev->driver;
>> +
>> +	if (!pdrv)
>> +		return;
>> +
>> +	pdrv->cxl_err_handler = NULL;
>> +}
>> +
>>  void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>  {
>> +	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
>> +
>>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
>>  	if (port->uport_regs.ras) {
>>  		dev_warn(&port->dev, "RAS is already mapped\n");
>> @@ -853,6 +881,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>  		dev_err(&port->dev, "Failed to map RAS capability.\n");
>>  		return;
>>  	}
>> +
>> +	cxl_assign_port_error_handlers(pdev);
>> +	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>>  
>> @@ -865,6 +896,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>  {
>>  	struct device *dport_dev = dport->dport_dev;
>>  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>> +	struct pci_dev *pdev = to_pci_dev(dport_dev);
>>  
>>  	if (dport->rch && host_bridge->native_aer) {
>>  		cxl_dport_map_rch_aer(dport);
>> @@ -883,6 +915,9 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>  		dev_err(dport_dev, "Failed to map RAS capability.\n");
>>  		return;
>>  	}
>> +
>> +	cxl_assign_port_error_handlers(pdev);
>> +	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>  


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
  2024-10-30 15:14   ` Jonathan Cameron
@ 2024-10-31 16:20   ` Dave Jiang
  2024-10-31 20:24   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2024-10-31 16:20 UTC (permalink / raw)
  To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 10/25/24 2:02 PM, Terry Bowman wrote:
> CXL.io provides PCIe like protocol error implementation, but CXL.io and
> PCIe have different handling requirements.
> 
> The PCIe AER service driver may attempt recovering PCIe devices with
> uncorrectable errors while recovery is not used for CXL.io. Recovery is not
> used in the CXL.io recovery because of the potential for corruption on
> what can be system memory.
> 
> Create pci_driver::cxl_err_handlers similar to pci_driver::error_handler.
> Create handlers for correctable and uncorrectable CXL.io error
> handling.
> 
> The CXL error handlers will be used in future patches adding CXL PCIe
> port protocol error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  include/linux/pci.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 573b4c4c2be6..106ac83e3a7b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -886,6 +886,14 @@ struct pci_error_handlers {
>  	void (*cor_error_detected)(struct pci_dev *dev);
>  };
>  
> +/* CXL bus error event callbacks */
> +struct cxl_error_handlers {
> +	/* CXL bus error detected on this device */
> +	bool (*error_detected)(struct pci_dev *dev);
> +
> +	/* Allow device driver to record more details of a correctable error */
> +	void (*cor_error_detected)(struct pci_dev *dev);
> +};
>  
>  struct module;
>  
> @@ -956,6 +964,7 @@ struct pci_driver {
>  	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
>  	const struct pci_error_handlers *err_handler;
> +	const struct cxl_error_handlers *cxl_err_handler;
>  	const struct attribute_group **groups;
>  	const struct attribute_group **dev_groups;
>  	struct device_driver	driver;


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support
  2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
  2024-10-30 15:13   ` Jonathan Cameron
@ 2024-10-31 16:21   ` Dave Jiang
  2024-10-31 20:25   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2024-10-31 16:21 UTC (permalink / raw)
  To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 10/25/24 2:02 PM, Terry Bowman wrote:
> The AER service driver already includes support for CXL restricted host
> (RCH) downstream port error handling. The current implementation is based
> on CXL1.1 using a root complex event collector.
> 
> Rename function interfaces and parameters where necessary to include
> virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
> handling.[1] The CXL PCIe port error handling will be added in a future
> patch.
> 
> Limit changes to renaming variable and function names. No functional
> changes are added.
> 
> [1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 13b8586924ea..fe6edf26279e 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1029,7 +1029,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  	return 0;
>  }
>  
> -static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	/*
>  	 * Internal errors of an RCEC indicate an AER error in an
> @@ -1052,30 +1052,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>  	return *handles_cxl;
>  }
>  
> -static bool handles_cxl_errors(struct pci_dev *rcec)
> +static bool handles_cxl_errors(struct pci_dev *dev)
>  {
>  	bool handles_cxl = false;
>  
> -	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
> -	    pcie_aer_is_native(rcec))
> -		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> +	    pcie_aer_is_native(dev))
> +		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>  
>  	return handles_cxl;
>  }
>  
> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> +static void cxl_enable_internal_errors(struct pci_dev *dev)
>  {
> -	if (!handles_cxl_errors(rcec))
> +	if (!handles_cxl_errors(dev))
>  		return;
>  
> -	pci_aer_unmask_internal_errors(rcec);
> -	pci_info(rcec, "CXL: Internal errors unmasked");
> +	pci_aer_unmask_internal_errors(dev);
> +	pci_info(dev, "CXL: Internal errors unmasked");
>  }
>  
>  #else
> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
> -static inline void cxl_rch_handle_error(struct pci_dev *dev,
> -					struct aer_err_info *info) { }
> +static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
> +static inline void cxl_handle_error(struct pci_dev *dev,
> +				    struct aer_err_info *info) { }
>  #endif
>  
>  /**
> @@ -1113,7 +1113,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);
> +	cxl_handle_error(dev, info);
>  	pci_aer_handle_error(dev, info);
>  	pci_dev_put(dev);
>  }
> @@ -1491,7 +1491,7 @@ static int aer_probe(struct pcie_device *dev)
>  		return status;
>  	}
>  
> -	cxl_rch_enable_rcec(port);
> +	cxl_enable_internal_errors(port);
>  	aer_enable_rootport(rpc);
>  	pci_info(port, "enabled with IRQ %d\n", dev->irq);
>  	return 0;


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
  2024-10-30 14:57   ` Jonathan Cameron
@ 2024-10-31 16:25   ` Dave Jiang
  2024-10-31 21:22   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2024-10-31 16:25 UTC (permalink / raw)
  To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 10/25/24 2:02 PM, Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices and CXL port
> devices.
> 
> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
> presence. The CXL Flexbus DVSEC presence is used because it is required
> for all the CXL PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
> 
> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl',
> 
> Add pcie_is_cxl_port() to check if a device is a CXL root port, CXL
> upstream switch port, or CXL downstream switch port. Also, verify the
> CXL extensions DVSEC for port is present.[1]
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> ---
>  drivers/pci/pci.c             | 14 ++++++++++++++
>  drivers/pci/probe.c           | 10 ++++++++++
>  include/linux/pci.h           |  4 ++++
>  include/uapi/linux/pci_regs.h |  3 ++-
>  4 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 7d85c04fbba2..c1b243aec61c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5034,6 +5034,20 @@ static u16 cxl_port_dvsec(struct pci_dev *dev)
>  					 PCI_DVSEC_CXL_PORT);
>  }
>  
> +bool pcie_is_cxl_port(struct pci_dev *dev)
> +{
> +	if (!pcie_is_cxl(dev))
> +		return false;
> +
> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> +		return false;
> +
> +	return cxl_port_dvsec(dev);
> +}
> +EXPORT_SYMBOL_GPL(pcie_is_cxl_port);
> +
>  static bool cxl_sbr_masked(struct pci_dev *dev)
>  {
>  	u16 dvsec, reg;
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4f68414c3086..9324eb345f11 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1631,6 +1631,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>  		dev->is_thunderbolt = 1;
>  }
>  
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> +					      PCI_DVSEC_CXL_FLEXBUS);
> +	if (dvsec)
> +		dev->is_cxl = 1;
> +}
> +
>  static void set_pcie_untrusted(struct pci_dev *dev)
>  {
>  	struct pci_dev *parent;
> @@ -1945,6 +1953,8 @@ int pci_setup_device(struct pci_dev *dev)
>  	/* Need to have dev->cfg_size ready */
>  	set_pcie_thunderbolt(dev);
>  
> +	set_pcie_cxl(dev);
> +
>  	set_pcie_untrusted(dev);
>  
>  	/* "Unknown power state" */
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 106ac83e3a7b..d3b1af9fb273 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -443,6 +443,7 @@ struct pci_dev {
>  	unsigned int	is_hotplug_bridge:1;
>  	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
>  	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
> +	unsigned int	is_cxl:1;               /* CXL alternate protocol */
>  	/*
>  	 * Devices marked being untrusted are the ones that can potentially
>  	 * execute DMA attacks and similar. They are typically connected
> @@ -743,6 +744,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +#define pcie_is_cxl(dev) (dev->is_cxl)
> +bool pcie_is_cxl_port(struct pci_dev *dev);
> +
>  #define for_each_pci_bridge(dev, bus)				\
>  	list_for_each_entry(dev, &bus->devices, bus_list)	\
>  		if (!pci_is_bridge(dev)) {} else
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 12323b3334a9..5df6c74963c5 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1186,9 +1186,10 @@
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
>  
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> +/* Compute Express Link (CXL r3.1, sec 8.1) */
>  #define PCI_DVSEC_CXL_PORT				3
>  #define PCI_DVSEC_CXL_PORT_CTL				0x0c
>  #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> +#define PCI_DVSEC_CXL_FLEXBUS				7
>  
>  #endif /* LINUX_PCI_REGS_H */


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
  2024-10-30 14:56   ` Jonathan Cameron
@ 2024-10-31 16:27   ` Dave Jiang
  2024-10-31 21:27   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2024-10-31 16:27 UTC (permalink / raw)
  To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 10/25/24 2:02 PM, Terry Bowman wrote:
> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors.
> 
> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL devices.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/pci/pcie/aer.c  | 14 ++++++++------
>  include/ras/ras_event.h |  9 ++++++---
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index fe6edf26279e..53e9a11f6c0f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
>  
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
> +	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
>  	int layer, agent;
>  	int id = pci_dev_id(dev);
>  	const char *level;
>  
>  	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>  		goto out;
>  	}
>  
> @@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>  
> -	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		pci_err(dev, "  Error of this Agent is reported first\n");
>  
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> @@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> +	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info;
> @@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (tlp_header_valid)
>  		__print_tlp_header(dev, &aer->header_log);
>  
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index e5f7ee0864e7..1bf8e7050ba8 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>  
>  TRACE_EVENT(aer_event,
>  	TP_PROTO(const char *dev_name,
> +		 const char *bus_type,
>  		 const u32 status,
>  		 const u8 severity,
>  		 const u8 tlp_header_valid,
>  		 struct pcie_tlp_log *tlp),
>  
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>  
>  	TP_STRUCT__entry(
>  		__string(	dev_name,	dev_name	)
> +		__string(	bus_type,	bus_type	)
>  		__field(	u32,		status		)
>  		__field(	u8,		severity	)
>  		__field(	u8, 		tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>  
>  	TP_fast_assign(
>  		__assign_str(dev_name);
> +		__assign_str(bus_type);
>  		__entry->status		= status;
>  		__entry->severity	= severity;
>  		__entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>  		}
>  	),
>  
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>  			__entry->severity == AER_FATAL ?
>  			"Fatal" : "Uncorrected, non-fatal",


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver
  2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
  2024-10-30 15:13   ` Jonathan Cameron
@ 2024-10-31 16:37   ` Dave Jiang
  1 sibling, 0 replies; 55+ messages in thread
From: Dave Jiang @ 2024-10-31 16:37 UTC (permalink / raw)
  To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 10/25/24 2:02 PM, Terry Bowman wrote:
> The AER service driver doesn't currently handle CXL protocol errors
> reported by CXL root ports, CXL upstream switch ports, and CXL downstream
> switch ports. Consequently, RAS protocol errors from CXL PCIe port devices
> are not properly logged or handled.
> 
> These errors are reported to the OS via the root port's AER correctable
> and uncorrectable internal error fields. While the AER driver supports
> handling downstream port protocol errors in restricted CXL host (RCH) mode
> also known as CXL1.1, it lacks the same functionality for CXL PCIe ports
> operating in virtual hierarchy (VH) mode.
> 
> To address this gap, update the AER driver to handle CXL PCIe port device
> protocol correctable errors (CE).
> 
> Make this update alongside the existing downstream port RCH error handling
> logic, extending support to CXL PCIe ports in VH mode.
> 
> is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
> config. Update is_internal_error()'s function declaration such that it is
> always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
> or disabled.
> 
> The uncorrectable error (UCE) handling will be added in a future patch.
> 
> [1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

With the commit log update from what Jonathan suggested,
Reviewed-by: Dave Jiang <dave.jiang@intel.com>

> ---
>  drivers/pci/pcie/aer.c | 59 ++++++++++++++++++++++++++++--------------
>  1 file changed, 39 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 53e9a11f6c0f..1d3e5b929661 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent,
>  	return true;
>  }
>  
> -#ifdef CONFIG_PCIEAER_CXL
> +static bool is_internal_error(struct aer_err_info *info)
> +{
> +	if (info->severity == AER_CORRECTABLE)
> +		return info->status & PCI_ERR_COR_INTERNAL;
>  
> +	return info->status & PCI_ERR_UNC_INTN;
> +}
> +
> +#ifdef CONFIG_PCIEAER_CXL
>  /**
>   * pci_aer_unmask_internal_errors - unmask internal errors
>   * @dev: pointer to the pcie_dev data structure
> @@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>  	return (pcie_ports_native || host->native_aer);
>  }
>  
> -static bool is_internal_error(struct aer_err_info *info)
> -{
> -	if (info->severity == AER_CORRECTABLE)
> -		return info->status & PCI_ERR_COR_INTERNAL;
> -
> -	return info->status & PCI_ERR_UNC_INTN;
> -}
> -
>  static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  {
>  	struct aer_err_info *info = (struct aer_err_info *)data;
> @@ -1033,14 +1032,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  
>  static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	/*
> -	 * Internal errors of an RCEC indicate an AER error in an
> -	 * RCH's downstream port. Check and handle them in the CXL.mem
> -	 * device driver.
> -	 */
> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    is_internal_error(info))
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>  		pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
> +
> +	if (info->severity == AER_CORRECTABLE) {
> +		struct pci_driver *pdrv = dev->driver;
> +		int aer = dev->aer_cap;
> +
> +		if (aer)
> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> +					       info->status);
> +
> +		if (pdrv && pdrv->cxl_err_handler &&
> +		    pdrv->cxl_err_handler->cor_error_detected)
> +			pdrv->cxl_err_handler->cor_error_detected(dev);
> +
> +		pcie_clear_device_status(dev);
> +	}
>  }
>  
>  static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
> @@ -1058,9 +1066,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
>  {
>  	bool handles_cxl = false;
>  
> -	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> -	    pcie_aer_is_native(dev))
> +	if (!pcie_aer_is_native(dev))
> +		return false;
> +
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
>  		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
> +	else
> +		handles_cxl = pcie_is_cxl_port(dev);
>  
>  	return handles_cxl;
>  }
> @@ -1078,6 +1090,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
>  static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
>  static inline void cxl_handle_error(struct pci_dev *dev,
>  				    struct aer_err_info *info) { }
> +static bool handles_cxl_errors(struct pci_dev *dev)
> +{
> +	return false;
> +}
>  #endif
>  
>  /**
> @@ -1115,8 +1131,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_handle_error(dev, info);
> -	pci_aer_handle_error(dev, info);
> +	if (is_internal_error(info) && handles_cxl_errors(dev))
> +		cxl_handle_error(dev, info);
> +	else
> +		pci_aer_handle_error(dev, info);
> +
>  	pci_dev_put(dev);
>  }
>  


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices
  2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
  2024-10-30 15:37   ` Jonathan Cameron
@ 2024-10-31 16:58   ` Dave Jiang
  2024-11-01 13:30     ` Bowman, Terry
  1 sibling, 1 reply; 55+ messages in thread
From: Dave Jiang @ 2024-10-31 16:58 UTC (permalink / raw)
  To: Terry Bowman, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 10/25/24 2:02 PM, Terry Bowman wrote:
> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe upstream port devices,
> including CXL upstream switch ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe upstream switch port devices.
> 
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe port devices.
> 
> The fatal error status will be used in future patches implementing
> CXL PCIe port uncorrectable error handling and logging.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/pci/pcie/aer.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 1d3e5b929661..d772f123c6a2 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1250,6 +1250,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>  		   type == PCI_EXP_TYPE_RC_EC ||
>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
> +		   type == PCI_EXP_TYPE_UPSTREAM ||

At minimal we probably should do something like
(pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)
instead so we don't regress the original PCI behavior?
    
>  		   info->severity == AER_NONFATAL) {
>  
>  		/* Link is still healthy for IO reads */


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
  2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
  2024-10-30 15:14   ` Jonathan Cameron
  2024-10-31 16:20   ` Dave Jiang
@ 2024-10-31 20:24   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2024-10-31 20:24 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Oct 25, 2024 at 04:02:52PM -0500, Terry Bowman wrote:
> CXL.io provides PCIe like protocol error implementation, but CXL.io and
> PCIe have different handling requirements.
> 
> The PCIe AER service driver may attempt recovering PCIe devices with
> uncorrectable errors while recovery is not used for CXL.io. Recovery is not
> used in the CXL.io recovery because of the potential for corruption on
> what can be system memory.
> 
> Create pci_driver::cxl_err_handlers similar to pci_driver::error_handler.
> Create handlers for correctable and uncorrectable CXL.io error
> handling.
> 
> The CXL error handlers will be used in future patches adding CXL PCIe
> port protocol error handling.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---

Reviewed-by: Fan Ni <fan.ni@samsung.com>

>  include/linux/pci.h | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 573b4c4c2be6..106ac83e3a7b 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -886,6 +886,14 @@ struct pci_error_handlers {
>  	void (*cor_error_detected)(struct pci_dev *dev);
>  };
>  
> +/* CXL bus error event callbacks */
> +struct cxl_error_handlers {
> +	/* CXL bus error detected on this device */
> +	bool (*error_detected)(struct pci_dev *dev);
> +
> +	/* Allow device driver to record more details of a correctable error */
> +	void (*cor_error_detected)(struct pci_dev *dev);
> +};
>  
>  struct module;
>  
> @@ -956,6 +964,7 @@ struct pci_driver {
>  	int  (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
>  	u32  (*sriov_get_vf_total_msix)(struct pci_dev *pf);
>  	const struct pci_error_handlers *err_handler;
> +	const struct cxl_error_handlers *cxl_err_handler;
>  	const struct attribute_group **groups;
>  	const struct attribute_group **dev_groups;
>  	struct device_driver	driver;
> -- 
> 2.34.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support
  2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
  2024-10-30 15:13   ` Jonathan Cameron
  2024-10-31 16:21   ` Dave Jiang
@ 2024-10-31 20:25   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2024-10-31 20:25 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Oct 25, 2024 at 04:02:53PM -0500, Terry Bowman wrote:
> The AER service driver already includes support for CXL restricted host
> (RCH) downstream port error handling. The current implementation is based
> on CXL1.1 using a root complex event collector.
> 
> Rename function interfaces and parameters where necessary to include
> virtual hierarchy (VH) mode CXL PCIe port error handling alongside the RCH
> handling.[1] The CXL PCIe port error handling will be added in a future
> patch.
> 
> Limit changes to renaming variable and function names. No functional
> changes are added.
> 
> [1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

Reviewed-by: Fan Ni <fan.ni@samsung.com>

> ---
>  drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
>  1 file changed, 14 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 13b8586924ea..fe6edf26279e 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1029,7 +1029,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>  	return 0;
>  }
>  
> -static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> +static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
>  	/*
>  	 * Internal errors of an RCEC indicate an AER error in an
> @@ -1052,30 +1052,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>  	return *handles_cxl;
>  }
>  
> -static bool handles_cxl_errors(struct pci_dev *rcec)
> +static bool handles_cxl_errors(struct pci_dev *dev)
>  {
>  	bool handles_cxl = false;
>  
> -	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
> -	    pcie_aer_is_native(rcec))
> -		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> +	if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
> +	    pcie_aer_is_native(dev))
> +		pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
>  
>  	return handles_cxl;
>  }
>  
> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> +static void cxl_enable_internal_errors(struct pci_dev *dev)
>  {
> -	if (!handles_cxl_errors(rcec))
> +	if (!handles_cxl_errors(dev))
>  		return;
>  
> -	pci_aer_unmask_internal_errors(rcec);
> -	pci_info(rcec, "CXL: Internal errors unmasked");
> +	pci_aer_unmask_internal_errors(dev);
> +	pci_info(dev, "CXL: Internal errors unmasked");
>  }
>  
>  #else
> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
> -static inline void cxl_rch_handle_error(struct pci_dev *dev,
> -					struct aer_err_info *info) { }
> +static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
> +static inline void cxl_handle_error(struct pci_dev *dev,
> +				    struct aer_err_info *info) { }
>  #endif
>  
>  /**
> @@ -1113,7 +1113,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
>  {
> -	cxl_rch_handle_error(dev, info);
> +	cxl_handle_error(dev, info);
>  	pci_aer_handle_error(dev, info);
>  	pci_dev_put(dev);
>  }
> @@ -1491,7 +1491,7 @@ static int aer_probe(struct pcie_device *dev)
>  		return status;
>  	}
>  
> -	cxl_rch_enable_rcec(port);
> +	cxl_enable_internal_errors(port);
>  	aer_enable_rootport(rpc);
>  	pci_info(port, "enabled with IRQ %d\n", dev->irq);
>  	return 0;
> -- 
> 2.34.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port()
  2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
  2024-10-30 14:57   ` Jonathan Cameron
  2024-10-31 16:25   ` Dave Jiang
@ 2024-10-31 21:22   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2024-10-31 21:22 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Oct 25, 2024 at 04:02:54PM -0500, Terry Bowman wrote:
> CXL and AER drivers need the ability to identify CXL devices and CXL port
> devices.
> 
> First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
> presence. The CXL Flexbus DVSEC presence is used because it is required
> for all the CXL PCIe devices.[1]
> 
> Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
> Flexbus presence.
> 
> Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl',
> 
> Add pcie_is_cxl_port() to check if a device is a CXL root port, CXL
> upstream switch port, or CXL downstream switch port. Also, verify the
> CXL extensions DVSEC for port is present.[1]
> 
> [1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
>     Capability (DVSEC) ID Assignment, Table 8-2
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---

Reviewed-by: Fan Ni <fan.ni@samsung.com>

>  drivers/pci/pci.c             | 14 ++++++++++++++
>  drivers/pci/probe.c           | 10 ++++++++++
>  include/linux/pci.h           |  4 ++++
>  include/uapi/linux/pci_regs.h |  3 ++-
>  4 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 7d85c04fbba2..c1b243aec61c 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -5034,6 +5034,20 @@ static u16 cxl_port_dvsec(struct pci_dev *dev)
>  					 PCI_DVSEC_CXL_PORT);
>  }
>  
> +bool pcie_is_cxl_port(struct pci_dev *dev)
> +{
> +	if (!pcie_is_cxl(dev))
> +		return false;
> +
> +	if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
> +	    (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
> +		return false;
> +
> +	return cxl_port_dvsec(dev);
> +}
> +EXPORT_SYMBOL_GPL(pcie_is_cxl_port);
> +
>  static bool cxl_sbr_masked(struct pci_dev *dev)
>  {
>  	u16 dvsec, reg;
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 4f68414c3086..9324eb345f11 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1631,6 +1631,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
>  		dev->is_thunderbolt = 1;
>  }
>  
> +static void set_pcie_cxl(struct pci_dev *dev)
> +{
> +	u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
> +					      PCI_DVSEC_CXL_FLEXBUS);
> +	if (dvsec)
> +		dev->is_cxl = 1;
> +}
> +
>  static void set_pcie_untrusted(struct pci_dev *dev)
>  {
>  	struct pci_dev *parent;
> @@ -1945,6 +1953,8 @@ int pci_setup_device(struct pci_dev *dev)
>  	/* Need to have dev->cfg_size ready */
>  	set_pcie_thunderbolt(dev);
>  
> +	set_pcie_cxl(dev);
> +
>  	set_pcie_untrusted(dev);
>  
>  	/* "Unknown power state" */
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 106ac83e3a7b..d3b1af9fb273 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -443,6 +443,7 @@ struct pci_dev {
>  	unsigned int	is_hotplug_bridge:1;
>  	unsigned int	shpc_managed:1;		/* SHPC owned by shpchp */
>  	unsigned int	is_thunderbolt:1;	/* Thunderbolt controller */
> +	unsigned int	is_cxl:1;               /* CXL alternate protocol */
>  	/*
>  	 * Devices marked being untrusted are the ones that can potentially
>  	 * execute DMA attacks and similar. They are typically connected
> @@ -743,6 +744,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
>  	return false;
>  }
>  
> +#define pcie_is_cxl(dev) (dev->is_cxl)
> +bool pcie_is_cxl_port(struct pci_dev *dev);
> +
>  #define for_each_pci_bridge(dev, bus)				\
>  	list_for_each_entry(dev, &bus->devices, bus_list)	\
>  		if (!pci_is_bridge(dev)) {} else
> diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
> index 12323b3334a9..5df6c74963c5 100644
> --- a/include/uapi/linux/pci_regs.h
> +++ b/include/uapi/linux/pci_regs.h
> @@ -1186,9 +1186,10 @@
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL		0x00ff0000
>  #define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX	0xff000000
>  
> -/* Compute Express Link (CXL r3.1, sec 8.1.5) */
> +/* Compute Express Link (CXL r3.1, sec 8.1) */
>  #define PCI_DVSEC_CXL_PORT				3
>  #define PCI_DVSEC_CXL_PORT_CTL				0x0c
>  #define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR		0x00000001
> +#define PCI_DVSEC_CXL_FLEXBUS				7
>  
>  #endif /* LINUX_PCI_REGS_H */
> -- 
> 2.34.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
  2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
  2024-10-30 14:56   ` Jonathan Cameron
  2024-10-31 16:27   ` Dave Jiang
@ 2024-10-31 21:27   ` Fan Ni
  2 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2024-10-31 21:27 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Oct 25, 2024 at 04:02:55PM -0500, Terry Bowman wrote:
> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors.
> 
> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL devices.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---

Reviewed-by: Fan Ni <fan.ni@samsung.com>

>  drivers/pci/pcie/aer.c  | 14 ++++++++------
>  include/ras/ras_event.h |  9 ++++++---
>  2 files changed, 14 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index fe6edf26279e..53e9a11f6c0f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
>  
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  {
> +	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
>  	int layer, agent;
>  	int id = pci_dev_id(dev);
>  	const char *level;
>  
>  	if (!info->status) {
> -		pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> -			aer_error_severity_string[info->severity]);
> +		pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> +			bus_type, aer_error_severity_string[info->severity]);
>  		goto out;
>  	}
>  
> @@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  
>  	level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>  
> -	pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> -		   aer_error_severity_string[info->severity],
> +	pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> +		   bus_type, aer_error_severity_string[info->severity],
>  		   aer_error_layer[layer], aer_agent_string[agent]);
>  
>  	pci_printk(level, dev, "  device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>  	if (info->id && info->error_dev_num > 1 && info->id == id)
>  		pci_err(dev, "  Error of this Agent is reported first\n");
>  
> -	trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>  			info->severity, info->tlp_header_valid, &info->tlp);
>  }
>  
> @@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> +	const char *bus_type = pcie_is_cxl(dev) ? "CXL"  : "PCIe";
>  	int layer, agent, tlp_header_valid = 0;
>  	u32 status, mask;
>  	struct aer_err_info info;
> @@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  	if (tlp_header_valid)
>  		__print_tlp_header(dev, &aer->header_log);
>  
> -	trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> +	trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>  			aer_severity, tlp_header_valid, &aer->header_log);
>  }
>  EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index e5f7ee0864e7..1bf8e7050ba8 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>  
>  TRACE_EVENT(aer_event,
>  	TP_PROTO(const char *dev_name,
> +		 const char *bus_type,
>  		 const u32 status,
>  		 const u8 severity,
>  		 const u8 tlp_header_valid,
>  		 struct pcie_tlp_log *tlp),
>  
> -	TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> +	TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>  
>  	TP_STRUCT__entry(
>  		__string(	dev_name,	dev_name	)
> +		__string(	bus_type,	bus_type	)
>  		__field(	u32,		status		)
>  		__field(	u8,		severity	)
>  		__field(	u8, 		tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>  
>  	TP_fast_assign(
>  		__assign_str(dev_name);
> +		__assign_str(bus_type);
>  		__entry->status		= status;
>  		__entry->severity	= severity;
>  		__entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>  		}
>  	),
>  
> -	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> -		__get_str(dev_name),
> +	TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> +		__get_str(dev_name), __get_str(bus_type),
>  		__entry->severity == AER_CORRECTABLE ? "Corrected" :
>  			__entry->severity == AER_FATAL ?
>  			"Fatal" : "Uncorrected, non-fatal",
> -- 
> 2.34.1
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices
  2024-10-31 16:58   ` Dave Jiang
@ 2024-11-01 13:30     ` Bowman, Terry
  0 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-11-01 13:30 UTC (permalink / raw)
  To: Dave Jiang, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

Hi Dave,

On 10/31/2024 11:58 AM, Dave Jiang wrote:
>
> On 10/25/24 2:02 PM, Terry Bowman wrote:
>> The AER service driver's aer_get_device_error_info() function doesn't read
>> uncorrectable (UCE) fatal error status from PCIe upstream port devices,
>> including CXL upstream switch ports. As a result, fatal errors are not
>> logged or handled as needed for CXL PCIe upstream switch port devices.
>>
>> Update the aer_get_device_error_info() function to read the UCE fatal
>> status for all CXL PCIe port devices.
>>
>> The fatal error status will be used in future patches implementing
>> CXL PCIe port uncorrectable error handling and logging.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/pci/pcie/aer.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 1d3e5b929661..d772f123c6a2 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1250,6 +1250,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
>>  	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>  		   type == PCI_EXP_TYPE_RC_EC ||
>>  		   type == PCI_EXP_TYPE_DOWNSTREAM ||
>> +		   type == PCI_EXP_TYPE_UPSTREAM ||
> At minimal we probably should do something like
> (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)
> instead so we don't regress the original PCI behavior?

Good Idea. I'll change the condition to what you recommend.

Regards,
Terry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
                   ` (15 preceding siblings ...)
  2024-10-28  1:05 ` [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Bowman, Terry
@ 2024-11-01 18:00 ` Fan Ni
  2024-11-01 18:28   ` Bowman, Terry
  16 siblings, 1 reply; 55+ messages in thread
From: Fan Ni @ 2024-11-01 18:00 UTC (permalink / raw)
  To: Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
> 
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
> 
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
> 
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> 
> Testing:

Hi Terry,
I tried to test the patchset with aer_inject tool (with the patch you shared
in the last version), and hit some issues.
Could you help check and give some insights? Thanks.

Below are some test setup info and results.

I tested two topology,
  a. one memdev directly attaced to a HB with only one RP;
  b. a topology with cxl switch:
         HB
        /  \
      RP0   RP1
       |
     switch
       |
 ----------------
 |    |    |    |
mem0 mem1 mem2 mem3

For both topologies, I cannot reproduce the system panic shown in your cover
letter.  

btw, I tried both compile cxl as modules and in the kernel.

Below, I will use the direct-attached topology (a) as an example to show what I
tried, hope can get some clarity about the test and what I missed or did wrong.

-------------------------------------
pci device info on the test VM 
root@fan:~# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
root@fan:~# 
-------------------------------------

The aer injection input file looks like below,

-------------------------------------
fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS INTERNAL
HEADER_LOG 0 1 2 3
------------------------------------

dmesg after aer injection 

ssh root@localhost -p 2024 "dmesg"
[  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
--------------------------------------

The problem seems to be related to the cxl_error_handler not been assigned for
cxlmem device. 

in
cxl_do_recover() {
...
    327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
    328     if (status)                                                                 
    329         panic("CXL cachemem error. Invoking panic");                   
...
}
The status returned is false, so no panic().

I tried to add some dev_dbg info to the code to debug.
Below are the debug info and kernel code changes for debugging. 
--------------------------------------
fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
[    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
[    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
[    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
[    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
[    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
[    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
[    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
[    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
[    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
fan:~/cxl/cxl-test-tool$ 
--------------------------------------

dmesg after error injection:
--------------------------------------
ssh root@localhost -p 2024 "dmesg"
[  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
[  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
[  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
[  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
[  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
fan:~/cxl/cxl-test-tool$ 
--------------------------------------


Kernel changes:
--------------------------------------
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5f7570c6173c..bcecd1283fc6 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
 {
 	struct pci_driver *pdrv = pdev->driver;
 
+    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
 	if (!pdrv)
 		return;
 
+    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
 	pdrv->cxl_err_handler = &cxl_port_error_handlers;
+    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
 }
 
 static void cxl_clear_port_error_handlers(void *data)
@@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 {
 	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
 
+    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
 	/* uport may have more than 1 downstream EP. Check if already mapped. */
 	if (port->uport_regs.ras) {
 		dev_warn(&port->dev, "RAS is already mapped\n");
 		return;
 	}
 
+    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
 	port->reg_map.host = &port->dev;
 	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
 				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
@@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
 		return;
 	}
 
+    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
 	cxl_assign_port_error_handlers(pdev);
 	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
 }
@@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
 	struct pci_dev *pdev = to_pci_dev(dport_dev);
 
+    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
 	if (dport->rch && host_bridge->native_aer) {
 		cxl_dport_map_rch_aer(dport);
 		cxl_disable_rch_root_ints(dport);
 	}
 
+    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
 	/* dport may have more than 1 downstream EP. Check if already mapped. */
 	if (dport->regs.ras) {
 		dev_warn(dport_dev, "RAS is already mapped\n");
@@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
 		return;
 	}
 
+    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
 	cxl_assign_port_error_handlers(pdev);
 	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
 }
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 067fd6389562..aa824584f8dd 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 	 * Now that the path to the root is established record all the
 	 * intervening ports in the chain.
 	 */
+    dev_dbg(host, "XXX: add endpoint\n");
 	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
 	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
 		struct cxl_ep *ep;
 
 		ep = cxl_ep_load(iter, cxlmd);
 		ep->next = down;
-		cxl_init_ep_ports_aer(ep);
+        dev_dbg(ep->ep, "XXX: init ep port aer\n");
+        cxl_init_ep_ports_aer(ep);
 	}
 
 	/* Note: endpoint port component registers are derived from @cxlds */
@@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
 			return -ENXIO;
 		}
 
+        dev_dbg(dev, "XXX: add endpoint\n");
 		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
 		if (rc)
 			return rc;
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 3785f4ca5103..8285f14994e8 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
 	bool *status = data;
 
 	device_lock(&dev->dev);
+    if (pdrv) {
+        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
+    } else {
+        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
+    }
 	if (pdrv && pdrv->cxl_err_handler &&
 	    pdrv->cxl_err_handler->error_detected) {
 		const struct cxl_error_handlers *cxl_err_handler =
--------------------------------------

Fan
> 
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
> 
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
> 
>  - Root port UCE:
>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>  Tainted: [E]=UNSIGNED_MODULE
>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl+0x27/0x90
>   dump_stack+0x10/0x20
>   panic+0x33e/0x380
>   cxl_do_recovery+0x116/0x120
>   ? srso_return_thunk+0x5/0x5f
>   aer_isr+0x3e0/0x710
>   irq_thread_fn+0x28/0x70
>   irq_thread+0x179/0x240
>   ? srso_return_thunk+0x5/0x5f
>   ? __pfx_irq_thread_fn+0x10/0x10
>   ? __pfx_irq_thread_dtor+0x10/0x10
>   ? __pfx_irq_thread+0x10/0x10
>   kthread+0xf5/0x130
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork+0x3c/0x60
>   ? __pfx_kthread+0x10/0x10
>   ret_from_fork_asm+0x1a/0x30
>   </TASK>
>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
...
-- 
Fan Ni

^ permalink raw reply related	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-11-01 18:00 ` Fan Ni
@ 2024-11-01 18:28   ` Bowman, Terry
  2024-11-01 19:11     ` Fan Ni
  2024-11-01 22:11     ` Fan Ni
  0 siblings, 2 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-11-01 18:28 UTC (permalink / raw)
  To: Fan Ni
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

[-- Attachment #1: Type: text/plain, Size: 14071 bytes --]

Hi Fan,

I added comments below.

On 11/1/2024 1:00 PM, Fan Ni wrote:
> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling in the AER service driver. This
>> patchset adds the CXL PCIe port protocol error handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 7 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
>> adding port specific error handlers, and protocol error logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>>
>> Testing:
> Hi Terry,
> I tried to test the patchset with aer_inject tool (with the patch you shared
> in the last version), and hit some issues.
> Could you help check and give some insights? Thanks.
>
> Below are some test setup info and results.
>
> I tested two topology,
>   a. one memdev directly attaced to a HB with only one RP;
>   b. a topology with cxl switch:
>          HB
>         /  \
>       RP0   RP1
>        |
>      switch
>        |
>  ----------------
>  |    |    |    |
> mem0 mem1 mem2 mem3
>
> For both topologies, I cannot reproduce the system panic shown in your cover
> letter.  
>
> btw, I tried both compile cxl as modules and in the kernel.
>
> Below, I will use the direct-attached topology (a) as an example to show what I
> tried, hope can get some clarity about the test and what I missed or did wrong.
>
> -------------------------------------
> pci device info on the test VM 
> root@fan:~# lspci
> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> root@fan:~# 
> -------------------------------------
>
> The aer injection input file looks like below,
>
> -------------------------------------
> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> AER
> PCI_ID 0000:0c:00.0
> UNCOR_STATUS INTERNAL
> HEADER_LOG 0 1 2 3
> ------------------------------------
>
> dmesg after aer injection 
>
> ssh root@localhost -p 2024 "dmesg"
> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> -----------------------------------

This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.

I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
bus then the device's you test in your setup.

The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
needed in the patchset itself (not just a test patch).

Regards,
Terry

>
> The problem seems to be related to the cxl_error_handler not been assigned for
> cxlmem device. 
>
> in
> cxl_do_recover() {
> ...
>     327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
>     328     if (status)                                                                 
>     329         panic("CXL cachemem error. Invoking panic");                   
> ...
> }
> The status returned is false, so no panic().
>
> I tried to add some dev_dbg info to the code to debug.
> Below are the debug info and kernel code changes for debugging. 
> --------------------------------------
> fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> [    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> [    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> [    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> [    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> [    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> [    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> [    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> [    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> [    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> fan:~/cxl/cxl-test-tool$ 
> --------------------------------------
>
> dmesg after error injection:
> --------------------------------------
> ssh root@localhost -p 2024 "dmesg"
> [  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
> [  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
> [  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> [  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> [  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> fan:~/cxl/cxl-test-tool$ 
> --------------------------------------
>
>
> Kernel changes:
> --------------------------------------
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 5f7570c6173c..bcecd1283fc6 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
>  {
>  	struct pci_driver *pdrv = pdev->driver;
>  
> +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
>  	if (!pdrv)
>  		return;
>  
> +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
>  	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
>  }
>  
>  static void cxl_clear_port_error_handlers(void *data)
> @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  {
>  	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
>  
> +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
>  	/* uport may have more than 1 downstream EP. Check if already mapped. */
>  	if (port->uport_regs.ras) {
>  		dev_warn(&port->dev, "RAS is already mapped\n");
>  		return;
>  	}
>  
> +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
>  	port->reg_map.host = &port->dev;
>  	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
>  				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>  		return;
>  	}
>  
> +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
>  	cxl_assign_port_error_handlers(pdev);
>  	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>  }
> @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>  	struct pci_dev *pdev = to_pci_dev(dport_dev);
>  
> +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
>  	if (dport->rch && host_bridge->native_aer) {
>  		cxl_dport_map_rch_aer(dport);
>  		cxl_disable_rch_root_ints(dport);
>  	}
>  
> +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
>  	/* dport may have more than 1 downstream EP. Check if already mapped. */
>  	if (dport->regs.ras) {
>  		dev_warn(dport_dev, "RAS is already mapped\n");
> @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>  		return;
>  	}
>  
> +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
>  	cxl_assign_port_error_handlers(pdev);
>  	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>  }
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 067fd6389562..aa824584f8dd 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  	 * Now that the path to the root is established record all the
>  	 * intervening ports in the chain.
>  	 */
> +    dev_dbg(host, "XXX: add endpoint\n");
>  	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
>  	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
>  		struct cxl_ep *ep;
>  
>  		ep = cxl_ep_load(iter, cxlmd);
>  		ep->next = down;
> -		cxl_init_ep_ports_aer(ep);
> +        dev_dbg(ep->ep, "XXX: init ep port aer\n");
> +        cxl_init_ep_ports_aer(ep);
>  	}
>  
>  	/* Note: endpoint port component registers are derived from @cxlds */
> @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
>  			return -ENXIO;
>  		}
>  
> +        dev_dbg(dev, "XXX: add endpoint\n");
>  		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
>  		if (rc)
>  			return rc;
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 3785f4ca5103..8285f14994e8 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
>  	bool *status = data;
>  
>  	device_lock(&dev->dev);
> +    if (pdrv) {
> +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> +    } else {
> +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> +    }
>  	if (pdrv && pdrv->cxl_err_handler &&
>  	    pdrv->cxl_err_handler->error_detected) {
>  		const struct cxl_error_handlers *cxl_err_handler =
> --------------------------------------
>
> Fan
>> Below are test results for this patchset using Qemu with CXL root
>> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
>> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
>> also added to show the existing PCIe endpoint handling is not changed.
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed but can
>> provide if needed).
>>
>>  - Root port UCE:
>>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
>>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
>>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
>>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>>  Tainted: [E]=UNSIGNED_MODULE
>>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>>  Call Trace:
>>   <TASK>
>>   dump_stack_lvl+0x27/0x90
>>   dump_stack+0x10/0x20
>>   panic+0x33e/0x380
>>   cxl_do_recovery+0x116/0x120
>>   ? srso_return_thunk+0x5/0x5f
>>   aer_isr+0x3e0/0x710
>>   irq_thread_fn+0x28/0x70
>>   irq_thread+0x179/0x240
>>   ? srso_return_thunk+0x5/0x5f
>>   ? __pfx_irq_thread_fn+0x10/0x10
>>   ? __pfx_irq_thread_dtor+0x10/0x10
>>   ? __pfx_irq_thread+0x10/0x10
>>   kthread+0xf5/0x130
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork+0x3c/0x60
>>   ? __pfx_kthread+0x10/0x10
>>   ret_from_fork_asm+0x1a/0x30
>>   </TASK>
>>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> ...

[-- Attachment #2: cxl-port-err-test-patches.tgz --]
[-- Type: application/x-compressed, Size: 1810 bytes --]

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-11-01 18:28   ` Bowman, Terry
@ 2024-11-01 19:11     ` Fan Ni
  2024-11-01 22:11     ` Fan Ni
  1 sibling, 0 replies; 55+ messages in thread
From: Fan Ni @ 2024-11-01 19:11 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Fan Ni, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> Hi Fan,
> 
> I added comments below.
> 
> On 11/1/2024 1:00 PM, Fan Ni wrote:
> > On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >> The RFC resulted in the decision to add CXL PCIe port error handling to
> >> the existing RCH downstream port handling in the AER service driver. This
> >> patchset adds the CXL PCIe port protocol error handling and logging.
> >>
> >> The first 7 patches update the existing AER service driver to support CXL
> >> PCIe port protocol error handling and reporting. This includes AER service
> >> driver changes for adding correctable and uncorrectable error support, CXL
> >> specific recovery handling, and addition of CXL driver callback handlers.
> >>
> >> The following 7 patches address CXL driver support for CXL PCIe port
> >> protocol errors. This includes the following changes to the CXL drivers:
> >> mapping CXL port and downstream port RAS registers, interface updates for
> >> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >> adding port specific error handlers, and protocol error logging.
> >>
> >> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>
> >> Testing:
> > Hi Terry,
> > I tried to test the patchset with aer_inject tool (with the patch you shared
> > in the last version), and hit some issues.
> > Could you help check and give some insights? Thanks.
> >
> > Below are some test setup info and results.
> >
> > I tested two topology,
> >   a. one memdev directly attaced to a HB with only one RP;
> >   b. a topology with cxl switch:
> >          HB
> >         /  \
> >       RP0   RP1
> >        |
> >      switch
> >        |
> >  ----------------
> >  |    |    |    |
> > mem0 mem1 mem2 mem3
> >
> > For both topologies, I cannot reproduce the system panic shown in your cover
> > letter.  
> >
> > btw, I tried both compile cxl as modules and in the kernel.
> >
> > Below, I will use the direct-attached topology (a) as an example to show what I
> > tried, hope can get some clarity about the test and what I missed or did wrong.
> >
> > -------------------------------------
> > pci device info on the test VM 
> > root@fan:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> > 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> > 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> > 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> > 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> > 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> > 0c:00.0 PCI bridge: Intel Corporation Device 7075
> > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > root@fan:~# 
> > -------------------------------------
> >
> > The aer injection input file looks like below,
> >
> > -------------------------------------
> > fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> > AER
> > PCI_ID 0000:0c:00.0
> > UNCOR_STATUS INTERNAL
> > HEADER_LOG 0 1 2 3
> > ------------------------------------
> >
> > dmesg after aer injection 
> >
> > ssh root@localhost -p 2024 "dmesg"
> > [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > -----------------------------------
> 
> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> 
> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> bus then the device's you test in your setup.
> 
> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> needed in the patchset itself (not just a test patch).
> 
> Regards,
> Terry

Hi Terry,

Thanks for the quick reply. With the two patches applied, the system
panic as expected. 

Thanks,
Fan

> 
> >
> > The problem seems to be related to the cxl_error_handler not been assigned for
> > cxlmem device. 
> >
> > in
> > cxl_do_recover() {
> > ...
> >     327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
> >     328     if (status)                                                                 
> >     329         panic("CXL cachemem error. Invoking panic");                   
> > ...
> > }
> > The status returned is false, so no panic().
> >
> > I tried to add some dev_dbg info to the code to debug.
> > Below are the debug info and kernel code changes for debugging. 
> > --------------------------------------
> > fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> > [    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> > [    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> > [    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> > [    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> > [    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> > [    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> > [    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> > dmesg after error injection:
> > --------------------------------------
> > ssh root@localhost -p 2024 "dmesg"
> > [  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
> > [  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
> > [  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> > [  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> > [  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> >
> > Kernel changes:
> > --------------------------------------
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index 5f7570c6173c..bcecd1283fc6 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> >  {
> >  	struct pci_driver *pdrv = pdev->driver;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
> >  	if (!pdrv)
> >  		return;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
> >  	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> >  }
> >  
> >  static void cxl_clear_port_error_handlers(void *data)
> > @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  {
> >  	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
> >  	/* uport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (port->uport_regs.ras) {
> >  		dev_warn(&port->dev, "RAS is already mapped\n");
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
> >  	port->reg_map.host = &port->dev;
> >  	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> >  				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> > @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> >  	struct pci_dev *pdev = to_pci_dev(dport_dev);
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
> >  	if (dport->rch && host_bridge->native_aer) {
> >  		cxl_dport_map_rch_aer(dport);
> >  		cxl_disable_rch_root_ints(dport);
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
> >  	/* dport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (dport->regs.ras) {
> >  		dev_warn(dport_dev, "RAS is already mapped\n");
> > @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index 067fd6389562..aa824584f8dd 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> >  	 * Now that the path to the root is established record all the
> >  	 * intervening ports in the chain.
> >  	 */
> > +    dev_dbg(host, "XXX: add endpoint\n");
> >  	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
> >  	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
> >  		struct cxl_ep *ep;
> >  
> >  		ep = cxl_ep_load(iter, cxlmd);
> >  		ep->next = down;
> > -		cxl_init_ep_ports_aer(ep);
> > +        dev_dbg(ep->ep, "XXX: init ep port aer\n");
> > +        cxl_init_ep_ports_aer(ep);
> >  	}
> >  
> >  	/* Note: endpoint port component registers are derived from @cxlds */
> > @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
> >  			return -ENXIO;
> >  		}
> >  
> > +        dev_dbg(dev, "XXX: add endpoint\n");
> >  		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
> >  		if (rc)
> >  			return rc;
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 3785f4ca5103..8285f14994e8 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> >  	bool *status = data;
> >  
> >  	device_lock(&dev->dev);
> > +    if (pdrv) {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> > +    } else {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> > +    }
> >  	if (pdrv && pdrv->cxl_err_handler &&
> >  	    pdrv->cxl_err_handler->error_detected) {
> >  		const struct cxl_error_handlers *cxl_err_handler =
> > --------------------------------------
> >
> > Fan
> >> Below are test results for this patchset using Qemu with CXL root
> >> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> >> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> >> also added to show the existing PCIe endpoint handling is not changed.
> >>
> >> This was tested using aer-inject updated to support CE and UCE internal
> >> error injection. CXL RAS was set using a test patch (not upstreamed but can
> >> provide if needed).
> >>
> >>  - Root port UCE:
> >>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> >>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
> >>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
> >>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> >>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> >>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
> >>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> >>  Tainted: [E]=UNSIGNED_MODULE
> >>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >>  Call Trace:
> >>   <TASK>
> >>   dump_stack_lvl+0x27/0x90
> >>   dump_stack+0x10/0x20
> >>   panic+0x33e/0x380
> >>   cxl_do_recovery+0x116/0x120
> >>   ? srso_return_thunk+0x5/0x5f
> >>   aer_isr+0x3e0/0x710
> >>   irq_thread_fn+0x28/0x70
> >>   irq_thread+0x179/0x240
> >>   ? srso_return_thunk+0x5/0x5f
> >>   ? __pfx_irq_thread_fn+0x10/0x10
> >>   ? __pfx_irq_thread_dtor+0x10/0x10
> >>   ? __pfx_irq_thread+0x10/0x10
> >>   kthread+0xf5/0x130
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork+0x3c/0x60
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork_asm+0x1a/0x30
> >>   </TASK>
> >>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> >>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> > ...



-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-11-01 18:28   ` Bowman, Terry
  2024-11-01 19:11     ` Fan Ni
@ 2024-11-01 22:11     ` Fan Ni
  2024-11-04 21:25       ` Bowman, Terry
  1 sibling, 1 reply; 55+ messages in thread
From: Fan Ni @ 2024-11-01 22:11 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Fan Ni, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> Hi Fan,
> 
> I added comments below.
> 
> On 11/1/2024 1:00 PM, Fan Ni wrote:
> > On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >> The RFC resulted in the decision to add CXL PCIe port error handling to
> >> the existing RCH downstream port handling in the AER service driver. This
> >> patchset adds the CXL PCIe port protocol error handling and logging.
> >>
> >> The first 7 patches update the existing AER service driver to support CXL
> >> PCIe port protocol error handling and reporting. This includes AER service
> >> driver changes for adding correctable and uncorrectable error support, CXL
> >> specific recovery handling, and addition of CXL driver callback handlers.
> >>
> >> The following 7 patches address CXL driver support for CXL PCIe port
> >> protocol errors. This includes the following changes to the CXL drivers:
> >> mapping CXL port and downstream port RAS registers, interface updates for
> >> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >> adding port specific error handlers, and protocol error logging.
> >>
> >> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>
> >> Testing:
> > Hi Terry,
> > I tried to test the patchset with aer_inject tool (with the patch you shared
> > in the last version), and hit some issues.
> > Could you help check and give some insights? Thanks.
> >
> > Below are some test setup info and results.
> >
> > I tested two topology,
> >   a. one memdev directly attaced to a HB with only one RP;
> >   b. a topology with cxl switch:
> >          HB
> >         /  \
> >       RP0   RP1
> >        |
> >      switch
> >        |
> >  ----------------
> >  |    |    |    |
> > mem0 mem1 mem2 mem3
> >
> > For both topologies, I cannot reproduce the system panic shown in your cover
> > letter.  
> >
> > btw, I tried both compile cxl as modules and in the kernel.
> >
> > Below, I will use the direct-attached topology (a) as an example to show what I
> > tried, hope can get some clarity about the test and what I missed or did wrong.
> >
> > -------------------------------------
> > pci device info on the test VM 
> > root@fan:~# lspci
> > 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> > 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> > 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> > 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> > 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> > 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> > 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> > 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> > 0c:00.0 PCI bridge: Intel Corporation Device 7075
> > 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> > root@fan:~# 
> > -------------------------------------
> >
> > The aer injection input file looks like below,
> >
> > -------------------------------------
> > fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> > AER
> > PCI_ID 0000:0c:00.0
> > UNCOR_STATUS INTERNAL
> > HEADER_LOG 0 1 2 3
> > ------------------------------------
> >
> > dmesg after aer injection 
> >
> > ssh root@localhost -p 2024 "dmesg"
> > [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > -----------------------------------
> 
> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> 
> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> bus then the device's you test in your setup.
> 
> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> needed in the patchset itself (not just a test patch).
> 
> Regards,
> Terry
> 
Hi Terry, 

I checked the two patches you attached, do we really need the first
patch to umask internal error? I see it is already unmasked in
aer_enable_internal_errors() which is called in aer_probe().
I tried to only apply the other patch and test again, it seems the test
output is the same as applying two patches. The system panics as well.

Fan

> >
> > The problem seems to be related to the cxl_error_handler not been assigned for
> > cxlmem device. 
> >
> > in
> > cxl_do_recover() {
> > ...
> >     327     cxl_walk_bridge(bridge, cxl_report_error_detected, &status);                         
> >     328     if (status)                                                                 
> >     329         panic("CXL cachemem error. Invoking panic");                   
> > ...
> > }
> > The status returned is false, so no panic().
> >
> > I tried to add some dev_dbg info to the code to debug.
> > Below are the debug info and kernel code changes for debugging. 
> > --------------------------------------
> > fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> > [    1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> > [    1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> > [    1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> > [    1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> > [    1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> > [    1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> > [    1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> > [    1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> > dmesg after error injection:
> > --------------------------------------
> > ssh root@localhost -p 2024 "dmesg"
> > [  228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> > [  228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> > [  228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> > [  228.545879] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/00000000
> > [  228.546360] pcieport 0000:0c:00.0:    [22] UncorrIntErr          
> > [  228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> > [  228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> > [  228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> > fan:~/cxl/cxl-test-tool$ 
> > --------------------------------------
> >
> >
> > Kernel changes:
> > --------------------------------------
> > diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> > index 5f7570c6173c..bcecd1283fc6 100644
> > --- a/drivers/cxl/core/pci.c
> > +++ b/drivers/cxl/core/pci.c
> > @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> >  {
> >  	struct pci_driver *pdrv = pdev->driver;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
> >  	if (!pdrv)
> >  		return;
> >  
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
> >  	pdrv->cxl_err_handler = &cxl_port_error_handlers;
> > +    dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> >  }
> >  
> >  static void cxl_clear_port_error_handlers(void *data)
> > @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  {
> >  	struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
> >  	/* uport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (port->uport_regs.ras) {
> >  		dev_warn(&port->dev, "RAS is already mapped\n");
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
> >  	port->reg_map.host = &port->dev;
> >  	if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> >  				   BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> > @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  	struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> >  	struct pci_dev *pdev = to_pci_dev(dport_dev);
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
> >  	if (dport->rch && host_bridge->native_aer) {
> >  		cxl_dport_map_rch_aer(dport);
> >  		cxl_disable_rch_root_ints(dport);
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
> >  	/* dport may have more than 1 downstream EP. Check if already mapped. */
> >  	if (dport->regs.ras) {
> >  		dev_warn(dport_dev, "RAS is already mapped\n");
> > @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >  		return;
> >  	}
> >  
> > +    dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
> >  	cxl_assign_port_error_handlers(pdev);
> >  	devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> >  }
> > diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> > index 067fd6389562..aa824584f8dd 100644
> > --- a/drivers/cxl/mem.c
> > +++ b/drivers/cxl/mem.c
> > @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> >  	 * Now that the path to the root is established record all the
> >  	 * intervening ports in the chain.
> >  	 */
> > +    dev_dbg(host, "XXX: add endpoint\n");
> >  	for (iter = parent_port, down = NULL; !is_cxl_root(iter);
> >  	     down = iter, iter = to_cxl_port(iter->dev.parent)) {
> >  		struct cxl_ep *ep;
> >  
> >  		ep = cxl_ep_load(iter, cxlmd);
> >  		ep->next = down;
> > -		cxl_init_ep_ports_aer(ep);
> > +        dev_dbg(ep->ep, "XXX: init ep port aer\n");
> > +        cxl_init_ep_ports_aer(ep);
> >  	}
> >  
> >  	/* Note: endpoint port component registers are derived from @cxlds */
> > @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
> >  			return -ENXIO;
> >  		}
> >  
> > +        dev_dbg(dev, "XXX: add endpoint\n");
> >  		rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
> >  		if (rc)
> >  			return rc;
> > diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> > index 3785f4ca5103..8285f14994e8 100644
> > --- a/drivers/pci/pcie/err.c
> > +++ b/drivers/pci/pcie/err.c
> > @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> >  	bool *status = data;
> >  
> >  	device_lock(&dev->dev);
> > +    if (pdrv) {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> > +    } else {
> > +        dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> > +    }
> >  	if (pdrv && pdrv->cxl_err_handler &&
> >  	    pdrv->cxl_err_handler->error_detected) {
> >  		const struct cxl_error_handlers *cxl_err_handler =
> > --------------------------------------
> >
> > Fan
> >> Below are test results for this patchset using Qemu with CXL root
> >> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> >> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> >> also added to show the existing PCIe endpoint handling is not changed.
> >>
> >> This was tested using aer-inject updated to support CE and UCE internal
> >> error injection. CXL RAS was set using a test patch (not upstreamed but can
> >> provide if needed).
> >>
> >>  - Root port UCE:
> >>  root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> >>  pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>  pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>  pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
> >>  pcieport 0000:0c:00.0:    [22] UncorrIntErr
> >>  aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> >>  cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> >>  Kernel panic - not syncing: CXL cachemem error. Invoking panic
> >>  CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G            E      6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> >>  Tainted: [E]=UNSIGNED_MODULE
> >>  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> >>  Call Trace:
> >>   <TASK>
> >>   dump_stack_lvl+0x27/0x90
> >>   dump_stack+0x10/0x20
> >>   panic+0x33e/0x380
> >>   cxl_do_recovery+0x116/0x120
> >>   ? srso_return_thunk+0x5/0x5f
> >>   aer_isr+0x3e0/0x710
> >>   irq_thread_fn+0x28/0x70
> >>   irq_thread+0x179/0x240
> >>   ? srso_return_thunk+0x5/0x5f
> >>   ? __pfx_irq_thread_fn+0x10/0x10
> >>   ? __pfx_irq_thread_dtor+0x10/0x10
> >>   ? __pfx_irq_thread+0x10/0x10
> >>   kthread+0xf5/0x130
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork+0x3c/0x60
> >>   ? __pfx_kthread+0x10/0x10
> >>   ret_from_fork_asm+0x1a/0x30
> >>   </TASK>
> >>  Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> >>  ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> > ...



-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-11-01 22:11     ` Fan Ni
@ 2024-11-04 21:25       ` Bowman, Terry
  2024-11-04 21:48         ` Fan Ni
  0 siblings, 1 reply; 55+ messages in thread
From: Bowman, Terry @ 2024-11-04 21:25 UTC (permalink / raw)
  To: Fan Ni
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 11/1/2024 5:11 PM, Fan Ni wrote:
> On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
>> Hi Fan,
>>
>> I added comments below.
>>
>> On 11/1/2024 1:00 PM, Fan Ni wrote:
>>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
>>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>>>> The RFC resulted in the decision to add CXL PCIe port error handling to
>>>> the existing RCH downstream port handling in the AER service driver. This
>>>> patchset adds the CXL PCIe port protocol error handling and logging.
>>>>
>>>> The first 7 patches update the existing AER service driver to support CXL
>>>> PCIe port protocol error handling and reporting. This includes AER service
>>>> driver changes for adding correctable and uncorrectable error support, CXL
>>>> specific recovery handling, and addition of CXL driver callback handlers.
>>>>
>>>> The following 7 patches address CXL driver support for CXL PCIe port
>>>> protocol errors. This includes the following changes to the CXL drivers:
>>>> mapping CXL port and downstream port RAS registers, interface updates for
>>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
>>>> adding port specific error handlers, and protocol error logging.
>>>>
>>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>>>>
>>>> Testing:
>>> Hi Terry,
>>> I tried to test the patchset with aer_inject tool (with the patch you shared
>>> in the last version), and hit some issues.
>>> Could you help check and give some insights? Thanks.
>>>
>>> Below are some test setup info and results.
>>>
>>> I tested two topology,
>>>   a. one memdev directly attaced to a HB with only one RP;
>>>   b. a topology with cxl switch:
>>>          HB
>>>         /  \
>>>       RP0   RP1
>>>        |
>>>      switch
>>>        |
>>>  ----------------
>>>  |    |    |    |
>>> mem0 mem1 mem2 mem3
>>>
>>> For both topologies, I cannot reproduce the system panic shown in your cover
>>> letter.  
>>>
>>> btw, I tried both compile cxl as modules and in the kernel.
>>>
>>> Below, I will use the direct-attached topology (a) as an example to show what I
>>> tried, hope can get some clarity about the test and what I missed or did wrong.
>>>
>>> -------------------------------------
>>> pci device info on the test VM 
>>> root@fan:~# lspci
>>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
>>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
>>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
>>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
>>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
>>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
>>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
>>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
>>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
>>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
>>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
>>> root@fan:~# 
>>> -------------------------------------
>>>
>>> The aer injection input file looks like below,
>>>
>>> -------------------------------------
>>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
>>> AER
>>> PCI_ID 0000:0c:00.0
>>> UNCOR_STATUS INTERNAL
>>> HEADER_LOG 0 1 2 3
>>> ------------------------------------
>>>
>>> dmesg after aer injection 
>>>
>>> ssh root@localhost -p 2024 "dmesg"
>>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
>>> -----------------------------------
>> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
>> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
>>
>> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
>> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
>> bus then the device's you test in your setup.
>>
>> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
>> needed in the patchset itself (not just a test patch).
>>
>> Regards,
>> Terry
>>
> Hi Terry, 
>
> I checked the two patches you attached, do we really need the first
> patch to umask internal error? I see it is already unmasked in
> aer_enable_internal_errors() which is called in aer_probe().
> I tried to only apply the other patch and test again, it seems the test
> output is the same as applying two patches. The system panics as well.
>
> Fan
Hi Fan,

Which device did you inject into? RP, DSP, or USP?

Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
and USP. Below are details from the spec describing how an AER error masked at the source will not
be propagated as notification to the root complex (RP or RCEC).

'If an individual error is masked when it is detected, its error status bit is still affected,
but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
Header Log, TLP Prefix Log, or First Error Pointer.'[1]

[1] PCIe Spec 6.2.3.2.2 Masking Individual Errors

Also, there can be platform BIOS settings that enable/disable UIE/CIE.

Regards,
Terry


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
  2024-11-04 21:25       ` Bowman, Terry
@ 2024-11-04 21:48         ` Fan Ni
  0 siblings, 0 replies; 55+ messages in thread
From: Fan Ni @ 2024-11-04 21:48 UTC (permalink / raw)
  To: Bowman, Terry
  Cc: Fan Ni, ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
	Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa

On Mon, Nov 04, 2024 at 03:25:38PM -0600, Bowman, Terry wrote:
> 
> 
> On 11/1/2024 5:11 PM, Fan Ni wrote:
> > On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> >> Hi Fan,
> >>
> >> I added comments below.
> >>
> >> On 11/1/2024 1:00 PM, Fan Ni wrote:
> >>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >>>> The RFC resulted in the decision to add CXL PCIe port error handling to
> >>>> the existing RCH downstream port handling in the AER service driver. This
> >>>> patchset adds the CXL PCIe port protocol error handling and logging.
> >>>>
> >>>> The first 7 patches update the existing AER service driver to support CXL
> >>>> PCIe port protocol error handling and reporting. This includes AER service
> >>>> driver changes for adding correctable and uncorrectable error support, CXL
> >>>> specific recovery handling, and addition of CXL driver callback handlers.
> >>>>
> >>>> The following 7 patches address CXL driver support for CXL PCIe port
> >>>> protocol errors. This includes the following changes to the CXL drivers:
> >>>> mapping CXL port and downstream port RAS registers, interface updates for
> >>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >>>> adding port specific error handlers, and protocol error logging.
> >>>>
> >>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>>>
> >>>> Testing:
> >>> Hi Terry,
> >>> I tried to test the patchset with aer_inject tool (with the patch you shared
> >>> in the last version), and hit some issues.
> >>> Could you help check and give some insights? Thanks.
> >>>
> >>> Below are some test setup info and results.
> >>>
> >>> I tested two topology,
> >>>   a. one memdev directly attaced to a HB with only one RP;
> >>>   b. a topology with cxl switch:
> >>>          HB
> >>>         /  \
> >>>       RP0   RP1
> >>>        |
> >>>      switch
> >>>        |
> >>>  ----------------
> >>>  |    |    |    |
> >>> mem0 mem1 mem2 mem3
> >>>
> >>> For both topologies, I cannot reproduce the system panic shown in your cover
> >>> letter.  
> >>>
> >>> btw, I tried both compile cxl as modules and in the kernel.
> >>>
> >>> Below, I will use the direct-attached topology (a) as an example to show what I
> >>> tried, hope can get some clarity about the test and what I missed or did wrong.
> >>>
> >>> -------------------------------------
> >>> pci device info on the test VM 
> >>> root@fan:~# lspci
> >>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> >>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> >>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> >>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> >>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> >>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> >>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> >>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> >>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> >>> root@fan:~# 
> >>> -------------------------------------
> >>>
> >>> The aer injection input file looks like below,
> >>>
> >>> -------------------------------------
> >>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> >>> AER
> >>> PCI_ID 0000:0c:00.0
> >>> UNCOR_STATUS INTERNAL
> >>> HEADER_LOG 0 1 2 3
> >>> ------------------------------------
> >>>
> >>> dmesg after aer injection 
> >>>
> >>> ssh root@localhost -p 2024 "dmesg"
> >>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> >>> -----------------------------------
> >> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> >> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> >>
> >> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> >> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> >> bus then the device's you test in your setup.
> >>
> >> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> >> needed in the patchset itself (not just a test patch).
> >>
> >> Regards,
> >> Terry
> >>
> > Hi Terry, 
> >
> > I checked the two patches you attached, do we really need the first
> > patch to umask internal error? I see it is already unmasked in
> > aer_enable_internal_errors() which is called in aer_probe().
> > I tried to only apply the other patch and test again, it seems the test
> > output is the same as applying two patches. The system panics as well.
> >
> > Fan
> Hi Fan,
> 
> Which device did you inject into? RP, DSP, or USP?
> 
> Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
> and USP. Below are details from the spec describing how an AER error masked at the source will not
> be propagated as notification to the root complex (RP or RCEC).
> 
> 'If an individual error is masked when it is detected, its error status bit is still affected,
> but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
> Header Log, TLP Prefix Log, or First Error Pointer.'[1]
> 
> [1] PCIe Spec 6.2.3.2.2 Masking Individual Errors
> 
> Also, there can be platform BIOS settings that enable/disable UIE/CIE.
> 
> Regards,
> Terry
Oh, I see. I did inject into rp in my previous setup. And confirmed we
need extra unmask for downstream port case. 

Thanks for the info.

Fan
> 

-- 
Fan Ni

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver
  2024-10-30 15:13   ` Jonathan Cameron
  2024-10-30 15:51     ` Bowman, Terry
@ 2024-11-04 21:50     ` Dan Williams
  2024-11-04 22:05       ` Bowman, Terry
  1 sibling, 1 reply; 55+ messages in thread
From: Dan Williams @ 2024-11-04 21:50 UTC (permalink / raw)
  To: Jonathan Cameron, Terry Bowman
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
	mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
	nathan.fontenot, Smita.KoralahalliChannabasappa

Jonathan Cameron wrote:
> On Fri, 25 Oct 2024 16:02:56 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
[..]
> Anyhow, I think it is fine but I would call out that this changes
> things so that the PCI error handlers are no longer called for CXL ports
> if it's an internal error.
> 
> With a sentence on that:
> 
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> I'm not 100% convinced the path of separate handlers is the way to go
> but we can always change things again if that doesn't work out.

Hmm, if that part is not clear there should at least be more
documentation as to the "why". For me it is the fact that CXL
potentially promotes endpoint errors to region scope recovery actions,
and that PCIe native AER has no concept of AER triggering unrecoverable
system fatal reponse.

To date panic on AER error has only been logic that ACPI APEI can
deploy, and the kernel has no chance to evaluate the error. So, CXL
error handlers is a reflection that these errors are outside of the PCIe
AER error model.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver
  2024-11-04 21:50     ` Dan Williams
@ 2024-11-04 22:05       ` Bowman, Terry
  0 siblings, 0 replies; 55+ messages in thread
From: Bowman, Terry @ 2024-11-04 22:05 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron
  Cc: ming4.li, linux-cxl, linux-kernel, linux-pci, dave, dave.jiang,
	alison.schofield, vishal.l.verma, bhelgaas, mahesh, ira.weiny,
	oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
	Smita.KoralahalliChannabasappa



On 11/4/2024 3:50 PM, Dan Williams wrote:
> Jonathan Cameron wrote:
>> On Fri, 25 Oct 2024 16:02:56 -0500
>> Terry Bowman <terry.bowman@amd.com> wrote:
> [..]
>> Anyhow, I think it is fine but I would call out that this changes
>> things so that the PCI error handlers are no longer called for CXL ports
>> if it's an internal error.
>>
>> With a sentence on that:
>>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>
>> I'm not 100% convinced the path of separate handlers is the way to go
>> but we can always change things again if that doesn't work out.
> Hmm, if that part is not clear there should at least be more
> documentation as to the "why". For me it is the fact that CXL
> potentially promotes endpoint errors to region scope recovery actions,
> and that PCIe native AER has no concept of AER triggering unrecoverable
> system fatal reponse.
>
> To date panic on AER error has only been logic that ACPI APEI can
> deploy, and the kernel has no chance to evaluate the error. So, CXL
> error handlers is a reflection that these errors are outside of the PCIe
> AER error model.
Hi Dan,

I'll elaborate more and touch on what you mentioned.

Regards,
Terry

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2024-11-04 22:05 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-10-30 15:14   ` Jonathan Cameron
2024-10-30 15:15     ` Bowman, Terry
2024-10-31 16:20   ` Dave Jiang
2024-10-31 20:24   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:13   ` Jonathan Cameron
2024-10-31 16:21   ` Dave Jiang
2024-10-31 20:25   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-10-30 14:57   ` Jonathan Cameron
2024-10-31 16:25   ` Dave Jiang
2024-10-31 21:22   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-10-30 14:56   ` Jonathan Cameron
2024-10-31 16:27   ` Dave Jiang
2024-10-31 21:27   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-30 15:13   ` Jonathan Cameron
2024-10-30 15:51     ` Bowman, Terry
2024-11-04 21:50     ` Dan Williams
2024-11-04 22:05       ` Bowman, Terry
2024-10-31 16:37   ` Dave Jiang
2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-30 15:37   ` Jonathan Cameron
2024-10-31 16:58   ` Dave Jiang
2024-11-01 13:30     ` Bowman, Terry
2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-30 15:42   ` Jonathan Cameron
2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
2024-10-30 15:45   ` Jonathan Cameron
2024-10-30 15:54     ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
2024-10-30 15:55   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-30 15:56   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:59   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-30 16:03   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
2024-10-30 16:07   ` Jonathan Cameron
2024-10-30 21:30     ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-10-30 16:11   ` Jonathan Cameron
2024-10-30 21:34     ` Bowman, Terry
2024-10-27 16:59 ` [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 Bowman, Terry
2024-10-28  1:05 ` [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Bowman, Terry
2024-11-01 18:00 ` Fan Ni
2024-11-01 18:28   ` Bowman, Terry
2024-11-01 19:11     ` Fan Ni
2024-11-01 22:11     ` Fan Ni
2024-11-04 21:25       ` Bowman, Terry
2024-11-04 21:48         ` Fan Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).