public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging
@ 2024-10-08 22:16 Terry Bowman
  2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
                   ` (18 more replies)
  0 siblings, 19 replies; 62+ messages in thread
From: Terry Bowman @ 2024-10-08 22:16 UTC (permalink / raw)
  To: ming4.li, linux-cxl, linux-kernel, linux-pci, dave,
	jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
	dan.j.williams, bhelgaas, mahesh, oohall, Benjamin.Cheatham,
	rrichter, nathan.fontenot, smita.koralahallichannabasappa,
	terry.bowman

This is a continuation of the CXL port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe port error handling to
the existing RCH downstream port handling. This patchset adds the CXL PCIe
port handling and logging.

The first 7 patches update the existing AER service driver to support CXL
PCIe port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.

The following 8 patches address CXL driver support for CXL PCIe port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL port and downstream port RAS registers, interface updates for
common RCH and VH, adding port specific error handlers, and protocol error
logging.

[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554
-1-terry.bowman@amd.com/

Testing:

Below are test results for this patchset. This is using Qemu with a root
port (0c:00.0), upstream switch port (0d:00.0),and downstream switch port
(0e:00.0).

This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed).

    Root port UCE:
    root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
    [   27.318920] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
    [   27.320164] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
    [   27.321518] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   27.322483] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00400000/02000000
    [   27.323243] pcieport 0000:0c:00.0:    [22] UncorrIntErr
    [   27.325584] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   27.325584]
    [   27.327171] cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   27.333277] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   27.333872] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3857
    [   27.334761] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   27.335716] Call Trace:
    [   27.335985]  <TASK>
    [   27.336226]  panic+0x2ed/0x320
    [   27.336547]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   27.337037]  ? __pfx_aer_root_reset+0x10/0x10
    [   27.337453]  cxl_do_recovery+0x304/0x310
    [   27.337833]  aer_isr+0x3fd/0x700
    [   27.338154]  ? __pfx_irq_thread_fn+0x10/0x10
    [   27.338572]  irq_thread_fn+0x1f/0x60
    [   27.338923]  irq_thread+0x102/0x1b0
    [   27.339267]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   27.339683]  ? __pfx_irq_thread+0x10/0x10
    [   27.340059]  kthread+0xcd/0x100
    [   27.340387]  ? __pfx_kthread+0x10/0x10
    [   27.340748]  ret_from_fork+0x2f/0x50
    [   27.341100]  ? __pfx_kthread+0x10/0x10
    [   27.341466]  ret_from_fork_asm+0x1a/0x30
    [   27.341842]  </TASK>
    [   27.342281] Kernel Offset: 0x1ba00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   27.343221] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Root port CE:
    root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
    [   19.444339] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
    [   19.445530] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
    [   19.446750] pcieport 0000:0c:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [   19.447742] pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
    [   19.448549] pcieport 0000:0c:00.0:    [14] CorrIntErr
    [   19.449223] aer_event: 0000:0c:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [   19.449223]
    [   19.451415] cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'

    Upstream switch port UCE:
    root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
    [   45.236853] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
    [   45.238101] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
    [   45.239416] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   45.240412] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00400000/02000000
    [   45.241159] pcieport 0000:0d:00.0:    [22] UncorrIntErr
    [   45.242448] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   45.242448]
    [   45.244008] cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   45.249129] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   45.249800] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g1fb9097c3728 #3855
    [   45.250795] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   45.251907] Call Trace:
    [   45.253284]  <TASK>
    [   45.253564]  panic+0x2ed/0x320
    [   45.253909]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   45.255455]  ? __pfx_aer_root_reset+0x10/0x10
    [   45.255915]  cxl_do_recovery+0x304/0x310
    [   45.257219]  aer_isr+0x3fd/0x700
    [   45.257572]  ? __pfx_irq_thread_fn+0x10/0x10
    [   45.258006]  irq_thread_fn+0x1f/0x60
    [   45.258383]  irq_thread+0x102/0x1b0
    [   45.258748]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   45.259196]  ? __pfx_irq_thread+0x10/0x10
    [   45.259605]  kthread+0xcd/0x100
    [   45.259956]  ? __pfx_kthread+0x10/0x10
    [   45.260386]  ret_from_fork+0x2f/0x50
    [   45.260879]  ? __pfx_kthread+0x10/0x10
    [   45.261418]  ret_from_fork_asm+0x1a/0x30
    [   45.261936]  </TASK>
    [   45.262451] Kernel Offset: 0xc600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   45.263467] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Upstream switch port CE:
    root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh 
    [   37.504029] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
    [   37.506076] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
    [   37.507599] pcieport 0000:0d:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [   37.508759] pcieport 0000:0d:00.0:   device [19e5:a128] error status/mask=00004000/0000a000
    [   37.509574] pcieport 0000:0d:00.0:    [14] CorrIntErr            
    [   37.510180] aer_event: 0000:0d:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [   37.510180] 
    [   37.512057] cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'

    Downstream switch port UCE:
    root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
    [   29.421532] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
    [   29.422812] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
    [   29.424551] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    [   29.425670] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00400000/02000000
    [   29.426487] pcieport 0000:0e:00.0:    [22] UncorrIntErr
    [   29.427111] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
    [   29.427111]
    [   29.428688] cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error'
    first_error: 'Memory Address Parity Error'
    [   29.430173] Kernel panic - not syncing: CXL cachemem error. Invoking panic
    [   29.430862] CPU: 12 UID: 0 PID: 122 Comm: irq/24-aerdrv Not tainted 6.11.0-rc1-port-error-g844fd2319372 #3851
    [   29.431874] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
    [   29.433031] Call Trace:
    [   29.433354]  <TASK>
    [   29.433631]  panic+0x2ed/0x320
    [   29.434010]  ? __pfx_cxl_report_normal_detected+0x10/0x10
    [   29.434653]  ? __pfx_aer_root_reset+0x10/0x10
    [   29.435179]  cxl_do_recovery+0x304/0x310
    [   29.435626]  aer_isr+0x3fd/0x700
    [   29.436027]  ? __pfx_irq_thread_fn+0x10/0x10
    [   29.436507]  irq_thread_fn+0x1f/0x60
    [   29.436898]  irq_thread+0x102/0x1b0
    [   29.437293]  ? __pfx_irq_thread_dtor+0x10/0x10
    [   29.437758]  ? __pfx_irq_thread+0x10/0x10
    [   29.438189]  kthread+0xcd/0x100
    [   29.438551]  ? __pfx_kthread+0x10/0x10
    [   29.438959]  ret_from_fork+0x2f/0x50
    [   29.439362]  ? __pfx_kthread+0x10/0x10
    [   29.439771]  ret_from_fork_asm+0x1a/0x30
    [   29.440221]  </TASK>
    [   29.440738] Kernel Offset: 0x10a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
    [   29.441812] ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---

    Downstream switch port CE:
    root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
    [  177.114442] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
    [  177.115602] pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
    [  177.116973] pcieport 0000:0e:00.0: PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    [  177.117985] pcieport 0000:0e:00.0:   device [19e5:a129] error status/mask=00004000/0000a000
    [  177.118809] pcieport 0000:0e:00.0:    [14] CorrIntErr
    [  177.119521] aer_event: 0000:0e:00.0 PCIe Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
    [  177.119521]
    [  177.122037] cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'

Changes RFC->v1:
 [Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
 [Dan] Add cxl_do_recovery()
 [Jonathan] Flatten cxl_setup_parent_uport()
 [Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
 [Jonathan] Rename cxl_dev_is_pci_type()
 [Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
 replace these find_cxl_port() and device_find_child().
 [Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
 [Ming] Dont use endpoint as host to cxl_map_component_regs()
 [Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
 [TODO][Bjorn] Dont use Kconfig to enable/disable a CXL external interface

Terry Bowman (15):
  cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service
    driver
  cxl/aer/pci: Update is_internal_error() to be callable w/o
    CONFIG_PCIEAER_CXL
  cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL
    PCIe ports
  cxl/aer/pci: Add CXL PCIe port correctable error support in AER
    service driver
  cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL
    PCIe port devices
  cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type
  cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER
    service driver
  cxl/pci: Change find_cxl_ports() to be non-static
  cxl/pci: Map CXL PCIe downstream port RAS registers
  cxl/pci: Map CXL PCIe upstream port RAS registers
  cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
  cxl/pci: Add error handler for CXL PCIe port RAS errors
  cxl/pci: Add trace logging for CXL PCIe port RAS errors
  cxl/aer/pci: Export pci_aer_unmask_internal_errors()
  cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices

 drivers/cxl/core/core.h  |   3 +
 drivers/cxl/core/pci.c   | 172 +++++++++++++++++++++++++++++++--------
 drivers/cxl/core/port.c  |   4 +-
 drivers/cxl/core/trace.h |  47 +++++++++++
 drivers/cxl/cxl.h        |  14 +++-
 drivers/cxl/mem.c        |  30 ++++++-
 drivers/cxl/pci.c        |   8 ++
 drivers/pci/pci.h        |   5 ++
 drivers/pci/pcie/aer.c   | 123 ++++++++++++++++++++--------
 drivers/pci/pcie/err.c   | 150 ++++++++++++++++++++++++++++++++++
 include/linux/aer.h      |  16 ++++
 include/linux/pci.h      |   3 +
 12 files changed, 503 insertions(+), 72 deletions(-)


base-commit: f7982d85e136ba7e26b31a725c1841373f81f84a
-- 
2.34.1


^ permalink raw reply	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2024-10-24 19:10 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08 22:16 [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-08 22:16 ` [PATCH 01/15] cxl/aer/pci: Add CXL PCIe port error handler callbacks in AER service driver Terry Bowman
2024-10-22  1:53   ` Dan Williams
2024-10-22 13:50     ` Terry Bowman
2024-10-22 17:09       ` Dan Williams
2024-10-22 18:40         ` Terry Bowman
2024-10-22 23:43           ` Dan Williams
2024-10-24 15:20             ` Bowman, Terry
2024-10-24 19:10               ` Dan Williams
2024-10-08 22:16 ` [PATCH 02/15] cxl/aer/pci: Update is_internal_error() to be callable w/o CONFIG_PCIEAER_CXL Terry Bowman
2024-10-16 16:11   ` Jonathan Cameron
2024-10-22  2:17   ` Dan Williams
2024-10-22 13:54     ` Terry Bowman
2024-10-08 22:16 ` [PATCH 03/15] cxl/aer/pci: Refactor AER driver's existing interfaces to support CXL PCIe ports Terry Bowman
2024-10-10 19:11   ` Bjorn Helgaas
2024-10-14 17:27     ` Terry Bowman
2024-10-08 22:16 ` [PATCH 04/15] cxl/aer/pci: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-16 16:22   ` Jonathan Cameron
2024-10-16 17:18     ` Terry Bowman
2024-10-16 17:29       ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 05/15] cxl/aer/pci: Update AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-16 16:28   ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 06/15] cxl/aer/pci: Introduce PCI_ERS_RESULT_PANIC to pci_ers_result type Terry Bowman
2024-10-16 16:30   ` Jonathan Cameron
2024-10-16 17:31     ` Terry Bowman
2024-10-17 13:31       ` Jonathan Cameron
2024-10-17 14:50         ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 07/15] cxl/aer/pci: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-16 16:54   ` Jonathan Cameron
2024-10-16 18:07     ` Terry Bowman
2024-10-17 13:43       ` Jonathan Cameron
2024-10-17 16:21         ` Bowman, Terry
2024-10-17 17:08           ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 08/15] cxl/pci: Change find_cxl_ports() to be non-static Terry Bowman
2024-10-08 22:16 ` [PATCH 09/15] cxl/pci: Map CXL PCIe downstream port RAS registers Terry Bowman
2024-10-16 17:14   ` Jonathan Cameron
2024-10-16 18:16     ` Terry Bowman
2024-10-17 13:50       ` Jonathan Cameron
2024-10-17 16:26         ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 10/15] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-08 22:16 ` [PATCH 11/15] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-10-08 22:16 ` [PATCH 12/15] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-17 13:57   ` Jonathan Cameron
2024-10-17 16:42     ` Bowman, Terry
2024-10-08 22:16 ` [PATCH 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-10-17 14:04   ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 14/15] cxl/aer/pci: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-10-16 17:22   ` Jonathan Cameron
2024-10-08 22:16 ` [PATCH 15/15] cxl/pci: Enable internal CE/UCE interrupts for CXL PCIe port devices Terry Bowman
2024-10-16 17:21   ` Jonathan Cameron
2024-10-16 17:24     ` Terry Bowman
2024-10-10 19:07 ` [PATCH 0/15] Enable CXL PCIe port protocol error handling and logging Bjorn Helgaas
2024-10-14 17:22   ` Terry Bowman
2024-10-14 17:29     ` Bjorn Helgaas
2024-10-14 17:33       ` Terry Bowman
2024-10-17 16:34 ` Fan Ni
2024-10-17 17:27   ` Bowman, Terry
2024-10-21 22:19     ` Fan Ni
2024-10-18 23:22 ` Bjorn Helgaas
2024-10-21 19:22   ` Terry Bowman
2024-10-22  1:43 ` Dan Williams
2024-10-22 13:29   ` Terry Bowman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox