* [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging
@ 2024-12-11 23:39 Terry Bowman
2024-12-11 23:39 ` [PATCH v4 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
` (14 more replies)
0 siblings, 15 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
This is a continuation of the CXL Port error handling RFC from earlier.[1]
The RFC resulted in the decision to add CXL PCIe Port error handling to
the existing RCH Downstream Port handling in the AER service driver. This
patchset adds the CXL PCIe Port protocol error handling and logging.
The first 7 patches update the existing AER service driver to support CXL
PCIe Port protocol error handling and reporting. This includes AER service
driver changes for adding correctable and uncorrectable error support, CXL
specific recovery handling, and addition of CXL driver callback handlers.
The following 8 patches address CXL driver support for CXL PCIe Port
protocol errors. This includes the following changes to the CXL drivers:
mapping CXL Port and Downstream Port RAS registers, interface updates for
common Restricted CXL Host mode (RCH) and Virtual Hierarchy mode (VH),
adding Port specific error handlers, and protocol error logging.
Note, this patchset does not address CXL1.1/RCH-RCD error handling. The
plan is to update CXL1.1 handling in a separate following patchset. Future
CXL1.1 changes will be needed to reuse the CXL Port changes introduced
here. The changes in this patchset will not regress behavior or
functionality.
@Bjorn, can you please help take a look at patch#7 ('Add CXL PCIe Port
uncorrectable error recovery...')? This introduces cxl_walk_bridge()
because pci_walk_bridge() doesn't evaluate RP or DSP errors when passed as
the 'dev' parameter. Should pci_walk_bridge() be updated to evaluate RP and
DSP errors as done in cxl_walk_bridge()?
[1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
Testing
=======
Below are test results for this patchset using Qemu with CXL Root
Port(0c:00.0), CXL Upstream Switchport(0d:00.0), CXL Downstream
Switchport(0e:00.0).
This was tested using aer-inject updated to support CE and UCE internal
error injection. CXL RAS was set using a test patch (not upstreamed but can
provide if needed).
== Root Port Correctable Error ==
root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00004000/0000a000
pcieport 0000:0c:00.0: [14] CorrIntErr
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0c:00.0 host=pci0000:0c status='Received Error From Physical Layer'
== Root Port UnCorrectable Error ==
root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
pcieport 0000:0c:00.0: [22] UncorrIntErr
aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc7-cxl-port-err-00016-g7ce90d33afcd #4727
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x122/0x130
? srso_return_thunk+0x5/0x5f
aer_isr+0x64f/0x700
? free_cpumask_var+0x9/0x10
? kfree+0x259/0x2e0
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x33600000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
== Upstream Port Correctable Error ==
root@tbowman-cxl:~/aer-inject# ./us-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00004000/0000a000
pcieport 0000:0d:00.0: [14] CorrIntErr
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0d:00.0 host=0000:0c:00.0 status='Received Error From Physical Layer'
== Upstream Port UnCorrectable Error ==
root@tbowman-cxl:~/aer-inject# ./us-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0d:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0d:00.0
pcieport 0000:0d:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0d:00.0: device [19e5:a128] error status/mask=00400000/02000000
pcieport 0000:0d:00.0: [22] UncorrIntErr
aer_event: 0000:0d:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0d:00.0 host=0000:0c:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 150 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc7-cxl-port-err-00016-g7ce90d33afcd #4727
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x122/0x130
? srso_return_thunk+0x5/0x5f
aer_isr+0x64f/0x700
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x2c400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
== Downstream Port Correctable Error ==
root@tbowman-cxl:~/aer-inject# ./ds-ce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00004000/0000a000
pcieport 0000:0e:00.0: [14] CorrIntErr
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
cxl_port_aer_correctable_error: device=0000:0e:00.0 host=0000:0d:00.0 status='Received Error From Physical Layer'
== Downstream Port UnCorrectable Error ==
root@tbowman-cxl:~/aer-inject# ./ds-uce-inject.sh
pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0e:00.0
pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0e:00.0
pcieport 0000:0e:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:0e:00.0: device [19e5:a129] error status/mask=00400000/02000000
pcieport 0000:0e:00.0: [22] UncorrIntErr
aer_event: 0000:0e:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
cxl_port_aer_uncorrectable_error: device=0000:0e:00.0 host=0000:0d:00.0 status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
Kernel panic - not syncing: CXL cachemem error.
CPU: 10 UID: 0 PID: 148 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc7-cxl-port-err-00016-g7ce90d33afcd #4727
Tainted: [E]=UNSIGNED_MODULE
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x90
dump_stack+0x10/0x20
panic+0x33e/0x380
cxl_do_recovery+0x122/0x130
? srso_return_thunk+0x5/0x5f
aer_isr+0x64f/0x700
irq_thread_fn+0x28/0x70
irq_thread+0x179/0x240
? srso_return_thunk+0x5/0x5f
? __pfx_irq_thread_fn+0x10/0x10
? __pfx_irq_thread_dtor+0x10/0x10
? __pfx_irq_thread+0x10/0x10
kthread+0xf5/0x130
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1a/0x30
</TASK>
Kernel Offset: 0x5800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
---[ end Kernel panic - not syncing: CXL cachemem error. ]---
Changes
=======
Changes in v3 -> v4
[Lukas] Capitalize PCIe and CXL device names as in specifications
[Lukas] Move call to pcie_is_cxl() into cxl_port_devsec()
[Lukas] Correct namespace spelling
[Lukas] Removed export from pcie_is_cxl_port()
[Lukas] Simplify 'if' blocks in cxl_handle_error()
[Lukas] Change panic message to remove redundant 'panic' text
[Ming] Update to call cxl_dport_init_ras_reporting() in RCH case
[lkp@intel] 'host' parameter is already removed. Remove parameter description too.
[Terry] Added field description for cxl_err_handlers in pci.h comment block
Changes in v2 -> v3
[Terry] Add UIE/CIE port enablement patch. Needed because only RP are enabled by AER driver.
[DaveJ] Isolate reading upstream port's AER info to only the CXL path
[Jonathan, Dan] Add details about separate handling paths for CXL & PCIe
[Jonathan] Add details to existing comment in devm_cxl_add_endpoint()
about call to cxl_init_ep_ports_aer()
[Jonathan] Updated cxl_init_ep_ports_aer() w/ checks for NULL;
[Jonathan] Move find_cxl_port() patch immediately before patch to create handlers
[Jonathan] Patch title fix: find_cxl_ports() -> find_cxl_port()
[Jonathan] Remove 2 unnecessary dev_warns() in cxl_dport_init_ras_reporting() and
cxl_uport_init_ras_reporting().
[Jonathan] Remove unnecessary filter on PCIe port devices in dev_is_cxl_pci()
[Jonathan] Change to use 2 cxl_port declarations in cxl_pci_port_ras()
[Jonathan] Fix spacing in 'struct cxl_error_handlers' declaration.
Changes in v1 -> v2
[Jonathan] Remove extra NULL check and cleanup in cxl_pci_port_ras()
[Jonathan] Update description to DSP map patch description
[Jonathan] Update cxl_pci_port_ras() to check for NULL port
[Jonathan] Dont call handler before handler port changes are present (patch order)
[Bjorn] Fix linebreak in cover sheet URL
[Bjorn] Remove timestamps from test logs in cover sheet
[Bjorn] Retitle AER commits to use "PCI/AER:"
[Bjorn] Retitle patch#3 to use renaming instead of refactoring
[Bjorn] Fix base commit-id on cover sheet
[Bjorn] Add VH spec reference/citation
[Terry] Removed last 2 patches to enable internal errors. Is not needed
because internal errors are enabled in AER driver.
[Dan] Create cxl_do_recovery() and pci_driver::cxl_err_handlers.
[Dan] Use kernel panic in CXL recovery
[Dan] cxl_port_hndlrs -> cxl_port_error_handlers
[Dan] Move cxl_port_error_handlers to pci_driver. Remove module (un)registration.
[Terry] Add patch w/ qcxl_assign_port_error_handlers() and cxl_clear_port_error_handlers()
[Terry] Removed PCI_ERS_RESULT_PANIC patch. Is no longer needed because the result type parameter
is not used in the CXL_err_handlers callbacks.
Changes in RFC -> v1:
[Dan] Rename cxl_rch_handle_error() becomes cxl_handle_error()
[Dan] Add cxl_do_recovery()
[Jonathan] Flatten cxl_setup_parent_uport()
[Jonathan] Use cxl_component_regs instead of struct cxl_regs regs
[Jonathan] Rename cxl_dev_is_pci_type()
[Ming] bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can
replace these find_cxl_port() and device_find_child().
[Jonathan] Compact call to cxl_port_map_regs() in cxl_setup_parent_uport()
[Ming] Dont use endpoint as host to cxl_map_component_regs()
[Bjorn] Use "PCIe UIR/CIE" instesad of "AER UI/CIE"
[Bjorn] Dont use Kconfig to enable/disable a CXL external interface
Terry Bowman (15):
PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct
pci_driver'
PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port
support
cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and
pcie_is_cxl_port()
PCI/AER: Modify AER driver logging to report CXL or PCIe bus error
type
PCI/AER: Add CXL PCIe Port correctable error support in AER service
driver
PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe
Port devices
PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service
driver
cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS
registers
cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
cxl/pci: Change find_cxl_port() to non-static
cxl/pci: Add error handler for CXL PCIe Port RAS errors
cxl/pci: Add trace logging for CXL PCIe Port RAS errors
cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch
Ports
drivers/cxl/core/core.h | 3 +
drivers/cxl/core/pci.c | 186 +++++++++++++++++++++++++++-------
drivers/cxl/core/port.c | 4 +-
drivers/cxl/core/trace.h | 47 +++++++++
drivers/cxl/cxl.h | 10 +-
drivers/cxl/mem.c | 39 ++++++-
drivers/pci/pci.c | 13 +++
drivers/pci/pci.h | 3 +
drivers/pci/pcie/aer.c | 107 +++++++++++--------
drivers/pci/pcie/err.c | 54 ++++++++++
drivers/pci/probe.c | 10 ++
include/linux/aer.h | 1 +
include/linux/pci.h | 14 +++
include/ras/ras_event.h | 9 +-
include/uapi/linux/pci_regs.h | 3 +-
15 files changed, 417 insertions(+), 86 deletions(-)
base-commit: 2d5404caa8c7bb5c4e0435f94b28834ae5456623
--
2.34.1
^ permalink raw reply [flat|nested] 45+ messages in thread
* [PATCH v4 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver'
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-11 23:39 ` [PATCH v4 02/15] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
` (13 subsequent siblings)
14 siblings, 0 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
CXL.io provides PCIe like protocol error implementation, but CXL.io and
PCIe have different handling requirements.
The PCIe AER service driver may attempt recovering PCIe devices with
uncorrectable errors while recovery is not used for CXL.io. Recovery is not
used in the CXL.io case because of potential corruption on what can be
system memory.
Create pci_driver::cxl_err_handlers structure similar to
pci_driver::error_handler. Create handlers for correctable and
uncorrectable CXL.io error handling.
The CXL error handlers will be used in future patches adding CXL PCIe
port protocol error handling.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
include/linux/pci.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 573b4c4c2be6..f6a9dddfc9e9 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -886,6 +886,14 @@ struct pci_error_handlers {
void (*cor_error_detected)(struct pci_dev *dev);
};
+/* Compute Express Link (CXL) bus error event callbacks */
+struct cxl_error_handlers {
+ /* CXL bus error detected on this device */
+ bool (*error_detected)(struct pci_dev *dev);
+
+ /* Allow device driver to record more details of a correctable error */
+ void (*cor_error_detected)(struct pci_dev *dev);
+};
struct module;
@@ -931,6 +939,7 @@ struct module;
* @sriov_get_vf_total_msix: PF driver callback to get the total number of
* MSI-X vectors available for distribution to the VFs.
* @err_handler: See Documentation/PCI/pci-error-recovery.rst
+ * @cxl_err_handler: Compute Express Link specific error handlers.
* @groups: Sysfs attribute groups.
* @dev_groups: Attributes attached to the device that will be
* created once it is bound to the driver.
@@ -956,6 +965,7 @@ struct pci_driver {
int (*sriov_set_msix_vec_count)(struct pci_dev *vf, int msix_vec_count); /* On PF */
u32 (*sriov_get_vf_total_msix)(struct pci_dev *pf);
const struct pci_error_handlers *err_handler;
+ const struct cxl_error_handlers *cxl_err_handler;
const struct attribute_group **groups;
const struct attribute_group **dev_groups;
struct device_driver driver;
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 02/15] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
2024-12-11 23:39 ` [PATCH v4 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-11 23:39 ` [PATCH v4 03/15] cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
` (12 subsequent siblings)
14 siblings, 0 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The AER service driver already includes support for CXL restricted host
(RCH) Downstream Port error handling. The current implementation is based
on CXL1.1 using a root complex event collector.
Rename function interfaces and parameters where necessary to include
virtual hierarchy (VH) mode CXL PCIe Port error handling alongside the RCH
handling.[1] The CXL PCIe Port error handling will be added in a future
patch.
Limit changes to renaming variable and function names. No functional
changes are added.
[1] CXL 3.1 Spec, 9.12.2 CXL Virtual Hierarchy
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
drivers/pci/pcie/aer.c | 28 ++++++++++++++--------------
1 file changed, 14 insertions(+), 14 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 13b8586924ea..fe6edf26279e 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1029,7 +1029,7 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
return 0;
}
-static void cxl_rch_handle_error(struct pci_dev *dev, struct aer_err_info *info)
+static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
{
/*
* Internal errors of an RCEC indicate an AER error in an
@@ -1052,30 +1052,30 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
return *handles_cxl;
}
-static bool handles_cxl_errors(struct pci_dev *rcec)
+static bool handles_cxl_errors(struct pci_dev *dev)
{
bool handles_cxl = false;
- if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
- pcie_aer_is_native(rcec))
- pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
+ pcie_aer_is_native(dev))
+ pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
return handles_cxl;
}
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
+static void cxl_enable_internal_errors(struct pci_dev *dev)
{
- if (!handles_cxl_errors(rcec))
+ if (!handles_cxl_errors(dev))
return;
- pci_aer_unmask_internal_errors(rcec);
- pci_info(rcec, "CXL: Internal errors unmasked");
+ pci_aer_unmask_internal_errors(dev);
+ pci_info(dev, "CXL: Internal errors unmasked");
}
#else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
-static inline void cxl_rch_handle_error(struct pci_dev *dev,
- struct aer_err_info *info) { }
+static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
+static inline void cxl_handle_error(struct pci_dev *dev,
+ struct aer_err_info *info) { }
#endif
/**
@@ -1113,7 +1113,7 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
- cxl_rch_handle_error(dev, info);
+ cxl_handle_error(dev, info);
pci_aer_handle_error(dev, info);
pci_dev_put(dev);
}
@@ -1491,7 +1491,7 @@ static int aer_probe(struct pcie_device *dev)
return status;
}
- cxl_rch_enable_rcec(port);
+ cxl_enable_internal_errors(port);
aer_enable_rootport(rpc);
pci_info(port, "enabled with IRQ %d\n", dev->irq);
return 0;
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 03/15] cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port()
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
2024-12-11 23:39 ` [PATCH v4 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-12-11 23:39 ` [PATCH v4 02/15] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-11 23:39 ` [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
` (11 subsequent siblings)
14 siblings, 0 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
CXL and AER drivers need the ability to identify CXL devices and CXL port
devices.
First, add set_pcie_cxl() with logic checking for CXL Flexbus DVSEC
presence. The CXL Flexbus DVSEC presence is used because it is required
for all the CXL PCIe devices.[1]
Add boolean 'struct pci_dev::is_cxl' with the purpose to cache the CXL
Flexbus presence.
Add pcie_is_cxl() as a macro to return 'struct pci_dev::is_cxl'.
Add pcie_is_cxl_port() to check if a device is a CXL Root Port, CXL
Upstream Switch Port, or CXL Downstream Switch Port. Also, verify the
CXL extensions DVSEC for port is present.[1]
[1] CXL 3.1 Spec, 8.1.1 PCIe Designated Vendor-Specific Extended
Capability (DVSEC) ID Assignment, Table 8-2
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
drivers/pci/pci.c | 13 +++++++++++++
drivers/pci/probe.c | 10 ++++++++++
include/linux/pci.h | 4 ++++
include/uapi/linux/pci_regs.h | 3 ++-
4 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 225a6cd2e9ca..c96c304bc799 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -5034,10 +5034,23 @@ static int pci_dev_reset_slot_function(struct pci_dev *dev, bool probe)
static u16 cxl_port_dvsec(struct pci_dev *dev)
{
+ if (!pcie_is_cxl(dev))
+ return 0;
+
return pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
PCI_DVSEC_CXL_PORT);
}
+bool pcie_is_cxl_port(struct pci_dev *dev)
+{
+ if ((pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_UPSTREAM) &&
+ (pci_pcie_type(dev) != PCI_EXP_TYPE_DOWNSTREAM))
+ return false;
+
+ return cxl_port_dvsec(dev);
+}
+
static bool cxl_sbr_masked(struct pci_dev *dev)
{
u16 dvsec, reg;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index f1615805f5b0..277e3fc8e1a7 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1631,6 +1631,14 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
dev->is_thunderbolt = 1;
}
+static void set_pcie_cxl(struct pci_dev *dev)
+{
+ u16 dvsec = pci_find_dvsec_capability(dev, PCI_VENDOR_ID_CXL,
+ PCI_DVSEC_CXL_FLEXBUS);
+ if (dvsec)
+ dev->is_cxl = 1;
+}
+
static void set_pcie_untrusted(struct pci_dev *dev)
{
struct pci_dev *parent;
@@ -1945,6 +1953,8 @@ int pci_setup_device(struct pci_dev *dev)
/* Need to have dev->cfg_size ready */
set_pcie_thunderbolt(dev);
+ set_pcie_cxl(dev);
+
set_pcie_untrusted(dev);
/* "Unknown power state" */
diff --git a/include/linux/pci.h b/include/linux/pci.h
index f6a9dddfc9e9..33a9abecdaba 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -443,6 +443,7 @@ struct pci_dev {
unsigned int is_hotplug_bridge:1;
unsigned int shpc_managed:1; /* SHPC owned by shpchp */
unsigned int is_thunderbolt:1; /* Thunderbolt controller */
+ unsigned int is_cxl:1; /* Compute Express Link (CXL) */
/*
* Devices marked being untrusted are the ones that can potentially
* execute DMA attacks and similar. They are typically connected
@@ -743,6 +744,9 @@ static inline bool pci_is_vga(struct pci_dev *pdev)
return false;
}
+#define pcie_is_cxl(dev) (dev->is_cxl)
+bool pcie_is_cxl_port(struct pci_dev *dev);
+
#define for_each_pci_bridge(dev, bus) \
list_for_each_entry(dev, &bus->devices, bus_list) \
if (!pci_is_bridge(dev)) {} else
diff --git a/include/uapi/linux/pci_regs.h b/include/uapi/linux/pci_regs.h
index 12323b3334a9..5df6c74963c5 100644
--- a/include/uapi/linux/pci_regs.h
+++ b/include/uapi/linux/pci_regs.h
@@ -1186,9 +1186,10 @@
#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_PROTOCOL 0x00ff0000
#define PCI_DOE_DATA_OBJECT_DISC_RSP_3_NEXT_INDEX 0xff000000
-/* Compute Express Link (CXL r3.1, sec 8.1.5) */
+/* Compute Express Link (CXL r3.1, sec 8.1) */
#define PCI_DVSEC_CXL_PORT 3
#define PCI_DVSEC_CXL_PORT_CTL 0x0c
#define PCI_DVSEC_CXL_PORT_CTL_UNMASK_SBR 0x00000001
+#define PCI_DVSEC_CXL_FLEXBUS 7
#endif /* LINUX_PCI_REGS_H */
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (2 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 03/15] cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-12 1:34 ` Li Ming
2024-12-11 23:39 ` [PATCH v4 05/15] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
` (10 subsequent siblings)
14 siblings, 1 reply; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The AER driver and aer_event tracing currently log 'PCIe Bus Type'
for all errors.
Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
device errors.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
---
drivers/pci/pcie/aer.c | 14 ++++++++------
include/ras/ras_event.h | 9 ++++++---
2 files changed, 14 insertions(+), 9 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index fe6edf26279e..53e9a11f6c0f 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
{
+ const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
int layer, agent;
int id = pci_dev_id(dev);
const char *level;
if (!info->status) {
- pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
- aer_error_severity_string[info->severity]);
+ pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
+ bus_type, aer_error_severity_string[info->severity]);
goto out;
}
@@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
- pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
- aer_error_severity_string[info->severity],
+ pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
+ bus_type, aer_error_severity_string[info->severity],
aer_error_layer[layer], aer_agent_string[agent]);
pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
@@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
if (info->id && info->error_dev_num > 1 && info->id == id)
pci_err(dev, " Error of this Agent is reported first\n");
- trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
+ trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
info->severity, info->tlp_header_valid, &info->tlp);
}
@@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer)
{
+ const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
int layer, agent, tlp_header_valid = 0;
u32 status, mask;
struct aer_err_info info;
@@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
if (tlp_header_valid)
__print_tlp_header(dev, &aer->header_log);
- trace_aer_event(dev_name(&dev->dev), (status & ~mask),
+ trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
aer_severity, tlp_header_valid, &aer->header_log);
}
EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index e5f7ee0864e7..1bf8e7050ba8 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
TRACE_EVENT(aer_event,
TP_PROTO(const char *dev_name,
+ const char *bus_type,
const u32 status,
const u8 severity,
const u8 tlp_header_valid,
struct pcie_tlp_log *tlp),
- TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
+ TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
TP_STRUCT__entry(
__string( dev_name, dev_name )
+ __string( bus_type, bus_type )
__field( u32, status )
__field( u8, severity )
__field( u8, tlp_header_valid)
@@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
TP_fast_assign(
__assign_str(dev_name);
+ __assign_str(bus_type);
__entry->status = status;
__entry->severity = severity;
__entry->tlp_header_valid = tlp_header_valid;
@@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
}
),
- TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
- __get_str(dev_name),
+ TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
+ __get_str(dev_name), __get_str(bus_type),
__entry->severity == AER_CORRECTABLE ? "Corrected" :
__entry->severity == AER_FATAL ?
"Fatal" : "Uncorrected, non-fatal",
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 05/15] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (3 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-11 23:39 ` [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
` (9 subsequent siblings)
14 siblings, 0 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The AER service driver supports handling Downstream Port protocol errors in
restricted CXL host (RCH) mode also known as CXL1.1. It needs the same
functionality for CXL PCIe Ports operating in virtual hierarchy (VH)
mode.[1]
CXL and PCIe protocol error handling have different requirements that
necessitate a separate handling path. The AER service driver may try to
recover PCIe uncorrectable non-fatal errors (UCE). The same recovery is not
suitable for CXL PCIe Port devices because of potential for system memory
corruption. Instead, CXL protocol error handling must use a kernel panic
in the case of a fatal or non-fatal UCE. The AER driver's PCIe error
handling does not panic the kernel in response to a UCE.
Introduce a separate path for CXL protocol error handling in the AER
service driver. This will allow CXL protocol errors to use CXL specific
handling instead of PCIe handling. Add the CXL specific changes without
affecting or adding functionality in the PCIe handling.
Make this update alongside the existing Downstream Port RCH error handling
logic, extending support to CXL PCIe Ports in VH mode.
is_internal_error() is currently limited by CONFIG_PCIEAER_CXL kernel
config. Update is_internal_error()'s function declaration such that it is
always available regardless if CONFIG_PCIEAER_CXL kernel config is enabled
or disabled.
The uncorrectable error (UCE) handling will be added in a future patch.
[1] CXL 3.1 Spec, 12.2.2 CXL Root Ports, Downstream Switch Ports, and
Upstream Switch Ports
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
---
drivers/pci/pcie/aer.c | 61 +++++++++++++++++++++++++++---------------
1 file changed, 40 insertions(+), 21 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 53e9a11f6c0f..d75886174969 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -941,8 +941,15 @@ static bool find_source_device(struct pci_dev *parent,
return true;
}
-#ifdef CONFIG_PCIEAER_CXL
+static bool is_internal_error(struct aer_err_info *info)
+{
+ if (info->severity == AER_CORRECTABLE)
+ return info->status & PCI_ERR_COR_INTERNAL;
+ return info->status & PCI_ERR_UNC_INTN;
+}
+
+#ifdef CONFIG_PCIEAER_CXL
/**
* pci_aer_unmask_internal_errors - unmask internal errors
* @dev: pointer to the pcie_dev data structure
@@ -994,14 +1001,6 @@ static bool cxl_error_is_native(struct pci_dev *dev)
return (pcie_ports_native || host->native_aer);
}
-static bool is_internal_error(struct aer_err_info *info)
-{
- if (info->severity == AER_CORRECTABLE)
- return info->status & PCI_ERR_COR_INTERNAL;
-
- return info->status & PCI_ERR_UNC_INTN;
-}
-
static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
{
struct aer_err_info *info = (struct aer_err_info *)data;
@@ -1033,14 +1032,23 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
{
- /*
- * Internal errors of an RCEC indicate an AER error in an
- * RCH's downstream port. Check and handle them in the CXL.mem
- * device driver.
- */
- if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
- is_internal_error(info))
- pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
+ return pcie_walk_rcec(dev, cxl_rch_handle_error_iter, info);
+
+ if (info->severity == AER_CORRECTABLE) {
+ struct pci_driver *pdrv = dev->driver;
+ int aer = dev->aer_cap;
+
+ if (aer)
+ pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+ info->status);
+
+ if (pdrv && pdrv->cxl_err_handler &&
+ pdrv->cxl_err_handler->cor_error_detected)
+ pdrv->cxl_err_handler->cor_error_detected(dev);
+
+ pcie_clear_device_status(dev);
+ }
}
static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
@@ -1058,9 +1066,13 @@ static bool handles_cxl_errors(struct pci_dev *dev)
{
bool handles_cxl = false;
- if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC &&
- pcie_aer_is_native(dev))
+ if (!pcie_aer_is_native(dev))
+ return false;
+
+ if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_EC)
pcie_walk_rcec(dev, handles_cxl_error_iter, &handles_cxl);
+ else
+ handles_cxl = pcie_is_cxl_port(dev);
return handles_cxl;
}
@@ -1078,6 +1090,10 @@ static void cxl_enable_internal_errors(struct pci_dev *dev)
static inline void cxl_enable_internal_errors(struct pci_dev *dev) { }
static inline void cxl_handle_error(struct pci_dev *dev,
struct aer_err_info *info) { }
+static bool handles_cxl_errors(struct pci_dev *dev)
+{
+ return false;
+}
#endif
/**
@@ -1115,8 +1131,11 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
{
- cxl_handle_error(dev, info);
- pci_aer_handle_error(dev, info);
+ if (is_internal_error(info) && handles_cxl_errors(dev))
+ cxl_handle_error(dev, info);
+ else
+ pci_aer_handle_error(dev, info);
+
pci_dev_put(dev);
}
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (4 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 05/15] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-24 18:28 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver Terry Bowman
` (8 subsequent siblings)
14 siblings, 1 reply; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The AER service driver's aer_get_device_error_info() function doesn't read
uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
including CXL Upstream Switch Ports. As a result, fatal errors are not
logged or handled as needed for CXL PCIe Upstream Switch Port devices.
Update the aer_get_device_error_info() function to read the UCE fatal
status for all CXL PCIe devices. Make the change such that non-CXL devices
are not affected.
The fatal error status will be used in future patches implementing
CXL PCIe Port uncorrectable error handling and logging.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pcie/aer.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d75886174969..c1eb939c1cca 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1250,7 +1250,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
type == PCI_EXP_TYPE_RC_EC ||
type == PCI_EXP_TYPE_DOWNSTREAM ||
- info->severity == AER_NONFATAL) {
+ info->severity == AER_NONFATAL ||
+ (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
/* Link is still healthy for IO reads */
pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (5 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-12 9:28 ` Alejandro Lucero Palau
2024-12-24 18:31 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
` (7 subsequent siblings)
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
Existing recovery procedure for PCIe Uncorrectable Errors (UCE) does not
apply to CXL devices. Recovery can not be used for CXL devices because of
potential corruption on what can be system memory. Also, current PCIe UCE
recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
does not begin at the RP/DSP but begins at the first downstream device.
This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
CXL recovery is needed because of the different handling requirements
Add a new function, cxl_do_recovery() using the following.
Add cxl_walk_bridge() to iterate the detected error's sub-topology.
cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
will begin iteration at the RP or DSP rather than beginning at the
first downstream device.
Add cxl_report_error_detected() as an analog to report_error_detected().
It will call pci_driver::cxl_err_handlers for each iterated downstream
device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
indicating if there was a UCE error detected during handling.
cxl_do_recovery() uses the status from cxl_report_error_detected() to
determine how to proceed. Non-fatal CXL UCE errors will be treated as
fatal. If a UCE was present during handling then cxl_do_recovery()
will kernel panic.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/pci/pci.h | 3 +++
drivers/pci/pcie/aer.c | 4 ++++
drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 61 insertions(+)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 14d00ce45bfa..5a67e41919d8 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_channel_state_t state,
pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
+/* CXL error reporting and handling */
+void cxl_do_recovery(struct pci_dev *dev);
+
bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index c1eb939c1cca..861521872318 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1024,6 +1024,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
err_handler->error_detected(dev, pci_channel_io_normal);
else if (info->severity == AER_FATAL)
err_handler->error_detected(dev, pci_channel_io_frozen);
+
+ cxl_do_recovery(dev);
}
out:
device_unlock(&dev->dev);
@@ -1048,6 +1050,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
pdrv->cxl_err_handler->cor_error_detected(dev);
pcie_clear_device_status(dev);
+ } else {
+ cxl_do_recovery(dev);
}
}
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 31090770fffc..6f7cf5e0087f 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
return status;
}
+
+static void cxl_walk_bridge(struct pci_dev *bridge,
+ int (*cb)(struct pci_dev *, void *),
+ void *userdata)
+{
+ bool *status = userdata;
+
+ cb(bridge, status);
+ if (bridge->subordinate && !*status)
+ pci_walk_bus(bridge->subordinate, cb, status);
+}
+
+static int cxl_report_error_detected(struct pci_dev *dev, void *data)
+{
+ struct pci_driver *pdrv = dev->driver;
+ bool *status = data;
+
+ device_lock(&dev->dev);
+ if (pdrv && pdrv->cxl_err_handler &&
+ pdrv->cxl_err_handler->error_detected) {
+ const struct cxl_error_handlers *cxl_err_handler =
+ pdrv->cxl_err_handler;
+ *status |= cxl_err_handler->error_detected(dev);
+ }
+ device_unlock(&dev->dev);
+ return *status;
+}
+
+void cxl_do_recovery(struct pci_dev *dev)
+{
+ struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+ int type = pci_pcie_type(dev);
+ struct pci_dev *bridge;
+ int status;
+
+ if (type == PCI_EXP_TYPE_ROOT_PORT ||
+ type == PCI_EXP_TYPE_DOWNSTREAM ||
+ type == PCI_EXP_TYPE_UPSTREAM ||
+ type == PCI_EXP_TYPE_ENDPOINT)
+ bridge = dev;
+ else
+ bridge = pci_upstream_bridge(dev);
+
+ cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
+ if (status)
+ panic("CXL cachemem error.");
+
+ if (host->native_aer || pcie_ports_native) {
+ pcie_clear_device_status(dev);
+ pci_aer_clear_nonfatal_status(dev);
+ }
+
+ pci_info(bridge, "CXL uncorrectable error.\n");
+}
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (6 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-12 10:36 ` Alejandro Lucero Palau
2024-12-24 18:38 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
` (6 subsequent siblings)
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
registers for the endpoint's Root Port. The same needs to be done for
each of the CXL Downstream Switch Ports and CXL Root Ports found between
the endpoint and CXL Host Bridge.
Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
sub-topology between the endpoint and the CXL Host Bridge. This function
will determine if there are CXL Downstream Switch Ports or CXL Root Ports
associated with this Port. The same check will be added in the future for
upstream switch ports.
Move the RAS register map logic from cxl_dport_map_ras() into
cxl_dport_init_ras_reporting(). This eliminates the need for the helper
function, cxl_dport_map_ras().
cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.
cxl_dport_init_ras_reporting() must check for previously mapped registers
before mapping. This is necessary because endpoints under a CXL switch
may share CXL Downstream Switch Ports or CXL Root Ports. Ensure the port
registers are only mapped once.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 37 +++++++++++++++----------------------
drivers/cxl/cxl.h | 6 ++----
drivers/cxl/mem.c | 31 +++++++++++++++++++++++++++++--
3 files changed, 46 insertions(+), 28 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5b46bc46aaa9..8540d1fd2e25 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -749,18 +749,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
}
}
-static void cxl_dport_map_ras(struct cxl_dport *dport)
-{
- struct cxl_register_map *map = &dport->reg_map;
- struct device *dev = dport->dport_dev;
-
- if (!map->component_map.ras.valid)
- dev_dbg(dev, "RAS registers not found\n");
- else if (cxl_map_component_regs(map, &dport->regs.component,
- BIT(CXL_CM_CAP_CAP_ID_RAS)))
- dev_dbg(dev, "Failed to map RAS capability.\n");
-}
-
static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
{
void __iomem *aer_base = dport->regs.dport_aer;
@@ -788,22 +776,27 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
/**
* cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
* @dport: the cxl_dport that needs to be initialized
- * @host: host device for devm operations
*/
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
{
- dport->reg_map.host = host;
- cxl_dport_map_ras(dport);
-
- if (dport->rch) {
- struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
-
- if (!host_bridge->native_aer)
- return;
+ struct device *dport_dev = dport->dport_dev;
+ struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
+ dport->reg_map.host = dport_dev;
+ if (dport->rch && host_bridge->native_aer) {
cxl_dport_map_rch_aer(dport);
cxl_disable_rch_root_ints(dport);
}
+
+ /* dport may have more than 1 downstream EP. Check if already mapped. */
+ if (dport->regs.ras)
+ return;
+
+ if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
+ BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+ dev_err(dport_dev, "Failed to map RAS capability.\n");
+ return;
+ }
}
EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5406e3ab3d4a..51acca3415b4 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -763,11 +763,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
resource_size_t rcrb);
#ifdef CONFIG_PCIEAER_CXL
-void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
-void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
+void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
#else
-static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
- struct device *host) { }
+static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
#endif
struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index a9fd5cd5a0d2..0ae89c9da71e 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,6 +45,31 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
return 0;
}
+static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
+{
+ struct pci_dev *pdev;
+
+ if (!dev || !dev_is_pci(dev))
+ return false;
+
+ pdev = to_pci_dev(dev);
+
+ return (pci_pcie_type(pdev) == pcie_type);
+}
+
+static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
+{
+ struct cxl_dport *dport = ep->dport;
+
+ if (dport) {
+ struct device *dport_dev = dport->dport_dev;
+
+ if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
+ dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
+ cxl_dport_init_ras_reporting(dport);
+ }
+}
+
static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
struct cxl_dport *parent_dport)
{
@@ -52,6 +77,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
struct cxl_port *endpoint, *iter, *down;
int rc;
+ if (parent_dport->rch)
+ cxl_dport_init_ras_reporting(parent_dport);
+
/*
* Now that the path to the root is established record all the
* intervening ports in the chain.
@@ -62,6 +90,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
ep = cxl_ep_load(iter, cxlmd);
ep->next = down;
+ cxl_init_ep_ports_aer(ep);
}
/* Note: endpoint port component registers are derived from @cxlds */
@@ -166,8 +195,6 @@ static int cxl_mem_probe(struct device *dev)
else
endpoint_parent = &parent_port->dev;
- cxl_dport_init_ras_reporting(dport, dev);
-
scoped_guard(device, endpoint_parent) {
if (!endpoint_parent->driver) {
dev_err(dev, "CXL port topology %s not enabled\n",
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (7 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-24 18:41 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
` (5 subsequent siblings)
14 siblings, 1 reply; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
pointer to the CXL Upstream Port's mapped RAS registers.
Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
register mapping. This is similar to the existing
cxl_dport_init_ras_reporting() but for USP devices.
The USP may have multiple downstream endpoints. Before mapping AER
registers check if the registers are already mapped.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 15 +++++++++++++++
drivers/cxl/cxl.h | 4 ++++
drivers/cxl/mem.c | 8 ++++++++
3 files changed, 27 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 8540d1fd2e25..08073bbe2697 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -773,6 +773,21 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
}
+void cxl_uport_init_ras_reporting(struct cxl_port *port)
+{
+ /* uport may have more than 1 downstream EP. Check if already mapped. */
+ if (port->uport_regs.ras)
+ return;
+
+ port->reg_map.host = &port->dev;
+ if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
+ BIT(CXL_CM_CAP_CAP_ID_RAS))) {
+ dev_err(&port->dev, "Failed to map RAS capability.\n");
+ return;
+ }
+}
+EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
+
/**
* cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
* @dport: the cxl_dport that needs to be initialized
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 51acca3415b4..0cf8d2cfcd8b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -592,6 +592,7 @@ struct cxl_dax_region {
* @parent_dport: dport that points to this port in the parent
* @decoder_ida: allocator for decoder ids
* @reg_map: component and ras register mapping parameters
+ * @uport_regs: mapped component registers
* @nr_dports: number of entries in @dports
* @hdm_end: track last allocated HDM decoder instance for allocation ordering
* @commit_end: cursor to track highest committed decoder for commit ordering
@@ -612,6 +613,7 @@ struct cxl_port {
struct cxl_dport *parent_dport;
struct ida decoder_ida;
struct cxl_register_map reg_map;
+ struct cxl_component_regs uport_regs;
int nr_dports;
int hdm_end;
int commit_end;
@@ -764,8 +766,10 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
#ifdef CONFIG_PCIEAER_CXL
void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
+void cxl_uport_init_ras_reporting(struct cxl_port *port);
#else
static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
+static inline void cxl_uport_init_ras_reporting(struct cxl_port *port) { }
#endif
struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 0ae89c9da71e..0ce71af8ce22 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -60,6 +60,7 @@ static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
{
struct cxl_dport *dport = ep->dport;
+ struct cxl_port *port = ep->next;
if (dport) {
struct device *dport_dev = dport->dport_dev;
@@ -68,6 +69,13 @@ static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
cxl_dport_init_ras_reporting(dport);
}
+
+ if (port) {
+ struct device *uport_dev = port->uport_dev;
+
+ if (dev_is_cxl_pci(uport_dev, PCI_EXP_TYPE_UPSTREAM))
+ cxl_uport_init_ras_reporting(port);
+ }
}
static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (8 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-12 10:38 ` Alejandro Lucero Palau
2024-12-24 18:42 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 11/15] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
` (4 subsequent siblings)
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
CXL PCIe Port protocol error handling support will be added to the
CXL drivers in the future. In preparation, rename the existing
interfaces to support handling all CXL PCIe Port protocol errors.
The driver's RAS support functions currently rely on a 'struct
cxl_dev_state' type parameter, which is not available for CXL Port
devices. However, since the same CXL RAS capability structure is
needed across most CXL components and devices, a common handling
approach should be adopted.
To accommodate this, update the __cxl_handle_cor_ras() and
__cxl_handle_ras() functions to use a `struct device` instead of
`struct cxl_dev_state`.
No functional changes are introduced.
[1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 08073bbe2697..89f8d65d71ce 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -650,7 +650,7 @@ void read_cdat_data(struct cxl_port *port)
}
EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+static void __cxl_handle_cor_ras(struct device *dev,
void __iomem *ras_base)
{
void __iomem *addr;
@@ -663,13 +663,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
status = readl(addr);
if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
- trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+ trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
}
}
static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
{
- return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+ return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
}
/* CXL spec rev3.0 8.2.4.16.1 */
@@ -693,8 +693,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
* Log the state of the RAS status registers and prepare them to log the
* next error status. Return 1 if reset needed.
*/
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
- void __iomem *ras_base)
+static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
{
u32 hl[CXL_HEADERLOG_SIZE_U32];
void __iomem *addr;
@@ -721,7 +720,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+ trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
@@ -729,7 +728,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
{
- return __cxl_handle_ras(cxlds, cxlds->regs.ras);
+ return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
}
#ifdef CONFIG_PCIEAER_CXL
@@ -818,13 +817,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
struct cxl_dport *dport)
{
- return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+ return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
}
static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
struct cxl_dport *dport)
{
- return __cxl_handle_ras(cxlds, dport->regs.ras);
+ return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
}
/*
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 11/15] cxl/pci: Change find_cxl_port() to non-static
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (9 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-11 23:39 ` [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
` (3 subsequent siblings)
14 siblings, 0 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
CXL PCIe Port protocol error support will be added in the future. This
requires searching for a CXL PCIe Port device in the CXL topology as
provided by find_cxl_port(). But, find_cxl_port() is defined static
and as a result is not callable outside of this source file.
Update the find_cxl_port() declaration to be non-static.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
drivers/cxl/core/core.h | 3 +++
drivers/cxl/core/port.c | 4 ++--
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 0c62b4069ba0..d81e5ee25f58 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -110,4 +110,7 @@ bool cxl_need_node_perf_attrs_update(int nid);
int cxl_port_get_switch_dport_bandwidth(struct cxl_port *port,
struct access_coordinate *c);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport);
+
#endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index af92c67bc954..1c4daf9fd2f3 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1342,8 +1342,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
return NULL;
}
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
- struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+ struct cxl_dport **dport)
{
struct cxl_find_port_ctx ctx = {
.dport_dev = dport_dev,
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (10 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 11/15] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
@ 2024-12-11 23:39 ` Terry Bowman
2024-12-12 2:19 ` Li Ming
2024-12-24 18:43 ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 13/15] cxl/pci: Add trace logging " Terry Bowman
` (2 subsequent siblings)
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:39 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
Introduce correctable and uncorrectable CXL PCIe port protocol error
handlers.
The handlers will be called with a 'struct pci_dev' parameter
indicating the CXL Port device requiring handling. The CXL PCIe Port
device's underlying 'struct device' will match the Port device in the
CXL topology.
Use the PCIe Port's device object to find the matching Upstream Switch
Port, Downstream Switch Port, or Root Port in the CXL topology. The
matching device will contain a reference to the RAS register block used to
handle and log the error.
Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() passing
a reference to the RAS registers as a parameter. These functions will use
the register reference to clear the device's RAS status.
Future patches will assign the error handlers and add trace logging.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 61 ++++++++++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 89f8d65d71ce..52afaedf5171 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -772,6 +772,67 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
}
+static int match_uport(struct device *dev, const void *data)
+{
+ struct device *uport_dev = (struct device *)data;
+ struct cxl_port *port;
+
+ if (!is_cxl_port(dev))
+ return 0;
+
+ port = to_cxl_port(dev);
+
+ return port->uport_dev == uport_dev;
+}
+
+static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
+{
+ void __iomem *ras_base;
+ struct cxl_port *port;
+
+ if (!pdev)
+ return NULL;
+
+ if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
+ (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+ struct cxl_dport *dport;
+
+ port = find_cxl_port(&pdev->dev, &dport);
+ ras_base = dport ? dport->regs.ras : NULL;
+ if (port)
+ put_device(&port->dev);
+ return ras_base;
+ } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
+ struct device *port_dev;
+
+ port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
+ match_uport);
+ if (!port_dev)
+ return NULL;
+
+ port = to_cxl_port(port_dev);
+ ras_base = port ? port->uport_regs.ras : NULL;
+ put_device(port_dev);
+ return ras_base;
+ }
+
+ return NULL;
+}
+
+static void cxl_port_cor_error_detected(struct pci_dev *pdev)
+{
+ void __iomem *ras_base = cxl_pci_port_ras(pdev);
+
+ __cxl_handle_cor_ras(&pdev->dev, ras_base);
+}
+
+static bool cxl_port_error_detected(struct pci_dev *pdev)
+{
+ void __iomem *ras_base = cxl_pci_port_ras(pdev);
+
+ return __cxl_handle_ras(&pdev->dev, ras_base);
+}
+
void cxl_uport_init_ras_reporting(struct cxl_port *port)
{
/* uport may have more than 1 downstream EP. Check if already mapped. */
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 13/15] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (11 preceding siblings ...)
2024-12-11 23:39 ` [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
@ 2024-12-11 23:40 ` Terry Bowman
2024-12-12 9:46 ` Alejandro Lucero Palau
2024-12-24 18:46 ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-12-11 23:40 ` [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:40 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The CXL drivers use kernel trace functions for logging endpoint and RCH
Downstream Port RAS errors. Similar functionality is required for CXL Root
Ports, CXL Downstream Switch Ports, and CXL Upstream Switch Ports.
Introduce trace logging functions for both RAS correctable and
uncorrectable errors specific to CXL PCIe Ports. Additionally, update
the PCIe Port error handlers to invoke these new trace functions.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 16 ++++++++++----
drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
2 files changed, 59 insertions(+), 4 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 52afaedf5171..3294ad5ff28f 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
status = readl(addr);
- if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
- writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+ if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+ return;
+ writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+ if (is_cxl_memdev(dev))
trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
- }
+ else
+ trace_cxl_port_aer_correctable_error(dev, status);
}
static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
@@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
}
header_log_copy(ras_base, hl);
- trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ if (is_cxl_memdev(dev))
+ trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
+ else
+ trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
+
writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
return true;
diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index 8389a94adb1a..681e415ac8f5 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,6 +48,34 @@
{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" } \
)
+TRACE_EVENT(cxl_port_aer_uncorrectable_error,
+ TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
+ TP_ARGS(dev, status, fe, hl),
+ TP_STRUCT__entry(
+ __string(devname, dev_name(dev))
+ __string(host, dev_name(dev->parent))
+ __field(u32, status)
+ __field(u32, first_error)
+ __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
+ ),
+ TP_fast_assign(
+ __assign_str(devname);
+ __assign_str(host);
+ __entry->status = status;
+ __entry->first_error = fe;
+ /*
+ * Embed the 512B headerlog data for user app retrieval and
+ * parsing, but no need to print this in the trace buffer.
+ */
+ memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
+ ),
+ TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
+ __get_str(devname), __get_str(host),
+ show_uc_errs(__entry->status),
+ show_uc_errs(__entry->first_error)
+ )
+);
+
TRACE_EVENT(cxl_aer_uncorrectable_error,
TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
TP_ARGS(cxlmd, status, fe, hl),
@@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" } \
)
+TRACE_EVENT(cxl_port_aer_correctable_error,
+ TP_PROTO(struct device *dev, u32 status),
+ TP_ARGS(dev, status),
+ TP_STRUCT__entry(
+ __string(devname, dev_name(dev))
+ __string(host, dev_name(dev->parent))
+ __field(u32, status)
+ ),
+ TP_fast_assign(
+ __assign_str(devname);
+ __assign_str(host);
+ __entry->status = status;
+ ),
+ TP_printk("device=%s host=%s status='%s'",
+ __get_str(devname), __get_str(host),
+ show_ce_errs(__entry->status)
+ )
+);
+
TRACE_EVENT(cxl_aer_correctable_error,
TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
TP_ARGS(cxlmd, status),
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (12 preceding siblings ...)
2024-12-11 23:40 ` [PATCH v4 13/15] cxl/pci: Add trace logging " Terry Bowman
@ 2024-12-11 23:40 ` Terry Bowman
2024-12-12 2:31 ` Li Ming
2024-12-24 18:50 ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:40 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
The handlers can't be set in the pci_driver static definition because the
CXL PCIe Port devices are bound to the portdrv driver which is not CXL
driver aware.
Add cxl_assign_port_error_handlers() in the cxl_core module. This
function will assign the default handlers for a CXL PCIe Port device.
When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
pci_driver::cxl_err_handlers must be set to NULL indicating they should no
longer be used.
Create cxl_clear_port_error_handlers() and register it to be called
when the CXL Port device (cxl_port or cxl_dport) is destroyed.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 file changed, 40 insertions(+)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 3294ad5ff28f..9734a4c55b29 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -841,8 +841,38 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
return __cxl_handle_ras(&pdev->dev, ras_base);
}
+static const struct cxl_error_handlers cxl_port_error_handlers = {
+ .error_detected = cxl_port_error_detected,
+ .cor_error_detected = cxl_port_cor_error_detected,
+};
+
+static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
+{
+ struct pci_driver *pdrv;
+
+ if (!pdev || !pdev->driver)
+ return;
+
+ pdrv = pdev->driver;
+ pdrv->cxl_err_handler = &cxl_port_error_handlers;
+}
+
+static void cxl_clear_port_error_handlers(void *data)
+{
+ struct pci_dev *pdev = data;
+ struct pci_driver *pdrv;
+
+ if (!pdev || !pdev->driver)
+ return;
+
+ pdrv = pdev->driver;
+ pdrv->cxl_err_handler = NULL;
+}
+
void cxl_uport_init_ras_reporting(struct cxl_port *port)
{
+ struct pci_dev *pdev = to_pci_dev(port->uport_dev);
+
/* uport may have more than 1 downstream EP. Check if already mapped. */
if (port->uport_regs.ras)
return;
@@ -853,6 +883,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
dev_err(&port->dev, "Failed to map RAS capability.\n");
return;
}
+
+ cxl_assign_port_error_handlers(pdev);
+ devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
}
EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
@@ -864,6 +897,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
{
struct device *dport_dev = dport->dport_dev;
struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
+ struct pci_dev *pdev = to_pci_dev(dport_dev);
dport->reg_map.host = dport_dev;
if (dport->rch && host_bridge->native_aer) {
@@ -880,6 +914,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
dev_err(dport_dev, "Failed to map RAS capability.\n");
return;
}
+
+ if (dport->rch)
+ return;
+
+ cxl_assign_port_error_handlers(pdev);
+ devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
}
EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
` (13 preceding siblings ...)
2024-12-11 23:40 ` [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
@ 2024-12-11 23:40 ` Terry Bowman
2024-12-12 9:44 ` Alejandro Lucero Palau
2024-12-24 18:53 ` Jonathan Cameron
14 siblings, 2 replies; 45+ messages in thread
From: Terry Bowman @ 2024-12-11 23:40 UTC (permalink / raw)
To: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot, terry.bowman,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
enablement is needed for CXL PCIe Upstream and Downstream Ports inorder to
notify the associated Root Port and OS.[1]
Export the AER service driver's pci_aer_unmask_internal_errors() function
to CXL namespace.
Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
because it is now an exported function.
Call pci_aer_unmask_internal_errors() during RAS initialization in:
cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
[1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
drivers/cxl/core/pci.c | 2 ++
drivers/pci/pcie/aer.c | 5 +++--
include/linux/aer.h | 1 +
3 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 9734a4c55b29..740ac5d8809f 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
cxl_assign_port_error_handlers(pdev);
devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
+ pci_aer_unmask_internal_errors(pdev);
}
EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
@@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
cxl_assign_port_error_handlers(pdev);
devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
+ pci_aer_unmask_internal_errors(pdev);
}
EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 861521872318..0fa1b1ed48c9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info *info)
return info->status & PCI_ERR_UNC_INTN;
}
-#ifdef CONFIG_PCIEAER_CXL
/**
* pci_aer_unmask_internal_errors - unmask internal errors
* @dev: pointer to the pcie_dev data structure
@@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info *info)
* Note: AER must be enabled and supported by the device which must be
* checked in advance, e.g. with pcie_aer_is_native().
*/
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
{
int aer = dev->aer_cap;
u32 mask;
@@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
mask &= ~PCI_ERR_COR_INTERNAL;
pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
}
+EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
+#ifdef CONFIG_PCIEAER_CXL
static bool is_cxl_mem_dev(struct pci_dev *dev)
{
/*
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..093293f9f12b 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
int cper_severity_to_aer(int cper_severity);
void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
int severity, struct aer_capability_regs *aer_regs);
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
#endif //_AER_H_
--
2.34.1
^ permalink raw reply related [flat|nested] 45+ messages in thread
* Re: [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
2024-12-11 23:39 ` [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
@ 2024-12-12 1:34 ` Li Ming
2024-12-12 19:59 ` Bowman, Terry
0 siblings, 1 reply; 45+ messages in thread
From: Li Ming @ 2024-12-12 1:34 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/12/2024 7:39 AM, Terry Bowman wrote:
> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
> for all errors.
>
> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
> device errors.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
> Reviewed-by: Fan Ni <fan.ni@samsung.com>
> ---
> drivers/pci/pcie/aer.c | 14 ++++++++------
> include/ras/ras_event.h | 9 ++++++---
> 2 files changed, 14 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index fe6edf26279e..53e9a11f6c0f 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
>
> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> {
> + const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
> int layer, agent;
> int id = pci_dev_id(dev);
> const char *level;
>
> if (!info->status) {
> - pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> - aer_error_severity_string[info->severity]);
> + pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
> + bus_type, aer_error_severity_string[info->severity]);
> goto out;
> }
>
> @@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>
> level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>
> - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
> - aer_error_severity_string[info->severity],
> + pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
> + bus_type, aer_error_severity_string[info->severity],
> aer_error_layer[layer], aer_agent_string[agent]);
>
> pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
> @@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> if (info->id && info->error_dev_num > 1 && info->id == id)
> pci_err(dev, " Error of this Agent is reported first\n");
>
> - trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
> + trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
> info->severity, info->tlp_header_valid, &info->tlp);
> }
>
> @@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
> void pci_print_aer(struct pci_dev *dev, int aer_severity,
> struct aer_capability_regs *aer)
> {
> + const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
> int layer, agent, tlp_header_valid = 0;
> u32 status, mask;
> struct aer_err_info info;
> @@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> if (tlp_header_valid)
> __print_tlp_header(dev, &aer->header_log);
>
> - trace_aer_event(dev_name(&dev->dev), (status & ~mask),
> + trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
> aer_severity, tlp_header_valid, &aer->header_log);
> }
> EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index e5f7ee0864e7..1bf8e7050ba8 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>
> TRACE_EVENT(aer_event,
> TP_PROTO(const char *dev_name,
> + const char *bus_type,
> const u32 status,
> const u8 severity,
> const u8 tlp_header_valid,
> struct pcie_tlp_log *tlp),
>
> - TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
> + TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>
> TP_STRUCT__entry(
> __string( dev_name, dev_name )
> + __string( bus_type, bus_type )
> __field( u32, status )
> __field( u8, severity )
> __field( u8, tlp_header_valid)
> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>
> TP_fast_assign(
> __assign_str(dev_name);
> + __assign_str(bus_type);
> __entry->status = status;
> __entry->severity = severity;
> __entry->tlp_header_valid = tlp_header_valid;
> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
> }
> ),
>
> - TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
> - __get_str(dev_name),
> + TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
> + __get_str(dev_name), __get_str(bus_type),
> __entry->severity == AER_CORRECTABLE ? "Corrected" :
> __entry->severity == AER_FATAL ?
> "Fatal" : "Uncorrected, non-fatal",
Hi Terry,
Patch #3 is using flexbus dvsec to identify CXL RP/USP/DSP. But per CXL r3.1 section 9.12.3 "Enumerating CXL RPs and DSPs", there may be a flexbus dvsec if CXL RP/DSP is in disconnect state or connecting to a PCIe device.
If a PCIe device connects to a CXL RP/DSP, and the CXL RP/DSP reports an error, the error log will be also "CXL Bus Type", is it expected? My understanding is that the CXL RP/DSP is working on PCIe mode.
If not, I think that setting "pci_dev->is_cxl" during cxl port enumeration and CXL device probing is another option.
Thanks
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors
2024-12-11 23:39 ` [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
@ 2024-12-12 2:19 ` Li Ming
2024-12-24 18:43 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Li Ming @ 2024-12-12 2:19 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/12/2024 7:39 AM, Terry Bowman wrote:
> Introduce correctable and uncorrectable CXL PCIe port protocol error
> handlers.
>
> The handlers will be called with a 'struct pci_dev' parameter
> indicating the CXL Port device requiring handling. The CXL PCIe Port
> device's underlying 'struct device' will match the Port device in the
> CXL topology.
>
> Use the PCIe Port's device object to find the matching Upstream Switch
> Port, Downstream Switch Port, or Root Port in the CXL topology. The
> matching device will contain a reference to the RAS register block used to
> handle and log the error.
>
> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() passing
> a reference to the RAS registers as a parameter. These functions will use
> the register reference to clear the device's RAS status.
>
> Future patches will assign the error handlers and add trace logging.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/cxl/core/pci.c | 61 ++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 61 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 89f8d65d71ce..52afaedf5171 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -772,6 +772,67 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> writel(aer_cmd, aer_base + PCI_ERR_ROOT_COMMAND);
> }
>
> +static int match_uport(struct device *dev, const void *data)
> +{
> + struct device *uport_dev = (struct device *)data;
> + struct cxl_port *port;
> +
> + if (!is_cxl_port(dev))
> + return 0;
> +
> + port = to_cxl_port(dev);
> +
> + return port->uport_dev == uport_dev;
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> + void __iomem *ras_base;
> + struct cxl_port *port;
> +
> + if (!pdev)
> + return NULL;
> +
> + if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> + (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> + struct cxl_dport *dport;
> +
> + port = find_cxl_port(&pdev->dev, &dport);
> + ras_base = dport ? dport->regs.ras : NULL;
> + if (port)
> + put_device(&port->dev);
> + return ras_base;
> + } else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> + struct device *port_dev;
> +
> + port_dev = bus_find_device(&cxl_bus_type, NULL, &pdev->dev,
> + match_uport);
> + if (!port_dev)
> + return NULL;
> +
> + port = to_cxl_port(port_dev);
> + ras_base = port ? port->uport_regs.ras : NULL;
I think that is no need to check 'port', just directly use 'ras_base = port->uport_regs.ras;', because match_uport() already checks it, returned port_dev must be a port.
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
2024-12-11 23:40 ` [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
@ 2024-12-12 2:31 ` Li Ming
2024-12-17 14:39 ` Bowman, Terry
2024-12-24 18:50 ` Jonathan Cameron
1 sibling, 1 reply; 45+ messages in thread
From: Li Ming @ 2024-12-12 2:31 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/12/2024 7:40 AM, Terry Bowman wrote:
> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
> The handlers can't be set in the pci_driver static definition because the
> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
> driver aware.
>
> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> function will assign the default handlers for a CXL PCIe Port device.
>
> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
> longer be used.
>
> Create cxl_clear_port_error_handlers() and register it to be called
> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/cxl/core/pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 3294ad5ff28f..9734a4c55b29 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -841,8 +841,38 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
> return __cxl_handle_ras(&pdev->dev, ras_base);
> }
>
> +static const struct cxl_error_handlers cxl_port_error_handlers = {
> + .error_detected = cxl_port_error_detected,
> + .cor_error_detected = cxl_port_cor_error_detected,
> +};
> +
> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> +{
> + struct pci_driver *pdrv;
> +
> + if (!pdev || !pdev->driver)
> + return;
> +
> + pdrv = pdev->driver;
> + pdrv->cxl_err_handler = &cxl_port_error_handlers;
> +}
> +
> +static void cxl_clear_port_error_handlers(void *data)
> +{
> + struct pci_dev *pdev = data;
> + struct pci_driver *pdrv;
> +
> + if (!pdev || !pdev->driver)
> + return;
> +
> + pdrv = pdev->driver;
> + pdrv->cxl_err_handler = NULL;
> +}
> +
> void cxl_uport_init_ras_reporting(struct cxl_port *port)
> {
> + struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> +
> /* uport may have more than 1 downstream EP. Check if already mapped. */
> if (port->uport_regs.ras)
> return;
> @@ -853,6 +883,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> dev_err(&port->dev, "Failed to map RAS capability.\n");
> return;
> }
> +
> + cxl_assign_port_error_handlers(pdev);
> + devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
I think the first parameter of devm_add_action_or_reset() should be 'port->dev' rather than 'port->uport_dev'.
'port->uport_dev' is 'pci_dev->dev' which will be destroyed on pci side, 'port->dev' will be destroyed on cxl side.
> }
> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>
> @@ -864,6 +897,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> {
> struct device *dport_dev = dport->dport_dev;
> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> + struct pci_dev *pdev = to_pci_dev(dport_dev);
>
> dport->reg_map.host = dport_dev;
> if (dport->rch && host_bridge->native_aer) {
> @@ -880,6 +914,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> dev_err(dport_dev, "Failed to map RAS capability.\n");
> return;
> }
> +
> + if (dport->rch)
> + return;
> +
> + cxl_assign_port_error_handlers(pdev);
> + devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
Same as above, should use 'port->dev'.
please correct me if I am wrong.
Ming
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver
2024-12-11 23:39 ` [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver Terry Bowman
@ 2024-12-12 9:28 ` Alejandro Lucero Palau
2024-12-13 15:07 ` Bowman, Terry
2024-12-24 18:31 ` Jonathan Cameron
1 sibling, 1 reply; 45+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-12 9:28 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/24 23:39, Terry Bowman wrote:
> Existing recovery procedure for PCIe Uncorrectable Errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> potential corruption on what can be system memory. Also, current PCIe UCE
> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> does not begin at the RP/DSP but begins at the first downstream device.
> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> CXL recovery is needed because of the different handling requirements
>
> Add a new function, cxl_do_recovery() using the following.
>
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the RP or DSP rather than beginning at the
> first downstream device.
>
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
>
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/pci/pci.h | 3 +++
> drivers/pci/pcie/aer.c | 4 ++++
> drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 61 insertions(+)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 14d00ce45bfa..5a67e41919d8 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_channel_state_t state,
> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
>
> +/* CXL error reporting and handling */
> +void cxl_do_recovery(struct pci_dev *dev);
> +
> bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
> int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index c1eb939c1cca..861521872318 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1024,6 +1024,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
> err_handler->error_detected(dev, pci_channel_io_normal);
> else if (info->severity == AER_FATAL)
> err_handler->error_detected(dev, pci_channel_io_frozen);
> +
> + cxl_do_recovery(dev);
> }
> out:
> device_unlock(&dev->dev);
> @@ -1048,6 +1050,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> pdrv->cxl_err_handler->cor_error_detected(dev);
>
> pcie_clear_device_status(dev);
> + } else {
> + cxl_do_recovery(dev);
> }
> }
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 31090770fffc..6f7cf5e0087f 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>
> return status;
> }
> +
> +static void cxl_walk_bridge(struct pci_dev *bridge,
> + int (*cb)(struct pci_dev *, void *),
> + void *userdata)
> +{
> + bool *status = userdata;
> +
> + cb(bridge, status);
> + if (bridge->subordinate && !*status)
I would prefer to use not a pointer for status as you are not changing
what it points to here, so first a cast then using just !status in the
conditional.
> + pci_walk_bus(bridge->subordinate, cb, status);
> +}
> +
> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> +{
> + struct pci_driver *pdrv = dev->driver;
> + bool *status = data;
> +
> + device_lock(&dev->dev);
> + if (pdrv && pdrv->cxl_err_handler &&
> + pdrv->cxl_err_handler->error_detected) {
> + const struct cxl_error_handlers *cxl_err_handler =
> + pdrv->cxl_err_handler;
> + *status |= cxl_err_handler->error_detected(dev);
This implies status should not be a bool pointer as different bits can
be set by the returning value, but as the code seems to only care about
any bit implying an error and therefore error detected, I guess that is
fine. However, the next function calling this one is using an int ...
Confusing to me. I would expect here not an OR but returning just when a
first error is detected, handling the lock properly, with the walk
function behind the scenes breaking the walk if the return is anything
other than zero.
> + }
> + device_unlock(&dev->dev);
> + return *status;
> +}
> +
> +void cxl_do_recovery(struct pci_dev *dev)
> +{
> + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> + int type = pci_pcie_type(dev);
> + struct pci_dev *bridge;
> + int status;
> +
> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> + type == PCI_EXP_TYPE_DOWNSTREAM ||
> + type == PCI_EXP_TYPE_UPSTREAM ||
> + type == PCI_EXP_TYPE_ENDPOINT)
> + bridge = dev;
> + else
> + bridge = pci_upstream_bridge(dev);
> +
> + cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
> + if (status)
> + panic("CXL cachemem error.");
> +
> + if (host->native_aer || pcie_ports_native) {
> + pcie_clear_device_status(dev);
> + pci_aer_clear_nonfatal_status(dev);
> + }
> +
> + pci_info(bridge, "CXL uncorrectable error.\n");
> +}
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-11 23:40 ` [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
@ 2024-12-12 9:44 ` Alejandro Lucero Palau
2024-12-12 10:44 ` Alejandro Lucero Palau
2024-12-13 15:34 ` Bowman, Terry
2024-12-24 18:53 ` Jonathan Cameron
1 sibling, 2 replies; 45+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-12 9:44 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/24 23:40, Terry Bowman wrote:
> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
> Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
> UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
> enablement is needed for CXL PCIe Upstream and Downstream Ports inorder to
> notify the associated Root Port and OS.[1]
>
> Export the AER service driver's pci_aer_unmask_internal_errors() function
> to CXL namespace.
>
> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
> because it is now an exported function.
>
> Call pci_aer_unmask_internal_errors() during RAS initialization in:
> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>
> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/cxl/core/pci.c | 2 ++
> drivers/pci/pcie/aer.c | 5 +++--
> include/linux/aer.h | 1 +
> 3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 9734a4c55b29..740ac5d8809f 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>
> cxl_assign_port_error_handlers(pdev);
> devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> + pci_aer_unmask_internal_errors(pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>
> @@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>
> cxl_assign_port_error_handlers(pdev);
> devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> + pci_aer_unmask_internal_errors(pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 861521872318..0fa1b1ed48c9 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info *info)
> return info->status & PCI_ERR_UNC_INTN;
> }
>
> -#ifdef CONFIG_PCIEAER_CXL
This ifdef move puzzles me. I would expect to use it when the next
function is invoked instead of moving it here.
It seems weird to have such a config but code using those related
functions not aware of it.
> /**
> * pci_aer_unmask_internal_errors - unmask internal errors
> * @dev: pointer to the pcie_dev data structure
> @@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info *info)
> * Note: AER must be enabled and supported by the device which must be
> * checked in advance, e.g. with pcie_aer_is_native().
> */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> {
> int aer = dev->aer_cap;
> u32 mask;
> @@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> mask &= ~PCI_ERR_COR_INTERNAL;
> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> }
> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
>
> +#ifdef CONFIG_PCIEAER_CXL
> static bool is_cxl_mem_dev(struct pci_dev *dev)
> {
> /*
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..093293f9f12b 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> int cper_severity_to_aer(int cper_severity);
> void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
> int severity, struct aer_capability_regs *aer_regs);
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> #endif //_AER_H_
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 13/15] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
2024-12-11 23:40 ` [PATCH v4 13/15] cxl/pci: Add trace logging " Terry Bowman
@ 2024-12-12 9:46 ` Alejandro Lucero Palau
2024-12-24 18:46 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-12 9:46 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/24 23:40, Terry Bowman wrote:
> The CXL drivers use kernel trace functions for logging endpoint and RCH
> Downstream Port RAS errors. Similar functionality is required for CXL Root
> Ports, CXL Downstream Switch Ports, and CXL Upstream Switch Ports.
>
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Additionally, update
> the PCIe Port error handlers to invoke these new trace functions.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> ---
> drivers/cxl/core/pci.c | 16 ++++++++++----
> drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 52afaedf5171..3294ad5ff28f 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
>
> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> + return;
> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> + if (is_cxl_memdev(dev))
> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> - }
> + else
> + trace_cxl_port_aer_correctable_error(dev, status);
> }
>
> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + if (is_cxl_memdev(dev))
> + trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + else
> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> +
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index 8389a94adb1a..681e415ac8f5 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,6 +48,34 @@
> { CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" } \
> )
>
> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> + TP_PROTO(struct device *dev, u32 status, u32 fe, u32 *hl),
> + TP_ARGS(dev, status, fe, hl),
> + TP_STRUCT__entry(
> + __string(devname, dev_name(dev))
> + __string(host, dev_name(dev->parent))
> + __field(u32, status)
> + __field(u32, first_error)
> + __array(u32, header_log, CXL_HEADERLOG_SIZE_U32)
> + ),
> + TP_fast_assign(
> + __assign_str(devname);
> + __assign_str(host);
> + __entry->status = status;
> + __entry->first_error = fe;
> + /*
> + * Embed the 512B headerlog data for user app retrieval and
> + * parsing, but no need to print this in the trace buffer.
> + */
> + memcpy(__entry->header_log, hl, CXL_HEADERLOG_SIZE);
> + ),
> + TP_printk("device=%s host=%s status: '%s' first_error: '%s'",
> + __get_str(devname), __get_str(host),
> + show_uc_errs(__entry->status),
> + show_uc_errs(__entry->first_error)
> + )
> +);
> +
> TRACE_EVENT(cxl_aer_uncorrectable_error,
> TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
> TP_ARGS(cxlmd, status, fe, hl),
> @@ -96,6 +124,25 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
> { CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" } \
> )
>
> +TRACE_EVENT(cxl_port_aer_correctable_error,
> + TP_PROTO(struct device *dev, u32 status),
> + TP_ARGS(dev, status),
> + TP_STRUCT__entry(
> + __string(devname, dev_name(dev))
> + __string(host, dev_name(dev->parent))
> + __field(u32, status)
> + ),
> + TP_fast_assign(
> + __assign_str(devname);
> + __assign_str(host);
> + __entry->status = status;
> + ),
> + TP_printk("device=%s host=%s status='%s'",
> + __get_str(devname), __get_str(host),
> + show_ce_errs(__entry->status)
> + )
> +);
> +
> TRACE_EVENT(cxl_aer_correctable_error,
> TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
> TP_ARGS(cxlmd, status),
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
2024-12-11 23:39 ` [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
@ 2024-12-12 10:36 ` Alejandro Lucero Palau
2024-12-13 15:10 ` Bowman, Terry
2024-12-24 18:38 ` Jonathan Cameron
1 sibling, 1 reply; 45+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-12 10:36 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/24 23:39, Terry Bowman wrote:
> The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
> registers for the endpoint's Root Port. The same needs to be done for
> each of the CXL Downstream Switch Ports and CXL Root Ports found between
> the endpoint and CXL Host Bridge.
>
> Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
> sub-topology between the endpoint and the CXL Host Bridge. This function
> will determine if there are CXL Downstream Switch Ports or CXL Root Ports
> associated with this Port. The same check will be added in the future for
> upstream switch ports.
>
> Move the RAS register map logic from cxl_dport_map_ras() into
> cxl_dport_init_ras_reporting(). This eliminates the need for the helper
> function, cxl_dport_map_ras().
>
> cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
> the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.
>
> cxl_dport_init_ras_reporting() must check for previously mapped registers
> before mapping. This is necessary because endpoints under a CXL switch
> may share CXL Downstream Switch Ports or CXL Root Ports. Ensure the port
> registers are only mapped once.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
Just a nit but, regs mapping could fail, what is properly reported, and
the __cxl_handle_ras function checks for the regs iomem being there
before using them. But if the mapping failed, any report there is
silently dropped. If the AER is happening, maybe to add a WARN_ONCE there?
> ---
> drivers/cxl/core/pci.c | 37 +++++++++++++++----------------------
> drivers/cxl/cxl.h | 6 ++----
> drivers/cxl/mem.c | 31 +++++++++++++++++++++++++++++--
> 3 files changed, 46 insertions(+), 28 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 5b46bc46aaa9..8540d1fd2e25 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -749,18 +749,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
> }
> }
>
> -static void cxl_dport_map_ras(struct cxl_dport *dport)
> -{
> - struct cxl_register_map *map = &dport->reg_map;
> - struct device *dev = dport->dport_dev;
> -
> - if (!map->component_map.ras.valid)
> - dev_dbg(dev, "RAS registers not found\n");
> - else if (cxl_map_component_regs(map, &dport->regs.component,
> - BIT(CXL_CM_CAP_CAP_ID_RAS)))
> - dev_dbg(dev, "Failed to map RAS capability.\n");
> -}
> -
> static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> {
> void __iomem *aer_base = dport->regs.dport_aer;
> @@ -788,22 +776,27 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
> /**
> * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
> * @dport: the cxl_dport that needs to be initialized
> - * @host: host device for devm operations
> */
> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> {
> - dport->reg_map.host = host;
> - cxl_dport_map_ras(dport);
> -
> - if (dport->rch) {
> - struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
> -
> - if (!host_bridge->native_aer)
> - return;
> + struct device *dport_dev = dport->dport_dev;
> + struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>
> + dport->reg_map.host = dport_dev;
> + if (dport->rch && host_bridge->native_aer) {
> cxl_dport_map_rch_aer(dport);
> cxl_disable_rch_root_ints(dport);
> }
> +
> + /* dport may have more than 1 downstream EP. Check if already mapped. */
> + if (dport->regs.ras)
> + return;
> +
> + if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
> + BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> + dev_err(dport_dev, "Failed to map RAS capability.\n");
> + return;
> + }
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5406e3ab3d4a..51acca3415b4 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -763,11 +763,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
> resource_size_t rcrb);
>
> #ifdef CONFIG_PCIEAER_CXL
> -void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
> #else
> -static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
> - struct device *host) { }
> +static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
> #endif
>
> struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index a9fd5cd5a0d2..0ae89c9da71e 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -45,6 +45,31 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
> return 0;
> }
>
> +static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
> +{
> + struct pci_dev *pdev;
> +
> + if (!dev || !dev_is_pci(dev))
> + return false;
> +
> + pdev = to_pci_dev(dev);
> +
> + return (pci_pcie_type(pdev) == pcie_type);
> +}
> +
> +static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
> +{
> + struct cxl_dport *dport = ep->dport;
> +
> + if (dport) {
> + struct device *dport_dev = dport->dport_dev;
> +
> + if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
> + dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
> + cxl_dport_init_ras_reporting(dport);
> + }
> +}
> +
> static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> struct cxl_dport *parent_dport)
> {
> @@ -52,6 +77,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> struct cxl_port *endpoint, *iter, *down;
> int rc;
>
> + if (parent_dport->rch)
> + cxl_dport_init_ras_reporting(parent_dport);
> +
> /*
> * Now that the path to the root is established record all the
> * intervening ports in the chain.
> @@ -62,6 +90,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>
> ep = cxl_ep_load(iter, cxlmd);
> ep->next = down;
> + cxl_init_ep_ports_aer(ep);
> }
>
> /* Note: endpoint port component registers are derived from @cxlds */
> @@ -166,8 +195,6 @@ static int cxl_mem_probe(struct device *dev)
> else
> endpoint_parent = &parent_port->dev;
>
> - cxl_dport_init_ras_reporting(dport, dev);
> -
> scoped_guard(device, endpoint_parent) {
> if (!endpoint_parent->driver) {
> dev_err(dev, "CXL port topology %s not enabled\n",
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
2024-12-11 23:39 ` [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
@ 2024-12-12 10:38 ` Alejandro Lucero Palau
2024-12-24 18:42 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-12 10:38 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/24 23:39, Terry Bowman wrote:
> CXL PCIe Port protocol error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port protocol errors.
>
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
>
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
>
> No functional changes are introduced.
>
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Alejandro Lucero <alucerop@amd.com>
> ---
> drivers/cxl/core/pci.c | 17 ++++++++---------
> 1 file changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 08073bbe2697..89f8d65d71ce 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -650,7 +650,7 @@ void read_cdat_data(struct cxl_port *port)
> }
> EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>
> -static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
> +static void __cxl_handle_cor_ras(struct device *dev,
> void __iomem *ras_base)
> {
> void __iomem *addr;
> @@ -663,13 +663,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
> status = readl(addr);
> if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> - trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> + trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> }
> }
>
> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> {
> - return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
> + return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
> }
>
> /* CXL spec rev3.0 8.2.4.16.1 */
> @@ -693,8 +693,7 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
> * Log the state of the RAS status registers and prepare them to log the
> * next error status. Return 1 if reset needed.
> */
> -static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
> - void __iomem *ras_base)
> +static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> {
> u32 hl[CXL_HEADERLOG_SIZE_U32];
> void __iomem *addr;
> @@ -721,7 +720,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
> + trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
> @@ -729,7 +728,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
>
> static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
> {
> - return __cxl_handle_ras(cxlds, cxlds->regs.ras);
> + return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
> }
>
> #ifdef CONFIG_PCIEAER_CXL
> @@ -818,13 +817,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
> static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
> struct cxl_dport *dport)
> {
> - return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
> + return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> }
>
> static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
> struct cxl_dport *dport)
> {
> - return __cxl_handle_ras(cxlds, dport->regs.ras);
> + return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
> }
>
> /*
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-12 9:44 ` Alejandro Lucero Palau
@ 2024-12-12 10:44 ` Alejandro Lucero Palau
2024-12-13 15:22 ` Bowman, Terry
2024-12-13 15:34 ` Bowman, Terry
1 sibling, 1 reply; 45+ messages in thread
From: Alejandro Lucero Palau @ 2024-12-12 10:44 UTC (permalink / raw)
To: Terry Bowman, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
ming4.li, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/12/24 09:44, Alejandro Lucero Palau wrote:
>
> On 12/11/24 23:40, Terry Bowman wrote:
>> The AER service driver enables PCIe Uncorrectable Internal Errors
>> (UIE) and
>> Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
>> UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
>> enablement is needed for CXL PCIe Upstream and Downstream Ports
>> inorder to
>> notify the associated Root Port and OS.[1]
>>
>> Export the AER service driver's pci_aer_unmask_internal_errors()
>> function
>> to CXL namespace.
>>
>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>> because it is now an exported function.
>>
>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>
>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/cxl/core/pci.c | 2 ++
>> drivers/pci/pcie/aer.c | 5 +++--
>> include/linux/aer.h | 1 +
>> 3 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 9734a4c55b29..740ac5d8809f 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port
>> *port)
>> cxl_assign_port_error_handlers(pdev);
>> devm_add_action_or_reset(port->uport_dev,
>> cxl_clear_port_error_handlers, pdev);
>> + pci_aer_unmask_internal_errors(pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>> @@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct
>> cxl_dport *dport)
>> cxl_assign_port_error_handlers(pdev);
>> devm_add_action_or_reset(dport_dev,
>> cxl_clear_port_error_handlers, pdev);
>> + pci_aer_unmask_internal_errors(pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 861521872318..0fa1b1ed48c9 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info
>> *info)
>> return info->status & PCI_ERR_UNC_INTN;
>> }
>> -#ifdef CONFIG_PCIEAER_CXL
>
>
> This ifdef move puzzles me. I would expect to use it when the next
> function is invoked instead of moving it here.
>
> It seems weird to have such a config but code using those related
> functions not aware of it.
>
>
>> /**
>> * pci_aer_unmask_internal_errors - unmask internal errors
>> * @dev: pointer to the pcie_dev data structure
>> @@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info
>> *info)
>> * Note: AER must be enabled and supported by the device which must be
>> * checked in advance, e.g. with pcie_aer_is_native().
>> */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> {
>> int aer = dev->aer_cap;
>> u32 mask;
>> @@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct
>> pci_dev *dev)
>> mask &= ~PCI_ERR_COR_INTERNAL;
>> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> }
>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
Forgot to mention all these exports are changing in 6.13 with the second
macro param being now an string, so just
EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
Not affected in the codebase linked to this patchset, but I hope it
helps you when getting weird errors with a newer kernel.
>> +#ifdef CONFIG_PCIEAER_CXL
>> static bool is_cxl_mem_dev(struct pci_dev *dev)
>> {
>> /*
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 4b97f38f3fcf..093293f9f12b 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int
>> aer_severity,
>> int cper_severity_to_aer(int cper_severity);
>> void aer_recover_queue(int domain, unsigned int bus, unsigned int
>> devfn,
>> int severity, struct aer_capability_regs *aer_regs);
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>> #endif //_AER_H_
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
2024-12-12 1:34 ` Li Ming
@ 2024-12-12 19:59 ` Bowman, Terry
2024-12-14 13:34 ` Li Ming
0 siblings, 1 reply; 45+ messages in thread
From: Bowman, Terry @ 2024-12-12 19:59 UTC (permalink / raw)
To: Li Ming, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/2024 7:34 PM, Li Ming wrote:
> On 12/12/2024 7:39 AM, Terry Bowman wrote:
>> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
>> for all errors.
>>
>> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
>> device errors.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
>> ---
>> drivers/pci/pcie/aer.c | 14 ++++++++------
>> include/ras/ras_event.h | 9 ++++++---
>> 2 files changed, 14 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index fe6edf26279e..53e9a11f6c0f 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
>>
>> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>> {
>> + const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
>> int layer, agent;
>> int id = pci_dev_id(dev);
>> const char *level;
>>
>> if (!info->status) {
>> - pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>> - aer_error_severity_string[info->severity]);
>> + pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>> + bus_type, aer_error_severity_string[info->severity]);
>> goto out;
>> }
>>
>> @@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>
>> level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>>
>> - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>> - aer_error_severity_string[info->severity],
>> + pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
>> + bus_type, aer_error_severity_string[info->severity],
>> aer_error_layer[layer], aer_agent_string[agent]);
>>
>> pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
>> @@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>> if (info->id && info->error_dev_num > 1 && info->id == id)
>> pci_err(dev, " Error of this Agent is reported first\n");
>>
>> - trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
>> + trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>> info->severity, info->tlp_header_valid, &info->tlp);
>> }
>>
>> @@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>> void pci_print_aer(struct pci_dev *dev, int aer_severity,
>> struct aer_capability_regs *aer)
>> {
>> + const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
>> int layer, agent, tlp_header_valid = 0;
>> u32 status, mask;
>> struct aer_err_info info;
>> @@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>> if (tlp_header_valid)
>> __print_tlp_header(dev, &aer->header_log);
>>
>> - trace_aer_event(dev_name(&dev->dev), (status & ~mask),
>> + trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>> aer_severity, tlp_header_valid, &aer->header_log);
>> }
>> EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>> index e5f7ee0864e7..1bf8e7050ba8 100644
>> --- a/include/ras/ras_event.h
>> +++ b/include/ras/ras_event.h
>> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>>
>> TRACE_EVENT(aer_event,
>> TP_PROTO(const char *dev_name,
>> + const char *bus_type,
>> const u32 status,
>> const u8 severity,
>> const u8 tlp_header_valid,
>> struct pcie_tlp_log *tlp),
>>
>> - TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
>> + TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>>
>> TP_STRUCT__entry(
>> __string( dev_name, dev_name )
>> + __string( bus_type, bus_type )
>> __field( u32, status )
>> __field( u8, severity )
>> __field( u8, tlp_header_valid)
>> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>>
>> TP_fast_assign(
>> __assign_str(dev_name);
>> + __assign_str(bus_type);
>> __entry->status = status;
>> __entry->severity = severity;
>> __entry->tlp_header_valid = tlp_header_valid;
>> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>> }
>> ),
>>
>> - TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
>> - __get_str(dev_name),
>> + TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
>> + __get_str(dev_name), __get_str(bus_type),
>> __entry->severity == AER_CORRECTABLE ? "Corrected" :
>> __entry->severity == AER_FATAL ?
>> "Fatal" : "Uncorrected, non-fatal",
> Hi Terry,
>
>
> Patch #3 is using flexbus dvsec to identify CXL RP/USP/DSP. But per CXL r3.1 section 9.12.3 "Enumerating CXL RPs and DSPs", there may be a flexbus dvsec if CXL RP/DSP is in disconnect state or connecting to a PCIe device.
>
> If a PCIe device connects to a CXL RP/DSP, and the CXL RP/DSP reports an error, the error log will be also "CXL Bus Type", is it expected? My understanding is that the CXL RP/DSP is working on PCIe mode.
>
> If not, I think that setting "pci_dev->is_cxl" during cxl port enumeration and CXL device probing is another option.
>
>
> Thanks
>
> Ming
>
Hi Ming,
aer_print_error() logs the AER details (including bus type) for the device that detected the error
not the RPAER reporting agent unless the error is detected in the RP. The bus type is determined
using the 'dev' parameter and in your example is a PCIe device not a CXL device. aer_print_error()
will log "PCI bus" because the flexbus DVSEC will not be present in 'dev' config space.
I agree in your example the RP and downstream device will train to PCIe mode and not CXL mode. But, the
flexbus DVSEC will still be present in the RP PCIe configuration space. The pci_dev::is_cxl structure
member indicates CXL support and is not reflective of the current training state.
Regards,
Terry
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver
2024-12-12 9:28 ` Alejandro Lucero Palau
@ 2024-12-13 15:07 ` Bowman, Terry
0 siblings, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-13 15:07 UTC (permalink / raw)
To: Alejandro Lucero Palau, linux-cxl, linux-kernel, linux-pci,
nifan.cxl, ming4.li, dave, jonathan.cameron, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/12/2024 3:28 AM, Alejandro Lucero Palau wrote:
> On 12/11/24 23:39, Terry Bowman wrote:
>> Existing recovery procedure for PCIe Uncorrectable Errors (UCE) does not
>> apply to CXL devices. Recovery can not be used for CXL devices because of
>> potential corruption on what can be system memory. Also, current PCIe UCE
>> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
>> does not begin at the RP/DSP but begins at the first downstream device.
>> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
>> CXL recovery is needed because of the different handling requirements
>>
>> Add a new function, cxl_do_recovery() using the following.
>>
>> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
>> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
>> will begin iteration at the RP or DSP rather than beginning at the
>> first downstream device.
>>
>> Add cxl_report_error_detected() as an analog to report_error_detected().
>> It will call pci_driver::cxl_err_handlers for each iterated downstream
>> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
>> indicating if there was a UCE error detected during handling.
>>
>> cxl_do_recovery() uses the status from cxl_report_error_detected() to
>> determine how to proceed. Non-fatal CXL UCE errors will be treated as
>> fatal. If a UCE was present during handling then cxl_do_recovery()
>> will kernel panic.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/pci/pci.h | 3 +++
>> drivers/pci/pcie/aer.c | 4 ++++
>> drivers/pci/pcie/err.c | 54 ++++++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 61 insertions(+)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 14d00ce45bfa..5a67e41919d8 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -658,6 +658,9 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> pci_channel_state_t state,
>> pci_ers_result_t (*reset_subordinates)(struct pci_dev *pdev));
>>
>> +/* CXL error reporting and handling */
>> +void cxl_do_recovery(struct pci_dev *dev);
>> +
>> bool pcie_wait_for_link(struct pci_dev *pdev, bool active);
>> int pcie_retrain_link(struct pci_dev *pdev, bool use_lt);
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index c1eb939c1cca..861521872318 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1024,6 +1024,8 @@ static int cxl_rch_handle_error_iter(struct pci_dev *dev, void *data)
>> err_handler->error_detected(dev, pci_channel_io_normal);
>> else if (info->severity == AER_FATAL)
>> err_handler->error_detected(dev, pci_channel_io_frozen);
>> +
>> + cxl_do_recovery(dev);
>> }
>> out:
>> device_unlock(&dev->dev);
>> @@ -1048,6 +1050,8 @@ static void cxl_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> pdrv->cxl_err_handler->cor_error_detected(dev);
>>
>> pcie_clear_device_status(dev);
>> + } else {
>> + cxl_do_recovery(dev);
>> }
>> }
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 31090770fffc..6f7cf5e0087f 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -276,3 +276,57 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>
>> return status;
>> }
>> +
>> +static void cxl_walk_bridge(struct pci_dev *bridge,
>> + int (*cb)(struct pci_dev *, void *),
>> + void *userdata)
>> +{
>> + bool *status = userdata;
>> +
>> + cb(bridge, status);
>> + if (bridge->subordinate && !*status)
>
> I would prefer to use not a pointer for status as you are not changing
> what it points to here, so first a cast then using just !status in the
> conditional.
>
Hi Alejandro,
I agree. This can be significantly improved. I'll remove the condition on
'status' and instead introduce condition on cb() and pci_walk_bus() return value.
But, 'status' does need to be a pointer so the caller (cxl_do_recovery()) can
determine if to invoke panic.
>> + pci_walk_bus(bridge->subordinate, cb, status);
>> +}
>> +
>> +static int cxl_report_error_detected(struct pci_dev *dev, void *data)
>> +{
>> + struct pci_driver *pdrv = dev->driver;
>> + bool *status = data;
>> +
>> + device_lock(&dev->dev);
>> + if (pdrv && pdrv->cxl_err_handler &&
>> + pdrv->cxl_err_handler->error_detected) {
>> + const struct cxl_error_handlers *cxl_err_handler =
>> + pdrv->cxl_err_handler;
>> + *status |= cxl_err_handler->error_detected(dev);
>
> This implies status should not be a bool pointer as different bits can
> be set by the returning value, but as the code seems to only care about
> any bit implying an error and therefore error detected, I guess that is
> fine. However, the next function calling this one is using an int ...
>
>
> Confusing to me. I would expect here not an OR but returning just when a
> first error is detected, handling the lock properly, with the walk
> function behind the scenes breaking the walk if the return is anything
> other than zero.
The cxl_err_handler->error_detected() return value is a bool. But, The bitwise OR is not necessary. I'll refactor as part of mentioned above.
Thanks for the feedback.
-Terry
>
>
>> + }
>> + device_unlock(&dev->dev);
>> + return *status;
>> +}
>> +
>> +void cxl_do_recovery(struct pci_dev *dev)
>> +{
>> + struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> + int type = pci_pcie_type(dev);
>> + struct pci_dev *bridge;
>> + int status;
>> +
>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> + type == PCI_EXP_TYPE_DOWNSTREAM ||
>> + type == PCI_EXP_TYPE_UPSTREAM ||
>> + type == PCI_EXP_TYPE_ENDPOINT)
>> + bridge = dev;
>> + else
>> + bridge = pci_upstream_bridge(dev);
>> +
>> + cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
>> + if (status)
>> + panic("CXL cachemem error.");
>> +
>> + if (host->native_aer || pcie_ports_native) {
>> + pcie_clear_device_status(dev);
>> + pci_aer_clear_nonfatal_status(dev);
>> + }
>> +
>> + pci_info(bridge, "CXL uncorrectable error.\n");
>> +}
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
2024-12-12 10:36 ` Alejandro Lucero Palau
@ 2024-12-13 15:10 ` Bowman, Terry
0 siblings, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-13 15:10 UTC (permalink / raw)
To: Alejandro Lucero Palau, linux-cxl, linux-kernel, linux-pci,
nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On 12/12/2024 4:36 AM, Alejandro Lucero Palau wrote:
> On 12/11/24 23:39, Terry Bowman wrote:
>> The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
>> registers for the endpoint's Root Port. The same needs to be done for
>> each of the CXL Downstream Switch Ports and CXL Root Ports found between
>> the endpoint and CXL Host Bridge.
>>
>> Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
>> sub-topology between the endpoint and the CXL Host Bridge. This function
>> will determine if there are CXL Downstream Switch Ports or CXL Root Ports
>> associated with this Port. The same check will be added in the future for
>> upstream switch ports.
>>
>> Move the RAS register map logic from cxl_dport_map_ras() into
>> cxl_dport_init_ras_reporting(). This eliminates the need for the helper
>> function, cxl_dport_map_ras().
>>
>> cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
>> the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.
>>
>> cxl_dport_init_ras_reporting() must check for previously mapped registers
>> before mapping. This is necessary because endpoints under a CXL switch
>> may share CXL Downstream Switch Ports or CXL Root Ports. Ensure the port
>> registers are only mapped once.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>
> Reviewed-by: Alejandro Lucero <alucerop@amd.com>
>
>
> Just a nit but, regs mapping could fail, what is properly reported, and
> the __cxl_handle_ras function checks for the regs iomem being there
> before using them. But if the mapping failed, any report there is
> silently dropped. If the AER is happening, maybe to add a WARN_ONCE there?
>
Good point. I'll add the WARN_ONCE().
- Terry
>> ---
>> drivers/cxl/core/pci.c | 37 +++++++++++++++----------------------
>> drivers/cxl/cxl.h | 6 ++----
>> drivers/cxl/mem.c | 31 +++++++++++++++++++++++++++++--
>> 3 files changed, 46 insertions(+), 28 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 5b46bc46aaa9..8540d1fd2e25 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -749,18 +749,6 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
>> }
>> }
>>
>> -static void cxl_dport_map_ras(struct cxl_dport *dport)
>> -{
>> - struct cxl_register_map *map = &dport->reg_map;
>> - struct device *dev = dport->dport_dev;
>> -
>> - if (!map->component_map.ras.valid)
>> - dev_dbg(dev, "RAS registers not found\n");
>> - else if (cxl_map_component_regs(map, &dport->regs.component,
>> - BIT(CXL_CM_CAP_CAP_ID_RAS)))
>> - dev_dbg(dev, "Failed to map RAS capability.\n");
>> -}
>> -
>> static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>> {
>> void __iomem *aer_base = dport->regs.dport_aer;
>> @@ -788,22 +776,27 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>> /**
>> * cxl_dport_init_ras_reporting - Setup CXL RAS report on this dport
>> * @dport: the cxl_dport that needs to be initialized
>> - * @host: host device for devm operations
>> */
>> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host)
>> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>> {
>> - dport->reg_map.host = host;
>> - cxl_dport_map_ras(dport);
>> -
>> - if (dport->rch) {
>> - struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport->dport_dev);
>> -
>> - if (!host_bridge->native_aer)
>> - return;
>> + struct device *dport_dev = dport->dport_dev;
>> + struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>>
>> + dport->reg_map.host = dport_dev;
>> + if (dport->rch && host_bridge->native_aer) {
>> cxl_dport_map_rch_aer(dport);
>> cxl_disable_rch_root_ints(dport);
>> }
>> +
>> + /* dport may have more than 1 downstream EP. Check if already mapped. */
>> + if (dport->regs.ras)
>> + return;
>> +
>> + if (cxl_map_component_regs(&dport->reg_map, &dport->regs.component,
>> + BIT(CXL_CM_CAP_CAP_ID_RAS))) {
>> + dev_err(dport_dev, "Failed to map RAS capability.\n");
>> + return;
>> + }
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index 5406e3ab3d4a..51acca3415b4 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -763,11 +763,9 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>> resource_size_t rcrb);
>>
>> #ifdef CONFIG_PCIEAER_CXL
>> -void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
>> -void cxl_dport_init_ras_reporting(struct cxl_dport *dport, struct device *host);
>> +void cxl_dport_init_ras_reporting(struct cxl_dport *dport);
>> #else
>> -static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport,
>> - struct device *host) { }
>> +static inline void cxl_dport_init_ras_reporting(struct cxl_dport *dport) { }
>> #endif
>>
>> struct cxl_decoder *to_cxl_decoder(struct device *dev);
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index a9fd5cd5a0d2..0ae89c9da71e 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -45,6 +45,31 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>> return 0;
>> }
>>
>> +static bool dev_is_cxl_pci(struct device *dev, u32 pcie_type)
>> +{
>> + struct pci_dev *pdev;
>> +
>> + if (!dev || !dev_is_pci(dev))
>> + return false;
>> +
>> + pdev = to_pci_dev(dev);
>> +
>> + return (pci_pcie_type(pdev) == pcie_type);
>> +}
>> +
>> +static void cxl_init_ep_ports_aer(struct cxl_ep *ep)
>> +{
>> + struct cxl_dport *dport = ep->dport;
>> +
>> + if (dport) {
>> + struct device *dport_dev = dport->dport_dev;
>> +
>> + if (dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
>> + dev_is_cxl_pci(dport_dev, PCI_EXP_TYPE_ROOT_PORT))
>> + cxl_dport_init_ras_reporting(dport);
>> + }
>> +}
>> +
>> static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>> struct cxl_dport *parent_dport)
>> {
>> @@ -52,6 +77,9 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>> struct cxl_port *endpoint, *iter, *down;
>> int rc;
>>
>> + if (parent_dport->rch)
>> + cxl_dport_init_ras_reporting(parent_dport);
>> +
>> /*
>> * Now that the path to the root is established record all the
>> * intervening ports in the chain.
>> @@ -62,6 +90,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>>
>> ep = cxl_ep_load(iter, cxlmd);
>> ep->next = down;
>> + cxl_init_ep_ports_aer(ep);
>> }
>>
>> /* Note: endpoint port component registers are derived from @cxlds */
>> @@ -166,8 +195,6 @@ static int cxl_mem_probe(struct device *dev)
>> else
>> endpoint_parent = &parent_port->dev;
>>
>> - cxl_dport_init_ras_reporting(dport, dev);
>> -
>> scoped_guard(device, endpoint_parent) {
>> if (!endpoint_parent->driver) {
>> dev_err(dev, "CXL port topology %s not enabled\n",
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-12 10:44 ` Alejandro Lucero Palau
@ 2024-12-13 15:22 ` Bowman, Terry
0 siblings, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-13 15:22 UTC (permalink / raw)
To: Alejandro Lucero Palau, linux-cxl, linux-kernel, linux-pci,
nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On 12/12/2024 4:44 AM, Alejandro Lucero Palau wrote:
> On 12/12/24 09:44, Alejandro Lucero Palau wrote:
>> On 12/11/24 23:40, Terry Bowman wrote:
>>> The AER service driver enables PCIe Uncorrectable Internal Errors
>>> (UIE) and
>>> Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
>>> UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
>>> enablement is needed for CXL PCIe Upstream and Downstream Ports
>>> inorder to
>>> notify the associated Root Port and OS.[1]
>>>
>>> Export the AER service driver's pci_aer_unmask_internal_errors()
>>> function
>>> to CXL namespace.
>>>
>>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>>> because it is now an exported function.
>>>
>>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>>
>>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> ---
>>> drivers/cxl/core/pci.c | 2 ++
>>> drivers/pci/pcie/aer.c | 5 +++--
>>> include/linux/aer.h | 1 +
>>> 3 files changed, 6 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>>> index 9734a4c55b29..740ac5d8809f 100644
>>> --- a/drivers/cxl/core/pci.c
>>> +++ b/drivers/cxl/core/pci.c
>>> @@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port
>>> *port)
>>> cxl_assign_port_error_handlers(pdev);
>>> devm_add_action_or_reset(port->uport_dev,
>>> cxl_clear_port_error_handlers, pdev);
>>> + pci_aer_unmask_internal_errors(pdev);
>>> }
>>> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>>> @@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct
>>> cxl_dport *dport)
>>> cxl_assign_port_error_handlers(pdev);
>>> devm_add_action_or_reset(dport_dev,
>>> cxl_clear_port_error_handlers, pdev);
>>> + pci_aer_unmask_internal_errors(pdev);
>>> }
>>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index 861521872318..0fa1b1ed48c9 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info
>>> *info)
>>> return info->status & PCI_ERR_UNC_INTN;
>>> }
>>> -#ifdef CONFIG_PCIEAER_CXL
>>
>> This ifdef move puzzles me. I would expect to use it when the next
>> function is invoked instead of moving it here.
>>
>> It seems weird to have such a config but code using those related
>> functions not aware of it.
>>
>>
>>> /**
>>> * pci_aer_unmask_internal_errors - unmask internal errors
>>> * @dev: pointer to the pcie_dev data structure
>>> @@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info
>>> *info)
>>> * Note: AER must be enabled and supported by the device which must be
>>> * checked in advance, e.g. with pcie_aer_is_native().
>>> */
>>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>> {
>>> int aer = dev->aer_cap;
>>> u32 mask;
>>> @@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct
>>> pci_dev *dev)
>>> mask &= ~PCI_ERR_COR_INTERNAL;
>>> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>>> }
>>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
>
> Forgot to mention all these exports are changing in 6.13 with the second
> macro param being now an string, so just
>
> EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, "CXL");
>
>
> Not affected in the codebase linked to this patchset, but I hope it
> helps you when getting weird errors with a newer kernel.
Thanks for the heads-up?
- Terry
>
>>> +#ifdef CONFIG_PCIEAER_CXL
>>> static bool is_cxl_mem_dev(struct pci_dev *dev)
>>> {
>>> /*
>>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>>> index 4b97f38f3fcf..093293f9f12b 100644
>>> --- a/include/linux/aer.h
>>> +++ b/include/linux/aer.h
>>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int
>>> aer_severity,
>>> int cper_severity_to_aer(int cper_severity);
>>> void aer_recover_queue(int domain, unsigned int bus, unsigned int
>>> devfn,
>>> int severity, struct aer_capability_regs *aer_regs);
>>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>>> #endif //_AER_H_
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-12 9:44 ` Alejandro Lucero Palau
2024-12-12 10:44 ` Alejandro Lucero Palau
@ 2024-12-13 15:34 ` Bowman, Terry
1 sibling, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-13 15:34 UTC (permalink / raw)
To: Alejandro Lucero Palau, linux-cxl, linux-kernel, linux-pci,
nifan.cxl, dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On 12/12/2024 3:44 AM, Alejandro Lucero Palau wrote:
> On 12/11/24 23:40, Terry Bowman wrote:
>> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
>> Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
>> UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
>> enablement is needed for CXL PCIe Upstream and Downstream Ports inorder to
>> notify the associated Root Port and OS.[1]
>>
>> Export the AER service driver's pci_aer_unmask_internal_errors() function
>> to CXL namespace.
>>
>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>> because it is now an exported function.
>>
>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>
>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/cxl/core/pci.c | 2 ++
>> drivers/pci/pcie/aer.c | 5 +++--
>> include/linux/aer.h | 1 +
>> 3 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 9734a4c55b29..740ac5d8809f 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>
>> cxl_assign_port_error_handlers(pdev);
>> devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>> + pci_aer_unmask_internal_errors(pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>>
>> @@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>
>> cxl_assign_port_error_handlers(pdev);
>> devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>> + pci_aer_unmask_internal_errors(pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 861521872318..0fa1b1ed48c9 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info *info)
>> return info->status & PCI_ERR_UNC_INTN;
>> }
>>
>> -#ifdef CONFIG_PCIEAER_CXL
>
> This ifdef move puzzles me. I would expect to use it when the next
> function is invoked instead of moving it here.
>
> It seems weird to have such a config but code using those related
> functions not aware of it.
>
I was asked to remove the dependency on the KConfig (ifdef) because the function is also being
'exported' and used across multiple subsystems. Because its exported, the function behavior
needs to be consistent and independent of a KConfig.
I'll update the commit message with this reasoning.
- Terry
>> /**
>> * pci_aer_unmask_internal_errors - unmask internal errors
>> * @dev: pointer to the pcie_dev data structure
>> @@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info *info)
>> * Note: AER must be enabled and supported by the device which must be
>> * checked in advance, e.g. with pcie_aer_is_native().
>> */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> {
>> int aer = dev->aer_cap;
>> u32 mask;
>> @@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> mask &= ~PCI_ERR_COR_INTERNAL;
>> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> }
>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
>>
>> +#ifdef CONFIG_PCIEAER_CXL
>> static bool is_cxl_mem_dev(struct pci_dev *dev)
>> {
>> /*
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 4b97f38f3fcf..093293f9f12b 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>> int cper_severity_to_aer(int cper_severity);
>> void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>> int severity, struct aer_capability_regs *aer_regs);
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>> #endif //_AER_H_
>>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type
2024-12-12 19:59 ` Bowman, Terry
@ 2024-12-14 13:34 ` Li Ming
0 siblings, 0 replies; 45+ messages in thread
From: Li Ming @ 2024-12-14 13:34 UTC (permalink / raw)
To: Bowman, Terry, linux-cxl, linux-kernel, linux-pci, nifan.cxl,
dave, jonathan.cameron, dave.jiang, alison.schofield,
vishal.l.verma, dan.j.williams, bhelgaas, mahesh, ira.weiny,
oohall, Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/13/2024 3:59 AM, Bowman, Terry wrote:
>
>
> On 12/11/2024 7:34 PM, Li Ming wrote:
>> On 12/12/2024 7:39 AM, Terry Bowman wrote:
>>> The AER driver and aer_event tracing currently log 'PCIe Bus Type'
>>> for all errors.
>>>
>>> Update the driver and aer_event tracing to log 'CXL Bus Type' for CXL
>>> device errors.
>>>
>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>>> Reviewed-by: Dave Jiang <dave.jiang@intel.com>
>>> Reviewed-by: Fan Ni <fan.ni@samsung.com>
>>> ---
>>> drivers/pci/pcie/aer.c | 14 ++++++++------
>>> include/ras/ras_event.h | 9 ++++++---
>>> 2 files changed, 14 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index fe6edf26279e..53e9a11f6c0f 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -699,13 +699,14 @@ static void __aer_print_error(struct pci_dev *dev,
>>>
>>> void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>> {
>>> + const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
>>> int layer, agent;
>>> int id = pci_dev_id(dev);
>>> const char *level;
>>>
>>> if (!info->status) {
>>> - pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>>> - aer_error_severity_string[info->severity]);
>>> + pci_err(dev, "%s Bus Error: severity=%s, type=Inaccessible, (Unregistered Agent ID)\n",
>>> + bus_type, aer_error_severity_string[info->severity]);
>>> goto out;
>>> }
>>>
>>> @@ -714,8 +715,8 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>>
>>> level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
>>>
>>> - pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
>>> - aer_error_severity_string[info->severity],
>>> + pci_printk(level, dev, "%s Bus Error: severity=%s, type=%s, (%s)\n",
>>> + bus_type, aer_error_severity_string[info->severity],
>>> aer_error_layer[layer], aer_agent_string[agent]);
>>>
>>> pci_printk(level, dev, " device [%04x:%04x] error status/mask=%08x/%08x\n",
>>> @@ -730,7 +731,7 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
>>> if (info->id && info->error_dev_num > 1 && info->id == id)
>>> pci_err(dev, " Error of this Agent is reported first\n");
>>>
>>> - trace_aer_event(dev_name(&dev->dev), (info->status & ~info->mask),
>>> + trace_aer_event(dev_name(&dev->dev), bus_type, (info->status & ~info->mask),
>>> info->severity, info->tlp_header_valid, &info->tlp);
>>> }
>>>
>>> @@ -764,6 +765,7 @@ EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>>> void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>> struct aer_capability_regs *aer)
>>> {
>>> + const char *bus_type = pcie_is_cxl(dev) ? "CXL" : "PCIe";
>>> int layer, agent, tlp_header_valid = 0;
>>> u32 status, mask;
>>> struct aer_err_info info;
>>> @@ -798,7 +800,7 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>> if (tlp_header_valid)
>>> __print_tlp_header(dev, &aer->header_log);
>>>
>>> - trace_aer_event(dev_name(&dev->dev), (status & ~mask),
>>> + trace_aer_event(dev_name(&dev->dev), bus_type, (status & ~mask),
>>> aer_severity, tlp_header_valid, &aer->header_log);
>>> }
>>> EXPORT_SYMBOL_NS_GPL(pci_print_aer, CXL);
>>> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
>>> index e5f7ee0864e7..1bf8e7050ba8 100644
>>> --- a/include/ras/ras_event.h
>>> +++ b/include/ras/ras_event.h
>>> @@ -297,15 +297,17 @@ TRACE_EVENT(non_standard_event,
>>>
>>> TRACE_EVENT(aer_event,
>>> TP_PROTO(const char *dev_name,
>>> + const char *bus_type,
>>> const u32 status,
>>> const u8 severity,
>>> const u8 tlp_header_valid,
>>> struct pcie_tlp_log *tlp),
>>>
>>> - TP_ARGS(dev_name, status, severity, tlp_header_valid, tlp),
>>> + TP_ARGS(dev_name, bus_type, status, severity, tlp_header_valid, tlp),
>>>
>>> TP_STRUCT__entry(
>>> __string( dev_name, dev_name )
>>> + __string( bus_type, bus_type )
>>> __field( u32, status )
>>> __field( u8, severity )
>>> __field( u8, tlp_header_valid)
>>> @@ -314,6 +316,7 @@ TRACE_EVENT(aer_event,
>>>
>>> TP_fast_assign(
>>> __assign_str(dev_name);
>>> + __assign_str(bus_type);
>>> __entry->status = status;
>>> __entry->severity = severity;
>>> __entry->tlp_header_valid = tlp_header_valid;
>>> @@ -325,8 +328,8 @@ TRACE_EVENT(aer_event,
>>> }
>>> ),
>>>
>>> - TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s\n",
>>> - __get_str(dev_name),
>>> + TP_printk("%s %s Bus Error: severity=%s, %s, TLP Header=%s\n",
>>> + __get_str(dev_name), __get_str(bus_type),
>>> __entry->severity == AER_CORRECTABLE ? "Corrected" :
>>> __entry->severity == AER_FATAL ?
>>> "Fatal" : "Uncorrected, non-fatal",
>> Hi Terry,
>>
>>
>> Patch #3 is using flexbus dvsec to identify CXL RP/USP/DSP. But per CXL r3.1 section 9.12.3 "Enumerating CXL RPs and DSPs", there may be a flexbus dvsec if CXL RP/DSP is in disconnect state or connecting to a PCIe device.
>>
>> If a PCIe device connects to a CXL RP/DSP, and the CXL RP/DSP reports an error, the error log will be also "CXL Bus Type", is it expected? My understanding is that the CXL RP/DSP is working on PCIe mode.
>>
>> If not, I think that setting "pci_dev->is_cxl" during cxl port enumeration and CXL device probing is another option.
>>
>>
>> Thanks
>>
>> Ming
>>
> Hi Ming,
>
> aer_print_error() logs the AER details (including bus type) for the device that detected the error
> not the RPAER reporting agent unless the error is detected in the RP. The bus type is determined
> using the 'dev' parameter and in your example is a PCIe device not a CXL device. aer_print_error()
> will log "PCI bus" because the flexbus DVSEC will not be present in 'dev' config space.
>
> I agree in your example the RP and downstream device will train to PCIe mode and not CXL mode. But, the
> flexbus DVSEC will still be present in the RP PCIe configuration space. The pci_dev::is_cxl structure
> member indicates CXL support and is not reflective of the current training state.
>
> Regards,
> Terry
>
>
Got it, thanks for your explanation.
Thanks
Ming
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
2024-12-12 2:31 ` Li Ming
@ 2024-12-17 14:39 ` Bowman, Terry
0 siblings, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-17 14:39 UTC (permalink / raw)
To: Li Ming, linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave,
jonathan.cameron, dave.jiang, alison.schofield, vishal.l.verma,
dan.j.williams, bhelgaas, mahesh, ira.weiny, oohall,
Benjamin.Cheatham, rrichter, nathan.fontenot,
Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On 12/11/2024 8:31 PM, Li Ming wrote:
> On 12/12/2024 7:40 AM, Terry Bowman wrote:
>> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
>> The handlers can't be set in the pci_driver static definition because the
>> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
>> driver aware.
>>
>> Add cxl_assign_port_error_handlers() in the cxl_core module. This
>> function will assign the default handlers for a CXL PCIe Port device.
>>
>> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
>> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
>> longer be used.
>>
>> Create cxl_clear_port_error_handlers() and register it to be called
>> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/cxl/core/pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 40 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 3294ad5ff28f..9734a4c55b29 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -841,8 +841,38 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
>> return __cxl_handle_ras(&pdev->dev, ras_base);
>> }
>>
>> +static const struct cxl_error_handlers cxl_port_error_handlers = {
>> + .error_detected = cxl_port_error_detected,
>> + .cor_error_detected = cxl_port_cor_error_detected,
>> +};
>> +
>> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
>> +{
>> + struct pci_driver *pdrv;
>> +
>> + if (!pdev || !pdev->driver)
>> + return;
>> +
>> + pdrv = pdev->driver;
>> + pdrv->cxl_err_handler = &cxl_port_error_handlers;
>> +}
>> +
>> +static void cxl_clear_port_error_handlers(void *data)
>> +{
>> + struct pci_dev *pdev = data;
>> + struct pci_driver *pdrv;
>> +
>> + if (!pdev || !pdev->driver)
>> + return;
>> +
>> + pdrv = pdev->driver;
>> + pdrv->cxl_err_handler = NULL;
>> +}
>> +
>> void cxl_uport_init_ras_reporting(struct cxl_port *port)
>> {
>> + struct pci_dev *pdev = to_pci_dev(port->uport_dev);
>> +
>> /* uport may have more than 1 downstream EP. Check if already mapped. */
>> if (port->uport_regs.ras)
>> return;
>> @@ -853,6 +883,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>> dev_err(&port->dev, "Failed to map RAS capability.\n");
>> return;
>> }
>> +
>> + cxl_assign_port_error_handlers(pdev);
>> + devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> I think the first parameter of devm_add_action_or_reset() should be 'port->dev' rather than 'port->uport_dev'.
>
> 'port->uport_dev' is 'pci_dev->dev' which will be destroyed on pci side, 'port->dev' will be destroyed on cxl side.
Hi Ming,
Indeed, this needs to be changed for dport and uport as you also indicated in your other email.
This is necessary in case the CXL drivers are uninstalled. Making this change will de-register
the RAS error handler callbacks so they are not potentially used to handle Port CXL errors
after the CXL drivers are removed.
Thanks Ming.
-Terry
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>>
>> @@ -864,6 +897,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>> {
>> struct device *dport_dev = dport->dport_dev;
>> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>> + struct pci_dev *pdev = to_pci_dev(dport_dev);
>>
>> dport->reg_map.host = dport_dev;
>> if (dport->rch && host_bridge->native_aer) {
>> @@ -880,6 +914,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>> dev_err(dport_dev, "Failed to map RAS capability.\n");
>> return;
>> }
>> +
>> + if (dport->rch)
>> + return;
>> +
>> + cxl_assign_port_error_handlers(pdev);
>> + devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> Same as above, should use 'port->dev'.
>
> please correct me if I am wrong.
>
>
> Ming
>
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices
2024-12-11 23:39 ` [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
@ 2024-12-24 18:28 ` Jonathan Cameron
0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:28 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:39:53 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service driver's aer_get_device_error_info() function doesn't read
> uncorrectable (UCE) fatal error status from PCIe Upstream Port devices,
> including CXL Upstream Switch Ports. As a result, fatal errors are not
> logged or handled as needed for CXL PCIe Upstream Switch Port devices.
>
> Update the aer_get_device_error_info() function to read the UCE fatal
> status for all CXL PCIe devices. Make the change such that non-CXL devices
> are not affected.
>
> The fatal error status will be used in future patches implementing
> CXL PCIe Port uncorrectable error handling and logging.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
This seems fine to me, though interacts with the link healthy change
you pointed me at from Shuai Xue.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> drivers/pci/pcie/aer.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index d75886174969..c1eb939c1cca 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1250,7 +1250,8 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
> } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
> type == PCI_EXP_TYPE_RC_EC ||
> type == PCI_EXP_TYPE_DOWNSTREAM ||
> - info->severity == AER_NONFATAL) {
> + info->severity == AER_NONFATAL ||
> + (pcie_is_cxl(dev) && type == PCI_EXP_TYPE_UPSTREAM)) {
>
> /* Link is still healthy for IO reads */
> pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver
2024-12-11 23:39 ` [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver Terry Bowman
2024-12-12 9:28 ` Alejandro Lucero Palau
@ 2024-12-24 18:31 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:31 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:39:54 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> Existing recovery procedure for PCIe Uncorrectable Errors (UCE) does not
> apply to CXL devices. Recovery can not be used for CXL devices because of
> potential corruption on what can be system memory. Also, current PCIe UCE
> recovery, in the case of a Root Port (RP) or Downstream Switch Port (DSP),
> does not begin at the RP/DSP but begins at the first downstream device.
> This will miss handling CXL Protocol Errors in a CXL RP or DSP. A separate
> CXL recovery is needed because of the different handling requirements
>
> Add a new function, cxl_do_recovery() using the following.
>
> Add cxl_walk_bridge() to iterate the detected error's sub-topology.
> cxl_walk_bridge() is similar to pci_walk_bridge() but the CXL flavor
> will begin iteration at the RP or DSP rather than beginning at the
> first downstream device.
I'm still not keen on this just from a subtlety making maintenance
harder point of view. Guess I should find the time to make
the change to the PCI walker and see if anyone shouts that it is
a problem.
From other branch of thread sounds like you are reworking this patch
anyway so I'll wait for next version before reviewing closely.
Jonathan
>
> Add cxl_report_error_detected() as an analog to report_error_detected().
> It will call pci_driver::cxl_err_handlers for each iterated downstream
> device. The pci_driver::cxl_err_handler's UCE handler returns a boolean
> indicating if there was a UCE error detected during handling.
>
> cxl_do_recovery() uses the status from cxl_report_error_detected() to
> determine how to proceed. Non-fatal CXL UCE errors will be treated as
> fatal. If a UCE was present during handling then cxl_do_recovery()
> will kernel panic.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers
2024-12-11 23:39 ` [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
2024-12-12 10:36 ` Alejandro Lucero Palau
@ 2024-12-24 18:38 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:38 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:39:55 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> The CXL mem driver (cxl_mem) currently maps and caches a pointer to RAS
> registers for the endpoint's Root Port. The same needs to be done for
> each of the CXL Downstream Switch Ports and CXL Root Ports found between
> the endpoint and CXL Host Bridge.
>
> Introduce cxl_init_ep_ports_aer() to be called for each CXL Port in the
> sub-topology between the endpoint and the CXL Host Bridge. This function
> will determine if there are CXL Downstream Switch Ports or CXL Root Ports
> associated with this Port. The same check will be added in the future for
> upstream switch ports.
>
> Move the RAS register map logic from cxl_dport_map_ras() into
> cxl_dport_init_ras_reporting(). This eliminates the need for the helper
> function, cxl_dport_map_ras().
Could possibly do that as a precursor.
>
> cxl_init_ep_ports_aer() calls cxl_dport_init_ras_reporting() to map
> the RAS registers for CXL Downstream Switch Ports and CXL Root Ports.
>
> cxl_dport_init_ras_reporting() must check for previously mapped registers
> before mapping. This is necessary because endpoints under a CXL switch
> may share CXL Downstream Switch Ports or CXL Root Ports. Ensure the port
> registers are only mapped once.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream Switch Port RAS registers
2024-12-11 23:39 ` [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
@ 2024-12-24 18:41 ` Jonathan Cameron
0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:41 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:39:56 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> Add logic to map CXL PCIe Upstream Switch Port (USP) RAS registers.
>
> Introduce 'struct cxl_regs' member into 'struct cxl_port' to cache a
> pointer to the CXL Upstream Port's mapped RAS registers.
>
> Also, introduce cxl_uport_init_ras_reporting() to perform the USP RAS
> register mapping. This is similar to the existing
> cxl_dport_init_ras_reporting() but for USP devices.
>
> The USP may have multiple downstream endpoints. Before mapping AER
> registers check if the registers are already mapped.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports
2024-12-11 23:39 ` [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2024-12-12 10:38 ` Alejandro Lucero Palau
@ 2024-12-24 18:42 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:42 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:39:57 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> CXL PCIe Port protocol error handling support will be added to the
> CXL drivers in the future. In preparation, rename the existing
> interfaces to support handling all CXL PCIe Port protocol errors.
>
> The driver's RAS support functions currently rely on a 'struct
> cxl_dev_state' type parameter, which is not available for CXL Port
> devices. However, since the same CXL RAS capability structure is
> needed across most CXL components and devices, a common handling
> approach should be adopted.
>
> To accommodate this, update the __cxl_handle_cor_ras() and
> __cxl_handle_ras() functions to use a `struct device` instead of
> `struct cxl_dev_state`.
>
> No functional changes are introduced.
>
> [1] CXL 3.1 Spec, 8.2.4 CXL.cache and CXL.mem Registers
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors
2024-12-11 23:39 ` [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
2024-12-12 2:19 ` Li Ming
@ 2024-12-24 18:43 ` Jonathan Cameron
1 sibling, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:43 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:39:59 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> Introduce correctable and uncorrectable CXL PCIe port protocol error
> handlers.
>
> The handlers will be called with a 'struct pci_dev' parameter
> indicating the CXL Port device requiring handling. The CXL PCIe Port
> device's underlying 'struct device' will match the Port device in the
> CXL topology.
>
> Use the PCIe Port's device object to find the matching Upstream Switch
> Port, Downstream Switch Port, or Root Port in the CXL topology. The
> matching device will contain a reference to the RAS register block used to
> handle and log the error.
>
> Invoke the existing __cxl_handle_ras() or __cxl_handle_cor_ras() passing
> a reference to the RAS registers as a parameter. These functions will use
> the register reference to clear the device's RAS status.
>
> Future patches will assign the error handlers and add trace logging.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Other than Li Ming's question, LGTM
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 13/15] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
2024-12-11 23:40 ` [PATCH v4 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-12-12 9:46 ` Alejandro Lucero Palau
@ 2024-12-24 18:46 ` Jonathan Cameron
2024-12-26 17:01 ` Bowman, Terry
1 sibling, 1 reply; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:46 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:40:00 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> The CXL drivers use kernel trace functions for logging endpoint and RCH
> Downstream Port RAS errors. Similar functionality is required for CXL Root
> Ports, CXL Downstream Switch Ports, and CXL Upstream Switch Ports.
>
> Introduce trace logging functions for both RAS correctable and
> uncorrectable errors specific to CXL PCIe Ports. Additionally, update
> the PCIe Port error handlers to invoke these new trace functions.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Trivial comment inline.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
> drivers/cxl/core/pci.c | 16 ++++++++++----
> drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 59 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 52afaedf5171..3294ad5ff28f 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
>
> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
> status = readl(addr);
> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> + return;
I'd put a blank line here.
> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> + if (is_cxl_memdev(dev))
> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
> - }
> + else
> + trace_cxl_port_aer_correctable_error(dev, status);
> }
>
> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
> }
>
> header_log_copy(ras_base, hl);
> - trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + if (is_cxl_memdev(dev))
> + trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
> + else
> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
> +
> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>
> return true;
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
2024-12-11 23:40 ` [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-12-12 2:31 ` Li Ming
@ 2024-12-24 18:50 ` Jonathan Cameron
2024-12-26 17:07 ` Bowman, Terry
1 sibling, 1 reply; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:50 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:40:01 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
> The handlers can't be set in the pci_driver static definition because the
> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
> driver aware.
>
> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> function will assign the default handlers for a CXL PCIe Port device.
>
> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
> longer be used.
>
> Create cxl_clear_port_error_handlers() and register it to be called
> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
> drivers/cxl/core/pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 40 insertions(+)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 3294ad5ff28f..9734a4c55b29 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -841,8 +841,38 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
> return __cxl_handle_ras(&pdev->dev, ras_base);
> }
>
> +static const struct cxl_error_handlers cxl_port_error_handlers = {
> + .error_detected = cxl_port_error_detected,
> + .cor_error_detected = cxl_port_cor_error_detected,
> +};
> +
> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> +{
> + struct pci_driver *pdrv;
> +
> + if (!pdev || !pdev->driver)
> + return;
> +
> + pdrv = pdev->driver;
What stops a race here? It's fiddly to remove that driver but
it can be done. At least I think we are messing withe portdrv
but this is such a fiddly stack I'm not 100% sure.
> + pdrv->cxl_err_handler = &cxl_port_error_handlers;
> +}
> +
> +static void cxl_clear_port_error_handlers(void *data)
> +{
> + struct pci_dev *pdev = data;
> + struct pci_driver *pdrv;
> +
> + if (!pdev || !pdev->driver)
> + return;
> +
> + pdrv = pdev->driver;
Likewise. Smells like a possible race.
> + pdrv->cxl_err_handler = NULL;
> +}
> +
> void cxl_uport_init_ras_reporting(struct cxl_port *port)
> {
> + struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> +
> /* uport may have more than 1 downstream EP. Check if already mapped. */
> if (port->uport_regs.ras)
> return;
> @@ -853,6 +883,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> dev_err(&port->dev, "Failed to map RAS capability.\n");
> return;
> }
> +
> + cxl_assign_port_error_handlers(pdev);
> + devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>
> @@ -864,6 +897,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> {
> struct device *dport_dev = dport->dport_dev;
> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> + struct pci_dev *pdev = to_pci_dev(dport_dev);
>
> dport->reg_map.host = dport_dev;
> if (dport->rch && host_bridge->native_aer) {
> @@ -880,6 +914,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> dev_err(dport_dev, "Failed to map RAS capability.\n");
> return;
> }
> +
> + if (dport->rch)
> + return;
> +
> + cxl_assign_port_error_handlers(pdev);
> + devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-11 23:40 ` [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
2024-12-12 9:44 ` Alejandro Lucero Palau
@ 2024-12-24 18:53 ` Jonathan Cameron
2024-12-26 17:19 ` Bowman, Terry
1 sibling, 1 reply; 45+ messages in thread
From: Jonathan Cameron @ 2024-12-24 18:53 UTC (permalink / raw)
To: Terry Bowman
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, ming4.li, dave,
dave.jiang, alison.schofield, vishal.l.verma, dan.j.williams,
bhelgaas, mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati
On Wed, 11 Dec 2024 17:40:02 -0600
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
> Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
> UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
> enablement is needed for CXL PCIe Upstream and Downstream Ports inorder to
> notify the associated Root Port and OS.[1]
>
> Export the AER service driver's pci_aer_unmask_internal_errors() function
> to CXL namespace.
>
> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
> because it is now an exported function.
>
> Call pci_aer_unmask_internal_errors() during RAS initialization in:
> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>
> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Whilst I'm in favor of just enabling this across all devices I guess
I can cope with this more minimal form and it will create fewer bug
reports :).
It is a little messy because we are tweaking it from the 'wrong' driver
but I guess that is fine.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
And on that note, happy Christmas / holidays etc all. My backlog
of review looks much less scary now but I need a beer!
Jonathan
> ---
> drivers/cxl/core/pci.c | 2 ++
> drivers/pci/pcie/aer.c | 5 +++--
> include/linux/aer.h | 1 +
> 3 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 9734a4c55b29..740ac5d8809f 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>
> cxl_assign_port_error_handlers(pdev);
> devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> + pci_aer_unmask_internal_errors(pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>
> @@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>
> cxl_assign_port_error_handlers(pdev);
> devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> + pci_aer_unmask_internal_errors(pdev);
> }
> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 861521872318..0fa1b1ed48c9 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info *info)
> return info->status & PCI_ERR_UNC_INTN;
> }
>
> -#ifdef CONFIG_PCIEAER_CXL
> /**
> * pci_aer_unmask_internal_errors - unmask internal errors
> * @dev: pointer to the pcie_dev data structure
> @@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info *info)
> * Note: AER must be enabled and supported by the device which must be
> * checked in advance, e.g. with pcie_aer_is_native().
> */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> {
> int aer = dev->aer_cap;
> u32 mask;
> @@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> mask &= ~PCI_ERR_COR_INTERNAL;
> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> }
> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
>
> +#ifdef CONFIG_PCIEAER_CXL
> static bool is_cxl_mem_dev(struct pci_dev *dev)
> {
> /*
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..093293f9f12b 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
> int cper_severity_to_aer(int cper_severity);
> void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
> int severity, struct aer_capability_regs *aer_regs);
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> #endif //_AER_H_
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 13/15] cxl/pci: Add trace logging for CXL PCIe Port RAS errors
2024-12-24 18:46 ` Jonathan Cameron
@ 2024-12-26 17:01 ` Bowman, Terry
0 siblings, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-26 17:01 UTC (permalink / raw)
To: Jonathan Cameron
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On 12/24/2024 12:46 PM, Jonathan Cameron wrote:
> On Wed, 11 Dec 2024 17:40:00 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The CXL drivers use kernel trace functions for logging endpoint and RCH
>> Downstream Port RAS errors. Similar functionality is required for CXL Root
>> Ports, CXL Downstream Switch Ports, and CXL Upstream Switch Ports.
>>
>> Introduce trace logging functions for both RAS correctable and
>> uncorrectable errors specific to CXL PCIe Ports. Additionally, update
>> the PCIe Port error handlers to invoke these new trace functions.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Trivial comment inline.
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Thank you Jonathan.
>> ---
>> drivers/cxl/core/pci.c | 16 ++++++++++----
>> drivers/cxl/core/trace.h | 47 ++++++++++++++++++++++++++++++++++++++++
>> 2 files changed, 59 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 52afaedf5171..3294ad5ff28f 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -661,10 +661,14 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>
>> addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>> status = readl(addr);
>> - if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> - writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> + if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> + return;
> I'd put a blank line here.
Sure, I will add the blank line.
Regards,
Terry
>> + writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> + if (is_cxl_memdev(dev))
>> trace_cxl_aer_correctable_error(to_cxl_memdev(dev), status);
>> - }
>> + else
>> + trace_cxl_port_aer_correctable_error(dev, status);
>> }
>>
>> static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -720,7 +724,11 @@ static bool __cxl_handle_ras(struct device *dev, void __iomem *ras_base)
>> }
>>
>> header_log_copy(ras_base, hl);
>> - trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>> + if (is_cxl_memdev(dev))
>> + trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev), status, fe, hl);
>> + else
>> + trace_cxl_port_aer_uncorrectable_error(dev, status, fe, hl);
>> +
>> writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>
>> return true;
>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
2024-12-24 18:50 ` Jonathan Cameron
@ 2024-12-26 17:07 ` Bowman, Terry
2025-01-07 11:32 ` Jonathan Cameron
0 siblings, 1 reply; 45+ messages in thread
From: Bowman, Terry @ 2024-12-26 17:07 UTC (permalink / raw)
To: Jonathan Cameron
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On 12/24/2024 12:50 PM, Jonathan Cameron wrote:
> On Wed, 11 Dec 2024 17:40:01 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
>> The handlers can't be set in the pci_driver static definition because the
>> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
>> driver aware.
>>
>> Add cxl_assign_port_error_handlers() in the cxl_core module. This
>> function will assign the default handlers for a CXL PCIe Port device.
>>
>> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
>> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
>> longer be used.
>>
>> Create cxl_clear_port_error_handlers() and register it to be called
>> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>> drivers/cxl/core/pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 40 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 3294ad5ff28f..9734a4c55b29 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -841,8 +841,38 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
>> return __cxl_handle_ras(&pdev->dev, ras_base);
>> }
>>
>> +static const struct cxl_error_handlers cxl_port_error_handlers = {
>> + .error_detected = cxl_port_error_detected,
>> + .cor_error_detected = cxl_port_cor_error_detected,
>> +};
>> +
>> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
>> +{
>> + struct pci_driver *pdrv;
>> +
>> + if (!pdev || !pdev->driver)
>> + return;
>> +
>> + pdrv = pdev->driver;
> What stops a race here? It's fiddly to remove that driver but
> it can be done. At least I think we are messing withe portdrv
> but this is such a fiddly stack I'm not 100% sure.
>
>> + pdrv->cxl_err_handler = &cxl_port_error_handlers;
>> +}
>> +
>> +static void cxl_clear_port_error_handlers(void *data)
>> +{
>> + struct pci_dev *pdev = data;
>> + struct pci_driver *pdrv;
>> +
>> + if (!pdev || !pdev->driver)
>> + return;
>> +
>> + pdrv = pdev->driver;
> Likewise. Smells like a possible race.
>
>> + pdrv->cxl_err_handler = NULL;
>> +}
>> +
I can add a get_device()/put_device() for both cxl_clear_port_error_handlers() and cxl_assign_port_error_handlers() to prevent operating on a recently destroyed pci_dev. Is that sufficient? Regards, Terry
>> void cxl_uport_init_ras_reporting(struct cxl_port *port)
>> {
>> + struct pci_dev *pdev = to_pci_dev(port->uport_dev);
>> +
>> /* uport may have more than 1 downstream EP. Check if already mapped. */
>> if (port->uport_regs.ras)
>> return;
>> @@ -853,6 +883,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>> dev_err(&port->dev, "Failed to map RAS capability.\n");
>> return;
>> }
>> +
>> + cxl_assign_port_error_handlers(pdev);
>> + devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>>
>> @@ -864,6 +897,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>> {
>> struct device *dport_dev = dport->dport_dev;
>> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
>> + struct pci_dev *pdev = to_pci_dev(dport_dev);
>>
>> dport->reg_map.host = dport_dev;
>> if (dport->rch && host_bridge->native_aer) {
>> @@ -880,6 +914,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>> dev_err(dport_dev, "Failed to map RAS capability.\n");
>> return;
>> }
>> +
>> + if (dport->rch)
>> + return;
>> +
>> + cxl_assign_port_error_handlers(pdev);
>> + devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports
2024-12-24 18:53 ` Jonathan Cameron
@ 2024-12-26 17:19 ` Bowman, Terry
0 siblings, 0 replies; 45+ messages in thread
From: Bowman, Terry @ 2024-12-26 17:19 UTC (permalink / raw)
To: Jonathan Cameron
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On 12/24/2024 12:53 PM, Jonathan Cameron wrote:
> On Wed, 11 Dec 2024 17:40:02 -0600
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver enables PCIe Uncorrectable Internal Errors (UIE) and
>> Correctable Internal errors (CIE) for CXL Root Ports and CXL RCEC's. The
>> UIE and CIE are used in reporting CXL Protocol Errors. The same UIE/CIE
>> enablement is needed for CXL PCIe Upstream and Downstream Ports inorder to
>> notify the associated Root Port and OS.[1]
>>
>> Export the AER service driver's pci_aer_unmask_internal_errors() function
>> to CXL namespace.
>>
>> Remove the function's dependency on the CONFIG_PCIEAER_CXL kernel config
>> because it is now an exported function.
>>
>> Call pci_aer_unmask_internal_errors() during RAS initialization in:
>> cxl_uport_init_ras_reporting() and cxl_dport_init_ras_reporting().
>>
>> [1] PCIe Base Spec r6.2-1.0, 6.2.3.2.2 Masking Individual Errors
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Whilst I'm in favor of just enabling this across all devices I guess
> I can cope with this more minimal form and it will create fewer bug
> reports :).
> It is a little messy because we are tweaking it from the 'wrong' driver
> but I guess that is fine.
>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>
> And on that note, happy Christmas / holidays etc all. My backlog
> of review looks much less scary now but I need a beer!
>
> Jonathan
>
I'm not aware of a better way to enable the interrupts in this case. But, I am open
to anyone's ideas for improvement.
Happy holidays everyone!
Regards,
Terry
>
>> ---
>> drivers/cxl/core/pci.c | 2 ++
>> drivers/pci/pcie/aer.c | 5 +++--
>> include/linux/aer.h | 1 +
>> 3 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 9734a4c55b29..740ac5d8809f 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -886,6 +886,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
>>
>> cxl_assign_port_error_handlers(pdev);
>> devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
>> + pci_aer_unmask_internal_errors(pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
>>
>> @@ -920,6 +921,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
>>
>> cxl_assign_port_error_handlers(pdev);
>> devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
>> + pci_aer_unmask_internal_errors(pdev);
>> }
>> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 861521872318..0fa1b1ed48c9 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -949,7 +949,6 @@ static bool is_internal_error(struct aer_err_info *info)
>> return info->status & PCI_ERR_UNC_INTN;
>> }
>>
>> -#ifdef CONFIG_PCIEAER_CXL
>> /**
>> * pci_aer_unmask_internal_errors - unmask internal errors
>> * @dev: pointer to the pcie_dev data structure
>> @@ -960,7 +959,7 @@ static bool is_internal_error(struct aer_err_info *info)
>> * Note: AER must be enabled and supported by the device which must be
>> * checked in advance, e.g. with pcie_aer_is_native().
>> */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> {
>> int aer = dev->aer_cap;
>> u32 mask;
>> @@ -973,7 +972,9 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> mask &= ~PCI_ERR_COR_INTERNAL;
>> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> }
>> +EXPORT_SYMBOL_NS_GPL(pci_aer_unmask_internal_errors, CXL);
>>
>> +#ifdef CONFIG_PCIEAER_CXL
>> static bool is_cxl_mem_dev(struct pci_dev *dev)
>> {
>> /*
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 4b97f38f3fcf..093293f9f12b 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -55,5 +55,6 @@ void pci_print_aer(struct pci_dev *dev, int aer_severity,
>> int cper_severity_to_aer(int cper_severity);
>> void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
>> int severity, struct aer_capability_regs *aer_regs);
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>> #endif //_AER_H_
>>
^ permalink raw reply [flat|nested] 45+ messages in thread
* Re: [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers
2024-12-26 17:07 ` Bowman, Terry
@ 2025-01-07 11:32 ` Jonathan Cameron
0 siblings, 0 replies; 45+ messages in thread
From: Jonathan Cameron @ 2025-01-07 11:32 UTC (permalink / raw)
To: Bowman, Terry
Cc: linux-cxl, linux-kernel, linux-pci, nifan.cxl, dave, dave.jiang,
alison.schofield, vishal.l.verma, dan.j.williams, bhelgaas,
mahesh, ira.weiny, oohall, Benjamin.Cheatham, rrichter,
nathan.fontenot, Smita.KoralahalliChannabasappa, lukas,
PradeepVineshReddy.Kodamati, Li Ming
On Thu, 26 Dec 2024 11:07:13 -0600
"Bowman, Terry" <terry.bowman@amd.com> wrote:
> On 12/24/2024 12:50 PM, Jonathan Cameron wrote:
> > On Wed, 11 Dec 2024 17:40:01 -0600
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >
> >> pci_driver::cxl_err_handlers are not currently assigned handler callbacks.
> >> The handlers can't be set in the pci_driver static definition because the
> >> CXL PCIe Port devices are bound to the portdrv driver which is not CXL
> >> driver aware.
> >>
> >> Add cxl_assign_port_error_handlers() in the cxl_core module. This
> >> function will assign the default handlers for a CXL PCIe Port device.
> >>
> >> When the CXL Port (cxl_port or cxl_dport) is destroyed the device's
> >> pci_driver::cxl_err_handlers must be set to NULL indicating they should no
> >> longer be used.
> >>
> >> Create cxl_clear_port_error_handlers() and register it to be called
> >> when the CXL Port device (cxl_port or cxl_dport) is destroyed.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> ---
> >> drivers/cxl/core/pci.c | 40 ++++++++++++++++++++++++++++++++++++++++
> >> 1 file changed, 40 insertions(+)
> >>
> >> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> >> index 3294ad5ff28f..9734a4c55b29 100644
> >> --- a/drivers/cxl/core/pci.c
> >> +++ b/drivers/cxl/core/pci.c
> >> @@ -841,8 +841,38 @@ static bool cxl_port_error_detected(struct pci_dev *pdev)
> >> return __cxl_handle_ras(&pdev->dev, ras_base);
> >> }
> >>
> >> +static const struct cxl_error_handlers cxl_port_error_handlers = {
> >> + .error_detected = cxl_port_error_detected,
> >> + .cor_error_detected = cxl_port_cor_error_detected,
> >> +};
> >> +
> >> +static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> >> +{
> >> + struct pci_driver *pdrv;
> >> +
> >> + if (!pdev || !pdev->driver)
> >> + return;
> >> +
> >> + pdrv = pdev->driver;
> > What stops a race here? It's fiddly to remove that driver but
> > it can be done. At least I think we are messing withe portdrv
> > but this is such a fiddly stack I'm not 100% sure.
> >
> >> + pdrv->cxl_err_handler = &cxl_port_error_handlers;
> >> +}
> >> +
> >> +static void cxl_clear_port_error_handlers(void *data)
> >> +{
> >> + struct pci_dev *pdev = data;
> >> + struct pci_driver *pdrv;
> >> +
> >> + if (!pdev || !pdev->driver)
> >> + return;
> >> +
> >> + pdrv = pdev->driver;
> > Likewise. Smells like a possible race.
> >
> >> + pdrv->cxl_err_handler = NULL;
> >> +}
> >> +
>
> I can add a get_device()/put_device() for both cxl_clear_port_error_handlers() and cxl_assign_port_error_handlers() to prevent operating on a recently destroyed pci_dev. Is that sufficient? Regards, Terry
Probably (by which I mean I think it is, but haven't checked in detail)
Jonathan
> >> void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >> {
> >> + struct pci_dev *pdev = to_pci_dev(port->uport_dev);
> >> +
> >> /* uport may have more than 1 downstream EP. Check if already mapped. */
> >> if (port->uport_regs.ras)
> >> return;
> >> @@ -853,6 +883,9 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> >> dev_err(&port->dev, "Failed to map RAS capability.\n");
> >> return;
> >> }
> >> +
> >> + cxl_assign_port_error_handlers(pdev);
> >> + devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> >> }
> >> EXPORT_SYMBOL_NS_GPL(cxl_uport_init_ras_reporting, CXL);
> >>
> >> @@ -864,6 +897,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >> {
> >> struct device *dport_dev = dport->dport_dev;
> >> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> >> + struct pci_dev *pdev = to_pci_dev(dport_dev);
> >>
> >> dport->reg_map.host = dport_dev;
> >> if (dport->rch && host_bridge->native_aer) {
> >> @@ -880,6 +914,12 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> >> dev_err(dport_dev, "Failed to map RAS capability.\n");
> >> return;
> >> }
> >> +
> >> + if (dport->rch)
> >> + return;
> >> +
> >> + cxl_assign_port_error_handlers(pdev);
> >> + devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> >> }
> >> EXPORT_SYMBOL_NS_GPL(cxl_dport_init_ras_reporting, CXL);
> >>
>
^ permalink raw reply [flat|nested] 45+ messages in thread
end of thread, other threads:[~2025-01-07 11:32 UTC | newest]
Thread overview: 45+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11 23:39 [PATCH v4 0/15] Enable CXL PCIe Port protocol error handling and logging Terry Bowman
2024-12-11 23:39 ` [PATCH v4 01/15] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-12-11 23:39 ` [PATCH v4 02/15] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe Port support Terry Bowman
2024-12-11 23:39 ` [PATCH v4 03/15] cxl/pci: Introduce PCIe helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-12-11 23:39 ` [PATCH v4 04/15] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-12-12 1:34 ` Li Ming
2024-12-12 19:59 ` Bowman, Terry
2024-12-14 13:34 ` Li Ming
2024-12-11 23:39 ` [PATCH v4 05/15] PCI/AER: Add CXL PCIe Port correctable error support in AER service driver Terry Bowman
2024-12-11 23:39 ` [PATCH v4 06/15] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe Port devices Terry Bowman
2024-12-24 18:28 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 07/15] PCI/AER: Add CXL PCIe Port Uncorrectable Error recovery in AER service driver Terry Bowman
2024-12-12 9:28 ` Alejandro Lucero Palau
2024-12-13 15:07 ` Bowman, Terry
2024-12-24 18:31 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 08/15] cxl/pci: Map CXL PCIe Root Port and Downstream Switch Port RAS registers Terry Bowman
2024-12-12 10:36 ` Alejandro Lucero Palau
2024-12-13 15:10 ` Bowman, Terry
2024-12-24 18:38 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 09/15] cxl/pci: Map CXL PCIe Upstream " Terry Bowman
2024-12-24 18:41 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 10/15] cxl/pci: Update RAS handler interfaces to also support CXL PCIe Ports Terry Bowman
2024-12-12 10:38 ` Alejandro Lucero Palau
2024-12-24 18:42 ` Jonathan Cameron
2024-12-11 23:39 ` [PATCH v4 11/15] cxl/pci: Change find_cxl_port() to non-static Terry Bowman
2024-12-11 23:39 ` [PATCH v4 12/15] cxl/pci: Add error handler for CXL PCIe Port RAS errors Terry Bowman
2024-12-12 2:19 ` Li Ming
2024-12-24 18:43 ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 13/15] cxl/pci: Add trace logging " Terry Bowman
2024-12-12 9:46 ` Alejandro Lucero Palau
2024-12-24 18:46 ` Jonathan Cameron
2024-12-26 17:01 ` Bowman, Terry
2024-12-11 23:40 ` [PATCH v4 14/15] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-12-12 2:31 ` Li Ming
2024-12-17 14:39 ` Bowman, Terry
2024-12-24 18:50 ` Jonathan Cameron
2024-12-26 17:07 ` Bowman, Terry
2025-01-07 11:32 ` Jonathan Cameron
2024-12-11 23:40 ` [PATCH v4 15/15] PCI/AER: Enable internal errors for CXL Upstream and Downstream Switch Ports Terry Bowman
2024-12-12 9:44 ` Alejandro Lucero Palau
2024-12-12 10:44 ` Alejandro Lucero Palau
2024-12-13 15:22 ` Bowman, Terry
2024-12-13 15:34 ` Bowman, Terry
2024-12-24 18:53 ` Jonathan Cameron
2024-12-26 17:19 ` Bowman, Terry
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).