From: Fan Ni <nifan.cxl@gmail.com>
To: Terry Bowman <terry.bowman@amd.com>
Cc: ming4.li@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, dan.j.williams@intel.com,
bhelgaas@google.com, mahesh@linux.ibm.com, ira.weiny@intel.com,
oohall@gmail.com, Benjamin.Cheatham@amd.com, rrichter@amd.com,
nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com
Subject: Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
Date: Fri, 1 Nov 2024 11:00:14 -0700 [thread overview]
Message-ID: <ZyUXLpQBBgTl733z@fan> (raw)
In-Reply-To: <20241025210305.27499-1-terry.bowman@amd.com>
On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> This is a continuation of the CXL port error handling RFC from earlier.[1]
> The RFC resulted in the decision to add CXL PCIe port error handling to
> the existing RCH downstream port handling in the AER service driver. This
> patchset adds the CXL PCIe port protocol error handling and logging.
>
> The first 7 patches update the existing AER service driver to support CXL
> PCIe port protocol error handling and reporting. This includes AER service
> driver changes for adding correctable and uncorrectable error support, CXL
> specific recovery handling, and addition of CXL driver callback handlers.
>
> The following 7 patches address CXL driver support for CXL PCIe port
> protocol errors. This includes the following changes to the CXL drivers:
> mapping CXL port and downstream port RAS registers, interface updates for
> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> adding port specific error handlers, and protocol error logging.
>
> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>
> Testing:
Hi Terry,
I tried to test the patchset with aer_inject tool (with the patch you shared
in the last version), and hit some issues.
Could you help check and give some insights? Thanks.
Below are some test setup info and results.
I tested two topology,
a. one memdev directly attaced to a HB with only one RP;
b. a topology with cxl switch:
HB
/ \
RP0 RP1
|
switch
|
----------------
| | | |
mem0 mem1 mem2 mem3
For both topologies, I cannot reproduce the system panic shown in your cover
letter.
btw, I tried both compile cxl as modules and in the kernel.
Below, I will use the direct-attached topology (a) as an example to show what I
tried, hope can get some clarity about the test and what I missed or did wrong.
-------------------------------------
pci device info on the test VM
root@fan:~# lspci
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
root@fan:~#
-------------------------------------
The aer injection input file looks like below,
-------------------------------------
fan:~/cxl/cxl-test-tool$ cat /tmp/internal
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS INTERNAL
HEADER_LOG 0 1 2 3
------------------------------------
dmesg after aer injection
ssh root@localhost -p 2024 "dmesg"
[ 613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[ 613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[ 613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[ 613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
--------------------------------------
The problem seems to be related to the cxl_error_handler not been assigned for
cxlmem device.
in
cxl_do_recover() {
...
327 cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
328 if (status)
329 panic("CXL cachemem error. Invoking panic");
...
}
The status returned is false, so no panic().
I tried to add some dev_dbg info to the code to debug.
Below are the debug info and kernel code changes for debugging.
--------------------------------------
fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
[ 1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
[ 1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
[ 1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
[ 1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
[ 1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
[ 1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
[ 1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
[ 1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
[ 1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
fan:~/cxl/cxl-test-tool$
--------------------------------------
dmesg after error injection:
--------------------------------------
ssh root@localhost -p 2024 "dmesg"
[ 228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
[ 228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
[ 228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[ 228.545879] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/00000000
[ 228.546360] pcieport 0000:0c:00.0: [22] UncorrIntErr
[ 228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
[ 228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
[ 228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
fan:~/cxl/cxl-test-tool$
--------------------------------------
Kernel changes:
--------------------------------------
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 5f7570c6173c..bcecd1283fc6 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
{
struct pci_driver *pdrv = pdev->driver;
+ dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
if (!pdrv)
return;
+ dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
pdrv->cxl_err_handler = &cxl_port_error_handlers;
+ dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
}
static void cxl_clear_port_error_handlers(void *data)
@@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
{
struct pci_dev *pdev = to_pci_dev(port->uport_dev);
+ dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
/* uport may have more than 1 downstream EP. Check if already mapped. */
if (port->uport_regs.ras) {
dev_warn(&port->dev, "RAS is already mapped\n");
return;
}
+ dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
port->reg_map.host = &port->dev;
if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
BIT(CXL_CM_CAP_CAP_ID_RAS))) {
@@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
return;
}
+ dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
cxl_assign_port_error_handlers(pdev);
devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
}
@@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
struct pci_dev *pdev = to_pci_dev(dport_dev);
+ dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
if (dport->rch && host_bridge->native_aer) {
cxl_dport_map_rch_aer(dport);
cxl_disable_rch_root_ints(dport);
}
+ dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
/* dport may have more than 1 downstream EP. Check if already mapped. */
if (dport->regs.ras) {
dev_warn(dport_dev, "RAS is already mapped\n");
@@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
return;
}
+ dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
cxl_assign_port_error_handlers(pdev);
devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
}
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 067fd6389562..aa824584f8dd 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
* Now that the path to the root is established record all the
* intervening ports in the chain.
*/
+ dev_dbg(host, "XXX: add endpoint\n");
for (iter = parent_port, down = NULL; !is_cxl_root(iter);
down = iter, iter = to_cxl_port(iter->dev.parent)) {
struct cxl_ep *ep;
ep = cxl_ep_load(iter, cxlmd);
ep->next = down;
- cxl_init_ep_ports_aer(ep);
+ dev_dbg(ep->ep, "XXX: init ep port aer\n");
+ cxl_init_ep_ports_aer(ep);
}
/* Note: endpoint port component registers are derived from @cxlds */
@@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
return -ENXIO;
}
+ dev_dbg(dev, "XXX: add endpoint\n");
rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
if (rc)
return rc;
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 3785f4ca5103..8285f14994e8 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
bool *status = data;
device_lock(&dev->dev);
+ if (pdrv) {
+ dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
+ } else {
+ dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
+ }
if (pdrv && pdrv->cxl_err_handler &&
pdrv->cxl_err_handler->error_detected) {
const struct cxl_error_handlers *cxl_err_handler =
--------------------------------------
Fan
>
> Below are test results for this patchset using Qemu with CXL root
> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
> also added to show the existing PCIe endpoint handling is not changed.
>
> This was tested using aer-inject updated to support CE and UCE internal
> error injection. CXL RAS was set using a test patch (not upstreamed but can
> provide if needed).
>
> - Root port UCE:
> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
> pcieport 0000:0c:00.0: [22] UncorrIntErr
> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
> cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
> Kernel panic - not syncing: CXL cachemem error. Invoking panic
> CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
> Tainted: [E]=UNSIGNED_MODULE
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> Call Trace:
> <TASK>
> dump_stack_lvl+0x27/0x90
> dump_stack+0x10/0x20
> panic+0x33e/0x380
> cxl_do_recovery+0x116/0x120
> ? srso_return_thunk+0x5/0x5f
> aer_isr+0x3e0/0x710
> irq_thread_fn+0x28/0x70
> irq_thread+0x179/0x240
> ? srso_return_thunk+0x5/0x5f
> ? __pfx_irq_thread_fn+0x10/0x10
> ? __pfx_irq_thread_dtor+0x10/0x10
> ? __pfx_irq_thread+0x10/0x10
> kthread+0xf5/0x130
> ? __pfx_kthread+0x10/0x10
> ret_from_fork+0x3c/0x60
> ? __pfx_kthread+0x10/0x10
> ret_from_fork_asm+0x1a/0x30
> </TASK>
> Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
...
--
Fan Ni
next prev parent reply other threads:[~2024-11-01 18:00 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-10-30 15:14 ` Jonathan Cameron
2024-10-30 15:15 ` Bowman, Terry
2024-10-31 16:20 ` Dave Jiang
2024-10-31 20:24 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:13 ` Jonathan Cameron
2024-10-31 16:21 ` Dave Jiang
2024-10-31 20:25 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-10-30 14:57 ` Jonathan Cameron
2024-10-31 16:25 ` Dave Jiang
2024-10-31 21:22 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-10-30 14:56 ` Jonathan Cameron
2024-10-31 16:27 ` Dave Jiang
2024-10-31 21:27 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-30 15:13 ` Jonathan Cameron
2024-10-30 15:51 ` Bowman, Terry
2024-11-04 21:50 ` Dan Williams
2024-11-04 22:05 ` Bowman, Terry
2024-10-31 16:37 ` Dave Jiang
2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-30 15:37 ` Jonathan Cameron
2024-10-31 16:58 ` Dave Jiang
2024-11-01 13:30 ` Bowman, Terry
2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-30 15:42 ` Jonathan Cameron
2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
2024-10-30 15:45 ` Jonathan Cameron
2024-10-30 15:54 ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
2024-10-30 15:55 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-30 15:56 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:59 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-30 16:03 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
2024-10-30 16:07 ` Jonathan Cameron
2024-10-30 21:30 ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-10-30 16:11 ` Jonathan Cameron
2024-10-30 21:34 ` Bowman, Terry
2024-10-27 16:59 ` [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 Bowman, Terry
2024-10-28 1:05 ` [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Bowman, Terry
2024-11-01 18:00 ` Fan Ni [this message]
2024-11-01 18:28 ` Bowman, Terry
2024-11-01 19:11 ` Fan Ni
2024-11-01 22:11 ` Fan Ni
2024-11-04 21:25 ` Bowman, Terry
2024-11-04 21:48 ` Fan Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZyUXLpQBBgTl733z@fan \
--to=nifan.cxl@gmail.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=mahesh@linux.ibm.com \
--cc=ming4.li@intel.com \
--cc=nathan.fontenot@amd.com \
--cc=oohall@gmail.com \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox