From: "Bowman, Terry" <terry.bowman@amd.com>
To: Fan Ni <nifan.cxl@gmail.com>
Cc: ming4.li@intel.com, linux-cxl@vger.kernel.org,
linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
dave@stgolabs.net, jonathan.cameron@huawei.com,
dave.jiang@intel.com, alison.schofield@intel.com,
vishal.l.verma@intel.com, dan.j.williams@intel.com,
bhelgaas@google.com, mahesh@linux.ibm.com, ira.weiny@intel.com,
oohall@gmail.com, Benjamin.Cheatham@amd.com, rrichter@amd.com,
nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com
Subject: Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
Date: Fri, 1 Nov 2024 13:28:12 -0500 [thread overview]
Message-ID: <8a73bfa1-b916-4f0a-92be-0cb677e1e334@amd.com> (raw)
In-Reply-To: <ZyUXLpQBBgTl733z@fan>
[-- Attachment #1: Type: text/plain, Size: 14071 bytes --]
Hi Fan,
I added comments below.
On 11/1/2024 1:00 PM, Fan Ni wrote:
> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
>> This is a continuation of the CXL port error handling RFC from earlier.[1]
>> The RFC resulted in the decision to add CXL PCIe port error handling to
>> the existing RCH downstream port handling in the AER service driver. This
>> patchset adds the CXL PCIe port protocol error handling and logging.
>>
>> The first 7 patches update the existing AER service driver to support CXL
>> PCIe port protocol error handling and reporting. This includes AER service
>> driver changes for adding correctable and uncorrectable error support, CXL
>> specific recovery handling, and addition of CXL driver callback handlers.
>>
>> The following 7 patches address CXL driver support for CXL PCIe port
>> protocol errors. This includes the following changes to the CXL drivers:
>> mapping CXL port and downstream port RAS registers, interface updates for
>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
>> adding port specific error handlers, and protocol error logging.
>>
>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
>>
>> Testing:
> Hi Terry,
> I tried to test the patchset with aer_inject tool (with the patch you shared
> in the last version), and hit some issues.
> Could you help check and give some insights? Thanks.
>
> Below are some test setup info and results.
>
> I tested two topology,
> a. one memdev directly attaced to a HB with only one RP;
> b. a topology with cxl switch:
> HB
> / \
> RP0 RP1
> |
> switch
> |
> ----------------
> | | | |
> mem0 mem1 mem2 mem3
>
> For both topologies, I cannot reproduce the system panic shown in your cover
> letter.
>
> btw, I tried both compile cxl as modules and in the kernel.
>
> Below, I will use the direct-attached topology (a) as an example to show what I
> tried, hope can get some clarity about the test and what I missed or did wrong.
>
> -------------------------------------
> pci device info on the test VM
> root@fan:~# lspci
> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> root@fan:~#
> -------------------------------------
>
> The aer injection input file looks like below,
>
> -------------------------------------
> fan:~/cxl/cxl-test-tool$ cat /tmp/internal
> AER
> PCI_ID 0000:0c:00.0
> UNCOR_STATUS INTERNAL
> HEADER_LOG 0 1 2 3
> ------------------------------------
>
> dmesg after aer injection
>
> ssh root@localhost -p 2024 "dmesg"
> [ 613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [ 613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [ 613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> -----------------------------------
This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
bus then the device's you test in your setup.
The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
needed in the patchset itself (not just a test patch).
Regards,
Terry
>
> The problem seems to be related to the cxl_error_handler not been assigned for
> cxlmem device.
>
> in
> cxl_do_recover() {
> ...
> 327 cxl_walk_bridge(bridge, cxl_report_error_detected, &status);
> 328 if (status)
> 329 panic("CXL cachemem error. Invoking panic");
> ...
> }
> The status returned is false, so no panic().
>
> I tried to add some dev_dbg info to the code to debug.
> Below are the debug info and kernel code changes for debugging.
> --------------------------------------
> fan:~/cxl/cxl-test-tool$ cxl-tool.py --cmd dmesg | grep XXX
> [ 1.738909] cxl_mem:cxl_mem_probe:205: cxl_mem mem0: XXX: add endpoint
> [ 1.739188] cxl_mem:devm_cxl_add_endpoint:85: cxl_port port1: XXX: add endpoint
> [ 1.739509] cxl_mem:devm_cxl_add_endpoint:92: cxl_mem mem0: XXX: init ep port aer
> [ 1.739876] cxl_core:cxl_dport_init_ras_reporting:907: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 1
> [ 1.740338] cxl_core:cxl_dport_init_ras_reporting:913: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 2
> [ 1.740812] cxl_core:cxl_dport_init_ras_reporting:927: pcieport 0000:0c:00.0: XXX: assign port error handlers for dport 3
> [ 1.741273] cxl_core:cxl_assign_port_error_handlers:851: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> [ 1.741812] cxl_core:cxl_assign_port_error_handlers:855: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____)
> [ 1.742263] cxl_core:cxl_assign_port_error_handlers:857: pcieport 0000:0c:00.0: XXX: cxl_err_handler: (____ptrval____) (____ptrval____)
> fan:~/cxl/cxl-test-tool$
> --------------------------------------
>
> dmesg after error injection:
> --------------------------------------
> ssh root@localhost -p 2024 "dmesg"
> [ 228.544439] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> [ 228.544977] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> [ 228.545381] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [ 228.545879] pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/00000000
> [ 228.546360] pcieport 0000:0c:00.0: [22] UncorrIntErr
> [ 228.546698] pcieport 0000:0c:00.0: AER: XXX: call cxl_err_handler: 00000000a268bfcb 000000009e0da039
> [ 228.547103] cxl_pci 0000:0d:00.0: AER: XXX: call cxl_err_handler: 00000000b9f08b93 0000000000000000
> [ 228.547515] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> fan:~/cxl/cxl-test-tool$
> --------------------------------------
>
>
> Kernel changes:
> --------------------------------------
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 5f7570c6173c..bcecd1283fc6 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -848,10 +848,13 @@ static void cxl_assign_port_error_handlers(struct pci_dev *pdev)
> {
> struct pci_driver *pdrv = pdev->driver;
>
> + dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n enter", pdev);
> if (!pdrv)
> return;
>
> + dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p\n", pdrv);
> pdrv->cxl_err_handler = &cxl_port_error_handlers;
> + dev_dbg(&pdev->dev, "XXX: cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> }
>
> static void cxl_clear_port_error_handlers(void *data)
> @@ -869,12 +872,14 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> {
> struct pci_dev *pdev = to_pci_dev(port->uport_dev);
>
> + dev_dbg(&port->dev, "XXX: assign port error handlers for uport 1\n");
> /* uport may have more than 1 downstream EP. Check if already mapped. */
> if (port->uport_regs.ras) {
> dev_warn(&port->dev, "RAS is already mapped\n");
> return;
> }
>
> + dev_dbg(&port->dev, "XXX: assign port error handlers for uport 2\n");
> port->reg_map.host = &port->dev;
> if (cxl_map_component_regs(&port->reg_map, &port->uport_regs,
> BIT(CXL_CM_CAP_CAP_ID_RAS))) {
> @@ -882,6 +887,7 @@ void cxl_uport_init_ras_reporting(struct cxl_port *port)
> return;
> }
>
> + dev_dbg(&port->dev, "XXX: assign port error handlers for uport 3\n");
> cxl_assign_port_error_handlers(pdev);
> devm_add_action_or_reset(port->uport_dev, cxl_clear_port_error_handlers, pdev);
> }
> @@ -898,11 +904,13 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> struct pci_host_bridge *host_bridge = to_pci_host_bridge(dport_dev);
> struct pci_dev *pdev = to_pci_dev(dport_dev);
>
> + dev_dbg(dport_dev, "XXX: assign port error handlers for dport 1\n");
> if (dport->rch && host_bridge->native_aer) {
> cxl_dport_map_rch_aer(dport);
> cxl_disable_rch_root_ints(dport);
> }
>
> + dev_dbg(dport_dev, "XXX: assign port error handlers for dport 2\n");
> /* dport may have more than 1 downstream EP. Check if already mapped. */
> if (dport->regs.ras) {
> dev_warn(dport_dev, "RAS is already mapped\n");
> @@ -916,6 +924,7 @@ void cxl_dport_init_ras_reporting(struct cxl_dport *dport)
> return;
> }
>
> + dev_dbg(dport_dev, "XXX: assign port error handlers for dport 3\n");
> cxl_assign_port_error_handlers(pdev);
> devm_add_action_or_reset(dport_dev, cxl_clear_port_error_handlers, pdev);
> }
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 067fd6389562..aa824584f8dd 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -82,13 +82,15 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
> * Now that the path to the root is established record all the
> * intervening ports in the chain.
> */
> + dev_dbg(host, "XXX: add endpoint\n");
> for (iter = parent_port, down = NULL; !is_cxl_root(iter);
> down = iter, iter = to_cxl_port(iter->dev.parent)) {
> struct cxl_ep *ep;
>
> ep = cxl_ep_load(iter, cxlmd);
> ep->next = down;
> - cxl_init_ep_ports_aer(ep);
> + dev_dbg(ep->ep, "XXX: init ep port aer\n");
> + cxl_init_ep_ports_aer(ep);
> }
>
> /* Note: endpoint port component registers are derived from @cxlds */
> @@ -200,6 +202,7 @@ static int cxl_mem_probe(struct device *dev)
> return -ENXIO;
> }
>
> + dev_dbg(dev, "XXX: add endpoint\n");
> rc = devm_cxl_add_endpoint(endpoint_parent, cxlmd, dport);
> if (rc)
> return rc;
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 3785f4ca5103..8285f14994e8 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -294,6 +294,11 @@ static int cxl_report_error_detected(struct pci_dev *dev, void *data)
> bool *status = data;
>
> device_lock(&dev->dev);
> + if (pdrv) {
> + dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p %p\n", pdrv, pdrv->cxl_err_handler);
> + } else {
> + dev_dbg(&dev->dev, "XXX: call cxl_err_handler: %p no handler\n", pdrv);
> + }
> if (pdrv && pdrv->cxl_err_handler &&
> pdrv->cxl_err_handler->error_detected) {
> const struct cxl_error_handlers *cxl_err_handler =
> --------------------------------------
>
> Fan
>> Below are test results for this patchset using Qemu with CXL root
>> port(0c:00.0), CXL upstream switchport(0d:00.0), CXL downstream
>> switchport(0e:00.0). A CXL endpoint(0f:00.0) CE and UCE logs are
>> also added to show the existing PCIe endpoint handling is not changed.
>>
>> This was tested using aer-inject updated to support CE and UCE internal
>> error injection. CXL RAS was set using a test patch (not upstreamed but can
>> provide if needed).
>>
>> - Root port UCE:
>> root@tbowman-cxl:~/aer-inject# ./root-uce-inject.sh
>> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
>> pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
>> pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> pcieport 0000:0c:00.0: device [8086:7075] error status/mask=00400000/02000000
>> pcieport 0000:0c:00.0: [22] UncorrIntErr
>> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Fatal, Uncorrectable Internal Error, TLP Header=Not available
>> cxl_port_aer_uncorrectable_error: device=0000:0c:00.0 host=pci0000:0c status: 'Memory Address Parity Error' first_error: 'Memory Address Parity Error'
>> Kernel panic - not syncing: CXL cachemem error. Invoking panic
>> CPU: 1 UID: 0 PID: 146 Comm: irq/24-aerdrv Tainted: G E 6.12.0-rc2-cxl-port-err-g2beab06a67d1 #4414
>> Tainted: [E]=UNSIGNED_MODULE
>> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
>> Call Trace:
>> <TASK>
>> dump_stack_lvl+0x27/0x90
>> dump_stack+0x10/0x20
>> panic+0x33e/0x380
>> cxl_do_recovery+0x116/0x120
>> ? srso_return_thunk+0x5/0x5f
>> aer_isr+0x3e0/0x710
>> irq_thread_fn+0x28/0x70
>> irq_thread+0x179/0x240
>> ? srso_return_thunk+0x5/0x5f
>> ? __pfx_irq_thread_fn+0x10/0x10
>> ? __pfx_irq_thread_dtor+0x10/0x10
>> ? __pfx_irq_thread+0x10/0x10
>> kthread+0xf5/0x130
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork+0x3c/0x60
>> ? __pfx_kthread+0x10/0x10
>> ret_from_fork_asm+0x1a/0x30
>> </TASK>
>> Kernel Offset: 0x29000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
>> ---[ end Kernel panic - not syncing: CXL cachemem error. Invoking panic ]---
> ...
[-- Attachment #2: cxl-port-err-test-patches.tgz --]
[-- Type: application/x-compressed, Size: 1810 bytes --]
next prev parent reply other threads:[~2024-11-01 18:28 UTC|newest]
Thread overview: 55+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-10-30 15:14 ` Jonathan Cameron
2024-10-30 15:15 ` Bowman, Terry
2024-10-31 16:20 ` Dave Jiang
2024-10-31 20:24 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:13 ` Jonathan Cameron
2024-10-31 16:21 ` Dave Jiang
2024-10-31 20:25 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-10-30 14:57 ` Jonathan Cameron
2024-10-31 16:25 ` Dave Jiang
2024-10-31 21:22 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-10-30 14:56 ` Jonathan Cameron
2024-10-31 16:27 ` Dave Jiang
2024-10-31 21:27 ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-30 15:13 ` Jonathan Cameron
2024-10-30 15:51 ` Bowman, Terry
2024-11-04 21:50 ` Dan Williams
2024-11-04 22:05 ` Bowman, Terry
2024-10-31 16:37 ` Dave Jiang
2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-30 15:37 ` Jonathan Cameron
2024-10-31 16:58 ` Dave Jiang
2024-11-01 13:30 ` Bowman, Terry
2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-30 15:42 ` Jonathan Cameron
2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
2024-10-30 15:45 ` Jonathan Cameron
2024-10-30 15:54 ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
2024-10-30 15:55 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-30 15:56 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:59 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-30 16:03 ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
2024-10-30 16:07 ` Jonathan Cameron
2024-10-30 21:30 ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-10-30 16:11 ` Jonathan Cameron
2024-10-30 21:34 ` Bowman, Terry
2024-10-27 16:59 ` [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 Bowman, Terry
2024-10-28 1:05 ` [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Bowman, Terry
2024-11-01 18:00 ` Fan Ni
2024-11-01 18:28 ` Bowman, Terry [this message]
2024-11-01 19:11 ` Fan Ni
2024-11-01 22:11 ` Fan Ni
2024-11-04 21:25 ` Bowman, Terry
2024-11-04 21:48 ` Fan Ni
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8a73bfa1-b916-4f0a-92be-0cb677e1e334@amd.com \
--to=terry.bowman@amd.com \
--cc=Benjamin.Cheatham@amd.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=ira.weiny@intel.com \
--cc=jonathan.cameron@huawei.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=mahesh@linux.ibm.com \
--cc=ming4.li@intel.com \
--cc=nathan.fontenot@amd.com \
--cc=nifan.cxl@gmail.com \
--cc=oohall@gmail.com \
--cc=rrichter@amd.com \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox