Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
  2024-03-06 11:27 Questions about CXL RAS injection test in qemu Yuquan Wang
@ 2024-03-06 13:23 ` Jonathan Cameron via
  2024-03-06 17:12   ` Terry Bowman
  2024-03-06 17:16   ` Dan Williams
  0 siblings, 2 replies; 7+ messages in thread
From: Jonathan Cameron via @ 2024-03-06 13:23 UTC (permalink / raw)
  To: Yuquan Wang
  Cc: linux-cxl, qemu-devel, Robert Richter, Terry Bowman, dan.williams

On Wed, 6 Mar 2024 19:27:07 +0800
Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:

> Hello, Jonathan
> 
> Recently I met some problems on CXL RAS tests. 
> 
> I tried to use "cxl-inject-uncorrectable-errors" and "cxl-inject-correctable-error"
> qmp to inject CXL errors, however, there was no any kernel printing information in 
> my qemu machine. And the qmp connection was unstable that made the machine 
> always "terminating on signal 2".

The qmp connection being unstable is odd - might be related to the CXL code, but
I'm not sure how..

> 
> In addition, I successfully used the hmp "pcie_aer_inject_error" in the same conditions.
> The kernel showed relevant print information.

IIRC the AER paths print under all circumstances whereas CXL errors do not, they simply
trigger tracepoints - but you should have seen device resets.

However I span up a test and I think the issue is more straight forward.
The uncorrectable internal error and correctable internal errors are masked on the device.
I thought we changed the default on this in linux but maybe not :(

Hack is fine the relevant device with lspci -tv and then use
setpci -s 0d:00.0 0x208.l=0
to clear all the mask bits for uncorrectable errors.

Note I tested this on a convenient arm64 setup so always possible there is yet
another problem on x86.

Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was
some discussion on walking out on VH as well to enable this, but seems it
never happened. Can you remember why?  Just kicked back for a future occasion?

Jonathan


> 
> Question:
> 1) Is my CXL RAS test operations standard?
> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" of cxl.io?
>    The error injected by "cxl-inject-uncorrectable-errors" or "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem?
> 
> Hope I can get some helps here, any help will be greatly appreciated.
> 
> 
> My qemu command line:
> qemu-system-x86_64 \
> -M q35,nvdimm=on,cxl=on \
> -m 4G \
> -smp 4 \
> -object memory-backend-ram,size=2G,id=mem0 \
> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
> -object memory-backend-ram,size=2G,id=mem1 \
> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
> -object memory-backend-ram,size=256M,id=cxl-mem0 \
> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k \
> -hda ../disk/ubuntu_x86_test_new.qcow2 \
> -nographic \
> -qmp tcp:127.0.0.1:4444,server,nowait \
> 
> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in "https://gitlab.com/jic23/qemu" 
> Kernel version: 6.8.0-rc6
> 
> My steps in the Qemu qmp:
> 1) telnet 127.0.0.1 4444
> 
> result:
> Trying 127.0.0.1...
> Connected to 127.0.0.1.
> Escape character is '^]'.
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}}
> 
> 2) { "execute": "qmp_capabilities" }
> 
> result:
> {"return": {}}
> 
> 3) If inject correctable error:
> { "execute": "cxl-inject-correctable-error",
>     "arguments": {
>         "path": "/machine/peripheral/cxl-mem0",
>         "type": "physical"
>     } }
> 
> result:
> {"return": {}}
> 
> 3) If inject uncorrectable error:
> { "execute": "cxl-inject-uncorrectable-errors",
>   "arguments": {
>     "path": "/machine/peripheral/cxl-mem0",
>     "errors": [
>         {
>             "type": "cache-address-parity",
>             "header": [ 3, 4]
>         },
>         {
>             "type": "cache-data-parity",
>             "header": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
>         },
>         {
>             "type": "internal",
>             "header": [ 1, 2, 4]
>         }
>         ]
>   }}
> 
> result:
> {"return": {}}
> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
> 
> Many thanks
> Yuquan
> 



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
  2024-03-06 13:23 ` Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu] Jonathan Cameron via
@ 2024-03-06 17:12   ` Terry Bowman
  2024-03-06 19:06     ` Terry Bowman
  2024-03-06 17:16   ` Dan Williams
  1 sibling, 1 reply; 7+ messages in thread
From: Terry Bowman @ 2024-03-06 17:12 UTC (permalink / raw)
  To: Jonathan Cameron, Yuquan Wang
  Cc: linux-cxl, qemu-devel, Robert Richter, dan.williams

Hi Yuquan an Jon,

I added responses inline below.

On 3/6/24 07:23, Jonathan Cameron wrote:
> On Wed, 6 Mar 2024 19:27:07 +0800
> Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:
> 
>> Hello, Jonathan
>>
>> Recently I met some problems on CXL RAS tests. 
>>
>> I tried to use "cxl-inject-uncorrectable-errors" and "cxl-inject-correctable-error"
>> qmp to inject CXL errors, however, there was no any kernel printing information in 
>> my qemu machine. And the qmp connection was unstable that made the machine 
>> always "terminating on signal 2".
> 
> The qmp connection being unstable is odd - might be related to the CXL code, but
> I'm not sure how..
> 
>>
>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the same conditions.
>> The kernel showed relevant print information.
> 
> IIRC the AER paths print under all circumstances whereas CXL errors do not, they simply
> trigger tracepoints - but you should have seen device resets.
> 
> However I span up a test and I think the issue is more straight forward.
> The uncorrectable internal error and correctable internal errors are masked on the device.
> I thought we changed the default on this in linux but maybe not :(
> 

Device AER UIE/CIE mask can be set and still expect to handle device AER errors. The device reports 
AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc errors. 

In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to properly receive 
AER UIE/CI notifications from devices and RCH dports.

"CXL Protocol and Link errors detected by components that are part of a CXL VH are
escalated and reported using standard PCIe error reporting mechanisms over CXL.io as
UIEs and/or CIEs. See PCIe Base Specification for details."[1]

[1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting

> Hack is fine the relevant device with lspci -tv and then use
> setpci -s 0d:00.0 0x208.l=0
> to clear all the mask bits for uncorrectable errors.
> 
> Note I tested this on a convenient arm64 setup so always possible there is yet
> another problem on x86.
> 
> Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was
> some discussion on walking out on VH as well to enable this, but seems it
> never happened. Can you remember why?  Just kicked back for a future occasion?
> 
> Jonathan
> 
> 

I tested (qemu x86) using the aer-inject tool and found it to work. Below shows the 
endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly handled
with root port logging and cxl_pci handler trace logs.

 # lspci | grep -i cxl                                                                                                                                     
    0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)                                                                                                       
                                                                                                                                                              
    # lspci -s 0d:00.0 -vvv | grep Advanced                                                                                                                   
    Capabilities: [200 v2] Advanced Error Reporting                                                                                                           
                                                                                                                                                              
    # setpci -s 0d:00.0 0x208.l                                                                                                                               
    02400000                                                                                                                                                  
                                                                                                                                                              
    # setpci -s 0d:00.0 0x214.l                                                                                                                               
    0000e000                                                                                                                                                  
                                                                                                                                                              
    # cat aer-input.txt                                                                                                                                       
    # Inject a correctable bad TLP error into the device with header log                                                                                      
    # words 0 1 2 3.                                                                                                                                          
    #                                                                                                                                                         
    # Either specify the PCI id on the command-line option or uncomment and edit                                                                              
    # the PCI_ID line below using the correct PCI ID.                                                                                                         
    #                                                                                                                                                         
    # Note that system firmware/BIOS may mask certain errors and/or not report                                                                                
    # header log words.                                                                                                                                       
    #                                                                                                                                                         
    AER                                                                                                                                                       
    #PCI_ID 0000:0C.00.0                                                                                                                                      
    COR_STATUS BAD_TLP                                                                                                                                        
    HEADER_LOG 0 1 2 3                                                                                                                                        
                                                                                                                                                              
    # ./aer-inject -s 0000:0d:00.0 aer-input.txt                                                                                                              
    [   72.850686] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0d:00.0                                             
    [   72.851784] pcieport 0000:0c:00.0: AER: Corrected error received: 0000:0d:00.0                                                                         
    [   72.852594] cxl_pci 0000:0d:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)                                              
    [   72.853591] cxl_pci 0000:0d:00.0:   device [8086:0d93] error status/mask=00000040/0000e000                                             
    # [   72.854277] cxl_pci 0000:0d:00.0:    [ 6] BadTLP      

I have not tried to use cxl-inject-uncorrectable-errors or cxl-inject-correctable-error.

Regards,
Terry

>>
>> Question:
>> 1) Is my CXL RAS test operations standard?
>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" of cxl.io?
>>    The error injected by "cxl-inject-uncorrectable-errors" or "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem?
>>
>> Hope I can get some helps here, any help will be greatly appreciated.
>>
>>
>> My qemu command line:
>> qemu-system-x86_64 \
>> -M q35,nvdimm=on,cxl=on \
>> -m 4G \
>> -smp 4 \
>> -object memory-backend-ram,size=2G,id=mem0 \
>> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
>> -object memory-backend-ram,size=2G,id=mem1 \
>> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
>> -object memory-backend-ram,size=256M,id=cxl-mem0 \
>> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
>> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
>> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k \
>> -hda ../disk/ubuntu_x86_test_new.qcow2 \
>> -nographic \
>> -qmp tcp:127.0.0.1:4444,server,nowait \
>>
>> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in "https://gitlab.com/jic23/qemu" 
>> Kernel version: 6.8.0-rc6
>>
>> My steps in the Qemu qmp:
>> 1) telnet 127.0.0.1 4444
>>
>> result:
>> Trying 127.0.0.1...
>> Connected to 127.0.0.1.
>> Escape character is '^]'.
>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}}
>>
>> 2) { "execute": "qmp_capabilities" }
>>
>> result:
>> {"return": {}}
>>
>> 3) If inject correctable error:
>> { "execute": "cxl-inject-correctable-error",
>>     "arguments": {
>>         "path": "/machine/peripheral/cxl-mem0",
>>         "type": "physical"
>>     } }
>>
>> result:
>> {"return": {}}
>>
>> 3) If inject uncorrectable error:
>> { "execute": "cxl-inject-uncorrectable-errors",
>>   "arguments": {
>>     "path": "/machine/peripheral/cxl-mem0",
>>     "errors": [
>>         {
>>             "type": "cache-address-parity",
>>             "header": [ 3, 4]
>>         },
>>         {
>>             "type": "cache-data-parity",
>>             "header": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
>>         },
>>         {
>>             "type": "internal",
>>             "header": [ 1, 2, 4]
>>         }
>>         ]
>>   }}
>>
>> result:
>> {"return": {}}
>> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>>
>> Many thanks
>> Yuquan
>>
> 


^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
  2024-03-06 13:23 ` Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu] Jonathan Cameron via
  2024-03-06 17:12   ` Terry Bowman
@ 2024-03-06 17:16   ` Dan Williams
  2024-03-06 17:42     ` Terry Bowman
  1 sibling, 1 reply; 7+ messages in thread
From: Dan Williams @ 2024-03-06 17:16 UTC (permalink / raw)
  To: Jonathan Cameron, Yuquan Wang
  Cc: linux-cxl, qemu-devel, Robert Richter, Terry Bowman, dan.williams,
	ming4.li

[ add Li Ming ]

Jonathan Cameron wrote:
[..]
> Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was
> some discussion on walking out on VH as well to enable this, but seems it
> never happened. Can you remember why?  Just kicked back for a future occasion?
> 

Li Ming has this patch below waiting in wings. Li Ming, this patch is
timely for this dicussion, care to send out the full series? I expect it
needs to be an RFC given concerns with integrating with the pending port
switch error handling work.

-- 8< --
From: Li Ming <ming4.li@intel.com>
Subject: [PATCH RFC v3 3/6] PCI/AER: Enable RCEC to report internal error for CXL root port
Date: Thu, 1 Feb 2024 05:58:08 +0000

Per CXL r3.1 section 12.2.2, RCEC is possible to log the CXL.cachemem
protocol errors detected by CXL root port as PCI_ERR_UNC_INTN or
PCI_ERR_COR_INTERNAL in AER Capability. So unmask PCI_ERR_UNC_INTN and
PCI_ERR_COR_INTERNAL for that case.

Signed-off-by: Li Ming <ming4.li@intel.com>
---
 drivers/pci/pcie/aer.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 42a3bd35a3e1..ef8fd77cb920 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -985,7 +985,7 @@ static bool cxl_error_is_native(struct pci_dev *dev)
 {
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
 
-	return (pcie_ports_native || host->native_aer);
+	return (pcie_ports_native || host->native_aer) && host->is_cxl;
 }
 
 static bool is_internal_error(struct aer_err_info *info)
@@ -1041,8 +1041,14 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
 {
 	bool *handles_cxl = data;
 
-	if (!*handles_cxl)
-		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
+	if (!*handles_cxl) {
+		if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END &&
+		    is_cxl_mem_dev(dev) && cxl_error_is_native(dev))
+			*handles_cxl = true;
+		if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT &&
+		    cxl_error_is_native(dev))
+			*handles_cxl = true;
+	}
 
 	/* Non-zero terminates iteration */
 	return *handles_cxl;
@@ -1054,13 +1060,18 @@ static bool handles_cxl_errors(struct pci_dev *rcec)
 
 	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
 	    pcie_aer_is_native(rcec))
-		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
+		pcie_walk_rcec_all(rcec, handles_cxl_error_iter, &handles_cxl);
 
 	return handles_cxl;
 }
 
-static void cxl_rch_enable_rcec(struct pci_dev *rcec)
+static void cxl_enable_rcec(struct pci_dev *rcec)
 {
+	/*
+	 * Enable RCEC's internal error report for two cases:
+	 * 1. RCiEP detected CXL.cachemem protocol errors
+	 * 2. CXL root port detected CXL.cachemem protocol errors.
+	 */
 	if (!handles_cxl_errors(rcec))
 		return;
 
@@ -1069,7 +1080,7 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
 }
 
 #else
-static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
+static inline void cxl_enable_rcec(struct pci_dev *dev) { }
 static inline void cxl_rch_handle_error(struct pci_dev *dev,
 					struct aer_err_info *info) { }
 #endif
@@ -1494,7 +1505,7 @@ static int aer_probe(struct pcie_device *dev)
 		return status;
 	}
 
-	cxl_rch_enable_rcec(port);
+	cxl_enable_rcec(port);
 	aer_enable_rootport(rpc);
 	pci_info(port, "enabled with IRQ %d\n", dev->irq);
 	return 0;
-- 
2.40.1



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
  2024-03-06 17:16   ` Dan Williams
@ 2024-03-06 17:42     ` Terry Bowman
  0 siblings, 0 replies; 7+ messages in thread
From: Terry Bowman @ 2024-03-06 17:42 UTC (permalink / raw)
  To: Dan Williams, Jonathan Cameron, Yuquan Wang, Dan Williams
  Cc: linux-cxl, qemu-devel, Robert Richter, dan.williams, ming4.li

Hi Jon,

This appears to partially address the same problem myself and Robert are working on. We 
are working to add support for CXL port devices to include root ports, RCECs, USPs, 
and DSPs. This was covered with LPC presentation and discussion.

We did not originally include RCEC error handling support because the same is needed 
for all CXL port devices. Also, we wanted to avoid adding more CXL specifics to aer.c and 
were looking for a more general solution. This led to the discussion about changes to 
the PCIe port bus driver.

Regards,
Terry

On 3/6/24 11:16, Dan Williams wrote:
> [ add Li Ming ]
> 
> Jonathan Cameron wrote:
> [..]
>> Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was
>> some discussion on walking out on VH as well to enable this, but seems it
>> never happened. Can you remember why?  Just kicked back for a future occasion?
>>
> 
> Li Ming has this patch below waiting in wings. Li Ming, this patch is
> timely for this dicussion, care to send out the full series? I expect it
> needs to be an RFC given concerns with integrating with the pending port
> switch error handling work.
> 
> -- 8< --
> From: Li Ming <ming4.li@intel.com>
> Subject: [PATCH RFC v3 3/6] PCI/AER: Enable RCEC to report internal error for CXL root port
> Date: Thu, 1 Feb 2024 05:58:08 +0000
> 
> Per CXL r3.1 section 12.2.2, RCEC is possible to log the CXL.cachemem
> protocol errors detected by CXL root port as PCI_ERR_UNC_INTN or
> PCI_ERR_COR_INTERNAL in AER Capability. So unmask PCI_ERR_UNC_INTN and
> PCI_ERR_COR_INTERNAL for that case.
> 
> Signed-off-by: Li Ming <ming4.li@intel.com>
> ---
>  drivers/pci/pcie/aer.c | 25 ++++++++++++++++++-------
>  1 file changed, 18 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 42a3bd35a3e1..ef8fd77cb920 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -985,7 +985,7 @@ static bool cxl_error_is_native(struct pci_dev *dev)
>  {
>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
> -	return (pcie_ports_native || host->native_aer);
> +	return (pcie_ports_native || host->native_aer) && host->is_cxl;
>  }
>  
>  static bool is_internal_error(struct aer_err_info *info)
> @@ -1041,8 +1041,14 @@ static int handles_cxl_error_iter(struct pci_dev *dev, void *data)
>  {
>  	bool *handles_cxl = data;
>  
> -	if (!*handles_cxl)
> -		*handles_cxl = is_cxl_mem_dev(dev) && cxl_error_is_native(dev);
> +	if (!*handles_cxl) {
> +		if (pci_pcie_type(dev) == PCI_EXP_TYPE_RC_END &&
> +		    is_cxl_mem_dev(dev) && cxl_error_is_native(dev))
> +			*handles_cxl = true;
> +		if (pci_pcie_type(dev) == PCI_EXP_TYPE_ROOT_PORT &&
> +		    cxl_error_is_native(dev))
> +			*handles_cxl = true;
> +	}
>  
>  	/* Non-zero terminates iteration */
>  	return *handles_cxl;
> @@ -1054,13 +1060,18 @@ static bool handles_cxl_errors(struct pci_dev *rcec)
>  
>  	if (pci_pcie_type(rcec) == PCI_EXP_TYPE_RC_EC &&
>  	    pcie_aer_is_native(rcec))
> -		pcie_walk_rcec(rcec, handles_cxl_error_iter, &handles_cxl);
> +		pcie_walk_rcec_all(rcec, handles_cxl_error_iter, &handles_cxl);
>  
>  	return handles_cxl;
>  }
>  
> -static void cxl_rch_enable_rcec(struct pci_dev *rcec)
> +static void cxl_enable_rcec(struct pci_dev *rcec)
>  {
> +	/*
> +	 * Enable RCEC's internal error report for two cases:
> +	 * 1. RCiEP detected CXL.cachemem protocol errors
> +	 * 2. CXL root port detected CXL.cachemem protocol errors.
> +	 */
>  	if (!handles_cxl_errors(rcec))
>  		return;
>  
> @@ -1069,7 +1080,7 @@ static void cxl_rch_enable_rcec(struct pci_dev *rcec)
>  }
>  
>  #else
> -static inline void cxl_rch_enable_rcec(struct pci_dev *dev) { }
> +static inline void cxl_enable_rcec(struct pci_dev *dev) { }
>  static inline void cxl_rch_handle_error(struct pci_dev *dev,
>  					struct aer_err_info *info) { }
>  #endif
> @@ -1494,7 +1505,7 @@ static int aer_probe(struct pcie_device *dev)
>  		return status;
>  	}
>  
> -	cxl_rch_enable_rcec(port);
> +	cxl_enable_rcec(port);
>  	aer_enable_rootport(rpc);
>  	pci_info(port, "enabled with IRQ %d\n", dev->irq);
>  	return 0;


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
  2024-03-06 17:12   ` Terry Bowman
@ 2024-03-06 19:06     ` Terry Bowman
  0 siblings, 0 replies; 7+ messages in thread
From: Terry Bowman @ 2024-03-06 19:06 UTC (permalink / raw)
  To: Jonathan Cameron, Yuquan Wang
  Cc: linux-cxl, qemu-devel, Robert Richter, dan.williams

HI Yuquan,

For your test, the first logging will come from the AER driver if 
everything is working correctly.

You may want to check if the upstream pci bridge's AER UIE/CIE 
masks are set. This could prevent the error from handled by the OS's
aer driver.

Regards,
Terry

On 3/6/24 11:12, Terry Bowman wrote:
> Hi Yuquan an Jon,
> 
> I added responses inline below.
> 
> On 3/6/24 07:23, Jonathan Cameron wrote:
>> On Wed, 6 Mar 2024 19:27:07 +0800
>> Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:
>>
>>> Hello, Jonathan
>>>
>>> Recently I met some problems on CXL RAS tests. 
>>>
>>> I tried to use "cxl-inject-uncorrectable-errors" and "cxl-inject-correctable-error"
>>> qmp to inject CXL errors, however, there was no any kernel printing information in 
>>> my qemu machine. And the qmp connection was unstable that made the machine 
>>> always "terminating on signal 2".
>>
>> The qmp connection being unstable is odd - might be related to the CXL code, but
>> I'm not sure how..
>>
>>>
>>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the same conditions.
>>> The kernel showed relevant print information.
>>
>> IIRC the AER paths print under all circumstances whereas CXL errors do not, they simply
>> trigger tracepoints - but you should have seen device resets.
>>
>> However I span up a test and I think the issue is more straight forward.
>> The uncorrectable internal error and correctable internal errors are masked on the device.
>> I thought we changed the default on this in linux but maybe not :(
>>
> 
> Device AER UIE/CIE mask can be set and still expect to handle device AER errors. The device reports 
> AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc errors. 
> 
> In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to properly receive 
> AER UIE/CI notifications from devices and RCH dports.
> 
> "CXL Protocol and Link errors detected by components that are part of a CXL VH are
> escalated and reported using standard PCIe error reporting mechanisms over CXL.io as
> UIEs and/or CIEs. See PCIe Base Specification for details."[1]
> 
> [1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting
> 
>> Hack is fine the relevant device with lspci -tv and then use
>> setpci -s 0d:00.0 0x208.l=0
>> to clear all the mask bits for uncorrectable errors.
>>
>> Note I tested this on a convenient arm64 setup so always possible there is yet
>> another problem on x86.
>>
>> Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was
>> some discussion on walking out on VH as well to enable this, but seems it
>> never happened. Can you remember why?  Just kicked back for a future occasion?
>>
>> Jonathan
>>
>>
> 
> I tested (qemu x86) using the aer-inject tool and found it to work. Below shows the 
> endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly handled
> with root port logging and cxl_pci handler trace logs.
> 
>  # lspci | grep -i cxl                                                                                                                                     
>     0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)                                                                                                       
>                                                                                                                                                               
>     # lspci -s 0d:00.0 -vvv | grep Advanced                                                                                                                   
>     Capabilities: [200 v2] Advanced Error Reporting                                                                                                           
>                                                                                                                                                               
>     # setpci -s 0d:00.0 0x208.l                                                                                                                               
>     02400000                                                                                                                                                  
>                                                                                                                                                               
>     # setpci -s 0d:00.0 0x214.l                                                                                                                               
>     0000e000                                                                                                                                                  
>                                                                                                                                                               
>     # cat aer-input.txt                                                                                                                                       
>     # Inject a correctable bad TLP error into the device with header log                                                                                      
>     # words 0 1 2 3.                                                                                                                                          
>     #                                                                                                                                                         
>     # Either specify the PCI id on the command-line option or uncomment and edit                                                                              
>     # the PCI_ID line below using the correct PCI ID.                                                                                                         
>     #                                                                                                                                                         
>     # Note that system firmware/BIOS may mask certain errors and/or not report                                                                                
>     # header log words.                                                                                                                                       
>     #                                                                                                                                                         
>     AER                                                                                                                                                       
>     #PCI_ID 0000:0C.00.0                                                                                                                                      
>     COR_STATUS BAD_TLP                                                                                                                                        
>     HEADER_LOG 0 1 2 3                                                                                                                                        
>                                                                                                                                                               
>     # ./aer-inject -s 0000:0d:00.0 aer-input.txt                                                                                                              
>     [   72.850686] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0d:00.0                                             
>     [   72.851784] pcieport 0000:0c:00.0: AER: Corrected error received: 0000:0d:00.0                                                                         
>     [   72.852594] cxl_pci 0000:0d:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)                                              
>     [   72.853591] cxl_pci 0000:0d:00.0:   device [8086:0d93] error status/mask=00000040/0000e000                                             
>     # [   72.854277] cxl_pci 0000:0d:00.0:    [ 6] BadTLP      
> 
> I have not tried to use cxl-inject-uncorrectable-errors or cxl-inject-correctable-error.
> 
> Regards,
> Terry
> 
>>>
>>> Question:
>>> 1) Is my CXL RAS test operations standard?
>>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" of cxl.io?
>>>    The error injected by "cxl-inject-uncorrectable-errors" or "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem?
>>>
>>> Hope I can get some helps here, any help will be greatly appreciated.
>>>
>>>
>>> My qemu command line:
>>> qemu-system-x86_64 \
>>> -M q35,nvdimm=on,cxl=on \
>>> -m 4G \
>>> -smp 4 \
>>> -object memory-backend-ram,size=2G,id=mem0 \
>>> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
>>> -object memory-backend-ram,size=2G,id=mem1 \
>>> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
>>> -object memory-backend-ram,size=256M,id=cxl-mem0 \
>>> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>>> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
>>> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
>>> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k \
>>> -hda ../disk/ubuntu_x86_test_new.qcow2 \
>>> -nographic \
>>> -qmp tcp:127.0.0.1:4444,server,nowait \
>>>
>>> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in "https://gitlab.com/jic23/qemu" 
>>> Kernel version: 6.8.0-rc6
>>>
>>> My steps in the Qemu qmp:
>>> 1) telnet 127.0.0.1 4444
>>>
>>> result:
>>> Trying 127.0.0.1...
>>> Connected to 127.0.0.1.
>>> Escape character is '^]'.
>>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}}
>>>
>>> 2) { "execute": "qmp_capabilities" }
>>>
>>> result:
>>> {"return": {}}
>>>
>>> 3) If inject correctable error:
>>> { "execute": "cxl-inject-correctable-error",
>>>     "arguments": {
>>>         "path": "/machine/peripheral/cxl-mem0",
>>>         "type": "physical"
>>>     } }
>>>
>>> result:
>>> {"return": {}}
>>>
>>> 3) If inject uncorrectable error:
>>> { "execute": "cxl-inject-uncorrectable-errors",
>>>   "arguments": {
>>>     "path": "/machine/peripheral/cxl-mem0",
>>>     "errors": [
>>>         {
>>>             "type": "cache-address-parity",
>>>             "header": [ 3, 4]
>>>         },
>>>         {
>>>             "type": "cache-data-parity",
>>>             "header": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
>>>         },
>>>         {
>>>             "type": "internal",
>>>             "header": [ 1, 2, 4]
>>>         }
>>>         ]
>>>   }}
>>>
>>> result:
>>> {"return": {}}
>>> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>>>
>>> Many thanks
>>> Yuquan
>>>
>>


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
@ 2024-03-08  2:01 Yuquan Wang
  2024-03-08 12:59 ` Jonathan Cameron via
  0 siblings, 1 reply; 7+ messages in thread
From: Yuquan Wang @ 2024-03-08  2:01 UTC (permalink / raw)
  To: Jonathan.Cameron, Terry.Bowman; +Cc: linux-cxl, qemu-devel

On 2024-03-07 20:10,  jonathan.cameron wrote:

> Hack is fine the relevant device with lspci -tv and then use
> setpci -s 0d:00.0 0x208.l=0
> to clear all the mask bits for uncorrectable errors.

Thanks! The suggestions from you and Terry did work!

BTW, is my understanding below about CXL RAS correct?

>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" of cxl.io?
>>    The error injected by "cxl-inject-uncorrectable-errors" or "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem

Many thanks
Yuuqan



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
  2024-03-08  2:01 Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu] Yuquan Wang
@ 2024-03-08 12:59 ` Jonathan Cameron via
  0 siblings, 0 replies; 7+ messages in thread
From: Jonathan Cameron via @ 2024-03-08 12:59 UTC (permalink / raw)
  To: Yuquan Wang; +Cc: Terry.Bowman, linux-cxl, qemu-devel

On Fri, 8 Mar 2024 10:01:34 +0800
Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:

> On 2024-03-07 20:10,  jonathan.cameron wrote:
> 
> > Hack is fine the relevant device with lspci -tv and then use
> > setpci -s 0d:00.0 0x208.l=0
> > to clear all the mask bits for uncorrectable errors.  
> 
> Thanks! The suggestions from you and Terry did work!
> 
> BTW, is my understanding below about CXL RAS correct?
> 
> >> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" of cxl.io?
> >>    The error injected by "cxl-inject-uncorrectable-errors" or "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem  
> 
> Many thanks
> Yuuqan
> 
Yes.  Note the two CXL errors are actually communicated via AER uncorrectable / correctable internal
error combined with data that is available on the EP in the CXL specific registers.

Jonathan


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2024-03-08 13:00 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-08  2:01 Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu] Yuquan Wang
2024-03-08 12:59 ` Jonathan Cameron via
  -- strict thread matches above, loose matches on Subject: below --
2024-03-06 11:27 Questions about CXL RAS injection test in qemu Yuquan Wang
2024-03-06 13:23 ` Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu] Jonathan Cameron via
2024-03-06 17:12   ` Terry Bowman
2024-03-06 19:06     ` Terry Bowman
2024-03-06 17:16   ` Dan Williams
2024-03-06 17:42     ` Terry Bowman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).