From: "Li, Ming" <ming4.li@intel.com>
To: Dan Williams <dan.j.williams@intel.com>, <rrichter@amd.com>,
<terry.bowman@amd.com>
Cc: <linux-cxl@vger.kernel.org>, <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 0/6] Add support for root port RAS error handling
Date: Fri, 15 Mar 2024 16:40:47 +0800 [thread overview]
Message-ID: <16ab732d-a009-45ee-a438-3faf048c7acd@intel.com> (raw)
In-Reply-To: <65f3a842988d6_a9b4294f7@dwillia2-mobl3.amr.corp.intel.com.notmuch>
On 3/15/2024 9:45 AM, Dan Williams wrote:
> Li Ming wrote:
>> Protocol errors signaled to a CXL root port may be captured by a Root
>> Complex Event Collector(RCEC). If those errors are not cleared and
>> reported the system owner loses forensic information for system failure
>> analysis.
>>
>> Per CXL r3.1 section 9.18.1.5, the recommendation for this case from CXL
>> specification is the 'Else' statement in 'IMPLEMENTATION NODE' under
>> 'Table 9-24 RDPAS Structure':
>>
>> "Probe all CXL Downstream Ports and determine whether they have logged an
>> error in the CXL.io or CXL.cachemem status registers."
>>
>> The CXL subsystem already supports RCH RAS Error handling that has a
>> dependency on the RCEC. Reuse and extend that RCH topoogy support to
>> handle reported errors in the VH topology case. The implementation is
>> composed of:
>> * Provide a new interface from RCEC side to support walk all devices
>> under RCEC and RCEC associated bus range. PCIe AER core uses this
>> interface to walk all CXL endpoints and all CXL root ports under the
>> bus ranges.
>> * Update the PCIe AER core to enable Uncorrectable Internal Errors and
>> Correctable Internal Errors report for root ports.
>
> Thanks for the above background.
>
>> * Invoke the cxl_pci error handler for RCEC reported errors.
>
> So what do you expect happens when a switch is involved? In the RCH case
> it knows that the only thing that can fire RCEC is a root complex
> integrated endpoint implementation driven by cxl_pci. In the VH case it
> could be a switch.
>
>> * Handle root-port errors in the cxl_pci handler when the device is
>> direct attached.
>
> I do expect direct-attach to be a predominant use case, but I want to
> make sure that the implementation at least does not make the switch port
> error handling case more difficult to implement.
Hi Dan,
Currently, A rough idea I have is that:
If a CXL switch connected to the CXL RP, there should be two cases,
1. no CXL memory device connected to the switch, in this case, I'm not sure whether CXL.cachemem protocol errors is still possibly happened between RP and switch without CXL memory device. If not, maybe we don't need to consider such case?
2. a CXL memory device connected to the switch. I think cxl_pci error handler could also help to handle CXL.cachemem protocol errors happened in switch USP/DSP.
next prev parent reply other threads:[~2024-03-15 8:41 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-03-13 8:35 [RFC PATCH 0/6] Add support for root port RAS error handling Li Ming
2024-03-13 8:35 ` [RFC PATCH 1/6] PCI/RCEC: Introduce pcie_walk_rcec_all() Li Ming
2024-03-25 20:15 ` Terry Bowman
2024-04-16 4:39 ` Dan Williams
2024-04-22 14:34 ` Terry Bowman
2024-04-22 23:03 ` Dan Williams
2024-04-23 2:33 ` Li, Ming
2024-04-16 7:23 ` Li, Ming
2024-03-13 8:35 ` [RFC PATCH 2/6] PCI/CXL: A new attribute to indicate CXL-capable host bridge Li Ming
2024-03-13 8:35 ` [RFC PATCH 3/6] PCI/AER: Enable RCEC to report internal error for CXL root port Li Ming
2024-03-25 19:42 ` Terry Bowman
2024-04-16 7:27 ` Li, Ming
2024-04-16 14:46 ` Terry Bowman
2024-04-18 5:53 ` Li, Ming
2024-04-18 14:57 ` Dan Williams
2024-04-22 2:06 ` Li, Ming
2024-04-22 23:01 ` Dan Williams
2024-03-13 8:36 ` [RFC PATCH 4/6] PCI/AER: Extend RCH RAS error handling to support VH topology case Li Ming
2024-03-15 2:30 ` Dan Williams
2024-03-15 3:43 ` Li, Ming
2024-03-15 4:05 ` Dan Williams
2024-03-15 5:08 ` Li, Ming
2024-03-25 19:14 ` Terry Bowman
2024-03-13 8:36 ` [RFC PATCH 5/6] cxl: Use __free() for cxl_pci/mem_find_port() to drop put_device() Li Ming
2024-03-15 2:24 ` Dan Williams
2024-03-15 4:05 ` Li, Ming
2024-03-13 8:36 ` [RFC PATCH 6/6] cxl/pci: Support to handle root port RAS errors captured by RCEC Li Ming
2024-03-15 1:45 ` [RFC PATCH 0/6] Add support for root port RAS error handling Dan Williams
2024-03-15 8:40 ` Li, Ming [this message]
2024-03-15 18:21 ` Dan Williams
2024-03-20 12:48 ` Li, Ming
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=16ab732d-a009-45ee-a438-3faf048c7acd@intel.com \
--to=ming4.li@intel.com \
--cc=dan.j.williams@intel.com \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=rrichter@amd.com \
--cc=terry.bowman@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox