All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Bowman, Terry" <terry.bowman@amd.com>
To: dan.j.williams@intel.com, dave@stgolabs.net,
	jonathan.cameron@huawei.com, dave.jiang@intel.com,
	alison.schofield@intel.com, bhelgaas@google.com,
	shiju.jose@huawei.com, ming.li@zohomail.com,
	Smita.KoralahalliChannabasappa@amd.com, rrichter@amd.com,
	dan.carpenter@linaro.org, PradeepVineshReddy.Kodamati@amd.com,
	lukas@wunner.de, Benjamin.Cheatham@amd.com,
	sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, terry.bowman@amd.com
Subject: Re: [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging
Date: Thu, 24 Jul 2025 10:58:16 -0500	[thread overview]
Message-ID: <f5434232-caac-4459-80d2-e981de0e9549@amd.com> (raw)
In-Reply-To: <68815a66459e4_134cc710012@dwillia2-xfh.jf.intel.com.notmuch>

[-- Attachment #1: Type: text/plain, Size: 3637 bytes --]



On 7/23/2025 4:55 PM, dan.j.williams@intel.com wrote:
> Terry Bowman wrote:
>> This patchset updates CXL Protocol Error handling for CXL Ports and CXL
>> Endpoints (EP). The reach of this patchset grew from CXL Ports to include
>> EPs as well.
> [..]
>> == Testing ==
>> Testing results below shows the Upstream Switch Port UCE and EP UCE errors
>> are handled as PCI errors. This is because aer_get_device_error_info() does
>> not populate the AER error severity and status in the case of FATAL UCE on
>> Upstream Ports and Endpoints. This is intended because the USP link to
>> access the device can be compromised. The check for is_cxl_error() and
>> is_internal_error() fail as a result and then processes the error as a PCI
>> error. Also, the AER event logging is missing the PCIe AER status.
> Are those issues "TODO" or permanent quirks of the implementation?
>
> Although looking at the error message they all seem to correctly say "CXL
> Bus Error", I guess I am not seting the end user visible problem of the
> details you are pointing out here. I.e. LGTM.
This is a potential TODO. It needs to be investigated as a possible improvement
to gather fatal error details that will otherwise not be collected for upstream
CXL PCIe Ports and Endpoints. BTW, the check that results in empty aer_err_info 
is in is in aer_get_device_error_info().

The AER log's bus type is determined by aer_err_bus(). This function is only
checking pci_dev::is_cxl. As a result, this function returns CXL bus for all these
CXL device's errors.

The missing aer_err_info data for these devices results in is_cxl_error() returning false.
The error is then processed as a PCIe error. The PCIe EP UCE handler does check
for RAS status and panics() if a UCE is found. 

> [..]
>> == Root Port ==
>> root@tbowman-cxl:~/aer-inject# ./root-ce-inject.sh
> Where can I find these inject scripts?
>
I attached a cxl-aer-einj.tgz. It includes the scripts. It also includes the 
patch to enable UIE/CIE in the aer-inject userspace tool. It includes a cover 
with the remote branch and base details.

It does __not__ set the CXL RAS error. I have been using a debug patch (non-upstreamed) 
to set the RAS status. It's not ideal but allows me to test. I can clean 
that patch up and provide here if needed.
>> pcieport 0000:0c:00.0: aer_inject: Injecting errors 00004000/00000000 into device 0000:0c:00.0
>> pcieport 0000:0c:00.0: AER: Correctable error message received from 0000:0c:00.0
>> aer_event: 0000:0c:00.0 CXL Bus Error: severity=Corrected, Corrected Internal Error, TLP Header=Not available
>> pcieport 0000:0c:00.0: CXL Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>> pcieport 0000:0c:00.0:   device [8086:7075] error status/mask=00004000/0000a000
>> pcieport 0000:0c:00.0:    [14] CorrIntErr    
>> cxl_aer_correctable_error: memdev=0000:0c:00.0 host=pci0000:0c serial=0 status='CRC Threshold Hit'
> Hmm, why "memdev=" for a root port error? Will take a look at what
> cxl_aer_correctable_error() is doing.

There was a discussion about this here:
https://lore.kernel.org/linux-cxl/aGSI7oXthPW-AY6D@aschofie-mobl2.lan/

> [..] 
>> base-commit: 716ba3023561ccacfaa28f988d26717535b8fed1
> I cannot find this commit in mainline nor linux-next. Please do try to
> base series on mainline tags, or otherwise push a public baseline branch
> somewhere. Helps reviewers and build bots.
The base commit in the cover sheet should have been: a403fe6c0b17 - cxl/edac: Fix potential memory leak issues

I somehow added a change to my local cxl/next branch. I'll delete and re-checkout remote/cxl/next and copy to cxl/next.
-Terry

[-- Attachment #2: cxl-aer-einj.tgz --]
[-- Type: application/x-compressed, Size: 2604 bytes --]

  reply	other threads:[~2025-07-24 15:58 UTC|newest]

Thread overview: 87+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-26 22:42 [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging Terry Bowman
2025-06-26 22:42 ` [PATCH v10 01/17] cxl/pci: Remove unnecessary CXL Endpoint handling helper functions Terry Bowman
2025-07-18 17:55   ` Dave Jiang
2025-07-23 21:58   ` dan.j.williams
2025-07-23 22:15     ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 02/17] PCI/CXL: Add pcie_is_cxl() Terry Bowman
2025-07-23 22:30   ` dan.j.williams
2025-07-23 23:21     ` Bowman, Terry
2025-07-24 18:00       ` dan.j.williams
2025-08-09 10:56   ` Alejandro Lucero Palau
2025-08-11 19:14     ` Bowman, Terry
2025-08-11 23:14       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 03/17] PCI/AER: Report CXL or PCIe bus error type in trace logging Terry Bowman
2025-06-26 23:25   ` Sathyanarayanan Kuppuswamy
2025-06-27 14:14     ` Bowman, Terry
2025-06-27  9:53   ` Jonathan Cameron
2025-07-02 16:00     ` Bowman, Terry
2025-06-27 11:32   ` Shiju Jose
2025-06-27 14:24     ` Bowman, Terry
2025-07-01 21:27   ` Dave Jiang
2025-07-23 22:56   ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 04/17] CXL/AER: Introduce CXL specific AER driver file Terry Bowman
2025-06-26 23:42   ` Sathyanarayanan Kuppuswamy
2025-06-27 10:12     ` Jonathan Cameron
2025-06-27 14:29     ` Bowman, Terry
2025-07-24  0:01   ` dan.j.williams
2025-07-24 17:06     ` Bowman, Terry
2025-07-24 20:32       ` dan.j.williams
2025-07-24  1:16   ` dan.j.williams
2025-07-24 17:02     ` Bowman, Terry
2025-07-24 20:23       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 05/17] CXL/AER: Introduce kfifo for forwarding CXL errors Terry Bowman
2025-06-27 10:24   ` Jonathan Cameron
2025-07-02 16:21     ` Bowman, Terry
2025-07-02 19:54       ` Dan Carpenter
2025-07-02 19:57         ` Bowman, Terry
2025-07-03 10:06       ` Jonathan Cameron
2025-07-01 21:53   ` Dave Jiang
2025-07-02 17:10     ` Bowman, Terry
2025-07-24  2:01   ` dan.j.williams
2025-07-24 17:21     ` Bowman, Terry
2025-07-24 20:55       ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 06/17] PCI/AER: Dequeue forwarded CXL error Terry Bowman
2025-06-27 11:00   ` Jonathan Cameron
2025-07-02 17:51     ` Bowman, Terry
2025-07-01 23:04   ` Dave Jiang
2025-07-02 17:56     ` Bowman, Terry
2025-07-03 10:11       ` Jonathan Cameron
2025-07-25  0:38   ` dan.j.williams
2025-06-26 22:42 ` [PATCH v10 07/17] CXL/PCI: Introduce CXL uncorrectable protocol error recovery Terry Bowman
2025-06-27 11:05   ` Jonathan Cameron
2025-07-02 21:06     ` Bowman, Terry
2025-06-27 12:27   ` Shiju Jose
2025-07-02 21:34     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 08/17] cxl/pci: Move RAS initialization to cxl_port driver Terry Bowman
2025-06-27 11:12   ` Jonathan Cameron
2025-07-18 18:01   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 09/17] cxl/pci: Map CXL Endpoint Port and CXL Switch Port RAS registers Terry Bowman
2025-06-27 11:17   ` Jonathan Cameron
2025-07-02 21:41     ` Bowman, Terry
2025-07-18 21:28   ` Dave Jiang
2025-07-18 21:55     ` Bowman, Terry
2025-07-18 22:01       ` Dave Jiang
2025-07-18 22:40         ` Bowman, Terry
2025-07-18 22:45           ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 10/17] cxl/pci: Update RAS handler interfaces to also support CXL Ports Terry Bowman
2025-06-26 22:42 ` [PATCH v10 11/17] cxl/pci: Log message if RAS registers are unmapped Terry Bowman
2025-07-21 21:56   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 12/17] cxl/pci: Unify CXL trace logging for CXL Endpoints and CXL Ports Terry Bowman
2025-06-27 12:22   ` Shiju Jose
2025-07-02  1:18     ` Alison Schofield
2025-07-02 22:07       ` Bowman, Terry
2025-07-02 21:56     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 13/17] cxl/pci: Update cxl_handle_cor_ras() to return early if no RAS errors Terry Bowman
2025-06-27 11:48   ` Jonathan Cameron
2025-07-21 22:17   ` Dave Jiang
2025-06-26 22:42 ` [PATCH v10 14/17] cxl/pci: Introduce CXL Endpoint protocol error handlers Terry Bowman
2025-06-27 11:52   ` Jonathan Cameron
2025-06-27 12:27   ` Shiju Jose
2025-07-21 22:35   ` Dave Jiang
2025-07-22 18:23     ` Bowman, Terry
2025-06-26 22:42 ` [PATCH v10 15/17] CXL/PCI: Introduce CXL Port " Terry Bowman
2025-06-26 22:42 ` [PATCH v10 16/17] CXL/PCI: Enable CXL protocol errors during CXL Port probe Terry Bowman
2025-06-26 22:42 ` [PATCH v10 17/17] CXL/PCI: Disable CXL protocol error interrupts during CXL Port cleanup Terry Bowman
2025-07-23 21:55 ` [PATCH v10 00/17] Enable CXL PCIe Port Protocol Error handling and logging dan.j.williams
2025-07-24 15:58   ` Bowman, Terry [this message]
2025-08-18 15:18 ` Joshua Hahn

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f5434232-caac-4459-80d2-e981de0e9549@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=PradeepVineshReddy.Kodamati@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.carpenter@linaro.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=lukas@wunner.de \
    --cc=ming.li@zohomail.com \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.