public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Fan Ni <nifan.cxl@gmail.com>
To: "Bowman, Terry" <terry.bowman@amd.com>
Cc: Fan Ni <nifan.cxl@gmail.com>,
	ming4.li@intel.com, linux-cxl@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-pci@vger.kernel.org,
	dave@stgolabs.net, jonathan.cameron@huawei.com,
	dave.jiang@intel.com, alison.schofield@intel.com,
	vishal.l.verma@intel.com, dan.j.williams@intel.com,
	bhelgaas@google.com, mahesh@linux.ibm.com, ira.weiny@intel.com,
	oohall@gmail.com, Benjamin.Cheatham@amd.com, rrichter@amd.com,
	nathan.fontenot@amd.com, Smita.KoralahalliChannabasappa@amd.com
Subject: Re: [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging
Date: Mon, 4 Nov 2024 13:48:23 -0800	[thread overview]
Message-ID: <ZylBJ8mASGFVyDch@fan> (raw)
In-Reply-To: <341e5c63-8f1c-4b53-a6f0-bdd7483f0c93@amd.com>

On Mon, Nov 04, 2024 at 03:25:38PM -0600, Bowman, Terry wrote:
> 
> 
> On 11/1/2024 5:11 PM, Fan Ni wrote:
> > On Fri, Nov 01, 2024 at 01:28:12PM -0500, Bowman, Terry wrote:
> >> Hi Fan,
> >>
> >> I added comments below.
> >>
> >> On 11/1/2024 1:00 PM, Fan Ni wrote:
> >>> On Fri, Oct 25, 2024 at 04:02:51PM -0500, Terry Bowman wrote:
> >>>> This is a continuation of the CXL port error handling RFC from earlier.[1]
> >>>> The RFC resulted in the decision to add CXL PCIe port error handling to
> >>>> the existing RCH downstream port handling in the AER service driver. This
> >>>> patchset adds the CXL PCIe port protocol error handling and logging.
> >>>>
> >>>> The first 7 patches update the existing AER service driver to support CXL
> >>>> PCIe port protocol error handling and reporting. This includes AER service
> >>>> driver changes for adding correctable and uncorrectable error support, CXL
> >>>> specific recovery handling, and addition of CXL driver callback handlers.
> >>>>
> >>>> The following 7 patches address CXL driver support for CXL PCIe port
> >>>> protocol errors. This includes the following changes to the CXL drivers:
> >>>> mapping CXL port and downstream port RAS registers, interface updates for
> >>>> common restricted CXL host mode (RCH) and virtual hierarchy mode (VH),
> >>>> adding port specific error handlers, and protocol error logging.
> >>>>
> >>>> [1] - https://lore.kernel.org/linux-cxl/20240617200411.1426554-1-terry.bowman@amd.com/
> >>>>
> >>>> Testing:
> >>> Hi Terry,
> >>> I tried to test the patchset with aer_inject tool (with the patch you shared
> >>> in the last version), and hit some issues.
> >>> Could you help check and give some insights? Thanks.
> >>>
> >>> Below are some test setup info and results.
> >>>
> >>> I tested two topology,
> >>>   a. one memdev directly attaced to a HB with only one RP;
> >>>   b. a topology with cxl switch:
> >>>          HB
> >>>         /  \
> >>>       RP0   RP1
> >>>        |
> >>>      switch
> >>>        |
> >>>  ----------------
> >>>  |    |    |    |
> >>> mem0 mem1 mem2 mem3
> >>>
> >>> For both topologies, I cannot reproduce the system panic shown in your cover
> >>> letter.  
> >>>
> >>> btw, I tried both compile cxl as modules and in the kernel.
> >>>
> >>> Below, I will use the direct-attached topology (a) as an example to show what I
> >>> tried, hope can get some clarity about the test and what I missed or did wrong.
> >>>
> >>> -------------------------------------
> >>> pci device info on the test VM 
> >>> root@fan:~# lspci
> >>> 00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
> >>> 00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
> >>> 00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
> >>> 00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
> >>> 00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
> >>> 00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
> >>> 00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
> >>> 00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
> >>> 0c:00.0 PCI bridge: Intel Corporation Device 7075
> >>> 0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
> >>> root@fan:~# 
> >>> -------------------------------------
> >>>
> >>> The aer injection input file looks like below,
> >>>
> >>> -------------------------------------
> >>> fan:~/cxl/cxl-test-tool$ cat /tmp/internal 
> >>> AER
> >>> PCI_ID 0000:0c:00.0
> >>> UNCOR_STATUS INTERNAL
> >>> HEADER_LOG 0 1 2 3
> >>> ------------------------------------
> >>>
> >>> dmesg after aer injection 
> >>>
> >>> ssh root@localhost -p 2024 "dmesg"
> >>> [  613.195352] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000000/00400000 into device 0000:0c:00.0
> >>> [  613.195830] pcieport 0000:0c:00.0: AER: Uncorrectable (Fatal) error message received from 0000:0c:00.0
> >>> [  613.196253] pcieport 0000:0c:00.0: CXL Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> >>> [  613.198199] pcieport 0000:0c:00.0: AER: No uncorrectable error found. Continuing.
> >>> -----------------------------------
> >> This is likely because the device's CXL RAS status is not set and as a result returns false and bypasses the panic.
> >> Unfortunately, the aer-inject only sets the AER status and triggers the interrupt. The CXL RAS is not set.
> >>
> >> I attached 2 'test' patches. The first patch sets the device's RAS status to simulate the error reporting.
> >> This will have to be adjusted as the patch looks for a specific device's bus and this will likely be a different
> >> bus then the device's you test in your setup.
> >>
> >> The 2nd patch enables UIE/CIE. I moved this out of the v2 patchset. I need to revisit this to see if it is
> >> needed in the patchset itself (not just a test patch).
> >>
> >> Regards,
> >> Terry
> >>
> > Hi Terry, 
> >
> > I checked the two patches you attached, do we really need the first
> > patch to umask internal error? I see it is already unmasked in
> > aer_enable_internal_errors() which is called in aer_probe().
> > I tried to only apply the other patch and test again, it seems the test
> > output is the same as applying two patches. The system panics as well.
> >
> > Fan
> Hi Fan,
> 
> Which device did you inject into? RP, DSP, or USP?
> 
> Yes, the RP UIE & CIE are enabled by the AER driver. RCEC too. But, this is not done for CXL DSP
> and USP. Below are details from the spec describing how an AER error masked at the source will not
> be propagated as notification to the root complex (RP or RCEC).
> 
> 'If an individual error is masked when it is detected, its error status bit is still affected,
> but no error reporting Message is sent to the Root Complex, and the error is not recorded in the
> Header Log, TLP Prefix Log, or First Error Pointer.'[1]
> 
> [1] PCIe Spec 6.2.3.2.2 Masking Individual Errors
> 
> Also, there can be platform BIOS settings that enable/disable UIE/CIE.
> 
> Regards,
> Terry
Oh, I see. I did inject into rp in my previous setup. And confirmed we
need extra unmask for downstream port case. 

Thanks for the info.

Fan
> 

-- 
Fan Ni

      reply	other threads:[~2024-11-04 21:48 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-25 21:02 [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Terry Bowman
2024-10-25 21:02 ` [PATCH v2 01/14] PCI/AER: Introduce 'struct cxl_err_handlers' and add to 'struct pci_driver' Terry Bowman
2024-10-30 15:14   ` Jonathan Cameron
2024-10-30 15:15     ` Bowman, Terry
2024-10-31 16:20   ` Dave Jiang
2024-10-31 20:24   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 02/14] PCI/AER: Rename AER driver's interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:13   ` Jonathan Cameron
2024-10-31 16:21   ` Dave Jiang
2024-10-31 20:25   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 03/14] cxl/pci: Introduce helper functions pcie_is_cxl() and pcie_is_cxl_port() Terry Bowman
2024-10-30 14:57   ` Jonathan Cameron
2024-10-31 16:25   ` Dave Jiang
2024-10-31 21:22   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 04/14] PCI/AER: Modify AER driver logging to report CXL or PCIe bus error type Terry Bowman
2024-10-30 14:56   ` Jonathan Cameron
2024-10-31 16:27   ` Dave Jiang
2024-10-31 21:27   ` Fan Ni
2024-10-25 21:02 ` [PATCH v2 05/14] PCI/AER: Add CXL PCIe port correctable error support in AER service driver Terry Bowman
2024-10-30 15:13   ` Jonathan Cameron
2024-10-30 15:51     ` Bowman, Terry
2024-11-04 21:50     ` Dan Williams
2024-11-04 22:05       ` Bowman, Terry
2024-10-31 16:37   ` Dave Jiang
2024-10-25 21:02 ` [PATCH v2 06/14] PCI/AER: Change AER driver to read UCE fatal status for all CXL PCIe port devices Terry Bowman
2024-10-30 15:37   ` Jonathan Cameron
2024-10-31 16:58   ` Dave Jiang
2024-11-01 13:30     ` Bowman, Terry
2024-10-25 21:02 ` [PATCH v2 07/14] PCI/AER: Add CXL PCIe port uncorrectable error recovery in AER service driver Terry Bowman
2024-10-30 15:42   ` Jonathan Cameron
2024-10-25 21:02 ` [PATCH v2 08/14] cxl/pci: Change find_cxl_ports() to non-static Terry Bowman
2024-10-30 15:45   ` Jonathan Cameron
2024-10-30 15:54     ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 09/14] cxl/pci: Map CXL PCIe root port and downstream switch port RAS registers Terry Bowman
2024-10-30 15:55   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 10/14] cxl/pci: Map CXL PCIe upstream " Terry Bowman
2024-10-30 15:56   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 11/14] cxl/pci: Rename RAS handler interfaces to also indicate CXL PCIe port support Terry Bowman
2024-10-30 15:59   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 12/14] cxl/pci: Add error handler for CXL PCIe port RAS errors Terry Bowman
2024-10-30 16:03   ` Jonathan Cameron
2024-10-25 21:03 ` [PATCH v2 13/14] cxl/pci: Add trace logging " Terry Bowman
2024-10-30 16:07   ` Jonathan Cameron
2024-10-30 21:30     ` Bowman, Terry
2024-10-25 21:03 ` [PATCH v2 14/14] cxl/pci: Add support to assign and clear pci_driver::cxl_err_handlers Terry Bowman
2024-10-30 16:11   ` Jonathan Cameron
2024-10-30 21:34     ` Bowman, Terry
2024-10-27 16:59 ` [PATCH v2 0/14] Applies to Base commit: 8cf0b93919e1 (tag: v6.12-rc2) Linux 6.12-rc2 Bowman, Terry
2024-10-28  1:05 ` [PATCH v2 0/14] Enable CXL PCIe port protocol error handling and logging Bowman, Terry
2024-11-01 18:00 ` Fan Ni
2024-11-01 18:28   ` Bowman, Terry
2024-11-01 19:11     ` Fan Ni
2024-11-01 22:11     ` Fan Ni
2024-11-04 21:25       ` Bowman, Terry
2024-11-04 21:48         ` Fan Ni [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZylBJ8mASGFVyDch@fan \
    --to=nifan.cxl@gmail.com \
    --cc=Benjamin.Cheatham@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ira.weiny@intel.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=mahesh@linux.ibm.com \
    --cc=ming4.li@intel.com \
    --cc=nathan.fontenot@amd.com \
    --cc=oohall@gmail.com \
    --cc=rrichter@amd.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox