public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Fan Ni <nifan.cxl@gmail.com>
To: Terry Bowman <Terry.Bowman@amd.com>
Cc: fan <nifan.cxl@gmail.com>,
	dan.j.williams@intel.com, ira.weiny@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	ming4.li@intel.com, vishal.l.verma@intel.com,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com,
	a.manzanares@samsung.com
Subject: Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
Date: Mon, 19 Aug 2024 11:17:54 -0700	[thread overview]
Message-ID: <ZsOMUi_dMhakCkit@fan> (raw)
In-Reply-To: <6db43744-1d92-482a-852f-8d43efa55b74@amd.com>

On Mon, Aug 19, 2024 at 11:21:01AM -0500, Terry Bowman wrote:
> Hi Fan
> 
> On 7/25/24 13:49, fan wrote:
> > On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
> >> This patchset provides RAS logging for CXL root ports, CXL downstream
> >> switch ports, and CXL upstream switch ports. This includes changes to
> >> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> >> cxl_pci callback.
> >>
> >> The first 3 patches prepare for and add an atomic notifier chain to the
> >> portdrv driver. The portdrv's notifier chain reports the port device's
> >> AER internal errors to the registered callback(s). The preparation changes
> >> include a portdrv update to call the uncorrectable handler for PCIe root
> >> ports and PCIe downstream switch ports. Also, the AER correctable error
> >> (CE) status is made available to the AER CE handler.
> >>
> >> The next 4 patches are in preparation for adding an atomic notification
> >> callback in the cxl_pci driver. This is for receiving AER internal error
> >> events from the portdrv notifier chain. Preparation includes adding RAS
> >> register block mapping, adding trace functions for logging, and
> >> refactoring cxl_pci RAS functions for reuse.
> >>
> >> The final 2 patches enable the AER internal error interrupts.
> >>
> >> Testing RAS CE/UCE:
> >>   QEMU was used for testing CXL root port, CXL downstream switch port, and
> >>   CXL upstream switch port. The aer-inject tool was used to inject AER and
> >>   a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
> >>   testing. Testing passed with no issues.
> > 
> > Hi Terry,
> > 
> > Could you share a little more about the qemu test setup?
> > From what I see, it seems currently qemu can only inject error to
> > type3 devices, is that true? Or how to do that for port devices?
> > Do we need a hack there?
> > 
> > Also, is the aer-inject tool you mentioned the one currently in the kernel
> > or something else?
> > https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c
> > 
> > Thanks,
> > Fan
> > 
> Sorry for the late response.
> 
> I used AMD RAS injection for testing HW root ports.
> 
> I used QEMU and the legacy aer-inject userspace tool to test switch ports (USP/DSP).[1] 
> I added a couple test patches to set the AER UIE/CIE because the tool doesn't support 
> injecting UIE or CIE bits. I used a test patch for assigning the RAS status as well.
> 
> Regards,
> Terry
> 
> [1] - https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/about/
> 

Hi Terry,
Thanks for the reply. I was able to do aer error inject through the aer
inject kernel module and the user space tool. 
Trying to exercise the code in this patchset.

Fan

> > 
> >>  
> >>   An AMD platform with the AMD RAS error injection tool was used for
> >>   testing CXL root port injection. Testing passed with no issues.
> >>
> >>   TODO - regression test CXL1.1 RCH handling.
> >>
> >> Solutions Considered (1-4):
> >>   Below are solutions that were considered. Solution #4 is
> >>   implemented in this patchset. 
> >>
> >>   1.) Reassigning portdrv error handler for CXL port devices
> >>   
> >>   This solution was based on reassigning the portdrv's CE/UCE err_handler
> >>   to be CXL cxl_pci driver functions.
> >>   
> >>   I started with this solution and once the flow was working I realized
> >>   the endpoint removal would have to be addressed as well. While this
> >>   could be resolved it does highlight the odd coupling and dependency
> >>   between the CXL port devices error handling with cxl_pci endpoint's
> >>   handlers. Also, the err_handler re-assignment at runtime required
> >>   ignoring the 'const' definition. I don't believe this should be
> >>   considered as a possible solution.
> >>   
> >>   2.) Update the AER driver to call cxl_pci driver's error handler before
> >>   calling pci_aer_handle_error()
> >>
> >>   This is similar to the existing RCH port error approach in aer.c.
> >>   In this solution the AER driver searches for a downstream CXL endpoint
> >>   to 'handle' detected CXL port protocol errors.
> >>
> >>   This is a good solution to consider if the one presented in this patchset
> >>   is not acceptable. I was initially reluctant to this approach because it
> >>   adds more CXL coupling to the AER driver. But, I think this solution
> >>   would technically work. I believe Ming was working towards this
> >>   solution.
> >>
> >>   3.) Refactor portdrv
> >>   The portdrv refactoring solution is to change the portdrv service drivers
> >>   into PCIe auxiliary drivers. With this change the facility drivers can be
> >>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> >>
> >>   In this case the CXL port functionality would be added either as a CXL
> >>   auxiliary driver or as a CXL specific port driver
> >>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> >>
> >>   This solution has challenges in the interrupt allocation by separate
> >>   auxiliary drivers and in binding of a specific driver. Binding is
> >>   currently based on PCIe class and would require extending the binding
> >>   logic to support multiple drivers for the same class.
> >>
> >>   Jonathan Cameron is working towards this solution by initially solving
> >>   for the PMU service driver.[1] It is using the auxiliary bus to associate
> >>   what were service drivers with the portdrv driver. Using a CXL auxiliary
> >>   for handling CXL port RAS errors would result in RAS logic called from
> >>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> >>
> >>   4.) Using a portdrv notifier chain/callback for CIE/UIE
> >>   (Implemented in this patchset)
> >>
> >>   This solution uses a portdrv atomic chain notifier and a cxl_pci
> >>   callback to handle and log CXL port RAS errors.
> >>   
> >>   I chose this after trying solution#1 above. I see a couple advantages to
> >>   this solution are:
> >>   - Is general port implementation for CIE/UIE specific handling mentioned
> >>   in the PCIe spec.[2]
> >>   - Notifier is used in RAS MCE driver as an existing example.
> >>   - Does not introduce further CXL dependencies into the AER driver.
> >>   - The notifier chain provides registration/unregistration and
> >>   synchronization.
> >>
> >>   A disadvantage of this approach is coupling still exists between the CXL
> >>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
> >>   is handled by a notifier callback in the cxl_pci endpoint driver.
> >>
> >>   Most of the patches in this patchset could be reused to work with
> >>   solution#3 or solution#2. The atomic notifier could be dropped and
> >>   instead use an auxiliary device or AER driver awareness. The other
> >>   changes in this patchset could possibly be reused.
> >>
> >>   [1] Kernel.org -
> >>   https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
> >>   [2] PCI6.0 - 6.2.10 Internal errors
> >>
> >>  drivers/cxl/core/core.h    |   4 +
> >>  drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
> >>  drivers/cxl/core/port.c    |   6 +-
> >>  drivers/cxl/core/trace.h   |  34 +++++++++
> >>  drivers/cxl/cxl.h          |  10 +++
> >>  drivers/cxl/cxlpci.h       |   2 +
> >>  drivers/cxl/mem.c          |  32 +++++++-
> >>  drivers/cxl/pci.c          |  19 ++++-
> >>  drivers/pci/pcie/aer.c     |  10 ++-
> >>  drivers/pci/pcie/err.c     |  20 +++++
> >>  drivers/pci/pcie/portdrv.c |  32 ++++++++
> >>  drivers/pci/pcie/portdrv.h |   2 +
> >>  include/linux/aer.h        |   6 ++
> >>  13 files changed, 303 insertions(+), 27 deletions(-)
> >>
> >>
> >> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
> >> -- 
> >> 2.34.1
> >>

      reply	other threads:[~2024-08-19 18:18 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-20 11:21   ` Jonathan Cameron
2024-06-24 14:58     ` Terry Bowman
2024-06-21 19:17   ` Dan Williams
2024-06-24 17:56     ` Terry Bowman
2024-07-10 20:48       ` nifan.cxl
2024-07-10 21:48         ` Terry Bowman
2024-07-11  1:14           ` fan
2024-08-19 18:35       ` Fan Ni
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
2024-06-20 11:31   ` Jonathan Cameron
2024-06-24 15:08     ` Terry Bowman
2024-06-21 19:23   ` Dan Williams
2024-06-24 18:00     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30   ` Jonathan Cameron
2024-06-24 15:22     ` Terry Bowman
2024-06-21 19:36   ` Dan Williams
2024-06-24 18:21     ` Terry Bowman
2024-06-24 21:46       ` Dan Williams
2024-06-25 14:41         ` Terry Bowman
2024-06-26  2:54   ` Li, Ming4
2024-06-26 13:39     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
2024-06-20 12:46   ` Jonathan Cameron
2024-06-24 15:51     ` Terry Bowman
2024-07-02 15:18       ` Jonathan Cameron
2024-06-26  3:39   ` Li, Ming4
2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-06-20 12:49   ` Jonathan Cameron
2024-07-15 17:50   ` nifan.cxl
2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
2024-06-20 12:53   ` Jonathan Cameron
2024-06-24 15:53     ` Terry Bowman
2024-07-02 15:53       ` Jonathan Cameron
2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
2024-06-20 13:09   ` Jonathan Cameron
2024-06-24 16:09     ` Terry Bowman
2024-07-02 15:58       ` Jonathan Cameron
2024-06-26  6:22   ` Li, Ming4
2024-06-26 13:51     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19  7:09   ` Christoph Hellwig
2024-06-19 15:40     ` Terry Bowman
2024-06-20 13:11   ` Jonathan Cameron
2024-06-24 16:22     ` Terry Bowman
2024-07-10 21:47   ` Bjorn Helgaas
2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
2024-06-20 13:15   ` Jonathan Cameron
2024-06-24 16:46     ` Terry Bowman
2024-07-02 16:00       ` Jonathan Cameron
2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
2024-06-24 17:47   ` Terry Bowman
2024-06-24 20:51     ` Dan Williams
2024-06-25 14:29       ` Terry Bowman
2024-07-25 18:49 ` fan
2024-08-19 16:21   ` Terry Bowman
2024-08-19 18:17     ` Fan Ni [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZsOMUi_dMhakCkit@fan \
    --to=nifan.cxl@gmail.com \
    --cc=Robert.Richter@amd.com \
    --cc=Terry.Bowman@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=a.manzanares@samsung.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jim.harris@samsung.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming4.li@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox