All of lore.kernel.org
 help / color / mirror / Atom feed
From: fan <nifan.cxl@gmail.com>
To: Terry Bowman <terry.bowman@amd.com>
Cc: dan.j.williams@intel.com, ira.weiny@intel.com, dave@stgolabs.net,
	dave.jiang@intel.com, alison.schofield@intel.com,
	ming4.li@intel.com, vishal.l.verma@intel.com,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com,
	a.manzanares@samsung.com
Subject: Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
Date: Thu, 25 Jul 2024 11:49:01 -0700	[thread overview]
Message-ID: <ZqKeHWsZtDb1RKfh@debian> (raw)
In-Reply-To: <20240617200411.1426554-1-terry.bowman@amd.com>

On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
> This patchset provides RAS logging for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This includes changes to
> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> cxl_pci callback.
> 
> The first 3 patches prepare for and add an atomic notifier chain to the
> portdrv driver. The portdrv's notifier chain reports the port device's
> AER internal errors to the registered callback(s). The preparation changes
> include a portdrv update to call the uncorrectable handler for PCIe root
> ports and PCIe downstream switch ports. Also, the AER correctable error
> (CE) status is made available to the AER CE handler.
> 
> The next 4 patches are in preparation for adding an atomic notification
> callback in the cxl_pci driver. This is for receiving AER internal error
> events from the portdrv notifier chain. Preparation includes adding RAS
> register block mapping, adding trace functions for logging, and
> refactoring cxl_pci RAS functions for reuse.
> 
> The final 2 patches enable the AER internal error interrupts.
> 
> Testing RAS CE/UCE:
>   QEMU was used for testing CXL root port, CXL downstream switch port, and
>   CXL upstream switch port. The aer-inject tool was used to inject AER and
>   a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
>   testing. Testing passed with no issues.

Hi Terry,

Could you share a little more about the qemu test setup?
From what I see, it seems currently qemu can only inject error to
type3 devices, is that true? Or how to do that for port devices?
Do we need a hack there?

Also, is the aer-inject tool you mentioned the one currently in the kernel
or something else?
https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c

Thanks,
Fan


>  
>   An AMD platform with the AMD RAS error injection tool was used for
>   testing CXL root port injection. Testing passed with no issues.
> 
>   TODO - regression test CXL1.1 RCH handling.
> 
> Solutions Considered (1-4):
>   Below are solutions that were considered. Solution #4 is
>   implemented in this patchset. 
> 
>   1.) Reassigning portdrv error handler for CXL port devices
>   
>   This solution was based on reassigning the portdrv's CE/UCE err_handler
>   to be CXL cxl_pci driver functions.
>   
>   I started with this solution and once the flow was working I realized
>   the endpoint removal would have to be addressed as well. While this
>   could be resolved it does highlight the odd coupling and dependency
>   between the CXL port devices error handling with cxl_pci endpoint's
>   handlers. Also, the err_handler re-assignment at runtime required
>   ignoring the 'const' definition. I don't believe this should be
>   considered as a possible solution.
>   
>   2.) Update the AER driver to call cxl_pci driver's error handler before
>   calling pci_aer_handle_error()
> 
>   This is similar to the existing RCH port error approach in aer.c.
>   In this solution the AER driver searches for a downstream CXL endpoint
>   to 'handle' detected CXL port protocol errors.
> 
>   This is a good solution to consider if the one presented in this patchset
>   is not acceptable. I was initially reluctant to this approach because it
>   adds more CXL coupling to the AER driver. But, I think this solution
>   would technically work. I believe Ming was working towards this
>   solution.
> 
>   3.) Refactor portdrv
>   The portdrv refactoring solution is to change the portdrv service drivers
>   into PCIe auxiliary drivers. With this change the facility drivers can be
>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> 
>   In this case the CXL port functionality would be added either as a CXL
>   auxiliary driver or as a CXL specific port driver
>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> 
>   This solution has challenges in the interrupt allocation by separate
>   auxiliary drivers and in binding of a specific driver. Binding is
>   currently based on PCIe class and would require extending the binding
>   logic to support multiple drivers for the same class.
> 
>   Jonathan Cameron is working towards this solution by initially solving
>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>   for handling CXL port RAS errors would result in RAS logic called from
>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> 
>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>   (Implemented in this patchset)
> 
>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>   callback to handle and log CXL port RAS errors.
>   
>   I chose this after trying solution#1 above. I see a couple advantages to
>   this solution are:
>   - Is general port implementation for CIE/UIE specific handling mentioned
>   in the PCIe spec.[2]
>   - Notifier is used in RAS MCE driver as an existing example.
>   - Does not introduce further CXL dependencies into the AER driver.
>   - The notifier chain provides registration/unregistration and
>   synchronization.
> 
>   A disadvantage of this approach is coupling still exists between the CXL
>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>   is handled by a notifier callback in the cxl_pci endpoint driver.
> 
>   Most of the patches in this patchset could be reused to work with
>   solution#3 or solution#2. The atomic notifier could be dropped and
>   instead use an auxiliary device or AER driver awareness. The other
>   changes in this patchset could possibly be reused.
> 
>   [1] Kernel.org -
>   https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
>   [2] PCI6.0 - 6.2.10 Internal errors
> 
>  drivers/cxl/core/core.h    |   4 +
>  drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
>  drivers/cxl/core/port.c    |   6 +-
>  drivers/cxl/core/trace.h   |  34 +++++++++
>  drivers/cxl/cxl.h          |  10 +++
>  drivers/cxl/cxlpci.h       |   2 +
>  drivers/cxl/mem.c          |  32 +++++++-
>  drivers/cxl/pci.c          |  19 ++++-
>  drivers/pci/pcie/aer.c     |  10 ++-
>  drivers/pci/pcie/err.c     |  20 +++++
>  drivers/pci/pcie/portdrv.c |  32 ++++++++
>  drivers/pci/pcie/portdrv.h |   2 +
>  include/linux/aer.h        |   6 ++
>  13 files changed, 303 insertions(+), 27 deletions(-)
> 
> 
> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
> -- 
> 2.34.1
> 

  parent reply	other threads:[~2024-07-25 18:49 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-20 11:21   ` Jonathan Cameron
2024-06-24 14:58     ` Terry Bowman
2024-06-21 19:17   ` Dan Williams
2024-06-24 17:56     ` Terry Bowman
2024-07-10 20:48       ` nifan.cxl
2024-07-10 21:48         ` Terry Bowman
2024-07-11  1:14           ` fan
2024-08-19 18:35       ` Fan Ni
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
2024-06-20 11:31   ` Jonathan Cameron
2024-06-24 15:08     ` Terry Bowman
2024-06-21 19:23   ` Dan Williams
2024-06-24 18:00     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30   ` Jonathan Cameron
2024-06-24 15:22     ` Terry Bowman
2024-06-21 19:36   ` Dan Williams
2024-06-24 18:21     ` Terry Bowman
2024-06-24 21:46       ` Dan Williams
2024-06-25 14:41         ` Terry Bowman
2024-06-26  2:54   ` Li, Ming4
2024-06-26 13:39     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
2024-06-20 12:46   ` Jonathan Cameron
2024-06-24 15:51     ` Terry Bowman
2024-07-02 15:18       ` Jonathan Cameron
2024-06-26  3:39   ` Li, Ming4
2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-06-20 12:49   ` Jonathan Cameron
2024-07-15 17:50   ` nifan.cxl
2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
2024-06-20 12:53   ` Jonathan Cameron
2024-06-24 15:53     ` Terry Bowman
2024-07-02 15:53       ` Jonathan Cameron
2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
2024-06-20 13:09   ` Jonathan Cameron
2024-06-24 16:09     ` Terry Bowman
2024-07-02 15:58       ` Jonathan Cameron
2024-06-26  6:22   ` Li, Ming4
2024-06-26 13:51     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19  7:09   ` Christoph Hellwig
2024-06-19 15:40     ` Terry Bowman
2024-06-20 13:11   ` Jonathan Cameron
2024-06-24 16:22     ` Terry Bowman
2024-07-10 21:47   ` Bjorn Helgaas
2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
2024-06-20 13:15   ` Jonathan Cameron
2024-06-24 16:46     ` Terry Bowman
2024-07-02 16:00       ` Jonathan Cameron
2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
2024-06-24 17:47   ` Terry Bowman
2024-06-24 20:51     ` Dan Williams
2024-06-25 14:29       ` Terry Bowman
2024-07-25 18:49 ` fan [this message]
2024-08-19 16:21   ` Terry Bowman
2024-08-19 18:17     ` Fan Ni

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZqKeHWsZtDb1RKfh@debian \
    --to=nifan.cxl@gmail.com \
    --cc=Robert.Richter@amd.com \
    --cc=Yazen.Ghannam@amd.com \
    --cc=a.manzanares@samsung.com \
    --cc=alison.schofield@intel.com \
    --cc=ardb@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=ilpo.jarvinen@linux.intel.com \
    --cc=ira.weiny@intel.com \
    --cc=jim.harris@samsung.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ming4.li@intel.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=terry.bowman@amd.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.