[RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
@ 2024-06-17 20:04 Terry Bowman
  2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
                   ` (10 more replies)
  0 siblings, 11 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter

This patchset provides RAS logging for CXL root ports, CXL downstream
switch ports, and CXL upstream switch ports. This includes changes to
use a portdrv notifier chain to communicate CXL AER/RAS errors to a
cxl_pci callback.

The first 3 patches prepare for and add an atomic notifier chain to the
portdrv driver. The portdrv's notifier chain reports the port device's
AER internal errors to the registered callback(s). The preparation changes
include a portdrv update to call the uncorrectable handler for PCIe root
ports and PCIe downstream switch ports. Also, the AER correctable error
(CE) status is made available to the AER CE handler.

The next 4 patches are in preparation for adding an atomic notification
callback in the cxl_pci driver. This is for receiving AER internal error
events from the portdrv notifier chain. Preparation includes adding RAS
register block mapping, adding trace functions for logging, and
refactoring cxl_pci RAS functions for reuse.

The final 2 patches enable the AER internal error interrupts.

Testing RAS CE/UCE:
  QEMU was used for testing CXL root port, CXL downstream switch port, and
  CXL upstream switch port. The aer-inject tool was used to inject AER and
  a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
  testing. Testing passed with no issues.

  An AMD platform with the AMD RAS error injection tool was used for
  testing CXL root port injection. Testing passed with no issues.

  TODO - regression test CXL1.1 RCH handling.

Solutions Considered (1-4):
  Below are solutions that were considered. Solution #4 is
  implemented in this patchset. 

  1.) Reassigning portdrv error handler for CXL port devices

  This solution was based on reassigning the portdrv's CE/UCE err_handler
  to be CXL cxl_pci driver functions.

  I started with this solution and once the flow was working I realized
  the endpoint removal would have to be addressed as well. While this
  could be resolved it does highlight the odd coupling and dependency
  between the CXL port devices error handling with cxl_pci endpoint's
  handlers. Also, the err_handler re-assignment at runtime required
  ignoring the 'const' definition. I don't believe this should be
  considered as a possible solution.

  2.) Update the AER driver to call cxl_pci driver's error handler before
  calling pci_aer_handle_error()

  This is similar to the existing RCH port error approach in aer.c.
  In this solution the AER driver searches for a downstream CXL endpoint
  to 'handle' detected CXL port protocol errors.

  This is a good solution to consider if the one presented in this patchset
  is not acceptable. I was initially reluctant to this approach because it
  adds more CXL coupling to the AER driver. But, I think this solution
  would technically work. I believe Ming was working towards this
  solution.

  3.) Refactor portdrv
  The portdrv refactoring solution is to change the portdrv service drivers
  into PCIe auxiliary drivers. With this change the facility drivers can be
  associated with a PCIe driver instead fixed bound to the portdrv driver.

  In this case the CXL port functionality would be added either as a CXL
  auxiliary driver or as a CXL specific port driver
  (PCI_CLASS_BRIDGE_PCI_NORMAL).

  This solution has challenges in the interrupt allocation by separate
  auxiliary drivers and in binding of a specific driver. Binding is
  currently based on PCIe class and would require extending the binding
  logic to support multiple drivers for the same class.

  Jonathan Cameron is working towards this solution by initially solving
  for the PMU service driver.[1] It is using the auxiliary bus to associate
  what were service drivers with the portdrv driver. Using a CXL auxiliary
  for handling CXL port RAS errors would result in RAS logic called from
  the cxl_pci and CXL auxiliary drivers. This may need a library driver.

  4.) Using a portdrv notifier chain/callback for CIE/UIE
  (Implemented in this patchset)

  This solution uses a portdrv atomic chain notifier and a cxl_pci
  callback to handle and log CXL port RAS errors.

  I chose this after trying solution#1 above. I see a couple advantages to
  this solution are:
  - Is general port implementation for CIE/UIE specific handling mentioned
  in the PCIe spec.[2]
  - Notifier is used in RAS MCE driver as an existing example.
  - Does not introduce further CXL dependencies into the AER driver.
  - The notifier chain provides registration/unregistration and
  synchronization.

  A disadvantage of this approach is coupling still exists between the CXL
  port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
  is handled by a notifier callback in the cxl_pci endpoint driver.

  Most of the patches in this patchset could be reused to work with
  solution#3 or solution#2. The atomic notifier could be dropped and
  instead use an auxiliary device or AER driver awareness. The other
  changes in this patchset could possibly be reused.

  [1] Kernel.org -
  https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
  [2] PCI6.0 - 6.2.10 Internal errors

 drivers/cxl/core/core.h    |   4 +
 drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
 drivers/cxl/core/port.c    |   6 +-
 drivers/cxl/core/trace.h   |  34 +++++++++
 drivers/cxl/cxl.h          |  10 +++
 drivers/cxl/cxlpci.h       |   2 +
 drivers/cxl/mem.c          |  32 +++++++-
 drivers/cxl/pci.c          |  19 ++++-
 drivers/pci/pcie/aer.c     |  10 ++-
 drivers/pci/pcie/err.c     |  20 +++++
 drivers/pci/pcie/portdrv.c |  32 ++++++++
 drivers/pci/pcie/portdrv.h |   2 +
 include/linux/aer.h        |   6 ++
 13 files changed, 303 insertions(+), 27 deletions(-)

base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
-- 
2.34.1

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 11:21   ` Jonathan Cameron
  2024-06-21 19:17   ` Dan Williams
  2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
                   ` (9 subsequent siblings)
  10 siblings, 2 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

The AER service driver does not currently call a handler for AER
uncorrectable errors (UCE) detected in root ports or downstream
ports. This is not needed in most cases because common PCIe port
functionality is handled by portdrv service drivers.

CXL root ports include CXL specific RAS registers that need logging
before starting do_recovery() in the UCE case.

Update the AER service driver to call the UCE handler for root ports
and downstream ports. These PCIe port devices are bound to the portdrv
driver that includes a CE and UCE handler to be called.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 705893b5f7b0..a4db474b2be5 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
 
+	/*
+	 * PCIe ports may include functionality beyond the standard
+	 * extended port capabilities. This may present a need to log and
+	 * handle errors not addressed in this driver. Examples are CXL
+	 * root ports and CXL downstream switch ports using AER UIE to
+	 * indicate CXL UCE RAS protocol errors.
+	 */
+	if (type == PCI_EXP_TYPE_ROOT_PORT ||
+	    type == PCI_EXP_TYPE_DOWNSTREAM) {
+		struct pci_driver *pdrv = dev->driver;
+
+		if (pdrv && pdrv->err_handler &&
+		    pdrv->err_handler->error_detected) {
+			const struct pci_error_handlers *err_handler;
+
+			err_handler = pdrv->err_handler;
+			status = err_handler->error_detected(dev, state);
+		}
+	}
+
 	/*
 	 * If the error was detected by a Root Port, Downstream Port, RCEC,
 	 * or RCiEP, recovery runs on the device itself.  For Ports, that
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
@ 2024-06-20 11:21   ` Jonathan Cameron
  2024-06-24 14:58     ` Terry Bowman
  2024-06-21 19:17   ` Dan Williams
  1 sibling, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 11:21 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, 17 Jun 2024 15:04:03 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver does not currently call a handler for AER
> uncorrectable errors (UCE) detected in root ports or downstream
> ports. This is not needed in most cases because common PCIe port
> functionality is handled by portdrv service drivers.
> 
> CXL root ports include CXL specific RAS registers that need logging
> before starting do_recovery() in the UCE case.
> 
> Update the AER service driver to call the UCE handler for root ports
> and downstream ports. These PCIe port devices are bound to the portdrv
> driver that includes a CE and UCE handler to be called.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 705893b5f7b0..a4db474b2be5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
> +	/*
> +	 * PCIe ports may include functionality beyond the standard
> +	 * extended port capabilities. This may present a need to log and
> +	 * handle errors not addressed in this driver. Examples are CXL
> +	 * root ports and CXL downstream switch ports using AER UIE to
> +	 * indicate CXL UCE RAS protocol errors.
> +	 */
> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> +		struct pci_driver *pdrv = dev->driver;
> +
> +		if (pdrv && pdrv->err_handler &&
> +		    pdrv->err_handler->error_detected) {
> +			const struct pci_error_handlers *err_handler;
> +
> +			err_handler = pdrv->err_handler;
> +			status = err_handler->error_detected(dev, state);

This status is going to get overridden by one of the pci_walk_bridge()
calls.  Should it be kept around and acted on, or dropped silently?
(I'd guess no for silent!).

> +		}
> +	}
> +
>  	/*
>  	 * If the error was detected by a Root Port, Downstream Port, RCEC,
>  	 * or RCiEP, recovery runs on the device itself.  For Ports, that


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-20 11:21   ` Jonathan Cameron
@ 2024-06-24 14:58     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 14:58 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

Hi Jonathan,
I added a response below.

On 6/20/24 06:21, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:03 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> The AER service driver does not currently call a handler for AER
>> uncorrectable errors (UCE) detected in root ports or downstream
>> ports. This is not needed in most cases because common PCIe port
>> functionality is handled by portdrv service drivers.
>>
>> CXL root ports include CXL specific RAS registers that need logging
>> before starting do_recovery() in the UCE case.
>>
>> Update the AER service driver to call the UCE handler for root ports
>> and downstream ports. These PCIe port devices are bound to the portdrv
>> driver that includes a CE and UCE handler to be called.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 705893b5f7b0..a4db474b2be5 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>  
>> +	/*
>> +	 * PCIe ports may include functionality beyond the standard
>> +	 * extended port capabilities. This may present a need to log and
>> +	 * handle errors not addressed in this driver. Examples are CXL
>> +	 * root ports and CXL downstream switch ports using AER UIE to
>> +	 * indicate CXL UCE RAS protocol errors.
>> +	 */
>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
>> +		struct pci_driver *pdrv = dev->driver;
>> +
>> +		if (pdrv && pdrv->err_handler &&
>> +		    pdrv->err_handler->error_detected) {
>> +			const struct pci_error_handlers *err_handler;
>> +
>> +			err_handler = pdrv->err_handler;
>> +			status = err_handler->error_detected(dev, state);
> 
> This status is going to get overridden by one of the pci_walk_bridge()
> calls.  Should it be kept around and acted on, or dropped silently?
> (I'd guess no for silent!).
> 

It should be used later.

According to PCI spec "The only method of recovering from an Uncorrectable
Internal Error is reset or hardware replacement."

I need to make certain that carries through below.

Regards,
Terry

>> +		}
>> +	}
>> +
>>  	/*
>>  	 * If the error was detected by a Root Port, Downstream Port, RCEC,
>>  	 * or RCiEP, recovery runs on the device itself.  For Ports, that
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
  2024-06-20 11:21   ` Jonathan Cameron
@ 2024-06-21 19:17   ` Dan Williams
  2024-06-24 17:56     ` Terry Bowman
  1 sibling, 1 reply; 59+ messages in thread
From: Dan Williams @ 2024-06-21 19:17 UTC (permalink / raw)
  To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Terry Bowman wrote:
> The AER service driver does not currently call a handler for AER
> uncorrectable errors (UCE) detected in root ports or downstream
> ports. This is not needed in most cases because common PCIe port
> functionality is handled by portdrv service drivers.
> 
> CXL root ports include CXL specific RAS registers that need logging
> before starting do_recovery() in the UCE case.
> 
> Update the AER service driver to call the UCE handler for root ports
> and downstream ports. These PCIe port devices are bound to the portdrv
> driver that includes a CE and UCE handler to be called.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 705893b5f7b0..a4db474b2be5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>  
> +	/*
> +	 * PCIe ports may include functionality beyond the standard
> +	 * extended port capabilities. This may present a need to log and
> +	 * handle errors not addressed in this driver. Examples are CXL
> +	 * root ports and CXL downstream switch ports using AER UIE to
> +	 * indicate CXL UCE RAS protocol errors.
> +	 */
> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> +		struct pci_driver *pdrv = dev->driver;
> +
> +		if (pdrv && pdrv->err_handler &&
> +		    pdrv->err_handler->error_detected) {
> +			const struct pci_error_handlers *err_handler;
> +
> +			err_handler = pdrv->err_handler;
> +			status = err_handler->error_detected(dev, state);
> +		}
> +	}
> +

Would not a more appropriate place for this be pci_walk_bridge() where
the ->subordinate == NULL and these type-check cases are unified?

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-21 19:17   ` Dan Williams
@ 2024-06-24 17:56     ` Terry Bowman
  2024-07-10 20:48       ` nifan.cxl
  2024-08-19 18:35       ` Fan Ni
  0 siblings, 2 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 17:56 UTC (permalink / raw)
  To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Hi Dan,

I added a response below.

On 6/21/24 14:17, Dan Williams wrote:
> Terry Bowman wrote:
>> The AER service driver does not currently call a handler for AER
>> uncorrectable errors (UCE) detected in root ports or downstream
>> ports. This is not needed in most cases because common PCIe port
>> functionality is handled by portdrv service drivers.
>>
>> CXL root ports include CXL specific RAS registers that need logging
>> before starting do_recovery() in the UCE case.
>>
>> Update the AER service driver to call the UCE handler for root ports
>> and downstream ports. These PCIe port devices are bound to the portdrv
>> driver that includes a CE and UCE handler to be called.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>  1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 705893b5f7b0..a4db474b2be5 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>  
>> +	/*
>> +	 * PCIe ports may include functionality beyond the standard
>> +	 * extended port capabilities. This may present a need to log and
>> +	 * handle errors not addressed in this driver. Examples are CXL
>> +	 * root ports and CXL downstream switch ports using AER UIE to
>> +	 * indicate CXL UCE RAS protocol errors.
>> +	 */
>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
>> +		struct pci_driver *pdrv = dev->driver;
>> +
>> +		if (pdrv && pdrv->err_handler &&
>> +		    pdrv->err_handler->error_detected) {
>> +			const struct pci_error_handlers *err_handler;
>> +
>> +			err_handler = pdrv->err_handler;
>> +			status = err_handler->error_detected(dev, state);
>> +		}
>> +	}
>> +
> 
> Would not a more appropriate place for this be pci_walk_bridge() where
> the ->subordinate == NULL and these type-check cases are unified?

It does. I can take a look at moving that.

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-24 17:56     ` Terry Bowman
@ 2024-07-10 20:48       ` nifan.cxl
  2024-07-10 21:48         ` Terry Bowman
  2024-08-19 18:35       ` Fan Ni
  1 sibling, 1 reply; 59+ messages in thread
From: nifan.cxl @ 2024-07-10 20:48 UTC (permalink / raw)
  To: Terry Bowman
  Cc: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
> 
> I added a response below.
> 
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>  1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>  
> >> +	/*
> >> +	 * PCIe ports may include functionality beyond the standard
> >> +	 * extended port capabilities. This may present a need to log and
> >> +	 * handle errors not addressed in this driver. Examples are CXL
> >> +	 * root ports and CXL downstream switch ports using AER UIE to
> >> +	 * indicate CXL UCE RAS protocol errors.
> >> +	 */
> >> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> +		struct pci_driver *pdrv = dev->driver;
> >> +
> >> +		if (pdrv && pdrv->err_handler &&
> >> +		    pdrv->err_handler->error_detected) {
> >> +			const struct pci_error_handlers *err_handler;
> >> +
> >> +			err_handler = pdrv->err_handler;
> >> +			status = err_handler->error_detected(dev, state);
> >> +		}
> >> +	}
> >> +
> > 
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
> 
> It does. I can take a look at moving that.

Has that already been handled in pci_walk_bridge?

The function pci_walk_bridge() will call report_error_detected, where
the err handler will be called. 
https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80

Fan

> 
> Regards,
> Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-07-10 20:48       ` nifan.cxl
@ 2024-07-10 21:48         ` Terry Bowman
  2024-07-11  1:14           ` fan
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-07-10 21:48 UTC (permalink / raw)
  To: nifan.cxl
  Cc: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

Hi Fan,

On 7/10/24 15:48, nifan.cxl@gmail.com wrote:
> On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
>> Hi Dan,
>>
>> I added a response below.
>>
>> On 6/21/24 14:17, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> The AER service driver does not currently call a handler for AER
>>>> uncorrectable errors (UCE) detected in root ports or downstream
>>>> ports. This is not needed in most cases because common PCIe port
>>>> functionality is handled by portdrv service drivers.
>>>>
>>>> CXL root ports include CXL specific RAS registers that need logging
>>>> before starting do_recovery() in the UCE case.
>>>>
>>>> Update the AER service driver to call the UCE handler for root ports
>>>> and downstream ports. These PCIe port devices are bound to the portdrv
>>>> driver that includes a CE and UCE handler to be called.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>>>> Cc: linux-pci@vger.kernel.org
>>>> ---
>>>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>>>  1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 705893b5f7b0..a4db474b2be5 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>>>  
>>>> +	/*
>>>> +	 * PCIe ports may include functionality beyond the standard
>>>> +	 * extended port capabilities. This may present a need to log and
>>>> +	 * handle errors not addressed in this driver. Examples are CXL
>>>> +	 * root ports and CXL downstream switch ports using AER UIE to
>>>> +	 * indicate CXL UCE RAS protocol errors.
>>>> +	 */
>>>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
>>>> +		struct pci_driver *pdrv = dev->driver;
>>>> +
>>>> +		if (pdrv && pdrv->err_handler &&
>>>> +		    pdrv->err_handler->error_detected) {
>>>> +			const struct pci_error_handlers *err_handler;
>>>> +
>>>> +			err_handler = pdrv->err_handler;
>>>> +			status = err_handler->error_detected(dev, state);
>>>> +		}
>>>> +	}
>>>> +
>>>
>>> Would not a more appropriate place for this be pci_walk_bridge() where
>>> the ->subordinate == NULL and these type-check cases are unified?
>>
>> It does. I can take a look at moving that.
> 
> Has that already been handled in pci_walk_bridge?
> 
> The function pci_walk_bridge() will call report_error_detected, where
> the err handler will be called. 
> https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
> 
> Fan
> 

You would think so but the UCE handler was not called in my testing for the PCIe 
ports (RP,USP,DSP). The pci_walk_bridge() function has 2 cases:
- If there is a subordinate/secondary bus then the callback is called for
those downstream devices but not the port itself.
- If there is no subordinate/secondary bus then the callback is invoked for the 
port itself.

The function header comment may explain it better:
/**                                                                                                                                                                                                                
 * pci_walk_bridge - walk bridges potentially AER affected                                                                                                                                                         
 * @bridge:     bridge which may be a Port, an RCEC, or an RCiEP                                                                                                                                                   
 * @cb:         callback to be called for each device found                                                                                                                                                        
 * @userdata:   arbitrary pointer to be passed to callback                                                                                                                                                         
 *                                                             
 * If the device provided is a bridge, walk the subordinate bus, including                                                                                                                                         
 * any bridged devices on buses under this bus.  Call the provided callback                                                                                                                                        
 * on each device found.                                                                                                                                                                                           
 *                                                                                                                                                                                                                 
 * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,                                                                                                                                          
 * call the callback on the device itself. 
 */

Regards,
Terry

>>
>> Regards,
>> Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-07-10 21:48         ` Terry Bowman
@ 2024-07-11  1:14           ` fan
  0 siblings, 0 replies; 59+ messages in thread
From: fan @ 2024-07-11  1:14 UTC (permalink / raw)
  To: Terry Bowman
  Cc: nifan.cxl, Dan Williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter, Bjorn Helgaas,
	linux-pci

On Wed, Jul 10, 2024 at 04:48:09PM -0500, Terry Bowman wrote:
> Hi Fan,
> 
> On 7/10/24 15:48, nifan.cxl@gmail.com wrote:
> > On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> >> Hi Dan,
> >>
> >> I added a response below.
> >>
> >> On 6/21/24 14:17, Dan Williams wrote:
> >>> Terry Bowman wrote:
> >>>> The AER service driver does not currently call a handler for AER
> >>>> uncorrectable errors (UCE) detected in root ports or downstream
> >>>> ports. This is not needed in most cases because common PCIe port
> >>>> functionality is handled by portdrv service drivers.
> >>>>
> >>>> CXL root ports include CXL specific RAS registers that need logging
> >>>> before starting do_recovery() in the UCE case.
> >>>>
> >>>> Update the AER service driver to call the UCE handler for root ports
> >>>> and downstream ports. These PCIe port devices are bound to the portdrv
> >>>> driver that includes a CE and UCE handler to be called.
> >>>>
> >>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >>>> Cc: linux-pci@vger.kernel.org
> >>>> ---
> >>>>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>>>  1 file changed, 20 insertions(+)
> >>>>
> >>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >>>> index 705893b5f7b0..a4db474b2be5 100644
> >>>> --- a/drivers/pci/pcie/err.c
> >>>> +++ b/drivers/pci/pcie/err.c
> >>>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>>>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>>>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>>>  
> >>>> +	/*
> >>>> +	 * PCIe ports may include functionality beyond the standard
> >>>> +	 * extended port capabilities. This may present a need to log and
> >>>> +	 * handle errors not addressed in this driver. Examples are CXL
> >>>> +	 * root ports and CXL downstream switch ports using AER UIE to
> >>>> +	 * indicate CXL UCE RAS protocol errors.
> >>>> +	 */
> >>>> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >>>> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> >>>> +		struct pci_driver *pdrv = dev->driver;
> >>>> +
> >>>> +		if (pdrv && pdrv->err_handler &&
> >>>> +		    pdrv->err_handler->error_detected) {
> >>>> +			const struct pci_error_handlers *err_handler;
> >>>> +
> >>>> +			err_handler = pdrv->err_handler;
> >>>> +			status = err_handler->error_detected(dev, state);
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>
> >>> Would not a more appropriate place for this be pci_walk_bridge() where
> >>> the ->subordinate == NULL and these type-check cases are unified?
> >>
> >> It does. I can take a look at moving that.
> > 
> > Has that already been handled in pci_walk_bridge?
> > 
> > The function pci_walk_bridge() will call report_error_detected, where
> > the err handler will be called. 
> > https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
> > 
> > Fan
> > 
> 
> You would think so but the UCE handler was not called in my testing for the PCIe 
> ports (RP,USP,DSP). The pci_walk_bridge() function has 2 cases:
> - If there is a subordinate/secondary bus then the callback is called for
> those downstream devices but not the port itself.
> - If there is no subordinate/secondary bus then the callback is invoked for the 
> port itself.
> 
> The function header comment may explain it better:
> /**                                                                                                                                                                                                                
>  * pci_walk_bridge - walk bridges potentially AER affected                                                                                                                                                         
>  * @bridge:     bridge which may be a Port, an RCEC, or an RCiEP                                                                                                                                                   
>  * @cb:         callback to be called for each device found                                                                                                                                                        
>  * @userdata:   arbitrary pointer to be passed to callback                                                                                                                                                         
>  *                                                             
>  * If the device provided is a bridge, walk the subordinate bus, including                                                                                                                                         
>  * any bridged devices on buses under this bus.  Call the provided callback                                                                                                                                        
>  * on each device found.                                                                                                                                                                                           
>  *                                                                                                                                                                                                                 
>  * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,                                                                                                                                          
>  * call the callback on the device itself. 
>  */
> 
> Regards,
> Terry

OK, interesting.
Btw, what is the "state" passed to pcie_do_recovery(...state...)?

Fan

> 
> >>
> >> Regards,
> >> Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
  2024-06-24 17:56     ` Terry Bowman
  2024-07-10 20:48       ` nifan.cxl
@ 2024-08-19 18:35       ` Fan Ni
  1 sibling, 0 replies; 59+ messages in thread
From: Fan Ni @ 2024-08-19 18:35 UTC (permalink / raw)
  To: Terry Bowman
  Cc: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci,
	a.manzanares

On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
> 
> I added a response below.
> 
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >>  drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>  1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>  	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>  
> >> +	/*
> >> +	 * PCIe ports may include functionality beyond the standard
> >> +	 * extended port capabilities. This may present a need to log and
> >> +	 * handle errors not addressed in this driver. Examples are CXL
> >> +	 * root ports and CXL downstream switch ports using AER UIE to
> >> +	 * indicate CXL UCE RAS protocol errors.
> >> +	 */
> >> +	if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> +	    type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> +		struct pci_driver *pdrv = dev->driver;
> >> +
> >> +		if (pdrv && pdrv->err_handler &&
> >> +		    pdrv->err_handler->error_detected) {
> >> +			const struct pci_error_handlers *err_handler;
> >> +
> >> +			err_handler = pdrv->err_handler;
> >> +			status = err_handler->error_detected(dev, state);
> >> +		}
> >> +	}
> >> +
> > 
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
> 
> It does. I can take a look at moving that.
> 

Based on current code logic, the code added here will be executed as
long as the type matches (downstream port or root port), and I also
noticed the case ->subordinate == NULL never gets touched when I try to
inject an error through the aer_inject module and the user space tool. 
If my way to do error injection is right, it means the behaviour will
get changed after the code move.

Here is some of my experimental setup:

QEMU +  cxl topology (one type3 memdev directly attached to a HB with a
single root port).

1. Load the cxl related drivers before error injection

2. Do aer inject with aer_inject inside the QEMU VM

# aer_inject ~/nonfatal

aer inject input file looks like below
-----------------------------------------------------
fan:~/cxl/linux-fixes$ cat ~/nonfatal 
# Inject an uncorrectable/non-fatal training error into the device
# with header log words 0 1 2 3.
#
# Either specify the PCI id on the command-line option or uncomment and edit
# the PCI_ID line below using the correct PCI ID.
#
# Note that system firmware/BIOS may mask certain errors, change their severity
# and/or not report header log words.
#
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS COMP_ABORT
HEADER_LOG 0 1 2 3
-----------------------------------------------------

The "lspci" output on the VM looks like below
----------------------------------------------------
Qemu: execute "lspci" on VM
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
--------------------------------------------------

Fan


> Regards,
> Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
  2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 11:31   ` Jonathan Cameron
  2024-06-21 19:23   ` Dan Williams
  2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
                   ` (8 subsequent siblings)
  10 siblings, 2 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

The AER service driver clears the AER correctable error (CE) status before
calling the correctable error handler. This results in the error's status
not correctly reflected if read from the CE handler.

The AER CE status is needed by the portdrv's CE handler. The portdrv's
CE handler is intended to only call the registered notifier callbacks
if the CE error status has correctable internal error (CIE) set.

This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
is still present in the AER capability and available for reading, if
needed, when the UCE handler is called.

Change the order of clearing the CE status and calling the CE handler.
Make it to call the CE handler first and then clear the CE status
after returning.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/pci/pcie/aer.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ac6293c24976..4dc03cb9aff0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 		 * Correctable error does not need software intervention.
 		 * No need to go through error recovery process.
 		 */
-		if (aer)
-			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
-					info->status);
 		if (pcie_aer_is_native(dev)) {
 			struct pci_driver *pdrv = dev->driver;
 
@@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 				pdrv->err_handler->cor_error_detected(dev);
 			pcie_clear_device_status(dev);
 		}
+		if (aer)
+			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+					info->status);
+
 	} else if (info->severity == AER_NONFATAL)
 		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
 	else if (info->severity == AER_FATAL)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
  2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
@ 2024-06-20 11:31   ` Jonathan Cameron
  2024-06-24 15:08     ` Terry Bowman
  2024-06-21 19:23   ` Dan Williams
  1 sibling, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 11:31 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, 17 Jun 2024 15:04:04 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The AER service driver clears the AER correctable error (CE) status before
> calling the correctable error handler. This results in the error's status
> not correctly reflected if read from the CE handler.
> 
> The AER CE status is needed by the portdrv's CE handler. The portdrv's
> CE handler is intended to only call the registered notifier callbacks
> if the CE error status has correctable internal error (CIE) set.
> 
> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status

uncorrectable

> is still present in the AER capability and available for reading, if
> needed, when the UCE handler is called.

I'm seeing the clear in the DPC path for UCE. For other cases is
it a side effect of the reset?

> 
> Change the order of clearing the CE status and calling the CE handler.
> Make it to call the CE handler first and then clear the CE status
> after returning.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
Seems reasonable, but many gremlins around the ordering in these
flows, so I'm to particularly confident. With that in mind.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>

> ---
>  drivers/pci/pcie/aer.c | 7 ++++---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..4dc03cb9aff0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  		 * Correctable error does not need software intervention.
>  		 * No need to go through error recovery process.
>  		 */
> -		if (aer)
> -			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> -					info->status);
>  		if (pcie_aer_is_native(dev)) {
>  			struct pci_driver *pdrv = dev->driver;
>  
> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>  				pdrv->err_handler->cor_error_detected(dev);
>  			pcie_clear_device_status(dev);
>  		}
> +		if (aer)
> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> +					info->status);
> +
>  	} else if (info->severity == AER_NONFATAL)
>  		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
>  	else if (info->severity == AER_FATAL)


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
  2024-06-20 11:31   ` Jonathan Cameron
@ 2024-06-24 15:08     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 15:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci



On 6/20/24 06:31, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:04 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> The AER service driver clears the AER correctable error (CE) status before
>> calling the correctable error handler. This results in the error's status
>> not correctly reflected if read from the CE handler.
>>
>> The AER CE status is needed by the portdrv's CE handler. The portdrv's
>> CE handler is intended to only call the registered notifier callbacks
>> if the CE error status has correctable internal error (CIE) set.
>>
>> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
> 
> uncorrectable
> 

Thank you.

>> is still present in the AER capability and available for reading, if
>> needed, when the UCE handler is called.
> 
> I'm seeing the clear in the DPC path for UCE. For other cases is
> it a side effect of the reset?
> 

Depends on when its being read. I'm assuming this is after recovery in your case. 
And after recovery it will be zeroed.

Regards,
Terry

>>
>> Change the order of clearing the CE status and calling the CE handler.
>> Make it to call the CE handler first and then clear the CE status
>> after returning.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
> Seems reasonable, but many gremlins around the ordering in these
> flows, so I'm to particularly confident. With that in mind.
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>
> 
>> ---
>>  drivers/pci/pcie/aer.c | 7 ++++---
>>  1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index ac6293c24976..4dc03cb9aff0 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  		 * Correctable error does not need software intervention.
>>  		 * No need to go through error recovery process.
>>  		 */
>> -		if (aer)
>> -			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> -					info->status);
>>  		if (pcie_aer_is_native(dev)) {
>>  			struct pci_driver *pdrv = dev->driver;
>>  
>> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>>  				pdrv->err_handler->cor_error_detected(dev);
>>  			pcie_clear_device_status(dev);
>>  		}
>> +		if (aer)
>> +			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> +					info->status);
>> +
>>  	} else if (info->severity == AER_NONFATAL)
>>  		pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
>>  	else if (info->severity == AER_FATAL)
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
  2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
  2024-06-20 11:31   ` Jonathan Cameron
@ 2024-06-21 19:23   ` Dan Williams
  2024-06-24 18:00     ` Terry Bowman
  1 sibling, 1 reply; 59+ messages in thread
From: Dan Williams @ 2024-06-21 19:23 UTC (permalink / raw)
  To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Terry Bowman wrote:
> The AER service driver clears the AER correctable error (CE) status before
> calling the correctable error handler. This results in the error's status
> not correctly reflected if read from the CE handler.
> 
> The AER CE status is needed by the portdrv's CE handler. The portdrv's
> CE handler is intended to only call the registered notifier callbacks
> if the CE error status has correctable internal error (CIE) set.

Is this a fix or a prep patch? It reads like a "fix", but there are no
notifiers to worry about today.

> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
> is still present in the AER capability and available for reading, if
> needed, when the UCE handler is called.
> 
> Change the order of clearing the CE status and calling the CE handler.
> Make it to call the CE handler first and then clear the CE status
> after returning.

With the changelog clarified to indicate whether this has any impact on
current behavior you can add:

Acked-by: Dan Williams <dan.j.williams@intel.com>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
  2024-06-21 19:23   ` Dan Williams
@ 2024-06-24 18:00     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 18:00 UTC (permalink / raw)
  To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Hi Dan, I added a response below.

On 6/21/24 14:23, Dan Williams wrote:
> Terry Bowman wrote:
>> The AER service driver clears the AER correctable error (CE) status before
>> calling the correctable error handler. This results in the error's status
>> not correctly reflected if read from the CE handler.
>>
>> The AER CE status is needed by the portdrv's CE handler. The portdrv's
>> CE handler is intended to only call the registered notifier callbacks
>> if the CE error status has correctable internal error (CIE) set.
> 
> Is this a fix or a prep patch? It reads like a "fix", but there are no
> notifiers to worry about today.
> 

I will add mention "in preparation for future patch".

>> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
>> is still present in the AER capability and available for reading, if
>> needed, when the UCE handler is called.
>>
>> Change the order of clearing the CE status and calling the CE handler.
>> Make it to call the CE handler first and then clear the CE status
>> after returning.
> 
> With the changelog clarified to indicate whether this has any impact on
> current behavior you can add:
> 
> Acked-by: Dan Williams <dan.j.williams@intel.com>

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
  2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
  2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 12:30   ` Jonathan Cameron
                     ` (2 more replies)
  2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
                   ` (7 subsequent siblings)
  10 siblings, 3 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
does not implement an AER correctable handler (CE) but does implement the
AER uncorrectable error (UCE). The UCE handler is fairly straightforward
in that it only checks for frozen error state and returns the next step
for recovery accordingly.

As a result, port devices relying on AER correctable internal errors (CIE)
and AER uncorrectable internal errors (UIE) will not be handled. Note,
the PCIe spec indicates AER CIE/UIE can be used to report implementation
specific errors.[1]

CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
are examples of devices using the AER CIE/UIE for implementation specific
purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
report CXL RAS errors.[2]

Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
notifier to report CIE/UIE errors to the registered functions. This will
require adding a CE handler and updating the existing UCE handler.

For the UCE handler, the CXL spec states UIE errors should return need
reset: "The only method of recovering from an Uncorrectable Internal Error
is reset or hardware replacement."[1]

[1] PCI6.0 - 6.2.10 Internal Errors
[2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
             Upstream Switch Ports

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
 drivers/pci/pcie/portdrv.h |  2 ++
 2 files changed, 34 insertions(+)

diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 14a4b89a3b83..86d80e0e9606 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -37,6 +37,9 @@ struct portdrv_service_data {
 	u32 service;
 };
 
+ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
+EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
+
 /**
  * release_pcie_device - free PCI Express port service device structure
  * @dev: Port service device to release
@@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
 static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
 					pci_channel_state_t error)
 {
+	if (dev->aer_cap) {
+		u32 status;
+
+		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
+				      &status);
+
+		if (status & PCI_ERR_UNC_INTN) {
+			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
+						   AER_FATAL, (void *)dev);
+			return PCI_ERS_RESULT_NEED_RESET;
+		}
+	}
+
 	if (error == pci_channel_io_frozen)
 		return PCI_ERS_RESULT_NEED_RESET;
 	return PCI_ERS_RESULT_CAN_RECOVER;
 }
 
+static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
+{
+	u32 status;
+
+	if (!dev->aer_cap)
+		return;
+
+	pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
+			      &status);
+
+	if (status & PCI_ERR_COR_INTERNAL)
+		atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
+					   AER_CORRECTABLE, (void *)dev);
+}
+
 static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
 {
 	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
@@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
 
 static const struct pci_error_handlers pcie_portdrv_err_handler = {
 	.error_detected = pcie_portdrv_error_detected,
+	.cor_error_detected = pcie_portdrv_cor_error_detected,
 	.slot_reset = pcie_portdrv_slot_reset,
 	.mmio_enabled = pcie_portdrv_mmio_enabled,
 };
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 12c89ea0313b..8a39197f0203 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
 #endif /* !CONFIG_PCIE_PME */
 
 struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
+
+extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
 #endif /* _PORTDRV_H_ */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
@ 2024-06-20 12:30   ` Jonathan Cameron
  2024-06-24 15:22     ` Terry Bowman
  2024-06-21 19:36   ` Dan Williams
  2024-06-26  2:54   ` Li, Ming4
  2 siblings, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 12:30 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, 17 Jun 2024 15:04:05 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> does not implement an AER correctable handler (CE) but does implement the
> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> in that it only checks for frozen error state and returns the next step
> for recovery accordingly.
> 
> As a result, port devices relying on AER correctable internal errors (CIE)
> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> specific errors.[1]
> 
> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> are examples of devices using the AER CIE/UIE for implementation specific
> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> report CXL RAS errors.[2]
> 
> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> notifier to report CIE/UIE errors to the registered functions. This will
> require adding a CE handler and updating the existing UCE handler.
> 
> For the UCE handler, the CXL spec states UIE errors should return need
> reset: "The only method of recovering from an Uncorrectable Internal Error
> is reset or hardware replacement."[1]
> 
> [1] PCI6.0 - 6.2.10 Internal Errors
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>              Upstream Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>  drivers/pci/pcie/portdrv.h |  2 ++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 14a4b89a3b83..86d80e0e9606 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>  	u32 service;
>  };
>  
> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);

Perhaps these should be per instance of the portdrv?
I'd imagine we only want to register CXL ones on CXL ports etc
and it's annoying to have to check at runtime for relevance
of a particular notifier.

> +
>  /**
>   * release_pcie_device - free PCI Express port service device structure
>   * @dev: Port service device to release
> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>  					pci_channel_state_t error)
>  {
> +	if (dev->aer_cap) {
> +		u32 status;
> +
> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> +				      &status);
> +
> +		if (status & PCI_ERR_UNC_INTN) {
> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> +						   AER_FATAL, (void *)dev);

Don't think the cast is needed as always fine to implicitly cast to and from
void * in C.

> +			return PCI_ERS_RESULT_NEED_RESET;
> +		}
> +	}
> +
>  	if (error == pci_channel_io_frozen)
>  		return PCI_ERS_RESULT_NEED_RESET;
>  	return PCI_ERS_RESULT_CAN_RECOVER;
>  }
>  
> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
> +{
> +	u32 status;
> +
> +	if (!dev->aer_cap)
> +		return;
> +
> +	pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
> +			      &status);
> +
> +	if (status & PCI_ERR_COR_INTERNAL)
> +		atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> +					   AER_CORRECTABLE, (void *)dev);

No need for the cast.

> +}
> +
>  static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
> @@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
>  
>  static const struct pci_error_handlers pcie_portdrv_err_handler = {
>  	.error_detected = pcie_portdrv_error_detected,
> +	.cor_error_detected = pcie_portdrv_cor_error_detected,
>  	.slot_reset = pcie_portdrv_slot_reset,
>  	.mmio_enabled = pcie_portdrv_mmio_enabled,
>  };
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 12c89ea0313b..8a39197f0203 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
>  #endif /* !CONFIG_PCIE_PME */
>  
>  struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
> +
> +extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
>  #endif /* _PORTDRV_H_ */


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-20 12:30   ` Jonathan Cameron
@ 2024-06-24 15:22     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 15:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

Hi Jonathan,

I added responses inline below.

On 6/20/24 07:30, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:05 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>              Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>>  drivers/pci/pcie/portdrv.h |  2 ++
>>  2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>>  	u32 service;
>>  };
>>  
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> 
> Perhaps these should be per instance of the portdrv?
> I'd imagine we only want to register CXL ones on CXL ports etc
> and it's annoying to have to check at runtime for relevance
> of a particular notifier.
> 

This could be made per-instance by moving to the PCI/device drvdata. This 
would likely need a portdrv setup-init helper function to enable for a 
particular PCI device.

>> +
>>  /**
>>   * release_pcie_device - free PCI Express port service device structure
>>   * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>  					pci_channel_state_t error)
>>  {
>> +	if (dev->aer_cap) {
>> +		u32 status;
>> +
>> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> +				      &status);
>> +
>> +		if (status & PCI_ERR_UNC_INTN) {
>> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> +						   AER_FATAL, (void *)dev);
> 
> Don't think the cast is needed as always fine to implicitly cast to and from
> void * in C.
> 

Ok.

>> +			return PCI_ERS_RESULT_NEED_RESET;
>> +		}
>> +	}
>> +
>>  	if (error == pci_channel_io_frozen)
>>  		return PCI_ERS_RESULT_NEED_RESET;
>>  	return PCI_ERS_RESULT_CAN_RECOVER;
>>  }
>>  
>> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
>> +{
>> +	u32 status;
>> +
>> +	if (!dev->aer_cap)
>> +		return;
>> +
>> +	pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
>> +			      &status);
>> +
>> +	if (status & PCI_ERR_COR_INTERNAL)
>> +		atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> +					   AER_CORRECTABLE, (void *)dev);
> 
> No need for the cast.
> 

Ok

Regards,
Terry

>> +}
>> +
>>  static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>>  {
>>  	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
>> @@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
>>  
>>  static const struct pci_error_handlers pcie_portdrv_err_handler = {
>>  	.error_detected = pcie_portdrv_error_detected,
>> +	.cor_error_detected = pcie_portdrv_cor_error_detected,
>>  	.slot_reset = pcie_portdrv_slot_reset,
>>  	.mmio_enabled = pcie_portdrv_mmio_enabled,
>>  };
>> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
>> index 12c89ea0313b..8a39197f0203 100644
>> --- a/drivers/pci/pcie/portdrv.h
>> +++ b/drivers/pci/pcie/portdrv.h
>> @@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
>>  #endif /* !CONFIG_PCIE_PME */
>>  
>>  struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
>> +
>> +extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
>>  #endif /* _PORTDRV_H_ */
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
  2024-06-20 12:30   ` Jonathan Cameron
@ 2024-06-21 19:36   ` Dan Williams
  2024-06-24 18:21     ` Terry Bowman
  2024-06-26  2:54   ` Li, Ming4
  2 siblings, 1 reply; 59+ messages in thread
From: Dan Williams @ 2024-06-21 19:36 UTC (permalink / raw)
  To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Terry Bowman wrote:
> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> does not implement an AER correctable handler (CE) but does implement the
> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> in that it only checks for frozen error state and returns the next step
> for recovery accordingly.
> 
> As a result, port devices relying on AER correctable internal errors (CIE)
> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> specific errors.[1]
> 
> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> are examples of devices using the AER CIE/UIE for implementation specific
> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> report CXL RAS errors.[2]
> 
> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> notifier to report CIE/UIE errors to the registered functions. This will
> require adding a CE handler and updating the existing UCE handler.
> 
> For the UCE handler, the CXL spec states UIE errors should return need
> reset: "The only method of recovering from an Uncorrectable Internal Error
> is reset or hardware replacement."[1]
> 
> [1] PCI6.0 - 6.2.10 Internal Errors
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>              Upstream Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>  drivers/pci/pcie/portdrv.h |  2 ++
>  2 files changed, 34 insertions(+)
> 
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 14a4b89a3b83..86d80e0e9606 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>  	u32 service;
>  };
>  
> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> +
>  /**
>   * release_pcie_device - free PCI Express port service device structure
>   * @dev: Port service device to release
> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>  					pci_channel_state_t error)
>  {
> +	if (dev->aer_cap) {
> +		u32 status;
> +
> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> +				      &status);
> +
> +		if (status & PCI_ERR_UNC_INTN) {
> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> +						   AER_FATAL, (void *)dev);
> +			return PCI_ERS_RESULT_NEED_RESET;
> +		}
> +	}
> +

Oh, this is a finer grained  / lower-level location than I was
expecting. I was expecting that the notifier was just conveying the port
interrupt notification to a driver that knew how to take the next step.
This pcie_portdrv_error_detected() is a notification that is already
"downstream" of the AER notification.

If PCIe does not care about CIE and UIE then don't make it care, but
redirect the notifications to the CXL side that may care.

Leave the portdrv handlers PCIe native as much as possible.

Now, I have not thought through the full implications of that
suggestion, but for now am reacting to this AER -> PCIe err_handler ->
CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
separate error handling domain that the PCIe side likely does not want
to worry about. PCIe side is only responsible for allowing CXL to
register for the notifications beacuse the AER interrupt is shared.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-21 19:36   ` Dan Williams
@ 2024-06-24 18:21     ` Terry Bowman
  2024-06-24 21:46       ` Dan Williams
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 18:21 UTC (permalink / raw)
  To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Hi Dan,

I added responses inline below.

On 6/21/24 14:36, Dan Williams wrote:
> Terry Bowman wrote:
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>              Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>>  drivers/pci/pcie/portdrv.h |  2 ++
>>  2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>>  	u32 service;
>>  };
>>  
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>> +
>>  /**
>>   * release_pcie_device - free PCI Express port service device structure
>>   * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>  					pci_channel_state_t error)
>>  {
>> +	if (dev->aer_cap) {
>> +		u32 status;
>> +
>> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> +				      &status);
>> +
>> +		if (status & PCI_ERR_UNC_INTN) {
>> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> +						   AER_FATAL, (void *)dev);
>> +			return PCI_ERS_RESULT_NEED_RESET;
>> +		}
>> +	}
>> +
> 
> Oh, this is a finer grained  / lower-level location than I was
> expecting. I was expecting that the notifier was just conveying the port
> interrupt notification to a driver that knew how to take the next step.
> This pcie_portdrv_error_detected() is a notification that is already
> "downstream" of the AER notification.
> 

My intent was to implement the UIE/CIE "implementation specific" behavior as 
mentioned in the PCI spec. This included allowing port devices to be notified if 
needed. This plan is not ideal but works within the PCI portdrv situation
and before we can introduce a CXL specific portdriver.

> If PCIe does not care about CIE and UIE then don't make it care, but
> redirect the notifications to the CXL side that may care.
> 
> Leave the portdrv handlers PCIe native as much as possible.
> 
> Now, I have not thought through the full implications of that
> suggestion, but for now am reacting to this AER -> PCIe err_handler ->
> CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
> separate error handling domain that the PCIe side likely does not want
> to worry about. PCIe side is only responsible for allowing CXL to
> register for the notifications beacuse the AER interrupt is shared.

Hmmm, this sounds like either option#2 or introducing a CXL portdrv service 
driver. 

Thanks for the reviews and please let me know which option you 
would like me to purse.

Regards,
Terry


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-24 18:21     ` Terry Bowman
@ 2024-06-24 21:46       ` Dan Williams
  2024-06-25 14:41         ` Terry Bowman
  0 siblings, 1 reply; 59+ messages in thread
From: Dan Williams @ 2024-06-24 21:46 UTC (permalink / raw)
  To: Terry Bowman, Dan Williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

Terry Bowman wrote:
> Hi Dan,
> 
> I added responses inline below.
> 
> On 6/21/24 14:36, Dan Williams wrote:
> > Terry Bowman wrote:
> >> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> >> does not implement an AER correctable handler (CE) but does implement the
> >> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> >> in that it only checks for frozen error state and returns the next step
> >> for recovery accordingly.
> >>
> >> As a result, port devices relying on AER correctable internal errors (CIE)
> >> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> >> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> >> specific errors.[1]
> >>
> >> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> >> are examples of devices using the AER CIE/UIE for implementation specific
> >> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> >> report CXL RAS errors.[2]
> >>
> >> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> >> notifier to report CIE/UIE errors to the registered functions. This will
> >> require adding a CE handler and updating the existing UCE handler.
> >>
> >> For the UCE handler, the CXL spec states UIE errors should return need
> >> reset: "The only method of recovering from an Uncorrectable Internal Error
> >> is reset or hardware replacement."[1]
> >>
> >> [1] PCI6.0 - 6.2.10 Internal Errors
> >> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> >>              Upstream Switch Ports
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
> >>  drivers/pci/pcie/portdrv.h |  2 ++
> >>  2 files changed, 34 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> >> index 14a4b89a3b83..86d80e0e9606 100644
> >> --- a/drivers/pci/pcie/portdrv.c
> >> +++ b/drivers/pci/pcie/portdrv.c
> >> @@ -37,6 +37,9 @@ struct portdrv_service_data {
> >>  	u32 service;
> >>  };
> >>  
> >> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> >> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> >> +
> >>  /**
> >>   * release_pcie_device - free PCI Express port service device structure
> >>   * @dev: Port service device to release
> >> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
> >>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> >>  					pci_channel_state_t error)
> >>  {
> >> +	if (dev->aer_cap) {
> >> +		u32 status;
> >> +
> >> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> >> +				      &status);
> >> +
> >> +		if (status & PCI_ERR_UNC_INTN) {
> >> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> >> +						   AER_FATAL, (void *)dev);
> >> +			return PCI_ERS_RESULT_NEED_RESET;
> >> +		}
> >> +	}
> >> +
> > 
> > Oh, this is a finer grained  / lower-level location than I was
> > expecting. I was expecting that the notifier was just conveying the port
> > interrupt notification to a driver that knew how to take the next step.
> > This pcie_portdrv_error_detected() is a notification that is already
> > "downstream" of the AER notification.
> > 
> 
> My intent was to implement the UIE/CIE "implementation specific" behavior as 
> mentioned in the PCI spec. This included allowing port devices to be notified if 
> needed. This plan is not ideal but works within the PCI portdrv situation
> and before we can introduce a CXL specific portdriver.

...but it really isn't implementation specific behavior like all the
other anonymous internal error cases. This is an open standard
definition that just happens to alias with the PCIe "internal"
notification mechanism.

> 
> > If PCIe does not care about CIE and UIE then don't make it care, but
> > redirect the notifications to the CXL side that may care.
> > 
> > Leave the portdrv handlers PCIe native as much as possible.
> > 
> > Now, I have not thought through the full implications of that
> > suggestion, but for now am reacting to this AER -> PCIe err_handler ->
> > CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
> > separate error handling domain that the PCIe side likely does not want
> > to worry about. PCIe side is only responsible for allowing CXL to
> > register for the notifications beacuse the AER interrupt is shared.
> 
> Hmmm, this sounds like either option#2 or introducing a CXL portdrv service 
> driver. 
> 
> Thanks for the reviews and please let me know which option you 
> would like me to purse.

So after looking at this patchset I think calling the PCIe portdrv error
handler set for anything other than PCIe errors is likely a mistake. The
CXL protocol side of the house can experience errors that have no
relation to errors that PCIe needs to handle or care about.

I am thinking something like cxl_rch_handle_error() becomes
cxl_handle_error() and when that successfully handles the error then no
need to trigger pcie_do_recovery().

pcie_do_recovery() is too tightly scoped to error recovery that is
reasonable for PCIe links. That may not be reasonable to CXL devices
where protocol errors potentially implicate that a system memory
transaction failed. The blast radius of CXL protocol errors are not
constrained to single devices like the PCIe case.

With that change something like a new cxl_do_recovery() can operate on
the cxl_port topology and know that it has exclusive control of the
error handling registers.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-24 21:46       ` Dan Williams
@ 2024-06-25 14:41         ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-25 14:41 UTC (permalink / raw)
  To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci



On 6/24/24 16:46, Dan Williams wrote:
> Terry Bowman wrote:
>> Hi Dan,
>>
>> I added responses inline below.
>>
>> On 6/21/24 14:36, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>>>> does not implement an AER correctable handler (CE) but does implement the
>>>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>>>> in that it only checks for frozen error state and returns the next step
>>>> for recovery accordingly.
>>>>
>>>> As a result, port devices relying on AER correctable internal errors (CIE)
>>>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>>>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>>>> specific errors.[1]
>>>>
>>>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>>>> are examples of devices using the AER CIE/UIE for implementation specific
>>>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>>>> report CXL RAS errors.[2]
>>>>
>>>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>>>> notifier to report CIE/UIE errors to the registered functions. This will
>>>> require adding a CE handler and updating the existing UCE handler.
>>>>
>>>> For the UCE handler, the CXL spec states UIE errors should return need
>>>> reset: "The only method of recovering from an Uncorrectable Internal Error
>>>> is reset or hardware replacement."[1]
>>>>
>>>> [1] PCI6.0 - 6.2.10 Internal Errors
>>>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>>              Upstream Switch Ports
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>>>> Cc: linux-pci@vger.kernel.org
>>>> ---
>>>>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>>>>  drivers/pci/pcie/portdrv.h |  2 ++
>>>>  2 files changed, 34 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>>>> index 14a4b89a3b83..86d80e0e9606 100644
>>>> --- a/drivers/pci/pcie/portdrv.c
>>>> +++ b/drivers/pci/pcie/portdrv.c
>>>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>>>>  	u32 service;
>>>>  };
>>>>  
>>>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>>>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>>>> +
>>>>  /**
>>>>   * release_pcie_device - free PCI Express port service device structure
>>>>   * @dev: Port service device to release
>>>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>>>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>>>  					pci_channel_state_t error)
>>>>  {
>>>> +	if (dev->aer_cap) {
>>>> +		u32 status;
>>>> +
>>>> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>>>> +				      &status);
>>>> +
>>>> +		if (status & PCI_ERR_UNC_INTN) {
>>>> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>>>> +						   AER_FATAL, (void *)dev);
>>>> +			return PCI_ERS_RESULT_NEED_RESET;
>>>> +		}
>>>> +	}
>>>> +
>>>
>>> Oh, this is a finer grained  / lower-level location than I was
>>> expecting. I was expecting that the notifier was just conveying the port
>>> interrupt notification to a driver that knew how to take the next step.
>>> This pcie_portdrv_error_detected() is a notification that is already
>>> "downstream" of the AER notification.
>>>
>>
>> My intent was to implement the UIE/CIE "implementation specific" behavior as 
>> mentioned in the PCI spec. This included allowing port devices to be notified if 
>> needed. This plan is not ideal but works within the PCI portdrv situation
>> and before we can introduce a CXL specific portdriver.
> 
> ...but it really isn't implementation specific behavior like all the
> other anonymous internal error cases. This is an open standard
> definition that just happens to alias with the PCIe "internal"
> notification mechanism.
> 
>>
>>> If PCIe does not care about CIE and UIE then don't make it care, but
>>> redirect the notifications to the CXL side that may care.
>>>
>>> Leave the portdrv handlers PCIe native as much as possible.
>>>
>>> Now, I have not thought through the full implications of that
>>> suggestion, but for now am reacting to this AER -> PCIe err_handler ->
>>> CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
>>> separate error handling domain that the PCIe side likely does not want
>>> to worry about. PCIe side is only responsible for allowing CXL to
>>> register for the notifications beacuse the AER interrupt is shared.
>>
>> Hmmm, this sounds like either option#2 or introducing a CXL portdrv service 
>> driver. 
>>
>> Thanks for the reviews and please let me know which option you 
>> would like me to purse.
> 
> So after looking at this patchset I think calling the PCIe portdrv error
> handler set for anything other than PCIe errors is likely a mistake. The
> CXL protocol side of the house can experience errors that have no
> relation to errors that PCIe needs to handle or care about.
> 
> I am thinking something like cxl_rch_handle_error() becomes
> cxl_handle_error() and when that successfully handles the error then no
> need to trigger pcie_do_recovery().
> 
> pcie_do_recovery() is too tightly scoped to error recovery that is
> reasonable for PCIe links. That may not be reasonable to CXL devices
> where protocol errors potentially implicate that a system memory
> transaction failed. The blast radius of CXL protocol errors are not
> constrained to single devices like the PCIe case.
> 
> With that change something like a new cxl_do_recovery() can operate on
> the cxl_port topology and know that it has exclusive control of the
> error handling registers.

Ok, I'll refactor the existing AER RCH downstream port handling to support
CXL USP, DSP, and RP as well. I can incorporate much of the feedback from 
this RFC into the new patchset.

Thanks Dan.

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
  2024-06-20 12:30   ` Jonathan Cameron
  2024-06-21 19:36   ` Dan Williams
@ 2024-06-26  2:54   ` Li, Ming4
  2024-06-26 13:39     ` Terry Bowman
  2 siblings, 1 reply; 59+ messages in thread
From: Li, Ming4 @ 2024-06-26  2:54 UTC (permalink / raw)
  To: Terry Bowman, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
	Jiang, Dave, Schofield, Alison, Verma, Vishal L,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com
  Cc: Bjorn Helgaas, linux-pci@vger.kernel.org

On 6/18/2024 4:04 AM, Terry Bowman wrote:
> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> does not implement an AER correctable handler (CE) but does implement the
> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> in that it only checks for frozen error state and returns the next step
> for recovery accordingly.
>
> As a result, port devices relying on AER correctable internal errors (CIE)
> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> specific errors.[1]
>
> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> are examples of devices using the AER CIE/UIE for implementation specific
> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> report CXL RAS errors.[2]
>
> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> notifier to report CIE/UIE errors to the registered functions. This will
> require adding a CE handler and updating the existing UCE handler.
>
> For the UCE handler, the CXL spec states UIE errors should return need
> reset: "The only method of recovering from an Uncorrectable Internal Error
> is reset or hardware replacement."[1]
>
> [1] PCI6.0 - 6.2.10 Internal Errors
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>              Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>  drivers/pci/pcie/portdrv.h |  2 ++
>  2 files changed, 34 insertions(+)
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 14a4b89a3b83..86d80e0e9606 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>  	u32 service;
>  };
>  
> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> +
>  /**
>   * release_pcie_device - free PCI Express port service device structure
>   * @dev: Port service device to release
> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>  					pci_channel_state_t error)
>  {
> +	if (dev->aer_cap) {
> +		u32 status;
> +
> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> +				      &status);
> +
> +		if (status & PCI_ERR_UNC_INTN) {
> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> +						   AER_FATAL, (void *)dev);
> +			return PCI_ERS_RESULT_NEED_RESET;
> +		}
> +	}
> +
>  	if (error == pci_channel_io_frozen)
>  		return PCI_ERS_RESULT_NEED_RESET;
>  	return PCI_ERS_RESULT_CAN_RECOVER;
>  }
>  
> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
> +{
> +	u32 status;
> +
> +	if (!dev->aer_cap)
> +		return;

Seems like that dev->aer_cap checking is not needed for cor_error_detected, aer_get_device_error_info() already checked it and won't call handle_error_source() if device has not AER capability. But I am curious why pci_aer_handle_error() checks dev->aer_cap again after aer_get_device_error_info().

> +
> +	pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
> +			      &status);
> +
> +	if (status & PCI_ERR_COR_INTERNAL)
> +		atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> +					   AER_CORRECTABLE, (void *)dev);
> +}
> +
>  static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>  {
>  	size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
> @@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
>  
>  static const struct pci_error_handlers pcie_portdrv_err_handler = {
>  	.error_detected = pcie_portdrv_error_detected,
> +	.cor_error_detected = pcie_portdrv_cor_error_detected,
>  	.slot_reset = pcie_portdrv_slot_reset,
>  	.mmio_enabled = pcie_portdrv_mmio_enabled,
>  };
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 12c89ea0313b..8a39197f0203 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
>  #endif /* !CONFIG_PCIE_PME */
>  
>  struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
> +
> +extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
>  #endif /* _PORTDRV_H_ */



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
  2024-06-26  2:54   ` Li, Ming4
@ 2024-06-26 13:39     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-26 13:39 UTC (permalink / raw)
  To: Li, Ming4, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
	Jiang, Dave, Schofield, Alison, Verma, Vishal L,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com
  Cc: Bjorn Helgaas, linux-pci@vger.kernel.org



On 6/25/24 21:54, Li, Ming4 wrote:
> On 6/18/2024 4:04 AM, Terry Bowman wrote:
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>              Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>>  drivers/pci/pcie/portdrv.h |  2 ++
>>  2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>>  	u32 service;
>>  };
>>  
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>> +
>>  /**
>>   * release_pcie_device - free PCI Express port service device structure
>>   * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>  static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>  					pci_channel_state_t error)
>>  {
>> +	if (dev->aer_cap) {
>> +		u32 status;
>> +
>> +		pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> +				      &status);
>> +
>> +		if (status & PCI_ERR_UNC_INTN) {
>> +			atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> +						   AER_FATAL, (void *)dev);
>> +			return PCI_ERS_RESULT_NEED_RESET;
>> +		}
>> +	}
>> +
>>  	if (error == pci_channel_io_frozen)
>>  		return PCI_ERS_RESULT_NEED_RESET;
>>  	return PCI_ERS_RESULT_CAN_RECOVER;
>>  }
>>  
>> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
>> +{
>> +	u32 status;
>> +
>> +	if (!dev->aer_cap)
>> +		return;
> 
> Seems like that dev->aer_cap checking is not needed for cor_error_detected, aer_get_device_error_info() already checked it and won't call handle_error_source() if device has not AER capability. But I am curious why pci_aer_handle_error() checks dev->aer_cap again after aer_get_device_error_info().
> 

Hi Ming,

I agree this check should be removed. 

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (2 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 12:46   ` Jonathan Cameron
  2024-06-26  3:39   ` Li, Ming4
  2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
                   ` (6 subsequent siblings)
  10 siblings, 2 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter

RAS registers are not currently mapped for CXL root ports, CXL downstream
switch ports, and CXL upstream switch ports. Update the driver to map the
ports' RAS registers in preparation for RAS logging and handling to be
added in the future.

Add a 'struct cxl_regs' variable to 'struct cxl_port'. This will be used
to store a pointer to the upstream port's mapped RAS registers.

Invoke the RAS mapping logic from the CXL memory device probe routine
after the endpoint is added. This ensures the ports have completed
training and the RAS registers are present in CXL.cachemem.

Refactor the cxl_dport_map_regs() function to support mapping the CXL
PCIe ports. Also, check for previously mapped registers in the topology
including CXL switch. Endpoints under a CXL switch share a CXL root port
and will be iterated for each endpoint. Only map once.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 30 +++++++++++++++++++++++++-----
 drivers/cxl/cxl.h      |  5 +++++
 drivers/cxl/mem.c      | 32 ++++++++++++++++++++++++++++++--
 3 files changed, 60 insertions(+), 7 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 0df09bd79408..e6c91b3dfccf 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -787,16 +787,26 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
 	dport->regs.dport_aer = dport_aer;
 }
 
-static void cxl_dport_map_regs(struct cxl_dport *dport)
+static void cxl_port_map_regs(struct device *dev,
+			      struct cxl_register_map *map,
+			      struct cxl_regs *regs)
 {
-	struct cxl_register_map *map = &dport->reg_map;
-	struct device *dev = dport->dport_dev;
-
 	if (!map->component_map.ras.valid)
 		dev_dbg(dev, "RAS registers not found\n");
-	else if (cxl_map_component_regs(map, &dport->regs.component,
+	else if (regs->ras)
+		dev_dbg(dev, "RAS registers already initialized\n");
+	else if (cxl_map_component_regs(map, &regs->component,
 					BIT(CXL_CM_CAP_CAP_ID_RAS)))
 		dev_dbg(dev, "Failed to map RAS capability.\n");
+}
+
+static void cxl_dport_map_regs(struct cxl_dport *dport)
+{
+	struct cxl_register_map *map = &dport->reg_map;
+	struct cxl_regs *regs = &dport->regs;
+	struct device *dev = dport->dport_dev;
+
+	cxl_port_map_regs(dev, map, regs);
 
 	if (dport->rch)
 		cxl_dport_map_rch_aer(dport);
@@ -831,6 +841,16 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
 	}
 }
 
+void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
+{
+	struct cxl_register_map *map = &port->reg_map;
+	struct cxl_regs *regs = &port->regs;
+	struct device *uport_dev = port->uport_dev;
+
+	cxl_port_map_regs(uport_dev, map, regs);
+}
+EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
+
 void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
 {
 	struct device *dport_dev = dport->dport_dev;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 036d17db68e0..7cee678fdb75 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -587,6 +587,7 @@ struct cxl_dax_region {
  * @parent_dport: dport that points to this port in the parent
  * @decoder_ida: allocator for decoder ids
  * @reg_map: component and ras register mapping parameters
+ * @regs: mapped component registers
  * @nr_dports: number of entries in @dports
  * @hdm_end: track last allocated HDM decoder instance for allocation ordering
  * @commit_end: cursor to track highest committed decoder for commit ordering
@@ -607,6 +608,7 @@ struct cxl_port {
 	struct cxl_dport *parent_dport;
 	struct ida decoder_ida;
 	struct cxl_register_map reg_map;
+	struct cxl_regs regs;
 	int nr_dports;
 	int hdm_end;
 	int commit_end;
@@ -757,9 +759,12 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
+void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
 #else
 static inline void cxl_setup_parent_dport(struct device *host,
 					  struct cxl_dport *dport) { }
+static inline void cxl_setup_parent_uport(struct device *host,
+					  struct cxl_port *port) { }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
index 0c79d9ce877c..51a4641fc9a6 100644
--- a/drivers/cxl/mem.c
+++ b/drivers/cxl/mem.c
@@ -45,10 +45,39 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
 	return 0;
 }
 
+static bool cxl_dev_is_pci_type(struct device *dev, u32 pcie_type)
+{
+	struct pci_dev *pdev;
+
+	if (!dev_is_pci(dev))
+		return false;
+
+	pdev = to_pci_dev(dev);
+	if (pci_pcie_type(pdev) != pcie_type)
+		return false;
+
+	return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL,
+					 CXL_DVSEC_REG_LOCATOR);
+}
+
+static void cxl_setup_ep_parent_ports(struct cxl_ep *ep, struct device *host)
+{
+	struct cxl_dport *dport = ep->dport;
+
+	if (cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
+	    cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
+		cxl_setup_parent_dport(host, ep->dport);
+
+	if (cxl_dev_is_pci_type(dport->port->uport_dev, PCI_EXP_TYPE_UPSTREAM))
+		cxl_setup_parent_uport(host, ep->dport->port);
+}
+
 static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 				 struct cxl_dport *parent_dport)
 {
 	struct cxl_port *parent_port = parent_dport->port;
+	struct cxl_dev_state *cxlds = cxlmd->cxlds;
+	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
 	struct cxl_port *endpoint, *iter, *down;
 	int rc;
 
@@ -62,6 +91,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
 
 		ep = cxl_ep_load(iter, cxlmd);
 		ep->next = down;
+		cxl_setup_ep_parent_ports(ep, &pdev->dev);
 	}
 
 	/* Note: endpoint port component registers are derived from @cxlds */
@@ -157,8 +187,6 @@ static int cxl_mem_probe(struct device *dev)
 	else
 		endpoint_parent = &parent_port->dev;
 
-	cxl_setup_parent_dport(dev, dport);
-
 	device_lock(endpoint_parent);
 	if (!endpoint_parent->driver) {
 		dev_err(dev, "CXL port topology %s not enabled\n",
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers
  2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
@ 2024-06-20 12:46   ` Jonathan Cameron
  2024-06-24 15:51     ` Terry Bowman
  2024-06-26  3:39   ` Li, Ming4
  1 sibling, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 12:46 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 17 Jun 2024 15:04:06 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> RAS registers are not currently mapped for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. Update the driver to map the
> ports' RAS registers in preparation for RAS logging and handling to be
> added in the future.
> 
> Add a 'struct cxl_regs' variable to 'struct cxl_port'. This will be used
> to store a pointer to the upstream port's mapped RAS registers.
> 
> Invoke the RAS mapping logic from the CXL memory device probe routine
> after the endpoint is added. This ensures the ports have completed
> training and the RAS registers are present in CXL.cachemem.
> 
> Refactor the cxl_dport_map_regs() function to support mapping the CXL
> PCIe ports. Also, check for previously mapped registers in the topology
> including CXL switch. Endpoints under a CXL switch share a CXL root port
> and will be iterated for each endpoint. Only map once.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

A few minor comments inline.

Thanks,

Jonathan

> ---
>  drivers/cxl/core/pci.c | 30 +++++++++++++++++++++++++-----
>  drivers/cxl/cxl.h      |  5 +++++
>  drivers/cxl/mem.c      | 32 ++++++++++++++++++++++++++++++--
>  3 files changed, 60 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 0df09bd79408..e6c91b3dfccf 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -787,16 +787,26 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
>  	dport->regs.dport_aer = dport_aer;
>  }
>  
> -static void cxl_dport_map_regs(struct cxl_dport *dport)
> +static void cxl_port_map_regs(struct device *dev,
> +			      struct cxl_register_map *map,
> +			      struct cxl_regs *regs)
>  {
> -	struct cxl_register_map *map = &dport->reg_map;
> -	struct device *dev = dport->dport_dev;
> -
>  	if (!map->component_map.ras.valid)
>  		dev_dbg(dev, "RAS registers not found\n");

Maybe return here as nothing useful is going to occur after this any more.

> -	else if (cxl_map_component_regs(map, &dport->regs.component,
> +	else if (regs->ras)
> +		dev_dbg(dev, "RAS registers already initialized\n");

likewise, return if this condition happened.

> +	else if (cxl_map_component_regs(map, &regs->component,
>  					BIT(CXL_CM_CAP_CAP_ID_RAS)))
>  		dev_dbg(dev, "Failed to map RAS capability.\n");
> +}
> +
> +static void cxl_dport_map_regs(struct cxl_dport *dport)
> +{
> +	struct cxl_register_map *map = &dport->reg_map;
> +	struct cxl_regs *regs = &dport->regs;
> +	struct device *dev = dport->dport_dev;
> +
> +	cxl_port_map_regs(dev, map, regs);
>  
>  	if (dport->rch)
>  		cxl_dport_map_rch_aer(dport);
> @@ -831,6 +841,16 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	}
>  }
>  
> +void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
> +{
> +	struct cxl_register_map *map = &port->reg_map;
> +	struct cxl_regs *regs = &port->regs;
> +	struct device *uport_dev = port->uport_dev;
> +
> +	cxl_port_map_regs(uport_dev, map, regs);

Maybe it will be used later, but based on this patch alone.
	cxl_port_map_regs(port->uport_dev, &port->reg_map,
			  &port->regs);

is more compact and I don't think looses anything on readability front.


> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
> +
>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>  {
>  	struct device *dport_dev = dport->dport_dev;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 036d17db68e0..7cee678fdb75 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -587,6 +587,7 @@ struct cxl_dax_region {
>   * @parent_dport: dport that points to this port in the parent
>   * @decoder_ida: allocator for decoder ids
>   * @reg_map: component and ras register mapping parameters
> + * @regs: mapped component registers
>   * @nr_dports: number of entries in @dports
>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>   * @commit_end: cursor to track highest committed decoder for commit ordering
> @@ -607,6 +608,7 @@ struct cxl_port {
>  	struct cxl_dport *parent_dport;
>  	struct ida decoder_ida;
>  	struct cxl_register_map reg_map;
> +	struct cxl_regs regs;

Does mapping the whole cxl_regs in make sense?
At least currently we can't use the pmu regs in there from here
for instance - the mess with interrupts means that has to bind
via the port driver (for now anyway).
Maybe struct cxl_component_regs is more appropriate here?


>  	int nr_dports;
>  	int hdm_end;
>  	int commit_end;
> @@ -757,9 +759,12 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
> +void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
>  #else
>  static inline void cxl_setup_parent_dport(struct device *host,
>  					  struct cxl_dport *dport) { }
> +static inline void cxl_setup_parent_uport(struct device *host,
> +					  struct cxl_port *port) { }
>  #endif
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 0c79d9ce877c..51a4641fc9a6 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -45,10 +45,39 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>  	return 0;
>  }
>  
> +static bool cxl_dev_is_pci_type(struct device *dev, u32 pcie_type)
Naming perhaps needs work to make it clear this is checking if
it's a CXL device of that type.
Also, seems like general functionality that belongs in core/pci.c or
similar.

> +{
> +	struct pci_dev *pdev;
> +
> +	if (!dev_is_pci(dev))
> +		return false;
> +
> +	pdev = to_pci_dev(dev);
> +	if (pci_pcie_type(pdev) != pcie_type)
> +		return false;
> +
> +	return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL,
> +					 CXL_DVSEC_REG_LOCATOR);
> +}
> +
> +static void cxl_setup_ep_parent_ports(struct cxl_ep *ep, struct device *host)
> +{
> +	struct cxl_dport *dport = ep->dport;
> +
> +	if (cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
> +	    cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
> +		cxl_setup_parent_dport(host, ep->dport);
> +
> +	if (cxl_dev_is_pci_type(dport->port->uport_dev, PCI_EXP_TYPE_UPSTREAM))
> +		cxl_setup_parent_uport(host, ep->dport->port);
> +}
> +
>  static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  				 struct cxl_dport *parent_dport)
>  {
>  	struct cxl_port *parent_port = parent_dport->port;
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>  	struct cxl_port *endpoint, *iter, *down;
>  	int rc;
>  
> @@ -62,6 +91,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  
>  		ep = cxl_ep_load(iter, cxlmd);
>  		ep->next = down;
> +		cxl_setup_ep_parent_ports(ep, &pdev->dev);
>  	}
>  
>  	/* Note: endpoint port component registers are derived from @cxlds */
> @@ -157,8 +187,6 @@ static int cxl_mem_probe(struct device *dev)
>  	else
>  		endpoint_parent = &parent_port->dev;
>  
> -	cxl_setup_parent_dport(dev, dport);
> -
>  	device_lock(endpoint_parent);
>  	if (!endpoint_parent->driver) {
>  		dev_err(dev, "CXL port topology %s not enabled\n",


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers
  2024-06-20 12:46   ` Jonathan Cameron
@ 2024-06-24 15:51     ` Terry Bowman
  2024-07-02 15:18       ` Jonathan Cameron
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 15:51 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

Hi Jonathan,

I added responses inline below.

On 6/20/24 07:46, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:06 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> RAS registers are not currently mapped for CXL root ports, CXL downstream
>> switch ports, and CXL upstream switch ports. Update the driver to map the
>> ports' RAS registers in preparation for RAS logging and handling to be
>> added in the future.
>>
>> Add a 'struct cxl_regs' variable to 'struct cxl_port'. This will be used
>> to store a pointer to the upstream port's mapped RAS registers.
>>
>> Invoke the RAS mapping logic from the CXL memory device probe routine
>> after the endpoint is added. This ensures the ports have completed
>> training and the RAS registers are present in CXL.cachemem.
>>
>> Refactor the cxl_dport_map_regs() function to support mapping the CXL
>> PCIe ports. Also, check for previously mapped registers in the topology
>> including CXL switch. Endpoints under a CXL switch share a CXL root port
>> and will be iterated for each endpoint. Only map once.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
> 
> A few minor comments inline.
> 
> Thanks,
> 
> Jonathan
> 
>> ---
>>  drivers/cxl/core/pci.c | 30 +++++++++++++++++++++++++-----
>>  drivers/cxl/cxl.h      |  5 +++++
>>  drivers/cxl/mem.c      | 32 ++++++++++++++++++++++++++++++--
>>  3 files changed, 60 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 0df09bd79408..e6c91b3dfccf 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -787,16 +787,26 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
>>  	dport->regs.dport_aer = dport_aer;
>>  }
>>  
>> -static void cxl_dport_map_regs(struct cxl_dport *dport)
>> +static void cxl_port_map_regs(struct device *dev,
>> +			      struct cxl_register_map *map,
>> +			      struct cxl_regs *regs)
>>  {
>> -	struct cxl_register_map *map = &dport->reg_map;
>> -	struct device *dev = dport->dport_dev;
>> -
>>  	if (!map->component_map.ras.valid)
>>  		dev_dbg(dev, "RAS registers not found\n");
> 
> Maybe return here as nothing useful is going to occur after this any more.
> 

Ok

>> -	else if (cxl_map_component_regs(map, &dport->regs.component,
>> +	else if (regs->ras)
>> +		dev_dbg(dev, "RAS registers already initialized\n");
> 
> likewise, return if this condition happened.
> 
Ok

>> +	else if (cxl_map_component_regs(map, &regs->component,
>>  					BIT(CXL_CM_CAP_CAP_ID_RAS)))
>>  		dev_dbg(dev, "Failed to map RAS capability.\n");
>> +}
>> +
>> +static void cxl_dport_map_regs(struct cxl_dport *dport)
>> +{
>> +	struct cxl_register_map *map = &dport->reg_map;
>> +	struct cxl_regs *regs = &dport->regs;
>> +	struct device *dev = dport->dport_dev;
>> +
>> +	cxl_port_map_regs(dev, map, regs);
>>  
>>  	if (dport->rch)
>>  		cxl_dport_map_rch_aer(dport);
>> @@ -831,6 +841,16 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>>  	}
>>  }
>>  
>> +void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
>> +{
>> +	struct cxl_register_map *map = &port->reg_map;
>> +	struct cxl_regs *regs = &port->regs;
>> +	struct device *uport_dev = port->uport_dev;
>> +
>> +	cxl_port_map_regs(uport_dev, map, regs);
> 
> Maybe it will be used later, but based on this patch alone.
> 	cxl_port_map_regs(port->uport_dev, &port->reg_map,
> 			  &port->regs);
>> is more compact and I don't think looses anything on readability front.
> 
> 
Good point.

>> +}
>> +EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
>> +
>>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>>  {
>>  	struct device *dport_dev = dport->dport_dev;
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index 036d17db68e0..7cee678fdb75 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -587,6 +587,7 @@ struct cxl_dax_region {
>>   * @parent_dport: dport that points to this port in the parent
>>   * @decoder_ida: allocator for decoder ids
>>   * @reg_map: component and ras register mapping parameters
>> + * @regs: mapped component registers
>>   * @nr_dports: number of entries in @dports
>>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>>   * @commit_end: cursor to track highest committed decoder for commit ordering
>> @@ -607,6 +608,7 @@ struct cxl_port {
>>  	struct cxl_dport *parent_dport;
>>  	struct ida decoder_ida;
>>  	struct cxl_register_map reg_map;
>> +	struct cxl_regs regs;
> 
> Does mapping the whole cxl_regs in make sense?
> At least currently we can't use the pmu regs in there from here
> for instance - the mess with interrupts means that has to bind
> via the port driver (for now anyway).
> Maybe struct cxl_component_regs is more appropriate here?
> 
> 
Only the RAS is mapped. But, as you point out this can be changed to 
be cxl_component_regs and it will be more precise in how it's used.

>>  	int nr_dports;
>>  	int hdm_end;
>>  	int commit_end;
>> @@ -757,9 +759,12 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>>  
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
>> +void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
>>  #else
>>  static inline void cxl_setup_parent_dport(struct device *host,
>>  					  struct cxl_dport *dport) { }
>> +static inline void cxl_setup_parent_uport(struct device *host,
>> +					  struct cxl_port *port) { }
>>  #endif
>>  
>>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
>> index 0c79d9ce877c..51a4641fc9a6 100644
>> --- a/drivers/cxl/mem.c
>> +++ b/drivers/cxl/mem.c
>> @@ -45,10 +45,39 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>>  	return 0;
>>  }
>>  
>> +static bool cxl_dev_is_pci_type(struct device *dev, u32 pcie_type)
> Naming perhaps needs work to make it clear this is checking if
> it's a CXL device of that type.
> Also, seems like general functionality that belongs in core/pci.c or
> similar.

Any suggestions on what to use for the rename?

Regards,
Terry

> 
>> +{
>> +	struct pci_dev *pdev;
>> +
>> +	if (!dev_is_pci(dev))
>> +		return false;
>> +
>> +	pdev = to_pci_dev(dev);
>> +	if (pci_pcie_type(pdev) != pcie_type)
>> +		return false;
>> +
>> +	return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL,
>> +					 CXL_DVSEC_REG_LOCATOR);
>> +}
>> +
>> +static void cxl_setup_ep_parent_ports(struct cxl_ep *ep, struct device *host)
>> +{
>> +	struct cxl_dport *dport = ep->dport;
>> +
>> +	if (cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
>> +	    cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
>> +		cxl_setup_parent_dport(host, ep->dport);
>> +
>> +	if (cxl_dev_is_pci_type(dport->port->uport_dev, PCI_EXP_TYPE_UPSTREAM))
>> +		cxl_setup_parent_uport(host, ep->dport->port);
>> +}
>> +
>>  static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>>  				 struct cxl_dport *parent_dport)
>>  {
>>  	struct cxl_port *parent_port = parent_dport->port;
>> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
>> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>>  	struct cxl_port *endpoint, *iter, *down;
>>  	int rc;
>>  
>> @@ -62,6 +91,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>>  
>>  		ep = cxl_ep_load(iter, cxlmd);
>>  		ep->next = down;
>> +		cxl_setup_ep_parent_ports(ep, &pdev->dev);
>>  	}
>>  
>>  	/* Note: endpoint port component registers are derived from @cxlds */
>> @@ -157,8 +187,6 @@ static int cxl_mem_probe(struct device *dev)
>>  	else
>>  		endpoint_parent = &parent_port->dev;
>>  
>> -	cxl_setup_parent_dport(dev, dport);
>> -
>>  	device_lock(endpoint_parent);
>>  	if (!endpoint_parent->driver) {
>>  		dev_err(dev, "CXL port topology %s not enabled\n",
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers
  2024-06-24 15:51     ` Terry Bowman
@ 2024-07-02 15:18       ` Jonathan Cameron
  0 siblings, 0 replies; 59+ messages in thread
From: Jonathan Cameron @ 2024-07-02 15:18 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter


> >> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> >> index 0c79d9ce877c..51a4641fc9a6 100644
> >> --- a/drivers/cxl/mem.c
> >> +++ b/drivers/cxl/mem.c
> >> @@ -45,10 +45,39 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
> >>  	return 0;
> >>  }
> >>  
> >> +static bool cxl_dev_is_pci_type(struct device *dev, u32 pcie_type)  
> > Naming perhaps needs work to make it clear this is checking if
> > it's a CXL device of that type.
> > Also, seems like general functionality that belongs in core/pci.c or
> > similar.  
> 
> Any suggestions on what to use for the rename?
dev_is_pcie_of_type() perhaps?
The dvsec bit obviously is less general but might be bandled
separately with
	if ((dev_is_pcie_of_type(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
	     dev_is_pcie_of_type(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT)) &&
	    cxl_dev_regloc(dport->dport_dev))

where 
cxl_dev_regloc() is that lookup and is also used in core/regs.c

Or something along those lines.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers
  2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
  2024-06-20 12:46   ` Jonathan Cameron
@ 2024-06-26  3:39   ` Li, Ming4
  1 sibling, 0 replies; 59+ messages in thread
From: Li, Ming4 @ 2024-06-26  3:39 UTC (permalink / raw)
  To: Terry Bowman, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
	Jiang, Dave, Schofield, Alison, Verma, Vishal L,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com

On 6/18/2024 4:04 AM, Terry Bowman wrote:
> RAS registers are not currently mapped for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. Update the driver to map the
> ports' RAS registers in preparation for RAS logging and handling to be
> added in the future.
>
> Add a 'struct cxl_regs' variable to 'struct cxl_port'. This will be used
> to store a pointer to the upstream port's mapped RAS registers.
>
> Invoke the RAS mapping logic from the CXL memory device probe routine
> after the endpoint is added. This ensures the ports have completed
> training and the RAS registers are present in CXL.cachemem.
>
> Refactor the cxl_dport_map_regs() function to support mapping the CXL
> PCIe ports. Also, check for previously mapped registers in the topology
> including CXL switch. Endpoints under a CXL switch share a CXL root port
> and will be iterated for each endpoint. Only map once.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 30 +++++++++++++++++++++++++-----
>  drivers/cxl/cxl.h      |  5 +++++
>  drivers/cxl/mem.c      | 32 ++++++++++++++++++++++++++++++--
>  3 files changed, 60 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 0df09bd79408..e6c91b3dfccf 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -787,16 +787,26 @@ static void cxl_dport_map_rch_aer(struct cxl_dport *dport)
>  	dport->regs.dport_aer = dport_aer;
>  }
>  
> -static void cxl_dport_map_regs(struct cxl_dport *dport)
> +static void cxl_port_map_regs(struct device *dev,
> +			      struct cxl_register_map *map,
> +			      struct cxl_regs *regs)
>  {
> -	struct cxl_register_map *map = &dport->reg_map;
> -	struct device *dev = dport->dport_dev;
> -
>  	if (!map->component_map.ras.valid)
>  		dev_dbg(dev, "RAS registers not found\n");
> -	else if (cxl_map_component_regs(map, &dport->regs.component,
> +	else if (regs->ras)
> +		dev_dbg(dev, "RAS registers already initialized\n");
> +	else if (cxl_map_component_regs(map, &regs->component,
>  					BIT(CXL_CM_CAP_CAP_ID_RAS)))
>  		dev_dbg(dev, "Failed to map RAS capability.\n");
> +}
> +
> +static void cxl_dport_map_regs(struct cxl_dport *dport)
> +{
> +	struct cxl_register_map *map = &dport->reg_map;
> +	struct cxl_regs *regs = &dport->regs;
> +	struct device *dev = dport->dport_dev;
> +
> +	cxl_port_map_regs(dev, map, regs);
>  
>  	if (dport->rch)
>  		cxl_dport_map_rch_aer(dport);
> @@ -831,6 +841,16 @@ static void cxl_disable_rch_root_ints(struct cxl_dport *dport)
>  	}
>  }
>  
> +void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
> +{
> +	struct cxl_register_map *map = &port->reg_map;
> +	struct cxl_regs *regs = &port->regs;
> +	struct device *uport_dev = port->uport_dev;
> +
> +	cxl_port_map_regs(uport_dev, map, regs);
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
> +
>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>  {
>  	struct device *dport_dev = dport->dport_dev;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 036d17db68e0..7cee678fdb75 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -587,6 +587,7 @@ struct cxl_dax_region {
>   * @parent_dport: dport that points to this port in the parent
>   * @decoder_ida: allocator for decoder ids
>   * @reg_map: component and ras register mapping parameters
> + * @regs: mapped component registers
>   * @nr_dports: number of entries in @dports
>   * @hdm_end: track last allocated HDM decoder instance for allocation ordering
>   * @commit_end: cursor to track highest committed decoder for commit ordering
> @@ -607,6 +608,7 @@ struct cxl_port {
>  	struct cxl_dport *parent_dport;
>  	struct ida decoder_ida;
>  	struct cxl_register_map reg_map;
> +	struct cxl_regs regs;
>  	int nr_dports;
>  	int hdm_end;
>  	int commit_end;
> @@ -757,9 +759,12 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
> +void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
>  #else
>  static inline void cxl_setup_parent_dport(struct device *host,
>  					  struct cxl_dport *dport) { }
> +static inline void cxl_setup_parent_uport(struct device *host,
> +					  struct cxl_port *port) { }
>  #endif
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/mem.c b/drivers/cxl/mem.c
> index 0c79d9ce877c..51a4641fc9a6 100644
> --- a/drivers/cxl/mem.c
> +++ b/drivers/cxl/mem.c
> @@ -45,10 +45,39 @@ static int cxl_mem_dpa_show(struct seq_file *file, void *data)
>  	return 0;
>  }
>  
> +static bool cxl_dev_is_pci_type(struct device *dev, u32 pcie_type)
> +{
> +	struct pci_dev *pdev;
> +
> +	if (!dev_is_pci(dev))
> +		return false;
> +
> +	pdev = to_pci_dev(dev);
> +	if (pci_pcie_type(pdev) != pcie_type)
> +		return false;
> +
> +	return pci_find_dvsec_capability(pdev, PCI_DVSEC_VENDOR_ID_CXL,
> +					 CXL_DVSEC_REG_LOCATOR);
> +}
> +
> +static void cxl_setup_ep_parent_ports(struct cxl_ep *ep, struct device *host)
> +{
> +	struct cxl_dport *dport = ep->dport;
> +
> +	if (cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_DOWNSTREAM) ||
> +	    cxl_dev_is_pci_type(dport->dport_dev, PCI_EXP_TYPE_ROOT_PORT))
> +		cxl_setup_parent_dport(host, ep->dport);
you reuse cxl_setup_parent_dport() for root ports in this case, and I noticed that cxl_setup_parent_dport() will update dport->reg_map.host. So the host of dport's reg_map is the first cxl device trying to map the registers on the dport, the mapping of registers will be released during the device removal, but the mapping should still be available for other devices/switches under the root port after the device removal.
> +
> +	if (cxl_dev_is_pci_type(dport->port->uport_dev, PCI_EXP_TYPE_UPSTREAM))
> +		cxl_setup_parent_uport(host, ep->dport->port);
> +}
> +
>  static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  				 struct cxl_dport *parent_dport)
>  {
>  	struct cxl_port *parent_port = parent_dport->port;
> +	struct cxl_dev_state *cxlds = cxlmd->cxlds;
> +	struct pci_dev *pdev = to_pci_dev(cxlds->dev);
>  	struct cxl_port *endpoint, *iter, *down;
>  	int rc;
>  
> @@ -62,6 +91,7 @@ static int devm_cxl_add_endpoint(struct device *host, struct cxl_memdev *cxlmd,
>  
>  		ep = cxl_ep_load(iter, cxlmd);
>  		ep->next = down;
> +		cxl_setup_ep_parent_ports(ep, &pdev->dev);
>  	}
>  
>  	/* Note: endpoint port component registers are derived from @cxlds */
> @@ -157,8 +187,6 @@ static int cxl_mem_probe(struct device *dev)
>  	else
>  		endpoint_parent = &parent_port->dev;
>  
> -	cxl_setup_parent_dport(dev, dport);
> -
>  	device_lock(endpoint_parent);
>  	if (!endpoint_parent->driver) {
>  		dev_err(dev, "CXL port topology %s not enabled\n",



^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (3 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 12:49   ` Jonathan Cameron
  2024-07-15 17:50   ` nifan.cxl
  2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
                   ` (5 subsequent siblings)
  10 siblings, 2 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter

CXL RAS error handling includes support for endpoints and RCH downstream
ports. The same support is missing for CXL root ports, CXL downstream
switch ports, and CXL upstream switch ports. This patch is in preparation
for adding CXL ports' RAS handling.

The cxl_pci driver's RAS support functions use the 'struct cxl_dev_state'
type parameter that is not available in CXL port devices. The same CXL
RAS capability structure is required for most CXL components/devices
and should have common handling where possible.[1]

Update __cxl_handle_cor_ras() and __cxl_handle_ras() to use 'struct
device' instead of 'struct cxl_dev_state'. Add function call to translate
device to CXL device state where needed.

[1] CXL3.1 - 8.2.4 CXL.cache and CXL.mem Registers

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index e6c91b3dfccf..59a317ab84bb 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -686,9 +686,10 @@ void read_cdat_data(struct cxl_port *port)
 }
 EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
 
-static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
+static void __cxl_handle_cor_ras(struct device *dev,
 				 void __iomem *ras_base)
 {
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	void __iomem *addr;
 	u32 status;
 
@@ -699,13 +700,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
 	status = readl(addr);
 	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
 		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
-		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
+		trace_cxl_aer_correctable_error(cxlmd, status);
 	}
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 /* CXL spec rev3.0 8.2.4.16.1 */
@@ -729,9 +730,10 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
  * Log the state of the RAS status registers and prepare them to log the
  * next error status. Return 1 if reset needed.
  */
-static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
-				  void __iomem *ras_base)
+static bool __cxl_handle_ras(struct device *dev,
+			     void __iomem *ras_base)
 {
+	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
 	u32 status;
@@ -757,7 +759,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
+	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -765,7 +767,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
 
 static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
 {
-	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
 }
 
 #ifdef CONFIG_PCIEAER_CXL
@@ -871,13 +873,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
 static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
 					  struct cxl_dport *dport)
 {
-	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 				       struct cxl_dport *dport)
 {
-	return __cxl_handle_ras(cxlds, dport->regs.ras);
+	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
 /*
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
  2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
@ 2024-06-20 12:49   ` Jonathan Cameron
  2024-07-15 17:50   ` nifan.cxl
  1 sibling, 0 replies; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 12:49 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 17 Jun 2024 15:04:07 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL RAS error handling includes support for endpoints and RCH downstream
> ports. The same support is missing for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This patch is in preparation
> for adding CXL ports' RAS handling.
> 
> The cxl_pci driver's RAS support functions use the 'struct cxl_dev_state'
> type parameter that is not available in CXL port devices. The same CXL
> RAS capability structure is required for most CXL components/devices
> and should have common handling where possible.[1]
> 
> Update __cxl_handle_cor_ras() and __cxl_handle_ras() to use 'struct
> device' instead of 'struct cxl_dev_state'. Add function call to translate
> device to CXL device state where needed.
> 
> [1] CXL3.1 - 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
I've not looked at how it's used yet as reading these in order,
but based on the explanation and code here looks good to me.

Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports
  2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
  2024-06-20 12:49   ` Jonathan Cameron
@ 2024-07-15 17:50   ` nifan.cxl
  1 sibling, 0 replies; 59+ messages in thread
From: nifan.cxl @ 2024-07-15 17:50 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, Jun 17, 2024 at 03:04:07PM -0500, Terry Bowman wrote:
> CXL RAS error handling includes support for endpoints and RCH downstream
> ports. The same support is missing for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This patch is in preparation
> for adding CXL ports' RAS handling.
> 
> The cxl_pci driver's RAS support functions use the 'struct cxl_dev_state'
> type parameter that is not available in CXL port devices. The same CXL
> RAS capability structure is required for most CXL components/devices
> and should have common handling where possible.[1]
> 
> Update __cxl_handle_cor_ras() and __cxl_handle_ras() to use 'struct
> device' instead of 'struct cxl_dev_state'. Add function call to translate
> device to CXL device state where needed.
> 
> [1] CXL3.1 - 8.2.4 CXL.cache and CXL.mem Registers
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/pci.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index e6c91b3dfccf..59a317ab84bb 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -686,9 +686,10 @@ void read_cdat_data(struct cxl_port *port)
>  }
>  EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>  
> -static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
> +static void __cxl_handle_cor_ras(struct device *dev,
>  				 void __iomem *ras_base)
>  {
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	void __iomem *addr;
>  	u32 status;
>  
> @@ -699,13 +700,13 @@ static void __cxl_handle_cor_ras(struct cxl_dev_state *cxlds,
>  	status = readl(addr);
>  	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>  		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> -		trace_cxl_aer_correctable_error(cxlds->cxlmd, status);
> +		trace_cxl_aer_correctable_error(cxlmd, status);
>  	}
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>  {
> -	return __cxl_handle_cor_ras(cxlds, cxlds->regs.ras);
> +	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
>  }
>  
>  /* CXL spec rev3.0 8.2.4.16.1 */
> @@ -729,9 +730,10 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>   * Log the state of the RAS status registers and prepare them to log the
>   * next error status. Return 1 if reset needed.
>   */
> -static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
> -				  void __iomem *ras_base)
> +static bool __cxl_handle_ras(struct device *dev,
> +			     void __iomem *ras_base)
>  {
> +	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
>  	u32 status;
> @@ -757,7 +759,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(cxlds->cxlmd, status, fe, hl);
> +	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
>  	return true;
> @@ -765,7 +767,7 @@ static bool __cxl_handle_ras(struct cxl_dev_state *cxlds,
>  
>  static bool cxl_handle_endpoint_ras(struct cxl_dev_state *cxlds)
>  {
> -	return __cxl_handle_ras(cxlds, cxlds->regs.ras);
> +	return __cxl_handle_ras(&cxlds->cxlmd->dev, cxlds->regs.ras);
>  }
>  
>  #ifdef CONFIG_PCIEAER_CXL
> @@ -871,13 +873,13 @@ EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
>  static void cxl_handle_rdport_cor_ras(struct cxl_dev_state *cxlds,
>  					  struct cxl_dport *dport)
>  {
> -	return __cxl_handle_cor_ras(cxlds, dport->regs.ras);
> +	return __cxl_handle_cor_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>  }
>  
>  static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>  				       struct cxl_dport *dport)
>  {
> -	return __cxl_handle_ras(cxlds, dport->regs.ras);
> +	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>  }
>  
>  /*
> -- 
> 2.34.1
> 

Looks good to me.

Fan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (4 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 12:53   ` Jonathan Cameron
  2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
                   ` (4 subsequent siblings)
  10 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter

The cxl_pci driver uses kernel trace functions to log RAS errors for
endpoints and RCH downstream ports. The same is needed for CXL root ports,
CXL downstream switch ports, and CXL upstream switch ports.

Add RAS correctable and RAS uncorrectable trace logging functions for
CXL PCIE ports.

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/trace.h | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)

diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
index e5f13260fc52..5cfd9952d88a 100644
--- a/drivers/cxl/core/trace.h
+++ b/drivers/cxl/core/trace.h
@@ -48,6 +48,23 @@
 	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
 )
 
+TRACE_EVENT(cxl_port_aer_uncorrectable_error,
+	TP_PROTO(struct device *dev, u32 status),
+	TP_ARGS(dev, status),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__field(u32, status)
+	),
+	TP_fast_assign(
+		__assign_str(devname, dev_name(dev));
+		__entry->status = status;
+	),
+	TP_printk("device=%s status='%s'",
+		  __get_str(devname),
+		  show_uc_errs(__entry->status)
+	)
+);
+
 TRACE_EVENT(cxl_aer_uncorrectable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
 	TP_ARGS(cxlmd, status, fe, hl),
@@ -96,6 +113,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
 	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
 )
 
+TRACE_EVENT(cxl_port_aer_correctable_error,
+	TP_PROTO(struct device *dev, u32 status),
+	TP_ARGS(dev, status),
+	TP_STRUCT__entry(
+		__string(devname, dev_name(dev))
+		__field(u32, status)
+	),
+	TP_fast_assign(
+		__assign_str(devname, dev_name(dev));
+		__entry->status = status;
+	),
+	TP_printk("device=%s status='%s'",
+		  __get_str(devname),
+		  show_ce_errs(__entry->status)
+	)
+);
+
 TRACE_EVENT(cxl_aer_correctable_error,
 	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
 	TP_ARGS(cxlmd, status),
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
@ 2024-06-20 12:53   ` Jonathan Cameron
  2024-06-24 15:53     ` Terry Bowman
  0 siblings, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 12:53 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 17 Jun 2024 15:04:08 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> The cxl_pci driver uses kernel trace functions to log RAS errors for
> endpoints and RCH downstream ports. The same is needed for CXL root ports,
> CXL downstream switch ports, and CXL upstream switch ports.
> 
> Add RAS correctable and RAS uncorrectable trace logging functions for
> CXL PCIE ports.
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/trace.h | 34 ++++++++++++++++++++++++++++++++++
>  1 file changed, 34 insertions(+)
> 
> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> index e5f13260fc52..5cfd9952d88a 100644
> --- a/drivers/cxl/core/trace.h
> +++ b/drivers/cxl/core/trace.h
> @@ -48,6 +48,23 @@
>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>  )
>  
> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> +	TP_PROTO(struct device *dev, u32 status),

By comparison with existing code, why no fe or header
log?  Don't exist for ports for some reason?
Serial number of the port might also be useful.

> +	TP_ARGS(dev, status),
> +	TP_STRUCT__entry(
> +		__string(devname, dev_name(dev))
> +		__field(u32, status)
> +	),
> +	TP_fast_assign(
> +		__assign_str(devname, dev_name(dev));
> +		__entry->status = status;
> +	),
> +	TP_printk("device=%s status='%s'",
> +		  __get_str(devname),
> +		  show_uc_errs(__entry->status)
> +	)
> +);
> +
>  TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
>  	TP_ARGS(cxlmd, status, fe, hl),
> @@ -96,6 +113,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>  )
>  
> +TRACE_EVENT(cxl_port_aer_correctable_error,
> +	TP_PROTO(struct device *dev, u32 status),
> +	TP_ARGS(dev, status),
> +	TP_STRUCT__entry(
> +		__string(devname, dev_name(dev))
> +		__field(u32, status)
> +	),
> +	TP_fast_assign(
> +		__assign_str(devname, dev_name(dev));
> +		__entry->status = status;
> +	),
> +	TP_printk("device=%s status='%s'",
> +		  __get_str(devname),
> +		  show_ce_errs(__entry->status)
> +	)
> +);
> +
>  TRACE_EVENT(cxl_aer_correctable_error,
>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>  	TP_ARGS(cxlmd, status),


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-06-20 12:53   ` Jonathan Cameron
@ 2024-06-24 15:53     ` Terry Bowman
  2024-07-02 15:53       ` Jonathan Cameron
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 15:53 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

Hi Jonathan,

I added responses inline below.

On 6/20/24 07:53, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:08 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> The cxl_pci driver uses kernel trace functions to log RAS errors for
>> endpoints and RCH downstream ports. The same is needed for CXL root ports,
>> CXL downstream switch ports, and CXL upstream switch ports.
>>
>> Add RAS correctable and RAS uncorrectable trace logging functions for
>> CXL PCIE ports.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/trace.h | 34 ++++++++++++++++++++++++++++++++++
>>  1 file changed, 34 insertions(+)
>>
>> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
>> index e5f13260fc52..5cfd9952d88a 100644
>> --- a/drivers/cxl/core/trace.h
>> +++ b/drivers/cxl/core/trace.h
>> @@ -48,6 +48,23 @@
>>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
>>  )
>>  
>> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
>> +	TP_PROTO(struct device *dev, u32 status),
> 
> By comparison with existing code, why no fe or header
> log?  Don't exist for ports for some reason?
> Serial number of the port might also be useful.
> 

The AER FE and header are the same for ports and the logging 
needs to be added here.

There is no serial number for the ports.

Regards,
Terry

>> +	TP_ARGS(dev, status),
>> +	TP_STRUCT__entry(
>> +		__string(devname, dev_name(dev))
>> +		__field(u32, status)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(devname, dev_name(dev));
>> +		__entry->status = status;
>> +	),
>> +	TP_printk("device=%s status='%s'",
>> +		  __get_str(devname),
>> +		  show_uc_errs(__entry->status)
>> +	)
>> +);
>> +
>>  TRACE_EVENT(cxl_aer_uncorrectable_error,
>>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status, u32 fe, u32 *hl),
>>  	TP_ARGS(cxlmd, status, fe, hl),
>> @@ -96,6 +113,23 @@ TRACE_EVENT(cxl_aer_uncorrectable_error,
>>  	{ CXL_RAS_CE_PHYS_LAYER_ERR, "Received Error From Physical Layer" }	\
>>  )
>>  
>> +TRACE_EVENT(cxl_port_aer_correctable_error,
>> +	TP_PROTO(struct device *dev, u32 status),
>> +	TP_ARGS(dev, status),
>> +	TP_STRUCT__entry(
>> +		__string(devname, dev_name(dev))
>> +		__field(u32, status)
>> +	),
>> +	TP_fast_assign(
>> +		__assign_str(devname, dev_name(dev));
>> +		__entry->status = status;
>> +	),
>> +	TP_printk("device=%s status='%s'",
>> +		  __get_str(devname),
>> +		  show_ce_errs(__entry->status)
>> +	)
>> +);
>> +
>>  TRACE_EVENT(cxl_aer_correctable_error,
>>  	TP_PROTO(const struct cxl_memdev *cxlmd, u32 status),
>>  	TP_ARGS(cxlmd, status),
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors
  2024-06-24 15:53     ` Terry Bowman
@ 2024-07-02 15:53       ` Jonathan Cameron
  0 siblings, 0 replies; 59+ messages in thread
From: Jonathan Cameron @ 2024-07-02 15:53 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 24 Jun 2024 10:53:51 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:

> Hi Jonathan,
> 
> I added responses inline below.
> 
> On 6/20/24 07:53, Jonathan Cameron wrote:
> > On Mon, 17 Jun 2024 15:04:08 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >   
> >> The cxl_pci driver uses kernel trace functions to log RAS errors for
> >> endpoints and RCH downstream ports. The same is needed for CXL root ports,
> >> CXL downstream switch ports, and CXL upstream switch ports.
> >>
> >> Add RAS correctable and RAS uncorrectable trace logging functions for
> >> CXL PCIE ports.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> ---
> >>  drivers/cxl/core/trace.h | 34 ++++++++++++++++++++++++++++++++++
> >>  1 file changed, 34 insertions(+)
> >>
> >> diff --git a/drivers/cxl/core/trace.h b/drivers/cxl/core/trace.h
> >> index e5f13260fc52..5cfd9952d88a 100644
> >> --- a/drivers/cxl/core/trace.h
> >> +++ b/drivers/cxl/core/trace.h
> >> @@ -48,6 +48,23 @@
> >>  	{ CXL_RAS_UC_IDE_RX_ERR, "IDE Rx Error" }			  \
> >>  )
> >>  
> >> +TRACE_EVENT(cxl_port_aer_uncorrectable_error,
> >> +	TP_PROTO(struct device *dev, u32 status),  
> > 
> > By comparison with existing code, why no fe or header
> > log?  Don't exist for ports for some reason?
> > Serial number of the port might also be useful.
> >   
> 
> The AER FE and header are the same for ports and the logging 
> needs to be added here.
> 
> There is no serial number for the ports.
Why not? At least for switch USP there might be (actually
I believe there can be for pretty much anything but there
are rules on them matching in switch funcitons).

J

> 
> Regards,
> Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (5 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 13:09   ` Jonathan Cameron
  2024-06-26  6:22   ` Li, Ming4
  2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
                   ` (3 subsequent siblings)
  10 siblings, 2 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter

CXL root ports, CXL downstream switch ports, and CXL upstream switch
ports are bound to the PCIe port bus driver, portdrv. portdrv provides
an atomic notifier chain for reporting PCIe port device AER
correctable internal errors (CIE) and AER uncorrectable internal
errors (UIE).

CXL PCIe port devices use AER CIE/UIE to report CXL RAS.[1]

Add a cxl_pci atomic notification callback for handling the portdrv's
AER UIE/CIE notifications.

Register the atomic notification callback in the cxl_pci module's
load. Unregister the callback in the cxl_pci driver's unload.

Implement the callback to check if the device parameter is a valid
CXL PCIe port. If it is valid then make the notification callback call
__cxl_handle_cor_ras() or __cxl_handle_ras() depending on the AER
type.

[1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
             Upstream Switch Ports

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/core.h |  4 ++
 drivers/cxl/core/pci.c  | 97 ++++++++++++++++++++++++++++++++++++++---
 drivers/cxl/core/port.c |  6 +--
 drivers/cxl/cxl.h       |  5 +++
 drivers/cxl/cxlpci.h    |  2 +
 drivers/cxl/pci.c       | 19 +++++++-
 6 files changed, 123 insertions(+), 10 deletions(-)

diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index bc5a95665aa0..69bef1db6ee0 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -94,4 +94,8 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
 				       enum access_coordinate_class access);
 bool cxl_need_node_perf_attrs_update(int nid);
 
+struct cxl_dport *find_dport(struct cxl_port *port, int id);
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport);
+
 #endif /* __CXL_CORE_H__ */
diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index 59a317ab84bb..e630eccb733d 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -689,7 +689,6 @@ EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
 static void __cxl_handle_cor_ras(struct device *dev,
 				 void __iomem *ras_base)
 {
-	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	void __iomem *addr;
 	u32 status;
 
@@ -698,10 +697,17 @@ static void __cxl_handle_cor_ras(struct device *dev,
 
 	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
 	status = readl(addr);
-	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
-		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+
+	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
+		return;
+
+	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
+	if (is_cxl_memdev(dev)) {
+		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
 		trace_cxl_aer_correctable_error(cxlmd, status);
-	}
+	} else if (dev_is_pci(dev))
+		trace_cxl_port_aer_correctable_error(dev, status);
 }
 
 static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
@@ -733,7 +739,6 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
 static bool __cxl_handle_ras(struct device *dev,
 			     void __iomem *ras_base)
 {
-	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
 	u32 hl[CXL_HEADERLOG_SIZE_U32];
 	void __iomem *addr;
 	u32 status;
@@ -759,7 +764,13 @@ static bool __cxl_handle_ras(struct device *dev,
 	}
 
 	header_log_copy(ras_base, hl);
-	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
+	if (is_cxl_memdev(dev)) {
+		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
+
+		trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
+	} else if (dev_is_pci(dev))
+		trace_cxl_port_aer_uncorrectable_error(dev, status);
+
 	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
 
 	return true;
@@ -882,6 +893,80 @@ static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
 	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
 }
 
+static int match_uport(struct device *dev, void *data)
+{
+	struct device *uport_dev = (struct device *)data;
+	struct cxl_port *port;
+
+	if (!is_cxl_port(dev))
+		return 0;
+
+	port = to_cxl_port(dev);
+
+	return (port->uport_dev == uport_dev);
+}
+
+static struct cxl_port *pci_to_cxl_uport(struct pci_dev *pdev)
+{
+	struct cxl_dport *dport;
+	struct device *port_dev;
+	struct cxl_port *port;
+
+	port = find_cxl_port(pdev->dev.parent, &dport);
+	if (!port)
+		return NULL;
+	put_device(&port->dev);
+
+	port_dev = device_find_child(&port->dev, &pdev->dev, match_uport);
+	if (!port_dev)
+		return NULL;
+	put_device(port_dev);
+
+	port = to_cxl_port(port_dev);
+
+	return port;
+}
+
+static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
+{
+	void __iomem *ras_base = NULL;
+
+	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
+	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
+		struct cxl_dport *dport;
+
+		find_cxl_port(&pdev->dev, &dport);
+		ras_base = dport ? dport->regs.ras : NULL;
+	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
+		struct cxl_port *port = pci_to_cxl_uport(pdev);
+
+		ras_base = port ? port->regs.ras : NULL;
+	}
+
+	return ras_base;
+}
+
+int port_internal_err_cb(struct notifier_block *unused,
+			 unsigned long event, void *ptr)
+{
+	struct pci_dev *pdev = (struct pci_dev *)ptr;
+	void __iomem *ras_base;
+
+	if (!pdev)
+		return 0;
+
+	if (event == AER_CORRECTABLE) {
+		ras_base = cxl_pci_port_ras(pdev);
+		__cxl_handle_cor_ras(&pdev->dev, ras_base);
+	} else if ((event == AER_FATAL) || (event == AER_NONFATAL)) {
+		ras_base = cxl_pci_port_ras(pdev);
+		__cxl_handle_ras(&pdev->dev, ras_base);
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(port_internal_err_cb, CXL);
+
 /*
  * Copy the AER capability registers using 32 bit read accesses.
  * This is necessary because RCRB AER capability is MMIO mapped. Clear the
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index 887ed6e358fb..d0f95c965ab4 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -1027,7 +1027,7 @@ void put_cxl_root(struct cxl_root *cxl_root)
 }
 EXPORT_SYMBOL_NS_GPL(put_cxl_root, CXL);
 
-static struct cxl_dport *find_dport(struct cxl_port *port, int id)
+struct cxl_dport *find_dport(struct cxl_port *port, int id)
 {
 	struct cxl_dport *dport;
 	unsigned long index;
@@ -1336,8 +1336,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
 	return NULL;
 }
 
-static struct cxl_port *find_cxl_port(struct device *dport_dev,
-				      struct cxl_dport **dport)
+struct cxl_port *find_cxl_port(struct device *dport_dev,
+			       struct cxl_dport **dport)
 {
 	struct cxl_find_port_ctx ctx = {
 		.dport_dev = dport_dev,
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 7cee678fdb75..04725344393b 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -11,6 +11,7 @@
 #include <linux/log2.h>
 #include <linux/node.h>
 #include <linux/io.h>
+#include "../pci/pcie/portdrv.h"
 
 /**
  * DOC: cxl objects
@@ -760,11 +761,15 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
 #ifdef CONFIG_PCIEAER_CXL
 void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
 void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
+int port_internal_err_cb(struct notifier_block *unused,
+			 unsigned long event, void *ptr);
 #else
 static inline void cxl_setup_parent_dport(struct device *host,
 					  struct cxl_dport *dport) { }
 static inline void cxl_setup_parent_uport(struct device *host,
 					  struct cxl_port *port) { }
+static inline int port_internal_err_cb(struct notifier_block *unused,
+				unsigned long event, void *ptr) { return 0; }
 #endif
 
 struct cxl_decoder *to_cxl_decoder(struct device *dev);
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 93992a1c8eec..6044955e1c48 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -130,4 +130,6 @@ void read_cdat_data(struct cxl_port *port);
 void cxl_cor_error_detected(struct pci_dev *pdev);
 pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
 				    pci_channel_state_t state);
+int port_err_nb_cb(struct notifier_block *unused,
+		   unsigned long event, void *ptr);
 #endif /* __CXL_PCI_H__ */
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 2ff361e756d6..f4183c5aea38 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -926,6 +926,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return rc;
 }
 
+struct notifier_block port_internal_err_nb = {
+	.notifier_call = port_internal_err_cb,
+};
+
 static const struct pci_device_id cxl_mem_pci_tbl[] = {
 	/* PCI class code for CXL.mem Type-3 Devices */
 	{ PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL << 8 | CXL_MEMORY_PROGIF), ~0)},
@@ -974,6 +978,19 @@ static struct pci_driver cxl_pci_driver = {
 	},
 };
 
-module_pci_driver(cxl_pci_driver);
+static int __init cxl_pci_init(void)
+{
+	atomic_notifier_chain_register(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
+	return pci_register_driver(&cxl_pci_driver);
+}
+module_init(cxl_pci_init);
+
+static void __exit cxl_pci_exit(void)
+{
+	atomic_notifier_chain_unregister(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
+	pci_unregister_driver(&cxl_pci_driver);
+}
+module_exit(cxl_pci_exit);
+
 MODULE_LICENSE("GPL v2");
 MODULE_IMPORT_NS(CXL);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
  2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
@ 2024-06-20 13:09   ` Jonathan Cameron
  2024-06-24 16:09     ` Terry Bowman
  2024-06-26  6:22   ` Li, Ming4
  1 sibling, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 13:09 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 17 Jun 2024 15:04:09 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL root ports, CXL downstream switch ports, and CXL upstream switch
> ports are bound to the PCIe port bus driver, portdrv. portdrv provides
> an atomic notifier chain for reporting PCIe port device AER
> correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE).
> 
> CXL PCIe port devices use AER CIE/UIE to report CXL RAS.[1]
> 
> Add a cxl_pci atomic notification callback for handling the portdrv's
> AER UIE/CIE notifications.
> 
> Register the atomic notification callback in the cxl_pci module's
> load. Unregister the callback in the cxl_pci driver's unload.
> 
> Implement the callback to check if the device parameter is a valid
> CXL PCIe port. If it is valid then make the notification callback call
> __cxl_handle_cor_ras() or __cxl_handle_ras() depending on the AER
> type.
> 
> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>              Upstream Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Hi Terry,

Some comments inline.  Mostly this comes down to earlier question of whether
this notifier should be registered per device or globally. 
I think I still prefer per device, but attaching the handler will be trickier
and I'm guessing there may be some locking/lifetime issues doing that which
are avoided by a global notifier.

Jonathan

> ---
>  drivers/cxl/core/core.h |  4 ++
>  drivers/cxl/core/pci.c  | 97 ++++++++++++++++++++++++++++++++++++++---
>  drivers/cxl/core/port.c |  6 +--
>  drivers/cxl/cxl.h       |  5 +++
>  drivers/cxl/cxlpci.h    |  2 +
>  drivers/cxl/pci.c       | 19 +++++++-
>  6 files changed, 123 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index bc5a95665aa0..69bef1db6ee0 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -94,4 +94,8 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
>  				       enum access_coordinate_class access);
>  bool cxl_need_node_perf_attrs_update(int nid);
>  
> +struct cxl_dport *find_dport(struct cxl_port *port, int id);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport);
> +
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 59a317ab84bb..e630eccb733d 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -689,7 +689,6 @@ EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>  static void __cxl_handle_cor_ras(struct device *dev,
>  				 void __iomem *ras_base)
>  {
> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	void __iomem *addr;
>  	u32 status;
>  
> @@ -698,10 +697,17 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +

Blank line probably not wanted as we want to group the status
check with the read (it's kind of an error check).

> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +	if (is_cxl_memdev(dev)) {
> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
>  		trace_cxl_aer_correctable_error(cxlmd, status);
As below - don't bother with local cxlmd variable.

> -	}
> +	} else if (dev_is_pci(dev))
> +		trace_cxl_port_aer_correctable_error(dev, status);
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -733,7 +739,6 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>  static bool __cxl_handle_ras(struct device *dev,
>  			     void __iomem *ras_base)
>  {
> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
>  	u32 status;
> @@ -759,7 +764,13 @@ static bool __cxl_handle_ras(struct device *dev,
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
> +	if (is_cxl_memdev(dev)) {
> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
Just use this inline.
		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev),
						  status, fe, hl);

> +
> +		trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
> +	} else if (dev_is_pci(dev))
> +		trace_cxl_port_aer_uncorrectable_error(dev, status);

As before, why no fe or hl?  I'm sure I'm missing some spec subtlty
but a comment would help me and others avoid that.

> +
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
>  	return true;
> @@ -882,6 +893,80 @@ static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>  	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>  }
>  
> +static int match_uport(struct device *dev, void *data)
> +{
> +	struct device *uport_dev = (struct device *)data;
> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return (port->uport_dev == uport_dev);
() doesn't add much so I'd drop them.

> +}
> +
> +static struct cxl_port *pci_to_cxl_uport(struct pci_dev *pdev)
> +{
> +	struct cxl_dport *dport;
> +	struct device *port_dev;
> +	struct cxl_port *port;
> +
> +	port = find_cxl_port(pdev->dev.parent, &dport);
> +	if (!port)
> +		return NULL;
> +	put_device(&port->dev);
I'm confused on the lifetimes. Doesn't it make more sense
to hold this until after you've stopped using it? So move the
put after device_find_child(...)

> +
> +	port_dev = device_find_child(&port->dev, &pdev->dev, match_uport);
> +	if (!port_dev)
> +		return NULL;
> +	put_device(port_dev);
> +
> +	port = to_cxl_port(port_dev);
> +
> +	return port;

	return to_cxl_port(port_dev);

> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> +	void __iomem *ras_base = NULL;
Don't initialize and...
> +
> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +		struct cxl_dport *dport;
> +
> +		find_cxl_port(&pdev->dev, &dport);
> +		ras_base = dport ? dport->regs.ras : NULL;
		if (dport)
			return dport->regs.ras;
> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> +		struct cxl_port *port = pci_to_cxl_uport(pdev);
> +
> +		ras_base = port ? port->regs.ras : NULL;
		if (port)
			return port->regs.ras;
> +	}
return NULL;
> +
> +	return ras_base;
> +}
> +
> +int port_internal_err_cb(struct notifier_block *unused,
> +			 unsigned long event, void *ptr)
> +{
> +	struct pci_dev *pdev = (struct pci_dev *)ptr;

I think you can use this notifier for other types of device in future?
If it's going to be global definitely want to check here that we
actually have a CXL port of some type.

It may be that via the various calls any non CXL device
will result in a safe error. However that's not obvious, so an
upfront check makes sense (or a per device notifier registration!)

> +	void __iomem *ras_base;
> +
> +	if (!pdev)
> +		return 0;
> +
> +	if (event == AER_CORRECTABLE) {
> +		ras_base = cxl_pci_port_ras(pdev);
> +		__cxl_handle_cor_ras(&pdev->dev, ras_base);

as below - one line should be fine for this.

> +	} else if ((event == AER_FATAL) || (event == AER_NONFATAL)) {
> +		ras_base = cxl_pci_port_ras(pdev);
> +		__cxl_handle_ras(&pdev->dev, ras_base);

		__cxl_handle_ras(&pdev->dev, cxl_pci_port_ras(pdev));

> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(port_internal_err_cb, CXL);
> +
>  /*
>   * Copy the AER capability registers using 32 bit read accesses.
>   * This is necessary because RCRB AER capability is MMIO mapped. Clear the
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 887ed6e358fb..d0f95c965ab4 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1027,7 +1027,7 @@ void put_cxl_root(struct cxl_root *cxl_root)
>  }
>  EXPORT_SYMBOL_NS_GPL(put_cxl_root, CXL);
>  
> -static struct cxl_dport *find_dport(struct cxl_port *port, int id)
> +struct cxl_dport *find_dport(struct cxl_port *port, int id)
>  {
>  	struct cxl_dport *dport;
>  	unsigned long index;
> @@ -1336,8 +1336,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>  	return NULL;
>  }
>  
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> -				      struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport)
>  {
>  	struct cxl_find_port_ctx ctx = {
>  		.dport_dev = dport_dev,
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 7cee678fdb75..04725344393b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -11,6 +11,7 @@
>  #include <linux/log2.h>
>  #include <linux/node.h>
>  #include <linux/io.h>
> +#include "../pci/pcie/portdrv.h"

Why add the include?  Maybe only needed in c files/

>  
>  /**
>   * DOC: cxl objects
> @@ -760,11 +761,15 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
>  void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
> +int port_internal_err_cb(struct notifier_block *unused,
> +			 unsigned long event, void *ptr);
>  #else
>  static inline void cxl_setup_parent_dport(struct device *host,
>  					  struct cxl_dport *dport) { }
>  static inline void cxl_setup_parent_uport(struct device *host,
>  					  struct cxl_port *port) { }
> +static inline int port_internal_err_cb(struct notifier_block *unused,
> +				unsigned long event, void *ptr) { return 0; }
>  #endif
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 93992a1c8eec..6044955e1c48 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -130,4 +130,6 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +int port_err_nb_cb(struct notifier_block *unused,
> +		   unsigned long event, void *ptr);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..f4183c5aea38 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -926,6 +926,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	return rc;
>  }
>  
> +struct notifier_block port_internal_err_nb = {
> +	.notifier_call = port_internal_err_cb,
> +};
> +
>  static const struct pci_device_id cxl_mem_pci_tbl[] = {
>  	/* PCI class code for CXL.mem Type-3 Devices */
>  	{ PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL << 8 | CXL_MEMORY_PROGIF), ~0)},
> @@ -974,6 +978,19 @@ static struct pci_driver cxl_pci_driver = {
>  	},
>  };
>  
> -module_pci_driver(cxl_pci_driver);
> +static int __init cxl_pci_init(void)
> +{
> +	atomic_notifier_chain_register(&portdrv_aer_internal_err_chain, &port_internal_err_nb);

Long line that you can easily break.

> +	return pci_register_driver(&cxl_pci_driver);
> +}
> +module_init(cxl_pci_init);
> +
> +static void __exit cxl_pci_exit(void)
> +{
> +	atomic_notifier_chain_unregister(&portdrv_aer_internal_err_chain, &port_internal_err_nb);

Long line that you can easily break.

> +	pci_unregister_driver(&cxl_pci_driver);
> +}
> +module_exit(cxl_pci_exit);
> +
>  MODULE_LICENSE("GPL v2");
>  MODULE_IMPORT_NS(CXL);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
  2024-06-20 13:09   ` Jonathan Cameron
@ 2024-06-24 16:09     ` Terry Bowman
  2024-07-02 15:58       ` Jonathan Cameron
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 16:09 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

Hi Jonathan,

I added repsonses inline below.

On 6/20/24 08:09, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:09 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch
>> ports are bound to the PCIe port bus driver, portdrv. portdrv provides
>> an atomic notifier chain for reporting PCIe port device AER
>> correctable internal errors (CIE) and AER uncorrectable internal
>> errors (UIE).
>>
>> CXL PCIe port devices use AER CIE/UIE to report CXL RAS.[1]
>>
>> Add a cxl_pci atomic notification callback for handling the portdrv's
>> AER UIE/CIE notifications.
>>
>> Register the atomic notification callback in the cxl_pci module's
>> load. Unregister the callback in the cxl_pci driver's unload.
>>
>> Implement the callback to check if the device parameter is a valid
>> CXL PCIe port. If it is valid then make the notification callback call
>> __cxl_handle_cor_ras() or __cxl_handle_ras() depending on the AER
>> type.
>>
>> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>              Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Hi Terry,
> 
> Some comments inline.  Mostly this comes down to earlier question of whether
> this notifier should be registered per device or globally. 
> I think I still prefer per device, but attaching the handler will be trickier
> and I'm guessing there may be some locking/lifetime issues doing that which
> are avoided by a global notifier.
> 
> Jonathan
> 

I agree on the per-device notifier.

>> ---
>>  drivers/cxl/core/core.h |  4 ++
>>  drivers/cxl/core/pci.c  | 97 ++++++++++++++++++++++++++++++++++++++---
>>  drivers/cxl/core/port.c |  6 +--
>>  drivers/cxl/cxl.h       |  5 +++
>>  drivers/cxl/cxlpci.h    |  2 +
>>  drivers/cxl/pci.c       | 19 +++++++-
>>  6 files changed, 123 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index bc5a95665aa0..69bef1db6ee0 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -94,4 +94,8 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
>>  				       enum access_coordinate_class access);
>>  bool cxl_need_node_perf_attrs_update(int nid);
>>  
>> +struct cxl_dport *find_dport(struct cxl_port *port, int id);
>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>> +			       struct cxl_dport **dport);
>> +
>>  #endif /* __CXL_CORE_H__ */
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 59a317ab84bb..e630eccb733d 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -689,7 +689,6 @@ EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>>  static void __cxl_handle_cor_ras(struct device *dev,
>>  				 void __iomem *ras_base)
>>  {
>> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>>  	void __iomem *addr;
>>  	u32 status;
>>  
>> @@ -698,10 +697,17 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
> 
> Blank line probably not wanted as we want to group the status
> check with the read (it's kind of an error check).
> 

Ok.

>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +	if (is_cxl_memdev(dev)) {
>> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>> +
>>  		trace_cxl_aer_correctable_error(cxlmd, status);
> As below - don't bother with local cxlmd variable.
> 

Ok.

>> -	}
>> +	} else if (dev_is_pci(dev))
>> +		trace_cxl_port_aer_correctable_error(dev, status);
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -733,7 +739,6 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>>  static bool __cxl_handle_ras(struct device *dev,
>>  			     void __iomem *ras_base)
>>  {
>> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>>  	void __iomem *addr;
>>  	u32 status;
>> @@ -759,7 +764,13 @@ static bool __cxl_handle_ras(struct device *dev,
>>  	}
>>  
>>  	header_log_copy(ras_base, hl);
>> -	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
>> +	if (is_cxl_memdev(dev)) {
>> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> Just use this inline.
> 		trace_cxl_aer_uncorrectable_error(to_cxl_memdev(dev),
> 						  status, fe, hl);
> 
>> +
>> +		trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
>> +	} else if (dev_is_pci(dev))
>> +		trace_cxl_port_aer_uncorrectable_error(dev, status);
> 
> As before, why no fe or hl?  I'm sure I'm missing some spec subtlty
> but a comment would help me and others avoid that.
> 

Adding the fe and hl on the list to be added. No, you're spot on. 

>> +
>>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>  
>>  	return true;
>> @@ -882,6 +893,80 @@ static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>>  	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>>  }
>>  
>> +static int match_uport(struct device *dev, void *data)
>> +{
>> +	struct device *uport_dev = (struct device *)data;
>> +	struct cxl_port *port;
>> +
>> +	if (!is_cxl_port(dev))
>> +		return 0;
>> +
>> +	port = to_cxl_port(dev);
>> +
>> +	return (port->uport_dev == uport_dev);
> () doesn't add much so I'd drop them.
> 
>> +}
>> +
>> +static struct cxl_port *pci_to_cxl_uport(struct pci_dev *pdev)
>> +{
>> +	struct cxl_dport *dport;
>> +	struct device *port_dev;
>> +	struct cxl_port *port;
>> +
>> +	port = find_cxl_port(pdev->dev.parent, &dport);
>> +	if (!port)
>> +		return NULL;
>> +	put_device(&port->dev);
> I'm confused on the lifetimes. Doesn't it make more sense
> to hold this until after you've stopped using it? So move the
> put after device_find_child(...)
> 

Ok.

>> +
>> +	port_dev = device_find_child(&port->dev, &pdev->dev, match_uport);
>> +	if (!port_dev)
>> +		return NULL;
>> +	put_device(port_dev);
>> +
>> +	port = to_cxl_port(port_dev);
>> +
>> +	return port;
> 
> 	return to_cxl_port(port_dev);
> 
>> +}
>> +
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> +	void __iomem *ras_base = NULL;
> Don't initialize and...

There is possibility the incorrect PCI type is passed and this is intended to
return NULL for these cases. Should ras_base not be preinitialized in 
that for the scenario?

>> +
>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> +		struct cxl_dport *dport;
>> +
>> +		find_cxl_port(&pdev->dev, &dport);
>> +		ras_base = dport ? dport->regs.ras : NULL;
> 		if (dport)
> 			return dport->regs.ras;
>> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
>> +		struct cxl_port *port = pci_to_cxl_uport(pdev);
>> +
>> +		ras_base = port ? port->regs.ras : NULL;
> 		if (port)
> 			return port->regs.ras;
>> +	}
> return NULL;
>> +
>> +	return ras_base;
>> +}
>> +
>> +int port_internal_err_cb(struct notifier_block *unused,
>> +			 unsigned long event, void *ptr)
>> +{
>> +	struct pci_dev *pdev = (struct pci_dev *)ptr;
> 
> I think you can use this notifier for other types of device in future?
> If it's going to be global definitely want to check here that we
> actually have a CXL port of some type.
> 
> It may be that via the various calls any non CXL device
> will result in a safe error. However that's not obvious, so an
> upfront check makes sense (or a per device notifier registration!)
> 

cxl_pci_port_ras() performs PCIe type check and sets RAS base to NULL if 
the type is not a port.

>> +	void __iomem *ras_base;
>> +
>> +	if (!pdev)
>> +		return 0;
>> +
>> +	if (event == AER_CORRECTABLE) {
>> +		ras_base = cxl_pci_port_ras(pdev);
>> +		__cxl_handle_cor_ras(&pdev->dev, ras_base);
> 
> as below - one line should be fine for this.
> 
>> +	} else if ((event == AER_FATAL) || (event == AER_NONFATAL)) {
>> +		ras_base = cxl_pci_port_ras(pdev);
>> +		__cxl_handle_ras(&pdev->dev, ras_base);
> 
> 		__cxl_handle_ras(&pdev->dev, cxl_pci_port_ras(pdev));
> 
>> +	}
>> +
>> +	return 0;
>> +}
>> +EXPORT_SYMBOL_NS_GPL(port_internal_err_cb, CXL);
>> +
>>  /*
>>   * Copy the AER capability registers using 32 bit read accesses.
>>   * This is necessary because RCRB AER capability is MMIO mapped. Clear the
>> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
>> index 887ed6e358fb..d0f95c965ab4 100644
>> --- a/drivers/cxl/core/port.c
>> +++ b/drivers/cxl/core/port.c
>> @@ -1027,7 +1027,7 @@ void put_cxl_root(struct cxl_root *cxl_root)
>>  }
>>  EXPORT_SYMBOL_NS_GPL(put_cxl_root, CXL);
>>  
>> -static struct cxl_dport *find_dport(struct cxl_port *port, int id)
>> +struct cxl_dport *find_dport(struct cxl_port *port, int id)
>>  {
>>  	struct cxl_dport *dport;
>>  	unsigned long index;
>> @@ -1336,8 +1336,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>>  	return NULL;
>>  }
>>  
>> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
>> -				      struct cxl_dport **dport)
>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>> +			       struct cxl_dport **dport)
>>  {
>>  	struct cxl_find_port_ctx ctx = {
>>  		.dport_dev = dport_dev,
>> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>> index 7cee678fdb75..04725344393b 100644
>> --- a/drivers/cxl/cxl.h
>> +++ b/drivers/cxl/cxl.h
>> @@ -11,6 +11,7 @@
>>  #include <linux/log2.h>
>>  #include <linux/node.h>
>>  #include <linux/io.h>
>> +#include "../pci/pcie/portdrv.h"
> 
> Why add the include?  Maybe only needed in c files/
> 

Ok

>>  
>>  /**
>>   * DOC: cxl objects
>> @@ -760,11 +761,15 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>>  #ifdef CONFIG_PCIEAER_CXL
>>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
>>  void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
>> +int port_internal_err_cb(struct notifier_block *unused,
>> +			 unsigned long event, void *ptr);
>>  #else
>>  static inline void cxl_setup_parent_dport(struct device *host,
>>  					  struct cxl_dport *dport) { }
>>  static inline void cxl_setup_parent_uport(struct device *host,
>>  					  struct cxl_port *port) { }
>> +static inline int port_internal_err_cb(struct notifier_block *unused,
>> +				unsigned long event, void *ptr) { return 0; }
>>  #endif
>>  
>>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
>> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>> index 93992a1c8eec..6044955e1c48 100644
>> --- a/drivers/cxl/cxlpci.h
>> +++ b/drivers/cxl/cxlpci.h
>> @@ -130,4 +130,6 @@ void read_cdat_data(struct cxl_port *port);
>>  void cxl_cor_error_detected(struct pci_dev *pdev);
>>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>>  				    pci_channel_state_t state);
>> +int port_err_nb_cb(struct notifier_block *unused,
>> +		   unsigned long event, void *ptr);
>>  #endif /* __CXL_PCI_H__ */
>> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>> index 2ff361e756d6..f4183c5aea38 100644
>> --- a/drivers/cxl/pci.c
>> +++ b/drivers/cxl/pci.c
>> @@ -926,6 +926,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>>  	return rc;
>>  }
>>  
>> +struct notifier_block port_internal_err_nb = {
>> +	.notifier_call = port_internal_err_cb,
>> +};
>> +
>>  static const struct pci_device_id cxl_mem_pci_tbl[] = {
>>  	/* PCI class code for CXL.mem Type-3 Devices */
>>  	{ PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL << 8 | CXL_MEMORY_PROGIF), ~0)},
>> @@ -974,6 +978,19 @@ static struct pci_driver cxl_pci_driver = {
>>  	},
>>  };
>>  
>> -module_pci_driver(cxl_pci_driver);
>> +static int __init cxl_pci_init(void)
>> +{
>> +	atomic_notifier_chain_register(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
> 
> Long line that you can easily break.
> 
>> +	return pci_register_driver(&cxl_pci_driver);
>> +}
>> +module_init(cxl_pci_init);
>> +
>> +static void __exit cxl_pci_exit(void)
>> +{
>> +	atomic_notifier_chain_unregister(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
> 
> Long line that you can easily break.
> 
>> +	pci_unregister_driver(&cxl_pci_driver);
>> +}
>> +module_exit(cxl_pci_exit);
>> +
>>  MODULE_LICENSE("GPL v2");
>>  MODULE_IMPORT_NS(CXL);
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
  2024-06-24 16:09     ` Terry Bowman
@ 2024-07-02 15:58       ` Jonathan Cameron
  0 siblings, 0 replies; 59+ messages in thread
From: Jonathan Cameron @ 2024-07-02 15:58 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter


> >> +
> >> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> >> +{
> >> +	void __iomem *ras_base = NULL;  
> > Don't initialize and...  
> 
> There is possibility the incorrect PCI type is passed and this is intended to
> return NULL for these cases. Should ras_base not be preinitialized in 
> that for the scenario?

From a code point of view at least, nope - just return NULL directly
give it's an error case.

> 
> >> +
> >> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> >> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> >> +		struct cxl_dport *dport;
> >> +
> >> +		find_cxl_port(&pdev->dev, &dport);
> >> +		ras_base = dport ? dport->regs.ras : NULL;  
> > 		if (dport)
> > 			return dport->regs.ras;  
> >> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> >> +		struct cxl_port *port = pci_to_cxl_uport(pdev);
> >> +
> >> +		ras_base = port ? port->regs.ras : NULL;  
> > 		if (port)
> > 			return port->regs.ras;  
> >> +	}  
> > return NULL;  

This is why you don't need to set ras_base.
If you get here it's always NULL.

> >> +
> >> +	return ras_base;
> >> +}
> >> +
> >> +int port_internal_err_cb(struct notifier_block *unused,
> >> +			 unsigned long event, void *ptr)
> >> +{
> >> +	struct pci_dev *pdev = (struct pci_dev *)ptr;  
> > 
> > I think you can use this notifier for other types of device in future?
> > If it's going to be global definitely want to check here that we
> > actually have a CXL port of some type.
> > 
> > It may be that via the various calls any non CXL device
> > will result in a safe error. However that's not obvious, so an
> > upfront check makes sense (or a per device notifier registration!)
> >   
> 
> cxl_pci_port_ras() performs PCIe type check and sets RAS base to NULL if 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
  2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
  2024-06-20 13:09   ` Jonathan Cameron
@ 2024-06-26  6:22   ` Li, Ming4
  2024-06-26 13:51     ` Terry Bowman
  1 sibling, 1 reply; 59+ messages in thread
From: Li, Ming4 @ 2024-06-26  6:22 UTC (permalink / raw)
  To: Terry Bowman, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
	Jiang, Dave, Schofield, Alison, Verma, Vishal L,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com

On 6/18/2024 4:04 AM, Terry Bowman wrote:
> CXL root ports, CXL downstream switch ports, and CXL upstream switch
> ports are bound to the PCIe port bus driver, portdrv. portdrv provides
> an atomic notifier chain for reporting PCIe port device AER
> correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE).
>
> CXL PCIe port devices use AER CIE/UIE to report CXL RAS.[1]
>
> Add a cxl_pci atomic notification callback for handling the portdrv's
> AER UIE/CIE notifications.
>
> Register the atomic notification callback in the cxl_pci module's
> load. Unregister the callback in the cxl_pci driver's unload.
>
> Implement the callback to check if the device parameter is a valid
> CXL PCIe port. If it is valid then make the notification callback call
> __cxl_handle_cor_ras() or __cxl_handle_ras() depending on the AER
> type.
>
> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>              Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> ---
>  drivers/cxl/core/core.h |  4 ++
>  drivers/cxl/core/pci.c  | 97 ++++++++++++++++++++++++++++++++++++++---
>  drivers/cxl/core/port.c |  6 +--
>  drivers/cxl/cxl.h       |  5 +++
>  drivers/cxl/cxlpci.h    |  2 +
>  drivers/cxl/pci.c       | 19 +++++++-
>  6 files changed, 123 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
> index bc5a95665aa0..69bef1db6ee0 100644
> --- a/drivers/cxl/core/core.h
> +++ b/drivers/cxl/core/core.h
> @@ -94,4 +94,8 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
>  				       enum access_coordinate_class access);
>  bool cxl_need_node_perf_attrs_update(int nid);
>  
> +struct cxl_dport *find_dport(struct cxl_port *port, int id);
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport);
> +
>  #endif /* __CXL_CORE_H__ */
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index 59a317ab84bb..e630eccb733d 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -689,7 +689,6 @@ EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>  static void __cxl_handle_cor_ras(struct device *dev,
>  				 void __iomem *ras_base)
>  {
> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	void __iomem *addr;
>  	u32 status;
>  
> @@ -698,10 +697,17 @@ static void __cxl_handle_cor_ras(struct device *dev,
>  
>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>  	status = readl(addr);
> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +
> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
> +		return;
> +
> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
> +	if (is_cxl_memdev(dev)) {
> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
>  		trace_cxl_aer_correctable_error(cxlmd, status);
> -	}
> +	} else if (dev_is_pci(dev))
> +		trace_cxl_port_aer_correctable_error(dev, status);
>  }
>  
>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
> @@ -733,7 +739,6 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>  static bool __cxl_handle_ras(struct device *dev,
>  			     void __iomem *ras_base)
>  {
> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>  	void __iomem *addr;
>  	u32 status;
> @@ -759,7 +764,13 @@ static bool __cxl_handle_ras(struct device *dev,
>  	}
>  
>  	header_log_copy(ras_base, hl);
> -	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
> +	if (is_cxl_memdev(dev)) {
> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
> +
> +		trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
> +	} else if (dev_is_pci(dev))
> +		trace_cxl_port_aer_uncorrectable_error(dev, status);
> +
>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>  
>  	return true;
> @@ -882,6 +893,80 @@ static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>  	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>  }
>  
> +static int match_uport(struct device *dev, void *data)
> +{
> +	struct device *uport_dev = (struct device *)data;
> +	struct cxl_port *port;
> +
> +	if (!is_cxl_port(dev))
> +		return 0;
> +
> +	port = to_cxl_port(dev);
> +
> +	return (port->uport_dev == uport_dev);
> +}
> +
> +static struct cxl_port *pci_to_cxl_uport(struct pci_dev *pdev)
> +{
> +	struct cxl_dport *dport;
> +	struct device *port_dev;
> +	struct cxl_port *port;
> +
> +	port = find_cxl_port(pdev->dev.parent, &dport);
> +	if (!port)
> +		return NULL;
> +	put_device(&port->dev);
> +
> +	port_dev = device_find_child(&port->dev, &pdev->dev, match_uport);
> +	if (!port_dev)
> +		return NULL;

 seems like just a bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can replace these find_cxl_port() and device_find_child().


> +	put_device(port_dev);
> +
> +	port = to_cxl_port(port_dev);
> +
> +	return port;
> +}
> +
> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
> +{
> +	void __iomem *ras_base = NULL;
> +
> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
> +		struct cxl_dport *dport;
> +
> +		find_cxl_port(&pdev->dev, &dport);
> +		ras_base = dport ? dport->regs.ras : NULL;

Need put_device(&port->dev) after find_cxl_port(), use scope-based resource management __free() here should be better.


> +	} else if (pci_pcie_type(pdev) == PCI_EXP_TYPE_UPSTREAM) {
> +		struct cxl_port *port = pci_to_cxl_uport(pdev);
> +
> +		ras_base = port ? port->regs.ras : NULL;
> +	}
> +
> +	return ras_base;
> +}
> +
> +int port_internal_err_cb(struct notifier_block *unused,
> +			 unsigned long event, void *ptr)
> +{
> +	struct pci_dev *pdev = (struct pci_dev *)ptr;
> +	void __iomem *ras_base;
> +
> +	if (!pdev)
> +		return 0;
> +
> +	if (event == AER_CORRECTABLE) {
> +		ras_base = cxl_pci_port_ras(pdev);
> +		__cxl_handle_cor_ras(&pdev->dev, ras_base);
> +	} else if ((event == AER_FATAL) || (event == AER_NONFATAL)) {
> +		ras_base = cxl_pci_port_ras(pdev);
> +		__cxl_handle_ras(&pdev->dev, ras_base);
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(port_internal_err_cb, CXL);
> +
>  /*
>   * Copy the AER capability registers using 32 bit read accesses.
>   * This is necessary because RCRB AER capability is MMIO mapped. Clear the
> diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
> index 887ed6e358fb..d0f95c965ab4 100644
> --- a/drivers/cxl/core/port.c
> +++ b/drivers/cxl/core/port.c
> @@ -1027,7 +1027,7 @@ void put_cxl_root(struct cxl_root *cxl_root)
>  }
>  EXPORT_SYMBOL_NS_GPL(put_cxl_root, CXL);
>  
> -static struct cxl_dport *find_dport(struct cxl_port *port, int id)
> +struct cxl_dport *find_dport(struct cxl_port *port, int id)
>  {
>  	struct cxl_dport *dport;
>  	unsigned long index;
> @@ -1336,8 +1336,8 @@ static struct cxl_port *__find_cxl_port(struct cxl_find_port_ctx *ctx)
>  	return NULL;
>  }
>  
> -static struct cxl_port *find_cxl_port(struct device *dport_dev,
> -				      struct cxl_dport **dport)
> +struct cxl_port *find_cxl_port(struct device *dport_dev,
> +			       struct cxl_dport **dport)
>  {
>  	struct cxl_find_port_ctx ctx = {
>  		.dport_dev = dport_dev,
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 7cee678fdb75..04725344393b 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -11,6 +11,7 @@
>  #include <linux/log2.h>
>  #include <linux/node.h>
>  #include <linux/io.h>
> +#include "../pci/pcie/portdrv.h"
>  
>  /**
>   * DOC: cxl objects
> @@ -760,11 +761,15 @@ struct cxl_dport *devm_cxl_add_rch_dport(struct cxl_port *port,
>  #ifdef CONFIG_PCIEAER_CXL
>  void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport);
>  void cxl_setup_parent_uport(struct device *host, struct cxl_port *port);
> +int port_internal_err_cb(struct notifier_block *unused,
> +			 unsigned long event, void *ptr);
>  #else
>  static inline void cxl_setup_parent_dport(struct device *host,
>  					  struct cxl_dport *dport) { }
>  static inline void cxl_setup_parent_uport(struct device *host,
>  					  struct cxl_port *port) { }
> +static inline int port_internal_err_cb(struct notifier_block *unused,
> +				unsigned long event, void *ptr) { return 0; }
>  #endif
>  
>  struct cxl_decoder *to_cxl_decoder(struct device *dev);
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 93992a1c8eec..6044955e1c48 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -130,4 +130,6 @@ void read_cdat_data(struct cxl_port *port);
>  void cxl_cor_error_detected(struct pci_dev *pdev);
>  pci_ers_result_t cxl_error_detected(struct pci_dev *pdev,
>  				    pci_channel_state_t state);
> +int port_err_nb_cb(struct notifier_block *unused,
> +		   unsigned long event, void *ptr);
>  #endif /* __CXL_PCI_H__ */
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 2ff361e756d6..f4183c5aea38 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -926,6 +926,10 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	return rc;
>  }
>  
> +struct notifier_block port_internal_err_nb = {
> +	.notifier_call = port_internal_err_cb,
> +};
> +
>  static const struct pci_device_id cxl_mem_pci_tbl[] = {
>  	/* PCI class code for CXL.mem Type-3 Devices */
>  	{ PCI_DEVICE_CLASS((PCI_CLASS_MEMORY_CXL << 8 | CXL_MEMORY_PROGIF), ~0)},
> @@ -974,6 +978,19 @@ static struct pci_driver cxl_pci_driver = {
>  	},
>  };
>  
> -module_pci_driver(cxl_pci_driver);
> +static int __init cxl_pci_init(void)
> +{
> +	atomic_notifier_chain_register(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
> +	return pci_register_driver(&cxl_pci_driver);
> +}
> +module_init(cxl_pci_init);
> +
> +static void __exit cxl_pci_exit(void)
> +{
> +	atomic_notifier_chain_unregister(&portdrv_aer_internal_err_chain, &port_internal_err_nb);
> +	pci_unregister_driver(&cxl_pci_driver);
> +}
> +module_exit(cxl_pci_exit);
> +
>  MODULE_LICENSE("GPL v2");
>  MODULE_IMPORT_NS(CXL);



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors
  2024-06-26  6:22   ` Li, Ming4
@ 2024-06-26 13:51     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-26 13:51 UTC (permalink / raw)
  To: Li, Ming4, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
	Jiang, Dave, Schofield, Alison, Verma, Vishal L,
	jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
	ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Yazen.Ghannam@amd.com, Robert.Richter@amd.com



On 6/26/24 01:22, Li, Ming4 wrote:
> On 6/18/2024 4:04 AM, Terry Bowman wrote:
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch
>> ports are bound to the PCIe port bus driver, portdrv. portdrv provides
>> an atomic notifier chain for reporting PCIe port device AER
>> correctable internal errors (CIE) and AER uncorrectable internal
>> errors (UIE).
>>
>> CXL PCIe port devices use AER CIE/UIE to report CXL RAS.[1]
>>
>> Add a cxl_pci atomic notification callback for handling the portdrv's
>> AER UIE/CIE notifications.
>>
>> Register the atomic notification callback in the cxl_pci module's
>> load. Unregister the callback in the cxl_pci driver's unload.
>>
>> Implement the callback to check if the device parameter is a valid
>> CXL PCIe port. If it is valid then make the notification callback call
>> __cxl_handle_cor_ras() or __cxl_handle_ras() depending on the AER
>> type.
>>
>> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>              Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> ---
>>  drivers/cxl/core/core.h |  4 ++
>>  drivers/cxl/core/pci.c  | 97 ++++++++++++++++++++++++++++++++++++++---
>>  drivers/cxl/core/port.c |  6 +--
>>  drivers/cxl/cxl.h       |  5 +++
>>  drivers/cxl/cxlpci.h    |  2 +
>>  drivers/cxl/pci.c       | 19 +++++++-
>>  6 files changed, 123 insertions(+), 10 deletions(-)
>>
>> diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
>> index bc5a95665aa0..69bef1db6ee0 100644
>> --- a/drivers/cxl/core/core.h
>> +++ b/drivers/cxl/core/core.h
>> @@ -94,4 +94,8 @@ int cxl_update_hmat_access_coordinates(int nid, struct cxl_region *cxlr,
>>  				       enum access_coordinate_class access);
>>  bool cxl_need_node_perf_attrs_update(int nid);
>>  
>> +struct cxl_dport *find_dport(struct cxl_port *port, int id);
>> +struct cxl_port *find_cxl_port(struct device *dport_dev,
>> +			       struct cxl_dport **dport);
>> +
>>  #endif /* __CXL_CORE_H__ */
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index 59a317ab84bb..e630eccb733d 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -689,7 +689,6 @@ EXPORT_SYMBOL_NS_GPL(read_cdat_data, CXL);
>>  static void __cxl_handle_cor_ras(struct device *dev,
>>  				 void __iomem *ras_base)
>>  {
>> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>>  	void __iomem *addr;
>>  	u32 status;
>>  
>> @@ -698,10 +697,17 @@ static void __cxl_handle_cor_ras(struct device *dev,
>>  
>>  	addr = ras_base + CXL_RAS_CORRECTABLE_STATUS_OFFSET;
>>  	status = readl(addr);
>> -	if (status & CXL_RAS_CORRECTABLE_STATUS_MASK) {
>> -		writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +
>> +	if (!(status & CXL_RAS_CORRECTABLE_STATUS_MASK))
>> +		return;
>> +
>> +	writel(status & CXL_RAS_CORRECTABLE_STATUS_MASK, addr);
>> +	if (is_cxl_memdev(dev)) {
>> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>> +
>>  		trace_cxl_aer_correctable_error(cxlmd, status);
>> -	}
>> +	} else if (dev_is_pci(dev))
>> +		trace_cxl_port_aer_correctable_error(dev, status);
>>  }
>>  
>>  static void cxl_handle_endpoint_cor_ras(struct cxl_dev_state *cxlds)
>> @@ -733,7 +739,6 @@ static void header_log_copy(void __iomem *ras_base, u32 *log)
>>  static bool __cxl_handle_ras(struct device *dev,
>>  			     void __iomem *ras_base)
>>  {
>> -	struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>>  	u32 hl[CXL_HEADERLOG_SIZE_U32];
>>  	void __iomem *addr;
>>  	u32 status;
>> @@ -759,7 +764,13 @@ static bool __cxl_handle_ras(struct device *dev,
>>  	}
>>  
>>  	header_log_copy(ras_base, hl);
>> -	trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
>> +	if (is_cxl_memdev(dev)) {
>> +		struct cxl_memdev *cxlmd = to_cxl_memdev(dev);
>> +
>> +		trace_cxl_aer_uncorrectable_error(cxlmd, status, fe, hl);
>> +	} else if (dev_is_pci(dev))
>> +		trace_cxl_port_aer_uncorrectable_error(dev, status);
>> +
>>  	writel(status & CXL_RAS_UNCORRECTABLE_STATUS_MASK, addr);
>>  
>>  	return true;
>> @@ -882,6 +893,80 @@ static bool cxl_handle_rdport_ras(struct cxl_dev_state *cxlds,
>>  	return __cxl_handle_ras(&cxlds->cxlmd->dev, dport->regs.ras);
>>  }
>>  
>> +static int match_uport(struct device *dev, void *data)
>> +{
>> +	struct device *uport_dev = (struct device *)data;
>> +	struct cxl_port *port;
>> +
>> +	if (!is_cxl_port(dev))
>> +		return 0;
>> +
>> +	port = to_cxl_port(dev);
>> +
>> +	return (port->uport_dev == uport_dev);
>> +}
>> +
>> +static struct cxl_port *pci_to_cxl_uport(struct pci_dev *pdev)
>> +{
>> +	struct cxl_dport *dport;
>> +	struct device *port_dev;
>> +	struct cxl_port *port;
>> +
>> +	port = find_cxl_port(pdev->dev.parent, &dport);
>> +	if (!port)
>> +		return NULL;
>> +	put_device(&port->dev);
>> +
>> +	port_dev = device_find_child(&port->dev, &pdev->dev, match_uport);
>> +	if (!port_dev)
>> +		return NULL;
> 
>  seems like just a bus_find_device(&cxl_bus_type, NULL, &pdev->dev, match_uport) can replace these find_cxl_port() and device_find_child().
> 
> 

That would be a good improvement/optimization. I'll look into making that change.

>> +	put_device(port_dev);
>> +
>> +	port = to_cxl_port(port_dev);
>> +
>> +	return port;
>> +}
>> +
>> +static void __iomem *cxl_pci_port_ras(struct pci_dev *pdev)
>> +{
>> +	void __iomem *ras_base = NULL;
>> +
>> +	if ((pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) ||
>> +	    (pci_pcie_type(pdev) == PCI_EXP_TYPE_DOWNSTREAM)) {
>> +		struct cxl_dport *dport;
>> +
>> +		find_cxl_port(&pdev->dev, &dport);
>> +		ras_base = dport ? dport->regs.ras : NULL;
> 
> Need put_device(&port->dev) after find_cxl_port(), use scope-based resource management __free() here should be better.
> 
> 

Thanks.

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (6 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-19  7:09   ` Christoph Hellwig
                     ` (2 more replies)
  2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
                   ` (2 subsequent siblings)
  10 siblings, 3 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter
  Cc: Bjorn Helgaas, linux-pci

AER correctable internal errors (CIE) and AER uncorrectable internal
errors (UIE) are disabled through the AER mask register by default.[1]

CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
need CIE/UIE enabled.[2]

Change pci_aer_unmask_internal_errors() function to be exported for
the CXL driver and other drivers to use.

[1] PCI6.0 - 7.8.4.3 Uncorrectable
[2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
             Switch Ports

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
 drivers/pci/pcie/aer.c | 3 ++-
 include/linux/aer.h    | 6 ++++++
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 4dc03cb9aff0..d7a1982f0c50 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
  * Note: AER must be enabled and supported by the device which must be
  * checked in advance, e.g. with pcie_aer_is_native().
  */
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 {
 	int aer = dev->aer_cap;
 	u32 mask;
@@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
 	mask &= ~PCI_ERR_COR_INTERNAL;
 	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
 }
+EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
 
 static bool is_cxl_mem_dev(struct pci_dev *dev)
 {
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..a4fd25ea0280 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
 static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
 #endif
 
+#ifdef CONFIG_PCIEAER_CXL
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
+#else
+static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
+#endif
+
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		    struct aer_capability_regs *aer);
 int cper_severity_to_aer(int cper_severity);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
  2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
@ 2024-06-19  7:09   ` Christoph Hellwig
  2024-06-19 15:40     ` Terry Bowman
  2024-06-20 13:11   ` Jonathan Cameron
  2024-07-10 21:47   ` Bjorn Helgaas
  2 siblings, 1 reply; 59+ messages in thread
From: Christoph Hellwig @ 2024-06-19  7:09 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, Jun 17, 2024 at 03:04:10PM -0500, Terry Bowman wrote:
> AER correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE) are disabled through the AER mask register by default.[1]
> 
> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
> need CIE/UIE enabled.[2]
> 
> Change pci_aer_unmask_internal_errors() function to be exported for
> the CXL driver and other drivers to use.

I can't actually find a user for this.  Maybe that's because you did
weird partial CCs for your series, or maybe it's because you don't
want to tell us.  Either way it's a no-go.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
  2024-06-19  7:09   ` Christoph Hellwig
@ 2024-06-19 15:40     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-19 15:40 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci



On 6/19/24 02:09, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 03:04:10PM -0500, Terry Bowman wrote:
>> AER correctable internal errors (CIE) and AER uncorrectable internal
>> errors (UIE) are disabled through the AER mask register by default.[1]
>>
>> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
>> need CIE/UIE enabled.[2]
>>
>> Change pci_aer_unmask_internal_errors() function to be exported for
>> the CXL driver and other drivers to use.
> 
> I can't actually find a user for this.  Maybe that's because you did
> weird partial CCs for your series, or maybe it's because you don't
> want to tell us.  Either way it's a no-go.

The use is in the following patchset (9/9) that missed being shared with 
PCI list. If there is rework I'll fix so both are sent to PCI list.

https://lore.kernel.org/all/20240617200411.1426554-10-terry.bowman@amd.com/

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
  2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
  2024-06-19  7:09   ` Christoph Hellwig
@ 2024-06-20 13:11   ` Jonathan Cameron
  2024-06-24 16:22     ` Terry Bowman
  2024-07-10 21:47   ` Bjorn Helgaas
  2 siblings, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 13:11 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, 17 Jun 2024 15:04:10 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> AER correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE) are disabled through the AER mask register by default.[1]
> 
> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
> need CIE/UIE enabled.[2]
> 
> Change pci_aer_unmask_internal_errors() function to be exported for
> the CXL driver and other drivers to use.

I've perhaps forgotten the end conclusion, but I thought there was
a request to just try enabling this in general and mask it out only
for known broken devices?

Admittedly that's a more daring path, so maybe I hallucinated it!

> 
> [1] PCI6.0 - 7.8.4.3 Uncorrectable
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>              Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/aer.c | 3 ++-
>  include/linux/aer.h    | 6 ++++++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4dc03cb9aff0..d7a1982f0c50 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
>   * Note: AER must be enabled and supported by the device which must be
>   * checked in advance, e.g. with pcie_aer_is_native().
>   */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  {
>  	int aer = dev->aer_cap;
>  	u32 mask;
> @@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  	mask &= ~PCI_ERR_COR_INTERNAL;
>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>  }
> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
>  
>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>  {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..a4fd25ea0280 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +#else
> +static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
> +#endif
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		    struct aer_capability_regs *aer);
>  int cper_severity_to_aer(int cper_severity);


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
  2024-06-20 13:11   ` Jonathan Cameron
@ 2024-06-24 16:22     ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 16:22 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

Hi Jonathan,

I added a response inline below.

On 6/20/24 08:11, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:10 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> AER correctable internal errors (CIE) and AER uncorrectable internal
>> errors (UIE) are disabled through the AER mask register by default.[1]
>>
>> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
>> need CIE/UIE enabled.[2]
>>
>> Change pci_aer_unmask_internal_errors() function to be exported for
>> the CXL driver and other drivers to use.
> 
> I've perhaps forgotten the end conclusion, but I thought there was
> a request to just try enabling this in general and mask it out only
> for known broken devices?
> 
> Admittedly that's a more daring path, so maybe I hallucinated it!
> 

I remember there was discussion. A quick search for PCI_ERR_COR_INTERNAL and 
PCI_ERR_UNC_INTERNAL doesn't find any default enablement. 

Regards,
Terry 

>>
>> [1] PCI6.0 - 7.8.4.3 Uncorrectable
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>>              Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>>  drivers/pci/pcie/aer.c | 3 ++-
>>  include/linux/aer.h    | 6 ++++++
>>  2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 4dc03cb9aff0..d7a1982f0c50 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
>>   * Note: AER must be enabled and supported by the device which must be
>>   * checked in advance, e.g. with pcie_aer_is_native().
>>   */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>  {
>>  	int aer = dev->aer_cap;
>>  	u32 mask;
>> @@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>>  	mask &= ~PCI_ERR_COR_INTERNAL;
>>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>>  }
>> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
>>  
>>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>>  {
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 4b97f38f3fcf..a4fd25ea0280 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>>  #endif
>>  
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>> +#else
>> +static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
>> +#endif
>> +
>>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>  		    struct aer_capability_regs *aer);
>>  int cper_severity_to_aer(int cper_severity);
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
  2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
  2024-06-19  7:09   ` Christoph Hellwig
  2024-06-20 13:11   ` Jonathan Cameron
@ 2024-07-10 21:47   ` Bjorn Helgaas
  2 siblings, 0 replies; 59+ messages in thread
From: Bjorn Helgaas @ 2024-07-10 21:47 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci

On Mon, Jun 17, 2024 at 03:04:10PM -0500, Terry Bowman wrote:
> AER correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE) are disabled through the AER mask register by default.[1]

Nit: "Correctable Errors" and "Uncorrectable Errors" are generic PCIe
concepts that exist independent of AER, so I wouldn't prefix them with
AER.  The AER mask registers control *reporting* of errors, but of
course they don't disable the errors themselves.

> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
> need CIE/UIE enabled.[2]
> 
> Change pci_aer_unmask_internal_errors() function to be exported for
> the CXL driver and other drivers to use.
> 
> [1] PCI6.0 - 7.8.4.3 Uncorrectable

s/PCI6.0 .../PCIe r6.0, sec 7.8.4.3/ since there is a conventional PCI
spec as well.

> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>              Switch Ports
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
>  drivers/pci/pcie/aer.c | 3 ++-
>  include/linux/aer.h    | 6 ++++++
>  2 files changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4dc03cb9aff0..d7a1982f0c50 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
>   * Note: AER must be enabled and supported by the device which must be
>   * checked in advance, e.g. with pcie_aer_is_native().
>   */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  {
>  	int aer = dev->aer_cap;
>  	u32 mask;
> @@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>  	mask &= ~PCI_ERR_COR_INTERNAL;
>  	pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>  }
> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
>  
>  static bool is_cxl_mem_dev(struct pci_dev *dev)
>  {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..a4fd25ea0280 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>  static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>  #endif
>  
> +#ifdef CONFIG_PCIEAER_CXL
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +#else
> +static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
> +#endif

I don't like the idea of exporting a generic PCI interface that only
does something when CONFIG_PCIEAER_CXL is enabled.  If there's ever a
non-CXL caller, it will be confused.

Bjorn

^ permalink raw reply	[flat|nested] 59+ messages in thread

* [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (7 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
  2024-06-20 13:15   ` Jonathan Cameron
  2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
  2024-07-25 18:49 ` fan
  10 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
  To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
	Yazen.Ghannam, Robert.Richter

CXL RAS errors are reported through AER interrupts using the AER status:
correctbale internal errors (CIE) and AER uncorrectable internal errors
(UIE).[1] But, the AER CIE/UIE are disabled by default preventing
notification of CXL RAS errors.[2]

Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
and UIE errors.

[1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
             Switch Ports
[2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
             7.8.4.6 Correctable Error Mask Register (Offset 14h)

Signed-off-by: Terry Bowman <terry.bowman@amd.com>
---
 drivers/cxl/core/pci.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
index e630eccb733d..73637d39df0a 100644
--- a/drivers/cxl/core/pci.c
+++ b/drivers/cxl/core/pci.c
@@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
 	struct device *uport_dev = port->uport_dev;
 
 	cxl_port_map_regs(uport_dev, map, regs);
+
+	if (dev_is_pci(uport_dev)) {
+		struct pci_dev *pdev = to_pci_dev(uport_dev);
+
+		pci_aer_unmask_internal_errors(pdev);
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
 
@@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
 
 	if (dport->rch)
 		cxl_disable_rch_root_ints(dport);
+
+	if (dev_is_pci(dport_dev)) {
+		struct pci_dev *pdev = to_pci_dev(dport_dev);
+
+		pci_aer_unmask_internal_errors(pdev);
+	}
 }
 EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors
  2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
@ 2024-06-20 13:15   ` Jonathan Cameron
  2024-06-24 16:46     ` Terry Bowman
  0 siblings, 1 reply; 59+ messages in thread
From: Jonathan Cameron @ 2024-06-20 13:15 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 17 Jun 2024 15:04:11 -0500
Terry Bowman <terry.bowman@amd.com> wrote:

> CXL RAS errors are reported through AER interrupts using the AER status:
> correctbale internal errors (CIE) and AER uncorrectable internal errors

correctable

> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing
> notification of CXL RAS errors.[2]
> 
> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
> and UIE errors.
> 
> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>              Switch Ports
> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
>              7.8.4.6 Correctable Error Mask Register (Offset 14h)
> 
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>

I'm not sure doing this from a driver other than the one handling the
errors makes sense.  It is doing a couple of RMW without any locking
or guarantees that the driver bound to the PCI port might care about
this changing.

I'd like more info on why we don't just turn this on in general
and hence avoid the need to control it from the 'wrong' place.

Jonathan



> ---
>  drivers/cxl/core/pci.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
> index e630eccb733d..73637d39df0a 100644
> --- a/drivers/cxl/core/pci.c
> +++ b/drivers/cxl/core/pci.c
> @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
>  	struct device *uport_dev = port->uport_dev;
>  
>  	cxl_port_map_regs(uport_dev, map, regs);
> +
> +	if (dev_is_pci(uport_dev)) {
> +		struct pci_dev *pdev = to_pci_dev(uport_dev);
> +
> +		pci_aer_unmask_internal_errors(pdev);

I'd skip the local variable for conciseness.

> +	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
>  
> @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>  
>  	if (dport->rch)
>  		cxl_disable_rch_root_ints(dport);
> +
> +	if (dev_is_pci(dport_dev)) {
> +		struct pci_dev *pdev = to_pci_dev(dport_dev);
> +
> +		pci_aer_unmask_internal_errors(pdev);

likewise.

> +	}
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
>  


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors
  2024-06-20 13:15   ` Jonathan Cameron
@ 2024-06-24 16:46     ` Terry Bowman
  2024-07-02 16:00       ` Jonathan Cameron
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 16:46 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

Hi Jonathan,

I added responses inline below.

On 6/20/24 08:15, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:11 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
> 
>> CXL RAS errors are reported through AER interrupts using the AER status:
>> correctbale internal errors (CIE) and AER uncorrectable internal errors
> 
> correctable
> 

Thanks.

>> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing
>> notification of CXL RAS errors.[2]
>>
>> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
>> and UIE errors.
>>
>> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>>              Switch Ports
>> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
>>              7.8.4.6 Correctable Error Mask Register (Offset 14h)
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> 
> I'm not sure doing this from a driver other than the one handling the
> errors makes sense.  It is doing a couple of RMW without any locking
> or guarantees that the driver bound to the PCI port might care about
> this changing.
> 

I think this could fit into the helper function mentioned in our earlier 
discussion. When the portdrv's notifier enabler is called it could also
enable the UIE/CIE.

> I'd like more info on why we don't just turn this on in general
> and hence avoid the need to control it from the 'wrong' place.
> 
> Jonathan
> 

I was trying to enable only where needed given the one case is not a 
pattern, yet. At this point it is only for CXL RCH downstream port 
and CXL VH ports (portdrv).

Would you like for the UIE/CIE unmask added to the AER driver init ?

> 
> 
>> ---
>>  drivers/cxl/core/pci.c | 12 ++++++++++++
>>  1 file changed, 12 insertions(+)
>>
>> diff --git a/drivers/cxl/core/pci.c b/drivers/cxl/core/pci.c
>> index e630eccb733d..73637d39df0a 100644
>> --- a/drivers/cxl/core/pci.c
>> +++ b/drivers/cxl/core/pci.c
>> @@ -861,6 +861,12 @@ void cxl_setup_parent_uport(struct device *host, struct cxl_port *port)
>>  	struct device *uport_dev = port->uport_dev;
>>  
>>  	cxl_port_map_regs(uport_dev, map, regs);
>> +
>> +	if (dev_is_pci(uport_dev)) {
>> +		struct pci_dev *pdev = to_pci_dev(uport_dev);
>> +
>> +		pci_aer_unmask_internal_errors(pdev);
> 
> I'd skip the local variable for conciseness.
> 
>> +	}
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_uport, CXL);
>>  
>> @@ -878,6 +884,12 @@ void cxl_setup_parent_dport(struct device *host, struct cxl_dport *dport)
>>  
>>  	if (dport->rch)
>>  		cxl_disable_rch_root_ints(dport);
>> +
>> +	if (dev_is_pci(dport_dev)) {
>> +		struct pci_dev *pdev = to_pci_dev(dport_dev);
>> +
>> +		pci_aer_unmask_internal_errors(pdev);
> 
> likewise.
> 

Got it.

Regards,
Terry

>> +	}
>>  }
>>  EXPORT_SYMBOL_NS_GPL(cxl_setup_parent_dport, CXL);
>>  
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors
  2024-06-24 16:46     ` Terry Bowman
@ 2024-07-02 16:00       ` Jonathan Cameron
  0 siblings, 0 replies; 59+ messages in thread
From: Jonathan Cameron @ 2024-07-02 16:00 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

On Mon, 24 Jun 2024 11:46:01 -0500
Terry Bowman <Terry.Bowman@amd.com> wrote:

> Hi Jonathan,
> 
> I added responses inline below.
> 
> On 6/20/24 08:15, Jonathan Cameron wrote:
> > On Mon, 17 Jun 2024 15:04:11 -0500
> > Terry Bowman <terry.bowman@amd.com> wrote:
> >   
> >> CXL RAS errors are reported through AER interrupts using the AER status:
> >> correctbale internal errors (CIE) and AER uncorrectable internal errors  
> > 
> > correctable
> >   
> 
> Thanks.
> 
> >> (UIE).[1] But, the AER CIE/UIE are disabled by default preventing
> >> notification of CXL RAS errors.[2]
> >>
> >> Enable CXL PCIe port RAS notification by unmasking the ports' AER CIE
> >> and UIE errors.
> >>
> >> [1] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
> >>              Switch Ports
> >> [2] PCI6.0 - 7.8.4.3 Uncorrectable Error Mask Register (Offset 08h),
> >>              7.8.4.6 Correctable Error Mask Register (Offset 14h)
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>  
> > 
> > I'm not sure doing this from a driver other than the one handling the
> > errors makes sense.  It is doing a couple of RMW without any locking
> > or guarantees that the driver bound to the PCI port might care about
> > this changing.
> >   
> 
> I think this could fit into the helper function mentioned in our earlier 
> discussion. When the portdrv's notifier enabler is called it could also
> enable the UIE/CIE.
> 
> > I'd like more info on why we don't just turn this on in general
> > and hence avoid the need to control it from the 'wrong' place.
> > 
> > Jonathan
> >   
> 
> I was trying to enable only where needed given the one case is not a 
> pattern, yet. At this point it is only for CXL RCH downstream port 
> and CXL VH ports (portdrv).
> 
> Would you like for the UIE/CIE unmask added to the AER driver init ?

If we can get away with it, yes!



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (8 preceding siblings ...)
  2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
@ 2024-06-21 19:04 ` Dan Williams
  2024-06-24 17:47   ` Terry Bowman
  2024-07-25 18:49 ` fan
  10 siblings, 1 reply; 59+ messages in thread
From: Dan Williams @ 2024-06-21 19:04 UTC (permalink / raw)
  To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter

Terry Bowman wrote:
> This patchset provides RAS logging for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This includes changes to
> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> cxl_pci callback.
> 
> The first 3 patches prepare for and add an atomic notifier chain to the
> portdrv driver. The portdrv's notifier chain reports the port device's
> AER internal errors to the registered callback(s). The preparation changes
> include a portdrv update to call the uncorrectable handler for PCIe root
> ports and PCIe downstream switch ports. Also, the AER correctable error
> (CE) status is made available to the AER CE handler.
> 
> The next 4 patches are in preparation for adding an atomic notification
> callback in the cxl_pci driver. This is for receiving AER internal error
> events from the portdrv notifier chain. Preparation includes adding RAS
> register block mapping, adding trace functions for logging, and
> refactoring cxl_pci RAS functions for reuse.
> 
> The final 2 patches enable the AER internal error interrupts.
[..] 
> 
> Solutions Considered (1-4):
>   Below are solutions that were considered. Solution #4 is
>   implemented in this patchset. 
[..]
>  2.) Update the AER driver to call cxl_pci driver's error handler before
>  calling pci_aer_handle_error()
>
>  This is similar to the existing RCH port error approach in aer.c.
>  In this solution the AER driver searches for a downstream CXL endpoint
>  to 'handle' detected CXL port protocol errors.
>
>  This is a good solution to consider if the one presented in this patchset
>  is not acceptable. I was initially reluctant to this approach because it
>  adds more CXL coupling to the AER driver. But, I think this solution
>  would technically work. I believe Ming was working towards this
>  solution.

I feel like the coupling is warranted because these things *are* PCIe
and CXL ports, but it means solving the interrupt distribution problem.

>   3.) Refactor portdrv
>   The portdrv refactoring solution is to change the portdrv service drivers
>   into PCIe auxiliary drivers. With this change the facility drivers can be
>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> 
>   In this case the CXL port functionality would be added either as a CXL
>   auxiliary driver or as a CXL specific port driver
>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> 
>   This solution has challenges in the interrupt allocation by separate
>   auxiliary drivers and in binding of a specific driver. Binding is
>   currently based on PCIe class and would require extending the binding
>   logic to support multiple drivers for the same class.
> 
>   Jonathan Cameron is working towards this solution by initially solving
>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>   for handling CXL port RAS errors would result in RAS logic called from
>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.

I don't think auxiliary bus is a fundamental step forward from pcie
portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
but with all the same problems around how to distribute interrupt
services to different interested parties.

So I think notifiers are interesting from the perspective of a software
hack to enable interrupt distribution. However, given that dynamic MSI-X
support is within reach I am interested in exploring that path and
mandating that archs that want to handle CXL protocol errors natively
need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
native protocol error handling support via CXL _OSC.

In other words, I expect native dynamic MSI-X support is more
maintainable in the sense of keeping all the code in one notification
domain.

>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>   (Implemented in this patchset)
> 
>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>   callback to handle and log CXL port RAS errors.

Oh, I will need to look that the cxl_pci tie in for this, I was
expecting cxl_pci only gets involved in the RCH case because the port
and the endpoint are one in the same object. in the VH case I would only
expect cxl_pci to get involved for its own observed protocol errors, not
those reported upstream from that endpoint.

>   I chose this after trying solution#1 above. I see a couple advantages to
>   this solution are:
>   - Is general port implementation for CIE/UIE specific handling mentioned
>   in the PCIe spec.[2]
>   - Notifier is used in RAS MCE driver as an existing example.
>   - Does not introduce further CXL dependencies into the AER driver.
>   - The notifier chain provides registration/unregistration and
>   synchronization.
> 
>   A disadvantage of this approach is coupling still exists between the CXL
>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>   is handled by a notifier callback in the cxl_pci endpoint driver.
> 
>   Most of the patches in this patchset could be reused to work with
>   solution#3 or solution#2. The atomic notifier could be dropped and
>   instead use an auxiliary device or AER driver awareness. The other
>   changes in this patchset could possibly be reused.

I appreciate the discussion of tradeoffs, thanks Terry!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
@ 2024-06-24 17:47   ` Terry Bowman
  2024-06-24 20:51     ` Dan Williams
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-06-24 17:47 UTC (permalink / raw)
  To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter

Hi Dan,

I added responses below.

On 6/21/24 14:04, Dan Williams wrote:
> Terry Bowman wrote:
>> This patchset provides RAS logging for CXL root ports, CXL downstream
>> switch ports, and CXL upstream switch ports. This includes changes to
>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>> cxl_pci callback.
>>
>> The first 3 patches prepare for and add an atomic notifier chain to the
>> portdrv driver. The portdrv's notifier chain reports the port device's
>> AER internal errors to the registered callback(s). The preparation changes
>> include a portdrv update to call the uncorrectable handler for PCIe root
>> ports and PCIe downstream switch ports. Also, the AER correctable error
>> (CE) status is made available to the AER CE handler.
>>
>> The next 4 patches are in preparation for adding an atomic notification
>> callback in the cxl_pci driver. This is for receiving AER internal error
>> events from the portdrv notifier chain. Preparation includes adding RAS
>> register block mapping, adding trace functions for logging, and
>> refactoring cxl_pci RAS functions for reuse.
>>
>> The final 2 patches enable the AER internal error interrupts.
> [..] 
>>
>> Solutions Considered (1-4):
>>   Below are solutions that were considered. Solution #4 is
>>   implemented in this patchset. 
> [..]
>>  2.) Update the AER driver to call cxl_pci driver's error handler before
>>  calling pci_aer_handle_error()
>>
>>  This is similar to the existing RCH port error approach in aer.c.
>>  In this solution the AER driver searches for a downstream CXL endpoint
>>  to 'handle' detected CXL port protocol errors.
>>
>>  This is a good solution to consider if the one presented in this patchset
>>  is not acceptable. I was initially reluctant to this approach because it
>>  adds more CXL coupling to the AER driver. But, I think this solution
>>  would technically work. I believe Ming was working towards this
>>  solution.
> 
> I feel like the coupling is warranted because these things *are* PCIe
> and CXL ports, but it means solving the interrupt distribution problem.
> 

I understand the service driver interrupt issue but it is not clear how it 
applies to the CXL port error handling. Can you help me understand how the 
interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.


>>   3.) Refactor portdrv
>>   The portdrv refactoring solution is to change the portdrv service drivers
>>   into PCIe auxiliary drivers. With this change the facility drivers can be
>>   associated with a PCIe driver instead fixed bound to the portdrv driver.
>>
>>   In this case the CXL port functionality would be added either as a CXL
>>   auxiliary driver or as a CXL specific port driver
>>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
>>
>>   This solution has challenges in the interrupt allocation by separate
>>   auxiliary drivers and in binding of a specific driver. Binding is
>>   currently based on PCIe class and would require extending the binding
>>   logic to support multiple drivers for the same class.
>>
>>   Jonathan Cameron is working towards this solution by initially solving
>>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>>   for handling CXL port RAS errors would result in RAS logic called from
>>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> 
> I don't think auxiliary bus is a fundamental step forward from pcie
> portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
> but with all the same problems around how to distribute interrupt
> services to different interested parties.
> 
> So I think notifiers are interesting from the perspective of a software
> hack to enable interrupt distribution. However, given that dynamic MSI-X
> support is within reach I am interested in exploring that path and
> mandating that archs that want to handle CXL protocol errors natively
> need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
> native protocol error handling support via CXL _OSC.
> 
> In other words, I expect native dynamic MSI-X support is more
> maintainable in the sense of keeping all the code in one notification
> domain.
> 
>>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>>   (Implemented in this patchset)
>>
>>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>>   callback to handle and log CXL port RAS errors.
> 
> Oh, I will need to look that the cxl_pci tie in for this, I was
> expecting cxl_pci only gets involved in the RCH case because the port
> and the endpoint are one in the same object. in the VH case I would only
> expect cxl_pci to get involved for its own observed protocol errors, not
> those reported upstream from that endpoint.
> 

The CXL port error handling needs a place to live with few options at the moment.
Where do you want the CXL port error handlers to reside? 

Regards,
Terry

>>   I chose this after trying solution#1 above. I see a couple advantages to
>>   this solution are:
>>   - Is general port implementation for CIE/UIE specific handling mentioned
>>   in the PCIe spec.[2]
>>   - Notifier is used in RAS MCE driver as an existing example.
>>   - Does not introduce further CXL dependencies into the AER driver.
>>   - The notifier chain provides registration/unregistration and
>>   synchronization.
>>
>>   A disadvantage of this approach is coupling still exists between the CXL
>>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>>   is handled by a notifier callback in the cxl_pci endpoint driver.
>>
>>   Most of the patches in this patchset could be reused to work with
>>   solution#3 or solution#2. The atomic notifier could be dropped and
>>   instead use an auxiliary device or AER driver awareness. The other
>>   changes in this patchset could possibly be reused.
> 
> I appreciate the discussion of tradeoffs, thanks Terry!

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-06-24 17:47   ` Terry Bowman
@ 2024-06-24 20:51     ` Dan Williams
  2024-06-25 14:29       ` Terry Bowman
  0 siblings, 1 reply; 59+ messages in thread
From: Dan Williams @ 2024-06-24 20:51 UTC (permalink / raw)
  To: Terry Bowman, Dan Williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter

Terry Bowman wrote:
> Hi Dan,
> 
> I added responses below.
> 
> On 6/21/24 14:04, Dan Williams wrote:
> > Terry Bowman wrote:
> >> This patchset provides RAS logging for CXL root ports, CXL downstream
> >> switch ports, and CXL upstream switch ports. This includes changes to
> >> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> >> cxl_pci callback.
> >>
> >> The first 3 patches prepare for and add an atomic notifier chain to the
> >> portdrv driver. The portdrv's notifier chain reports the port device's
> >> AER internal errors to the registered callback(s). The preparation changes
> >> include a portdrv update to call the uncorrectable handler for PCIe root
> >> ports and PCIe downstream switch ports. Also, the AER correctable error
> >> (CE) status is made available to the AER CE handler.
> >>
> >> The next 4 patches are in preparation for adding an atomic notification
> >> callback in the cxl_pci driver. This is for receiving AER internal error
> >> events from the portdrv notifier chain. Preparation includes adding RAS
> >> register block mapping, adding trace functions for logging, and
> >> refactoring cxl_pci RAS functions for reuse.
> >>
> >> The final 2 patches enable the AER internal error interrupts.
> > [..] 
> >>
> >> Solutions Considered (1-4):
> >>   Below are solutions that were considered. Solution #4 is
> >>   implemented in this patchset. 
> > [..]
> >>  2.) Update the AER driver to call cxl_pci driver's error handler before
> >>  calling pci_aer_handle_error()
> >>
> >>  This is similar to the existing RCH port error approach in aer.c.
> >>  In this solution the AER driver searches for a downstream CXL endpoint
> >>  to 'handle' detected CXL port protocol errors.
> >>
> >>  This is a good solution to consider if the one presented in this patchset
> >>  is not acceptable. I was initially reluctant to this approach because it
> >>  adds more CXL coupling to the AER driver. But, I think this solution
> >>  would technically work. I believe Ming was working towards this
> >>  solution.
> > 
> > I feel like the coupling is warranted because these things *are* PCIe
> > and CXL ports, but it means solving the interrupt distribution problem.
> > 
> 
> I understand the service driver interrupt issue but it is not clear how it 
> applies to the CXL port error handling. Can you help me understand how the 
> interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.

Just the case of the AER MSI/-X vector being multiplexed with other CXL
functionality on the same device. If the CXL interrupt vector is to be
enabled later then it means MSI/-X vector enabling needs to be dynamic.

...but yeah, not a problem now as we are only talking about PCIe AER
events and not multiplexing yet. I.e. that problem can be solved later.

> 
> 
> >>   3.) Refactor portdrv
> >>   The portdrv refactoring solution is to change the portdrv service drivers
> >>   into PCIe auxiliary drivers. With this change the facility drivers can be
> >>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> >>
> >>   In this case the CXL port functionality would be added either as a CXL
> >>   auxiliary driver or as a CXL specific port driver
> >>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> >>
> >>   This solution has challenges in the interrupt allocation by separate
> >>   auxiliary drivers and in binding of a specific driver. Binding is
> >>   currently based on PCIe class and would require extending the binding
> >>   logic to support multiple drivers for the same class.
> >>
> >>   Jonathan Cameron is working towards this solution by initially solving
> >>   for the PMU service driver.[1] It is using the auxiliary bus to associate
> >>   what were service drivers with the portdrv driver. Using a CXL auxiliary
> >>   for handling CXL port RAS errors would result in RAS logic called from
> >>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> > 
> > I don't think auxiliary bus is a fundamental step forward from pcie
> > portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
> > but with all the same problems around how to distribute interrupt
> > services to different interested parties.
> > 
> > So I think notifiers are interesting from the perspective of a software
> > hack to enable interrupt distribution. However, given that dynamic MSI-X
> > support is within reach I am interested in exploring that path and
> > mandating that archs that want to handle CXL protocol errors natively
> > need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
> > native protocol error handling support via CXL _OSC.
> > 
> > In other words, I expect native dynamic MSI-X support is more
> > maintainable in the sense of keeping all the code in one notification
> > domain.
> > 
> >>   4.) Using a portdrv notifier chain/callback for CIE/UIE
> >>   (Implemented in this patchset)
> >>
> >>   This solution uses a portdrv atomic chain notifier and a cxl_pci
> >>   callback to handle and log CXL port RAS errors.
> > 
> > Oh, I will need to look that the cxl_pci tie in for this, I was
> > expecting cxl_pci only gets involved in the RCH case because the port
> > and the endpoint are one in the same object. in the VH case I would only
> > expect cxl_pci to get involved for its own observed protocol errors, not
> > those reported upstream from that endpoint.
> > 
> 
> The CXL port error handling needs a place to live with few options at the moment.
> Where do you want the CXL port error handlers to reside? 

I need to go understand exactly why cxl_pci is involved in this current
proposal, but I was thinking it is probably more natural for cxl_port to
have error handlers.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-06-24 20:51     ` Dan Williams
@ 2024-06-25 14:29       ` Terry Bowman
  0 siblings, 0 replies; 59+ messages in thread
From: Terry Bowman @ 2024-06-25 14:29 UTC (permalink / raw)
  To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter



On 6/24/24 15:51, Dan Williams wrote:
> Terry Bowman wrote:
>> Hi Dan,
>>
>> I added responses below.
>>
>> On 6/21/24 14:04, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> This patchset provides RAS logging for CXL root ports, CXL downstream
>>>> switch ports, and CXL upstream switch ports. This includes changes to
>>>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>>>> cxl_pci callback.
>>>>
>>>> The first 3 patches prepare for and add an atomic notifier chain to the
>>>> portdrv driver. The portdrv's notifier chain reports the port device's
>>>> AER internal errors to the registered callback(s). The preparation changes
>>>> include a portdrv update to call the uncorrectable handler for PCIe root
>>>> ports and PCIe downstream switch ports. Also, the AER correctable error
>>>> (CE) status is made available to the AER CE handler.
>>>>
>>>> The next 4 patches are in preparation for adding an atomic notification
>>>> callback in the cxl_pci driver. This is for receiving AER internal error
>>>> events from the portdrv notifier chain. Preparation includes adding RAS
>>>> register block mapping, adding trace functions for logging, and
>>>> refactoring cxl_pci RAS functions for reuse.
>>>>
>>>> The final 2 patches enable the AER internal error interrupts.
>>> [..] 
>>>>
>>>> Solutions Considered (1-4):
>>>>   Below are solutions that were considered. Solution #4 is
>>>>   implemented in this patchset. 
>>> [..]
>>>>  2.) Update the AER driver to call cxl_pci driver's error handler before
>>>>  calling pci_aer_handle_error()
>>>>
>>>>  This is similar to the existing RCH port error approach in aer.c.
>>>>  In this solution the AER driver searches for a downstream CXL endpoint
>>>>  to 'handle' detected CXL port protocol errors.
>>>>
>>>>  This is a good solution to consider if the one presented in this patchset
>>>>  is not acceptable. I was initially reluctant to this approach because it
>>>>  adds more CXL coupling to the AER driver. But, I think this solution
>>>>  would technically work. I believe Ming was working towards this
>>>>  solution.
>>>
>>> I feel like the coupling is warranted because these things *are* PCIe
>>> and CXL ports, but it means solving the interrupt distribution problem.
>>>
>>
>> I understand the service driver interrupt issue but it is not clear how it 
>> applies to the CXL port error handling. Can you help me understand how the 
>> interrupt issue affects CXL port AER UIE/CIE handling in the AER driver.
> 
> Just the case of the AER MSI/-X vector being multiplexed with other CXL
> functionality on the same device. If the CXL interrupt vector is to be
> enabled later then it means MSI/-X vector enabling needs to be dynamic.
> 
> ...but yeah, not a problem now as we are only talking about PCIe AER
> events and not multiplexing yet. I.e. that problem can be solved later.
> 
>>
>>
>>>>   3.) Refactor portdrv
>>>>   The portdrv refactoring solution is to change the portdrv service drivers
>>>>   into PCIe auxiliary drivers. With this change the facility drivers can be
>>>>   associated with a PCIe driver instead fixed bound to the portdrv driver.
>>>>
>>>>   In this case the CXL port functionality would be added either as a CXL
>>>>   auxiliary driver or as a CXL specific port driver
>>>>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
>>>>
>>>>   This solution has challenges in the interrupt allocation by separate
>>>>   auxiliary drivers and in binding of a specific driver. Binding is
>>>>   currently based on PCIe class and would require extending the binding
>>>>   logic to support multiple drivers for the same class.
>>>>
>>>>   Jonathan Cameron is working towards this solution by initially solving
>>>>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>>>>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>>>>   for handling CXL port RAS errors would result in RAS logic called from
>>>>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
>>>
>>> I don't think auxiliary bus is a fundamental step forward from pcie
>>> portdrv, it's just a s/pcie_port_bus_type/auxiliary_bus_type/ rename,
>>> but with all the same problems around how to distribute interrupt
>>> services to different interested parties.
>>>
>>> So I think notifiers are interesting from the perspective of a software
>>> hack to enable interrupt distribution. However, given that dynamic MSI-X
>>> support is within reach I am interested in exploring that path and
>>> mandating that archs that want to handle CXL protocol errors natively
>>> need to enable dynamic MSI-X. Otherwise, those platforms should disclaim
>>> native protocol error handling support via CXL _OSC.
>>>
>>> In other words, I expect native dynamic MSI-X support is more
>>> maintainable in the sense of keeping all the code in one notification
>>> domain.
>>>
>>>>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>>>>   (Implemented in this patchset)
>>>>
>>>>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>>>>   callback to handle and log CXL port RAS errors.
>>>
>>> Oh, I will need to look that the cxl_pci tie in for this, I was
>>> expecting cxl_pci only gets involved in the RCH case because the port
>>> and the endpoint are one in the same object. in the VH case I would only
>>> expect cxl_pci to get involved for its own observed protocol errors, not
>>> those reported upstream from that endpoint.
>>>
>>
>> The CXL port error handling needs a place to live with few options at the moment.
>> Where do you want the CXL port error handlers to reside? 
> 
> I need to go understand exactly why cxl_pci is involved in this current
> proposal, but I was thinking it is probably more natural for cxl_port to
> have error handlers.

Ok. I agree, cxl_port is a better location for the handlers.

Regards,
Terry

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
                   ` (9 preceding siblings ...)
  2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
@ 2024-07-25 18:49 ` fan
  2024-08-19 16:21   ` Terry Bowman
  10 siblings, 1 reply; 59+ messages in thread
From: fan @ 2024-07-25 18:49 UTC (permalink / raw)
  To: Terry Bowman
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, a.manzanares

On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
> This patchset provides RAS logging for CXL root ports, CXL downstream
> switch ports, and CXL upstream switch ports. This includes changes to
> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> cxl_pci callback.
> 
> The first 3 patches prepare for and add an atomic notifier chain to the
> portdrv driver. The portdrv's notifier chain reports the port device's
> AER internal errors to the registered callback(s). The preparation changes
> include a portdrv update to call the uncorrectable handler for PCIe root
> ports and PCIe downstream switch ports. Also, the AER correctable error
> (CE) status is made available to the AER CE handler.
> 
> The next 4 patches are in preparation for adding an atomic notification
> callback in the cxl_pci driver. This is for receiving AER internal error
> events from the portdrv notifier chain. Preparation includes adding RAS
> register block mapping, adding trace functions for logging, and
> refactoring cxl_pci RAS functions for reuse.
> 
> The final 2 patches enable the AER internal error interrupts.
> 
> Testing RAS CE/UCE:
>   QEMU was used for testing CXL root port, CXL downstream switch port, and
>   CXL upstream switch port. The aer-inject tool was used to inject AER and
>   a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
>   testing. Testing passed with no issues.

Hi Terry,

Could you share a little more about the qemu test setup?
From what I see, it seems currently qemu can only inject error to
type3 devices, is that true? Or how to do that for port devices?
Do we need a hack there?

Also, is the aer-inject tool you mentioned the one currently in the kernel
or something else?
https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c

Thanks,
Fan


>  
>   An AMD platform with the AMD RAS error injection tool was used for
>   testing CXL root port injection. Testing passed with no issues.
> 
>   TODO - regression test CXL1.1 RCH handling.
> 
> Solutions Considered (1-4):
>   Below are solutions that were considered. Solution #4 is
>   implemented in this patchset. 
> 
>   1.) Reassigning portdrv error handler for CXL port devices
>   
>   This solution was based on reassigning the portdrv's CE/UCE err_handler
>   to be CXL cxl_pci driver functions.
>   
>   I started with this solution and once the flow was working I realized
>   the endpoint removal would have to be addressed as well. While this
>   could be resolved it does highlight the odd coupling and dependency
>   between the CXL port devices error handling with cxl_pci endpoint's
>   handlers. Also, the err_handler re-assignment at runtime required
>   ignoring the 'const' definition. I don't believe this should be
>   considered as a possible solution.
>   
>   2.) Update the AER driver to call cxl_pci driver's error handler before
>   calling pci_aer_handle_error()
> 
>   This is similar to the existing RCH port error approach in aer.c.
>   In this solution the AER driver searches for a downstream CXL endpoint
>   to 'handle' detected CXL port protocol errors.
> 
>   This is a good solution to consider if the one presented in this patchset
>   is not acceptable. I was initially reluctant to this approach because it
>   adds more CXL coupling to the AER driver. But, I think this solution
>   would technically work. I believe Ming was working towards this
>   solution.
> 
>   3.) Refactor portdrv
>   The portdrv refactoring solution is to change the portdrv service drivers
>   into PCIe auxiliary drivers. With this change the facility drivers can be
>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> 
>   In this case the CXL port functionality would be added either as a CXL
>   auxiliary driver or as a CXL specific port driver
>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> 
>   This solution has challenges in the interrupt allocation by separate
>   auxiliary drivers and in binding of a specific driver. Binding is
>   currently based on PCIe class and would require extending the binding
>   logic to support multiple drivers for the same class.
> 
>   Jonathan Cameron is working towards this solution by initially solving
>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>   for handling CXL port RAS errors would result in RAS logic called from
>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> 
>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>   (Implemented in this patchset)
> 
>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>   callback to handle and log CXL port RAS errors.
>   
>   I chose this after trying solution#1 above. I see a couple advantages to
>   this solution are:
>   - Is general port implementation for CIE/UIE specific handling mentioned
>   in the PCIe spec.[2]
>   - Notifier is used in RAS MCE driver as an existing example.
>   - Does not introduce further CXL dependencies into the AER driver.
>   - The notifier chain provides registration/unregistration and
>   synchronization.
> 
>   A disadvantage of this approach is coupling still exists between the CXL
>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>   is handled by a notifier callback in the cxl_pci endpoint driver.
> 
>   Most of the patches in this patchset could be reused to work with
>   solution#3 or solution#2. The atomic notifier could be dropped and
>   instead use an auxiliary device or AER driver awareness. The other
>   changes in this patchset could possibly be reused.
> 
>   [1] Kernel.org -
>   https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
>   [2] PCI6.0 - 6.2.10 Internal errors
> 
>  drivers/cxl/core/core.h    |   4 +
>  drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
>  drivers/cxl/core/port.c    |   6 +-
>  drivers/cxl/core/trace.h   |  34 +++++++++
>  drivers/cxl/cxl.h          |  10 +++
>  drivers/cxl/cxlpci.h       |   2 +
>  drivers/cxl/mem.c          |  32 +++++++-
>  drivers/cxl/pci.c          |  19 ++++-
>  drivers/pci/pcie/aer.c     |  10 ++-
>  drivers/pci/pcie/err.c     |  20 +++++
>  drivers/pci/pcie/portdrv.c |  32 ++++++++
>  drivers/pci/pcie/portdrv.h |   2 +
>  include/linux/aer.h        |   6 ++
>  13 files changed, 303 insertions(+), 27 deletions(-)
> 
> 
> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-07-25 18:49 ` fan
@ 2024-08-19 16:21   ` Terry Bowman
  2024-08-19 18:17     ` Fan Ni
  0 siblings, 1 reply; 59+ messages in thread
From: Terry Bowman @ 2024-08-19 16:21 UTC (permalink / raw)
  To: fan
  Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
	ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
	sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
	Yazen.Ghannam, Robert.Richter, a.manzanares

Hi Fan

On 7/25/24 13:49, fan wrote:
> On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
>> This patchset provides RAS logging for CXL root ports, CXL downstream
>> switch ports, and CXL upstream switch ports. This includes changes to
>> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
>> cxl_pci callback.
>>
>> The first 3 patches prepare for and add an atomic notifier chain to the
>> portdrv driver. The portdrv's notifier chain reports the port device's
>> AER internal errors to the registered callback(s). The preparation changes
>> include a portdrv update to call the uncorrectable handler for PCIe root
>> ports and PCIe downstream switch ports. Also, the AER correctable error
>> (CE) status is made available to the AER CE handler.
>>
>> The next 4 patches are in preparation for adding an atomic notification
>> callback in the cxl_pci driver. This is for receiving AER internal error
>> events from the portdrv notifier chain. Preparation includes adding RAS
>> register block mapping, adding trace functions for logging, and
>> refactoring cxl_pci RAS functions for reuse.
>>
>> The final 2 patches enable the AER internal error interrupts.
>>
>> Testing RAS CE/UCE:
>>   QEMU was used for testing CXL root port, CXL downstream switch port, and
>>   CXL upstream switch port. The aer-inject tool was used to inject AER and
>>   a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
>>   testing. Testing passed with no issues.
> 
> Hi Terry,
> 
> Could you share a little more about the qemu test setup?
> From what I see, it seems currently qemu can only inject error to
> type3 devices, is that true? Or how to do that for port devices?
> Do we need a hack there?
> 
> Also, is the aer-inject tool you mentioned the one currently in the kernel
> or something else?
> https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c
> 
> Thanks,
> Fan
> 
Sorry for the late response.

I used AMD RAS injection for testing HW root ports.

I used QEMU and the legacy aer-inject userspace tool to test switch ports (USP/DSP).[1] 
I added a couple test patches to set the AER UIE/CIE because the tool doesn't support 
injecting UIE or CIE bits. I used a test patch for assigning the RAS status as well.

Regards,
Terry

[1] - https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/about/

> 
>>  
>>   An AMD platform with the AMD RAS error injection tool was used for
>>   testing CXL root port injection. Testing passed with no issues.
>>
>>   TODO - regression test CXL1.1 RCH handling.
>>
>> Solutions Considered (1-4):
>>   Below are solutions that were considered. Solution #4 is
>>   implemented in this patchset. 
>>
>>   1.) Reassigning portdrv error handler for CXL port devices
>>   
>>   This solution was based on reassigning the portdrv's CE/UCE err_handler
>>   to be CXL cxl_pci driver functions.
>>   
>>   I started with this solution and once the flow was working I realized
>>   the endpoint removal would have to be addressed as well. While this
>>   could be resolved it does highlight the odd coupling and dependency
>>   between the CXL port devices error handling with cxl_pci endpoint's
>>   handlers. Also, the err_handler re-assignment at runtime required
>>   ignoring the 'const' definition. I don't believe this should be
>>   considered as a possible solution.
>>   
>>   2.) Update the AER driver to call cxl_pci driver's error handler before
>>   calling pci_aer_handle_error()
>>
>>   This is similar to the existing RCH port error approach in aer.c.
>>   In this solution the AER driver searches for a downstream CXL endpoint
>>   to 'handle' detected CXL port protocol errors.
>>
>>   This is a good solution to consider if the one presented in this patchset
>>   is not acceptable. I was initially reluctant to this approach because it
>>   adds more CXL coupling to the AER driver. But, I think this solution
>>   would technically work. I believe Ming was working towards this
>>   solution.
>>
>>   3.) Refactor portdrv
>>   The portdrv refactoring solution is to change the portdrv service drivers
>>   into PCIe auxiliary drivers. With this change the facility drivers can be
>>   associated with a PCIe driver instead fixed bound to the portdrv driver.
>>
>>   In this case the CXL port functionality would be added either as a CXL
>>   auxiliary driver or as a CXL specific port driver
>>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
>>
>>   This solution has challenges in the interrupt allocation by separate
>>   auxiliary drivers and in binding of a specific driver. Binding is
>>   currently based on PCIe class and would require extending the binding
>>   logic to support multiple drivers for the same class.
>>
>>   Jonathan Cameron is working towards this solution by initially solving
>>   for the PMU service driver.[1] It is using the auxiliary bus to associate
>>   what were service drivers with the portdrv driver. Using a CXL auxiliary
>>   for handling CXL port RAS errors would result in RAS logic called from
>>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
>>
>>   4.) Using a portdrv notifier chain/callback for CIE/UIE
>>   (Implemented in this patchset)
>>
>>   This solution uses a portdrv atomic chain notifier and a cxl_pci
>>   callback to handle and log CXL port RAS errors.
>>   
>>   I chose this after trying solution#1 above. I see a couple advantages to
>>   this solution are:
>>   - Is general port implementation for CIE/UIE specific handling mentioned
>>   in the PCIe spec.[2]
>>   - Notifier is used in RAS MCE driver as an existing example.
>>   - Does not introduce further CXL dependencies into the AER driver.
>>   - The notifier chain provides registration/unregistration and
>>   synchronization.
>>
>>   A disadvantage of this approach is coupling still exists between the CXL
>>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
>>   is handled by a notifier callback in the cxl_pci endpoint driver.
>>
>>   Most of the patches in this patchset could be reused to work with
>>   solution#3 or solution#2. The atomic notifier could be dropped and
>>   instead use an auxiliary device or AER driver awareness. The other
>>   changes in this patchset could possibly be reused.
>>
>>   [1] Kernel.org -
>>   https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
>>   [2] PCI6.0 - 6.2.10 Internal errors
>>
>>  drivers/cxl/core/core.h    |   4 +
>>  drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
>>  drivers/cxl/core/port.c    |   6 +-
>>  drivers/cxl/core/trace.h   |  34 +++++++++
>>  drivers/cxl/cxl.h          |  10 +++
>>  drivers/cxl/cxlpci.h       |   2 +
>>  drivers/cxl/mem.c          |  32 +++++++-
>>  drivers/cxl/pci.c          |  19 ++++-
>>  drivers/pci/pcie/aer.c     |  10 ++-
>>  drivers/pci/pcie/err.c     |  20 +++++
>>  drivers/pci/pcie/portdrv.c |  32 ++++++++
>>  drivers/pci/pcie/portdrv.h |   2 +
>>  include/linux/aer.h        |   6 ++
>>  13 files changed, 303 insertions(+), 27 deletions(-)
>>
>>
>> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
>> -- 
>> 2.34.1
>>

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
  2024-08-19 16:21   ` Terry Bowman
@ 2024-08-19 18:17     ` Fan Ni
  0 siblings, 0 replies; 59+ messages in thread
From: Fan Ni @ 2024-08-19 18:17 UTC (permalink / raw)
  To: Terry Bowman
  Cc: fan, dan.j.williams, ira.weiny, dave, dave.jiang,
	alison.schofield, ming4.li, vishal.l.verma, jim.harris,
	ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
	linux-kernel, Yazen.Ghannam, Robert.Richter, a.manzanares

On Mon, Aug 19, 2024 at 11:21:01AM -0500, Terry Bowman wrote:
> Hi Fan
> 
> On 7/25/24 13:49, fan wrote:
> > On Mon, Jun 17, 2024 at 03:04:02PM -0500, Terry Bowman wrote:
> >> This patchset provides RAS logging for CXL root ports, CXL downstream
> >> switch ports, and CXL upstream switch ports. This includes changes to
> >> use a portdrv notifier chain to communicate CXL AER/RAS errors to a
> >> cxl_pci callback.
> >>
> >> The first 3 patches prepare for and add an atomic notifier chain to the
> >> portdrv driver. The portdrv's notifier chain reports the port device's
> >> AER internal errors to the registered callback(s). The preparation changes
> >> include a portdrv update to call the uncorrectable handler for PCIe root
> >> ports and PCIe downstream switch ports. Also, the AER correctable error
> >> (CE) status is made available to the AER CE handler.
> >>
> >> The next 4 patches are in preparation for adding an atomic notification
> >> callback in the cxl_pci driver. This is for receiving AER internal error
> >> events from the portdrv notifier chain. Preparation includes adding RAS
> >> register block mapping, adding trace functions for logging, and
> >> refactoring cxl_pci RAS functions for reuse.
> >>
> >> The final 2 patches enable the AER internal error interrupts.
> >>
> >> Testing RAS CE/UCE:
> >>   QEMU was used for testing CXL root port, CXL downstream switch port, and
> >>   CXL upstream switch port. The aer-inject tool was used to inject AER and
> >>   a test patch was used to set the AER CIE/UIE and RAS CE/UCE status during
> >>   testing. Testing passed with no issues.
> > 
> > Hi Terry,
> > 
> > Could you share a little more about the qemu test setup?
> > From what I see, it seems currently qemu can only inject error to
> > type3 devices, is that true? Or how to do that for port devices?
> > Do we need a hack there?
> > 
> > Also, is the aer-inject tool you mentioned the one currently in the kernel
> > or something else?
> > https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/aer_inject.c
> > 
> > Thanks,
> > Fan
> > 
> Sorry for the late response.
> 
> I used AMD RAS injection for testing HW root ports.
> 
> I used QEMU and the legacy aer-inject userspace tool to test switch ports (USP/DSP).[1] 
> I added a couple test patches to set the AER UIE/CIE because the tool doesn't support 
> injecting UIE or CIE bits. I used a test patch for assigning the RAS status as well.
> 
> Regards,
> Terry
> 
> [1] - https://git.kernel.org/pub/scm/linux/kernel/git/gong.chen/aer-inject.git/about/
> 

Hi Terry,
Thanks for the reply. I was able to do aer error inject through the aer
inject kernel module and the user space tool. 
Trying to exercise the code in this patchset.

Fan

> > 
> >>  
> >>   An AMD platform with the AMD RAS error injection tool was used for
> >>   testing CXL root port injection. Testing passed with no issues.
> >>
> >>   TODO - regression test CXL1.1 RCH handling.
> >>
> >> Solutions Considered (1-4):
> >>   Below are solutions that were considered. Solution #4 is
> >>   implemented in this patchset. 
> >>
> >>   1.) Reassigning portdrv error handler for CXL port devices
> >>   
> >>   This solution was based on reassigning the portdrv's CE/UCE err_handler
> >>   to be CXL cxl_pci driver functions.
> >>   
> >>   I started with this solution and once the flow was working I realized
> >>   the endpoint removal would have to be addressed as well. While this
> >>   could be resolved it does highlight the odd coupling and dependency
> >>   between the CXL port devices error handling with cxl_pci endpoint's
> >>   handlers. Also, the err_handler re-assignment at runtime required
> >>   ignoring the 'const' definition. I don't believe this should be
> >>   considered as a possible solution.
> >>   
> >>   2.) Update the AER driver to call cxl_pci driver's error handler before
> >>   calling pci_aer_handle_error()
> >>
> >>   This is similar to the existing RCH port error approach in aer.c.
> >>   In this solution the AER driver searches for a downstream CXL endpoint
> >>   to 'handle' detected CXL port protocol errors.
> >>
> >>   This is a good solution to consider if the one presented in this patchset
> >>   is not acceptable. I was initially reluctant to this approach because it
> >>   adds more CXL coupling to the AER driver. But, I think this solution
> >>   would technically work. I believe Ming was working towards this
> >>   solution.
> >>
> >>   3.) Refactor portdrv
> >>   The portdrv refactoring solution is to change the portdrv service drivers
> >>   into PCIe auxiliary drivers. With this change the facility drivers can be
> >>   associated with a PCIe driver instead fixed bound to the portdrv driver.
> >>
> >>   In this case the CXL port functionality would be added either as a CXL
> >>   auxiliary driver or as a CXL specific port driver
> >>   (PCI_CLASS_BRIDGE_PCI_NORMAL).
> >>
> >>   This solution has challenges in the interrupt allocation by separate
> >>   auxiliary drivers and in binding of a specific driver. Binding is
> >>   currently based on PCIe class and would require extending the binding
> >>   logic to support multiple drivers for the same class.
> >>
> >>   Jonathan Cameron is working towards this solution by initially solving
> >>   for the PMU service driver.[1] It is using the auxiliary bus to associate
> >>   what were service drivers with the portdrv driver. Using a CXL auxiliary
> >>   for handling CXL port RAS errors would result in RAS logic called from
> >>   the cxl_pci and CXL auxiliary drivers. This may need a library driver.
> >>
> >>   4.) Using a portdrv notifier chain/callback for CIE/UIE
> >>   (Implemented in this patchset)
> >>
> >>   This solution uses a portdrv atomic chain notifier and a cxl_pci
> >>   callback to handle and log CXL port RAS errors.
> >>   
> >>   I chose this after trying solution#1 above. I see a couple advantages to
> >>   this solution are:
> >>   - Is general port implementation for CIE/UIE specific handling mentioned
> >>   in the PCIe spec.[2]
> >>   - Notifier is used in RAS MCE driver as an existing example.
> >>   - Does not introduce further CXL dependencies into the AER driver.
> >>   - The notifier chain provides registration/unregistration and
> >>   synchronization.
> >>
> >>   A disadvantage of this approach is coupling still exists between the CXL
> >>   port's driver (portdrv) and the cxl_pci driver. The CXL port device's RAS
> >>   is handled by a notifier callback in the cxl_pci endpoint driver.
> >>
> >>   Most of the patches in this patchset could be reused to work with
> >>   solution#3 or solution#2. The atomic notifier could be dropped and
> >>   instead use an auxiliary device or AER driver awareness. The other
> >>   changes in this patchset could possibly be reused.
> >>
> >>   [1] Kernel.org -
> >>   https://lore.kernel.org/all/f4b23710-059a-51b7-9d27-b62e8b358b54@linux.intel.com
> >>   [2] PCI6.0 - 6.2.10 Internal errors
> >>
> >>  drivers/cxl/core/core.h    |   4 +
> >>  drivers/cxl/core/pci.c     | 153 ++++++++++++++++++++++++++++++++-----
> >>  drivers/cxl/core/port.c    |   6 +-
> >>  drivers/cxl/core/trace.h   |  34 +++++++++
> >>  drivers/cxl/cxl.h          |  10 +++
> >>  drivers/cxl/cxlpci.h       |   2 +
> >>  drivers/cxl/mem.c          |  32 +++++++-
> >>  drivers/cxl/pci.c          |  19 ++++-
> >>  drivers/pci/pcie/aer.c     |  10 ++-
> >>  drivers/pci/pcie/err.c     |  20 +++++
> >>  drivers/pci/pcie/portdrv.c |  32 ++++++++
> >>  drivers/pci/pcie/portdrv.h |   2 +
> >>  include/linux/aer.h        |   6 ++
> >>  13 files changed, 303 insertions(+), 27 deletions(-)
> >>
> >>
> >> base-commit: ca3d4767c8054447ac2a58356080e299a59e05b8
> >> -- 
> >> 2.34.1
> >>

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2024-08-19 18:35 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-17 20:04 [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-20 11:21   ` Jonathan Cameron
2024-06-24 14:58     ` Terry Bowman
2024-06-21 19:17   ` Dan Williams
2024-06-24 17:56     ` Terry Bowman
2024-07-10 20:48       ` nifan.cxl
2024-07-10 21:48         ` Terry Bowman
2024-07-11  1:14           ` fan
2024-08-19 18:35       ` Fan Ni
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
2024-06-20 11:31   ` Jonathan Cameron
2024-06-24 15:08     ` Terry Bowman
2024-06-21 19:23   ` Dan Williams
2024-06-24 18:00     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30   ` Jonathan Cameron
2024-06-24 15:22     ` Terry Bowman
2024-06-21 19:36   ` Dan Williams
2024-06-24 18:21     ` Terry Bowman
2024-06-24 21:46       ` Dan Williams
2024-06-25 14:41         ` Terry Bowman
2024-06-26  2:54   ` Li, Ming4
2024-06-26 13:39     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 4/9] cxl/pci: Map CXL PCIe ports' RAS registers Terry Bowman
2024-06-20 12:46   ` Jonathan Cameron
2024-06-24 15:51     ` Terry Bowman
2024-07-02 15:18       ` Jonathan Cameron
2024-06-26  3:39   ` Li, Ming4
2024-06-17 20:04 ` [RFC PATCH 5/9] cxl/pci: Update RAS handler interfaces to support CXL PCIe ports Terry Bowman
2024-06-20 12:49   ` Jonathan Cameron
2024-07-15 17:50   ` nifan.cxl
2024-06-17 20:04 ` [RFC PATCH 6/9] cxl/pci: Add trace logging for CXL PCIe port RAS errors Terry Bowman
2024-06-20 12:53   ` Jonathan Cameron
2024-06-24 15:53     ` Terry Bowman
2024-07-02 15:53       ` Jonathan Cameron
2024-06-17 20:04 ` [RFC PATCH 7/9] cxl/pci: Add atomic notifier callback for CXL PCIe port AER internal errors Terry Bowman
2024-06-20 13:09   ` Jonathan Cameron
2024-06-24 16:09     ` Terry Bowman
2024-07-02 15:58       ` Jonathan Cameron
2024-06-26  6:22   ` Li, Ming4
2024-06-26 13:51     ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19  7:09   ` Christoph Hellwig
2024-06-19 15:40     ` Terry Bowman
2024-06-20 13:11   ` Jonathan Cameron
2024-06-24 16:22     ` Terry Bowman
2024-07-10 21:47   ` Bjorn Helgaas
2024-06-17 20:04 ` [RFC PATCH 9/9] cxl/pci: Enable interrupts for CXL PCIe ports' AER internal errors Terry Bowman
2024-06-20 13:15   ` Jonathan Cameron
2024-06-24 16:46     ` Terry Bowman
2024-07-02 16:00       ` Jonathan Cameron
2024-06-21 19:04 ` [RFC PATCH 0/9] Add RAS support for CXL root ports, CXL downstream switch ports, and CXL upstream switch ports Dan Williams
2024-06-24 17:47   ` Terry Bowman
2024-06-24 20:51     ` Dan Williams
2024-06-25 14:29       ` Terry Bowman
2024-07-25 18:49 ` fan
2024-08-19 16:21   ` Terry Bowman
2024-08-19 18:17     ` Fan Ni

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox