* [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
[not found] <20240617200411.1426554-1-terry.bowman@amd.com>
@ 2024-06-17 20:04 ` Terry Bowman
2024-06-20 11:21 ` Jonathan Cameron
2024-06-21 19:17 ` Dan Williams
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
` (2 subsequent siblings)
3 siblings, 2 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
The AER service driver does not currently call a handler for AER
uncorrectable errors (UCE) detected in root ports or downstream
ports. This is not needed in most cases because common PCIe port
functionality is handled by portdrv service drivers.
CXL root ports include CXL specific RAS registers that need logging
before starting do_recovery() in the UCE case.
Update the AER service driver to call the UCE handler for root ports
and downstream ports. These PCIe port devices are bound to the portdrv
driver that includes a CE and UCE handler to be called.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 705893b5f7b0..a4db474b2be5 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+ /*
+ * PCIe ports may include functionality beyond the standard
+ * extended port capabilities. This may present a need to log and
+ * handle errors not addressed in this driver. Examples are CXL
+ * root ports and CXL downstream switch ports using AER UIE to
+ * indicate CXL UCE RAS protocol errors.
+ */
+ if (type == PCI_EXP_TYPE_ROOT_PORT ||
+ type == PCI_EXP_TYPE_DOWNSTREAM) {
+ struct pci_driver *pdrv = dev->driver;
+
+ if (pdrv && pdrv->err_handler &&
+ pdrv->err_handler->error_detected) {
+ const struct pci_error_handlers *err_handler;
+
+ err_handler = pdrv->err_handler;
+ status = err_handler->error_detected(dev, state);
+ }
+ }
+
/*
* If the error was detected by a Root Port, Downstream Port, RCEC,
* or RCiEP, recovery runs on the device itself. For Ports, that
--
2.34.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
[not found] <20240617200411.1426554-1-terry.bowman@amd.com>
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
2024-06-20 11:31 ` Jonathan Cameron
2024-06-21 19:23 ` Dan Williams
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
3 siblings, 2 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
The AER service driver clears the AER correctable error (CE) status before
calling the correctable error handler. This results in the error's status
not correctly reflected if read from the CE handler.
The AER CE status is needed by the portdrv's CE handler. The portdrv's
CE handler is intended to only call the registered notifier callbacks
if the CE error status has correctable internal error (CIE) set.
This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
is still present in the AER capability and available for reading, if
needed, when the UCE handler is called.
Change the order of clearing the CE status and calling the CE handler.
Make it to call the CE handler first and then clear the CE status
after returning.
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
drivers/pci/pcie/aer.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ac6293c24976..4dc03cb9aff0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
* Correctable error does not need software intervention.
* No need to go through error recovery process.
*/
- if (aer)
- pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
- info->status);
if (pcie_aer_is_native(dev)) {
struct pci_driver *pdrv = dev->driver;
@@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
pdrv->err_handler->cor_error_detected(dev);
pcie_clear_device_status(dev);
}
+ if (aer)
+ pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
+ info->status);
+
} else if (info->severity == AER_NONFATAL)
pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
else if (info->severity == AER_FATAL)
--
2.34.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
[not found] <20240617200411.1426554-1-terry.bowman@amd.com>
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
2024-06-20 12:30 ` Jonathan Cameron
` (2 more replies)
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
3 siblings, 3 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
does not implement an AER correctable handler (CE) but does implement the
AER uncorrectable error (UCE). The UCE handler is fairly straightforward
in that it only checks for frozen error state and returns the next step
for recovery accordingly.
As a result, port devices relying on AER correctable internal errors (CIE)
and AER uncorrectable internal errors (UIE) will not be handled. Note,
the PCIe spec indicates AER CIE/UIE can be used to report implementation
specific errors.[1]
CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
are examples of devices using the AER CIE/UIE for implementation specific
purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
report CXL RAS errors.[2]
Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
notifier to report CIE/UIE errors to the registered functions. This will
require adding a CE handler and updating the existing UCE handler.
For the UCE handler, the CXL spec states UIE errors should return need
reset: "The only method of recovering from an Uncorrectable Internal Error
is reset or hardware replacement."[1]
[1] PCI6.0 - 6.2.10 Internal Errors
[2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
Upstream Switch Ports
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
drivers/pci/pcie/portdrv.h | 2 ++
2 files changed, 34 insertions(+)
diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
index 14a4b89a3b83..86d80e0e9606 100644
--- a/drivers/pci/pcie/portdrv.c
+++ b/drivers/pci/pcie/portdrv.c
@@ -37,6 +37,9 @@ struct portdrv_service_data {
u32 service;
};
+ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
+EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
+
/**
* release_pcie_device - free PCI Express port service device structure
* @dev: Port service device to release
@@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
pci_channel_state_t error)
{
+ if (dev->aer_cap) {
+ u32 status;
+
+ pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
+ &status);
+
+ if (status & PCI_ERR_UNC_INTN) {
+ atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
+ AER_FATAL, (void *)dev);
+ return PCI_ERS_RESULT_NEED_RESET;
+ }
+ }
+
if (error == pci_channel_io_frozen)
return PCI_ERS_RESULT_NEED_RESET;
return PCI_ERS_RESULT_CAN_RECOVER;
}
+static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
+{
+ u32 status;
+
+ if (!dev->aer_cap)
+ return;
+
+ pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
+ &status);
+
+ if (status & PCI_ERR_COR_INTERNAL)
+ atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
+ AER_CORRECTABLE, (void *)dev);
+}
+
static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
{
size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
@@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
static const struct pci_error_handlers pcie_portdrv_err_handler = {
.error_detected = pcie_portdrv_error_detected,
+ .cor_error_detected = pcie_portdrv_cor_error_detected,
.slot_reset = pcie_portdrv_slot_reset,
.mmio_enabled = pcie_portdrv_mmio_enabled,
};
diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
index 12c89ea0313b..8a39197f0203 100644
--- a/drivers/pci/pcie/portdrv.h
+++ b/drivers/pci/pcie/portdrv.h
@@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
#endif /* !CONFIG_PCIE_PME */
struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
+
+extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
#endif /* _PORTDRV_H_ */
--
2.34.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
[not found] <20240617200411.1426554-1-terry.bowman@amd.com>
` (2 preceding siblings ...)
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
@ 2024-06-17 20:04 ` Terry Bowman
2024-06-19 7:09 ` Christoph Hellwig
` (2 more replies)
3 siblings, 3 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-17 20:04 UTC (permalink / raw)
To: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel, terry.bowman,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
AER correctable internal errors (CIE) and AER uncorrectable internal
errors (UIE) are disabled through the AER mask register by default.[1]
CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
need CIE/UIE enabled.[2]
Change pci_aer_unmask_internal_errors() function to be exported for
the CXL driver and other drivers to use.
[1] PCI6.0 - 7.8.4.3 Uncorrectable
[2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
Switch Ports
Signed-off-by: Terry Bowman <terry.bowman@amd.com>
Cc: Bjorn Helgaas <bhelgaas@google.com>
Cc: linux-pci@vger.kernel.org
---
drivers/pci/pcie/aer.c | 3 ++-
include/linux/aer.h | 6 ++++++
2 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 4dc03cb9aff0..d7a1982f0c50 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
* Note: AER must be enabled and supported by the device which must be
* checked in advance, e.g. with pcie_aer_is_native().
*/
-static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
+void pci_aer_unmask_internal_errors(struct pci_dev *dev)
{
int aer = dev->aer_cap;
u32 mask;
@@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
mask &= ~PCI_ERR_COR_INTERNAL;
pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
}
+EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
static bool is_cxl_mem_dev(struct pci_dev *dev)
{
diff --git a/include/linux/aer.h b/include/linux/aer.h
index 4b97f38f3fcf..a4fd25ea0280 100644
--- a/include/linux/aer.h
+++ b/include/linux/aer.h
@@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
#endif
+#ifdef CONFIG_PCIEAER_CXL
+void pci_aer_unmask_internal_errors(struct pci_dev *dev);
+#else
+static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
+#endif
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer);
int cper_severity_to_aer(int cper_severity);
--
2.34.1
^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
@ 2024-06-19 7:09 ` Christoph Hellwig
2024-06-19 15:40 ` Terry Bowman
2024-06-20 13:11 ` Jonathan Cameron
2024-07-10 21:47 ` Bjorn Helgaas
2 siblings, 1 reply; 29+ messages in thread
From: Christoph Hellwig @ 2024-06-19 7:09 UTC (permalink / raw)
To: Terry Bowman
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, Jun 17, 2024 at 03:04:10PM -0500, Terry Bowman wrote:
> AER correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE) are disabled through the AER mask register by default.[1]
>
> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
> need CIE/UIE enabled.[2]
>
> Change pci_aer_unmask_internal_errors() function to be exported for
> the CXL driver and other drivers to use.
I can't actually find a user for this. Maybe that's because you did
weird partial CCs for your series, or maybe it's because you don't
want to tell us. Either way it's a no-go.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
2024-06-19 7:09 ` Christoph Hellwig
@ 2024-06-19 15:40 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-19 15:40 UTC (permalink / raw)
To: Christoph Hellwig
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On 6/19/24 02:09, Christoph Hellwig wrote:
> On Mon, Jun 17, 2024 at 03:04:10PM -0500, Terry Bowman wrote:
>> AER correctable internal errors (CIE) and AER uncorrectable internal
>> errors (UIE) are disabled through the AER mask register by default.[1]
>>
>> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
>> need CIE/UIE enabled.[2]
>>
>> Change pci_aer_unmask_internal_errors() function to be exported for
>> the CXL driver and other drivers to use.
>
> I can't actually find a user for this. Maybe that's because you did
> weird partial CCs for your series, or maybe it's because you don't
> want to tell us. Either way it's a no-go.
The use is in the following patchset (9/9) that missed being shared with
PCI list. If there is rework I'll fix so both are sent to PCI list.
https://lore.kernel.org/all/20240617200411.1426554-10-terry.bowman@amd.com/
Regards,
Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
@ 2024-06-20 11:21 ` Jonathan Cameron
2024-06-24 14:58 ` Terry Bowman
2024-06-21 19:17 ` Dan Williams
1 sibling, 1 reply; 29+ messages in thread
From: Jonathan Cameron @ 2024-06-20 11:21 UTC (permalink / raw)
To: Terry Bowman
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, 17 Jun 2024 15:04:03 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service driver does not currently call a handler for AER
> uncorrectable errors (UCE) detected in root ports or downstream
> ports. This is not needed in most cases because common PCIe port
> functionality is handled by portdrv service drivers.
>
> CXL root ports include CXL specific RAS registers that need logging
> before starting do_recovery() in the UCE case.
>
> Update the AER service driver to call the UCE handler for root ports
> and downstream ports. These PCIe port devices are bound to the portdrv
> driver that includes a CE and UCE handler to be called.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 705893b5f7b0..a4db474b2be5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>
> + /*
> + * PCIe ports may include functionality beyond the standard
> + * extended port capabilities. This may present a need to log and
> + * handle errors not addressed in this driver. Examples are CXL
> + * root ports and CXL downstream switch ports using AER UIE to
> + * indicate CXL UCE RAS protocol errors.
> + */
> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> + type == PCI_EXP_TYPE_DOWNSTREAM) {
> + struct pci_driver *pdrv = dev->driver;
> +
> + if (pdrv && pdrv->err_handler &&
> + pdrv->err_handler->error_detected) {
> + const struct pci_error_handlers *err_handler;
> +
> + err_handler = pdrv->err_handler;
> + status = err_handler->error_detected(dev, state);
This status is going to get overridden by one of the pci_walk_bridge()
calls. Should it be kept around and acted on, or dropped silently?
(I'd guess no for silent!).
> + }
> + }
> +
> /*
> * If the error was detected by a Root Port, Downstream Port, RCEC,
> * or RCiEP, recovery runs on the device itself. For Ports, that
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
@ 2024-06-20 11:31 ` Jonathan Cameron
2024-06-24 15:08 ` Terry Bowman
2024-06-21 19:23 ` Dan Williams
1 sibling, 1 reply; 29+ messages in thread
From: Jonathan Cameron @ 2024-06-20 11:31 UTC (permalink / raw)
To: Terry Bowman
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, 17 Jun 2024 15:04:04 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> The AER service driver clears the AER correctable error (CE) status before
> calling the correctable error handler. This results in the error's status
> not correctly reflected if read from the CE handler.
>
> The AER CE status is needed by the portdrv's CE handler. The portdrv's
> CE handler is intended to only call the registered notifier callbacks
> if the CE error status has correctable internal error (CIE) set.
>
> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
uncorrectable
> is still present in the AER capability and available for reading, if
> needed, when the UCE handler is called.
I'm seeing the clear in the DPC path for UCE. For other cases is
it a side effect of the reset?
>
> Change the order of clearing the CE status and calling the CE handler.
> Make it to call the CE handler first and then clear the CE status
> after returning.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
Seems reasonable, but many gremlins around the ordering in these
flows, so I'm to particularly confident. With that in mind.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>
> ---
> drivers/pci/pcie/aer.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index ac6293c24976..4dc03cb9aff0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> * Correctable error does not need software intervention.
> * No need to go through error recovery process.
> */
> - if (aer)
> - pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> - info->status);
> if (pcie_aer_is_native(dev)) {
> struct pci_driver *pdrv = dev->driver;
>
> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
> pdrv->err_handler->cor_error_detected(dev);
> pcie_clear_device_status(dev);
> }
> + if (aer)
> + pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
> + info->status);
> +
> } else if (info->severity == AER_NONFATAL)
> pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
> else if (info->severity == AER_FATAL)
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
@ 2024-06-20 12:30 ` Jonathan Cameron
2024-06-24 15:22 ` Terry Bowman
2024-06-21 19:36 ` Dan Williams
2024-06-26 2:54 ` Li, Ming4
2 siblings, 1 reply; 29+ messages in thread
From: Jonathan Cameron @ 2024-06-20 12:30 UTC (permalink / raw)
To: Terry Bowman
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, 17 Jun 2024 15:04:05 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> does not implement an AER correctable handler (CE) but does implement the
> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> in that it only checks for frozen error state and returns the next step
> for recovery accordingly.
>
> As a result, port devices relying on AER correctable internal errors (CIE)
> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> specific errors.[1]
>
> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> are examples of devices using the AER CIE/UIE for implementation specific
> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> report CXL RAS errors.[2]
>
> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> notifier to report CIE/UIE errors to the registered functions. This will
> require adding a CE handler and updating the existing UCE handler.
>
> For the UCE handler, the CXL spec states UIE errors should return need
> reset: "The only method of recovering from an Uncorrectable Internal Error
> is reset or hardware replacement."[1]
>
> [1] PCI6.0 - 6.2.10 Internal Errors
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
> drivers/pci/pcie/portdrv.h | 2 ++
> 2 files changed, 34 insertions(+)
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 14a4b89a3b83..86d80e0e9606 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -37,6 +37,9 @@ struct portdrv_service_data {
> u32 service;
> };
>
> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
Perhaps these should be per instance of the portdrv?
I'd imagine we only want to register CXL ones on CXL ports etc
and it's annoying to have to check at runtime for relevance
of a particular notifier.
> +
> /**
> * release_pcie_device - free PCI Express port service device structure
> * @dev: Port service device to release
> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> pci_channel_state_t error)
> {
> + if (dev->aer_cap) {
> + u32 status;
> +
> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> + &status);
> +
> + if (status & PCI_ERR_UNC_INTN) {
> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> + AER_FATAL, (void *)dev);
Don't think the cast is needed as always fine to implicitly cast to and from
void * in C.
> + return PCI_ERS_RESULT_NEED_RESET;
> + }
> + }
> +
> if (error == pci_channel_io_frozen)
> return PCI_ERS_RESULT_NEED_RESET;
> return PCI_ERS_RESULT_CAN_RECOVER;
> }
>
> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
> +{
> + u32 status;
> +
> + if (!dev->aer_cap)
> + return;
> +
> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
> + &status);
> +
> + if (status & PCI_ERR_COR_INTERNAL)
> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> + AER_CORRECTABLE, (void *)dev);
No need for the cast.
> +}
> +
> static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
> {
> size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
> @@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
>
> static const struct pci_error_handlers pcie_portdrv_err_handler = {
> .error_detected = pcie_portdrv_error_detected,
> + .cor_error_detected = pcie_portdrv_cor_error_detected,
> .slot_reset = pcie_portdrv_slot_reset,
> .mmio_enabled = pcie_portdrv_mmio_enabled,
> };
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 12c89ea0313b..8a39197f0203 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
> #endif /* !CONFIG_PCIE_PME */
>
> struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
> +
> +extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
> #endif /* _PORTDRV_H_ */
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19 7:09 ` Christoph Hellwig
@ 2024-06-20 13:11 ` Jonathan Cameron
2024-06-24 16:22 ` Terry Bowman
2024-07-10 21:47 ` Bjorn Helgaas
2 siblings, 1 reply; 29+ messages in thread
From: Jonathan Cameron @ 2024-06-20 13:11 UTC (permalink / raw)
To: Terry Bowman
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, 17 Jun 2024 15:04:10 -0500
Terry Bowman <terry.bowman@amd.com> wrote:
> AER correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE) are disabled through the AER mask register by default.[1]
>
> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
> need CIE/UIE enabled.[2]
>
> Change pci_aer_unmask_internal_errors() function to be exported for
> the CXL driver and other drivers to use.
I've perhaps forgotten the end conclusion, but I thought there was
a request to just try enabling this in general and mask it out only
for known broken devices?
Admittedly that's a more daring path, so maybe I hallucinated it!
>
> [1] PCI6.0 - 7.8.4.3 Uncorrectable
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
> Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/aer.c | 3 ++-
> include/linux/aer.h | 6 ++++++
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4dc03cb9aff0..d7a1982f0c50 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
> * Note: AER must be enabled and supported by the device which must be
> * checked in advance, e.g. with pcie_aer_is_native().
> */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> {
> int aer = dev->aer_cap;
> u32 mask;
> @@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> mask &= ~PCI_ERR_COR_INTERNAL;
> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> }
> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
>
> static bool is_cxl_mem_dev(struct pci_dev *dev)
> {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..a4fd25ea0280 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> #endif
>
> +#ifdef CONFIG_PCIEAER_CXL
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +#else
> +static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
> +#endif
> +
> void pci_print_aer(struct pci_dev *dev, int aer_severity,
> struct aer_capability_regs *aer);
> int cper_severity_to_aer(int cper_severity);
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-20 11:21 ` Jonathan Cameron
@ 2024-06-21 19:17 ` Dan Williams
2024-06-24 17:56 ` Terry Bowman
1 sibling, 1 reply; 29+ messages in thread
From: Dan Williams @ 2024-06-21 19:17 UTC (permalink / raw)
To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
alison.schofield, ming4.li, vishal.l.verma, jim.harris,
ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
linux-kernel, Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Terry Bowman wrote:
> The AER service driver does not currently call a handler for AER
> uncorrectable errors (UCE) detected in root ports or downstream
> ports. This is not needed in most cases because common PCIe port
> functionality is handled by portdrv service drivers.
>
> CXL root ports include CXL specific RAS registers that need logging
> before starting do_recovery() in the UCE case.
>
> Update the AER service driver to call the UCE handler for root ports
> and downstream ports. These PCIe port devices are bound to the portdrv
> driver that includes a CE and UCE handler to be called.
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index 705893b5f7b0..a4db474b2be5 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>
> + /*
> + * PCIe ports may include functionality beyond the standard
> + * extended port capabilities. This may present a need to log and
> + * handle errors not addressed in this driver. Examples are CXL
> + * root ports and CXL downstream switch ports using AER UIE to
> + * indicate CXL UCE RAS protocol errors.
> + */
> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> + type == PCI_EXP_TYPE_DOWNSTREAM) {
> + struct pci_driver *pdrv = dev->driver;
> +
> + if (pdrv && pdrv->err_handler &&
> + pdrv->err_handler->error_detected) {
> + const struct pci_error_handlers *err_handler;
> +
> + err_handler = pdrv->err_handler;
> + status = err_handler->error_detected(dev, state);
> + }
> + }
> +
Would not a more appropriate place for this be pci_walk_bridge() where
the ->subordinate == NULL and these type-check cases are unified?
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
2024-06-20 11:31 ` Jonathan Cameron
@ 2024-06-21 19:23 ` Dan Williams
2024-06-24 18:00 ` Terry Bowman
1 sibling, 1 reply; 29+ messages in thread
From: Dan Williams @ 2024-06-21 19:23 UTC (permalink / raw)
To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
alison.schofield, ming4.li, vishal.l.verma, jim.harris,
ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
linux-kernel, Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Terry Bowman wrote:
> The AER service driver clears the AER correctable error (CE) status before
> calling the correctable error handler. This results in the error's status
> not correctly reflected if read from the CE handler.
>
> The AER CE status is needed by the portdrv's CE handler. The portdrv's
> CE handler is intended to only call the registered notifier callbacks
> if the CE error status has correctable internal error (CIE) set.
Is this a fix or a prep patch? It reads like a "fix", but there are no
notifiers to worry about today.
> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
> is still present in the AER capability and available for reading, if
> needed, when the UCE handler is called.
>
> Change the order of clearing the CE status and calling the CE handler.
> Make it to call the CE handler first and then clear the CE status
> after returning.
With the changelog clarified to indicate whether this has any impact on
current behavior you can add:
Acked-by: Dan Williams <dan.j.williams@intel.com>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30 ` Jonathan Cameron
@ 2024-06-21 19:36 ` Dan Williams
2024-06-24 18:21 ` Terry Bowman
2024-06-26 2:54 ` Li, Ming4
2 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2024-06-21 19:36 UTC (permalink / raw)
To: Terry Bowman, dan.j.williams, ira.weiny, dave, dave.jiang,
alison.schofield, ming4.li, vishal.l.verma, jim.harris,
ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
linux-kernel, Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Terry Bowman wrote:
> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> does not implement an AER correctable handler (CE) but does implement the
> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> in that it only checks for frozen error state and returns the next step
> for recovery accordingly.
>
> As a result, port devices relying on AER correctable internal errors (CIE)
> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> specific errors.[1]
>
> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> are examples of devices using the AER CIE/UIE for implementation specific
> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> report CXL RAS errors.[2]
>
> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> notifier to report CIE/UIE errors to the registered functions. This will
> require adding a CE handler and updating the existing UCE handler.
>
> For the UCE handler, the CXL spec states UIE errors should return need
> reset: "The only method of recovering from an Uncorrectable Internal Error
> is reset or hardware replacement."[1]
>
> [1] PCI6.0 - 6.2.10 Internal Errors
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
> drivers/pci/pcie/portdrv.h | 2 ++
> 2 files changed, 34 insertions(+)
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 14a4b89a3b83..86d80e0e9606 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -37,6 +37,9 @@ struct portdrv_service_data {
> u32 service;
> };
>
> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> +
> /**
> * release_pcie_device - free PCI Express port service device structure
> * @dev: Port service device to release
> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> pci_channel_state_t error)
> {
> + if (dev->aer_cap) {
> + u32 status;
> +
> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> + &status);
> +
> + if (status & PCI_ERR_UNC_INTN) {
> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> + AER_FATAL, (void *)dev);
> + return PCI_ERS_RESULT_NEED_RESET;
> + }
> + }
> +
Oh, this is a finer grained / lower-level location than I was
expecting. I was expecting that the notifier was just conveying the port
interrupt notification to a driver that knew how to take the next step.
This pcie_portdrv_error_detected() is a notification that is already
"downstream" of the AER notification.
If PCIe does not care about CIE and UIE then don't make it care, but
redirect the notifications to the CXL side that may care.
Leave the portdrv handlers PCIe native as much as possible.
Now, I have not thought through the full implications of that
suggestion, but for now am reacting to this AER -> PCIe err_handler ->
CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
separate error handling domain that the PCIe side likely does not want
to worry about. PCIe side is only responsible for allowing CXL to
register for the notifications beacuse the AER interrupt is shared.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-06-20 11:21 ` Jonathan Cameron
@ 2024-06-24 14:58 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 14:58 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
Hi Jonathan,
I added a response below.
On 6/20/24 06:21, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:03 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver does not currently call a handler for AER
>> uncorrectable errors (UCE) detected in root ports or downstream
>> ports. This is not needed in most cases because common PCIe port
>> functionality is handled by portdrv service drivers.
>>
>> CXL root ports include CXL specific RAS registers that need logging
>> before starting do_recovery() in the UCE case.
>>
>> Update the AER service driver to call the UCE handler for root ports
>> and downstream ports. These PCIe port devices are bound to the portdrv
>> driver that includes a CE and UCE handler to be called.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 705893b5f7b0..a4db474b2be5 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>
>> + /*
>> + * PCIe ports may include functionality beyond the standard
>> + * extended port capabilities. This may present a need to log and
>> + * handle errors not addressed in this driver. Examples are CXL
>> + * root ports and CXL downstream switch ports using AER UIE to
>> + * indicate CXL UCE RAS protocol errors.
>> + */
>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> + type == PCI_EXP_TYPE_DOWNSTREAM) {
>> + struct pci_driver *pdrv = dev->driver;
>> +
>> + if (pdrv && pdrv->err_handler &&
>> + pdrv->err_handler->error_detected) {
>> + const struct pci_error_handlers *err_handler;
>> +
>> + err_handler = pdrv->err_handler;
>> + status = err_handler->error_detected(dev, state);
>
> This status is going to get overridden by one of the pci_walk_bridge()
> calls. Should it be kept around and acted on, or dropped silently?
> (I'd guess no for silent!).
>
It should be used later.
According to PCI spec "The only method of recovering from an Uncorrectable
Internal Error is reset or hardware replacement."
I need to make certain that carries through below.
Regards,
Terry
>> + }
>> + }
>> +
>> /*
>> * If the error was detected by a Root Port, Downstream Port, RCEC,
>> * or RCiEP, recovery runs on the device itself. For Ports, that
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
2024-06-20 11:31 ` Jonathan Cameron
@ 2024-06-24 15:08 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 15:08 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On 6/20/24 06:31, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:04 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> The AER service driver clears the AER correctable error (CE) status before
>> calling the correctable error handler. This results in the error's status
>> not correctly reflected if read from the CE handler.
>>
>> The AER CE status is needed by the portdrv's CE handler. The portdrv's
>> CE handler is intended to only call the registered notifier callbacks
>> if the CE error status has correctable internal error (CIE) set.
>>
>> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
>
> uncorrectable
>
Thank you.
>> is still present in the AER capability and available for reading, if
>> needed, when the UCE handler is called.
>
> I'm seeing the clear in the DPC path for UCE. For other cases is
> it a side effect of the reset?
>
Depends on when its being read. I'm assuming this is after recovery in your case.
And after recovery it will be zeroed.
Regards,
Terry
>>
>> Change the order of clearing the CE status and calling the CE handler.
>> Make it to call the CE handler first and then clear the CE status
>> after returning.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
> Seems reasonable, but many gremlins around the ordering in these
> flows, so I'm to particularly confident. With that in mind.
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huwei.com>
>
>> ---
>> drivers/pci/pcie/aer.c | 7 ++++---
>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index ac6293c24976..4dc03cb9aff0 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1094,9 +1094,6 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> * Correctable error does not need software intervention.
>> * No need to go through error recovery process.
>> */
>> - if (aer)
>> - pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> - info->status);
>> if (pcie_aer_is_native(dev)) {
>> struct pci_driver *pdrv = dev->driver;
>>
>> @@ -1105,6 +1102,10 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
>> pdrv->err_handler->cor_error_detected(dev);
>> pcie_clear_device_status(dev);
>> }
>> + if (aer)
>> + pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
>> + info->status);
>> +
>> } else if (info->severity == AER_NONFATAL)
>> pcie_do_recovery(dev, pci_channel_io_normal, aer_root_reset);
>> else if (info->severity == AER_FATAL)
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-20 12:30 ` Jonathan Cameron
@ 2024-06-24 15:22 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 15:22 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
Hi Jonathan,
I added responses inline below.
On 6/20/24 07:30, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:05 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>> Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>> drivers/pci/pcie/portdrv.h | 2 ++
>> 2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>> u32 service;
>> };
>>
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>
> Perhaps these should be per instance of the portdrv?
> I'd imagine we only want to register CXL ones on CXL ports etc
> and it's annoying to have to check at runtime for relevance
> of a particular notifier.
>
This could be made per-instance by moving to the PCI/device drvdata. This
would likely need a portdrv setup-init helper function to enable for a
particular PCI device.
>> +
>> /**
>> * release_pcie_device - free PCI Express port service device structure
>> * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>> pci_channel_state_t error)
>> {
>> + if (dev->aer_cap) {
>> + u32 status;
>> +
>> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> + &status);
>> +
>> + if (status & PCI_ERR_UNC_INTN) {
>> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> + AER_FATAL, (void *)dev);
>
> Don't think the cast is needed as always fine to implicitly cast to and from
> void * in C.
>
Ok.
>> + return PCI_ERS_RESULT_NEED_RESET;
>> + }
>> + }
>> +
>> if (error == pci_channel_io_frozen)
>> return PCI_ERS_RESULT_NEED_RESET;
>> return PCI_ERS_RESULT_CAN_RECOVER;
>> }
>>
>> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
>> +{
>> + u32 status;
>> +
>> + if (!dev->aer_cap)
>> + return;
>> +
>> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
>> + &status);
>> +
>> + if (status & PCI_ERR_COR_INTERNAL)
>> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> + AER_CORRECTABLE, (void *)dev);
>
> No need for the cast.
>
Ok
Regards,
Terry
>> +}
>> +
>> static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
>> {
>> size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
>> @@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
>>
>> static const struct pci_error_handlers pcie_portdrv_err_handler = {
>> .error_detected = pcie_portdrv_error_detected,
>> + .cor_error_detected = pcie_portdrv_cor_error_detected,
>> .slot_reset = pcie_portdrv_slot_reset,
>> .mmio_enabled = pcie_portdrv_mmio_enabled,
>> };
>> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
>> index 12c89ea0313b..8a39197f0203 100644
>> --- a/drivers/pci/pcie/portdrv.h
>> +++ b/drivers/pci/pcie/portdrv.h
>> @@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
>> #endif /* !CONFIG_PCIE_PME */
>>
>> struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
>> +
>> +extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
>> #endif /* _PORTDRV_H_ */
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
2024-06-20 13:11 ` Jonathan Cameron
@ 2024-06-24 16:22 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 16:22 UTC (permalink / raw)
To: Jonathan Cameron
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
Hi Jonathan,
I added a response inline below.
On 6/20/24 08:11, Jonathan Cameron wrote:
> On Mon, 17 Jun 2024 15:04:10 -0500
> Terry Bowman <terry.bowman@amd.com> wrote:
>
>> AER correctable internal errors (CIE) and AER uncorrectable internal
>> errors (UIE) are disabled through the AER mask register by default.[1]
>>
>> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
>> need CIE/UIE enabled.[2]
>>
>> Change pci_aer_unmask_internal_errors() function to be exported for
>> the CXL driver and other drivers to use.
>
> I've perhaps forgotten the end conclusion, but I thought there was
> a request to just try enabling this in general and mask it out only
> for known broken devices?
>
> Admittedly that's a more daring path, so maybe I hallucinated it!
>
I remember there was discussion. A quick search for PCI_ERR_COR_INTERNAL and
PCI_ERR_UNC_INTERNAL doesn't find any default enablement.
Regards,
Terry
>>
>> [1] PCI6.0 - 7.8.4.3 Uncorrectable
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
>> Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>> drivers/pci/pcie/aer.c | 3 ++-
>> include/linux/aer.h | 6 ++++++
>> 2 files changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index 4dc03cb9aff0..d7a1982f0c50 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
>> * Note: AER must be enabled and supported by the device which must be
>> * checked in advance, e.g. with pcie_aer_is_native().
>> */
>> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> {
>> int aer = dev->aer_cap;
>> u32 mask;
>> @@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
>> mask &= ~PCI_ERR_COR_INTERNAL;
>> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
>> }
>> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
>>
>> static bool is_cxl_mem_dev(struct pci_dev *dev)
>> {
>> diff --git a/include/linux/aer.h b/include/linux/aer.h
>> index 4b97f38f3fcf..a4fd25ea0280 100644
>> --- a/include/linux/aer.h
>> +++ b/include/linux/aer.h
>> @@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
>> static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
>> #endif
>>
>> +#ifdef CONFIG_PCIEAER_CXL
>> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
>> +#else
>> +static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
>> +#endif
>> +
>> void pci_print_aer(struct pci_dev *dev, int aer_severity,
>> struct aer_capability_regs *aer);
>> int cper_severity_to_aer(int cper_severity);
>
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-06-21 19:17 ` Dan Williams
@ 2024-06-24 17:56 ` Terry Bowman
2024-07-10 20:48 ` nifan.cxl
2024-08-19 18:35 ` Fan Ni
0 siblings, 2 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 17:56 UTC (permalink / raw)
To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Hi Dan,
I added a response below.
On 6/21/24 14:17, Dan Williams wrote:
> Terry Bowman wrote:
>> The AER service driver does not currently call a handler for AER
>> uncorrectable errors (UCE) detected in root ports or downstream
>> ports. This is not needed in most cases because common PCIe port
>> functionality is handled by portdrv service drivers.
>>
>> CXL root ports include CXL specific RAS registers that need logging
>> before starting do_recovery() in the UCE case.
>>
>> Update the AER service driver to call the UCE handler for root ports
>> and downstream ports. These PCIe port devices are bound to the portdrv
>> driver that includes a CE and UCE handler to be called.
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>> 1 file changed, 20 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index 705893b5f7b0..a4db474b2be5 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>
>> + /*
>> + * PCIe ports may include functionality beyond the standard
>> + * extended port capabilities. This may present a need to log and
>> + * handle errors not addressed in this driver. Examples are CXL
>> + * root ports and CXL downstream switch ports using AER UIE to
>> + * indicate CXL UCE RAS protocol errors.
>> + */
>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> + type == PCI_EXP_TYPE_DOWNSTREAM) {
>> + struct pci_driver *pdrv = dev->driver;
>> +
>> + if (pdrv && pdrv->err_handler &&
>> + pdrv->err_handler->error_detected) {
>> + const struct pci_error_handlers *err_handler;
>> +
>> + err_handler = pdrv->err_handler;
>> + status = err_handler->error_detected(dev, state);
>> + }
>> + }
>> +
>
> Would not a more appropriate place for this be pci_walk_bridge() where
> the ->subordinate == NULL and these type-check cases are unified?
It does. I can take a look at moving that.
Regards,
Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register
2024-06-21 19:23 ` Dan Williams
@ 2024-06-24 18:00 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 18:00 UTC (permalink / raw)
To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Hi Dan, I added a response below.
On 6/21/24 14:23, Dan Williams wrote:
> Terry Bowman wrote:
>> The AER service driver clears the AER correctable error (CE) status before
>> calling the correctable error handler. This results in the error's status
>> not correctly reflected if read from the CE handler.
>>
>> The AER CE status is needed by the portdrv's CE handler. The portdrv's
>> CE handler is intended to only call the registered notifier callbacks
>> if the CE error status has correctable internal error (CIE) set.
>
> Is this a fix or a prep patch? It reads like a "fix", but there are no
> notifiers to worry about today.
>
I will add mention "in preparation for future patch".
>> This is not a problem for AER uncorrrectbale errors (UCE). The UCE status
>> is still present in the AER capability and available for reading, if
>> needed, when the UCE handler is called.
>>
>> Change the order of clearing the CE status and calling the CE handler.
>> Make it to call the CE handler first and then clear the CE status
>> after returning.
>
> With the changelog clarified to indicate whether this has any impact on
> current behavior you can add:
>
> Acked-by: Dan Williams <dan.j.williams@intel.com>
Regards,
Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-21 19:36 ` Dan Williams
@ 2024-06-24 18:21 ` Terry Bowman
2024-06-24 21:46 ` Dan Williams
0 siblings, 1 reply; 29+ messages in thread
From: Terry Bowman @ 2024-06-24 18:21 UTC (permalink / raw)
To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Hi Dan,
I added responses inline below.
On 6/21/24 14:36, Dan Williams wrote:
> Terry Bowman wrote:
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>> Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>> drivers/pci/pcie/portdrv.h | 2 ++
>> 2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>> u32 service;
>> };
>>
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>> +
>> /**
>> * release_pcie_device - free PCI Express port service device structure
>> * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>> pci_channel_state_t error)
>> {
>> + if (dev->aer_cap) {
>> + u32 status;
>> +
>> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> + &status);
>> +
>> + if (status & PCI_ERR_UNC_INTN) {
>> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> + AER_FATAL, (void *)dev);
>> + return PCI_ERS_RESULT_NEED_RESET;
>> + }
>> + }
>> +
>
> Oh, this is a finer grained / lower-level location than I was
> expecting. I was expecting that the notifier was just conveying the port
> interrupt notification to a driver that knew how to take the next step.
> This pcie_portdrv_error_detected() is a notification that is already
> "downstream" of the AER notification.
>
My intent was to implement the UIE/CIE "implementation specific" behavior as
mentioned in the PCI spec. This included allowing port devices to be notified if
needed. This plan is not ideal but works within the PCI portdrv situation
and before we can introduce a CXL specific portdriver.
> If PCIe does not care about CIE and UIE then don't make it care, but
> redirect the notifications to the CXL side that may care.
>
> Leave the portdrv handlers PCIe native as much as possible.
>
> Now, I have not thought through the full implications of that
> suggestion, but for now am reacting to this AER -> PCIe err_handler ->
> CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
> separate error handling domain that the PCIe side likely does not want
> to worry about. PCIe side is only responsible for allowing CXL to
> register for the notifications beacuse the AER interrupt is shared.
Hmmm, this sounds like either option#2 or introducing a CXL portdrv service
driver.
Thanks for the reviews and please let me know which option you
would like me to purse.
Regards,
Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-24 18:21 ` Terry Bowman
@ 2024-06-24 21:46 ` Dan Williams
2024-06-25 14:41 ` Terry Bowman
0 siblings, 1 reply; 29+ messages in thread
From: Dan Williams @ 2024-06-24 21:46 UTC (permalink / raw)
To: Terry Bowman, Dan Williams, ira.weiny, dave, dave.jiang,
alison.schofield, ming4.li, vishal.l.verma, jim.harris,
ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
linux-kernel, Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
Terry Bowman wrote:
> Hi Dan,
>
> I added responses inline below.
>
> On 6/21/24 14:36, Dan Williams wrote:
> > Terry Bowman wrote:
> >> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> >> does not implement an AER correctable handler (CE) but does implement the
> >> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> >> in that it only checks for frozen error state and returns the next step
> >> for recovery accordingly.
> >>
> >> As a result, port devices relying on AER correctable internal errors (CIE)
> >> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> >> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> >> specific errors.[1]
> >>
> >> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> >> are examples of devices using the AER CIE/UIE for implementation specific
> >> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> >> report CXL RAS errors.[2]
> >>
> >> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> >> notifier to report CIE/UIE errors to the registered functions. This will
> >> require adding a CE handler and updating the existing UCE handler.
> >>
> >> For the UCE handler, the CXL spec states UIE errors should return need
> >> reset: "The only method of recovering from an Uncorrectable Internal Error
> >> is reset or hardware replacement."[1]
> >>
> >> [1] PCI6.0 - 6.2.10 Internal Errors
> >> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> >> Upstream Switch Ports
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
> >> drivers/pci/pcie/portdrv.h | 2 ++
> >> 2 files changed, 34 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> >> index 14a4b89a3b83..86d80e0e9606 100644
> >> --- a/drivers/pci/pcie/portdrv.c
> >> +++ b/drivers/pci/pcie/portdrv.c
> >> @@ -37,6 +37,9 @@ struct portdrv_service_data {
> >> u32 service;
> >> };
> >>
> >> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> >> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> >> +
> >> /**
> >> * release_pcie_device - free PCI Express port service device structure
> >> * @dev: Port service device to release
> >> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
> >> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> >> pci_channel_state_t error)
> >> {
> >> + if (dev->aer_cap) {
> >> + u32 status;
> >> +
> >> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> >> + &status);
> >> +
> >> + if (status & PCI_ERR_UNC_INTN) {
> >> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> >> + AER_FATAL, (void *)dev);
> >> + return PCI_ERS_RESULT_NEED_RESET;
> >> + }
> >> + }
> >> +
> >
> > Oh, this is a finer grained / lower-level location than I was
> > expecting. I was expecting that the notifier was just conveying the port
> > interrupt notification to a driver that knew how to take the next step.
> > This pcie_portdrv_error_detected() is a notification that is already
> > "downstream" of the AER notification.
> >
>
> My intent was to implement the UIE/CIE "implementation specific" behavior as
> mentioned in the PCI spec. This included allowing port devices to be notified if
> needed. This plan is not ideal but works within the PCI portdrv situation
> and before we can introduce a CXL specific portdriver.
...but it really isn't implementation specific behavior like all the
other anonymous internal error cases. This is an open standard
definition that just happens to alias with the PCIe "internal"
notification mechanism.
>
> > If PCIe does not care about CIE and UIE then don't make it care, but
> > redirect the notifications to the CXL side that may care.
> >
> > Leave the portdrv handlers PCIe native as much as possible.
> >
> > Now, I have not thought through the full implications of that
> > suggestion, but for now am reacting to this AER -> PCIe err_handler ->
> > CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
> > separate error handling domain that the PCIe side likely does not want
> > to worry about. PCIe side is only responsible for allowing CXL to
> > register for the notifications beacuse the AER interrupt is shared.
>
> Hmmm, this sounds like either option#2 or introducing a CXL portdrv service
> driver.
>
> Thanks for the reviews and please let me know which option you
> would like me to purse.
So after looking at this patchset I think calling the PCIe portdrv error
handler set for anything other than PCIe errors is likely a mistake. The
CXL protocol side of the house can experience errors that have no
relation to errors that PCIe needs to handle or care about.
I am thinking something like cxl_rch_handle_error() becomes
cxl_handle_error() and when that successfully handles the error then no
need to trigger pcie_do_recovery().
pcie_do_recovery() is too tightly scoped to error recovery that is
reasonable for PCIe links. That may not be reasonable to CXL devices
where protocol errors potentially implicate that a system memory
transaction failed. The blast radius of CXL protocol errors are not
constrained to single devices like the PCIe case.
With that change something like a new cxl_do_recovery() can operate on
the cxl_port topology and know that it has exclusive control of the
error handling registers.
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-24 21:46 ` Dan Williams
@ 2024-06-25 14:41 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-25 14:41 UTC (permalink / raw)
To: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter
Cc: Bjorn Helgaas, linux-pci
On 6/24/24 16:46, Dan Williams wrote:
> Terry Bowman wrote:
>> Hi Dan,
>>
>> I added responses inline below.
>>
>> On 6/21/24 14:36, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>>>> does not implement an AER correctable handler (CE) but does implement the
>>>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>>>> in that it only checks for frozen error state and returns the next step
>>>> for recovery accordingly.
>>>>
>>>> As a result, port devices relying on AER correctable internal errors (CIE)
>>>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>>>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>>>> specific errors.[1]
>>>>
>>>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>>>> are examples of devices using the AER CIE/UIE for implementation specific
>>>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>>>> report CXL RAS errors.[2]
>>>>
>>>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>>>> notifier to report CIE/UIE errors to the registered functions. This will
>>>> require adding a CE handler and updating the existing UCE handler.
>>>>
>>>> For the UCE handler, the CXL spec states UIE errors should return need
>>>> reset: "The only method of recovering from an Uncorrectable Internal Error
>>>> is reset or hardware replacement."[1]
>>>>
>>>> [1] PCI6.0 - 6.2.10 Internal Errors
>>>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>>>> Upstream Switch Ports
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>>>> Cc: linux-pci@vger.kernel.org
>>>> ---
>>>> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>>>> drivers/pci/pcie/portdrv.h | 2 ++
>>>> 2 files changed, 34 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>>>> index 14a4b89a3b83..86d80e0e9606 100644
>>>> --- a/drivers/pci/pcie/portdrv.c
>>>> +++ b/drivers/pci/pcie/portdrv.c
>>>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>>>> u32 service;
>>>> };
>>>>
>>>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>>>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>>>> +
>>>> /**
>>>> * release_pcie_device - free PCI Express port service device structure
>>>> * @dev: Port service device to release
>>>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>>>> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>>>> pci_channel_state_t error)
>>>> {
>>>> + if (dev->aer_cap) {
>>>> + u32 status;
>>>> +
>>>> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>>>> + &status);
>>>> +
>>>> + if (status & PCI_ERR_UNC_INTN) {
>>>> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>>>> + AER_FATAL, (void *)dev);
>>>> + return PCI_ERS_RESULT_NEED_RESET;
>>>> + }
>>>> + }
>>>> +
>>>
>>> Oh, this is a finer grained / lower-level location than I was
>>> expecting. I was expecting that the notifier was just conveying the port
>>> interrupt notification to a driver that knew how to take the next step.
>>> This pcie_portdrv_error_detected() is a notification that is already
>>> "downstream" of the AER notification.
>>>
>>
>> My intent was to implement the UIE/CIE "implementation specific" behavior as
>> mentioned in the PCI spec. This included allowing port devices to be notified if
>> needed. This plan is not ideal but works within the PCI portdrv situation
>> and before we can introduce a CXL specific portdriver.
>
> ...but it really isn't implementation specific behavior like all the
> other anonymous internal error cases. This is an open standard
> definition that just happens to alias with the PCIe "internal"
> notification mechanism.
>
>>
>>> If PCIe does not care about CIE and UIE then don't make it care, but
>>> redirect the notifications to the CXL side that may care.
>>>
>>> Leave the portdrv handlers PCIe native as much as possible.
>>>
>>> Now, I have not thought through the full implications of that
>>> suggestion, but for now am reacting to this AER -> PCIe err_handler ->
>>> CXL notfier as potentially more awkward than AER -> CXL notifier. It's a
>>> separate error handling domain that the PCIe side likely does not want
>>> to worry about. PCIe side is only responsible for allowing CXL to
>>> register for the notifications beacuse the AER interrupt is shared.
>>
>> Hmmm, this sounds like either option#2 or introducing a CXL portdrv service
>> driver.
>>
>> Thanks for the reviews and please let me know which option you
>> would like me to purse.
>
> So after looking at this patchset I think calling the PCIe portdrv error
> handler set for anything other than PCIe errors is likely a mistake. The
> CXL protocol side of the house can experience errors that have no
> relation to errors that PCIe needs to handle or care about.
>
> I am thinking something like cxl_rch_handle_error() becomes
> cxl_handle_error() and when that successfully handles the error then no
> need to trigger pcie_do_recovery().
>
> pcie_do_recovery() is too tightly scoped to error recovery that is
> reasonable for PCIe links. That may not be reasonable to CXL devices
> where protocol errors potentially implicate that a system memory
> transaction failed. The blast radius of CXL protocol errors are not
> constrained to single devices like the PCIe case.
>
> With that change something like a new cxl_do_recovery() can operate on
> the cxl_port topology and know that it has exclusive control of the
> error handling registers.
Ok, I'll refactor the existing AER RCH downstream port handling to support
CXL USP, DSP, and RP as well. I can incorporate much of the feedback from
this RFC into the new patchset.
Thanks Dan.
Regards,
Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30 ` Jonathan Cameron
2024-06-21 19:36 ` Dan Williams
@ 2024-06-26 2:54 ` Li, Ming4
2024-06-26 13:39 ` Terry Bowman
2 siblings, 1 reply; 29+ messages in thread
From: Li, Ming4 @ 2024-06-26 2:54 UTC (permalink / raw)
To: Terry Bowman, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
Jiang, Dave, Schofield, Alison, Verma, Vishal L,
jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
Yazen.Ghannam@amd.com, Robert.Richter@amd.com
Cc: Bjorn Helgaas, linux-pci@vger.kernel.org
On 6/18/2024 4:04 AM, Terry Bowman wrote:
> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
> does not implement an AER correctable handler (CE) but does implement the
> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
> in that it only checks for frozen error state and returns the next step
> for recovery accordingly.
>
> As a result, port devices relying on AER correctable internal errors (CIE)
> and AER uncorrectable internal errors (UIE) will not be handled. Note,
> the PCIe spec indicates AER CIE/UIE can be used to report implementation
> specific errors.[1]
>
> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
> are examples of devices using the AER CIE/UIE for implementation specific
> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
> report CXL RAS errors.[2]
>
> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
> notifier to report CIE/UIE errors to the registered functions. This will
> require adding a CE handler and updating the existing UCE handler.
>
> For the UCE handler, the CXL spec states UIE errors should return need
> reset: "The only method of recovering from an Uncorrectable Internal Error
> is reset or hardware replacement."[1]
>
> [1] PCI6.0 - 6.2.10 Internal Errors
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
> Upstream Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
> drivers/pci/pcie/portdrv.h | 2 ++
> 2 files changed, 34 insertions(+)
>
> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
> index 14a4b89a3b83..86d80e0e9606 100644
> --- a/drivers/pci/pcie/portdrv.c
> +++ b/drivers/pci/pcie/portdrv.c
> @@ -37,6 +37,9 @@ struct portdrv_service_data {
> u32 service;
> };
>
> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
> +
> /**
> * release_pcie_device - free PCI Express port service device structure
> * @dev: Port service device to release
> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
> pci_channel_state_t error)
> {
> + if (dev->aer_cap) {
> + u32 status;
> +
> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
> + &status);
> +
> + if (status & PCI_ERR_UNC_INTN) {
> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> + AER_FATAL, (void *)dev);
> + return PCI_ERS_RESULT_NEED_RESET;
> + }
> + }
> +
> if (error == pci_channel_io_frozen)
> return PCI_ERS_RESULT_NEED_RESET;
> return PCI_ERS_RESULT_CAN_RECOVER;
> }
>
> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
> +{
> + u32 status;
> +
> + if (!dev->aer_cap)
> + return;
Seems like that dev->aer_cap checking is not needed for cor_error_detected, aer_get_device_error_info() already checked it and won't call handle_error_source() if device has not AER capability. But I am curious why pci_aer_handle_error() checks dev->aer_cap again after aer_get_device_error_info().
> +
> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_COR_STATUS,
> + &status);
> +
> + if (status & PCI_ERR_COR_INTERNAL)
> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
> + AER_CORRECTABLE, (void *)dev);
> +}
> +
> static pci_ers_result_t pcie_portdrv_slot_reset(struct pci_dev *dev)
> {
> size_t off = offsetof(struct pcie_port_service_driver, slot_reset);
> @@ -780,6 +811,7 @@ static const struct pci_device_id port_pci_ids[] = {
>
> static const struct pci_error_handlers pcie_portdrv_err_handler = {
> .error_detected = pcie_portdrv_error_detected,
> + .cor_error_detected = pcie_portdrv_cor_error_detected,
> .slot_reset = pcie_portdrv_slot_reset,
> .mmio_enabled = pcie_portdrv_mmio_enabled,
> };
> diff --git a/drivers/pci/pcie/portdrv.h b/drivers/pci/pcie/portdrv.h
> index 12c89ea0313b..8a39197f0203 100644
> --- a/drivers/pci/pcie/portdrv.h
> +++ b/drivers/pci/pcie/portdrv.h
> @@ -121,4 +121,6 @@ static inline void pcie_pme_interrupt_enable(struct pci_dev *dev, bool en) {}
> #endif /* !CONFIG_PCIE_PME */
>
> struct device *pcie_port_find_device(struct pci_dev *dev, u32 service);
> +
> +extern struct atomic_notifier_head portdrv_aer_internal_err_chain;
> #endif /* _PORTDRV_H_ */
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors
2024-06-26 2:54 ` Li, Ming4
@ 2024-06-26 13:39 ` Terry Bowman
0 siblings, 0 replies; 29+ messages in thread
From: Terry Bowman @ 2024-06-26 13:39 UTC (permalink / raw)
To: Li, Ming4, Williams, Dan J, Weiny, Ira, dave@stgolabs.net,
Jiang, Dave, Schofield, Alison, Verma, Vishal L,
jim.harris@samsung.com, ilpo.jarvinen@linux.intel.com,
ardb@kernel.org, sathyanarayanan.kuppuswamy@linux.intel.com,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
Yazen.Ghannam@amd.com, Robert.Richter@amd.com
Cc: Bjorn Helgaas, linux-pci@vger.kernel.org
On 6/25/24 21:54, Li, Ming4 wrote:
> On 6/18/2024 4:04 AM, Terry Bowman wrote:
>> PCIe port devices are bound to portdrv, the PCIe port bus driver. portdrv
>> does not implement an AER correctable handler (CE) but does implement the
>> AER uncorrectable error (UCE). The UCE handler is fairly straightforward
>> in that it only checks for frozen error state and returns the next step
>> for recovery accordingly.
>>
>> As a result, port devices relying on AER correctable internal errors (CIE)
>> and AER uncorrectable internal errors (UIE) will not be handled. Note,
>> the PCIe spec indicates AER CIE/UIE can be used to report implementation
>> specific errors.[1]
>>
>> CXL root ports, CXL downstream switch ports, and CXL upstream switch ports
>> are examples of devices using the AER CIE/UIE for implementation specific
>> purposes. These CXL ports use the AER interrupt and AER CIE/UIE status to
>> report CXL RAS errors.[2]
>>
>> Add an atomic notifier to portdrv's CE/UCE handlers. Use the atomic
>> notifier to report CIE/UIE errors to the registered functions. This will
>> require adding a CE handler and updating the existing UCE handler.
>>
>> For the UCE handler, the CXL spec states UIE errors should return need
>> reset: "The only method of recovering from an Uncorrectable Internal Error
>> is reset or hardware replacement."[1]
>>
>> [1] PCI6.0 - 6.2.10 Internal Errors
>> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and
>> Upstream Switch Ports
>>
>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>> Cc: linux-pci@vger.kernel.org
>> ---
>> drivers/pci/pcie/portdrv.c | 32 ++++++++++++++++++++++++++++++++
>> drivers/pci/pcie/portdrv.h | 2 ++
>> 2 files changed, 34 insertions(+)
>>
>> diff --git a/drivers/pci/pcie/portdrv.c b/drivers/pci/pcie/portdrv.c
>> index 14a4b89a3b83..86d80e0e9606 100644
>> --- a/drivers/pci/pcie/portdrv.c
>> +++ b/drivers/pci/pcie/portdrv.c
>> @@ -37,6 +37,9 @@ struct portdrv_service_data {
>> u32 service;
>> };
>>
>> +ATOMIC_NOTIFIER_HEAD(portdrv_aer_internal_err_chain);
>> +EXPORT_SYMBOL_GPL(portdrv_aer_internal_err_chain);
>> +
>> /**
>> * release_pcie_device - free PCI Express port service device structure
>> * @dev: Port service device to release
>> @@ -745,11 +748,39 @@ static void pcie_portdrv_shutdown(struct pci_dev *dev)
>> static pci_ers_result_t pcie_portdrv_error_detected(struct pci_dev *dev,
>> pci_channel_state_t error)
>> {
>> + if (dev->aer_cap) {
>> + u32 status;
>> +
>> + pci_read_config_dword(dev, dev->aer_cap + PCI_ERR_UNCOR_STATUS,
>> + &status);
>> +
>> + if (status & PCI_ERR_UNC_INTN) {
>> + atomic_notifier_call_chain(&portdrv_aer_internal_err_chain,
>> + AER_FATAL, (void *)dev);
>> + return PCI_ERS_RESULT_NEED_RESET;
>> + }
>> + }
>> +
>> if (error == pci_channel_io_frozen)
>> return PCI_ERS_RESULT_NEED_RESET;
>> return PCI_ERS_RESULT_CAN_RECOVER;
>> }
>>
>> +static void pcie_portdrv_cor_error_detected(struct pci_dev *dev)
>> +{
>> + u32 status;
>> +
>> + if (!dev->aer_cap)
>> + return;
>
> Seems like that dev->aer_cap checking is not needed for cor_error_detected, aer_get_device_error_info() already checked it and won't call handle_error_source() if device has not AER capability. But I am curious why pci_aer_handle_error() checks dev->aer_cap again after aer_get_device_error_info().
>
Hi Ming,
I agree this check should be removed.
Regards,
Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-06-24 17:56 ` Terry Bowman
@ 2024-07-10 20:48 ` nifan.cxl
2024-07-10 21:48 ` Terry Bowman
2024-08-19 18:35 ` Fan Ni
1 sibling, 1 reply; 29+ messages in thread
From: nifan.cxl @ 2024-07-10 20:48 UTC (permalink / raw)
To: Terry Bowman
Cc: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
>
> I added a response below.
>
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>
> >> + /*
> >> + * PCIe ports may include functionality beyond the standard
> >> + * extended port capabilities. This may present a need to log and
> >> + * handle errors not addressed in this driver. Examples are CXL
> >> + * root ports and CXL downstream switch ports using AER UIE to
> >> + * indicate CXL UCE RAS protocol errors.
> >> + */
> >> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> + type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> + struct pci_driver *pdrv = dev->driver;
> >> +
> >> + if (pdrv && pdrv->err_handler &&
> >> + pdrv->err_handler->error_detected) {
> >> + const struct pci_error_handlers *err_handler;
> >> +
> >> + err_handler = pdrv->err_handler;
> >> + status = err_handler->error_detected(dev, state);
> >> + }
> >> + }
> >> +
> >
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
>
> It does. I can take a look at moving that.
Has that already been handled in pci_walk_bridge?
The function pci_walk_bridge() will call report_error_detected, where
the err handler will be called.
https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
Fan
>
> Regards,
> Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors()
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19 7:09 ` Christoph Hellwig
2024-06-20 13:11 ` Jonathan Cameron
@ 2024-07-10 21:47 ` Bjorn Helgaas
2 siblings, 0 replies; 29+ messages in thread
From: Bjorn Helgaas @ 2024-07-10 21:47 UTC (permalink / raw)
To: Terry Bowman
Cc: dan.j.williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
On Mon, Jun 17, 2024 at 03:04:10PM -0500, Terry Bowman wrote:
> AER correctable internal errors (CIE) and AER uncorrectable internal
> errors (UIE) are disabled through the AER mask register by default.[1]
Nit: "Correctable Errors" and "Uncorrectable Errors" are generic PCIe
concepts that exist independent of AER, so I wouldn't prefix them with
AER. The AER mask registers control *reporting* of errors, but of
course they don't disable the errors themselves.
> CXL PCIe ports use the CIE/UIE to report RAS errors and as a result
> need CIE/UIE enabled.[2]
>
> Change pci_aer_unmask_internal_errors() function to be exported for
> the CXL driver and other drivers to use.
>
> [1] PCI6.0 - 7.8.4.3 Uncorrectable
s/PCI6.0 .../PCIe r6.0, sec 7.8.4.3/ since there is a conventional PCI
spec as well.
> [2] CXL3.1 - 12.2.2 CXL Root Ports, Downstream Switch Ports, and Upstream
> Switch Ports
>
> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> Cc: Bjorn Helgaas <bhelgaas@google.com>
> Cc: linux-pci@vger.kernel.org
> ---
> drivers/pci/pcie/aer.c | 3 ++-
> include/linux/aer.h | 6 ++++++
> 2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 4dc03cb9aff0..d7a1982f0c50 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -951,7 +951,7 @@ static bool find_source_device(struct pci_dev *parent,
> * Note: AER must be enabled and supported by the device which must be
> * checked in advance, e.g. with pcie_aer_is_native().
> */
> -static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> {
> int aer = dev->aer_cap;
> u32 mask;
> @@ -964,6 +964,7 @@ static void pci_aer_unmask_internal_errors(struct pci_dev *dev)
> mask &= ~PCI_ERR_COR_INTERNAL;
> pci_write_config_dword(dev, aer + PCI_ERR_COR_MASK, mask);
> }
> +EXPORT_SYMBOL_GPL(pci_aer_unmask_internal_errors);
>
> static bool is_cxl_mem_dev(struct pci_dev *dev)
> {
> diff --git a/include/linux/aer.h b/include/linux/aer.h
> index 4b97f38f3fcf..a4fd25ea0280 100644
> --- a/include/linux/aer.h
> +++ b/include/linux/aer.h
> @@ -50,6 +50,12 @@ static inline int pci_aer_clear_nonfatal_status(struct pci_dev *dev)
> static inline int pcie_aer_is_native(struct pci_dev *dev) { return 0; }
> #endif
>
> +#ifdef CONFIG_PCIEAER_CXL
> +void pci_aer_unmask_internal_errors(struct pci_dev *dev);
> +#else
> +static inline void pci_aer_unmask_internal_errors(struct pci_dev *dev) { }
> +#endif
I don't like the idea of exporting a generic PCI interface that only
does something when CONFIG_PCIEAER_CXL is enabled. If there's ever a
non-CXL caller, it will be confused.
Bjorn
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-07-10 20:48 ` nifan.cxl
@ 2024-07-10 21:48 ` Terry Bowman
2024-07-11 1:14 ` fan
0 siblings, 1 reply; 29+ messages in thread
From: Terry Bowman @ 2024-07-10 21:48 UTC (permalink / raw)
To: nifan.cxl
Cc: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci
Hi Fan,
On 7/10/24 15:48, nifan.cxl@gmail.com wrote:
> On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
>> Hi Dan,
>>
>> I added a response below.
>>
>> On 6/21/24 14:17, Dan Williams wrote:
>>> Terry Bowman wrote:
>>>> The AER service driver does not currently call a handler for AER
>>>> uncorrectable errors (UCE) detected in root ports or downstream
>>>> ports. This is not needed in most cases because common PCIe port
>>>> functionality is handled by portdrv service drivers.
>>>>
>>>> CXL root ports include CXL specific RAS registers that need logging
>>>> before starting do_recovery() in the UCE case.
>>>>
>>>> Update the AER service driver to call the UCE handler for root ports
>>>> and downstream ports. These PCIe port devices are bound to the portdrv
>>>> driver that includes a CE and UCE handler to be called.
>>>>
>>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
>>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
>>>> Cc: linux-pci@vger.kernel.org
>>>> ---
>>>> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>>>> index 705893b5f7b0..a4db474b2be5 100644
>>>> --- a/drivers/pci/pcie/err.c
>>>> +++ b/drivers/pci/pcie/err.c
>>>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>>> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>>> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>>>>
>>>> + /*
>>>> + * PCIe ports may include functionality beyond the standard
>>>> + * extended port capabilities. This may present a need to log and
>>>> + * handle errors not addressed in this driver. Examples are CXL
>>>> + * root ports and CXL downstream switch ports using AER UIE to
>>>> + * indicate CXL UCE RAS protocol errors.
>>>> + */
>>>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>>> + type == PCI_EXP_TYPE_DOWNSTREAM) {
>>>> + struct pci_driver *pdrv = dev->driver;
>>>> +
>>>> + if (pdrv && pdrv->err_handler &&
>>>> + pdrv->err_handler->error_detected) {
>>>> + const struct pci_error_handlers *err_handler;
>>>> +
>>>> + err_handler = pdrv->err_handler;
>>>> + status = err_handler->error_detected(dev, state);
>>>> + }
>>>> + }
>>>> +
>>>
>>> Would not a more appropriate place for this be pci_walk_bridge() where
>>> the ->subordinate == NULL and these type-check cases are unified?
>>
>> It does. I can take a look at moving that.
>
> Has that already been handled in pci_walk_bridge?
>
> The function pci_walk_bridge() will call report_error_detected, where
> the err handler will be called.
> https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
>
> Fan
>
You would think so but the UCE handler was not called in my testing for the PCIe
ports (RP,USP,DSP). The pci_walk_bridge() function has 2 cases:
- If there is a subordinate/secondary bus then the callback is called for
those downstream devices but not the port itself.
- If there is no subordinate/secondary bus then the callback is invoked for the
port itself.
The function header comment may explain it better:
/**
* pci_walk_bridge - walk bridges potentially AER affected
* @bridge: bridge which may be a Port, an RCEC, or an RCiEP
* @cb: callback to be called for each device found
* @userdata: arbitrary pointer to be passed to callback
*
* If the device provided is a bridge, walk the subordinate bus, including
* any bridged devices on buses under this bus. Call the provided callback
* on each device found.
*
* If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
* call the callback on the device itself.
*/
Regards,
Terry
>>
>> Regards,
>> Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-07-10 21:48 ` Terry Bowman
@ 2024-07-11 1:14 ` fan
0 siblings, 0 replies; 29+ messages in thread
From: fan @ 2024-07-11 1:14 UTC (permalink / raw)
To: Terry Bowman
Cc: nifan.cxl, Dan Williams, ira.weiny, dave, dave.jiang,
alison.schofield, ming4.li, vishal.l.verma, jim.harris,
ilpo.jarvinen, ardb, sathyanarayanan.kuppuswamy, linux-cxl,
linux-kernel, Yazen.Ghannam, Robert.Richter, Bjorn Helgaas,
linux-pci
On Wed, Jul 10, 2024 at 04:48:09PM -0500, Terry Bowman wrote:
> Hi Fan,
>
> On 7/10/24 15:48, nifan.cxl@gmail.com wrote:
> > On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> >> Hi Dan,
> >>
> >> I added a response below.
> >>
> >> On 6/21/24 14:17, Dan Williams wrote:
> >>> Terry Bowman wrote:
> >>>> The AER service driver does not currently call a handler for AER
> >>>> uncorrectable errors (UCE) detected in root ports or downstream
> >>>> ports. This is not needed in most cases because common PCIe port
> >>>> functionality is handled by portdrv service drivers.
> >>>>
> >>>> CXL root ports include CXL specific RAS registers that need logging
> >>>> before starting do_recovery() in the UCE case.
> >>>>
> >>>> Update the AER service driver to call the UCE handler for root ports
> >>>> and downstream ports. These PCIe port devices are bound to the portdrv
> >>>> driver that includes a CE and UCE handler to be called.
> >>>>
> >>>> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >>>> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >>>> Cc: linux-pci@vger.kernel.org
> >>>> ---
> >>>> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >>>> 1 file changed, 20 insertions(+)
> >>>>
> >>>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >>>> index 705893b5f7b0..a4db474b2be5 100644
> >>>> --- a/drivers/pci/pcie/err.c
> >>>> +++ b/drivers/pci/pcie/err.c
> >>>> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >>>> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >>>> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>>>
> >>>> + /*
> >>>> + * PCIe ports may include functionality beyond the standard
> >>>> + * extended port capabilities. This may present a need to log and
> >>>> + * handle errors not addressed in this driver. Examples are CXL
> >>>> + * root ports and CXL downstream switch ports using AER UIE to
> >>>> + * indicate CXL UCE RAS protocol errors.
> >>>> + */
> >>>> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >>>> + type == PCI_EXP_TYPE_DOWNSTREAM) {
> >>>> + struct pci_driver *pdrv = dev->driver;
> >>>> +
> >>>> + if (pdrv && pdrv->err_handler &&
> >>>> + pdrv->err_handler->error_detected) {
> >>>> + const struct pci_error_handlers *err_handler;
> >>>> +
> >>>> + err_handler = pdrv->err_handler;
> >>>> + status = err_handler->error_detected(dev, state);
> >>>> + }
> >>>> + }
> >>>> +
> >>>
> >>> Would not a more appropriate place for this be pci_walk_bridge() where
> >>> the ->subordinate == NULL and these type-check cases are unified?
> >>
> >> It does. I can take a look at moving that.
> >
> > Has that already been handled in pci_walk_bridge?
> >
> > The function pci_walk_bridge() will call report_error_detected, where
> > the err handler will be called.
> > https://elixir.bootlin.com/linux/v6.10-rc6/source/drivers/pci/pcie/err.c#L80
> >
> > Fan
> >
>
> You would think so but the UCE handler was not called in my testing for the PCIe
> ports (RP,USP,DSP). The pci_walk_bridge() function has 2 cases:
> - If there is a subordinate/secondary bus then the callback is called for
> those downstream devices but not the port itself.
> - If there is no subordinate/secondary bus then the callback is invoked for the
> port itself.
>
> The function header comment may explain it better:
> /**
> * pci_walk_bridge - walk bridges potentially AER affected
> * @bridge: bridge which may be a Port, an RCEC, or an RCiEP
> * @cb: callback to be called for each device found
> * @userdata: arbitrary pointer to be passed to callback
> *
> * If the device provided is a bridge, walk the subordinate bus, including
> * any bridged devices on buses under this bus. Call the provided callback
> * on each device found.
> *
> * If the device provided has no subordinate bus, e.g., an RCEC or RCiEP,
> * call the callback on the device itself.
> */
>
> Regards,
> Terry
OK, interesting.
Btw, what is the "state" passed to pcie_do_recovery(...state...)?
Fan
>
> >>
> >> Regards,
> >> Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers
2024-06-24 17:56 ` Terry Bowman
2024-07-10 20:48 ` nifan.cxl
@ 2024-08-19 18:35 ` Fan Ni
1 sibling, 0 replies; 29+ messages in thread
From: Fan Ni @ 2024-08-19 18:35 UTC (permalink / raw)
To: Terry Bowman
Cc: Dan Williams, ira.weiny, dave, dave.jiang, alison.schofield,
ming4.li, vishal.l.verma, jim.harris, ilpo.jarvinen, ardb,
sathyanarayanan.kuppuswamy, linux-cxl, linux-kernel,
Yazen.Ghannam, Robert.Richter, Bjorn Helgaas, linux-pci,
a.manzanares
On Mon, Jun 24, 2024 at 12:56:29PM -0500, Terry Bowman wrote:
> Hi Dan,
>
> I added a response below.
>
> On 6/21/24 14:17, Dan Williams wrote:
> > Terry Bowman wrote:
> >> The AER service driver does not currently call a handler for AER
> >> uncorrectable errors (UCE) detected in root ports or downstream
> >> ports. This is not needed in most cases because common PCIe port
> >> functionality is handled by portdrv service drivers.
> >>
> >> CXL root ports include CXL specific RAS registers that need logging
> >> before starting do_recovery() in the UCE case.
> >>
> >> Update the AER service driver to call the UCE handler for root ports
> >> and downstream ports. These PCIe port devices are bound to the portdrv
> >> driver that includes a CE and UCE handler to be called.
> >>
> >> Signed-off-by: Terry Bowman <terry.bowman@amd.com>
> >> Cc: Bjorn Helgaas <bhelgaas@google.com>
> >> Cc: linux-pci@vger.kernel.org
> >> ---
> >> drivers/pci/pcie/err.c | 20 ++++++++++++++++++++
> >> 1 file changed, 20 insertions(+)
> >>
> >> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> >> index 705893b5f7b0..a4db474b2be5 100644
> >> --- a/drivers/pci/pcie/err.c
> >> +++ b/drivers/pci/pcie/err.c
> >> @@ -203,6 +203,26 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> >> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> >> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> >>
> >> + /*
> >> + * PCIe ports may include functionality beyond the standard
> >> + * extended port capabilities. This may present a need to log and
> >> + * handle errors not addressed in this driver. Examples are CXL
> >> + * root ports and CXL downstream switch ports using AER UIE to
> >> + * indicate CXL UCE RAS protocol errors.
> >> + */
> >> + if (type == PCI_EXP_TYPE_ROOT_PORT ||
> >> + type == PCI_EXP_TYPE_DOWNSTREAM) {
> >> + struct pci_driver *pdrv = dev->driver;
> >> +
> >> + if (pdrv && pdrv->err_handler &&
> >> + pdrv->err_handler->error_detected) {
> >> + const struct pci_error_handlers *err_handler;
> >> +
> >> + err_handler = pdrv->err_handler;
> >> + status = err_handler->error_detected(dev, state);
> >> + }
> >> + }
> >> +
> >
> > Would not a more appropriate place for this be pci_walk_bridge() where
> > the ->subordinate == NULL and these type-check cases are unified?
>
> It does. I can take a look at moving that.
>
Based on current code logic, the code added here will be executed as
long as the type matches (downstream port or root port), and I also
noticed the case ->subordinate == NULL never gets touched when I try to
inject an error through the aer_inject module and the user space tool.
If my way to do error injection is right, it means the behaviour will
get changed after the code move.
Here is some of my experimental setup:
QEMU + cxl topology (one type3 memdev directly attached to a HB with a
single root port).
1. Load the cxl related drivers before error injection
2. Do aer inject with aer_inject inside the QEMU VM
# aer_inject ~/nonfatal
aer inject input file looks like below
-----------------------------------------------------
fan:~/cxl/linux-fixes$ cat ~/nonfatal
# Inject an uncorrectable/non-fatal training error into the device
# with header log words 0 1 2 3.
#
# Either specify the PCI id on the command-line option or uncomment and edit
# the PCI_ID line below using the correct PCI ID.
#
# Note that system firmware/BIOS may mask certain errors, change their severity
# and/or not report header log words.
#
AER
PCI_ID 0000:0c:00.0
UNCOR_STATUS COMP_ABORT
HEADER_LOG 0 1 2 3
-----------------------------------------------------
The "lspci" output on the VM looks like below
----------------------------------------------------
Qemu: execute "lspci" on VM
00:00.0 Host bridge: Intel Corporation 82G33/G31/P35/P31 Express DRAM Controller
00:01.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:02.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:03.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:04.0 Unclassified device [0002]: Red Hat, Inc. Virtio filesystem
00:05.0 Host bridge: Red Hat, Inc. QEMU PCIe Expander bridge
00:1f.0 ISA bridge: Intel Corporation 82801IB (ICH9) LPC Interface Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation 82801IR/IO/IH (ICH9R/DO/DH) 6 port SATA Controller [AHCI mode] (rev 02)
00:1f.3 SMBus: Intel Corporation 82801I (ICH9 Family) SMBus Controller (rev 02)
0c:00.0 PCI bridge: Intel Corporation Device 7075
0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)
--------------------------------------------------
Fan
> Regards,
> Terry
^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2024-08-19 18:35 UTC | newest]
Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20240617200411.1426554-1-terry.bowman@amd.com>
2024-06-17 20:04 ` [RFC PATCH 1/9] PCI/AER: Update AER driver to call root port and downstream port UCE handlers Terry Bowman
2024-06-20 11:21 ` Jonathan Cameron
2024-06-24 14:58 ` Terry Bowman
2024-06-21 19:17 ` Dan Williams
2024-06-24 17:56 ` Terry Bowman
2024-07-10 20:48 ` nifan.cxl
2024-07-10 21:48 ` Terry Bowman
2024-07-11 1:14 ` fan
2024-08-19 18:35 ` Fan Ni
2024-06-17 20:04 ` [RFC PATCH 2/9] PCI/AER: Call AER CE handler before clearing AER CE status register Terry Bowman
2024-06-20 11:31 ` Jonathan Cameron
2024-06-24 15:08 ` Terry Bowman
2024-06-21 19:23 ` Dan Williams
2024-06-24 18:00 ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 3/9] PCI/portdrv: Update portdrv with an atomic notifier for reporting AER internal errors Terry Bowman
2024-06-20 12:30 ` Jonathan Cameron
2024-06-24 15:22 ` Terry Bowman
2024-06-21 19:36 ` Dan Williams
2024-06-24 18:21 ` Terry Bowman
2024-06-24 21:46 ` Dan Williams
2024-06-25 14:41 ` Terry Bowman
2024-06-26 2:54 ` Li, Ming4
2024-06-26 13:39 ` Terry Bowman
2024-06-17 20:04 ` [RFC PATCH 8/9] PCI/AER: Export pci_aer_unmask_internal_errors() Terry Bowman
2024-06-19 7:09 ` Christoph Hellwig
2024-06-19 15:40 ` Terry Bowman
2024-06-20 13:11 ` Jonathan Cameron
2024-06-24 16:22 ` Terry Bowman
2024-07-10 21:47 ` Bjorn Helgaas
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox