[PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

Linux PCI subsystem development
 help / color / mirror / Atom feed

* [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
@ 2025-09-17  6:33 Shuai Xue
  2025-09-17  6:33 ` [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-17  6:33 UTC (permalink / raw)
  To: bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy
  Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev

changes since v4:
- rebase to 6.17-rc6
- pick up Reviewed-by tag for PATCH[2] from Sathyanarayanan
- minor typos in commit log per Manivannan

changes since v3:
- squash patch 1 and 2 into one patch per Sathyanarayanan
- add comments note for dpc_process_error per Sathyanarayanan
- pick up Reviewed-by tag for PATCH[1] from Sathyanarayanan

changes since v2:
- moving the "err_port" rename to a separate patch per Sathyanarayanan
- rewrite comments of dpc_process_error per Sathyanarayanan
- remove NULL initialization for err_dev per Sathyanarayanan

changes since v1:
- rewrite commit log per Bjorn
- refactor aer_get_device_error_info to reduce duplication per Keith
- fix to avoid reporting fatal errors twice for root and downstream ports per Keith

The AER driver has historically avoided reading the configuration space of an
endpoint or RCiEP that reported a fatal error, considering the link to that
device unreliable. Consequently, when a fatal error occurs, the AER and DPC
drivers do not report specific error types, resulting in logs like:

	pcieport 0015:00:00.0: EDR: EDR event received
	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
	pcieport 0015:00:00.0: AER: broadcast error_detected message
	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
	pcieport 0015:00:00.0: AER: broadcast resume message
	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
	pcieport 0015:00:00.0: AER: device recovery successful
	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80

AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status and TLP header of the
error device. In such case, report fatal errors which helps to figure out the
error root case.

After this patch, the logs like:

	pcieport 0015:00:00.0: EDR: EDR event received
	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
	pcieport 0015:00:00.0: AER: broadcast error_detected message
	vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
	vfio-pci 0015:01:00.0:   device [144d:a80a] error status/mask=00001000/00400000
	vfio-pci 0015:01:00.0:    [12] TLP                    (First)
	vfio-pci 0015:01:00.0: AER:   TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
	pcieport 0015:00:00.0: AER: broadcast resume message
	pcieport 0015:00:00.0: AER: device recovery successful
	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80

Shuai Xue (3):
  PCI/DPC: Clarify naming for error port in DPC Handling
  PCI/DPC: Run recovery on device that detected the error
  PCI/AER: Report fatal errors of RCiEP and EP if link recoverd

 drivers/pci/pci.h      |  5 +++--
 drivers/pci/pcie/aer.c | 11 +++++++----
 drivers/pci/pcie/dpc.c | 31 ++++++++++++++++++++++++-------
 drivers/pci/pcie/edr.c | 35 ++++++++++++++++++-----------------
 drivers/pci/pcie/err.c | 11 +++++++++++
 5 files changed, 63 insertions(+), 30 deletions(-)

-- 
2.39.3


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling
  2025-09-17  6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
@ 2025-09-17  6:33 ` Shuai Xue
  2025-09-17  6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
  2025-09-17  6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
  2 siblings, 0 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-17  6:33 UTC (permalink / raw)
  To: bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy
  Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev

dpc_handler() is registered for error port which recevie DPC interrupt
and acpi_dpc_port_get() locate the port that experienced the containment
event.

Rename edev and pdev to err_port for clear so that later patch will
avoid misused err_port in pcie_do_recovery().

No functional changes intended.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/dpc.c | 10 +++++-----
 drivers/pci/pcie/edr.c | 34 +++++++++++++++++-----------------
 2 files changed, 22 insertions(+), 22 deletions(-)

diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..bff29726c6a5 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -361,21 +361,21 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
 
 static irqreturn_t dpc_handler(int irq, void *context)
 {
-	struct pci_dev *pdev = context;
+	struct pci_dev *err_port = context;
 
 	/*
 	 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
 	 * of async removal and should be ignored by software.
 	 */
-	if (dpc_is_surprise_removal(pdev)) {
-		dpc_handle_surprise_removal(pdev);
+	if (dpc_is_surprise_removal(err_port)) {
+		dpc_handle_surprise_removal(err_port);
 		return IRQ_HANDLED;
 	}
 
-	dpc_process_error(pdev);
+	dpc_process_error(err_port);
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
-	pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
+	pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
 
 	return IRQ_HANDLED;
 }
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index e86298dbbcff..521fca2f40cb 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, struct pci_dev *edev,
 
 static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 {
-	struct pci_dev *pdev = data, *edev;
+	struct pci_dev *pdev = data, *err_port;
 	pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
 	u16 status;
 
@@ -169,36 +169,36 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 	 * may be that port or a parent of it (PCI Firmware r3.3, sec
 	 * 4.6.13).
 	 */
-	edev = acpi_dpc_port_get(pdev);
-	if (!edev) {
+	err_port = acpi_dpc_port_get(pdev);
+	if (!err_port) {
 		pci_err(pdev, "Firmware failed to locate DPC port\n");
 		return;
 	}
 
-	pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(edev));
+	pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(err_port));
 
 	/* If port does not support DPC, just send the OST */
-	if (!edev->dpc_cap) {
-		pci_err(edev, FW_BUG "This device doesn't support DPC\n");
+	if (!err_port->dpc_cap) {
+		pci_err(err_port, FW_BUG "This device doesn't support DPC\n");
 		goto send_ost;
 	}
 
 	/* Check if there is a valid DPC trigger */
-	pci_read_config_word(edev, edev->dpc_cap + PCI_EXP_DPC_STATUS, &status);
+	pci_read_config_word(err_port, err_port->dpc_cap + PCI_EXP_DPC_STATUS, &status);
 	if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
-		pci_err(edev, "Invalid DPC trigger %#010x\n", status);
+		pci_err(err_port, "Invalid DPC trigger %#010x\n", status);
 		goto send_ost;
 	}
 
-	dpc_process_error(edev);
-	pci_aer_raw_clear_status(edev);
+	dpc_process_error(err_port);
+	pci_aer_raw_clear_status(err_port);
 
 	/*
 	 * Irrespective of whether the DPC event is triggered by ERR_FATAL
 	 * or ERR_NONFATAL, since the link is already down, use the FATAL
 	 * error recovery path for both cases.
 	 */
-	estate = pcie_do_recovery(edev, pci_channel_io_frozen, dpc_reset_link);
+	estate = pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
 
 send_ost:
 
@@ -207,15 +207,15 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 	 * to firmware. If not successful, send _OST(0xF, BDF << 16 | 0x81).
 	 */
 	if (estate == PCI_ERS_RESULT_RECOVERED) {
-		pci_dbg(edev, "DPC port successfully recovered\n");
-		pcie_clear_device_status(edev);
-		acpi_send_edr_status(pdev, edev, EDR_OST_SUCCESS);
+		pci_dbg(err_port, "DPC port successfully recovered\n");
+		pcie_clear_device_status(err_port);
+		acpi_send_edr_status(pdev, err_port, EDR_OST_SUCCESS);
 	} else {
-		pci_dbg(edev, "DPC port recovery failed\n");
-		acpi_send_edr_status(pdev, edev, EDR_OST_FAILED);
+		pci_dbg(err_port, "DPC port recovery failed\n");
+		acpi_send_edr_status(pdev, err_port, EDR_OST_FAILED);
 	}
 
-	pci_dev_put(edev);
+	pci_dev_put(err_port);
 }
 
 void pci_acpi_add_edr_notifier(struct pci_dev *pdev)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error
  2025-09-17  6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
  2025-09-17  6:33 ` [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
@ 2025-09-17  6:33 ` Shuai Xue
  2025-09-17 22:01   ` kernel test robot
  2025-09-17  6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
  2 siblings, 1 reply; 9+ messages in thread
From: Shuai Xue @ 2025-09-17  6:33 UTC (permalink / raw)
  To: bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy
  Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev

The current implementation of pcie_do_recovery() assumes that the
recovery process is executed for the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().

Use the SOURCE ID register to correctly identify the device that
detected the error. When passing the error device, the
pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected. And subsequent commits will be able to
accurately access AER status of the error device.

Should not observe any functional changes.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pci.h      |  2 +-
 drivers/pci/pcie/dpc.c | 25 +++++++++++++++++++++----
 drivers/pci/pcie/edr.c |  7 ++++---
 3 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 34f65d69662e..de2f07cefa72 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -654,7 +654,7 @@ struct rcec_ea {
 void pci_save_dpc_state(struct pci_dev *dev);
 void pci_restore_dpc_state(struct pci_dev *dev);
 void pci_dpc_init(struct pci_dev *pdev);
-void dpc_process_error(struct pci_dev *pdev);
+struct pci_dev *dpc_process_error(struct pci_dev *pdev);
 pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
 bool pci_dpc_recovered(struct pci_dev *pdev);
 unsigned int dpc_tlp_log_len(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index bff29726c6a5..f6069f621683 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -260,10 +260,20 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
 	return 1;
 }
 
-void dpc_process_error(struct pci_dev *pdev)
+/**
+ * dpc_process_error - handle the DPC error status
+ * @pdev: the port that experienced the containment event
+ *
+ * Return: the device that detected the error.
+ *
+ * NOTE: The device reference count is increased, the caller must decrement
+ * the reference count by calling pci_dev_put().
+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)
 {
 	u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
 	struct aer_err_info info = {};
+	struct pci_dev *err_dev;
 
 	pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
 
@@ -279,6 +289,7 @@ void dpc_process_error(struct pci_dev *pdev)
 			pci_aer_clear_nonfatal_status(pdev);
 			pci_aer_clear_fatal_status(pdev);
 		}
+		err_dev = pci_dev_get(pdev);
 		break;
 	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
 	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
@@ -290,6 +301,8 @@ void dpc_process_error(struct pci_dev *pdev)
 				"ERR_FATAL" : "ERR_NONFATAL",
 			 pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
 			 PCI_SLOT(source), PCI_FUNC(source));
+		err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+					    PCI_BUS_NUM(source), source & 0xff);
 		break;
 	case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
 		ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
@@ -304,8 +317,11 @@ void dpc_process_error(struct pci_dev *pdev)
 		if (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO &&
 		    pdev->dpc_rp_extensions)
 			dpc_process_rp_pio_error(pdev);
+		err_dev = pci_dev_get(pdev);
 		break;
 	}
+
+	return err_dev;
 }
 
 static void pci_clear_surpdn_errors(struct pci_dev *pdev)
@@ -361,7 +377,7 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
 
 static irqreturn_t dpc_handler(int irq, void *context)
 {
-	struct pci_dev *err_port = context;
+	struct pci_dev *err_port = context, *err_dev;
 
 	/*
 	 * According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
@@ -372,10 +388,11 @@ static irqreturn_t dpc_handler(int irq, void *context)
 		return IRQ_HANDLED;
 	}
 
-	dpc_process_error(err_port);
+	err_dev = dpc_process_error(err_port);
 
 	/* We configure DPC so it only triggers on ERR_FATAL */
-	pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
+	pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
+	pci_dev_put(err_dev);
 
 	return IRQ_HANDLED;
 }
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index 521fca2f40cb..3f971bb04433 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, struct pci_dev *edev,
 
 static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 {
-	struct pci_dev *pdev = data, *err_port;
+	struct pci_dev *pdev = data, *err_port, *err_dev;
 	pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
 	u16 status;
 
@@ -190,7 +190,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 		goto send_ost;
 	}
 
-	dpc_process_error(err_port);
+	err_dev = dpc_process_error(err_port);
 	pci_aer_raw_clear_status(err_port);
 
 	/*
@@ -198,7 +198,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 	 * or ERR_NONFATAL, since the link is already down, use the FATAL
 	 * error recovery path for both cases.
 	 */
-	estate = pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
+	estate = pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
 
 send_ost:
 
@@ -215,6 +215,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
 		acpi_send_edr_status(pdev, err_port, EDR_OST_FAILED);
 	}
 
+	pci_dev_put(err_dev);
 	pci_dev_put(err_port);
 }
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
  2025-09-17  6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
  2025-09-17  6:33 ` [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
  2025-09-17  6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
@ 2025-09-17  6:33 ` Shuai Xue
  2025-09-17 19:09   ` Kuppuswamy Sathyanarayanan
  2025-09-18 20:33   ` Bjorn Helgaas
  2 siblings, 2 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-17  6:33 UTC (permalink / raw)
  To: bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy
  Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev

The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal error occurs, the AER
and DPC drivers do not report specific error types, resulting in logs like:

	pcieport 0015:00:00.0: EDR: EDR event received
	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
	pcieport 0015:00:00.0: AER: broadcast error_detected message
	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
	pcieport 0015:00:00.0: AER: broadcast resume message
	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
	pcieport 0015:00:00.0: AER: device recovery successful
	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80

AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status and TLP header of the
error device. In such case, report fatal errors which helps to figure out the
error root case.

After this patch, the logs like:

	pcieport 0015:00:00.0: EDR: EDR event received
	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
	pcieport 0015:00:00.0: AER: broadcast error_detected message
	vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
	vfio-pci 0015:01:00.0:   device [144d:a80a] error status/mask=00001000/00400000
	vfio-pci 0015:01:00.0:    [12] TLP                    (First)
	vfio-pci 0015:01:00.0: AER:   TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
	pcieport 0015:00:00.0: AER: broadcast resume message
	pcieport 0015:00:00.0: AER: device recovery successful
	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
 drivers/pci/pci.h      |  3 ++-
 drivers/pci/pcie/aer.c | 11 +++++++----
 drivers/pci/pcie/dpc.c |  2 +-
 drivers/pci/pcie/err.c | 11 +++++++++++
 4 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index de2f07cefa72..b8d364545e7d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -629,7 +629,8 @@ struct aer_err_info {
 	struct pcie_tlp_log tlp;	/* TLP Header */
 };
 
-int aer_get_device_error_info(struct aer_err_info *info, int i);
+int aer_get_device_error_info(struct aer_err_info *info, int i,
+			      bool link_healthy);
 void aer_print_error(struct aer_err_info *info, int i);
 
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d716..157ad7fb44a0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1351,12 +1351,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
  * aer_get_device_error_info - read error status from dev and store it to info
  * @info: pointer to structure to store the error record
  * @i: index into info->dev[]
+ * @link_healthy: link is healthy or not
  *
  * Return: 1 on success, 0 on error.
  *
  * Note that @info is reused among all error devices. Clear fields properly.
  */
-int aer_get_device_error_info(struct aer_err_info *info, int i)
+int aer_get_device_error_info(struct aer_err_info *info, int i,
+			      bool link_healthy)
 {
 	struct pci_dev *dev;
 	int type, aer;
@@ -1387,7 +1389,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
 	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
 		   type == PCI_EXP_TYPE_RC_EC ||
 		   type == PCI_EXP_TYPE_DOWNSTREAM ||
-		   info->severity == AER_NONFATAL) {
+		   info->severity == AER_NONFATAL ||
+		   (info->severity == AER_FATAL && link_healthy)) {
 
 		/* Link is still healthy for IO reads */
 		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
@@ -1420,11 +1423,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
 
 	/* Report all before handling them, to not lose records by reset etc. */
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (aer_get_device_error_info(e_info, i))
+		if (aer_get_device_error_info(e_info, i, false))
 			aer_print_error(e_info, i);
 	}
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
-		if (aer_get_device_error_info(e_info, i))
+		if (aer_get_device_error_info(e_info, i, false))
 			handle_error_source(e_info->dev[i], e_info);
 	}
 }
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f6069f621683..21c4e8371279 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -284,7 +284,7 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
 		pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
 			 status);
 		if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
-		    aer_get_device_error_info(&info, 0)) {
+		    aer_get_device_error_info(&info, 0, false)) {
 			aer_print_error(&info, 0);
 			pci_aer_clear_nonfatal_status(pdev);
 			pci_aer_clear_fatal_status(pdev);
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index de6381c690f5..744d77ee7271 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -196,6 +196,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	struct pci_dev *bridge;
 	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
 	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+	struct aer_err_info info;
 
 	/*
 	 * If the error was detected by a Root Port, Downstream Port, RCEC,
@@ -223,6 +224,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 			pci_warn(bridge, "subordinate device reset failed\n");
 			goto failed;
 		}
+
+		info.dev[0] = dev;
+		info.level = KERN_ERR;
+		info.severity = AER_FATAL;
+		/* Link recovered, report fatal errors of RCiEP or EP */
+		if ((type == PCI_EXP_TYPE_ENDPOINT ||
+		     type == PCI_EXP_TYPE_RC_END) &&
+		    aer_get_device_error_info(&info, 0, true))
+			aer_print_error(&info, 0);
 	} else {
 		pci_walk_bridge(bridge, report_normal_detected, &status);
 	}
@@ -259,6 +269,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
 	if (host->native_aer || pcie_ports_native) {
 		pcie_clear_device_status(dev);
 		pci_aer_clear_nonfatal_status(dev);
+		pci_aer_clear_fatal_status(dev);
 	}
 
 	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
  2025-09-17  6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
@ 2025-09-17 19:09   ` Kuppuswamy Sathyanarayanan
  2025-09-18  2:45     ` Shuai Xue
  2025-09-18 20:33   ` Bjorn Helgaas
  1 sibling, 1 reply; 9+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2025-09-17 19:09 UTC (permalink / raw)
  To: Shuai Xue, bhelgaas, mahesh, mani, Jonathan.Cameron
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev


On 9/16/25 23:33, Shuai Xue wrote:
> The AER driver has historically avoided reading the configuration space of
> an endpoint or RCiEP that reported a fatal error, considering the link to
> that device unreliable. Consequently, when a fatal error occurs, the AER
> and DPC drivers do not report specific error types, resulting in logs like:
>
> 	pcieport 0015:00:00.0: EDR: EDR event received
> 	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
> 	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
> 	pcieport 0015:00:00.0: AER: broadcast error_detected message
> 	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
> 	pcieport 0015:00:00.0: AER: broadcast resume message
> 	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
> 	pcieport 0015:00:00.0: AER: device recovery successful
> 	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
> 	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>
> AER status registers are sticky and Write-1-to-clear. If the link recovered
> after hot reset, we can still safely access AER status and TLP header of the
> error device. In such case, report fatal errors which helps to figure out the
> error root case.
>
> After this patch, the logs like:
>
> 	pcieport 0015:00:00.0: EDR: EDR event received
> 	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
> 	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
> 	pcieport 0015:00:00.0: AER: broadcast error_detected message
> 	vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> 	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
> 	vfio-pci 0015:01:00.0:   device [144d:a80a] error status/mask=00001000/00400000
> 	vfio-pci 0015:01:00.0:    [12] TLP                    (First)
> 	vfio-pci 0015:01:00.0: AER:   TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
> 	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
> 	pcieport 0015:00:00.0: AER: broadcast resume message
> 	pcieport 0015:00:00.0: AER: device recovery successful
> 	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
> 	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
>   drivers/pci/pci.h      |  3 ++-
>   drivers/pci/pcie/aer.c | 11 +++++++----
>   drivers/pci/pcie/dpc.c |  2 +-
>   drivers/pci/pcie/err.c | 11 +++++++++++
>   4 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index de2f07cefa72..b8d364545e7d 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -629,7 +629,8 @@ struct aer_err_info {
>   	struct pcie_tlp_log tlp;	/* TLP Header */
>   };
>   
> -int aer_get_device_error_info(struct aer_err_info *info, int i);
> +int aer_get_device_error_info(struct aer_err_info *info, int i,
> +			      bool link_healthy);
>   void aer_print_error(struct aer_err_info *info, int i);
>   
>   int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e286c197d716..157ad7fb44a0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1351,12 +1351,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
>    * aer_get_device_error_info - read error status from dev and store it to info
>    * @info: pointer to structure to store the error record
>    * @i: index into info->dev[]
> + * @link_healthy: link is healthy or not
>    *
>    * Return: 1 on success, 0 on error.
>    *
>    * Note that @info is reused among all error devices. Clear fields properly.
>    */
> -int aer_get_device_error_info(struct aer_err_info *info, int i)
> +int aer_get_device_error_info(struct aer_err_info *info, int i,
> +			      bool link_healthy)
>   {
>   	struct pci_dev *dev;
>   	int type, aer;
> @@ -1387,7 +1389,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
>   	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>   		   type == PCI_EXP_TYPE_RC_EC ||
>   		   type == PCI_EXP_TYPE_DOWNSTREAM ||
> -		   info->severity == AER_NONFATAL) {
> +		   info->severity == AER_NONFATAL ||
> +		   (info->severity == AER_FATAL && link_healthy)) {
>   
>   		/* Link is still healthy for IO reads */
>   		pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
> @@ -1420,11 +1423,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>   
>   	/* Report all before handling them, to not lose records by reset etc. */
>   	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> -		if (aer_get_device_error_info(e_info, i))
> +		if (aer_get_device_error_info(e_info, i, false))
>   			aer_print_error(e_info, i);
>   	}
>   	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> -		if (aer_get_device_error_info(e_info, i))
> +		if (aer_get_device_error_info(e_info, i, false))
>   			handle_error_source(e_info->dev[i], e_info);
>   	}
>   }
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index f6069f621683..21c4e8371279 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -284,7 +284,7 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
>   		pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
>   			 status);
>   		if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
> -		    aer_get_device_error_info(&info, 0)) {
> +		    aer_get_device_error_info(&info, 0, false)) {
>   			aer_print_error(&info, 0);
>   			pci_aer_clear_nonfatal_status(pdev);
>   			pci_aer_clear_fatal_status(pdev);
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index de6381c690f5..744d77ee7271 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -196,6 +196,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   	struct pci_dev *bridge;
>   	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>   	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> +	struct aer_err_info info;
>   
>   	/*
>   	 * If the error was detected by a Root Port, Downstream Port, RCEC,
> @@ -223,6 +224,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   			pci_warn(bridge, "subordinate device reset failed\n");
>   			goto failed;
>   		}
> +
> +		info.dev[0] = dev;
> +		info.level = KERN_ERR;
> +		info.severity = AER_FATAL;
> +		/* Link recovered, report fatal errors of RCiEP or EP */
> +		if ((type == PCI_EXP_TYPE_ENDPOINT ||
> +		     type == PCI_EXP_TYPE_RC_END) &&
> +		    aer_get_device_error_info(&info, 0, true))
> +			aer_print_error(&info, 0);
>   	} else {
>   		pci_walk_bridge(bridge, report_normal_detected, &status);
>   	}
> @@ -259,6 +269,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>   	if (host->native_aer || pcie_ports_native) {
>   		pcie_clear_device_status(dev);
>   		pci_aer_clear_nonfatal_status(dev);
> +		pci_aer_clear_fatal_status(dev);

Above change looks unrelated to dumping the error info. It would be better if
you move it to a separate patch.

>   	}
>   
>   	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error
  2025-09-17  6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
@ 2025-09-17 22:01   ` kernel test robot
  0 siblings, 0 replies; 9+ messages in thread
From: kernel test robot @ 2025-09-17 22:01 UTC (permalink / raw)
  To: Shuai Xue, bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy
  Cc: oe-kbuild-all, oohall, xueshuai, linux-pci, linux-kernel,
	linuxppc-dev

Hi Shuai,

kernel test robot noticed the following build warnings:

[auto build test WARNING on pci/for-linus]
[also build test WARNING on linus/master v6.17-rc6 next-20250917]
[cannot apply to pci/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Shuai-Xue/PCI-DPC-Clarify-naming-for-error-port-in-DPC-Handling/20250917-143459
base:   https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git for-linus
patch link:    https://lore.kernel.org/r/20250917063352.19429-3-xueshuai%40linux.alibaba.com
patch subject: [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250918/202509180501.eB8FJ5Vt-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 7c861bcedf61607b6c087380ac711eb7ff918ca6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250918/202509180501.eB8FJ5Vt-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509180501.eB8FJ5Vt-lkp@intel.com/

All warnings (new ones prefixed by >>):

   In file included from <built-in>:3:
   In file included from include/linux/compiler_types.h:171:
   include/linux/compiler-clang.h:28:9: warning: '__SANITIZE_ADDRESS__' macro redefined [-Wmacro-redefined]
      28 | #define __SANITIZE_ADDRESS__
         |         ^
   <built-in>:371:9: note: previous definition is here
     371 | #define __SANITIZE_ADDRESS__ 1
         |         ^
>> drivers/pci/pcie/edr.c:188:6: warning: variable 'err_dev' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
     188 |         if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
         |             ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/pci/pcie/edr.c:218:14: note: uninitialized use occurs here
     218 |         pci_dev_put(err_dev);
         |                     ^~~~~~~
   drivers/pci/pcie/edr.c:188:2: note: remove the 'if' if its condition is always false
     188 |         if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     189 |                 pci_err(err_port, "Invalid DPC trigger %#010x\n", status);
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     190 |                 goto send_ost;
         |                 ~~~~~~~~~~~~~~
     191 |         }
         |         ~
   drivers/pci/pcie/edr.c:181:6: warning: variable 'err_dev' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
     181 |         if (!err_port->dpc_cap) {
         |             ^~~~~~~~~~~~~~~~~~
   drivers/pci/pcie/edr.c:218:14: note: uninitialized use occurs here
     218 |         pci_dev_put(err_dev);
         |                     ^~~~~~~
   drivers/pci/pcie/edr.c:181:2: note: remove the 'if' if its condition is always false
     181 |         if (!err_port->dpc_cap) {
         |         ^~~~~~~~~~~~~~~~~~~~~~~~~
     182 |                 pci_err(err_port, FW_BUG "This device doesn't support DPC\n");
         |                 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
     183 |                 goto send_ost;
         |                 ~~~~~~~~~~~~~~
     184 |         }
         |         ~
   drivers/pci/pcie/edr.c:153:50: note: initialize the variable 'err_dev' to silence this warning
     153 |         struct pci_dev *pdev = data, *err_port, *err_dev;
         |                                                         ^
         |                                                          = NULL
   3 warnings generated.


vim +188 drivers/pci/pcie/edr.c

ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  150  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  151  static void edr_handle_event(acpi_handle handle, u32 event, void *data)
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  152  {
267102466d7b592 Shuai Xue                  2025-09-17  153  	struct pci_dev *pdev = data, *err_port, *err_dev;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  154  	pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  155  	u16 status;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  156  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  157  	if (event != ACPI_NOTIFY_DISCONNECT_RECOVER)
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  158  		return;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  159  
774820b362b07b9 Bjorn Helgaas              2023-04-07  160  	/*
774820b362b07b9 Bjorn Helgaas              2023-04-07  161  	 * pdev is a Root Port or Downstream Port that is still present and
774820b362b07b9 Bjorn Helgaas              2023-04-07  162  	 * has triggered a containment event, e.g., DPC, so its child
774820b362b07b9 Bjorn Helgaas              2023-04-07  163  	 * devices have been disconnected (ACPI r6.5, sec 5.6.6).
774820b362b07b9 Bjorn Helgaas              2023-04-07  164  	 */
af03958da0678c3 Kuppuswamy Sathyanarayanan 2020-04-15  165  	pci_info(pdev, "EDR event received\n");
af03958da0678c3 Kuppuswamy Sathyanarayanan 2020-04-15  166  
774820b362b07b9 Bjorn Helgaas              2023-04-07  167  	/*
774820b362b07b9 Bjorn Helgaas              2023-04-07  168  	 * Locate the port that experienced the containment event.  pdev
774820b362b07b9 Bjorn Helgaas              2023-04-07  169  	 * may be that port or a parent of it (PCI Firmware r3.3, sec
774820b362b07b9 Bjorn Helgaas              2023-04-07  170  	 * 4.6.13).
774820b362b07b9 Bjorn Helgaas              2023-04-07  171  	 */
a56b1e47845b946 Shuai Xue                  2025-09-17  172  	err_port = acpi_dpc_port_get(pdev);
a56b1e47845b946 Shuai Xue                  2025-09-17  173  	if (!err_port) {
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  174  		pci_err(pdev, "Firmware failed to locate DPC port\n");
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  175  		return;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  176  	}
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  177  
a56b1e47845b946 Shuai Xue                  2025-09-17  178  	pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(err_port));
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  179  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  180  	/* If port does not support DPC, just send the OST */
a56b1e47845b946 Shuai Xue                  2025-09-17  181  	if (!err_port->dpc_cap) {
a56b1e47845b946 Shuai Xue                  2025-09-17  182  		pci_err(err_port, FW_BUG "This device doesn't support DPC\n");
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  183  		goto send_ost;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  184  	}
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  185  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  186  	/* Check if there is a valid DPC trigger */
a56b1e47845b946 Shuai Xue                  2025-09-17  187  	pci_read_config_word(err_port, err_port->dpc_cap + PCI_EXP_DPC_STATUS, &status);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 @188  	if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
a56b1e47845b946 Shuai Xue                  2025-09-17  189  		pci_err(err_port, "Invalid DPC trigger %#010x\n", status);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  190  		goto send_ost;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  191  	}
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  192  
267102466d7b592 Shuai Xue                  2025-09-17  193  	err_dev = dpc_process_error(err_port);
a56b1e47845b946 Shuai Xue                  2025-09-17  194  	pci_aer_raw_clear_status(err_port);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  195  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  196  	/*
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  197  	 * Irrespective of whether the DPC event is triggered by ERR_FATAL
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  198  	 * or ERR_NONFATAL, since the link is already down, use the FATAL
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  199  	 * error recovery path for both cases.
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  200  	 */
267102466d7b592 Shuai Xue                  2025-09-17  201  	estate = pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  202  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  203  send_ost:
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  204  
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  205  	/*
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  206  	 * If recovery is successful, send _OST(0xF, BDF << 16 | 0x80)
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  207  	 * to firmware. If not successful, send _OST(0xF, BDF << 16 | 0x81).
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  208  	 */
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  209  	if (estate == PCI_ERS_RESULT_RECOVERED) {
a56b1e47845b946 Shuai Xue                  2025-09-17  210  		pci_dbg(err_port, "DPC port successfully recovered\n");
a56b1e47845b946 Shuai Xue                  2025-09-17  211  		pcie_clear_device_status(err_port);
a56b1e47845b946 Shuai Xue                  2025-09-17  212  		acpi_send_edr_status(pdev, err_port, EDR_OST_SUCCESS);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  213  	} else {
a56b1e47845b946 Shuai Xue                  2025-09-17  214  		pci_dbg(err_port, "DPC port recovery failed\n");
a56b1e47845b946 Shuai Xue                  2025-09-17  215  		acpi_send_edr_status(pdev, err_port, EDR_OST_FAILED);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  216  	}
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  217  
267102466d7b592 Shuai Xue                  2025-09-17  218  	pci_dev_put(err_dev);
a56b1e47845b946 Shuai Xue                  2025-09-17  219  	pci_dev_put(err_port);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  220  }
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23  221  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
  2025-09-17 19:09   ` Kuppuswamy Sathyanarayanan
@ 2025-09-18  2:45     ` Shuai Xue
  0 siblings, 0 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-18  2:45 UTC (permalink / raw)
  To: Kuppuswamy Sathyanarayanan, bhelgaas, mahesh, mani,
	Jonathan.Cameron
  Cc: oohall, linux-pci, linux-kernel, linuxppc-dev



在 2025/9/18 03:09, Kuppuswamy Sathyanarayanan 写道:
> 
> On 9/16/25 23:33, Shuai Xue wrote:
>> The AER driver has historically avoided reading the configuration space of
>> an endpoint or RCiEP that reported a fatal error, considering the link to
>> that device unreliable. Consequently, when a fatal error occurs, the AER
>> and DPC drivers do not report specific error types, resulting in logs like:
>>
>>     pcieport 0015:00:00.0: EDR: EDR event received
>>     pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
>>     pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
>>     pcieport 0015:00:00.0: AER: broadcast error_detected message
>>     pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
>>     pcieport 0015:00:00.0: AER: broadcast resume message
>>     pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
>>     pcieport 0015:00:00.0: AER: device recovery successful
>>     pcieport 0015:00:00.0: EDR: DPC port successfully recovered
>>     pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>>
>> AER status registers are sticky and Write-1-to-clear. If the link recovered
>> after hot reset, we can still safely access AER status and TLP header of the
>> error device. In such case, report fatal errors which helps to figure out the
>> error root case.
>>
>> After this patch, the logs like:
>>
>>     pcieport 0015:00:00.0: EDR: EDR event received
>>     pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
>>     pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
>>     pcieport 0015:00:00.0: AER: broadcast error_detected message
>>     vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>>     pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
>>     vfio-pci 0015:01:00.0:   device [144d:a80a] error status/mask=00001000/00400000
>>     vfio-pci 0015:01:00.0:    [12] TLP                    (First)
>>     vfio-pci 0015:01:00.0: AER:   TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
>>     pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
>>     pcieport 0015:00:00.0: AER: broadcast resume message
>>     pcieport 0015:00:00.0: AER: device recovery successful
>>     pcieport 0015:00:00.0: EDR: DPC port successfully recovered
>>     pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>>   drivers/pci/pci.h      |  3 ++-
>>   drivers/pci/pcie/aer.c | 11 +++++++----
>>   drivers/pci/pcie/dpc.c |  2 +-
>>   drivers/pci/pcie/err.c | 11 +++++++++++
>>   4 files changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index de2f07cefa72..b8d364545e7d 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -629,7 +629,8 @@ struct aer_err_info {
>>       struct pcie_tlp_log tlp;    /* TLP Header */
>>   };
>> -int aer_get_device_error_info(struct aer_err_info *info, int i);
>> +int aer_get_device_error_info(struct aer_err_info *info, int i,
>> +                  bool link_healthy);
>>   void aer_print_error(struct aer_err_info *info, int i);
>>   int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index e286c197d716..157ad7fb44a0 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1351,12 +1351,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
>>    * aer_get_device_error_info - read error status from dev and store it to info
>>    * @info: pointer to structure to store the error record
>>    * @i: index into info->dev[]
>> + * @link_healthy: link is healthy or not
>>    *
>>    * Return: 1 on success, 0 on error.
>>    *
>>    * Note that @info is reused among all error devices. Clear fields properly.
>>    */
>> -int aer_get_device_error_info(struct aer_err_info *info, int i)
>> +int aer_get_device_error_info(struct aer_err_info *info, int i,
>> +                  bool link_healthy)
>>   {
>>       struct pci_dev *dev;
>>       int type, aer;
>> @@ -1387,7 +1389,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
>>       } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>>              type == PCI_EXP_TYPE_RC_EC ||
>>              type == PCI_EXP_TYPE_DOWNSTREAM ||
>> -           info->severity == AER_NONFATAL) {
>> +           info->severity == AER_NONFATAL ||
>> +           (info->severity == AER_FATAL && link_healthy)) {
>>           /* Link is still healthy for IO reads */
>>           pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
>> @@ -1420,11 +1423,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>>       /* Report all before handling them, to not lose records by reset etc. */
>>       for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>> -        if (aer_get_device_error_info(e_info, i))
>> +        if (aer_get_device_error_info(e_info, i, false))
>>               aer_print_error(e_info, i);
>>       }
>>       for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>> -        if (aer_get_device_error_info(e_info, i))
>> +        if (aer_get_device_error_info(e_info, i, false))
>>               handle_error_source(e_info->dev[i], e_info);
>>       }
>>   }
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index f6069f621683..21c4e8371279 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -284,7 +284,7 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
>>           pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
>>                status);
>>           if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>> -            aer_get_device_error_info(&info, 0)) {
>> +            aer_get_device_error_info(&info, 0, false)) {
>>               aer_print_error(&info, 0);
>>               pci_aer_clear_nonfatal_status(pdev);
>>               pci_aer_clear_fatal_status(pdev);
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index de6381c690f5..744d77ee7271 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -196,6 +196,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>       struct pci_dev *bridge;
>>       pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>>       struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> +    struct aer_err_info info;
>>       /*
>>        * If the error was detected by a Root Port, Downstream Port, RCEC,
>> @@ -223,6 +224,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>               pci_warn(bridge, "subordinate device reset failed\n");
>>               goto failed;
>>           }
>> +
>> +        info.dev[0] = dev;
>> +        info.level = KERN_ERR;
>> +        info.severity = AER_FATAL;
>> +        /* Link recovered, report fatal errors of RCiEP or EP */
>> +        if ((type == PCI_EXP_TYPE_ENDPOINT ||
>> +             type == PCI_EXP_TYPE_RC_END) &&
>> +            aer_get_device_error_info(&info, 0, true))
>> +            aer_print_error(&info, 0);
>>       } else {
>>           pci_walk_bridge(bridge, report_normal_detected, &status);
>>       }
>> @@ -259,6 +269,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>>       if (host->native_aer || pcie_ports_native) {
>>           pcie_clear_device_status(dev);
>>           pci_aer_clear_nonfatal_status(dev);
>> +        pci_aer_clear_fatal_status(dev);
> 
> Above change looks unrelated to dumping the error info. It would be better if
> you move it to a separate patch.
> 

Hi, Kuppuswamy,

Thanks for quick reply and valueable comments.
Sure, I will add two new separate patches.

The first is to use pcie_aer_is_native() to check for native AER control.

--------- patch 1 start ------

Subject: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control

Replace the manual checks for native AER control with the
pcie_aer_is_native() helper, which provides a more robust way
to determine if we have native control of AER.

No functional changes intended.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
  drivers/pci/pcie/err.c | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 1397f62f13dc..86624ae61cb6 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -195,7 +195,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  	int type = pci_pcie_type(dev);
  	struct pci_dev *bridge;
  	pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
-	struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
  	struct aer_err_info info;

  	/*
@@ -266,7 +265,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  	 * it is responsible for clearing this status.  In that case, the
  	 * signaling device may not even be visible to the OS.
  	 */
-	if (host->native_aer || pcie_ports_native) {
+	if (pcie_aer_is_native(dev)) {
  		pcie_clear_device_status(dev);
  		pci_aer_clear_nonfatal_status(dev);
  	}
-- 

--------- patch 1 end ------

The second is clear AER fatal status for err_dev:

--------- patch 2 start ------

Subject: [PATCH v6 5/5] PCI/AER: Clear both AER fatal and non-fatal status

The DPC driver clears AER fatal status for the port that reported the
error, but not for the downstream device that deteced the error.  The
current recovery code only clears non-fatal AER status, leaving fatal
status bits set in the error device.

Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
status in the error device, ensuring all AER status bits are properly
cleared after recovery.

Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
  drivers/pci/pcie/err.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 86624ae61cb6..96d99eaf13d2 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -267,7 +267,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
  	 */
  	if (pcie_aer_is_native(dev)) {
  		pcie_clear_device_status(dev);
-		pci_aer_clear_nonfatal_status(dev);
+		pci_aer_raw_clear_status(dev);
  	}

  	pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
-- 
2.39.3

--------- patch 2 end ------

I can also squash these two patches into one. Which way do you prefer?

Thanks.
Shuai




^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
  2025-09-17  6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
  2025-09-17 19:09   ` Kuppuswamy Sathyanarayanan
@ 2025-09-18 20:33   ` Bjorn Helgaas
  2025-09-19  1:41     ` Shuai Xue
  1 sibling, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2025-09-18 20:33 UTC (permalink / raw)
  To: Shuai Xue
  Cc: bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy, oohall, linux-pci, linux-kernel,
	linuxppc-dev

On Wed, Sep 17, 2025 at 02:33:52PM +0800, Shuai Xue wrote:
> The AER driver has historically avoided reading the configuration space of
> an endpoint or RCiEP that reported a fatal error, considering the link to
> that device unreliable. Consequently, when a fatal error occurs, the AER
> and DPC drivers do not report specific error types, resulting in logs like:
> 
> 	pcieport 0015:00:00.0: EDR: EDR event received
> 	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
> 	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
> 	pcieport 0015:00:00.0: AER: broadcast error_detected message
> 	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
> 	pcieport 0015:00:00.0: AER: broadcast resume message
> 	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
> 	pcieport 0015:00:00.0: AER: device recovery successful
> 	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
> 	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80

When you update this series, can you indent these messages with two
spaces instead of a tab?  That will preserve a little space and also
preserve the formatting when "git log" adds its own indentation.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
  2025-09-18 20:33   ` Bjorn Helgaas
@ 2025-09-19  1:41     ` Shuai Xue
  0 siblings, 0 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-19  1:41 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, mahesh, mani, Jonathan.Cameron,
	sathyanarayanan.kuppuswamy, oohall, linux-pci, linux-kernel,
	linuxppc-dev



在 2025/9/19 04:33, Bjorn Helgaas 写道:
> On Wed, Sep 17, 2025 at 02:33:52PM +0800, Shuai Xue wrote:
>> The AER driver has historically avoided reading the configuration space of
>> an endpoint or RCiEP that reported a fatal error, considering the link to
>> that device unreliable. Consequently, when a fatal error occurs, the AER
>> and DPC drivers do not report specific error types, resulting in logs like:
>>
>> 	pcieport 0015:00:00.0: EDR: EDR event received
>> 	pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
>> 	pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
>> 	pcieport 0015:00:00.0: AER: broadcast error_detected message
>> 	pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
>> 	pcieport 0015:00:00.0: AER: broadcast resume message
>> 	pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
>> 	pcieport 0015:00:00.0: AER: device recovery successful
>> 	pcieport 0015:00:00.0: EDR: DPC port successfully recovered
>> 	pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
> 
> When you update this series, can you indent these messages with two
> spaces instead of a tab?  That will preserve a little space and also
> preserve the formatting when "git log" adds its own indentation.


Sure, will align with space.

Thanks.
Shuai

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-09-19  1:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-17  6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-09-17  6:33 ` [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
2025-09-17  6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
2025-09-17 22:01   ` kernel test robot
2025-09-17  6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-09-17 19:09   ` Kuppuswamy Sathyanarayanan
2025-09-18  2:45     ` Shuai Xue
2025-09-18 20:33   ` Bjorn Helgaas
2025-09-19  1:41     ` Shuai Xue

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox