* [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling
2025-09-17 6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
@ 2025-09-17 6:33 ` Shuai Xue
2025-09-17 6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
2025-09-17 6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2 siblings, 0 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-17 6:33 UTC (permalink / raw)
To: bhelgaas, mahesh, mani, Jonathan.Cameron,
sathyanarayanan.kuppuswamy
Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev
dpc_handler() is registered for error port which recevie DPC interrupt
and acpi_dpc_port_get() locate the port that experienced the containment
event.
Rename edev and pdev to err_port for clear so that later patch will
avoid misused err_port in pcie_do_recovery().
No functional changes intended.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
drivers/pci/pcie/dpc.c | 10 +++++-----
drivers/pci/pcie/edr.c | 34 +++++++++++++++++-----------------
2 files changed, 22 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..bff29726c6a5 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -361,21 +361,21 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
static irqreturn_t dpc_handler(int irq, void *context)
{
- struct pci_dev *pdev = context;
+ struct pci_dev *err_port = context;
/*
* According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
* of async removal and should be ignored by software.
*/
- if (dpc_is_surprise_removal(pdev)) {
- dpc_handle_surprise_removal(pdev);
+ if (dpc_is_surprise_removal(err_port)) {
+ dpc_handle_surprise_removal(err_port);
return IRQ_HANDLED;
}
- dpc_process_error(pdev);
+ dpc_process_error(err_port);
/* We configure DPC so it only triggers on ERR_FATAL */
- pcie_do_recovery(pdev, pci_channel_io_frozen, dpc_reset_link);
+ pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
return IRQ_HANDLED;
}
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index e86298dbbcff..521fca2f40cb 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, struct pci_dev *edev,
static void edr_handle_event(acpi_handle handle, u32 event, void *data)
{
- struct pci_dev *pdev = data, *edev;
+ struct pci_dev *pdev = data, *err_port;
pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
u16 status;
@@ -169,36 +169,36 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
* may be that port or a parent of it (PCI Firmware r3.3, sec
* 4.6.13).
*/
- edev = acpi_dpc_port_get(pdev);
- if (!edev) {
+ err_port = acpi_dpc_port_get(pdev);
+ if (!err_port) {
pci_err(pdev, "Firmware failed to locate DPC port\n");
return;
}
- pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(edev));
+ pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(err_port));
/* If port does not support DPC, just send the OST */
- if (!edev->dpc_cap) {
- pci_err(edev, FW_BUG "This device doesn't support DPC\n");
+ if (!err_port->dpc_cap) {
+ pci_err(err_port, FW_BUG "This device doesn't support DPC\n");
goto send_ost;
}
/* Check if there is a valid DPC trigger */
- pci_read_config_word(edev, edev->dpc_cap + PCI_EXP_DPC_STATUS, &status);
+ pci_read_config_word(err_port, err_port->dpc_cap + PCI_EXP_DPC_STATUS, &status);
if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
- pci_err(edev, "Invalid DPC trigger %#010x\n", status);
+ pci_err(err_port, "Invalid DPC trigger %#010x\n", status);
goto send_ost;
}
- dpc_process_error(edev);
- pci_aer_raw_clear_status(edev);
+ dpc_process_error(err_port);
+ pci_aer_raw_clear_status(err_port);
/*
* Irrespective of whether the DPC event is triggered by ERR_FATAL
* or ERR_NONFATAL, since the link is already down, use the FATAL
* error recovery path for both cases.
*/
- estate = pcie_do_recovery(edev, pci_channel_io_frozen, dpc_reset_link);
+ estate = pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
send_ost:
@@ -207,15 +207,15 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
* to firmware. If not successful, send _OST(0xF, BDF << 16 | 0x81).
*/
if (estate == PCI_ERS_RESULT_RECOVERED) {
- pci_dbg(edev, "DPC port successfully recovered\n");
- pcie_clear_device_status(edev);
- acpi_send_edr_status(pdev, edev, EDR_OST_SUCCESS);
+ pci_dbg(err_port, "DPC port successfully recovered\n");
+ pcie_clear_device_status(err_port);
+ acpi_send_edr_status(pdev, err_port, EDR_OST_SUCCESS);
} else {
- pci_dbg(edev, "DPC port recovery failed\n");
- acpi_send_edr_status(pdev, edev, EDR_OST_FAILED);
+ pci_dbg(err_port, "DPC port recovery failed\n");
+ acpi_send_edr_status(pdev, err_port, EDR_OST_FAILED);
}
- pci_dev_put(edev);
+ pci_dev_put(err_port);
}
void pci_acpi_add_edr_notifier(struct pci_dev *pdev)
--
2.39.3
^ permalink raw reply related [flat|nested] 9+ messages in thread* [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error
2025-09-17 6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-09-17 6:33 ` [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
@ 2025-09-17 6:33 ` Shuai Xue
2025-09-17 22:01 ` kernel test robot
2025-09-17 6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2 siblings, 1 reply; 9+ messages in thread
From: Shuai Xue @ 2025-09-17 6:33 UTC (permalink / raw)
To: bhelgaas, mahesh, mani, Jonathan.Cameron,
sathyanarayanan.kuppuswamy
Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev
The current implementation of pcie_do_recovery() assumes that the
recovery process is executed for the device that detected the error.
However, the DPC driver currently passes the error port that experienced
the DPC event to pcie_do_recovery().
Use the SOURCE ID register to correctly identify the device that
detected the error. When passing the error device, the
pcie_do_recovery() will find the upstream bridge and walk bridges
potentially AER affected. And subsequent commits will be able to
accurately access AER status of the error device.
Should not observe any functional changes.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
drivers/pci/pci.h | 2 +-
drivers/pci/pcie/dpc.c | 25 +++++++++++++++++++++----
drivers/pci/pcie/edr.c | 7 ++++---
3 files changed, 26 insertions(+), 8 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 34f65d69662e..de2f07cefa72 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -654,7 +654,7 @@ struct rcec_ea {
void pci_save_dpc_state(struct pci_dev *dev);
void pci_restore_dpc_state(struct pci_dev *dev);
void pci_dpc_init(struct pci_dev *pdev);
-void dpc_process_error(struct pci_dev *pdev);
+struct pci_dev *dpc_process_error(struct pci_dev *pdev);
pci_ers_result_t dpc_reset_link(struct pci_dev *pdev);
bool pci_dpc_recovered(struct pci_dev *pdev);
unsigned int dpc_tlp_log_len(struct pci_dev *dev);
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index bff29726c6a5..f6069f621683 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -260,10 +260,20 @@ static int dpc_get_aer_uncorrect_severity(struct pci_dev *dev,
return 1;
}
-void dpc_process_error(struct pci_dev *pdev)
+/**
+ * dpc_process_error - handle the DPC error status
+ * @pdev: the port that experienced the containment event
+ *
+ * Return: the device that detected the error.
+ *
+ * NOTE: The device reference count is increased, the caller must decrement
+ * the reference count by calling pci_dev_put().
+ */
+struct pci_dev *dpc_process_error(struct pci_dev *pdev)
{
u16 cap = pdev->dpc_cap, status, source, reason, ext_reason;
struct aer_err_info info = {};
+ struct pci_dev *err_dev;
pci_read_config_word(pdev, cap + PCI_EXP_DPC_STATUS, &status);
@@ -279,6 +289,7 @@ void dpc_process_error(struct pci_dev *pdev)
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
}
+ err_dev = pci_dev_get(pdev);
break;
case PCI_EXP_DPC_STATUS_TRIGGER_RSN_NFE:
case PCI_EXP_DPC_STATUS_TRIGGER_RSN_FE:
@@ -290,6 +301,8 @@ void dpc_process_error(struct pci_dev *pdev)
"ERR_FATAL" : "ERR_NONFATAL",
pci_domain_nr(pdev->bus), PCI_BUS_NUM(source),
PCI_SLOT(source), PCI_FUNC(source));
+ err_dev = pci_get_domain_bus_and_slot(pci_domain_nr(pdev->bus),
+ PCI_BUS_NUM(source), source & 0xff);
break;
case PCI_EXP_DPC_STATUS_TRIGGER_RSN_IN_EXT:
ext_reason = status & PCI_EXP_DPC_STATUS_TRIGGER_RSN_EXT;
@@ -304,8 +317,11 @@ void dpc_process_error(struct pci_dev *pdev)
if (ext_reason == PCI_EXP_DPC_STATUS_TRIGGER_RSN_RP_PIO &&
pdev->dpc_rp_extensions)
dpc_process_rp_pio_error(pdev);
+ err_dev = pci_dev_get(pdev);
break;
}
+
+ return err_dev;
}
static void pci_clear_surpdn_errors(struct pci_dev *pdev)
@@ -361,7 +377,7 @@ static bool dpc_is_surprise_removal(struct pci_dev *pdev)
static irqreturn_t dpc_handler(int irq, void *context)
{
- struct pci_dev *err_port = context;
+ struct pci_dev *err_port = context, *err_dev;
/*
* According to PCIe r6.0 sec 6.7.6, errors are an expected side effect
@@ -372,10 +388,11 @@ static irqreturn_t dpc_handler(int irq, void *context)
return IRQ_HANDLED;
}
- dpc_process_error(err_port);
+ err_dev = dpc_process_error(err_port);
/* We configure DPC so it only triggers on ERR_FATAL */
- pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
+ pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
+ pci_dev_put(err_dev);
return IRQ_HANDLED;
}
diff --git a/drivers/pci/pcie/edr.c b/drivers/pci/pcie/edr.c
index 521fca2f40cb..3f971bb04433 100644
--- a/drivers/pci/pcie/edr.c
+++ b/drivers/pci/pcie/edr.c
@@ -150,7 +150,7 @@ static int acpi_send_edr_status(struct pci_dev *pdev, struct pci_dev *edev,
static void edr_handle_event(acpi_handle handle, u32 event, void *data)
{
- struct pci_dev *pdev = data, *err_port;
+ struct pci_dev *pdev = data, *err_port, *err_dev;
pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
u16 status;
@@ -190,7 +190,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
goto send_ost;
}
- dpc_process_error(err_port);
+ err_dev = dpc_process_error(err_port);
pci_aer_raw_clear_status(err_port);
/*
@@ -198,7 +198,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
* or ERR_NONFATAL, since the link is already down, use the FATAL
* error recovery path for both cases.
*/
- estate = pcie_do_recovery(err_port, pci_channel_io_frozen, dpc_reset_link);
+ estate = pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
send_ost:
@@ -215,6 +215,7 @@ static void edr_handle_event(acpi_handle handle, u32 event, void *data)
acpi_send_edr_status(pdev, err_port, EDR_OST_FAILED);
}
+ pci_dev_put(err_dev);
pci_dev_put(err_port);
}
--
2.39.3
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error
2025-09-17 6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
@ 2025-09-17 22:01 ` kernel test robot
0 siblings, 0 replies; 9+ messages in thread
From: kernel test robot @ 2025-09-17 22:01 UTC (permalink / raw)
To: Shuai Xue, bhelgaas, mahesh, mani, Jonathan.Cameron,
sathyanarayanan.kuppuswamy
Cc: oe-kbuild-all, oohall, xueshuai, linux-pci, linux-kernel,
linuxppc-dev
Hi Shuai,
kernel test robot noticed the following build warnings:
[auto build test WARNING on pci/for-linus]
[also build test WARNING on linus/master v6.17-rc6 next-20250917]
[cannot apply to pci/next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Shuai-Xue/PCI-DPC-Clarify-naming-for-error-port-in-DPC-Handling/20250917-143459
base: https://git.kernel.org/pub/scm/linux/kernel/git/pci/pci.git for-linus
patch link: https://lore.kernel.org/r/20250917063352.19429-3-xueshuai%40linux.alibaba.com
patch subject: [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error
config: loongarch-allyesconfig (https://download.01.org/0day-ci/archive/20250918/202509180501.eB8FJ5Vt-lkp@intel.com/config)
compiler: clang version 22.0.0git (https://github.com/llvm/llvm-project 7c861bcedf61607b6c087380ac711eb7ff918ca6)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250918/202509180501.eB8FJ5Vt-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202509180501.eB8FJ5Vt-lkp@intel.com/
All warnings (new ones prefixed by >>):
In file included from <built-in>:3:
In file included from include/linux/compiler_types.h:171:
include/linux/compiler-clang.h:28:9: warning: '__SANITIZE_ADDRESS__' macro redefined [-Wmacro-redefined]
28 | #define __SANITIZE_ADDRESS__
| ^
<built-in>:371:9: note: previous definition is here
371 | #define __SANITIZE_ADDRESS__ 1
| ^
>> drivers/pci/pcie/edr.c:188:6: warning: variable 'err_dev' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
188 | if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/pci/pcie/edr.c:218:14: note: uninitialized use occurs here
218 | pci_dev_put(err_dev);
| ^~~~~~~
drivers/pci/pcie/edr.c:188:2: note: remove the 'if' if its condition is always false
188 | if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
189 | pci_err(err_port, "Invalid DPC trigger %#010x\n", status);
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
190 | goto send_ost;
| ~~~~~~~~~~~~~~
191 | }
| ~
drivers/pci/pcie/edr.c:181:6: warning: variable 'err_dev' is used uninitialized whenever 'if' condition is true [-Wsometimes-uninitialized]
181 | if (!err_port->dpc_cap) {
| ^~~~~~~~~~~~~~~~~~
drivers/pci/pcie/edr.c:218:14: note: uninitialized use occurs here
218 | pci_dev_put(err_dev);
| ^~~~~~~
drivers/pci/pcie/edr.c:181:2: note: remove the 'if' if its condition is always false
181 | if (!err_port->dpc_cap) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~
182 | pci_err(err_port, FW_BUG "This device doesn't support DPC\n");
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
183 | goto send_ost;
| ~~~~~~~~~~~~~~
184 | }
| ~
drivers/pci/pcie/edr.c:153:50: note: initialize the variable 'err_dev' to silence this warning
153 | struct pci_dev *pdev = data, *err_port, *err_dev;
| ^
| = NULL
3 warnings generated.
vim +188 drivers/pci/pcie/edr.c
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 150
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 151 static void edr_handle_event(acpi_handle handle, u32 event, void *data)
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 152 {
267102466d7b592 Shuai Xue 2025-09-17 153 struct pci_dev *pdev = data, *err_port, *err_dev;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 154 pci_ers_result_t estate = PCI_ERS_RESULT_DISCONNECT;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 155 u16 status;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 156
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 157 if (event != ACPI_NOTIFY_DISCONNECT_RECOVER)
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 158 return;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 159
774820b362b07b9 Bjorn Helgaas 2023-04-07 160 /*
774820b362b07b9 Bjorn Helgaas 2023-04-07 161 * pdev is a Root Port or Downstream Port that is still present and
774820b362b07b9 Bjorn Helgaas 2023-04-07 162 * has triggered a containment event, e.g., DPC, so its child
774820b362b07b9 Bjorn Helgaas 2023-04-07 163 * devices have been disconnected (ACPI r6.5, sec 5.6.6).
774820b362b07b9 Bjorn Helgaas 2023-04-07 164 */
af03958da0678c3 Kuppuswamy Sathyanarayanan 2020-04-15 165 pci_info(pdev, "EDR event received\n");
af03958da0678c3 Kuppuswamy Sathyanarayanan 2020-04-15 166
774820b362b07b9 Bjorn Helgaas 2023-04-07 167 /*
774820b362b07b9 Bjorn Helgaas 2023-04-07 168 * Locate the port that experienced the containment event. pdev
774820b362b07b9 Bjorn Helgaas 2023-04-07 169 * may be that port or a parent of it (PCI Firmware r3.3, sec
774820b362b07b9 Bjorn Helgaas 2023-04-07 170 * 4.6.13).
774820b362b07b9 Bjorn Helgaas 2023-04-07 171 */
a56b1e47845b946 Shuai Xue 2025-09-17 172 err_port = acpi_dpc_port_get(pdev);
a56b1e47845b946 Shuai Xue 2025-09-17 173 if (!err_port) {
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 174 pci_err(pdev, "Firmware failed to locate DPC port\n");
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 175 return;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 176 }
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 177
a56b1e47845b946 Shuai Xue 2025-09-17 178 pci_dbg(pdev, "Reported EDR dev: %s\n", pci_name(err_port));
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 179
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 180 /* If port does not support DPC, just send the OST */
a56b1e47845b946 Shuai Xue 2025-09-17 181 if (!err_port->dpc_cap) {
a56b1e47845b946 Shuai Xue 2025-09-17 182 pci_err(err_port, FW_BUG "This device doesn't support DPC\n");
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 183 goto send_ost;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 184 }
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 185
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 186 /* Check if there is a valid DPC trigger */
a56b1e47845b946 Shuai Xue 2025-09-17 187 pci_read_config_word(err_port, err_port->dpc_cap + PCI_EXP_DPC_STATUS, &status);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 @188 if (!(status & PCI_EXP_DPC_STATUS_TRIGGER)) {
a56b1e47845b946 Shuai Xue 2025-09-17 189 pci_err(err_port, "Invalid DPC trigger %#010x\n", status);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 190 goto send_ost;
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 191 }
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 192
267102466d7b592 Shuai Xue 2025-09-17 193 err_dev = dpc_process_error(err_port);
a56b1e47845b946 Shuai Xue 2025-09-17 194 pci_aer_raw_clear_status(err_port);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 195
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 196 /*
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 197 * Irrespective of whether the DPC event is triggered by ERR_FATAL
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 198 * or ERR_NONFATAL, since the link is already down, use the FATAL
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 199 * error recovery path for both cases.
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 200 */
267102466d7b592 Shuai Xue 2025-09-17 201 estate = pcie_do_recovery(err_dev, pci_channel_io_frozen, dpc_reset_link);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 202
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 203 send_ost:
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 204
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 205 /*
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 206 * If recovery is successful, send _OST(0xF, BDF << 16 | 0x80)
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 207 * to firmware. If not successful, send _OST(0xF, BDF << 16 | 0x81).
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 208 */
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 209 if (estate == PCI_ERS_RESULT_RECOVERED) {
a56b1e47845b946 Shuai Xue 2025-09-17 210 pci_dbg(err_port, "DPC port successfully recovered\n");
a56b1e47845b946 Shuai Xue 2025-09-17 211 pcie_clear_device_status(err_port);
a56b1e47845b946 Shuai Xue 2025-09-17 212 acpi_send_edr_status(pdev, err_port, EDR_OST_SUCCESS);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 213 } else {
a56b1e47845b946 Shuai Xue 2025-09-17 214 pci_dbg(err_port, "DPC port recovery failed\n");
a56b1e47845b946 Shuai Xue 2025-09-17 215 acpi_send_edr_status(pdev, err_port, EDR_OST_FAILED);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 216 }
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 217
267102466d7b592 Shuai Xue 2025-09-17 218 pci_dev_put(err_dev);
a56b1e47845b946 Shuai Xue 2025-09-17 219 pci_dev_put(err_port);
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 220 }
ac1c8e35a3262d0 Kuppuswamy Sathyanarayanan 2020-03-23 221
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
2025-09-17 6:33 [PATCH v5 0/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-09-17 6:33 ` [PATCH v5 1/3] PCI/DPC: Clarify naming for error port in DPC Handling Shuai Xue
2025-09-17 6:33 ` [PATCH v5 2/3] PCI/DPC: Run recovery on device that detected the error Shuai Xue
@ 2025-09-17 6:33 ` Shuai Xue
2025-09-17 19:09 ` Kuppuswamy Sathyanarayanan
2025-09-18 20:33 ` Bjorn Helgaas
2 siblings, 2 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-17 6:33 UTC (permalink / raw)
To: bhelgaas, mahesh, mani, Jonathan.Cameron,
sathyanarayanan.kuppuswamy
Cc: oohall, xueshuai, linux-pci, linux-kernel, linuxppc-dev
The AER driver has historically avoided reading the configuration space of
an endpoint or RCiEP that reported a fatal error, considering the link to
that device unreliable. Consequently, when a fatal error occurs, the AER
and DPC drivers do not report specific error types, resulting in logs like:
pcieport 0015:00:00.0: EDR: EDR event received
pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
pcieport 0015:00:00.0: AER: broadcast error_detected message
pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
pcieport 0015:00:00.0: AER: broadcast resume message
pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
pcieport 0015:00:00.0: AER: device recovery successful
pcieport 0015:00:00.0: EDR: DPC port successfully recovered
pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
AER status registers are sticky and Write-1-to-clear. If the link recovered
after hot reset, we can still safely access AER status and TLP header of the
error device. In such case, report fatal errors which helps to figure out the
error root case.
After this patch, the logs like:
pcieport 0015:00:00.0: EDR: EDR event received
pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
pcieport 0015:00:00.0: AER: broadcast error_detected message
vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
vfio-pci 0015:01:00.0: device [144d:a80a] error status/mask=00001000/00400000
vfio-pci 0015:01:00.0: [12] TLP (First)
vfio-pci 0015:01:00.0: AER: TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
pcieport 0015:00:00.0: AER: broadcast resume message
pcieport 0015:00:00.0: AER: device recovery successful
pcieport 0015:00:00.0: EDR: DPC port successfully recovered
pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
drivers/pci/pci.h | 3 ++-
drivers/pci/pcie/aer.c | 11 +++++++----
drivers/pci/pcie/dpc.c | 2 +-
drivers/pci/pcie/err.c | 11 +++++++++++
4 files changed, 21 insertions(+), 6 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index de2f07cefa72..b8d364545e7d 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -629,7 +629,8 @@ struct aer_err_info {
struct pcie_tlp_log tlp; /* TLP Header */
};
-int aer_get_device_error_info(struct aer_err_info *info, int i);
+int aer_get_device_error_info(struct aer_err_info *info, int i,
+ bool link_healthy);
void aer_print_error(struct aer_err_info *info, int i);
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e286c197d716..157ad7fb44a0 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -1351,12 +1351,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
* aer_get_device_error_info - read error status from dev and store it to info
* @info: pointer to structure to store the error record
* @i: index into info->dev[]
+ * @link_healthy: link is healthy or not
*
* Return: 1 on success, 0 on error.
*
* Note that @info is reused among all error devices. Clear fields properly.
*/
-int aer_get_device_error_info(struct aer_err_info *info, int i)
+int aer_get_device_error_info(struct aer_err_info *info, int i,
+ bool link_healthy)
{
struct pci_dev *dev;
int type, aer;
@@ -1387,7 +1389,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
type == PCI_EXP_TYPE_RC_EC ||
type == PCI_EXP_TYPE_DOWNSTREAM ||
- info->severity == AER_NONFATAL) {
+ info->severity == AER_NONFATAL ||
+ (info->severity == AER_FATAL && link_healthy)) {
/* Link is still healthy for IO reads */
pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
@@ -1420,11 +1423,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
/* Report all before handling them, to not lose records by reset etc. */
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info, i))
+ if (aer_get_device_error_info(e_info, i, false))
aer_print_error(e_info, i);
}
for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
- if (aer_get_device_error_info(e_info, i))
+ if (aer_get_device_error_info(e_info, i, false))
handle_error_source(e_info->dev[i], e_info);
}
}
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index f6069f621683..21c4e8371279 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -284,7 +284,7 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
status);
if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
- aer_get_device_error_info(&info, 0)) {
+ aer_get_device_error_info(&info, 0, false)) {
aer_print_error(&info, 0);
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index de6381c690f5..744d77ee7271 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -196,6 +196,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
struct pci_dev *bridge;
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
+ struct aer_err_info info;
/*
* If the error was detected by a Root Port, Downstream Port, RCEC,
@@ -223,6 +224,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
pci_warn(bridge, "subordinate device reset failed\n");
goto failed;
}
+
+ info.dev[0] = dev;
+ info.level = KERN_ERR;
+ info.severity = AER_FATAL;
+ /* Link recovered, report fatal errors of RCiEP or EP */
+ if ((type == PCI_EXP_TYPE_ENDPOINT ||
+ type == PCI_EXP_TYPE_RC_END) &&
+ aer_get_device_error_info(&info, 0, true))
+ aer_print_error(&info, 0);
} else {
pci_walk_bridge(bridge, report_normal_detected, &status);
}
@@ -259,6 +269,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
if (host->native_aer || pcie_ports_native) {
pcie_clear_device_status(dev);
pci_aer_clear_nonfatal_status(dev);
+ pci_aer_clear_fatal_status(dev);
}
pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
--
2.39.3
^ permalink raw reply related [flat|nested] 9+ messages in thread* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
2025-09-17 6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
@ 2025-09-17 19:09 ` Kuppuswamy Sathyanarayanan
2025-09-18 2:45 ` Shuai Xue
2025-09-18 20:33 ` Bjorn Helgaas
1 sibling, 1 reply; 9+ messages in thread
From: Kuppuswamy Sathyanarayanan @ 2025-09-17 19:09 UTC (permalink / raw)
To: Shuai Xue, bhelgaas, mahesh, mani, Jonathan.Cameron
Cc: oohall, linux-pci, linux-kernel, linuxppc-dev
On 9/16/25 23:33, Shuai Xue wrote:
> The AER driver has historically avoided reading the configuration space of
> an endpoint or RCiEP that reported a fatal error, considering the link to
> that device unreliable. Consequently, when a fatal error occurs, the AER
> and DPC drivers do not report specific error types, resulting in logs like:
>
> pcieport 0015:00:00.0: EDR: EDR event received
> pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
> pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
> pcieport 0015:00:00.0: AER: broadcast error_detected message
> pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
> pcieport 0015:00:00.0: AER: broadcast resume message
> pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
> pcieport 0015:00:00.0: AER: device recovery successful
> pcieport 0015:00:00.0: EDR: DPC port successfully recovered
> pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>
> AER status registers are sticky and Write-1-to-clear. If the link recovered
> after hot reset, we can still safely access AER status and TLP header of the
> error device. In such case, report fatal errors which helps to figure out the
> error root case.
>
> After this patch, the logs like:
>
> pcieport 0015:00:00.0: EDR: EDR event received
> pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
> pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
> pcieport 0015:00:00.0: AER: broadcast error_detected message
> vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
> vfio-pci 0015:01:00.0: device [144d:a80a] error status/mask=00001000/00400000
> vfio-pci 0015:01:00.0: [12] TLP (First)
> vfio-pci 0015:01:00.0: AER: TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
> pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
> pcieport 0015:00:00.0: AER: broadcast resume message
> pcieport 0015:00:00.0: AER: device recovery successful
> pcieport 0015:00:00.0: EDR: DPC port successfully recovered
> pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>
> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
> ---
> drivers/pci/pci.h | 3 ++-
> drivers/pci/pcie/aer.c | 11 +++++++----
> drivers/pci/pcie/dpc.c | 2 +-
> drivers/pci/pcie/err.c | 11 +++++++++++
> 4 files changed, 21 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index de2f07cefa72..b8d364545e7d 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -629,7 +629,8 @@ struct aer_err_info {
> struct pcie_tlp_log tlp; /* TLP Header */
> };
>
> -int aer_get_device_error_info(struct aer_err_info *info, int i);
> +int aer_get_device_error_info(struct aer_err_info *info, int i,
> + bool link_healthy);
> void aer_print_error(struct aer_err_info *info, int i);
>
> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e286c197d716..157ad7fb44a0 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -1351,12 +1351,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
> * aer_get_device_error_info - read error status from dev and store it to info
> * @info: pointer to structure to store the error record
> * @i: index into info->dev[]
> + * @link_healthy: link is healthy or not
> *
> * Return: 1 on success, 0 on error.
> *
> * Note that @info is reused among all error devices. Clear fields properly.
> */
> -int aer_get_device_error_info(struct aer_err_info *info, int i)
> +int aer_get_device_error_info(struct aer_err_info *info, int i,
> + bool link_healthy)
> {
> struct pci_dev *dev;
> int type, aer;
> @@ -1387,7 +1389,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
> } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
> type == PCI_EXP_TYPE_RC_EC ||
> type == PCI_EXP_TYPE_DOWNSTREAM ||
> - info->severity == AER_NONFATAL) {
> + info->severity == AER_NONFATAL ||
> + (info->severity == AER_FATAL && link_healthy)) {
>
> /* Link is still healthy for IO reads */
> pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
> @@ -1420,11 +1423,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>
> /* Report all before handling them, to not lose records by reset etc. */
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> - if (aer_get_device_error_info(e_info, i))
> + if (aer_get_device_error_info(e_info, i, false))
> aer_print_error(e_info, i);
> }
> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
> - if (aer_get_device_error_info(e_info, i))
> + if (aer_get_device_error_info(e_info, i, false))
> handle_error_source(e_info->dev[i], e_info);
> }
> }
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index f6069f621683..21c4e8371279 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -284,7 +284,7 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
> pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
> status);
> if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
> - aer_get_device_error_info(&info, 0)) {
> + aer_get_device_error_info(&info, 0, false)) {
> aer_print_error(&info, 0);
> pci_aer_clear_nonfatal_status(pdev);
> pci_aer_clear_fatal_status(pdev);
> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
> index de6381c690f5..744d77ee7271 100644
> --- a/drivers/pci/pcie/err.c
> +++ b/drivers/pci/pcie/err.c
> @@ -196,6 +196,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> struct pci_dev *bridge;
> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
> + struct aer_err_info info;
>
> /*
> * If the error was detected by a Root Port, Downstream Port, RCEC,
> @@ -223,6 +224,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> pci_warn(bridge, "subordinate device reset failed\n");
> goto failed;
> }
> +
> + info.dev[0] = dev;
> + info.level = KERN_ERR;
> + info.severity = AER_FATAL;
> + /* Link recovered, report fatal errors of RCiEP or EP */
> + if ((type == PCI_EXP_TYPE_ENDPOINT ||
> + type == PCI_EXP_TYPE_RC_END) &&
> + aer_get_device_error_info(&info, 0, true))
> + aer_print_error(&info, 0);
> } else {
> pci_walk_bridge(bridge, report_normal_detected, &status);
> }
> @@ -259,6 +269,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
> if (host->native_aer || pcie_ports_native) {
> pcie_clear_device_status(dev);
> pci_aer_clear_nonfatal_status(dev);
> + pci_aer_clear_fatal_status(dev);
Above change looks unrelated to dumping the error info. It would be better if
you move it to a separate patch.
> }
>
> pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
2025-09-17 19:09 ` Kuppuswamy Sathyanarayanan
@ 2025-09-18 2:45 ` Shuai Xue
0 siblings, 0 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-18 2:45 UTC (permalink / raw)
To: Kuppuswamy Sathyanarayanan, bhelgaas, mahesh, mani,
Jonathan.Cameron
Cc: oohall, linux-pci, linux-kernel, linuxppc-dev
在 2025/9/18 03:09, Kuppuswamy Sathyanarayanan 写道:
>
> On 9/16/25 23:33, Shuai Xue wrote:
>> The AER driver has historically avoided reading the configuration space of
>> an endpoint or RCiEP that reported a fatal error, considering the link to
>> that device unreliable. Consequently, when a fatal error occurs, the AER
>> and DPC drivers do not report specific error types, resulting in logs like:
>>
>> pcieport 0015:00:00.0: EDR: EDR event received
>> pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
>> pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
>> pcieport 0015:00:00.0: AER: broadcast error_detected message
>> pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
>> pcieport 0015:00:00.0: AER: broadcast resume message
>> pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
>> pcieport 0015:00:00.0: AER: device recovery successful
>> pcieport 0015:00:00.0: EDR: DPC port successfully recovered
>> pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>>
>> AER status registers are sticky and Write-1-to-clear. If the link recovered
>> after hot reset, we can still safely access AER status and TLP header of the
>> error device. In such case, report fatal errors which helps to figure out the
>> error root case.
>>
>> After this patch, the logs like:
>>
>> pcieport 0015:00:00.0: EDR: EDR event received
>> pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
>> pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
>> pcieport 0015:00:00.0: AER: broadcast error_detected message
>> vfio-pci 0015:01:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
>> vfio-pci 0015:01:00.0: device [144d:a80a] error status/mask=00001000/00400000
>> vfio-pci 0015:01:00.0: [12] TLP (First)
>> vfio-pci 0015:01:00.0: AER: TLP Header: 0x4a004010 0x00000040 0x01000000 0xffffffff
>> pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
>> pcieport 0015:00:00.0: AER: broadcast resume message
>> pcieport 0015:00:00.0: AER: device recovery successful
>> pcieport 0015:00:00.0: EDR: DPC port successfully recovered
>> pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>>
>> Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
>> ---
>> drivers/pci/pci.h | 3 ++-
>> drivers/pci/pcie/aer.c | 11 +++++++----
>> drivers/pci/pcie/dpc.c | 2 +-
>> drivers/pci/pcie/err.c | 11 +++++++++++
>> 4 files changed, 21 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index de2f07cefa72..b8d364545e7d 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -629,7 +629,8 @@ struct aer_err_info {
>> struct pcie_tlp_log tlp; /* TLP Header */
>> };
>> -int aer_get_device_error_info(struct aer_err_info *info, int i);
>> +int aer_get_device_error_info(struct aer_err_info *info, int i,
>> + bool link_healthy);
>> void aer_print_error(struct aer_err_info *info, int i);
>> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index e286c197d716..157ad7fb44a0 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -1351,12 +1351,14 @@ EXPORT_SYMBOL_GPL(aer_recover_queue);
>> * aer_get_device_error_info - read error status from dev and store it to info
>> * @info: pointer to structure to store the error record
>> * @i: index into info->dev[]
>> + * @link_healthy: link is healthy or not
>> *
>> * Return: 1 on success, 0 on error.
>> *
>> * Note that @info is reused among all error devices. Clear fields properly.
>> */
>> -int aer_get_device_error_info(struct aer_err_info *info, int i)
>> +int aer_get_device_error_info(struct aer_err_info *info, int i,
>> + bool link_healthy)
>> {
>> struct pci_dev *dev;
>> int type, aer;
>> @@ -1387,7 +1389,8 @@ int aer_get_device_error_info(struct aer_err_info *info, int i)
>> } else if (type == PCI_EXP_TYPE_ROOT_PORT ||
>> type == PCI_EXP_TYPE_RC_EC ||
>> type == PCI_EXP_TYPE_DOWNSTREAM ||
>> - info->severity == AER_NONFATAL) {
>> + info->severity == AER_NONFATAL ||
>> + (info->severity == AER_FATAL && link_healthy)) {
>> /* Link is still healthy for IO reads */
>> pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS,
>> @@ -1420,11 +1423,11 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
>> /* Report all before handling them, to not lose records by reset etc. */
>> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>> - if (aer_get_device_error_info(e_info, i))
>> + if (aer_get_device_error_info(e_info, i, false))
>> aer_print_error(e_info, i);
>> }
>> for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
>> - if (aer_get_device_error_info(e_info, i))
>> + if (aer_get_device_error_info(e_info, i, false))
>> handle_error_source(e_info->dev[i], e_info);
>> }
>> }
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index f6069f621683..21c4e8371279 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -284,7 +284,7 @@ struct pci_dev *dpc_process_error(struct pci_dev *pdev)
>> pci_warn(pdev, "containment event, status:%#06x: unmasked uncorrectable error detected\n",
>> status);
>> if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>> - aer_get_device_error_info(&info, 0)) {
>> + aer_get_device_error_info(&info, 0, false)) {
>> aer_print_error(&info, 0);
>> pci_aer_clear_nonfatal_status(pdev);
>> pci_aer_clear_fatal_status(pdev);
>> diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
>> index de6381c690f5..744d77ee7271 100644
>> --- a/drivers/pci/pcie/err.c
>> +++ b/drivers/pci/pcie/err.c
>> @@ -196,6 +196,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> struct pci_dev *bridge;
>> pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
>> struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
>> + struct aer_err_info info;
>> /*
>> * If the error was detected by a Root Port, Downstream Port, RCEC,
>> @@ -223,6 +224,15 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> pci_warn(bridge, "subordinate device reset failed\n");
>> goto failed;
>> }
>> +
>> + info.dev[0] = dev;
>> + info.level = KERN_ERR;
>> + info.severity = AER_FATAL;
>> + /* Link recovered, report fatal errors of RCiEP or EP */
>> + if ((type == PCI_EXP_TYPE_ENDPOINT ||
>> + type == PCI_EXP_TYPE_RC_END) &&
>> + aer_get_device_error_info(&info, 0, true))
>> + aer_print_error(&info, 0);
>> } else {
>> pci_walk_bridge(bridge, report_normal_detected, &status);
>> }
>> @@ -259,6 +269,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
>> if (host->native_aer || pcie_ports_native) {
>> pcie_clear_device_status(dev);
>> pci_aer_clear_nonfatal_status(dev);
>> + pci_aer_clear_fatal_status(dev);
>
> Above change looks unrelated to dumping the error info. It would be better if
> you move it to a separate patch.
>
Hi, Kuppuswamy,
Thanks for quick reply and valueable comments.
Sure, I will add two new separate patches.
The first is to use pcie_aer_is_native() to check for native AER control.
--------- patch 1 start ------
Subject: [PATCH v6 4/5] PCI/ERR: Use pcie_aer_is_native() to check for native AER control
Replace the manual checks for native AER control with the
pcie_aer_is_native() helper, which provides a more robust way
to determine if we have native control of AER.
No functional changes intended.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
drivers/pci/pcie/err.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 1397f62f13dc..86624ae61cb6 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -195,7 +195,6 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
int type = pci_pcie_type(dev);
struct pci_dev *bridge;
pci_ers_result_t status = PCI_ERS_RESULT_CAN_RECOVER;
- struct pci_host_bridge *host = pci_find_host_bridge(dev->bus);
struct aer_err_info info;
/*
@@ -266,7 +265,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
* it is responsible for clearing this status. In that case, the
* signaling device may not even be visible to the OS.
*/
- if (host->native_aer || pcie_ports_native) {
+ if (pcie_aer_is_native(dev)) {
pcie_clear_device_status(dev);
pci_aer_clear_nonfatal_status(dev);
}
--
--------- patch 1 end ------
The second is clear AER fatal status for err_dev:
--------- patch 2 start ------
Subject: [PATCH v6 5/5] PCI/AER: Clear both AER fatal and non-fatal status
The DPC driver clears AER fatal status for the port that reported the
error, but not for the downstream device that deteced the error. The
current recovery code only clears non-fatal AER status, leaving fatal
status bits set in the error device.
Use pci_aer_raw_clear_status() to clear both fatal and non-fatal error
status in the error device, ensuring all AER status bits are properly
cleared after recovery.
Signed-off-by: Shuai Xue <xueshuai@linux.alibaba.com>
---
drivers/pci/pcie/err.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/pci/pcie/err.c b/drivers/pci/pcie/err.c
index 86624ae61cb6..96d99eaf13d2 100644
--- a/drivers/pci/pcie/err.c
+++ b/drivers/pci/pcie/err.c
@@ -267,7 +267,7 @@ pci_ers_result_t pcie_do_recovery(struct pci_dev *dev,
*/
if (pcie_aer_is_native(dev)) {
pcie_clear_device_status(dev);
- pci_aer_clear_nonfatal_status(dev);
+ pci_aer_raw_clear_status(dev);
}
pci_walk_bridge(bridge, pci_pm_runtime_put, NULL);
--
2.39.3
--------- patch 2 end ------
I can also squash these two patches into one. Which way do you prefer?
Thanks.
Shuai
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
2025-09-17 6:33 ` [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd Shuai Xue
2025-09-17 19:09 ` Kuppuswamy Sathyanarayanan
@ 2025-09-18 20:33 ` Bjorn Helgaas
2025-09-19 1:41 ` Shuai Xue
1 sibling, 1 reply; 9+ messages in thread
From: Bjorn Helgaas @ 2025-09-18 20:33 UTC (permalink / raw)
To: Shuai Xue
Cc: bhelgaas, mahesh, mani, Jonathan.Cameron,
sathyanarayanan.kuppuswamy, oohall, linux-pci, linux-kernel,
linuxppc-dev
On Wed, Sep 17, 2025 at 02:33:52PM +0800, Shuai Xue wrote:
> The AER driver has historically avoided reading the configuration space of
> an endpoint or RCiEP that reported a fatal error, considering the link to
> that device unreliable. Consequently, when a fatal error occurs, the AER
> and DPC drivers do not report specific error types, resulting in logs like:
>
> pcieport 0015:00:00.0: EDR: EDR event received
> pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
> pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
> pcieport 0015:00:00.0: AER: broadcast error_detected message
> pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
> pcieport 0015:00:00.0: AER: broadcast resume message
> pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
> pcieport 0015:00:00.0: AER: device recovery successful
> pcieport 0015:00:00.0: EDR: DPC port successfully recovered
> pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
When you update this series, can you indent these messages with two
spaces instead of a tab? That will preserve a little space and also
preserve the formatting when "git log" adds its own indentation.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH v5 3/3] PCI/AER: Report fatal errors of RCiEP and EP if link recoverd
2025-09-18 20:33 ` Bjorn Helgaas
@ 2025-09-19 1:41 ` Shuai Xue
0 siblings, 0 replies; 9+ messages in thread
From: Shuai Xue @ 2025-09-19 1:41 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: bhelgaas, mahesh, mani, Jonathan.Cameron,
sathyanarayanan.kuppuswamy, oohall, linux-pci, linux-kernel,
linuxppc-dev
在 2025/9/19 04:33, Bjorn Helgaas 写道:
> On Wed, Sep 17, 2025 at 02:33:52PM +0800, Shuai Xue wrote:
>> The AER driver has historically avoided reading the configuration space of
>> an endpoint or RCiEP that reported a fatal error, considering the link to
>> that device unreliable. Consequently, when a fatal error occurs, the AER
>> and DPC drivers do not report specific error types, resulting in logs like:
>>
>> pcieport 0015:00:00.0: EDR: EDR event received
>> pcieport 0015:00:00.0: EDR: Reported EDR dev: 0015:00:00.0
>> pcieport 0015:00:00.0: DPC: containment event, status:0x200d, ERR_FATAL received from 0015:01:00.0
>> pcieport 0015:00:00.0: AER: broadcast error_detected message
>> pcieport 0015:00:00.0: AER: broadcast mmio_enabled message
>> pcieport 0015:00:00.0: AER: broadcast resume message
>> pcieport 0015:00:00.0: pciehp: Slot(21): Link Down/Up ignored
>> pcieport 0015:00:00.0: AER: device recovery successful
>> pcieport 0015:00:00.0: EDR: DPC port successfully recovered
>> pcieport 0015:00:00.0: EDR: Status for 0015:00:00.0: 0x80
>
> When you update this series, can you indent these messages with two
> spaces instead of a tab? That will preserve a little space and also
> preserve the formatting when "git log" adds its own indentation.
Sure, will align with space.
Thanks.
Shuai
^ permalink raw reply [flat|nested] 9+ messages in thread