* [PATCH v2] PCI/AER: Fix AER log missing in DPC case
@ 2026-01-29 14:01 Sizhe Liu
2026-02-06 10:00 ` Sizhe LIU
2026-02-06 20:10 ` Bjorn Helgaas
0 siblings, 2 replies; 5+ messages in thread
From: Sizhe Liu @ 2026-01-29 14:01 UTC (permalink / raw)
To: bhelgaas, jonathan.cameron, shiju.jose, pandoh
Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, liusizhe5
In the current DPC error reporting case, some AER log information is missing.
-- Error log abnormal
pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
(------ AER error log supposed to be printed here, but missing ------)
nvme nvme0: frozen state error detected, reset controller
{4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Cause:
In aer_print_error(), PCIe AER errors is reported, and is rate-limited
by info->ratelimit_print[i]. There are two entry points for
aer_print_error().
1) Native AER
aer_isr_one_error_type() -> aer_process_err_devices() ->
aer_print_error()
2) DPC
dpc_process_error() -> aer_print_error()
The value of info->ratelimit_print[i] is initialized correctly in
the native AER case:
aer_isr_one_error_type() -> find_source_device() ->
find_device_iter() -> add_error_device()
In the DPC case, info->ratelimit_print[i] is not initialized and
alloc by 0 , so in aer_print_error(), it will directly return at line
if (!info->ratelimit_print[i])
This will result in losing the AER log messages in the DPC case.
Solution:
1. Move the initialization of info->ratelimit_print[i] to
aer_ratelimit_print_init().
2. Add aer_ratelimit_print_init() in dpc_process_error().
3. Replace the initialization by aer_ratelimit_print_init()in
Native AER case.
Test with AER inject:
Set the DPC reporting priority in the BIOS and send
MalfTLP(AER FATAL ERROR) to device.
-- Error log normal
pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
pcieport 0000:20:00.0: [18] MalfTLP (First)
pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
nvme nvme0: frozen state error detected, reset controller
{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
---
v2
- Corrected the format and spelling errors in the commit log.
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
drivers/pci/pcie/dpc.c | 1 +
3 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..0cbcbcd52354 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -748,6 +748,7 @@ struct aer_err_info {
int aer_get_device_error_info(struct aer_err_info *info, int i);
void aer_print_error(struct aer_err_info *info, int i);
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
unsigned int tlp_len, bool flit,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e0bcaa896803..b73915b63327 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
EXPORT_SYMBOL_GPL(cper_severity_to_aer);
#endif
+/**
+ * aer_ratelimit_print_init - set flag whether error message is printed
+ * @dev: pointer to pci_dev to be rate-limited
+ * @e_info: pointer to error info
+ * @idx: index for ratelimit_print array
+ */
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
+{
+ /*
+ * Ratelimit AER log messages. "dev" is either the source
+ * identified by the root's Error Source ID or it has an unmasked
+ * error logged in its own AER Capability. Messages are emitted
+ * when "ratelimit_print[i]" is non-zero. If we will print detail
+ * for a downstream device, make sure we print the Error Source ID
+ * from the root as well.
+ */
+ if (aer_ratelimit(dev, e_info->severity)) {
+ e_info->ratelimit_print[idx] = 1;
+ e_info->root_ratelimit_print = 1;
+ }
+}
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer)
{
@@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
e_info->dev[i] = pci_dev_get(dev);
e_info->error_dev_num++;
- /*
- * Ratelimit AER log messages. "dev" is either the source
- * identified by the root's Error Source ID or it has an unmasked
- * error logged in its own AER Capability. Messages are emitted
- * when "ratelimit_print[i]" is non-zero. If we will print detail
- * for a downstream device, make sure we print the Error Source ID
- * from the root as well.
- */
- if (aer_ratelimit(dev, e_info->severity)) {
- e_info->ratelimit_print[i] = 1;
- e_info->root_ratelimit_print = 1;
- }
+ aer_ratelimit_print_init(dev, e_info, i);
return 0;
}
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..d17adc642781 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
status);
if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(&info, 0)) {
+ aer_ratelimit_print_init(pdev, &info, 0);
aer_print_error(&info, 0);
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
--
2.33.0
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case 2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu @ 2026-02-06 10:00 ` Sizhe LIU 2026-02-06 20:10 ` Bjorn Helgaas 1 sibling, 0 replies; 5+ messages in thread From: Sizhe LIU @ 2026-02-06 10:00 UTC (permalink / raw) To: bhelgaas, jonathan.cameron, shiju.jose, pandoh Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39 Hi, Gentle ping. Best regards, Sizhe On 2026/1/29 22:01, Sizhe Liu wrote: > In the current DPC error reporting case, some AER log information is missing. > > -- Error log abnormal > pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected > (------ AER error log supposed to be printed here, but missing ------) > nvme nvme0: frozen state error detected, reset controller > {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 > > Cause: > In aer_print_error(), PCIe AER errors is reported, and is rate-limited > by info->ratelimit_print[i]. There are two entry points for > aer_print_error(). > > 1) Native AER > aer_isr_one_error_type() -> aer_process_err_devices() -> > aer_print_error() > 2) DPC > dpc_process_error() -> aer_print_error() > > The value of info->ratelimit_print[i] is initialized correctly in > the native AER case: > aer_isr_one_error_type() -> find_source_device() -> > find_device_iter() -> add_error_device() > > In the DPC case, info->ratelimit_print[i] is not initialized and > alloc by 0 , so in aer_print_error(), it will directly return at line > if (!info->ratelimit_print[i]) > This will result in losing the AER log messages in the DPC case. > > Solution: > 1. Move the initialization of info->ratelimit_print[i] to > aer_ratelimit_print_init(). > 2. Add aer_ratelimit_print_init() in dpc_process_error(). > 3. Replace the initialization by aer_ratelimit_print_init()in > Native AER case. > > Test with AER inject: > Set the DPC reporting priority in the BIOS and send > MalfTLP(AER FATAL ERROR) to device. > > -- Error log normal > pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected > pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000 > pcieport 0000:20:00.0: [18] MalfTLP (First) > pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000 > nvme nvme0: frozen state error detected, reset controller > {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 > > [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/ > > Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging") > Signed-off-by: Sizhe Liu <liusizhe5@huawei.com> > --- > v2 > - Corrected the format and spelling errors in the commit log. > > drivers/pci/pci.h | 1 + > drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------ > drivers/pci/pcie/dpc.c | 1 + > 3 files changed, 25 insertions(+), 12 deletions(-) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 0e67014aa001..0cbcbcd52354 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -748,6 +748,7 @@ struct aer_err_info { > > int aer_get_device_error_info(struct aer_err_info *info, int i); > void aer_print_error(struct aer_err_info *info, int i); > +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx); > > int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2, > unsigned int tlp_len, bool flit, > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index e0bcaa896803..b73915b63327 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity) > EXPORT_SYMBOL_GPL(cper_severity_to_aer); > #endif > > +/** > + * aer_ratelimit_print_init - set flag whether error message is printed > + * @dev: pointer to pci_dev to be rate-limited > + * @e_info: pointer to error info > + * @idx: index for ratelimit_print array > + */ > +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx) > +{ > + /* > + * Ratelimit AER log messages. "dev" is either the source > + * identified by the root's Error Source ID or it has an unmasked > + * error logged in its own AER Capability. Messages are emitted > + * when "ratelimit_print[i]" is non-zero. If we will print detail > + * for a downstream device, make sure we print the Error Source ID > + * from the root as well. > + */ > + if (aer_ratelimit(dev, e_info->severity)) { > + e_info->ratelimit_print[idx] = 1; > + e_info->root_ratelimit_print = 1; > + } > +} > + > void pci_print_aer(struct pci_dev *dev, int aer_severity, > struct aer_capability_regs *aer) > { > @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev) > e_info->dev[i] = pci_dev_get(dev); > e_info->error_dev_num++; > > - /* > - * Ratelimit AER log messages. "dev" is either the source > - * identified by the root's Error Source ID or it has an unmasked > - * error logged in its own AER Capability. Messages are emitted > - * when "ratelimit_print[i]" is non-zero. If we will print detail > - * for a downstream device, make sure we print the Error Source ID > - * from the root as well. > - */ > - if (aer_ratelimit(dev, e_info->severity)) { > - e_info->ratelimit_print[i] = 1; > - e_info->root_ratelimit_print = 1; > - } > + aer_ratelimit_print_init(dev, e_info, i); > return 0; > } > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > index fc18349614d7..d17adc642781 100644 > --- a/drivers/pci/pcie/dpc.c > +++ b/drivers/pci/pcie/dpc.c > @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev) > status); > if (dpc_get_aer_uncorrect_severity(pdev, &info) && > aer_get_device_error_info(&info, 0)) { > + aer_ratelimit_print_init(pdev, &info, 0); > aer_print_error(&info, 0); > pci_aer_clear_nonfatal_status(pdev); > pci_aer_clear_fatal_status(pdev); ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case 2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu 2026-02-06 10:00 ` Sizhe LIU @ 2026-02-06 20:10 ` Bjorn Helgaas 2026-02-10 8:00 ` Sizhe LIU 1 sibling, 1 reply; 5+ messages in thread From: Bjorn Helgaas @ 2026-02-06 20:10 UTC (permalink / raw) To: Sizhe Liu Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39 On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote: > In the current DPC error reporting case, some AER log information is missing. > > -- Error log abnormal > pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected > (------ AER error log supposed to be printed here, but missing ------) > nvme nvme0: frozen state error detected, reset controller > {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 > > Cause: > In aer_print_error(), PCIe AER errors is reported, and is rate-limited > by info->ratelimit_print[i]. There are two entry points for > aer_print_error(). > > 1) Native AER > aer_isr_one_error_type() -> aer_process_err_devices() -> > aer_print_error() > 2) DPC > dpc_process_error() -> aer_print_error() > > The value of info->ratelimit_print[i] is initialized correctly in > the native AER case: > aer_isr_one_error_type() -> find_source_device() -> > find_device_iter() -> add_error_device() > > In the DPC case, info->ratelimit_print[i] is not initialized and > alloc by 0 , so in aer_print_error(), it will directly return at line > if (!info->ratelimit_print[i]) > This will result in losing the AER log messages in the DPC case. > > Solution: > 1. Move the initialization of info->ratelimit_print[i] to > aer_ratelimit_print_init(). > 2. Add aer_ratelimit_print_init() in dpc_process_error(). > 3. Replace the initialization by aer_ratelimit_print_init()in > Native AER case. I see the problem, and I think you're right that we're not logging any AER info for DPC events (including events handled via the EDR path, which also calls dpc_process_error()). Currently we do the ratelimit init in add_error_device(), which also includes pci_dev_get() for the device. I don't see a similar pci_dev_get() anywhere in the DPC path. There is one in the EDR path: edr_handle_event acpi_dpc_port_get pci_dev_get <-- dpc_process_error aer_get_device_error_info(aer_err_info) aer_print_error(aer_err_info) pcie_do_recovery pci_dev_put Maybe DPC and EDR should be using add_error_device() directly? It seems like holding that reference on the device is important. > Test with AER inject: > Set the DPC reporting priority in the BIOS and send > MalfTLP(AER FATAL ERROR) to device. > > -- Error log normal > pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected > pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) > pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000 > pcieport 0000:20:00.0: [18] MalfTLP (First) > pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000 > nvme nvme0: frozen state error detected, reset controller > {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 > > [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/ > > Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging") > Signed-off-by: Sizhe Liu <liusizhe5@huawei.com> > --- > v2 > - Corrected the format and spelling errors in the commit log. > > drivers/pci/pci.h | 1 + > drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------ > drivers/pci/pcie/dpc.c | 1 + > 3 files changed, 25 insertions(+), 12 deletions(-) > > diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h > index 0e67014aa001..0cbcbcd52354 100644 > --- a/drivers/pci/pci.h > +++ b/drivers/pci/pci.h > @@ -748,6 +748,7 @@ struct aer_err_info { > > int aer_get_device_error_info(struct aer_err_info *info, int i); > void aer_print_error(struct aer_err_info *info, int i); > +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx); > > int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2, > unsigned int tlp_len, bool flit, > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > index e0bcaa896803..b73915b63327 100644 > --- a/drivers/pci/pcie/aer.c > +++ b/drivers/pci/pcie/aer.c > @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity) > EXPORT_SYMBOL_GPL(cper_severity_to_aer); > #endif > > +/** > + * aer_ratelimit_print_init - set flag whether error message is printed > + * @dev: pointer to pci_dev to be rate-limited > + * @e_info: pointer to error info > + * @idx: index for ratelimit_print array > + */ > +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx) > +{ > + /* > + * Ratelimit AER log messages. "dev" is either the source > + * identified by the root's Error Source ID or it has an unmasked > + * error logged in its own AER Capability. Messages are emitted > + * when "ratelimit_print[i]" is non-zero. If we will print detail > + * for a downstream device, make sure we print the Error Source ID > + * from the root as well. > + */ > + if (aer_ratelimit(dev, e_info->severity)) { > + e_info->ratelimit_print[idx] = 1; > + e_info->root_ratelimit_print = 1; > + } > +} > + > void pci_print_aer(struct pci_dev *dev, int aer_severity, > struct aer_capability_regs *aer) > { > @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev) > e_info->dev[i] = pci_dev_get(dev); > e_info->error_dev_num++; > > - /* > - * Ratelimit AER log messages. "dev" is either the source > - * identified by the root's Error Source ID or it has an unmasked > - * error logged in its own AER Capability. Messages are emitted > - * when "ratelimit_print[i]" is non-zero. If we will print detail > - * for a downstream device, make sure we print the Error Source ID > - * from the root as well. > - */ > - if (aer_ratelimit(dev, e_info->severity)) { > - e_info->ratelimit_print[i] = 1; > - e_info->root_ratelimit_print = 1; > - } > + aer_ratelimit_print_init(dev, e_info, i); > return 0; > } > > diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c > index fc18349614d7..d17adc642781 100644 > --- a/drivers/pci/pcie/dpc.c > +++ b/drivers/pci/pcie/dpc.c > @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev) > status); > if (dpc_get_aer_uncorrect_severity(pdev, &info) && > aer_get_device_error_info(&info, 0)) { > + aer_ratelimit_print_init(pdev, &info, 0); > aer_print_error(&info, 0); > pci_aer_clear_nonfatal_status(pdev); > pci_aer_clear_fatal_status(pdev); > -- > 2.33.0 > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case 2026-02-06 20:10 ` Bjorn Helgaas @ 2026-02-10 8:00 ` Sizhe LIU 2026-02-11 12:18 ` Sizhe Liu 0 siblings, 1 reply; 5+ messages in thread From: Sizhe LIU @ 2026-02-10 8:00 UTC (permalink / raw) To: Bjorn Helgaas Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39 On 2026/2/7 4:10, Bjorn Helgaas wrote: > On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote: >> In the current DPC error reporting case, some AER log information is missing. >> >> -- Error log abnormal >> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected >> (------ AER error log supposed to be printed here, but missing ------) >> nvme nvme0: frozen state error detected, reset controller >> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 >> >> Cause: >> In aer_print_error(), PCIe AER errors is reported, and is rate-limited >> by info->ratelimit_print[i]. There are two entry points for >> aer_print_error(). >> >> 1) Native AER >> aer_isr_one_error_type() -> aer_process_err_devices() -> >> aer_print_error() >> 2) DPC >> dpc_process_error() -> aer_print_error() >> >> The value of info->ratelimit_print[i] is initialized correctly in >> the native AER case: >> aer_isr_one_error_type() -> find_source_device() -> >> find_device_iter() -> add_error_device() >> >> In the DPC case, info->ratelimit_print[i] is not initialized and >> alloc by 0 , so in aer_print_error(), it will directly return at line >> if (!info->ratelimit_print[i]) >> This will result in losing the AER log messages in the DPC case. >> >> Solution: >> 1. Move the initialization of info->ratelimit_print[i] to >> aer_ratelimit_print_init(). >> 2. Add aer_ratelimit_print_init() in dpc_process_error(). >> 3. Replace the initialization by aer_ratelimit_print_init()in >> Native AER case. > I see the problem, and I think you're right that we're not logging any > AER info for DPC events (including events handled via the EDR path, > which also calls dpc_process_error()). > > Currently we do the ratelimit init in add_error_device(), which also > includes pci_dev_get() for the device. I don't see a similar > pci_dev_get() anywhere in the DPC path. There is one in the EDR path: > > edr_handle_event > acpi_dpc_port_get > pci_dev_get <-- > dpc_process_error > aer_get_device_error_info(aer_err_info) > aer_print_error(aer_err_info) > pcie_do_recovery > pci_dev_put > > Maybe DPC and EDR should be using add_error_device() directly? It > seems like holding that reference on the device is important. I agree with directly using add_error_device() – holding a reference to the device is a more robust approach. Below is the detailed call trace for reference: DPC path: dpc_handler add_error_device <-- dpc_process_error aer_get_device_error_info(aer_err_info) aer_print_error(aer_err_info) pcie_do_recovery pci_dev_put EDR path: edr_handle_event acpi_dpc_port_get add_error_device <-- dpc_process_error aer_get_device_error_info(aer_err_info) aer_print_error(aer_err_info) pcie_do_recovery pci_dev_put I will implement this change in the v3 patch and add relevant comments to add_error_device() for clarity. Thanks, Sizhe >> Test with AER inject: >> Set the DPC reporting priority in the BIOS and send >> MalfTLP(AER FATAL ERROR) to device. >> >> -- Error log normal >> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected >> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID) >> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000 >> pcieport 0000:20:00.0: [18] MalfTLP (First) >> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000 >> nvme nvme0: frozen state error detected, reset controller >> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0 >> >> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/ >> >> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging") >> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com> >> --- >> v2 >> - Corrected the format and spelling errors in the commit log. >> >> drivers/pci/pci.h | 1 + >> drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------ >> drivers/pci/pcie/dpc.c | 1 + >> 3 files changed, 25 insertions(+), 12 deletions(-) >> >> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h >> index 0e67014aa001..0cbcbcd52354 100644 >> --- a/drivers/pci/pci.h >> +++ b/drivers/pci/pci.h >> @@ -748,6 +748,7 @@ struct aer_err_info { >> >> int aer_get_device_error_info(struct aer_err_info *info, int i); >> void aer_print_error(struct aer_err_info *info, int i); >> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx); >> >> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2, >> unsigned int tlp_len, bool flit, >> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >> index e0bcaa896803..b73915b63327 100644 >> --- a/drivers/pci/pcie/aer.c >> +++ b/drivers/pci/pcie/aer.c >> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity) >> EXPORT_SYMBOL_GPL(cper_severity_to_aer); >> #endif >> >> +/** >> + * aer_ratelimit_print_init - set flag whether error message is printed >> + * @dev: pointer to pci_dev to be rate-limited >> + * @e_info: pointer to error info >> + * @idx: index for ratelimit_print array >> + */ >> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx) >> +{ >> + /* >> + * Ratelimit AER log messages. "dev" is either the source >> + * identified by the root's Error Source ID or it has an unmasked >> + * error logged in its own AER Capability. Messages are emitted >> + * when "ratelimit_print[i]" is non-zero. If we will print detail >> + * for a downstream device, make sure we print the Error Source ID >> + * from the root as well. >> + */ >> + if (aer_ratelimit(dev, e_info->severity)) { >> + e_info->ratelimit_print[idx] = 1; >> + e_info->root_ratelimit_print = 1; >> + } >> +} >> + >> void pci_print_aer(struct pci_dev *dev, int aer_severity, >> struct aer_capability_regs *aer) >> { >> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev) >> e_info->dev[i] = pci_dev_get(dev); >> e_info->error_dev_num++; >> >> - /* >> - * Ratelimit AER log messages. "dev" is either the source >> - * identified by the root's Error Source ID or it has an unmasked >> - * error logged in its own AER Capability. Messages are emitted >> - * when "ratelimit_print[i]" is non-zero. If we will print detail >> - * for a downstream device, make sure we print the Error Source ID >> - * from the root as well. >> - */ >> - if (aer_ratelimit(dev, e_info->severity)) { >> - e_info->ratelimit_print[i] = 1; >> - e_info->root_ratelimit_print = 1; >> - } >> + aer_ratelimit_print_init(dev, e_info, i); >> return 0; >> } >> >> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c >> index fc18349614d7..d17adc642781 100644 >> --- a/drivers/pci/pcie/dpc.c >> +++ b/drivers/pci/pcie/dpc.c >> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev) >> status); >> if (dpc_get_aer_uncorrect_severity(pdev, &info) && >> aer_get_device_error_info(&info, 0)) { >> + aer_ratelimit_print_init(pdev, &info, 0); >> aer_print_error(&info, 0); >> pci_aer_clear_nonfatal_status(pdev); >> pci_aer_clear_fatal_status(pdev); >> -- >> 2.33.0 >> > ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case 2026-02-10 8:00 ` Sizhe LIU @ 2026-02-11 12:18 ` Sizhe Liu 0 siblings, 0 replies; 5+ messages in thread From: Sizhe Liu @ 2026-02-11 12:18 UTC (permalink / raw) To: Bjorn Helgaas Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39 On 2026/2/10 16:00, Sizhe LIU wrote: > On 2026/2/7 4:10, Bjorn Helgaas wrote: > >> On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote: >>> In the current DPC error reporting case, some AER log information is >>> missing. >>> >>> -- Error log abnormal >>> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: >>> unmasked uncorrectable error detected >>> (------ AER error log supposed to be printed here, but missing ------) >>> nvme nvme0: frozen state error detected, reset controller >>> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error >>> Source: 0 >>> >>> Cause: >>> In aer_print_error(), PCIe AER errors is reported, and is rate-limited >>> by info->ratelimit_print[i]. There are two entry points for >>> aer_print_error(). >>> >>> 1) Native AER >>> aer_isr_one_error_type() -> aer_process_err_devices() -> >>> aer_print_error() >>> 2) DPC >>> dpc_process_error() -> aer_print_error() >>> >>> The value of info->ratelimit_print[i] is initialized correctly in >>> the native AER case: >>> aer_isr_one_error_type() -> find_source_device() -> >>> find_device_iter() -> add_error_device() >>> >>> In the DPC case, info->ratelimit_print[i] is not initialized and >>> alloc by 0 , so in aer_print_error(), it will directly return at line >>> if (!info->ratelimit_print[i]) >>> This will result in losing the AER log messages in the DPC case. >>> >>> Solution: >>> 1. Move the initialization of info->ratelimit_print[i] to >>> aer_ratelimit_print_init(). >>> 2. Add aer_ratelimit_print_init() in dpc_process_error(). >>> 3. Replace the initialization by aer_ratelimit_print_init()in >>> Native AER case. >> I see the problem, and I think you're right that we're not logging any >> AER info for DPC events (including events handled via the EDR path, >> which also calls dpc_process_error()). >> >> Currently we do the ratelimit init in add_error_device(), which also >> includes pci_dev_get() for the device. I don't see a similar >> pci_dev_get() anywhere in the DPC path. There is one in the EDR path: >> >> edr_handle_event >> acpi_dpc_port_get >> pci_dev_get <-- >> dpc_process_error >> aer_get_device_error_info(aer_err_info) >> aer_print_error(aer_err_info) >> pcie_do_recovery >> pci_dev_put >> >> Maybe DPC and EDR should be using add_error_device() directly? It >> seems like holding that reference on the device is important. > I agree with directly using add_error_device() – holding a reference > to the device is a more robust approach. > Below is the detailed call trace for reference: > > DPC path: > dpc_handler > add_error_device <-- > dpc_process_error > aer_get_device_error_info(aer_err_info) > aer_print_error(aer_err_info) > pcie_do_recovery > pci_dev_put > > EDR path: > edr_handle_event > acpi_dpc_port_get > add_error_device <-- > dpc_process_error > aer_get_device_error_info(aer_err_info) > aer_print_error(aer_err_info) > pcie_do_recovery > pci_dev_put > > I will implement this change in the v3 patch and add relevant comments > to add_error_device() for clarity. > > Thanks, > Sizhe Hi, I attempted to do this, but found the required changes and their impact grew increasingly substantial. We rely on aer_err_info->severity to initialize ratelimit_print in add_error_device(). To replace pci_dev_get() with add_error_device(), we must read registers to get the severity before the call. Besides pci_dev_get(), acpi_dpc_port_get() uses another path that calls pci_get_domain_bus_and_slot() to obtain the error device and take a reference via BDF, not from the pci_dev structure. This requires a separate initialization of ratelimit_print. A less intrusive approach in my opinion: 1. Extract ratelimit_print initialization into a dedicated function aer_ratelimit_print_init(), and call it after dpc_get_aer_uncorrect_severity() inside dpc_process_error(). 2. Add pci_dev_get() / pci_dev_put() pairs directly in dpc_handler(). I plan to incorporate these changes into the v3 patch. Looking forward to your feedback. Thanks, Sizhe >>> Test with AER inject: >>> Set the DPC reporting priority in the BIOS and send >>> MalfTLP(AER FATAL ERROR) to device. >>> >>> -- Error log normal >>> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: >>> unmasked uncorrectable error detected >>> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable >>> (Fatal), type=Transaction Layer, (Receiver ID) >>> pcieport 0000:20:00.0: device [19e5:a120] error >>> status/mask=00040000/04580000 >>> pcieport 0000:20:00.0: [18] MalfTLP (First) >>> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 >>> 0x00000000 0x00000000 >>> nvme nvme0: frozen state error detected, reset controller >>> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error >>> Source: 0 >>> >>> [1] >>> https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/ >>> >>> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal >>> error logging") >>> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com> >>> --- >>> v2 >>> - Corrected the format and spelling errors in the commit log. >>> >>> drivers/pci/pci.h | 1 + >>> drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------ >>> drivers/pci/pcie/dpc.c | 1 + >>> 3 files changed, 25 insertions(+), 12 deletions(-) >>> >>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h >>> index 0e67014aa001..0cbcbcd52354 100644 >>> --- a/drivers/pci/pci.h >>> +++ b/drivers/pci/pci.h >>> @@ -748,6 +748,7 @@ struct aer_err_info { >>> int aer_get_device_error_info(struct aer_err_info *info, int i); >>> void aer_print_error(struct aer_err_info *info, int i); >>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct >>> aer_err_info *e_info, int idx); >>> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2, >>> unsigned int tlp_len, bool flit, >>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c >>> index e0bcaa896803..b73915b63327 100644 >>> --- a/drivers/pci/pcie/aer.c >>> +++ b/drivers/pci/pcie/aer.c >>> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity) >>> EXPORT_SYMBOL_GPL(cper_severity_to_aer); >>> #endif >>> +/** >>> + * aer_ratelimit_print_init - set flag whether error message is >>> printed >>> + * @dev: pointer to pci_dev to be rate-limited >>> + * @e_info: pointer to error info >>> + * @idx: index for ratelimit_print array >>> + */ >>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct >>> aer_err_info *e_info, int idx) >>> +{ >>> + /* >>> + * Ratelimit AER log messages. "dev" is either the source >>> + * identified by the root's Error Source ID or it has an unmasked >>> + * error logged in its own AER Capability. Messages are emitted >>> + * when "ratelimit_print[i]" is non-zero. If we will print detail >>> + * for a downstream device, make sure we print the Error Source ID >>> + * from the root as well. >>> + */ >>> + if (aer_ratelimit(dev, e_info->severity)) { >>> + e_info->ratelimit_print[idx] = 1; >>> + e_info->root_ratelimit_print = 1; >>> + } >>> +} >>> + >>> void pci_print_aer(struct pci_dev *dev, int aer_severity, >>> struct aer_capability_regs *aer) >>> { >>> @@ -990,18 +1012,7 @@ static int add_error_device(struct >>> aer_err_info *e_info, struct pci_dev *dev) >>> e_info->dev[i] = pci_dev_get(dev); >>> e_info->error_dev_num++; >>> - /* >>> - * Ratelimit AER log messages. "dev" is either the source >>> - * identified by the root's Error Source ID or it has an unmasked >>> - * error logged in its own AER Capability. Messages are emitted >>> - * when "ratelimit_print[i]" is non-zero. If we will print detail >>> - * for a downstream device, make sure we print the Error Source ID >>> - * from the root as well. >>> - */ >>> - if (aer_ratelimit(dev, e_info->severity)) { >>> - e_info->ratelimit_print[i] = 1; >>> - e_info->root_ratelimit_print = 1; >>> - } >>> + aer_ratelimit_print_init(dev, e_info, i); >>> return 0; >>> } >>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c >>> index fc18349614d7..d17adc642781 100644 >>> --- a/drivers/pci/pcie/dpc.c >>> +++ b/drivers/pci/pcie/dpc.c >>> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev) >>> status); >>> if (dpc_get_aer_uncorrect_severity(pdev, &info) && >>> aer_get_device_error_info(&info, 0)) { >>> + aer_ratelimit_print_init(pdev, &info, 0); >>> aer_print_error(&info, 0); >>> pci_aer_clear_nonfatal_status(pdev); >>> pci_aer_clear_fatal_status(pdev); >>> -- >>> 2.33.0 >>> >> > > ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-02-11 12:18 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu 2026-02-06 10:00 ` Sizhe LIU 2026-02-06 20:10 ` Bjorn Helgaas 2026-02-10 8:00 ` Sizhe LIU 2026-02-11 12:18 ` Sizhe Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox