* [PATCH v2] PCI/AER: Fix AER log missing in DPC case
@ 2026-01-29 14:01 Sizhe Liu
2026-02-06 10:00 ` Sizhe LIU
2026-02-06 20:10 ` Bjorn Helgaas
0 siblings, 2 replies; 5+ messages in thread
From: Sizhe Liu @ 2026-01-29 14:01 UTC (permalink / raw)
To: bhelgaas, jonathan.cameron, shiju.jose, pandoh
Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, liusizhe5
In the current DPC error reporting case, some AER log information is missing.
-- Error log abnormal
pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
(------ AER error log supposed to be printed here, but missing ------)
nvme nvme0: frozen state error detected, reset controller
{4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
Cause:
In aer_print_error(), PCIe AER errors is reported, and is rate-limited
by info->ratelimit_print[i]. There are two entry points for
aer_print_error().
1) Native AER
aer_isr_one_error_type() -> aer_process_err_devices() ->
aer_print_error()
2) DPC
dpc_process_error() -> aer_print_error()
The value of info->ratelimit_print[i] is initialized correctly in
the native AER case:
aer_isr_one_error_type() -> find_source_device() ->
find_device_iter() -> add_error_device()
In the DPC case, info->ratelimit_print[i] is not initialized and
alloc by 0 , so in aer_print_error(), it will directly return at line
if (!info->ratelimit_print[i])
This will result in losing the AER log messages in the DPC case.
Solution:
1. Move the initialization of info->ratelimit_print[i] to
aer_ratelimit_print_init().
2. Add aer_ratelimit_print_init() in dpc_process_error().
3. Replace the initialization by aer_ratelimit_print_init()in
Native AER case.
Test with AER inject:
Set the DPC reporting priority in the BIOS and send
MalfTLP(AER FATAL ERROR) to device.
-- Error log normal
pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
pcieport 0000:20:00.0: [18] MalfTLP (First)
pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
nvme nvme0: frozen state error detected, reset controller
{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
---
v2
- Corrected the format and spelling errors in the commit log.
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
drivers/pci/pcie/dpc.c | 1 +
3 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..0cbcbcd52354 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -748,6 +748,7 @@ struct aer_err_info {
int aer_get_device_error_info(struct aer_err_info *info, int i);
void aer_print_error(struct aer_err_info *info, int i);
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
unsigned int tlp_len, bool flit,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e0bcaa896803..b73915b63327 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
EXPORT_SYMBOL_GPL(cper_severity_to_aer);
#endif
+/**
+ * aer_ratelimit_print_init - set flag whether error message is printed
+ * @dev: pointer to pci_dev to be rate-limited
+ * @e_info: pointer to error info
+ * @idx: index for ratelimit_print array
+ */
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
+{
+ /*
+ * Ratelimit AER log messages. "dev" is either the source
+ * identified by the root's Error Source ID or it has an unmasked
+ * error logged in its own AER Capability. Messages are emitted
+ * when "ratelimit_print[i]" is non-zero. If we will print detail
+ * for a downstream device, make sure we print the Error Source ID
+ * from the root as well.
+ */
+ if (aer_ratelimit(dev, e_info->severity)) {
+ e_info->ratelimit_print[idx] = 1;
+ e_info->root_ratelimit_print = 1;
+ }
+}
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer)
{
@@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
e_info->dev[i] = pci_dev_get(dev);
e_info->error_dev_num++;
- /*
- * Ratelimit AER log messages. "dev" is either the source
- * identified by the root's Error Source ID or it has an unmasked
- * error logged in its own AER Capability. Messages are emitted
- * when "ratelimit_print[i]" is non-zero. If we will print detail
- * for a downstream device, make sure we print the Error Source ID
- * from the root as well.
- */
- if (aer_ratelimit(dev, e_info->severity)) {
- e_info->ratelimit_print[i] = 1;
- e_info->root_ratelimit_print = 1;
- }
+ aer_ratelimit_print_init(dev, e_info, i);
return 0;
}
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..d17adc642781 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
status);
if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(&info, 0)) {
+ aer_ratelimit_print_init(pdev, &info, 0);
aer_print_error(&info, 0);
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
--
2.33.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
@ 2026-02-06 10:00 ` Sizhe LIU
2026-02-06 20:10 ` Bjorn Helgaas
1 sibling, 0 replies; 5+ messages in thread
From: Sizhe LIU @ 2026-02-06 10:00 UTC (permalink / raw)
To: bhelgaas, jonathan.cameron, shiju.jose, pandoh
Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39
Hi,
Gentle ping.
Best regards,
Sizhe
On 2026/1/29 22:01, Sizhe Liu wrote:
> In the current DPC error reporting case, some AER log information is missing.
>
> -- Error log abnormal
> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
> (------ AER error log supposed to be printed here, but missing ------)
> nvme nvme0: frozen state error detected, reset controller
> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> Cause:
> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
> by info->ratelimit_print[i]. There are two entry points for
> aer_print_error().
>
> 1) Native AER
> aer_isr_one_error_type() -> aer_process_err_devices() ->
> aer_print_error()
> 2) DPC
> dpc_process_error() -> aer_print_error()
>
> The value of info->ratelimit_print[i] is initialized correctly in
> the native AER case:
> aer_isr_one_error_type() -> find_source_device() ->
> find_device_iter() -> add_error_device()
>
> In the DPC case, info->ratelimit_print[i] is not initialized and
> alloc by 0 , so in aer_print_error(), it will directly return at line
> if (!info->ratelimit_print[i])
> This will result in losing the AER log messages in the DPC case.
>
> Solution:
> 1. Move the initialization of info->ratelimit_print[i] to
> aer_ratelimit_print_init().
> 2. Add aer_ratelimit_print_init() in dpc_process_error().
> 3. Replace the initialization by aer_ratelimit_print_init()in
> Native AER case.
>
> Test with AER inject:
> Set the DPC reporting priority in the BIOS and send
> MalfTLP(AER FATAL ERROR) to device.
>
> -- Error log normal
> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
> pcieport 0000:20:00.0: [18] MalfTLP (First)
> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
> nvme nvme0: frozen state error detected, reset controller
> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>
> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
> ---
> v2
> - Corrected the format and spelling errors in the commit log.
>
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
> drivers/pci/pcie/dpc.c | 1 +
> 3 files changed, 25 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 0e67014aa001..0cbcbcd52354 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -748,6 +748,7 @@ struct aer_err_info {
>
> int aer_get_device_error_info(struct aer_err_info *info, int i);
> void aer_print_error(struct aer_err_info *info, int i);
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
>
> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> unsigned int tlp_len, bool flit,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e0bcaa896803..b73915b63327 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
> EXPORT_SYMBOL_GPL(cper_severity_to_aer);
> #endif
>
> +/**
> + * aer_ratelimit_print_init - set flag whether error message is printed
> + * @dev: pointer to pci_dev to be rate-limited
> + * @e_info: pointer to error info
> + * @idx: index for ratelimit_print array
> + */
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
> +{
> + /*
> + * Ratelimit AER log messages. "dev" is either the source
> + * identified by the root's Error Source ID or it has an unmasked
> + * error logged in its own AER Capability. Messages are emitted
> + * when "ratelimit_print[i]" is non-zero. If we will print detail
> + * for a downstream device, make sure we print the Error Source ID
> + * from the root as well.
> + */
> + if (aer_ratelimit(dev, e_info->severity)) {
> + e_info->ratelimit_print[idx] = 1;
> + e_info->root_ratelimit_print = 1;
> + }
> +}
> +
> void pci_print_aer(struct pci_dev *dev, int aer_severity,
> struct aer_capability_regs *aer)
> {
> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
> e_info->dev[i] = pci_dev_get(dev);
> e_info->error_dev_num++;
>
> - /*
> - * Ratelimit AER log messages. "dev" is either the source
> - * identified by the root's Error Source ID or it has an unmasked
> - * error logged in its own AER Capability. Messages are emitted
> - * when "ratelimit_print[i]" is non-zero. If we will print detail
> - * for a downstream device, make sure we print the Error Source ID
> - * from the root as well.
> - */
> - if (aer_ratelimit(dev, e_info->severity)) {
> - e_info->ratelimit_print[i] = 1;
> - e_info->root_ratelimit_print = 1;
> - }
> + aer_ratelimit_print_init(dev, e_info, i);
> return 0;
> }
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fc18349614d7..d17adc642781 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
> status);
> if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
> aer_get_device_error_info(&info, 0)) {
> + aer_ratelimit_print_init(pdev, &info, 0);
> aer_print_error(&info, 0);
> pci_aer_clear_nonfatal_status(pdev);
> pci_aer_clear_fatal_status(pdev);
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
2026-02-06 10:00 ` Sizhe LIU
@ 2026-02-06 20:10 ` Bjorn Helgaas
2026-02-10 8:00 ` Sizhe LIU
1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2026-02-06 20:10 UTC (permalink / raw)
To: Sizhe Liu
Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci,
linuxarm, prime.zeng, fanghao11, shenyang39
On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote:
> In the current DPC error reporting case, some AER log information is missing.
>
> -- Error log abnormal
> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
> (------ AER error log supposed to be printed here, but missing ------)
> nvme nvme0: frozen state error detected, reset controller
> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> Cause:
> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
> by info->ratelimit_print[i]. There are two entry points for
> aer_print_error().
>
> 1) Native AER
> aer_isr_one_error_type() -> aer_process_err_devices() ->
> aer_print_error()
> 2) DPC
> dpc_process_error() -> aer_print_error()
>
> The value of info->ratelimit_print[i] is initialized correctly in
> the native AER case:
> aer_isr_one_error_type() -> find_source_device() ->
> find_device_iter() -> add_error_device()
>
> In the DPC case, info->ratelimit_print[i] is not initialized and
> alloc by 0 , so in aer_print_error(), it will directly return at line
> if (!info->ratelimit_print[i])
> This will result in losing the AER log messages in the DPC case.
>
> Solution:
> 1. Move the initialization of info->ratelimit_print[i] to
> aer_ratelimit_print_init().
> 2. Add aer_ratelimit_print_init() in dpc_process_error().
> 3. Replace the initialization by aer_ratelimit_print_init()in
> Native AER case.
I see the problem, and I think you're right that we're not logging any
AER info for DPC events (including events handled via the EDR path,
which also calls dpc_process_error()).
Currently we do the ratelimit init in add_error_device(), which also
includes pci_dev_get() for the device. I don't see a similar
pci_dev_get() anywhere in the DPC path. There is one in the EDR path:
edr_handle_event
acpi_dpc_port_get
pci_dev_get <--
dpc_process_error
aer_get_device_error_info(aer_err_info)
aer_print_error(aer_err_info)
pcie_do_recovery
pci_dev_put
Maybe DPC and EDR should be using add_error_device() directly? It
seems like holding that reference on the device is important.
> Test with AER inject:
> Set the DPC reporting priority in the BIOS and send
> MalfTLP(AER FATAL ERROR) to device.
>
> -- Error log normal
> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
> pcieport 0000:20:00.0: [18] MalfTLP (First)
> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
> nvme nvme0: frozen state error detected, reset controller
> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>
> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
> ---
> v2
> - Corrected the format and spelling errors in the commit log.
>
> drivers/pci/pci.h | 1 +
> drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
> drivers/pci/pcie/dpc.c | 1 +
> 3 files changed, 25 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 0e67014aa001..0cbcbcd52354 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -748,6 +748,7 @@ struct aer_err_info {
>
> int aer_get_device_error_info(struct aer_err_info *info, int i);
> void aer_print_error(struct aer_err_info *info, int i);
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
>
> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
> unsigned int tlp_len, bool flit,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e0bcaa896803..b73915b63327 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
> EXPORT_SYMBOL_GPL(cper_severity_to_aer);
> #endif
>
> +/**
> + * aer_ratelimit_print_init - set flag whether error message is printed
> + * @dev: pointer to pci_dev to be rate-limited
> + * @e_info: pointer to error info
> + * @idx: index for ratelimit_print array
> + */
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
> +{
> + /*
> + * Ratelimit AER log messages. "dev" is either the source
> + * identified by the root's Error Source ID or it has an unmasked
> + * error logged in its own AER Capability. Messages are emitted
> + * when "ratelimit_print[i]" is non-zero. If we will print detail
> + * for a downstream device, make sure we print the Error Source ID
> + * from the root as well.
> + */
> + if (aer_ratelimit(dev, e_info->severity)) {
> + e_info->ratelimit_print[idx] = 1;
> + e_info->root_ratelimit_print = 1;
> + }
> +}
> +
> void pci_print_aer(struct pci_dev *dev, int aer_severity,
> struct aer_capability_regs *aer)
> {
> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
> e_info->dev[i] = pci_dev_get(dev);
> e_info->error_dev_num++;
>
> - /*
> - * Ratelimit AER log messages. "dev" is either the source
> - * identified by the root's Error Source ID or it has an unmasked
> - * error logged in its own AER Capability. Messages are emitted
> - * when "ratelimit_print[i]" is non-zero. If we will print detail
> - * for a downstream device, make sure we print the Error Source ID
> - * from the root as well.
> - */
> - if (aer_ratelimit(dev, e_info->severity)) {
> - e_info->ratelimit_print[i] = 1;
> - e_info->root_ratelimit_print = 1;
> - }
> + aer_ratelimit_print_init(dev, e_info, i);
> return 0;
> }
>
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fc18349614d7..d17adc642781 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
> status);
> if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
> aer_get_device_error_info(&info, 0)) {
> + aer_ratelimit_print_init(pdev, &info, 0);
> aer_print_error(&info, 0);
> pci_aer_clear_nonfatal_status(pdev);
> pci_aer_clear_fatal_status(pdev);
> --
> 2.33.0
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
2026-02-06 20:10 ` Bjorn Helgaas
@ 2026-02-10 8:00 ` Sizhe LIU
2026-02-11 12:18 ` Sizhe Liu
0 siblings, 1 reply; 5+ messages in thread
From: Sizhe LIU @ 2026-02-10 8:00 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci,
linuxarm, prime.zeng, fanghao11, shenyang39
On 2026/2/7 4:10, Bjorn Helgaas wrote:
> On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote:
>> In the current DPC error reporting case, some AER log information is missing.
>>
>> -- Error log abnormal
>> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
>> (------ AER error log supposed to be printed here, but missing ------)
>> nvme nvme0: frozen state error detected, reset controller
>> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>>
>> Cause:
>> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
>> by info->ratelimit_print[i]. There are two entry points for
>> aer_print_error().
>>
>> 1) Native AER
>> aer_isr_one_error_type() -> aer_process_err_devices() ->
>> aer_print_error()
>> 2) DPC
>> dpc_process_error() -> aer_print_error()
>>
>> The value of info->ratelimit_print[i] is initialized correctly in
>> the native AER case:
>> aer_isr_one_error_type() -> find_source_device() ->
>> find_device_iter() -> add_error_device()
>>
>> In the DPC case, info->ratelimit_print[i] is not initialized and
>> alloc by 0 , so in aer_print_error(), it will directly return at line
>> if (!info->ratelimit_print[i])
>> This will result in losing the AER log messages in the DPC case.
>>
>> Solution:
>> 1. Move the initialization of info->ratelimit_print[i] to
>> aer_ratelimit_print_init().
>> 2. Add aer_ratelimit_print_init() in dpc_process_error().
>> 3. Replace the initialization by aer_ratelimit_print_init()in
>> Native AER case.
> I see the problem, and I think you're right that we're not logging any
> AER info for DPC events (including events handled via the EDR path,
> which also calls dpc_process_error()).
>
> Currently we do the ratelimit init in add_error_device(), which also
> includes pci_dev_get() for the device. I don't see a similar
> pci_dev_get() anywhere in the DPC path. There is one in the EDR path:
>
> edr_handle_event
> acpi_dpc_port_get
> pci_dev_get <--
> dpc_process_error
> aer_get_device_error_info(aer_err_info)
> aer_print_error(aer_err_info)
> pcie_do_recovery
> pci_dev_put
>
> Maybe DPC and EDR should be using add_error_device() directly? It
> seems like holding that reference on the device is important.
I agree with directly using add_error_device() – holding a reference to
the device is a more robust approach.
Below is the detailed call trace for reference:
DPC path:
dpc_handler
add_error_device <--
dpc_process_error
aer_get_device_error_info(aer_err_info)
aer_print_error(aer_err_info)
pcie_do_recovery
pci_dev_put
EDR path:
edr_handle_event
acpi_dpc_port_get
add_error_device <--
dpc_process_error
aer_get_device_error_info(aer_err_info)
aer_print_error(aer_err_info)
pcie_do_recovery
pci_dev_put
I will implement this change in the v3 patch and add relevant comments
to add_error_device() for clarity.
Thanks,
Sizhe
>> Test with AER inject:
>> Set the DPC reporting priority in the BIOS and send
>> MalfTLP(AER FATAL ERROR) to device.
>>
>> -- Error log normal
>> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
>> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
>> pcieport 0000:20:00.0: [18] MalfTLP (First)
>> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
>> nvme nvme0: frozen state error detected, reset controller
>> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>>
>> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>>
>> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
>> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
>> ---
>> v2
>> - Corrected the format and spelling errors in the commit log.
>>
>> drivers/pci/pci.h | 1 +
>> drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
>> drivers/pci/pcie/dpc.c | 1 +
>> 3 files changed, 25 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 0e67014aa001..0cbcbcd52354 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -748,6 +748,7 @@ struct aer_err_info {
>>
>> int aer_get_device_error_info(struct aer_err_info *info, int i);
>> void aer_print_error(struct aer_err_info *info, int i);
>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
>>
>> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>> unsigned int tlp_len, bool flit,
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index e0bcaa896803..b73915b63327 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
>> EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>> #endif
>>
>> +/**
>> + * aer_ratelimit_print_init - set flag whether error message is printed
>> + * @dev: pointer to pci_dev to be rate-limited
>> + * @e_info: pointer to error info
>> + * @idx: index for ratelimit_print array
>> + */
>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
>> +{
>> + /*
>> + * Ratelimit AER log messages. "dev" is either the source
>> + * identified by the root's Error Source ID or it has an unmasked
>> + * error logged in its own AER Capability. Messages are emitted
>> + * when "ratelimit_print[i]" is non-zero. If we will print detail
>> + * for a downstream device, make sure we print the Error Source ID
>> + * from the root as well.
>> + */
>> + if (aer_ratelimit(dev, e_info->severity)) {
>> + e_info->ratelimit_print[idx] = 1;
>> + e_info->root_ratelimit_print = 1;
>> + }
>> +}
>> +
>> void pci_print_aer(struct pci_dev *dev, int aer_severity,
>> struct aer_capability_regs *aer)
>> {
>> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>> e_info->dev[i] = pci_dev_get(dev);
>> e_info->error_dev_num++;
>>
>> - /*
>> - * Ratelimit AER log messages. "dev" is either the source
>> - * identified by the root's Error Source ID or it has an unmasked
>> - * error logged in its own AER Capability. Messages are emitted
>> - * when "ratelimit_print[i]" is non-zero. If we will print detail
>> - * for a downstream device, make sure we print the Error Source ID
>> - * from the root as well.
>> - */
>> - if (aer_ratelimit(dev, e_info->severity)) {
>> - e_info->ratelimit_print[i] = 1;
>> - e_info->root_ratelimit_print = 1;
>> - }
>> + aer_ratelimit_print_init(dev, e_info, i);
>> return 0;
>> }
>>
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index fc18349614d7..d17adc642781 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
>> status);
>> if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>> aer_get_device_error_info(&info, 0)) {
>> + aer_ratelimit_print_init(pdev, &info, 0);
>> aer_print_error(&info, 0);
>> pci_aer_clear_nonfatal_status(pdev);
>> pci_aer_clear_fatal_status(pdev);
>> --
>> 2.33.0
>>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
2026-02-10 8:00 ` Sizhe LIU
@ 2026-02-11 12:18 ` Sizhe Liu
0 siblings, 0 replies; 5+ messages in thread
From: Sizhe Liu @ 2026-02-11 12:18 UTC (permalink / raw)
To: Bjorn Helgaas
Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci,
linuxarm, prime.zeng, fanghao11, shenyang39
On 2026/2/10 16:00, Sizhe LIU wrote:
> On 2026/2/7 4:10, Bjorn Helgaas wrote:
>
>> On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote:
>>> In the current DPC error reporting case, some AER log information is
>>> missing.
>>>
>>> -- Error log abnormal
>>> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11:
>>> unmasked uncorrectable error detected
>>> (------ AER error log supposed to be printed here, but missing ------)
>>> nvme nvme0: frozen state error detected, reset controller
>>> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error
>>> Source: 0
>>>
>>> Cause:
>>> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
>>> by info->ratelimit_print[i]. There are two entry points for
>>> aer_print_error().
>>>
>>> 1) Native AER
>>> aer_isr_one_error_type() -> aer_process_err_devices() ->
>>> aer_print_error()
>>> 2) DPC
>>> dpc_process_error() -> aer_print_error()
>>>
>>> The value of info->ratelimit_print[i] is initialized correctly in
>>> the native AER case:
>>> aer_isr_one_error_type() -> find_source_device() ->
>>> find_device_iter() -> add_error_device()
>>>
>>> In the DPC case, info->ratelimit_print[i] is not initialized and
>>> alloc by 0 , so in aer_print_error(), it will directly return at line
>>> if (!info->ratelimit_print[i])
>>> This will result in losing the AER log messages in the DPC case.
>>>
>>> Solution:
>>> 1. Move the initialization of info->ratelimit_print[i] to
>>> aer_ratelimit_print_init().
>>> 2. Add aer_ratelimit_print_init() in dpc_process_error().
>>> 3. Replace the initialization by aer_ratelimit_print_init()in
>>> Native AER case.
>> I see the problem, and I think you're right that we're not logging any
>> AER info for DPC events (including events handled via the EDR path,
>> which also calls dpc_process_error()).
>>
>> Currently we do the ratelimit init in add_error_device(), which also
>> includes pci_dev_get() for the device. I don't see a similar
>> pci_dev_get() anywhere in the DPC path. There is one in the EDR path:
>>
>> edr_handle_event
>> acpi_dpc_port_get
>> pci_dev_get <--
>> dpc_process_error
>> aer_get_device_error_info(aer_err_info)
>> aer_print_error(aer_err_info)
>> pcie_do_recovery
>> pci_dev_put
>>
>> Maybe DPC and EDR should be using add_error_device() directly? It
>> seems like holding that reference on the device is important.
> I agree with directly using add_error_device() – holding a reference
> to the device is a more robust approach.
> Below is the detailed call trace for reference:
>
> DPC path:
> dpc_handler
> add_error_device <--
> dpc_process_error
> aer_get_device_error_info(aer_err_info)
> aer_print_error(aer_err_info)
> pcie_do_recovery
> pci_dev_put
>
> EDR path:
> edr_handle_event
> acpi_dpc_port_get
> add_error_device <--
> dpc_process_error
> aer_get_device_error_info(aer_err_info)
> aer_print_error(aer_err_info)
> pcie_do_recovery
> pci_dev_put
>
> I will implement this change in the v3 patch and add relevant comments
> to add_error_device() for clarity.
>
> Thanks,
> Sizhe
Hi,
I attempted to do this, but found the required changes and their impact
grew increasingly substantial. We rely on aer_err_info->severity
to initialize ratelimit_print in add_error_device().
To replace pci_dev_get() with add_error_device(),
we must read registers to get the severity before the call.
Besides pci_dev_get(), acpi_dpc_port_get() uses another path
that calls pci_get_domain_bus_and_slot() to obtain the error
device and take a reference via BDF, not from the pci_dev structure.
This requires a separate initialization of ratelimit_print.
A less intrusive approach in my opinion:
1. Extract ratelimit_print initialization into a dedicated function
aer_ratelimit_print_init(), and call it after
dpc_get_aer_uncorrect_severity()
inside dpc_process_error().
2. Add pci_dev_get() / pci_dev_put() pairs directly in dpc_handler().
I plan to incorporate these changes into the v3 patch.
Looking forward to your feedback.
Thanks,
Sizhe
>>> Test with AER inject:
>>> Set the DPC reporting priority in the BIOS and send
>>> MalfTLP(AER FATAL ERROR) to device.
>>>
>>> -- Error log normal
>>> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11:
>>> unmasked uncorrectable error detected
>>> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable
>>> (Fatal), type=Transaction Layer, (Receiver ID)
>>> pcieport 0000:20:00.0: device [19e5:a120] error
>>> status/mask=00040000/04580000
>>> pcieport 0000:20:00.0: [18] MalfTLP (First)
>>> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000
>>> 0x00000000 0x00000000
>>> nvme nvme0: frozen state error detected, reset controller
>>> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error
>>> Source: 0
>>>
>>> [1]
>>> https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>>>
>>> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal
>>> error logging")
>>> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
>>> ---
>>> v2
>>> - Corrected the format and spelling errors in the commit log.
>>>
>>> drivers/pci/pci.h | 1 +
>>> drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
>>> drivers/pci/pcie/dpc.c | 1 +
>>> 3 files changed, 25 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>>> index 0e67014aa001..0cbcbcd52354 100644
>>> --- a/drivers/pci/pci.h
>>> +++ b/drivers/pci/pci.h
>>> @@ -748,6 +748,7 @@ struct aer_err_info {
>>> int aer_get_device_error_info(struct aer_err_info *info, int i);
>>> void aer_print_error(struct aer_err_info *info, int i);
>>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct
>>> aer_err_info *e_info, int idx);
>>> int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>>> unsigned int tlp_len, bool flit,
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index e0bcaa896803..b73915b63327 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
>>> EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>>> #endif
>>> +/**
>>> + * aer_ratelimit_print_init - set flag whether error message is
>>> printed
>>> + * @dev: pointer to pci_dev to be rate-limited
>>> + * @e_info: pointer to error info
>>> + * @idx: index for ratelimit_print array
>>> + */
>>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct
>>> aer_err_info *e_info, int idx)
>>> +{
>>> + /*
>>> + * Ratelimit AER log messages. "dev" is either the source
>>> + * identified by the root's Error Source ID or it has an unmasked
>>> + * error logged in its own AER Capability. Messages are emitted
>>> + * when "ratelimit_print[i]" is non-zero. If we will print detail
>>> + * for a downstream device, make sure we print the Error Source ID
>>> + * from the root as well.
>>> + */
>>> + if (aer_ratelimit(dev, e_info->severity)) {
>>> + e_info->ratelimit_print[idx] = 1;
>>> + e_info->root_ratelimit_print = 1;
>>> + }
>>> +}
>>> +
>>> void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>> struct aer_capability_regs *aer)
>>> {
>>> @@ -990,18 +1012,7 @@ static int add_error_device(struct
>>> aer_err_info *e_info, struct pci_dev *dev)
>>> e_info->dev[i] = pci_dev_get(dev);
>>> e_info->error_dev_num++;
>>> - /*
>>> - * Ratelimit AER log messages. "dev" is either the source
>>> - * identified by the root's Error Source ID or it has an unmasked
>>> - * error logged in its own AER Capability. Messages are emitted
>>> - * when "ratelimit_print[i]" is non-zero. If we will print detail
>>> - * for a downstream device, make sure we print the Error Source ID
>>> - * from the root as well.
>>> - */
>>> - if (aer_ratelimit(dev, e_info->severity)) {
>>> - e_info->ratelimit_print[i] = 1;
>>> - e_info->root_ratelimit_print = 1;
>>> - }
>>> + aer_ratelimit_print_init(dev, e_info, i);
>>> return 0;
>>> }
>>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>>> index fc18349614d7..d17adc642781 100644
>>> --- a/drivers/pci/pcie/dpc.c
>>> +++ b/drivers/pci/pcie/dpc.c
>>> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
>>> status);
>>> if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>>> aer_get_device_error_info(&info, 0)) {
>>> + aer_ratelimit_print_init(pdev, &info, 0);
>>> aer_print_error(&info, 0);
>>> pci_aer_clear_nonfatal_status(pdev);
>>> pci_aer_clear_fatal_status(pdev);
>>> --
>>> 2.33.0
>>>
>>
>
>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-02-11 12:18 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
2026-02-06 10:00 ` Sizhe LIU
2026-02-06 20:10 ` Bjorn Helgaas
2026-02-10 8:00 ` Sizhe LIU
2026-02-11 12:18 ` Sizhe Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox