* [PATCH] PCI/AER: Fix AER log missing in DPC case
@ 2026-01-27 3:54 Sizhe Liu
2026-01-27 15:56 ` Jonathan Cameron
0 siblings, 1 reply; 3+ messages in thread
From: Sizhe Liu @ 2026-01-27 3:54 UTC (permalink / raw)
To: bhelgaas, jonathan.cameron, shiju.jose, pandoh
Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, liusizhe5
In the current DPC error reporting case, some AER log information
is missing.
-- Error log abnormal (line breaks adjusted)
[ 976.604003] pcieport 0000:20:00.0: DPC: containment event, status:
0x1f11: unmasked uncorrectable error detected
(------ AER error log supposed to be printed here, but missing ------)
[ 976.604030] nvme nvme0: frozen state error detected, reset controller
[ 977.812932] {4}[Hardware Error]: Hardware error from APEI
Generic Hardware Error Source: 0
Cause:
In aer_print_error(), PCIe AER errors is reported, and is rate-limited
by info->ratelimit_print[i]. There are two entry points for
aer_print_error().
1) Native AER
aer_isr_one_error_type() -> aer_process_err_devices() ->
aer_print_error()
2) DPC
dpc_process_error() -> aer_print_error()
The value of info->ratelimit_print[i] is initialized correctly in
the native AER case:
aer_isr_one_error_type() -> find_source_device() ->
find_device_iter() -> add_error_device()
In the DPC case, info->ratelimit_print[i] is not initialized and
alloc by 0 , so in aer_print_error(), it will directly return at line
if (!info->ratelimit_print[i])
This will result in lossing the AER log messages in the DPC case.
Solution:
1. Move the initialization of info->ratelimit_print[i] to
aer_ratelimit_print_init().
2. Add aer_ratelimit_print_init() in dpc_process_error().
3. Replace the initialization by aer_ratelimit_print_init()in
Native AER case.
Test with AER inject:
Set the DPC reporting priority in the BIOS and send
MalfTLP(AER FATAL ERROR) to device.
-- Error log normal (line breaks adjusted)
[ 5366.943807] pcieport 0000:20:00.0: DPC: containment event,
status:0x1f11: unmasked uncorrectable error detected
[ 5366.943826] pcieport 0000:20:00.0: PCIe Bus Error:
severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
[root@localhost ~]# [ 5366.943830] pcieport 0000:20:00.0:
device [19e5:a120] error status/mask=00040000/04580000
[ 5366.943833] pcieport 0000:20:00.0: [18] MalfTLP (First)
[ 5366.943836] pcieport 0000:20:00.0: AER: TLP Header:
0x00000000 0x00000000 0x00000000 0x00000000
[ 5366.943843] nvme nvme0: frozen state error detected, reset controller
[ 5368.156778] {2}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 0
Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
---
drivers/pci/pci.h | 1 +
drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
drivers/pci/pcie/dpc.c | 1 +
3 files changed, 25 insertions(+), 12 deletions(-)
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..0cbcbcd52354 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -748,6 +748,7 @@ struct aer_err_info {
int aer_get_device_error_info(struct aer_err_info *info, int i);
void aer_print_error(struct aer_err_info *info, int i);
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
unsigned int tlp_len, bool flit,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e0bcaa896803..b73915b63327 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
EXPORT_SYMBOL_GPL(cper_severity_to_aer);
#endif
+/**
+ * aer_ratelimit_print_init - set flag whether error message is printed
+ * @dev: pointer to pci_dev to be rate-limited
+ * @e_info: pointer to error info
+ * @idx: index for ratelimit_print array
+ */
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
+{
+ /*
+ * Ratelimit AER log messages. "dev" is either the source
+ * identified by the root's Error Source ID or it has an unmasked
+ * error logged in its own AER Capability. Messages are emitted
+ * when "ratelimit_print[i]" is non-zero. If we will print detail
+ * for a downstream device, make sure we print the Error Source ID
+ * from the root as well.
+ */
+ if (aer_ratelimit(dev, e_info->severity)) {
+ e_info->ratelimit_print[idx] = 1;
+ e_info->root_ratelimit_print = 1;
+ }
+}
+
void pci_print_aer(struct pci_dev *dev, int aer_severity,
struct aer_capability_regs *aer)
{
@@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
e_info->dev[i] = pci_dev_get(dev);
e_info->error_dev_num++;
- /*
- * Ratelimit AER log messages. "dev" is either the source
- * identified by the root's Error Source ID or it has an unmasked
- * error logged in its own AER Capability. Messages are emitted
- * when "ratelimit_print[i]" is non-zero. If we will print detail
- * for a downstream device, make sure we print the Error Source ID
- * from the root as well.
- */
- if (aer_ratelimit(dev, e_info->severity)) {
- e_info->ratelimit_print[i] = 1;
- e_info->root_ratelimit_print = 1;
- }
+ aer_ratelimit_print_init(dev, e_info, i);
return 0;
}
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..d17adc642781 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
status);
if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
aer_get_device_error_info(&info, 0)) {
+ aer_ratelimit_print_init(pdev, &info, 0);
aer_print_error(&info, 0);
pci_aer_clear_nonfatal_status(pdev);
pci_aer_clear_fatal_status(pdev);
--
2.33.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] PCI/AER: Fix AER log missing in DPC case
2026-01-27 3:54 [PATCH] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
@ 2026-01-27 15:56 ` Jonathan Cameron
2026-01-29 13:22 ` Sizhe LIU
0 siblings, 1 reply; 3+ messages in thread
From: Jonathan Cameron @ 2026-01-27 15:56 UTC (permalink / raw)
To: Sizhe Liu
Cc: bhelgaas, shiju.jose, pandoh, linux-pci, linuxarm, prime.zeng,
fanghao11, shenyang39
On Tue, 27 Jan 2026 11:54:05 +0800
Sizhe Liu <liusizhe5@huawei.com> wrote:
> In the current DPC error reporting case, some AER log information
> is missing.
Wrap commit messages up to 75 chars, this is about 65.
Only do that for stuff where the formatting isn't fixed for other
reasons, such as the log below.
>
> -- Error log abnormal (line breaks adjusted)
> [ 976.604003] pcieport 0000:20:00.0: DPC: containment event, status:
> 0x1f11: unmasked uncorrectable error detected
> (------ AER error log supposed to be printed here, but missing ------)
> [ 976.604030] nvme nvme0: frozen state error detected, reset controller
> [ 977.812932] {4}[Hardware Error]: Hardware error from APEI
> Generic Hardware Error Source: 0
>
This, but do remove timestamps as those don't matter to anyone.
Not wrapping helps if anyone is searching for this later.
Abnormal error log
pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
(------ AER error log supposed to be printed here, but missing ------)
nvme nvme0: frozen state error detected, reset controller
{4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> Cause:
> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
> by info->ratelimit_print[i]. There are two entry points for
> aer_print_error().
>
> 1) Native AER
> aer_isr_one_error_type() -> aer_process_err_devices() ->
> aer_print_error()
> 2) DPC
> dpc_process_error() -> aer_print_error()
>
> The value of info->ratelimit_print[i] is initialized correctly in
> the native AER case:
> aer_isr_one_error_type() -> find_source_device() ->
> find_device_iter() -> add_error_device()
>
> In the DPC case, info->ratelimit_print[i] is not initialized and
> alloc by 0 , so in aer_print_error(), it will directly return at line
> if (!info->ratelimit_print[i])
> This will result in lossing the AER log messages in the DPC case.
losing
>
> Solution:
> 1. Move the initialization of info->ratelimit_print[i] to
> aer_ratelimit_print_init().
> 2. Add aer_ratelimit_print_init() in dpc_process_error().
> 3. Replace the initialization by aer_ratelimit_print_init()in
> Native AER case.
>
> Test with AER inject:
> Set the DPC reporting priority in the BIOS and send
> MalfTLP(AER FATAL ERROR) to device.
>
> -- Error log normal (line breaks adjusted)
> [ 5366.943807] pcieport 0000:20:00.0: DPC: containment event,
> status:0x1f11: unmasked uncorrectable error detected
> [ 5366.943826] pcieport 0000:20:00.0: PCIe Bus Error:
> severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> [root@localhost ~]# [ 5366.943830] pcieport 0000:20:00.0:
> device [19e5:a120] error status/mask=00040000/04580000
> [ 5366.943833] pcieport 0000:20:00.0: [18] MalfTLP (First)
> [ 5366.943836] pcieport 0000:20:00.0: AER: TLP Header:
> 0x00000000 0x00000000 0x00000000 0x00000000
> [ 5366.943843] nvme nvme0: frozen state error detected, reset controller
> [ 5368.156778] {2}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 0
Reformat this and drop timestamps as well:
Normal error log
pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
pcieport 0000:20:00.0: [18] MalfTLP (First)
pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
nvme nvme0: frozen state error detected, reset controller
{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
Otherwise looks good to me.
Thanks,
Jonathan
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] PCI/AER: Fix AER log missing in DPC case
2026-01-27 15:56 ` Jonathan Cameron
@ 2026-01-29 13:22 ` Sizhe LIU
0 siblings, 0 replies; 3+ messages in thread
From: Sizhe LIU @ 2026-01-29 13:22 UTC (permalink / raw)
To: Jonathan Cameron
Cc: bhelgaas, shiju.jose, pandoh, linux-pci, linuxarm, prime.zeng,
fanghao11, shenyang39
On 2026/1/27 23:56, Jonathan Cameron wrote:
> On Tue, 27 Jan 2026 11:54:05 +0800
> Sizhe Liu <liusizhe5@huawei.com> wrote:
>
>> In the current DPC error reporting case, some AER log information
>> is missing.
> Wrap commit messages up to 75 chars, this is about 65.
> Only do that for stuff where the formatting isn't fixed for other
> reasons, such as the log below.
Got it, I will align the format.
>> -- Error log abnormal (line breaks adjusted)
>> [ 976.604003] pcieport 0000:20:00.0: DPC: containment event, status:
>> 0x1f11: unmasked uncorrectable error detected
>> (------ AER error log supposed to be printed here, but missing ------)
>> [ 976.604030] nvme nvme0: frozen state error detected, reset controller
>> [ 977.812932] {4}[Hardware Error]: Hardware error from APEI
>> Generic Hardware Error Source: 0
>>
> This, but do remove timestamps as those don't matter to anyone.
> Not wrapping helps if anyone is searching for this later.
>
> Abnormal error log
>
> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
> (------ AER error log supposed to be printed here, but missing ------)
> nvme nvme0: frozen state error detected, reset controller
> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
I will remove the timestamps and adjust the log accordingly.
>> Cause:
>> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
>> by info->ratelimit_print[i]. There are two entry points for
>> aer_print_error().
>>
>> 1) Native AER
>> aer_isr_one_error_type() -> aer_process_err_devices() ->
>> aer_print_error()
>> 2) DPC
>> dpc_process_error() -> aer_print_error()
>>
>> The value of info->ratelimit_print[i] is initialized correctly in
>> the native AER case:
>> aer_isr_one_error_type() -> find_source_device() ->
>> find_device_iter() -> add_error_device()
>>
>> In the DPC case, info->ratelimit_print[i] is not initialized and
>> alloc by 0 , so in aer_print_error(), it will directly return at line
>> if (!info->ratelimit_print[i])
>> This will result in lossing the AER log messages in the DPC case.
> losing
Oops, good catch! Thanks :)
>> Solution:
>> 1. Move the initialization of info->ratelimit_print[i] to
>> aer_ratelimit_print_init().
>> 2. Add aer_ratelimit_print_init() in dpc_process_error().
>> 3. Replace the initialization by aer_ratelimit_print_init()in
>> Native AER case.
>>
>> Test with AER inject:
>> Set the DPC reporting priority in the BIOS and send
>> MalfTLP(AER FATAL ERROR) to device.
>>
>> -- Error log normal (line breaks adjusted)
>> [ 5366.943807] pcieport 0000:20:00.0: DPC: containment event,
>> status:0x1f11: unmasked uncorrectable error detected
>> [ 5366.943826] pcieport 0000:20:00.0: PCIe Bus Error:
>> severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> [root@localhost ~]# [ 5366.943830] pcieport 0000:20:00.0:
>> device [19e5:a120] error status/mask=00040000/04580000
>> [ 5366.943833] pcieport 0000:20:00.0: [18] MalfTLP (First)
>> [ 5366.943836] pcieport 0000:20:00.0: AER: TLP Header:
>> 0x00000000 0x00000000 0x00000000 0x00000000
>> [ 5366.943843] nvme nvme0: frozen state error detected, reset controller
>> [ 5368.156778] {2}[Hardware Error]: Hardware error from APEI Generic
>> Hardware Error Source: 0
> Reformat this and drop timestamps as well:
>
> Normal error log
>
> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
> pcieport 0000:20:00.0: [18] MalfTLP (First)
> pcieport 0000:20:00.0: AER: TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
> nvme nvme0: frozen state error detected, reset controller
> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
I will reformat the AER log as above.
>> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
>> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
> Otherwise looks good to me.
>
> Thanks,
>
> Jonathan
>
A v2 patch will be sent for adjusting the commit log.
Kind regards,
Sizhe
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-01-29 13:22 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-27 3:54 [PATCH] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
2026-01-27 15:56 ` Jonathan Cameron
2026-01-29 13:22 ` Sizhe LIU
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox