public inbox for linux-pci@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] PCI/AER: Fix AER log missing in DPC case
@ 2026-01-29 14:01 Sizhe Liu
  2026-02-06 10:00 ` Sizhe LIU
  2026-02-06 20:10 ` Bjorn Helgaas
  0 siblings, 2 replies; 5+ messages in thread
From: Sizhe Liu @ 2026-01-29 14:01 UTC (permalink / raw)
  To: bhelgaas, jonathan.cameron, shiju.jose, pandoh
  Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39, liusizhe5

In the current DPC error reporting case, some AER log information is missing.

-- Error log abnormal
pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
(------ AER error log supposed to be printed here, but missing ------)
nvme nvme0: frozen state error detected, reset controller
{4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0

Cause:
In aer_print_error(), PCIe AER errors is reported, and is rate-limited
by info->ratelimit_print[i]. There are two entry points for
aer_print_error().

1) Native AER
aer_isr_one_error_type() -> aer_process_err_devices() ->
aer_print_error()
2) DPC
dpc_process_error() -> aer_print_error()

The value of info->ratelimit_print[i] is initialized correctly in
the native AER case:
aer_isr_one_error_type() -> find_source_device() ->
find_device_iter() -> add_error_device()

In the DPC case, info->ratelimit_print[i] is not initialized and
alloc by 0 , so in aer_print_error(), it will directly return at line
if (!info->ratelimit_print[i])
This will result in losing the AER log messages in the DPC case.

Solution:
1. Move the initialization of info->ratelimit_print[i] to
aer_ratelimit_print_init().
2. Add aer_ratelimit_print_init() in dpc_process_error().
3. Replace the initialization by aer_ratelimit_print_init()in
Native AER case.

Test with AER inject:
Set the DPC reporting priority in the BIOS and send
MalfTLP(AER FATAL ERROR) to device.

-- Error log normal
pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
pcieport 0000:20:00.0:    [18] MalfTLP   (First)
pcieport 0000:20:00.0: AER:   TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
nvme nvme0: frozen state error detected, reset controller
{2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0

[1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/

Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
---
v2
- Corrected the format and spelling errors in the commit log.

 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
 drivers/pci/pcie/dpc.c |  1 +
 3 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 0e67014aa001..0cbcbcd52354 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -748,6 +748,7 @@ struct aer_err_info {
 
 int aer_get_device_error_info(struct aer_err_info *info, int i);
 void aer_print_error(struct aer_err_info *info, int i);
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
 
 int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
 		      unsigned int tlp_len, bool flit,
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index e0bcaa896803..b73915b63327 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
 EXPORT_SYMBOL_GPL(cper_severity_to_aer);
 #endif
 
+/**
+ * aer_ratelimit_print_init - set flag whether error message is printed
+ * @dev: pointer to pci_dev to be rate-limited
+ * @e_info: pointer to error info
+ * @idx: index for ratelimit_print array
+ */
+void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
+{
+	/*
+	 * Ratelimit AER log messages.  "dev" is either the source
+	 * identified by the root's Error Source ID or it has an unmasked
+	 * error logged in its own AER Capability.  Messages are emitted
+	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
+	 * for a downstream device, make sure we print the Error Source ID
+	 * from the root as well.
+	 */
+	if (aer_ratelimit(dev, e_info->severity)) {
+		e_info->ratelimit_print[idx] = 1;
+		e_info->root_ratelimit_print = 1;
+	}
+}
+
 void pci_print_aer(struct pci_dev *dev, int aer_severity,
 		   struct aer_capability_regs *aer)
 {
@@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
 	e_info->dev[i] = pci_dev_get(dev);
 	e_info->error_dev_num++;
 
-	/*
-	 * Ratelimit AER log messages.  "dev" is either the source
-	 * identified by the root's Error Source ID or it has an unmasked
-	 * error logged in its own AER Capability.  Messages are emitted
-	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
-	 * for a downstream device, make sure we print the Error Source ID
-	 * from the root as well.
-	 */
-	if (aer_ratelimit(dev, e_info->severity)) {
-		e_info->ratelimit_print[i] = 1;
-		e_info->root_ratelimit_print = 1;
-	}
+	aer_ratelimit_print_init(dev, e_info, i);
 	return 0;
 }
 
diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
index fc18349614d7..d17adc642781 100644
--- a/drivers/pci/pcie/dpc.c
+++ b/drivers/pci/pcie/dpc.c
@@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
 			 status);
 		if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
 		    aer_get_device_error_info(&info, 0)) {
+			aer_ratelimit_print_init(pdev, &info, 0);
 			aer_print_error(&info, 0);
 			pci_aer_clear_nonfatal_status(pdev);
 			pci_aer_clear_fatal_status(pdev);
-- 
2.33.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
  2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
@ 2026-02-06 10:00 ` Sizhe LIU
  2026-02-06 20:10 ` Bjorn Helgaas
  1 sibling, 0 replies; 5+ messages in thread
From: Sizhe LIU @ 2026-02-06 10:00 UTC (permalink / raw)
  To: bhelgaas, jonathan.cameron, shiju.jose, pandoh
  Cc: linux-pci, linuxarm, prime.zeng, fanghao11, shenyang39

Hi,
Gentle ping.


Best regards,
Sizhe

On 2026/1/29 22:01, Sizhe Liu wrote:
> In the current DPC error reporting case, some AER log information is missing.
>
> -- Error log abnormal
> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
> (------ AER error log supposed to be printed here, but missing ------)
> nvme nvme0: frozen state error detected, reset controller
> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> Cause:
> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
> by info->ratelimit_print[i]. There are two entry points for
> aer_print_error().
>
> 1) Native AER
> aer_isr_one_error_type() -> aer_process_err_devices() ->
> aer_print_error()
> 2) DPC
> dpc_process_error() -> aer_print_error()
>
> The value of info->ratelimit_print[i] is initialized correctly in
> the native AER case:
> aer_isr_one_error_type() -> find_source_device() ->
> find_device_iter() -> add_error_device()
>
> In the DPC case, info->ratelimit_print[i] is not initialized and
> alloc by 0 , so in aer_print_error(), it will directly return at line
> if (!info->ratelimit_print[i])
> This will result in losing the AER log messages in the DPC case.
>
> Solution:
> 1. Move the initialization of info->ratelimit_print[i] to
> aer_ratelimit_print_init().
> 2. Add aer_ratelimit_print_init() in dpc_process_error().
> 3. Replace the initialization by aer_ratelimit_print_init()in
> Native AER case.
>
> Test with AER inject:
> Set the DPC reporting priority in the BIOS and send
> MalfTLP(AER FATAL ERROR) to device.
>
> -- Error log normal
> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
> pcieport 0000:20:00.0:    [18] MalfTLP   (First)
> pcieport 0000:20:00.0: AER:   TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
> nvme nvme0: frozen state error detected, reset controller
> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>
> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>
> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
> ---
> v2
> - Corrected the format and spelling errors in the commit log.
>
>   drivers/pci/pci.h      |  1 +
>   drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
>   drivers/pci/pcie/dpc.c |  1 +
>   3 files changed, 25 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 0e67014aa001..0cbcbcd52354 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -748,6 +748,7 @@ struct aer_err_info {
>   
>   int aer_get_device_error_info(struct aer_err_info *info, int i);
>   void aer_print_error(struct aer_err_info *info, int i);
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
>   
>   int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>   		      unsigned int tlp_len, bool flit,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e0bcaa896803..b73915b63327 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
>   EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>   #endif
>   
> +/**
> + * aer_ratelimit_print_init - set flag whether error message is printed
> + * @dev: pointer to pci_dev to be rate-limited
> + * @e_info: pointer to error info
> + * @idx: index for ratelimit_print array
> + */
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
> +{
> +	/*
> +	 * Ratelimit AER log messages.  "dev" is either the source
> +	 * identified by the root's Error Source ID or it has an unmasked
> +	 * error logged in its own AER Capability.  Messages are emitted
> +	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
> +	 * for a downstream device, make sure we print the Error Source ID
> +	 * from the root as well.
> +	 */
> +	if (aer_ratelimit(dev, e_info->severity)) {
> +		e_info->ratelimit_print[idx] = 1;
> +		e_info->root_ratelimit_print = 1;
> +	}
> +}
> +
>   void pci_print_aer(struct pci_dev *dev, int aer_severity,
>   		   struct aer_capability_regs *aer)
>   {
> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>   	e_info->dev[i] = pci_dev_get(dev);
>   	e_info->error_dev_num++;
>   
> -	/*
> -	 * Ratelimit AER log messages.  "dev" is either the source
> -	 * identified by the root's Error Source ID or it has an unmasked
> -	 * error logged in its own AER Capability.  Messages are emitted
> -	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
> -	 * for a downstream device, make sure we print the Error Source ID
> -	 * from the root as well.
> -	 */
> -	if (aer_ratelimit(dev, e_info->severity)) {
> -		e_info->ratelimit_print[i] = 1;
> -		e_info->root_ratelimit_print = 1;
> -	}
> +	aer_ratelimit_print_init(dev, e_info, i);
>   	return 0;
>   }
>   
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fc18349614d7..d17adc642781 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
>   			 status);
>   		if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>   		    aer_get_device_error_info(&info, 0)) {
> +			aer_ratelimit_print_init(pdev, &info, 0);
>   			aer_print_error(&info, 0);
>   			pci_aer_clear_nonfatal_status(pdev);
>   			pci_aer_clear_fatal_status(pdev);


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
  2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
  2026-02-06 10:00 ` Sizhe LIU
@ 2026-02-06 20:10 ` Bjorn Helgaas
  2026-02-10  8:00   ` Sizhe LIU
  1 sibling, 1 reply; 5+ messages in thread
From: Bjorn Helgaas @ 2026-02-06 20:10 UTC (permalink / raw)
  To: Sizhe Liu
  Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci,
	linuxarm, prime.zeng, fanghao11, shenyang39

On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote:
> In the current DPC error reporting case, some AER log information is missing.
> 
> -- Error log abnormal
> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
> (------ AER error log supposed to be printed here, but missing ------)
> nvme nvme0: frozen state error detected, reset controller
> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> 
> Cause:
> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
> by info->ratelimit_print[i]. There are two entry points for
> aer_print_error().
> 
> 1) Native AER
> aer_isr_one_error_type() -> aer_process_err_devices() ->
> aer_print_error()
> 2) DPC
> dpc_process_error() -> aer_print_error()
> 
> The value of info->ratelimit_print[i] is initialized correctly in
> the native AER case:
> aer_isr_one_error_type() -> find_source_device() ->
> find_device_iter() -> add_error_device()
> 
> In the DPC case, info->ratelimit_print[i] is not initialized and
> alloc by 0 , so in aer_print_error(), it will directly return at line
> if (!info->ratelimit_print[i])
> This will result in losing the AER log messages in the DPC case.
> 
> Solution:
> 1. Move the initialization of info->ratelimit_print[i] to
> aer_ratelimit_print_init().
> 2. Add aer_ratelimit_print_init() in dpc_process_error().
> 3. Replace the initialization by aer_ratelimit_print_init()in
> Native AER case.

I see the problem, and I think you're right that we're not logging any
AER info for DPC events (including events handled via the EDR path,
which also calls dpc_process_error()).

Currently we do the ratelimit init in add_error_device(), which also
includes pci_dev_get() for the device.  I don't see a similar
pci_dev_get() anywhere in the DPC path.  There is one in the EDR path:

  edr_handle_event
    acpi_dpc_port_get
      pci_dev_get                    <--
    dpc_process_error
      aer_get_device_error_info(aer_err_info)
      aer_print_error(aer_err_info)
    pcie_do_recovery
    pci_dev_put

Maybe DPC and EDR should be using add_error_device() directly?  It
seems like holding that reference on the device is important.

> Test with AER inject:
> Set the DPC reporting priority in the BIOS and send
> MalfTLP(AER FATAL ERROR) to device.
> 
> -- Error log normal
> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
> pcieport 0000:20:00.0:    [18] MalfTLP   (First)
> pcieport 0000:20:00.0: AER:   TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
> nvme nvme0: frozen state error detected, reset controller
> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> 
> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
> 
> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
> ---
> v2
> - Corrected the format and spelling errors in the commit log.
> 
>  drivers/pci/pci.h      |  1 +
>  drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
>  drivers/pci/pcie/dpc.c |  1 +
>  3 files changed, 25 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 0e67014aa001..0cbcbcd52354 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -748,6 +748,7 @@ struct aer_err_info {
>  
>  int aer_get_device_error_info(struct aer_err_info *info, int i);
>  void aer_print_error(struct aer_err_info *info, int i);
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
>  
>  int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>  		      unsigned int tlp_len, bool flit,
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index e0bcaa896803..b73915b63327 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
>  EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>  #endif
>  
> +/**
> + * aer_ratelimit_print_init - set flag whether error message is printed
> + * @dev: pointer to pci_dev to be rate-limited
> + * @e_info: pointer to error info
> + * @idx: index for ratelimit_print array
> + */
> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
> +{
> +	/*
> +	 * Ratelimit AER log messages.  "dev" is either the source
> +	 * identified by the root's Error Source ID or it has an unmasked
> +	 * error logged in its own AER Capability.  Messages are emitted
> +	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
> +	 * for a downstream device, make sure we print the Error Source ID
> +	 * from the root as well.
> +	 */
> +	if (aer_ratelimit(dev, e_info->severity)) {
> +		e_info->ratelimit_print[idx] = 1;
> +		e_info->root_ratelimit_print = 1;
> +	}
> +}
> +
>  void pci_print_aer(struct pci_dev *dev, int aer_severity,
>  		   struct aer_capability_regs *aer)
>  {
> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>  	e_info->dev[i] = pci_dev_get(dev);
>  	e_info->error_dev_num++;
>  
> -	/*
> -	 * Ratelimit AER log messages.  "dev" is either the source
> -	 * identified by the root's Error Source ID or it has an unmasked
> -	 * error logged in its own AER Capability.  Messages are emitted
> -	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
> -	 * for a downstream device, make sure we print the Error Source ID
> -	 * from the root as well.
> -	 */
> -	if (aer_ratelimit(dev, e_info->severity)) {
> -		e_info->ratelimit_print[i] = 1;
> -		e_info->root_ratelimit_print = 1;
> -	}
> +	aer_ratelimit_print_init(dev, e_info, i);
>  	return 0;
>  }
>  
> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
> index fc18349614d7..d17adc642781 100644
> --- a/drivers/pci/pcie/dpc.c
> +++ b/drivers/pci/pcie/dpc.c
> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
>  			 status);
>  		if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>  		    aer_get_device_error_info(&info, 0)) {
> +			aer_ratelimit_print_init(pdev, &info, 0);
>  			aer_print_error(&info, 0);
>  			pci_aer_clear_nonfatal_status(pdev);
>  			pci_aer_clear_fatal_status(pdev);
> -- 
> 2.33.0
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
  2026-02-06 20:10 ` Bjorn Helgaas
@ 2026-02-10  8:00   ` Sizhe LIU
  2026-02-11 12:18     ` Sizhe Liu
  0 siblings, 1 reply; 5+ messages in thread
From: Sizhe LIU @ 2026-02-10  8:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci,
	linuxarm, prime.zeng, fanghao11, shenyang39

On 2026/2/7 4:10, Bjorn Helgaas wrote:

> On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote:
>> In the current DPC error reporting case, some AER log information is missing.
>>
>> -- Error log abnormal
>> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: unmasked uncorrectable error detected
>> (------ AER error log supposed to be printed here, but missing ------)
>> nvme nvme0: frozen state error detected, reset controller
>> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>>
>> Cause:
>> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
>> by info->ratelimit_print[i]. There are two entry points for
>> aer_print_error().
>>
>> 1) Native AER
>> aer_isr_one_error_type() -> aer_process_err_devices() ->
>> aer_print_error()
>> 2) DPC
>> dpc_process_error() -> aer_print_error()
>>
>> The value of info->ratelimit_print[i] is initialized correctly in
>> the native AER case:
>> aer_isr_one_error_type() -> find_source_device() ->
>> find_device_iter() -> add_error_device()
>>
>> In the DPC case, info->ratelimit_print[i] is not initialized and
>> alloc by 0 , so in aer_print_error(), it will directly return at line
>> if (!info->ratelimit_print[i])
>> This will result in losing the AER log messages in the DPC case.
>>
>> Solution:
>> 1. Move the initialization of info->ratelimit_print[i] to
>> aer_ratelimit_print_init().
>> 2. Add aer_ratelimit_print_init() in dpc_process_error().
>> 3. Replace the initialization by aer_ratelimit_print_init()in
>> Native AER case.
> I see the problem, and I think you're right that we're not logging any
> AER info for DPC events (including events handled via the EDR path,
> which also calls dpc_process_error()).
>
> Currently we do the ratelimit init in add_error_device(), which also
> includes pci_dev_get() for the device.  I don't see a similar
> pci_dev_get() anywhere in the DPC path.  There is one in the EDR path:
>
>    edr_handle_event
>      acpi_dpc_port_get
>        pci_dev_get                    <--
>      dpc_process_error
>        aer_get_device_error_info(aer_err_info)
>        aer_print_error(aer_err_info)
>      pcie_do_recovery
>      pci_dev_put
>
> Maybe DPC and EDR should be using add_error_device() directly?  It
> seems like holding that reference on the device is important.
I agree with directly using add_error_device() – holding a reference to 
the device is a more robust approach.
Below is the detailed call trace for reference:

DPC path:
     dpc_handler
         add_error_device           <--
         dpc_process_error
             aer_get_device_error_info(aer_err_info)
             aer_print_error(aer_err_info)
         pcie_do_recovery
         pci_dev_put

EDR path:
     edr_handle_event
         acpi_dpc_port_get
             add_error_device           <--
         dpc_process_error
             aer_get_device_error_info(aer_err_info)
             aer_print_error(aer_err_info)
         pcie_do_recovery
         pci_dev_put

I will implement this change in the v3 patch and add relevant comments 
to add_error_device() for clarity.

Thanks,
Sizhe
>> Test with AER inject:
>> Set the DPC reporting priority in the BIOS and send
>> MalfTLP(AER FATAL ERROR) to device.
>>
>> -- Error log normal
>> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: unmasked uncorrectable error detected
>> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>> pcieport 0000:20:00.0: device [19e5:a120] error status/mask=00040000/04580000
>> pcieport 0000:20:00.0:    [18] MalfTLP   (First)
>> pcieport 0000:20:00.0: AER:   TLP Header: 0x00000000 0x00000000 0x00000000 0x00000000
>> nvme nvme0: frozen state error detected, reset controller
>> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
>>
>> [1] https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>>
>> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal error logging")
>> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
>> ---
>> v2
>> - Corrected the format and spelling errors in the commit log.
>>
>>   drivers/pci/pci.h      |  1 +
>>   drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
>>   drivers/pci/pcie/dpc.c |  1 +
>>   3 files changed, 25 insertions(+), 12 deletions(-)
>>
>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>> index 0e67014aa001..0cbcbcd52354 100644
>> --- a/drivers/pci/pci.h
>> +++ b/drivers/pci/pci.h
>> @@ -748,6 +748,7 @@ struct aer_err_info {
>>   
>>   int aer_get_device_error_info(struct aer_err_info *info, int i);
>>   void aer_print_error(struct aer_err_info *info, int i);
>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx);
>>   
>>   int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>>   		      unsigned int tlp_len, bool flit,
>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>> index e0bcaa896803..b73915b63327 100644
>> --- a/drivers/pci/pcie/aer.c
>> +++ b/drivers/pci/pcie/aer.c
>> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
>>   EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>>   #endif
>>   
>> +/**
>> + * aer_ratelimit_print_init - set flag whether error message is printed
>> + * @dev: pointer to pci_dev to be rate-limited
>> + * @e_info: pointer to error info
>> + * @idx: index for ratelimit_print array
>> + */
>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct aer_err_info *e_info, int idx)
>> +{
>> +	/*
>> +	 * Ratelimit AER log messages.  "dev" is either the source
>> +	 * identified by the root's Error Source ID or it has an unmasked
>> +	 * error logged in its own AER Capability.  Messages are emitted
>> +	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
>> +	 * for a downstream device, make sure we print the Error Source ID
>> +	 * from the root as well.
>> +	 */
>> +	if (aer_ratelimit(dev, e_info->severity)) {
>> +		e_info->ratelimit_print[idx] = 1;
>> +		e_info->root_ratelimit_print = 1;
>> +	}
>> +}
>> +
>>   void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>   		   struct aer_capability_regs *aer)
>>   {
>> @@ -990,18 +1012,7 @@ static int add_error_device(struct aer_err_info *e_info, struct pci_dev *dev)
>>   	e_info->dev[i] = pci_dev_get(dev);
>>   	e_info->error_dev_num++;
>>   
>> -	/*
>> -	 * Ratelimit AER log messages.  "dev" is either the source
>> -	 * identified by the root's Error Source ID or it has an unmasked
>> -	 * error logged in its own AER Capability.  Messages are emitted
>> -	 * when "ratelimit_print[i]" is non-zero.  If we will print detail
>> -	 * for a downstream device, make sure we print the Error Source ID
>> -	 * from the root as well.
>> -	 */
>> -	if (aer_ratelimit(dev, e_info->severity)) {
>> -		e_info->ratelimit_print[i] = 1;
>> -		e_info->root_ratelimit_print = 1;
>> -	}
>> +	aer_ratelimit_print_init(dev, e_info, i);
>>   	return 0;
>>   }
>>   
>> diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>> index fc18349614d7..d17adc642781 100644
>> --- a/drivers/pci/pcie/dpc.c
>> +++ b/drivers/pci/pcie/dpc.c
>> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
>>   			 status);
>>   		if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>>   		    aer_get_device_error_info(&info, 0)) {
>> +			aer_ratelimit_print_init(pdev, &info, 0);
>>   			aer_print_error(&info, 0);
>>   			pci_aer_clear_nonfatal_status(pdev);
>>   			pci_aer_clear_fatal_status(pdev);
>> -- 
>> 2.33.0
>>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] PCI/AER: Fix AER log missing in DPC case
  2026-02-10  8:00   ` Sizhe LIU
@ 2026-02-11 12:18     ` Sizhe Liu
  0 siblings, 0 replies; 5+ messages in thread
From: Sizhe Liu @ 2026-02-11 12:18 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: bhelgaas, jonathan.cameron, shiju.jose, pandoh, linux-pci,
	linuxarm, prime.zeng, fanghao11, shenyang39

On 2026/2/10 16:00, Sizhe LIU wrote:

> On 2026/2/7 4:10, Bjorn Helgaas wrote:
>
>> On Thu, Jan 29, 2026 at 10:01:03PM +0800, Sizhe Liu wrote:
>>> In the current DPC error reporting case, some AER log information is 
>>> missing.
>>>
>>> -- Error log abnormal
>>> pcieport 0000:20:00.0: DPC: containment event, status: 0x1f11: 
>>> unmasked uncorrectable error detected
>>> (------ AER error log supposed to be printed here, but missing ------)
>>> nvme nvme0: frozen state error detected, reset controller
>>> {4}[Hardware Error]: Hardware error from APEI Generic Hardware Error 
>>> Source: 0
>>>
>>> Cause:
>>> In aer_print_error(), PCIe AER errors is reported, and is rate-limited
>>> by info->ratelimit_print[i]. There are two entry points for
>>> aer_print_error().
>>>
>>> 1) Native AER
>>> aer_isr_one_error_type() -> aer_process_err_devices() ->
>>> aer_print_error()
>>> 2) DPC
>>> dpc_process_error() -> aer_print_error()
>>>
>>> The value of info->ratelimit_print[i] is initialized correctly in
>>> the native AER case:
>>> aer_isr_one_error_type() -> find_source_device() ->
>>> find_device_iter() -> add_error_device()
>>>
>>> In the DPC case, info->ratelimit_print[i] is not initialized and
>>> alloc by 0 , so in aer_print_error(), it will directly return at line
>>> if (!info->ratelimit_print[i])
>>> This will result in losing the AER log messages in the DPC case.
>>>
>>> Solution:
>>> 1. Move the initialization of info->ratelimit_print[i] to
>>> aer_ratelimit_print_init().
>>> 2. Add aer_ratelimit_print_init() in dpc_process_error().
>>> 3. Replace the initialization by aer_ratelimit_print_init()in
>>> Native AER case.
>> I see the problem, and I think you're right that we're not logging any
>> AER info for DPC events (including events handled via the EDR path,
>> which also calls dpc_process_error()).
>>
>> Currently we do the ratelimit init in add_error_device(), which also
>> includes pci_dev_get() for the device.  I don't see a similar
>> pci_dev_get() anywhere in the DPC path.  There is one in the EDR path:
>>
>>    edr_handle_event
>>      acpi_dpc_port_get
>>        pci_dev_get                    <--
>>      dpc_process_error
>>        aer_get_device_error_info(aer_err_info)
>>        aer_print_error(aer_err_info)
>>      pcie_do_recovery
>>      pci_dev_put
>>
>> Maybe DPC and EDR should be using add_error_device() directly? It
>> seems like holding that reference on the device is important.
> I agree with directly using add_error_device() – holding a reference 
> to the device is a more robust approach.
> Below is the detailed call trace for reference:
>
> DPC path:
>     dpc_handler
>         add_error_device           <--
>         dpc_process_error
>             aer_get_device_error_info(aer_err_info)
>             aer_print_error(aer_err_info)
>         pcie_do_recovery
>         pci_dev_put
>
> EDR path:
>     edr_handle_event
>         acpi_dpc_port_get
>             add_error_device           <--
>         dpc_process_error
>             aer_get_device_error_info(aer_err_info)
>             aer_print_error(aer_err_info)
>         pcie_do_recovery
>         pci_dev_put
>
> I will implement this change in the v3 patch and add relevant comments 
> to add_error_device() for clarity.
>
> Thanks,
> Sizhe
Hi,

I attempted to do this, but found the required changes and their impact
grew increasingly substantial. We rely on aer_err_info->severity
to initialize ratelimit_print in add_error_device().
To replace pci_dev_get() with add_error_device(),
we must read registers to get the severity before the call.

Besides pci_dev_get(), acpi_dpc_port_get() uses another path
that calls pci_get_domain_bus_and_slot() to obtain the error
device and take a reference via BDF, not from the pci_dev structure.
This requires a separate initialization of ratelimit_print.

A less intrusive approach in my opinion:
1. Extract ratelimit_print initialization into a dedicated function
    aer_ratelimit_print_init(), and call it after 
dpc_get_aer_uncorrect_severity()
    inside dpc_process_error().
2. Add pci_dev_get() / pci_dev_put() pairs directly in dpc_handler().

I plan to incorporate these changes into the v3 patch.
Looking forward to your feedback.

Thanks,
Sizhe
>>> Test with AER inject:
>>> Set the DPC reporting priority in the BIOS and send
>>> MalfTLP(AER FATAL ERROR) to device.
>>>
>>> -- Error log normal
>>> pcieport 0000:20:00.0: DPC: containment event, status:0x1f11: 
>>> unmasked uncorrectable error detected
>>> pcieport 0000:20:00.0: PCIe Bus Error: severity=Uncorrectable 
>>> (Fatal), type=Transaction Layer, (Receiver ID)
>>> pcieport 0000:20:00.0: device [19e5:a120] error 
>>> status/mask=00040000/04580000
>>> pcieport 0000:20:00.0:    [18] MalfTLP   (First)
>>> pcieport 0000:20:00.0: AER:   TLP Header: 0x00000000 0x00000000 
>>> 0x00000000 0x00000000
>>> nvme nvme0: frozen state error detected, reset controller
>>> {2}[Hardware Error]: Hardware error from APEI Generic Hardware Error 
>>> Source: 0
>>>
>>> [1] 
>>> https://lore.kernel.org/linux-pci/20260127035405.712271-1-liusizhe5@huawei.com/
>>>
>>> Fixes: a57f2bfb4a58 ("PCI/AER: Ratelimit correctable and non-fatal 
>>> error logging")
>>> Signed-off-by: Sizhe Liu <liusizhe5@huawei.com>
>>> ---
>>> v2
>>> - Corrected the format and spelling errors in the commit log.
>>>
>>>   drivers/pci/pci.h      |  1 +
>>>   drivers/pci/pcie/aer.c | 35 +++++++++++++++++++++++------------
>>>   drivers/pci/pcie/dpc.c |  1 +
>>>   3 files changed, 25 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
>>> index 0e67014aa001..0cbcbcd52354 100644
>>> --- a/drivers/pci/pci.h
>>> +++ b/drivers/pci/pci.h
>>> @@ -748,6 +748,7 @@ struct aer_err_info {
>>>     int aer_get_device_error_info(struct aer_err_info *info, int i);
>>>   void aer_print_error(struct aer_err_info *info, int i);
>>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct 
>>> aer_err_info *e_info, int idx);
>>>     int pcie_read_tlp_log(struct pci_dev *dev, int where, int where2,
>>>                 unsigned int tlp_len, bool flit,
>>> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
>>> index e0bcaa896803..b73915b63327 100644
>>> --- a/drivers/pci/pcie/aer.c
>>> +++ b/drivers/pci/pcie/aer.c
>>> @@ -925,6 +925,28 @@ int cper_severity_to_aer(int cper_severity)
>>>   EXPORT_SYMBOL_GPL(cper_severity_to_aer);
>>>   #endif
>>>   +/**
>>> + * aer_ratelimit_print_init - set flag whether error message is 
>>> printed
>>> + * @dev: pointer to pci_dev to be rate-limited
>>> + * @e_info: pointer to error info
>>> + * @idx: index for ratelimit_print array
>>> + */
>>> +void aer_ratelimit_print_init(struct pci_dev *dev, struct 
>>> aer_err_info *e_info, int idx)
>>> +{
>>> +    /*
>>> +     * Ratelimit AER log messages.  "dev" is either the source
>>> +     * identified by the root's Error Source ID or it has an unmasked
>>> +     * error logged in its own AER Capability.  Messages are emitted
>>> +     * when "ratelimit_print[i]" is non-zero.  If we will print detail
>>> +     * for a downstream device, make sure we print the Error Source ID
>>> +     * from the root as well.
>>> +     */
>>> +    if (aer_ratelimit(dev, e_info->severity)) {
>>> +        e_info->ratelimit_print[idx] = 1;
>>> +        e_info->root_ratelimit_print = 1;
>>> +    }
>>> +}
>>> +
>>>   void pci_print_aer(struct pci_dev *dev, int aer_severity,
>>>              struct aer_capability_regs *aer)
>>>   {
>>> @@ -990,18 +1012,7 @@ static int add_error_device(struct 
>>> aer_err_info *e_info, struct pci_dev *dev)
>>>       e_info->dev[i] = pci_dev_get(dev);
>>>       e_info->error_dev_num++;
>>>   -    /*
>>> -     * Ratelimit AER log messages.  "dev" is either the source
>>> -     * identified by the root's Error Source ID or it has an unmasked
>>> -     * error logged in its own AER Capability.  Messages are emitted
>>> -     * when "ratelimit_print[i]" is non-zero.  If we will print detail
>>> -     * for a downstream device, make sure we print the Error Source ID
>>> -     * from the root as well.
>>> -     */
>>> -    if (aer_ratelimit(dev, e_info->severity)) {
>>> -        e_info->ratelimit_print[i] = 1;
>>> -        e_info->root_ratelimit_print = 1;
>>> -    }
>>> +    aer_ratelimit_print_init(dev, e_info, i);
>>>       return 0;
>>>   }
>>>   diff --git a/drivers/pci/pcie/dpc.c b/drivers/pci/pcie/dpc.c
>>> index fc18349614d7..d17adc642781 100644
>>> --- a/drivers/pci/pcie/dpc.c
>>> +++ b/drivers/pci/pcie/dpc.c
>>> @@ -275,6 +275,7 @@ void dpc_process_error(struct pci_dev *pdev)
>>>                status);
>>>           if (dpc_get_aer_uncorrect_severity(pdev, &info) &&
>>>               aer_get_device_error_info(&info, 0)) {
>>> +            aer_ratelimit_print_init(pdev, &info, 0);
>>>               aer_print_error(&info, 0);
>>>               pci_aer_clear_nonfatal_status(pdev);
>>>               pci_aer_clear_fatal_status(pdev);
>>> -- 
>>> 2.33.0
>>>
>>
>
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-02-11 12:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-29 14:01 [PATCH v2] PCI/AER: Fix AER log missing in DPC case Sizhe Liu
2026-02-06 10:00 ` Sizhe LIU
2026-02-06 20:10 ` Bjorn Helgaas
2026-02-10  8:00   ` Sizhe LIU
2026-02-11 12:18     ` Sizhe Liu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox