linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
@ 2024-06-20  2:58 Zhenzhong Duan
  2024-06-20  2:58 ` [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Zhenzhong Duan @ 2024-06-20  2:58 UTC (permalink / raw)
  To: linux-pci
  Cc: bhelgaas, mahesh, oohall, linuxppc-dev, linux-acpi, rafael, lenb,
	james.morse, tony.luck, bp, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, helgaas, linmiaohe,
	shiju.jose, adam.c.preble, lukas, Smita.KoralahalliChannabasappa,
	rrichter, linux-cxl, linux-edac, linux-kernel, erwin.tsaur,
	sathyanarayanan.kuppuswamy, dan.j.williams, feiting.wanyan,
	yudong.wang, chao.p.peng, qingshun.wang, Zhenzhong Duan

Hi,

This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
processing as subject suggests and drops trace-event for now. I think it's
a bit heavy to do extra IOes to get PCIe registers only for trace purpose
and not see it a community request for now.

According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
both Correctable Error(CE) Status register and Uncorrectable Error(UE)
Status register. Currently, when handling AER events the kernel will only
look at CE status or UE status, but never both. In the ANFE case, bits set
in the UE status register will not be reported and cleared until the next
FE/NFE arrives.

For instance, previously, when the kernel receives an ANFE with Poisoned
TLP in OS native AER mode, only the status of CE will be reported and
cleared:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. The Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP

To handle this case properly, calculate potential ANFE related status bits
and save in aer_err_info. Use this information to determine the status bits
that need to be cleared.

Now, for the previous scenario, both CE status and related UE status will
be reported and cleared after ANFE:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [12] TLP

Note:
checkpatch.pl will produce following warnings on PATCH1&2:

WARNING: 'UE' may be misspelled - perhaps 'USE'?
#22:
uncorrectable error(UE) status should be cleared. However, there is no

...similar warnings omitted...

This is a false-positive, so not fixed.

WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit description?)
#10:
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)

...similar warnings omitted...

For readability reasons, these warnings are not fixed.



[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-qingshun.wang@linux.intel.com

Thanks
Qingshun, Zhenzhong

Changelog:
v5:
 - squash patch 1 and 3 (Kuppuswamy)
 - add comment about avoiding race and fix typo error (Kuppuswamy)
 - collect Jonathan and Kuppuswamy's R-b

v4:
  - Fix a race in anfe_get_uc_status() (Jonathan)
  - Add a comment to explain side effect of processing ANFE as NFE (Jonathan)
  - Drop the check for PCI_EXP_DEVSTA_NFED

v3:
  - Split ANFE print and processing to two patches (Bjorn)
  - Simplify ANFE handling, drop trace event
  - Polish comments and patch description
  - Add Tested-by

v2:
  - Reference to the latest PCIe Specification in both commit messages
    and comments, as suggested by Bjorn Helgaas.
  - Describe the reason for storing additional information in
    aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
    Helgaas.
  - Add more details of behavior changes in the commit message of PATCH
    2, as suggested by Bjorn Helgaas.

v4: https://lkml.org/lkml/2024/5/9/247
v3: https://lore.kernel.org/lkml/20240417061407.1491361-1-zhenzhong.duan@intel.com
v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1-qingshun.wang@linux.intel.com
v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-qingshun.wang@linux.intel.com


Zhenzhong Duan (2):
  PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
  PCI/AER: Print UNCOR_STATUS bits that might be ANFE

 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 79 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 79 insertions(+), 1 deletion(-)

-- 
2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
  2024-06-20  2:58 [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Zhenzhong Duan
@ 2024-06-20  2:58 ` Zhenzhong Duan
  2025-08-22 17:20   ` Bjorn Helgaas
  2024-06-20  2:58 ` [PATCH v5 2/2] PCI/AER: Print " Zhenzhong Duan
  2024-07-12  9:56 ` [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Duan, Zhenzhong
  2 siblings, 1 reply; 13+ messages in thread
From: Zhenzhong Duan @ 2024-06-20  2:58 UTC (permalink / raw)
  To: linux-pci
  Cc: bhelgaas, mahesh, oohall, linuxppc-dev, linux-acpi, rafael, lenb,
	james.morse, tony.luck, bp, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, helgaas, linmiaohe,
	shiju.jose, adam.c.preble, lukas, Smita.KoralahalliChannabasappa,
	rrichter, linux-cxl, linux-edac, linux-kernel, erwin.tsaur,
	sathyanarayanan.kuppuswamy, dan.j.williams, feiting.wanyan,
	yudong.wang, chao.p.peng, qingshun.wang, Zhenzhong Duan,
	Jonathan Cameron, Kuppuswamy Sathyanarayanan

In some cases the detector of a Non-Fatal Error(NFE) is not the most
appropriate agent to determine the type of the error. For example,
when software performs a configuration read from a non-existent
device or Function, completer will send an ERR_NONFATAL Message.
On some platforms, ERR_NONFATAL results in a System Error, which
breaks normal software probing.

Advisory Non-Fatal Error(ANFE) is a special case that can be used
in above scenario. It is predominantly determined by the role of the
detecting agent (Requester, Completer, or Receiver) and the specific
error. In such cases, an agent with AER signals the NFE (if enabled)
by sending an ERR_COR Message as an advisory to software, instead of
sending ERR_NONFATAL.

When processing an ANFE, ideally both correctable error(CE) status and
uncorrectable error(UE) status should be cleared. However, there is no
way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
NFE will reproduce above mentioned issue, i.e., breaking software probing;
treating NFE as ANFE will make us ignore some UEs which need active
recover operation. To avoid clearing UEs that are not ANFE by accident,
the most conservative route is taken here: If any of the NFE Detected
bits is set in Device Status, do not touch UE status, they should be
cleared later by the UE handler. Otherwise, a specific set of UEs that
may be raised as ANFE according to the PCIe specification will be cleared
if their corresponding severity is Non-Fatal.

To achieve above purpose, cache UNCOR_STATUS bits that might be ANFE
in aer_err_info.anfe_status and clean them in pci_aer_handle_error().
aer_err_info.anfe_status will also be used to print ANFE related bits
in following patch.

For instance, previously, when the kernel receives an ANFE with Poisoned
TLP in OS native AER mode, only the status of CE will be reported and
cleared:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

If the kernel receives a Malformed TLP after that, two UEs will be
reported, which is unexpected. The Malformed TLP Header is lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP

To handle this case properly, calculate potential ANFE related status bits
and save in aer_err_info. Use this information to determine the status bits
that need to be cleared.

Now, for the previous scenario, both CE status and related UE status will
be cleared after ANFE:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr

  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00040000/00180020
     [18] MalfTLP                    (First)

Tested-by: Yudong Wang <yudong.wang@intel.com>
Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pci.h      |  1 +
 drivers/pci/pcie/aer.c | 64 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 64 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index fd44565c4756..b889a6204db0 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -410,6 +410,7 @@ struct aer_err_info {
 
 	unsigned int status;		/* COR/UNCOR Error Status */
 	unsigned int mask;		/* COR/UNCOR Error Mask */
+	unsigned int anfe_status;	/* UNCOR Error Status for ANFE */
 	struct pcie_tlp_log tlp;	/* TLP Header */
 };
 
diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index ac6293c24976..3dcfa0191169 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -107,6 +107,12 @@ struct aer_stats {
 					PCI_ERR_ROOT_MULTI_COR_RCV |	\
 					PCI_ERR_ROOT_MULTI_UNCOR_RCV)
 
+#define AER_ERR_ANFE_UNC_MASK		(PCI_ERR_UNC_POISON_TLP |	\
+					PCI_ERR_UNC_COMP_TIME |		\
+					PCI_ERR_UNC_COMP_ABORT |	\
+					PCI_ERR_UNC_UNX_COMP |		\
+					PCI_ERR_UNC_UNSUP)
+
 static int pcie_aer_disable;
 static pci_ers_result_t aer_root_reset(struct pci_dev *dev);
 
@@ -1094,9 +1100,14 @@ static void pci_aer_handle_error(struct pci_dev *dev, struct aer_err_info *info)
 		 * Correctable error does not need software intervention.
 		 * No need to go through error recovery process.
 		 */
-		if (aer)
+		if (aer) {
 			pci_write_config_dword(dev, aer + PCI_ERR_COR_STATUS,
 					info->status);
+			if (info->anfe_status)
+				pci_write_config_dword(dev,
+						       aer + PCI_ERR_UNCOR_STATUS,
+						       info->anfe_status);
+		}
 		if (pcie_aer_is_native(dev)) {
 			struct pci_driver *pdrv = dev->driver;
 
@@ -1196,6 +1207,53 @@ void aer_recover_queue(int domain, unsigned int bus, unsigned int devfn,
 EXPORT_SYMBOL_GPL(aer_recover_queue);
 #endif
 
+static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info *info)
+{
+	u32 uncor_mask, uncor_status, anfe_status;
+	u16 device_status;
+	int aer = dev->aer_cap;
+
+	/*
+	 * To avoid race between device status read and error status register read,
+	 * cache uncorrectable error status before checking for NFE in device status
+	 * register.
+	 */
+	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &uncor_status);
+	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &uncor_mask);
+	/*
+	 * According to PCIe Base Specification Revision 6.1 Section 6.2.3.2.4,
+	 * if an UNCOR error is raised as Advisory Non-Fatal error, it will
+	 * match the following conditions:
+	 *	a. The severity of the error is Non-Fatal.
+	 *	b. The error is one of the following:
+	 *		1. Poisoned TLP           (Section 6.2.3.2.4.3)
+	 *		2. Completion Timeout     (Section 6.2.3.2.4.4)
+	 *		3. Completer Abort        (Section 6.2.3.2.4.1)
+	 *		4. Unexpected Completion  (Section 6.2.3.2.4.5)
+	 *		5. Unsupported Request    (Section 6.2.3.2.4.1)
+	 */
+	anfe_status = uncor_status & ~uncor_mask & ~info->severity &
+		      AER_ERR_ANFE_UNC_MASK;
+
+	if (pcie_capability_read_word(dev, PCI_EXP_DEVSTA, &device_status))
+		return;
+	/*
+	 * Take the most conservative route here. If there are Non-Fatal errors
+	 * detected, do not assume any bit in uncor_status is set by ANFE.
+	 */
+	if (device_status & PCI_EXP_DEVSTA_NFED)
+		return;
+
+	/*
+	 * If there is another ANFE between reading uncor_status and clearing
+	 * PCI_ERR_COR_ADV_NFAT bit in cor_status register, that ANFE isn't
+	 * recorded in info->anfe_status. It will be read out as NFE in
+	 * following uncor_status register reading and processed by NFE
+	 * handler.
+	 */
+	info->anfe_status = anfe_status;
+}
+
 /**
  * aer_get_device_error_info - read error status from dev and store it to info
  * @dev: pointer to the device expected to have a error record
@@ -1213,6 +1271,7 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 
 	/* Must reset in this function */
 	info->status = 0;
+	info->anfe_status = 0;
 	info->tlp_header_valid = 0;
 
 	/* The device might not support AER */
@@ -1226,6 +1285,9 @@ int aer_get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 			&info->mask);
 		if (!(info->status & ~info->mask))
 			return 0;
+
+		if (info->status & PCI_ERR_COR_ADV_NFAT)
+			anfe_get_uc_status(dev, info);
 	} else if (type == PCI_EXP_TYPE_ROOT_PORT ||
 		   type == PCI_EXP_TYPE_RC_EC ||
 		   type == PCI_EXP_TYPE_DOWNSTREAM ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v5 2/2] PCI/AER: Print UNCOR_STATUS bits that might be ANFE
  2024-06-20  2:58 [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Zhenzhong Duan
  2024-06-20  2:58 ` [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
@ 2024-06-20  2:58 ` Zhenzhong Duan
  2025-08-22 17:27   ` Bjorn Helgaas
  2025-08-29 21:18   ` Bjorn Helgaas
  2024-07-12  9:56 ` [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Duan, Zhenzhong
  2 siblings, 2 replies; 13+ messages in thread
From: Zhenzhong Duan @ 2024-06-20  2:58 UTC (permalink / raw)
  To: linux-pci
  Cc: bhelgaas, mahesh, oohall, linuxppc-dev, linux-acpi, rafael, lenb,
	james.morse, tony.luck, bp, dave, jonathan.cameron, dave.jiang,
	alison.schofield, vishal.l.verma, ira.weiny, helgaas, linmiaohe,
	shiju.jose, adam.c.preble, lukas, Smita.KoralahalliChannabasappa,
	rrichter, linux-cxl, linux-edac, linux-kernel, erwin.tsaur,
	sathyanarayanan.kuppuswamy, dan.j.williams, feiting.wanyan,
	yudong.wang, chao.p.peng, qingshun.wang, Zhenzhong Duan,
	Jonathan Cameron, Kuppuswamy Sathyanarayanan

When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
status and ANFE related uncorrectable error(UE) status will be printed:

  AER: Correctable error message received from 0000:b7:02.0
  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [12] TLP

Tested-by: Yudong Wang <yudong.wang@intel.com>
Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
---
 drivers/pci/pcie/aer.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3dcfa0191169..ba3a54092f2c 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -681,6 +681,7 @@ static void __aer_print_error(struct pci_dev *dev,
 {
 	const char **strings;
 	unsigned long status = info->status & ~info->mask;
+	unsigned long anfe_status = info->anfe_status;
 	const char *level, *errmsg;
 	int i;
 
@@ -701,6 +702,20 @@ static void __aer_print_error(struct pci_dev *dev,
 				info->first_error == i ? " (First)" : "");
 	}
 	pci_dev_aer_stats_incr(dev, info);
+
+	if (!anfe_status)
+		return;
+
+	strings = aer_uncorrectable_error_string;
+	pci_printk(level, dev, "Uncorrectable errors that may cause Advisory Non-Fatal:\n");
+
+	for_each_set_bit(i, &anfe_status, 32) {
+		errmsg = strings[i];
+		if (!errmsg)
+			errmsg = "Unknown Error Bit";
+
+		pci_printk(level, dev, "   [%2d] %s\n", i, errmsg);
+	}
 }
 
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* RE: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2024-06-20  2:58 [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Zhenzhong Duan
  2024-06-20  2:58 ` [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
  2024-06-20  2:58 ` [PATCH v5 2/2] PCI/AER: Print " Zhenzhong Duan
@ 2024-07-12  9:56 ` Duan, Zhenzhong
  2025-08-21 16:58   ` Matthew W Carlis
  2 siblings, 1 reply; 13+ messages in thread
From: Duan, Zhenzhong @ 2024-07-12  9:56 UTC (permalink / raw)
  To: linux-pci@vger.kernel.org, bhelgaas@google.com
  Cc: mahesh@linux.ibm.com, oohall@gmail.com,
	linuxppc-dev@lists.ozlabs.org, linux-acpi@vger.kernel.org,
	rafael@kernel.org, lenb@kernel.org, james.morse@arm.com,
	Luck, Tony, bp@alien8.de, dave@stgolabs.net,
	jonathan.cameron@huawei.com, Jiang, Dave, Schofield, Alison,
	Verma, Vishal L, Weiny, Ira, helgaas@kernel.org,
	linmiaohe@huawei.com, shiju.jose@huawei.com, Preble, Adam C,
	lukas@wunner.de, Smita.KoralahalliChannabasappa@amd.com,
	rrichter@amd.com, linux-cxl@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	Tsaur, Erwin, Kuppuswamy, Sathyanarayanan, Williams, Dan J,
	Wanyan, Feiting, Wang, Yudong, Peng, Chao P,
	qingshun.wang@linux.intel.com

Hi Bjorn,

Kindly ping, this series got Reviewed-by and no comments for a month.
Will you think about picking it or further improvements are needed.
Look forward to your suggestions.

Thanks
Zhenzhong

>-----Original Message-----
>From: Duan, Zhenzhong <zhenzhong.duan@intel.com>
>Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
>
>Hi,
>
>This is a relay work of Qingshun's v2 [1], but changed to focus on ANFE
>processing as subject suggests and drops trace-event for now. I think it's
>a bit heavy to do extra IOes to get PCIe registers only for trace purpose
>and not see it a community request for now.
>
>According to PCIe Base Specification Revision 6.1, Sections 6.2.3.2.4 and
>6.2.4.3, certain uncorrectable errors will signal ERR_COR instead of
>ERR_NONFATAL, logged as Advisory Non-Fatal Error(ANFE), and set bits in
>both Correctable Error(CE) Status register and Uncorrectable Error(UE)
>Status register. Currently, when handling AER events the kernel will only
>look at CE status or UE status, but never both. In the ANFE case, bits set
>in the UE status register will not be reported and cleared until the next
>FE/NFE arrives.
>
>For instance, previously, when the kernel receives an ANFE with Poisoned
>TLP in OS native AER mode, only the status of CE will be reported and
>cleared:
>
>  AER: Correctable error message received from 0000:b7:02.0
>  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>    device [8086:0db0] error status/mask=00002000/00000000
>     [13] NonFatalErr
>
>If the kernel receives a Malformed TLP after that, two UEs will be
>reported, which is unexpected. The Malformed TLP Header is lost since
>the previous ANFE gated the TLP header logs:
>
>  PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer,
>(Receiver ID)
>    device [8086:0db0] error status/mask=00041000/00180020
>     [12] TLP                    (First)
>     [18] MalfTLP
>
>To handle this case properly, calculate potential ANFE related status bits
>and save in aer_err_info. Use this information to determine the status bits
>that need to be cleared.
>
>Now, for the previous scenario, both CE status and related UE status will
>be reported and cleared after ANFE:
>
>  AER: Correctable error message received from 0000:b7:02.0
>  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>    device [8086:0db0] error status/mask=00002000/00000000
>     [13] NonFatalErr
>    Uncorrectable errors that may cause Advisory Non-Fatal:
>     [12] TLP
>
>Note:
>checkpatch.pl will produce following warnings on PATCH1&2:
>
>WARNING: 'UE' may be misspelled - perhaps 'USE'?
>#22:
>uncorrectable error(UE) status should be cleared. However, there is no
>
>...similar warnings omitted...
>
>This is a false-positive, so not fixed.
>
>WARNING: Prefer a maximum 75 chars per line (possible unwrapped commit
>description?)
>#10:
>  PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>
>...similar warnings omitted...
>
>For readability reasons, these warnings are not fixed.
>
>
>
>[1] https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@linux.intel.com
>
>Thanks
>Qingshun, Zhenzhong
>
>Changelog:
>v5:
> - squash patch 1 and 3 (Kuppuswamy)
> - add comment about avoiding race and fix typo error (Kuppuswamy)
> - collect Jonathan and Kuppuswamy's R-b
>
>v4:
>  - Fix a race in anfe_get_uc_status() (Jonathan)
>  - Add a comment to explain side effect of processing ANFE as NFE (Jonathan)
>  - Drop the check for PCI_EXP_DEVSTA_NFED
>
>v3:
>  - Split ANFE print and processing to two patches (Bjorn)
>  - Simplify ANFE handling, drop trace event
>  - Polish comments and patch description
>  - Add Tested-by
>
>v2:
>  - Reference to the latest PCIe Specification in both commit messages
>    and comments, as suggested by Bjorn Helgaas.
>  - Describe the reason for storing additional information in
>    aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
>    Helgaas.
>  - Add more details of behavior changes in the commit message of PATCH
>    2, as suggested by Bjorn Helgaas.
>
>v4: https://lkml.org/lkml/2024/5/9/247
>v3: https://lore.kernel.org/lkml/20240417061407.1491361-1-
>zhenzhong.duan@intel.com
>v2: https://lore.kernel.org/linux-pci/20240125062802.50819-1-
>qingshun.wang@linux.intel.com
>v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-
>qingshun.wang@linux.intel.com
>
>
>Zhenzhong Duan (2):
>  PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
>  PCI/AER: Print UNCOR_STATUS bits that might be ANFE
>
> drivers/pci/pci.h      |  1 +
> drivers/pci/pcie/aer.c | 79
>+++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 79 insertions(+), 1 deletion(-)
>
>--
>2.34.1


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2024-07-12  9:56 ` [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Duan, Zhenzhong
@ 2025-08-21 16:58   ` Matthew W Carlis
  2025-08-22  1:45     ` Duan, Zhenzhong
  0 siblings, 1 reply; 13+ messages in thread
From: Matthew W Carlis @ 2025-08-21 16:58 UTC (permalink / raw)
  To: zhenzhong.duan
  Cc: Smita.KoralahalliChannabasappa, adam.c.preble, alison.schofield,
	bhelgaas, bp, chao.p.peng, dan.j.williams, dave.jiang, dave,
	erwin.tsaur, feiting.wanyan, helgaas, ira.weiny, james.morse,
	lenb, linux-acpi, linux-cxl, linux-edac, linux-kernel, linux-pci,
	linuxppc-dev, lukas, mahesh, oohall, qingshun.wang, rafael,
	rrichter, sathyanarayanan.kuppuswamy, tony.luck, vishal.l.verma,
	yudong.wang, msaggi, sconnor, ashishk, rhan, jrangi, agovindjee,
	bamstadt

Hello. My team had independently started to make a change similar to this
before realizing that someone had already taken a stab at it. It is highly
desirable in my mind to have an improved handling of Advisory Errors in
the upstream kernel. Is there anything we can do to help move this effort
along? Perhaps testing? We have a decent variety of system configurations &
are able to inject various kinds of errors via special devices/commands etc.

Thanks,
-Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2025-08-21 16:58   ` Matthew W Carlis
@ 2025-08-22  1:45     ` Duan, Zhenzhong
  2025-08-22 16:51       ` Bjorn Helgaas
  0 siblings, 1 reply; 13+ messages in thread
From: Duan, Zhenzhong @ 2025-08-22  1:45 UTC (permalink / raw)
  To: Carlis, Matthew
  Cc: Smita.KoralahalliChannabasappa@amd.com, Preble, Adam C,
	Schofield, Alison, bhelgaas@google.com, bp@alien8.de,
	Peng, Chao P, Williams, Dan J, Jiang, Dave, dave@stgolabs.net,
	erwin.tsaur@intel.com, Wanyan, Feiting, helgaas@kernel.org,
	Weiny, Ira, james.morse@arm.com, lenb@kernel.org,
	linux-acpi@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	lukas@wunner.de, mahesh@linux.ibm.com, oohall@gmail.com,
	qingshun.wang@linux.intel.com, rafael@kernel.org,
	rrichter@amd.com, Kuppuswamy, Sathyanarayanan, Luck, Tony,
	Verma, Vishal L, Wang, Yudong, Saggi, Meeta,
	sconnor@purestorage.com, Karkare, Ashish, rhan@purestorage.com,
	Rangi, Jasjeet, Govindjee, Arjun, Amstadt, Bob

Hi Matthew,

Feel free to take it over if you are interested. Maintainer didn't respond to this series, perhaps he expects some improvement in the series.

Thanks
Zhenzhong

>-----Original Message-----
>From: Matthew W Carlis <mattc@purestorage.com>
>Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
>
>Hello. My team had independently started to make a change similar to this
>before realizing that someone had already taken a stab at it. It is highly
>desirable in my mind to have an improved handling of Advisory Errors in
>the upstream kernel. Is there anything we can do to help move this effort
>along? Perhaps testing? We have a decent variety of system configurations &
>are able to inject various kinds of errors via special devices/commands etc.
>
>Thanks,
>-Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2025-08-22  1:45     ` Duan, Zhenzhong
@ 2025-08-22 16:51       ` Bjorn Helgaas
  2025-08-22 18:15         ` Matthew W Carlis
  2025-08-28  1:00         ` Matthew W Carlis
  0 siblings, 2 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2025-08-22 16:51 UTC (permalink / raw)
  To: Duan, Zhenzhong
  Cc: Carlis, Matthew, Smita.KoralahalliChannabasappa@amd.com,
	Preble, Adam C, Schofield, Alison, bhelgaas@google.com,
	bp@alien8.de, Peng, Chao P, Williams, Dan J, Jiang, Dave,
	dave@stgolabs.net, erwin.tsaur@intel.com, Wanyan, Feiting,
	Weiny, Ira, james.morse@arm.com, lenb@kernel.org,
	linux-acpi@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	lukas@wunner.de, mahesh@linux.ibm.com, oohall@gmail.com,
	qingshun.wang@linux.intel.com, rafael@kernel.org,
	rrichter@amd.com, Kuppuswamy, Sathyanarayanan, Luck, Tony,
	Verma, Vishal L, Wang, Yudong, Saggi, Meeta,
	sconnor@purestorage.com, Karkare, Ashish, rhan@purestorage.com,
	Rangi, Jasjeet, Govindjee, Arjun, Amstadt, Bob

On Fri, Aug 22, 2025 at 01:45:30AM +0000, Duan, Zhenzhong wrote:
> Hi Matthew,
> 
> Feel free to take it over if you are interested. Maintainer didn't
> respond to this series, perhaps he expects some improvement in the
> series.

I'm terribly sorry, this is my fault.  It just fell off my list for no
good reason.  Matthew, if you are able to test and/or provide a
Reviewed-by, that would be the best thing you can do to move this
forward (although neither is actually necessary).

Bjorn

> >-----Original Message-----
> >From: Matthew W Carlis <mattc@purestorage.com>
> >Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
> >
> >Hello. My team had independently started to make a change similar to this
> >before realizing that someone had already taken a stab at it. It is highly
> >desirable in my mind to have an improved handling of Advisory Errors in
> >the upstream kernel. Is there anything we can do to help move this effort
> >along? Perhaps testing? We have a decent variety of system configurations &
> >are able to inject various kinds of errors via special devices/commands etc.
> >
> >Thanks,
> >-Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE
  2024-06-20  2:58 ` [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
@ 2025-08-22 17:20   ` Bjorn Helgaas
  0 siblings, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2025-08-22 17:20 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: linux-pci, bhelgaas, mahesh, oohall, linuxppc-dev, linux-acpi,
	rafael, lenb, james.morse, tony.luck, bp, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	linmiaohe, shiju.jose, adam.c.preble, lukas,
	Smita.KoralahalliChannabasappa, rrichter, linux-cxl, linux-edac,
	linux-kernel, erwin.tsaur, sathyanarayanan.kuppuswamy,
	dan.j.williams, feiting.wanyan, yudong.wang, chao.p.peng,
	qingshun.wang, Kuppuswamy Sathyanarayanan

On Thu, Jun 20, 2024 at 10:58:56AM +0800, Zhenzhong Duan wrote:
> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> appropriate agent to determine the type of the error. For example,
> when software performs a configuration read from a non-existent
> device or Function, completer will send an ERR_NONFATAL Message.
> On some platforms, ERR_NONFATAL results in a System Error, which
> breaks normal software probing.
> 
> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> in above scenario. It is predominantly determined by the role of the
> detecting agent (Requester, Completer, or Receiver) and the specific
> error. In such cases, an agent with AER signals the NFE (if enabled)
> by sending an ERR_COR Message as an advisory to software, instead of
> sending ERR_NONFATAL.
> 
> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will reproduce above mentioned issue, i.e., breaking software probing;
> treating NFE as ANFE will make us ignore some UEs which need active
> recover operation. To avoid clearing UEs that are not ANFE by accident,
> the most conservative route is taken here: If any of the NFE Detected
> bits is set in Device Status, do not touch UE status, they should be
> cleared later by the UE handler. Otherwise, a specific set of UEs that
> may be raised as ANFE according to the PCIe specification will be cleared
> if their corresponding severity is Non-Fatal.
> 
> To achieve above purpose, cache UNCOR_STATUS bits that might be ANFE
> in aer_err_info.anfe_status and clean them in pci_aer_handle_error().
> aer_err_info.anfe_status will also be used to print ANFE related bits
> in following patch.
> 
> For instance, previously, when the kernel receives an ANFE with Poisoned
> TLP in OS native AER mode, only the status of CE will be reported and
> cleared:
> 
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
> 
> If the kernel receives a Malformed TLP after that, two UEs will be
> reported, which is unexpected. The Malformed TLP Header is lost since
> the previous ANFE gated the TLP header logs:
> 
>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00041000/00180020
>      [12] TLP                    (First)
>      [18] MalfTLP
> 
> To handle this case properly, calculate potential ANFE related status bits
> and save in aer_err_info. Use this information to determine the status bits
> that need to be cleared.
> 
> Now, for the previous scenario, both CE status and related UE status will
> be cleared after ANFE:
> 
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
> 
>   PCIe Bus Error: severity="Uncorrectable (Fatal), type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00040000/00180020
>      [18] MalfTLP                    (First)
> 
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>

This no longer applies cleanly; would you mind rebasing it to pci/main
(v6.17-rc1)?  There have been recent AER changes; if they affect the
dmesg text, could you update that as well?

> +static void anfe_get_uc_status(struct pci_dev *dev, struct aer_err_info *info)
> +{
> +	u32 uncor_mask, uncor_status, anfe_status;
> +	u16 device_status;
> +	int aer = dev->aer_cap;
> +
> +	/*
> +	 * To avoid race between device status read and error status register read,
> +	 * cache uncorrectable error status before checking for NFE in device status
> +	 * register.

I can't tell for sure from the patch, but if this doesn't fit in 80
columns, can you rewrap it so it matches the rest of the file?

> +	 */
> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_STATUS, &uncor_status);
> +	pci_read_config_dword(dev, aer + PCI_ERR_UNCOR_MASK, &uncor_mask);
> +	/*
> +	 * According to PCIe Base Specification Revision 6.1 Section 6.2.3.2.4,
> +	 * if an UNCOR error is raised as Advisory Non-Fatal error, it will
> +	 * match the following conditions:
> +	 *	a. The severity of the error is Non-Fatal.
> +	 *	b. The error is one of the following:
> +	 *		1. Poisoned TLP           (Section 6.2.3.2.4.3)
> +	 *		2. Completion Timeout     (Section 6.2.3.2.4.4)
> +	 *		3. Completer Abort        (Section 6.2.3.2.4.1)
> +	 *		4. Unexpected Completion  (Section 6.2.3.2.4.5)
> +	 *		5. Unsupported Request    (Section 6.2.3.2.4.1)
> +	 */

Could you update the citation to PCIe 7.0, please?

Bjorn

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/2] PCI/AER: Print UNCOR_STATUS bits that might be ANFE
  2024-06-20  2:58 ` [PATCH v5 2/2] PCI/AER: Print " Zhenzhong Duan
@ 2025-08-22 17:27   ` Bjorn Helgaas
  2025-08-29 21:18   ` Bjorn Helgaas
  1 sibling, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2025-08-22 17:27 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: linux-pci, bhelgaas, mahesh, oohall, linuxppc-dev, linux-acpi,
	rafael, lenb, james.morse, tony.luck, bp, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	linmiaohe, shiju.jose, adam.c.preble, lukas,
	Smita.KoralahalliChannabasappa, rrichter, linux-cxl, linux-edac,
	linux-kernel, erwin.tsaur, sathyanarayanan.kuppuswamy,
	dan.j.williams, feiting.wanyan, yudong.wang, chao.p.peng,
	qingshun.wang, Kuppuswamy Sathyanarayanan

On Thu, Jun 20, 2024 at 10:58:57AM +0800, Zhenzhong Duan wrote:
> When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
> status and ANFE related uncorrectable error(UE) status will be printed:
> 
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
>     Uncorrectable errors that may cause Advisory Non-Fatal:
>      [12] TLP

Forgot to mention on other patch, but please add spaces between the
spelled-out terms and the "()" abbreviation, e.g., "Correctable Error
(CE)".

Also, can you update this commit log to say what the patch does?  It's
OK if it repeats and/or expands on the subject.

> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/aer.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3dcfa0191169..ba3a54092f2c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -681,6 +681,7 @@ static void __aer_print_error(struct pci_dev *dev,
>  {
>  	const char **strings;
>  	unsigned long status = info->status & ~info->mask;
> +	unsigned long anfe_status = info->anfe_status;
>  	const char *level, *errmsg;
>  	int i;
>  
> @@ -701,6 +702,20 @@ static void __aer_print_error(struct pci_dev *dev,
>  				info->first_error == i ? " (First)" : "");
>  	}
>  	pci_dev_aer_stats_incr(dev, info);
> +
> +	if (!anfe_status)
> +		return;
> +
> +	strings = aer_uncorrectable_error_string;
> +	pci_printk(level, dev, "Uncorrectable errors that may cause Advisory Non-Fatal:\n");

Will have to look at the spec more, but I don't think "may cause" is
quite the right wording here.  It's not that an Uncorrectable Error
causes a separate Advisory Non-Fatal Error; IIUC there's only a single
error and it's just *treated* and signaled differently.

> +
> +	for_each_set_bit(i, &anfe_status, 32) {
> +		errmsg = strings[i];
> +		if (!errmsg)
> +			errmsg = "Unknown Error Bit";
> +
> +		pci_printk(level, dev, "   [%2d] %s\n", i, errmsg);

I think we might have removed pci_printk() recently, so this might
need adjustment.

> +	}
>  }
>  
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2025-08-22 16:51       ` Bjorn Helgaas
@ 2025-08-22 18:15         ` Matthew W Carlis
  2025-08-27  9:42           ` Duan, Zhenzhong
  2025-08-28  1:00         ` Matthew W Carlis
  1 sibling, 1 reply; 13+ messages in thread
From: Matthew W Carlis @ 2025-08-22 18:15 UTC (permalink / raw)
  To: helgaas
  Cc: Smita.KoralahalliChannabasappa, adam.c.preble, agovindjee,
	alison.schofield, ashishk, bamstadt, bhelgaas, bp, chao.p.peng,
	dan.j.williams, dave.jiang, dave, erwin.tsaur, feiting.wanyan,
	ira.weiny, james.morse, jrangi, lenb, linux-acpi, linux-cxl,
	linux-edac, linux-kernel, linux-pci, linuxppc-dev, lukas, mahesh,
	mattc, msaggi, oohall, qingshun.wang, rafael, rhan, rrichter,
	sathyanarayanan.kuppuswamy, sconnor, tony.luck, vishal.l.verma,
	yudong.wang, zhenzhong.duan

On Fri, 22 Aug 2025 11:51:12 -0500, Bjorn Helgaas wrote 
> I'm terribly sorry, this is my fault.  It just fell off my list for no
> good reason.  Matthew, if you are able to test and/or provide a
> Reviewed-by, that would be the best thing you can do to move this
> forward (although neither is actually necessary).

It seems for pci there is always a massive list of things in flight..
Difficult for any mortal to keep up with. We pulled the patch into our
kernel & have started testing it. I'll sync-up with my team internally to
see exactly what the plan is & how long we think it will take.

Cheers!
-Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2025-08-22 18:15         ` Matthew W Carlis
@ 2025-08-27  9:42           ` Duan, Zhenzhong
  0 siblings, 0 replies; 13+ messages in thread
From: Duan, Zhenzhong @ 2025-08-27  9:42 UTC (permalink / raw)
  To: Carlis, Matthew, helgaas@kernel.org
  Cc: Smita.KoralahalliChannabasappa@amd.com, Preble, Adam C,
	Govindjee, Arjun, Schofield, Alison, Karkare, Ashish,
	Amstadt, Bob, bhelgaas@google.com, bp@alien8.de, Peng, Chao P,
	Williams, Dan J, Jiang, Dave, dave@stgolabs.net,
	erwin.tsaur@intel.com, Wanyan, Feiting, Weiny, Ira,
	james.morse@arm.com, Rangi, Jasjeet, lenb@kernel.org,
	linux-acpi@vger.kernel.org, linux-cxl@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	lukas@wunner.de, mahesh@linux.ibm.com, Carlis, Matthew,
	Saggi, Meeta, oohall@gmail.com, qingshun.wang@linux.intel.com,
	rafael@kernel.org, rhan@purestorage.com, rrichter@amd.com,
	Kuppuswamy, Sathyanarayanan, sconnor@purestorage.com, Luck, Tony,
	Verma, Vishal L, Wang, Yudong



>-----Original Message-----
>From: Matthew W Carlis <mattc@purestorage.com>
>Subject: [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
>
>On Fri, 22 Aug 2025 11:51:12 -0500, Bjorn Helgaas wrote
>> I'm terribly sorry, this is my fault.  It just fell off my list for no
>> good reason.  Matthew, if you are able to test and/or provide a
>> Reviewed-by, that would be the best thing you can do to move this
>> forward (although neither is actually necessary).
>
>It seems for pci there is always a massive list of things in flight..
>Difficult for any mortal to keep up with.

Fully agree, never mind, Bjorn.

BRs,
Zhenzhong

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error
  2025-08-22 16:51       ` Bjorn Helgaas
  2025-08-22 18:15         ` Matthew W Carlis
@ 2025-08-28  1:00         ` Matthew W Carlis
  1 sibling, 0 replies; 13+ messages in thread
From: Matthew W Carlis @ 2025-08-28  1:00 UTC (permalink / raw)
  To: helgaas
  Cc: Smita.KoralahalliChannabasappa, adam.c.preble, agovindjee,
	alison.schofield, ashishk, bamstadt, bhelgaas, bp, chao.p.peng,
	dan.j.williams, dave.jiang, dave, erwin.tsaur, feiting.wanyan,
	ira.weiny, james.morse, jrangi, lenb, linux-acpi, linux-cxl,
	linux-edac, linux-kernel, linux-pci, linuxppc-dev, lukas, mahesh,
	mattc, msaggi, oohall, qingshun.wang, rafael, rhan, rrichter,
	sathyanarayanan.kuppuswamy, sconnor, tony.luck, vishal.l.verma,
	yudong.wang, zhenzhong.duan

On Fri, 22 Aug 2025 11:51:12 -0500, Bjorn Helgaas wrote 
> Matthew, if you are able to test and/or provide a Reviewed-by, that would
> be the best thing you can do to move this forward ...

I spent some time looking at the patch thinking about it a little
more carefully. The only thing I don't really like in this revision
of the patch is the logging for "may cause Advisory". Example below
from "[PATCH v5 2/2] PCI/AER: Print UNCOR_STATUS bits that might be ANFE".

AER: Correctable error message received from 0000:b7:02.0
PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
  device [8086:0db0] error status/mask=00002000/00000000
   [13] NonFatalErr
  Uncorrectable errors that may cause Advisory Non-Fatal:
   [12] TLP

I don't think we really need to log the UE caused by ANF any differently
than any other UE & in fact I would prefer not to. In my mind we should log all
the UE status bits via the same format as before. Taking from example above,
in my mind it would be nice if the logging looked like this.

AER: Correctable error message received from 0000:b7:02.0
PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
  device [8086:0db0] error status/mask=00002000/00000000
   [13] NonFatalErr
PCIe Bus Error: severity=Uncorrectable (Non-Fatal), type=Transaction Layer
   [12] TLP

If there was only one error (that triggered ANF handling) then we would
know that the Non-Fatal UE was what triggered the NonFatalErr. If some other
Non-Fatal errors are happening at the same time then it doesn't really matter
which was sent via ERR_COR vs ERR_NONFATAL since we would also know from Root
Error Status that we had received at least one of each message type. The
objective in my mind being to free up header-logs & log status details without
making error the recovery worse.

Does this sound reasonable or unreasonable? I can update the patch-set &
re-submit if 'reasonable'.

Cheers!
-Matt

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v5 2/2] PCI/AER: Print UNCOR_STATUS bits that might be ANFE
  2024-06-20  2:58 ` [PATCH v5 2/2] PCI/AER: Print " Zhenzhong Duan
  2025-08-22 17:27   ` Bjorn Helgaas
@ 2025-08-29 21:18   ` Bjorn Helgaas
  1 sibling, 0 replies; 13+ messages in thread
From: Bjorn Helgaas @ 2025-08-29 21:18 UTC (permalink / raw)
  To: Zhenzhong Duan
  Cc: linux-pci, bhelgaas, mahesh, oohall, linuxppc-dev, linux-acpi,
	rafael, lenb, james.morse, tony.luck, bp, dave, jonathan.cameron,
	dave.jiang, alison.schofield, vishal.l.verma, ira.weiny,
	linmiaohe, shiju.jose, adam.c.preble, lukas,
	Smita.KoralahalliChannabasappa, rrichter, linux-cxl, linux-edac,
	linux-kernel, erwin.tsaur, sathyanarayanan.kuppuswamy,
	dan.j.williams, feiting.wanyan, yudong.wang, chao.p.peng,
	qingshun.wang, Kuppuswamy Sathyanarayanan, Matthew W Carlis

[+cc Matt]

On Thu, Jun 20, 2024 at 10:58:57AM +0800, Zhenzhong Duan wrote:
> When an Advisory Non-Fatal error(ANFE) triggers, both correctable error(CE)
> status and ANFE related uncorrectable error(UE) status will be printed:
> 
>   AER: Correctable error message received from 0000:b7:02.0
>   PCIe Bus Error: severity=Correctable, type=Transaction Layer, (Receiver ID)
>     device [8086:0db0] error status/mask=00002000/00000000
>      [13] NonFatalErr
>     Uncorrectable errors that may cause Advisory Non-Fatal:
>      [12] TLP
> 
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com>
> ---
>  drivers/pci/pcie/aer.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 3dcfa0191169..ba3a54092f2c 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -681,6 +681,7 @@ static void __aer_print_error(struct pci_dev *dev,
>  {
>  	const char **strings;
>  	unsigned long status = info->status & ~info->mask;
> +	unsigned long anfe_status = info->anfe_status;
>  	const char *level, *errmsg;
>  	int i;
>  
> @@ -701,6 +702,20 @@ static void __aer_print_error(struct pci_dev *dev,
>  				info->first_error == i ? " (First)" : "");
>  	}
>  	pci_dev_aer_stats_incr(dev, info);
> +
> +	if (!anfe_status)
> +		return;

__aer_print_error() is used by both native AER handling, where Linux
fields the AER interrupt and reads the AER status registers directly,
and APEI GHES firmware-first error handling, where platform firmware
fields the AER interrupt, reads the AER status registers, and packages
them up to hand off to Linux via aer_recover_queue().

But the previous patch only sets info->anfe_status for the native
path, so the APEI GHES path doesn't get the benefit of this change.

I think both paths should log the same ANFE information.

> +
> +	strings = aer_uncorrectable_error_string;
> +	pci_printk(level, dev, "Uncorrectable errors that may cause Advisory Non-Fatal:\n");
> +
> +	for_each_set_bit(i, &anfe_status, 32) {
> +		errmsg = strings[i];
> +		if (!errmsg)
> +			errmsg = "Unknown Error Bit";
> +
> +		pci_printk(level, dev, "   [%2d] %s\n", i, errmsg);
> +	}
>  }
>  
>  void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-08-29 21:18 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-20  2:58 [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Zhenzhong Duan
2024-06-20  2:58 ` [PATCH v5 1/2] PCI/AER: Clear UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
2025-08-22 17:20   ` Bjorn Helgaas
2024-06-20  2:58 ` [PATCH v5 2/2] PCI/AER: Print " Zhenzhong Duan
2025-08-22 17:27   ` Bjorn Helgaas
2025-08-29 21:18   ` Bjorn Helgaas
2024-07-12  9:56 ` [PATCH v5 0/2] PCI/AER: Handle Advisory Non-Fatal error Duan, Zhenzhong
2025-08-21 16:58   ` Matthew W Carlis
2025-08-22  1:45     ` Duan, Zhenzhong
2025-08-22 16:51       ` Bjorn Helgaas
2025-08-22 18:15         ` Matthew W Carlis
2025-08-27  9:42           ` Duan, Zhenzhong
2025-08-28  1:00         ` Matthew W Carlis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).