linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
From: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
To: linux-pci@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	linux-acpi@vger.kernel.org
Cc: Miaohe Lin <linmiaohe@huawei.com>,
	Alison Schofield <alison.schofield@intel.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	erwin.tsaur@intel.com,
	Kuppuswamy Sathyanarayanan
	<sathyanarayanan.kuppuswamy@linux.intel.com>,
	linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
	Oliver O'Halloran <oohall@gmail.com>,
	chao.p.peng@linux.intel.com, Ira Weiny <ira.weiny@intel.com>,
	Davidlohr Bueso <dave@stgolabs.net>,
	Dave Jiang <dave.jiang@intel.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
	Bjorn Helgaas <helgaas@kernel.org>, Len Brown <lenb@kernel.org>,
	Robert Richter <rrichter@amd.com>, Borislav Petkov <bp@alien8.de>,
	"Wang, Qingshun" <qingshun.wang@linux.intel.com>,
	Jonathan Cameron <jonathan.cameron@huawei.com>,
	Bjorn Helgaas <bhelgaas@google.com>,
	Dan Williams <dan.j.williams@intel.com>,
	linux-edac@vger.kernel.org, Tony Luck <tony.luck@intel.com>,
	feiting.wanyan@intel.com, Adam Preble <adam.c.preble@intel.com>,
	Mahesh J Sa lgaonkar <mahesh@linux.ibm.com>,
	Li Yang <leoyang.li@nxp.com>, Lukas Wunner <lukas@wunner.de>,
	James Morse <james.morse@arm.com>,
	qingshun.wang@intel.com, Shiju Jose <shiju.jose@huawei.com>
Subject: [PATCH v2 0/4] PCI/AER: Handle Advisory Non-Fatal properly
Date: Thu, 25 Jan 2024 14:27:58 +0800	[thread overview]
Message-ID: <20240125062802.50819-1-qingshun.wang@linux.intel.com> (raw)

According to PCIe Base Specification Revision 6.1, Sections 6.2.3.4
and 6.2.4.3, certain uncorrectable errors will signal ERR_COR instead
of ERR_NONFATAL, logged as Advisory Non-Fatal Error, and set bits in
both Correctable Error Status register and Uncorrectable Error Status
register. Currently, when handling AER events the kernel will only look
at CE status or UE status, but never both. In the Advisory
Non-Fatal Error case, bits set in the UE status register will not be
reported and cleared until the next Fatal/Non-Fatal error arrives.

For instance, previously, when the kernel receives an ANFE with Poisoned
TLP in OS native AER mode, only the status of CE will be reported and
cleared:

  AER: Corrected error received: 0000:b7:02.0
  PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr           

If the kernel receives a Malformed TLP after that, two UE will be
reported, which is unexpected. The Malformed TLP Header was lost since
the previous ANFE gated the TLP header logs:

  PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00041000/00180020
     [12] TLP                    (First)
     [18] MalfTLP       

To handle this case properly, add additional fields in aer_err_info to
track both CE and UE status/mask, UE severity, and Device Status.
Use this information to determine the status bits that need to be cleared.

Now, in the previous scenario, both CE status and related UE status will
be reported and cleared after ANFE:

  AER: Corrected error received: 0000:b7:02.0
  PCIe Bus Error: severity=Corrected, type=Transaction Layer, (Receiver ID)
    device [8086:0db0] error status/mask=00002000/00000000
     [13] NonFatalErr           
    Uncorrectable errors that may cause Advisory Non-Fatal:
     [18] TLP

Additionally, add more data to aer_event tracepoint, which would help
to better understand ANFE and other errors for external observation.

Note:
checkpatch.pl will produce following errors and warnings on PATCH 4:

  ERROR: space prohibited after that open parenthesis '('
  #103: FILE: include/ras/ras_event.h:319:
  +		__field(	u16,		link_status	)

  ...similar errors omitted...

  WARNING: quoted string split across lines
  #126: FILE: include/ras/ras_event.h:342:
  +	TP_printk("%s PCIe Bus Error: severity=%s, %s, TLP Header=%s, "
  +		 "Correctable Error Status=0x%08x, "

  ...similar warnings omitted...

For readability reasons, these errors and warnings are not fixed,
following the code style of existing examples in the kernel source tree.

Change log:
v2:
  - Reference to the latest PCIe Specification in both commit messages
    and comments, as suggested by Bjorn Helgaas.
  - Describe the reason for storing additional information in
    aer_err_info in the commit message of PATCH 1, as suggested by Bjorn
    Helgaas.
  - Add more details of behavior changes in the commit message of PATCH
    2, as suggested by Bjorn Helgaas.

v1: https://lore.kernel.org/linux-pci/20240111073227.31488-1-qingshun.wang@linux.intel.com/

Wang, Qingshun (4):
  PCI/AER: Store more information in aer_err_info
  PCI/AER: Handle Advisory Non-Fatal properly
  PCI/AER: Fetch information for FTrace
  RAS: Trace more information in aer_event

 drivers/acpi/apei/ghes.c      |  16 ++-
 drivers/cxl/core/pci.c        |  15 ++-
 drivers/pci/pci.h             |  12 ++-
 drivers/pci/pcie/aer.c        | 191 +++++++++++++++++++++++++++-------
 include/linux/aer.h           |   6 +-
 include/linux/pci.h           |  27 +++++
 include/ras/ras_event.h       |  48 ++++++++-
 include/uapi/linux/pci_regs.h |   1 +
 8 files changed, 269 insertions(+), 47 deletions(-)


base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
-- 
2.42.0


             reply	other threads:[~2024-01-25 11:04 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-25  6:27 Wang, Qingshun [this message]
2024-01-25  6:27 ` [PATCH v2 1/4] PCI/AER: Store more information in aer_err_info Wang, Qingshun
2024-01-31  2:26   ` Kuppuswamy Sathyanarayanan
2024-01-31  8:04     ` Wang, Qingshun
2024-02-05 23:12   ` Bjorn Helgaas
2024-02-06 16:41     ` Wang, Qingshun
2024-02-06 17:23       ` Bjorn Helgaas
2024-02-08 16:16         ` Wang, Qingshun
2024-01-25  6:28 ` [PATCH v2 2/4] PCI/AER: Handle Advisory Non-Fatal properly Wang, Qingshun
2024-02-05 23:26   ` Bjorn Helgaas
2024-02-06 16:46     ` Wang, Qingshun
2024-01-25  6:28 ` [PATCH v2 3/4] PCI/AER: Fetch information for FTrace Wang, Qingshun
2024-02-02 18:01   ` Dan Williams
2024-02-03  4:59     ` Wang, Qingshun
2024-01-25  6:28 ` [PATCH v2 4/4] RAS: Trace more information in aer_event Wang, Qingshun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240125062802.50819-1-qingshun.wang@linux.intel.com \
    --to=qingshun.wang@linux.intel.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=adam.c.preble@intel.com \
    --cc=alison.schofield@intel.com \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=dave@stgolabs.net \
    --cc=erwin.tsaur@intel.com \
    --cc=feiting.wanyan@intel.com \
    --cc=helgaas@kernel.org \
    --cc=ira.weiny@intel.com \
    --cc=james.morse@arm.com \
    --cc=jonathan.cameron@huawei.com \
    --cc=lenb@kernel.org \
    --cc=leoyang.li@nxp.com \
    --cc=linmiaohe@huawei.com \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lukas@wunner.de \
    --cc=mahesh@linux.ibm.com \
    --cc=oohall@gmail.com \
    --cc=qingshun.wang@intel.com \
    --cc=rafael@kernel.org \
    --cc=rrichter@amd.com \
    --cc=sathyanarayanan.kuppuswamy@linux.intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=tony.luck@intel.com \
    --cc=vishal.l.verma@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).