From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Zhenzhong Duan <zhenzhong.duan@intel.com>
Cc: <linux-pci@vger.kernel.org>, <linuxppc-dev@lists.ozlabs.org>,
<linux-acpi@vger.kernel.org>, <rafael@kernel.org>,
<lenb@kernel.org>, <james.morse@arm.com>, <tony.luck@intel.com>,
<bp@alien8.de>, <dave@stgolabs.net>, <dave.jiang@intel.com>,
<alison.schofield@intel.com>, <vishal.l.verma@intel.com>,
<ira.weiny@intel.com>, <bhelgaas@google.com>,
<helgaas@kernel.org>, <mahesh@linux.ibm.com>, <oohall@gmail.com>,
<linmiaohe@huawei.com>, <shiju.jose@huawei.com>,
<adam.c.preble@intel.com>, <lukas@wunner.de>,
<Smita.KoralahalliChannabasappa@amd.com>, <rrichter@amd.com>,
<linux-cxl@vger.kernel.org>, <linux-edac@vger.kernel.org>,
<linux-kernel@vger.kernel.org>, <erwin.tsaur@intel.com>,
<sathyanarayanan.kuppuswamy@intel.com>,
<dan.j.williams@intel.com>, <feiting.wanyan@intel.com>,
<yudong.wang@intel.com>, <chao.p.peng@intel.com>,
<qingshun.wang@linux.intel.com>
Subject: Re: [PATCH v4 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
Date: Thu, 6 Jun 2024 16:06:47 +0100 [thread overview]
Message-ID: <20240606160647.0000644e@Huawei.com> (raw)
In-Reply-To: <20240509084833.2147767-2-zhenzhong.duan@intel.com>
On Thu, 9 May 2024 16:48:31 +0800
Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> appropriate agent to determine the type of the error. For example,
> when software performs a configuration read from a non-existent
> device or Function, completer will send an ERR_NONFATAL Message.
> On some platforms, ERR_NONFATAL results in a System Error, which
> breaks normal software probing.
>
> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> in above scenario. It is predominantly determined by the role of the
> detecting agent (Requester, Completer, or Receiver) and the specific
> error. In such cases, an agent with AER signals the NFE (if enabled)
> by sending an ERR_COR Message as an advisory to software, instead of
> sending ERR_NONFATAL.
>
> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will reproduce above mentioned issue, i.e., breaking softwore probing;
> treating NFE as ANFE will make us ignoring some UEs which need active
> recover operation. To avoid clearing UEs that are not ANFE by accident,
> the most conservative route is taken here: If any of the NFE Detected
> bits is set in Device Status, do not touch UE status, they should be
> cleared later by the UE handler. Otherwise, a specific set of UEs that
> may be raised as ANFE according to the PCIe specification will be cleared
> if their corresponding severity is Non-Fatal.
>
> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE
> in aer_err_info.anfe_status. So that those bits could be printed and
> processed later.
>
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Not my most confident review ever as this is nasty and gives
me a headache but your description is good and I think the
implementation looks reasonable.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
WARNING: multiple messages have this Message-ID (diff)
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: Zhenzhong Duan <zhenzhong.duan@intel.com>
Cc: linmiaohe@huawei.com, alison.schofield@intel.com,
rafael@kernel.org, sathyanarayanan.kuppuswamy@intel.com,
linux-pci@vger.kernel.org, erwin.tsaur@intel.com,
linux-cxl@vger.kernel.org, linux-kernel@vger.kernel.org,
oohall@gmail.com, ira.weiny@intel.com, dave@stgolabs.net,
dave.jiang@intel.com, vishal.l.verma@intel.com,
Smita.KoralahalliChannabasappa@amd.com,
linux-acpi@vger.kernel.org, helgaas@kernel.org, lenb@kernel.org,
chao.p.peng@intel.com, rrichter@amd.com, yudong.wang@intel.com,
bp@alien8.de, qingshun.wang@linux.intel.com, bhelgaas@google.com,
dan.j.williams@intel.com, linux-edac@vger.kernel.org,
tony.luck@intel.com, feiting.wanyan@intel.com,
adam.c.preble@intel.com, mahesh@linux.ibm.com, lukas@wunner.de,
james.morse@arm.com, linuxppc-dev@lists.ozlabs.org,
shiju.jose@huawei.com
Subject: Re: [PATCH v4 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info
Date: Thu, 6 Jun 2024 16:06:47 +0100 [thread overview]
Message-ID: <20240606160647.0000644e@Huawei.com> (raw)
In-Reply-To: <20240509084833.2147767-2-zhenzhong.duan@intel.com>
On Thu, 9 May 2024 16:48:31 +0800
Zhenzhong Duan <zhenzhong.duan@intel.com> wrote:
> In some cases the detector of a Non-Fatal Error(NFE) is not the most
> appropriate agent to determine the type of the error. For example,
> when software performs a configuration read from a non-existent
> device or Function, completer will send an ERR_NONFATAL Message.
> On some platforms, ERR_NONFATAL results in a System Error, which
> breaks normal software probing.
>
> Advisory Non-Fatal Error(ANFE) is a special case that can be used
> in above scenario. It is predominantly determined by the role of the
> detecting agent (Requester, Completer, or Receiver) and the specific
> error. In such cases, an agent with AER signals the NFE (if enabled)
> by sending an ERR_COR Message as an advisory to software, instead of
> sending ERR_NONFATAL.
>
> When processing an ANFE, ideally both correctable error(CE) status and
> uncorrectable error(UE) status should be cleared. However, there is no
> way to fully identify the UE associated with ANFE. Even worse, Non-Fatal
> Error(NFE) may set the same UE status bit as ANFE. Treating an ANFE as
> NFE will reproduce above mentioned issue, i.e., breaking softwore probing;
> treating NFE as ANFE will make us ignoring some UEs which need active
> recover operation. To avoid clearing UEs that are not ANFE by accident,
> the most conservative route is taken here: If any of the NFE Detected
> bits is set in Device Status, do not touch UE status, they should be
> cleared later by the UE handler. Otherwise, a specific set of UEs that
> may be raised as ANFE according to the PCIe specification will be cleared
> if their corresponding severity is Non-Fatal.
>
> To achieve above purpose, store UNCOR_STATUS bits that might be ANFE
> in aer_err_info.anfe_status. So that those bits could be printed and
> processed later.
>
> Tested-by: Yudong Wang <yudong.wang@intel.com>
> Co-developed-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: "Wang, Qingshun" <qingshun.wang@linux.intel.com>
> Signed-off-by: Zhenzhong Duan <zhenzhong.duan@intel.com>
Not my most confident review ever as this is nasty and gives
me a headache but your description is good and I think the
implementation looks reasonable.
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
next prev parent reply other threads:[~2024-06-06 15:06 UTC|newest]
Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-09 8:48 [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error Zhenzhong Duan
2024-05-09 8:48 ` Zhenzhong Duan
2024-05-09 8:48 ` [PATCH v4 1/3] PCI/AER: Store UNCOR_STATUS bits that might be ANFE in aer_err_info Zhenzhong Duan
2024-05-09 8:48 ` Zhenzhong Duan
2024-06-06 15:06 ` Jonathan Cameron [this message]
2024-06-06 15:06 ` Jonathan Cameron
2024-06-13 21:26 ` Kuppuswamy Sathyanarayanan
2024-06-13 21:26 ` Kuppuswamy Sathyanarayanan
2024-06-14 2:39 ` Duan, Zhenzhong
2024-06-14 2:39 ` Duan, Zhenzhong
2024-06-14 3:05 ` Kuppuswamy Sathyanarayanan
2024-06-14 3:05 ` Kuppuswamy Sathyanarayanan
2024-06-14 3:13 ` Duan, Zhenzhong
2024-06-14 3:13 ` Duan, Zhenzhong
2024-05-09 8:48 ` [PATCH v4 2/3] PCI/AER: Print UNCOR_STATUS bits that might be ANFE Zhenzhong Duan
2024-05-09 8:48 ` Zhenzhong Duan
2024-06-06 15:07 ` Jonathan Cameron
2024-06-06 15:07 ` Jonathan Cameron
2024-06-13 21:28 ` Kuppuswamy Sathyanarayanan
2024-06-13 21:28 ` Kuppuswamy Sathyanarayanan
2024-05-09 8:48 ` [PATCH v4 3/3] PCI/AER: Clear " Zhenzhong Duan
2024-05-09 8:48 ` Zhenzhong Duan
2024-06-06 15:11 ` Jonathan Cameron
2024-06-06 15:11 ` Jonathan Cameron
2024-06-13 22:59 ` Kuppuswamy, Sathyanarayanan
2024-06-13 22:59 ` Kuppuswamy, Sathyanarayanan
2024-06-14 2:40 ` Duan, Zhenzhong
2024-06-14 2:40 ` Duan, Zhenzhong
2024-06-14 3:18 ` Kuppuswamy Sathyanarayanan
2024-06-14 3:18 ` Kuppuswamy Sathyanarayanan
2024-06-14 3:32 ` Duan, Zhenzhong
2024-06-14 3:32 ` Duan, Zhenzhong
2024-05-29 5:32 ` [PATCH v4 0/3] PCI/AER: Handle Advisory Non-Fatal error Duan, Zhenzhong
2024-05-29 5:32 ` Duan, Zhenzhong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240606160647.0000644e@Huawei.com \
--to=jonathan.cameron@huawei.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=adam.c.preble@intel.com \
--cc=alison.schofield@intel.com \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=chao.p.peng@intel.com \
--cc=dan.j.williams@intel.com \
--cc=dave.jiang@intel.com \
--cc=dave@stgolabs.net \
--cc=erwin.tsaur@intel.com \
--cc=feiting.wanyan@intel.com \
--cc=helgaas@kernel.org \
--cc=ira.weiny@intel.com \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linmiaohe@huawei.com \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=lukas@wunner.de \
--cc=mahesh@linux.ibm.com \
--cc=oohall@gmail.com \
--cc=qingshun.wang@linux.intel.com \
--cc=rafael@kernel.org \
--cc=rrichter@amd.com \
--cc=sathyanarayanan.kuppuswamy@intel.com \
--cc=shiju.jose@huawei.com \
--cc=tony.luck@intel.com \
--cc=vishal.l.verma@intel.com \
--cc=yudong.wang@intel.com \
--cc=zhenzhong.duan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.