Questions: Should kernel panic when PCIe fatal error occurs?

public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed

From: Shuai Xue <xueshuai@linux.alibaba.com>
To: "lenb@kernel.org" <lenb@kernel.org>,
	"james.morse@arm.com" <james.morse@arm.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	"bp@alien8.de" <bp@alien8.de>,
	mahesh@linux.ibm.com, bhelgaas@google.com,
	Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
	gregkh@linuxfoundation.org
Cc: "linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
	Linux PCI <linux-pci@vger.kernel.org>,
	Shuai Xue <xueshuai@linux.alibaba.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>
Subject: Questions: Should kernel panic when PCIe fatal error occurs?
Date: Mon, 18 Sep 2023 17:39:58 +0800	[thread overview]
Message-ID: <e486db16-d36d-9e14-4f10-dc755c0ef97d@linux.alibaba.com> (raw)

Hi, all folks,

Error reporting and recovery are one of the important features of PCIe, and
the kernel has been supporting them since version 2.6, 17 years ago.
I am very curious about the expected behavior of the software.
I first recap the error classification and then list my questions bellow it.

## Recap: Error classification

- Fatal Errors

Fatal errors are uncorrectable error conditions which render the particular
Link and related hardware unreliable. For Fatal errors, a reset of the
components on the Link may be required to return to reliable operation.
Platform handling of Fatal errors, and any efforts to limit the effects of
these errors, is platform implementation specific. (PCIe 6.0.1, sec
6.2.2.2.1 Fatal Errors).

- Non-Fatal Errors

Non-fatal errors are uncorrectable errors which cause a particular
transaction to be unreliable but the Link is otherwise fully functional.
Isolating Non-fatal from Fatal errors provides Requester/Receiver logic in
a device or system management software the opportunity to recover from the
error without resetting the components on the Link and disturbing other
transactions in progress. Devices not associated with the transaction in
error are not impacted by the error.  (PCIe 6.0.1, sec 6.2.2.2.1 Non-Fatal
Errors).

## What the kernel do?

The Linux kernel supports both the OS native and firmware first modes in
AER and DPC drivers. The error recovery API is defined in `struct
pci_error_handlers`, and the recovery process is performed in several
stages in pcie_do_recovery(). One main difference in handling PCIe errors
is that the kernel only resets the link when a fatal error is detected.

## Questions

1. Should kernel panic when fatal errors occur without AER recovery?

IMHO, the answer is NO. The AER driver handles both fatal and non-fatal
errors, and I have not found any panic changes in the recovery path in OS
native mode.

As far as I know, on many X86 platforms, struct `acpi_hest_generic_status::error_severity`
is set as CPER_SEV_FATAL in firmware first mode. As a result, kernel will
panic immediately in ghes_proc() when fatal AER errors occur, and there
is no chance to handle the error and perform recovery in AER driver.

For fatal and non-fatal errors, struct `acpi_hest_generic_status::error_severity`
should as CPER_SEV_RECOVERABLE, and struct `acpi_hest_generic_data::error_severity`
should reflect its real severity. Then, the kernel is equivalent to handling
PCIe errors in Firmware first mode as it does in OS native mode.
Please correct me if I am wrong.

However, I have changed my mind on this issue as I encounter a case where
a error propagation is detected due to fatal DLLP (Data Link Protocol
Error) error. A DLLP error occurred in the Compute node, causing the
node to panic because `struct acpi_hest_generic_status::error_severity` was
set as CPER_SEV_FATAL. However, data corruption was still detected in the
storage node by CRC.

2. Should kernel panic when AER recovery failed?

This question is actually a TODO that was added when the AER driver was
first upstreamed 17 years ago, and it is still relevant today. The kernel
does not proactively panic regardless of the error types occurring in OS
native mode. The DLLP error propagation case indicates that the kernel
might should panic when recovery failed?

3. Should DPC be enabled by default to contain fatal and non-fatal error?

According to the PCIe specification, DPC halts PCIe traffic below a
Downstream Port after an unmasked uncorrectable error is detected at or
below the Port, avoiding the potential spread of any data corruption.

The kernel configures DPC to be triggered only on ERR_FATAL. Literally
speaking, only fatal error have the potential spread of any data
corruption? In addition, the AER Severity is programable by the
Uncorrectable Error Severity Register (Offset 0Ch in PCIe AER cap). If a
default fatal error, e.g. DLLP, set as non-fatal, DPC will not be
triggered.

Looking forward to any comments and reply :)

Thank you.

Best Regards,
Shuai

[1] https://github.com/torvalds/linux/commit/6c2b374d74857e892080ee726184ec1d15e7d4e4#diff-fea64904d30501b59d2e948189bbedc476fc270ed4c15e4ae29d7f0efd06771aR438

next             reply	other threads:[~2023-09-18  9:42 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-18  9:39 Shuai Xue [this message]
2023-09-20 23:02 ` Questions: Should kernel panic when PCIe fatal error occurs? Bjorn Helgaas
     [not found]   ` <d84b6d17-7fe9-222a-c874-798af4d9faea@linux.alibaba.com>
2023-09-21 13:20     ` David Laight
2023-09-25  1:43       ` Shuai Xue
2023-09-25  8:07         ` David Laight
2023-09-21 21:52     ` Bjorn Helgaas
2023-09-22  2:46       ` Shuai Xue
2023-09-26 23:02         ` Bjorn Helgaas
2023-09-27  3:01           ` Shuai Xue
2023-09-27  4:03           ` Oliver O'Halloran
2023-09-21 22:22   ` David Laight
2023-09-25  3:54     ` Oliver O'Halloran

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e486db16-d36d-9e14-4f10-dc755c0ef97d@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bhelgaas@google.com \
    --cc=bp@alien8.de \
    --cc=gregkh@linuxfoundation.org \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pci@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mahesh@linux.ibm.com \
    --cc=rafael@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox