From: Shuai Xue <xueshuai@linux.alibaba.com>
To: David Laight <David.Laight@ACULAB.COM>,
Bjorn Helgaas <helgaas@kernel.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
"gregkh@linuxfoundation.org" <gregkh@linuxfoundation.org>,
Jonathan Cameron <Jonathan.Cameron@huawei.com>,
"mahesh@linux.ibm.com" <mahesh@linux.ibm.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"bp@alien8.de" <bp@alien8.de>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Linux PCI <linux-pci@vger.kernel.org>,
"bhelgaas@google.com" <bhelgaas@google.com>,
"james.morse@arm.com" <james.morse@arm.com>,
"linuxppc-dev@lists.ozlabs.org" <linuxppc-dev@lists.ozlabs.org>,
"lenb@kernel.org" <lenb@kernel.org>
Subject: Re: Questions: Should kernel panic when PCIe fatal error occurs?
Date: Mon, 25 Sep 2023 09:43:52 +0800 [thread overview]
Message-ID: <f70e93c6-ba5b-a71c-4b82-33b279c76b0e@linux.alibaba.com> (raw)
In-Reply-To: <2e5870e416f84e8fad8340061ec303e2@AcuMS.aculab.com>
On 2023/9/21 21:20, David Laight wrote:
> ...
> I've got a target to generate AER errors by generating read cycles
> that are inside the address range that the bridge forwards but
> outside of any BAR because there are 2 different sized BARs.
> (Pretty easy to setup.)
> On the system I was using they didn't get propagated all the way
> to the root bridge - but were visible in the lower bridge.
So how did you observe it? If the error message does not propagate
to the root bridge, I think no AER interrupt will be trigger.
> It would be nice for a driver to be able to detect/clear such
> a flag if it gets an unexpected ~0u read value.
> (I'm not sure an error callback helps.)
IMHO, a general model is that error detected at endpoint should be
routed to upstream port for example: RCiEP route error message to RCEC,
so that the AER port service could handle the error, the device driver
only have to implement error handler callback.
>
> OTOH a 'nebs compliant' server routed any kind of PCIe link error
> through to some 'system management' logic that then raised an NMI.
> I'm not sure who thought an NMI was a good idea - they are pretty
> impossible to handle in the kernel and too late to be of use to
> the code performing the access.
I think it is the responsibility of the device to prevent the spread of
errors while reporting that errors have been detected. For example, drop
the current, (drain submit queue) and report error in completion record.
Both NMI and MSI are asynchronous interrupts.
>
> In any case we were getting one after 'echo 1 >xxx/remove' and
> then taking the PCIe link down by reprogramming the fpga.
> So the link going down was entirely expected, but there seemed
> to be nothing we could do to stop the kernel crashing.
>
> I'm sure 'nebs compliant' ought to contain some requirements for
> resilience to hardware failures!
How the kernel crash after a link down? Did the system detect a surprise
down error?
Best Regards,
Shuai
next prev parent reply other threads:[~2023-09-25 1:44 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-18 9:39 Questions: Should kernel panic when PCIe fatal error occurs? Shuai Xue
2023-09-20 23:02 ` Bjorn Helgaas
2023-09-21 12:10 ` Shuai Xue
2023-09-21 13:20 ` David Laight
2023-09-25 1:43 ` Shuai Xue [this message]
2023-09-25 8:07 ` David Laight
2023-09-21 21:52 ` Bjorn Helgaas
2023-09-22 2:46 ` Shuai Xue
2023-09-26 23:02 ` Bjorn Helgaas
2023-09-27 3:01 ` Shuai Xue
2023-09-27 4:03 ` Oliver O'Halloran
2023-09-21 22:22 ` David Laight
2023-09-25 3:54 ` Oliver O'Halloran
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f70e93c6-ba5b-a71c-4b82-33b279c76b0e@linux.alibaba.com \
--to=xueshuai@linux.alibaba.com \
--cc=David.Laight@ACULAB.COM \
--cc=Jonathan.Cameron@huawei.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=gregkh@linuxfoundation.org \
--cc=helgaas@kernel.org \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mahesh@linux.ibm.com \
--cc=rafael@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).