From: "Alex G." <mr.nuke.me@gmail.com>
To: "Luck, Tony" <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
alex_gagniuc@dellteam.com, austin_bolen@dell.com,
shyam_iyer@dell.com, "Rafael J. Wysocki" <rjw@rjwysocki.net>,
Len Brown <lenb@kernel.org>,
Tyler Baicar <tbaicar@codeaurora.org>,
Will Deacon <will.deacon@arm.com>,
James Morse <james.morse@arm.com>,
Shiju Jose <shiju.jose@huawei.com>,
"Jonathan (Zhixiong) Zhang" <zjzhang@codeaurora.org>,
Dongjiu Geng <gengdongjiu@huawei.com>,
ACPI Devel Maling List <linux-acpi@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
Date: Tue, 22 May 2018 13:13:12 -0500 [thread overview]
Message-ID: <1071d2a4-486a-e2e1-5273-1d0bbe807beb@gmail.com> (raw)
In-Reply-To: <20180522175742.GA3543@agluck-desk>
On 05/22/2018 12:57 PM, Luck, Tony wrote:
> On Tue, May 22, 2018 at 04:54:26PM +0200, Borislav Petkov wrote:
>> I especially don't want to have the case where a PCIe error is *really*
>> fatal and then we noodle in some handlers debating about the severity
>> because it got marked as recoverable intermittently and end up causing
>> data corruption on the storage device. Here's a real no-no for ya.
>
> All that we have is a message from the BIOS that this is a "fatal"
> error. When did we start trusting the BIOS to give us accurate
> information?
When we merged ACPI handling.
> PCIe fatal means that the link or the device is broken. But that
> seems a poor reason to take down a large server that may have
> dozens of devices (some of them set up specifically to handle
> errors ... e.g. mirrored disks on separate controllers, or NIC
> devices that have been "bonded" together).
>
> So, as long as the action for a "fatal" error is to mark a link
> down and offline the device, that seems a pretty reasonable course
> of action.
>
> The argument gets a lot more marginal if you simply reset the
> link and re-enable the device to "fix" it. That might be enough,
> but I don't think the OS has enough data to make the call.
I'm not 100% satisfied with how AER handler works, and how certain
drivers (nvme!!!) interface with AER handling. But this is an arguments
that belongs in PCI code, and a fight I will fight with Bjorn and Keith.
The issue we're having with Borislav and Rafael's estate is that we
can't make it to PCI land.
I'm seeing here the same fight that I saw with firmware vs OS, where fw
wants to have control, and OS wants to have control. I saw the same with
ME/CSE/CSME team vs ChromeOS team, where ME team did everything possible
to make sure only they can access the boot vector and boot the
processor, and ChromeOS team couldn't use this approach because they
wanted their own root of trust. I've seen this in other places as well,
though confidentiality agreements prevent me from talking about it.
It's the issue of control, and it's a fact of life. Borislav and Rafael
don't want to relinquish control until they can be 100% certain that
going further will result in 100% recovery. That is a goal I aspire to
as well, but an unachievable ideal nonetheless.
I thought the best compromise would be to be as close as possible to
native handling. That is, if AER can't recover, we don't recover the
device, but the machine keeps running. I think there's some deeper
history to GHES handling, which I didn't take into consideration. The
fight is to convince appropriate parties to share the responsibility in
a way which doesn't kill the machine. We still have a ways to go until
we get there.
Alex
> -Tony
>
> P.S. I deliberately put "fatal" in quotes above because to
> quote "The Princess Bride" -- "that word, I do not think it
> means what you think it means". :-)
>
next prev parent reply other threads:[~2018-05-22 18:13 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-05-21 13:49 [PATCH v6 0/2] acpi: apei: Improve PCIe error handling with FFS Alexandru Gagniuc
2018-05-21 13:49 ` [PATCH v6 1/2] acpi: apei: Rename ghes_severity() to ghes_cper_severity() Alexandru Gagniuc
2018-05-22 8:55 ` Rafael J. Wysocki
2018-05-22 13:38 ` Alex G.
2018-05-22 13:50 ` Borislav Petkov
2018-05-22 14:39 ` Alex G.
2018-05-22 14:54 ` Borislav Petkov
2018-05-22 15:22 ` Alex G.
2018-05-22 15:33 ` Borislav Petkov
2018-05-22 17:57 ` Luck, Tony
2018-05-22 18:10 ` Rafael J. Wysocki
2018-05-22 18:19 ` Alex G.
2018-05-22 18:45 ` Luck, Tony
2018-05-22 18:49 ` Alex G.
2018-05-22 18:33 ` Luck, Tony
2018-05-22 18:13 ` Alex G. [this message]
2018-05-22 18:13 ` Rafael J. Wysocki
2018-05-22 18:20 ` Alex G.
2018-05-22 21:20 ` Rafael J. Wysocki
2018-05-21 13:49 ` [PATCH v6 2/2] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-05-21 14:27 ` Tyler Baicar
2018-05-21 17:40 ` Alex G.
2018-05-22 9:02 ` Rafael J. Wysocki
2018-05-22 14:32 ` Alex G.
2018-05-22 15:15 ` Tyler Baicar
2018-05-22 15:18 ` Alex G.
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1071d2a4-486a-e2e1-5273-1d0bbe807beb@gmail.com \
--to=mr.nuke.me@gmail.com \
--cc=alex_gagniuc@dellteam.com \
--cc=austin_bolen@dell.com \
--cc=bp@alien8.de \
--cc=gengdongjiu@huawei.com \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=rafael@kernel.org \
--cc=rjw@rjwysocki.net \
--cc=shiju.jose@huawei.com \
--cc=shyam_iyer@dell.com \
--cc=tbaicar@codeaurora.org \
--cc=tony.luck@intel.com \
--cc=will.deacon@arm.com \
--cc=zjzhang@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox