From: "Alex G." <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
shiju.jose@huawei.com, zjzhang@codeaurora.org,
gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
alex_gagniuc@dellteam.com, austin_bolen@dell.com,
shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
robert.moore@intel.com, erik.schmauss@intel.com
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Thu, 19 Apr 2018 17:55:08 -0500 [thread overview]
Message-ID: <dba07329-2f85-2bde-85ea-5bdf26fb8df2@gmail.com> (raw)
In-Reply-To: <20180419190323.GF5635@pd.tnic>
On 04/19/2018 02:03 PM, Borislav Petkov wrote:
> (snip useful explanation).
>
> On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote:
>> On the r740xd, FW just hides those errors from the OS with no further
>> notification. On this machine BIOS sets things up such that non-posted
>> requests report fatal (PCIe) errors. FW still tries very hard to hide
>> this from the OS, and I think the heuristic is that if the drive
>> physical presence is gone, don't even report the error.
>
> Ok, second question: can you detect from the error signatures alone that
> it was a surprise removal?
I suppose you could make some inference, given the timing of other
events going on around the the crash. It's not uncommon to see a "Card
not present" event around drive removal.
Since the presence detect pin breaks last, you might not get that
interrupt for a long while. In that case it's much harder to determine
if you're seeing a SURPRISE!!! removal or some other fault.
I don't think you can use GHES alone to determine the nature of the
event. There is not a 1:1 mapping from the set of things going wrong to
the set of PCIe errors.
> How does such an error look like, in detail?
It's green on the soft side, with lots of red accents, as well as some
textured white shades:
[ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
[ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
to correct
[ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
[ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
[ 52.711616] {1}[Hardware Error]: event severity: fatal
[ 52.716754] {1}[Hardware Error]: Error 0, type: fatal
[ 52.721891] {1}[Hardware Error]: section_type: PCIe error
[ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 52.734075] {1}[Hardware Error]: version: 3.0
[ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0
[ 52.750271] {1}[Hardware Error]: slot: 4
[ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3
[ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
[ 52.766123] {1}[Hardware Error]: class_code: 000406
[ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000,
control: 0x0003
[ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
0x01a10000
[ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
[ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request
[ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
aer_agent=Requester ID
[ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
[ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f
e12023bc 01000000
[ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
[ 52.883895] pci 0000:b3:00.0: device has no driver
[ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
queued; currently getting powered on
[ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
> Got error logs somewhere to dump?
Sure [1]. They have the ANSI sequences, so you might want to wget and
grep them in a color terminal.
Alex
[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log
WARNING: multiple messages have this Message-ID (diff)
From: Alexandru Gagniuc <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
shiju.jose@huawei.com, zjzhang@codeaurora.org,
gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
alex_gagniuc@dellteam.com, austin_bolen@dell.com,
shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
robert.moore@intel.com, erik.schmauss@intel.com
Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Thu, 19 Apr 2018 17:55:08 -0500 [thread overview]
Message-ID: <dba07329-2f85-2bde-85ea-5bdf26fb8df2@gmail.com> (raw)
On 04/19/2018 02:03 PM, Borislav Petkov wrote:
> (snip useful explanation).
>
> On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote:
>> On the r740xd, FW just hides those errors from the OS with no further
>> notification. On this machine BIOS sets things up such that non-posted
>> requests report fatal (PCIe) errors. FW still tries very hard to hide
>> this from the OS, and I think the heuristic is that if the drive
>> physical presence is gone, don't even report the error.
>
> Ok, second question: can you detect from the error signatures alone that
> it was a surprise removal?
I suppose you could make some inference, given the timing of other
events going on around the the crash. It's not uncommon to see a "Card
not present" event around drive removal.
Since the presence detect pin breaks last, you might not get that
interrupt for a long while. In that case it's much harder to determine
if you're seeing a SURPRISE!!! removal or some other fault.
I don't think you can use GHES alone to determine the nature of the
event. There is not a 1:1 mapping from the set of things going wrong to
the set of PCIe errors.
> How does such an error look like, in detail?
It's green on the soft side, with lots of red accents, as well as some
textured white shades:
[ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
[ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
to correct
[ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
[ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
[ 52.711616] {1}[Hardware Error]: event severity: fatal
[ 52.716754] {1}[Hardware Error]: Error 0, type: fatal
[ 52.721891] {1}[Hardware Error]: section_type: PCIe error
[ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port
[ 52.734075] {1}[Hardware Error]: version: 3.0
[ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010
[ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0
[ 52.750271] {1}[Hardware Error]: slot: 4
[ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3
[ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733
[ 52.766123] {1}[Hardware Error]: class_code: 000406
[ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000,
control: 0x0003
[ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
0x01a10000
[ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
[ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request
[ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
aer_agent=Requester ID
[ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
[ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f
e12023bc 01000000
[ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
[ 52.883895] pci 0000:b3:00.0: device has no driver
[ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
queued; currently getting powered on
[ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
> Got error logs somewhere to dump?
Sure [1]. They have the ANSI sequences, so you might want to wget and
grep them in a color terminal.
Alex
[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log
---
To unsubscribe from this list: send the line "unsubscribe linux-edac" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2018-04-19 22:55 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-16 21:58 [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-16 21:59 ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-17 9:36 ` [RFC PATCH v2 1/4] " Borislav Petkov
2018-04-17 9:36 ` [RFC,v2,1/4] " Borislav Petkov
2018-04-17 16:43 ` [RFC PATCH v2 1/4] " Alex G.
2018-04-17 16:43 ` [RFC,v2,1/4] " Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc Alexandru Gagniuc
2018-04-16 21:59 ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-18 17:52 ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-18 17:52 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:19 ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:19 ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 14:30 ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 14:30 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 14:57 ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 14:57 ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 15:29 ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 15:29 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-19 15:46 ` [RFC PATCH v2 2/4] " Alex G.
2018-04-19 15:46 ` [RFC,v2,2/4] " Alexandru Gagniuc
2018-04-19 16:40 ` [RFC PATCH v2 2/4] " Borislav Petkov
2018-04-19 16:40 ` [RFC,v2,2/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal Alexandru Gagniuc
2018-04-16 21:59 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-18 17:54 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-18 17:54 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 14:57 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 14:57 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:35 ` [Devel] [RFC PATCH v2 3/4] " James Morse
2018-04-19 15:35 ` [RFC,v2,3/4] " James Morse
2018-04-19 15:35 ` [RFC PATCH v2 3/4] " James Morse
2018-04-19 16:27 ` Alex G.
2018-04-19 16:27 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 15:40 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 15:40 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 16:26 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 16:26 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 16:45 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 16:45 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 17:40 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-19 17:40 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-19 19:03 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-19 19:03 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-19 22:55 ` Alex G. [this message]
2018-04-19 22:55 ` Alexandru Gagniuc
2018-04-22 10:48 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-22 10:48 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-24 4:19 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-24 4:19 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 14:01 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 14:01 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 15:00 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 15:00 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:15 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:15 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-25 17:27 ` [RFC PATCH v2 3/4] " Alex G.
2018-04-25 17:27 ` [RFC,v2,3/4] " Alexandru Gagniuc
2018-04-25 17:39 ` [RFC PATCH v2 3/4] " Borislav Petkov
2018-04-25 17:39 ` [RFC,v2,3/4] " Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-16 21:59 ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-18 17:54 ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-18 17:54 ` [RFC,v2,4/4] " Borislav Petkov
2018-04-19 15:11 ` [RFC PATCH v2 4/4] " Alex G.
2018-04-19 15:11 ` [RFC,v2,4/4] " Alexandru Gagniuc
2018-04-19 15:46 ` [RFC PATCH v2 4/4] " Borislav Petkov
2018-04-19 15:46 ` [RFC,v2,4/4] " Borislav Petkov
2018-04-25 20:39 ` [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first Alexandru Gagniuc
2018-04-25 20:39 ` [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-25 20:39 ` [RFC,v3,1/3] " Alexandru Gagniuc
2018-04-25 20:39 ` [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-04-25 20:39 ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-26 11:19 ` [RFC PATCH v3 2/3] " Borislav Petkov
2018-04-26 11:19 ` [RFC,v3,2/3] " Borislav Petkov
2018-04-26 17:44 ` [RFC PATCH v3 2/3] " Alex G.
2018-04-26 17:44 ` [RFC,v3,2/3] " Alexandru Gagniuc
2018-04-25 20:39 ` [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-25 20:39 ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 11:20 ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 11:20 ` [RFC,v3,3/3] " Borislav Petkov
2018-04-26 17:47 ` [RFC PATCH v3 3/3] " Alex G.
2018-04-26 17:47 ` [RFC,v3,3/3] " Alexandru Gagniuc
2018-04-26 18:03 ` [RFC PATCH v3 3/3] " Borislav Petkov
2018-04-26 18:03 ` [RFC,v3,3/3] " Borislav Petkov
2018-05-02 19:10 ` [RFC PATCH v3 3/3] " Pavel Machek
2018-05-02 19:10 ` [RFC,v3,3/3] " Pavel Machek
2018-05-02 19:29 ` [RFC PATCH v3 3/3] " Alex G.
2018-05-02 19:29 ` [RFC,v3,3/3] " Alexandru Gagniuc
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=dba07329-2f85-2bde-85ea-5bdf26fb8df2@gmail.com \
--to=mr.nuke.me@gmail.com \
--cc=alex_gagniuc@dellteam.com \
--cc=austin_bolen@dell.com \
--cc=bp@alien8.de \
--cc=devel@acpica.org \
--cc=erik.schmauss@intel.com \
--cc=gengdongjiu@huawei.com \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=rjw@rjwysocki.net \
--cc=robert.moore@intel.com \
--cc=shiju.jose@huawei.com \
--cc=shyam_iyer@dell.com \
--cc=tbaicar@codeaurora.org \
--cc=tony.luck@intel.com \
--cc=will.deacon@arm.com \
--cc=zjzhang@codeaurora.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.