linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Alex G." <mr.nuke.me@gmail.com>
To: Borislav Petkov <bp@alien8.de>
Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org,
	rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com,
	tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com,
	shiju.jose@huawei.com, zjzhang@codeaurora.org,
	gengdongjiu@huawei.com, linux-kernel@vger.kernel.org,
	alex_gagniuc@dellteam.com, austin_bolen@dell.com,
	shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org,
	robert.moore@intel.com, erik.schmauss@intel.com
Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal.
Date: Thu, 19 Apr 2018 17:55:08 -0500	[thread overview]
Message-ID: <dba07329-2f85-2bde-85ea-5bdf26fb8df2@gmail.com> (raw)
In-Reply-To: <20180419190323.GF5635@pd.tnic>



On 04/19/2018 02:03 PM, Borislav Petkov wrote:
> (snip useful explanation).
> 
> On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote:
>> On the r740xd, FW just hides those errors from the OS with no further
>> notification. On this machine BIOS sets things up such that non-posted
>> requests report fatal (PCIe) errors. FW still tries very hard to hide
>> this from the OS, and I think the heuristic is that if the drive
>> physical presence is gone, don't even report the error.
> 
> Ok, second question: can you detect from the error signatures alone that
> it was a surprise removal? 

I suppose you could make some inference, given the timing of other
events going on around the the crash. It's not uncommon to see a "Card
not present" event around drive removal.

Since the presence detect pin breaks last, you might not get that
interrupt for a long while. In that case it's much harder to determine
if you're seeing a SURPRISE!!! removal or some other fault.

I don't think you can use GHES alone to determine the nature of the
event. There is not a 1:1 mapping from the set of things going wrong to
the set of PCIe errors.

> How does such an error look like, in detail?

It's green on the soft side, with lots of red accents, as well as some
textured white shades:

[   51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[   51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down
[   52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able
to correct
[   52.703345] BROKEN FIRMWARE: Complain to your hardware vendor
[   52.703347] {1}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[   52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up
[   52.711616] {1}[Hardware Error]: event severity: fatal
[   52.716754] {1}[Hardware Error]:  Error 0, type: fatal
[   52.721891] {1}[Hardware Error]:   section_type: PCIe error
[   52.727463] {1}[Hardware Error]:   port_type: 6, downstream switch port
[   52.734075] {1}[Hardware Error]:   version: 3.0
[   52.738607] {1}[Hardware Error]:   command: 0x0407, status: 0x0010
[   52.744786] {1}[Hardware Error]:   device_id: 0000:b0:06.0
[   52.750271] {1}[Hardware Error]:   slot: 4
[   52.754371] {1}[Hardware Error]:   secondary_bus: 0xb3
[   52.759509] {1}[Hardware Error]:   vendor_id: 0x10b5, device_id: 0x9733
[   52.766123] {1}[Hardware Error]:   class_code: 000406
[   52.771182] {1}[Hardware Error]:   bridge: secondary_status: 0x0000,
control: 0x0003
[   52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask:
0x01a10000
[   52.782303] nvme0n1: detected capacity change from 3200631791616 to 0
[   52.786348] pcieport 0000:b0:06.0:    [20] Unsupported Request
[   52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer,
aer_agent=Requester ID
[   52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030
[   52.786352] pcieport 0000:b0:06.0:   TLP Header: 40000001 0000020f
e12023bc 01000000
[   52.786357] pcieport 0000:b0:06.0: broadcast error_detected message
[   52.883895] pci 0000:b3:00.0: device has no driver
[   52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down
[   52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event
queued; currently getting powered on
[   52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up


> Got error logs somewhere to dump?

Sure [1]. They have the ANSI sequences, so you might want to wget and
grep them in a color terminal.

Alex

[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log

  reply	other threads:[~2018-04-19 22:55 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-16 21:58 [RFC PATCH v2 0/4] acpi: apei: Improve error handling with firmware-first Alexandru Gagniuc
2018-04-16 21:59 ` [RFC PATCH v2 1/4] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-17  9:36   ` Borislav Petkov
2018-04-17 16:43     ` Alex G.
2018-04-16 21:59 ` [RFC PATCH v2 2/4] acpi: apei: Split GHES handlers outside of ghes_do_proc Alexandru Gagniuc
2018-04-18 17:52   ` Borislav Petkov
2018-04-19 14:19     ` Alex G.
2018-04-19 14:30       ` Borislav Petkov
2018-04-19 14:57         ` Alex G.
2018-04-19 15:29           ` Borislav Petkov
2018-04-19 15:46             ` Alex G.
2018-04-19 16:40               ` Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal Alexandru Gagniuc
2018-04-18 17:54   ` Borislav Petkov
2018-04-19 14:57     ` Alex G.
2018-04-19 15:35       ` James Morse
2018-04-19 16:27         ` Alex G.
2018-04-19 15:40       ` Borislav Petkov
2018-04-19 16:26         ` Alex G.
2018-04-19 16:45           ` Borislav Petkov
2018-04-19 17:40             ` Alex G.
2018-04-19 19:03               ` Borislav Petkov
2018-04-19 22:55                 ` Alex G. [this message]
2018-04-22 10:48                   ` Borislav Petkov
2018-04-24  4:19                     ` Alex G.
2018-04-25 14:01                       ` Borislav Petkov
2018-04-25 15:00                         ` Alex G.
2018-04-25 17:15                           ` Borislav Petkov
2018-04-25 17:27                             ` Alex G.
2018-04-25 17:39                               ` Borislav Petkov
2018-04-16 21:59 ` [RFC PATCH v2 4/4] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-18 17:54   ` Borislav Petkov
2018-04-19 15:11     ` Alex G.
2018-04-19 15:46       ` Borislav Petkov
2018-04-25 20:39 ` [RFC PATCH v3 0/3] acpi: apei: Improve PCIe error handling with firmware-first Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 1/3] EDAC, GHES: Remove unused argument to ghes_edac_report_mem_error Alexandru Gagniuc
2018-04-25 20:39   ` [RFC PATCH v3 2/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-04-26 11:19     ` Borislav Petkov
2018-04-26 17:44       ` Alex G.
2018-04-25 20:39   ` [RFC PATCH v3 3/3] acpi: apei: Warn when GHES marks correctable errors as "fatal" Alexandru Gagniuc
2018-04-26 11:20     ` Borislav Petkov
2018-04-26 17:47       ` Alex G.
2018-04-26 18:03         ` Borislav Petkov
2018-05-02 19:10       ` Pavel Machek
2018-05-02 19:29         ` Alex G.

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=dba07329-2f85-2bde-85ea-5bdf26fb8df2@gmail.com \
    --to=mr.nuke.me@gmail.com \
    --cc=alex_gagniuc@dellteam.com \
    --cc=austin_bolen@dell.com \
    --cc=bp@alien8.de \
    --cc=devel@acpica.org \
    --cc=erik.schmauss@intel.com \
    --cc=gengdongjiu@huawei.com \
    --cc=james.morse@arm.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mchehab@kernel.org \
    --cc=rjw@rjwysocki.net \
    --cc=robert.moore@intel.com \
    --cc=shiju.jose@huawei.com \
    --cc=shyam_iyer@dell.com \
    --cc=tbaicar@codeaurora.org \
    --cc=tony.luck@intel.com \
    --cc=will.deacon@arm.com \
    --cc=zjzhang@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).