From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Alex G." Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. Date: Thu, 19 Apr 2018 17:55:08 -0500 Message-ID: References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> <20180419190323.GF5635@pd.tnic> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20180419190323.GF5635@pd.tnic> Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com List-Id: linux-acpi@vger.kernel.org On 04/19/2018 02:03 PM, Borislav Petkov wrote: > (snip useful explanation). > > On Thu, Apr 19, 2018 at 12:40:54PM -0500, Alex G. wrote: >> On the r740xd, FW just hides those errors from the OS with no further >> notification. On this machine BIOS sets things up such that non-posted >> requests report fatal (PCIe) errors. FW still tries very hard to hide >> this from the OS, and I think the heuristic is that if the drive >> physical presence is gone, don't even report the error. > > Ok, second question: can you detect from the error signatures alone that > it was a surprise removal? I suppose you could make some inference, given the timing of other events going on around the the crash. It's not uncommon to see a "Card not present" event around drive removal. Since the presence detect pin breaks last, you might not get that interrupt for a long while. In that case it's much harder to determine if you're seeing a SURPRISE!!! removal or some other fault. I don't think you can use GHES alone to determine the nature of the event. There is not a 1:1 mapping from the set of things going wrong to the set of PCIe errors. > How does such an error look like, in detail? It's green on the soft side, with lots of red accents, as well as some textured white shades: [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able to correct [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 1 [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up [ 52.711616] {1}[Hardware Error]: event severity: fatal [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal [ 52.721891] {1}[Hardware Error]: section_type: PCIe error [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port [ 52.734075] {1}[Hardware Error]: version: 3.0 [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010 [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0 [ 52.750271] {1}[Hardware Error]: slot: 4 [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3 [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733 [ 52.766123] {1}[Hardware Error]: class_code: 000406 [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0003 [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask: 0x01a10000 [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0 [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer, aer_agent=Requester ID [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030 [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f e12023bc 01000000 [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message [ 52.883895] pci 0000:b3:00.0: device has no driver [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event queued; currently getting powered on [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up > Got error logs somewhere to dump? Sure [1]. They have the ANSI sequences, so you might want to wget and grep them in a color terminal. Alex [1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180416-1919.log From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. From: Alexandru Gagniuc Message-Id: Date: Thu, 19 Apr 2018 17:55:08 -0500 To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com List-ID: T24gMDQvMTkvMjAxOCAwMjowMyBQTSwgQm9yaXNsYXYgUGV0a292IHdyb3RlOgo+IChzbmlwIHVz ZWZ1bCBleHBsYW5hdGlvbikuCj4gCj4gT24gVGh1LCBBcHIgMTksIDIwMTggYXQgMTI6NDA6NTRQ TSAtMDUwMCwgQWxleCBHLiB3cm90ZToKPj4gT24gdGhlIHI3NDB4ZCwgRlcganVzdCBoaWRlcyB0 aG9zZSBlcnJvcnMgZnJvbSB0aGUgT1Mgd2l0aCBubyBmdXJ0aGVyCj4+IG5vdGlmaWNhdGlvbi4g T24gdGhpcyBtYWNoaW5lIEJJT1Mgc2V0cyB0aGluZ3MgdXAgc3VjaCB0aGF0IG5vbi1wb3N0ZWQK Pj4gcmVxdWVzdHMgcmVwb3J0IGZhdGFsIChQQ0llKSBlcnJvcnMuIEZXIHN0aWxsIHRyaWVzIHZl cnkgaGFyZCB0byBoaWRlCj4+IHRoaXMgZnJvbSB0aGUgT1MsIGFuZCBJIHRoaW5rIHRoZSBoZXVy aXN0aWMgaXMgdGhhdCBpZiB0aGUgZHJpdmUKPj4gcGh5c2ljYWwgcHJlc2VuY2UgaXMgZ29uZSwg ZG9uJ3QgZXZlbiByZXBvcnQgdGhlIGVycm9yLgo+IAo+IE9rLCBzZWNvbmQgcXVlc3Rpb246IGNh biB5b3UgZGV0ZWN0IGZyb20gdGhlIGVycm9yIHNpZ25hdHVyZXMgYWxvbmUgdGhhdAo+IGl0IHdh cyBhIHN1cnByaXNlIHJlbW92YWw/IAoKSSBzdXBwb3NlIHlvdSBjb3VsZCBtYWtlIHNvbWUgaW5m ZXJlbmNlLCBnaXZlbiB0aGUgdGltaW5nIG9mIG90aGVyCmV2ZW50cyBnb2luZyBvbiBhcm91bmQg dGhlIHRoZSBjcmFzaC4gSXQncyBub3QgdW5jb21tb24gdG8gc2VlIGEgIkNhcmQKbm90IHByZXNl bnQiIGV2ZW50IGFyb3VuZCBkcml2ZSByZW1vdmFsLgoKU2luY2UgdGhlIHByZXNlbmNlIGRldGVj dCBwaW4gYnJlYWtzIGxhc3QsIHlvdSBtaWdodCBub3QgZ2V0IHRoYXQKaW50ZXJydXB0IGZvciBh IGxvbmcgd2hpbGUuIEluIHRoYXQgY2FzZSBpdCdzIG11Y2ggaGFyZGVyIHRvIGRldGVybWluZQpp ZiB5b3UncmUgc2VlaW5nIGEgU1VSUFJJU0UhISEgcmVtb3ZhbCBvciBzb21lIG90aGVyIGZhdWx0 LgoKSSBkb24ndCB0aGluayB5b3UgY2FuIHVzZSBHSEVTIGFsb25lIHRvIGRldGVybWluZSB0aGUg bmF0dXJlIG9mIHRoZQpldmVudC4gVGhlcmUgaXMgbm90IGEgMToxIG1hcHBpbmcgZnJvbSB0aGUg c2V0IG9mIHRoaW5ncyBnb2luZyB3cm9uZyB0bwp0aGUgc2V0IG9mIFBDSWUgZXJyb3JzLgoKPiBI b3cgZG9lcyBzdWNoIGFuIGVycm9yIGxvb2sgbGlrZSwgaW4gZGV0YWlsPwoKSXQncyBncmVlbiBv biB0aGUgc29mdCBzaWRlLCB3aXRoIGxvdHMgb2YgcmVkIGFjY2VudHMsIGFzIHdlbGwgYXMgc29t ZQp0ZXh0dXJlZCB3aGl0ZSBzaGFkZXM6CgpbICAgNTEuNDE0NjE2XSBwY2llaHAgMDAwMDpiMDow Ni4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBEb3duClsgICA1MS40MTQ2MzRdIHBjaWVocCAw MDAwOmIwOjA1LjA6cGNpZTIwNDogU2xvdCgxNzkpOiBMaW5rIERvd24KWyAgIDUyLjcwMzM0M10g RklSTVdBUkUgQlVHOiBGaXJtd2FyZSBzZW50IGZhdGFsIGVycm9yIHRoYXQgd2Ugd2VyZSBhYmxl CnRvIGNvcnJlY3QKWyAgIDUyLjcwMzM0NV0gQlJPS0VOIEZJUk1XQVJFOiBDb21wbGFpbiB0byB5 b3VyIGhhcmR3YXJlIHZlbmRvcgpbICAgNTIuNzAzMzQ3XSB7MX1bSGFyZHdhcmUgRXJyb3JdOiBI YXJkd2FyZSBlcnJvciBmcm9tIEFQRUkgR2VuZXJpYwpIYXJkd2FyZSBFcnJvciBTb3VyY2U6IDEK WyAgIDUyLjcwMzM1OF0gcGNpZWhwIDAwMDA6YjA6MDYuMDpwY2llMjA0OiBTbG90KDE3Nik6IExp bmsgVXAKWyAgIDUyLjcxMTYxNl0gezF9W0hhcmR3YXJlIEVycm9yXTogZXZlbnQgc2V2ZXJpdHk6 IGZhdGFsClsgICA1Mi43MTY3NTRdIHsxfVtIYXJkd2FyZSBFcnJvcl06ICBFcnJvciAwLCB0eXBl OiBmYXRhbApbICAgNTIuNzIxODkxXSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgIHNlY3Rpb25fdHlw ZTogUENJZSBlcnJvcgpbICAgNTIuNzI3NDYzXSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgIHBvcnRf dHlwZTogNiwgZG93bnN0cmVhbSBzd2l0Y2ggcG9ydApbICAgNTIuNzM0MDc1XSB7MX1bSGFyZHdh cmUgRXJyb3JdOiAgIHZlcnNpb246IDMuMApbICAgNTIuNzM4NjA3XSB7MX1bSGFyZHdhcmUgRXJy b3JdOiAgIGNvbW1hbmQ6IDB4MDQwNywgc3RhdHVzOiAweDAwMTAKWyAgIDUyLjc0NDc4Nl0gezF9 W0hhcmR3YXJlIEVycm9yXTogICBkZXZpY2VfaWQ6IDAwMDA6YjA6MDYuMApbICAgNTIuNzUwMjcx XSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgIHNsb3Q6IDQKWyAgIDUyLjc1NDM3MV0gezF9W0hhcmR3 YXJlIEVycm9yXTogICBzZWNvbmRhcnlfYnVzOiAweGIzClsgICA1Mi43NTk1MDldIHsxfVtIYXJk d2FyZSBFcnJvcl06ICAgdmVuZG9yX2lkOiAweDEwYjUsIGRldmljZV9pZDogMHg5NzMzClsgICA1 Mi43NjYxMjNdIHsxfVtIYXJkd2FyZSBFcnJvcl06ICAgY2xhc3NfY29kZTogMDAwNDA2ClsgICA1 Mi43NzExODJdIHsxfVtIYXJkd2FyZSBFcnJvcl06ICAgYnJpZGdlOiBzZWNvbmRhcnlfc3RhdHVz OiAweDAwMDAsCmNvbnRyb2w6IDB4MDAwMwpbICAgNTIuNzc5MDM4XSBwY2llcG9ydCAwMDAwOmIw OjA2LjA6IGFlcl9zdGF0dXM6IDB4MDAxMDAwMDAsIGFlcl9tYXNrOgoweDAxYTEwMDAwClsgICA1 Mi43ODIzMDNdIG52bWUwbjE6IGRldGVjdGVkIGNhcGFjaXR5IGNoYW5nZSBmcm9tIDMyMDA2MzE3 OTE2MTYgdG8gMApbICAgNTIuNzg2MzQ4XSBwY2llcG9ydCAwMDAwOmIwOjA2LjA6ICAgIFsyMF0g VW5zdXBwb3J0ZWQgUmVxdWVzdApbICAgNTIuNzg2MzQ5XSBwY2llcG9ydCAwMDAwOmIwOjA2LjA6 IGFlcl9sYXllcj1UcmFuc2FjdGlvbiBMYXllciwKYWVyX2FnZW50PVJlcXVlc3RlciBJRApbICAg NTIuNzg2MzUwXSBwY2llcG9ydCAwMDAwOmIwOjA2LjA6IGFlcl91bmNvcl9zZXZlcml0eTogMHgw MDRlYjAzMApbICAgNTIuNzg2MzUyXSBwY2llcG9ydCAwMDAwOmIwOjA2LjA6ICAgVExQIEhlYWRl cjogNDAwMDAwMDEgMDAwMDAyMGYKZTEyMDIzYmMgMDEwMDAwMDAKWyAgIDUyLjc4NjM1N10gcGNp ZXBvcnQgMDAwMDpiMDowNi4wOiBicm9hZGNhc3QgZXJyb3JfZGV0ZWN0ZWQgbWVzc2FnZQpbICAg NTIuODgzODk1XSBwY2kgMDAwMDpiMzowMC4wOiBkZXZpY2UgaGFzIG5vIGRyaXZlcgpbICAgNTIu ODgzOTc2XSBwY2llaHAgMDAwMDpiMDowNi4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBEb3du ClsgICA1Mi44ODQxODRdIHBjaWVocCAwMDAwOmIwOjA2LjA6cGNpZTIwNDogU2xvdCgxNzYpOiBM aW5rIERvd24gZXZlbnQKcXVldWVkOyBjdXJyZW50bHkgZ2V0dGluZyBwb3dlcmVkIG9uClsgICA1 Mi45NjcxNzVdIHBjaWVocCAwMDAwOmIwOjA2LjA6cGNpZTIwNDogU2xvdCgxNzYpOiBMaW5rIFVw CgoKPiBHb3QgZXJyb3IgbG9ncyBzb21ld2hlcmUgdG8gZHVtcD8KClN1cmUgWzFdLiBUaGV5IGhh dmUgdGhlIEFOU0kgc2VxdWVuY2VzLCBzbyB5b3UgbWlnaHQgd2FudCB0byB3Z2V0IGFuZApncmVw IHRoZW0gaW4gYSBjb2xvciB0ZXJtaW5hbC4KCkFsZXgKClsxXSBodHRwOi8vZ3RlY2gubXlmdHAu b3JnL35tcm51a2UvbnZtZV9sb2dzL2xvZy0yMDE4MDQxNi0xOTE5LmxvZwotLS0KVG8gdW5zdWJz Y3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJlIGxpbnV4LWVk YWMiIGluCnRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5rZXJuZWwub3Jn Ck1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5vcmcvbWFqb3Jkb21v LWluZm8uaHRtbAo=