From mboxrd@z Thu Jan 1 00:00:00 1970 From: Borislav Petkov Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. Date: Sun, 22 Apr 2018 12:48:49 +0200 Message-ID: <20180422104849.GA32754@pd.tnic> References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> <20180419190323.GF5635@pd.tnic> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: "Alex G." Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel List-Id: linux-acpi@vger.kernel.org On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote: > > How does such an error look like, in detail? > > It's green on the soft side, with lots of red accents, as well as some > textured white shades: > > [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down > [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down > [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able > to correct > [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor > [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic > Hardware Error Source: 1 > [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up > [ 52.711616] {1}[Hardware Error]: event severity: fatal > [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal > [ 52.721891] {1}[Hardware Error]: section_type: PCIe error > [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port > [ 52.734075] {1}[Hardware Error]: version: 3.0 > [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010 > [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0 > [ 52.750271] {1}[Hardware Error]: slot: 4 > [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3 > [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733 > [ 52.766123] {1}[Hardware Error]: class_code: 000406 > [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000, > control: 0x0003 > [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask: > 0x01a10000 > [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0 > [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request > [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer, > aer_agent=Requester ID > [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030 > [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f > e12023bc 01000000 > [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message > [ 52.883895] pci 0000:b3:00.0: device has no driver > [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down > [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event > queued; currently getting powered on > [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up Btw, from another discussion we're having with Yazen: @Yazen, do you see how this error record is worth shit? class_code: 000406 command: 0x0407, status: 0x0010 bridge: secondary_status: 0x0000, control: 0x0003 aer_status: 0x00100000, aer_mask: 0x01a10000 aer_uncor_severity: 0x004eb030 those above are only some of the fields which are purely useless undecoded. Makes me wonder what's worse for the user: dump the half-decoded error or not dump an error at all... Anyway, Alex, I see this in the logs: [ 66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down [ 66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present [ 66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present and that comes from that pciehp_isr() interrupt handler AFAICT. So there *is* a way to know that the card is not present anymore. So, theoretically, and ignoring the code layering for now, we can connect that error to the card not present event and then ignore the error... Hmmm. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. From: Borislav Petkov Message-Id: <20180422104849.GA32754@pd.tnic> Date: Sun, 22 Apr 2018 12:48:49 +0200 To: "Alex G." Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel List-ID: T24gVGh1LCBBcHIgMTksIDIwMTggYXQgMDU6NTU6MDhQTSAtMDUwMCwgQWxleCBHLiB3cm90ZToK PiA+IEhvdyBkb2VzIHN1Y2ggYW4gZXJyb3IgbG9vayBsaWtlLCBpbiBkZXRhaWw/Cj4gCj4gSXQn cyBncmVlbiBvbiB0aGUgc29mdCBzaWRlLCB3aXRoIGxvdHMgb2YgcmVkIGFjY2VudHMsIGFzIHdl bGwgYXMgc29tZQo+IHRleHR1cmVkIHdoaXRlIHNoYWRlczoKPiAKPiBbICAgNTEuNDE0NjE2XSBw Y2llaHAgMDAwMDpiMDowNi4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBEb3duCj4gWyAgIDUx LjQxNDYzNF0gcGNpZWhwIDAwMDA6YjA6MDUuMDpwY2llMjA0OiBTbG90KDE3OSk6IExpbmsgRG93 bgo+IFsgICA1Mi43MDMzNDNdIEZJUk1XQVJFIEJVRzogRmlybXdhcmUgc2VudCBmYXRhbCBlcnJv ciB0aGF0IHdlIHdlcmUgYWJsZQo+IHRvIGNvcnJlY3QKPiBbICAgNTIuNzAzMzQ1XSBCUk9LRU4g RklSTVdBUkU6IENvbXBsYWluIHRvIHlvdXIgaGFyZHdhcmUgdmVuZG9yCj4gWyAgIDUyLjcwMzM0 N10gezF9W0hhcmR3YXJlIEVycm9yXTogSGFyZHdhcmUgZXJyb3IgZnJvbSBBUEVJIEdlbmVyaWMK PiBIYXJkd2FyZSBFcnJvciBTb3VyY2U6IDEKPiBbICAgNTIuNzAzMzU4XSBwY2llaHAgMDAwMDpi MDowNi4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBVcAo+IFsgICA1Mi43MTE2MTZdIHsxfVtI YXJkd2FyZSBFcnJvcl06IGV2ZW50IHNldmVyaXR5OiBmYXRhbAo+IFsgICA1Mi43MTY3NTRdIHsx fVtIYXJkd2FyZSBFcnJvcl06ICBFcnJvciAwLCB0eXBlOiBmYXRhbAo+IFsgICA1Mi43MjE4OTFd IHsxfVtIYXJkd2FyZSBFcnJvcl06ICAgc2VjdGlvbl90eXBlOiBQQ0llIGVycm9yCj4gWyAgIDUy LjcyNzQ2M10gezF9W0hhcmR3YXJlIEVycm9yXTogICBwb3J0X3R5cGU6IDYsIGRvd25zdHJlYW0g c3dpdGNoIHBvcnQKPiBbICAgNTIuNzM0MDc1XSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgIHZlcnNp b246IDMuMAo+IFsgICA1Mi43Mzg2MDddIHsxfVtIYXJkd2FyZSBFcnJvcl06ICAgY29tbWFuZDog MHgwNDA3LCBzdGF0dXM6IDB4MDAxMAo+IFsgICA1Mi43NDQ3ODZdIHsxfVtIYXJkd2FyZSBFcnJv cl06ICAgZGV2aWNlX2lkOiAwMDAwOmIwOjA2LjAKPiBbICAgNTIuNzUwMjcxXSB7MX1bSGFyZHdh cmUgRXJyb3JdOiAgIHNsb3Q6IDQKPiBbICAgNTIuNzU0MzcxXSB7MX1bSGFyZHdhcmUgRXJyb3Jd OiAgIHNlY29uZGFyeV9idXM6IDB4YjMKPiBbICAgNTIuNzU5NTA5XSB7MX1bSGFyZHdhcmUgRXJy b3JdOiAgIHZlbmRvcl9pZDogMHgxMGI1LCBkZXZpY2VfaWQ6IDB4OTczMwo+IFsgICA1Mi43NjYx MjNdIHsxfVtIYXJkd2FyZSBFcnJvcl06ICAgY2xhc3NfY29kZTogMDAwNDA2Cj4gWyAgIDUyLjc3 MTE4Ml0gezF9W0hhcmR3YXJlIEVycm9yXTogICBicmlkZ2U6IHNlY29uZGFyeV9zdGF0dXM6IDB4 MDAwMCwKPiBjb250cm9sOiAweDAwMDMKPiBbICAgNTIuNzc5MDM4XSBwY2llcG9ydCAwMDAwOmIw OjA2LjA6IGFlcl9zdGF0dXM6IDB4MDAxMDAwMDAsIGFlcl9tYXNrOgo+IDB4MDFhMTAwMDAKPiBb ICAgNTIuNzgyMzAzXSBudm1lMG4xOiBkZXRlY3RlZCBjYXBhY2l0eSBjaGFuZ2UgZnJvbSAzMjAw NjMxNzkxNjE2IHRvIDAKPiBbICAgNTIuNzg2MzQ4XSBwY2llcG9ydCAwMDAwOmIwOjA2LjA6ICAg IFsyMF0gVW5zdXBwb3J0ZWQgUmVxdWVzdAo+IFsgICA1Mi43ODYzNDldIHBjaWVwb3J0IDAwMDA6 YjA6MDYuMDogYWVyX2xheWVyPVRyYW5zYWN0aW9uIExheWVyLAo+IGFlcl9hZ2VudD1SZXF1ZXN0 ZXIgSUQKPiBbICAgNTIuNzg2MzUwXSBwY2llcG9ydCAwMDAwOmIwOjA2LjA6IGFlcl91bmNvcl9z ZXZlcml0eTogMHgwMDRlYjAzMAo+IFsgICA1Mi43ODYzNTJdIHBjaWVwb3J0IDAwMDA6YjA6MDYu MDogICBUTFAgSGVhZGVyOiA0MDAwMDAwMSAwMDAwMDIwZgo+IGUxMjAyM2JjIDAxMDAwMDAwCj4g WyAgIDUyLjc4NjM1N10gcGNpZXBvcnQgMDAwMDpiMDowNi4wOiBicm9hZGNhc3QgZXJyb3JfZGV0 ZWN0ZWQgbWVzc2FnZQo+IFsgICA1Mi44ODM4OTVdIHBjaSAwMDAwOmIzOjAwLjA6IGRldmljZSBo YXMgbm8gZHJpdmVyCj4gWyAgIDUyLjg4Mzk3Nl0gcGNpZWhwIDAwMDA6YjA6MDYuMDpwY2llMjA0 OiBTbG90KDE3Nik6IExpbmsgRG93bgo+IFsgICA1Mi44ODQxODRdIHBjaWVocCAwMDAwOmIwOjA2 LjA6cGNpZTIwNDogU2xvdCgxNzYpOiBMaW5rIERvd24gZXZlbnQKPiBxdWV1ZWQ7IGN1cnJlbnRs eSBnZXR0aW5nIHBvd2VyZWQgb24KPiBbICAgNTIuOTY3MTc1XSBwY2llaHAgMDAwMDpiMDowNi4w OnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBVcAoKQnR3LCBmcm9tIGFub3RoZXIgZGlzY3Vzc2lv biB3ZSdyZSBoYXZpbmcgd2l0aCBZYXplbjoKCkBZYXplbiwgZG8geW91IHNlZSBob3cgdGhpcyBl cnJvciByZWNvcmQgaXMgd29ydGggc2hpdD8KCiBjbGFzc19jb2RlOiAwMDA0MDYKIGNvbW1hbmQ6 IDB4MDQwNywgc3RhdHVzOiAweDAwMTAKIGJyaWRnZTogc2Vjb25kYXJ5X3N0YXR1czogMHgwMDAw LCBjb250cm9sOiAweDAwMDMKIGFlcl9zdGF0dXM6IDB4MDAxMDAwMDAsIGFlcl9tYXNrOiAweDAx YTEwMDAwCiBhZXJfdW5jb3Jfc2V2ZXJpdHk6IDB4MDA0ZWIwMzAKCnRob3NlIGFib3ZlIGFyZSBv bmx5IHNvbWUgb2YgdGhlIGZpZWxkcyB3aGljaCBhcmUgcHVyZWx5IHVzZWxlc3MKdW5kZWNvZGVk LiBNYWtlcyBtZSB3b25kZXIgd2hhdCdzIHdvcnNlIGZvciB0aGUgdXNlcjogZHVtcCB0aGUKaGFs Zi1kZWNvZGVkIGVycm9yIG9yIG5vdCBkdW1wIGFuIGVycm9yIGF0IGFsbC4uLgoKQW55d2F5LCBB bGV4LCBJIHNlZSB0aGlzIGluIHRoZSBsb2dzOgoKWyAgIDY2LjU4MTEyMV0gcGNpZWhwIDAwMDA6 YjA6MDYuMDpwY2llMjA0OiBTbG90KDE3Nik6IExpbmsgRG93bgpbICAgNjYuNTkxOTM5XSBwY2ll aHAgMDAwMDpiMDowNS4wOnBjaWUyMDQ6IFNsb3QoMTc5KTogQ2FyZCBub3QgcHJlc2VudApbICAg NjYuNTkyMTAyXSBwY2llaHAgMDAwMDpiMDowNi4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogQ2FyZCBu b3QgcHJlc2VudAoKYW5kIHRoYXQgY29tZXMgZnJvbSB0aGF0IHBjaWVocF9pc3IoKSBpbnRlcnJ1 cHQgaGFuZGxlciBBRkFJQ1QuCgpTbyB0aGVyZSAqaXMqIGEgd2F5IHRvIGtub3cgdGhhdCB0aGUg Y2FyZCBpcyBub3QgcHJlc2VudCBhbnltb3JlLiBTbywKdGhlb3JldGljYWxseSwgYW5kIGlnbm9y aW5nIHRoZSBjb2RlIGxheWVyaW5nIGZvciBub3csIHdlIGNhbiBjb25uZWN0CnRoYXQgZXJyb3Ig dG8gdGhlIGNhcmQgbm90IHByZXNlbnQgZXZlbnQgYW5kIHRoZW4gaWdub3JlIHRoZSBlcnJvci4u LgoKSG1tbS4K