From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Alex G." Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. Date: Mon, 23 Apr 2018 23:19:25 -0500 Message-ID: <70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com> References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> <20180418175415.GJ4795@pd.tnic> <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> <20180419190323.GF5635@pd.tnic> <20180422104849.GA32754@pd.tnic> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <20180422104849.GA32754@pd.tnic> Content-Language: en-US Sender: linux-kernel-owner@vger.kernel.org To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel List-Id: linux-acpi@vger.kernel.org On 04/22/2018 05:48 AM, Borislav Petkov wrote: > On Thu, Apr 19, 2018 at 05:55:08PM -0500, Alex G. wrote: >>> How does such an error look like, in detail? >> >> It's green on the soft side, with lots of red accents, as well as some >> textured white shades: >> >> [ 51.414616] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down >> [ 51.414634] pciehp 0000:b0:05.0:pcie204: Slot(179): Link Down >> [ 52.703343] FIRMWARE BUG: Firmware sent fatal error that we were able >> to correct >> [ 52.703345] BROKEN FIRMWARE: Complain to your hardware vendor >> [ 52.703347] {1}[Hardware Error]: Hardware error from APEI Generic >> Hardware Error Source: 1 >> [ 52.703358] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up >> [ 52.711616] {1}[Hardware Error]: event severity: fatal >> [ 52.716754] {1}[Hardware Error]: Error 0, type: fatal >> [ 52.721891] {1}[Hardware Error]: section_type: PCIe error >> [ 52.727463] {1}[Hardware Error]: port_type: 6, downstream switch port >> [ 52.734075] {1}[Hardware Error]: version: 3.0 >> [ 52.738607] {1}[Hardware Error]: command: 0x0407, status: 0x0010 >> [ 52.744786] {1}[Hardware Error]: device_id: 0000:b0:06.0 >> [ 52.750271] {1}[Hardware Error]: slot: 4 >> [ 52.754371] {1}[Hardware Error]: secondary_bus: 0xb3 >> [ 52.759509] {1}[Hardware Error]: vendor_id: 0x10b5, device_id: 0x9733 >> [ 52.766123] {1}[Hardware Error]: class_code: 000406 >> [ 52.771182] {1}[Hardware Error]: bridge: secondary_status: 0x0000, >> control: 0x0003 >> [ 52.779038] pcieport 0000:b0:06.0: aer_status: 0x00100000, aer_mask: >> 0x01a10000 >> [ 52.782303] nvme0n1: detected capacity change from 3200631791616 to 0 >> [ 52.786348] pcieport 0000:b0:06.0: [20] Unsupported Request >> [ 52.786349] pcieport 0000:b0:06.0: aer_layer=Transaction Layer, >> aer_agent=Requester ID >> [ 52.786350] pcieport 0000:b0:06.0: aer_uncor_severity: 0x004eb030 >> [ 52.786352] pcieport 0000:b0:06.0: TLP Header: 40000001 0000020f >> e12023bc 01000000 >> [ 52.786357] pcieport 0000:b0:06.0: broadcast error_detected message >> [ 52.883895] pci 0000:b3:00.0: device has no driver >> [ 52.883976] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down >> [ 52.884184] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down event >> queued; currently getting powered on >> [ 52.967175] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Up > > Btw, from another discussion we're having with Yazen: > > @Yazen, do you see how this error record is worth shit? > > class_code: 000406 > command: 0x0407, status: 0x0010 > bridge: secondary_status: 0x0000, control: 0x0003 > aer_status: 0x00100000, aer_mask: 0x01a10000 > aer_uncor_severity: 0x004eb030 That tells you what FFS said about the error. Keep in mind that FFS has cleared the hardware error bits, which the AER handler would normally read from the PCI device. > those above are only some of the fields which are purely useless > undecoded. Makes me wonder what's worse for the user: dump the > half-decoded error or not dump an error at all... It's immediately obvious if there's a glaring FFS bug and if we get bogus data. If you distrust firmware as much as I do, then you will find great value in having such info in the logs. It's probably not too useful to a casual user, but then neither is a majority of the system log. > Anyway, Alex, I see this in the logs: > > [ 66.581121] pciehp 0000:b0:06.0:pcie204: Slot(176): Link Down > [ 66.591939] pciehp 0000:b0:05.0:pcie204: Slot(179): Card not present > [ 66.592102] pciehp 0000:b0:06.0:pcie204: Slot(176): Card not present > > and that comes from that pciehp_isr() interrupt handler AFAICT. > > So there *is* a way to know that the card is not present anymore. So, > theoretically, and ignoring the code layering for now, we can connect > that error to the card not present event and then ignore the error... You're missing the timing and assuming you will get the hotplug interrupt. In this example, you have 22ms between the link down and presence detect state change. This is a fairly fast removal. Hotplug dependencies aside (you can have the kernel run without PCIe hotplug support), I don't think you want to just linger in NMI for dozens of milliseconds waiting for presence detect confirmation. For enterprise SFF NVMe drives, the data lanes will disconnect before the presence detect. FFS relies on presence detect, and these are two of the reasons why slow removal is such a problem. You might not get a presence detect interrupt at all. Presence detect is optional for PCIe. PD is such a reliable heuristic, that it guarantees worse error handling than the crackmonkey firmware. I don't see how might be useful in a way which gives us better handling than firmware. > Hmmm. Hmmm Anyway, heuristics about PCIe error recovery belong in the recovery handler. I don't think it's smart to apply policy before we get there Alex From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. From: Alexandru Gagniuc Message-Id: <70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com> Date: Mon, 23 Apr 2018 23:19:25 -0500 To: Borislav Petkov Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel List-ID: T24gMDQvMjIvMjAxOCAwNTo0OCBBTSwgQm9yaXNsYXYgUGV0a292IHdyb3RlOgo+IE9uIFRodSwg QXByIDE5LCAyMDE4IGF0IDA1OjU1OjA4UE0gLTA1MDAsIEFsZXggRy4gd3JvdGU6Cj4+PiBIb3cg ZG9lcyBzdWNoIGFuIGVycm9yIGxvb2sgbGlrZSwgaW4gZGV0YWlsPwo+Pgo+PiBJdCdzIGdyZWVu IG9uIHRoZSBzb2Z0IHNpZGUsIHdpdGggbG90cyBvZiByZWQgYWNjZW50cywgYXMgd2VsbCBhcyBz b21lCj4+IHRleHR1cmVkIHdoaXRlIHNoYWRlczoKPj4KPj4gWyAgIDUxLjQxNDYxNl0gcGNpZWhw IDAwMDA6YjA6MDYuMDpwY2llMjA0OiBTbG90KDE3Nik6IExpbmsgRG93bgo+PiBbICAgNTEuNDE0 NjM0XSBwY2llaHAgMDAwMDpiMDowNS4wOnBjaWUyMDQ6IFNsb3QoMTc5KTogTGluayBEb3duCj4+ IFsgICA1Mi43MDMzNDNdIEZJUk1XQVJFIEJVRzogRmlybXdhcmUgc2VudCBmYXRhbCBlcnJvciB0 aGF0IHdlIHdlcmUgYWJsZQo+PiB0byBjb3JyZWN0Cj4+IFsgICA1Mi43MDMzNDVdIEJST0tFTiBG SVJNV0FSRTogQ29tcGxhaW4gdG8geW91ciBoYXJkd2FyZSB2ZW5kb3IKPj4gWyAgIDUyLjcwMzM0 N10gezF9W0hhcmR3YXJlIEVycm9yXTogSGFyZHdhcmUgZXJyb3IgZnJvbSBBUEVJIEdlbmVyaWMK Pj4gSGFyZHdhcmUgRXJyb3IgU291cmNlOiAxCj4+IFsgICA1Mi43MDMzNThdIHBjaWVocCAwMDAw OmIwOjA2LjA6cGNpZTIwNDogU2xvdCgxNzYpOiBMaW5rIFVwCj4+IFsgICA1Mi43MTE2MTZdIHsx fVtIYXJkd2FyZSBFcnJvcl06IGV2ZW50IHNldmVyaXR5OiBmYXRhbAo+PiBbICAgNTIuNzE2NzU0 XSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgRXJyb3IgMCwgdHlwZTogZmF0YWwKPj4gWyAgIDUyLjcy MTg5MV0gezF9W0hhcmR3YXJlIEVycm9yXTogICBzZWN0aW9uX3R5cGU6IFBDSWUgZXJyb3IKPj4g WyAgIDUyLjcyNzQ2M10gezF9W0hhcmR3YXJlIEVycm9yXTogICBwb3J0X3R5cGU6IDYsIGRvd25z dHJlYW0gc3dpdGNoIHBvcnQKPj4gWyAgIDUyLjczNDA3NV0gezF9W0hhcmR3YXJlIEVycm9yXTog ICB2ZXJzaW9uOiAzLjAKPj4gWyAgIDUyLjczODYwN10gezF9W0hhcmR3YXJlIEVycm9yXTogICBj b21tYW5kOiAweDA0MDcsIHN0YXR1czogMHgwMDEwCj4+IFsgICA1Mi43NDQ3ODZdIHsxfVtIYXJk d2FyZSBFcnJvcl06ICAgZGV2aWNlX2lkOiAwMDAwOmIwOjA2LjAKPj4gWyAgIDUyLjc1MDI3MV0g ezF9W0hhcmR3YXJlIEVycm9yXTogICBzbG90OiA0Cj4+IFsgICA1Mi43NTQzNzFdIHsxfVtIYXJk d2FyZSBFcnJvcl06ICAgc2Vjb25kYXJ5X2J1czogMHhiMwo+PiBbICAgNTIuNzU5NTA5XSB7MX1b SGFyZHdhcmUgRXJyb3JdOiAgIHZlbmRvcl9pZDogMHgxMGI1LCBkZXZpY2VfaWQ6IDB4OTczMwo+ PiBbICAgNTIuNzY2MTIzXSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgIGNsYXNzX2NvZGU6IDAwMDQw Ngo+PiBbICAgNTIuNzcxMTgyXSB7MX1bSGFyZHdhcmUgRXJyb3JdOiAgIGJyaWRnZTogc2Vjb25k YXJ5X3N0YXR1czogMHgwMDAwLAo+PiBjb250cm9sOiAweDAwMDMKPj4gWyAgIDUyLjc3OTAzOF0g cGNpZXBvcnQgMDAwMDpiMDowNi4wOiBhZXJfc3RhdHVzOiAweDAwMTAwMDAwLCBhZXJfbWFzazoK Pj4gMHgwMWExMDAwMAo+PiBbICAgNTIuNzgyMzAzXSBudm1lMG4xOiBkZXRlY3RlZCBjYXBhY2l0 eSBjaGFuZ2UgZnJvbSAzMjAwNjMxNzkxNjE2IHRvIDAKPj4gWyAgIDUyLjc4NjM0OF0gcGNpZXBv cnQgMDAwMDpiMDowNi4wOiAgICBbMjBdIFVuc3VwcG9ydGVkIFJlcXVlc3QKPj4gWyAgIDUyLjc4 NjM0OV0gcGNpZXBvcnQgMDAwMDpiMDowNi4wOiBhZXJfbGF5ZXI9VHJhbnNhY3Rpb24gTGF5ZXIs Cj4+IGFlcl9hZ2VudD1SZXF1ZXN0ZXIgSUQKPj4gWyAgIDUyLjc4NjM1MF0gcGNpZXBvcnQgMDAw MDpiMDowNi4wOiBhZXJfdW5jb3Jfc2V2ZXJpdHk6IDB4MDA0ZWIwMzAKPj4gWyAgIDUyLjc4NjM1 Ml0gcGNpZXBvcnQgMDAwMDpiMDowNi4wOiAgIFRMUCBIZWFkZXI6IDQwMDAwMDAxIDAwMDAwMjBm Cj4+IGUxMjAyM2JjIDAxMDAwMDAwCj4+IFsgICA1Mi43ODYzNTddIHBjaWVwb3J0IDAwMDA6YjA6 MDYuMDogYnJvYWRjYXN0IGVycm9yX2RldGVjdGVkIG1lc3NhZ2UKPj4gWyAgIDUyLjg4Mzg5NV0g cGNpIDAwMDA6YjM6MDAuMDogZGV2aWNlIGhhcyBubyBkcml2ZXIKPj4gWyAgIDUyLjg4Mzk3Nl0g cGNpZWhwIDAwMDA6YjA6MDYuMDpwY2llMjA0OiBTbG90KDE3Nik6IExpbmsgRG93bgo+PiBbICAg NTIuODg0MTg0XSBwY2llaHAgMDAwMDpiMDowNi4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBE b3duIGV2ZW50Cj4+IHF1ZXVlZDsgY3VycmVudGx5IGdldHRpbmcgcG93ZXJlZCBvbgo+PiBbICAg NTIuOTY3MTc1XSBwY2llaHAgMDAwMDpiMDowNi4wOnBjaWUyMDQ6IFNsb3QoMTc2KTogTGluayBV cAo+IAo+IEJ0dywgZnJvbSBhbm90aGVyIGRpc2N1c3Npb24gd2UncmUgaGF2aW5nIHdpdGggWWF6 ZW46Cj4gCj4gQFlhemVuLCBkbyB5b3Ugc2VlIGhvdyB0aGlzIGVycm9yIHJlY29yZCBpcyB3b3J0 aCBzaGl0Pwo+IAo+ICAgY2xhc3NfY29kZTogMDAwNDA2Cj4gICBjb21tYW5kOiAweDA0MDcsIHN0 YXR1czogMHgwMDEwCj4gICBicmlkZ2U6IHNlY29uZGFyeV9zdGF0dXM6IDB4MDAwMCwgY29udHJv bDogMHgwMDAzCj4gICBhZXJfc3RhdHVzOiAweDAwMTAwMDAwLCBhZXJfbWFzazogMHgwMWExMDAw MAo+ICAgYWVyX3VuY29yX3NldmVyaXR5OiAweDAwNGViMDMwCgpUaGF0IHRlbGxzIHlvdSB3aGF0 IEZGUyBzYWlkIGFib3V0IHRoZSBlcnJvci4gS2VlcCBpbiBtaW5kIHRoYXQgRkZTIGhhcyAKY2xl YXJlZCB0aGUgaGFyZHdhcmUgZXJyb3IgYml0cywgd2hpY2ggdGhlIEFFUiBoYW5kbGVyIHdvdWxk IG5vcm1hbGx5IApyZWFkIGZyb20gdGhlIFBDSSBkZXZpY2UuCgo+IHRob3NlIGFib3ZlIGFyZSBv bmx5IHNvbWUgb2YgdGhlIGZpZWxkcyB3aGljaCBhcmUgcHVyZWx5IHVzZWxlc3MKPiB1bmRlY29k ZWQuIE1ha2VzIG1lIHdvbmRlciB3aGF0J3Mgd29yc2UgZm9yIHRoZSB1c2VyOiBkdW1wIHRoZQo+ IGhhbGYtZGVjb2RlZCBlcnJvciBvciBub3QgZHVtcCBhbiBlcnJvciBhdCBhbGwuLi4KCkl0J3Mg aW1tZWRpYXRlbHkgb2J2aW91cyBpZiB0aGVyZSdzIGEgZ2xhcmluZyBGRlMgYnVnIGFuZCBpZiB3 ZSBnZXQgCmJvZ3VzIGRhdGEuIElmIHlvdSBkaXN0cnVzdCBmaXJtd2FyZSBhcyBtdWNoIGFzIEkg ZG8sIHRoZW4geW91IHdpbGwgZmluZCAKZ3JlYXQgdmFsdWUgaW4gaGF2aW5nIHN1Y2ggaW5mbyBp biB0aGUgbG9ncy4gSXQncyBwcm9iYWJseSBub3QgdG9vIAp1c2VmdWwgdG8gYSBjYXN1YWwgdXNl ciwgYnV0IHRoZW4gbmVpdGhlciBpcyBhIG1ham9yaXR5IG9mIHRoZSBzeXN0ZW0gbG9nLgoKPiBB bnl3YXksIEFsZXgsIEkgc2VlIHRoaXMgaW4gdGhlIGxvZ3M6Cj4gCj4gWyAgIDY2LjU4MTEyMV0g cGNpZWhwIDAwMDA6YjA6MDYuMDpwY2llMjA0OiBTbG90KDE3Nik6IExpbmsgRG93bgo+IFsgICA2 Ni41OTE5MzldIHBjaWVocCAwMDAwOmIwOjA1LjA6cGNpZTIwNDogU2xvdCgxNzkpOiBDYXJkIG5v dCBwcmVzZW50Cj4gWyAgIDY2LjU5MjEwMl0gcGNpZWhwIDAwMDA6YjA6MDYuMDpwY2llMjA0OiBT bG90KDE3Nik6IENhcmQgbm90IHByZXNlbnQKPiAKPiBhbmQgdGhhdCBjb21lcyBmcm9tIHRoYXQg cGNpZWhwX2lzcigpIGludGVycnVwdCBoYW5kbGVyIEFGQUlDVC4KPiAKPiBTbyB0aGVyZSAqaXMq IGEgd2F5IHRvIGtub3cgdGhhdCB0aGUgY2FyZCBpcyBub3QgcHJlc2VudCBhbnltb3JlLiBTbywK PiB0aGVvcmV0aWNhbGx5LCBhbmQgaWdub3JpbmcgdGhlIGNvZGUgbGF5ZXJpbmcgZm9yIG5vdywg d2UgY2FuIGNvbm5lY3QKPiB0aGF0IGVycm9yIHRvIHRoZSBjYXJkIG5vdCBwcmVzZW50IGV2ZW50 IGFuZCB0aGVuIGlnbm9yZSB0aGUgZXJyb3IuLi4KCllvdSdyZSBtaXNzaW5nIHRoZSB0aW1pbmcg YW5kIGFzc3VtaW5nIHlvdSB3aWxsIGdldCB0aGUgaG90cGx1ZyAKaW50ZXJydXB0LiBJbiB0aGlz IGV4YW1wbGUsIHlvdSBoYXZlIDIybXMgYmV0d2VlbiB0aGUgbGluayBkb3duIGFuZCAKcHJlc2Vu Y2UgZGV0ZWN0IHN0YXRlIGNoYW5nZS4gVGhpcyBpcyBhIGZhaXJseSBmYXN0IHJlbW92YWwuCgpI b3RwbHVnIGRlcGVuZGVuY2llcyBhc2lkZSAoeW91IGNhbiBoYXZlIHRoZSBrZXJuZWwgcnVuIHdp dGhvdXQgUENJZSAKaG90cGx1ZyBzdXBwb3J0KSwgSSBkb24ndCB0aGluayB5b3Ugd2FudCB0byBq dXN0IGxpbmdlciBpbiBOTUkgZm9yIApkb3plbnMgb2YgbWlsbGlzZWNvbmRzIHdhaXRpbmcgZm9y IHByZXNlbmNlIGRldGVjdCBjb25maXJtYXRpb24uCgpGb3IgZW50ZXJwcmlzZSBTRkYgTlZNZSBk cml2ZXMsIHRoZSBkYXRhIGxhbmVzIHdpbGwgZGlzY29ubmVjdCBiZWZvcmUgCnRoZSBwcmVzZW5j ZSBkZXRlY3QuIEZGUyByZWxpZXMgb24gcHJlc2VuY2UgZGV0ZWN0LCBhbmQgdGhlc2UgYXJlIHR3 byBvZiAKdGhlIHJlYXNvbnMgd2h5IHNsb3cgcmVtb3ZhbCBpcyBzdWNoIGEgcHJvYmxlbS4gWW91 IG1pZ2h0IG5vdCBnZXQgYSAKcHJlc2VuY2UgZGV0ZWN0IGludGVycnVwdCBhdCBhbGwuCgpQcmVz ZW5jZSBkZXRlY3QgaXMgb3B0aW9uYWwgZm9yIFBDSWUuIFBEIGlzIHN1Y2ggYSByZWxpYWJsZSBo ZXVyaXN0aWMsIAp0aGF0IGl0IGd1YXJhbnRlZXMgd29yc2UgZXJyb3IgaGFuZGxpbmcgdGhhbiB0 aGUgY3JhY2ttb25rZXkgZmlybXdhcmUuIEkgCmRvbid0IHNlZSBob3cgbWlnaHQgYmUgdXNlZnVs IGluIGEgd2F5IHdoaWNoIGdpdmVzIHVzIGJldHRlciBoYW5kbGluZyAKdGhhbiBmaXJtd2FyZS4K Cj4gSG1tbS4KCkhtbW0KCkFueXdheSwgaGV1cmlzdGljcyBhYm91dCBQQ0llIGVycm9yIHJlY292 ZXJ5IGJlbG9uZyBpbiB0aGUgcmVjb3ZlcnkgCmhhbmRsZXIuIEkgZG9uJ3QgdGhpbmsgaXQncyBz bWFydCB0byBhcHBseSBwb2xpY3kgYmVmb3JlIHdlIGdldCB0aGVyZQoKQWxleAotLS0KVG8gdW5z dWJzY3JpYmUgZnJvbSB0aGlzIGxpc3Q6IHNlbmQgdGhlIGxpbmUgInVuc3Vic2NyaWJlIGxpbnV4 LWVkYWMiIGluCnRoZSBib2R5IG9mIGEgbWVzc2FnZSB0byBtYWpvcmRvbW9Admdlci5rZXJuZWwu b3JnCk1vcmUgbWFqb3Jkb21vIGluZm8gYXQgIGh0dHA6Ly92Z2VyLmtlcm5lbC5vcmcvbWFqb3Jk b21vLWluZm8uaHRtbAo=