From mboxrd@z Thu Jan 1 00:00:00 1970 From: Borislav Petkov Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. Date: Wed, 18 Apr 2018 19:54:15 +0200 Message-ID: <20180418175415.GJ4795@pd.tnic> References: <20180416215903.7318-1-mr.nuke.me@gmail.com> <20180416215903.7318-4-mr.nuke.me@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: <20180416215903.7318-4-mr.nuke.me@gmail.com> Sender: linux-kernel-owner@vger.kernel.org To: Alexandru Gagniuc Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com List-Id: linux-acpi@vger.kernel.org On Mon, Apr 16, 2018 at 04:59:02PM -0500, Alexandru Gagniuc wrote: > Firmware is evil: > - ACPI was created to "try and make the 'ACPI' extensions somehow > Windows specific" in order to "work well with NT and not the others > even if they are open" > - EFI was created to hide "secret" registers from the OS. > - UEFI was created to allow compromising an otherwise secure OS. > > Never has firmware been created to solve a problem or simplify an > otherwise cumbersome process. It is of no surprise then, that > firmware nowadays intentionally crashes an OS. I don't believe I'm saying this but, get rid of that rant. Even though I agree, it doesn't belong in a commit message. > > One simple way to do that is to mark GHES errors as fatal. Firmware > knows and even expects that an OS will crash in this case. And most > OSes do. > > PCIe errors are notorious for having different definitions of "fatal". > In ACPI, and other firmware sandards, 'fatal' means the machine is > about to explode and needs to be reset. In PCIe, on the other hand, > fatal means that the link to a device has died. In the hotplug world > of PCIe, this is akin to a USB disconnect. From that view, the "fatal" > loss of a link is a normal event. To allow a machine to crash in this > case is downright idiotic. > > To solve this, implement an IRQ safe handler for AER. This makes sure > we have enough information to invoke the full AER handler later down > the road, and tells ghes_notify_nmi that "It's all cool". > ghes_notify_nmi() then gets calmed down a little, and doesn't panic(). > > Signed-off-by: Alexandru Gagniuc > --- > drivers/acpi/apei/ghes.c | 44 ++++++++++++++++++++++++++++++++++++++++++-- > 1 file changed, 42 insertions(+), 2 deletions(-) > > diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c > index 2119c51b4a9e..e0528da4e8f8 100644 > --- a/drivers/acpi/apei/ghes.c > +++ b/drivers/acpi/apei/ghes.c > @@ -481,12 +481,26 @@ static int ghes_handle_aer(struct acpi_hest_generic_data *gdata, int sev) > return ghes_severity(gdata->error_severity); > } > > +static int ghes_handle_aer_irqsafe(struct acpi_hest_generic_data *gdata, > + int sev) > +{ > + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); > + > + /* The system can always recover from AER errors. */ > + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && > + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO) > + return CPER_SEV_RECOVERABLE; > + > + return ghes_severity(gdata->error_severity); > +} Well, Tyler touched that AER error severity handling recently and we had it all nicely documented in the comment above ghes_handle_aer(). Your ghes_handle_aer_irqsafe() graft basically bypasses ghes_handle_aer() instead of incorporating in it. If all you wanna say is, the severity computation should go through all the sections and look at each error's severity before making a decision, then add that to ghes_severity() instead of doing that "deferrable" severity dance. And add the changes to the policy to the comment above ghes_handle_aer(). I don't want any changes from people coming and going and leaving us scratching heads why we did it this way. And no need for those handlers and so on - make it simple first - then we can talk more complex handling. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. From: Borislav Petkov Message-Id: <20180418175415.GJ4795@pd.tnic> Date: Wed, 18 Apr 2018 19:54:15 +0200 To: Alexandru Gagniuc Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com List-ID: T24gTW9uLCBBcHIgMTYsIDIwMTggYXQgMDQ6NTk6MDJQTSAtMDUwMCwgQWxleGFuZHJ1IEdhZ25p dWMgd3JvdGU6Cj4gRmlybXdhcmUgaXMgZXZpbDoKPiAgLSBBQ1BJIHdhcyBjcmVhdGVkIHRvICJ0 cnkgYW5kIG1ha2UgdGhlICdBQ1BJJyBleHRlbnNpb25zIHNvbWVob3cKPiAgV2luZG93cyBzcGVj aWZpYyIgaW4gb3JkZXIgdG8gIndvcmsgd2VsbCB3aXRoIE5UIGFuZCBub3QgdGhlIG90aGVycwo+ ICBldmVuIGlmIHRoZXkgYXJlIG9wZW4iCj4gIC0gRUZJIHdhcyBjcmVhdGVkIHRvIGhpZGUgInNl Y3JldCIgcmVnaXN0ZXJzIGZyb20gdGhlIE9TLgo+ICAtIFVFRkkgd2FzIGNyZWF0ZWQgdG8gYWxs b3cgY29tcHJvbWlzaW5nIGFuIG90aGVyd2lzZSBzZWN1cmUgT1MuCj4gCj4gTmV2ZXIgaGFzIGZp cm13YXJlIGJlZW4gY3JlYXRlZCB0byBzb2x2ZSBhIHByb2JsZW0gb3Igc2ltcGxpZnkgYW4KPiBv dGhlcndpc2UgY3VtYmVyc29tZSBwcm9jZXNzLiBJdCBpcyBvZiBubyBzdXJwcmlzZSB0aGVuLCB0 aGF0Cj4gZmlybXdhcmUgbm93YWRheXMgaW50ZW50aW9uYWxseSBjcmFzaGVzIGFuIE9TLgoKSSBk b24ndCBiZWxpZXZlIEknbSBzYXlpbmcgdGhpcyBidXQsIGdldCByaWQgb2YgdGhhdCByYW50LiBF dmVuIHRob3VnaCBJCmFncmVlLCBpdCBkb2Vzbid0IGJlbG9uZyBpbiBhIGNvbW1pdCBtZXNzYWdl LgoKPiAKPiBPbmUgc2ltcGxlIHdheSB0byBkbyB0aGF0IGlzIHRvIG1hcmsgR0hFUyBlcnJvcnMg YXMgZmF0YWwuIEZpcm13YXJlCj4ga25vd3MgYW5kIGV2ZW4gZXhwZWN0cyB0aGF0IGFuIE9TIHdp bGwgY3Jhc2ggaW4gdGhpcyBjYXNlLiBBbmQgbW9zdAo+IE9TZXMgZG8uCj4gCj4gUENJZSBlcnJv cnMgYXJlIG5vdG9yaW91cyBmb3IgaGF2aW5nIGRpZmZlcmVudCBkZWZpbml0aW9ucyBvZiAiZmF0 YWwiLgo+IEluIEFDUEksIGFuZCBvdGhlciBmaXJtd2FyZSBzYW5kYXJkcywgJ2ZhdGFsJyBtZWFu cyB0aGUgbWFjaGluZSBpcwo+IGFib3V0IHRvIGV4cGxvZGUgYW5kIG5lZWRzIHRvIGJlIHJlc2V0 LiBJbiBQQ0llLCBvbiB0aGUgb3RoZXIgaGFuZCwKPiBmYXRhbCBtZWFucyB0aGF0IHRoZSBsaW5r IHRvIGEgZGV2aWNlIGhhcyBkaWVkLiBJbiB0aGUgaG90cGx1ZyB3b3JsZAo+IG9mIFBDSWUsIHRo aXMgaXMgYWtpbiB0byBhIFVTQiBkaXNjb25uZWN0LiBGcm9tIHRoYXQgdmlldywgdGhlICJmYXRh bCIKPiBsb3NzIG9mIGEgbGluayBpcyBhIG5vcm1hbCBldmVudC4gVG8gYWxsb3cgYSBtYWNoaW5l IHRvIGNyYXNoIGluIHRoaXMKPiBjYXNlIGlzIGRvd25yaWdodCBpZGlvdGljLgo+IAo+IFRvIHNv bHZlIHRoaXMsIGltcGxlbWVudCBhbiBJUlEgc2FmZSBoYW5kbGVyIGZvciBBRVIuIFRoaXMgbWFr ZXMgc3VyZQo+IHdlIGhhdmUgZW5vdWdoIGluZm9ybWF0aW9uIHRvIGludm9rZSB0aGUgZnVsbCBB RVIgaGFuZGxlciBsYXRlciBkb3duCj4gdGhlIHJvYWQsIGFuZCB0ZWxscyBnaGVzX25vdGlmeV9u bWkgdGhhdCAiSXQncyBhbGwgY29vbCIuCj4gZ2hlc19ub3RpZnlfbm1pKCkgdGhlbiBnZXRzIGNh bG1lZCBkb3duIGEgbGl0dGxlLCBhbmQgZG9lc24ndCBwYW5pYygpLgo+IAo+IFNpZ25lZC1vZmYt Ynk6IEFsZXhhbmRydSBHYWduaXVjIDxtci5udWtlLm1lQGdtYWlsLmNvbT4KPiAtLS0KPiAgZHJp dmVycy9hY3BpL2FwZWkvZ2hlcy5jIHwgNDQgKysrKysrKysrKysrKysrKysrKysrKysrKysrKysr KysrKysrKysrKysrLS0KPiAgMSBmaWxlIGNoYW5nZWQsIDQyIGluc2VydGlvbnMoKyksIDIgZGVs ZXRpb25zKC0pCj4gCj4gZGlmZiAtLWdpdCBhL2RyaXZlcnMvYWNwaS9hcGVpL2doZXMuYyBiL2Ry aXZlcnMvYWNwaS9hcGVpL2doZXMuYwo+IGluZGV4IDIxMTljNTFiNGE5ZS4uZTA1MjhkYTRlOGY4 IDEwMDY0NAo+IC0tLSBhL2RyaXZlcnMvYWNwaS9hcGVpL2doZXMuYwo+ICsrKyBiL2RyaXZlcnMv YWNwaS9hcGVpL2doZXMuYwo+IEBAIC00ODEsMTIgKzQ4MSwyNiBAQCBzdGF0aWMgaW50IGdoZXNf aGFuZGxlX2FlcihzdHJ1Y3QgYWNwaV9oZXN0X2dlbmVyaWNfZGF0YSAqZ2RhdGEsIGludCBzZXYp Cj4gIAlyZXR1cm4gZ2hlc19zZXZlcml0eShnZGF0YS0+ZXJyb3Jfc2V2ZXJpdHkpOwo+ICB9Cj4g IAo+ICtzdGF0aWMgaW50IGdoZXNfaGFuZGxlX2Flcl9pcnFzYWZlKHN0cnVjdCBhY3BpX2hlc3Rf Z2VuZXJpY19kYXRhICpnZGF0YSwKPiArCQkJCSAgIGludCBzZXYpCj4gK3sKPiArCXN0cnVjdCBj cGVyX3NlY19wY2llICpwY2llX2VyciA9IGFjcGlfaGVzdF9nZXRfcGF5bG9hZChnZGF0YSk7Cj4g Kwo+ICsJLyogVGhlIHN5c3RlbSBjYW4gYWx3YXlzIHJlY292ZXIgZnJvbSBBRVIgZXJyb3JzLiAq Lwo+ICsJaWYgKHBjaWVfZXJyLT52YWxpZGF0aW9uX2JpdHMgJiBDUEVSX1BDSUVfVkFMSURfREVW SUNFX0lEICYmCj4gKwkJcGNpZV9lcnItPnZhbGlkYXRpb25fYml0cyAmIENQRVJfUENJRV9WQUxJ RF9BRVJfSU5GTykKPiArCQlyZXR1cm4gQ1BFUl9TRVZfUkVDT1ZFUkFCTEU7Cj4gKwo+ICsJcmV0 dXJuIGdoZXNfc2V2ZXJpdHkoZ2RhdGEtPmVycm9yX3NldmVyaXR5KTsKPiArfQoKV2VsbCwgVHls ZXIgdG91Y2hlZCB0aGF0IEFFUiBlcnJvciBzZXZlcml0eSBoYW5kbGluZyByZWNlbnRseSBhbmQg d2UgaGFkCml0IGFsbCBuaWNlbHkgZG9jdW1lbnRlZCBpbiB0aGUgY29tbWVudCBhYm92ZSBnaGVz X2hhbmRsZV9hZXIoKS4KCllvdXIgZ2hlc19oYW5kbGVfYWVyX2lycXNhZmUoKSBncmFmdCBiYXNp Y2FsbHkgYnlwYXNzZXMKZ2hlc19oYW5kbGVfYWVyKCkgaW5zdGVhZCBvZiBpbmNvcnBvcmF0aW5n IGluIGl0LgoKSWYgYWxsIHlvdSB3YW5uYSBzYXkgaXMsIHRoZSBzZXZlcml0eSBjb21wdXRhdGlv biBzaG91bGQgZ28gdGhyb3VnaCBhbGwKdGhlIHNlY3Rpb25zIGFuZCBsb29rIGF0IGVhY2ggZXJy b3IncyBzZXZlcml0eSBiZWZvcmUgbWFraW5nIGEgZGVjaXNpb24sCnRoZW4gYWRkIHRoYXQgdG8g Z2hlc19zZXZlcml0eSgpIGluc3RlYWQgb2YgZG9pbmcgdGhhdCAiZGVmZXJyYWJsZSIKc2V2ZXJp dHkgZGFuY2UuCgpBbmQgYWRkIHRoZSBjaGFuZ2VzIHRvIHRoZSBwb2xpY3kgdG8gdGhlIGNvbW1l bnQgYWJvdmUKZ2hlc19oYW5kbGVfYWVyKCkuIEkgZG9uJ3Qgd2FudCBhbnkgY2hhbmdlcyBmcm9t IHBlb3BsZSBjb21pbmcgYW5kIGdvaW5nCmFuZCBsZWF2aW5nIHVzIHNjcmF0Y2hpbmcgaGVhZHMg d2h5IHdlIGRpZCBpdCB0aGlzIHdheS4KCkFuZCBubyBuZWVkIGZvciB0aG9zZSBoYW5kbGVycyBh bmQgc28gb24gLSBtYWtlIGl0IHNpbXBsZSBmaXJzdCAtIHRoZW4gd2UKY2FuIHRhbGsgbW9yZSBj b21wbGV4IGhhbmRsaW5nLgo=