From mboxrd@z Thu Jan 1 00:00:00 1970 From: Borislav Petkov Subject: Re: [RFC PATCH v2 3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. Date: Wed, 25 Apr 2018 19:15:57 +0200 Message-ID: <20180425171557.GC2597@pd.tnic> References: <20180419154006.GE3600@pd.tnic> <977608e6-9f5d-c523-a78a-993ac5bfd55f@gmail.com> <20180419164528.GD5635@pd.tnic> <20180419190323.GF5635@pd.tnic> <20180422104849.GA32754@pd.tnic> <70c43399-e8e5-5061-b5a5-451deb5f02fa@gmail.com> <20180425140108.GA2597@pd.tnic> <48944beb-4e29-05cc-857b-7698e3dbe89b@gmail.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Return-path: Content-Disposition: inline In-Reply-To: <48944beb-4e29-05cc-857b-7698e3dbe89b@gmail.com> Sender: linux-kernel-owner@vger.kernel.org To: "Alex G." Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel List-Id: linux-acpi@vger.kernel.org On Wed, Apr 25, 2018 at 10:00:53AM -0500, Alex G. wrote: > Firmware-first. Ok, my guess was right. > We could probably use more of the native AER print functions, but that's > beyond the scope of this patch. No no, this does not belong in this patchset. > Like the exact thing that this patch series implements? :) Exact thing? I don't think so. No, your patchset is grafting some funky and questionable side-handler which gets to see the PCIe errors first, out-of-line and then it practically downgrades their severity outside of the error processing flow. What I've been telling you to do is to extend ghes_severity() to give the lower than PANIC severity for CPER_SEC_PCIE errors first so that the machine doesn't panic from them anymore and those PCIe errors get processed in the normal error processing path down through ghes_do_proc() and then land in ghes_handle_aer(). No adhoc ->handle_irqsafe thing - just the normal straightforward error processing path. There, in ghes_handle_aer(), you do the check whether the device is still there - i.e., you try to apply some heuristics to detect the error type and why the system is complaining - you maybe even check whether the NVMe device is still there - and *then* you do the proper recovery action. And you document for the future people looking at this code *why* you're doing this. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [RFC,v2,3/4] acpi: apei: Do not panic() when correctable errors are marked as fatal. From: Borislav Petkov Message-Id: <20180425171557.GC2597@pd.tnic> Date: Wed, 25 Apr 2018 19:15:57 +0200 To: "Alex G." Cc: linux-acpi@vger.kernel.org, linux-edac@vger.kernel.org, rjw@rjwysocki.net, lenb@kernel.org, tony.luck@intel.com, tbaicar@codeaurora.org, will.deacon@arm.com, james.morse@arm.com, shiju.jose@huawei.com, zjzhang@codeaurora.org, gengdongjiu@huawei.com, linux-kernel@vger.kernel.org, alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, devel@acpica.org, mchehab@kernel.org, robert.moore@intel.com, erik.schmauss@intel.com, Yazen Ghannam , Ard Biesheuvel List-ID: T24gV2VkLCBBcHIgMjUsIDIwMTggYXQgMTA6MDA6NTNBTSAtMDUwMCwgQWxleCBHLiB3cm90ZToK PiBGaXJtd2FyZS1maXJzdC4KCk9rLCBteSBndWVzcyB3YXMgcmlnaHQuCgo+IFdlIGNvdWxkIHBy b2JhYmx5IHVzZSBtb3JlIG9mIHRoZSBuYXRpdmUgQUVSIHByaW50IGZ1bmN0aW9ucywgYnV0IHRo YXQncwo+IGJleW9uZCB0aGUgc2NvcGUgb2YgdGhpcyBwYXRjaC4KCk5vIG5vLCB0aGlzIGRvZXMg bm90IGJlbG9uZyBpbiB0aGlzIHBhdGNoc2V0LgoKPiBMaWtlIHRoZSBleGFjdCB0aGluZyB0aGF0 IHRoaXMgcGF0Y2ggc2VyaWVzIGltcGxlbWVudHM/IDopCgpFeGFjdCB0aGluZz8gSSBkb24ndCB0 aGluayBzby4KCk5vLCB5b3VyIHBhdGNoc2V0IGlzIGdyYWZ0aW5nIHNvbWUgZnVua3kgYW5kIHF1 ZXN0aW9uYWJsZSBzaWRlLWhhbmRsZXIKd2hpY2ggZ2V0cyB0byBzZWUgdGhlIFBDSWUgZXJyb3Jz IGZpcnN0LCBvdXQtb2YtbGluZSBhbmQgdGhlbiBpdApwcmFjdGljYWxseSBkb3duZ3JhZGVzIHRo ZWlyIHNldmVyaXR5IG91dHNpZGUgb2YgdGhlIGVycm9yIHByb2Nlc3NpbmcKZmxvdy4KCldoYXQg SSd2ZSBiZWVuIHRlbGxpbmcgeW91IHRvIGRvIGlzIHRvIGV4dGVuZCBnaGVzX3NldmVyaXR5KCkg dG8KZ2l2ZSB0aGUgbG93ZXIgdGhhbiBQQU5JQyBzZXZlcml0eSBmb3IgQ1BFUl9TRUNfUENJRSBl cnJvcnMgZmlyc3QKc28gdGhhdCB0aGUgbWFjaGluZSBkb2Vzbid0IHBhbmljIGZyb20gdGhlbSBh bnltb3JlIGFuZCB0aG9zZSBQQ0llCmVycm9ycyBnZXQgcHJvY2Vzc2VkIGluIHRoZSBub3JtYWwg ZXJyb3IgcHJvY2Vzc2luZyBwYXRoIGRvd24KdGhyb3VnaCBnaGVzX2RvX3Byb2MoKSBhbmQgdGhl biBsYW5kIGluIGdoZXNfaGFuZGxlX2FlcigpLiBObyBhZGhvYwotPmhhbmRsZV9pcnFzYWZlIHRo aW5nIC0ganVzdCB0aGUgbm9ybWFsIHN0cmFpZ2h0Zm9yd2FyZCBlcnJvcgpwcm9jZXNzaW5nIHBh dGguCgpUaGVyZSwgaW4gZ2hlc19oYW5kbGVfYWVyKCksIHlvdSBkbyB0aGUgY2hlY2sgd2hldGhl ciB0aGUgZGV2aWNlIGlzCnN0aWxsIHRoZXJlIC0gaS5lLiwgeW91IHRyeSB0byBhcHBseSBzb21l IGhldXJpc3RpY3MgdG8gZGV0ZWN0IHRoZSBlcnJvcgp0eXBlIGFuZCB3aHkgdGhlIHN5c3RlbSBp cyBjb21wbGFpbmluZyAtIHlvdSBtYXliZSBldmVuIGNoZWNrIHdoZXRoZXIKdGhlIE5WTWUgZGV2 aWNlIGlzIHN0aWxsIHRoZXJlIC0gYW5kICp0aGVuKiB5b3UgZG8gdGhlIHByb3BlciByZWNvdmVy eQphY3Rpb24uCgpBbmQgeW91IGRvY3VtZW50IGZvciB0aGUgZnV0dXJlIHBlb3BsZSBsb29raW5n IGF0IHRoaXMgY29kZSAqd2h5KiB5b3UncmUKZG9pbmcgdGhpcy4K