From mboxrd@z Thu Jan 1 00:00:00 1970 From: Alexandru Gagniuc Subject: [RFC PATCH v4 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Date: Mon, 30 Apr 2018 16:33:52 -0500 Message-ID: <20180430213358.8319-3-mr.nuke.me@gmail.com> References: <20180430212836.7807-1-mr.nuke.me@gmail.com> <20180430213358.8319-1-mr.nuke.me@gmail.com> Return-path: In-Reply-To: <20180430213358.8319-1-mr.nuke.me@gmail.com> Sender: linux-kernel-owner@vger.kernel.org To: bp@alien8.de Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, Alexandru Gagniuc , "Rafael J. Wysocki" , Len Brown , Tony Luck , Mauro Carvalho Chehab , Robert Moore , Erik Schmauss , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, devel@acpica.org List-Id: linux-acpi@vger.kernel.org The policy was to panic() when GHES said that an error is "Fatal". This logic is wrong for several reasons, as it doesn't take into account what caused the error. PCIe fatal errors indicate that the link to a device is either unstable or unusable. They don't indicate that the machine is on fire, and they are not severe enough that we need to panic(). Instead of relying on crackmonkey firmware, evaluate the error severity based on what caused the error (GHES subsections). Signed-off-by: Alexandru Gagniuc --- drivers/acpi/apei/ghes.c | 45 ++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 42 insertions(+), 3 deletions(-) diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c index c9f1971333c1..49318fba409c 100644 --- a/drivers/acpi/apei/ghes.c +++ b/drivers/acpi/apei/ghes.c @@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int * GHES_SEV_RECOVERABLE -> AER_NONFATAL * GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL * These both need to be reported and recovered from by the AER driver. - * GHES_SEV_PANIC does not make it to this handling since the kernel must - * panic. + * GHES_SEV_PANIC -> AER_FATAL */ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) { @@ -459,6 +458,46 @@ static void ghes_handle_aer(struct acpi_hest_generic_data *gdata) #endif } +/* PCIe errors should not cause a panic. */ +static int ghes_sec_pcie_severity(struct acpi_hest_generic_data *gdata) +{ + struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata); + + if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID && + pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO && + IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER)) + return CPER_SEV_RECOVERABLE; + + return ghes_cper_severity(gdata->error_severity); +} +/* + * The severity field in the status block is oftentimes more severe than it + * needs to be. This makes it an unreliable metric for the severity. A more + * reliable way is to look at each subsection and correlate it with how well + * the error can be handled. + * - SEC_PCIE: All PCIe errors can be handled by AER. + */ +static int ghes_severity(struct ghes *ghes) +{ + int worst_sev, sec_sev; + struct acpi_hest_generic_data *gdata; + const guid_t *section_type; + const struct acpi_hest_generic_status *estatus = ghes->estatus; + + worst_sev = GHES_SEV_NO; + apei_estatus_for_each_section(estatus, gdata) { + section_type = (guid_t *)gdata->section_type; + sec_sev = ghes_cper_severity(gdata->error_severity); + + if (guid_equal(section_type, &CPER_SEC_PCIE)) + sec_sev = ghes_sec_pcie_severity(gdata); + + worst_sev = max(worst_sev, sec_sev); + } + + return worst_sev; +} + static void ghes_do_proc(struct ghes *ghes, const struct acpi_hest_generic_status *estatus) { @@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs) ret = NMI_HANDLED; } - sev = ghes_cper_severity(ghes->estatus->error_severity); + sev = ghes_severity(ghes); if (sev >= GHES_SEV_PANIC) { oops_begin(); ghes_print_queued_estatus(); -- 2.14.3 From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: base64 Subject: [RFC,v4,3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES From: Alexandru Gagniuc Message-Id: <20180430213358.8319-3-mr.nuke.me@gmail.com> Date: Mon, 30 Apr 2018 16:33:52 -0500 To: bp@alien8.de Cc: alex_gagniuc@dellteam.com, austin_bolen@dell.com, shyam_iyer@dell.com, Alexandru Gagniuc , "Rafael J. Wysocki" , Len Brown , Tony Luck , Mauro Carvalho Chehab , Robert Moore , Erik Schmauss , Tyler Baicar , Will Deacon , James Morse , Shiju Jose , "Jonathan (Zhixiong) Zhang" , Dongjiu Geng , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org, devel@acpica.org List-ID: VGhlIHBvbGljeSB3YXMgdG8gcGFuaWMoKSB3aGVuIEdIRVMgc2FpZCB0aGF0IGFuIGVycm9yIGlz ICJGYXRhbCIuClRoaXMgbG9naWMgaXMgd3JvbmcgZm9yIHNldmVyYWwgcmVhc29ucywgYXMgaXQg ZG9lc24ndCB0YWtlIGludG8KYWNjb3VudCB3aGF0IGNhdXNlZCB0aGUgZXJyb3IuCgpQQ0llIGZh dGFsIGVycm9ycyBpbmRpY2F0ZSB0aGF0IHRoZSBsaW5rIHRvIGEgZGV2aWNlIGlzIGVpdGhlcgp1 bnN0YWJsZSBvciB1bnVzYWJsZS4gVGhleSBkb24ndCBpbmRpY2F0ZSB0aGF0IHRoZSBtYWNoaW5l IGlzIG9uIGZpcmUsCmFuZCB0aGV5IGFyZSBub3Qgc2V2ZXJlIGVub3VnaCB0aGF0IHdlIG5lZWQg dG8gcGFuaWMoKS4gSW5zdGVhZCBvZgpyZWx5aW5nIG9uIGNyYWNrbW9ua2V5IGZpcm13YXJlLCBl dmFsdWF0ZSB0aGUgZXJyb3Igc2V2ZXJpdHkgYmFzZWQgb24Kd2hhdCBjYXVzZWQgdGhlIGVycm9y IChHSEVTIHN1YnNlY3Rpb25zKS4KClNpZ25lZC1vZmYtYnk6IEFsZXhhbmRydSBHYWduaXVjIDxt ci5udWtlLm1lQGdtYWlsLmNvbT4KLS0tCiBkcml2ZXJzL2FjcGkvYXBlaS9naGVzLmMgfCA0NSAr KysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKysrKystLS0KIDEgZmlsZSBjaGFu Z2VkLCA0MiBpbnNlcnRpb25zKCspLCAzIGRlbGV0aW9ucygtKQoKZGlmZiAtLWdpdCBhL2RyaXZl cnMvYWNwaS9hcGVpL2doZXMuYyBiL2RyaXZlcnMvYWNwaS9hcGVpL2doZXMuYwppbmRleCBjOWYx OTcxMzMzYzEuLjQ5MzE4ZmJhNDA5YyAxMDA2NDQKLS0tIGEvZHJpdmVycy9hY3BpL2FwZWkvZ2hl cy5jCisrKyBiL2RyaXZlcnMvYWNwaS9hcGVpL2doZXMuYwpAQCAtNDI1LDggKzQyNSw3IEBAIHN0 YXRpYyB2b2lkIGdoZXNfaGFuZGxlX21lbW9yeV9mYWlsdXJlKHN0cnVjdCBhY3BpX2hlc3RfZ2Vu ZXJpY19kYXRhICpnZGF0YSwgaW50CiAgKiBHSEVTX1NFVl9SRUNPVkVSQUJMRSAtPiBBRVJfTk9O RkFUQUwKICAqIEdIRVNfU0VWX1JFQ09WRVJBQkxFICYmIENQRVJfU0VDX1JFU0VUIC0+IEFFUl9G QVRBTAogICogICAgIFRoZXNlIGJvdGggbmVlZCB0byBiZSByZXBvcnRlZCBhbmQgcmVjb3ZlcmVk IGZyb20gYnkgdGhlIEFFUiBkcml2ZXIuCi0gKiBHSEVTX1NFVl9QQU5JQyBkb2VzIG5vdCBtYWtl IGl0IHRvIHRoaXMgaGFuZGxpbmcgc2luY2UgdGhlIGtlcm5lbCBtdXN0Ci0gKiAgICAgcGFuaWMu CisgKiBHSEVTX1NFVl9QQU5JQyAtPiBBRVJfRkFUQUwKICAqLwogc3RhdGljIHZvaWQgZ2hlc19o YW5kbGVfYWVyKHN0cnVjdCBhY3BpX2hlc3RfZ2VuZXJpY19kYXRhICpnZGF0YSkKIHsKQEAgLTQ1 OSw2ICs0NTgsNDYgQEAgc3RhdGljIHZvaWQgZ2hlc19oYW5kbGVfYWVyKHN0cnVjdCBhY3BpX2hl c3RfZ2VuZXJpY19kYXRhICpnZGF0YSkKICNlbmRpZgogfQogCisvKiBQQ0llIGVycm9ycyBzaG91 bGQgbm90IGNhdXNlIGEgcGFuaWMuICovCitzdGF0aWMgaW50IGdoZXNfc2VjX3BjaWVfc2V2ZXJp dHkoc3RydWN0IGFjcGlfaGVzdF9nZW5lcmljX2RhdGEgKmdkYXRhKQoreworCXN0cnVjdCBjcGVy X3NlY19wY2llICpwY2llX2VyciA9IGFjcGlfaGVzdF9nZXRfcGF5bG9hZChnZGF0YSk7CisKKwlp ZiAocGNpZV9lcnItPnZhbGlkYXRpb25fYml0cyAmIENQRVJfUENJRV9WQUxJRF9ERVZJQ0VfSUQg JiYKKwkgICAgcGNpZV9lcnItPnZhbGlkYXRpb25fYml0cyAmIENQRVJfUENJRV9WQUxJRF9BRVJf SU5GTyAmJgorCSAgICBJU19FTkFCTEVEKENPTkZJR19BQ1BJX0FQRUlfUENJRUFFUikpCisJCXJl dHVybiBDUEVSX1NFVl9SRUNPVkVSQUJMRTsKKworCXJldHVybiBnaGVzX2NwZXJfc2V2ZXJpdHko Z2RhdGEtPmVycm9yX3NldmVyaXR5KTsKK30KKy8qCisgKiBUaGUgc2V2ZXJpdHkgZmllbGQgaW4g dGhlIHN0YXR1cyBibG9jayBpcyBvZnRlbnRpbWVzIG1vcmUgc2V2ZXJlIHRoYW4gaXQKKyAqIG5l ZWRzIHRvIGJlLiBUaGlzIG1ha2VzIGl0IGFuIHVucmVsaWFibGUgbWV0cmljIGZvciB0aGUgc2V2 ZXJpdHkuIEEgbW9yZQorICogcmVsaWFibGUgd2F5IGlzIHRvIGxvb2sgYXQgZWFjaCBzdWJzZWN0 aW9uIGFuZCBjb3JyZWxhdGUgaXQgd2l0aCBob3cgd2VsbAorICogdGhlIGVycm9yIGNhbiBiZSBo YW5kbGVkLgorICogICAtIFNFQ19QQ0lFOiBBbGwgUENJZSBlcnJvcnMgY2FuIGJlIGhhbmRsZWQg YnkgQUVSLgorICovCitzdGF0aWMgaW50IGdoZXNfc2V2ZXJpdHkoc3RydWN0IGdoZXMgKmdoZXMp Cit7CisJaW50IHdvcnN0X3Nldiwgc2VjX3NldjsKKwlzdHJ1Y3QgYWNwaV9oZXN0X2dlbmVyaWNf ZGF0YSAqZ2RhdGE7CisJY29uc3QgZ3VpZF90ICpzZWN0aW9uX3R5cGU7CisJY29uc3Qgc3RydWN0 IGFjcGlfaGVzdF9nZW5lcmljX3N0YXR1cyAqZXN0YXR1cyA9IGdoZXMtPmVzdGF0dXM7CisKKwl3 b3JzdF9zZXYgPSBHSEVTX1NFVl9OTzsKKwlhcGVpX2VzdGF0dXNfZm9yX2VhY2hfc2VjdGlvbihl c3RhdHVzLCBnZGF0YSkgeworCQlzZWN0aW9uX3R5cGUgPSAoZ3VpZF90ICopZ2RhdGEtPnNlY3Rp b25fdHlwZTsKKwkJc2VjX3NldiA9IGdoZXNfY3Blcl9zZXZlcml0eShnZGF0YS0+ZXJyb3Jfc2V2 ZXJpdHkpOworCisJCWlmIChndWlkX2VxdWFsKHNlY3Rpb25fdHlwZSwgJkNQRVJfU0VDX1BDSUUp KQorCQkJc2VjX3NldiA9IGdoZXNfc2VjX3BjaWVfc2V2ZXJpdHkoZ2RhdGEpOworCisJCXdvcnN0 X3NldiA9IG1heCh3b3JzdF9zZXYsIHNlY19zZXYpOworCX0KKworCXJldHVybiB3b3JzdF9zZXY7 Cit9CisKIHN0YXRpYyB2b2lkIGdoZXNfZG9fcHJvYyhzdHJ1Y3QgZ2hlcyAqZ2hlcywKIAkJCSBj b25zdCBzdHJ1Y3QgYWNwaV9oZXN0X2dlbmVyaWNfc3RhdHVzICplc3RhdHVzKQogewpAQCAtOTQ0 LDcgKzk4Myw3IEBAIHN0YXRpYyBpbnQgZ2hlc19ub3RpZnlfbm1pKHVuc2lnbmVkIGludCBjbWQs IHN0cnVjdCBwdF9yZWdzICpyZWdzKQogCQkJcmV0ID0gTk1JX0hBTkRMRUQ7CiAJCX0KIAotCQlz ZXYgPSBnaGVzX2NwZXJfc2V2ZXJpdHkoZ2hlcy0+ZXN0YXR1cy0+ZXJyb3Jfc2V2ZXJpdHkpOwor CQlzZXYgPSBnaGVzX3NldmVyaXR5KGdoZXMpOwogCQlpZiAoc2V2ID49IEdIRVNfU0VWX1BBTklD KSB7CiAJCQlvb3BzX2JlZ2luKCk7CiAJCQlnaGVzX3ByaW50X3F1ZXVlZF9lc3RhdHVzKCk7Cg==