* [PATCH v7 1/3] acpi: apei: Rename GHES_SEV_PANIC to GHES_SEV_FATAL
2018-05-25 15:53 [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Alexandru Gagniuc
@ 2018-05-25 15:53 ` Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 2/3] acpi: apei: Rename ghes_severity() to ghes_cper_severity() Alexandru Gagniuc
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel, devel
'GHES_SEV_PANIC' implies that the kernel must panic. That was true
many years ago when fatal errors could not be handled and recovered.
However, this is no longer the case with PCIe AER and DPC errors. The
latter class of errors are contained at the hardware level.
'GHES_SEV_PANIC' is confusing because it implies a policy to crash the
system on fatal errors. Drop this questionable policy, and rename the
enum to 'GHES_SEV_FATAL' to better convey the meaning.
Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
arch/x86/kernel/cpu/mcheck/mce-apei.c | 2 +-
drivers/acpi/apei/ghes.c | 11 +++++------
drivers/edac/ghes_edac.c | 2 +-
include/acpi/ghes.h | 2 +-
4 files changed, 8 insertions(+), 9 deletions(-)
diff --git a/arch/x86/kernel/cpu/mcheck/mce-apei.c b/arch/x86/kernel/cpu/mcheck/mce-apei.c
index 2eee85379689..cbec89f5cdf0 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-apei.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-apei.c
@@ -53,7 +53,7 @@ void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;
- if (severity >= GHES_SEV_PANIC) {
+ if (severity >= GHES_SEV_FATAL) {
m.status |= MCI_STATUS_PCC;
m.tsc = rdtsc();
}
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1efefe919555..26a41bbe222b 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -281,10 +281,10 @@ static inline int ghes_severity(int severity)
case CPER_SEV_RECOVERABLE:
return GHES_SEV_RECOVERABLE;
case CPER_SEV_FATAL:
- return GHES_SEV_PANIC;
+ return GHES_SEV_FATAL;
default:
/* Unknown, go panic */
- return GHES_SEV_PANIC;
+ return GHES_SEV_FATAL;
}
}
@@ -425,8 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_PANIC does not make it to this handling since the kernel must
- * panic.
+ * GHES_SEV_FATAL does not make it to this handler
*/
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -705,7 +704,7 @@ static int ghes_proc(struct ghes *ghes)
if (rc)
goto out;
- if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_PANIC) {
+ if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_FATAL) {
__ghes_panic(ghes);
}
@@ -946,7 +945,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
}
sev = ghes_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_PANIC) {
+ if (sev >= GHES_SEV_FATAL) {
oops_begin();
ghes_print_queued_estatus();
__ghes_panic(ghes);
diff --git a/drivers/edac/ghes_edac.c b/drivers/edac/ghes_edac.c
index 68b6ee18bea6..8455758327d4 100644
--- a/drivers/edac/ghes_edac.c
+++ b/drivers/edac/ghes_edac.c
@@ -220,7 +220,7 @@ void ghes_edac_report_mem_error(struct ghes *ghes, int sev,
case GHES_SEV_RECOVERABLE:
type = HW_EVENT_ERR_UNCORRECTED;
break;
- case GHES_SEV_PANIC:
+ case GHES_SEV_FATAL:
type = HW_EVENT_ERR_FATAL;
break;
default:
diff --git a/include/acpi/ghes.h b/include/acpi/ghes.h
index 8feb0c866ee0..322f7ede24bd 100644
--- a/include/acpi/ghes.h
+++ b/include/acpi/ghes.h
@@ -49,7 +49,7 @@ enum {
GHES_SEV_NO = 0x0,
GHES_SEV_CORRECTED = 0x1,
GHES_SEV_RECOVERABLE = 0x2,
- GHES_SEV_PANIC = 0x3,
+ GHES_SEV_FATAL = 0x3,
};
/* From drivers/edac/ghes_edac.c */
--
2.14.3
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v7 2/3] acpi: apei: Rename ghes_severity() to ghes_cper_severity()
2018-05-25 15:53 [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 1/3] acpi: apei: Rename GHES_SEV_PANIC to GHES_SEV_FATAL Alexandru Gagniuc
@ 2018-05-25 15:53 ` Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
2018-05-27 9:36 ` [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Rafael J. Wysocki
3 siblings, 0 replies; 5+ messages in thread
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel, devel
ghes_severity() is a misnomer in this case, as it implies the severity
of the entire GHES structure. Instead, it maps one CPER value to one
GHES_SEV* value. ghes_cper_severity() is clearer.
Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
drivers/acpi/apei/ghes.c | 17 ++++++++---------
1 file changed, 8 insertions(+), 9 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 26a41bbe222b..1b22e18168f5 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -271,7 +271,7 @@ static void ghes_fini(struct ghes *ghes)
unmap_gen_v2(ghes);
}
-static inline int ghes_severity(int severity)
+static inline int ghes_cper_severity(int severity)
{
switch (severity) {
case CPER_SEV_INFORMATIONAL:
@@ -388,7 +388,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
#ifdef CONFIG_ACPI_APEI_MEMORY_FAILURE
unsigned long pfn;
int flags = -1;
- int sec_sev = ghes_severity(gdata->error_severity);
+ int sec_sev = ghes_cper_severity(gdata->error_severity);
struct cper_sec_mem_err *mem_err = acpi_hest_get_payload(gdata);
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
@@ -467,10 +467,10 @@ static void ghes_do_proc(struct ghes *ghes,
guid_t *fru_id = &NULL_UUID_LE;
char *fru_text = "";
- sev = ghes_severity(estatus->error_severity);
+ sev = ghes_cper_severity(estatus->error_severity);
apei_estatus_for_each_section(estatus, gdata) {
sec_type = (guid_t *)gdata->section_type;
- sec_sev = ghes_severity(gdata->error_severity);
+ sec_sev = ghes_cper_severity(gdata->error_severity);
if (gdata->validation_bits & CPER_SEC_VALID_FRU_ID)
fru_id = (guid_t *)gdata->fru_id;
@@ -511,7 +511,7 @@ static void __ghes_print_estatus(const char *pfx,
char pfx_seq[64];
if (pfx == NULL) {
- if (ghes_severity(estatus->error_severity) <=
+ if (ghes_cper_severity(estatus->error_severity) <=
GHES_SEV_CORRECTED)
pfx = KERN_WARNING;
else
@@ -533,7 +533,7 @@ static int ghes_print_estatus(const char *pfx,
static DEFINE_RATELIMIT_STATE(ratelimit_uncorrected, 5*HZ, 2);
struct ratelimit_state *ratelimit;
- if (ghes_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
+ if (ghes_cper_severity(estatus->error_severity) <= GHES_SEV_CORRECTED)
ratelimit = &ratelimit_corrected;
else
ratelimit = &ratelimit_uncorrected;
@@ -704,9 +704,8 @@ static int ghes_proc(struct ghes *ghes)
if (rc)
goto out;
- if (ghes_severity(ghes->estatus->error_severity) >= GHES_SEV_FATAL) {
+ if (ghes_cper_severity(ghes->estatus->error_severity) >= GHES_SEV_FATAL)
__ghes_panic(ghes);
- }
if (!ghes_estatus_cached(ghes->estatus)) {
if (ghes_print_estatus(NULL, ghes->generic, ghes->estatus))
@@ -944,7 +943,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
ret = NMI_HANDLED;
}
- sev = ghes_severity(ghes->estatus->error_severity);
+ sev = ghes_cper_severity(ghes->estatus->error_severity);
if (sev >= GHES_SEV_FATAL) {
oops_begin();
ghes_print_queued_estatus();
--
2.14.3
^ permalink raw reply related [flat|nested] 5+ messages in thread* [PATCH v7 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES
2018-05-25 15:53 [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 1/3] acpi: apei: Rename GHES_SEV_PANIC to GHES_SEV_FATAL Alexandru Gagniuc
2018-05-25 15:53 ` [PATCH v7 2/3] acpi: apei: Rename ghes_severity() to ghes_cper_severity() Alexandru Gagniuc
@ 2018-05-25 15:53 ` Alexandru Gagniuc
2018-05-27 9:36 ` [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Rafael J. Wysocki
3 siblings, 0 replies; 5+ messages in thread
From: Alexandru Gagniuc @ 2018-05-25 15:53 UTC (permalink / raw)
To: linux-acpi
Cc: alex_gagniuc, austin_bolen, shyam_iyer, Alexandru Gagniuc,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, x86, Rafael J. Wysocki, Len Brown,
Mauro Carvalho Chehab, Robert Moore, Erik Schmauss, Tyler Baicar,
Will Deacon, James Morse, Jonathan (Zhixiong) Zhang, Dongjiu Geng,
linux-edac, linux-kernel, devel
As previously noted, the policy to panic on any "Fatal" GHES error is
not suitable for several classes of errors. The most notable is
error containment. The correct policy is to achieve identical behavior
to native error handling -- i.e. when not reported through GHES. This,
in special cases, may not be possible, as we have to exit NMIs, which
requires these special considerations
PCIe AER errors are contained and reported at the root port. On DPC
capable hardware, containment can be done by all downstream ports. DPC
also has the added advantage of preventing future errors. Since these
errors stop at the root port, we can do all the work we need to exit
NMI and reach the error handler.
This patch does away with the mindless crashing of the system, and
correctly invokes the AER handler. When AER is not enabled, or the
firmware doesn't provide sufficient information to identify the source
of the error, the original panic() behavior is maintained.
Signed-off-by: Alexandru Gagniuc <mr.nuke.me@gmail.com>
---
drivers/acpi/apei/ghes.c | 43 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 41 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/apei/ghes.c b/drivers/acpi/apei/ghes.c
index 1b22e18168f5..f7126f6d8d52 100644
--- a/drivers/acpi/apei/ghes.c
+++ b/drivers/acpi/apei/ghes.c
@@ -425,7 +425,7 @@ static void ghes_handle_memory_failure(struct acpi_hest_generic_data *gdata, int
* GHES_SEV_RECOVERABLE -> AER_NONFATAL
* GHES_SEV_RECOVERABLE && CPER_SEC_RESET -> AER_FATAL
* These both need to be reported and recovered from by the AER driver.
- * GHES_SEV_FATAL does not make it to this handler
+ * GHES_SEV_FATAL -> AER_FATAL
*/
static void ghes_handle_aer(struct acpi_hest_generic_data *gdata)
{
@@ -837,6 +837,45 @@ static inline void ghes_sea_remove(struct ghes *ghes) { }
static struct llist_head ghes_estatus_llist;
static struct irq_work ghes_proc_irq_work;
+/* PCIe AER errors are safe if AER section contains enough info. */
+static int ghes_pcie_has_safe_handler(struct acpi_hest_generic_data *gdata)
+{
+ struct cper_sec_pcie *pcie_err = acpi_hest_get_payload(gdata);
+
+ if (pcie_err->validation_bits & CPER_PCIE_VALID_DEVICE_ID &&
+ pcie_err->validation_bits & CPER_PCIE_VALID_AER_INFO &&
+ IS_ENABLED(CONFIG_ACPI_APEI_PCIEAER))
+ return true;
+
+ return false;
+}
+
+/*
+ * Do we have an error handler that we can safely reach? We're concerned with
+ * being able to notify an error handler by crossing the NMI/IRQ boundary,
+ * being able to schedule_work, and so forth.
+ */
+static int ghes_has_fatal_handler(struct ghes *ghes)
+{
+ int worst_sev, sec_sev;
+ bool safe = true;
+ struct acpi_hest_generic_data *gdata;
+ const guid_t *section_type;
+ const struct acpi_hest_generic_status *estatus = ghes->estatus;
+
+ apei_estatus_for_each_section(estatus, gdata) {
+ section_type = (guid_t *)gdata->section_type;
+
+ if (guid_equal(section_type, &CPER_SEC_PCIE))
+ safe = ghes_pcie_has_safe_handler(gdata);
+
+ if (!safe)
+ break;
+ }
+
+ return safe;
+}
+
/*
* NMI may be triggered on any CPU, so ghes_in_nmi is used for
* having only one concurrent reader.
@@ -944,7 +983,7 @@ static int ghes_notify_nmi(unsigned int cmd, struct pt_regs *regs)
}
sev = ghes_cper_severity(ghes->estatus->error_severity);
- if (sev >= GHES_SEV_FATAL) {
+ if ((sev >= GHES_SEV_FATAL) && !ghes_has_fatal_handler(ghes)) {
oops_begin();
ghes_print_queued_estatus();
__ghes_panic(ghes);
--
2.14.3
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy
2018-05-25 15:53 [PATCH v7 0/3] acpi: apei: Drop panic() on fatal errors policy Alexandru Gagniuc
` (2 preceding siblings ...)
2018-05-25 15:53 ` [PATCH v7 3/3] acpi: apei: Do not panic() on PCIe errors reported through GHES Alexandru Gagniuc
@ 2018-05-27 9:36 ` Rafael J. Wysocki
3 siblings, 0 replies; 5+ messages in thread
From: Rafael J. Wysocki @ 2018-05-27 9:36 UTC (permalink / raw)
To: Alexandru Gagniuc
Cc: ACPI Devel Maling List, alex_gagniuc, austin_bolen, shyam_iyer,
Tony Luck, Borislav Petkov, Thomas Gleixner, Ingo Molnar,
H. Peter Anvin, the arch/x86 maintainers, Rafael J. Wysocki,
Len Brown, Mauro Carvalho Chehab, Robert Moore, Erik Schmauss,
Tyler Baicar, Will Deacon, James Morse, Jonathan (Zhixiong) Zhang,
Dongjiu Geng, open list:EDAC-CORE, Linux Kernel Mailing List,
devel
On Fri, May 25, 2018 at 5:53 PM, Alexandru Gagniuc <mr.nuke.me@gmail.com> wrote:
> FFS (firmware-first) handling through APEI seems to have developed a
> policy to panic() on any fatal errors. This policy is completely
> independent of the non-FFS case. It is also inconsistent with how the
> native error handlers, a number of which will recover the system from
> fatal errors.
>
> The purpose of this series is to obsolete this idiotic policy, with
> the motivation to enable identical handling of PCIe errors to native
> reporting.
>
>
> Rafael, this is copypaste from the previous patch series. I suspect
> you might have missed it last time, because you asked questions which
> were answered here. I've included it so you don't have to go digging
> old emails:
I didn't miss it, but I didn't like your answers.
Anyway, as a rule, no GHES/APEI patches are applied without an ACK
from either Boris or Tony, so you need to talk to them about the
patches.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 5+ messages in thread