From: Tony Luck <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: Yazen Ghannam <yazen.ghannam@amd.com>,
Smita Koralahalli <Smita.KoralahalliChannabasappa@amd.com>,
Carlos Bilbao <carlos.bilbao@amd.com>,
x86@kernel.org, linux-edac@vger.kernel.org,
linux-kernel@vger.kernel.org, Tony Luck <tony.luck@intel.com>
Subject: [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context
Date: Thu, 22 Sep 2022 12:51:36 -0700 [thread overview]
Message-ID: <20220922195136.54575-3-tony.luck@intel.com> (raw)
In-Reply-To: <20220922195136.54575-1-tony.luck@intel.com>
It isn't generally useful to dump the stack for a fatal machine check.
The error was detected by hardware when some parity or ECC check failed,
software isn't the problem.
But the kernel now has a few places where it can recover from a machine
check by treating it as an error. E.g. when copying parameters for system
calls from an application.
In order to ease the hunt for additional code flows where machine check
errors can be recovered it is useful to know, for example, why the
kernel was copying a page. Perhaps that code sequence can be modified to
handle machine checks as errors.
Add a new machine check severity value to indicate when a stack dump
may be useful.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
arch/x86/kernel/cpu/mce/internal.h | 1 +
arch/x86/kernel/cpu/mce/core.c | 11 +++++++++--
arch/x86/kernel/cpu/mce/severity.c | 2 +-
3 files changed, 11 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/cpu/mce/internal.h b/arch/x86/kernel/cpu/mce/internal.h
index 7e03f5b7f6bd..f03aaff79e39 100644
--- a/arch/x86/kernel/cpu/mce/internal.h
+++ b/arch/x86/kernel/cpu/mce/internal.h
@@ -18,6 +18,7 @@ enum severity_level {
MCE_UC_SEVERITY,
MCE_AR_SEVERITY,
MCE_PANIC_SEVERITY,
+ MCE_PANIC_STACKDUMP_SEVERITY,
};
extern struct blocking_notifier_head x86_mce_decoder_chain;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 2c8ec5c71712..69ec63eaa625 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -44,6 +44,7 @@
#include <linux/sync_core.h>
#include <linux/task_work.h>
#include <linux/hardirq.h>
+#include <linux/sched/debug.h>
#include <asm/intel-family.h>
#include <asm/processor.h>
@@ -254,6 +255,9 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
wait_for_panic();
barrier();
+ if (final->severity == MCE_PANIC_STACKDUMP_SEVERITY)
+ show_stack(NULL, NULL, KERN_DEFAULT);
+
bust_spinlocks(1);
console_verbose();
} else {
@@ -864,6 +868,7 @@ static __always_inline int mce_no_way_out(struct mce *m, char **msg, unsigned lo
struct pt_regs *regs)
{
char *tmp = *msg;
+ int severity;
int i;
for (i = 0; i < this_cpu_read(mce_num_banks); i++) {
@@ -876,9 +881,11 @@ static __always_inline int mce_no_way_out(struct mce *m, char **msg, unsigned lo
quirk_sandybridge_ifu(i, m, regs);
m->bank = i;
- if (mce_severity(m, regs, &tmp, true) >= MCE_PANIC_SEVERITY) {
+ severity = mce_severity(m, regs, &tmp, true);
+ if (severity >= MCE_PANIC_SEVERITY) {
mce_read_aux(m, i);
*msg = tmp;
+ m->severity = severity;
return 1;
}
}
@@ -994,7 +1001,7 @@ static void mce_reign(void)
*/
if (m && global_worst >= MCE_PANIC_SEVERITY) {
/* call mce_severity() to get "msg" for panic */
- mce_severity(m, NULL, &msg, true);
+ m->severity = mce_severity(m, NULL, &msg, true);
mce_panic("Fatal machine check", m, msg);
}
diff --git a/arch/x86/kernel/cpu/mce/severity.c b/arch/x86/kernel/cpu/mce/severity.c
index c4477162c07d..89d083c5bd06 100644
--- a/arch/x86/kernel/cpu/mce/severity.c
+++ b/arch/x86/kernel/cpu/mce/severity.c
@@ -174,7 +174,7 @@ static struct severity {
USER
),
MCESEV(
- PANIC, "Data load in unrecoverable area of kernel",
+ PANIC_STACKDUMP, "Data load in unrecoverable area of kernel",
SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
KERNEL
),
--
2.37.3
next prev parent reply other threads:[~2022-09-22 19:52 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-22 19:51 [PATCH 0/2] Dump stack after certain machine checks Tony Luck
2022-09-22 19:51 ` [PATCH 1/2] x86/mce: Use severity table to handle uncorrected errors in kernel Tony Luck
2022-09-22 19:51 ` Tony Luck [this message]
2022-10-31 16:44 ` [PATCH 2/2] x86/mce: Dump the stack for recoverable machine checks in kernel context Borislav Petkov
2022-10-31 17:13 ` Luck, Tony
2022-10-31 18:36 ` Borislav Petkov
2022-10-31 19:20 ` Luck, Tony
2022-10-31 10:30 ` [PATCH 0/2] Dump stack after certain machine checks Borislav Petkov
2022-11-01 17:36 ` Yazen Ghannam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220922195136.54575-3-tony.luck@intel.com \
--to=tony.luck@intel.com \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=bp@alien8.de \
--cc=carlos.bilbao@amd.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox