From: Ingo Molnar <mingo@kernel.org>
To: Zhiquan Li <zhiquan1.li@intel.com>
Cc: x86@kernel.org, linux-edac@vger.kernel.org,
linux-kernel@vger.kernel.org, patches@lists.linux.dev,
bp@alien8.de, tony.luck@intel.com, naoya.horiguchi@nec.com,
Youquan Song <youquan.song@intel.com>
Subject: Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic
Date: Mon, 2 Oct 2023 21:06:12 +0200 [thread overview]
Message-ID: <ZRsUpM/XtPAE50Rm@gmail.com> (raw)
In-Reply-To: <20230914030539.1622477-1-zhiquan1.li@intel.com>
* Zhiquan Li <zhiquan1.li@intel.com> wrote:
> Kdump can exclude the HWPosion page to avoid touch the error page
> again, the prerequisite is the PG_hwpoison page flag is set.
> However, for some MCE fatal error cases, there is no opportunity
> to queue a task for calling memory_failure(), as a result,
> the capture kernel touches the error page again and panics.
>
> Add function mce_set_page_hwpoison_now() which marks a page as
> HWPoison before kernel panic() for MCE error, so that the dump
> program can check and skip the error page and prevent the capture
> kernel panic.
>
> [Tony: Changed TestSetPageHWPoison() to SetPageHWPoison()]
>
> Co-developed-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Youquan Song <youquan.song@intel.com>
> Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
>
> ---
> V2 RESEND notes:
> - No changes on this, just rebasing as v6.6-rc1 is out.
> - Added the tag from Naoya.
> Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t
>
> Changes since V1:
> - Revised the commit message as per Naoya's suggestion.
> - Replaced "TODO" comment in code with comments based on mailing list
> discussion on the lack of value in covering other page types.
> Link: https://lore.kernel.org/all/20230127015030.30074-1-tony.luck@intel.com/
> ---
> arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
> 1 file changed, 18 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 6f35f724cc14..2725698268f3 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
> }
> EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
>
> +/*
> + * Kdump can exclude the HWPosion page to avoid touch the error page again,
> + * the prerequisite is the PG_hwpoison page flag is set. However, for some
> + * MCE fatal error cases, there are no opportunity to queue a task
> + * for calling memory_failure(), as a result, the capture kernel panics.
> + * This function marks the page as HWPoison before kernel panic() for MCE.
> + */
The English in this commit is *atrocious*, both in the changelog and in
the comments - how on Earth did 'Posion' typo and half a dozen other typos
and bad grammar survive ~3 iterations and a Reviewed-by tag??
The version below fixes up the worst, but I suspect that's not the only problem
with this patch...
Thanks,
Ingo
================>
From: Zhiquan Li <zhiquan1.li@intel.com>
Date: Thu, 14 Sep 2023 11:05:39 +0800
Subject: [PATCH] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic
Kdump can exclude the HWPoison page to avoid touching the error page
again, the prerequisite is the PG_hwpoison page flag is set.
However, for some MCE fatal error cases, there is no opportunity
to queue a task for calling memory_failure(), and as a result,
the capture kernel touches the error page again and panics.
Add the mce_set_page_hwpoison_now() function, which marks a page as
HWPoison before kernel panic() for MCE error, so that the dump
program can check and skip the error page and prevent the capture
kernel panic.
[ Tony: Changed TestSetPageHWPoison() to SetPageHWPoison() ]
[ mingo: Fixed the comments & changelog ]
Co-developed-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Youquan Song <youquan.song@intel.com>
Signed-off-by: Zhiquan Li <zhiquan1.li@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lore.kernel.org/all/20230719211625.298785-1-tony.luck@intel.com/#t
---
arch/x86/kernel/cpu/mce/core.c | 18 ++++++++++++++++++
1 file changed, 18 insertions(+)
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 6f35f724cc14..1a14e8233c5a 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -156,6 +156,22 @@ void mce_unregister_decode_chain(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(mce_unregister_decode_chain);
+/*
+ * Kdump can exclude the HWPoison page to avoid touching the error page again,
+ * the prerequisite is that the PG_hwpoison page flag is set. However, for some
+ * MCE fatal error cases, there is no opportunity to queue a task
+ * for calling memory_failure(), and as a result, the capture kernel panics.
+ * This function marks the page as HWPoison before kernel panic() for MCE.
+ */
+static void mce_set_page_hwpoison_now(unsigned long pfn)
+{
+ struct page *p;
+
+ p = pfn_to_online_page(pfn);
+ if (p)
+ SetPageHWPoison(p);
+}
+
static void __print_mce(struct mce *m)
{
pr_emerg(HW_ERR "CPU %d: Machine Check%s: %Lx Bank %d: %016Lx\n",
@@ -286,6 +302,8 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
if (!fake_panic) {
if (panic_timeout == 0)
panic_timeout = mca_cfg.panic_timeout;
+ if (final && (final->status & MCI_STATUS_ADDRV))
+ mce_set_page_hwpoison_now(final->addr >> PAGE_SHIFT);
panic(msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
next prev parent reply other threads:[~2023-10-02 19:06 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-14 3:05 [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic Zhiquan Li
2023-10-02 19:06 ` Ingo Molnar [this message]
2023-10-10 0:56 ` Zhiquan Li
2023-10-10 8:25 ` Borislav Petkov
2023-10-10 8:28 ` Borislav Petkov
2023-10-11 3:00 ` Zhiquan Li
2023-10-12 14:57 ` Borislav Petkov
2023-10-13 0:26 ` Zhiquan Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZRsUpM/XtPAE50Rm@gmail.com \
--to=mingo@kernel.org \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=naoya.horiguchi@nec.com \
--cc=patches@lists.linux.dev \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=youquan.song@intel.com \
--cc=zhiquan1.li@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.