public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
From: Shuai Xue <xueshuai@linux.alibaba.com>
To: Borislav Petkov <bp@alien8.de>
Cc: keescook@chromium.org, tony.luck@intel.com, gpiccoli@igalia.com,
	rafael@kernel.org, lenb@kernel.org, james.morse@arm.com,
	tglx@linutronix.de, mingo@redhat.com,
	dave.hansen@linux.intel.com, x86@kernel.org, hpa@zytor.com,
	ardb@kernel.org, robert.moore@intel.com,
	linux-hardening@vger.kernel.org, linux-acpi@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-edac@vger.kernel.org,
	linux-efi@vger.kernel.org,
	acpica-devel@lists.linuxfoundation.org,
	baolin.wang@linux.alibaba.com
Subject: Re: [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors
Date: Sat, 7 Oct 2023 15:15:45 +0800	[thread overview]
Message-ID: <f654be8f-aa98-1bed-117b-ebdf96d23df1@linux.alibaba.com> (raw)
In-Reply-To: <20230928144345.GAZRWRIXH1Tfgn5EpO@fat_crate.local>



On 2023/9/28 22:43, Borislav Petkov wrote:
> On Mon, Sep 25, 2023 at 03:44:17PM +0800, Shuai Xue wrote:
>> After /dev/mcelog character device deprecated by commit 5de97c9f6d85
>> ("x86/mce: Factor out and deprecate the /dev/mcelog driver"), the
>> serialized MCE error record, of previous boot in persistent storage is not
>> collected via APEI ERST.
> 
> You lost me here. /dev/mcelog is deprecated but you can still use it and
> apei_write_mce() still happens.

Yes, you are right. apei_write_mce() still happens so that MCE records are
written to persistent storage and the MCE records can be retrieved by
apei_read_mce(). Previously, the task was performed by the mcelog package.
However, it has been deprecated, some distributions like Arch kernels are
not even compiled with the necessary configuration option
CONFIG_X86_MCELOG_LEGACY.[1]

So, IMHO, it's better to add a way to retrieve MCE records through switching
to the new generation rasdaemon solution.

> 
> Looking at your patches, you're adding this to ghes so how about you sit
> down first and explain your exact use case and what exactly you wanna
> do?
> 
> Thx.
> 

Sorry for the poor cover letter. I hope the following response can clarify
the matter.

Q1: What is the exact problem?

Traditionally, fatal hardware errors will cause Linux print error log to
console, e.g. print_mce() or __ghes_print_estatus(), then reboot. With
Linux, the primary method for obtaining debugging information of a serious
error or fault is via the kdump mechanism. Kdump captures a wealth of
kernel and machine state and writes it to a file for post-mortem debugging.

In certain scenarios, ie. hosts/guests with root filesystems on NFS/iSCSI
where networking software and/or hardware fails, and thus kdump fails to
collect the hardware error context, leaving us unaware of what actually
occurred. In the public cloud scenario, multiple virtual machines run on a
single physical server, and if that server experiences a failure, it can
potentially impact multiple tenants. It is crucial for us to thoroughly
analyze the root causes of each instance failure in order to:

- Provide customers with a detailed explanation of the outage to reassure them.
- Collect the characteristics of the failures, such as ECC syndrome, to enable fault prediction.
- Explore potential solutions to prevent widespread outages.

In short, it is necessary to serialize hardware error information available
for post-mortem debugging.

Q2: What exactly I wanna do:

The MCE handler, do_machine_check(), saves the MCE record to persistent
storage and it is retrieved by mcelog. Mcelog has been deprecated when
kernel 4.12 released in 2017, and the help of the configuration option
CONFIG_X86_MCELOG_LEGACY suggest to consider switching to the new
generation rasdaemon solution. The GHES handler does not support APEI error
record now.

To serialize hardware error information available for post-mortem
debugging:
- add support to save APEI error record into flash via ERST before go panic,
- add support to retrieve MCE or APEI error record from the flash and emit
the related tracepoint after system boot successful again so that rasdaemon
can collect them


Best Regards,
Shuai


[1] https://wiki.archlinux.org/title/Machine-check_exception

  reply	other threads:[~2023-10-07  7:15 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-09-25  7:44 [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 1/9] pstore: move pstore creator id, section type and record struct to common header Shuai Xue
2023-09-25 17:13   ` Kees Cook
2023-09-26  6:47     ` Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 2/9] ACPI: APEI: Use common ERST struct to read/write serialized MCE record Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 3/9] ACPI: APEI: ERST: Emit the mce_record tracepoint Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 4/9] ACPI: tables: change section_type of generic error data as guid_t Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 5/9] ACPI: APEI: GHES: Use ERST to serialize APEI generic error before panic Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 6/9] ACPI: APEI: GHES: export ghes_report_chain Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 7/9] ACPI: APEI: ESRT: kick ghes_report_chain notifier to report serialized memory errors Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 8/9] ACPI: APEI: ESRT: print AER to report serialized PCIe errors Shuai Xue
2023-09-25  7:44 ` [RFC PATCH v2 9/9] ACPI: APEI: ESRT: log ARM processor error Shuai Xue
2023-09-28 14:43 ` [RFC PATCH v2 0/9] Use ERST for persistent storage of MCE and APEI errors Borislav Petkov
2023-10-07  7:15   ` Shuai Xue [this message]
2023-10-26 10:21     ` Shuai Xue
2023-10-26 13:32     ` Borislav Petkov
2024-11-26  7:04       ` Shuai Xue

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f654be8f-aa98-1bed-117b-ebdf96d23df1@linux.alibaba.com \
    --to=xueshuai@linux.alibaba.com \
    --cc=acpica-devel@lists.linuxfoundation.org \
    --cc=ardb@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=gpiccoli@igalia.com \
    --cc=hpa@zytor.com \
    --cc=james.morse@arm.com \
    --cc=keescook@chromium.org \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-efi@vger.kernel.org \
    --cc=linux-hardening@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=rafael@kernel.org \
    --cc=robert.moore@intel.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox