From: Dave Hansen <dave.hansen@intel.com>
To: Breno Leitao <leitao@debian.org>
Cc: "Rafael J. Wysocki" <rafael@kernel.org>,
Len Brown <lenb@kernel.org>, James Morse <james.morse@arm.com>,
Tony Luck <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>,
Robert Moore <robert.moore@intel.com>,
Thomas Gleixner <tglx@linutronix.de>,
Ingo Molnar <mingo@redhat.com>,
Dave Hansen <dave.hansen@linux.intel.com>,
x86@kernel.org, "H. Peter Anvin" <hpa@zytor.com>,
Hanjun Guo <guohanjun@huawei.com>,
Mauro Carvalho Chehab <mchehab@kernel.org>,
Mahesh J Salgaonkar <mahesh@linux.ibm.com>,
Oliver O'Halloran <oohall@gmail.com>,
Bjorn Helgaas <bhelgaas@google.com>,
linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
acpica-devel@lists.linux.dev, osandov@osandov.com,
xueshuai@linux.alibaba.com, konrad.wilk@oracle.com,
linux-edac@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
linux-pci@vger.kernel.org, kernel-team@meta.com, osandov@fb.com
Subject: Re: [PATCH v4] vmcoreinfo: Track and log recoverable hardware errors
Date: Fri, 1 Aug 2025 09:24:43 -0700 [thread overview]
Message-ID: <0c045f1b-44d0-430c-9e8a-58b65dd84453@intel.com> (raw)
In-Reply-To: <f3yl424iqiyctgz4j36hzjrhkgae3a2h5smhalm2qbmq3nrpzd@oeuprthscfez>
On 8/1/25 08:13, Breno Leitao wrote:
> Hello Dave,
>
> On Fri, Aug 01, 2025 at 07:52:17AM -0700, Dave Hansen wrote:
>> On 8/1/25 05:31, Breno Leitao wrote:
>>> Introduce a generic infrastructure for tracking recoverable hardware
>>> errors (HW errors that are visible to the OS but does not cause a panic)
>>> and record them for vmcore consumption.
>> ...
>>
>> Are there patches for the consumer side of this, too? Or do humans
>> looking at crash dumps have to know what to go digging for?
>>
>> In either case, don't we need documentation for this new ABI?
>
> I have considered this, but the documentation for vmcoreinfo
> (admin-guide/kdump/vmcoreinfo.rst) solely documents what is explicitly
> exposed by vmcore, which differs from the nature of these counters.
>
> Where would be a good place to document it?
I'm not picky. But you also didn't quite answer the question I was asking.
Is this new data for humans or machines to read?
>>> @@ -1690,6 +1691,9 @@ noinstr void do_machine_check(struct pt_regs *regs)
>>> }
>>>
>>> out:
>>> + /* Given it didn't panic, mark it as recoverable */
>>> + hwerr_log_error_type(HWERR_RECOV_MCE);
>>> +
>>
>> Does "MCE" mean anything outside of x86?
>
> AFAIK this is a MCE concept.
I'm not really sure what that response means.
There are two problems here. First is that HWERR_RECOV_MCE is defined in
arch-generic code, but it may never get used by anything other than x86
when CONFIG_X86_MCE.
That also completely wastes space in your data structure when
HWERR_RECOV_MCE=n. Not a huge deal as-is, but it's still a bit sloppy
and wasteful.
...
>>> + hwerr_data[src].count++;
>>> + hwerr_data[src].timestamp = ktime_get_real_seconds();
>>> +}
>>> +EXPORT_SYMBOL_GPL(hwerr_log_error_type);
>>
>> I'd also love to hear more about _actual_ users of this. Surely, someone
>> hit a real world problem and thought this would be a nifty solution. Who
>> was that? What problem did they hit? How does this help them?
>
> Yes, this has been extensively discussed in the very first version of
> the patch. Borislav raised the same question, which was discussed in the
> following link:
>
> https://lore.kernel.org/all/20250715125327.GGaHZPRz9QLNNO-7q8@fat_crate.local/
When someone raises a concern, we usually try to alleviate the concern
in a way that is self-contained in the next posting. A cover letter with
a full explanation would be one place to put the reasoning, for example.
But expecting future reviewers to plod through all the old threads isn't
really feasible.
next prev parent reply other threads:[~2025-08-01 16:24 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-08-01 12:31 [PATCH v4] vmcoreinfo: Track and log recoverable hardware errors Breno Leitao
2025-08-01 14:52 ` Dave Hansen
2025-08-01 15:13 ` Breno Leitao
2025-08-01 16:24 ` Dave Hansen [this message]
2025-08-01 17:00 ` Breno Leitao
2025-08-01 17:06 ` Dave Hansen
2025-08-04 17:12 ` Breno Leitao
2025-08-04 17:41 ` Dave Hansen
2025-08-05 13:00 ` Breno Leitao
2025-08-02 0:51 ` kernel test robot
2025-08-04 0:05 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c045f1b-44d0-430c-9e8a-58b65dd84453@intel.com \
--to=dave.hansen@intel.com \
--cc=acpica-devel@lists.linux.dev \
--cc=bhelgaas@google.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=guohanjun@huawei.com \
--cc=hpa@zytor.com \
--cc=james.morse@arm.com \
--cc=kernel-team@meta.com \
--cc=konrad.wilk@oracle.com \
--cc=leitao@debian.org \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-pci@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mahesh@linux.ibm.com \
--cc=mchehab@kernel.org \
--cc=mingo@redhat.com \
--cc=oohall@gmail.com \
--cc=osandov@fb.com \
--cc=osandov@osandov.com \
--cc=rafael@kernel.org \
--cc=robert.moore@intel.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=xueshuai@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox