From: Bert Karwatzki <spasswolf@web.de>
To: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Borislav Petkov <bp@alien8.de>, Tony Luck <tony.luck@intel.com>,
linux-kernel@vger.kernel.org, linux-next@vger.kernel.org,
linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
x86@kernel.org, rafael@kernel.org, qiuxu.zhuo@intel.com,
nik.borisov@suse.com, Smita.KoralahalliChannabasappa@amd.com,
spasswolf@web.de
Subject: Re: spurious mce Hardware Error messages in next-20250912
Date: Wed, 17 Sep 2025 17:33:29 +0200 [thread overview]
Message-ID: <6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de> (raw)
In-Reply-To: <20250917144148.GA1313380@yaz-khff2.amd.com>
Am Mittwoch, dem 17.09.2025 um 10:41 -0400 schrieb Yazen Ghannam:
> On Wed, Sep 17, 2025 at 09:13:11AM +0200, Bert Karwatzki wrote:
> > Am Dienstag, dem 16.09.2025 um 22:27 +0200 schrieb Bert Karwatzki:
> [...]
> >
> > I ran a test for 10h and got one real deferred error, I also looked through
> > older logs (which only go back to 2025-08-17) and they do not contain any
> > mce Hardware errors. Here's the output of
> >
> > $ dmesg | grep -E "mce|Hardware Error"
> > [...]
> > [10163.739261] [ T9326] mce: [Hardware Error]: Machine check events logged
> > [10163.739265] [ T9326] [Hardware Error]: Deferred error, no action required.
> > [10163.739267] [ T9326] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> > [10163.739275] [ T9326] [Hardware Error]: Error Addr: 0x0095464100000020
> > [10163.739276] [ T9326] [Hardware Error]: IPID: 0x000700b040000000
> > [10163.739278] [ T9326] [Hardware Error]: L3 Cache Ext. Error Code: 0
> > [10163.739279] [ T9326] [Hardware Error]: cache level: RESV, tx: INSN
> > [...]
This seems to be a real deferred errror.
>
> Summary so far:
> 1) Errors are found on CPU0 banks 11 and 14.
> 2) Errors are found during MCA timer-based polling.
> 3) The data is coming from MCA_DESTAT register.
> 4) The status bits are not consistent with documentation.
> 5) Likely these errors are not generating a deferred error interrupt.
>
> Bert, can you please collecting the following data?
>
> 1) Output of "/proc/interrupts".
> a) The MCE, MCP, THR, and DFR lines are of interest.
> b) We should verify if any other notification types occur besides
> "MCP" (MCA polling).
This is from next-20250916 (without the debug patch), unfortunately I've
already rebooted after the testrun with next-20250912 and your debug patch.
$ cat /proc/interrupts | grep -E "DFR|THR|MCE|MCP"
THR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 Threshold APIC interrupts
DFR: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 Deferred Error APIC interrupts
MCE: 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 Machine check exceptions
MCP: 39 39 39 39 39 39 39 39 39 39 39 39 39 39
39 39 Machine check polls
> 2) Using an older kernel, read the MCA_DESTAT registers for L3 cache.
> a) CPU0 bank 11: "sudo rdmsr -p 0 0xC00020b8"
> b) CPU0 bank 14: "sudo rdmsr -p 0 0xC00020e8"
> c) If these are non-zero, then I think we can confirm that the
> spurious data was always there.
>
> Thanks,
> Yazen
This is from 6.12.43+deb13-amd64 (the stock debian trixie kernel, currently the
oldest version I have installed):
# rdmsr -p 0 0xC00020b8
8700aa0800000000
# rdmsr -p 0 0xC00020e8
8700a28800000000
Bert Karwatzki
next prev parent reply other threads:[~2025-09-17 15:33 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-15 1:00 spurious mce Hardware Error messages in next-20250912 Bert Karwatzki
2025-09-15 17:55 ` Yazen Ghannam
2025-09-15 21:03 ` Bert Karwatzki
2025-09-15 21:43 ` Bert Karwatzki
2025-09-16 9:10 ` Borislav Petkov
2025-09-16 14:07 ` Yazen Ghannam
2025-09-16 20:27 ` Bert Karwatzki
2025-09-17 7:13 ` Bert Karwatzki
2025-09-17 14:41 ` Yazen Ghannam
2025-09-17 15:33 ` Bert Karwatzki [this message]
2025-09-17 19:26 ` Yazen Ghannam
2025-09-17 21:15 ` Yazen Ghannam
2025-09-17 22:01 ` Bert Karwatzki
2025-09-18 10:20 ` Nikolay Borisov
2025-09-18 21:00 ` Yazen Ghannam
2025-09-18 21:04 ` Luck, Tony
2025-09-18 21:14 ` Yazen Ghannam
2025-09-18 22:07 ` Bert Karwatzki
2025-10-09 13:20 ` Yazen Ghannam
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de \
--to=spasswolf@web.de \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=bp@alien8.de \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-next@vger.kernel.org \
--cc=nik.borisov@suse.com \
--cc=qiuxu.zhuo@intel.com \
--cc=rafael@kernel.org \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).