linux-acpi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bert Karwatzki <spasswolf@web.de>
To: Yazen Ghannam <yazen.ghannam@amd.com>
Cc: Borislav Petkov <bp@alien8.de>, Tony Luck <tony.luck@intel.com>,
	 linux-kernel@vger.kernel.org, linux-next@vger.kernel.org,
	 linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
	x86@kernel.org,  rafael@kernel.org, qiuxu.zhuo@intel.com,
	nik.borisov@suse.com,  Smita.KoralahalliChannabasappa@amd.com,
	spasswolf@web.de
Subject: Re: spurious mce Hardware Error messages in next-20250912
Date: Wed, 17 Sep 2025 17:33:29 +0200	[thread overview]
Message-ID: <6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de> (raw)
In-Reply-To: <20250917144148.GA1313380@yaz-khff2.amd.com>

Am Mittwoch, dem 17.09.2025 um 10:41 -0400 schrieb Yazen Ghannam:
> On Wed, Sep 17, 2025 at 09:13:11AM +0200, Bert Karwatzki wrote:
> > Am Dienstag, dem 16.09.2025 um 22:27 +0200 schrieb Bert Karwatzki:
> [...]
> > 
> > I ran a test for 10h and got one real deferred error, I also looked through
> > older logs (which only go back to 2025-08-17) and they do not contain any
> > mce Hardware errors. Here's the output of
> > 
> > $ dmesg | grep -E "mce|Hardware Error"
> > [...]
> > [10163.739261] [   T9326] mce: [Hardware Error]: Machine check events logged
> > [10163.739265] [   T9326] [Hardware Error]: Deferred error, no action required.
> > [10163.739267] [   T9326] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> > [10163.739275] [   T9326] [Hardware Error]: Error Addr: 0x0095464100000020
> > [10163.739276] [   T9326] [Hardware Error]: IPID: 0x000700b040000000
> > [10163.739278] [   T9326] [Hardware Error]: L3 Cache Ext. Error Code: 0
> > [10163.739279] [   T9326] [Hardware Error]: cache level: RESV, tx: INSN
> > [...]

This seems to be a real deferred errror.

> 
> Summary so far:
> 1) Errors are found on CPU0 banks 11 and 14.
> 2) Errors are found during MCA timer-based polling.
> 3) The data is coming from MCA_DESTAT register.
> 4) The status bits are not consistent with documentation.
> 5) Likely these errors are not generating a deferred error interrupt.
> 
> Bert, can you please collecting the following data?
> 
> 1) Output of "/proc/interrupts".
>   a) The MCE, MCP, THR, and DFR lines are of interest.
>   b) We should verify if any other notification types occur besides
>      "MCP" (MCA polling).

This is from next-20250916 (without the debug patch), unfortunately I've
already rebooted after the testrun with next-20250912 and your debug patch.

$ cat /proc/interrupts | grep -E "DFR|THR|MCE|MCP"
 THR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
0          0   Threshold APIC interrupts
 DFR:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
0          0   Deferred Error APIC interrupts
 MCE:          0          0          0          0          0          0          0          0          0          0          0          0          0          0
0          0   Machine check exceptions
 MCP:         39         39         39         39         39         39         39         39         39         39         39         39         39         39
39         39   Machine check polls



> 2) Using an older kernel, read the MCA_DESTAT registers for L3 cache.
>   a) CPU0 bank 11: "sudo rdmsr -p 0 0xC00020b8"
>   b) CPU0 bank 14: "sudo rdmsr -p 0 0xC00020e8"
>   c) If these are non-zero, then I think we can confirm that the
>      spurious data was always there.
> 
> Thanks,
> Yazen

This is from 6.12.43+deb13-amd64 (the stock debian trixie kernel, currently the
oldest version I have installed):

# rdmsr -p 0 0xC00020b8
8700aa0800000000
# rdmsr -p 0 0xC00020e8
8700a28800000000


Bert Karwatzki

  reply	other threads:[~2025-09-17 15:33 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15  1:00 spurious mce Hardware Error messages in next-20250912 Bert Karwatzki
2025-09-15 17:55 ` Yazen Ghannam
2025-09-15 21:03   ` Bert Karwatzki
2025-09-15 21:43     ` Bert Karwatzki
2025-09-16  9:10       ` Borislav Petkov
2025-09-16 14:07         ` Yazen Ghannam
2025-09-16 20:27           ` Bert Karwatzki
2025-09-17  7:13             ` Bert Karwatzki
2025-09-17 14:41               ` Yazen Ghannam
2025-09-17 15:33                 ` Bert Karwatzki [this message]
2025-09-17 19:26                   ` Yazen Ghannam
2025-09-17 21:15                     ` Yazen Ghannam
2025-09-17 22:01                       ` Bert Karwatzki
2025-09-18 10:20                     ` Nikolay Borisov
2025-09-18 21:00                       ` Yazen Ghannam
2025-09-18 21:04                         ` Luck, Tony
2025-09-18 21:14                           ` Yazen Ghannam
2025-09-18 22:07                         ` Bert Karwatzki
2025-10-09 13:20                           ` Yazen Ghannam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=6e1eda7dd55f6fa30405edf7b0f75695cf55b237.camel@web.de \
    --to=spasswolf@web.de \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=nik.borisov@suse.com \
    --cc=qiuxu.zhuo@intel.com \
    --cc=rafael@kernel.org \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    --cc=yazen.ghannam@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).