public inbox for linux-next@vger.kernel.org
 help / color / mirror / Atom feed
From: Yazen Ghannam <yazen.ghannam@amd.com>
To: Bert Karwatzki <spasswolf@web.de>
Cc: Nikolay Borisov <nik.borisov@suse.com>,
	Borislav Petkov <bp@alien8.de>, Tony Luck <tony.luck@intel.com>,
	linux-kernel@vger.kernel.org, linux-next@vger.kernel.org,
	linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
	x86@kernel.org, rafael@kernel.org, qiuxu.zhuo@intel.com,
	Smita.KoralahalliChannabasappa@amd.com
Subject: Re: spurious (?) mce Hardware Error messages in v6.19
Date: Mon, 16 Feb 2026 15:25:46 -0500	[thread overview]
Message-ID: <20260216202546.GA420258@yaz-khff2.amd.com> (raw)
In-Reply-To: <21ba47fa8893b33b94370c2a42e5084cf0d2e975.camel@web.de>

On Thu, Feb 12, 2026 at 01:50:05PM +0100, Bert Karwatzki wrote:
> I couldn't test this patch as I was busy figuring out this:
> 243b467dea17 Revert "drm/amd: Check if ASPM is enabled from PCIe subsystem"
> but with this done I could do some testing on v6.19. The periodic bogus mce
> errors are gone because smca_should_log_poll_error() usually returns false, but
> I still get some error messages for which I'm not sure if they are real errors.
> 
> I monitored smca_should_log_poll_error() like this (in v6.19 (errors do not occur in v6.18)):
> 
> static bool smca_should_log_poll_error(struct mce *m)
> {
> 	if (m->status & MCI_STATUS_VAL) {
> 		printk(KERN_INFO "%s: 0\n", __func__);
> 		return true;
> 	}
> 
> 	m->status = mce_rdmsrq(MSR_AMD64_SMCA_MCx_DESTAT(m->bank));
> 	if ((m->status & MCI_STATUS_VAL) && (m->status & MCI_STATUS_DEFERRED)) {
> 		printk(KERN_INFO "%s: 1\n", __func__);
> 		m->kflags |= MCE_CHECK_DFR_REGS;
> 		return true;
> 	}
> 
> 	printk(KERN_INFO "%s: 2\n", __func__);
> 	return false;
> }
> 
> And get these error messages (usually just once or twice per boot)
> 
> Examples from v6.19:
> $ grep -aE "Hardware Error|smca_should_log_poll_error: 1" /var/log/kern.log
> 
> 2026-02-10T16:15:01.001203+01:00 lisa kernel: [    C0] smca_should_log_poll_error: 1
> 2026-02-10T16:15:01.001815+01:00 lisa kernel: [T45426] mce: [Hardware Error]: Machine check events logged
> 2026-02-10T16:15:01.001818+01:00 lisa kernel: [T45426] [Hardware Error]: Deferred error, no action required.
> 2026-02-10T16:15:01.001819+01:00 lisa kernel: [T45426] [Hardware Error]: CPU:0 (19:50:0) MC14_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> 2026-02-10T16:15:01.001821+01:00 lisa kernel: [T45426] [Hardware Error]: Error Addr: 0x01b3877c00000020
> 2026-02-10T16:15:01.001822+01:00 lisa kernel: [T45426] [Hardware Error]: IPID: 0x000700b040000000
> 2026-02-10T16:15:01.001831+01:00 lisa kernel: [T45426] [Hardware Error]: L3 Cache Ext. Error Code: 0
> 2026-02-10T16:15:01.001832+01:00 lisa kernel: [T45426] [Hardware Error]: cache level: RESV, tx: INSN
> 
> 2026-02-11T14:24:13.358353+01:00 lisa kernel: [    C0] smca_should_log_poll_error: 1
> 2026-02-11T14:24:13.358832+01:00 lisa kernel: [T310371] mce: [Hardware Error]: Machine check events logged
> 2026-02-11T14:24:13.361773+01:00 lisa kernel: [T310371] [Hardware Error]: Deferred error, no action required.
> 2026-02-11T14:24:13.361778+01:00 lisa kernel: [T310371] [Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|-|-|AddrV|-|-|SyndV|UECC|Deferred|-|-]:
> 0x8424b0c8009d011e
> 2026-02-11T14:24:13.361781+01:00 lisa kernel: [T310371] [Hardware Error]: Error Addr: 0x01f8a43400000020
> 2026-02-11T14:24:13.361782+01:00 lisa kernel: [T310371] [Hardware Error]: IPID: 0x000700b040000000, Syndrome: 0x0000000000000042
> 2026-02-11T14:24:13.361787+01:00 lisa kernel: [T310371] [Hardware Error]: L3 Cache Ext. Error Code: 29
> 2026-02-11T14:24:13.361788+01:00 lisa kernel: [T310371] [Hardware Error]: cache level: L2, tx: RESV, mem-tx: RD
> 
> 2026-02-12T10:07:28.804529+01:00 lisa kernel: [    C0] smca_should_log_poll_error: 1
> 2026-02-12T10:07:28.805020+01:00 lisa kernel: [T393396] mce: [Hardware Error]: Machine check events logged
> 2026-02-12T10:07:28.805028+01:00 lisa kernel: [T393396] [Hardware Error]: Deferred error, no action required.
> 2026-02-12T10:07:28.805029+01:00 lisa kernel: [T393396] [Hardware Error]: CPU:0 (19:50:0) MC11_STATUS[-|-|-|AddrV|PCC|-|-|Deferred|-|-]: 0x8700900800000000
> 2026-02-12T10:07:28.805030+01:00 lisa kernel: [T393396] [Hardware Error]: Error Addr: 0x01300a9d00000020
> 2026-02-12T10:07:28.805031+01:00 lisa kernel: [T393396] [Hardware Error]: IPID: 0x000700b040000000
> 2026-02-12T10:07:28.805033+01:00 lisa kernel: [T393396] [Hardware Error]: L3 Cache Ext. Error Code: 0
> 2026-02-12T10:07:28.805034+01:00 lisa kernel: [T393396] [Hardware Error]: cache level: RESV, tx: INSN

The first one and the third one are definitely bogus.

This is evident because the "PCC" (Processor Context Corrupt) bit is
set. This is would result in a machine check exception and the kernel
would panic.

The second one seems mostly valid. Though a deferred error cause a
deferred error interrupt. In this case, it is found through timer
polling. And the similarity with the others makes it suspect too.

I think we should filter these out. You can ignore these for now, if
they aren't regularly occurring like before.

> 
> Are the "Error Addr" reported here supposed to be physical addresses of memory?
> If they are they don't seem to make sense to me given the following output of
> "cat /proc/iomem":
> 

The "Error Addr" is the value of the MCA_ADDR register. This register is
formatted based on what the bank represents and the error code. In this
case, you have an "L3 cache" error. So the address is some
implementation-specific format with set, way, index, etc. But I wouldn't
give much attention to this, since the errors are bogus.

Thanks for following up on this topic. I'll see about a filtering
mechanism. My first thought is to sanity check the status bits, etc.,
and filter anything that isn't consistent with the architecture. And we
can have an option to remove this filtering for those who want all the
data for doing hardware checkout.

Thanks,
Yazen

  parent reply	other threads:[~2026-02-16 20:25 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-15  1:00 spurious mce Hardware Error messages in next-20250912 Bert Karwatzki
2025-09-15 17:55 ` Yazen Ghannam
2025-09-15 21:03   ` Bert Karwatzki
2025-09-15 21:43     ` Bert Karwatzki
2025-09-16  9:10       ` Borislav Petkov
2025-09-16 14:07         ` Yazen Ghannam
2025-09-16 20:27           ` Bert Karwatzki
2025-09-17  7:13             ` Bert Karwatzki
2025-09-17 14:41               ` Yazen Ghannam
2025-09-17 15:33                 ` Bert Karwatzki
2025-09-17 19:26                   ` Yazen Ghannam
2025-09-17 21:15                     ` Yazen Ghannam
2025-09-17 22:01                       ` Bert Karwatzki
2025-09-18 10:20                     ` Nikolay Borisov
2025-09-18 21:00                       ` Yazen Ghannam
2025-09-18 21:04                         ` Luck, Tony
2025-09-18 21:14                           ` Yazen Ghannam
2025-09-18 22:07                         ` Bert Karwatzki
2025-10-09 13:20                           ` Yazen Ghannam
2026-02-12 12:50                             ` spurious (?) mce Hardware Error messages in v6.19 Bert Karwatzki
2026-02-13 12:45                               ` Bert Karwatzki
2026-02-16 20:25                               ` Yazen Ghannam [this message]
2026-02-19 14:33                                 ` Yazen Ghannam
2026-02-19 15:43                                   ` Bert Karwatzki
2026-02-20 16:49                                     ` Mario Limonciello
2026-02-20 18:24                                       ` Bert Karwatzki
2026-02-23 21:53                                         ` Yazen Ghannam

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260216202546.GA420258@yaz-khff2.amd.com \
    --to=yazen.ghannam@amd.com \
    --cc=Smita.KoralahalliChannabasappa@amd.com \
    --cc=bp@alien8.de \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-next@vger.kernel.org \
    --cc=nik.borisov@suse.com \
    --cc=qiuxu.zhuo@intel.com \
    --cc=rafael@kernel.org \
    --cc=spasswolf@web.de \
    --cc=tony.luck@intel.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox