From: Yazen Ghannam <yazen.ghannam@amd.com>
To: "Luck, Tony" <tony.luck@intel.com>, Borislav Petkov <bp@alien8.de>
Cc: yazen.ghannam@amd.com, Aristeu Rozanski <aris@ruivo.org>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
Aristeu Rozanski <aris@redhat.com>
Subject: Re: [PATCH] mce: prevent concurrent polling of MCE events
Date: Fri, 9 Jun 2023 12:00:53 -0400 [thread overview]
Message-ID: <facb48e2-73a0-e780-4fda-2ecbdfd3b48b@amd.com> (raw)
In-Reply-To: <SJ1PR11MB60831A82C6BACEBCCC50E397FC51A@SJ1PR11MB6083.namprd11.prod.outlook.com>
On 6/9/23 11:24 AM, Luck, Tony wrote:
>> So "UCNA" is like the AMD "Deferred" severity it seems. How is this
>> different from "Action Optional"? I've been equating DFR and AO.
>
Thanks Tony.
> Categories of uncorrected errors memory errors on Intel
>
> 1) "UCNA" ... these are logged by memory controllers when ECC says that a memory read cannot
> supply correct data. If CMCI is enabled, signaled with CMCI. Note that these will occur on prefetch
> or speculative reads as well as "regular" reads. The data might never be consumed.
>
Yes, this is like AMD.
Key differences:
* Logged using "Deferred" severity. However, deferred errors
aren't always from the memory controller. So there still needs
to be an error code check in addition to severity.
* Signalled with a Deferred error APIC interrupt. This way UC
errors can be signalled independently of CEs.
> 2) "SRAO". This is now legacy. Pre-Icelake systems log these for uncorrected errors found by
> the patrol scrubber, and for evictions of poison from L3 cache (if that poison was due to an ECC
> failure in the cache itself, not for poison created elsewhere and currently resident in the cache).
> Signaled with a broadcast machine check. Icelake and newer systems use UCNA for these.
>
Yes, this mostly fails within the Deferred/UCNA case for AMD also.
> 3) "SRAR". Attempt to consume poison (either data read or instruction fetch). Signaled with
> machine check. Pre-Skylake this was broadcast. Skylake and newer have an opt-in mechanism
> to request #MC delivery to just the logical CPU trying to consume (Linux always opts-in).
>
>
> UCNA = Uncorrected No Action required. But Linux does take action to try to offline the page.
>
That's right. So we'll use this for the Deferred memory error case. But
there will need to be updates for these to be actionable on AMD systems.
> SRAO = Software Recoverable Action Optional. As with UCNA Linux tries to offline the page.
>
Okay, so ignore these going forward.
> SRAR = Software Recoverable Action Required. Linux will replace a clean page with a new copy
> if it can (think read-only text pages mapped from ELF executable). If not it sends SIGBUS to the
> application. Some SRAR in the kernel are recoverable ... see the copy_mc*() functions.
>
Yep, this mostly works. There's still an AMD IF Unit quirk that needs to
be handled. And the kernel recovery cases needs to be tested.
Thanks again!
-Yazen
next prev parent reply other threads:[~2023-06-09 16:01 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-15 14:32 [PATCH] mce: prevent concurrent polling of MCE events Aristeu Rozanski
2023-05-15 14:52 ` Borislav Petkov
2023-05-15 17:18 ` Luck, Tony
2023-05-15 18:30 ` Borislav Petkov
2023-05-15 19:08 ` Luck, Tony
2023-05-15 19:44 ` Borislav Petkov
2023-05-15 20:07 ` Luck, Tony
2023-05-15 20:20 ` Aristeu Rozanski
2023-05-15 20:27 ` Luck, Tony
2023-05-15 20:32 ` Aristeu Rozanski
2023-05-15 20:40 ` Luck, Tony
2023-05-16 17:08 ` Borislav Petkov
2023-05-23 14:15 ` Aristeu Rozanski
2023-06-04 16:04 ` Aristeu Rozanski
2023-06-05 15:33 ` Luck, Tony
2023-06-05 17:41 ` Borislav Petkov
2023-06-05 17:58 ` Luck, Tony
2023-06-05 19:30 ` Borislav Petkov
2023-06-05 19:37 ` Luck, Tony
2023-06-05 19:43 ` Borislav Petkov
2023-06-05 20:10 ` Aristeu Rozanski
2023-06-05 20:33 ` Aristeu Rozanski
2023-06-05 20:56 ` Borislav Petkov
2023-06-05 21:01 ` Aristeu Rozanski
2023-06-05 21:06 ` Borislav Petkov
2023-06-05 21:29 ` Luck, Tony
2023-06-05 21:58 ` Aristeu Rozanski
2023-06-06 8:25 ` Borislav Petkov
2023-06-06 14:00 ` Aristeu Rozanski
2023-06-06 14:08 ` Borislav Petkov
2023-06-09 0:26 ` Luck, Tony
2023-06-09 10:17 ` Borislav Petkov
2023-06-09 15:00 ` Yazen Ghannam
2023-06-09 15:24 ` Luck, Tony
2023-06-09 16:00 ` Yazen Ghannam [this message]
2023-06-09 15:59 ` Aristeu Rozanski
2023-06-05 19:08 ` Aristeu Rozanski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=facb48e2-73a0-e780-4fda-2ecbdfd3b48b@amd.com \
--to=yazen.ghannam@amd.com \
--cc=aris@redhat.com \
--cc=aris@ruivo.org \
--cc=bp@alien8.de \
--cc=linux-edac@vger.kernel.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox