From: Dave Hansen <dave.hansen@intel.com>
To: "Sironi, Filippo" <sironi@amazon.de>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Cc: "tony.luck@intel.com" <tony.luck@intel.com>,
"bp@alien8.de" <bp@alien8.de>,
"tglx@linutronix.de" <tglx@linutronix.de>,
"mingo@redhat.com" <mingo@redhat.com>,
"dave.hansen@linux.intel.com" <dave.hansen@linux.intel.com>,
"x86@kernel.org" <x86@kernel.org>,
"hpa@zytor.com" <hpa@zytor.com>,
"linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>
Subject: Re: [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages
Date: Thu, 12 Oct 2023 08:49:39 -0700 [thread overview]
Message-ID: <6591377b-7911-444b-abf9-cfc978472d76@intel.com> (raw)
In-Reply-To: <EDD08AA3-C404-4DB6-96BA-2B25519B2496@amazon.de>
On 10/12/23 04:46, Sironi, Filippo wrote:
> There's correlation across the errors that we're seeing, indeed,
> we're looking at the same row being responsible for multiple CPUs
> tripping and running into #MC. I still don't like the full lack of
> visibility; it's not uncommon in a large fleet to see to take a
> server out of production, replace a DIMM and shortly after taking it
> out of production again to replace another DIMM just because some of
> the errors weren't properly logged.
So you had two nearly simultaneous DIMM failures. The first failed,
filled up the buffer and then the second failed, but there was no room.
The second failed *SO* soon after the first that there was no
opportunity to empty the buffer between.
Right?
How do you know that storing 8 pages of records will catch this case as
opposed to storing 2?
>> Is there any way that the size of the pool can be more automatically
>> determined? Is the likelihood of a bunch errors proportional to the
>> number of CPUs or amount of RAM or some other aspect of the hardware?
>>
>> Could the pool be emptied more aggressively so that it does not fill up?
You didn't really address the additional questions I posed there.
I'll add one more: how many of the messages are duplicates or
*effectively* duplicates? Or is that hard to determine at the time that
the entries are being made that they are duplicates?
It _should_ also be fairly easy to enlarge the buffer on demand, say, if
it got half full. What's the time scale over which the buffer filled
up? Did a single #MC fill it up?
I really think we need to understand what the problem is and have _some_
confidence that the proposed solution will fix that, even if we're just
talking about a new config option.
next prev parent reply other threads:[~2023-10-12 15:52 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-11 16:33 [PATCH] x86/mce: Increase the size of the MCE pool from 2 to 8 pages Filippo Sironi
2023-10-11 17:32 ` Dave Hansen
2023-10-12 11:46 ` Sironi, Filippo
2023-10-12 11:52 ` Borislav Petkov
2023-10-12 15:49 ` Dave Hansen [this message]
2023-10-16 14:14 ` Yazen Ghannam
2023-10-16 14:24 ` Dave Hansen
2023-10-16 14:40 ` Borislav Petkov
2023-10-16 14:47 ` Yazen Ghannam
2023-10-16 16:14 ` Luck, Tony
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=6591377b-7911-444b-abf9-cfc978472d76@intel.com \
--to=dave.hansen@intel.com \
--cc=bp@alien8.de \
--cc=dave.hansen@linux.intel.com \
--cc=hpa@zytor.com \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mingo@redhat.com \
--cc=sironi@amazon.de \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox