From: Borislav Petkov <bp@alien8.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "x86@kernel.org" <x86@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"patches@lists.linux.dev" <patches@lists.linux.dev>,
Yazen Ghannam <yazen.ghannam@amd.com>
Subject: Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
Date: Tue, 28 Jun 2022 17:59:54 +0200 [thread overview]
Message-ID: <Yrsleko0MnGtwaaR@zn.tnic> (raw)
In-Reply-To: <7da92773f7084c57814f7ef4d033bc53@intel.com>
On Mon, Jun 27, 2022 at 05:27:57PM +0000, Luck, Tony wrote:
> Existing default is 1023 ... which is not a good choice for anyone (except
> perhaps ostriches that want to bury their heads in the sand an ignore marginal
> DIMMs for as long as possible).
Why isn't that a good choice?
I'm sure there are error rates where this fits just fine.
> So changing the threshold to "2" would be an improvement in at least
> being right for one vendor, instead of wrong for all.
So I'm pretty sure that is not needed on AMD at all.
> Linux already had a hook in the GHES code to take an error record from
> the platform and offline a page. So this "smart" code could be done
> by BIOS or BMC just providing the resulting list of pages that should
> be taken offline to Linux.
So my worry is some firmware agent interfering with our recovery
strategy. And reportedly, there are people who don't like the firmware
recovery at all and prefer it all is done in the OS.
Which then makes it a problem of how to synchronize with the firmware
about who does what in RAS. And we don't have any API here...
Anyway, this is just a worry I have from watching where it all goes
to.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2022-06-28 16:00 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-07 21:20 [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2" Tony Luck
2022-06-27 14:40 ` Borislav Petkov
2022-06-27 17:27 ` Luck, Tony
2022-06-28 15:59 ` Borislav Petkov [this message]
2022-06-28 16:51 ` Luck, Tony
2022-06-30 7:11 ` Borislav Petkov
2022-06-30 17:02 ` Luck, Tony
2022-07-01 8:49 ` Borislav Petkov
2022-07-01 16:44 ` Luck, Tony
2022-07-01 19:12 ` [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Tony Luck
2022-08-02 12:07 ` Yazen Ghannam
2022-08-02 16:18 ` [PATCH v2] " Tony Luck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yrsleko0MnGtwaaR@zn.tnic \
--to=bp@alien8.de \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).