From: Borislav Petkov <bp@alien8.de>
To: "Luck, Tony" <tony.luck@intel.com>
Cc: "x86@kernel.org" <x86@kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"patches@lists.linux.dev" <patches@lists.linux.dev>,
Yazen Ghannam <yazen.ghannam@amd.com>
Subject: Re: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
Date: Fri, 1 Jul 2022 10:49:43 +0200 [thread overview]
Message-ID: <Yr61Jy6aGhxeulxN@zn.tnic> (raw)
In-Reply-To: <Yr3XLMwYnRMa3Opw@agluck-desk3.sc.intel.com>
On Thu, Jun 30, 2022 at 10:02:36AM -0700, Luck, Tony wrote:
> Yes. The cost to offline a page is low (4KB reduction in system capacity
> on a system with 10's or 100's of GB memory).
*If* that page is going to go bad at all.
> The risk to the system if the page does develop an uncorected error is
> high (process is killed, or system crashes).
That's not what the papers say.
> The question is whether the default threshold should be "do I feel
> lucky?" and those corrected errors are nothing to worry about. Or
> "do I want to take the safe path?" and premptively offline pages
> at the first sign of trouble.
Well, we can't decide that for every possible situation so if Intel's
recommendation is to do that on Intel systems, then users can set that.
/sys/kernel/debug/ras/cec/action_threshold is perhaps not the perfect
interface for that but we can make something more user-friendly.
> Is there a study about "wobbly" DIMMs?
Most of the papers I looked at say that the majority of errors are CE
and that there's a likelihood that those errors can turn UE but none is
quantifying that likelihood. One paper says that a huge number of the
errors are transient. If you offline such a page just because two alpha
particles flew through it, you're offlining a perfectly good page.
DRAM vendor is also important as different DRAM vendors show different
error stats. And so on and so on.
So you can't simply go and decide for all and say, the answer is 2.
> We now have some real data. Instead of a "finger in the air guess"
> that was made (on a different generation of DIMM technology ... the
> AMD paper you reference below says DDR4 is 5.5x worse than DDR3).
In the next sentence it says that the hardware handles those errors just
fine!
> Second most common on DDR4 DIMMs is "row failure". Which current ECC
> systems don't handle well.
This is not what we're talking about here - we're talking about
offlining pages after 2 CEs.
As to the row offlining - yes, no question there, we need to address
that.
> While that's low from one perspective (0.6% servers affected) it's high
> enough to be interesting to the CSP - because they lose revenue and
> reputation when they have to tell their customers: "sorry the VM you
> rented from us just crashed". Note that one physical system crashing
> may take down dozens of VMs.
So that whitepaper doesn't specify what they call "fault". Because
in one of the papers in the Reference section, they explain the
terminology:
"A fault is the underlying cause of an error, such as a stuck-at bit or
high-energy particle strike. Faults can be active (causing errors), or
dormant (not causing errors).
An error is an incorrect portion of state resulting from an active
fault, such as an incorrect value in memory. Errors may be detected and
possibly corrected by higher level mechanisms such as parity or error
correcting codes (ECC). They may also go uncorrected, or in the worst
case, completely undetected (i.e., silent)."
So even if we put on the most pessimistic glasses and say that 0.6%
of the faults result in system crashes, then CSP can go and set the
threshold to something lower for their use case after following
recommendations by DRAM and CPU vendor and so on.
> While anyone can tune the RAS_CEC threshold. The default value should
> be something reasonable. I'm sticking with "2" being much more
> reasonable default than 1023.
You can make that configurable or Intel-only or whatever - but not
unconditional for everyone.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2022-07-01 8:51 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-07 21:20 [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2" Tony Luck
2022-06-27 14:40 ` Borislav Petkov
2022-06-27 17:27 ` Luck, Tony
2022-06-28 15:59 ` Borislav Petkov
2022-06-28 16:51 ` Luck, Tony
2022-06-30 7:11 ` Borislav Petkov
2022-06-30 17:02 ` Luck, Tony
2022-07-01 8:49 ` Borislav Petkov [this message]
2022-07-01 16:44 ` Luck, Tony
2022-07-01 19:12 ` [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Tony Luck
2022-08-02 12:07 ` Yazen Ghannam
2022-08-02 16:18 ` [PATCH v2] " Tony Luck
2022-08-22 17:41 ` [tip: ras/core] " tip-bot2 for Tony Luck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yr61Jy6aGhxeulxN@zn.tnic \
--to=bp@alien8.de \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=tony.luck@intel.com \
--cc=x86@kernel.org \
--cc=yazen.ghannam@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox