From: Tony Luck <tony.luck@intel.com>
To: Borislav Petkov <bp@alien8.de>
Cc: x86@kernel.org, linux-kernel@vger.kernel.org,
patches@lists.linux.dev, Tony Luck <tony.luck@intel.com>
Subject: [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2"
Date: Tue, 7 Jun 2022 14:20:15 -0700 [thread overview]
Message-ID: <20220607212015.175591-1-tony.luck@intel.com> (raw)
A large scale study of memory errors in data centers showed that it is
best to aggressively take pages with corrected errors offline. This is
the best strategy of using corrected errors as a predictor of future
uncorrected errors.
Signed-off-by: Tony Luck <tony.luck@intel.com>
---
Here's the link to the study. I thought of putting into the code
comment, or the commit comment. But these links are sometimes changed
as website is re-organised, making the link stale.
https://www.intel.com/content/dam/www/public/us/en/documents/intel-and-samsung-mrt-improving-memory-reliability-at-data-centers.pdf
The paper has two recommendations:
1) Change threshold to "2".
2) Do very smart platform dependent things
This commit only addresses the first :-)
---
drivers/ras/cec.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c
index 42f2fc0bc8a9..5d614c383ccf 100644
--- a/drivers/ras/cec.c
+++ b/drivers/ras/cec.c
@@ -125,8 +125,11 @@ static struct ce_array {
static DEFINE_MUTEX(ce_mutex);
static u64 dfs_pfn;
-/* Amount of errors after which we offline */
-static u64 action_threshold = COUNT_MASK;
+/*
+ * Number of errors after which we offline. Default is to aggressively
+ * offline the page when a second error is seen.
+ */
+static u64 action_threshold = 2;
/* Each element "decays" each decay_interval which is 24hrs by default. */
#define CEC_DECAY_DEFAULT_INTERVAL 24 * 60 * 60 /* 24 hrs */
--
2.35.3
next reply other threads:[~2022-06-07 21:20 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-06-07 21:20 Tony Luck [this message]
2022-06-27 14:40 ` [PATCH] RAS/CEC: Reduce default threshold to offline a page to "2" Borislav Petkov
2022-06-27 17:27 ` Luck, Tony
2022-06-28 15:59 ` Borislav Petkov
2022-06-28 16:51 ` Luck, Tony
2022-06-30 7:11 ` Borislav Petkov
2022-06-30 17:02 ` Luck, Tony
2022-07-01 8:49 ` Borislav Petkov
2022-07-01 16:44 ` Luck, Tony
2022-07-01 19:12 ` [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Tony Luck
2022-08-02 12:07 ` Yazen Ghannam
2022-08-02 16:18 ` [PATCH v2] " Tony Luck
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20220607212015.175591-1-tony.luck@intel.com \
--to=tony.luck@intel.com \
--cc=bp@alien8.de \
--cc=linux-kernel@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).