From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mga06.intel.com (mga06b.intel.com [134.134.136.31]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 2CC337498 for ; Fri, 1 Jul 2022 19:12:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1656702775; x=1688238775; h=from:to:cc:subject:date:message-id:in-reply-to: references:mime-version:content-transfer-encoding; bh=7Ji9e0hAxIJNMDuh9KYE+lqOyY+7TlQzjmXgYGpP8ic=; b=b8YNBbiuLJoPYkDcUf1vgEya9yMdfEeGrFWq5/5ikprSI9oC9y3RsAaR olbn6/TSXvtmtoATHjVykmV1WLoT7aS7DEhwBDyqOu6uUpVpCzEVzIpvH R20jd1VgXh/P5C1Sl6Hxn4VuooXObY6FVq3rptUSR3qMgWBEpgN+sK7Jl pNFV86ikp1z4jYwpgMx1Qhz3WL5wo3pULast2f6KlsetmG4bgYD7eSUYu kj+BykiQqZrtB2b3UDOPoYxJ1eJjK4sIESh6vRFhYZV/KnKdvFcOxyCak wMteA28CiEhiAizKiEy8lpWThmT1DnaJPvLkEplCSlkKwdBtSOFbYXT9m Q==; X-IronPort-AV: E=McAfee;i="6400,9594,10395"; a="344410072" X-IronPort-AV: E=Sophos;i="5.92,238,1650956400"; d="scan'208";a="344410072" Received: from fmsmga007.fm.intel.com ([10.253.24.52]) by orsmga104.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Jul 2022 12:12:44 -0700 X-IronPort-AV: E=Sophos;i="5.92,238,1650956400"; d="scan'208";a="596366381" Received: from agluck-desk3.sc.intel.com ([172.25.222.78]) by fmsmga007-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 01 Jul 2022 12:12:44 -0700 From: Tony Luck To: yazen.ghannam@amd.com Cc: tony.luck@intel.com, bp@alien8.de, linux-kernel@vger.kernel.org, patches@lists.linux.dev, x86@kernel.org Subject: [PATCH] RAS/CEC: Reduce offline page threshold for Intel systems Date: Fri, 1 Jul 2022 12:12:39 -0700 Message-Id: <20220701191239.619940-1-tony.luck@intel.com> X-Mailer: git-send-email 2.35.3 In-Reply-To: References: Precedence: bulk X-Mailing-List: patches@lists.linux.dev List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit A large scale study of memory errors on Intel systems in data centers showed that aggressively taking pages with corrected errors offline is the best strategy of using corrected errors as a predictor of future uncorrected errors. It is unknown whether this would help other vendors. There are some indicators that it would not. Set the threshold to "2" on Intel systems. Do-not-apply-without-agreement-from-AMD Signed-off-by: Tony Luck --- drivers/ras/cec.c | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/drivers/ras/cec.c b/drivers/ras/cec.c index 42f2fc0bc8a9..b1fc193b2036 100644 --- a/drivers/ras/cec.c +++ b/drivers/ras/cec.c @@ -556,6 +556,14 @@ static int __init cec_init(void) if (ce_arr.disabled) return -ENODEV; + /* + * Intel systems may avoid uncorreectable errors + * if pages with corrected errors are aggresively + * taken offline. + */ + if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) + action_threshold = 2; + ce_arr.array = (void *)get_zeroed_page(GFP_KERNEL); if (!ce_arr.array) { pr_err("Error allocating CE array page!\n"); -- 2.35.3