From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0539418DB24; Wed, 30 Jul 2025 16:21:45 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753892506; cv=none; b=Hjm3vbKB4VnBOxOH13/iv/G8ezb/AhJ0MLkZh+GuNZ8EShFnxIcfk0Y0TWc+/0YtwXCxwExf8f8LzlDze5TCx2h2DT6xhqK9uBZv4cyTK9U+1DqBQD88q65wTCfibjnpXjg7KPXZOYMVJa9XB5/rlbiRo12sEKERKEi485a27SE= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753892506; c=relaxed/simple; bh=Z/C3ul5o75i6bTpbhJZLGEjV6nDj+lf/hDM6AkOr34I=; h=Date:From:To:Cc:Subject:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=Rip5fgpc2trbGFGpCPn3DtMVxDgyNwwtQqCgFFe8oZLBT0wxnZyxPvNj0bYCxc6NWOLbbXof9adDGgzmxgsh1oDWj+fP4ZK+Up4uVQdhbMmxZrKl0NXVgtY3DnWUZwpGzgoMaKs6wStD0DixqnQhKshxTYcbZOEN3OZcjzd/VAU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=aiBCDdUe; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="aiBCDdUe" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3A335C4CEE3; Wed, 30 Jul 2025 16:21:40 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1753892505; bh=Z/C3ul5o75i6bTpbhJZLGEjV6nDj+lf/hDM6AkOr34I=; h=Date:From:To:Cc:Subject:In-Reply-To:References:From; b=aiBCDdUeAhbo90SYZVK5pgGP6Ixwqf56PxCBG0mWpQOhbCHY0fYBuNTto7qCQe+7Y X55RLKXT8bxgM+aP1+/q5pbDbnSE+sPqrkZKovPRiki3qQmYfWBDylFOYFvlEE3ys2 EP1V7U2UzklkCKJocA4UCdXop6SGNVcziJR6gdyAnAiYeUiBJL94Z8+CCj0h7vsMrF jDElP6bu+0Q4TDmLc0ZbLzarjNJ6aIZylF/q72OzBPN7vzG17TWB8t1wm0ZkAKSmeU BhkNBMF0o2ilUiPR5m9P+FXxIFfT4phrUZ7h8wYxqizfA5Ex4XXHl60N4O3zqIFJ0I Csj7368Nvfkhg== Date: Wed, 30 Jul 2025 18:21:37 +0200 From: Mauro Carvalho Chehab To: Breno Leitao Cc: Shuai Xue , Tony Luck , Borislav Petkov , "Rafael J. Wysocki" , Len Brown , James Morse , Robert Moore , Thomas Gleixner , Ingo Molnar , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Hanjun Guo , Mauro Carvalho Chehab , Mahesh J Salgaonkar , Oliver O'Halloran , Bjorn Helgaas , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev, osandov@osandov.com, konrad.wilk@oracle.com, linux-edac@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, kernel-team@meta.com Subject: Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors Message-ID: <20250730182137.18605ea1@foz.lan> In-Reply-To: References: <20250722-vmcore_hw_error-v3-1-ff0683fc1f17@debian.org> <7ce9731a-b212-4e27-8809-0559eb36c5f2@linux.alibaba.com> <4qh2wbcbzdajh2tvki26qe4tqjazmyvbn7v7aqqhkxpitdrexo@ucch4ppo7i4e> <4ef01be1-44b2-4bf5-afec-a90d4f71e955@linux.alibaba.com> <2a7ok3hdq3hmz45fzosd5vve4qpn6zy5uoogg33warsekigazu@wgfi7qsg5ixo> X-Mailer: Claws Mail 4.3.1 (GTK 3.24.49; x86_64-redhat-linux-gnu) Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Em Wed, 30 Jul 2025 06:11:52 -0700 Breno Leitao escreveu: > Hello Shuai, > > On Wed, Jul 30, 2025 at 10:13:13AM +0800, Shuai Xue wrote: > > In ghes_log_hwerr(), you're counting both CPER_SEV_CORRECTED and > > CPER_SEV_RECOVERABLE errors: > > Thanks. I was reading this code a bit more, and I want to make sure my > understanding is correct, giving I was confused about CORRECTED and > RECOVERABLE errors. > > CPER_SEV_CORRECTED means it is corrected in the background, and the OS > was not even notified about it. That includes 1-bit ECC error. > THose are not the errors we are interested in, since they are irrelavant > to the OS. Hardware-corrected errors aren't irrelevant. The rasdaemon utils capture such errors, as they may be a symptom of a hardware defect. In a matter of fact, at rasdamon, thresholds can be set to trigger an action, like for instance, disable memory blocks that contain defective memories. This is specially relevant on HPC and supercomputer workloads, where it is a lot cheaper to disable a block of bad memory than to lose an entire job because that could take several weeks of run time on a supercomputer, just because a defective memory ended causing a failure at the application. Regards, Mauro