From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out30-111.freemail.mail.aliyun.com (out30-111.freemail.mail.aliyun.com [115.124.30.111]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 473AD29B0; Mon, 28 Jul 2025 01:08:35 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=115.124.30.111 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753664921; cv=none; b=jmKTKLadiqXkV6eEaJSO1Wx2CdzybqpaupzbfVNJTJjombYcrVXY3J7ZZV2swYQFCDr+tyzV6EFHFiMhogbJBvqmSVePqvH21rUFVcvsXmBs9h44JG5ayR+Y4lqhggknLdGmFmJaL+vDs4Z/ONzXEdQur7FzE+luCewrixA/pfQ= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1753664921; c=relaxed/simple; bh=6Y3D9D9Ou7Tlb+4EPgKYBcu/72sFDKzVsN8C1xEr5d0=; h=Message-ID:Date:MIME-Version:Subject:To:Cc:References:From: In-Reply-To:Content-Type; b=GXpT7wg9q9gtynEngKYJ5Bk/75dzPwRW1Ls210kmgsDdJUfsLCKSEFZqM5Tb2tEsS0S27Z364nyAebS5hO0cXyzIiaIWBlM8HjhSWX4i8mwGq6Zg5duygO7uBvBUZuy7KeKbosxsXnU3NWdyvqejo9hCyHk9RBEVdpxTxUn6wtU= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com; spf=pass smtp.mailfrom=linux.alibaba.com; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b=eGKpMu8N; arc=none smtp.client-ip=115.124.30.111 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.alibaba.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.alibaba.com header.i=@linux.alibaba.com header.b="eGKpMu8N" DKIM-Signature:v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.alibaba.com; s=default; t=1753664908; h=Message-ID:Date:MIME-Version:Subject:To:From:Content-Type; bh=2Iz/DOKZWtt9pD+4iMxQ3XO3FM24VBfncMNR3ZAeizA=; b=eGKpMu8NscuyUwKZnRp3q8JcVV05i/CslvjaduwcGJ4JGAVbxoBKkdH+efdAXASIX8mYJySv2+aHT7VylLCfSJ0JWtmoP8dFopeh3IzzyyjDbMOak0cgEY2+BDpxSe45xjlLEe+I6ay/nEQpWD32lTWqmHEdsuP/9zGoloh4D2o= Received: from 30.246.181.19(mailfrom:xueshuai@linux.alibaba.com fp:SMTPD_---0WkBBPjH_1753664905 cluster:ay36) by smtp.aliyun-inc.com; Mon, 28 Jul 2025 09:08:26 +0800 Message-ID: <4ef01be1-44b2-4bf5-afec-a90d4f71e955@linux.alibaba.com> Date: Mon, 28 Jul 2025 09:08:25 +0800 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v3] vmcoreinfo: Track and log recoverable hardware errors To: Breno Leitao Cc: Tony Luck , Borislav Petkov , "Rafael J. Wysocki" , Len Brown , James Morse , Robert Moore , Thomas Gleixner , Ingo Molnar , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Hanjun Guo , Mauro Carvalho Chehab , Mahesh J Salgaonkar , Oliver O'Halloran , Bjorn Helgaas , linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org, acpica-devel@lists.linux.dev, osandov@osandov.com, konrad.wilk@oracle.com, linux-edac@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-pci@vger.kernel.org, kernel-team@meta.com References: <20250722-vmcore_hw_error-v3-1-ff0683fc1f17@debian.org> <7ce9731a-b212-4e27-8809-0559eb36c5f2@linux.alibaba.com> <4qh2wbcbzdajh2tvki26qe4tqjazmyvbn7v7aqqhkxpitdrexo@ucch4ppo7i4e> From: Shuai Xue In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit 在 2025/7/26 00:16, Breno Leitao 写道: > Hello Shuai, > > On Fri, Jul 25, 2025 at 03:40:58PM +0800, Shuai Xue wrote: >>>> APEI does not define an error type named GHES. GHES is just a kernel >>>> driver name. Many hardware error types can be handled in GHES (see >>>> ghes_do_proc), for example, AER is routed by GHES when firmware-first >>>> mode is used. As far as I know, firmware-first mode is commonly used in >>>> production. Should GHES errors be categorized into AER, memory, and CXL >>>> memory instead? >>> >>> I also considered slicing the data differently initially, but then >>> realized it would add more complexity than necessary for my needs. >>> >>> If you believe we should further subdivide the data, I’m happy to do so. >>> >>> You’re suggesting a structure like this, which would then map to the >>> corresponding CPER_SEC_ sections: >>> >>> enum hwerr_error_type { >>> HWERR_RECOV_AER, // maps to CPER_SEC_PCIE >>> HWERR_RECOV_MCE, // maps to default MCE + CPER_SEC_PCIE >> >> CPER_SEC_PCIE is typo? > > Correct, HWERR_RECOV_MCE would map to the regular MCE and not errors > coming from GHES. > >>> HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_* >>> HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM >>> } >>> >>> Additionally, what about events related to CPU, Firmware, or DMA >>> errors—for example, CPER_SEC_PROC, CPER_SEC_FW, CPER_SEC_DMAR? Should we >>> include those in the classification as well? >> >> I would like to split a error from ghes to its own type, >> it sounds more reasonable. I can not tell what happened from HWERR_RECOV_AERat all :( > > Makes sense. Regarding your answer, I suppose we might want to have > something like the following: > > enum hwerr_error_type { > HWERR_RECOV_MCE, // maps to errors in do_machine_check() > HWERR_RECOV_CXL, // maps to CPER_SEC_CXL_ > HWERR_RECOV_PCI, // maps to AER (pci_dev_aer_stats_incr()) and CPER_SEC_PCIE and CPER_SEC_PCI > HWERR_RECOV_MEMORY, // maps to CPER_SEC_PLATFORM_MEM_ > HWERR_RECOV_CPU, // maps to CPER_SEC_PROC_ > HWERR_RECOV_DMA, // maps to CPER_SEC_DMAR_ > HWERR_RECOV_OTHERS, // maps to CPER_SEC_FW_, CPER_SEC_DMAR_, > } > > Is this what you think we should track? > > Thanks > --breno It sounds good to me. Thanks. Shuai