From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.9 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5A949C433E2 for ; Wed, 9 Sep 2020 12:06:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id BA39C21D81 for ; Wed, 9 Sep 2020 12:06:39 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=alien8.de header.i=@alien8.de header.b="sdE9CNIC" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729479AbgIIMG1 (ORCPT ); Wed, 9 Sep 2020 08:06:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43312 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1730159AbgIIMEi (ORCPT ); Wed, 9 Sep 2020 08:04:38 -0400 Received: from mail.skyhub.de (mail.skyhub.de [IPv6:2a01:4f8:190:11c2::b:1457]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4471EC06179E; Wed, 9 Sep 2020 05:02:10 -0700 (PDT) Received: from zn.tnic (p200300ec2f0ae7002c1f5e624f33a6aa.dip0.t-ipconnect.de [IPv6:2003:ec:2f0a:e700:2c1f:5e62:4f33:a6aa]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.skyhub.de (SuperMail on ZX Spectrum 128k) with ESMTPSA id C37D71EC0423; Wed, 9 Sep 2020 14:02:08 +0200 (CEST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=alien8.de; s=dkim; t=1599652928; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:in-reply-to:in-reply-to: references:references; bh=kgJlGGMVRB9lfRXv2ieQ2H9t4e0tJoEqcTQEnRU5y84=; b=sdE9CNICvAPqLe7kTQTgGYGJkizFSaRt2dUBWSb6pnR2OzMoqJf56x/jNz2O/vsE3teYJM ThiWdxnHS8E1qmuroDI7eBWqkhdjVK8ezQljpp6rrrcQgDgMI3Fbe2gPasY7jp+p7hnUzx jz9miS5VNff04z9hQ20V9IP+0z6IHq8= Date: Wed, 9 Sep 2020 14:02:03 +0200 From: Borislav Petkov To: Shiju Jose Cc: "linux-edac@vger.kernel.org" , "linux-acpi@vger.kernel.org" , "linux-kernel@vger.kernel.org" , "tony.luck@intel.com" , "rjw@rjwysocki.net" , "james.morse@arm.com" , "lenb@kernel.org" , Linuxarm Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core Message-ID: <20200909120203.GB12237@zn.tnic> References: <20200901140140.1772-1-shiju.jose@huawei.com> <20200901143539.GC8392@zn.tnic> <512b7b8e6cb846aabaf5a2191cd9b5d4@huawei.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: <512b7b8e6cb846aabaf5a2191cd9b5d4@huawei.com> Sender: linux-acpi-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-acpi@vger.kernel.org On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote: > CPU CEC derived the infrastructure of the CEC only and the logic > used in the CEC for CE count storage, CE count calculation and page > isolation is very unique for the memory pages, which seems cannot be > reusable for the CPU CEs. Oh, because it saves the reported error's PFN and you want to save [CPU num | error count] ? Well, you can easily change that by extending the existing CEC to have a different storage format for CPU errors, i.e., use a different ce_array which gets passed to the functions anyway. > Also the values set for the parameters such as threshold, time period > for the memory errors and CPU errors would be different. And your implementation with sliding windows is so totally different that it warrants the duplication of the code? I don't think so. You can use the current CEC to do exactly what you wanna do, with the decaying and so on. Because all you wanna do is count the errors a CPU triggered. However, a CPU can trigger a *lot* of different types of errors. You're putting them all in the same basket by doing: else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM)) /* add to CEC */ and only for correctable. What type of errors get reported in CPER_SEC_PROC_ARM? If they're all lumped together and if some functional unit generates a lot of errors, instead of disabling that unit only, you'll go and remove the whole CPU? Doesn't make a whole lot of sense to me. How about you define what exactly you're trying to solve, maybe give an example of a real issue someone is encountering and you're trying to address? Because there was never a necessity so far to disable CPUs on x86 due to correctable errors. Why is that needed on ARM? > Thus extending cec.c to support CPU CEs would include adding CPU CEC > specific code for storing error count, isolation etc which I thought > would result the code less tidy and less readable unless find more > reusable logic. Depends on how you design it. But with what I'm seeing so far, I'm still sceptical this is needed at all. -- Regards/Gruss, Boris. https://people.kernel.org/tglx/notes-about-netiquette