From: Borislav Petkov <bp@alien8.de>
To: Shiju Jose <shiju.jose@huawei.com>
Cc: "linux-edac@vger.kernel.org" <linux-edac@vger.kernel.org>,
"linux-acpi@vger.kernel.org" <linux-acpi@vger.kernel.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
"tony.luck@intel.com" <tony.luck@intel.com>,
"rjw@rjwysocki.net" <rjw@rjwysocki.net>,
"james.morse@arm.com" <james.morse@arm.com>,
"lenb@kernel.org" <lenb@kernel.org>,
Linuxarm <linuxarm@huawei.com>
Subject: Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core
Date: Wed, 9 Sep 2020 14:02:03 +0200 [thread overview]
Message-ID: <20200909120203.GB12237@zn.tnic> (raw)
In-Reply-To: <512b7b8e6cb846aabaf5a2191cd9b5d4@huawei.com>
On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote:
> CPU CEC derived the infrastructure of the CEC only and the logic
> used in the CEC for CE count storage, CE count calculation and page
> isolation is very unique for the memory pages, which seems cannot be
> reusable for the CPU CEs.
Oh, because it saves the reported error's PFN and you want to save
[CPU num | error count]
?
Well, you can easily change that by extending the existing CEC to have a
different storage format for CPU errors, i.e., use a different ce_array
which gets passed to the functions anyway.
> Also the values set for the parameters such as threshold, time period
> for the memory errors and CPU errors would be different.
And your implementation with sliding windows is so totally different
that it warrants the duplication of the code? I don't think so.
You can use the current CEC to do exactly what you wanna do, with the
decaying and so on.
Because all you wanna do is count the errors a CPU triggered.
However, a CPU can trigger a *lot* of different types of errors.
You're putting them all in the same basket by doing:
else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
/* add to CEC */
and only for correctable.
What type of errors get reported in CPER_SEC_PROC_ARM?
If they're all lumped together and if some functional unit generates a
lot of errors, instead of disabling that unit only, you'll go and remove
the whole CPU?
Doesn't make a whole lot of sense to me.
How about you define what exactly you're trying to solve, maybe give an
example of a real issue someone is encountering and you're trying to
address? Because there was never a necessity so far to disable CPUs on
x86 due to correctable errors. Why is that needed on ARM?
> Thus extending cec.c to support CPU CEs would include adding CPU CEC
> specific code for storing error count, isolation etc which I thought
> would result the code less tidy and less readable unless find more
> reusable logic.
Depends on how you design it.
But with what I'm seeing so far, I'm still sceptical this is needed at
all.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2020-09-09 12:06 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-09-01 14:01 [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core Shiju Jose
2020-09-01 14:35 ` Borislav Petkov
2020-09-01 16:20 ` Shiju Jose
2020-09-09 12:02 ` Borislav Petkov [this message]
2020-09-10 15:29 ` Shiju Jose
2020-09-17 8:40 ` Borislav Petkov
2020-10-01 17:16 ` James Morse
2020-10-01 17:30 ` Borislav Petkov
2020-10-02 12:23 ` Shiju Jose
2020-09-01 18:51 ` kernel test robot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20200909120203.GB12237@zn.tnic \
--to=bp@alien8.de \
--cc=james.morse@arm.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linuxarm@huawei.com \
--cc=rjw@rjwysocki.net \
--cc=shiju.jose@huawei.com \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox