From: Thierry Reding <thierry.reding@gmail.com>
To: Borislav Petkov <bp@alien8.de>, Arnd Bergmann <arnd@arndb.de>
Cc: arm@kernel.org, soc@kernel.org, Jon Hunter <jonathanh@nvidia.com>,
linux-tegra@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-edac@vger.kernel.org,
Mauro Carvalho Chehab <mchehab@kernel.org>,
Tony Luck <tony.luck@intel.com>,
James Morse <james.morse@arm.com>,
Robert Richter <rric@kernel.org>,
Rahul Bedarkar <rabedarkar@nvidia.com>
Subject: Re: [GIT PULL 1/7] soc/tegra: Changes for v5.20-rc1
Date: Tue, 27 Sep 2022 18:00:19 +0200 [thread overview]
Message-ID: <YzMeE2HKOd5WaNqd@orome> (raw)
In-Reply-To: <YtAajDYfcVHRGl1U@nazgul.tnic>
[-- Attachment #1.1: Type: text/plain, Size: 3568 bytes --]
On Thu, Jul 14, 2022 at 03:31:07PM +0200, Borislav Petkov wrote:
> On Wed, Jul 13, 2022 at 02:14:27PM +0200, Arnd Bergmann wrote:
> > I think this is just a reflection of what other hardware can do:
> > most machines only detect memory errors, but the EDAC subsystem
> > can work with any type in principle. There are also a lot of
> > conditions elsewhere that can be detected but not corrected.
>
> Just a couple of thoughts from looking at this:
>
> So the EDAC thing reports *hardware* errors by using the RAS
> capabilities built into an IP block. So it started with memory
> controllers but it is getting extended to other blocks. AMD are looking
> at how to integrate GPU hw errors reporting into it, for example.
>
> Looking at that CBB thing, it looks like it is supposed to report not
> so much hardware errors but operational errors. Some of the hw errors
> reported by RAS hw are also operation-related but not the majority.
>
> Then, EDAC has this counters exposed in:
>
> $ grep -r . /sys/devices/system/edac/
> /sys/devices/system/edac/power/runtime_active_time:0
> /sys/devices/system/edac/power/runtime_status:unsupported
> /sys/devices/system/edac/power/runtime_suspended_time:0
> /sys/devices/system/edac/power/control:auto
> /sys/devices/system/edac/pci/edac_pci_log_pe:1
> /sys/devices/system/edac/pci/pci0/pe_count:0
> /sys/devices/system/edac/pci/pci0/npe_count:0
> /sys/devices/system/edac/pci/pci_parity_count:0
> /sys/devices/system/edac/pci/pci_nonparity_count:0
> /sys/devices/system/edac/pci/edac_pci_log_npe:1
> /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
> /sys/devices/system/edac/pci/check_pci_errors:0
> /sys/devices/system/edac/mc/power/runtime_active_time:0
> /sys/devices/system/edac/mc/power/runtime_status:unsupported
> ...
>
> with the respective hierarchy: memory controllers, PCI errors, etc.
>
> So the main question is, does it make sense for you to fit this into the
> EDAC hierarchy and what would even be the advantage of making it part of
> EDAC?
Closing the loop on this: we've decided to keep this in drivers/soc for
now, with the option of re-evaluating when we encounter similar
functionality on other hardware.
I'm also going to hijack the thread because something else came up
recently that fits the audience here and it's up the same alley: on
Tegra234 a mechanism, called FSI (Functional Safety Island), exists
to report failures to an external MCU that's monitoring the system.
Special hardware exists in the SoC that can send these errors to the
MCU via different transports, and the idea is to report software-
detected failures from kernel drivers such as I2C or PCI via this
mechanism, so appropriate action can be taken. So essentially we're
looking at adding some new API, preferably something generic, to these
bus drivers along with "provider" drivers that get notified of these
reports so that they can be forwarded to the FSI (and then the MCU).
This again doesn't seem to be a great fit for EDAC as it is today, but
I can also not find anything better looking around the kernel. So I'm
wondering if this is something that others have encountered and might
have solved already and I just haven't found it, or if this is something
that would be worth creating a new subsystem for. Or perhaps this could
be integrated into EDAC somehow? I'm a bit reluctant to add yet another
custom infrastructure for this, given that it's functionality that
likely exists in other SoCs as well.
Any thoughts on this?
Thierry
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
[-- Attachment #2: Type: text/plain, Size: 176 bytes --]
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
next prev parent reply other threads:[~2022-09-27 16:01 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-07-08 18:56 [GIT PULL 0/7] NVIDIA Tegra changes for v5.20-rc1 Thierry Reding
2022-07-08 18:56 ` [GIT PULL 1/7] soc/tegra: Changes " Thierry Reding
2022-07-12 13:27 ` Arnd Bergmann
2022-07-13 10:58 ` Thierry Reding
2022-07-13 12:14 ` Arnd Bergmann
2022-07-13 12:19 ` Jon Hunter
2022-07-13 12:36 ` Arnd Bergmann
2022-07-14 6:49 ` Jon Hunter
2022-07-13 20:22 ` Thierry Reding
2022-07-14 6:30 ` Jon Hunter
2022-07-14 14:45 ` Arnd Bergmann
2022-07-14 13:31 ` Borislav Petkov
2022-07-15 8:06 ` Sumit Gupta
2022-07-28 17:34 ` Thierry Reding
2022-08-22 9:31 ` Sumit Gupta
2022-09-27 16:00 ` Thierry Reding [this message]
2022-07-08 18:56 ` [GIT PULL 2/7] firmware: tegra: " Thierry Reding
2022-07-08 18:56 ` [GIT PULL 3/7] dt-bindings: " Thierry Reding
2022-07-08 18:56 ` [GIT PULL 4/7] memory: tegra: " Thierry Reding
2022-07-08 18:56 ` [GIT PULL 5/7] ARM: tegra: Device tree changes " Thierry Reding
2022-07-08 18:56 ` [GIT PULL 6/7] arm64: " Thierry Reding
2022-07-08 18:56 ` [GIT PULL 7/7] arm64: tegra: Default configuration updates " Thierry Reding
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YzMeE2HKOd5WaNqd@orome \
--to=thierry.reding@gmail.com \
--cc=arm@kernel.org \
--cc=arnd@arndb.de \
--cc=bp@alien8.de \
--cc=james.morse@arm.com \
--cc=jonathanh@nvidia.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-tegra@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=rabedarkar@nvidia.com \
--cc=rric@kernel.org \
--cc=soc@kernel.org \
--cc=tony.luck@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).