From: Sumit Gupta <sumitg@nvidia.com>
To: Borislav Petkov <bp@alien8.de>, Arnd Bergmann <arnd@arndb.de>
Cc: Thierry Reding <thierry.reding@gmail.com>,
arm-soc <arm@kernel.org>, "SoC Team" <soc@kernel.org>,
Jon Hunter <jonathanh@nvidia.com>,
"open list:TEGRA ARCHITECTURE SUPPORT"
<linux-tegra@vger.kernel.org>,
Linux ARM <linux-arm-kernel@lists.infradead.org>,
<linux-edac@vger.kernel.org>,
"Mauro Carvalho Chehab" <mchehab@kernel.org>,
Tony Luck <tony.luck@intel.com>,
"James Morse" <james.morse@arm.com>,
Robert Richter <rric@kernel.org>, Sumit Gupta <sumitg@nvidia.com>,
<bbasu@nvidia.com>, Vikram Sethi <vsethi@nvidia.com>
Subject: Re: [GIT PULL 1/7] soc/tegra: Changes for v5.20-rc1
Date: Fri, 15 Jul 2022 13:36:16 +0530 [thread overview]
Message-ID: <8dd2310d-cf1d-600e-0bd3-7b16c7b4ac18@nvidia.com> (raw)
In-Reply-To: <YtAajDYfcVHRGl1U@nazgul.tnic>
Hi Arnd, Boris,
Thank you for your inputs.
>> I think this is just a reflection of what other hardware can do:
>> most machines only detect memory errors, but the EDAC subsystem
>> can work with any type in principle. There are also a lot of
>> conditions elsewhere that can be detected but not corrected.
>
> Just a couple of thoughts from looking at this:
>
> So the EDAC thing reports *hardware* errors by using the RAS
> capabilities built into an IP block. So it started with memory
> controllers but it is getting extended to other blocks. AMD are looking
> at how to integrate GPU hw errors reporting into it, for example.
>
> Looking at that CBB thing, it looks like it is supposed to report not
> so much hardware errors but operational errors. Some of the hw errors
> reported by RAS hw are also operation-related but not the majority.
>
CBB driver reports errors due to bad MMIO accesses within software.
The vast majority of the CBB errors tend to be programming errors in
setting up address windows leading to decode errors.
> Then, EDAC has this counters exposed in:
>
> $ grep -r . /sys/devices/system/edac/
> /sys/devices/system/edac/power/runtime_active_time:0
> /sys/devices/system/edac/power/runtime_status:unsupported
> /sys/devices/system/edac/power/runtime_suspended_time:0
> /sys/devices/system/edac/power/control:auto
> /sys/devices/system/edac/pci/edac_pci_log_pe:1
> /sys/devices/system/edac/pci/pci0/pe_count:0
> /sys/devices/system/edac/pci/pci0/npe_count:0
> /sys/devices/system/edac/pci/pci_parity_count:0
> /sys/devices/system/edac/pci/pci_nonparity_count:0
> /sys/devices/system/edac/pci/edac_pci_log_npe:1
> /sys/devices/system/edac/pci/edac_pci_panic_on_pe:0
> /sys/devices/system/edac/pci/check_pci_errors:0
> /sys/devices/system/edac/mc/power/runtime_active_time:0
> /sys/devices/system/edac/mc/power/runtime_status:unsupported
> ...
>
> with the respective hierarchy: memory controllers, PCI errors, etc.
>
> So the main question is, does it make sense for you to fit this into the
> EDAC hierarchy and what would even be the advantage of making it part of
> EDAC?
>
I also think this doesn't seem to fit with the errors reported by EDAC
which are mainly hardware errors as Boris explained.
Please share your thoughts and if we can merge the patches as it is.
> HTH.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
next prev parent reply other threads:[~2022-07-15 8:06 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20220708185608.676474-1-thierry.reding@gmail.com>
[not found] ` <20220708185608.676474-2-thierry.reding@gmail.com>
2022-07-12 13:27 ` [GIT PULL 1/7] soc/tegra: Changes for v5.20-rc1 Arnd Bergmann
2022-07-13 10:58 ` Thierry Reding
2022-07-13 12:14 ` Arnd Bergmann
2022-07-13 12:19 ` Jon Hunter
2022-07-13 12:36 ` Arnd Bergmann
2022-07-14 6:49 ` Jon Hunter
2022-07-13 20:22 ` Thierry Reding
2022-07-14 6:30 ` Jon Hunter
2022-07-14 14:45 ` Arnd Bergmann
2022-07-14 13:31 ` Borislav Petkov
2022-07-15 8:06 ` Sumit Gupta [this message]
2022-07-28 17:34 ` Thierry Reding
2022-08-22 9:31 ` Sumit Gupta
2022-09-27 16:00 ` Thierry Reding
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8dd2310d-cf1d-600e-0bd3-7b16c7b4ac18@nvidia.com \
--to=sumitg@nvidia.com \
--cc=arm@kernel.org \
--cc=arnd@arndb.de \
--cc=bbasu@nvidia.com \
--cc=bp@alien8.de \
--cc=james.morse@arm.com \
--cc=jonathanh@nvidia.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-tegra@vger.kernel.org \
--cc=mchehab@kernel.org \
--cc=rric@kernel.org \
--cc=soc@kernel.org \
--cc=thierry.reding@gmail.com \
--cc=tony.luck@intel.com \
--cc=vsethi@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox