linux-edac.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [v11 PATCH 0/2] Add L1 and L2 error detection for A72
@ 2025-05-29  3:00 Vijay Balakrishna
  2025-05-29  3:00 ` [v11 PATCH 1/2] EDAC: Add EDAC driver for ARM Cortex A72 cores Vijay Balakrishna
                   ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Vijay Balakrishna @ 2025-05-29  3:00 UTC (permalink / raw)
  To: Borislav Petkov, Tony Luck, Rob Herring, Krzysztof Kozlowski,
	Conor Dooley
  Cc: James Morse, Mauro Carvalho Chehab, Robert Richter, linux-edac,
	linux-kernel, Tyler Hicks, Marc Zyngier, Sascha Hauer,
	Lorenzo Pieralisi, devicetree, Vijay Balakrishna

This is an attempt to revive [v5] series. I have attempted to address comments
and suggestions from Marc Zyngier since [v5]. Additionally, I have limited
the support only for A72 processors per [v8] discussion. Testing the driver
on a problematic A72 SoC has led to the detection of Correctable Errors (CEs).
Below are logs captured from the problematic SoC during various boot instances.

[  876.896022] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[ 3700.978086] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[  976.956158] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[ 1427.933606] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

[  192.959911] EDAC DEVICE0: CE: cortex-arm64-edac instance: cpu2 block: L1 count: 1 'L1-D Data RAM correctable error(s) on CPU 2'

Testing our product kernel involved adding the 'edac-enabled' property to CPU
nodes in the DTS. For mainline sanity checks, we tested under QEMU by
extracting the default DTB and modifying the DTS to include the 'edac-enabled'
property. We then verified the presence of /sysfs nodes for CE and UE counts
for the emulated A72 CPUs.

Our primary focus is on A72. We have a significant number of A72-based systems
in our fleet, and timely replacements via monitoring CEs will be instrumental
in managing them effectively.

I am eager to hear your suggestions and feedback on this series.

Thanks,
Vijay

[v5] https://lore.kernel.org/all/20210401110615.15326-1-s.hauer@pengutronix.de/#t
[v6] https://lore.kernel.org/all/1744241785-20256-1-git-send-email-vijayb@linux.microsoft.com/
[v7] https://lore.kernel.org/all/1744409319-24912-1-git-send-email-vijayb@linux.microsoft.com/#t
[v8] https://lore.kernel.org/all/1746404860-27069-1-git-send-email-vijayb@linux.microsoft.com/
[v9] https://lore.kernel.org/all/1747353973-4749-1-git-send-email-vijayb@linux.microsoft.com/
[v10] https://lore.kernel.org/all/1748387790-20838-1-git-send-email-vijayb@linux.microsoft.com/

Changes since v10: 
- edac_a72.c: copyright line add (Jonathan)
- cpus.yaml: drop stale comment line (Krzysztof)
- added "Reviewed-by" tags

Changes since v9: 
- commit title, message and prefix update (Boris)
- fix spelling in Kconfig help text (Boris)
- prepared patches against edac-for-next (Boris)
- struct naming update from "merrsr" to "mem_err_synd_reg" (Boris)
- grouping of all defines (Boris)
- function variable declarations in reverse fir tree order (Boris)
- simplify naming of static functions (Boris)
- protect smp_call_function_single() against CPU hotplug (Boris)
- "CPU" in visible string instead of "cpu" (Boris)
- error message reflects "edac_a72" driver name (Boris)
- fixed the issues with device_node release using scope exit (Jonathan)
- of_cpu_device_node_get() instead of of_get_cpu_node() (Jonathan)
- make dt-binding update applicable only for A72 using if/then schema (Rob)

Changes since v8: 
- removed support for A53 and A57
- added entry to MAINTAINERS
- added missing module exit point to enable unload

Changes since v7: 
- v5 was based on the internal product kernel, identified following upon review
- correct format specifier to print CPUID/WAY
- removal of unused dynamic attributes for edac_device_alloc_ctl_info() 
- driver remove callback return type is void

Changes since v6:
- restore the change made in [v5] to clear CPU/L2 syndrome registers
  back to read_errors()
- upon detecting a valid error, clear syndrome registers immediately
  to avoid clobbering between the read and write (Marc)
- NULL return check for of_get_cpu_node() (Tyler)
- of_node_put() to avoid refcount issue (Tyler)
- quotes are dropped in yaml file (Krzysztof)

Changes since v5:
- rebase on v6.15-rc1
- the syndrome registers for CPU/L2 memory errors are cleared only upon
  detecting an error and an isb() after for synchronization (Marc)
- "edac-enabled" hunk moved to initial patch to avoid breaking virtual
  environments (Marc)
- to ensure compatibility across all three families, we are not reporting
  "L1 Dirty RAM," documented only in the A53 TRM
- above prompted changing default CPU L1 error meesage from "unknown"
  to "Unspecified"
- capturing CPUID/WAY information in L2 memory error log (Marc)
- module license from "GPL v2" to "GPL" (checkpatch.pl warning)
- extend support for A72

Sascha Hauer (2):
  EDAC: Add EDAC driver for ARM Cortex A72 cores
  dt-bindings: arm: cpus: Add edac-enabled property

 .../devicetree/bindings/arm/cpus.yaml         |  50 ++--
 MAINTAINERS                                   |   7 +
 drivers/edac/Kconfig                          |   8 +
 drivers/edac/Makefile                         |   1 +
 drivers/edac/edac_a72.c                       | 230 ++++++++++++++++++
 5 files changed, 280 insertions(+), 16 deletions(-)
 create mode 100644 drivers/edac/edac_a72.c


base-commit: 855b5de2e562c07d6cda4deb08d09dc2e0e2b18d
-- 
2.49.0


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-30 16:34 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-29  3:00 [v11 PATCH 0/2] Add L1 and L2 error detection for A72 Vijay Balakrishna
2025-05-29  3:00 ` [v11 PATCH 1/2] EDAC: Add EDAC driver for ARM Cortex A72 cores Vijay Balakrishna
2025-06-30 16:33   ` Borislav Petkov
2025-05-29  3:00 ` [v11 PATCH 2/2] dt-bindings: arm: cpus: Add edac-enabled property Vijay Balakrishna
2025-05-29  9:55 ` [v11 PATCH 0/2] Add L1 and L2 error detection for A72 Borislav Petkov
2025-06-27 16:49   ` Vijay Balakrishna

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).