public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
From: Umang Chheda <umang.chheda@oss.qualcomm.com>
To: Ruidong Tian <tianruidong@linux.alibaba.com>,
	catalin.marinas@arm.com, will@kernel.org, lpieralisi@kernel.org,
	guohanjun@huawei.com, sudeep.holla@arm.com, rafael@kernel.org,
	robin.murphy@arm.com, mark.rutland@arm.com, tony.luck@intel.com,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org
Cc: lenb@kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-acpi@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-edac@vger.kernel.org,
	mchehab@kernel.org, xueshuai@linux.alibaba.com,
	zhuo.song@linux.alibaba.com, oliver.yang@linux.alibaba.com
Subject: Re: [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling
Date: Mon, 9 Mar 2026 18:58:29 +0530	[thread overview]
Message-ID: <edf7e7eb-8f02-4672-bc31-16e0a8fb9715@oss.qualcomm.com> (raw)
In-Reply-To: <20260122094656.73399-1-tianruidong@linux.alibaba.com>

Hello Ruidong Tain,

On 1/22/2026 3:16 PM, Ruidong Tian wrote:
> Motivation: Reliability in Modern Data Centers
> =================================================
> In modern data centers, proactive maintenance is essential for achieving high
> service availability. The practice of using Corrected Errors (CE) to predict
> impending Uncorrected Errors (UE) is already widely deployed at scale across
> the industry, like Alibaba[2], Tencent[4], Intel[1], AMD[2]. By analyzing CE
> telemetry, operators can identify failing hardware and perform migrations
> before catastrophic failures occur.
>
> Problem: Inefficient CE Collection on ARM
> ==========================================
> Currently, ARM-based systems primarily rely on "Firmware-First" error
> handling (e.g., via GHES). This path is inherently heavy-weight. To avoid
> significant performance overhead, firmware is often configured with high
> thresholds—reporting to the OS only after thousands of CEs have occurred.
> If the threshold is set lower, the high frequency of errors leads to
> excessive and costly context switching between the OS and firmware.
> Consequently, ARM platforms currently lack an efficient mechanism to collect
> the granular CE data required for high-fidelity error prediction.
>
> Solution: Kernel-First Handling via AEST
> ===========================================
> Other architectures have long utilized "Kernel-First" approaches for
> efficient CE collection: Intel provides CMCI (Corrected Machine Check
> Interrupt), and AMD has recently introduced similar CE interrupt support[5].
>
> On the ARM architecture, hardware already provides the necessary RAS
> Extensions[6], and the ACPI AEST specification[0] defines a standardized way for
> the OS to discover these error source registers. This series implements
> AEST support, enabling the kernel to:
>
>  - Discover error sources directly via ACPI tables.
>  - Handle CE notifications via direct interrupts.
>  - Bypass firmware overhead to collect every CE or use low-latency thresholds.
>
> This implementation provides the missing link for efficient RAS telemetry
> on ARM, bringing it to parity with other enterprise architectures.

Thanks for posting this series enabling kernel-first handling for the Armv8 RAS extensions.

We noticed the current implementation targets ACPI-based server platforms. For embedded/SoC systems, Device Tree is often the primary firmware description. 
Do you have any plans to add DT-based support for the same flow? If not, do you see any blockers to extending this series to support DT 
(e.g., DT bindings + discovery/registration path analogous to the ACPI plumbing) ? 
If DT support is in-scope, We would be happy to align on the expected approach and help with review/development/testing for DT-based platforms.

> Background and Maintenance
> =============================
> This series is based on Tyler Baicar's preliminary patches [7]. I attempted
> to follow up with Tyler in 2022 [8] but received no reply. As he no longer
> appears active on the mailing list, I have picked up this work, updated it
> to align with the latest AEST v2.0 specification, and addressed pending
> feedback to ensure this critical feature is integrated into the mainline.
>
> AEST Driver Architecture
> ========================
>
> The AEST driver is structured into three primary components:
>   - AEST device: Responsible for handling interrupts, managing the lifecycle
>                  of AEST nodes, and processing error records.
>   - AEST node: Corresponds directly to a RAS node in the hardware
>   - AEST record: Represents a set of RAS registers associated with a specific
>                  error source.
>
> Comparison with x86 MCA:
>
> RAS record ≈ MCA bank.
> RAS node ≈ A set of MCA banks + CMCI on a core.
>
> The key difference lies in uncore handling: x86 typically maps uncore errors
> (like those from a memory controller) into core-based MCA banks. In contrast,
> ARM requires uncore components to provide their own standalone RAS nodes. When
> a component requires multiple such nodes, they are grouped and managed as a
> "RAS device" in AEST driver. 
>
> These components are organized hierarchically as follows:
>
>  ┌──────────────────────────────────────────────────┐
>  │             AEST Driver Device Management        │
>  │┌─────────────┐    ┌──────────┐     ┌───────────┐ │
>  ││ AEST Device ├─┬─►│AEST Node ├──┬─►│AEST Record│ │
>  │└─────────────┘ │  └──────────┘  │  └───────────┘ │
>  │                │       .        │  ┌───────────┐ │
>  │                │       .        ├─►│AEST Record│ │
>  │                │       .        │  └───────────┘ │
>  │                │  ┌──────────┐  │        .       │
>  │                ├─►│AEST Node │  │        .       │
>  │                │  └──────────┘  │        .       │
>  │                │                │  ┌───────────┐ │
>  │                │  ┌──────────┐  └─►│AEST Record│ │
>  │                └─►│AEST Node │     └───────────┘ │
>  │                   └──────────┘                   │
>  └──────────────────────────────────────────────────┘
>
> AEST Interrupt Handle
> =====================
>
> Upon an AEST interrupt, the driver performs the following sequence:
> 1. The AEST device iterates through all registered AEST nodes to identify the
>    specific node(s) and record(s) that reported an error.
> 2. Each node typically contains two types of records:
>       - report record: Errors can be located efficiently through a bitmap
>                        in the `ERRGSR` register.
>       - poll record: The node must individually poll all records to determine
>                      if an error has occurred.
> 3. process record:
>       - if error is corrected, The CE threshold is reset, and the error event
>         is logged.
>       - if error is defered, Relevant registers are dumped, and
>         `memory_failure()` is invoked.
>       - if error is uncorrected, panic, While UEs typically trigger an
>         exception rather than an interrupt, if detected, the system will panic.
> 4. decode record: The AEST driver notifies other relevant drivers, such as
>    EDAC, to further decode the reported RAS register information.
>
> Testing
> ===================
> I have tested this series on THead Yitian710 SOC with customized BIOS. Someone
> can also use QEMU[9] for preliminary driver testing.
>
> 1. Boot Qemu
>
> qemu-system-aarch64 -smp 4 -m 32G \
>   -cpu host --enable-kvm -machine virt,gic-version=3 \
>   -kernel Image -initrd initrd.cpio.gz \
>   -device virtio-net-pci,netdev=t0 -netdev user,id=t0 \
>   -bios /usr/share/edk2/aarch64/QEMU_EFI.fd  \
>   -append "rdinit=/sbin/init earlycon verbose debug console=ttyAMA0 aest.dyndbg='+pt'" \
>   -nographic -d guest_errors -D qemu.log
>
> 2. inject error
> devmem 0x90d0808 l 0xc4800390
>
> 2.1 Memory error
> [   64.959849] AEST: {1}[Hardware Error]: Hardware error from AEST memory.90d0000
> [   64.959852] AEST: {1}[Hardware Error]:  Error from memory at SRAT proximity domain 0x0
> [   64.959855] AEST: {1}[Hardware Error]:   ERR0FR: 0x40000080044081
> [   64.959858] AEST: {1}[Hardware Error]:   ERR0CTRL: 0x108
> [   64.959859] AEST: {1}[Hardware Error]:   ERR0STATUS: 0xc4800390
> [   64.959860] AEST: {1}[Hardware Error]:   ERR0ADDR: 0x8400000043344521
> [   64.959861] AEST: {1}[Hardware Error]:   ERR0MISC0: 0x7fff00000000
> [   64.959861] AEST: {1}[Hardware Error]:   ERR0MISC1: 0x0
> [   64.959862] AEST: {1}[Hardware Error]:   ERR0MISC2: 0x0
> [   64.959863] AEST: {1}[Hardware Error]:   ERR0MISC3: 0x0
> [   64.959873] Memory failure: 0x43344: recovery action for free buddy page: Recovered
>
> 2.2 CMN error
> [  132.044283] AEST: {2}[Hardware Error]: Hardware error from AEST XP
> [  132.044286] AEST: {2}[Hardware Error]:  Error from vendor hid ARMHC700 uid 0x0
> [  132.044288] AEST: {2}[Hardware Error]:   ERR0FR: 0x48a5
> [  132.044290] AEST: {2}[Hardware Error]:   ERR0CTRL: 0x108
> [  132.044292] AEST: {2}[Hardware Error]:   ERR0STATUS: 0xc4800390
> [  132.044293] AEST: {2}[Hardware Error]:   ERR0ADDR: 0x8400000043344521
> [  132.044295] AEST: {2}[Hardware Error]:   ERR0MISC0: 0x0
> [  132.044296] AEST: {2}[Hardware Error]:   ERR0MISC1: 0x0
> [  132.044298] AEST: {2}[Hardware Error]:   ERR0MISC2: 0x0
> [  132.044299] AEST: {2}[Hardware Error]:   ERR0MISC3: 0x0
> [  132.044302] Memory failure: 0x43344: recovery action for already poisoned page: Failed
>
> [0]: https://developer.arm.com/documentation/den0085/0200/
> [1]: Intel: Predicting Uncorrectable Memory Errors from the Correctable Error History
> [2]: Alibaba. Predicting DRAM-Caused Risky VMs in Large-Scale Clouds. Published in HPCA2025
> [3]: AMD: Physics-informed machinelearning for dram error modeling
> [4]: Tencent: Predicting uncorrectablememory errors for proactive replacement: An empirical study on large-scale field data
> [5]: https://lore.kernel.org/all/20251104-wip-mca-updates-v8-4-66c8eacf67b9@amd.com/
> [6]: https://developer.arm.com/documentation/ihi0100/
> [7]: https://lore.kernel.org/all/20211124170708.3874-1-baicar@os.amperecomputing.com/
> [8]: https://lore.kernel.org/all/b365db02-b28c-1b22-2e87-c011cef848e2@linux.alibaba.com/
> [9]: https://github.com/winterddd/qemu/tree/error_record
>
> Change from V5:
> https://lore.kernel.org/all/20251230090945.43969-1-tianruidong@linux.alibaba.com/
> 1. Based on the feedback from Borislav Petkov, I've dropped the idea of a 
>    unified address translation interface across ARM and AMD.
>
> Change from V4:
> https://lore.kernel.org/all/20251222094351.38792-1-tianruidong@linux.alibaba.com/
> 1. Fix build warning in 0010 and 0014 report by kernel test robot:
>     https://lore.kernel.org/all/202512230122.CfXZcF76-lkp@intel.com/
>     https://lore.kernel.org/all/202512230007.Vs6IvFVD-lkp@intel.com/
> 2. Dropped the extra patch(0014) that was mistakenly included in v4.
>
> Change from V3:
> https://lore.kernel.org/all/20250115084228.107573-1-tianruidong@linux.alibaba.com/
> 1. Add vendor AEST node framework and support CMN700
> 2. Borislav Petkov
>     - Split into multiple smaller patches for easier review.
>     - refined the English in the cover letter for better flow.
> 3. Accept Tomohiro Misono's comment
>
> Change from V2:
> https://lore.kernel.org/all/20240321025317.114621-1-tianruidong@linux.alibaba.com/
> 1. Tomohiro Misono
>     - dump register before panic
> 2. Baolin Wang & Shuai Xue: accept all comment.
> 3. Support AEST V2.
>
> Change from V1:
> https://lore.kernel.org/all/20240304111517.33001-1-tianruidong@linux.alibaba.com/
> 1. Marc Zyngier
>   - Use readq/writeq_relaxed instead of readq/writeq for MMIO address.
>   - Add sync for system register operation.
>   - Use irq_is_percpu_devid() helper to identify a per-CPU interrupt.
>   - Other fix.
> 2. Set RAS CE threshold in AEST driver.
> 3. Enable RAS interrupt explicitly in driver.
> 4. UER and UEO trigger memory_failure other than panic.
>
> Ruidong Tian (16):
>   ACPI/AEST: Parse the AEST table
>   ras: AEST: Add probe/remove for AEST driver
>   ras: AEST: support different group format
>   ras: AEST: Unify the read/write interface for system and MMIO register
>   ras: AEST: Probe RAS system architecture version
>   ras: AEST: Support RAS Common Fault Injection Model Extension
>   ras: AEST: Support CE threshold of error record
>   ras: AEST: Enable and register IRQs
>   ras: AEST: Add cpuhp callback
>   ras: AEST: Introduce AEST driver sysfs interface
>   ras: AEST: Add error count tracking and debugfs interface
>   ras: AEST: Allow configuring CE threshold via debugfs
>   ras: AEST: Introduce AEST inject interface to test AEST driver
>   ras: AEST: Add framework to process AEST vendor node
>   ras: AEST: support vendor node CMN700
>   trace, ras: add ARM RAS extension trace event
>
>  Documentation/ABI/testing/debugfs-aest |   99 +++
>  MAINTAINERS                            |   11 +
>  arch/arm64/include/asm/arm-cmn.h       |   47 ++
>  arch/arm64/include/asm/ras.h           |   95 +++
>  drivers/acpi/arm64/Kconfig             |   11 +
>  drivers/acpi/arm64/Makefile            |    1 +
>  drivers/acpi/arm64/aest.c              |  311 +++++++
>  drivers/perf/arm-cmn.c                 |   37 +-
>  drivers/ras/Kconfig                    |    1 +
>  drivers/ras/Makefile                   |    1 +
>  drivers/ras/aest/Kconfig               |   17 +
>  drivers/ras/aest/Makefile              |    8 +
>  drivers/ras/aest/aest-cmn.c            |  330 ++++++++
>  drivers/ras/aest/aest-core.c           | 1054 ++++++++++++++++++++++++
>  drivers/ras/aest/aest-inject.c         |  131 +++
>  drivers/ras/aest/aest-sysfs.c          |  228 +++++
>  drivers/ras/aest/aest.h                |  410 +++++++++
>  drivers/ras/ras.c                      |    3 +
>  include/linux/acpi_aest.h              |   75 ++
>  include/linux/cpuhotplug.h             |    1 +
>  include/linux/ras.h                    |    8 +
>  include/ras/ras_event.h                |   71 ++
>  22 files changed, 2914 insertions(+), 36 deletions(-)
>  create mode 100644 Documentation/ABI/testing/debugfs-aest
>  create mode 100644 arch/arm64/include/asm/arm-cmn.h
>  create mode 100644 arch/arm64/include/asm/ras.h
>  create mode 100644 drivers/acpi/arm64/aest.c
>  create mode 100644 drivers/ras/aest/Kconfig
>  create mode 100644 drivers/ras/aest/Makefile
>  create mode 100644 drivers/ras/aest/aest-cmn.c
>  create mode 100644 drivers/ras/aest/aest-core.c
>  create mode 100644 drivers/ras/aest/aest-inject.c
>  create mode 100644 drivers/ras/aest/aest-sysfs.c
>  create mode 100644 drivers/ras/aest/aest.h
>  create mode 100644 include/linux/acpi_aest.h


Thanks,
Umang


  parent reply	other threads:[~2026-03-09 13:28 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22  9:46 [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 01/16] ACPI/AEST: Parse the AEST table Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 02/16] ras: AEST: Add probe/remove for AEST driver Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 03/16] ras: AEST: support different group format Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 04/16] ras: AEST: Unify the read/write interface for system and MMIO register Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 05/16] ras: AEST: Probe RAS system architecture version Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 06/16] ras: AEST: Support RAS Common Fault Injection Model Extension Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 07/16] ras: AEST: Support CE threshold of error record Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 08/16] ras: AEST: Enable and register IRQs Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 09/16] ras: AEST: Add cpuhp callback Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 10/16] ras: AEST: Introduce AEST driver sysfs interface Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 11/16] ras: AEST: Add error count tracking and debugfs interface Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 12/16] ras: AEST: Allow configuring CE threshold via debugfs Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 13/16] ras: AEST: Introduce AEST inject interface to test AEST driver Ruidong Tian
2026-01-27 12:52   ` kernel test robot
2026-01-22  9:46 ` [PATCH v6 14/16] ras: AEST: Add framework to process AEST vendor node Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 15/16] ras: AEST: support vendor node CMN700 Ruidong Tian
2026-01-27 18:56   ` kernel test robot
2026-03-09 19:21   ` Robin Murphy
2026-03-13  9:49     ` Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 16/16] trace, ras: add ARM RAS extension trace event Ruidong Tian
2026-03-09 13:28 ` Umang Chheda [this message]
2026-03-11  3:25   ` [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Ruidong Tian
2026-03-17  7:21     ` Umang Chheda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=edf7e7eb-8f02-4672-bc31-16e0a8fb9715@oss.qualcomm.com \
    --to=umang.chheda@oss.qualcomm.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=guohanjun@huawei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=lpieralisi@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab@kernel.org \
    --cc=oliver.yang@linux.alibaba.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=sudeep.holla@arm.com \
    --cc=tglx@linutronix.de \
    --cc=tianruidong@linux.alibaba.com \
    --cc=tony.luck@intel.com \
    --cc=will@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    --cc=zhuo.song@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox