From: Ruidong Tian <tianruidong@linux.alibaba.com>
To: Umang Chheda <umang.chheda@oss.qualcomm.com>,
catalin.marinas@arm.com, will@kernel.org, lpieralisi@kernel.org,
guohanjun@huawei.com, sudeep.holla@arm.com, rafael@kernel.org,
robin.murphy@arm.com, mark.rutland@arm.com, tony.luck@intel.com,
bp@alien8.de, tglx@linutronix.de, peterz@infradead.org
Cc: lenb@kernel.org, linux-kernel@vger.kernel.org,
linux-arm-kernel@lists.infradead.org, linux-acpi@vger.kernel.org,
linux-perf-users@vger.kernel.org, linux-edac@vger.kernel.org,
mchehab@kernel.org, xueshuai@linux.alibaba.com,
zhuo.song@linux.alibaba.com, oliver.yang@linux.alibaba.com
Subject: Re: [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling
Date: Wed, 11 Mar 2026 11:25:00 +0800 [thread overview]
Message-ID: <d4e4e2b1-c182-42f9-9d60-b12f0fd7f977@linux.alibaba.com> (raw)
In-Reply-To: <edf7e7eb-8f02-4672-bc31-16e0a8fb9715@oss.qualcomm.com>
在 2026/3/9 21:28, Umang Chheda 写道:
> Hello Ruidong Tain,
>
> On 1/22/2026 3:16 PM, Ruidong Tian wrote:
>> Motivation: Reliability in Modern Data Centers
>> =================================================
>> In modern data centers, proactive maintenance is essential for achieving high
>> service availability. The practice of using Corrected Errors (CE) to predict
>> impending Uncorrected Errors (UE) is already widely deployed at scale across
>> the industry, like Alibaba[2], Tencent[4], Intel[1], AMD[2]. By analyzing CE
>> telemetry, operators can identify failing hardware and perform migrations
>> before catastrophic failures occur.
>>
>> Problem: Inefficient CE Collection on ARM
>> ==========================================
>> Currently, ARM-based systems primarily rely on "Firmware-First" error
>> handling (e.g., via GHES). This path is inherently heavy-weight. To avoid
>> significant performance overhead, firmware is often configured with high
>> thresholds—reporting to the OS only after thousands of CEs have occurred.
>> If the threshold is set lower, the high frequency of errors leads to
>> excessive and costly context switching between the OS and firmware.
>> Consequently, ARM platforms currently lack an efficient mechanism to collect
>> the granular CE data required for high-fidelity error prediction.
>>
>> Solution: Kernel-First Handling via AEST
>> ===========================================
>> Other architectures have long utilized "Kernel-First" approaches for
>> efficient CE collection: Intel provides CMCI (Corrected Machine Check
>> Interrupt), and AMD has recently introduced similar CE interrupt support[5].
>>
>> On the ARM architecture, hardware already provides the necessary RAS
>> Extensions[6], and the ACPI AEST specification[0] defines a standardized way for
>> the OS to discover these error source registers. This series implements
>> AEST support, enabling the kernel to:
>>
>> - Discover error sources directly via ACPI tables.
>> - Handle CE notifications via direct interrupts.
>> - Bypass firmware overhead to collect every CE or use low-latency thresholds.
>>
>> This implementation provides the missing link for efficient RAS telemetry
>> on ARM, bringing it to parity with other enterprise architectures.
>
> Thanks for posting this series enabling kernel-first handling for the Armv8 RAS extensions.
>
> We noticed the current implementation targets ACPI-based server platforms. For embedded/SoC systems, Device Tree is often the primary firmware description.
> Do you have any plans to add DT-based support for the same flow? If not, do you see any blockers to extending this series to support DT
> (e.g., DT bindings + discovery/registration path analogous to the ACPI plumbing) ?
> If DT support is in-scope, We would be happy to align on the expected approach and help with review/development/testing for DT-based platforms.
Hi Umang,
Thanks for the reply.
Adding Device Tree support should be easy. We just need a patch similar
to "ACPI/AEST: Parse the AEST table" to fill the DT table into struct
acpi_aest_node (might need renaming) and struct aest_hnode. The driver
part requires minimal changes.
However, I'm not very familiar with DT and lack DT engineering support,
so I would need some guidance on these DT-related questions:
- Is there a specification that outlines the reporting requirements for
RAS extension information that is similar to AEST?
- How should the DT be designed?
- How can I develop QEMU and modify DT files for debugging, etc.?
I would be happy to adjust the patchset to meet the needs of both
parties if you are prepared to invest the necessary effort(DT-related).
In reality, I believe that just a little modification is required.
>
>> Background and Maintenance
>> =============================
>> This series is based on Tyler Baicar's preliminary patches [7]. I attempted
>> to follow up with Tyler in 2022 [8] but received no reply. As he no longer
>> appears active on the mailing list, I have picked up this work, updated it
>> to align with the latest AEST v2.0 specification, and addressed pending
>> feedback to ensure this critical feature is integrated into the mainline.
>>
>> AEST Driver Architecture
>> ========================
>>
>> The AEST driver is structured into three primary components:
>> - AEST device: Responsible for handling interrupts, managing the lifecycle
>> of AEST nodes, and processing error records.
>> - AEST node: Corresponds directly to a RAS node in the hardware
>> - AEST record: Represents a set of RAS registers associated with a specific
>> error source.
>>
>> Comparison with x86 MCA:
>>
>> RAS record ≈ MCA bank.
>> RAS node ≈ A set of MCA banks + CMCI on a core.
>>
>> The key difference lies in uncore handling: x86 typically maps uncore errors
>> (like those from a memory controller) into core-based MCA banks. In contrast,
>> ARM requires uncore components to provide their own standalone RAS nodes. When
>> a component requires multiple such nodes, they are grouped and managed as a
>> "RAS device" in AEST driver.
>>
>> These components are organized hierarchically as follows:
>>
>> ┌──────────────────────────────────────────────────┐
>> │ AEST Driver Device Management │
>> │┌─────────────┐ ┌──────────┐ ┌───────────┐ │
>> ││ AEST Device ├─┬─►│AEST Node ├──┬─►│AEST Record│ │
>> │└─────────────┘ │ └──────────┘ │ └───────────┘ │
>> │ │ . │ ┌───────────┐ │
>> │ │ . ├─►│AEST Record│ │
>> │ │ . │ └───────────┘ │
>> │ │ ┌──────────┐ │ . │
>> │ ├─►│AEST Node │ │ . │
>> │ │ └──────────┘ │ . │
>> │ │ │ ┌───────────┐ │
>> │ │ ┌──────────┐ └─►│AEST Record│ │
>> │ └─►│AEST Node │ └───────────┘ │
>> │ └──────────┘ │
>> └──────────────────────────────────────────────────┘
>>
>> AEST Interrupt Handle
>> =====================
>>
>> Upon an AEST interrupt, the driver performs the following sequence:
>> 1. The AEST device iterates through all registered AEST nodes to identify the
>> specific node(s) and record(s) that reported an error.
>> 2. Each node typically contains two types of records:
>> - report record: Errors can be located efficiently through a bitmap
>> in the `ERRGSR` register.
>> - poll record: The node must individually poll all records to determine
>> if an error has occurred.
>> 3. process record:
>> - if error is corrected, The CE threshold is reset, and the error event
>> is logged.
>> - if error is defered, Relevant registers are dumped, and
>> `memory_failure()` is invoked.
>> - if error is uncorrected, panic, While UEs typically trigger an
>> exception rather than an interrupt, if detected, the system will panic.
>> 4. decode record: The AEST driver notifies other relevant drivers, such as
>> EDAC, to further decode the reported RAS register information.
>>
>> Testing
>> ===================
>> I have tested this series on THead Yitian710 SOC with customized BIOS. Someone
>> can also use QEMU[9] for preliminary driver testing.
>>
>> 1. Boot Qemu
>>
>> qemu-system-aarch64 -smp 4 -m 32G \
>> -cpu host --enable-kvm -machine virt,gic-version=3 \
>> -kernel Image -initrd initrd.cpio.gz \
>> -device virtio-net-pci,netdev=t0 -netdev user,id=t0 \
>> -bios /usr/share/edk2/aarch64/QEMU_EFI.fd \
>> -append "rdinit=/sbin/init earlycon verbose debug console=ttyAMA0 aest.dyndbg='+pt'" \
>> -nographic -d guest_errors -D qemu.log
>>
>> 2. inject error
>> devmem 0x90d0808 l 0xc4800390
>>
>> 2.1 Memory error
>> [ 64.959849] AEST: {1}[Hardware Error]: Hardware error from AEST memory.90d0000
>> [ 64.959852] AEST: {1}[Hardware Error]: Error from memory at SRAT proximity domain 0x0
>> [ 64.959855] AEST: {1}[Hardware Error]: ERR0FR: 0x40000080044081
>> [ 64.959858] AEST: {1}[Hardware Error]: ERR0CTRL: 0x108
>> [ 64.959859] AEST: {1}[Hardware Error]: ERR0STATUS: 0xc4800390
>> [ 64.959860] AEST: {1}[Hardware Error]: ERR0ADDR: 0x8400000043344521
>> [ 64.959861] AEST: {1}[Hardware Error]: ERR0MISC0: 0x7fff00000000
>> [ 64.959861] AEST: {1}[Hardware Error]: ERR0MISC1: 0x0
>> [ 64.959862] AEST: {1}[Hardware Error]: ERR0MISC2: 0x0
>> [ 64.959863] AEST: {1}[Hardware Error]: ERR0MISC3: 0x0
>> [ 64.959873] Memory failure: 0x43344: recovery action for free buddy page: Recovered
>>
>> 2.2 CMN error
>> [ 132.044283] AEST: {2}[Hardware Error]: Hardware error from AEST XP
>> [ 132.044286] AEST: {2}[Hardware Error]: Error from vendor hid ARMHC700 uid 0x0
>> [ 132.044288] AEST: {2}[Hardware Error]: ERR0FR: 0x48a5
>> [ 132.044290] AEST: {2}[Hardware Error]: ERR0CTRL: 0x108
>> [ 132.044292] AEST: {2}[Hardware Error]: ERR0STATUS: 0xc4800390
>> [ 132.044293] AEST: {2}[Hardware Error]: ERR0ADDR: 0x8400000043344521
>> [ 132.044295] AEST: {2}[Hardware Error]: ERR0MISC0: 0x0
>> [ 132.044296] AEST: {2}[Hardware Error]: ERR0MISC1: 0x0
>> [ 132.044298] AEST: {2}[Hardware Error]: ERR0MISC2: 0x0
>> [ 132.044299] AEST: {2}[Hardware Error]: ERR0MISC3: 0x0
>> [ 132.044302] Memory failure: 0x43344: recovery action for already poisoned page: Failed
>>
>> [0]: https://developer.arm.com/documentation/den0085/0200/
>> [1]: Intel: Predicting Uncorrectable Memory Errors from the Correctable Error History
>> [2]: Alibaba. Predicting DRAM-Caused Risky VMs in Large-Scale Clouds. Published in HPCA2025
>> [3]: AMD: Physics-informed machinelearning for dram error modeling
>> [4]: Tencent: Predicting uncorrectablememory errors for proactive replacement: An empirical study on large-scale field data
>> [5]: https://lore.kernel.org/all/20251104-wip-mca-updates-v8-4-66c8eacf67b9@amd.com/
>> [6]: https://developer.arm.com/documentation/ihi0100/
>> [7]: https://lore.kernel.org/all/20211124170708.3874-1-baicar@os.amperecomputing.com/
>> [8]: https://lore.kernel.org/all/b365db02-b28c-1b22-2e87-c011cef848e2@linux.alibaba.com/
>> [9]: https://github.com/winterddd/qemu/tree/error_record
>>
>> Change from V5:
>> https://lore.kernel.org/all/20251230090945.43969-1-tianruidong@linux.alibaba.com/
>> 1. Based on the feedback from Borislav Petkov, I've dropped the idea of a
>> unified address translation interface across ARM and AMD.
>>
>> Change from V4:
>> https://lore.kernel.org/all/20251222094351.38792-1-tianruidong@linux.alibaba.com/
>> 1. Fix build warning in 0010 and 0014 report by kernel test robot:
>> https://lore.kernel.org/all/202512230122.CfXZcF76-lkp@intel.com/
>> https://lore.kernel.org/all/202512230007.Vs6IvFVD-lkp@intel.com/
>> 2. Dropped the extra patch(0014) that was mistakenly included in v4.
>>
>> Change from V3:
>> https://lore.kernel.org/all/20250115084228.107573-1-tianruidong@linux.alibaba.com/
>> 1. Add vendor AEST node framework and support CMN700
>> 2. Borislav Petkov
>> - Split into multiple smaller patches for easier review.
>> - refined the English in the cover letter for better flow.
>> 3. Accept Tomohiro Misono's comment
>>
>> Change from V2:
>> https://lore.kernel.org/all/20240321025317.114621-1-tianruidong@linux.alibaba.com/
>> 1. Tomohiro Misono
>> - dump register before panic
>> 2. Baolin Wang & Shuai Xue: accept all comment.
>> 3. Support AEST V2.
>>
>> Change from V1:
>> https://lore.kernel.org/all/20240304111517.33001-1-tianruidong@linux.alibaba.com/
>> 1. Marc Zyngier
>> - Use readq/writeq_relaxed instead of readq/writeq for MMIO address.
>> - Add sync for system register operation.
>> - Use irq_is_percpu_devid() helper to identify a per-CPU interrupt.
>> - Other fix.
>> 2. Set RAS CE threshold in AEST driver.
>> 3. Enable RAS interrupt explicitly in driver.
>> 4. UER and UEO trigger memory_failure other than panic.
>>
>> Ruidong Tian (16):
>> ACPI/AEST: Parse the AEST table
>> ras: AEST: Add probe/remove for AEST driver
>> ras: AEST: support different group format
>> ras: AEST: Unify the read/write interface for system and MMIO register
>> ras: AEST: Probe RAS system architecture version
>> ras: AEST: Support RAS Common Fault Injection Model Extension
>> ras: AEST: Support CE threshold of error record
>> ras: AEST: Enable and register IRQs
>> ras: AEST: Add cpuhp callback
>> ras: AEST: Introduce AEST driver sysfs interface
>> ras: AEST: Add error count tracking and debugfs interface
>> ras: AEST: Allow configuring CE threshold via debugfs
>> ras: AEST: Introduce AEST inject interface to test AEST driver
>> ras: AEST: Add framework to process AEST vendor node
>> ras: AEST: support vendor node CMN700
>> trace, ras: add ARM RAS extension trace event
>>
>> Documentation/ABI/testing/debugfs-aest | 99 +++
>> MAINTAINERS | 11 +
>> arch/arm64/include/asm/arm-cmn.h | 47 ++
>> arch/arm64/include/asm/ras.h | 95 +++
>> drivers/acpi/arm64/Kconfig | 11 +
>> drivers/acpi/arm64/Makefile | 1 +
>> drivers/acpi/arm64/aest.c | 311 +++++++
>> drivers/perf/arm-cmn.c | 37 +-
>> drivers/ras/Kconfig | 1 +
>> drivers/ras/Makefile | 1 +
>> drivers/ras/aest/Kconfig | 17 +
>> drivers/ras/aest/Makefile | 8 +
>> drivers/ras/aest/aest-cmn.c | 330 ++++++++
>> drivers/ras/aest/aest-core.c | 1054 ++++++++++++++++++++++++
>> drivers/ras/aest/aest-inject.c | 131 +++
>> drivers/ras/aest/aest-sysfs.c | 228 +++++
>> drivers/ras/aest/aest.h | 410 +++++++++
>> drivers/ras/ras.c | 3 +
>> include/linux/acpi_aest.h | 75 ++
>> include/linux/cpuhotplug.h | 1 +
>> include/linux/ras.h | 8 +
>> include/ras/ras_event.h | 71 ++
>> 22 files changed, 2914 insertions(+), 36 deletions(-)
>> create mode 100644 Documentation/ABI/testing/debugfs-aest
>> create mode 100644 arch/arm64/include/asm/arm-cmn.h
>> create mode 100644 arch/arm64/include/asm/ras.h
>> create mode 100644 drivers/acpi/arm64/aest.c
>> create mode 100644 drivers/ras/aest/Kconfig
>> create mode 100644 drivers/ras/aest/Makefile
>> create mode 100644 drivers/ras/aest/aest-cmn.c
>> create mode 100644 drivers/ras/aest/aest-core.c
>> create mode 100644 drivers/ras/aest/aest-inject.c
>> create mode 100644 drivers/ras/aest/aest-sysfs.c
>> create mode 100644 drivers/ras/aest/aest.h
>> create mode 100644 include/linux/acpi_aest.h
>
>
> Thanks,
> Umang
next prev parent reply other threads:[~2026-03-11 3:25 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-01-22 9:46 [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 01/16] ACPI/AEST: Parse the AEST table Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 02/16] ras: AEST: Add probe/remove for AEST driver Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 03/16] ras: AEST: support different group format Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 04/16] ras: AEST: Unify the read/write interface for system and MMIO register Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 05/16] ras: AEST: Probe RAS system architecture version Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 06/16] ras: AEST: Support RAS Common Fault Injection Model Extension Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 07/16] ras: AEST: Support CE threshold of error record Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 08/16] ras: AEST: Enable and register IRQs Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 09/16] ras: AEST: Add cpuhp callback Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 10/16] ras: AEST: Introduce AEST driver sysfs interface Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 11/16] ras: AEST: Add error count tracking and debugfs interface Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 12/16] ras: AEST: Allow configuring CE threshold via debugfs Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 13/16] ras: AEST: Introduce AEST inject interface to test AEST driver Ruidong Tian
2026-01-27 12:52 ` kernel test robot
2026-01-22 9:46 ` [PATCH v6 14/16] ras: AEST: Add framework to process AEST vendor node Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 15/16] ras: AEST: support vendor node CMN700 Ruidong Tian
2026-01-27 18:56 ` kernel test robot
2026-03-09 19:21 ` Robin Murphy
2026-03-13 9:49 ` Ruidong Tian
2026-01-22 9:46 ` [PATCH v6 16/16] trace, ras: add ARM RAS extension trace event Ruidong Tian
2026-03-09 13:28 ` [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Umang Chheda
2026-03-11 3:25 ` Ruidong Tian [this message]
2026-03-17 7:21 ` Umang Chheda
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d4e4e2b1-c182-42f9-9d60-b12f0fd7f977@linux.alibaba.com \
--to=tianruidong@linux.alibaba.com \
--cc=bp@alien8.de \
--cc=catalin.marinas@arm.com \
--cc=guohanjun@huawei.com \
--cc=lenb@kernel.org \
--cc=linux-acpi@vger.kernel.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-edac@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=lpieralisi@kernel.org \
--cc=mark.rutland@arm.com \
--cc=mchehab@kernel.org \
--cc=oliver.yang@linux.alibaba.com \
--cc=peterz@infradead.org \
--cc=rafael@kernel.org \
--cc=robin.murphy@arm.com \
--cc=sudeep.holla@arm.com \
--cc=tglx@linutronix.de \
--cc=tony.luck@intel.com \
--cc=umang.chheda@oss.qualcomm.com \
--cc=will@kernel.org \
--cc=xueshuai@linux.alibaba.com \
--cc=zhuo.song@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox