Re: [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling

public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed

From: Ruidong Tian <tianruidong@linux.alibaba.com>
To: Umang Chheda <umang.chheda@oss.qualcomm.com>,
	catalin.marinas@arm.com, will@kernel.org, lpieralisi@kernel.org,
	guohanjun@huawei.com, sudeep.holla@arm.com, rafael@kernel.org,
	robin.murphy@arm.com, mark.rutland@arm.com, tony.luck@intel.com,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org
Cc: lenb@kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-acpi@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-edac@vger.kernel.org,
	mchehab@kernel.org, xueshuai@linux.alibaba.com,
	zhuo.song@linux.alibaba.com, oliver.yang@linux.alibaba.com
Subject: Re: [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling
Date: Wed, 11 Mar 2026 11:25:00 +0800	[thread overview]
Message-ID: <d4e4e2b1-c182-42f9-9d60-b12f0fd7f977@linux.alibaba.com> (raw)
In-Reply-To: <edf7e7eb-8f02-4672-bc31-16e0a8fb9715@oss.qualcomm.com>



在 2026/3/9 21:28, Umang Chheda 写道:
> Hello Ruidong Tain,
> 
> On 1/22/2026 3:16 PM, Ruidong Tian wrote:
>> Motivation: Reliability in Modern Data Centers
>> =================================================
>> In modern data centers, proactive maintenance is essential for achieving high
>> service availability. The practice of using Corrected Errors (CE) to predict
>> impending Uncorrected Errors (UE) is already widely deployed at scale across
>> the industry, like Alibaba[2], Tencent[4], Intel[1], AMD[2]. By analyzing CE
>> telemetry, operators can identify failing hardware and perform migrations
>> before catastrophic failures occur.
>>
>> Problem: Inefficient CE Collection on ARM
>> ==========================================
>> Currently, ARM-based systems primarily rely on "Firmware-First" error
>> handling (e.g., via GHES). This path is inherently heavy-weight. To avoid
>> significant performance overhead, firmware is often configured with high
>> thresholds—reporting to the OS only after thousands of CEs have occurred.
>> If the threshold is set lower, the high frequency of errors leads to
>> excessive and costly context switching between the OS and firmware.
>> Consequently, ARM platforms currently lack an efficient mechanism to collect
>> the granular CE data required for high-fidelity error prediction.
>>
>> Solution: Kernel-First Handling via AEST
>> ===========================================
>> Other architectures have long utilized "Kernel-First" approaches for
>> efficient CE collection: Intel provides CMCI (Corrected Machine Check
>> Interrupt), and AMD has recently introduced similar CE interrupt support[5].
>>
>> On the ARM architecture, hardware already provides the necessary RAS
>> Extensions[6], and the ACPI AEST specification[0] defines a standardized way for
>> the OS to discover these error source registers. This series implements
>> AEST support, enabling the kernel to:
>>
>>   - Discover error sources directly via ACPI tables.
>>   - Handle CE notifications via direct interrupts.
>>   - Bypass firmware overhead to collect every CE or use low-latency thresholds.
>>
>> This implementation provides the missing link for efficient RAS telemetry
>> on ARM, bringing it to parity with other enterprise architectures.
> 
> Thanks for posting this series enabling kernel-first handling for the Armv8 RAS extensions.
> 
> We noticed the current implementation targets ACPI-based server platforms. For embedded/SoC systems, Device Tree is often the primary firmware description.
> Do you have any plans to add DT-based support for the same flow? If not, do you see any blockers to extending this series to support DT
> (e.g., DT bindings + discovery/registration path analogous to the ACPI plumbing) ?
> If DT support is in-scope, We would be happy to align on the expected approach and help with review/development/testing for DT-based platforms.

Hi Umang,

Thanks for the reply.

Adding Device Tree support should be easy. We just need a patch similar 
to "ACPI/AEST: Parse the AEST table" to fill the DT table into struct 
acpi_aest_node (might need renaming) and struct aest_hnode. The driver 
part requires minimal changes.

However, I'm not very familiar with DT and lack DT engineering support, 
so I would need some guidance on these DT-related questions:

- Is there a specification that outlines the reporting 	requirements for 
RAS extension information that is similar to AEST?
- How should the DT be designed?
- How can I develop QEMU and modify DT files for debugging, etc.?

I would be happy to adjust the patchset to meet the needs of both 
parties if you are prepared to invest the necessary effort（DT-related）. 
In reality, I believe that just a little modification is required.

> 
>> Background and Maintenance
>> =============================
>> This series is based on Tyler Baicar's preliminary patches [7]. I attempted
>> to follow up with Tyler in 2022 [8] but received no reply. As he no longer
>> appears active on the mailing list, I have picked up this work, updated it
>> to align with the latest AEST v2.0 specification, and addressed pending
>> feedback to ensure this critical feature is integrated into the mainline.
>>
>> AEST Driver Architecture
>> ========================
>>
>> The AEST driver is structured into three primary components:
>>    - AEST device: Responsible for handling interrupts, managing the lifecycle
>>                   of AEST nodes, and processing error records.
>>    - AEST node: Corresponds directly to a RAS node in the hardware
>>    - AEST record: Represents a set of RAS registers associated with a specific
>>                   error source.
>>
>> Comparison with x86 MCA:
>>
>> RAS record ≈ MCA bank.
>> RAS node ≈ A set of MCA banks + CMCI on a core.
>>
>> The key difference lies in uncore handling: x86 typically maps uncore errors
>> (like those from a memory controller) into core-based MCA banks. In contrast,
>> ARM requires uncore components to provide their own standalone RAS nodes. When
>> a component requires multiple such nodes, they are grouped and managed as a
>> "RAS device" in AEST driver.
>>
>> These components are organized hierarchically as follows:
>>
>>   ┌──────────────────────────────────────────────────┐
>>   │             AEST Driver Device Management        │
>>   │┌─────────────┐    ┌──────────┐     ┌───────────┐ │
>>   ││ AEST Device ├─┬─►│AEST Node ├──┬─►│AEST Record│ │
>>   │└─────────────┘ │  └──────────┘  │  └───────────┘ │
>>   │                │       .        │  ┌───────────┐ │
>>   │                │       .        ├─►│AEST Record│ │
>>   │                │       .        │  └───────────┘ │
>>   │                │  ┌──────────┐  │        .       │
>>   │                ├─►│AEST Node │  │        .       │
>>   │                │  └──────────┘  │        .       │
>>   │                │                │  ┌───────────┐ │
>>   │                │  ┌──────────┐  └─►│AEST Record│ │
>>   │                └─►│AEST Node │     └───────────┘ │
>>   │                   └──────────┘                   │
>>   └──────────────────────────────────────────────────┘
>>
>> AEST Interrupt Handle
>> =====================
>>
>> Upon an AEST interrupt, the driver performs the following sequence:
>> 1. The AEST device iterates through all registered AEST nodes to identify the
>>     specific node(s) and record(s) that reported an error.
>> 2. Each node typically contains two types of records:
>>        - report record: Errors can be located efficiently through a bitmap
>>                         in the `ERRGSR` register.
>>        - poll record: The node must individually poll all records to determine
>>                       if an error has occurred.
>> 3. process record:
>>        - if error is corrected, The CE threshold is reset, and the error event
>>          is logged.
>>        - if error is defered, Relevant registers are dumped, and
>>          `memory_failure()` is invoked.
>>        - if error is uncorrected, panic, While UEs typically trigger an
>>          exception rather than an interrupt, if detected, the system will panic.
>> 4. decode record: The AEST driver notifies other relevant drivers, such as
>>     EDAC, to further decode the reported RAS register information.
>>
>> Testing
>> ===================
>> I have tested this series on THead Yitian710 SOC with customized BIOS. Someone
>> can also use QEMU[9] for preliminary driver testing.
>>
>> 1. Boot Qemu
>>
>> qemu-system-aarch64 -smp 4 -m 32G \
>>    -cpu host --enable-kvm -machine virt,gic-version=3 \
>>    -kernel Image -initrd initrd.cpio.gz \
>>    -device virtio-net-pci,netdev=t0 -netdev user,id=t0 \
>>    -bios /usr/share/edk2/aarch64/QEMU_EFI.fd  \
>>    -append "rdinit=/sbin/init earlycon verbose debug console=ttyAMA0 aest.dyndbg='+pt'" \
>>    -nographic -d guest_errors -D qemu.log
>>
>> 2. inject error
>> devmem 0x90d0808 l 0xc4800390
>>
>> 2.1 Memory error
>> [   64.959849] AEST: {1}[Hardware Error]: Hardware error from AEST memory.90d0000
>> [   64.959852] AEST: {1}[Hardware Error]:  Error from memory at SRAT proximity domain 0x0
>> [   64.959855] AEST: {1}[Hardware Error]:   ERR0FR: 0x40000080044081
>> [   64.959858] AEST: {1}[Hardware Error]:   ERR0CTRL: 0x108
>> [   64.959859] AEST: {1}[Hardware Error]:   ERR0STATUS: 0xc4800390
>> [   64.959860] AEST: {1}[Hardware Error]:   ERR0ADDR: 0x8400000043344521
>> [   64.959861] AEST: {1}[Hardware Error]:   ERR0MISC0: 0x7fff00000000
>> [   64.959861] AEST: {1}[Hardware Error]:   ERR0MISC1: 0x0
>> [   64.959862] AEST: {1}[Hardware Error]:   ERR0MISC2: 0x0
>> [   64.959863] AEST: {1}[Hardware Error]:   ERR0MISC3: 0x0
>> [   64.959873] Memory failure: 0x43344: recovery action for free buddy page: Recovered
>>
>> 2.2 CMN error
>> [  132.044283] AEST: {2}[Hardware Error]: Hardware error from AEST XP
>> [  132.044286] AEST: {2}[Hardware Error]:  Error from vendor hid ARMHC700 uid 0x0
>> [  132.044288] AEST: {2}[Hardware Error]:   ERR0FR: 0x48a5
>> [  132.044290] AEST: {2}[Hardware Error]:   ERR0CTRL: 0x108
>> [  132.044292] AEST: {2}[Hardware Error]:   ERR0STATUS: 0xc4800390
>> [  132.044293] AEST: {2}[Hardware Error]:   ERR0ADDR: 0x8400000043344521
>> [  132.044295] AEST: {2}[Hardware Error]:   ERR0MISC0: 0x0
>> [  132.044296] AEST: {2}[Hardware Error]:   ERR0MISC1: 0x0
>> [  132.044298] AEST: {2}[Hardware Error]:   ERR0MISC2: 0x0
>> [  132.044299] AEST: {2}[Hardware Error]:   ERR0MISC3: 0x0
>> [  132.044302] Memory failure: 0x43344: recovery action for already poisoned page: Failed
>>
>> [0]: https://developer.arm.com/documentation/den0085/0200/
>> [1]: Intel: Predicting Uncorrectable Memory Errors from the Correctable Error History
>> [2]: Alibaba. Predicting DRAM-Caused Risky VMs in Large-Scale Clouds. Published in HPCA2025
>> [3]: AMD: Physics-informed machinelearning for dram error modeling
>> [4]: Tencent: Predicting uncorrectablememory errors for proactive replacement: An empirical study on large-scale field data
>> [5]: https://lore.kernel.org/all/20251104-wip-mca-updates-v8-4-66c8eacf67b9@amd.com/
>> [6]: https://developer.arm.com/documentation/ihi0100/
>> [7]: https://lore.kernel.org/all/20211124170708.3874-1-baicar@os.amperecomputing.com/
>> [8]: https://lore.kernel.org/all/b365db02-b28c-1b22-2e87-c011cef848e2@linux.alibaba.com/
>> [9]: https://github.com/winterddd/qemu/tree/error_record
>>
>> Change from V5:
>> https://lore.kernel.org/all/20251230090945.43969-1-tianruidong@linux.alibaba.com/
>> 1. Based on the feedback from Borislav Petkov, I've dropped the idea of a
>>     unified address translation interface across ARM and AMD.
>>
>> Change from V4:
>> https://lore.kernel.org/all/20251222094351.38792-1-tianruidong@linux.alibaba.com/
>> 1. Fix build warning in 0010 and 0014 report by kernel test robot:
>>      https://lore.kernel.org/all/202512230122.CfXZcF76-lkp@intel.com/
>>      https://lore.kernel.org/all/202512230007.Vs6IvFVD-lkp@intel.com/
>> 2. Dropped the extra patch(0014) that was mistakenly included in v4.
>>
>> Change from V3:
>> https://lore.kernel.org/all/20250115084228.107573-1-tianruidong@linux.alibaba.com/
>> 1. Add vendor AEST node framework and support CMN700
>> 2. Borislav Petkov
>>      - Split into multiple smaller patches for easier review.
>>      - refined the English in the cover letter for better flow.
>> 3. Accept Tomohiro Misono's comment
>>
>> Change from V2:
>> https://lore.kernel.org/all/20240321025317.114621-1-tianruidong@linux.alibaba.com/
>> 1. Tomohiro Misono
>>      - dump register before panic
>> 2. Baolin Wang & Shuai Xue: accept all comment.
>> 3. Support AEST V2.
>>
>> Change from V1:
>> https://lore.kernel.org/all/20240304111517.33001-1-tianruidong@linux.alibaba.com/
>> 1. Marc Zyngier
>>    - Use readq/writeq_relaxed instead of readq/writeq for MMIO address.
>>    - Add sync for system register operation.
>>    - Use irq_is_percpu_devid() helper to identify a per-CPU interrupt.
>>    - Other fix.
>> 2. Set RAS CE threshold in AEST driver.
>> 3. Enable RAS interrupt explicitly in driver.
>> 4. UER and UEO trigger memory_failure other than panic.
>>
>> Ruidong Tian (16):
>>    ACPI/AEST: Parse the AEST table
>>    ras: AEST: Add probe/remove for AEST driver
>>    ras: AEST: support different group format
>>    ras: AEST: Unify the read/write interface for system and MMIO register
>>    ras: AEST: Probe RAS system architecture version
>>    ras: AEST: Support RAS Common Fault Injection Model Extension
>>    ras: AEST: Support CE threshold of error record
>>    ras: AEST: Enable and register IRQs
>>    ras: AEST: Add cpuhp callback
>>    ras: AEST: Introduce AEST driver sysfs interface
>>    ras: AEST: Add error count tracking and debugfs interface
>>    ras: AEST: Allow configuring CE threshold via debugfs
>>    ras: AEST: Introduce AEST inject interface to test AEST driver
>>    ras: AEST: Add framework to process AEST vendor node
>>    ras: AEST: support vendor node CMN700
>>    trace, ras: add ARM RAS extension trace event
>>
>>   Documentation/ABI/testing/debugfs-aest |   99 +++
>>   MAINTAINERS                            |   11 +
>>   arch/arm64/include/asm/arm-cmn.h       |   47 ++
>>   arch/arm64/include/asm/ras.h           |   95 +++
>>   drivers/acpi/arm64/Kconfig             |   11 +
>>   drivers/acpi/arm64/Makefile            |    1 +
>>   drivers/acpi/arm64/aest.c              |  311 +++++++
>>   drivers/perf/arm-cmn.c                 |   37 +-
>>   drivers/ras/Kconfig                    |    1 +
>>   drivers/ras/Makefile                   |    1 +
>>   drivers/ras/aest/Kconfig               |   17 +
>>   drivers/ras/aest/Makefile              |    8 +
>>   drivers/ras/aest/aest-cmn.c            |  330 ++++++++
>>   drivers/ras/aest/aest-core.c           | 1054 ++++++++++++++++++++++++
>>   drivers/ras/aest/aest-inject.c         |  131 +++
>>   drivers/ras/aest/aest-sysfs.c          |  228 +++++
>>   drivers/ras/aest/aest.h                |  410 +++++++++
>>   drivers/ras/ras.c                      |    3 +
>>   include/linux/acpi_aest.h              |   75 ++
>>   include/linux/cpuhotplug.h             |    1 +
>>   include/linux/ras.h                    |    8 +
>>   include/ras/ras_event.h                |   71 ++
>>   22 files changed, 2914 insertions(+), 36 deletions(-)
>>   create mode 100644 Documentation/ABI/testing/debugfs-aest
>>   create mode 100644 arch/arm64/include/asm/arm-cmn.h
>>   create mode 100644 arch/arm64/include/asm/ras.h
>>   create mode 100644 drivers/acpi/arm64/aest.c
>>   create mode 100644 drivers/ras/aest/Kconfig
>>   create mode 100644 drivers/ras/aest/Makefile
>>   create mode 100644 drivers/ras/aest/aest-cmn.c
>>   create mode 100644 drivers/ras/aest/aest-core.c
>>   create mode 100644 drivers/ras/aest/aest-inject.c
>>   create mode 100644 drivers/ras/aest/aest-sysfs.c
>>   create mode 100644 drivers/ras/aest/aest.h
>>   create mode 100644 include/linux/acpi_aest.h
> 
> 
> Thanks,
> Umang

next prev parent reply	other threads:[~2026-03-11  3:25 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22  9:46 [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 01/16] ACPI/AEST: Parse the AEST table Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 02/16] ras: AEST: Add probe/remove for AEST driver Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 03/16] ras: AEST: support different group format Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 04/16] ras: AEST: Unify the read/write interface for system and MMIO register Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 05/16] ras: AEST: Probe RAS system architecture version Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 06/16] ras: AEST: Support RAS Common Fault Injection Model Extension Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 07/16] ras: AEST: Support CE threshold of error record Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 08/16] ras: AEST: Enable and register IRQs Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 09/16] ras: AEST: Add cpuhp callback Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 10/16] ras: AEST: Introduce AEST driver sysfs interface Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 11/16] ras: AEST: Add error count tracking and debugfs interface Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 12/16] ras: AEST: Allow configuring CE threshold via debugfs Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 13/16] ras: AEST: Introduce AEST inject interface to test AEST driver Ruidong Tian
2026-01-27 12:52   ` kernel test robot
2026-01-22  9:46 ` [PATCH v6 14/16] ras: AEST: Add framework to process AEST vendor node Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 15/16] ras: AEST: support vendor node CMN700 Ruidong Tian
2026-01-27 18:56   ` kernel test robot
2026-03-09 19:21   ` Robin Murphy
2026-03-13  9:49     ` Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 16/16] trace, ras: add ARM RAS extension trace event Ruidong Tian
2026-03-09 13:28 ` [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Umang Chheda
2026-03-11  3:25   ` Ruidong Tian [this message]
2026-03-17  7:21     ` Umang Chheda

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d4e4e2b1-c182-42f9-9d60-b12f0fd7f977@linux.alibaba.com \
    --to=tianruidong@linux.alibaba.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=guohanjun@huawei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=lpieralisi@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab@kernel.org \
    --cc=oliver.yang@linux.alibaba.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=sudeep.holla@arm.com \
    --cc=tglx@linutronix.de \
    --cc=tony.luck@intel.com \
    --cc=umang.chheda@oss.qualcomm.com \
    --cc=will@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    --cc=zhuo.song@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox