public inbox for linux-acpi@vger.kernel.org
 help / color / mirror / Atom feed
From: Umang Chheda <umang.chheda@oss.qualcomm.com>
To: Ruidong Tian <tianruidong@linux.alibaba.com>,
	catalin.marinas@arm.com, will@kernel.org, lpieralisi@kernel.org,
	guohanjun@huawei.com, sudeep.holla@arm.com, rafael@kernel.org,
	robin.murphy@arm.com, mark.rutland@arm.com, tony.luck@intel.com,
	bp@alien8.de, tglx@linutronix.de, peterz@infradead.org
Cc: lenb@kernel.org, linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-acpi@vger.kernel.org,
	linux-perf-users@vger.kernel.org, linux-edac@vger.kernel.org,
	mchehab@kernel.org, xueshuai@linux.alibaba.com,
	zhuo.song@linux.alibaba.com, oliver.yang@linux.alibaba.com
Subject: Re: [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling
Date: Tue, 17 Mar 2026 12:51:40 +0530	[thread overview]
Message-ID: <95d2de63-5571-4fce-aa32-6e8969bb9149@oss.qualcomm.com> (raw)
In-Reply-To: <d4e4e2b1-c182-42f9-9d60-b12f0fd7f977@linux.alibaba.com>


On 3/11/2026 8:55 AM, Ruidong Tian wrote:
>
>
> 在 2026/3/9 21:28, Umang Chheda 写道:
>> Hello Ruidong Tain,
>>
>> On 1/22/2026 3:16 PM, Ruidong Tian wrote:
>>> Motivation: Reliability in Modern Data Centers
>>> =================================================
>>> In modern data centers, proactive maintenance is essential for achieving high
>>> service availability. The practice of using Corrected Errors (CE) to predict
>>> impending Uncorrected Errors (UE) is already widely deployed at scale across
>>> the industry, like Alibaba[2], Tencent[4], Intel[1], AMD[2]. By analyzing CE
>>> telemetry, operators can identify failing hardware and perform migrations
>>> before catastrophic failures occur.
>>>
>>> Problem: Inefficient CE Collection on ARM
>>> ==========================================
>>> Currently, ARM-based systems primarily rely on "Firmware-First" error
>>> handling (e.g., via GHES). This path is inherently heavy-weight. To avoid
>>> significant performance overhead, firmware is often configured with high
>>> thresholds—reporting to the OS only after thousands of CEs have occurred.
>>> If the threshold is set lower, the high frequency of errors leads to
>>> excessive and costly context switching between the OS and firmware.
>>> Consequently, ARM platforms currently lack an efficient mechanism to collect
>>> the granular CE data required for high-fidelity error prediction.
>>>
>>> Solution: Kernel-First Handling via AEST
>>> ===========================================
>>> Other architectures have long utilized "Kernel-First" approaches for
>>> efficient CE collection: Intel provides CMCI (Corrected Machine Check
>>> Interrupt), and AMD has recently introduced similar CE interrupt support[5].
>>>
>>> On the ARM architecture, hardware already provides the necessary RAS
>>> Extensions[6], and the ACPI AEST specification[0] defines a standardized way for
>>> the OS to discover these error source registers. This series implements
>>> AEST support, enabling the kernel to:
>>>
>>>   - Discover error sources directly via ACPI tables.
>>>   - Handle CE notifications via direct interrupts.
>>>   - Bypass firmware overhead to collect every CE or use low-latency thresholds.
>>>
>>> This implementation provides the missing link for efficient RAS telemetry
>>> on ARM, bringing it to parity with other enterprise architectures.
>>
>> Thanks for posting this series enabling kernel-first handling for the Armv8 RAS extensions.
>>
>> We noticed the current implementation targets ACPI-based server platforms. For embedded/SoC systems, Device Tree is often the primary firmware description.
>> Do you have any plans to add DT-based support for the same flow? If not, do you see any blockers to extending this series to support DT
>> (e.g., DT bindings + discovery/registration path analogous to the ACPI plumbing) ?
>> If DT support is in-scope, We would be happy to align on the expected approach and help with review/development/testing for DT-based platforms.
>
> Hi Umang,
>
> Thanks for the reply.
>
> Adding Device Tree support should be easy. We just need a patch similar to "ACPI/AEST: Parse the AEST table" to fill the DT table into struct acpi_aest_node (might need renaming) and struct aest_hnode. The driver part requires minimal changes. 
>
> However, I'm not very familiar with DT and lack DT engineering support, so I would need some guidance on these DT-related questions:
>
> - Is there a specification that outlines the reporting     requirements for RAS extension information that is similar to AEST? 

Currently there is no spec/bindings defined for DT based systems. Based on discussions on earlier posted patch [1]  - expectation from maintainers
was to have DT bindings aligned to and equivalent to the AEST spec. Based on this expectation - we plan to propose DT bindings - which are  aligned
to AEST spec.

[1] Re: [PATCH 1/2] dt-bindings: edac: Add DT bindings for Kryo EDAC - James Morse <https://lore.kernel.org/linux-edac/312fc8b8-7019-0c74-6a92-c6740cab5dad@arm.com/>

> - How should the DT be designed?
>
Same as above - plan to align DT bindings equivalent to AEST spec.

> - How can I develop QEMU and modify DT files for debugging, etc.? 

We can share the steps for this for DT based system in this thread.


>
> I would be happy to adjust the patchset to meet the needs of both parties if you are prepared to invest the necessary effort(DT-related). In reality, I believe that just a little modification is required. 

Thanks, Yes - we are open to contribute in extending this patch series to support DT based systems as well.

[...]


Thanks,
Umang

>
>>
>>>
>>
>>
>>
>

      reply	other threads:[~2026-03-17  7:21 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-01-22  9:46 [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 01/16] ACPI/AEST: Parse the AEST table Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 02/16] ras: AEST: Add probe/remove for AEST driver Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 03/16] ras: AEST: support different group format Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 04/16] ras: AEST: Unify the read/write interface for system and MMIO register Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 05/16] ras: AEST: Probe RAS system architecture version Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 06/16] ras: AEST: Support RAS Common Fault Injection Model Extension Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 07/16] ras: AEST: Support CE threshold of error record Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 08/16] ras: AEST: Enable and register IRQs Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 09/16] ras: AEST: Add cpuhp callback Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 10/16] ras: AEST: Introduce AEST driver sysfs interface Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 11/16] ras: AEST: Add error count tracking and debugfs interface Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 12/16] ras: AEST: Allow configuring CE threshold via debugfs Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 13/16] ras: AEST: Introduce AEST inject interface to test AEST driver Ruidong Tian
2026-01-27 12:52   ` kernel test robot
2026-01-22  9:46 ` [PATCH v6 14/16] ras: AEST: Add framework to process AEST vendor node Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 15/16] ras: AEST: support vendor node CMN700 Ruidong Tian
2026-01-27 18:56   ` kernel test robot
2026-03-09 19:21   ` Robin Murphy
2026-03-13  9:49     ` Ruidong Tian
2026-01-22  9:46 ` [PATCH v6 16/16] trace, ras: add ARM RAS extension trace event Ruidong Tian
2026-03-09 13:28 ` [PATCH v6 00/16] Support Armv8 RAS Extensions for Kernel-first error handling Umang Chheda
2026-03-11  3:25   ` Ruidong Tian
2026-03-17  7:21     ` Umang Chheda [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95d2de63-5571-4fce-aa32-6e8969bb9149@oss.qualcomm.com \
    --to=umang.chheda@oss.qualcomm.com \
    --cc=bp@alien8.de \
    --cc=catalin.marinas@arm.com \
    --cc=guohanjun@huawei.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-edac@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=lpieralisi@kernel.org \
    --cc=mark.rutland@arm.com \
    --cc=mchehab@kernel.org \
    --cc=oliver.yang@linux.alibaba.com \
    --cc=peterz@infradead.org \
    --cc=rafael@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=sudeep.holla@arm.com \
    --cc=tglx@linutronix.de \
    --cc=tianruidong@linux.alibaba.com \
    --cc=tony.luck@intel.com \
    --cc=will@kernel.org \
    --cc=xueshuai@linux.alibaba.com \
    --cc=zhuo.song@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox