From: Sasha Levin <sashal@kernel.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
akpm@linux-foundation.org, corbet@lwn.net, ljs@kernel.org,
Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, skhan@linuxfoundation.org,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
linux-mm@kvack.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, Sasha Levin <sashal@nvidia.com>,
Sanif Veeras <sveeras@nvidia.com>,
"Claude:claude-opus-4-7" <noreply@anthropic.com>
Subject: Re: [RFC 4/7] mm: add page consistency checker implementation
Date: Mon, 27 Apr 2026 19:24:46 -0400 [thread overview]
Message-ID: <ae_wPuqYpVxicOUq@laps> (raw)
In-Reply-To: <36d82055-67f3-4c29-a605-a9848a28f7cb@kernel.org>
On Mon, Apr 27, 2026 at 09:37:02PM +0200, David Hildenbrand (Arm) wrote:
>
>>>
>>> Thanks, but I fundamentally don't understand how RAS capabilities interact here?
>>> We have mm/memory-failure.c for a reason :)
>>
>> We do, but self driving safety requires way more than the current hardware can
>> provide.
>>
>> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
>> researched these issues in a datacenter environment (so no sun exposure,
>> temperature controlled, designed to avoid electromagnetic interference).
>>
>> "We call a fault that generates an error larger than 2 bits in an ECC word an
>> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
>> more than two bits in any ECC word, and the data written to that location does
>> not match the value produced by the fault."
>>
>> [...]
>>
>> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
>> FIT per node for vendors A, B, and C, respectively. This translates to one
>> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
>> the size of Cielo."
>>
>> [...]
>>
>> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
>> modern DRAM subsystems. The rate of undetected errors is too high to justify
>> its use in very large scale systems comprised of thousands of nodes where
>> fidelity of results is critical."
>
>Yes, I read before that ECC is insufficient to detect certain bitflips.
>
>But I don't understand how this patch set here is going to move the needle in
>any reasonable way?
>
>You have your magical self-driving car algorithm.
>
>Bitflips can corrupt your algorithm, your data, the kernel image, your user page
>tables, your kernel page tables. Even a pointer to a bitmap :)
>
>... and we worry about the state of allocated vs. free pages.
Do we agree that this is one piece of a (much) larger puzzle that we would need
to tackle?
>Please enlighten me!
Definitely! This is a pretty hefty body of work, so outside of trying to get
the code out there we're also working on documentation, talks, webinars, etc in
the context of ELISA (https://elisa.tech/).
The concept itself was approved by an independant assessor as compliant with
the relevant safety standard, so the story is there, we're just working on
getting it out.
--
Thanks,
Sasha
next prev parent reply other threads:[~2026-04-27 23:24 UTC|newest]
Thread overview: 29+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 14:00 [RFC 0/7] mm: dual-bitmap page allocator consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 1/7] mm: add generic dual-bitmap consistency primitives Sasha Levin
2026-04-24 14:00 ` [RFC 2/7] mm: add page consistency checker header Sasha Levin
2026-04-24 14:00 ` [RFC 3/7] mm: add Kconfig options for page consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 4/7] mm: add page consistency checker implementation Sasha Levin
2026-04-24 14:25 ` David Hildenbrand (Arm)
2026-04-24 14:49 ` Sasha Levin
2026-04-24 15:06 ` Pasha Tatashin
2026-04-24 18:28 ` David Hildenbrand (Arm)
2026-04-24 23:34 ` Sasha Levin
2026-04-25 5:30 ` David Hildenbrand (Arm)
2026-04-25 16:38 ` Sasha Levin
2026-04-27 12:32 ` David Hildenbrand (Arm)
2026-04-27 14:10 ` Sasha Levin
2026-04-27 15:40 ` David Hildenbrand (Arm)
2026-04-27 18:56 ` Sasha Levin
2026-04-27 19:37 ` David Hildenbrand (Arm)
2026-04-27 23:24 ` Sasha Levin [this message]
2026-04-28 7:22 ` David Hildenbrand (Arm)
2026-04-24 18:26 ` David Hildenbrand (Arm)
2026-04-24 14:00 ` [RFC 5/7] mm/page_alloc: integrate page consistency hooks Sasha Levin
2026-04-24 14:00 ` [RFC 6/7] Documentation/mm: add page consistency checker documentation Sasha Levin
2026-04-24 14:00 ` [RFC 7/7] mm/page_consistency: add KUnit tests for dual-bitmap primitives Sasha Levin
2026-04-24 15:34 ` [RFC 0/7] mm: dual-bitmap page allocator consistency checker Matthew Wilcox
2026-04-24 15:53 ` Sasha Levin
2026-04-24 15:42 ` Vlastimil Babka (SUSE)
2026-04-24 16:25 ` Sasha Levin
2026-04-25 5:51 ` David Hildenbrand (Arm)
2026-04-25 16:09 ` Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ae_wPuqYpVxicOUq@laps \
--to=sashal@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=david@kernel.org \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=noreply@anthropic.com \
--cc=pasha.tatashin@soleen.com \
--cc=rppt@kernel.org \
--cc=sashal@nvidia.com \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=sveeras@nvidia.com \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox