public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: "David Hildenbrand (Arm)" <david@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
	akpm@linux-foundation.org, corbet@lwn.net, ljs@kernel.org,
	Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, skhan@linuxfoundation.org,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Sasha Levin <sashal@nvidia.com>,
	Sanif Veeras <sveeras@nvidia.com>,
	"Claude:claude-opus-4-7" <noreply@anthropic.com>
Subject: Re: [RFC 4/7] mm: add page consistency checker implementation
Date: Mon, 27 Apr 2026 19:24:46 -0400	[thread overview]
Message-ID: <ae_wPuqYpVxicOUq@laps> (raw)
In-Reply-To: <36d82055-67f3-4c29-a605-a9848a28f7cb@kernel.org>

On Mon, Apr 27, 2026 at 09:37:02PM +0200, David Hildenbrand (Arm) wrote:
>
>>>
>>> Thanks, but I fundamentally don't understand how RAS capabilities interact here?
>>> We have mm/memory-failure.c for a reason :)
>>
>> We do, but self driving safety requires way more than the current hardware can
>> provide.
>>
>> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
>> researched these issues in a datacenter environment (so no sun exposure,
>> temperature controlled, designed to avoid electromagnetic interference).
>>
>> "We call a fault that generates an error larger than 2 bits in an ECC word an
>> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
>> more than two bits in any ECC word, and the data written to that location does
>> not match the value produced by the fault."
>>
>> [...]
>>
>> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
>> FIT per node for vendors A, B, and C, respectively. This translates to one
>> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
>> the size of Cielo."
>>
>> [...]
>>
>> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
>> modern DRAM subsystems. The rate of undetected errors is too high to justify
>> its use in very large scale systems comprised of thousands of nodes where
>> fidelity of results is critical."
>
>Yes, I read before that ECC is insufficient to detect certain bitflips.
>
>But I don't understand how this patch set here is going to move the needle in
>any reasonable way?
>
>You have your magical self-driving car algorithm.
>
>Bitflips can corrupt your algorithm, your data, the kernel image, your user page
>tables, your kernel page tables. Even a pointer to a bitmap :)
>
>... and we worry about the state of allocated vs. free pages.

Do we agree that this is one piece of a (much) larger puzzle that we would need
to tackle?

>Please enlighten me!

Definitely! This is a pretty hefty body of work, so outside of trying to get
the code out there we're also working on documentation, talks, webinars, etc in
the context of ELISA (https://elisa.tech/).

The concept itself was approved by an independant assessor as compliant with
the relevant safety standard, so the story is there, we're just working on
getting it out.

-- 
Thanks,
Sasha

  reply	other threads:[~2026-04-27 23:24 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 14:00 [RFC 0/7] mm: dual-bitmap page allocator consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 1/7] mm: add generic dual-bitmap consistency primitives Sasha Levin
2026-04-24 14:00 ` [RFC 2/7] mm: add page consistency checker header Sasha Levin
2026-04-24 14:00 ` [RFC 3/7] mm: add Kconfig options for page consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 4/7] mm: add page consistency checker implementation Sasha Levin
2026-04-24 14:25   ` David Hildenbrand (Arm)
2026-04-24 14:49     ` Sasha Levin
2026-04-24 15:06       ` Pasha Tatashin
2026-04-24 18:28         ` David Hildenbrand (Arm)
2026-04-24 23:34           ` Sasha Levin
2026-04-25  5:30             ` David Hildenbrand (Arm)
2026-04-25 16:38               ` Sasha Levin
2026-04-27 12:32                 ` David Hildenbrand (Arm)
2026-04-27 14:10                   ` Sasha Levin
2026-04-27 15:40                     ` David Hildenbrand (Arm)
2026-04-27 18:56                       ` Sasha Levin
2026-04-27 19:37                         ` David Hildenbrand (Arm)
2026-04-27 23:24                           ` Sasha Levin [this message]
2026-04-28  7:22                             ` David Hildenbrand (Arm)
2026-04-24 18:26       ` David Hildenbrand (Arm)
2026-04-24 14:00 ` [RFC 5/7] mm/page_alloc: integrate page consistency hooks Sasha Levin
2026-04-24 14:00 ` [RFC 6/7] Documentation/mm: add page consistency checker documentation Sasha Levin
2026-04-24 14:00 ` [RFC 7/7] mm/page_consistency: add KUnit tests for dual-bitmap primitives Sasha Levin
2026-04-24 15:34 ` [RFC 0/7] mm: dual-bitmap page allocator consistency checker Matthew Wilcox
2026-04-24 15:53   ` Sasha Levin
2026-04-24 15:42 ` Vlastimil Babka (SUSE)
2026-04-24 16:25   ` Sasha Levin
2026-04-25  5:51     ` David Hildenbrand (Arm)
2026-04-25 16:09       ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ae_wPuqYpVxicOUq@laps \
    --to=sashal@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=david@kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=noreply@anthropic.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=rppt@kernel.org \
    --cc=sashal@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=sveeras@nvidia.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox