From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Sasha Levin <sashal@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
akpm@linux-foundation.org, corbet@lwn.net, ljs@kernel.org,
Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
surenb@google.com, mhocko@suse.com, skhan@linuxfoundation.org,
jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
linux-mm@kvack.org, linux-doc@vger.kernel.org,
linux-kernel@vger.kernel.org, Sasha Levin <sashal@nvidia.com>,
Sanif Veeras <sveeras@nvidia.com>,
"Claude:claude-opus-4-7" <noreply@anthropic.com>
Subject: Re: [RFC 4/7] mm: add page consistency checker implementation
Date: Mon, 27 Apr 2026 21:37:02 +0200 [thread overview]
Message-ID: <36d82055-67f3-4c29-a605-a9848a28f7cb@kernel.org> (raw)
In-Reply-To: <ae-xXHUGfDiFcBUk@laps>
>>
>> Thanks, but I fundamentally don't understand how RAS capabilities interact here?
>> We have mm/memory-failure.c for a reason :)
>
> We do, but self driving safety requires way more than the current hardware can
> provide.
>
> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
> researched these issues in a datacenter environment (so no sun exposure,
> temperature controlled, designed to avoid electromagnetic interference).
>
> "We call a fault that generates an error larger than 2 bits in an ECC word an
> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
> more than two bits in any ECC word, and the data written to that location does
> not match the value produced by the fault."
>
> [...]
>
> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
> FIT per node for vendors A, B, and C, respectively. This translates to one
> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
> the size of Cielo."
>
> [...]
>
> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
> modern DRAM subsystems. The rate of undetected errors is too high to justify
> its use in very large scale systems comprised of thousands of nodes where
> fidelity of results is critical."
Yes, I read before that ECC is insufficient to detect certain bitflips.
But I don't understand how this patch set here is going to move the needle in
any reasonable way?
You have your magical self-driving car algorithm.
Bitflips can corrupt your algorithm, your data, the kernel image, your user page
tables, your kernel page tables. Even a pointer to a bitmap :)
... and we worry about the state of allocated vs. free pages.
Please enlighten me!
>
> The passengers you've mentioned before would be excited if they knew how high
> the bar is around their safety :)
Heh :)
--
Cheers,
David
next prev parent reply other threads:[~2026-04-27 19:37 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-24 14:00 [RFC 0/7] mm: dual-bitmap page allocator consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 1/7] mm: add generic dual-bitmap consistency primitives Sasha Levin
2026-04-24 14:00 ` [RFC 2/7] mm: add page consistency checker header Sasha Levin
2026-04-24 14:00 ` [RFC 3/7] mm: add Kconfig options for page consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 4/7] mm: add page consistency checker implementation Sasha Levin
2026-04-24 14:25 ` David Hildenbrand (Arm)
2026-04-24 14:49 ` Sasha Levin
2026-04-24 15:06 ` Pasha Tatashin
2026-04-24 18:28 ` David Hildenbrand (Arm)
2026-04-24 23:34 ` Sasha Levin
2026-04-25 5:30 ` David Hildenbrand (Arm)
2026-04-25 16:38 ` Sasha Levin
2026-04-27 12:32 ` David Hildenbrand (Arm)
2026-04-27 14:10 ` Sasha Levin
2026-04-27 15:40 ` David Hildenbrand (Arm)
2026-04-27 18:56 ` Sasha Levin
2026-04-27 19:37 ` David Hildenbrand (Arm) [this message]
2026-04-27 23:24 ` Sasha Levin
2026-04-24 18:26 ` David Hildenbrand (Arm)
2026-04-24 14:00 ` [RFC 5/7] mm/page_alloc: integrate page consistency hooks Sasha Levin
2026-04-24 14:00 ` [RFC 6/7] Documentation/mm: add page consistency checker documentation Sasha Levin
2026-04-24 14:00 ` [RFC 7/7] mm/page_consistency: add KUnit tests for dual-bitmap primitives Sasha Levin
2026-04-24 15:34 ` [RFC 0/7] mm: dual-bitmap page allocator consistency checker Matthew Wilcox
2026-04-24 15:53 ` Sasha Levin
2026-04-24 15:42 ` Vlastimil Babka (SUSE)
2026-04-24 16:25 ` Sasha Levin
2026-04-25 5:51 ` David Hildenbrand (Arm)
2026-04-25 16:09 ` Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=36d82055-67f3-4c29-a605-a9848a28f7cb@kernel.org \
--to=david@kernel.org \
--cc=Liam.Howlett@oracle.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=hannes@cmpxchg.org \
--cc=jackmanb@google.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=noreply@anthropic.com \
--cc=pasha.tatashin@soleen.com \
--cc=rppt@kernel.org \
--cc=sashal@kernel.org \
--cc=sashal@nvidia.com \
--cc=skhan@linuxfoundation.org \
--cc=surenb@google.com \
--cc=sveeras@nvidia.com \
--cc=vbabka@kernel.org \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox