public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Sasha Levin <sashal@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
	akpm@linux-foundation.org, corbet@lwn.net, ljs@kernel.org,
	Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, skhan@linuxfoundation.org,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Sasha Levin <sashal@nvidia.com>,
	Sanif Veeras <sveeras@nvidia.com>,
	"Claude:claude-opus-4-7" <noreply@anthropic.com>
Subject: Re: [RFC 4/7] mm: add page consistency checker implementation
Date: Mon, 27 Apr 2026 21:37:02 +0200	[thread overview]
Message-ID: <36d82055-67f3-4c29-a605-a9848a28f7cb@kernel.org> (raw)
In-Reply-To: <ae-xXHUGfDiFcBUk@laps>


>>
>> Thanks, but I fundamentally don't understand how RAS capabilities interact here?
>> We have mm/memory-failure.c for a reason :)
> 
> We do, but self driving safety requires way more than the current hardware can
> provide.
> 
> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
> researched these issues in a datacenter environment (so no sun exposure,
> temperature controlled, designed to avoid electromagnetic interference).
> 
> "We call a fault that generates an error larger than 2 bits in an ECC word an
> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
> more than two bits in any ECC word, and the data written to that location does
> not match the value produced by the fault."
> 
> [...]
> 
> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
> FIT per node for vendors A, B, and C, respectively. This translates to one
> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
> the size of Cielo."
> 
> [...]
> 
> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
> modern DRAM subsystems. The rate of undetected errors is too high to justify
> its use in very large scale systems comprised of thousands of nodes where
> fidelity of results is critical."

Yes, I read before that ECC is insufficient to detect certain bitflips.

But I don't understand how this patch set here is going to move the needle in
any reasonable way?

You have your magical self-driving car algorithm.

Bitflips can corrupt your algorithm, your data, the kernel image, your user page
tables, your kernel page tables. Even a pointer to a bitmap :)

... and we worry about the state of allocated vs. free pages.

Please enlighten me!

> 
> The passengers you've mentioned before would be excited if they knew how high
> the bar is around their safety :)
Heh :)

-- 
Cheers,

David


  reply	other threads:[~2026-04-27 19:37 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 14:00 [RFC 0/7] mm: dual-bitmap page allocator consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 1/7] mm: add generic dual-bitmap consistency primitives Sasha Levin
2026-04-24 14:00 ` [RFC 2/7] mm: add page consistency checker header Sasha Levin
2026-04-24 14:00 ` [RFC 3/7] mm: add Kconfig options for page consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 4/7] mm: add page consistency checker implementation Sasha Levin
2026-04-24 14:25   ` David Hildenbrand (Arm)
2026-04-24 14:49     ` Sasha Levin
2026-04-24 15:06       ` Pasha Tatashin
2026-04-24 18:28         ` David Hildenbrand (Arm)
2026-04-24 23:34           ` Sasha Levin
2026-04-25  5:30             ` David Hildenbrand (Arm)
2026-04-25 16:38               ` Sasha Levin
2026-04-27 12:32                 ` David Hildenbrand (Arm)
2026-04-27 14:10                   ` Sasha Levin
2026-04-27 15:40                     ` David Hildenbrand (Arm)
2026-04-27 18:56                       ` Sasha Levin
2026-04-27 19:37                         ` David Hildenbrand (Arm) [this message]
2026-04-27 23:24                           ` Sasha Levin
2026-04-24 18:26       ` David Hildenbrand (Arm)
2026-04-24 14:00 ` [RFC 5/7] mm/page_alloc: integrate page consistency hooks Sasha Levin
2026-04-24 14:00 ` [RFC 6/7] Documentation/mm: add page consistency checker documentation Sasha Levin
2026-04-24 14:00 ` [RFC 7/7] mm/page_consistency: add KUnit tests for dual-bitmap primitives Sasha Levin
2026-04-24 15:34 ` [RFC 0/7] mm: dual-bitmap page allocator consistency checker Matthew Wilcox
2026-04-24 15:53   ` Sasha Levin
2026-04-24 15:42 ` Vlastimil Babka (SUSE)
2026-04-24 16:25   ` Sasha Levin
2026-04-25  5:51     ` David Hildenbrand (Arm)
2026-04-25 16:09       ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=36d82055-67f3-4c29-a605-a9848a28f7cb@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=noreply@anthropic.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=rppt@kernel.org \
    --cc=sashal@kernel.org \
    --cc=sashal@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=sveeras@nvidia.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox