public inbox for linux-doc@vger.kernel.org
 help / color / mirror / Atom feed
From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Sasha Levin <sashal@kernel.org>
Cc: Pasha Tatashin <pasha.tatashin@soleen.com>,
	akpm@linux-foundation.org, corbet@lwn.net, ljs@kernel.org,
	Liam.Howlett@oracle.com, vbabka@kernel.org, rppt@kernel.org,
	surenb@google.com, mhocko@suse.com, skhan@linuxfoundation.org,
	jackmanb@google.com, hannes@cmpxchg.org, ziy@nvidia.com,
	linux-mm@kvack.org, linux-doc@vger.kernel.org,
	linux-kernel@vger.kernel.org, Sasha Levin <sashal@nvidia.com>,
	Sanif Veeras <sveeras@nvidia.com>,
	"Claude:claude-opus-4-7" <noreply@anthropic.com>
Subject: Re: [RFC 4/7] mm: add page consistency checker implementation
Date: Tue, 28 Apr 2026 09:22:40 +0200	[thread overview]
Message-ID: <63befc48-8173-4ad3-917b-e09c55dd3191@kernel.org> (raw)
In-Reply-To: <ae_wPuqYpVxicOUq@laps>

On 4/28/26 01:24, Sasha Levin wrote:
> On Mon, Apr 27, 2026 at 09:37:02PM +0200, David Hildenbrand (Arm) wrote:
>>
>>>
>>> We do, but self driving safety requires way more than the current hardware can
>>> provide.
>>>
>>> I'll point you to https://dl.acm.org/doi/10.1145/2775054.2694348 , which
>>> researched these issues in a datacenter environment (so no sun exposure,
>>> temperature controlled, designed to avoid electromagnetic interference).
>>>
>>> "We call a fault that generates an error larger than 2 bits in an ECC word an
>>> undetectable-by-SECDED fault. A fault is undetectable-by-SECDED if it affects
>>> more than two bits in any ECC word, and the data written to that location does
>>> not match the value produced by the fault."
>>>
>>> [...]
>>>
>>> "A Cielo node has 288 DRAM devices, so this translates to 6048, 518, and 57.6
>>> FIT per node for vendors A, B, and C, respectively. This translates to one
>>> undetected error every 0.8 days, every 9.5 days, and every 85 days on a machine
>>> the size of Cielo."
>>>
>>> [...]
>>>
>>> "Our main conclusion from this data is that SEC-DED ECC is poorly suited to
>>> modern DRAM subsystems. The rate of undetected errors is too high to justify
>>> its use in very large scale systems comprised of thousands of nodes where
>>> fidelity of results is critical."
>>
>> Yes, I read before that ECC is insufficient to detect certain bitflips.
>>
>> But I don't understand how this patch set here is going to move the needle in
>> any reasonable way?
>>
>> You have your magical self-driving car algorithm.
>>
>> Bitflips can corrupt your algorithm, your data, the kernel image, your user page
>> tables, your kernel page tables. Even a pointer to a bitmap :)
>>
>> ... and we worry about the state of allocated vs. free pages.
> 
> Do we agree that this is one piece of a (much) larger puzzle that we would need
> to tackle?

Once you solved the real hard problems (corrupting random page state, page
tables, all of that) we can think about whether adding complexity to the page
allocator to detect possible corruptions.

As you state in your reply to Vlasta, the buddy keeps free pages in a list. So a
pointer corruption there would be rather fatal, and I don't follow how the
approach here makes things any better.

So for the time being, I don't think this proposal moves the needle in any
reasonable way, and I don't think we want this any time soon.

-- 
Cheers,

David

  reply	other threads:[~2026-04-28  7:22 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-04-24 14:00 [RFC 0/7] mm: dual-bitmap page allocator consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 1/7] mm: add generic dual-bitmap consistency primitives Sasha Levin
2026-04-24 14:00 ` [RFC 2/7] mm: add page consistency checker header Sasha Levin
2026-04-24 14:00 ` [RFC 3/7] mm: add Kconfig options for page consistency checker Sasha Levin
2026-04-24 14:00 ` [RFC 4/7] mm: add page consistency checker implementation Sasha Levin
2026-04-24 14:25   ` David Hildenbrand (Arm)
2026-04-24 14:49     ` Sasha Levin
2026-04-24 15:06       ` Pasha Tatashin
2026-04-24 18:28         ` David Hildenbrand (Arm)
2026-04-24 23:34           ` Sasha Levin
2026-04-25  5:30             ` David Hildenbrand (Arm)
2026-04-25 16:38               ` Sasha Levin
2026-04-27 12:32                 ` David Hildenbrand (Arm)
2026-04-27 14:10                   ` Sasha Levin
2026-04-27 15:40                     ` David Hildenbrand (Arm)
2026-04-27 18:56                       ` Sasha Levin
2026-04-27 19:37                         ` David Hildenbrand (Arm)
2026-04-27 23:24                           ` Sasha Levin
2026-04-28  7:22                             ` David Hildenbrand (Arm) [this message]
2026-04-24 18:26       ` David Hildenbrand (Arm)
2026-04-24 14:00 ` [RFC 5/7] mm/page_alloc: integrate page consistency hooks Sasha Levin
2026-04-24 14:00 ` [RFC 6/7] Documentation/mm: add page consistency checker documentation Sasha Levin
2026-04-24 14:00 ` [RFC 7/7] mm/page_consistency: add KUnit tests for dual-bitmap primitives Sasha Levin
2026-04-24 15:34 ` [RFC 0/7] mm: dual-bitmap page allocator consistency checker Matthew Wilcox
2026-04-24 15:53   ` Sasha Levin
2026-04-24 15:42 ` Vlastimil Babka (SUSE)
2026-04-24 16:25   ` Sasha Levin
2026-04-25  5:51     ` David Hildenbrand (Arm)
2026-04-25 16:09       ` Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=63befc48-8173-4ad3-917b-e09c55dd3191@kernel.org \
    --to=david@kernel.org \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=hannes@cmpxchg.org \
    --cc=jackmanb@google.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ljs@kernel.org \
    --cc=mhocko@suse.com \
    --cc=noreply@anthropic.com \
    --cc=pasha.tatashin@soleen.com \
    --cc=rppt@kernel.org \
    --cc=sashal@kernel.org \
    --cc=sashal@nvidia.com \
    --cc=skhan@linuxfoundation.org \
    --cc=surenb@google.com \
    --cc=sveeras@nvidia.com \
    --cc=vbabka@kernel.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox