Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	Jason Gunthorpe <jgg@ziepe.ca>
Cc: Michal Hocko <mhocko@suse.com>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	Andrew Morton <akpm@linux-foundation.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Peter Xu <peterx@redhat.com>
Subject: Re: [PATCH v1] mm/gup: remove (VM_)BUG_ONs
Date: Wed, 11 Jun 2025 11:32:07 +0200	[thread overview]
Message-ID: <e3db61eb-6dda-44f7-94a7-97f43c19093e@redhat.com> (raw)
In-Reply-To: <8ce6c104-d2fb-4fed-a108-224a6113c227@nvidia.com>

On 07.06.25 20:00, John Hubbard wrote:
> On 6/7/25 6:53 AM, Lorenzo Stoakes wrote:
>> On Sat, Jun 07, 2025 at 10:42:14AM -0300, Jason Gunthorpe wrote:
>>> On Fri, Jun 06, 2025 at 08:03:15PM +0100, Lorenzo Stoakes wrote:
>>>> On Fri, Jun 06, 2025 at 07:46:52PM +0100, Lorenzo Stoakes wrote:
>>>>> On Fri, Jun 06, 2025 at 03:42:12PM -0300, Jason Gunthorpe wrote:
>>>>>> On Fri, Jun 06, 2025 at 08:23:25PM +0200, David Hildenbrand wrote:
> ...
>>>> And I guess we'd not want a new interface for this like WARN_ON_ONCE_STORED()
>>>> because that'd be... weird and how would anyone think to use that and nearly all
>>>> cases wouldn't.
>>>
>>> No! Delete WARN_ON_ONCE and say the new global mechanism is good
>>> enough to capture the first WARN_ON, everyone always uses it always
>>> and then nobody needs to think about this anymore when writing new
>>> code.
> Interesting. This would cause WARN_ON*() to act similarly to lockdep, in
> that the first instance is the only one that works.
> 
> That works, but if-and-only-if the system is kept completely WARN-free.
> However, I have a mitigation for that, below.
> 
>>
>> Well that is simpler :)
>>
>> I have encountered situations where I've had more than one and needed 2nd+
>> but it is rare as you say.
>>
>> My late night incoherent babbling yesterday was perhaps because I
>> misunderstood David/John as to what they encountered in the past... maybe
>> they can clarify...
> 
> I've debugged lots of production systems, often these were large HPC
> clusters and supercomputers. I've seen:
> 
> a) Long up-times, with (of course!) relatively small dmesg buffer sizes,
> so that early logs are long gone. This means that WARN_ON_ONCE() is
> quite often gone (overwritten). This is common.
> 
> The worst part is that if you go to reproduce a problem, you don't
> see the next warning in the logs!! This is devastating, especially if
> the site makes it hard to ask for a system reboot. (If you have
> ~20,000 nodes in the cluster, a reboot is not a small affair.)
> 
> b) WARN_ON() loops that fill up the dmesg buffer immediately, which
> of course also overwrites anything leading up to the first WARN_ON.
> 
> However, (b) is rare on production kernels--it's more commonly
> self-inflicted, if I make a mistake and write WARN_ON() when I should
> have written WARN_ON_ONCE(), in a custom build to explore a problem.
> 
> c) Other printk-based noise in a loop, occasionally this also wraps
> the dmesg buffer. Especially if a customer has done a bit of their
> own "special" logging code.
> 
> 
>>
>> I do find myself grepping dmesg to find the first warning and it's
>> _annoying_ to do so, so this would be handy.
>>
>> But I'm not sure it'd justify getting rid of WARN_ON_ONCE() when you are in
>> a loop or something and now your dmesg is going to go to hell, still useful
>> not to do that, esp. if you know there's no value to further warnings
> 
> In all cases, the ideal outcome is a dmesg buffer that includes (let's
> fantasize for a moment):
> 
> 1) the earliest dmesg output (showing any oddness with system
> configuration, and what hardware was brought online, etc), plus
> 
> 2) the first problem that is logged, plus
> 
> 3) ...sometimes two or even three things happen in order to get to the
> problem, so (again, fantasizing!) really the first *three* warnings
> would be good to have.
> 
> So this is just a 3-element array of global warnings--not another
> circular buffer, but a little larger than a TAINT or lockdep type
> of capture.

Can't we do something very simple to get started: remove VM_WARN_ON_ONCE 
and implement VM_WARN_ON as something that uses a single global variable 
to print up to *magic value* 10 warnings. Then we'll simply stop 
printing any further warnings.

Races when updating that global variable? Who cars if we end up printing 11.

That is, have a global limit over all of the VM_WARN.

Later, we can do more advanced stuff, such as storing them in some other 
buffer to persist them etc.

-- 
Cheers,

David / dhildenb

next prev parent reply	other threads:[~2025-06-11  9:32 UTC|newest]

Thread overview: 55+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-04 14:05 [PATCH v1] mm/gup: remove (VM_)BUG_ONs David Hildenbrand
2025-06-04 14:22 ` Vlastimil Babka
2025-06-04 14:26   ` Suren Baghdasaryan
2025-06-04 14:48 ` Lorenzo Stoakes
2025-06-04 14:58   ` David Hildenbrand
2025-06-04 15:44     ` Lorenzo Stoakes
2025-06-04 15:42   ` Linus Torvalds
2025-06-04 16:05     ` Lorenzo Stoakes
2025-06-04 15:14 ` Jason Gunthorpe
2025-06-04 16:01   ` David Hildenbrand
2025-06-04 17:25 ` SeongJae Park
2025-06-04 19:12 ` Liam R. Howlett
2025-06-04 19:16   ` David Hildenbrand
2025-06-05  1:07 ` John Hubbard
2025-06-05  5:37   ` Vlastimil Babka
2025-06-05  6:08     ` David Hildenbrand
2025-06-05  8:48       ` Vlastimil Babka
2025-06-05 12:29         ` David Hildenbrand
2025-06-05  7:10 ` Michal Hocko
2025-06-06  8:10   ` David Hildenbrand
2025-06-06  8:31     ` Michal Hocko
2025-06-06  9:01       ` David Hildenbrand
2025-06-06 10:13         ` Michal Hocko
2025-06-06 10:19           ` Lorenzo Stoakes
2025-06-06 10:28             ` David Hildenbrand
2025-06-06 11:04               ` Lorenzo Stoakes
2025-06-06 11:44                 ` David Hildenbrand
2025-06-06 11:56                   ` Michal Hocko
2025-06-06 12:12                   ` Lorenzo Stoakes
2025-06-06 12:17                     ` David Hildenbrand
2025-06-06 17:57                 ` John Hubbard
2025-06-06 18:06                   ` Lorenzo Stoakes
2025-06-06 18:15                     ` David Hildenbrand
2025-06-06 18:21                       ` John Hubbard
2025-06-06 18:23                         ` David Hildenbrand
2025-06-06 18:31                           ` John Hubbard
2025-06-06 18:36                             ` David Hildenbrand
2025-06-06 18:39                               ` John Hubbard
2025-06-06 18:34                           ` Lorenzo Stoakes
2025-06-06 18:42                           ` Jason Gunthorpe
2025-06-06 18:46                             ` Lorenzo Stoakes
2025-06-06 19:03                               ` Lorenzo Stoakes
2025-06-07 13:42                                 ` Jason Gunthorpe
2025-06-07 13:53                                   ` Lorenzo Stoakes
2025-06-07 18:00                                     ` John Hubbard
2025-06-09  9:57                                       ` Vlastimil Babka
2025-07-24 10:54                                         ` Vlastimil Babka
2025-07-24 10:56                                           ` Lorenzo Stoakes
2025-07-24 17:27                                             ` John Hubbard
2025-06-11  9:32                                       ` David Hildenbrand [this message]
2025-06-11 12:03                                         ` Jason Gunthorpe
2025-06-11 12:06                                           ` Lorenzo Stoakes
2025-06-06 10:28             ` Michal Hocko
2025-06-06 10:27           ` David Hildenbrand
2025-06-06  8:12 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e3db61eb-6dda-44f7-94a7-97f43c19093e@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=jgg@ziepe.ca \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=peterx@redhat.com \
    --cc=rppt@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).