linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: "Liam R. Howlett" <Liam.Howlett@oracle.com>,
	maple-tree@lists.infradead.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, Vlastimil Babka <vbabka@suse.cz>,
	Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jann Horn <jannh@google.com>, Pedro Falcato <pfalcato@suse.de>,
	Charan Teja Kalla <quic_charante@quicinc.com>,
	shikemeng@huaweicloud.com, kasong@tencent.com, nphamcs@gmail.com,
	bhe@redhat.com, baohua@kernel.org, chrisl@kernel.org,
	Matthew Wilcox <willy@infradead.org>
Subject: Re: [RFC PATCH 0/6] Remove XA_ZERO from error recovery of
Date: Mon, 18 Aug 2025 16:47:58 +0100	[thread overview]
Message-ID: <1e26e895-cad6-4920-a9df-21619777d25a@lucifer.local> (raw)
In-Reply-To: <3970cd97-2e9e-403f-867a-3addfbe399dc@redhat.com>

On Mon, Aug 18, 2025 at 11:44:16AM +0200, David Hildenbrand wrote:
> On 15.08.25 21:10, Liam R. Howlett wrote:
> > Before you read on, please take a moment to acknowledge that David
> > Hildenbrand asked for this, so I'm blaming mostly him :)
>
> :)
>
> >
> > It is possible that the dup_mmap() call fails on allocating or setting
> > up a vma after the maple tree of the oldmm is copied.  Today, that
> > failure point is marked by inserting an XA_ZERO entry over the failure
> > point so that the exact location does not need to be communicated
> > through to exit_mmap().
> >
> > However, a race exists in the tear down process because the dup_mmap()
> > drops the mmap lock before exit_mmap() can remove the partially set up
> > vma tree.  This means that other tasks may get to the mm tree and find
> > the invalid vma pointer (since it's an XA_ZERO entry), even though the
> > mm is marked as MMF_OOM_SKIP and MMF_UNSTABLE.
> >
> > To remove the race fully, the tree must be cleaned up before dropping
> > the lock.  This is accomplished by extracting the vma cleanup in
> > exit_mmap() and changing the required functions to pass through the vma
> > search limit.
> >
> > This does run the risk of increasing the possibility of finding no vmas
> > (which is already possible!) in code this isn't careful.
>
> Right, it would also happen if __mt_dup() fails I guess.
>
> >
> > The passing of so many limits and variables was such a mess when the
> > dup_mmap() was introduced that it was avoided in favour of the XA_ZERO
> > entry marker, but since the swap case was the second time we've hit
> > cases of walking an almost-dead mm, here's the alternative to checking
> > MMF_UNSTABLE before wandering into other mm structs.
>
> Changes look fairly small and reasonable, so I really like this.
>
> I agree with Jann that doing a partial teardown might be even better, but
> code-wise I suspect it might end up with a lot more churn and weird
> allocation-corner-cases to handle.

I've yet to review the series and see exactly what's proposed but on gut
instinct (and based on past experience with the munmap gather/reattach
stuff), some kind of a partial thing like this tends to end up a nightmare
of weird-stuff-you-didn't-think-about.

So I'm instincitively against this.

However I'll take a proper look through this series shortly and hopefully
have more intelligent things to say...

An aside - I was working on a crazy anon idea over the weekend (I know, I
know) and noticed that mm life cycle is just weird. I observed apparent
duplicate calls of __mmdrop() for instance (I think the unwinding just
broke), the delayed mmdrop is strange and the whole area seems rife with
complexity.

So I'm glad David talked you into doing this ;) this particular edge case
was always strange and the fact we have now hid it twice (and this time
more seriously - as it's due to a fatal signal which is much more likely to
arise than an OOM scenario with too-small-to-fail allocations).

BTW where are we with the hotfix for the swapoff case [0]? I think we
agreed settng on MMF_UNSTABLE there and using that to decide not to proceed
in unuse_mm() right?

Cheers, Lorenzo

[0]: https://lore.kernel.org/all/20250808092156.1918973-1-quic_charante@quicinc.com/


      parent reply	other threads:[~2025-08-18 15:48 UTC|newest]

Thread overview: 35+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-08-15 19:10 [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Liam R. Howlett
2025-08-15 19:10 ` [RFC PATCH 1/6] mm/mmap: Move exit_mmap() trace point Liam R. Howlett
2025-08-19 18:27   ` Lorenzo Stoakes
2025-08-21 21:12   ` Chris Li
2025-08-15 19:10 ` [RFC PATCH 2/6] mm/mmap: Abstract vma clean up from exit_mmap() Liam R. Howlett
2025-08-19 18:38   ` Lorenzo Stoakes
2025-09-03 19:56     ` Liam R. Howlett
2025-09-04 15:21       ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 3/6] mm/vma: Add limits to unmap_region() for vmas Liam R. Howlett
2025-08-19 18:48   ` Lorenzo Stoakes
2025-09-03 19:57     ` Liam R. Howlett
2025-09-04 15:23       ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 4/6] mm/memory: Add tree limit to free_pgtables() Liam R. Howlett
2025-08-18 15:36   ` Lorenzo Stoakes
2025-08-18 15:54     ` Liam R. Howlett
2025-08-19 19:14   ` Lorenzo Stoakes
2025-09-03 20:19     ` Liam R. Howlett
2025-09-04 10:20       ` David Hildenbrand
2025-09-04 15:36         ` Lorenzo Stoakes
2025-09-09 17:19         ` Liam R. Howlett
2025-09-04 15:33       ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 5/6] mm/vma: Add page table limit to unmap_region() Liam R. Howlett
2025-08-19 19:27   ` Lorenzo Stoakes
2025-08-15 19:10 ` [RFC PATCH 6/6] mm: Change dup_mmap() recovery Liam R. Howlett
2025-08-18 15:12   ` Lorenzo Stoakes
2025-08-18 15:29     ` Lorenzo Stoakes
2025-08-19 20:33   ` Lorenzo Stoakes
2025-09-04  0:13     ` Liam R. Howlett
2025-09-04 15:40       ` Lorenzo Stoakes
2025-08-15 19:49 ` [RFC PATCH 0/6] Remove XA_ZERO from error recovery of Jann Horn
2025-08-18 15:48   ` Liam R. Howlett
2025-08-18  9:44 ` David Hildenbrand
2025-08-18 14:26   ` Charan Teja Kalla
2025-08-18 14:54     ` Liam R. Howlett
2025-08-18 15:47   ` Lorenzo Stoakes [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1e26e895-cad6-4920-a9df-21619777d25a@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=bhe@redhat.com \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=jannh@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maple-tree@lists.infradead.org \
    --cc=mhocko@suse.com \
    --cc=nphamcs@gmail.com \
    --cc=pfalcato@suse.de \
    --cc=quic_charante@quicinc.com \
    --cc=rppt@kernel.org \
    --cc=shikemeng@huaweicloud.com \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).