From: Minchan Kim <minchan@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
akpm@linux-foundation.org, hca@linux.ibm.com,
linux-s390@vger.kernel.org, david@kernel.org, mhocko@suse.com,
linux-mm@kvack.org, linux-kernel@vger.kernel.org,
surenb@google.com, timmurray@google.com
Subject: Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Date: Fri, 22 May 2026 15:09:13 -0700 [thread overview]
Message-ID: <ahDUCbo4fIKA7rMZ@google.com> (raw)
In-Reply-To: <20260521-voreilig-investieren-34f6e0c56258@brauner>
On Thu, May 21, 2026 at 01:50:44PM +0200, Christian Brauner wrote:
> On Tue, May 19, 2026 at 01:53:29PM -0700, Minchan Kim wrote:
> > On Sat, May 16, 2026 at 09:31:04AM -0700, Linus Torvalds wrote:
> > > On Fri, 15 May 2026 at 22:47, Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > Regarding proc_mem_open(), it actually operates very close to what you suggested.
> > > > It acquires a reference to the mm_struct itself via mmgrab() but immediately
> > > > unpins the address space memory via mmput(). Thus, no long-term mm_users
> > > > reference is held across the open file descriptor.
> > >
> > > Ahh, and we've actually done that since 2012. How time flies..
> > >
> > > > The latency issue occurs during seqfile iteration (m_start/m_stop) in
> > > > smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
> > > > temporarily acquires mm_users via mmget_not_zero() or get_task_mm().
> > >
> > > Ok, so it's that much smaller region.
> > >
> > > How about a completely different approach then - make exit_mmap() just
> > > take the mmap_write_lock() properly, and allow walking the vma's
> > > without ever grabbing mm_users at all?
> > >
> > > IOW, just a mm_count ref would be sufficient to hold the mm_struct
> > > around, and then the read-lock protects against exit_mm() actually
> > > tearing the list down when the last "real" user goes away.
> > >
> > > [ exit_mm() is currently a bit odd - it does take the mmap_write lock,
> > > but it *starts* with the read-lock.
> > >
> > > I'm not sure why it does that - it used to do the write lock over
> > > the whole sequence, but that was changed in commit bf3980c85212 ("mm:
> > > drop oom code from exit_mmap").
> > >
> > > Sure, read-lock allows more concurrency, but that would seem to be a
> > > complete non-issue for exit_mmap(), and switching locking seems to
> > > just complicate things.
> > >
> > > But that's a separate issue that I just happened to notice while
> > > looking at this ]
> > >
> > > I may be missing something else again.
> >
> > Hi Linus,
> >
> > Sorry for the slow response.
> > Thank you for the incredibly detailed feedback and the suggestions.
> >
> > Your proposal to avoid mm_users and synchronize via mmap_lock is an elegant
> > conceptual cleanup. However, from the perspective of userspace OOM recovery,
> > we hit two critical roadblocks that this alone cannot resolve:
> >
> > First, the -ESRCH race remains unsolved.
> > Even if we don't grab mm_users, the victim process still clears its task->mm
> > to NULL early in exit_mm(). Here is the timing mismatch:
> >
> > CPU A (Userspace OOM Killer) CPU B (Victim Task)
> > ---------------------------- -------------------
> > 1. Sends SIGKILL
> > 2. Victim receives SIGKILL
> > do_exit()
> > exit_mm()
> > task->mm = NULL <==== (Stops pinning mm)
> > mmput()
> > 3. Calls process_mrelease()
> > (Looks up task->mm)
> > (Sees NULL)
> > Returns -ESRCH! <======================================== (Reaping fails!)
> >
> > Without Jann's patch to preserve the mm pointer via task->exit_mm, the
> > userspace killer won't even have a chance to attempt reaping.
> >
> > Second, the latency bottleneck transfers from mmput() to mmap_lock.
> > If a low-priority procfs reader is preempted or stalled while holding the
> > mmap_read_lock, the exiting process calling exit_mmap() will block indefinitely
> > when trying to acquire the mmap_write_lock.
> >
> > Crucially, if this lock contention occurs, process_mrelease() itself would
> > also block on the same mmap_lock while trying to reap the memory, defeating the
> > synchronous and expedited nature of the API.
> >
> > [An Alternative Proposal: Combining Kill and Reap via pidfd_send_signal()]
> >
> > Taking a step back, I believe the fundamental issue stems from separating
> > the asynchronous "Kill" and synchronous "Reap" operations into two distinct
> > system calls. Because userspace cannot predict when the victim will execute
> > exit_mm(), the timing mismatch is practically unavoidable so the reaping
> > doesn't work in the end.
> >
> > Since Christian understandably dislikes combining signaling semantics into
> > process_mrelease(), perhaps we could solve this from the signal side.
> >
> > What if we introduce a new flag for pidfd_send_signal(), such as
> > PIDFD_SIGNAL_PROCESS_GROUP_EXPEDITE?
> >
> > When invoked with this flag and SIGKILL, pidfd_send_signal() would deliver the
> > fatal signal and immediately trigger the oom_reaper's VM zapping on the target
> > mm within the same synchronous syscall context (where task->mm is guaranteed to
> > be valid and easily locked).
>
> Maybe. We would need to see what that actually looks like.
Hi Christian,
Sure, Let me cook the initial draft.
Thanks.
next prev parent reply other threads:[~2026-05-22 22:09 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-05-15 14:41 ` Christian Brauner
2026-05-15 15:49 ` Oleg Nesterov
2026-05-15 20:15 ` Oleg Nesterov
2026-05-15 22:33 ` Minchan Kim
2026-05-15 23:45 ` Linus Torvalds
2026-05-16 0:00 ` Linus Torvalds
2026-05-16 5:47 ` Minchan Kim
2026-05-16 16:31 ` Linus Torvalds
2026-05-19 20:53 ` Minchan Kim
2026-05-21 11:50 ` Christian Brauner
2026-05-22 22:09 ` Minchan Kim [this message]
2026-05-22 22:41 ` Linus Torvalds
2026-05-22 22:44 ` Linus Torvalds
2026-05-15 20:42 ` Suren Baghdasaryan
2026-05-15 20:52 ` Minchan Kim
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ahDUCbo4fIKA7rMZ@google.com \
--to=minchan@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=brauner@kernel.org \
--cc=david@kernel.org \
--cc=hca@linux.ibm.com \
--cc=jannh@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-s390@vger.kernel.org \
--cc=mhocko@suse.com \
--cc=oleg@redhat.com \
--cc=surenb@google.com \
--cc=timmurray@google.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox