All of lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Christian Brauner <brauner@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
	Oleg Nesterov <oleg@redhat.com>, Jann Horn <jannh@google.com>,
	akpm@linux-foundation.org, hca@linux.ibm.com,
	linux-s390@vger.kernel.org, david@kernel.org, mhocko@suse.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	surenb@google.com, timmurray@google.com
Subject: Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Date: Fri, 22 May 2026 15:09:13 -0700	[thread overview]
Message-ID: <ahDUCbo4fIKA7rMZ@google.com> (raw)
In-Reply-To: <20260521-voreilig-investieren-34f6e0c56258@brauner>

On Thu, May 21, 2026 at 01:50:44PM +0200, Christian Brauner wrote:
> On Tue, May 19, 2026 at 01:53:29PM -0700, Minchan Kim wrote:
> > On Sat, May 16, 2026 at 09:31:04AM -0700, Linus Torvalds wrote:
> > > On Fri, 15 May 2026 at 22:47, Minchan Kim <minchan@kernel.org> wrote:
> > > >
> > > > Regarding proc_mem_open(), it actually operates very close to what you suggested.
> > > > It acquires a reference to the mm_struct itself via mmgrab() but immediately
> > > > unpins the address space memory via mmput(). Thus, no long-term mm_users
> > > > reference is held across the open file descriptor.
> > > 
> > > Ahh, and we've actually done that since 2012. How time flies..
> > > 
> > > > The latency issue occurs during seqfile iteration (m_start/m_stop) in
> > > > smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
> > > > temporarily acquires mm_users via mmget_not_zero() or get_task_mm().
> > > 
> > > Ok, so it's that much smaller region.
> > > 
> > > How about a completely different approach then - make exit_mmap() just
> > > take the mmap_write_lock() properly, and allow walking the vma's
> > > without ever grabbing mm_users at all?
> > > 
> > > IOW, just a mm_count ref would be sufficient to hold the mm_struct
> > > around, and then the read-lock protects against exit_mm() actually
> > > tearing the list down when the last "real" user goes away.
> > > 
> > > [ exit_mm() is currently a bit odd - it does take the mmap_write lock,
> > > but it *starts* with the read-lock.
> > > 
> > >    I'm not sure why it does that - it used to do the write lock over
> > > the whole sequence, but that was changed in commit bf3980c85212 ("mm:
> > > drop oom code from exit_mmap").
> > > 
> > >   Sure, read-lock allows more concurrency, but that would seem to be a
> > > complete non-issue for exit_mmap(), and switching locking seems to
> > > just complicate things.
> > > 
> > >   But that's a separate issue that I just happened to notice while
> > > looking at this ]
> > > 
> > > I may be missing something else again.
> > 
> > Hi Linus,
> > 
> > Sorry for the slow response.
> > Thank you for the incredibly detailed feedback and the suggestions.
> > 
> > Your proposal to avoid mm_users and synchronize via mmap_lock is an elegant
> > conceptual cleanup. However, from the perspective of userspace OOM recovery,
> > we hit two critical roadblocks that this alone cannot resolve:
> > 
> > First, the -ESRCH race remains unsolved.
> > Even if we don't grab mm_users, the victim process still clears its task->mm
> > to NULL early in exit_mm(). Here is the timing mismatch:
> > 
> > CPU A (Userspace OOM Killer)      CPU B (Victim Task)
> > ----------------------------      -------------------
> > 1. Sends SIGKILL
> > 2. Victim receives SIGKILL
> >                                   do_exit()
> >                                     exit_mm()
> >                                       task->mm = NULL  <==== (Stops pinning mm)
> >                                       mmput()
> > 3. Calls process_mrelease()
> >    (Looks up task->mm)
> >    (Sees NULL)
> >    Returns -ESRCH! <======================================== (Reaping fails!)
> > 
> > Without Jann's patch to preserve the mm pointer via task->exit_mm, the
> > userspace killer won't even have a chance to attempt reaping.
> > 
> > Second, the latency bottleneck transfers from mmput() to mmap_lock.
> > If a low-priority procfs reader is preempted or stalled while holding the
> > mmap_read_lock, the exiting process calling exit_mmap() will block indefinitely
> > when trying to acquire the mmap_write_lock.
> > 
> > Crucially, if this lock contention occurs, process_mrelease() itself would
> > also block on the same mmap_lock while trying to reap the memory, defeating the
> > synchronous and expedited nature of the API.
> > 
> > [An Alternative Proposal: Combining Kill and Reap via pidfd_send_signal()]
> > 
> > Taking a step back, I believe the fundamental issue stems from separating
> > the asynchronous "Kill" and synchronous "Reap" operations into two distinct
> > system calls. Because userspace cannot predict when the victim will execute
> > exit_mm(), the timing mismatch is practically unavoidable so the reaping
> > doesn't work in the end.
> > 
> > Since Christian understandably dislikes combining signaling semantics into
> > process_mrelease(), perhaps we could solve this from the signal side.
> > 
> > What if we introduce a new flag for pidfd_send_signal(), such as
> > PIDFD_SIGNAL_PROCESS_GROUP_EXPEDITE?
> > 
> > When invoked with this flag and SIGKILL, pidfd_send_signal() would deliver the
> > fatal signal and immediately trigger the oom_reaper's VM zapping on the target
> > mm within the same synchronous syscall context (where task->mm is guaranteed to
> > be valid and easily locked).
> 
> Maybe. We would need to see what that actually looks like.

Hi Christian,

Sure, Let me cook the initial draft.

Thanks.

  reply	other threads:[~2026-05-22 22:09 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-05-15 14:41 ` Christian Brauner
2026-05-15 15:49   ` Oleg Nesterov
2026-05-15 20:15   ` Oleg Nesterov
2026-05-15 22:33     ` Minchan Kim
2026-05-15 23:45       ` Linus Torvalds
2026-05-16  0:00         ` Linus Torvalds
2026-05-16  5:47         ` Minchan Kim
2026-05-16 16:31           ` Linus Torvalds
2026-05-19 20:53             ` Minchan Kim
2026-05-21 11:50               ` Christian Brauner
2026-05-22 22:09                 ` Minchan Kim [this message]
2026-05-22 22:41               ` Linus Torvalds
2026-05-22 22:44                 ` Linus Torvalds
2026-05-15 20:42   ` Suren Baghdasaryan
2026-05-15 20:52   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ahDUCbo4fIKA7rMZ@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.