Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: Minchan Kim <minchan@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Oleg Nesterov <oleg@redhat.com>,
	Christian Brauner <brauner@kernel.org>,
	Jann Horn <jannh@google.com>,
	akpm@linux-foundation.org, hca@linux.ibm.com,
	linux-s390@vger.kernel.org, david@kernel.org, mhocko@suse.com,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org,
	surenb@google.com, timmurray@google.com
Subject: Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag
Date: Tue, 19 May 2026 13:53:29 -0700	[thread overview]
Message-ID: <agzNyXKIirm1zOob@google.com> (raw)
In-Reply-To: <CAHk-=wg08=sZb=hVa22KjN2pD3aNk-diGmKQnizCgmoMMfrvwQ@mail.gmail.com>

On Sat, May 16, 2026 at 09:31:04AM -0700, Linus Torvalds wrote:
> On Fri, 15 May 2026 at 22:47, Minchan Kim <minchan@kernel.org> wrote:
> >
> > Regarding proc_mem_open(), it actually operates very close to what you suggested.
> > It acquires a reference to the mm_struct itself via mmgrab() but immediately
> > unpins the address space memory via mmput(). Thus, no long-term mm_users
> > reference is held across the open file descriptor.
> 
> Ahh, and we've actually done that since 2012. How time flies..
> 
> > The latency issue occurs during seqfile iteration (m_start/m_stop) in
> > smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader
> > temporarily acquires mm_users via mmget_not_zero() or get_task_mm().
> 
> Ok, so it's that much smaller region.
> 
> How about a completely different approach then - make exit_mmap() just
> take the mmap_write_lock() properly, and allow walking the vma's
> without ever grabbing mm_users at all?
> 
> IOW, just a mm_count ref would be sufficient to hold the mm_struct
> around, and then the read-lock protects against exit_mm() actually
> tearing the list down when the last "real" user goes away.
> 
> [ exit_mm() is currently a bit odd - it does take the mmap_write lock,
> but it *starts* with the read-lock.
> 
>    I'm not sure why it does that - it used to do the write lock over
> the whole sequence, but that was changed in commit bf3980c85212 ("mm:
> drop oom code from exit_mmap").
> 
>   Sure, read-lock allows more concurrency, but that would seem to be a
> complete non-issue for exit_mmap(), and switching locking seems to
> just complicate things.
> 
>   But that's a separate issue that I just happened to notice while
> looking at this ]
> 
> I may be missing something else again.

Hi Linus,

Sorry for the slow response.
Thank you for the incredibly detailed feedback and the suggestions.

Your proposal to avoid mm_users and synchronize via mmap_lock is an elegant
conceptual cleanup. However, from the perspective of userspace OOM recovery,
we hit two critical roadblocks that this alone cannot resolve:

First, the -ESRCH race remains unsolved.
Even if we don't grab mm_users, the victim process still clears its task->mm
to NULL early in exit_mm(). Here is the timing mismatch:

CPU A (Userspace OOM Killer)      CPU B (Victim Task)
----------------------------      -------------------
1. Sends SIGKILL
2. Victim receives SIGKILL
                                  do_exit()
                                    exit_mm()
                                      task->mm = NULL  <==== (Stops pinning mm)
                                      mmput()
3. Calls process_mrelease()
   (Looks up task->mm)
   (Sees NULL)
   Returns -ESRCH! <======================================== (Reaping fails!)

Without Jann's patch to preserve the mm pointer via task->exit_mm, the
userspace killer won't even have a chance to attempt reaping.

Second, the latency bottleneck transfers from mmput() to mmap_lock.
If a low-priority procfs reader is preempted or stalled while holding the
mmap_read_lock, the exiting process calling exit_mmap() will block indefinitely
when trying to acquire the mmap_write_lock.

Crucially, if this lock contention occurs, process_mrelease() itself would
also block on the same mmap_lock while trying to reap the memory, defeating the
synchronous and expedited nature of the API.

[An Alternative Proposal: Combining Kill and Reap via pidfd_send_signal()]

Taking a step back, I believe the fundamental issue stems from separating
the asynchronous "Kill" and synchronous "Reap" operations into two distinct
system calls. Because userspace cannot predict when the victim will execute
exit_mm(), the timing mismatch is practically unavoidable so the reaping
doesn't work in the end.

Since Christian understandably dislikes combining signaling semantics into
process_mrelease(), perhaps we could solve this from the signal side.

What if we introduce a new flag for pidfd_send_signal(), such as
PIDFD_SIGNAL_PROCESS_GROUP_EXPEDITE?

When invoked with this flag and SIGKILL, pidfd_send_signal() would deliver the
fatal signal and immediately trigger the oom_reaper's VM zapping on the target
mm within the same synchronous syscall context (where task->mm is guaranteed to
be valid and easily locked).

This would completely eliminate the -ESRCH race by making the kill-and-reap
operation atomic from userspace's perspective, while keeping each syscall
focused strictly on its primary responsibility (signaling vs. reclaiming)

Honestly, if we adopt this atomic interface, it might actually make the
separate process_mrelease() syscall obsolete. I am not entirely sure about
the historical reasons why they were split into two distinct APIs
in the first place, but merging them into a single pidfd-based atomic
operation seems much cleaner.

I would highly appreciate everyone's thoughts on this perspective and
alternative direction.

> 
> Also, I do really hate the smap code. People have optimized it because
> it's so piggy, but that code is still just silly. The "rollup" case in
> particular knows how bad it is, and does that whole "unlock and relock
> under contention" because it knows it's a horrible latency pig.

And yes, I completely agree with your frustration on the smaps code—it is
indeed a massive latency pig. In fact, userspace tools have increasingly moved
away from smaps and even PSS (Proportional Set Size) altogether because they
are simply too slow to be usable in production.

> 
> Oh well. But it really feels like we *could* do this all without mm_users. No?
> 
>                   Linus
> 


  reply	other threads:[~2026-05-19 20:53 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11 21:42 [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Minchan Kim
2026-05-15 14:41 ` Christian Brauner
2026-05-15 15:49   ` Oleg Nesterov
2026-05-15 20:15   ` Oleg Nesterov
2026-05-15 22:33     ` Minchan Kim
2026-05-15 23:45       ` Linus Torvalds
2026-05-16  0:00         ` Linus Torvalds
2026-05-16  5:47         ` Minchan Kim
2026-05-16 16:31           ` Linus Torvalds
2026-05-19 20:53             ` Minchan Kim [this message]
2026-05-15 20:42   ` Suren Baghdasaryan
2026-05-15 20:52   ` Minchan Kim

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=agzNyXKIirm1zOob@google.com \
    --to=minchan@kernel.org \
    --cc=akpm@linux-foundation.org \
    --cc=brauner@kernel.org \
    --cc=david@kernel.org \
    --cc=hca@linux.ibm.com \
    --cc=jannh@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=mhocko@suse.com \
    --cc=oleg@redhat.com \
    --cc=surenb@google.com \
    --cc=timmurray@google.com \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox