From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-alma10-1.taild15c8.ts.net [100.103.45.18]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0F4C3377017; Tue, 19 May 2026 20:53:31 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=100.103.45.18 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779224012; cv=none; b=bMMiAjisAI3FSiEiZWRL3F7W6YvC8+PWNtQcIDH4YNVA7vh8CAXf4MyiruY8PL5RAE3gcE6TAc2rdzGUK0T110CjKGxe53gzYdaT7dUX512/LiCqDZTFumQI15l1RhGaoF0oj3YSwMK+C6Bqc0q1M3XPLP7KrYma/EL9Rqdueas= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1779224012; c=relaxed/simple; bh=VNo94fcZ+jcCKTWScIEa9bXLVVawJonEFzL+GX6S3C4=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=YEPXiw+Sn4D11ExHjjg1iF5afUL5p1hCAP4PJ/WIHfULVjd3twuR6CoUwVtPipUmbjKmpefIwFzk3TmXdY2gK1EcD7WKQVtUtzCuDNQ9Jg35s3v7rJGkRDRp5Z9+xXcyK8qyYQEZ+C6d5Md5GnJqPLnL93QMWqYfDv0C76R9FQw= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=OI4P2Mzp; arc=none smtp.client-ip=100.103.45.18 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="OI4P2Mzp" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8E4C21F000E9; Tue, 19 May 2026 20:53:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel.org; s=k20260515; t=1779224010; bh=jCqIEXfw/wdvd3SK3XOskv8/nJywCW8w0ATavedCeCQ=; h=Date:From:To:Cc:Subject:References:In-Reply-To; b=OI4P2Mzpbf6YCwD9QkPGZO2VyqVZebFUp+qZ/ndrxtEvKZgvBFvazZ/5FhVarFVM4 sghlGq6PREViBvKMH0FDJtBVy2YhFLlz2Y5qAgUkP4pe+0PaeraCNDxS4ij0nDGGh2 fvK/XmKh8DDwtda1VYK7bEeSij8VUL7P+gApAinmbufjrtEXu8Ps9MBtTuYWMBZXFS 93N9tU86GTC/MhCxZpXalKA8GsOaefgybM/wFVHKb0JPgCIWcQnIZY/52ER8RG1QJI /2EjE3jc1tWTGlkmTDfj+rY4sG0dDSMJ/YGD36JCvCwlWV93N5UEToe8EMc9Fo9ZRi IErQOvsdch9vg== Date: Tue, 19 May 2026 13:53:29 -0700 From: Minchan Kim To: Linus Torvalds Cc: Oleg Nesterov , Christian Brauner , Jann Horn , akpm@linux-foundation.org, hca@linux.ibm.com, linux-s390@vger.kernel.org, david@kernel.org, mhocko@suse.com, linux-mm@kvack.org, linux-kernel@vger.kernel.org, surenb@google.com, timmurray@google.com Subject: Re: [PATCH v3] mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag Message-ID: References: <20260511214226.937793-1-minchan@kernel.org> <20260515-nachdenken-umbenannt-a90006a46e14@brauner> Precedence: bulk X-Mailing-List: linux-s390@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Sat, May 16, 2026 at 09:31:04AM -0700, Linus Torvalds wrote: > On Fri, 15 May 2026 at 22:47, Minchan Kim wrote: > > > > Regarding proc_mem_open(), it actually operates very close to what you suggested. > > It acquires a reference to the mm_struct itself via mmgrab() but immediately > > unpins the address space memory via mmput(). Thus, no long-term mm_users > > reference is held across the open file descriptor. > > Ahh, and we've actually done that since 2012. How time flies.. > > > The latency issue occurs during seqfile iteration (m_start/m_stop) in > > smaps/maps, or during get_cmdline() and ptrace_access_vm(), where the reader > > temporarily acquires mm_users via mmget_not_zero() or get_task_mm(). > > Ok, so it's that much smaller region. > > How about a completely different approach then - make exit_mmap() just > take the mmap_write_lock() properly, and allow walking the vma's > without ever grabbing mm_users at all? > > IOW, just a mm_count ref would be sufficient to hold the mm_struct > around, and then the read-lock protects against exit_mm() actually > tearing the list down when the last "real" user goes away. > > [ exit_mm() is currently a bit odd - it does take the mmap_write lock, > but it *starts* with the read-lock. > > I'm not sure why it does that - it used to do the write lock over > the whole sequence, but that was changed in commit bf3980c85212 ("mm: > drop oom code from exit_mmap"). > > Sure, read-lock allows more concurrency, but that would seem to be a > complete non-issue for exit_mmap(), and switching locking seems to > just complicate things. > > But that's a separate issue that I just happened to notice while > looking at this ] > > I may be missing something else again. Hi Linus, Sorry for the slow response. Thank you for the incredibly detailed feedback and the suggestions. Your proposal to avoid mm_users and synchronize via mmap_lock is an elegant conceptual cleanup. However, from the perspective of userspace OOM recovery, we hit two critical roadblocks that this alone cannot resolve: First, the -ESRCH race remains unsolved. Even if we don't grab mm_users, the victim process still clears its task->mm to NULL early in exit_mm(). Here is the timing mismatch: CPU A (Userspace OOM Killer) CPU B (Victim Task) ---------------------------- ------------------- 1. Sends SIGKILL 2. Victim receives SIGKILL do_exit() exit_mm() task->mm = NULL <==== (Stops pinning mm) mmput() 3. Calls process_mrelease() (Looks up task->mm) (Sees NULL) Returns -ESRCH! <======================================== (Reaping fails!) Without Jann's patch to preserve the mm pointer via task->exit_mm, the userspace killer won't even have a chance to attempt reaping. Second, the latency bottleneck transfers from mmput() to mmap_lock. If a low-priority procfs reader is preempted or stalled while holding the mmap_read_lock, the exiting process calling exit_mmap() will block indefinitely when trying to acquire the mmap_write_lock. Crucially, if this lock contention occurs, process_mrelease() itself would also block on the same mmap_lock while trying to reap the memory, defeating the synchronous and expedited nature of the API. [An Alternative Proposal: Combining Kill and Reap via pidfd_send_signal()] Taking a step back, I believe the fundamental issue stems from separating the asynchronous "Kill" and synchronous "Reap" operations into two distinct system calls. Because userspace cannot predict when the victim will execute exit_mm(), the timing mismatch is practically unavoidable so the reaping doesn't work in the end. Since Christian understandably dislikes combining signaling semantics into process_mrelease(), perhaps we could solve this from the signal side. What if we introduce a new flag for pidfd_send_signal(), such as PIDFD_SIGNAL_PROCESS_GROUP_EXPEDITE? When invoked with this flag and SIGKILL, pidfd_send_signal() would deliver the fatal signal and immediately trigger the oom_reaper's VM zapping on the target mm within the same synchronous syscall context (where task->mm is guaranteed to be valid and easily locked). This would completely eliminate the -ESRCH race by making the kill-and-reap operation atomic from userspace's perspective, while keeping each syscall focused strictly on its primary responsibility (signaling vs. reclaiming) Honestly, if we adopt this atomic interface, it might actually make the separate process_mrelease() syscall obsolete. I am not entirely sure about the historical reasons why they were split into two distinct APIs in the first place, but merging them into a single pidfd-based atomic operation seems much cleaner. I would highly appreciate everyone's thoughts on this perspective and alternative direction. > > Also, I do really hate the smap code. People have optimized it because > it's so piggy, but that code is still just silly. The "rollup" case in > particular knows how bad it is, and does that whole "unlock and relock > under contention" because it knows it's a horrible latency pig. And yes, I completely agree with your frustration on the smaps code—it is indeed a massive latency pig. In fact, userspace tools have increasingly moved away from smaps and even PSS (Proportional Set Size) altogether because they are simply too slow to be usable in production. > > Oh well. But it really feels like we *could* do this all without mm_users. No? > > Linus >