From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 512D72FF657 for ; Wed, 15 Apr 2026 23:26:36 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776295596; cv=none; b=Q9ulH4UrxZS3ZaQgeF9Q3lqh6R6GFf5pOTuChlv7UOMZxpYVv71+hFId2m+SIoYJ9tM4c4ECwS6eetezo1MAheaXDnOPpNzjZg+mOwGif64bCUZw+ti6GmqcFPGl4uEwo5IQVwC/GR9qyyuC5xL33iTXbB8ja0s41BTsLUS1OG4= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1776295596; c=relaxed/simple; bh=R76toBFGdOsNAA62zbclsRCQOGUew+TGKnY/JICakrI=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=XFayYyIN0U4aiFVRhflhIRzWImd19ahWEEuhpPVtzG47AqQUo+s6ywLrRsv6lF3b8rTbmjHoX0cbpN734Rl0q42nvQBN4r5KzJ0gEU0PotTAwG8OujPrOHTWBpDul70v+yFANV27XGp5UdbKY1JltLeJ+IrkotPADrwje2yKqvY= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=t6YBxQJO; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="t6YBxQJO" Received: by smtp.kernel.org (Postfix) with ESMTPSA id A1D69C19424; Wed, 15 Apr 2026 23:26:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1776295596; bh=R76toBFGdOsNAA62zbclsRCQOGUew+TGKnY/JICakrI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=t6YBxQJOGV4LhVce4mTi6PnYNk2GhZ/Az5e2O7ok/l/SplZSW6v+Xj73SA099agbw UB4mlsI7MX8+jXYcF2Z7w9uGDQpSufJBPwgxX6bgwWKqGGSpvEcUb0zjHwWoqrsPbz HqpiClp/DipoRNWFVGgjFBS7MzUGXyfFkbbZ6Szm1e4svQKJb0iZkDkNYHuauTxtEh u7IIoeax6bVJlk9zWOPVXC3EN5o96PKiIiMK/9xpjDoKQ/fHj18mBYD7pNsWooGVaH uheulakQCygVZI4/jyJPskol6QtWI8nyxjznpfKxUob3yZTri82659JLTELrOdjgzY vm0GMxf9qvKCQ== Date: Wed, 15 Apr 2026 16:26:34 -0700 From: Minchan Kim To: Michal Hocko Cc: akpm@linux-foundation.org, david@kernel.org, brauner@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, surenb@google.com, timmurray@google.com Subject: Re: [RFC 0/3] mm: process_mrelease: expedited reclaim and auto-kill support Message-ID: References: <20260413223948.556351-1-minchan@kernel.org> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: On Wed, Apr 15, 2026 at 09:38:05AM +0200, Michal Hocko wrote: > On Tue 14-04-26 13:00:16, Minchan Kim wrote: > > On Tue, Apr 14, 2026 at 08:57:57AM +0200, Michal Hocko wrote: > > > On Mon 13-04-26 15:39:45, Minchan Kim wrote: > > > > This patch series introduces optimizations to expedite memory reclamation > > > > in process_mrelease() and provides a secure, race-free "auto-kill" > > > > mechanism for efficient container shutdown and OOM handling. > > > > > > > > Currently, process_mrelease() unmaps pages but leaves clean file folios > > > > on the LRU list, relying on standard memory reclaim to eventually free > > > > them. Furthermore, requiring userspace to send a SIGKILL prior to > > > > invoking process_mrelease() introduces scheduling race conditions where > > > > the victim task may enter the exit path prematurely, bypassing expedited > > > > reclamation hooks. > > > > > > > > This series addresses these limitations in three logical steps. > > > > > > > > Patch #1: mm: process_mrelease: expedite clean file folio reclaim via mmu_gather > > > > Integrates clean file folio eviction directly into the low-level TLB > > > > batching (mmu_gather) infrastructure. Symmetrically truncates clean file > > > > folios alongside anonymous pages during the unmap loop. > > > > > > Why do we need to care about clean page cache? Is this a form of > > > drop_caches? > > > > The goal is to ensure the memory is actually freed by the time > > process_mrelease returns. Currently, process_mrelease unmaps pages, but > > page caches remain on the LRU, leaving them to be reclaimed later > > by kswapd or direct reclaim. > > Correct. This was the initial design decision because there is not much > you can assume about page cache pages which are very often shared. Even > if they are not mapped by all users. Fair point. However, that's the trade-off: Leaving unmapped caches to be reclaimed asynchronously keeps system memory pressure high for too long. In Android, this delay forces the LMKD to unnecessarily kill additional innocent background apps before the memory from the original victim is recovered. Furthermore, this approach maintains safety: mapping_evict_folio() will naturally fail if the folio is still actively shared by other processes. Yeah, that's not perfect but minimal safety net. I believe preventing redundant background app kills is a highly beneficial trade-off than adding more IOs since user can keep listening their music. > > > This delay defeats the purpose of > > "expedited" release. It’s not a global drop_caches, but rather a > > targeted eviction for the victim process to make its memory immediately > > available for other urgent allocations. > > Clean page cache reclaim should be quite effective. Why doesn't kswapd > keep up in that regards? Or is this more a per-memcg problem where there > is no background reclaim and you are hitting direct reclaim to clean up > those pages? So many reasons: Cannot say everything how kswapd is inefficient compared to memory allocation. 1. kswapd is CPU hungry. 2. Out of control which core(big, midlle or little) kswapd will run 3. Allocatoin is super fast than the memory reclaim 4. Kswapd is stuck on some lock 5. Kswapd is doing swapping out since anon/file reclaims are serialized > > > > > Patch #2: mm: process_mrelease: skip LRU movement for exclusive file folios > > > > Skips costly LRU marking (folio_mark_accessed) for exclusive file-backed > > > > folios undergoing process_mrelease reclaim. Perf profiling reveals that > > > > LRU movement accounts for ~55% of overhead during unmap. > > > > > > OK, but why is this not desirable behavior fir mrelease? > > > > In Android, lmkd kills background apps under memory pressure and then calls > > process_mrelease. If the memory release is slow due to LRU overhead (~55% as noted), > > it cannot keep up with the allocation speed of the foreground app. > > This delay often leads to "over-killing" - killing more background apps > > than necessary because the system hasn't yet "seen" the memory freed > > from the first kill. > > OK, I see. More on that below. > > > > > Patch #3: mm: process_mrelease: introduce PROCESS_MRELEASE_REAP_KILL flag > > > > Adds an auto-kill flag supporting atomic teardown. Utilizes a dedicated > > > > signal code (KILL_MRELEASE) to guarantee MMF_UNSTABLE is marked in the > > > > signal delivery path, preventing scheduling races. > > > > > > Could you explain why those races are a real problem? > > > > The race occurs when the victim process starts its own exit path (after > > SIGKILL) before the caller can invoke process_mrelease. If the victim > > reaches the exit path first, the caller might lose the window to apply > > these expedited reclamation optimizations. > > Isn't this the problem you are trying to solve then? You are special > casing process_mrelease while you really want to expedite the process > memory clean up. > > The same situation happens with the global OOM and your approach doesn't > really close the race anyway. You send SIGKILL first and the victim can > hit the exit path right after that before you start processing the rest. > That is not fundamentally different from doing that in two syscalls, > race window is just smaller. No, this approach completely close the race. When it invokes do_send_sig_info(SIGKILL) with the KILL_MRELEASE code, the kernel sets the MMF_UNSTABLE flag on the victim's mm_struct in the signal delivery path (kernel/signal.c) *before* the task begins processing the signal. When the victim gets scheduled and wakes up to process the fatal signal, the MMF_UNSTABLE flag is already set. This guarantees that the victim's own exit path (do_exit -> exit_mmap) will utilize the expedited reclamation optimizations automatically, regardless of whether the reaper or the victim gets scheduled first. For the OOM, we can use the same idea. > > All that being said, I do not think those special hacks for > process_mrelease is the right approach. I very much agree that the > address space tear down for a dying process could be improved and we > should be focusing on that part. I think process_mrelease is crucial here because relying on the exit path is non-deterministic. The mm_struct can have its reference count increased by other tasks or kernel subsystems. When a process exits, the actual address space teardown (exit_mmap) is deferred until the mm's reference count reaches zero. If any component holds a reference, the tearing down is delayed indefinitely. We have observed the problem in the field quite a lot. This is precisely why process_mrelease was introduced in the first place. It allows an external reaper to bypass the reference count delays and synchronously reap the memory of a dying process on demand. Improving the regular exit path alone cannot substitute for this, as it would still be stalled by outstanding reference counts. I don't know what alternative ideas you are thinking about, but please share if you have a better approach.