From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mta20.hihonor.com (mta20.honor.com [81.70.206.69]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EBEB434D4E0; Thu, 26 Mar 2026 07:41:23 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=81.70.206.69 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774510886; cv=none; b=sZFvlrfdC9MbSjOwpVbax2YbTCAePaDVOfF+KI7sjHIUFfa3uG6cAhf83wedM8Ns2etb+BBkkVEaseLtwmEvU0YVTKTF0sNYPNQiisOqx1gVQhlkp3SHdxiOYVsw1fF0fxLfL2WyPz9CBRNNeOw/bBUz9nVR9WFWSKV38zvRynI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1774510886; c=relaxed/simple; bh=XfXxkIWbBdkMy8x3ATJVSQNsAqdb/foDLRcAs9AUgUw=; h=From:To:CC:Subject:Date:Message-ID:References:In-Reply-To: Content-Type:MIME-Version; b=c91hsRJXsvJgtmftP08SbGS70xdpYF7HwXBJE1cSns2RbijdPuNkxpOlYc/nGpcTFGRKd8uUtNSor8zM6e6wc1s9MG2ykPGdyFJ+dfqm+rLNppDeAEmWtbrDUf4rZ7W8rWyCrbjRZ7xQkaQJOPLvPsUfq9NWIQ8NuVHSo2cuvN0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=honor.com; spf=pass smtp.mailfrom=honor.com; arc=none smtp.client-ip=81.70.206.69 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=honor.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=honor.com Received: from w002.hihonor.com (unknown [10.68.28.120]) by mta20.hihonor.com (SkyGuard) with ESMTPS id 4fhFQ05lJMzYndlL; Thu, 26 Mar 2026 15:14:16 +0800 (CST) Received: from TA010.hihonor.com (10.77.226.208) by w002.hihonor.com (10.68.28.120) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.27; Thu, 26 Mar 2026 15:18:43 +0800 Received: from TA012.hihonor.com (10.77.228.68) by TA010.hihonor.com (10.77.226.208) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.2.2562.17; Thu, 26 Mar 2026 15:18:43 +0800 Received: from TA012.hihonor.com ([fe80::9e31:9fdb:69fb:928c]) by TA012.hihonor.com ([fe80::9e31:9fdb:69fb:928c%8]) with mapi id 15.02.2562.017; Thu, 26 Mar 2026 15:18:35 +0800 From: wangzicheng To: Shakeel Butt , "lsf-pc@lists.linux-foundation.org" CC: Andrew Morton , Johannes Weiner , David Hildenbrand , Michal Hocko , Qi Zheng , Lorenzo Stoakes , Chen Ridong , Emil Tsalapatis , Alexei Starovoitov , Axel Rasmussen , Yuanchu Xie , Wei Xu , Kairui Song , Matthew Wilcox , Nhat Pham , Gregory Price , Barry Song <21cnbao@gmail.com>, David Stevens , wangtao , Vernon Yang , David Rientjes , Kalesh Singh , "T . J . Mercier" , "Baolin Wang" , Suren Baghdasaryan , Meta kernel team , "bpf@vger.kernel.org" , "linux-mm@kvack.org" , "linux-kernel@vger.kernel.org" , liulu 00013167 , gao xu , wangxin 00023513 Subject: RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Thread-Topic: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Thread-Index: AQHcvJtycwbgDsD0AUiXwTFgs8EFdbXAY5Wg Date: Thu, 26 Mar 2026 07:18:35 +0000 Message-ID: <12a0c8c9d12040fa8d23658ca57a8760@honor.com> References: <20260325210637.3704220-1-shakeel.butt@linux.dev> In-Reply-To: <20260325210637.3704220-1-shakeel.butt@linux.dev> Accept-Language: en-US Content-Language: zh-CN X-MS-Has-Attach: X-MS-TNEF-Correlator: Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 > -----Original Message----- > From: owner-linux-mm@kvack.org On Behalf > Of Shakeel Butt > Sent: Thursday, March 26, 2026 5:07 AM > To: lsf-pc@lists.linux-foundation.org > Cc: Andrew Morton ; Johannes Weiner > ; David Hildenbrand ; Michal > Hocko ; Qi Zheng ; > Lorenzo Stoakes ; Chen Ridong > ; Emil Tsalapatis ; > Alexei Starovoitov ; Axel Rasmussen > ; Yuanchu Xie ; Wei > Xu ; Kairui Song ; Matthew > Wilcox ; Nhat Pham ; Gregory > Price ; Barry Song <21cnbao@gmail.com>; David > Stevens ; Vernon Yang ; > David Rientjes ; Kalesh Singh > ; wangzicheng ; T . J . > Mercier ; Baolin Wang > ; Suren Baghdasaryan > ; Meta kernel team ; > bpf@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org > Subject: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory > Reclaim (reclaim_ext) >=20 > The Problem > ----------- >=20 > Memory reclaim in the kernel is a mess. We ship two completely separate > eviction algorithms -- traditional LRU and MGLRU -- in the same file. > mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that > duplicates functionality already present in the traditional path. Every b= ug fix, > every optimization, every feature has to be done twice or it only works f= or > half the users. This is not sustainable. It has to stop. >=20 > We should unify both algorithms into a single code path. In this path, bo= th > algorithms are a set of hooks called from that path. Everyone maintains, > understands, and evolves a single codebase. Optimizations are now > evaluated against -- and available to -- both algorithms. And the next ti= me > someone develops a new LRU algorithm, they can do so in a way that does > not add churn to existing code. >=20 > How We Got Here > --------------- >=20 > MGLRU brought interesting ideas -- multi-generation aging, page table > scanning, Bloom filters, spatial lookaround. But we never tried to refact= or the > existing reclaim code or integrate these mechanisms into the traditional = path. > 3,300 lines of code were dumped as a completely parallel implementation > with a runtime toggle to switch between the two. > No attempt to evolve the existing code or share mechanisms between the > two paths -- just a second reclaim system bolted on next to the first. >=20 > To be fair, traditional reclaim is not easy to refactor. It has accumulat= ed > decades of heuristics trying to work for every workload, and touching any= of > it risks regressions. But difficulty is not an excuse. > There was no justification for not even trying -- not attempting to gener= alize > the existing scanning path, not proposing shared abstractions, not offeri= ng > the new mechanisms as improvements to the code that was already there. > Hard does not mean impossible, and the cost of not trying is what we are > living with now. >=20 > The Differences That Matter > --------------------------- >=20 > The two algorithms differ in how they classify pages, detect access, and > decide what to evict. But most of these differences are not fundamental > -- they are mechanisms that got trapped inside one implementation when > they could benefit both. Not making those mechanisms shareable leaves > potential free performance gains on the table. >=20 > Access detection: Traditional LRU walks reverse mappings (RMAP) from the > page back to its page table entries. MGLRU walks page tables forward, > scanning process address spaces directly. Neither approach is inherently = tied > to its eviction policy. Page table scanning would benefit traditional LRU= just as > much -- it is cache-friendly, batches updates without the LRU lock, and > naturally exploits spatial locality. There is no reason this should be MG= LRU- > only. >=20 > Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold page > table regions and a lookaround optimization to scan adjacent PTEs during > eviction. These are general-purpose optimizations for any scanning path. > They are locked inside MGLRU today for no good reason. >=20 > Lock-free age updates: MGLRU updates folio age using atomic flag > operations, avoiding the LRU lock during scanning. Traditional reclaim ca= n use > the same technique to reduce lock contention. >=20 > Page classification: Traditional LRU uses two buckets (active/inactive). > MGLRU uses four generations with timestamps and reference frequency > tiers. This is the policy difference -- how many age buckets and how page= s > move between them. Every other mechanism is shareable. >=20 > Both systems already share the core reclaim mechanics -- writeback, > unmapping, swap, NUMA demotion, and working set tracking. The shareable > mechanisms listed above should join that common core. What remains after > that is a thin policy layer -- and that is all that should differ between > algorithms. >=20 > The Fix: One Reclaim, Pluggable and Extensible > ----------------------------------------------- >=20 > We need one reclaim system, not two. One code path that everyone > maintains, everyone tests, and everyone benefits from. But it needs to be > pluggable as there will always be cases where someone wants some > customization for their specialized workload or wants to explore some new > techniques/ideas, and we do not want to get into the current mess again. >=20 > The unified reclaim must separate mechanism from policy. The mechanisms > -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are > shared today and should stay shared. The policy decisions -- how to detec= t > access, how to classify pages, which pages to evict, when to protect a pa= ge -- > are where the two algorithms differ, and where future algorithms will dif= fer > too. Make those pluggable. >=20 > This gives us one maintained code path with the flexibility to evolve. > New ideas get implemented as new policies, not as 3,000-line forks. Good > mechanisms from MGLRU (page table scanning, Bloom filters, lookaround) > become shared infrastructure available to any policy. And if someone come= s > up with a better eviction algorithm tomorrow, they plug it in without > touching the core. >=20 > Making reclaim pluggable implies we define it as a set of function method= s > (let's call them reclaim_ops) hooking into a stable codebase we rarely mo= dify. > We then have two big questions to answer: how do these reclaim ops look, > and how do we move the existing code to the new model? >=20 > How Do We Get There > ------------------- >=20 > Do we merge the two mechanisms feature by feature, or do we prioritize > moving MGLRU to the pluggable model then follow with LRU once we are > happy with the result? >=20 > Whichever option we choose, we do the work in small, self-contained phase= s. > Each phase ships independently, each phase makes the code better, each > phase is bisectable. No big bang. No disruption. No excuses. >=20 > Option A: Factor and Merge >=20 > MGLRU is already pretty modular. However, we do not know which > optimizations are actually generic and which ones are only useful for MGL= RU > itself. >=20 > Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional > changes to MGLRU. Traditional LRU code is left completely untouched at th= is > stage. >=20 > Phase 2 -- Merge the two paths one method at a time. Right now the code > diverts control to MGLRU from the very top of the high-level hooks. We > instead unify the algorithms starting from the very beginning of LRU and > deciding what to keep in common code and what to move into a traditional > LRU path. >=20 > Advantages: > - We do not touch LRU until Phase 2, avoiding churn. > - Makes it easy to experiment with combining MGLRU features into > traditional LRU. We do not actually know which optimizations are > useful and which should stay in MGLRU hooks. >=20 > Disadvantages: > - We will not find out whether reclaim_ops exposes the right methods > until we merge the paths at the end. We will have to change the ops > if it turns out we need a different split. The reclaim_ops API will > be private and have a single user so it is not that bad, but it may > require additional changes. >=20 > Option B: Merge and Factor >=20 > Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page tabl= e > scanning, Bloom filter PMD skipping, lookaround, lock-free folio age upda= tes. > These are independently useful. Make them available to both algorithms. > Stop hoarding good ideas inside one code path. >=20 > Phase 2 -- Collapse the remaining differences. Generalize list infrastruc= ture > to N classifications (trad=3D2, MGLRU=3D4). Unify eviction entry points. = Common > classification/promotion interface. At this point the two "algorithms" ar= e thin > wrappers over shared code. >=20 > Phase 3 -- Define the hook interface. Define reclaim_ops around the > remaining policy differences. Layer BPF on top (reclaim_ext). > Traditional LRU and MGLRU become two instances of the same interface. > Adding a third algorithm means writing a new set of hooks, not forking > 3,000 lines. >=20 > Advantages: > - We get signals on what should be shared earlier. We know every shared > method to be useful because we use it for both algorithms. > - Can test LRU optimizations on MGLRU early. >=20 > Disadvantages: > - Slower, as we factor out both algorithms and expand reclaim_ops all > at once. >=20 > Open Questions > -------------- >=20 > - Policy granularity: system-wide, per-node, or per-cgroup? > - Mechanism/policy boundary: needs iteration; get it wrong and we > either constrain policies or duplicate code. > - Validation: reclaim quality is hard to measure; we need agreed-upon > benchmarks. > - Simplicity: the end result must be simpler than what we have today, > not more complex. If it is not simpler, we failed. > -- > 2.52.0 >=20 Hi Shakeel, The reclaim_ops direction looks very promising. I'd be interested in the di= scussion. We are particularly interested in the individual effects of several mechani= sms currently bundled in MGLRU. reclaim_ops would provide a great opportunity t= o run ablation experiments, e.g. testing traditional LRU with page table scan= ning. On policy granularity, it would also be interesting to see something like `= `reclaim_ext''[1,2] taking control at different levels, similar to what sched_ext does for sche= duling policies. Best, Zicheng [1] cache_ext: Customizing the Page Cache with eBPF [2] PageFlex: Flexible and Efficient User-space Delegation of Linux Paging = Policies with eBPF