RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: wangzicheng <wangzicheng@honor.com>
To: Shakeel Butt <shakeel.butt@linux.dev>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	David Hildenbrand <david@kernel.org>,
	Michal Hocko <mhocko@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Lorenzo Stoakes <ljs@kernel.org>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Kairui Song <ryncsn@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Nhat Pham <nphamcs@gmail.com>, Gregory Price <gourry@gourry.net>,
	Barry Song <21cnbao@gmail.com>,
	David Stevens <stevensd@google.com>,
	wangtao <tao.wangtao@honor.com>,
	Vernon Yang <vernon2gm@gmail.com>,
	David Rientjes <rientjes@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	"T . J . Mercier" <tjmercier@google.com>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Meta kernel team <kernel-team@meta.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	liulu 00013167 <liulu.liu@honor.com>, gao xu <gaoxu2@honor.com>,
	wangxin 00023513 <wangxin23@honor.com>
Subject: RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
Date: Thu, 26 Mar 2026 07:18:35 +0000	[thread overview]
Message-ID: <12a0c8c9d12040fa8d23658ca57a8760@honor.com> (raw)
In-Reply-To: <20260325210637.3704220-1-shakeel.butt@linux.dev>



> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Shakeel Butt
> Sent: Thursday, March 26, 2026 5:07 AM
> To: lsf-pc@lists.linux-foundation.org
> Cc: Andrew Morton <akpm@linux-foundation.org>; Johannes Weiner
> <hannes@cmpxchg.org>; David Hildenbrand <david@kernel.org>; Michal
> Hocko <mhocko@kernel.org>; Qi Zheng <zhengqi.arch@bytedance.com>;
> Lorenzo Stoakes <ljs@kernel.org>; Chen Ridong
> <chenridong@huaweicloud.com>; Emil Tsalapatis <emil@etsalapatis.com>;
> Alexei Starovoitov <ast@kernel.org>; Axel Rasmussen
> <axelrasmussen@google.com>; Yuanchu Xie <yuanchu@google.com>; Wei
> Xu <weixugc@google.com>; Kairui Song <ryncsn@gmail.com>; Matthew
> Wilcox <willy@infradead.org>; Nhat Pham <nphamcs@gmail.com>; Gregory
> Price <gourry@gourry.net>; Barry Song <21cnbao@gmail.com>; David
> Stevens <stevensd@google.com>; Vernon Yang <vernon2gm@gmail.com>;
> David Rientjes <rientjes@google.com>; Kalesh Singh
> <kaleshsingh@google.com>; wangzicheng <wangzicheng@honor.com>; T . J .
> Mercier <tjmercier@google.com>; Baolin Wang
> <baolin.wang@linux.alibaba.com>; Suren Baghdasaryan
> <surenb@google.com>; Meta kernel team <kernel-team@meta.com>;
> bpf@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Subject: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory
> Reclaim (reclaim_ext)
> 
> The Problem
> -----------
> 
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every bug fix,
> every optimization, every feature has to be done twice or it only works for
> half the users. This is not sustainable. It has to stop.
> 
> We should unify both algorithms into a single code path. In this path, both
> algorithms are a set of hooks called from that path. Everyone maintains,
> understands, and evolves a single codebase. Optimizations are now
> evaluated against -- and available to -- both algorithms. And the next time
> someone develops a new LRU algorithm, they can do so in a way that does
> not add churn to existing code.
> 
> How We Got Here
> ---------------
> 
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to refactor the
> existing reclaim code or integrate these mechanisms into the traditional path.
> 3,300 lines of code were dumped as a completely parallel implementation
> with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
> 
> To be fair, traditional reclaim is not easy to refactor. It has accumulated
> decades of heuristics trying to work for every workload, and touching any of
> it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to generalize
> the existing scanning path, not proposing shared abstractions, not offering
> the new mechanisms as improvements to the code that was already there.
> Hard does not mean impossible, and the cost of not trying is what we are
> living with now.
> 
> The Differences That Matter
> ---------------------------
> 
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
> 
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently tied
> to its eviction policy. Page table scanning would benefit traditional LRU just as
> much -- it is cache-friendly, batches updates without the LRU lock, and
> naturally exploits spatial locality. There is no reason this should be MGLRU-
> only.
> 
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold page
> table regions and a lookaround optimization to scan adjacent PTEs during
> eviction. These are general-purpose optimizations for any scanning path.
> They are locked inside MGLRU today for no good reason.
> 
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim can use
> the same technique to reduce lock contention.
> 
> Page classification: Traditional LRU uses two buckets (active/inactive).
> MGLRU uses four generations with timestamps and reference frequency
> tiers. This is the policy difference -- how many age buckets and how pages
> move between them. Every other mechanism is shareable.
> 
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
> 
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
> 
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to be
> pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some new
> techniques/ideas, and we do not want to get into the current mess again.
> 
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to detect
> access, how to classify pages, which pages to evict, when to protect a page --
> are where the two algorithms differ, and where future algorithms will differ
> too. Make those pluggable.
> 
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone comes
> up with a better eviction algorithm tomorrow, they plug it in without
> touching the core.
> 
> Making reclaim pluggable implies we define it as a set of function methods
> (let's call them reclaim_ops) hooking into a stable codebase we rarely modify.
> We then have two big questions to answer: how do these reclaim ops look,
> and how do we move the existing code to the new model?
> 
> How Do We Get There
> -------------------
> 
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
> 
> Whichever option we choose, we do the work in small, self-contained phases.
> Each phase ships independently, each phase makes the code better, each
> phase is bisectable. No big bang. No disruption. No excuses.
> 
> Option A: Factor and Merge
> 
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for MGLRU
> itself.
> 
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at this
> stage.
> 
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
> 
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
>   traditional LRU. We do not actually know which optimizations are
>   useful and which should stay in MGLRU hooks.
> 
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
>   until we merge the paths at the end. We will have to change the ops
>   if it turns out we need a different split. The reclaim_ops API will
>   be private and have a single user so it is not that bad, but it may
>   require additional changes.
> 
> Option B: Merge and Factor
> 
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page table
> scanning, Bloom filter PMD skipping, lookaround, lock-free folio age updates.
> These are independently useful. Make them available to both algorithms.
> Stop hoarding good ideas inside one code path.
> 
> Phase 2 -- Collapse the remaining differences. Generalize list infrastructure
> to N classifications (trad=2, MGLRU=4). Unify eviction entry points. Common
> classification/promotion interface. At this point the two "algorithms" are thin
> wrappers over shared code.
> 
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
> 
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
>   method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
> 
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
>   at once.
> 
> Open Questions
> --------------
> 
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
>   either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
>   benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
>   not more complex. If it is not simpler, we failed.
> --
> 2.52.0
> 

Hi Shakeel,

The reclaim_ops direction looks very promising. I'd be interested in the discussion.

We are particularly interested in the individual effects of several mechanisms
currently bundled in MGLRU. reclaim_ops would provide a great opportunity to
run ablation experiments, e.g. testing traditional LRU with page table scanning.

On policy granularity, it would also be interesting to see something like ``reclaim_ext''[1,2]
taking control at different levels, similar to what sched_ext does for scheduling policies.

Best,
Zicheng

[1] cache_ext: Customizing the Page Cache with eBPF
[2] PageFlex: Flexible and Efficient User-space Delegation of Linux Paging Policies with eBPF

next prev parent reply	other threads:[~2026-03-26  7:18 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
2026-03-26  0:10 ` T.J. Mercier
2026-03-26  2:05 ` Andrew Morton
2026-03-26  7:03   ` Michal Hocko
2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
2026-03-26 12:37       ` Kairui Song
2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:42           ` David Hildenbrand (Arm)
2026-03-26 13:45             ` Lorenzo Stoakes (Oracle)
2026-03-26 12:06   ` Kairui Song
2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
2026-03-26 13:17       ` Kairui Song
2026-03-26 13:26         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:21   ` Shakeel Butt
2026-03-26  7:12 ` Michal Hocko
2026-03-26 13:44   ` Shakeel Butt
2026-03-26  7:18 ` wangzicheng [this message]
2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12a0c8c9d12040fa8d23658ca57a8760@honor.com \
    --to=wangzicheng@honor.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=emil@etsalapatis.com \
    --cc=gaoxu2@honor.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=kaleshsingh@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liulu.liu@honor.com \
    --cc=ljs@kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=rientjes@google.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=tao.wangtao@honor.com \
    --cc=tjmercier@google.com \
    --cc=vernon2gm@gmail.com \
    --cc=wangxin23@honor.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox