From: "T.J. Mercier" <tjmercier@google.com>
To: Shakeel Butt <shakeel.butt@linux.dev>
Cc: lsf-pc@lists.linux-foundation.org,
Andrew Morton <akpm@linux-foundation.org>,
Johannes Weiner <hannes@cmpxchg.org>,
David Hildenbrand <david@kernel.org>,
Michal Hocko <mhocko@kernel.org>,
Qi Zheng <zhengqi.arch@bytedance.com>,
Lorenzo Stoakes <ljs@kernel.org>,
Chen Ridong <chenridong@huaweicloud.com>,
Emil Tsalapatis <emil@etsalapatis.com>,
Alexei Starovoitov <ast@kernel.org>,
Axel Rasmussen <axelrasmussen@google.com>,
Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
Kairui Song <ryncsn@gmail.com>,
Matthew Wilcox <willy@infradead.org>,
Nhat Pham <nphamcs@gmail.com>, Gregory Price <gourry@gourry.net>,
Barry Song <21cnbao@gmail.com>,
David Stevens <stevensd@google.com>,
Vernon Yang <vernon2gm@gmail.com>,
David Rientjes <rientjes@google.com>,
Kalesh Singh <kaleshsingh@google.com>,
wangzicheng <wangzicheng@honor.com>,
Baolin Wang <baolin.wang@linux.alibaba.com>,
Suren Baghdasaryan <surenb@google.com>,
Meta kernel team <kernel-team@meta.com>,
bpf@vger.kernel.org, linux-mm@kvack.org,
linux-kernel@vger.kernel.org
Subject: Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
Date: Wed, 25 Mar 2026 17:10:34 -0700 [thread overview]
Message-ID: <CABdmKX2T9AmFH4RvShSB_82+_W1fY6fqsAX18C82nOzmTruq6Q@mail.gmail.com> (raw)
In-Reply-To: <20260325210637.3704220-1-shakeel.butt@linux.dev>
On Wed, Mar 25, 2026 at 2:07 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> The Problem
> -----------
>
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every
> bug fix, every optimization, every feature has to be done twice or it
> only works for half the users. This is not sustainable. It has to stop.
>
> We should unify both algorithms into a single code path. In this path,
> both algorithms are a set of hooks called from that path. Everyone
> maintains, understands, and evolves a single codebase. Optimizations are
> now evaluated against -- and available to -- both algorithms. And the
> next time someone develops a new LRU algorithm, they can do so in a way
> that does not add churn to existing code.
>
> How We Got Here
> ---------------
>
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to
> refactor the existing reclaim code or integrate these mechanisms into the
> traditional path. 3,300 lines of code were dumped as a completely
> parallel implementation with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
>
> To be fair, traditional reclaim is not easy to refactor. It has
> accumulated decades of heuristics trying to work for every workload, and
> touching any of it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to
> generalize the existing scanning path, not proposing shared
> abstractions, not offering the new mechanisms as improvements to the code
> that was already there. Hard does not mean impossible, and the cost of
> not trying is what we are living with now.
>
> The Differences That Matter
> ---------------------------
>
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
>
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently
> tied to its eviction policy. Page table scanning would benefit
> traditional LRU just as much -- it is cache-friendly, batches updates
> without the LRU lock, and naturally exploits spatial locality. There is
> no reason this should be MGLRU-only.
>
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold
> page table regions and a lookaround optimization to scan adjacent PTEs
> during eviction. These are general-purpose optimizations for any
> scanning path. They are locked inside MGLRU today for no good reason.
>
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim
> can use the same technique to reduce lock contention.
>
> Page classification: Traditional LRU uses two buckets
> (active/inactive). MGLRU uses four generations with timestamps and
> reference frequency tiers. This is the policy difference --
> how many age buckets and how pages move between them. Every other
> mechanism is shareable.
>
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
>
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
>
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to
> be pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some
> new techniques/ideas, and we do not want to get into the current mess
> again.
>
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to
> detect access, how to classify pages, which pages to evict, when to
> protect a page -- are where the two algorithms differ, and where future
> algorithms will differ too. Make those pluggable.
>
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone
> comes up with a better eviction algorithm tomorrow, they plug it in
> without touching the core.
>
> Making reclaim pluggable implies we define it as a set of function
> methods (let's call them reclaim_ops) hooking into a stable codebase we
> rarely modify. We then have two big questions to answer: how do these
> reclaim ops look, and how do we move the existing code to the new model?
>
> How Do We Get There
> -------------------
>
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
>
> Whichever option we choose, we do the work in small, self-contained
> phases. Each phase ships independently, each phase makes the code
> better, each phase is bisectable. No big bang. No disruption. No
> excuses.
>
> Option A: Factor and Merge
>
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for
> MGLRU itself.
>
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at
> this stage.
>
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
>
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
> traditional LRU. We do not actually know which optimizations are
> useful and which should stay in MGLRU hooks.
>
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
> until we merge the paths at the end. We will have to change the ops
> if it turns out we need a different split. The reclaim_ops API will
> be private and have a single user so it is not that bad, but it may
> require additional changes.
>
> Option B: Merge and Factor
>
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page
> table scanning, Bloom filter PMD skipping, lookaround, lock-free folio
> age updates. These are independently useful. Make them available to both
> algorithms. Stop hoarding good ideas inside one code path.
>
> Phase 2 -- Collapse the remaining differences. Generalize list
> infrastructure to N classifications (trad=2, MGLRU=4). Unify eviction
> entry points. Common classification/promotion interface. At this point
> the two "algorithms" are thin wrappers over shared code.
>
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
>
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
> method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
>
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
> at once.
>
> Open Questions
> --------------
>
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
> either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
> benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
> not more complex. If it is not simpler, we failed.
> --
> 2.52.0
>
Hi Shakeel,
Nice outline, I'd be quite interested in this discussion. It's a
little difficult for me to imagine a reclaim_ops getting us to
complete convergence, but it seems like a good way to start making
progress.
Unfortuantely I got an LSFMM Invitation Decline, so I won't be there.
Take good notes. :)
-T.J.
next prev parent reply other threads:[~2026-03-26 0:10 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
2026-03-26 0:10 ` T.J. Mercier [this message]
2026-03-26 2:05 ` Andrew Morton
2026-03-26 7:03 ` Michal Hocko
2026-03-26 8:02 ` Lorenzo Stoakes (Oracle)
2026-03-26 12:37 ` Kairui Song
2026-03-26 13:13 ` Lorenzo Stoakes (Oracle)
2026-03-26 13:42 ` David Hildenbrand (Arm)
2026-03-26 13:45 ` Lorenzo Stoakes (Oracle)
2026-03-26 12:06 ` Kairui Song
2026-03-26 12:31 ` Lorenzo Stoakes (Oracle)
2026-03-26 13:17 ` Kairui Song
2026-03-26 13:26 ` Lorenzo Stoakes (Oracle)
2026-03-26 13:21 ` Shakeel Butt
2026-03-26 7:12 ` Michal Hocko
2026-03-26 13:44 ` Shakeel Butt
2026-03-26 7:18 ` wangzicheng
2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CABdmKX2T9AmFH4RvShSB_82+_W1fY6fqsAX18C82nOzmTruq6Q@mail.gmail.com \
--to=tjmercier@google.com \
--cc=21cnbao@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=ast@kernel.org \
--cc=axelrasmussen@google.com \
--cc=baolin.wang@linux.alibaba.com \
--cc=bpf@vger.kernel.org \
--cc=chenridong@huaweicloud.com \
--cc=david@kernel.org \
--cc=emil@etsalapatis.com \
--cc=gourry@gourry.net \
--cc=hannes@cmpxchg.org \
--cc=kaleshsingh@google.com \
--cc=kernel-team@meta.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=mhocko@kernel.org \
--cc=nphamcs@gmail.com \
--cc=rientjes@google.com \
--cc=ryncsn@gmail.com \
--cc=shakeel.butt@linux.dev \
--cc=stevensd@google.com \
--cc=surenb@google.com \
--cc=vernon2gm@gmail.com \
--cc=wangzicheng@honor.com \
--cc=weixugc@google.com \
--cc=willy@infradead.org \
--cc=yuanchu@google.com \
--cc=zhengqi.arch@bytedance.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox