RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)

All of lore.kernel.org
 help / color / mirror / Atom feed

From: wangzicheng <wangzicheng@honor.com>
To: Shakeel Butt <shakeel.butt@linux.dev>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	David Hildenbrand <david@kernel.org>,
	Michal Hocko <mhocko@kernel.org>,
	Qi Zheng <zhengqi.arch@bytedance.com>,
	Lorenzo Stoakes <ljs@kernel.org>,
	Chen Ridong <chenridong@huaweicloud.com>,
	Emil Tsalapatis <emil@etsalapatis.com>,
	Alexei Starovoitov <ast@kernel.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Yuanchu Xie <yuanchu@google.com>, Wei Xu <weixugc@google.com>,
	Kairui Song <ryncsn@gmail.com>,
	Matthew Wilcox <willy@infradead.org>,
	Nhat Pham <nphamcs@gmail.com>, Gregory Price <gourry@gourry.net>,
	Barry Song <21cnbao@gmail.com>,
	David Stevens <stevensd@google.com>,
	wangtao <tao.wangtao@honor.com>,
	Vernon Yang <vernon2gm@gmail.com>,
	David Rientjes <rientjes@google.com>,
	Kalesh Singh <kaleshsingh@google.com>,
	"T . J . Mercier" <tjmercier@google.com>,
	"Baolin Wang" <baolin.wang@linux.alibaba.com>,
	Suren Baghdasaryan <surenb@google.com>,
	Meta kernel team <kernel-team@meta.com>,
	"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	liulu 00013167 <liulu.liu@honor.com>, gao xu <gaoxu2@honor.com>,
	wangxin 00023513 <wangxin23@honor.com>
Subject: RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
Date: Thu, 26 Mar 2026 07:18:35 +0000	[thread overview]
Message-ID: <12a0c8c9d12040fa8d23658ca57a8760@honor.com> (raw)
In-Reply-To: <20260325210637.3704220-1-shakeel.butt@linux.dev>



> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Shakeel Butt
> Sent: Thursday, March 26, 2026 5:07 AM
> To: lsf-pc@lists.linux-foundation.org
> Cc: Andrew Morton <akpm@linux-foundation.org>; Johannes Weiner
> <hannes@cmpxchg.org>; David Hildenbrand <david@kernel.org>; Michal
> Hocko <mhocko@kernel.org>; Qi Zheng <zhengqi.arch@bytedance.com>;
> Lorenzo Stoakes <ljs@kernel.org>; Chen Ridong
> <chenridong@huaweicloud.com>; Emil Tsalapatis <emil@etsalapatis.com>;
> Alexei Starovoitov <ast@kernel.org>; Axel Rasmussen
> <axelrasmussen@google.com>; Yuanchu Xie <yuanchu@google.com>; Wei
> Xu <weixugc@google.com>; Kairui Song <ryncsn@gmail.com>; Matthew
> Wilcox <willy@infradead.org>; Nhat Pham <nphamcs@gmail.com>; Gregory
> Price <gourry@gourry.net>; Barry Song <21cnbao@gmail.com>; David
> Stevens <stevensd@google.com>; Vernon Yang <vernon2gm@gmail.com>;
> David Rientjes <rientjes@google.com>; Kalesh Singh
> <kaleshsingh@google.com>; wangzicheng <wangzicheng@honor.com>; T . J .
> Mercier <tjmercier@google.com>; Baolin Wang
> <baolin.wang@linux.alibaba.com>; Suren Baghdasaryan
> <surenb@google.com>; Meta kernel team <kernel-team@meta.com>;
> bpf@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Subject: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory
> Reclaim (reclaim_ext)
> 
> The Problem
> -----------
> 
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every bug fix,
> every optimization, every feature has to be done twice or it only works for
> half the users. This is not sustainable. It has to stop.
> 
> We should unify both algorithms into a single code path. In this path, both
> algorithms are a set of hooks called from that path. Everyone maintains,
> understands, and evolves a single codebase. Optimizations are now
> evaluated against -- and available to -- both algorithms. And the next time
> someone develops a new LRU algorithm, they can do so in a way that does
> not add churn to existing code.
> 
> How We Got Here
> ---------------
> 
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to refactor the
> existing reclaim code or integrate these mechanisms into the traditional path.
> 3,300 lines of code were dumped as a completely parallel implementation
> with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
> 
> To be fair, traditional reclaim is not easy to refactor. It has accumulated
> decades of heuristics trying to work for every workload, and touching any of
> it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to generalize
> the existing scanning path, not proposing shared abstractions, not offering
> the new mechanisms as improvements to the code that was already there.
> Hard does not mean impossible, and the cost of not trying is what we are
> living with now.
> 
> The Differences That Matter
> ---------------------------
> 
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
> 
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently tied
> to its eviction policy. Page table scanning would benefit traditional LRU just as
> much -- it is cache-friendly, batches updates without the LRU lock, and
> naturally exploits spatial locality. There is no reason this should be MGLRU-
> only.
> 
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold page
> table regions and a lookaround optimization to scan adjacent PTEs during
> eviction. These are general-purpose optimizations for any scanning path.
> They are locked inside MGLRU today for no good reason.
> 
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim can use
> the same technique to reduce lock contention.
> 
> Page classification: Traditional LRU uses two buckets (active/inactive).
> MGLRU uses four generations with timestamps and reference frequency
> tiers. This is the policy difference -- how many age buckets and how pages
> move between them. Every other mechanism is shareable.
> 
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
> 
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
> 
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to be
> pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some new
> techniques/ideas, and we do not want to get into the current mess again.
> 
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to detect
> access, how to classify pages, which pages to evict, when to protect a page --
> are where the two algorithms differ, and where future algorithms will differ
> too. Make those pluggable.
> 
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone comes
> up with a better eviction algorithm tomorrow, they plug it in without
> touching the core.
> 
> Making reclaim pluggable implies we define it as a set of function methods
> (let's call them reclaim_ops) hooking into a stable codebase we rarely modify.
> We then have two big questions to answer: how do these reclaim ops look,
> and how do we move the existing code to the new model?
> 
> How Do We Get There
> -------------------
> 
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
> 
> Whichever option we choose, we do the work in small, self-contained phases.
> Each phase ships independently, each phase makes the code better, each
> phase is bisectable. No big bang. No disruption. No excuses.
> 
> Option A: Factor and Merge
> 
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for MGLRU
> itself.
> 
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at this
> stage.
> 
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
> 
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
>   traditional LRU. We do not actually know which optimizations are
>   useful and which should stay in MGLRU hooks.
> 
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
>   until we merge the paths at the end. We will have to change the ops
>   if it turns out we need a different split. The reclaim_ops API will
>   be private and have a single user so it is not that bad, but it may
>   require additional changes.
> 
> Option B: Merge and Factor
> 
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page table
> scanning, Bloom filter PMD skipping, lookaround, lock-free folio age updates.
> These are independently useful. Make them available to both algorithms.
> Stop hoarding good ideas inside one code path.
> 
> Phase 2 -- Collapse the remaining differences. Generalize list infrastructure
> to N classifications (trad=2, MGLRU=4). Unify eviction entry points. Common
> classification/promotion interface. At this point the two "algorithms" are thin
> wrappers over shared code.
> 
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
> 
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
>   method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
> 
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
>   at once.
> 
> Open Questions
> --------------
> 
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
>   either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
>   benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
>   not more complex. If it is not simpler, we failed.
> --
> 2.52.0
> 

Hi Shakeel,

The reclaim_ops direction looks very promising. I'd be interested in the discussion.

We are particularly interested in the individual effects of several mechanisms
currently bundled in MGLRU. reclaim_ops would provide a great opportunity to
run ablation experiments, e.g. testing traditional LRU with page table scanning.

On policy granularity, it would also be interesting to see something like ``reclaim_ext''[1,2]
taking control at different levels, similar to what sched_ext does for scheduling policies.

Best,
Zicheng

[1] cache_ext: Customizing the Page Cache with eBPF
[2] PageFlex: Flexible and Efficient User-space Delegation of Linux Paging Policies with eBPF

next prev parent reply	other threads:[~2026-03-26  7:41 UTC|newest]

Thread overview: 46+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
2026-03-26  0:10 ` T.J. Mercier
2026-03-26  2:05 ` Andrew Morton
2026-03-26  7:03   ` Michal Hocko
2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
2026-03-26 12:37       ` Kairui Song
2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:42           ` David Hildenbrand (Arm)
2026-03-26 13:45             ` Lorenzo Stoakes (Oracle)
2026-03-26 16:02         ` Lorenzo Stoakes (Oracle)
2026-03-26 20:02       ` Axel Rasmussen
2026-03-26 20:30         ` Gregory Price
2026-03-26 20:47           ` Axel Rasmussen
2026-03-27  3:43             ` Matthew Wilcox
2026-03-27 19:12               ` Tal Zussman
2026-03-27 19:43                 ` Gregory Price
2026-04-14 21:11                   ` Tal Zussman
2026-04-09  0:21                 ` John Hubbard
2026-04-09  8:22                   ` Lorenzo Stoakes
2026-04-14 21:38                     ` Tal Zussman
2026-04-14 20:35                   ` Tal Zussman
2026-03-27  8:07         ` [Lsf-pc] " Vlastimil Babka
2026-03-27  9:29           ` Lorenzo Stoakes (Oracle)
2026-03-26 12:06   ` Kairui Song
2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
2026-03-26 13:17       ` Kairui Song
2026-03-26 13:26         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:21   ` Shakeel Butt
2026-03-26  7:12 ` Michal Hocko
2026-03-26 13:44   ` Shakeel Butt
2026-03-26 15:24     ` Michal Hocko
2026-03-26 18:21       ` Shakeel Butt
2026-03-26  7:18 ` wangzicheng [this message]
2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
2026-03-26 15:24   ` Gregory Price
2026-03-26 15:35     ` Lorenzo Stoakes (Oracle)
2026-03-26 16:32       ` Gregory Price
2026-03-26 16:40         ` Lorenzo Stoakes (Oracle)
2026-03-27 19:53       ` Johannes Weiner
2026-04-07 11:36         ` Lorenzo Stoakes
2026-04-07 16:56           ` Gregory Price
2026-04-07 17:30             ` Lorenzo Stoakes
2026-04-07 17:52               ` Johannes Weiner
2026-04-07 18:37               ` Gregory Price
2026-04-08  6:48                 ` Lorenzo Stoakes
2026-03-26 18:49   ` Shakeel Butt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=12a0c8c9d12040fa8d23658ca57a8760@honor.com \
    --to=wangzicheng@honor.com \
    --cc=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=ast@kernel.org \
    --cc=axelrasmussen@google.com \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=bpf@vger.kernel.org \
    --cc=chenridong@huaweicloud.com \
    --cc=david@kernel.org \
    --cc=emil@etsalapatis.com \
    --cc=gaoxu2@honor.com \
    --cc=gourry@gourry.net \
    --cc=hannes@cmpxchg.org \
    --cc=kaleshsingh@google.com \
    --cc=kernel-team@meta.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=liulu.liu@honor.com \
    --cc=ljs@kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mhocko@kernel.org \
    --cc=nphamcs@gmail.com \
    --cc=rientjes@google.com \
    --cc=ryncsn@gmail.com \
    --cc=shakeel.butt@linux.dev \
    --cc=stevensd@google.com \
    --cc=surenb@google.com \
    --cc=tao.wangtao@honor.com \
    --cc=tjmercier@google.com \
    --cc=vernon2gm@gmail.com \
    --cc=wangxin23@honor.com \
    --cc=weixugc@google.com \
    --cc=willy@infradead.org \
    --cc=yuanchu@google.com \
    --cc=zhengqi.arch@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.