public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed
* [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
@ 2026-03-25 21:06 Shakeel Butt
  2026-03-26  0:10 ` T.J. Mercier
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Shakeel Butt @ 2026-03-25 21:06 UTC (permalink / raw)
  To: lsf-pc
  Cc: Andrew Morton, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

The Problem
-----------

Memory reclaim in the kernel is a mess. We ship two completely separate
eviction algorithms -- traditional LRU and MGLRU -- in the same file.
mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
duplicates functionality already present in the traditional path. Every
bug fix, every optimization, every feature has to be done twice or it
only works for half the users. This is not sustainable. It has to stop.

We should unify both algorithms into a single code path. In this path,
both algorithms are a set of hooks called from that path. Everyone
maintains, understands, and evolves a single codebase. Optimizations are
now evaluated against -- and available to -- both algorithms. And the
next time someone develops a new LRU algorithm, they can do so in a way
that does not add churn to existing code.

How We Got Here
---------------

MGLRU brought interesting ideas -- multi-generation aging, page table
scanning, Bloom filters, spatial lookaround. But we never tried to
refactor the existing reclaim code or integrate these mechanisms into the
traditional path. 3,300 lines of code were dumped as a completely
parallel implementation with a runtime toggle to switch between the two.
No attempt to evolve the existing code or share mechanisms between the
two paths -- just a second reclaim system bolted on next to the first.

To be fair, traditional reclaim is not easy to refactor. It has
accumulated decades of heuristics trying to work for every workload, and
touching any of it risks regressions. But difficulty is not an excuse.
There was no justification for not even trying -- not attempting to
generalize the existing scanning path, not proposing shared
abstractions, not offering the new mechanisms as improvements to the code
that was already there. Hard does not mean impossible, and the cost of
not trying is what we are living with now.

The Differences That Matter
---------------------------

The two algorithms differ in how they classify pages, detect access, and
decide what to evict. But most of these differences are not fundamental
-- they are mechanisms that got trapped inside one implementation when
they could benefit both. Not making those mechanisms shareable leaves
potential free performance gains on the table.

Access detection: Traditional LRU walks reverse mappings (RMAP) from the
page back to its page table entries. MGLRU walks page tables forward,
scanning process address spaces directly. Neither approach is inherently
tied to its eviction policy. Page table scanning would benefit
traditional LRU just as much -- it is cache-friendly, batches updates
without the LRU lock, and naturally exploits spatial locality. There is
no reason this should be MGLRU-only.

Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold
page table regions and a lookaround optimization to scan adjacent PTEs
during eviction. These are general-purpose optimizations for any
scanning path. They are locked inside MGLRU today for no good reason.

Lock-free age updates: MGLRU updates folio age using atomic flag
operations, avoiding the LRU lock during scanning. Traditional reclaim
can use the same technique to reduce lock contention.

Page classification: Traditional LRU uses two buckets
(active/inactive). MGLRU uses four generations with timestamps and
reference frequency tiers. This is the policy difference --
how many age buckets and how pages move between them. Every other
mechanism is shareable.

Both systems already share the core reclaim mechanics -- writeback,
unmapping, swap, NUMA demotion, and working set tracking. The shareable
mechanisms listed above should join that common core. What remains after
that is a thin policy layer -- and that is all that should differ between
algorithms.

The Fix: One Reclaim, Pluggable and Extensible
-----------------------------------------------

We need one reclaim system, not two. One code path that everyone
maintains, everyone tests, and everyone benefits from. But it needs to
be pluggable as there will always be cases where someone wants some
customization for their specialized workload or wants to explore some
new techniques/ideas, and we do not want to get into the current mess
again.

The unified reclaim must separate mechanism from policy. The mechanisms
-- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
shared today and should stay shared. The policy decisions -- how to
detect access, how to classify pages, which pages to evict, when to
protect a page -- are where the two algorithms differ, and where future
algorithms will differ too. Make those pluggable.

This gives us one maintained code path with the flexibility to evolve.
New ideas get implemented as new policies, not as 3,000-line forks. Good
mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
become shared infrastructure available to any policy. And if someone
comes up with a better eviction algorithm tomorrow, they plug it in
without touching the core.

Making reclaim pluggable implies we define it as a set of function
methods (let's call them reclaim_ops) hooking into a stable codebase we
rarely modify. We then have two big questions to answer: how do these
reclaim ops look, and how do we move the existing code to the new model?

How Do We Get There
-------------------

Do we merge the two mechanisms feature by feature, or do we prioritize
moving MGLRU to the pluggable model then follow with LRU once we are
happy with the result?

Whichever option we choose, we do the work in small, self-contained
phases. Each phase ships independently, each phase makes the code
better, each phase is bisectable. No big bang. No disruption. No
excuses.

Option A: Factor and Merge

MGLRU is already pretty modular. However, we do not know which
optimizations are actually generic and which ones are only useful for
MGLRU itself.

Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
changes to MGLRU. Traditional LRU code is left completely untouched at
this stage.

Phase 2 -- Merge the two paths one method at a time. Right now the code
diverts control to MGLRU from the very top of the high-level hooks. We
instead unify the algorithms starting from the very beginning of LRU and
deciding what to keep in common code and what to move into a traditional
LRU path.

Advantages:
- We do not touch LRU until Phase 2, avoiding churn.
- Makes it easy to experiment with combining MGLRU features into
  traditional LRU. We do not actually know which optimizations are
  useful and which should stay in MGLRU hooks.

Disadvantages:
- We will not find out whether reclaim_ops exposes the right methods
  until we merge the paths at the end. We will have to change the ops
  if it turns out we need a different split. The reclaim_ops API will
  be private and have a single user so it is not that bad, but it may
  require additional changes.

Option B: Merge and Factor

Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page
table scanning, Bloom filter PMD skipping, lookaround, lock-free folio
age updates. These are independently useful. Make them available to both
algorithms. Stop hoarding good ideas inside one code path.

Phase 2 -- Collapse the remaining differences. Generalize list
infrastructure to N classifications (trad=2, MGLRU=4). Unify eviction
entry points. Common classification/promotion interface. At this point
the two "algorithms" are thin wrappers over shared code.

Phase 3 -- Define the hook interface. Define reclaim_ops around the
remaining policy differences. Layer BPF on top (reclaim_ext).
Traditional LRU and MGLRU become two instances of the same interface.
Adding a third algorithm means writing a new set of hooks, not forking
3,000 lines.

Advantages:
- We get signals on what should be shared earlier. We know every shared
  method to be useful because we use it for both algorithms.
- Can test LRU optimizations on MGLRU early.

Disadvantages:
- Slower, as we factor out both algorithms and expand reclaim_ops all
  at once.

Open Questions
--------------

- Policy granularity: system-wide, per-node, or per-cgroup?
- Mechanism/policy boundary: needs iteration; get it wrong and we
  either constrain policies or duplicate code.
- Validation: reclaim quality is hard to measure; we need agreed-upon
  benchmarks.
- Simplicity: the end result must be simpler than what we have today,
  not more complex. If it is not simpler, we failed.
-- 
2.52.0



^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-26 20:48 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
2026-03-26  0:10 ` T.J. Mercier
2026-03-26  2:05 ` Andrew Morton
2026-03-26  7:03   ` Michal Hocko
2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
2026-03-26 12:37       ` Kairui Song
2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:42           ` David Hildenbrand (Arm)
2026-03-26 13:45             ` Lorenzo Stoakes (Oracle)
2026-03-26 16:02         ` Lorenzo Stoakes (Oracle)
2026-03-26 20:02       ` Axel Rasmussen
2026-03-26 20:30         ` Gregory Price
2026-03-26 20:47           ` Axel Rasmussen
2026-03-26 12:06   ` Kairui Song
2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
2026-03-26 13:17       ` Kairui Song
2026-03-26 13:26         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:21   ` Shakeel Butt
2026-03-26  7:12 ` Michal Hocko
2026-03-26 13:44   ` Shakeel Butt
2026-03-26 15:24     ` Michal Hocko
2026-03-26 18:21       ` Shakeel Butt
2026-03-26  7:18 ` wangzicheng
2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
2026-03-26 15:24   ` Gregory Price
2026-03-26 15:35     ` Lorenzo Stoakes (Oracle)
2026-03-26 16:32       ` Gregory Price
2026-03-26 16:40         ` Lorenzo Stoakes (Oracle)
2026-03-26 18:49   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox