[LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
@ 2026-03-25 21:06 Shakeel Butt
  2026-03-26  0:10 ` T.J. Mercier
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: Shakeel Butt @ 2026-03-25 21:06 UTC (permalink / raw)
  To: lsf-pc
  Cc: Andrew Morton, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

The Problem
-----------

Memory reclaim in the kernel is a mess. We ship two completely separate
eviction algorithms -- traditional LRU and MGLRU -- in the same file.
mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
duplicates functionality already present in the traditional path. Every
bug fix, every optimization, every feature has to be done twice or it
only works for half the users. This is not sustainable. It has to stop.

We should unify both algorithms into a single code path. In this path,
both algorithms are a set of hooks called from that path. Everyone
maintains, understands, and evolves a single codebase. Optimizations are
now evaluated against -- and available to -- both algorithms. And the
next time someone develops a new LRU algorithm, they can do so in a way
that does not add churn to existing code.

How We Got Here
---------------

MGLRU brought interesting ideas -- multi-generation aging, page table
scanning, Bloom filters, spatial lookaround. But we never tried to
refactor the existing reclaim code or integrate these mechanisms into the
traditional path. 3,300 lines of code were dumped as a completely
parallel implementation with a runtime toggle to switch between the two.
No attempt to evolve the existing code or share mechanisms between the
two paths -- just a second reclaim system bolted on next to the first.

To be fair, traditional reclaim is not easy to refactor. It has
accumulated decades of heuristics trying to work for every workload, and
touching any of it risks regressions. But difficulty is not an excuse.
There was no justification for not even trying -- not attempting to
generalize the existing scanning path, not proposing shared
abstractions, not offering the new mechanisms as improvements to the code
that was already there. Hard does not mean impossible, and the cost of
not trying is what we are living with now.

The Differences That Matter
---------------------------

The two algorithms differ in how they classify pages, detect access, and
decide what to evict. But most of these differences are not fundamental
-- they are mechanisms that got trapped inside one implementation when
they could benefit both. Not making those mechanisms shareable leaves
potential free performance gains on the table.

Access detection: Traditional LRU walks reverse mappings (RMAP) from the
page back to its page table entries. MGLRU walks page tables forward,
scanning process address spaces directly. Neither approach is inherently
tied to its eviction policy. Page table scanning would benefit
traditional LRU just as much -- it is cache-friendly, batches updates
without the LRU lock, and naturally exploits spatial locality. There is
no reason this should be MGLRU-only.

Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold
page table regions and a lookaround optimization to scan adjacent PTEs
during eviction. These are general-purpose optimizations for any
scanning path. They are locked inside MGLRU today for no good reason.

Lock-free age updates: MGLRU updates folio age using atomic flag
operations, avoiding the LRU lock during scanning. Traditional reclaim
can use the same technique to reduce lock contention.

Page classification: Traditional LRU uses two buckets
(active/inactive). MGLRU uses four generations with timestamps and
reference frequency tiers. This is the policy difference --
how many age buckets and how pages move between them. Every other
mechanism is shareable.

Both systems already share the core reclaim mechanics -- writeback,
unmapping, swap, NUMA demotion, and working set tracking. The shareable
mechanisms listed above should join that common core. What remains after
that is a thin policy layer -- and that is all that should differ between
algorithms.

The Fix: One Reclaim, Pluggable and Extensible
-----------------------------------------------

We need one reclaim system, not two. One code path that everyone
maintains, everyone tests, and everyone benefits from. But it needs to
be pluggable as there will always be cases where someone wants some
customization for their specialized workload or wants to explore some
new techniques/ideas, and we do not want to get into the current mess
again.

The unified reclaim must separate mechanism from policy. The mechanisms
-- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
shared today and should stay shared. The policy decisions -- how to
detect access, how to classify pages, which pages to evict, when to
protect a page -- are where the two algorithms differ, and where future
algorithms will differ too. Make those pluggable.

This gives us one maintained code path with the flexibility to evolve.
New ideas get implemented as new policies, not as 3,000-line forks. Good
mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
become shared infrastructure available to any policy. And if someone
comes up with a better eviction algorithm tomorrow, they plug it in
without touching the core.

Making reclaim pluggable implies we define it as a set of function
methods (let's call them reclaim_ops) hooking into a stable codebase we
rarely modify. We then have two big questions to answer: how do these
reclaim ops look, and how do we move the existing code to the new model?

How Do We Get There
-------------------

Do we merge the two mechanisms feature by feature, or do we prioritize
moving MGLRU to the pluggable model then follow with LRU once we are
happy with the result?

Whichever option we choose, we do the work in small, self-contained
phases. Each phase ships independently, each phase makes the code
better, each phase is bisectable. No big bang. No disruption. No
excuses.

Option A: Factor and Merge

MGLRU is already pretty modular. However, we do not know which
optimizations are actually generic and which ones are only useful for
MGLRU itself.

Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
changes to MGLRU. Traditional LRU code is left completely untouched at
this stage.

Phase 2 -- Merge the two paths one method at a time. Right now the code
diverts control to MGLRU from the very top of the high-level hooks. We
instead unify the algorithms starting from the very beginning of LRU and
deciding what to keep in common code and what to move into a traditional
LRU path.

Advantages:
- We do not touch LRU until Phase 2, avoiding churn.
- Makes it easy to experiment with combining MGLRU features into
  traditional LRU. We do not actually know which optimizations are
  useful and which should stay in MGLRU hooks.

Disadvantages:
- We will not find out whether reclaim_ops exposes the right methods
  until we merge the paths at the end. We will have to change the ops
  if it turns out we need a different split. The reclaim_ops API will
  be private and have a single user so it is not that bad, but it may
  require additional changes.

Option B: Merge and Factor

Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page
table scanning, Bloom filter PMD skipping, lookaround, lock-free folio
age updates. These are independently useful. Make them available to both
algorithms. Stop hoarding good ideas inside one code path.

Phase 2 -- Collapse the remaining differences. Generalize list
infrastructure to N classifications (trad=2, MGLRU=4). Unify eviction
entry points. Common classification/promotion interface. At this point
the two "algorithms" are thin wrappers over shared code.

Phase 3 -- Define the hook interface. Define reclaim_ops around the
remaining policy differences. Layer BPF on top (reclaim_ext).
Traditional LRU and MGLRU become two instances of the same interface.
Adding a third algorithm means writing a new set of hooks, not forking
3,000 lines.

Advantages:
- We get signals on what should be shared earlier. We know every shared
  method to be useful because we use it for both algorithms.
- Can test LRU optimizations on MGLRU early.

Disadvantages:
- Slower, as we factor out both algorithms and expand reclaim_ops all
  at once.

Open Questions
--------------

- Policy granularity: system-wide, per-node, or per-cgroup?
- Mechanism/policy boundary: needs iteration; get it wrong and we
  either constrain policies or duplicate code.
- Validation: reclaim quality is hard to measure; we need agreed-upon
  benchmarks.
- Simplicity: the end result must be simpler than what we have today,
  not more complex. If it is not simpler, we failed.
-- 
2.52.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
@ 2026-03-26  0:10 ` T.J. Mercier
  2026-03-26  2:05 ` Andrew Morton
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 29+ messages in thread
From: T.J. Mercier @ 2026-03-26  0:10 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price,
	Barry Song, David Stevens, Vernon Yang, David Rientjes,
	Kalesh Singh, wangzicheng, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Wed, Mar 25, 2026 at 2:07 PM Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> The Problem
> -----------
>
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every
> bug fix, every optimization, every feature has to be done twice or it
> only works for half the users. This is not sustainable. It has to stop.
>
> We should unify both algorithms into a single code path. In this path,
> both algorithms are a set of hooks called from that path. Everyone
> maintains, understands, and evolves a single codebase. Optimizations are
> now evaluated against -- and available to -- both algorithms. And the
> next time someone develops a new LRU algorithm, they can do so in a way
> that does not add churn to existing code.
>
> How We Got Here
> ---------------
>
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to
> refactor the existing reclaim code or integrate these mechanisms into the
> traditional path. 3,300 lines of code were dumped as a completely
> parallel implementation with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
>
> To be fair, traditional reclaim is not easy to refactor. It has
> accumulated decades of heuristics trying to work for every workload, and
> touching any of it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to
> generalize the existing scanning path, not proposing shared
> abstractions, not offering the new mechanisms as improvements to the code
> that was already there. Hard does not mean impossible, and the cost of
> not trying is what we are living with now.
>
> The Differences That Matter
> ---------------------------
>
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
>
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently
> tied to its eviction policy. Page table scanning would benefit
> traditional LRU just as much -- it is cache-friendly, batches updates
> without the LRU lock, and naturally exploits spatial locality. There is
> no reason this should be MGLRU-only.
>
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold
> page table regions and a lookaround optimization to scan adjacent PTEs
> during eviction. These are general-purpose optimizations for any
> scanning path. They are locked inside MGLRU today for no good reason.
>
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim
> can use the same technique to reduce lock contention.
>
> Page classification: Traditional LRU uses two buckets
> (active/inactive). MGLRU uses four generations with timestamps and
> reference frequency tiers. This is the policy difference --
> how many age buckets and how pages move between them. Every other
> mechanism is shareable.
>
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
>
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
>
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to
> be pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some
> new techniques/ideas, and we do not want to get into the current mess
> again.
>
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to
> detect access, how to classify pages, which pages to evict, when to
> protect a page -- are where the two algorithms differ, and where future
> algorithms will differ too. Make those pluggable.
>
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone
> comes up with a better eviction algorithm tomorrow, they plug it in
> without touching the core.
>
> Making reclaim pluggable implies we define it as a set of function
> methods (let's call them reclaim_ops) hooking into a stable codebase we
> rarely modify. We then have two big questions to answer: how do these
> reclaim ops look, and how do we move the existing code to the new model?
>
> How Do We Get There
> -------------------
>
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
>
> Whichever option we choose, we do the work in small, self-contained
> phases. Each phase ships independently, each phase makes the code
> better, each phase is bisectable. No big bang. No disruption. No
> excuses.
>
> Option A: Factor and Merge
>
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for
> MGLRU itself.
>
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at
> this stage.
>
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
>
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
>   traditional LRU. We do not actually know which optimizations are
>   useful and which should stay in MGLRU hooks.
>
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
>   until we merge the paths at the end. We will have to change the ops
>   if it turns out we need a different split. The reclaim_ops API will
>   be private and have a single user so it is not that bad, but it may
>   require additional changes.
>
> Option B: Merge and Factor
>
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page
> table scanning, Bloom filter PMD skipping, lookaround, lock-free folio
> age updates. These are independently useful. Make them available to both
> algorithms. Stop hoarding good ideas inside one code path.
>
> Phase 2 -- Collapse the remaining differences. Generalize list
> infrastructure to N classifications (trad=2, MGLRU=4). Unify eviction
> entry points. Common classification/promotion interface. At this point
> the two "algorithms" are thin wrappers over shared code.
>
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
>
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
>   method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
>
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
>   at once.
>
> Open Questions
> --------------
>
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
>   either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
>   benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
>   not more complex. If it is not simpler, we failed.
> --
> 2.52.0
>

Hi Shakeel,

Nice outline, I'd be quite interested in this discussion. It's a
little difficult for me to imagine a reclaim_ops getting us to
complete convergence, but it seems like a good way to start making
progress.

Unfortuantely I got an LSFMM Invitation Decline, so I won't be there.
Take good notes. :)

-T.J.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
  2026-03-26  0:10 ` T.J. Mercier
@ 2026-03-26  2:05 ` Andrew Morton
  2026-03-26  7:03   ` Michal Hocko
                     ` (2 more replies)
  2026-03-26  7:12 ` Michal Hocko
                   ` (2 subsequent siblings)
  4 siblings, 3 replies; 29+ messages in thread
From: Andrew Morton @ 2026-03-26  2:05 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:

> We should unify both algorithms into a single code path. 

I'm here to ask the questions which others fear will sound dumb.

Is it indeed the plan to maintain both implementations?  I thought the
long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?

And that Linus has expressed such a desire, but my googling fails me.




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  2:05 ` Andrew Morton
@ 2026-03-26  7:03   ` Michal Hocko
  2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
  2026-03-26 12:06   ` Kairui Song
  2026-03-26 13:21   ` Shakeel Butt
  2 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2026-03-26  7:03 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shakeel Butt, lsf-pc, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Wed 25-03-26 19:05:47, Andrew Morton wrote:
> On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > We should unify both algorithms into a single code path. 
> 
> I'm here to ask the questions which others fear will sound dumb.

Not dumb at all and recently discussed here https://lore.kernel.org/all/CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com/T/#u

> Is it indeed the plan to maintain both implementations?  I thought the
> long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?

Yes, but MGLRU is not there yet and with development pace last year or
so we are not much closer than at the time MGLRU has been merged
unfortunatelly.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  7:03   ` Michal Hocko
@ 2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
  2026-03-26 12:37       ` Kairui Song
  2026-03-26 20:02       ` Axel Rasmussen
  0 siblings, 2 replies; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26  8:02 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, Shakeel Butt, lsf-pc, Johannes Weiner,
	David Hildenbrand, Qi Zheng, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 08:03:34AM +0100, Michal Hocko wrote:
> On Wed 25-03-26 19:05:47, Andrew Morton wrote:
> > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > We should unify both algorithms into a single code path.
> >
> > I'm here to ask the questions which others fear will sound dumb.
>
> Not dumb at all and recently discussed here https://lore.kernel.org/all/CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com/T/#u
>
> > Is it indeed the plan to maintain both implementations?  I thought the
> > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?
>
> Yes, but MGLRU is not there yet and with development pace last year or
> so we are not much closer than at the time MGLRU has been merged
> unfortunatelly.

I'm quite concerned about maintainership, as it seems the MGLRU maintainers have
not been all that active, and the MGLRU to me at least is currently a black box.

I'm not the only one who's raised this (see [0]).

That'd very much have to be resolved and the community reassured that MGLRU is
_actively_ maintained before we could even contemplate it replacing the
'classic' reclaim approach IMO.

I hope that Kairu, Barry, Zicheng and others who are interested int it resolve
this, however!

Thanks, Lorenzo

[0]:https://lore.kernel.org/linux-mm/aaBsrrmV25FTIkVX@casper.infradead.org/

> --
> Michal Hocko
> SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 12:37       ` Kairui Song
  2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
  2026-03-26 16:02         ` Lorenzo Stoakes (Oracle)
  2026-03-26 20:02       ` Axel Rasmussen
  1 sibling, 2 replies; 29+ messages in thread
From: Kairui Song @ 2026-03-26 12:37 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Michal Hocko, Andrew Morton, Shakeel Butt, lsf-pc,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 4:02 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Thu, Mar 26, 2026 at 08:03:34AM +0100, Michal Hocko wrote:
> > On Wed 25-03-26 19:05:47, Andrew Morton wrote:
> > > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > > We should unify both algorithms into a single code path.
> > >
> > > I'm here to ask the questions which others fear will sound dumb.
> >
> > Not dumb at all and recently discussed here https://lore.kernel.org/all/CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com/T/#u
> >
> > > Is it indeed the plan to maintain both implementations?  I thought the
> > > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?
> >
> > Yes, but MGLRU is not there yet and with development pace last year or
> > so we are not much closer than at the time MGLRU has been merged
> > unfortunatelly.
>
> I'm quite concerned about maintainership, as it seems the MGLRU maintainers have
> not been all that active, and the MGLRU to me at least is currently a black box.
>
> I'm not the only one who's raised this (see [0]).
>
> That'd very much have to be resolved and the community reassured that MGLRU is
> _actively_ maintained before we could even contemplate it replacing the
> 'classic' reclaim approach IMO.
>
> I hope that Kairu, Barry, Zicheng and others who are interested int it resolve
> this, however!

Hi everyone,

Right, I think we are starting to make good progress on improving
MGLRU recently. For the last few years we already have some commits
stashed downstream to enable that on our fleet, most of my effort in
upstream is spent on other parts like SWAP, really looking forward to
making MGLRU better upstreamly.

Yesterday I was still discussing with CachyOS folks about their usage
while working on that series for MGLRU cleanup and dirty flush
optimization, and got some nice feedback later from their chat server
that MGLRU's TTL resolved their thrashing issue very well. With
classic LRU they needed a le9 patch downtreamly.

For many other typical workloads under stress, MGLRU performs
significantly better too (e.g. database, build kernel could be more
than twice as fast), it would be a huge loss to leave it unmaintained.

Barry also provided some really helpful ideas about MGLRU like
readahead handling. We are also seeing other vendors and people
contributing to MGLRU like Leno and Baolin recently. Things are
looking promising.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 12:37       ` Kairui Song
@ 2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
  2026-03-26 13:42           ` David Hildenbrand (Arm)
  2026-03-26 16:02         ` Lorenzo Stoakes (Oracle)
  1 sibling, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 13:13 UTC (permalink / raw)
  To: Kairui Song
  Cc: Michal Hocko, Andrew Morton, Shakeel Butt, lsf-pc,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 08:37:06PM +0800, Kairui Song wrote:
> On Thu, Mar 26, 2026 at 4:02 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > On Thu, Mar 26, 2026 at 08:03:34AM +0100, Michal Hocko wrote:
> > > On Wed 25-03-26 19:05:47, Andrew Morton wrote:
> > > > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > > We should unify both algorithms into a single code path.
> > > >
> > > > I'm here to ask the questions which others fear will sound dumb.
> > >
> > > Not dumb at all and recently discussed here https://lore.kernel.org/all/CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com/T/#u
> > >
> > > > Is it indeed the plan to maintain both implementations?  I thought the
> > > > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?
> > >
> > > Yes, but MGLRU is not there yet and with development pace last year or
> > > so we are not much closer than at the time MGLRU has been merged
> > > unfortunatelly.
> >
> > I'm quite concerned about maintainership, as it seems the MGLRU maintainers have
> > not been all that active, and the MGLRU to me at least is currently a black box.
> >
> > I'm not the only one who's raised this (see [0]).
> >
> > That'd very much have to be resolved and the community reassured that MGLRU is
> > _actively_ maintained before we could even contemplate it replacing the
> > 'classic' reclaim approach IMO.
> >
> > I hope that Kairu, Barry, Zicheng and others who are interested int it resolve
> > this, however!
>
> Hi everyone,
>
> Right, I think we are starting to make good progress on improving
> MGLRU recently. For the last few years we already have some commits
> stashed downstream to enable that on our fleet, most of my effort in
> upstream is spent on other parts like SWAP, really looking forward to
> making MGLRU better upstreamly.
>
> Yesterday I was still discussing with CachyOS folks about their usage
> while working on that series for MGLRU cleanup and dirty flush
> optimization, and got some nice feedback later from their chat server
> that MGLRU's TTL resolved their thrashing issue very well. With
> classic LRU they needed a le9 patch downtreamly.
>
> For many other typical workloads under stress, MGLRU performs
> significantly better too (e.g. database, build kernel could be more
> than twice as fast), it would be a huge loss to leave it unmaintained.
>
> Barry also provided some really helpful ideas about MGLRU like
> readahead handling. We are also seeing other vendors and people
> contributing to MGLRU like Leno and Baolin recently. Things are
> looking promising.

Guys, can I please make a plea then for you guys to AT LEAST become
reviewers in MAINTAINERS please:

MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU)
M:	Andrew Morton <akpm@linux-foundation.org>

Well Andrew is Andrew :)

M:	Axel Rasmussen <axelrasmussen@google.com>

git log mm/vmscan.c has 0 results.

Axel has however engaged in discussion on
https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/
recently for example, but not much in 2025 afaict.

M:	Yuanchu Xie <yuanchu@google.com>

git log mm/vmscan.c shows 1 result from August 13th 2024.

Yuanchu has engaged in some MGLRU discussion also recently, here and there.

R:	Wei Xu <weixugc@google.com>

git log mm/vmscan.c shows 2 results from October 2024.

Wei doesn't look to have engaged in discussion on MGLRU recently.

L:	linux-mm@kvack.org
S:	Maintained
W:	http://www.linux-mm.org
T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
F:	Documentation/admin-guide/mm/multigen_lru.rst
F:	Documentation/mm/multigen_lru.rst
F:	include/linux/mm_inline.h
F:	include/linux/mmzone.h
F:	mm/swap.c
F:	mm/vmscan.c
F:	mm/workingset.c

It doesn't really feel like MGLRU is currently maintained at all, quite
honestly.

So I think we need people to step up here.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 13:42           ` David Hildenbrand (Arm)
  2026-03-26 13:45             ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 29+ messages in thread
From: David Hildenbrand (Arm) @ 2026-03-26 13:42 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle), Kairui Song
  Cc: Michal Hocko, Andrew Morton, Shakeel Butt, lsf-pc,
	Johannes Weiner, Qi Zheng, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On 3/26/26 14:13, Lorenzo Stoakes (Oracle) wrote:
> On Thu, Mar 26, 2026 at 08:37:06PM +0800, Kairui Song wrote:
>> On Thu, Mar 26, 2026 at 4:02 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>>>
>>>
>>> I'm quite concerned about maintainership, as it seems the MGLRU maintainers have
>>> not been all that active, and the MGLRU to me at least is currently a black box.
>>>
>>> I'm not the only one who's raised this (see [0]).
>>>
>>> That'd very much have to be resolved and the community reassured that MGLRU is
>>> _actively_ maintained before we could even contemplate it replacing the
>>> 'classic' reclaim approach IMO.
>>>
>>> I hope that Kairu, Barry, Zicheng and others who are interested int it resolve
>>> this, however!
>>
>> Hi everyone,
>>
>> Right, I think we are starting to make good progress on improving
>> MGLRU recently. For the last few years we already have some commits
>> stashed downstream to enable that on our fleet, most of my effort in
>> upstream is spent on other parts like SWAP, really looking forward to
>> making MGLRU better upstreamly.
>>
>> Yesterday I was still discussing with CachyOS folks about their usage
>> while working on that series for MGLRU cleanup and dirty flush
>> optimization, and got some nice feedback later from their chat server
>> that MGLRU's TTL resolved their thrashing issue very well. With
>> classic LRU they needed a le9 patch downtreamly.
>>
>> For many other typical workloads under stress, MGLRU performs
>> significantly better too (e.g. database, build kernel could be more
>> than twice as fast), it would be a huge loss to leave it unmaintained.
>>
>> Barry also provided some really helpful ideas about MGLRU like
>> readahead handling. We are also seeing other vendors and people
>> contributing to MGLRU like Leno and Baolin recently. Things are
>> looking promising.
> 
> Guys, can I please make a plea then for you guys to AT LEAST become
> reviewers in MAINTAINERS please:
> 
> MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU)
> M:	Andrew Morton <akpm@linux-foundation.org>
> 
> Well Andrew is Andrew :)
> 
> M:	Axel Rasmussen <axelrasmussen@google.com>
> 
> git log mm/vmscan.c has 0 results.
> 
> Axel has however engaged in discussion on
> https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/
> recently for example, but not much in 2025 afaict.
> 
> M:	Yuanchu Xie <yuanchu@google.com>
> 
> git log mm/vmscan.c shows 1 result from August 13th 2024.
> 
> Yuanchu has engaged in some MGLRU discussion also recently, here and there.
> 
> R:	Wei Xu <weixugc@google.com>
> 
> git log mm/vmscan.c shows 2 results from October 2024.
> 
> Wei doesn't look to have engaged in discussion on MGLRU recently.
> 
> L:	linux-mm@kvack.org
> S:	Maintained
> W:	http://www.linux-mm.org
> T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> F:	Documentation/admin-guide/mm/multigen_lru.rst
> F:	Documentation/mm/multigen_lru.rst
> F:	include/linux/mm_inline.h
> F:	include/linux/mmzone.h
> F:	mm/swap.c
> F:	mm/vmscan.c
> F:	mm/workingset.c
> 
> It doesn't really feel like MGLRU is currently maintained at all, quite
> honestly.
> 
> So I think we need people to step up here.


Ans some serious cleanup of the entry to reflect who is actually
involved nowadays, if at all.

Master of MAINTAINER updates, I assume you'll take care of that?

-- 
Cheers,

David


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 13:42           ` David Hildenbrand (Arm)
@ 2026-03-26 13:45             ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 13:45 UTC (permalink / raw)
  To: David Hildenbrand (Arm)
  Cc: Kairui Song, Michal Hocko, Andrew Morton, Shakeel Butt, lsf-pc,
	Johannes Weiner, Qi Zheng, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 02:42:35PM +0100, David Hildenbrand (Arm) wrote:
> On 3/26/26 14:13, Lorenzo Stoakes (Oracle) wrote:
> > On Thu, Mar 26, 2026 at 08:37:06PM +0800, Kairui Song wrote:
> >> On Thu, Mar 26, 2026 at 4:02 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >>>
> >>>
> >>> I'm quite concerned about maintainership, as it seems the MGLRU maintainers have
> >>> not been all that active, and the MGLRU to me at least is currently a black box.
> >>>
> >>> I'm not the only one who's raised this (see [0]).
> >>>
> >>> That'd very much have to be resolved and the community reassured that MGLRU is
> >>> _actively_ maintained before we could even contemplate it replacing the
> >>> 'classic' reclaim approach IMO.
> >>>
> >>> I hope that Kairu, Barry, Zicheng and others who are interested int it resolve
> >>> this, however!
> >>
> >> Hi everyone,
> >>
> >> Right, I think we are starting to make good progress on improving
> >> MGLRU recently. For the last few years we already have some commits
> >> stashed downstream to enable that on our fleet, most of my effort in
> >> upstream is spent on other parts like SWAP, really looking forward to
> >> making MGLRU better upstreamly.
> >>
> >> Yesterday I was still discussing with CachyOS folks about their usage
> >> while working on that series for MGLRU cleanup and dirty flush
> >> optimization, and got some nice feedback later from their chat server
> >> that MGLRU's TTL resolved their thrashing issue very well. With
> >> classic LRU they needed a le9 patch downtreamly.
> >>
> >> For many other typical workloads under stress, MGLRU performs
> >> significantly better too (e.g. database, build kernel could be more
> >> than twice as fast), it would be a huge loss to leave it unmaintained.
> >>
> >> Barry also provided some really helpful ideas about MGLRU like
> >> readahead handling. We are also seeing other vendors and people
> >> contributing to MGLRU like Leno and Baolin recently. Things are
> >> looking promising.
> >
> > Guys, can I please make a plea then for you guys to AT LEAST become
> > reviewers in MAINTAINERS please:
> >
> > MEMORY MANAGEMENT - MGLRU (MULTI-GEN LRU)
> > M:	Andrew Morton <akpm@linux-foundation.org>
> >
> > Well Andrew is Andrew :)
> >
> > M:	Axel Rasmussen <axelrasmussen@google.com>
> >
> > git log mm/vmscan.c has 0 results.
> >
> > Axel has however engaged in discussion on
> > https://lore.kernel.org/linux-mm/20260318-mglru-reclaim-v1-0-2c46f9eb0508@tencent.com/
> > recently for example, but not much in 2025 afaict.
> >
> > M:	Yuanchu Xie <yuanchu@google.com>
> >
> > git log mm/vmscan.c shows 1 result from August 13th 2024.
> >
> > Yuanchu has engaged in some MGLRU discussion also recently, here and there.
> >
> > R:	Wei Xu <weixugc@google.com>
> >
> > git log mm/vmscan.c shows 2 results from October 2024.
> >
> > Wei doesn't look to have engaged in discussion on MGLRU recently.
> >
> > L:	linux-mm@kvack.org
> > S:	Maintained
> > W:	http://www.linux-mm.org
> > T:	git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> > F:	Documentation/admin-guide/mm/multigen_lru.rst
> > F:	Documentation/mm/multigen_lru.rst
> > F:	include/linux/mm_inline.h
> > F:	include/linux/mmzone.h
> > F:	mm/swap.c
> > F:	mm/vmscan.c
> > F:	mm/workingset.c
> >
> > It doesn't really feel like MGLRU is currently maintained at all, quite
> > honestly.
> >
> > So I think we need people to step up here.
>
>
> Ans some serious cleanup of the entry to reflect who is actually
> involved nowadays, if at all.
>
> Master of MAINTAINER updates, I assume you'll take care of that?

Ack of course :)

I mean I can suggest a change and cc the various people and we can do it
that way :)

Obviously will want some Acked-by's from those invovled on that though! :)

>
> --
> Cheers,
>
> David

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 12:37       ` Kairui Song
  2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 16:02         ` Lorenzo Stoakes (Oracle)
  1 sibling, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 16:02 UTC (permalink / raw)
  To: Kairui Song
  Cc: Michal Hocko, Andrew Morton, Shakeel Butt, lsf-pc,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 08:37:06PM +0800, Kairui Song wrote:
> For many other typical workloads under stress, MGLRU performs
> significantly better too (e.g. database, build kernel could be more
> than twice as fast), it would be a huge loss to leave it unmaintained.

BTW press X to doubt on it doubling kernel build times :)

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
  2026-03-26 12:37       ` Kairui Song
@ 2026-03-26 20:02       ` Axel Rasmussen
  2026-03-26 20:30         ` Gregory Price
  1 sibling, 1 reply; 29+ messages in thread
From: Axel Rasmussen @ 2026-03-26 20:02 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Michal Hocko, Andrew Morton, Shakeel Butt, lsf-pc,
	Johannes Weiner, David Hildenbrand, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 1:02 AM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> On Thu, Mar 26, 2026 at 08:03:34AM +0100, Michal Hocko wrote:
> > On Wed 25-03-26 19:05:47, Andrew Morton wrote:
> > > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > > We should unify both algorithms into a single code path.
> > >
> > > I'm here to ask the questions which others fear will sound dumb.
> >
> > Not dumb at all and recently discussed here https://lore.kernel.org/all/CAMgjq7AkYOtUL2HuZjBu5dJw=RTL7W2L1+zVv=SCOyHKYwc3AA@mail.gmail.com/T/#u
> >
> > > Is it indeed the plan to maintain both implementations?  I thought the
> > > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?

I think one thing we all agree on at least is, long term, there isn't
really a good argument for having > 1 LRU implementation. E.g., we
don't believe there are just irreconcilable differences, where one
impl is better for some workloads, and another is better for others,
and there is no way the two can be converged.

On that basis, I would be hesitant to add some complex abstraction
layer / reclaim_ops to facilitate having two. It seems ilke it may
make things a bit cleaner in the short term, but long term might make
that end goal harder (because we'd add the task of cleaning up this
abstraction at some point).

My preferred way would be more like:

- Look for opportunities where we can deduplicate code, but without
adding abstraction (e.g., factor out common operations into common
functions both impls can call).
- Identify gaps where MGLRU performs worse than classic LRU, and close them.

We could go the other direction, where we identify places classic LRU
can be improved, and port particular MGLRU features over to it. I
prefer the other way for a couple of reasons;

- My sense is MGLRU is "close", meaning as Kairui said in "average"
cases it is substantially better, and the gaps are both fairly narrow
/ edge-casey, and very solveable.

- Leaving classic LRU alone leaves the option for existing users to
maintain the status quo. If we start porting MGLRU features over to
classic LRU, we may introduce new regressions and break existing
users. Using MGLRU as our working base means we can iterate on it
without as much risk.

> >
> > Yes, but MGLRU is not there yet and with development pace last year or
> > so we are not much closer than at the time MGLRU has been merged
> > unfortunatelly.
>
> I'm quite concerned about maintainership, as it seems the MGLRU maintainers have
> not been all that active, and the MGLRU to me at least is currently a black box.
>
> I'm not the only one who's raised this (see [0]).
>
> That'd very much have to be resolved and the community reassured that MGLRU is
> _actively_ maintained before we could even contemplate it replacing the
> 'classic' reclaim approach IMO.

Very fair concern. Time will tell but I think things are on a much
better trajectory now.

Barry's and Kairui's work here has been great. Suren's team (Android)
is dedicating a full time engineer to MGLRU this year, as I understand
it. I'm planning to do the same myself. I started by spending some
time reviewing some patches last week, and I'm hoping to send some
patches of my own in the next month or two.

I realize just talking about it doesn't mean much, action is needed.
I'm hoping in a few months' time everyone feels much better about the
state of things. :)

>
> I hope that Kairu, Barry, Zicheng and others who are interested int it resolve
> this, however!
>
> Thanks, Lorenzo
>
> [0]:https://lore.kernel.org/linux-mm/aaBsrrmV25FTIkVX@casper.infradead.org/
>
> > --
> > Michal Hocko
> > SUSE Labs

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 20:02       ` Axel Rasmussen
@ 2026-03-26 20:30         ` Gregory Price
  2026-03-26 20:47           ` Axel Rasmussen
  0 siblings, 1 reply; 29+ messages in thread
From: Gregory Price @ 2026-03-26 20:30 UTC (permalink / raw)
  To: Axel Rasmussen
  Cc: Lorenzo Stoakes (Oracle), Michal Hocko, Andrew Morton,
	Shakeel Butt, lsf-pc, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Chen Ridong, Emil Tsalapatis, Alexei Starovoitov,
	Yuanchu Xie, Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham,
	Barry Song, David Stevens, Vernon Yang, David Rientjes,
	Kalesh Singh, wangzicheng, T . J . Mercier, Baolin Wang,
	Suren Baghdasaryan, Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 01:02:02PM -0700, Axel Rasmussen wrote:
> 
> I think one thing we all agree on at least is, long term, there isn't
> really a good argument for having > 1 LRU implementation. E.g., we
> don't believe there are just irreconcilable differences, where one
> impl is better for some workloads, and another is better for others,
> and there is no way the two can be converged.
>

I absolutely believe there are irreconcilable differences - but not in
the sense that one is better or worse, but in the sense that features
from one cannot work in the other.

> - My sense is MGLRU is "close", meaning as Kairui said in "average"
> cases it is substantially better, and the gaps are both fairly narrow
> / edge-casey, and very solveable.
> 

This is a really, really bold claim.  So much of this is workload
dependent, and LRU has decades of battle-testing while MGLRU has barely
been around 4 years.

~Gregory


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 20:30         ` Gregory Price
@ 2026-03-26 20:47           ` Axel Rasmussen
  0 siblings, 0 replies; 29+ messages in thread
From: Axel Rasmussen @ 2026-03-26 20:47 UTC (permalink / raw)
  To: Gregory Price
  Cc: Lorenzo Stoakes (Oracle), Michal Hocko, Andrew Morton,
	Shakeel Butt, lsf-pc, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Chen Ridong, Emil Tsalapatis, Alexei Starovoitov,
	Yuanchu Xie, Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham,
	Barry Song, David Stevens, Vernon Yang, David Rientjes,
	Kalesh Singh, wangzicheng, T . J . Mercier, Baolin Wang,
	Suren Baghdasaryan, Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 1:30 PM Gregory Price <gourry@gourry.net> wrote:
>
> On Thu, Mar 26, 2026 at 01:02:02PM -0700, Axel Rasmussen wrote:
> >
> > I think one thing we all agree on at least is, long term, there isn't
> > really a good argument for having > 1 LRU implementation. E.g., we
> > don't believe there are just irreconcilable differences, where one
> > impl is better for some workloads, and another is better for others,
> > and there is no way the two can be converged.
> >
>
> I absolutely believe there are irreconcilable differences - but not in
> the sense that one is better or worse, but in the sense that features
> from one cannot work in the other.

Right, agreed. I mean a case where we have workloads A and B, such
that there does not exist an implementation that can serve both well.
If such workloads were "common" to me that would justify a reclaim_ops
/ pluggable abstraction layer. My thesis is that they are "not
common", so I'm a bit skeptical the abstraction is worth it.

>
> > - My sense is MGLRU is "close", meaning as Kairui said in "average"
> > cases it is substantially better, and the gaps are both fairly narrow
> > / edge-casey, and very solveable.
> >
>
> This is a really, really bold claim.  So much of this is workload
> dependent, and LRU has decades of battle-testing while MGLRU has barely
> been around 4 years.

Fair. My basic thinking is, MGLRU is enabled by default in some
(admittedly, not all) major distros like Fedora or Arch. From that, we
have some various regression reports on the mailing list, but to
Kairui's point, nothing that looks super hard to resolve.

Then again we can always be surprised later / as the footprint expands. :)

>
> ~Gregory


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  2:05 ` Andrew Morton
  2026-03-26  7:03   ` Michal Hocko
@ 2026-03-26 12:06   ` Kairui Song
  2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
  2026-03-26 13:21   ` Shakeel Butt
  2 siblings, 1 reply; 29+ messages in thread
From: Kairui Song @ 2026-03-26 12:06 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Shakeel Butt, lsf-pc, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Lorenzo Stoakes, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 10:05 AM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
>
> > We should unify both algorithms into a single code path.
>
> I'm here to ask the questions which others fear will sound dumb.
>
> Is it indeed the plan to maintain both implementations?  I thought the
> long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?

I personally also agree on that, so far I'm not aware of any major
issues with MGLRU except some corner cases that are not hard to fix.
Once these are done, I don't see the need for more complexity.


> And that Linus has expressed such a desire, but my googling fails me.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 12:06   ` Kairui Song
@ 2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
  2026-03-26 13:17       ` Kairui Song
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 12:31 UTC (permalink / raw)
  To: Kairui Song
  Cc: Andrew Morton, Shakeel Butt, lsf-pc, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

+cc Gregory

On Thu, Mar 26, 2026 at 08:06:13PM +0800, Kairui Song wrote:
> On Thu, Mar 26, 2026 at 10:05 AM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> >
> > > We should unify both algorithms into a single code path.
> >
> > I'm here to ask the questions which others fear will sound dumb.
> >
> > Is it indeed the plan to maintain both implementations?  I thought the
> > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?
>
> I personally also agree on that, so far I'm not aware of any major
> issues with MGLRU except some corner cases that are not hard to fix.
> Once these are done, I don't see the need for more complexity.

Well... :)

What about the issues Gregory raised here?

https://lore.kernel.org/linux-mm/aaXM7xNSJaJBsety@gourry-fedora-PF4VCD3F/

>
>
> > And that Linus has expressed such a desire, but my googling fails me.

Yup I know, and it'd be ideal to have a single approach, but we definitely need
a little more assurance this time that it's the right choice :)

In any case I do agree with Shakeel that de-duplicating the code and 'stealing'
good ideas from it for both flavours of reclaim at least to start with is a good
way forwards.

Once we've done that, we can work towards eliminating the 'classic' reclaim if
the data supports it.

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 13:17       ` Kairui Song
  2026-03-26 13:26         ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 29+ messages in thread
From: Kairui Song @ 2026-03-26 13:17 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Andrew Morton, Shakeel Butt, lsf-pc, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 8:31 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
>
> +cc Gregory
>
> On Thu, Mar 26, 2026 at 08:06:13PM +0800, Kairui Song wrote:
> > On Thu, Mar 26, 2026 at 10:05 AM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > >
> > > > We should unify both algorithms into a single code path.
> > >
> > > I'm here to ask the questions which others fear will sound dumb.
> > >
> > > Is it indeed the plan to maintain both implementations?  I thought the
> > > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?
> >
> > I personally also agree on that, so far I'm not aware of any major
> > issues with MGLRU except some corner cases that are not hard to fix.
> > Once these are done, I don't see the need for more complexity.
>
> Well... :)
>
> What about the issues Gregory raised here?
>
> https://lore.kernel.org/linux-mm/aaXM7xNSJaJBsety@gourry-fedora-PF4VCD3F/

Hi,

Thanks for linking Gregory's mail.

I'm not quite sure I fully get what the concern is, I'll try to
address the main points I see. Regarding the bi-modal distributions,
it's very workload-dependent, and the reclamation still performs well
with that distributions. So I think that's not really a problem?

The heuristics are a bit confusing and intertwined (gen number as
flags for atomic update and avoid LRU lock, tight coupling with
generational aging, etc.), decoupling them cleanly without hurting
performance is tricky, and the current approach has worked
dramatically well. Cleanup the code structure might be helpful?

Many distros have been running MGLRU by default for years now and the
feedback (plus our own production experience) has been very positive.
There are indeed some known rough edges, under-protected file cache,
ineffective swappiness, flag usage, etc. some of which I listed here
[1]. I believe they're all fixable, and once addressed MGLRU should be
a solid base.

Right, I realized after re-reading that some of these really are not
really easy to fix or improve... sorry for my earlier reply, guess I'm
too sleepy today and not thinking clearly :P, my apologies.

[1] https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 13:17       ` Kairui Song
@ 2026-03-26 13:26         ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 13:26 UTC (permalink / raw)
  To: Kairui Song
  Cc: Andrew Morton, Shakeel Butt, lsf-pc, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 09:17:17PM +0800, Kairui Song wrote:
> On Thu, Mar 26, 2026 at 8:31 PM Lorenzo Stoakes (Oracle) <ljs@kernel.org> wrote:
> >
> > +cc Gregory
> >
> > On Thu, Mar 26, 2026 at 08:06:13PM +0800, Kairui Song wrote:
> > > On Thu, Mar 26, 2026 at 10:05 AM Andrew Morton
> > > <akpm@linux-foundation.org> wrote:
> > > >
> > > > On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> > > >
> > > > > We should unify both algorithms into a single code path.
> > > >
> > > > I'm here to ask the questions which others fear will sound dumb.
> > > >
> > > > Is it indeed the plan to maintain both implementations?  I thought the
> > > > long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?
> > >
> > > I personally also agree on that, so far I'm not aware of any major
> > > issues with MGLRU except some corner cases that are not hard to fix.
> > > Once these are done, I don't see the need for more complexity.
> >
> > Well... :)
> >
> > What about the issues Gregory raised here?
> >
> > https://lore.kernel.org/linux-mm/aaXM7xNSJaJBsety@gourry-fedora-PF4VCD3F/
>
> Hi,
>
> Thanks for linking Gregory's mail.
>
> I'm not quite sure I fully get what the concern is, I'll try to
> address the main points I see. Regarding the bi-modal distributions,
> it's very workload-dependent, and the reclamation still performs well
> with that distributions. So I think that's not really a problem?
>
> The heuristics are a bit confusing and intertwined (gen number as
> flags for atomic update and avoid LRU lock, tight coupling with
> generational aging, etc.), decoupling them cleanly without hurting
> performance is tricky, and the current approach has worked
> dramatically well. Cleanup the code structure might be helpful?
>
> Many distros have been running MGLRU by default for years now and the
> feedback (plus our own production experience) has been very positive.
> There are indeed some known rough edges, under-protected file cache,
> ineffective swappiness, flag usage, etc. some of which I listed here
> [1]. I believe they're all fixable, and once addressed MGLRU should be
> a solid base.
>
> Right, I realized after re-reading that some of these really are not
> really easy to fix or improve... sorry for my earlier reply, guess I'm
> too sleepy today and not thinking clearly :P, my apologies.
>
> [1] https://lore.kernel.org/linux-mm/CAMgjq7BoekNjg-Ra3C8M7=8=75su38w=HD782T5E_cxyeCeH_g@mail.gmail.com/

Thanks, well that's good to hear.

I broadly agree with Shakeel's proposals for the incremental improvement of
the LRU code, and I think that and fixing MAINTAINERS (*ahem*) will
seriously help move us forwards with this.

I wonder also if we shouldn't have a document on this somewhere in the
kernel to lay out the differences, improvements, how the gen numbers work
etc.?

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  2:05 ` Andrew Morton
  2026-03-26  7:03   ` Michal Hocko
  2026-03-26 12:06   ` Kairui Song
@ 2026-03-26 13:21   ` Shakeel Butt
  2 siblings, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2026-03-26 13:21 UTC (permalink / raw)
  To: Andrew Morton
  Cc: lsf-pc, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Wed, Mar 25, 2026 at 07:05:47PM -0700, Andrew Morton wrote:
> On Wed, 25 Mar 2026 14:06:37 -0700 Shakeel Butt <shakeel.butt@linux.dev> wrote:
> 
> > We should unify both algorithms into a single code path. 
> 
> I'm here to ask the questions which others fear will sound dumb.
> 
> Is it indeed the plan to maintain both implementations?  

I think in general the plan/wish was to have one implementation but the current
state shows that we are far from where we wish to be.

> I thought the
> long-term ambition was to knock MGLRU into shape and to drop the legacy LRU?

That might be the ambition but I don't think there was an agreement and there
weren't any efforts to do so. Traditional LRU has been battle tested for decades
and thus there were concerns/skepticisms about this ambition as well.

To me, the right way is to take step-by-step approach with clear evaluations and
comparison on unifying or selecting a component from one over the other.


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
  2026-03-26  0:10 ` T.J. Mercier
  2026-03-26  2:05 ` Andrew Morton
@ 2026-03-26  7:12 ` Michal Hocko
  2026-03-26 13:44   ` Shakeel Butt
  2026-03-26  7:18 ` wangzicheng
  2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
  4 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2026-03-26  7:12 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Wed 25-03-26 14:06:37, Shakeel Butt wrote:
> The Problem
> -----------
> 
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every
> bug fix, every optimization, every feature has to be done twice or it
> only works for half the users. This is not sustainable. It has to stop.

While I do agree that having 2 implementations available and to maintain
them is not long term sustainable I would disagree with your above line
of argumentation. We are not aiming to have the two in feature parity
nor they are overlapping in bug space all that much.

> We should unify both algorithms into a single code path. In this path,
> both algorithms are a set of hooks called from that path.

Isn't this the case from a large part? MGRLU tends to have couple of
entry points in the shared code base (node/memcg scanning code).

> Everyone
> maintains, understands, and evolves a single codebase. Optimizations are
> now evaluated against -- and available to -- both algorithms. And the
> next time someone develops a new LRU algorithm, they can do so in a way
> that does not add churn to existing code.

I think we should focus to make a single canonical reclaim
implementation work well. I.e. we deal with most (or ideally all) known
regressions of MGLRU. In the initial presentation of the MGRLU framework
we were told that the implemenation should be extensible to provide more
creative aging algorithms etc.

> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
> 
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to
> be pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some
> new techniques/ideas, and we do not want to get into the current mess
> again.

I would go that way only if/after we are done with MGLRU unification and
after we will have depleted the potential of that approach and hit cases
where we cannot implement new extensions without going $foo_ext way. TBH
I am not convinced "make it pluginable to solve hard problems" is the
best way forward.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26  7:12 ` Michal Hocko
@ 2026-03-26 13:44   ` Shakeel Butt
  2026-03-26 15:24     ` Michal Hocko
  0 siblings, 1 reply; 29+ messages in thread
From: Shakeel Butt @ 2026-03-26 13:44 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 08:12:10AM +0100, Michal Hocko wrote:
> On Wed 25-03-26 14:06:37, Shakeel Butt wrote:
> > The Problem
> > -----------
> > 
> > Memory reclaim in the kernel is a mess. We ship two completely separate
> > eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> > mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> > duplicates functionality already present in the traditional path. Every
> > bug fix, every optimization, every feature has to be done twice or it
> > only works for half the users. This is not sustainable. It has to stop.
> 
> While I do agree that having 2 implementations available and to maintain
> them is not long term sustainable I would disagree with your above line
> of argumentation. We are not aiming to have the two in feature parity
> nor they are overlapping in bug space all that much.

There is definitely basic set of features which we want from a reclaim
mechanism (e.g. writeback and writeback throttling which MGLRU lacked for a long
time) and it does not mean we should aim for feature parity.

For the bugs/debugging, we always need to answer if it is impacting one or the
other or both.

> 
> > We should unify both algorithms into a single code path. In this path,
> > both algorithms are a set of hooks called from that path.
> 
> Isn't this the case from a large part? MGRLU tends to have couple of
> entry points in the shared code base (node/memcg scanning code).

Most of the code is diverged at the reclaim entry point and from what I see the
code at the lowest layer (shrink_folio_list) is shared.

> 
> > Everyone
> > maintains, understands, and evolves a single codebase. Optimizations are
> > now evaluated against -- and available to -- both algorithms. And the
> > next time someone develops a new LRU algorithm, they can do so in a way
> > that does not add churn to existing code.
> 
> I think we should focus to make a single canonical reclaim
> implementation work well. I.e. we deal with most (or ideally all) known
> regressions of MGLRU. 

Here we disagree on the approach or steps to reach the single canonical reclaim
implementation. MGLRU is a plethora of different mechanisms and policies and it
never went through rigorous evaluation for each of those mechanisms and
policies individually. To me that needs to be done to have one solution. 

> In the initial presentation of the MGRLU framework
> we were told that the implemenation should be extensible to provide more
> creative aging algorithms etc.
> 
> > The Fix: One Reclaim, Pluggable and Extensible
> > -----------------------------------------------
> > 
> > We need one reclaim system, not two. One code path that everyone
> > maintains, everyone tests, and everyone benefits from. But it needs to
> > be pluggable as there will always be cases where someone wants some
> > customization for their specialized workload or wants to explore some
> > new techniques/ideas, and we do not want to get into the current mess
> > again.
> 
> I would go that way only if/after we are done with MGLRU unification and
> after we will have depleted the potential of that approach and hit cases
> where we cannot implement new extensions without going $foo_ext way. TBH
> I am not convinced "make it pluginable to solve hard problems" is the
> best way forward.

The reason I have added pluggable/extensible part in this proposal is that I
want to avoid the same scenario all over again in the future. There will always
be some users with very specialized workloads needing some fancy/weird
heuristic. Rather than polluting the core reclaim, letting such users to do
fancy policies should be part of our long term strategy. In addition, we will
want to explore different algorithms and techniques, providing a way to easily
do that without changing the core is definitely needed for future proofing the
reclaim.



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 13:44   ` Shakeel Butt
@ 2026-03-26 15:24     ` Michal Hocko
  2026-03-26 18:21       ` Shakeel Butt
  0 siblings, 1 reply; 29+ messages in thread
From: Michal Hocko @ 2026-03-26 15:24 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu 26-03-26 06:44:07, Shakeel Butt wrote:
> On Thu, Mar 26, 2026 at 08:12:10AM +0100, Michal Hocko wrote:
> > On Wed 25-03-26 14:06:37, Shakeel Butt wrote:
[...]
> > I think we should focus to make a single canonical reclaim
> > implementation work well. I.e. we deal with most (or ideally all) known
> > regressions of MGLRU. 
> 
> Here we disagree on the approach or steps to reach the single canonical reclaim
> implementation. MGLRU is a plethora of different mechanisms and policies and it
> never went through rigorous evaluation for each of those mechanisms and
> policies individually. To me that needs to be done to have one solution. 

If my recollection is correct from the LSFMM (2022) there was a promise that
MGLRU architecture should allow to add extension and eventually
supersede traditional LRU. If we do not see that happening then we
should re-evaluate current MGRLU approach. I do not think we want to
build reclaim_ext architecture on top of the current code. Or are you
suggesting to achieve MGLRU through reclaim_ext?
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 15:24     ` Michal Hocko
@ 2026-03-26 18:21       ` Shakeel Butt
  0 siblings, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2026-03-26 18:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 04:24:04PM +0100, Michal Hocko wrote:
> On Thu 26-03-26 06:44:07, Shakeel Butt wrote:
> > On Thu, Mar 26, 2026 at 08:12:10AM +0100, Michal Hocko wrote:
> > > On Wed 25-03-26 14:06:37, Shakeel Butt wrote:
> [...]
> > > I think we should focus to make a single canonical reclaim
> > > implementation work well. I.e. we deal with most (or ideally all) known
> > > regressions of MGLRU. 
> > 
> > Here we disagree on the approach or steps to reach the single canonical reclaim
> > implementation. MGLRU is a plethora of different mechanisms and policies and it
> > never went through rigorous evaluation for each of those mechanisms and
> > policies individually. To me that needs to be done to have one solution. 
> 
> If my recollection is correct from the LSFMM (2022) there was a promise that
> MGLRU architecture should allow to add extension and eventually
> supersede traditional LRU. If we do not see that happening then we
> should re-evaluate current MGRLU approach. I do not think we want to
> build reclaim_ext architecture on top of the current code. Or are you
> suggesting to achieve MGLRU through reclaim_ext?

My main objective is to unify the reclaim i.e. one reclaim algorithm/mechanism
and then (later) provide a framework where folks/kernel-engs/researchers can
experiment with new techniques/algorithms without changing the core reclaim.

Regarding MGLRU, after unification something from MGLRU remains which does not
makes sense to put in the core reclaim, that can be provided through the
framework. For reclaim_ext, I just like the name, looks fancy :).


^ permalink raw reply	[flat|nested] 29+ messages in thread

* RE: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
                   ` (2 preceding siblings ...)
  2026-03-26  7:12 ` Michal Hocko
@ 2026-03-26  7:18 ` wangzicheng
  2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
  4 siblings, 0 replies; 29+ messages in thread
From: wangzicheng @ 2026-03-26  7:18 UTC (permalink / raw)
  To: Shakeel Butt, lsf-pc@lists.linux-foundation.org
  Cc: Andrew Morton, Johannes Weiner, David Hildenbrand, Michal Hocko,
	Qi Zheng, Lorenzo Stoakes, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, wangtao, Vernon Yang, David Rientjes, Kalesh Singh,
	T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org, liulu 00013167, gao xu,
	wangxin 00023513



> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Shakeel Butt
> Sent: Thursday, March 26, 2026 5:07 AM
> To: lsf-pc@lists.linux-foundation.org
> Cc: Andrew Morton <akpm@linux-foundation.org>; Johannes Weiner
> <hannes@cmpxchg.org>; David Hildenbrand <david@kernel.org>; Michal
> Hocko <mhocko@kernel.org>; Qi Zheng <zhengqi.arch@bytedance.com>;
> Lorenzo Stoakes <ljs@kernel.org>; Chen Ridong
> <chenridong@huaweicloud.com>; Emil Tsalapatis <emil@etsalapatis.com>;
> Alexei Starovoitov <ast@kernel.org>; Axel Rasmussen
> <axelrasmussen@google.com>; Yuanchu Xie <yuanchu@google.com>; Wei
> Xu <weixugc@google.com>; Kairui Song <ryncsn@gmail.com>; Matthew
> Wilcox <willy@infradead.org>; Nhat Pham <nphamcs@gmail.com>; Gregory
> Price <gourry@gourry.net>; Barry Song <21cnbao@gmail.com>; David
> Stevens <stevensd@google.com>; Vernon Yang <vernon2gm@gmail.com>;
> David Rientjes <rientjes@google.com>; Kalesh Singh
> <kaleshsingh@google.com>; wangzicheng <wangzicheng@honor.com>; T . J .
> Mercier <tjmercier@google.com>; Baolin Wang
> <baolin.wang@linux.alibaba.com>; Suren Baghdasaryan
> <surenb@google.com>; Meta kernel team <kernel-team@meta.com>;
> bpf@vger.kernel.org; linux-mm@kvack.org; linux-kernel@vger.kernel.org
> Subject: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory
> Reclaim (reclaim_ext)
> 
> The Problem
> -----------
> 
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.
> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every bug fix,
> every optimization, every feature has to be done twice or it only works for
> half the users. This is not sustainable. It has to stop.
> 
> We should unify both algorithms into a single code path. In this path, both
> algorithms are a set of hooks called from that path. Everyone maintains,
> understands, and evolves a single codebase. Optimizations are now
> evaluated against -- and available to -- both algorithms. And the next time
> someone develops a new LRU algorithm, they can do so in a way that does
> not add churn to existing code.
> 
> How We Got Here
> ---------------
> 
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to refactor the
> existing reclaim code or integrate these mechanisms into the traditional path.
> 3,300 lines of code were dumped as a completely parallel implementation
> with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.
> 
> To be fair, traditional reclaim is not easy to refactor. It has accumulated
> decades of heuristics trying to work for every workload, and touching any of
> it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to generalize
> the existing scanning path, not proposing shared abstractions, not offering
> the new mechanisms as improvements to the code that was already there.
> Hard does not mean impossible, and the cost of not trying is what we are
> living with now.
> 
> The Differences That Matter
> ---------------------------
> 
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
> 
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently tied
> to its eviction policy. Page table scanning would benefit traditional LRU just as
> much -- it is cache-friendly, batches updates without the LRU lock, and
> naturally exploits spatial locality. There is no reason this should be MGLRU-
> only.
> 
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold page
> table regions and a lookaround optimization to scan adjacent PTEs during
> eviction. These are general-purpose optimizations for any scanning path.
> They are locked inside MGLRU today for no good reason.
> 
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim can use
> the same technique to reduce lock contention.
> 
> Page classification: Traditional LRU uses two buckets (active/inactive).
> MGLRU uses four generations with timestamps and reference frequency
> tiers. This is the policy difference -- how many age buckets and how pages
> move between them. Every other mechanism is shareable.
> 
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.
> 
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
> 
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to be
> pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some new
> techniques/ideas, and we do not want to get into the current mess again.
> 
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to detect
> access, how to classify pages, which pages to evict, when to protect a page --
> are where the two algorithms differ, and where future algorithms will differ
> too. Make those pluggable.
> 
> This gives us one maintained code path with the flexibility to evolve.
> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone comes
> up with a better eviction algorithm tomorrow, they plug it in without
> touching the core.
> 
> Making reclaim pluggable implies we define it as a set of function methods
> (let's call them reclaim_ops) hooking into a stable codebase we rarely modify.
> We then have two big questions to answer: how do these reclaim ops look,
> and how do we move the existing code to the new model?
> 
> How Do We Get There
> -------------------
> 
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?
> 
> Whichever option we choose, we do the work in small, self-contained phases.
> Each phase ships independently, each phase makes the code better, each
> phase is bisectable. No big bang. No disruption. No excuses.
> 
> Option A: Factor and Merge
> 
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for MGLRU
> itself.
> 
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at this
> stage.
> 
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
> 
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
>   traditional LRU. We do not actually know which optimizations are
>   useful and which should stay in MGLRU hooks.
> 
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
>   until we merge the paths at the end. We will have to change the ops
>   if it turns out we need a different split. The reclaim_ops API will
>   be private and have a single user so it is not that bad, but it may
>   require additional changes.
> 
> Option B: Merge and Factor
> 
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page table
> scanning, Bloom filter PMD skipping, lookaround, lock-free folio age updates.
> These are independently useful. Make them available to both algorithms.
> Stop hoarding good ideas inside one code path.
> 
> Phase 2 -- Collapse the remaining differences. Generalize list infrastructure
> to N classifications (trad=2, MGLRU=4). Unify eviction entry points. Common
> classification/promotion interface. At this point the two "algorithms" are thin
> wrappers over shared code.
> 
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
> 
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
>   method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
> 
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
>   at once.
> 
> Open Questions
> --------------
> 
> - Policy granularity: system-wide, per-node, or per-cgroup?
> - Mechanism/policy boundary: needs iteration; get it wrong and we
>   either constrain policies or duplicate code.
> - Validation: reclaim quality is hard to measure; we need agreed-upon
>   benchmarks.
> - Simplicity: the end result must be simpler than what we have today,
>   not more complex. If it is not simpler, we failed.
> --
> 2.52.0
> 

Hi Shakeel,

The reclaim_ops direction looks very promising. I'd be interested in the discussion.

We are particularly interested in the individual effects of several mechanisms
currently bundled in MGLRU. reclaim_ops would provide a great opportunity to
run ablation experiments, e.g. testing traditional LRU with page table scanning.

On policy granularity, it would also be interesting to see something like ``reclaim_ext''[1,2]
taking control at different levels, similar to what sched_ext does for scheduling policies.

Best,
Zicheng

[1] cache_ext: Customizing the Page Cache with eBPF
[2] PageFlex: Flexible and Efficient User-space Delegation of Linux Paging Policies with eBPF




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
                   ` (3 preceding siblings ...)
  2026-03-26  7:18 ` wangzicheng
@ 2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
  2026-03-26 15:24   ` Gregory Price
  2026-03-26 18:49   ` Shakeel Butt
  4 siblings, 2 replies; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 11:43 UTC (permalink / raw)
  To: Shakeel Butt
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Wed, Mar 25, 2026 at 02:06:37PM -0700, Shakeel Butt wrote:
> The Problem
> -----------
>
> Memory reclaim in the kernel is a mess. We ship two completely separate
> eviction algorithms -- traditional LRU and MGLRU -- in the same file.

Agreed :)

> mm/vmscan.c is over 8,000 lines. 40% of it is MGLRU-specific code that
> duplicates functionality already present in the traditional path. Every
> bug fix, every optimization, every feature has to be done twice or it
> only works for half the users. This is not sustainable. It has to stop.

Yup.

>
> We should unify both algorithms into a single code path. In this path,
> both algorithms are a set of hooks called from that path. Everyone
> maintains, understands, and evolves a single codebase. Optimizations are
> now evaluated against -- and available to -- both algorithms. And the
> next time someone develops a new LRU algorithm, they can do so in a way
> that does not add churn to existing code.

Yup. I mean it's less churn more duplication which is a lot worse.

>
> How We Got Here
> ---------------
>
> MGLRU brought interesting ideas -- multi-generation aging, page table
> scanning, Bloom filters, spatial lookaround. But we never tried to
> refactor the existing reclaim code or integrate these mechanisms into the
> traditional path. 3,300 lines of code were dumped as a completely
> parallel implementation with a runtime toggle to switch between the two.
> No attempt to evolve the existing code or share mechanisms between the
> two paths -- just a second reclaim system bolted on next to the first.

Yeah, I don't love that we accepted it in this form.

With review far sharper now I would hope in future there'd be pushback in at
least having better separation in future.

>
> To be fair, traditional reclaim is not easy to refactor. It has
> accumulated decades of heuristics trying to work for every workload, and
> touching any of it risks regressions. But difficulty is not an excuse.
> There was no justification for not even trying -- not attempting to
> generalize the existing scanning path, not proposing shared
> abstractions, not offering the new mechanisms as improvements to the code
> that was already there. Hard does not mean impossible, and the cost of
> not trying is what we are living with now.

Yes. Agreed very much so.

>
> The Differences That Matter
> ---------------------------
>
> The two algorithms differ in how they classify pages, detect access, and
> decide what to evict. But most of these differences are not fundamental
> -- they are mechanisms that got trapped inside one implementation when
> they could benefit both. Not making those mechanisms shareable leaves
> potential free performance gains on the table.
>
> Access detection: Traditional LRU walks reverse mappings (RMAP) from the
> page back to its page table entries. MGLRU walks page tables forward,
> scanning process address spaces directly. Neither approach is inherently
> tied to its eviction policy. Page table scanning would benefit
> traditional LRU just as much -- it is cache-friendly, batches updates
> without the LRU lock, and naturally exploits spatial locality. There is
> no reason this should be MGLRU-only.

Right.

>
> Bloom filters and lookaround: MGLRU uses Bloom filters to skip cold
> page table regions and a lookaround optimization to scan adjacent PTEs
> during eviction. These are general-purpose optimizations for any
> scanning path. They are locked inside MGLRU today for no good reason.
>
> Lock-free age updates: MGLRU updates folio age using atomic flag
> operations, avoiding the LRU lock during scanning. Traditional reclaim
> can use the same technique to reduce lock contention.
>
> Page classification: Traditional LRU uses two buckets
> (active/inactive). MGLRU uses four generations with timestamps and
> reference frequency tiers. This is the policy difference --
> how many age buckets and how pages move between them. Every other
> mechanism is shareable.
>
> Both systems already share the core reclaim mechanics -- writeback,
> unmapping, swap, NUMA demotion, and working set tracking. The shareable
> mechanisms listed above should join that common core. What remains after
> that is a thin policy layer -- and that is all that should differ between
> algorithms.

Yeah, this all really speaks to the review simply not being sufficient at
the time.

Given the data given by Mike at [0], it suggests the recent sub-M changes
have made a really big difference to this (I'm genuinely pleasantly
surprised by that!) so hopefully this is something we'll avoid in future.

[0]:https://lore.kernel.org/linux-mm/acJEFArj6uw2Z_2e@kernel.org/

>
> The Fix: One Reclaim, Pluggable and Extensible
> -----------------------------------------------
>
> We need one reclaim system, not two. One code path that everyone
> maintains, everyone tests, and everyone benefits from. But it needs to
> be pluggable as there will always be cases where someone wants some
> customization for their specialized workload or wants to explore some
> new techniques/ideas, and we do not want to get into the current mess
> again.

OK so I was with you up until the pluggable bit :) it's like you're
combining two things here, obviously - unification and pluggability.

I think we should consider both separately.

I also hope that as a result of this those who are involved in unification
gain understanding and some at least are able to perhaps become _active_
co-maintainers/reviewers? As this is a major concern see [1].

[1]:https://lore.kernel.org/linux-mm/aaBsrrmV25FTIkVX@casper.infradead.org/

>
> The unified reclaim must separate mechanism from policy. The mechanisms
> -- writeback, unmapping, swap, NUMA demotion, workingset tracking -- are
> shared today and should stay shared. The policy decisions -- how to
> detect access, how to classify pages, which pages to evict, when to
> protect a page -- are where the two algorithms differ, and where future
> algorithms will differ too. Make those pluggable.

Again you're saying sane things than adding on 'pluggable' :)

Let me address pluggability concerns further down I guess.

>
> This gives us one maintained code path with the flexibility to evolve.

I like 'maintained' here :)

> New ideas get implemented as new policies, not as 3,000-line forks. Good
> mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> become shared infrastructure available to any policy. And if someone
> comes up with a better eviction algorithm tomorrow, they plug it in
> without touching the core.
>
> Making reclaim pluggable implies we define it as a set of function
> methods (let's call them reclaim_ops) hooking into a stable codebase we
> rarely modify. We then have two big questions to answer: how do these
> reclaim ops look, and how do we move the existing code to the new model?

Hmm, I'm not so sure about that. But it depends really on who has access to
these operations.

The issue with operations in general is that they eliminate the possibility
of the general code being able to make assumptions about what's happening.

For instance, the .mmap f_op callback meant that we had to account for any
possible thing being done by a driver. You couldn't make assumptions about
vma state, page table state, etc. and of course things happened that we
didn't anticipate, leading to bugs.

So I guess it's less 'no ops' more so 'what do we actually expose to the
ops', 'what assumptions do we bake in about how the ops are used' and very
importantly - 'who gets to populate them'.

If they're _exclusively_ mm-internal then that's fine.

Reclaim is a _very_ _very_ sensitive part of mm. At the point it's being
activated you may be under extreme memory pressure, so a hook even
allocating at all may either fail or enter infinite loops.

We are also very sensitive on things like rmap locks and also, of course, -
timing.

It's not just a perf concern, if we are too slow, we might end up thrashing
when we could otherwise not have.

Also there ends up being a question of how much now-internal functionality
we end up exposing to users.

So we really need a good definition of who we intend should use this stuff,
and how any such interface should be designed.

I mean, if sufficiently abstracted, and with very carefully restricted
constrainst perhaps we could work around a lot of this but we have to tread
_very_ carefully here.

>
> How Do We Get There
> -------------------
>
> Do we merge the two mechanisms feature by feature, or do we prioritize
> moving MGLRU to the pluggable model then follow with LRU once we are
> happy with the result?

Absolutely by a distance the first is preferable. The pluggability is
controversial here and needs careful consideration.

Eliminating redundancy and ensuring broader community maintainership is
easily more important.

>
> Whichever option we choose, we do the work in small, self-contained
> phases. Each phase ships independently, each phase makes the code
> better, each phase is bisectable. No big bang. No disruption. No
> excuses.
>
> Option A: Factor and Merge
>
> MGLRU is already pretty modular. However, we do not know which
> optimizations are actually generic and which ones are only useful for
> MGLRU itself.
>
> Phase 1 -- Factor out just MGLRU into reclaim_ops. We make no functional
> changes to MGLRU. Traditional LRU code is left completely untouched at
> this stage.
>
> Phase 2 -- Merge the two paths one method at a time. Right now the code
> diverts control to MGLRU from the very top of the high-level hooks. We
> instead unify the algorithms starting from the very beginning of LRU and
> deciding what to keep in common code and what to move into a traditional
> LRU path.
>
> Advantages:
> - We do not touch LRU until Phase 2, avoiding churn.
> - Makes it easy to experiment with combining MGLRU features into
>   traditional LRU. We do not actually know which optimizations are
>   useful and which should stay in MGLRU hooks.
>
> Disadvantages:
> - We will not find out whether reclaim_ops exposes the right methods
>   until we merge the paths at the end. We will have to change the ops
>   if it turns out we need a different split. The reclaim_ops API will
>   be private and have a single user so it is not that bad, but it may
>   require additional changes.

Yup this to me renders this totally not an option.

>
> Option B: Merge and Factor
>
> Phase 1 -- Extract MGLRU mechanisms into shared infrastructure. Page
> table scanning, Bloom filter PMD skipping, lookaround, lock-free folio
> age updates. These are independently useful. Make them available to both
> algorithms. Stop hoarding good ideas inside one code path.
>
> Phase 2 -- Collapse the remaining differences. Generalize list
> infrastructure to N classifications (trad=2, MGLRU=4). Unify eviction
> entry points. Common classification/promotion interface. At this point
> the two "algorithms" are thin wrappers over shared code.
>
> Phase 3 -- Define the hook interface. Define reclaim_ops around the
> remaining policy differences. Layer BPF on top (reclaim_ext).
> Traditional LRU and MGLRU become two instances of the same interface.
> Adding a third algorithm means writing a new set of hooks, not forking
> 3,000 lines.
>
> Advantages:
> - We get signals on what should be shared earlier. We know every shared
>   method to be useful because we use it for both algorithms.
> - Can test LRU optimizations on MGLRU early.
>
> Disadvantages:
> - Slower, as we factor out both algorithms and expand reclaim_ops all
>   at once.

Much preferable thanks. I'd rather we deferred the pluggability stuff.

>
> Open Questions
> --------------
>
> - Policy granularity: system-wide, per-node, or per-cgroup?

Well, we have varying levers to pull at least per-cgroup/system-wide.

I wonder if we could add improved documentation on this overall by the way
:) just a thought.

A general reclaim page that mentions cgroup stuff also (can link back to
cgroup pages) so a 'one stop shop' for reclaim/(perhaps also)
reclaim-adjacent writeback, etc. controls could be useful.

> - Mechanism/policy boundary: needs iteration; get it wrong and we
>   either constrain policies or duplicate code.

> - Validation: reclaim quality is hard to measure; we need agreed-upon
>   benchmarks.

Yes, it'd be great to get some standardised set of tests to ensure correct
behaviour, though how we set those up might be tricky.

Perhaps somehow some qemu/libvirt configurations with very tightly
specified environments intended to trigger various reclaim behaviours, with
some specific measurements (bpf, ftrace, procfs, etc.?) to correctly
observe the behaviours in place?

> - Simplicity: the end result must be simpler than what we have today,
>   not more complex. If it is not simpler, we failed.

Yeah that's a nice aim :)

> --
> 2.52.0
>

Thanks, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 15:24   ` Gregory Price
  2026-03-26 15:35     ` Lorenzo Stoakes (Oracle)
  2026-03-26 18:49   ` Shakeel Butt
  1 sibling, 1 reply; 29+ messages in thread
From: Gregory Price @ 2026-03-26 15:24 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Shakeel Butt, lsf-pc, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

... snip snip snip ...

> >
> > How Do We Get There
> > -------------------
> >
> > Do we merge the two mechanisms feature by feature, or do we prioritize
> > moving MGLRU to the pluggable model then follow with LRU once we are
> > happy with the result?
> 
> Absolutely by a distance the first is preferable. The pluggability is
> controversial here and needs careful consideration.
> 

Pluggability asside - I do not think merging these two things "feature
by feature" is actually feasible (I would be delighted to be wrong).

Many MGLRU "features" solve problems that MGLRU invents for itself.

Take MGLRU's PID controller - its entire purpose is to try to smooth out
refault rates and "learn" from prior mistakes - but it's fundamentally
tied to MGLRU's aging system, and the aging systems differ greatly.

  - LRU:   actual lists - active/inactive - that maintain ordering
  - MGLRU: "generations", "inter-generation tiers", aging-in-place

"Merging" this is essentially inventing something completely new - or
more reasonably just migrating everyone to MGLRU.

In terms of managing risk, it seems far more reasonable to either split
MGLRU off into its own file and formalize the interface (ops), or simply
rip it out and let each individual feature fight its way back in.

~Gregory

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 15:24   ` Gregory Price
@ 2026-03-26 15:35     ` Lorenzo Stoakes (Oracle)
  2026-03-26 16:32       ` Gregory Price
  0 siblings, 1 reply; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 15:35 UTC (permalink / raw)
  To: Gregory Price
  Cc: Shakeel Butt, lsf-pc, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 10:24:28AM -0500, Gregory Price wrote:
> ... snip snip snip ...
>
> > >
> > > How Do We Get There
> > > -------------------
> > >
> > > Do we merge the two mechanisms feature by feature, or do we prioritize
> > > moving MGLRU to the pluggable model then follow with LRU once we are
> > > happy with the result?
> >
> > Absolutely by a distance the first is preferable. The pluggability is
> > controversial here and needs careful consideration.
> >
>
> Pluggability asside - I do not think merging these two things "feature
> by feature" is actually feasible (I would be delighted to be wrong).
>
> Many MGLRU "features" solve problems that MGLRU invents for itself.
>
> Take MGLRU's PID controller - its entire purpose is to try to smooth out
> refault rates and "learn" from prior mistakes - but it's fundamentally
> tied to MGLRU's aging system, and the aging systems differ greatly.
>
>   - LRU:   actual lists - active/inactive - that maintain ordering
>   - MGLRU: "generations", "inter-generation tiers", aging-in-place
>
> "Merging" this is essentially inventing something completely new - or
> more reasonably just migrating everyone to MGLRU.
>
> In terms of managing risk, it seems far more reasonable to either split
> MGLRU off into its own file and formalize the interface (ops), or simply
> rip it out and let each individual feature fight its way back in.

But _surely_ (and Shakeel can come back on this I guess) there are things that
are commonalities.

In any case we can all agree that MGLRU cohabiting with classical reclaim is not
a happy household one way or another.

Maybe a stage 1 can be to separate stuff into another file, because that'd
actually be pretty easy to do for a fair bit of it, surely?

Can use mm/internal.h to handle stuff that has to link one with other?

>
> ~Gregory

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 15:35     ` Lorenzo Stoakes (Oracle)
@ 2026-03-26 16:32       ` Gregory Price
  2026-03-26 16:40         ` Lorenzo Stoakes (Oracle)
  0 siblings, 1 reply; 29+ messages in thread
From: Gregory Price @ 2026-03-26 16:32 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: Shakeel Butt, lsf-pc, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 03:35:28PM +0000, Lorenzo Stoakes (Oracle) wrote:
> 
> Maybe a stage 1 can be to separate stuff into another file, because that'd
> actually be pretty easy to do for a fair bit of it, surely?
> 
> Can use mm/internal.h to handle stuff that has to link one with other?
>

Yeah i don't think we absolutely need to jump all the way to ops right
away, and splitting into a new file would at least give our eyes a rest
from the ifdef spaghetti.

~Gregory


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 16:32       ` Gregory Price
@ 2026-03-26 16:40         ` Lorenzo Stoakes (Oracle)
  0 siblings, 0 replies; 29+ messages in thread
From: Lorenzo Stoakes (Oracle) @ 2026-03-26 16:40 UTC (permalink / raw)
  To: Gregory Price
  Cc: Shakeel Butt, lsf-pc, Andrew Morton, Johannes Weiner,
	David Hildenbrand, Michal Hocko, Qi Zheng, Chen Ridong,
	Emil Tsalapatis, Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie,
	Wei Xu, Kairui Song, Matthew Wilcox, Nhat Pham, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 11:32:52AM -0500, Gregory Price wrote:
> On Thu, Mar 26, 2026 at 03:35:28PM +0000, Lorenzo Stoakes (Oracle) wrote:
> >
> > Maybe a stage 1 can be to separate stuff into another file, because that'd
> > actually be pretty easy to do for a fair bit of it, surely?
> >
> > Can use mm/internal.h to handle stuff that has to link one with other?
> >
>
> Yeah i don't think we absolutely need to jump all the way to ops right
> away, and splitting into a new file would at least give our eyes a rest
> from the ifdef spaghetti.

Yeah as I said in my reply here, I'm if(def?)fy about the ops tbh.

But we can defer thinking about that until last anyway.

>
> ~Gregory

Cheers, Lorenzo


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext)
  2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
  2026-03-26 15:24   ` Gregory Price
@ 2026-03-26 18:49   ` Shakeel Butt
  1 sibling, 0 replies; 29+ messages in thread
From: Shakeel Butt @ 2026-03-26 18:49 UTC (permalink / raw)
  To: Lorenzo Stoakes (Oracle)
  Cc: lsf-pc, Andrew Morton, Johannes Weiner, David Hildenbrand,
	Michal Hocko, Qi Zheng, Chen Ridong, Emil Tsalapatis,
	Alexei Starovoitov, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Kairui Song, Matthew Wilcox, Nhat Pham, Gregory Price, Barry Song,
	David Stevens, Vernon Yang, David Rientjes, Kalesh Singh,
	wangzicheng, T . J . Mercier, Baolin Wang, Suren Baghdasaryan,
	Meta kernel team, bpf, linux-mm, linux-kernel

On Thu, Mar 26, 2026 at 11:43:46AM +0000, Lorenzo Stoakes (Oracle) wrote:
> On Wed, Mar 25, 2026 at 02:06:37PM -0700, Shakeel Butt wrote:
[...]
> >
> > The Fix: One Reclaim, Pluggable and Extensible
> > -----------------------------------------------
> >
> > We need one reclaim system, not two. One code path that everyone
> > maintains, everyone tests, and everyone benefits from. But it needs to
> > be pluggable as there will always be cases where someone wants some
> > customization for their specialized workload or wants to explore some
> > new techniques/ideas, and we do not want to get into the current mess
> > again.
> 
> OK so I was with you up until the pluggable bit :) it's like you're
> combining two things here, obviously - unification and pluggability.
> 
> I think we should consider both separately.

Yes, I should be more explicit that these are two different steps. First
unification and then provide a framework for extensibility.

[...]

> 
> > New ideas get implemented as new policies, not as 3,000-line forks. Good
> > mechanisms from MGLRU (page table scanning, Bloom filters, lookaround)
> > become shared infrastructure available to any policy. And if someone
> > comes up with a better eviction algorithm tomorrow, they plug it in
> > without touching the core.
> >
> > Making reclaim pluggable implies we define it as a set of function
> > methods (let's call them reclaim_ops) hooking into a stable codebase we
> > rarely modify. We then have two big questions to answer: how do these
> > reclaim ops look, and how do we move the existing code to the new model?
> 
> Hmm, I'm not so sure about that. But it depends really on who has access to
> these operations.
> 
> The issue with operations in general is that they eliminate the possibility
> of the general code being able to make assumptions about what's happening.
> 
> For instance, the .mmap f_op callback meant that we had to account for any
> possible thing being done by a driver. You couldn't make assumptions about
> vma state, page table state, etc. and of course things happened that we
> didn't anticipate, leading to bugs.
> 
> So I guess it's less 'no ops' more so 'what do we actually expose to the
> ops', 'what assumptions do we bake in about how the ops are used' and very
> importantly - 'who gets to populate them'.
> 
> If they're _exclusively_ mm-internal then that's fine.
> 
> Reclaim is a _very_ _very_ sensitive part of mm. At the point it's being
> activated you may be under extreme memory pressure, so a hook even
> allocating at all may either fail or enter infinite loops.
> 
> We are also very sensitive on things like rmap locks and also, of course, -
> timing.
> 
> It's not just a perf concern, if we are too slow, we might end up thrashing
> when we could otherwise not have.
> 
> Also there ends up being a question of how much now-internal functionality
> we end up exposing to users.
> 
> So we really need a good definition of who we intend should use this stuff,
> and how any such interface should be designed.
> 
> I mean, if sufficiently abstracted, and with very carefully restricted
> constrainst perhaps we could work around a lot of this but we have to tread
> _very_ carefully here.

Good points and I think we are still at the early stage of defining what
operations these would be. One of the complain during MGLRU upstream effort was
that the traditional LRU is too rigid and is very hard to experiment new ideas.
I want to eliminate such future complains. If you want to experiment some new
algorithm or new heuristic, experiment using the new framework.

I think once we start unifying the reclaim mechanisms, these operations will
start becoming more clear.


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2026-03-26 20:48 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-25 21:06 [LSF/MM/BPF TOPIC] Towards Unified and Extensible Memory Reclaim (reclaim_ext) Shakeel Butt
2026-03-26  0:10 ` T.J. Mercier
2026-03-26  2:05 ` Andrew Morton
2026-03-26  7:03   ` Michal Hocko
2026-03-26  8:02     ` Lorenzo Stoakes (Oracle)
2026-03-26 12:37       ` Kairui Song
2026-03-26 13:13         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:42           ` David Hildenbrand (Arm)
2026-03-26 13:45             ` Lorenzo Stoakes (Oracle)
2026-03-26 16:02         ` Lorenzo Stoakes (Oracle)
2026-03-26 20:02       ` Axel Rasmussen
2026-03-26 20:30         ` Gregory Price
2026-03-26 20:47           ` Axel Rasmussen
2026-03-26 12:06   ` Kairui Song
2026-03-26 12:31     ` Lorenzo Stoakes (Oracle)
2026-03-26 13:17       ` Kairui Song
2026-03-26 13:26         ` Lorenzo Stoakes (Oracle)
2026-03-26 13:21   ` Shakeel Butt
2026-03-26  7:12 ` Michal Hocko
2026-03-26 13:44   ` Shakeel Butt
2026-03-26 15:24     ` Michal Hocko
2026-03-26 18:21       ` Shakeel Butt
2026-03-26  7:18 ` wangzicheng
2026-03-26 11:43 ` Lorenzo Stoakes (Oracle)
2026-03-26 15:24   ` Gregory Price
2026-03-26 15:35     ` Lorenzo Stoakes (Oracle)
2026-03-26 16:32       ` Gregory Price
2026-03-26 16:40         ` Lorenzo Stoakes (Oracle)
2026-03-26 18:49   ` Shakeel Butt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox