[DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency
@ 2025-10-20  2:22 Yifan Ji
  0 siblings, 0 replies; 4+ messages in thread
From: Yifan Ji @ 2025-10-20  2:22 UTC (permalink / raw)
  To: linux-mm
  Cc: linux-kernel, Andrew Morton, Michal Hocko, Johannes Weiner,
	Vlastimil Babka, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 1901 bytes --]

Hi all,

We have been investigating reclaim performance on mobile systems under
memory pressure and noticed that slab shrinking often accounts for a
significant portion of reclaim time in both direct reclaim and kswapd
contexts.
In some cases, shrink_slab() can take noticeably long when multiple
shrinkers are active, leading to latency spikes and slower overall reclaim
progress.

To address this, we are considering an approach to move slab shrinking
into a *dedicated kernel thread*. The intention is to decouple slab reclaim
from the direct reclaim and kswapd paths, allowing it to proceed
asynchronously under controlled conditions such as system idle periods or
specific reclaim triggers.

*Motivation:*

   -

   Reduce latency in direct reclaim paths by offloading potentially
   long-running slab reclaim work.
   -

   Improve overall reclaim efficiency by scheduling slab shrinking
   separately from page reclaim.
   -

   Allow more flexible control over when and how slab caches are aged
   or shrunk.

*Proposed direction:*

   -

   Introduce a kernel thread responsible for invoking
   shrink_slab() periodically or when signaled.
   -

   Keep the existing shrinker infrastructure intact but move the
   execution context outside of direct reclaim and kswapd.
   -

   Optionally trigger this thread based on system activity (e.g.
   idle detection, vmpressure events, or background reclaim).

We’d like to gather community feedback on:

   -

   Whether decoupling slab reclaim from kswapd and direct reclaim
   makes sense from a design and maintainability perspective.
   -

   Potential implications on fairness, concurrency, and memcg
   accounting.
   -

   Any related prior work or alternative ideas that have been discussed
   in this area.

Thanks for your time and consideration.

Best regards,
Yifan Ji

[-- Attachment #2: Type: text/html, Size: 2169 bytes --]

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency
@ 2025-10-21  2:52 Yifan Ji
  2025-10-21  5:25 ` Christoph Hellwig
  0 siblings, 1 reply; 4+ messages in thread
From: Yifan Ji @ 2025-10-21  2:52 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, Michal Hocko, Johannes Weiner, Vlastimil Babka,
	Matthew Wilcox

Hi all,

We've been profiling memory reclaim performance on mobile systems and found
that slab shrinking can dominate reclaim time, particularly when multiple
shrinkers are active. In some cases, shrink_slab() introduces noticeable
latency in both direct reclaim and kswapd contexts.

We are exploring an approach to move slab shrinking into a dedicated kernel
thread, decoupling it from direct reclaim and kswapd. The goal is to perform
slab reclaim asynchronously under controlled conditions such as idle periods
or vmpressure triggers.

Motivation:
 - Reduce latency in direct reclaim paths.
 - Improve reclaim efficiency by separating page and slab reclaim.
 - Provide more flexible scheduling for slab shrinking.

Proposed direction:
 - Introduce a kernel thread that periodically or conditionally calls
shrink_slab().

We'd appreciate feedback on:
 - Whether this decoupling aligns with the design of the current reclaim model.
 - Possible implications on fairness, concurrency, and memcg behavior.

Thanks for your time and input.

Best regards,
Yifan Ji

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency
  2025-10-21  2:52 [DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency Yifan Ji
@ 2025-10-21  5:25 ` Christoph Hellwig
  2025-10-24  0:47   ` Dave Chinner
  0 siblings, 1 reply; 4+ messages in thread
From: Christoph Hellwig @ 2025-10-21  5:25 UTC (permalink / raw)
  To: Yifan Ji
  Cc: linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Vlastimil Babka, Matthew Wilcox, Dave Chinner

[adding Dave who has spent a lot of time on shrinkers]

On Tue, Oct 21, 2025 at 10:52:41AM +0800, Yifan Ji wrote:
> Hi all,
> 
> We've been profiling memory reclaim performance on mobile systems and found
> that slab shrinking can dominate reclaim time, particularly when multiple
> shrinkers are active. In some cases, shrink_slab() introduces noticeable
> latency in both direct reclaim and kswapd contexts.
> 
> We are exploring an approach to move slab shrinking into a dedicated kernel
> thread, decoupling it from direct reclaim and kswapd. The goal is to perform
> slab reclaim asynchronously under controlled conditions such as idle periods
> or vmpressure triggers.

That would mirror what everyone in reclaim / writeback does and have the
same benefits and pitfalls like throttling.  I'd suggest you give it a
spin and report your findings.

> Motivation:
>  - Reduce latency in direct reclaim paths.
>  - Improve reclaim efficiency by separating page and slab reclaim.
>  - Provide more flexible scheduling for slab shrinking.
> 
> Proposed direction:
>  - Introduce a kernel thread that periodically or conditionally calls
> shrink_slab().
> 
> We'd appreciate feedback on:
>  - Whether this decoupling aligns with the design of the current reclaim model.
>  - Possible implications on fairness, concurrency, and memcg behavior.
> 
> Thanks for your time and input.
> 
> Best regards,
> Yifan Ji
> 
---end quoted text---


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency
  2025-10-21  5:25 ` Christoph Hellwig
@ 2025-10-24  0:47   ` Dave Chinner
  0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2025-10-24  0:47 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Yifan Ji, linux-mm, Andrew Morton, Michal Hocko, Johannes Weiner,
	Vlastimil Babka, Matthew Wilcox

On Mon, Oct 20, 2025 at 10:25:51PM -0700, Christoph Hellwig wrote:
> [adding Dave who has spent a lot of time on shrinkers]
> 
> On Tue, Oct 21, 2025 at 10:52:41AM +0800, Yifan Ji wrote:
> > Hi all,
> > 
> > We've been profiling memory reclaim performance on mobile systems and found
> > that slab shrinking can dominate reclaim time, particularly when multiple
> > shrinkers are active. In some cases, shrink_slab() introduces noticeable
> > latency in both direct reclaim and kswapd contexts.

Sure, it can increase memory reclaim latency, but that's because
memory reclaim takes time to reclaim objects. The more ireclaimable
objects there are in shrinkable caches, the more time and overhead
it takes to reclaim them.

If the workload is heavily biased towards shrinkable caches rather
then file- or anon- pages, then profiles will show shrinkers taking
up all the reclaim time and CPU. This is not a bug, nor is it an
indication of an actual reclaim problem.

So before we start even thinking about "solutions", we need to
understand the problem you are trying to solve. Can you please post
the profiles, the workload analysis, the shrinkable cache sizes that
are being worked on along with the state of page reclaim at the same
time, etc?

i.e. we need to understand why the shrinkers are taking time and
determine if it is an individual subsystem shrinker implementation
issue, a shrinker/page reclaim balance issue, or something else that
is causing the symptoms you are seeing.

> > We are exploring an approach to move slab shrinking into a dedicated kernel
> > thread, decoupling it from direct reclaim and kswapd. The goal is to perform
> > slab reclaim asynchronously under controlled conditions such as idle periods
> > or vmpressure triggers.

There be dragons.

page reclaim and shrinker reclaim are intimately tied together to
balance the reclaim across all system caches. Maintaining
performance is related to working set retention, so we have to
balance reclaim across all caches that need to retain a working set
of cached objects. e.g. file page cache, dentry cache and inode
cache balance are all delicately balanced and changing by separating
page cache reclaim from dentry and inode cache reclaim is likely to
cause working set retention balance problems across the caches.

e.g. if progress is not being made, we have to increase reclaim
pressure on both page and shrinker reclaim at the same time so
reclaim balance is maintained. If we decouple them and only increase
pressure on one side then all sorts of bad things can happen.  e.g.
there's nothing left for page reclaim to reclaim, so it starts
thinking that we're approaching OOM territory. At the same time, the
shrinker caches could be making progress releasing objects from high
object count shrinkable caches, so it doesn;t think there is any
memory pressure at all.

Then we end up with the page reclaim side declaring OOM and killing
stuff, whilst there is still lots of reclaimable memory in the
machine and reclaim of that is making good progress. That would be a
bad thing, and this is one of the reasons taht page reclaim and
shrinker reclaim are intimately tied together....

Separating them whilst maintaining good co-ordination and control
will be no easy task. My intuition suggests that it'll end up having
too many corner cases where things go bad that evena mess of
heuristics won't be able to address....

> That would mirror what everyone in reclaim / writeback does and have the
> same benefits and pitfalls like throttling.  I'd suggest you give it a
> spin and report your findings.

Kind of, but not really. decoupling shrinkers from direct reclaim
doesn't address all latency and overhead problems with direct
reclaim (like inline memory comapction).

The IO-less dirty throttling implementation took direct writeback
from the throttling context and moved it all into the background.
We went from unbound writeback concurrency to writeback being
controlled by a single task. IOWs, we decoupled throttling
concurrency from writeback IO. This allowed writeback IO to be done
in the most efficient manner possible whilst not having to care
about incoming write concurrency.

The direct correlation to memory allocation is memory allocation
performing direct reclaim. i.e. memory allocation fails so it then
runs reclaim itself. We get unbound concurrency in memory reclaim,
and that means single threaded shrinkers are exposed to unbound
concurrency. This is exactly the same problem that direct writeback
from dirty page throttling had.

IOWs, if there's a problem with too much concurrency in single
threaded shrinkers, the solution is not to push the shrinkers into a
background thread, but to push all of direct reclaim into a set of
bound concurrency controlled asynchronous worker tasks.

Then memory allocation only needs to wait on reclaim progress being
made. it doesn't burn CPU scanning for things to reclaim, it doesn't
burn CPU contending on locks for exclusive reclaim resources or
single threaded shrinker paths, etc.

The control loop would be almost as simple as dirty page throttling.
i.e. allocation only needs to be able to kick background reclaim,
and tasks doing allocation only need to waits for a certain number
of pages to be reclaimed. (i.e. same as dirty page throttling).

As for per-memcg reclaim, this would be similar in concept to the
per-BDI dirty throttling. We would have a per-memcg relcaim waiter
queue, and as background reclaim frees pages associated with a
memcg, it is accounted to the memcg. When enough pages have been
reclaimed in the memcg, background reclaim wakes the first waiter on
the memcg reclaim queue.

IOWs, if the problems you are seeing is a result of too much
concurrency from direct reclaim, the solution is to get rid of
direct reclaim altogether.  Memory allocation only needs -something-
to make forwards progress reclaiming pages; it doesn't acutally need
to perform every possible garbage collection operation itself...

But unbound direct reclaim concurrency might not be the problem, so
that may not be the right solution. Hence we really need to
understand what problems you are trying to address before we can
really make any solid suggestions on how the problem could be best
resolved.

> > Motivation:
> >  - Reduce latency in direct reclaim paths.

Yup, direct reclaim is very harmful to performance in many cases.
Unbound concurrency causes reclaim efficiency issues (as per above),
in-line memory compaction is a massive resource hog (oh, boy does
that hurt!), and so on.

> >  - Improve reclaim efficiency by separating page and slab reclaim.

I'm not sure that it will have that effect. Seperating them
introduces a bunch of new complexity and behaviours that will have
to be managed, and in the mean time it doesn't address various
underlying issues that create inefficiencies...

> >  - Provide more flexible scheduling for slab shrinking.

Perhaps, but this by itself doesn't actually improve anything.

> > Proposed direction:
> >  - Introduce a kernel thread that periodically or conditionally calls
> > shrink_slab().

You can effectively simulate that with the /proc/sys/vm/drop_caches
infrastructure. Write a patch that allows you to specify how many
objects to reclaim in a pass and you can experiment with this
functionality from (multiple) userspace tasks however you want....

> > We'd appreciate feedback on:
> >  - Whether this decoupling aligns with the design of the current reclaim model.

IMO it is not a good fit, but others may have different views.

> >  - Possible implications on fairness, concurrency, and memcg behavior.

Lots - I barely touched the surface in my comments above.  You also
have to think about NUMA toplogy, how to co-ordinate reclaim across
multiple shrinker and page reclaim specific tasks within a node and
across a machine, supporting fast directed memcg-only reclaim, etc.

Really, though, we need to start with a common understanding of the
problem that you are trying to solve. Hence I think the best thing
you can do at this point is tell us in detail about the problem
being observed....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-10-24  0:47 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-21  2:52 [DISCUSS] Proposal: move slab shrinking into a dedicated kernel thread to improve reclaim efficiency Yifan Ji
2025-10-21  5:25 ` Christoph Hellwig
2025-10-24  0:47   ` Dave Chinner
  -- strict thread matches above, loose matches on Subject: below --
2025-10-20  2:22 Yifan Ji

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).